Fine-Tuning GPT-4o-Mini for Blog Post Generation

The newly released GPT-4o-mini model, launched on July 18, surpasses GPT-3.5 and approaches GPT-4 in performance, with a cost only half of GPT-3.5 and the fastest response speed among the series. OpenAI officially opened the fine-tuning interface for GPT-4o-mini today, offering 2M tokens free per day until September 23, 2024.

1 Fine-Tuning Application Scenarios

For simple tasks, writing prompts is sufficient for the model to perform well. For more complex tasks, you can use the Chain of Thought technique to break down the task into multiple steps and reason through them step by step. However, for tasks requiring high precision and consistent output, fine-tuning is necessary.

The following table compares the pros and cons of these three methods and their application scenarios.

Method Advantages Disadvantages Application Scenarios
Fine-Tuning Provides high-quality results Requires a lot of time and resources to prepare and train data Tasks needing stable, reliable, and high-quality output
Suitable for complex tasks and custom domains Feedback loop is slow, training cost is high Improve model performance in specific tasks or domains
Saves tokens, reduces latency Requires knowledge of deep learning Tasks needing high precision or unique style, tone, format
Prompting Quick iteration and testing Depends on the quality of the prompt design Quick prototyping and testing of common tasks
Suitable for initial exploration and general tasks May not be accurate enough for complex tasks When flexible adjustment of model output is needed
No need for additional data preparation and training resources Not suitable for tasks with many examples and complex logic
Chain of Thought Provides step-by-step logic and reasoning Increases prompt complexity and length Tasks requiring reasoning and logical steps
Improves performance on complex tasks Increases token usage and latency Scenarios involving multi-step problem-solving
Easy to combine multiple strategies and tools May still not be enough for very complex tasks When a clear logical process and step-by-step execution are needed

The No Free Lunch theorem tells us that no method can be suitable for all scenarios, and this is no exception. Fine-tuning is not necessarily better than the other two methods. However, it is clear that fine-tuning is suitable for those “hard-to-describe tasks”, such as a specific style and tone. Additionally, these three methods are not mutually exclusive. Using well-designed prompts or even combining them with chain of thought in a fine-tuned model might achieve better results.

For simple tasks like writing an article or a paragraph, prompts are sufficient. But for a blog post, considering SEO, there are many details like core keyword frequency, etc. These details may not be fully understood by the large model, and as a user, you might not be able to describe them well in prompts. Therefore, writing such a blog post can benefit from fine-tuning.

2 Preparing Data

Data needs to be organized in jsonl format, with each line being a JSON object. For example:

1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

You can also set weights in multi-turn conversations, with weight 0 indicating that the model should avoid this type of response.

1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "William Shakespeare", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters.", "weight": 1}]}

Of course, processing data is the most time-consuming part. Here, you can also use the dataset I created. This dataset is used for fine-tuning large models, sourced from scraping over 3000 pages across 13 categories from the reads.alibaba.com website. The open-source dataset includes both processed data, raw data, and crawler code.

Upload the prepared data and record the returned file ID.

1
2
3
4
5
6
7
from openai import OpenAI
client = OpenAI()

client.files.create(
  file=open("all_filter_2120.jsonl", "rb"),
  purpose="fine-tune"
)

3 Fine-Tuning the Model

Once the data is prepared, validated, and the token cost confirmed, you can create a fine-tuning job.

1
2
3
4
5
6
7
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-zWptPbsD37ZnemssjpsK6CnF", 
  model="gpt-4o-mini"
)

More detailed parameter configurations for this step can be found in the official API documentation.

OpenAI Fine-Tuning UI

These two steps can also be quickly completed in the UI interface. After submitting the job, you can also track progress and loss changes in real-time in the UI interface.

OpenAI Fine-Tuning Process Logs

4 Using the Model

You can check the status of the fine-tuning job with the code below. Once the job is successful, you will see the fine_tuned_model field filled with the model’s name. Note this name to call the model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from openai import OpenAI
client = OpenAI()

# 查询微调任务列表
client.fine_tuning.jobs.list(limit=10)

# 查询微调任务详情
client.fine_tuning.jobs.retrieve("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# 取消任务
client.fine_tuning.jobs.cancel("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# 查看任务中的日志
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-gvP0VB7RlWcF3QHdQrEVf49Y", limit=10)

# 删除微调模型
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

The calling method is the same as the official models; you only need to change the model name. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="ft:gpt-4o-mini-2024-07-18:personal:0724:9oMH6S7A",
  messages=[
    {"role": "system", "content": "Please write an SEO article of no less than 800 words based on the title I gave you, including at least 4 subtitles by HTML format. Do not include the <h1> , <body> tag.  Do not include the <html> tag in the start and end of the content. Directly start with the content."},
    {"role": "user", "content": f"title:{task.title},core keyword:{task.coreKeywords},related keyword:{task.relatedKeywords}"}
  ]
)
print(completion.choices[0].message)

5 Evaluating the Results

During training, there are two indicators to refer to: loss and token accuracy. The official explanation is as follows:

Validation loss and validation token accuracy are calculated in two different ways - on a small batch of data during each step and on the entire validation set at the end of each epoch. The overall validation loss and overall validation token accuracy indicators are the most accurate indicators for tracking the overall performance of the model. These statistics are intended to provide a sanity check to ensure training is proceeding smoothly (loss should decrease, token accuracy should increase).

However, indicators are just references; the actual effect still needs to be evaluated by yourself. The fine-tuned model has at least the following improvements:

  • Article length increases by 20%
  • Article structure is closer to the training data
  • No more format errors (e.g., markdown format, adding CSS, etc.)

Here is an example of an article generated with the title “What is the Difference Between a Mural and a Mosaic?”:

Evaluation Results

6 Reference Articles

Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%