Fine-Tuning GPT-4o-Mini to Generate Blog Articles

The new model GPT-4o-mini, released on July 18, surpasses GPT-3.5 and approaches GPT-4 in performance, while costing only half as much as GPT-3.5. It is also the fastest in response time among the entire series of models. OpenAI officially opened the fine-tuning interface for GPT-4o-mini today, offering a daily free quota of 2M tokens until September 23, 2024.

It’s not that Llama 3.1 405B is unaffordable, but GPT-4o-mini offers better cost-effectiveness.

1 Suitable Scenarios for Fine-tuning

For general simple tasks, writing prompts is sufficient for the model to perform well. For more complex tasks, you can try using a Chain of Thought to break down the complex task into multiple steps and reason through them gradually. However, for tasks requiring high precision and consistent output, fine-tuning is necessary.

The table below compares the advantages, disadvantages, and application scenarios of these three methods.

Method Advantages Disadvantages Application Scenarios
Fine-tuning Provides high-quality results Requires a lot of time and resources to prepare and train data Requires stable, reliable, and high-quality output
Suitable for complex tasks and customization in specific fields Feedback loop is slow, training cost is high Improves model performance in specific tasks or fields
Saves tokens, reduces latency Requires foundational knowledge of deep learning When tasks require high precision or unique style, tone, format
Prompting Fast iteration and testing Depends on the quality of prompt design Quick prototyping and testing of common tasks
Suitable for initial exploration and general tasks May not be accurate enough for complex tasks When flexible adjustment of model output is needed
No need for additional data preparation and training resources Not suitable for tasks with many examples and complex logic
Chain of Thought Provides step-by-step logic and reasoning Increases the complexity and length of prompts Handles tasks requiring reasoning and logical steps
Improves performance on complex tasks Increases token usage and latency Involves multi-step problem-solving scenarios
Easily combines multiple strategies and tools May still not be enough for very complex tasks When a clear logical process and step-by-step execution are needed

The NFL theorem tells us that no method is suitable for all scenarios, and the same applies here; fine-tuning is not necessarily better than the other two methods. However, it is clear that fine-tuning is suitable for those “hard-to-describe tasks”, such as a specific style and tone. Moreover, these three methods are not mutually exclusive; a fine-tuned model using carefully designed prompts, or even combined with a Chain of Thought, might achieve better results.

For simply writing an article or paragraph, prompts are enough. However, for a blog article considering SEO, there are many details, such as the frequency of core keywords. These details may not be fully understood by a large model, and as a user, you may not be able to describe them well in prompts. Therefore, writing such a blog article can use fine-tuning.

2 Preparing Data

Data needs to be organized in jsonl format, with each line being a json. For example:

1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

You can also set weights in multi-turn dialogues, with a weight of 0 indicating that the model should avoid such responses.

1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "William Shakespeare", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters.", "weight": 1}]}

Of course, processing data is the most time-consuming part, and you can directly use the dataset I created. This dataset is used for fine-tuning large models, sourced from scraping over 3000 pages across 13 categories from the reads.alibaba.com website. The open-source content includes not only the processed data but also the raw data and crawler code.

Upload the prepared data and record the returned file ID.

1
2
3
4
5
6
7
from openai import OpenAI
client = OpenAI()

client.files.create(
  file=open("all_filter_2120.jsonl", "rb"),
  purpose="fine-tune"
)

3 Fine-tuning the Model

Once the data is prepared, verified, and the token cost is confirmed, you can create a fine-tuning task.

1
2
3
4
5
6
7
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-zWptPbsD37ZnemssjpsK6CnF", 
  model="gpt-4o-mini"
)

More detailed parameter configurations for this step can be found in the official API documentation.

OpenAI Fine-tuning UI

These two steps can also be quickly completed in the UI interface. After submitting the task, you can also view the progress and loss changes in real time on the UI interface.

OpenAI Fine-tuning Process Log

4 Invoking the Model

Use the following code to query the status of the fine-tuning task. Once the job is successful, you will see the fine_tuned_model field filled with the model’s name. Note this name for invocation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from openai import OpenAI
client = OpenAI()

# Query fine-tuning job list
client.fine_tuning.jobs.list(limit=10)

# Query fine-tuning job details
client.fine_tuning.jobs.retrieve("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# Cancel job
client.fine_tuning.jobs.cancel("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# View logs in the job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-gvP0VB7RlWcF3QHdQrEVf49Y", limit=10)

# Delete fine-tuned model
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

The invocation method is the same as the official model; you only need to change the model name, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="ft:gpt-4o-mini-2024-07-18:personal:0724:9oMH6S7A",
  messages=[
    {"role": "system", "content": "Please write an SEO article of no less than 800 words based on the title I gave you, including at least 4 subtitles by HTML format. Do not include the <h1> , <body> tag.  Do not include the <html> tag in the start and end of the content. Directly start with the content."},
    {"role": "user", "content": f"title:{task.title},core keyword:{task.coreKeywords},related keyword:{task.relatedKeywords}"}
  ]
)
print(completion.choices[0].message)

5 Evaluating Results

During training, there are two metrics available for reference: loss value and token accuracy. The official explanation is as follows:

Validation loss and validation token accuracy are calculated in two different ways—on a small batch of data during each step and on the entire validation set at the end of each epoch. The entire validation loss and entire validation token accuracy metrics are the most accurate indicators of tracking the overall performance of the model. These statistics are intended to provide a sanity check to ensure that training is proceeding smoothly (loss should decrease, token accuracy should increase).

However, metrics are just references, and the actual effect still needs to be evaluated by yourself. The fine-tuned model has at least the following improvements:

  • Article length increased by 20%
  • Article structure is closer to the training data
  • No more formatting errors (such as markdown format, adding CSS, etc.)

An article generated with the title “What is the Difference Between a Mural and a Mosaic?” is as follows:

Evaluation Results

6 Reference Articles

Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%