Fine-Tuning GPT-4o-Mini to Generate Blog Articles

2024-07-24 2025-08-18 About 1400 words 7 minutes - views

Contents

The new model GPT-4o-mini, released on July 18, surpasses GPT-3.5 and approaches GPT-4 in performance, while costing only half as much as GPT-3.5. It is also the fastest in response time among the entire series of models. OpenAI officially opened the fine-tuning interface for GPT-4o-mini today, offering a daily free quota of 2M tokens until September 23, 2024.

~~It’s not that Llama 3.1 405B is unaffordable, but GPT-4o-mini offers better cost-effectiveness.~~

1 Suitable Scenarios for Fine-tuning

For general simple tasks, writing prompts is sufficient for the model to perform well. For more complex tasks, you can try using a Chain of Thought to break down the complex task into multiple steps and reason through them gradually. However, for tasks requiring high precision and consistent output, fine-tuning is necessary.

The table below compares the advantages, disadvantages, and application scenarios of these three methods.

Method	Advantages	Disadvantages	Application Scenarios
Fine-tuning	Provides high-quality results	Requires a lot of time and resources to prepare and train data	Requires stable, reliable, and high-quality output
	Suitable for complex tasks and customization in specific fields	Feedback loop is slow, training cost is high	Improves model performance in specific tasks or fields
	Saves tokens, reduces latency	Requires foundational knowledge of deep learning	When tasks require high precision or unique style, tone, format
Prompting	Fast iteration and testing	Depends on the quality of prompt design	Quick prototyping and testing of common tasks
	Suitable for initial exploration and general tasks	May not be accurate enough for complex tasks	When flexible adjustment of model output is needed
	No need for additional data preparation and training resources		Not suitable for tasks with many examples and complex logic
Chain of Thought	Provides step-by-step logic and reasoning	Increases the complexity and length of prompts	Handles tasks requiring reasoning and logical steps
	Improves performance on complex tasks	Increases token usage and latency	Involves multi-step problem-solving scenarios
	Easily combines multiple strategies and tools	May still not be enough for very complex tasks	When a clear logical process and step-by-step execution are needed

The NFL theorem tells us that no method is suitable for all scenarios, and the same applies here; fine-tuning is not necessarily better than the other two methods. However, it is clear that fine-tuning is suitable for those “hard-to-describe tasks”, such as a specific style and tone. Moreover, these three methods are not mutually exclusive; a fine-tuned model using carefully designed prompts, or even combined with a Chain of Thought, might achieve better results.

For simply writing an article or paragraph, prompts are enough. However, for a blog article considering SEO, there are many details, such as the frequency of core keywords. These details may not be fully understood by a large model, and as a user, you may not be able to describe them well in prompts. Therefore, writing such a blog article can use fine-tuning.

2 Preparing Data

Data needs to be organized in jsonl format, with each line being a json. For example:

1
2
3


{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

You can also set weights in multi-turn dialogues, with a weight of 0 indicating that the model should avoid such responses.

1
2
3


{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "William Shakespeare", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters.", "weight": 1}]}

Of course, processing data is the most time-consuming part, and you can directly use the dataset I created. This dataset is used for fine-tuning large models, sourced from scraping over 3000 pages across 13 categories from the reads.alibaba.com website. The open-source content includes not only the processed data but also the raw data and crawler code.

Upload the prepared data and record the returned file ID.

1
2
3
4
5
6
7


from openai import OpenAI
client = OpenAI()

client.files.create(
  file=open("all_filter_2120.jsonl", "rb"),
  purpose="fine-tune"
)

3 Fine-tuning the Model

Once the data is prepared, verified, and the token cost is confirmed, you can create a fine-tuning task.

1
2
3
4
5
6
7


from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-zWptPbsD37ZnemssjpsK6CnF", 
  model="gpt-4o-mini"
)

More detailed parameter configurations for this step can be found in the official API documentation.

OpenAI Fine-tuning UI

These two steps can also be quickly completed in the UI interface. After submitting the task, you can also view the progress and loss changes in real time on the UI interface.

OpenAI Fine-tuning Process Log

4 Invoking the Model

Use the following code to query the status of the fine-tuning task. Once the job is successful, you will see the fine_tuned_model field filled with the model’s name. Note this name for invocation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


from openai import OpenAI
client = OpenAI()

# Query fine-tuning job list
client.fine_tuning.jobs.list(limit=10)

# Query fine-tuning job details
client.fine_tuning.jobs.retrieve("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# Cancel job
client.fine_tuning.jobs.cancel("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# View logs in the job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-gvP0VB7RlWcF3QHdQrEVf49Y", limit=10)

# Delete fine-tuned model
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

The invocation method is the same as the official model; you only need to change the model name, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="ft:gpt-4o-mini-2024-07-18:personal:0724:9oMH6S7A",
  messages=[
    {"role": "system", "content": "Please write an SEO article of no less than 800 words based on the title I gave you, including at least 4 subtitles by HTML format. Do not include the <h1> , <body> tag.  Do not include the <html> tag in the start and end of the content. Directly start with the content."},
    {"role": "user", "content": f"title:{task.title},core keyword:{task.coreKeywords},related keyword:{task.relatedKeywords}"}
  ]
)
print(completion.choices[0].message)

5 Evaluating Results

During training, there are two metrics available for reference: loss value and token accuracy. The official explanation is as follows:

Validation loss and validation token accuracy are calculated in two different ways—on a small batch of data during each step and on the entire validation set at the end of each epoch. The entire validation loss and entire validation token accuracy metrics are the most accurate indicators of tracking the overall performance of the model. These statistics are intended to provide a sanity check to ensure that training is proceeding smoothly (loss should decrease, token accuracy should increase).

However, metrics are just references, and the actual effect still needs to be evaluated by yourself. The fine-tuned model has at least the following improvements:

Article length increased by 20%
Article structure is closer to the training data
No more formatting errors (such as markdown format, adding CSS, etc.)

An article generated with the title “What is the Difference Between a Mural and a Mosaic?” is as follows:

Evaluation Results

6 Reference Articles

https://platform.openai.com/docs/guides/fine-tuning

Buy me a coffee~

Donate

Alipay

PayPal

WeChat Pay