Fine-Tuning GPT-4o-Mini to Generate Blog Articles
The new model GPT-4o-mini, released on July 18, surpasses GPT-3.5 and approaches GPT-4 in performance, while costing only half as much as GPT-3.5. It is also the fastest in response time among the entire series of models. OpenAI officially opened the fine-tuning interface for GPT-4o-mini today, offering a daily free quota of 2M tokens until September 23, 2024.
It’s not that Llama 3.1 405B is unaffordable, but GPT-4o-mini offers better cost-effectiveness.
1 Suitable Scenarios for Fine-tuning
For general simple tasks, writing prompts is sufficient for the model to perform well. For more complex tasks, you can try using a Chain of Thought to break down the complex task into multiple steps and reason through them gradually. However, for tasks requiring high precision and consistent output, fine-tuning is necessary.
The table below compares the advantages, disadvantages, and application scenarios of these three methods.
Method | Advantages | Disadvantages | Application Scenarios |
---|---|---|---|
Fine-tuning | Provides high-quality results | Requires a lot of time and resources to prepare and train data | Requires stable, reliable, and high-quality output |
Suitable for complex tasks and customization in specific fields | Feedback loop is slow, training cost is high | Improves model performance in specific tasks or fields | |
Saves tokens, reduces latency | Requires foundational knowledge of deep learning | When tasks require high precision or unique style, tone, format | |
Prompting | Fast iteration and testing | Depends on the quality of prompt design | Quick prototyping and testing of common tasks |
Suitable for initial exploration and general tasks | May not be accurate enough for complex tasks | When flexible adjustment of model output is needed | |
No need for additional data preparation and training resources | Not suitable for tasks with many examples and complex logic | ||
Chain of Thought | Provides step-by-step logic and reasoning | Increases the complexity and length of prompts | Handles tasks requiring reasoning and logical steps |
Improves performance on complex tasks | Increases token usage and latency | Involves multi-step problem-solving scenarios | |
Easily combines multiple strategies and tools | May still not be enough for very complex tasks | When a clear logical process and step-by-step execution are needed |
The NFL theorem tells us that no method is suitable for all scenarios, and the same applies here; fine-tuning is not necessarily better than the other two methods. However, it is clear that fine-tuning is suitable for those “hard-to-describe tasks”, such as a specific style and tone. Moreover, these three methods are not mutually exclusive; a fine-tuned model using carefully designed prompts, or even combined with a Chain of Thought, might achieve better results.
For simply writing an article or paragraph, prompts are enough. However, for a blog article considering SEO, there are many details, such as the frequency of core keywords. These details may not be fully understood by a large model, and as a user, you may not be able to describe them well in prompts. Therefore, writing such a blog article can use fine-tuning.
2 Preparing Data
Data needs to be organized in jsonl
format, with each line being a json. For example:
|
|
You can also set weights in multi-turn dialogues, with a weight of 0 indicating that the model should avoid such responses.
|
|
Of course, processing data is the most time-consuming part, and you can directly use the dataset I created. This dataset is used for fine-tuning large models, sourced from scraping over 3000 pages across 13 categories from the reads.alibaba.com website. The open-source content includes not only the processed data but also the raw data and crawler code.
Upload the prepared data and record the returned file ID.
|
|
3 Fine-tuning the Model
Once the data is prepared, verified, and the token cost is confirmed, you can create a fine-tuning task.
|
|
More detailed parameter configurations for this step can be found in the official API documentation.
These two steps can also be quickly completed in the UI interface. After submitting the task, you can also view the progress and loss changes in real time on the UI interface.
4 Invoking the Model
Use the following code to query the status of the fine-tuning task. Once the job is successful, you will see the fine_tuned_model
field filled with the model’s name. Note this name for invocation.
|
|
The invocation method is the same as the official model; you only need to change the model name, for example:
|
|
5 Evaluating Results
During training, there are two metrics available for reference: loss value and token accuracy. The official explanation is as follows:
Validation loss and validation token accuracy are calculated in two different ways—on a small batch of data during each step and on the entire validation set at the end of each epoch. The entire validation loss and entire validation token accuracy metrics are the most accurate indicators of tracking the overall performance of the model. These statistics are intended to provide a sanity check to ensure that training is proceeding smoothly (loss should decrease, token accuracy should increase).
However, metrics are just references, and the actual effect still needs to be evaluated by yourself. The fine-tuned model has at least the following improvements:
- Article length increased by 20%
- Article structure is closer to the training data
- No more formatting errors (such as markdown format, adding CSS, etc.)
An article generated with the title “What is the Difference Between a Mural and a Mosaic?” is as follows:
6 Reference Articles


