Data Exhaustion Crisis: The Data Bottleneck in AI Development and Countermeasures
We estimate that the stock of public text generated by humans is about 300 trillion tokens. If trends continue, language models will completely exhaust this stock between 2026 and 2032, or even earlier if overtrained. ——Epoch AI
In 2006, Fei-Fei Li, then a professor at the University of Illinois (now a professor at Stanford University), saw the potential of the internet to change artificial intelligence (AI) research. Research in the field of linguistics had identified 80,000 “noun synonym sets,” which are collections of synonyms describing the same type of thing. Fei-Fei Li hypothesized that among the billions of images on the internet, there must be countless instances of these synonym sets. If enough of these images could be collected, a massive database could be created that surpasses any previous AI training resources. She said, “Many people focus on models, we should focus on data.” Thus, the ImageNet project was born.
The internet not only provided images but also resources for image annotation. After finding images of cats, dogs, chairs, etc., through search engines, people on Amazon’s crowdsourcing platform Mechanical Turk manually checked and annotated these images. Eventually, a database containing millions of verified images was constructed. It was the use of part of the ImageNet data to train AlexNet in 2012 that demonstrated the great potential of “deep learning,” kicking off the last AI cycle and spawning an industry dependent on large amounts of annotated data.
1 The Data-Driven AI Era
In this AI cycle, AI development has extended to large language models (LLM), which also rely on internet data for training, but in a different way. The classic training task in the field of computer vision (CV) is to predict the content of an image (image classification task), but the classic task for LLM training is to predict the missing words in a text based on context.
This training method does not require manually annotated data; the system can leave out words, infer, and evaluate the correctness of the answers through “self-supervised training.” However, this method requires a large amount of data. Generally speaking, the more text a model obtains and the larger the data volume, the better its performance (Scaling Law). The internet provides tens of billions of texts, which are as valuable to LLM as carbon deposited over billions of years is to modern industry—a precious resource that can be refined into fuel.
A common source of training data is Common Crawl, an internet archive containing 50 billion web pages. As AI models develop, more data is added, such as Books3—a database containing thousands of books. However, as the demand for text data by AI grows, the supply of high-quality data on the internet is gradually becoming overwhelmed. According to Epoch AI’s estimates, by 2028, high-quality text data on the internet will be completely utilized, and the industry will face the so-called “data wall.” The situation is more severe in the Chinese internet, from “the Chinese internet is collapsing” to major platforms locking their doors, as everyone realizes the value of data and puts it in a vault. How to overcome this barrier may be one of the most challenging issues in the future development of AI, and it may also be the issue most likely to slow down its progress.
2 Data Ownership and Copyright Issues
AI models increasingly rely on internet data, but the copyright issues of data are also full of controversy. Many of the data used to train large language models are often used without the consent of the copyright holders, and some AI companies have even utilized content behind paywalls. Although AI companies claim that such use falls under the “fair use” doctrine of copyright law, copyright holders do not agree. Getty Images sued image generation company Stability AI, accusing it of unauthorized use of its image library. The New York Times sued OpenAI and Microsoft, accusing them of infringing the copyrights of millions of articles. Stack Overflow, Reddit, and X (formerly Twitter) now charge AI companies fees. Zhihu is also interfering with crawlers like Bing and Google by using garbled text to restrict its Chinese content from being used as datasets for AI training.
Different regions have different attitudes toward this issue. Japan and Israel have taken a lenient stance to promote their AI industries. The EU does not have a general concept of “fair use” and may be stricter. Domestically, only a national data bureau has been established, clarifying that data has a dual identity as both production material and production object.
3 Existing Data Usage Strategies
Facing the data wall, the AI field has proposed several countermeasures. One key strategy is to focus on data quality rather than quantity. AI labs no longer blindly use the entire internet’s data to train models but instead focus more on data filtering, cleaning, and optimization to ensure that models can extract the most valuable content from it. In the past year (2024), OpenAI’s models no longer seem to be “far ahead,” and everyone’s models are performing comparably, with differences in performance on different tasks coming from the construction of training data. After all, there are many open-source algorithms and models, but very few open-source datasets.
Obtaining “real-world information” is crucial, especially when models involve a lot of reasoning, making authoritative resources like academic textbooks particularly valuable. However, finding the optimal balance between different data sources remains a mysterious art.
During data usage, models also face the problem of “catastrophic forgetting”—that is, when a system is overtrained on certain types of data, it may excel in that field while forgetting previously learned knowledge. Therefore, the order of data during training also needs careful consideration. If all data on a particular topic (such as mathematics) is concentrated at the end of the training process, the model may perform well on math problems but may weaken its abilities in other areas. This unbalanced training method exacerbates the risk of catastrophic forgetting.
When data involves different fields and different forms (modalities), these strategies become more complex. As new text data becomes scarce, leading models like OpenAI’s GPT-4 and Google’s Gemini use not only text but also images, videos, and audio for training during self-supervised learning. However, video data is particularly tricky because video files contain extremely dense data points. To simplify the problem, existing models usually extract only a few frames for simplified processing, and academia is still searching for more efficient solutions.
4 Synthetic Data and AI Self-Training
Model capabilities can also be enhanced by fine-tuning (using additional data) based on the versions generated during self-supervised learning (pre-training versions). For example, “supervised fine-tuning” involves providing the model with question-answer pairs collected or created by humans to teach the model what constitutes a good answer. Another method, “reinforcement learning based on human feedback” (RLHF), tells the model whether an answer satisfies the questioner.
In RLHF, users provide feedback on the quality of the model’s output, which is then used to adjust the model’s parameters (weights). User interactions with chatbots, such as likes or dislikes, are particularly useful for RLHF. This is the mechanism of the “data flywheel”: more users bring in more data, which in turn optimizes better models. AI companies closely monitor the various questions users pose to their models, then collect data to adjust the models to cover these topics. Companies like Alibaba, ByteDance, and Minimax have launched price wars for models, and it’s hard to say there isn’t some consideration of this aspect.
As pre-training data on the internet gradually depletes, the importance of post-training is becoming increasingly prominent. Annotation companies like Scale AI and Surge AI earn hundreds of millions of dollars annually by collecting post-training data. Scale recently raised $1 billion at a valuation of $14 billion. Today’s annotation work has surpassed the era of Mechanical Turk: top annotators can earn up to $100 per hour. Although post-training helps generate better models and meets the needs of many commercial applications, it remains an incremental improvement, addressing symptoms but not the root cause.
In addition to gradually overcoming the data wall, another solution is to completely bypass it by using machine-generated synthetic data. DeepMind (a subsidiary of Google) launched the AlphaGo Zero model, which is a good example. The company’s first successful Go model was trained using data from millions of amateur games, while AlphaGo Zero did not use any existing data. Instead, it learned Go by playing 4.9 million games against itself over three days and recorded successful strategies. This “reinforcement learning” taught it how to respond to opponents’ moves by simulating a large number of possible responses and choosing the strategy with the highest probability of success.
Similar methods can also be used for LLM, such as the currently strongest open-source large model, Llama 3.1. A significant proportion of the SFT data in Llama 3.1 consists of synthetic data generated by the model, while a large proportion of the data in the SFT phase of Gemma2 is synthesized by larger models, proving that the quality of synthetic data is not inferior to that of human-annotated data.
Can we infinitely generate synthetic data and climb to the sky with one foot on the other? I believe the answer is no. A study published in Nature last month found that “abusing” synthetic data in model training can lead to “irreversible defects.” When models are fine-tuned with data synthesized by the model itself, it only takes a few rounds for the model to start talking nonsense, a phenomenon researchers call “model collapse.”
The bigger issue is how to extend this method to vertical fields such as healthcare or education. In games, the definition of victory is clear, and it is easier to collect data on whether a move is beneficial. In other fields, this is much more complex. Data on “good” decisions is usually collected from experts, but this is both expensive and time-consuming, and the solutions are not comprehensive. How to determine whether an expert is correct is also a recursive problem.
5 Conclusion
Acquiring more data will be key to maintaining rapid AI progress. Whether it is specialized data obtained from expert sources or machine-generated synthetic data, AI’s progress depends on the continuous supply of data. As the most easily accessible data reserves gradually deplete, the AI industry has made many efforts to alleviate this issue:
- Emphasizing data quality and data cleaning
- Increasing the proportion of mathematical, logical, and code data, adjusting training order
- Using synthetic data to supplement real data
But these seem unsustainable, and new data sources or sustainable alternatives must be sought; or from the algorithm architecture level, design new architectures that do not rely on data, thus ushering in the next AI cycle.
6 Recommended Reading
- The Chinese Internet is Accelerating Collapse | He Jiayan
- Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data | EpochAI
- Zhihu is Interfering with Crawlers like Bing/Google with Garbled Text | CSDN
- AI Training Data is Depleting, Synthetic Data Sparks Huge Controversy | Wall Street Insights
- A Brief Talk on Llama3.1: From Structure, Training Process, Impact to Data Synthesis | Volcano Community
- AI Models Collapse When Trained on Recursively Generated Data | Nature