All-powerful, ever-pervasive AI is running out of internet
There is no such thing as unlimited data
![Glass brain with connectors.](https://cdn.mos.cms.futurecdn.net/TQRhvEFDjwy3nS4DLQyeQ-415-80.jpg)
Artificial intelligence (AI) has relied on high-quality language data to train its models, but supply is running low. That depletion is forcing companies to look elsewhere for data sourcing as well as to change their algorithms to use data more efficiently.
What is the scope of AI's data problem?
Artificial intelligence needs to be trained, and data and information is used to accomplish that. Trouble is, the data is running out. A paper by Epoch, an AI research organization, found that AI could exhaust all the current high-quality language data available on the internet as soon as 2026. This could pose a problem as AI continues to grow. "The issue stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find ever more texts to train them on," said the MIT Technology Review. The quality of the data used in training AI is important. "The [data shortage] issue stems partly from the fact that language AI researchers filter the data they use to train models into two categories: high-quality and low-quality," said the Review. "The line between the two categories can be fuzzy," but "text from [high-quality data] is viewed as better-written and is often produced by professional writers."
AI models require vast amounts of data to be functional. For example, "the algorithm powering ChatGPT was originally trained on 570 gigabytes of text data, or about 300 billion words," said Singularity Hub. In addition, "low-quality data such as social media posts or blurry photographs are easy to source but aren't sufficient to train high-performing AI models," and could even be "biased or prejudiced or may include disinformation or illegal content which could be replicated by the model." Much of the data on the internet is considered useless for AI modeling. Instead, "AI companies are hunting for untapped information sources and rethinking how they train these systems," said The Wall Street Journal. "Companies also are experimenting with using AI-generated, or synthetic, data as training material — an approach many researchers say could actually cause crippling malfunctions."
Subscribe to The Week
Escape your echo chamber. Get the facts behind the news, plus analysis from multiple perspectives.
![https://cdn.mos.cms.futurecdn.net/flexiimages/jacafc5zvs1692883516.jpg](https://cdn.mos.cms.futurecdn.net/flexiimages/jacafc5zvs1692883516-320-80.jpg)
Sign up for The Week's Free Newsletters
From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.
From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.
What are AI companies doing to combat the imminent data scarcity?
The ticking clock on high-quality data has forced AI developers to think more creatively. For instance, Google has considered using user data from Google Docs, Google Sheets and similar company products. Other companies are "searching for content outside the free online space, such as that held by large publishers and offline repositories," like those published before the internet existed, said Singularity Hub. Meta has considered purchasing Simon & Schuster publishing house to gain access to all its literary works. More broadly, many companies have looked to synthetic data, which is generated by AI itself. "As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine," OpenAI CEO Sam Altman said at a tech conference in 2023. However, using synthetic data can present other problems. "Feeding a model text that is itself generated by AI is considered the computer-science version of inbreeding," said the Journal. "Such a model tends to produce nonsense, which some researchers call 'model collapse.'"
The other option is to rework AI algorithms to better and more efficiently use the existing high-quality data. One strategy being explored is called curriculum learning, which is when "data is fed to language models in a specific order in hopes that the AI will form smarter connections between concepts," said the Journal. If successful, the method could cut the data required to run an AI model by half. Companies may also diversify the data sets used in AI models to include some lower-quality sources or instead opt to create smaller models that require less data altogether. "We've seen how smaller models that are trained on higher-quality data can outperform larger models trained on lower-quality data," Percy Liang, a computer science professor at Stanford University, said to the MIT Technology Review.
Create an account with the same email registered to your subscription to unlock access.
Sign up for Today's Best Articles in your inbox
A free daily email with the biggest news stories of the day – and the best features from TheWeek.com
Devika Rao has worked as a staff writer at The Week since 2022, covering science, the environment, climate and business. She previously worked as a policy associate for a nonprofit organization advocating for environmental action from a business perspective.
-
5 fully-loaded cartoons on the bump stock ruling
Cartoons Artists take on Tommy guns, technicalities, and more
By The Week US Published
-
The Greens: a new force on the Left
Talking Point The party's manifesto 'centrepiece' is a bold wealth tax
By The Week UK Published
-
Congestion charging in NYC: a dream that died
In the Spotlight New York City is the most walkable city in the United States – so why do New Yorkers hate the idea of a congestion charge?
By The Week UK Published
-
The growing dystopian AI influencer economy
In the Spotlight AI-generated digital personas are giving human influencers a run for their money
By Theara Coleman, The Week US Published
-
How AI is used in UK train stations
Under the Radar Image recognition software that can track passenger emotions pits privacy concerns against efficiency and safety improvements
By The Week UK Published
-
Why is the tech industry up in arms about Google's search algorithm leak?
Today's Big Question A leak of about 2,500 documents shed light on how Google's search engine operates, and not everyone is happy
By Justin Klawans, The Week US Published
-
Who is winning the US-China chip war?
Today's Big Question A fight for the future of advanced manufacturing
By Joel Mathis, The Week US Published
-
Apple unveils AI integration, ChatGPT partnership
Speed Read AI capabilities will be added to a bulked-up Siri and other apps, in partnership with OpenAI's ChatGPT
By Peter Weber, The Week US Published
-
Apple Intelligence: iPhone maker set to overhaul the AI experience
In the Spotlight A 'top-to-bottom makeover of the iPhone' sees the tech giant try to win the consumer AI game
By Harriet Marsden, The Week UK Published
-
How the FBI took down the world's largest zombie 'botnet'
Under the Radar The bot allegedly infected more than 19 million IP addresses across the world
By Justin Klawans, The Week US Published
-
Is quantum computing the next technological frontier?
Today's Big Question Some people believe the technology will change the world, but others are skeptical of its risks
By Justin Klawans, The Week US Published