The murky world of AI training
Despite public interest in artificial intelligence models themselves, few consider how those models are trained

Reddit will reportedly allow an unnamed artificial intelligence company to train its models using the online message board's user-generated content.
California-based Reddit told prospective investors ahead of its initial public offering (IPO) that it had signed a contract with "an unnamed large AI company" worth about $60 million (£48 million) annually, according to Bloomberg. The agreement "could be a model for future contracts of a similar nature".
Apple has already "opened negotiations" with several major news and publishing organisations to develop its generative AI systems with their materials, according to The New York Times. The tech giant has "floated multiyear deals worth at least $50 million" (£40 million) to license the archives of news articles, anonymous sources told the paper.
Subscribe to The Week
Escape your echo chamber. Get the facts behind the news, plus analysis from multiple perspectives.

Sign up for The Week's Free Newsletters
From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.
From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.
But the rapidly accelerating world of training AI has been marked by controversy, from arguments over copyright to fears of ethics violations and the replication of human bias.
How does AI learn?
Tech companies train AI models, the most well known being ChatGPT, on "massive amounts of data and text scraped from the internet", said Business Insider – including copyrighted material.
ChatGPT creator OpenAI designed it to find patterns in text databases. It ultimately analysed 300 billion words and 570 gigabytes of data, said BBC Science Focus. Other AI models like DALL-E, which generates images based on text prompts, were fed nearly 6 billion image-text pairs from the LAION-5B dataset.
What are the issues with training AI models?
OpenAI and Google have both been accused of training AI models on work without paying through licensing deals or getting permission from creators of that content. The New York Times is even suing OpenAI and Microsoft for copyright infringement for using its articles.
OpenAI further fanned the flames of the copyright debate when it released Sora, a video generator, last week. Sora is able to create incredibly lifelike videos from simple text prompts, but OpenAI "has barely shared anything about its training data", said Mashable. Speculation immediately began that Sora was trained on copyrighted material.
Despite their vast data reserves, these models still require a human touch. In a process called "reinforcement learning", a human operator evaluates the accuracy and appropriateness of a model's output. "So click by click, a largely unregulated army of humans is transforming the raw data into AI feedstock," said The Washington Post.
Not only is it costly and time-consuming to employ people to babysit AI models, but the process is subjective: individuals have different standards for what constitutes accurate or appropriate.
Reinforcement learning has also led to the exploitation of workers, said the paper. In the Philippines, former employees have accused San Francisco start-up Scale AI of paying workers "extremely low rates" via an outsourced digital work platform called Remotasks, or withholding payments entirely. Human rights groups say it is "among a number of American AI companies that have not abided by basic labor standards for their workers abroad", said the Post.
AI companies have also inadvertently hired children and teenagers to perform these roles, reported Wired, because tasks are "often outsourced to gig workers, via online crowdsourcing platforms".
Ethics demands aside, the creators of AI models are concerned with a future supply issue: while the internet may contain massive amounts of data, it isn't unlimited.
The most advanced AI programs "have consumed most of the text and images available" and are running out of training data: their "most precious resource", said The Atlantic. This has "stymied the technology's growth, leading to iterative updates rather than massive paradigm shifts".
What's coming down the pipeline?
OpenAi, Google Deepmind, Microsoft and other big tech companies have recently "published research that uses an AI model to improve another AI model, or even itself", said The Atlantic. Tech executives have heralded this approach, known as synthetic data, as "the technology's future".
Training an AI model on data that a different AI model has produced is flawed, however. It could reinforce conclusions that the model drew from the original data, which could be incorrect or even biased.
As the AI industry continues its exponential growth, what happens next is unclear – but it is likely to be both "exciting and scary", said The New York Times.
Sign up for Today's Best Articles in your inbox
A free daily email with the biggest news stories of the day – and the best features from TheWeek.com
-
The Resistance: Is it finally taking off?
Feature Mass protests erupted across all 50 states during the 'Hands Off!' demonstrations against the Trump administration
By The Week US
-
Loomer: Feeding Trump's paranoia
Feature Trump fires National Security Council officials after the conspiracy theorist attended a meeting in the Oval Office
By The Week US
-
Inflation: How tariffs could push up prices
Feature Trump's new tariffs could cost families an extra $3,800 a year
By The Week US
-
How might AI chatbots replace mental health therapists?
Today's Big Question Clients form 'strong relationships' with tech
By Joel Mathis, The Week US
-
What are AI hallucinations?
The Explainer Artificial intelligence is known for making things up – and that can cause real damage
By Elizabeth Carr-Ellis, The Week UK
-
The backlash against ChatGPT's Studio Ghibli filter
The Explainer The studio's charming style has become part of a nebulous social media trend
By Theara Coleman, The Week US
-
Not there yet: The frustrations of the pocket AI
Feature Apple rushes to roll out its ‘Apple Intelligence’ features but fails to deliver on promises
By The Week US
-
OpenAI's new model is 'really good' at creative writing
Under the Radar CEO Sam Altman says he is impressed. But is this merely an attempt to sell more subscriptions?
By Theara Coleman, The Week US
-
Could artificial superintelligence spell the end of humanity?
Talking Points Growing technology is causing growing concern
By Devika Rao, The Week US
-
Space-age living: The race for robot servants
Feature Meta and Apple compete to bring humanoid robots to market
By The Week US
-
Musk vs. Altman: The fight over OpenAI
Feature Elon Musk has launched a $97.4 billion takeover bid for OpenAI
By The Week US