The Internet Archive is in danger
More companies are opting not to archive their sites
A free daily email with the biggest news stories of the day – and the best features from TheWeek.com
You are now subscribed
Your newsletter sign-up was successful
The Internet Archive has been responsible for saving and providing access to trillions of websites over the past 30 years. AI is putting a damper on the organization’s work, as large language models are using the data without permission. As a result, many companies are no longer allowing their content to be archived, which could lead to a large loss of historical records in the future.
Access denied
The Internet Archive is a non-profit that is building a “digital library of internet sites and other cultural artifacts,” according to its website. The organization uses web crawlers to capture snapshots of sites. These snapshots are then made available through the public-facing tool, the Wayback Machine, which operates like a library, providing “free access to researchers, historians, scholars, people with print disabilities and the general public.” However, amid the rise of AI, the Internet Archive’s “commitment to free information access has turned its digital library into a potential liability for some news publishers,” said an analysis by Nieman Lab.
Currently, “241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots,” including The New York Times and Reddit, said Nieman Lab. Of these sites, 87% are owned by USA Today Co., the “largest newspaper conglomerate in the United States, formerly known as Gannett.” The Guardian has also restricted the Internet Archive; the publication does not block the crawlers, but it “excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles,” said Wired.
Article continues belowThe Week
Escape your echo chamber. Get the facts behind the news, plus analysis from multiple perspectives.
Sign up for The Week's Free Newsletters
From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.
From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.
Many of the same media outlets banning Internet Archive’s crawlers have used the resource themselves to access older data and articles. “Journalists rely on the Archive as a resource in our reporting, and many digital investigations into issues like misinformation or censorship are possible only because it preserves material that would otherwise disappear,” said the organizations Fight for the Future, the Electronic Frontier Foundation and Public Knowledge, in a letter to the Internet Archive. “Without that ongoing work to preserve the web, large parts of journalism’s recent history would already be lost.”
On record
Artificial intelligence is the biggest reason sites are blocking the Internet Archive. There is “evidence that the Wayback Machine has been used to train large language models,” said Forbes. The archive allows tech companies to “skirt copyright laws by using the Wayback Machine as a workaround for training language models on their content,” said Morning Brew. Despite this, Mark Graham, the director of the Wayback Machine, “emphasizes that the digital archive has controls to limit abuse of AI automation and prevent large-scale data extraction.”
Unfortunately, a few bad apples ruin the whole bunch. The Internet Archive “tends to be good citizens,” Robert Hahn, the head of business affairs and licensing at The Guardian, said to Nieman Lab. “It’s the law of unintended consequences: You do something for really good purposes, and it gets abused.” The nonprofit “has taken on the Herculean task of preserving the internet, and many news organizations aren’t equipped to save their own work,” Nieman Lab said.
There is “no widely available public tool comparable to the Wayback Machine,” said Wired. If it “continues to lose access to major news sources, its preservation efforts could erode to the point where early digital records of history become much harder to access, or are even lost altogether.”
A free daily email with the biggest news stories of the day – and the best features from TheWeek.com
Devika Rao has worked as a staff writer at The Week since 2022, covering science, the environment, climate and business. She previously worked as a policy associate for a nonprofit organization advocating for environmental action from a business perspective.
