The Internet Archive is in danger

More companies are opting not to archive their sites

Illustrative collage of the Internet Archive logo, cracked
Many media sites have blocked the Internet Archive’s ability to capture snapshots of their pages
(Image credit: Illustration by Julia Wytrazek / Getty Images)

The Internet Archive has been responsible for saving and providing access to trillions of websites over the past 30 years. AI is putting a damper on the organization’s work, as large language models are using the data without permission. As a result, many companies are no longer allowing their content to be archived, which could lead to a large loss of historical records in the future.

Access denied

Currently, “241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots,” including The New York Times and Reddit, said Nieman Lab. Of these sites, 87% are owned by USA Today Co., the “largest newspaper conglomerate in the United States, formerly known as Gannett.” The Guardian has also restricted the Internet Archive; the publication does not block the crawlers, but it “excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles,” said Wired.

Article continues below

The Week

Escape your echo chamber. Get the facts behind the news, plus analysis from multiple perspectives.

SUBSCRIBE & SAVE
https://cdn.mos.cms.futurecdn.net/flexiimages/jacafc5zvs1692883516.jpg

Sign up for The Week's Free Newsletters

From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.

From our morning news briefing to a weekly Good News Newsletter, get the best of The Week delivered directly to your inbox.

Sign up

Many of the same media outlets banning Internet Archive’s crawlers have used the resource themselves to access older data and articles. “Journalists rely on the Archive as a resource in our reporting, and many digital investigations into issues like misinformation or censorship are possible only because it preserves material that would otherwise disappear,” said the organizations Fight for the Future, the Electronic Frontier Foundation and Public Knowledge, in a letter to the Internet Archive. “Without that ongoing work to preserve the web, large parts of journalism’s recent history would already be lost.”

On record

Artificial intelligence is the biggest reason sites are blocking the Internet Archive. There is “evidence that the Wayback Machine has been used to train large language models,” said Forbes. The archive allows tech companies to “skirt copyright laws by using the Wayback Machine as a workaround for training language models on their content,” said Morning Brew. Despite this, Mark Graham, the director of the Wayback Machine, “emphasizes that the digital archive has controls to limit abuse of AI automation and prevent large-scale data extraction.”

Unfortunately, a few bad apples ruin the whole bunch. The Internet Archive “tends to be good citizens,” Robert Hahn, the head of business affairs and licensing at The Guardian, said to Nieman Lab. “It’s the law of unintended consequences: You do something for really good purposes, and it gets abused.” The nonprofit “has taken on the Herculean task of preserving the internet, and many news organizations aren’t equipped to save their own work,” Nieman Lab said.

There is “no widely available public tool comparable to the Wayback Machine,” said Wired. If it “continues to lose access to major news sources, its preservation efforts could erode to the point where early digital records of history become much harder to access, or are even lost altogether.”

Devika Rao, The Week US

 Devika Rao has worked as a staff writer at The Week since 2022, covering science, the environment, climate and business. She previously worked as a policy associate for a nonprofit organization advocating for environmental action from a business perspective.