The Internet Archive has been responsible for saving and providing access to trillions of websites over the past 30 years. But AI is putting a damper on its work as large language models are using the data without permission. As a result, many companies are no longer allowing their content to be archived, which could lead to a substantial loss of historical records in the future.
The Internet Archive is a non-profit organisation that is building a “digital library of internet sites and other cultural artifacts”, according to its website. It uses web crawlers to capture snapshots of sites. These are then made available through the public-facing tool the Wayback Machine, which operates like a library. However, amid the rise of AI, the Internet Archive’s “commitment to free information access has turned its digital library into a potential liability for some news publishers”, according to an analysis by Nieman Lab.
Currently, “241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots”, including The New York Times and Reddit, said Nieman Lab. The Guardian has also restricted the Internet Archive; the publication does not block the crawlers, but it “excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles”, said tech site Wired.
There is “no widely available public tool comparable to the Wayback Machine”, added Wired. If it “continues to lose access to major news sources, its preservation efforts could erode to the point where early digital records of history become much harder to access or are even lost altogether”.
|