The Internet Archive has saved and provided access to trillions of websites for 30 years. The nonprofit builds a “digital library of internet sites,” according to its website, using web crawlers to capture snapshots that are then made available through its public-facing tool, the Wayback Machine. However, amid the rise of AI, the Internet Archive’s “commitment to free information access has turned its digital library into a potential liability,” Nieman Lab said in an analysis.
Currently, “241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots,” including The New York Times and Reddit, said Nieman Lab. Many of the same media outlets banning crawlers have used the resource themselves to access older data and articles. “Journalists rely on the Archive in our reporting, and many digital investigations into issues like misinformation or censorship are possible only because it preserves material that would otherwise disappear,” the organizations Fight for the Future, the Electronic Frontier Foundation and Public Knowledge said in a letter.
Artificial intelligence is the biggest reason sites are blocking the Internet Archive. There is “evidence that the Wayback Machine has been used to train large language models,” said Forbes. But Mark Graham, the director of the Wayback Machine, “emphasizes that the digital archive has controls to limit abuse of AI automation,” said Morning Brew.
There is “no widely available public tool comparable to the Wayback Machine,” said Wired. If it “continues to lose access to major news sources,” early “digital records of history” may become “lost altogether.”
|