
More than 340 local news outlets are limiting the Internet Archive’s access to their journalism
McClatchy, Advance Local, Tribune Publishing and other major newspaper chains are restricting the nonprofit’s archiving bots.
By Andrew Deck and Hanaa’ Tameez, May 20, 2026, 5:03 p.m.
In January, Nieman Lab broke the story that major news publishers — including The New York Times, The Guardian, and USA Today Co. — had started blocking the Internet Archive due to concerns that AI companies might scrape the nonprofit’s repositories for training data.
No news publisher has confirmed to Nieman Lab that an AI company has already scraped their content from the Wayback Machine. Still, in the five months since we published our story the number of news sites blocking the Internet Archive has continued to rise.
RELATED ARTICLE – News publishers limit Internet Archive access due to AI scraping concerns – January 28, 2026
Overwhelmingly, these sites are local news outlets.
Our new analysis shows that more than 340 local news sites across the United States are now limiting the Internet Archive’s ability to access and preserve their stories. Many sites in our sample are owned by five of the seven largest local news publishers in the country: USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. The latter two are both subsidiaries of the “vulture hedge fund” Alden Global Capital.
Researchers, historians, and citizens around the world rely on the web archives of local news sites to do their work.
“Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term,” Edward McCain, a journalism librarian at the University of Missouri, said. “In the present we may have some workarounds, but in the long run, it weakens a vital link in primary source materials that we need to understand where we’ve been and where we want to go.”
Working journalists are among the most frequent users of the Wayback Machine’s local news archives. Over the last month, online petitions have called for news media companies to allow the Internet Archive to preserve their journalism.
“I cover news within a larger news desert in New York’s Rockland, Sullivan, and Rockland counties. This means I need to heavily rely on archival data of old news articles from now deceased, or zombie-fied, media outlets,” wrote B.J. Mendelson, the editor of The Monroe Gazette newsletter, in one recent petition signed by over 200 journalists. “Without the Internet Archive, my [work] would be incredibly difficult to do.”
RELATED ARTICLE – Journalists champion Wayback Machine after news publishers limit article archiving – April 15, 2026
In the face of publisher concerns, the Wayback Machine has highlighted its efforts to minimize abuse of its site, including implementing systems that limit bulk downloading and working with vendors like Cloudflare to monitor bot activity. “We are in conversation with many publishers and appreciate the opportunity to address their concerns,” Mark Graham, the founder of the Wayback Machine, told Nieman Lab, noting that the Internet Archive’s terms of use only permits using its collections for scholarship or research purposes.
Meredith Broussard, a data journalist and professor at New York University, said that as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property.
“This is the same fight that everybody has been having with the Internet Archive since its inception,” Broussard said. “Internet Archive is a very old-school, ‘information-should-be-free’ organization. But the people who are invested differently have different priorities. There are lots of different historical and legal and economic issues that are colliding in this situation. AI companies [are] the catalyst for the latest skirmish in a very old battle.”
In January, Nieman Lab used journalist Ben Welsh‘s database of 1,167 news websites‘ robots.txt files to determine which sites were disallowing the Internet Archive. At the time, the Internet Archive did not respond to requests to confirm which crawling bots it was using, so we identified four bots that the AI user agent watchdog service Dark Visitors had associated with them. (You can find our full methodology here.)
We found that 241 news websites disallowed at least one Internet Archive-affiliated crawling bot. About 80% of these sites belonged to USA Today Co., the company formerly known as Gannett.
By May, we found that an additional 141 news websites disallowed at least one Internet Archive-affiliated bot, increasing the total number of sites in our sample to 382. Some of these additions appeared in Welsh’s database. We found others by checking robots.txt files ourselves. Our final sample includes sites in 10 countries, though the vast majority (93%) are based in the United States.
Of the 382 news sites in our updated sample, 342 are local. Of course, our data doesn’t include all the local news outlets in the United States, but it shows that many of the country’s largest local news publishers are at least attempting to limit Internet Archive access.
The scraping bots we tracked in our new analysis are Heritrix, My-heritrix-crawler, heritrix/3.3.0, Archive-It, archive.org_bot, ia_archiver-web.archive.org, and Special_archiver. (We included Archive-It, archive.org_bot, ia_archiver-web.archive.org, and Special_archiver in our January analysis. After confirming that the bot Heritrix and its variations belong to the Internet Archive, we added them.)
Graham told Nieman Lab that the Wayback Machine doesn’t use the bots “ia_archiver,” “ia_archiverbot” or “ia_archiver-web.archive.org.”
Third-party websites and internet forums have regularly documented “ia_archiver-web.archive.org” as an alleged user agent of the Wayback Machine. We continue to include “ia_archiver-web.archive.org” in our dataset because news publishers are disallowing the bot under the assumption that it is used by the Internet Archive.
In July 2025, Alden ran an editorial in more than 60 of its daily newspapers openly criticizing OpenAI and other AI companies that have used news content to train their models without compensation. “Securing permission from, and fairly compensating, those publishers who created this great foundation of knowledge is the right, just and American thing to do,” read the editorial. Both Alden publishers are part of the major copyright infringement suit against OpenAI and Microsoft that includes The New York Times and is currently winding its way through federal court.
Editor’s Note: First time I’ve used a new WP feature. The video in the article was created for this article by WP AI clip creator. Let me know in your comments if you like or not. –DrWeb
Continue/Read Original Article: More than 340 local news outlets are limiting the Internet Archive’s access to their journalism | Nieman Journalism Lab
Discover more from DrWeb's Domain
Subscribe to get the latest posts sent to your email.
