Your Digital Collections Are Training AI. Did You Know That?
If your digital collections are publicly accessible on the web, they've been used to train AI. Past tense. It already happened.
Quick question: Have you checked whether AI companies are scraping your library’s digital collections for training data?
If not, you should. And while you’re at it, check whether your privacy policy is lying about AI.
If your answer is “We don’t have a way to check that,” you’re not alone. Most libraries don’t.
Here’s the uncomfortable truth: If your digital collections are publicly accessible on the web, they’ve been used to train AI. Past tense. It already happened. (Common Crawl, which trained GPT-3 and other major AI models, scrapes publicly accessible web content—including library digital collections.)
And you might not have any say in it.
How AI Training Actually Works
AI models like ChatGPT, Claude, and Google’s Gemini are trained on massive datasets—billions of documents, images, and web pages scraped from the internet.
The companies building these models use web crawlers (bots) that systematically visit websites, download content, and add it to training datasets.
Your library’s website? Fair game. Your digital collections portal? Fair game. Your institutional repository? Fair game.
Unless you’ve specifically blocked these crawlers (and most libraries haven’t), your content is getting scraped.
Why This Matters
Copyright infringement risk: If your digital collections include copyrighted material (digitized books, journal articles, photographs, archival documents), and AI companies are training on that content without permission, that’s copyright infringement. If courts rule against fair use, you’re caught in the middle. Rightsholders might sue the AI company. The AI company might say “We got this content from [your library]‘s website.”
Metadata and privacy issues: Your digital collections contain metadata. Catalog records, subject headings, usage statistics, timestamps, user-generated tags. That metadata is valuable. If your metadata includes information about who accessed what and when… that’s privacy-sensitive. Most libraries don’t realize their metadata is being scraped alongside content.
Loss of control: Once your content is in an AI training dataset, you can’t get it back. Even if you later decide “We don’t want our content used for AI,” it’s too late. The AI has already learned from it. You’ve permanently lost control.
Ethical concerns: Many libraries have ethical commitments around how they share and preserve cultural heritage. You’ve spent years building digital collections that respect Indigenous knowledge protocols, donor restrictions, privacy of individuals in archival materials, and cultural sensitivities. AI companies don’t care about any of that. They’re scraping everything indiscriminately.
Your carefully curated collection becomes just another data dump in a corporate AI model.
The Common Crawl Problem
You’ve probably never heard of Common Crawl. But it’s one of the largest sources of AI training data.
Common Crawl is a nonprofit that regularly crawls the entire web and releases the data as open datasets. AI companies use Common Crawl data extensively because it’s free, comprehensive, and legally (somewhat) defensible.
If your library’s website is indexed by Common Crawl, your content is in datasets being used to train AI.
Want to check? Go to commoncrawl.org/search and search for your library’s domain.
I’ll wait.
What AI Companies Are Saying (And Why It’s Nonsense)
“It’s publicly accessible, so it’s fair game.” Wrong. “Publicly accessible” doesn’t mean “free to use for commercial AI training.” Your library makes content publicly accessible for research, education, and preservation—not to train corporate AI models.
“We’re following robots.txt rules.” Robots.txt is a file that tells web crawlers which parts of your site they can and can’t access. But it’s voluntary. Crawlers can ignore it. And many AI companies do ignore it or interpret it selectively.
“It’s fair use.” Maybe. Courts haven’t decided yet. And if they decide it’s not fair use, you’re left dealing with the consequences.
“We anonymize and aggregate the data.” Great. But you didn’t ask permission. And “anonymized” doesn’t mean “ethical.”
What You Can Do Right Now
Update your robots.txt file:
Add these lines to block common AI crawlers:
# Block OpenAI (ChatGPT)
User-agent: GPTBot
Disallow: /
# Block Google AI
User-agent: Google-Extended
Disallow: /
# Block Anthropic (Claude)
User-agent: anthropic-ai
Disallow: /
# Block Common Crawl
User-agent: CCBot
Disallow: /
# Block general AI scrapers
User-agent: ChatGPT-User
Disallow: /
This won’t stop all AI scraping (some crawlers don’t identify themselves, others ignore robots.txt), but it’s a start.
Add a Terms of Service page:
Create a clear Terms of Service that states:
- Content on this site is provided for research and educational purposes
- Commercial use, including AI training, is prohibited without written permission
- Automated scraping for AI training is explicitly forbidden
Will this legally stop AI companies? No. But it establishes your intent and could be useful if you ever need to take legal action.
Monitor your server logs:
Check who’s accessing your digital collections. Look for:
- High-volume automated access patterns
- User agents associated with AI crawlers (GPTBot, CCBot, Bytespider, etc.)
- Unusual traffic spikes
If you see suspicious activity, block those IPs or user agents.
Use metadata to assert rights:
Add rights statements to your metadata:
“This content is made available for research and education. Use for commercial purposes, including training artificial intelligence models, is prohibited without permission.”
Participate in collective action:
Individual libraries can’t fight Big Tech alone. But collective action matters.
Join organizations like The Authors Alliance, Creative Commons, or ALA’s Intellectual Freedom Committee. The more libraries speak up, the more pressure there is for regulation and ethical AI practices.
The Questions Nobody’s Asking (But Should Be)
If we make content publicly accessible, do we have any right to control how it’s used?
Legally, maybe not (depending on copyright and fair use). Ethically, absolutely. Libraries exist to serve the public good, not to fuel corporate AI development.
Should we treat AI companies like we treat other commercial entities?
You wouldn’t let a for-profit publisher scrape your digital collections and republish them without permission. Why is AI training different?
Are we complicit in AI harms if we allow our content to be scraped?
If an AI trained on your library’s content generates biased, harmful, or false information, do you bear any responsibility?
What do our donors and rightsholders think?
If you digitized materials donated to your library, do the donors know their content might be training corporate AI? Did they consent to that?
The HathiTrust Paradox
HathiTrust successfully argued that building a searchable database of scanned books was fair use. The court ruled in their favor because HathiTrust wasn’t commercially exploiting the content.
Now AI companies are doing the opposite—scraping content from libraries and using it for commercial products. And they’re claiming the same “fair use” defense HathiTrust used.
If courts allow that, it undermines the whole point of the HathiTrust decision. Libraries lose control over digital preservation and access.
My Recommendation: Don’t Stay Silent
You have three options:
Option 1: Do nothing. Accept that your digital collections will be used to train AI.
Option 2: Block AI scrapers. Update robots.txt, monitor server logs, assert your rights.
Option 3: Engage and negotiate. Reach out to AI companies and say “If you want our content, let’s talk.” Set terms. Require transparency.
I recommend a combination of 2 and 3.
Blocking AI scrapers sends a message: You don’t have free rein over our content. Engaging with AI companies (or advocating for regulation) ensures libraries have a voice in how AI is built.
Doing nothing is the worst option. Because silence is consent—and right now, AI companies are treating library silence as permission.
Check your robots.txt. Block AI scrapers. Stop letting them take without asking.
Authenticity note: With the exception of images, this post was not created with the aid of any LLM product for prose or description. It is original writing by a human librarian with opinions.
Discussion
Have questions or feedback? Join the conversation using your GitHub account.