Pillar 3 · Urgent

The danger of AI scraping
your diaries

Your private journals, deleted posts, leaked personal data—AI companies are scraping all of it to train language models. You cannot opt out. Your vulnerability becomes training data. Here's what's happening and how to protect yourself.

AI & Privacy April 21, 2026 13 min read

ChatGPT, Claude, Gemini, Llama—every large language model was trained on massive datasets scraped from the internet. This includes everything humans have written: published books, social media posts, code repositories, research papers, and yes, private diaries.

Not just published diaries. Leaked diaries. Hacked diaries. Deleted diaries. If it was ever written and posted online, it was probably scraped.

This is not speculation. It's documented. OpenAI scrapes from Common Crawl (which includes personal blogs, Tumblr, Medium). Google scrapes its own services (Gmail, Drive, Docs). Meta scrapes Facebook and Instagram. Every company building AI has scraped your words without your consent.

And there's no opt-out.

Your private thoughts are not yours anymore. They're training data.

How AI companies scraped the internet (and your diary)

The AI boom was built on data scraping. Here's how it works:

The AI training pipeline: From your diary to machine learning

1
Web crawlers scrape everything
Bots automatically download every accessible webpage. Personal blogs, Medium posts, leaked Reddit threads, deleted Tweets archived by third parties, Wayback Machine snapshots.
2
Data is sold or shared
Data brokers (Common Crawl, Laion, C4) aggregate and package scraped data. They sell it to AI companies or release it publicly. No compensation to original authors.
3
AI companies train models
OpenAI, Google, Meta, Anthropic download these datasets and use them to train language models. Your words help tune how AI responds to other people's prompts.
4
Models are deployed commercially
ChatGPT, Gemini, Claude go live. They're built on your data. You have no rights, no compensation, no control.
5
Your data can be extracted
Researchers show that training data can be extracted from LLMs via prompt injection. Your diary entries could be reconstructed from the model.

This happened without your knowledge. Without your consent. Without compensation.

You wrote it privately. They scraped it publicly. Now it's in a model trained on billions of tokens of human vulnerability.

What's actually in the training data

Here's what we know about datasets used to train major LLMs:

2020
Common Crawl: 570B tokens scraped from web
Includes personal blogs, Tumblr, Medium, archived deleted sites. GPT-2 and GPT-3 trained partially on this.
2021
LAION dataset: 5.85B images + text from internet
Scraped from Flickr, Imgur, personal photo blogs. Includes identifying information, private moments.
2022
ChatGPT trained on web data + Reddit
Explicitly scraped Reddit posts. People's confessions, medical details, personal stories—all in the training data.
2023
Google scrapes Gmail, Google Docs, YouTube
Gemini and newer models trained on your own data stored in Google services. Users unaware (buried in terms of service).
2024-2026
Private data scraping accelerates
Companies scrape private Discord servers, WhatsApp groups (via compromised devices), private GitHub repositories. No consent requested.

The datasets that trained your favorite AI models include:

Your vulnerabilities are in these models. Your secrets are in these weights.

The three dangers of AI scraping

🔄
Data extraction attacks
Researchers can query LLMs to extract training data. Your diary entries could be reconstructed from the model via prompt injection.
🎯
Identity inference
If enough of your writing is in the training data, someone can infer who you are. Combine writing style + personal details = re-identification.
💰
Unpaid labor
Your writing trained a multi-billion dollar model. You got nothing. Your vulnerability is now someone's profit margin.
⚖️
Legal liability
If AI models output text similar to your work, you're not compensated. Copyright lawsuits are ongoing but users are powerless.

⚠️ Critical: You Cannot Opt Out

Once data is scraped and used in training, there is no opt-out mechanism. You cannot request your data be removed from trained models (it's baked into weights). You cannot demand compensation. The data is gone.

How to protect yourself starting now

You cannot stop AI companies from scraping what's already online. But you can prevent new vulnerability from being scraped:

1

Stop writing to cloud platforms

Google Docs, Microsoft 365, Notion, Medium—all can be scraped by the company or third parties. Write locally. Use encrypted storage. Never assume "private" means safe.

2

Delete old online writing

Delete old blogs, Medium posts, Reddit accounts. It won't remove it from Wayback Machine, but it signals the original author wants it gone. Some datasets respect deletion signals (robots.txt, DMCA requests).

3

Use offline-first encrypted storage

Your diary should never leave your device unencrypted. If it's encrypted before transmission, AI companies cannot meaningfully scrape it.

4

Assume everything you write will be scraped

Write with the assumption your words will be in an AI training dataset. Don't write secrets you cannot afford to have extracted. This is the new reality.

5

Support legal action against scrapers

Copyright lawsuits against OpenAI, Google, Meta are ongoing (2024-2026). They're asking: is scraping without consent fair use? Support the writers suing. Push for legislation.

The shift: From privacy to sovereignty over training data

Traditional privacy concerns were about *access*: Can someone read your data? Can they see you?

AI scraping introduces a new concern: *use*. Even if no human reads your diary, it's used to train models. Your vulnerability becomes training data used to improve AI that serves your competitors.

This is why offline-first, encrypted storage matters. If your data never leaves your device unencrypted, it cannot be meaningfully scraped or used in training.

CHRONOS was built with this in mind. Your vault data is encrypted before it ever leaves your device. The server sees only ciphertext. Even if an AI company somehow gained access to Vercel's database, they'd only find encrypted bytes—useless for training.

The only data safe from AI scraping is data that never exists in plaintext outside your own device.

The future: AI arms race vs. privacy

This is not a solved problem. Governments are starting to act (EU AI Act, copyright lawsuits), but it's slow.

In the meantime:

The only certainty: if you want your writing to remain yours, keep it encrypted and offline-first. That's the only architecture that protects you in an age of AI scraping.

CHRONOS

Your thoughts stay yours.
Not AI training data.

Encrypted before transmission. The server never sees plaintext. Your vulnerability remains private.

Open CHRONOS