ChatGPT, Claude, Gemini, Llama—every large language model was trained on massive datasets scraped from the internet. This includes everything humans have written: published books, social media posts, code repositories, research papers, and yes, private diaries.
Not just published diaries. Leaked diaries. Hacked diaries. Deleted diaries. If it was ever written and posted online, it was probably scraped.
This is not speculation. It's documented. OpenAI scrapes from Common Crawl (which includes personal blogs, Tumblr, Medium). Google scrapes its own services (Gmail, Drive, Docs). Meta scrapes Facebook and Instagram. Every company building AI has scraped your words without your consent.
And there's no opt-out.
Your private thoughts are not yours anymore. They're training data.
How AI companies scraped the internet (and your diary)
The AI boom was built on data scraping. Here's how it works:
The AI training pipeline: From your diary to machine learning
This happened without your knowledge. Without your consent. Without compensation.
You wrote it privately. They scraped it publicly. Now it's in a model trained on billions of tokens of human vulnerability.
What's actually in the training data
Here's what we know about datasets used to train major LLMs:
The datasets that trained your favorite AI models include:
- Personal blogs: Decades of intimate writing, auto-scraped by web crawlers
- Reddit confessions: People sharing trauma, medical details, secrets. All in training data.
- Leaked datasets: Hacked databases of journal apps, dating apps, therapy platforms
- Deleted content: Wayback Machine archives, third-party snapshots of deleted posts
- Private services: Gmail, Google Docs, YouTube comments—if you use Google services, your data trained Gemini
- Medical data: Patient forums, health subreddits, symptoms shared online
Your vulnerabilities are in these models. Your secrets are in these weights.
The three dangers of AI scraping
⚠️ Critical: You Cannot Opt Out
Once data is scraped and used in training, there is no opt-out mechanism. You cannot request your data be removed from trained models (it's baked into weights). You cannot demand compensation. The data is gone.
How to protect yourself starting now
You cannot stop AI companies from scraping what's already online. But you can prevent new vulnerability from being scraped:
Stop writing to cloud platforms
Google Docs, Microsoft 365, Notion, Medium—all can be scraped by the company or third parties. Write locally. Use encrypted storage. Never assume "private" means safe.
Delete old online writing
Delete old blogs, Medium posts, Reddit accounts. It won't remove it from Wayback Machine, but it signals the original author wants it gone. Some datasets respect deletion signals (robots.txt, DMCA requests).
Use offline-first encrypted storage
Your diary should never leave your device unencrypted. If it's encrypted before transmission, AI companies cannot meaningfully scrape it.
Assume everything you write will be scraped
Write with the assumption your words will be in an AI training dataset. Don't write secrets you cannot afford to have extracted. This is the new reality.
Support legal action against scrapers
Copyright lawsuits against OpenAI, Google, Meta are ongoing (2024-2026). They're asking: is scraping without consent fair use? Support the writers suing. Push for legislation.
The shift: From privacy to sovereignty over training data
Traditional privacy concerns were about *access*: Can someone read your data? Can they see you?
AI scraping introduces a new concern: *use*. Even if no human reads your diary, it's used to train models. Your vulnerability becomes training data used to improve AI that serves your competitors.
This is why offline-first, encrypted storage matters. If your data never leaves your device unencrypted, it cannot be meaningfully scraped or used in training.
CHRONOS was built with this in mind. Your vault data is encrypted before it ever leaves your device. The server sees only ciphertext. Even if an AI company somehow gained access to Vercel's database, they'd only find encrypted bytes—useless for training.
The only data safe from AI scraping is data that never exists in plaintext outside your own device.
The future: AI arms race vs. privacy
This is not a solved problem. Governments are starting to act (EU AI Act, copyright lawsuits), but it's slow.
In the meantime:
- AI companies will keep scraping because training data is valuable and regulations are weak
- More data breaches will happen and sensitive data will be included in training datasets
- Extraction attacks will improve making it easier to reconstruct training data from models
- Legal battles will intensify but litigation is slow; by the time it's resolved, all major models will be trained on scraped data
The only certainty: if you want your writing to remain yours, keep it encrypted and offline-first. That's the only architecture that protects you in an age of AI scraping.
CHRONOS
Your thoughts stay yours.
Not AI training data.
Encrypted before transmission. The server never sees plaintext. Your vulnerability remains private.
Open CHRONOS