AI Companies Built Empires on Stolen Work. That Isn't Innovation.

In 2024, Thomson Reuters won the first major AI copyright case in US history. The defendant, Ross Intelligence, had copied thousands of pages from Reuters' legal research database to train an AI product that would compete with Reuters. The court's ruling was unambiguous: this was not fair use. It was theft. The AI industry's response: keep doing it anyway, just with bigger lawyers. The Scale of What Was Taken Before we talk about "innovation," let's talk about what actually happened. Books3 Dataset: 183,000 copyrighted books, scraped from shadow libraries. Works by Stephen King, Margaret Atwood, Zadie Smith. Living authors who never consented. This dataset was used to train models worth hundreds of billions of dollars. LAION-5B: 5.85 billion image-text pairs scraped from the internet. Getty Images watermarks visible in the training data. Personal photographs. Professional photography. Art that took years to create, consumed in milliseconds. The Stack: 6TB of source code from GitHub repositories, regardless of license. GPL-licensed code, which requires derivative works to be open source, fed into proprietary models. Proprietary code that was accidentally public, swallowed whole. Common Crawl: An estimated 400 billion web pages. Blog posts, news articles, forum discussions, personal websites. Everything you ever published online, scraped and consumed without your knowledge. They didn't ask. They took. The "Fair Use" Shield When caught, every AI company reaches for the same defense: fair use. Their argument: "Training AI is like a student reading books to learn. It's transformative. The output is different from the input." Here's why that argument falls apart: AI Models Memorize, They Don't "Learn" Research has repeatedly shown that large language models can reproduce exact passages from their training data. A student who recited entire chapters verbatim from memory would be accused of plagiarism, not "learning." The Output Competes With the Input When an AI generates an image in an artist's style, that output competes with the artist's work in the marketplace. Fair use traditionally requires that the new work doesn't serve as a market substitute for the original. The Scale Changes Everything A human reading one book is learning. A machine consuming 183,000 books and using them to generate competing content is industrial reproduction. The scale transforms the nature of the act. Thomson Reuters Disagrees The first court to rule on this in a full trial sided with the copyright holder. Thomson Reuters v. Ross Intelligence established that using proprietary data to train a competing AI product is not fair use. The ruling is working its way through appeals, but the precedent is set. The Lawsuits Piling Up As of early 2026, more than 25 active lawsuits challenge AI companies' training practices: NYT v. OpenAI -- The Times alleges millions of articles were used without consent. The case centers on "regurgitation" -- the model's ability to reproduce near-exact copies of Times journalism. Getty Images v. Stability AI -- Stock photos used to train image generators, complete with watermarks in the output. Authors Guild v. OpenAI -- Living authors allege their copyrighted books were scraped from shadow libraries. GitHub Copilot Class Action -- Developers allege GPL-licensed code was used in violation of its open-source license terms. Encyclopedia Britannica v. OpenAI -- The latest in the wave, filed March 2026. Disney v. Midjourney -- "The Mouse bites back." Characters and imagery under trademark and copyright. These cases will define whether AI companies have to ask permission before they take your work. The outcomes are uncertain. The stakes are existential for creators. The Consent Problem The core issue is not complicated: AI companies built trillion-dollar products using creative works they never got permission to use. The "opt-out" response is a joke. Creators are expected to: Discover that their work was used (impossible without transparency) Find the correct opt-out mechanism for each company Submit individual requests Trust that companies actually comply Accept zero compensation for past use This is not consent. This is coercion with extra steps. The power imbalance is staggering. Individual artists, writers, and developers versus trillion-dollar tech companies with armies of lawyers. No collective bargaining power. No transparency about what was used. Legal recourse requires expensive litigation that most creators cannot afford. What "Innovation" Actually Looks Like Contrast the scraping approach with companies that built AI ethically: Adobe Firefly trained exclusively on licensed and public-domain imagery. It works fine. The images are good. The company didn't need to steal anyone's work to build a competitive product. Models with transparent training data exist. They publish what they trained on. Creators can verify and opt out before training, not after. The difference between theft and innovation is consent. Building a competitive AI product without stealing anyone's work is possible. Several companies have done it. The reason most didn't is simple: stealing was cheaper and faster. "Move fast and break things" met "ask forgiveness, not permission," and the result was the largest unauthorized copying of creative works in human history. The X/Twitter Terms of Service Precedent In January 2026, X (formerly Twitter) updated its Terms of Service to explicitly grant itself the right to use all user content for AI training with no opt-out and no compensation. Users grant a "worldwide, royalty-free, sublicensable license" for "any purpose," including training Grok. The FTC has warned that retroactively changing terms of service to expand AI training rights may constitute unfair or deceptive practices. But enforcement lags behind deployment. No one was consulted if you wanted your tweets training their AI. They changed the terms and called it consent. What Actually Helps For Creators Check if your work was used: haveibeentrained.com searches LAION datasets Block AI crawlers: Add robots.txt directives for GPTBot, ChatGPT-User, Google-Extended, and other known scrapers Register copyrights: Required for statutory damages in US courts Join class actions: Multiple ongoing suits need plaintiffs Use protective tools: Glaze and Nightshade add adversarial perturbations that disrupt AI training For Everyone Demand transparency: AI companies must disclose training datasets Support opt-in legislation: Consent before training, not after Back creator compensation models: Revenue sharing for training data Choose ethical AI products: Support companies that license their data Where This Ends The AI industry's greatest innovation was not a model architecture or a training technique. It was the legal fiction that scraping the entire internet without permission constitutes "fair use." Thomson Reuters won the first case. The NYT's case is heading toward a ruling that could reshape the industry. Twenty-five lawsuits and counting are testing whether "innovation" includes stealing other people's work. Stealing isn't innovation. It's stealing. And the fact that billions of dollars and the best legal talent in the world are being deployed to argue otherwise tells you everything you need to know about the AI industry's relationship with consent. They didn't ask before they took your work. They built empires on it. And now they're fighting in court to ensure they never have to ask. --- Related: AI Training Data Theft Thomson Reuters v. Ross: AI Theft on Trial NYT v. Perplexity: Journalism Theft Suno/Udio: Silence of the Jams Sources Thomson Reuters wins first major AI copyright case -- Reuters, 2024 The New York Times sues OpenAI and Microsoft -- NYT, December 2023 Encyclopedia Britannica sues OpenAI over AI training -- Reuters, March 2026 FTC: Quietly changing terms of service could be unfair or deceptive -- FTC, February 2024 X Terms of Service grant AI training rights with no opt-out -- CryptoSlate, January 2026