Two of the world's most trusted reference publishers just threw a legal bomb at OpenAI. Encyclopedia Britannica and Merriam-Webster filed a lawsuit alleging the AI giant violated copyright on nearly 100,000 articles by scraping their content to train large language models. The suit marks another major escalation in the battle over AI training data, coming as OpenAI faces mounting legal pressure from publishers, artists, and creators who say the company built its empire on stolen intellectual property.
Encyclopedia Britannica and Merriam-Webster are taking OpenAI to court over what they claim is wholesale theft of their editorial work. The publishers allege OpenAI scraped close to 100,000 articles from their databases to train the company's large language models without permission, licensing deals, or compensation.
The lawsuit lands at a particularly sensitive moment for OpenAI. The company's already fighting multiple copyright battles with The New York Times, authors including John Grisham and George R.R. Martin, and visual artists. Each case chips away at OpenAI's defense that training AI models on publicly available content constitutes fair use under copyright law.
What makes this case different is the nature of the plaintiffs. Encyclopedia Britannica has been publishing authoritative reference content since 1768. Merriam-Webster has defined the English language for generations. These aren't just content mills or news aggregators - they're institutions that have spent centuries building reputations on accuracy, editorial standards, and meticulous fact-checking.
That carefully curated content apparently ended up in OpenAI's training datasets anyway. The publishers claim their articles, definitions, and educational materials were ingested to help ChatGPT and other models generate human-like text. In essence, OpenAI took the reference works that students and researchers have relied on for decades and fed them into its AI without asking.
The legal theory here hinges on whether scraping published content for AI training qualifies as transformative use. OpenAI has consistently argued that training AI models is fundamentally different from copying or republishing content. The company claims its models learn patterns and concepts rather than memorizing specific text. Publishers aren't buying it. They point out that AI models can and do reproduce content verbatim, and that the entire value proposition of these models depends on ingesting high-quality source material.
Encyclopedia Britannica and Merriam-Webster want more than just an acknowledgment. They're seeking damages for what they characterize as industrial-scale copyright infringement. The nearly 100,000 articles represent decades of editorial investment, expert curation, and intellectual labor. If the publishers prevail, it could force OpenAI to either license reference content properly or exclude it from training entirely.
The timing puts extra pressure on OpenAI as the company tries to position itself as a responsible AI leader. The firm has signed content licensing deals with some publishers, including Axel Springer and the Associated Press. But those agreements came after lawsuits started piling up, and many publishers say the offered terms don't come close to fair compensation for their archives.
This lawsuit also raises uncomfortable questions about what happens when AI companies train on humanity's collective knowledge without permission. Reference works aren't just commercial products - they're educational resources, cultural artifacts, and records of human understanding. If AI companies can scrape them freely, what incentive remains for publishers to invest in accuracy and expertise?
The case will likely take months or years to resolve, but its implications stretch far beyond Encyclopedia Britannica and Merriam-Webster. Educational publishers, academic institutions, and libraries are watching closely. If reference content isn't protected from unauthorized AI training, virtually no published material is safe.
OpenAI hasn't issued a detailed response yet, but the company's past statements suggest it will lean heavily on fair use arguments and claims that AI training doesn't harm the original publishers' markets. That argument gets harder to make as AI-generated content increasingly competes with the very sources it was trained on. Why pay for an Encyclopedia Britannica subscription when ChatGPT can answer your questions using knowledge extracted from Britannica articles?
This lawsuit isn't just about 100,000 articles or two publishers seeking damages. It's a referendum on whether AI companies can build billion-dollar businesses on the backs of content creators without permission or fair compensation. If Encyclopedia Britannica and Merriam-Webster win, it could force the entire AI industry to rethink how it sources training data. If they lose, it might open the floodgates for AI companies to scrape virtually anything published online. Either way, the case will help define the rules for AI's relationship with human knowledge in the years ahead.