AI Models Accurately Recreate Copyrighted Books: Legal Risks Loom
A carefully constructed distinction is emerging as artificial intelligence (AI) companies defend themselves against a growing wave of legal challenges. This distinction centers on the fundamental nature of copyright law, which protects original works and their creators, granting copyright holders the exclusive right to reproduce, adapt, distribute, perform, and display their work, as defined by the 1976 U.S. Copyright Act.
The Core of the Debate
However, the “fair use” doctrine allows others to utilize copyrighted material for purposes like criticism, journalism, and research. This has been a key defense for the AI industry against accusations of infringement. OpenAI CEO Sam Altman has stated that the industry’s progress would “be over” if it were prohibited from freely using copyrighted data to train its models.
For years, copyright holders have protested, alleging that AI companies are training their models on pirated and copyrighted works, profiting from them without fairly compensating authors, journalists, and artists. This dispute has already resulted in some settlements.
New Research Raises Concerns
Recent research may force AI companies back on the defensive. A study by researchers at Stanford and Yale Universities found compelling evidence that AI models are, in fact, reproducing data rather than simply “learning” from it. Specifically, four prominent Large Language Models (LLMs) – OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet – were able to recreate lengthy passages from popular, copyrighted works with remarkable accuracy.
Researchers found that “Claude” recreated “entire books almost verbatim” with 95.8% accuracy. “Gemini” reproduced the novel Harry Potter and the Sorcerer’s Stone with 76.8% accuracy, while “Claude” recreated George Orwell’s 1984 with over 94% accuracy compared to the original, still-copyrighted material.
“While many assume that LLMs do not memorize much of their training data, recent research shows that a substantial amount of copyrighted text can be extracted from open models,” the researchers wrote.
Potential Legal Ramifications
The implications of these findings could be substantial, as numerous copyright infringement cases are currently being litigated in U.S. Courts. According to Alex Reisner of The Atlantic, the results further weaken the AI industry’s argument that LLMs “learn” from texts rather than storing information and later reproducing it. This evidence “could be a huge legal liability for AI companies” and “could cost the industry billions of dollars in copyright infringement.”
Whether AI companies are liable for copyright infringement remains a hotly debated topic. Stanford law professor Mark Lemley, who has represented AI companies in copyright cases, told The Atlantic he is unsure whether an AI model “has” a book or can recreate it “instantaneously in response to a prompt.”
Unsurprisingly, the industry continues to maintain that it does not technically reproduce copyrighted works. In 2023, Google stated to the U.S. Copyright Office that “We find no copies of training data—texts, images, or other formats—within the model.” OpenAI similarly reported to the office that its “models do not retain copies of the information they are trained on.”
Reisner of The Atlantic argues that the analogy of AI models learning like humans is “misleading, a comforting idea that gets in the way of the public discussion we need to have about how AI companies are using creative and intellectual works on which they are entirely dependent.”
What’s Next?
It remains unclear whether judges presiding over the numerous copyright infringement cases will agree with this assessment. The stakes are high, particularly as authors, journalists, and other content creators face increasing difficulties earning a living while the artificial intelligence industry grows to an unimaginable value, according to Futurism.
Frequently Asked Questions
What is the “fair use” doctrine?
The “fair use” doctrine allows others to use copyrighted material for purposes such as criticism, journalism, and research.
Which AI models were examined in the recent study?
The study examined OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet.
What did the AI companies state regarding copies of training data?
Both Google and OpenAI reported to the U.S. Copyright Office that their models do not retain copies of the information they are trained on.
As AI technology continues to evolve, how might the legal definition of “fair use” need to adapt to address the unique challenges posed by these powerful new tools?