This article explores the evolving legal landscape around AI training on copyrighted material in the United States, focusing on the principle that using lawfully obtained content in a transformative way may qualify as fair use. It highlights the importance of how training data is sourced, emphasizing that using pirated content constitutes infringement regardless of the AI’s use. While grounded in U.S. court decisions and laws, these developments are globally relevant given that many leading AI companies operate out of the U.S. and their legal precedents often influence international policy and industry practices.
An article by Manasvi Madhumohan and Srividhya Perumal
In June 2025, a U.S. federal court delivered the first big legal win for Artificial Intelligence (AI) companies using copyrighted material to train their models. In the case of Bartz v. Anthropic, Judge William Alsup ruled that Anthropic acted legally when it trained its Claude model. The company had bought and scanned the books and the court called its use “exceedingly transformative“ and protected under fair use. “Exceedingly transformative” means that the court found the new use of copyrighted material to be extremely different in purpose, function, or outcome from how the original work was intended to be used. This marks the first time a court has explicitly held that training an AI on copyrighted text may qualify as fair use, provided that the data comes from lawful sources.
The judge rejected Anthropic’s defense of using over seven million pirated books, and that issue now goes to trial to decide potential statutory damages. So, before AI companies rush to train on every book they can find, it’s worth pausing and understanding its implications.
What is the Anthropic Case?
Three authors filed a lawsuit against Anthropic, alleging it trained its AI, Claude, on their copyrighted books without consent or payment using a dataset called “The Pile”. It is important to note that Anthropic acquired books for its training library in two primary ways. Firstly, it started by downloading enormous collections of illegally obtained books from online libraries, such as Books3 and Library Genesis. Secondly, it bought used physical books in quantity, removed their bindings, scanned them into digital format, and discarded the originals.
The court referenced Perfect 10, Inc v. Amazon.com, Inc., as a guideline for analyzing fair use. It stated that each use of the copyrighted material must be looked at separately. In Perfect 10, Google’s thumbnail images were considered fair use because they served as pointers to original content, and not replacements. While Anthropic claimed it only used books for training AI, the authors argued there were multiple uses. Unlike the previous case’s single transformative purpose, the court found that some of Anthropic’s uses, such as storing pirated books in a permanent library were not transformative and therefore violated copyright law.
The court also cited the Google Books case where Google scanned millions of library books to create searchable databases, text snippets for users, and accessible formats for disabled readers. The court found these uses transformative because they served new purposes such as search and accessibility. Anthropic claimed it could have legally bought used copies or taken a similar lawful route under the first sale doctrine (17 U.S.C. § 109(a)). However, the judgment concluded that it did neither. Instead, it downloaded pirated books from shadow libraries to train its core model.
Judge Alsup held that training AI with books may qualify as fair use if the material is lawfully obtained but downloading pirated books violates copyright law. The court found Anthropic made unauthorized copies and committed mass infringement, though some actions were excused as “fair use.” The key takeaway is that how developers obtain training data matters as much as how they use it.
AI & Copyright
This isn’t the first time Intellectual Property (IP) and AI have clashed. Getty Images launched a high-profile lawsuit against Stability AI in 2023. The allegations mentioned that the company used up to 12 million of Getty’s copyrighted photos to train its AI image generator Stable Diffusion without permission and sought $1.8 trillion in damages.
Internationally, Getty Images v. Stability AI offers a wildcard. In London, Getty dropped its main copyright claims but pursued trademark infringement, arguing that model ‘weights’ themselves may infringe by embedding copyrighted images. Model weights are internal settings a neural network learns from training on text or images, representing the AI’s knowledge or response patterns. If courts find model weights can violate copyright, developers may need to retrain models regionally or restrict their use to follow local laws.
Several high-profile copyright cases have emerged in 2024 and 2025, most notably the lawsuit against Meta. The authors alleged that Meta infringed their IP rights by using their books to train its AI models. Two days after the Anthropic judgment, Judge Vince Chhabria ruled similarly in Kadrey v. Meta, granting Meta summary judgment for fair use. Though Meta used pirated books to train its LLaMA models, plaintiffs failed to show market harm. The judge noted that stronger evidence in future cases could change this, reinforcing that economic impact is key in fair use rulings.
In Thomson Reuters v. ROSS, a Delaware federal court ruled against ROSS Intelligence. ROSS had scrapped Westlaw’s proprietary legal headnotes to build a competing research tool. The court found this use non-transformative and directly harmful because it served the same commercial purpose as the original, rejecting the fair-use defense. This case demonstrates that not all AI training is the same and courts examine the purpose of the use plus its effects on the market.
The legal community and lawmakers have responded swiftly. The bipartisan AI Accountability and Personal Data Protection Act (2025) would give individuals the power to sue AI companies that use their copyrighted works without consent. Combined, these cases and legislation begin to sketch out the first real “rules of the road” for generative AI.
Fair Use: Meaning and Connection to AI
The legal principle known as “fair use”, often used in the U.S. and other jurisdictions, encourages freedom of expression by allowing the unrestricted use of copyrighted works under specific conditions. It serves as an affirmative defense that allows parties to use these works without explicit consent for purposes such as criticism, comment, news reporting, teaching, or research.
Section 107 of the Copyright Act in the U.S identifies four factors for determining whether a given use of a copyrighted work is a fair use: (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.
U.S. courts generally agree that training AI to understand and generate text differs from copying protected works. Judges view large language models like people learning ideas, making training “transformative” and supporting fair use. However, courts are cautious when AI outputs directly compete with copyrighted works. Market harm is often the deciding factor, as seen in Kadrey v. Meta, where the judge leaned toward fair use due to a lack of clear economic harm but encouraged stronger evidence in future cases. In contrast, Bartz v. Anthropic stressed that the source of training data matters, ruling that using pirated content is infringement regardless of how the AI is later trained.
Emerging Regulations
The U.S. Copyright Office’s 2025 Part III report on generative AI training reflects courts’ concerns in the United States, a jurisdiction whose legal developments often influence global AI policy debates. It emphasized that copying creative works for AI training can go beyond fair use, especially if the data comes from illegal sources. The Office promotes voluntary licenses over mandatory government-imposed licensing to balance innovation with copyright owner rights.
Meanwhile, the AI Accountability and Personal Data Protection Act would impose statutory damages of at least $1,000 per unauthorized work, prohibit training on copyrighted datasets without consent, and create new tort claims. Supporters view it as a necessary protection for creators whereas critics believe it could undermine existing fair use protections.
What This Means for Developers and Rights Holders
Developers should now recognize the importance of documenting where their datasets come from. They can reduce legal risks by actively keeping logs of purchases and license terms. Companies should conduct audits to remove pirated or potentially infringing materials and invest in testing whether models memorize and replicate copyrighted outputs verbatim. One of the most debated unresolved risks is whether an AI system that outputs large verbatim sections of copyrighted works, either by accident or design, could itself be infringing, even if the training was lawful. Courts have not yet ruled on this critical issue, and future cases may turn on empirical evidence about how much (and what kind of) content models can ‘remember’ from their training corpora. New legislation could create liability, so firms must update their compliance programs and include potential statutory damages in their risk assessments. For rights holders, these cases offer a clearer pathway to enforcement. Collecting data on economic harms, identifying and targeting pirate repositories, and negotiating licensing add up to a more effective strategy.
Conclusion
The Bartz v. Anthropic ruling marks a pivotal moment in AI copyright law, establishing the “exceedingly transformative” standard that will likely influence future litigation, particularly in the United States. However, remarkably little has been settled by recent rulings and significant legal uncertainties remain. The emerging consensus is clear: the source and manner of data acquisition has become as important as the transformative nature of the use itself. Judge William Alsup’s distinction between lawfully purchased books and pirated content creates a practical roadmap for AI companies, though it also raises compliance costs and operational complexity. As this body of law continues to develop, AI companies that proactively address copyright concerns through legitimate data sourcing and careful documentation will be better positioned to combat future legal challenges.
Published under licence CC BY-NC-ND.