What happens when tech giants rely on millions of books to train artificial intelligence but authors never gave permission The class action filed in 2025 by authors Grady Hendrix and Jennifer Roberson accuses Apple of using pirated copies of copyrighted books from the Books3 dataset to train its OpenELM models without consent or compensation.
Large generative AI systems depend on vast text corpora to learn patterns in language. Securing legitimate licenses for hundreds of thousands of books would be complex and costly. That gap has led some developers to rely on collections like the Books3 dataset which has been reported to include unauthorized digital copies of many in copyright works.
This case could set an important legal precedent about copyright and AI training data. A ruling for the plaintiffs may establish that using copyrighted books for training without permission can constitute copyright infringement. That outcome could require retroactive licensing deals and change the economics of AI development by making AI training data licensing and author compensation mandatory components of model building.
The dispute highlights a key debate in AI policy The fair use defense for training data remains unsettled in court and legal experts are divided on whether large scale use of books for training qualifies as fair use. Greater training data transparency and clearer licensing practices are emerging as proposed solutions to balance technological progress with respect for creator rights.
Generative AI is increasingly integrated into business and creative workflows. With roughly 42% of companies using AI to produce long form content, the outcome of the Apple lawsuit of 2025 could reshape how training data is sourced and paid for across the industry. For authors and publishers the case offers a chance to assert rights and seek fair compensation. For AI developers it could mean new compliance and licensing obligations but also clearer standards that support sustainable, ethical AI.
As the litigation progresses, stakeholders should watch for rulings that clarify the interplay between copyright law and AI training practices and consider building strategies that prioritize training data licensing, author compensation, and training data transparency.