Meta Description: Apple sued by authors for allegedly using pirated books to train AI models without permission. Learn how this lawsuit could reshape AI training practices.
Apple, known for its stance on user privacy and ethical business practices, now faces an AI training data copyright lawsuit. Two U.S. authors, Grady Hendrix and Jennifer Roberson, filed a proposed class action in early September 2025 alleging Apple used their copyrighted books without permission to train its language models. The complaint points to Books3 and RedPajama derived datasets that have been described by some observers as pirated training data. Could this high profile case force the industry to change how it sources training material and increase demands for dataset disclosure and licensed content deals?
For years, companies have relied on large scraped datasets to train generative AI systems. The Books3 dataset, at the center of this suit, contains tens of thousands of books and has been used by multiple AI developers despite questions around provenance. The lawsuit arrives as courts and legislators are increasingly focused on fair use in AI training and on policies that require transparency in AI training datasets.
This case follows a wave of litigation and settlements in 2024 to 2025 that made headlines and set new expectations for AI companies. A record breaking settlement earlier this year highlighted the financial exposure companies face when training models on disputed content. Plaintiffs are seeking monetary damages and injunctive relief that could force companies to retrain models using only properly licensed material, a costly and time consuming process.
The outcome could accelerate several trends we are already seeing:
Apple's lawsuit over alleged use of pirated books for AI training is more than a single legal dispute. It underscores the shift toward heightened legal scrutiny of how commercial AI systems are trained and the growing demand for transparency in AI training datasets. Companies that adopt robust licensing practices, document dataset provenance, and embrace transparent policies will be better positioned as the legal and regulatory landscape evolves.