How AI Companies Are Mining Human Knowledge: The Story of Project Panama
Information follows a fundamental principle: the more it is consumed, the more it gets reshaped and shared across generations. Human intelligence expands by absorbing knowledge, combining different ideas, and passing them forward in innovative forms. From ancient oral traditions to stone inscriptions, from handwritten letters to printed books, and from early computers to sophisticated algorithms, every stage of human progress has relied on preserving existing knowledge while reworking it for future applications.
Artificial Intelligence Built on Human Thought
Artificial intelligence systems operate on this same foundational principle. To deliver comprehensive answers on virtually any topic, anywhere, and at any moment, these systems require access to the broadest and most reliable records of human thought accumulated over centuries. For generations, books have served as the most trusted and durable vessels of information, documenting humanity's remarkable journey from primitive tools and survival techniques to rockets exploring Mars, from personal correspondence to global digital networks, and from hunting and gathering to instant food delivery services.
These bound volumes preserve critical ideas across generations, creating a continuous thread of human understanding. Within Anthropic, the artificial intelligence research company, planners recognized books as concentrated repositories of human knowledge that had been carefully shaped by editors, authors, and the passage of time. They believed that long-form texts could teach artificial intelligence systems how to reason more logically and write more coherently compared to the fragmented content typically found across the internet.
The Birth of Project Panama
This conviction sparked an internal initiative that later became known as Project Panama. Recently unsealed court filings from a copyright lawsuit provide unprecedented insight into how this ambitious project operated. Anthropic embarked on purchasing physical books in massive quantities, then systematically dismantling and scanning them at high speeds to create digital versions. Once the digitization process was complete, the original paper copies were sent for recycling, leaving no physical archive behind.
The primary objective was to rapidly expand the volume of book-based data available for training the company's artificial intelligence systems. The scale of this operation became publicly known following a detailed report by The Washington Post, which offered a rare glimpse into how aggressively AI companies pursued high-quality textual data as competition to develop more capable chatbots intensified across the industry.
Strategic Decisions and Industry Pressures
Internal documents reveal that Anthropic deliberately chose this approach instead of negotiating licensing agreements at scale. Company executives argued that purchasing physical copies and conducting digitization internally proved faster and more practical given the competitive landscape. This strategy reflected the increasingly fierce race to dominate artificial intelligence development, where every technological advance can translate directly into market share, investment opportunities, and revenue generation.
In an industry moving at extraordinary speed, with new developments emerging almost daily, access to high-quality training data has become one of the most valuable assets in the broader push to transform AI capabilities into commercial power. The unsealed records surrounding Project Panama offer one of the clearest views yet into how modern AI systems are constructed, revealing that behind consumer-facing chatbots lies an industrial pipeline involving substantial capital investments, significant legal risks, and irreversible data extraction processes.
Processing Millions of Books
Vendor proposals and court records indicate that Anthropic sought scanning capacity for approximately 500,000 to 2 million books over a six-month period. Although the precise final number remains redacted in legal documents, filings repeatedly describe the purchase and destruction of millions of volumes, acquired in batches numbering tens of thousands at a time. The project involved tens of millions of dollars in expenditures covering book purchases, logistical operations, and scanning services, highlighting how central books had become to AI training strategies.
Once acquired, books were transported to commercial vendors equipped for industrial-scale document processing. Hydraulic cutting machines carefully removed the spines, allowing individual pages to be scanned using high-speed production equipment. Following digitization, the paper copies were systematically scheduled for recycling. This process was intentionally designed to be irreversible, leaving no physical archive remaining.
Preservation specialists note that this destructive approach distinguishes Project Panama from earlier digitization efforts, which typically preserved original copies. Court records suggest Anthropic viewed destructive scanning as a safer alternative to downloading large pirated digital libraries. The methodology drew lessons from previous mass digitization initiatives, including Google Books, and was partially shaped by Tom Turvey, who had previously worked on that pioneering project. Unlike Google Books, however, Project Panama prioritized speed and exclusivity over public accessibility or long-term preservation.
Legal Considerations and Fair Use
A federal judge later ruled that training AI models on books can qualify as fair use when the process demonstrates transformative characteristics. However, the court also determined that Anthropic's earlier downloads of pirated books raised separate copyright concerns, making clear that how training data is acquired remains legally significant even if the training process itself might be permitted under certain circumstances.
Authors' Response and Settlement Agreement
Authors reacted strongly to these disclosures, arguing that artificial intelligence companies were benefiting substantially from creative work without obtaining proper consent or providing adequate compensation. Ed Newton-Rex, a former AI executive, stated that this case illustrated a growing imbalance between technology firms and the creators whose work forms the foundation of modern AI systems. He and other critics have contended that existing copyright frameworks do not adequately address the complexities of large-scale machine learning operations.
In 2025, Anthropic agreed to pay $1.5 billion to settle claims related to its earlier use of pirated books, without admitting any wrongdoing. Under this agreement, authors whose works were included can seek compensation estimated at approximately $3,000 per title, though actual payouts may vary depending on specific circumstances. Anthropic has clarified that the settlement addressed data acquisition practices rather than the fundamental legality of AI training methodologies.
Part of a Broader Industry Pattern
Project Panama does not represent an isolated incident within the artificial intelligence industry. Court filings from other lawsuits reveal that Meta employees debated downloading extensive shadow libraries of books, while OpenAI has acknowledged downloading similar datasets in the past before subsequently deleting them. Both Google and Microsoft also face ongoing legal challenges concerning their AI training data practices.
Legal scholar James Grimmelmann has observed that the industry essentially carried academic data-use norms into a commercial arms race, only confronting legal risks after making massive investments. By that point, he noted, companies had become effectively locked into data pipelines that would prove difficult to unwind or substantially modify.
Ongoing Debates and Future Implications
This case has intensified debates about whether current copyright law is adequately equipped to handle machine learning operations conducted at unprecedented scale. As courts continue to define the boundaries of fair use in AI training contexts, Project Panama stands as a defining example of the competitive pressures shaping artificial intelligence development and the unresolved tension between technological advancement and creators' rights.
The revelations about Project Panama demonstrate that behind the sophisticated interfaces of consumer-facing chatbots exists an industrial-scale operation involving substantial financial resources, calculated legal risks, and transformative data extraction methods. This case continues to influence discussions about ethical data sourcing, fair compensation for creators, and the appropriate legal frameworks for emerging technologies that rely heavily on existing human knowledge.



