Meta is under intense scrutiny after newly unsealed court documents revealed internal discussions about using copyrighted content, including pirated books, to train its AI models. The revelations, part of the Kadrey v. Meta lawsuit, shed light on how Meta employees weighed the legal risks of using unlicensed data while attempting to keep pace with AI competitors.

Internal Deliberations Over Copyrighted Content

Court documents show that Meta employees debated whether to train AI models on copyrighted materials without explicit permission. In internal work chats, staff discussed acquiring copyrighted books without licensing deals and escalating the decision to company executives.

Meta research engineer Xavier Martinet suggested an “ask forgiveness, not for permission” approach, in a chat dated February 2023, according to the filings. Stating:

“[T]his is why they set up this gen ai org for [sic]: so we can be less risk averse.”

He further argued that negotiating deals with publishers was inefficient and that competitors were likely already using pirated data.

“I mean, worst case: we found out it is finally ok, while a gazillion start up [sic] just pirated tons of books on bittorrent.” Martinet wrote, according to the filings. “[M]y 2 cents again: trying to have deals with publishers directly takes a long time …”

Meta’s AI leadership acknowledged that licenses were needed for publicly available data, but employees noted that the company’s legal team was becoming more flexible on approving training data sources.

Talks of Libgen and Legal Risks

The filings reveal that Meta employees discussed using Libgen, a site known for providing unauthorized access to copyrighted books. in Wechat Melanie Kambadur, a senior manager for Meta’s Llama model research team, suggested using Libgen as an alternative to licensed datasets.

According to the Filling in one conversation, Sony Theakanath, director of product management at Meta, called Libgen “essential to meet SOTA numbers across all categories,” emphasizing that without it, Meta’s AI models might fall behind state-of-the-art (SOTA) benchmarks.

Theakanath also proposed strategies to mitigate legal risks, including removing data from Libgen that was “clearly marked as pirated/stolen” and ensuring that Meta would not publicly cite its use of the dataset.

“We would not disclose use of Libgen datasets used to train,” he wrote in an internal email to Meta AI VP Joelle Pineau.

Further discussions among Meta employees suggested that the company attempted to filter out risky content from Libgen files by searching for terms like “stolen” or “pirated” while still leveraging the remaining data for AI training.

Despite concerns raised by some staff, including a Google search result stating “No, Libgen is not legal,” discussions about utilizing the platform continued internally.

Meta’s AI Data Sources and Training Strategies

Additional filings suggest that Meta explored scraping Reddit data using techniques similar to those employed by a third-party service, Pushshift. There were also discussions about revisiting past decisions not to use Quora content, scientific articles, and licensed books. In a March 2024 chat, Chaya Nayak, director of product management for Meta’s generative AI division, indicated that leadership was considering overriding prior restrictions on training sets.

She emphasized the need for more diverse data sources, stating: “[W]e need more data.” Meta’s AI team also worked on tuning models to avoid reproducing copyrighted content, blocking responses to direct requests for protected materials and preventing AI from revealing its training data sources.

Legal and Industry Implications

The plaintiffs in Kadrey v. Meta have amended their lawsuit multiple times since filing in 2023 in the U.S. District Court for the Northern District of California. The latest claims allege that Meta not only used pirated data but also cross-referenced copyrighted books with available licensed versions to determine whether to pursue publishing agreements.

In response to the growing legal pressure, Meta has strengthened its legal defense by adding two Supreme Court litigators from the law firm Paul Weiss to its team. Meta has not yet publicly addressed these latest allegations. However, the case highlights the ongoing conflict between AI companies’ need for massive datasets and the legal protections surrounding intellectual property. The outcome could set a major precedent for how AI companies train models and navigate copyright laws in the future.

Meta Faces Legal Battle Over AI Training with Copyrighted Content

Internal Deliberations Over Copyrighted Content

Talks of Libgen and Legal Risks

Meta’s AI Data Sources and Training Strategies

Legal and Industry Implications

Munazza Shaheen

Administrator

More Stories From "News"

Meta Faces Legal Battle Over AI Training with Copyrighted Content

Internal Deliberations Over Copyrighted Content

Talks of Libgen and Legal Risks

Meta’s AI Data Sources and Training Strategies

Legal and Industry Implications

Munazza Shaheen Administrator

More Stories From "News"

Munazza Shaheen

Administrator