Common Corpus: Open Data Set Powers Global AI Training & Avoids Copyright Issues
A French startup, Pleias, has significantly expanded its open-source large language model (LLM) training dataset, the Common Corpus, to over 2.267 trillion tokens, adding substantial multilingual support, particularly from Asia. The dataset aims to provide a legally sound alternative to the widespread practice of scraping copyrighted material from the internet to train AI models, a practice currently facing legal challenges in U.S. Courts.
The Common Corpus distinguishes itself by containing only data that is permissively licensed and meticulously documenting its provenance. This approach addresses growing concerns about copyright infringement in the AI industry, where LLMs are often trained on vast quantities of copyrighted text and code. Recent court decisions, as highlighted by McCarter & English, have begun to define the boundaries of “fair use” in the context of LLM training, but the legal landscape remains complex. In both Bartz v. Anthropic and Kadrey v. Meta Platforms, courts found that using copyrighted works to train LLMs could constitute fair use, but circumstances matter.
The Common Corpus is categorized into five main areas: OpenGovernment, OpenCulture, OpenScience, OpenWeb and OpenSource. OpenGovernment includes financial and legal documents, although OpenCulture focuses on cultural heritage data like books and newspapers, many dating from the 18th and 19th centuries. OpenScience comprises publicly available academic publications, and OpenWeb includes transcripts from public domain YouTube videos and content from Stack Exchange. Finally, OpenSource consists of code from permissively licensed GitHub repositories.
The latest update significantly broadens the linguistic diversity of the corpus, now supporting eight languages with over 10 billion tokens each – English, French, German, Spanish, Italian, Polish, Greek, and Latin – and 33 languages with over 1 billion tokens. This expansion is particularly notable for its inclusion of data from China, Japan, Korea, Brazil, India, Africa, and Southeast Asia.
Beyond legal compliance, the Common Corpus offers practical advantages for developers. The dataset has been curated to remove harmful and toxic content, and personally identifiable information (PII) has been removed to ensure GDPR compliance. This curation results in LLMs trained on the corpus being less prone to generating problematic outputs. The AI Alliance notes that the dataset “exceeds the requirements of even the strictest regulations on AI training data, such as the EU AI Act.”
The dataset also facilitates the creation of open-source AI models, aligning with the Open Source Initiative’s definition, which permits use “for any purpose and without having to ask for permission.” This characteristic positions the Common Corpus as a potential foundation for “public AI” systems, an approach advocated by some as a means of ensuring equitable access to AI technology. The French government, along with organizations like Wikimedia Enterprise and Libraries Without Borders, are already supporting the project.
The development of the Common Corpus underscores a growing movement towards transparency and ethical sourcing in the AI industry. While legal battles over copyrighted training data continue, initiatives like this offer a viable path forward, potentially reducing legal risks and fostering a more sustainable ecosystem for LLM development.
