When EleutherAI unveiled its latest project, eyebrows across the tech world shot up in surprise.
The group, well known among AI researchers, rolled out its latest open dataset release, a mammoth dataset that clocks in at an impressive eight terabytes. Two years in the making, thanks to joint efforts with AI startups like Poolside and Hugging Face as well as university partners, the dataset aims to reshape how developers train machine learning models.
What makes The Common Pile v0.1 noteworthy is the attention EleutherAI paid to legally sourced material. While many prominent players in AI development have faced uncomfortable questions and courtroom battles over using data pulled straight from the internet, EleutherAI wanted a cleaner approach. Rather than skirting the edge of copyright, the team built the dataset using open content and fully licensed materials, including over three hundred thousand books from the Library of Congress and the Internet Archive.
The timing is significant. As corporations such as OpenAI contend with lawsuits challenging their reliance on content scraped from the web, the conversation about ethics and legality in AI data collection keeps getting louder. Some companies try to establish formal agreements with publishers, but most rely on the fair use doctrine to fend off legal fallout tied to training on copyrighted work.
Open Data, Open Ambitions
EleutherAI believes the rise in these lawsuits has made bigger companies more secretive about their data sources. According to executive director Stella Biderman, that wall of secrecy has slowed progress in the wider AI field. She noted that some researchers are now even barred from releasing their findings due to ongoing legal wrangling.
To counteract this, Biderman and her team made sure The Common Pile v0.1 was reviewed by legal advisers and drew on a wide range of public resources. They also tapped into modern transcription technology using Whisper, the open source tool from OpenAI, to turn spoken audio into usable text for their collection.
The results are already tangible. EleutherAI’s new AI models, Comma v0.1-1T and Comma v0.1-2T, both feature seven billion parameters, placing them in the same neighborhood as early Llama models from Meta and yielding strong results in coding, math, and image recognition tests. The models were built using just a slice of the newly released dataset but went toe to toe with others trained using far murkier data.
Biderman argues that the commonly held belief that model quality depends on unlicensed text simply does not stand up to scrutiny anymore.
As open and licensed datasets become more available, she is confident models built using these resources can keep pace with, and even surpass, those created with proprietary data.
This announcement also signals a change of heart for EleutherAI. The group acknowledges its previous open dataset, known simply as The Pile, included protected material, leading to criticism and legal headaches for companies that deployed it. Now, EleutherAI says it will prioritize more frequent collaboration and transparent releases, aiming for truly open research with the support of academic and industry partnerships.