# EleutherAI

> Source: https://aiwiki.ai/wiki/eleutherai
> Updated: 2026-06-21
> Categories: AI Companies, AI Research, Large Language Models, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**EleutherAI** is a non-profit artificial intelligence research institute that builds and openly releases [large language models](/wiki/large_language_model), datasets, and evaluation tools, and studies their [interpretability](/wiki/interpretability) and [alignment](/wiki/ai_alignment). It was founded in July 2020 by Connor Leahy, Sid Black, and Leo Gao as a grassroots [Discord](/wiki/discord) community attempting to replicate [OpenAI](/wiki/openai)'s [GPT-3](/wiki/gpt-3), and it incorporated as a registered non-profit, the EleutherAI Institute, in early 2023. EleutherAI describes itself as "a leading non-profit research institute focused on large-scale artificial intelligence research," and its stated goal is to "ensure that the ability to study foundation models is not restricted to a handful of companies." [7]

EleutherAI is best known for producing several of the most widely used open-source language models, including GPT-Neo, [GPT-J](/wiki/gpt_j)-6B, GPT-NeoX-20B, and the Pythia model suite; for creating [The Pile](/wiki/the_pile), an 825 GiB open text dataset that became one of the most common training corpora in language-model research; and for the LM Evaluation Harness, the de facto standard framework for benchmarking language models. As of 2025 the organization reports that its models have been downloaded more than 70 million times and that its researchers have published more than 130 papers in venues including [NeurIPS](/wiki/neurips), [ICML](/wiki/icml), [ICLR](/wiki/iclr), EMNLP, and Nature, while employing roughly two dozen full- and part-time staff plus about a dozen regular volunteers. [7] The name "EleutherAI" derives from the Greek word *eleutheria*, meaning liberty, reflecting the group's commitment to making powerful AI systems freely accessible to everyone. In 2025 the group extended this work with the Common Pile v0.1, an 8 TB dataset built entirely from public domain and openly licensed text, and the Comma family of models trained on it. [14][15]

## History

### Origins on Discord (2020)

The story of EleutherAI begins in the summer of 2020, shortly after [OpenAI](/wiki/openai) released its landmark [GPT-3](/wiki/gpt-3) paper describing a 175-billion-parameter language model. On July 2, 2020, a user known as "Daj" (Connor Leahy) posted in Shawn Presser's AI-focused Discord server suggesting the community should "give OpenAI a run for their money" by attempting to replicate GPT-3 as an open-source project. [12] Leahy had previously gained attention in 2019 for reverse-engineering [GPT-2](/wiki/gpt-2) in his bedroom.

The idea resonated with several members of the server, and on July 7, 2020, Leahy, Sid Black, and Leo Gao created a new Discord server under the tentative name "LibreAI." The server quickly attracted volunteers from around the world who were interested in democratizing access to large-scale AI. On July 28, 2020, the group rebranded from "LibreAI" to "EleutherAI," choosing the name as a reference to the Greek concept of eleutheria (liberty or freedom).

In its earliest days, EleutherAI operated entirely as a volunteer-driven collective. Contributors were independent researchers, students, hobbyists, and professionals who donated their time and skills to the project. The group coordinated its work through Discord channels, with no formal organizational structure, funding, or institutional backing.

### First Year: The Pile and GPT-Neo (2020-2021)

The first major milestone came on December 31, 2020, when EleutherAI publicly released [The Pile](/wiki/the_pile), an 825 GiB curated dataset of diverse English text assembled from 22 different sources. [1] The dataset was primarily curated by Leo Gao and Stella Biderman, who would later become the organization's Executive Director. The Pile was designed to provide a high-quality, diverse training corpus that could serve as an alternative to proprietary datasets used by companies like OpenAI.

On March 21, 2021, EleutherAI released the GPT-Neo model family, consisting of models with 125 million, 1.3 billion, and 2.7 billion parameters. These were the first open-source models explicitly designed to replicate the GPT-3 architecture, built using the [Mesh TensorFlow](/wiki/mesh_tensorflow) library for distributed training on [TPUs](/wiki/tensor_processing_unit_tpu). Although originally intended as a proof of concept, GPT-Neo attracted far more attention than the team anticipated. The models were trained on The Pile using compute resources from [Google](/wiki/google)'s TPU Research Cloud (TRC) program, which Connor Leahy had access to from a prior research allocation.

On June 9, 2021, the group released [GPT-J](/wiki/gpt_j)-6B, a six-billion-parameter model that was, at the time of release, the largest publicly available GPT-3-style language model in the world. [3] GPT-J was trained using Ben Wang's Mesh Transformer [JAX](/wiki/jax) library on a TPU v3-256 pod and achieved performance comparable to OpenAI's 6.7-billion-parameter Curie model. The release of GPT-J marked a turning point for the open-source AI community, demonstrating that volunteer-driven projects could produce models competitive with those built by well-funded corporations.

### Scaling Up: GPT-NeoX and Partnerships (2021-2022)

As EleutherAI's ambitions grew, so did its need for computational resources beyond what Google's TPU Research Cloud could provide. In early 2021, the group accepted a partnership with [CoreWeave](/wiki/coreweave), a cloud computing company that provided access to clusters of [NVIDIA](/wiki/nvidia) [A100](/wiki/nvidia_a100) GPUs without requiring financial payment. Additional compute support came from SpellML, a cloud infrastructure company.

EleutherAI developed a new training framework called GPT-NeoX, built on top of NVIDIA's [Megatron](/wiki/megatron_lm) language model framework and Microsoft's [DeepSpeed](/wiki/deepspeed) library. This GPU-based codebase was designed to scale to hundreds of billions of parameters and beyond, overcoming the limitations of the earlier TPU-based Mesh TensorFlow approach.

On February 10, 2022, EleutherAI released GPT-NeoX-20B, a 20-billion-parameter autoregressive language model trained on The Pile. [4] At the time, it was the largest dense, publicly available language model. The model was trained on CoreWeave's NVIDIA A100 GPU cluster and was released under the Apache 2.0 open-source license. The accompanying paper, authored by Sid Black, Stella Biderman, and 15 other contributors, was published at the BigScience Workshop at ACL 2022.

GPT-NeoX-20B introduced several architectural innovations compared to earlier models in the series. It used rotary positional embeddings (RoPE) instead of learned positional encodings and computed attention and feed-forward layers in parallel rather than sequentially, yielding roughly 15% greater training throughput. [4]

### Is EleutherAI a non-profit? Incorporation (2023)

In early 2023, EleutherAI formalized its structure by incorporating as a non-profit research institute, the EleutherAI Institute. [11] The organization announced that it would be led by Stella Biderman as Executive Director and Head of Research, Curtis Huebner as Head of Alignment, and Shivanshu Purohit as Head of Engineering.

The non-profit received funding and support from a range of backers, including [Stability AI](/wiki/stability_ai), [Hugging Face](/wiki/hugging_face), [Lambda Labs](/wiki/lambda_labs), Canva, the Mozilla Foundation, [Open Philanthropy](/wiki/open_philanthropy), the Omidyar Network, and individual donor Nat Friedman (former CEO of GitHub). [11] [CoreWeave](/wiki/coreweave) and Google TRC continued to provide compute resources.

Alongside the incorporation, EleutherAI announced a strategic shift in focus. While the organization would continue to release open-source models, it would place greater emphasis on [interpretability](/wiki/interpretability), [AI alignment](/wiki/ai_alignment), and scientific research. EleutherAI now describes its core mission as advancing "research on the interpretability and alignment of open-source foundation models." [7] The non-profit structure allowed EleutherAI to hire full-time staff for the first time, eventually growing to approximately two dozen full-time and part-time researchers with about a dozen regular volunteers and external collaborators. [7]

### Pythia and Research Focus (2023-Present)

In April 2023, EleutherAI released the Pythia model suite, a collection of 16 language models ranging from 70 million to 12 billion parameters. [5] Unlike the earlier GPT-Neo and GPT-J releases, which were primarily aimed at providing open alternatives to proprietary models, Pythia was designed from the ground up as a research tool for studying how language models learn and develop over the course of training.

The Pythia paper was published at [ICML](/wiki/icml) 2023, with Stella Biderman and Hailey Schoelkopf as lead authors. [5] The suite has since become a standard resource for research in mechanistic interpretability, learning dynamics, and [AI ethics](/wiki/ai_ethics).

By 2025, EleutherAI had accumulated over 70 million model downloads and published more than 130 papers in top venues including [NeurIPS](/wiki/neurips), ICML, [ICLR](/wiki/iclr), EMNLP, and Nature. [7] The organization also released the Common Pile v0.1 in June 2025, an 8-terabyte dataset composed entirely of public domain and openly licensed text, developed in partnership with Poolside, Hugging Face, and the US Library of Congress. [14]

### 2025-2026 Developments

EleutherAI remained highly active through 2025 and into 2026, releasing new datasets and models, expanding its safety research, and launching a recurring research-onboarding program.

In June 2025 the organization published the Common Pile v0.1 and, alongside it, the **Comma** family of language models, the first models the group had trained on a fully open-license corpus. Comma v0.1-1T and Comma v0.1-2T are both 7-billion-parameter models trained on the Common Pile, using 1 trillion and 2 trillion training tokens respectively. EleutherAI reported that the Comma models perform comparably to leading models trained in the same compute regime on unlicensed data, with coverage noting that they rival [Meta](/wiki/meta)'s first-generation [LLaMA](/wiki/llama) on benchmarks for coding, image understanding, and mathematics. Both models were released openly on [Hugging Face](/wiki/hugging_face). [14][15][16]

In safety research, EleutherAI released "Deep Ignorance" on August 12, 2025, a study showing that filtering dangerous knowledge out of pretraining data can build tamper-resistant safeguards into open-weight models. By removing text related to dual-use biology from the training corpus, the team produced models whose general capabilities were unaffected but whose biothreat-proxy capabilities stayed low and resisted up to 10,000 steps and 300 million tokens of adversarial fine-tuning, far beyond what conventional safety fine-tuning had withstood. The paper, authored by Kyle O'Brien, Stella Biderman, Aviya Skowron, and Quentin Anthony, was posted as arXiv:2508.06601, with all models and filtered datasets released on Hugging Face. [17][18]

EleutherAI also contributed to RWKV-7 "Goose," released in March 2025 as a collaboration with the [RWKV](/wiki/rwkv) community and academic partners. RWKV-7 is a 2.9-billion-parameter attention-free [recurrent](/wiki/recurrent_neural_network) architecture that achieves constant memory and constant per-token inference cost. It set state-of-the-art downstream results at the 3-billion-parameter scale on multilingual tasks while being trained on far fewer tokens than comparable models, and it was released under the Apache 2.0 license. [19]

Continuing its long-standing interest in evaluation methodology, EleutherAI published "Quantifying the Effect of Test Set Contamination on Generative Evaluations" in February 2026 (arXiv:2601.04301). By pretraining models on mixtures of web data and the MATH benchmark while varying model size and the number of contaminating test-set copies, the researchers found that including even a single replica of the test set lets a model reach a lower loss than the irreducible error of training on an uncontaminated corpus, a finding with implications for how frontier systems are benchmarked. [20]

To address what it described as a missing onboarding path into open AI research, EleutherAI launched the **Summer of Open AI Research (SOAR)**, a remote hackathon-style mentorship program. The first edition, labeled SOAR 2025, ran in August 2025 and, despite being organized on short notice without a dedicated budget, drew more than 500 applications, of which 142 were accepted across 12 projects spanning [mechanistic interpretability](/wiki/mechanistic_interpretability), alignment, audio modeling, and other areas. [22] Several SOAR 2025 projects led to accepted publications at venues including ISMIR 2025, an ICASSP 2026 workshop, and a NeurIPS 2025 workshop. EleutherAI opened applications for a second edition, SOAR 2026, in May 2026, with the program scheduled to run from July 13 to August 16, 2026. [21][22]

## Key People

| Person | Role | Notes |
|--------|------|-------|
| Connor Leahy | Co-founder | Reverse-engineered GPT-2 in 2019. Later co-founded Conjecture, an AI safety company, in March 2022. Active advocate for AI regulation and existential risk mitigation. |
| Sid Black | Co-founder | Lead author on the GPT-NeoX-20B paper. Later co-founded Conjecture alongside Connor Leahy and Gabriel Alfour. |
| Leo Gao | Co-founder | Primary architect of The Pile dataset and the LM Evaluation Harness. Later joined [OpenAI](/wiki/openai) as a researcher in 2021, where he continued work on alignment research. |
| Stella Biderman | Executive Director, Head of Research | Joined the project early and became a central figure. Holds a background in mathematics, computer science, and philosophy. Leads the non-profit institute since 2023. Co-author on the 2025 Deep Ignorance and Common Pile work. |
| Curtis Huebner | Head of Alignment | Directs EleutherAI's research on AI alignment and safety. |
| Shivanshu Purohit | Head of Engineering | Leads engineering efforts, contributed to the Pythia project and GPT-NeoX development. |
| Ben Wang | Core Contributor | Created the Mesh Transformer JAX library used to train GPT-J-6B. |
| Hailey Schoelkopf | Researcher | Co-lead author on the Pythia paper. |
| Quentin Anthony | Researcher | Co-author on the Pythia and 2025 Deep Ignorance papers; contributor to GPT-NeoX engineering. |

## What models has EleutherAI released?

EleutherAI has produced several families of open-source language models, each representing a step forward in scale, capability, and research utility.

### GPT-Neo (March 2021)

GPT-Neo was EleutherAI's first model release and the organization's initial attempt to replicate the GPT-3 architecture in an open-source setting. The model family consisted of three sizes: 125M, 1.3B, and 2.7B parameters. All three variants were trained on [The Pile](/wiki/the_pile) using the Mesh TensorFlow library, which enabled distributed training across Google's [TPU](/wiki/tensor_processing_unit_tpu) pods. [2]

The GPT-Neo architecture closely followed GPT-3's design but incorporated local attention in alternating layers for improved efficiency. In the local attention layers, the window size was set to 256 tokens. The 2.7B model used 32 layers with a hidden dimension of 2,560 and 20 attention heads. The 125M model was trained for 300 billion tokens over 572,300 steps.

All GPT-Neo models were released under the MIT open-source license, making them freely available for both research and commercial use. At the time of release, GPT-Neo 2.7B was the largest publicly available transformer language model trained on a curated, diverse dataset.

### GPT-J-6B (June 2021)

[GPT-J](/wiki/gpt_j)-6B represented a significant leap in scale for EleutherAI. The model contained 6 billion parameters and was trained on The Pile using Ben Wang's Mesh Transformer JAX framework on a TPU v3-256 pod. Training consumed 402 billion tokens over 383,500 steps.

Architecturally, GPT-J consisted of 28 transformer layers with a model dimension of 4,096 and a feedforward dimension of 16,384. The model dimension was split across 16 attention heads, each with a dimension of 256. GPT-J was one of the first large language models to use Rotary Position [Embeddings](/wiki/embeddings) ([RoPE](/wiki/rotary_position_embedding)), which were applied to 64 dimensions of each attention head. The model used the same BPE tokenizer as GPT-2 and GPT-3, with a vocabulary of 50,257 tokens.

At the time of its release, GPT-J-6B was the largest publicly available GPT-3-style model. [3] [Benchmarks](/wiki/benchmarks) showed it performing comparably to the similarly sized GPT-3 Curie variant (6.7B parameters). GPT-J was released under the Apache 2.0 license.

### GPT-NeoX-20B (February 2022)

GPT-NeoX-20B was EleutherAI's largest model and marked the organization's transition from TPU-based training to GPU-based training. The model contained 20 billion parameters and was trained on The Pile using EleutherAI's custom GPT-NeoX framework, which combined NVIDIA's Megatron with Microsoft's [DeepSpeed](/wiki/deepspeed). [4]

The architecture featured 44 transformer layers with a hidden dimension of 6,144 and 64 attention heads. The feedforward intermediate dimension was 24,576, and the vocabulary size was expanded to 50,432 tokens. Like GPT-J, GPT-NeoX-20B used rotary positional embeddings, applied to the first 25% of each head's embedding dimensions. The model computed attention and feed-forward sub-layers in parallel rather than sequentially, a design choice that improved training throughput by approximately 15%.

GPT-NeoX-20B was trained on CoreWeave's cluster of NVIDIA A100-SXM4-40GB GPUs with a batch size of approximately 3.15 million tokens (1,538 sequences of 2,048 tokens each) for 150,000 steps. The model was released under the Apache 2.0 license and hosted on GooseAI, a managed inference service operated jointly by CoreWeave and Anlatan (the creators of NovelAI).

### Pythia Suite (April 2023)

The Pythia suite represented a departure from EleutherAI's earlier work, which was primarily focused on pushing the boundaries of model scale. Instead, Pythia was designed as a controlled scientific resource for studying how language models learn and evolve during training.

The suite consisted of 16 models in eight sizes: 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B parameters. Each size was trained in two variants: one on the standard Pile dataset and one on a deduplicated version of The Pile. All models were trained on the exact same data in the exact same order, using a uniform batch size of 2 million tokens (2,097,152 tokens) for 143,000 steps, totaling approximately 300 billion tokens per model. [5]

A distinguishing feature of Pythia was the release of 154 intermediate checkpoints for every model, saved at regular intervals throughout training. These checkpoints enabled researchers to study learning dynamics, memorization patterns, and the emergence of biases at different stages of training. The training setup was fully reproducible, and all results in the published paper were independently verified by at least one external laboratory. [5]

The Pythia paper presented case studies demonstrating novel findings in memorization behavior, the effect of term frequency on few-shot arithmetic performance, and methods for reducing gender bias during training.

In March 2025, EleutherAI extended this line of work with PolyPythias, a study of training stability and outliers across fifty separate Pythia pre-training runs that varied only in random seed. The project released thousands of additional checkpoints and examined how stable the learning trajectory and final behavior of a model are when everything except initialization and data order is held fixed. [23]

### Comma v0.1 (June 2025)

The Comma models were EleutherAI's first language models trained entirely on openly licensed and public domain data. Released in June 2025 together with the Common Pile v0.1 dataset, Comma v0.1-1T and Comma v0.1-2T are both 7-billion-parameter models, distinguished by their training budget: Comma v0.1-1T was trained on 1 trillion tokens and Comma v0.1-2T on 2 trillion tokens, both drawn from the Common Pile. [14][16]

The Comma models were created to test a specific claim, namely that competitive language models can be built without scraping unlicensed copyrighted text. EleutherAI reported that they perform comparably to leading models trained on the same number of tokens from unlicensed web data, and external coverage described them as rivaling Meta's first-generation [LLaMA](/wiki/llama) on coding, image understanding, and math benchmarks. The models were published on Hugging Face as common-pile/comma-v0.1-1t and common-pile/comma-v0.1-2t. [15][16]

### Model Comparison Table

| Model | Parameters | Release Date | Training Data | Training Framework | Hardware | Training Tokens | License |
|-------|-----------|--------------|---------------|-------------------|----------|----------------|--------|
| GPT-Neo 125M | 125M | March 2021 | [The Pile](/wiki/the_pile) | Mesh TensorFlow | [TPU](/wiki/tensor_processing_unit_tpu) | ~300B | MIT |
| GPT-Neo 1.3B | 1.3B | March 2021 | [The Pile](/wiki/the_pile) | Mesh TensorFlow | [TPU](/wiki/tensor_processing_unit_tpu) | ~300B | MIT |
| GPT-Neo 2.7B | 2.7B | March 2021 | [The Pile](/wiki/the_pile) | Mesh TensorFlow | [TPU](/wiki/tensor_processing_unit_tpu) | ~300B | MIT |
| [GPT-J](/wiki/gpt_j)-6B | 6B | June 2021 | [The Pile](/wiki/the_pile) | Mesh Transformer JAX | TPU v3-256 | ~402B | Apache 2.0 |
| GPT-NeoX-20B | 20B | February 2022 | [The Pile](/wiki/the_pile) | GPT-NeoX (Megatron + [DeepSpeed](/wiki/deepspeed)) | [NVIDIA A100](/wiki/nvidia_a100) GPUs | ~473B | Apache 2.0 |
| Pythia 70M | 70M | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Pythia 160M | 160M | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Pythia 410M | 410M | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Pythia 1B | 1B | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Pythia 1.4B | 1.4B | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Pythia 2.8B | 2.8B | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Pythia 6.9B | 6.9B | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Pythia 12B | 12B | April 2023 | [The Pile](/wiki/the_pile) | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~300B | Apache 2.0 |
| Comma v0.1-1T | 7B | June 2025 | Common Pile v0.1 | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~1T | Open weights |
| Comma v0.1-2T | 7B | June 2025 | Common Pile v0.1 | GPT-NeoX | [GPU](/wiki/gpu_computing) | ~2T | Open weights |

## What is The Pile?

[The Pile](/wiki/the_pile) is an 825 GiB diverse, open-source English text dataset created by EleutherAI specifically for training large language models. [1] It was publicly released on December 31, 2020, and was primarily curated by Leo Gao and Stella Biderman. The corresponding paper, "The Pile: An 800GB Dataset of Diverse Text for Language Modeling," was submitted to arXiv in January 2021. [1]

The Pile was designed to address a gap in the availability of high-quality, diverse training data for language model research. At the time, most large language models were trained on proprietary datasets or on Common Crawl-derived corpora that, while large, lacked diversity across knowledge domains.

### Composition

The Pile is composed of 22 smaller, high-quality subsets spanning a wide range of domains. Many of these subsets were newly constructed for the project, while others were existing datasets that were cleaned and reformatted. The following table lists all 22 component datasets:

| Subset | Size (GiB) | Description |
|--------|-----------|-------------|
| Pile-CC | 227.12 | Filtered subset of [Common Crawl](/wiki/common_crawl) with improved extraction quality |
| PubMed Central | 90.27 | Full-text biomedical and life sciences research articles |
| Books3 | 100.96 | Collection of books (later removed due to copyright concerns) |
| [ArXiv](/wiki/arxiv) | ~56 | Academic preprints in physics, mathematics, computer science, and other fields |
| GitHub | 95.16 | Open-source code repositories |
| OpenWebText2 | 62.77 | Extension of the original OpenWebText dataset, web pages linked from Reddit |
| FreeLaw | 51.15 | Legal opinions from the Free Law Project |
| [Stack Exchange](/wiki/stack_exchange) | 32.20 | Questions and answers from the Stack Exchange network |
| USPTO Backgrounds | 22.90 | Patent application background sections from the US Patent and Trademark Office |
| PubMed Abstracts | 19.26 | Abstracts from biomedical literature |
| OpenSubtitles | 12.98 | Movie and television subtitles |
| Project Gutenberg (PG-19) | 10.88 | Public domain books |
| DM Mathematics | 7.75 | Algorithmically generated math problems from [DeepMind](/wiki/deepmind) |
| Wikipedia (en) | ~6.4 | English Wikipedia articles |
| BookCorpus2 | 6.30 | Extension of the original BookCorpus dataset |
| Ubuntu IRC | 5.52 | Chat logs from Ubuntu support IRC channels |
| EuroParl | 4.59 | European Parliament proceedings in English |
| HackerNews | 3.90 | Comments from the Hacker News technology forum |
| YouTube Subtitles | 3.73 | Subtitles from YouTube videos |
| PhilPapers | 2.38 | Philosophy papers and abstracts |
| NIH ExPORTER | 1.89 | Abstracts from NIH-funded research grants |
| Enron Emails | 0.88 | The Enron email corpus |

The dataset was stored in jsonlines format compressed with zstandard. Models trained on The Pile showed significant performance improvements over models trained on raw Common Crawl data or CC-100, particularly on specialized domains like scientific literature, legal text, and code.

### Copyright Controversy

The Pile attracted controversy over the inclusion of copyrighted material, particularly through the Books3 subset, which contained books compiled from Bibliotik. In 2024, a class action lawsuit was filed by authors seeking damages over the use of their copyrighted works. In response, EleutherAI eventually removed the Books3 component. The copyright issues surrounding The Pile contributed to EleutherAI's decision to create the Common Pile v0.1 in 2025, which contained only public domain and openly licensed content.

### The Common Pile v0.1

The Common Pile v0.1 is EleutherAI's openly licensed successor to The Pile, released on June 5, 2025. It is an 8-terabyte corpus assembled entirely from public domain and openly licensed text, drawn from roughly 30 source datasets. Notable components include approximately 300,000 public domain books sourced from the US Library of Congress and the Internet Archive, along with text transcribed from audio using OpenAI's [Whisper](/wiki/whisper) speech-recognition model. The dataset took about two years to assemble. [13][14][15] EleutherAI uses a strict definition of openness for the project, stating that "'open' means that permission is granted to use, study, modify, and redistribute by any person." [14]

The project was a broad collaboration. EleutherAI worked with the AI startup Poolside, [Hugging Face](/wiki/hugging_face), and a range of academic and library partners reported to include the University of Toronto, the Vector Institute, the Allen Institute for AI, Teraflop AI, and others, in addition to the US Library of Congress. Executive Director Stella Biderman framed the effort as a direct response to the wave of copyright lawsuits against AI developers, arguing that the litigation had reduced the transparency companies were willing to provide and that openly licensed data could still support competitive model development. In ablation experiments, models trained on the Common Pile outperformed those trained on several earlier open-license corpora and roughly matched models trained on the original Pile, while still trailing the web-derived FineWeb dataset. The accompanying paper, "The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text," was released on arXiv along with the dataset and the Comma models trained on it. [13][15][16]

## What is the LM Evaluation Harness?

The Language Model Evaluation Harness (lm-evaluation-harness) is an open-source framework developed by EleutherAI for evaluating generative language models across a wide variety of benchmarks. Originally created by Leo Gao, the tool has evolved since 2020 into what is widely considered the standard evaluation framework for [large language models](/wiki/large_language_model) in both academic and industry settings. [10]

The Evaluation Harness provides a unified codebase that allows any causal language model to be tested on the same inputs, ensuring that results from different models are directly comparable. It supports evaluation with publicly available prompts for full reproducibility, configurable few-shot settings, and compatibility with models hosted on [Hugging Face](/wiki/hugging_face), OpenAI APIs, [vLLM](/wiki/vllm), and custom local endpoints.

The framework serves as the backend for Hugging Face's Open LLM Leaderboard, one of the most widely referenced benchmarks for comparing language model performance. It has been used in hundreds of published research papers and is employed internally by organizations including NVIDIA, [Cohere](/wiki/cohere), BigScience, BigCode, [Nous Research](/wiki/nous_research), and MosaicML. [10]

All tasks in the current version of the harness are defined through YAML configuration files. Together with the codebase commit hash, these configuration files can be shared to enable precise replication of any evaluation setup. The framework supports advanced features such as output post-processing, answer extraction, [LoRA](/wiki/lora) adapter evaluation via Hugging Face's [PEFT](/wiki/peft) library, and data-parallel inference for faster evaluation. The project has continued under active development through 2025 and into 2026; from the v0.4.7 release onward, the base installation no longer requires PyTorch or Transformers, which keeps the tool lightweight when evaluating models served through an API. [24]

## How did EleutherAI get its compute and funding?

EleutherAI's ability to train large language models despite lacking the resources of major technology companies is one of the more notable aspects of its story. The organization relied on a combination of donated compute, grant programs, and strategic partnerships to fund its research.

### Google TPU Research Cloud

In its earliest phase, EleutherAI's primary source of compute was the Google TPU Research Cloud (TRC) program, which provides free access to Google's [TPU](/wiki/tensor_processing_unit_tpu) hardware for research projects that commit to publishing their results publicly. Connor Leahy had an existing TRC allocation from prior work, and this became the foundation for training the GPT-Neo models and GPT-J-6B. The TRC program was well-suited to EleutherAI's open-source mission, as the program's requirement to share results aligned perfectly with the group's goals.

### CoreWeave Partnership

As EleutherAI's models grew in size, TPU resources alone became insufficient. In early 2021, the group established a partnership with [CoreWeave](/wiki/coreweave), a cloud computing company that provided access to clusters of NVIDIA GPUs. This partnership was structured as a donation of compute resources rather than a financial transaction. CoreWeave's infrastructure powered the training of GPT-NeoX-20B and later served as a platform for ongoing experiments. Additional GPU support came from [Stability AI](/wiki/stability_ai), which provided access to a Slurm cluster.

### Non-Profit Funding

When EleutherAI incorporated as a non-profit in 2023, it received grants and donations from Stability AI, Hugging Face, Lambda Labs, Canva, the Mozilla Foundation, Open Philanthropy, the Omidyar Network, and Nat Friedman. [11] These funds supported hiring full-time staff and expanding research operations.

## Relationship to Other Organizations

### Stability AI

EleutherAI and [Stability AI](/wiki/stability_ai) have had a close but informal relationship. Stability AI's founder, Emad Mostaque, began supporting EleutherAI during its early days and later provided both financial donations and compute resources. Stability AI is listed as a donor on EleutherAI's website, and the two organizations (along with [LAION](/wiki/laion)) collaborated on the development of [Stable Diffusion](/wiki/stable_diffusion). However, there is no formal organizational affiliation between the two entities, and Stability AI does not hold any intellectual property rights over EleutherAI's models.

### Conjecture

Two of EleutherAI's three co-founders, Connor Leahy and Sid Black, went on to co-found Conjecture in March 2022, an [AI safety](/wiki/ai_safety) research company focused on the alignment problem. Conjecture was directly born out of the founders' experiences at EleutherAI, which had deepened their understanding of the capabilities and risks of large language models. Leahy has since become a prominent voice calling for regulation of frontier AI development, including proposals for a moratorium on large-scale training runs.

### OpenAI

Leo Gao, the third EleutherAI co-founder, joined [OpenAI](/wiki/openai) as a researcher in 2021. His prior work at EleutherAI on The Pile, the LM Evaluation Harness, and alignment research informed his continued contributions to the field. Despite his move to OpenAI, Gao has continued to participate in alignment discussions within the EleutherAI community.

### BigScience and BLOOM

Many EleutherAI members participated in the BigScience research workshop, a large international collaboration coordinated by Hugging Face that produced [BLOOM](/wiki/bloom), a 176-billion-parameter multilingual language model released in July 2022. EleutherAI contributors played roles in the design, development, and evaluation of BLOOM and the related mT0 model. Before BigScience convened, EleutherAI was the only non-corporate entity outside China actively developing large language models.

### Community Projects

Beyond its own models, EleutherAI members have contributed to a range of community AI projects, including [VQGAN-CLIP](/wiki/vqgan_clip) ([AI art](/wiki/ai_art) generation), [Stable Diffusion](/wiki/stable_diffusion) (text-to-image generation), and OpenFold (protein structure prediction). The organization's Discord server has served as a hub for open-source AI research, with ongoing discussions spanning topics from [mechanistic interpretability](/wiki/mechanistic_interpretability) to biological machine learning. In 2025, the group also began collaborating with the [RWKV](/wiki/rwkv) community on attention-free recurrent architectures, contributing to the RWKV-7 "Goose" release. [19]

## Research Focus Areas

### Interpretability

Since its pivot in 2023, EleutherAI has made [mechanistic interpretability](/wiki/mechanistic_interpretability) a primary research focus. The Pythia model suite was specifically designed to support interpretability research by providing consistent training conditions and intermediate checkpoints.

In July 2024, EleutherAI released an open-source pipeline for generating and evaluating natural-language explanations of [sparse autoencoder](/wiki/sparse_autoencoder) (SAE) features using large language models. The organization has also published research on whether interpretability methods designed for [transformer](/wiki/transformer) architectures transfer effectively to recurrent models like [Mamba](/wiki/mamba) and [RWKV](/wiki/rwkv).

In January 2025, EleutherAI researchers co-authored "Open Problems in Mechanistic Interpretability," a landmark paper bringing together 29 researchers from 18 organizations to formalize the goals and open questions in the field. In March 2025, the organization launched the "Interpreting Across Time" project, which studies how model internals evolve during training to identify potential interventions for shaping model behavior.

During 2025 the group continued to publish on sparse autoencoders and steering, including work on evaluating SAE interpretability without relying on natural-language explanations and on steering model refusal behavior through SAE features. It also released "Does Transformer Interpretability Transfer to RNNs?", a direct test of whether interpretability tools built for transformers carry over to recurrent models such as Mamba and RWKV. [25]

### AI Alignment and Safety

EleutherAI maintains an active alignment research program led by Curtis Huebner. In February 2025, the organization launched Alignment-MineTest, a project that uses the open-source Minetest voxel game engine to study alignment properties of [reinforcement learning](/wiki/reinforcement_learning) agents, with a focus on corrigibility and misgeneralization.

The group's 2025 safety agenda increasingly focused on the risks of open-weight models. The Deep Ignorance project (August 2025) demonstrated that carefully filtering hazardous content out of a model's pretraining data, rather than only fine-tuning the model afterward, can produce safeguards that survive thousands of steps of adversarial fine-tuning while leaving general capabilities intact. This approach was positioned as a more durable alternative to post-hoc safety fine-tuning, which prior work had shown could often be undone with only a few dozen steps of additional training. [17][18]

### Multilingual Models

Through its Polyglot project, EleutherAI has extended its work to non-English languages, developing and releasing multilingual model variants. Its 2025 contributions to the RWKV-7 "Goose" model, which set state-of-the-art results at the 3-billion-parameter scale on multilingual tasks, continued this emphasis on language coverage beyond English. [19]

## Impact on Open-Source AI

EleutherAI's influence on the broader AI ecosystem extends well beyond the models and tools it has directly produced. The organization helped establish the principle that large language models could and should be developed openly, at a time when the field was trending toward closed, proprietary systems.

The GPT-Neo and GPT-J model releases in 2021 are widely credited with sparking a wave of open-source AI development. These models demonstrated that meaningful language modeling capabilities were achievable outside the confines of major technology companies, inspiring subsequent open-source efforts by organizations including [Meta](/wiki/meta) (with [LLaMA](/wiki/llama)), [Mistral AI](/wiki/mistral_ai), and the broader Hugging Face community. As one industry account put it, EleutherAI's openly licensed models "for a while fueled an entirely new wave of startups." [11]

The Pile became one of the most widely used training datasets in the field, adopted by researchers and companies well beyond EleutherAI's own projects. The LM Evaluation Harness established a common standard for comparing language model performance, helping to bring rigor and reproducibility to an area that had previously lacked consistent evaluation practices. [10]

With the Common Pile v0.1 and the Comma models in 2025, EleutherAI extended this influence into the debate over training-data provenance, offering a concrete demonstration that competitive models can be built from openly licensed text rather than scraped copyrighted material. The release arrived amid mounting copyright litigation against AI developers and was widely covered as evidence that more transparent data practices were technically feasible. [14][15][16]

EleutherAI also served as a training ground for AI researchers. Several members went on to take prominent positions at leading AI organizations, and the collaborative, open-science culture fostered by the Discord community influenced how other groups approached open-source AI development. The organization formalized this onboarding role in 2025 with the launch of the Summer of Open AI Research program, which brought hundreds of newcomers into open AI research and produced several peer-reviewed publications in its first cohort. [21][22]

## Organizational Structure

EleutherAI operates as a registered non-profit research institute (since early 2023). The organization maintains approximately two dozen full-time and part-time research staff, along with roughly a dozen regular volunteers and external collaborators. [7] Day-to-day coordination continues to take place on the organization's public Discord server, maintaining the open, community-driven culture from its origins.

The non-profit is governed by its leadership team, with Stella Biderman serving as Executive Director and Head of Research, Curtis Huebner as Head of Alignment, and Shivanshu Purohit as Head of Engineering. This structure enables EleutherAI to accept grants, hire staff, and enter formal partnerships while preserving its commitment to open research. As of 2025 and 2026 the institute continues to combine open model and dataset releases with research on interpretability, alignment, evaluation, and the safety of open-weight systems, while running the recurring Summer of Open AI Research program to bring new contributors into the field. [21]

## References

1. Gao, L., Biderman, S., et al. (2021). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027.
2. Black, S., Leo, G., Wang, B., et al. (2021). "GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow." Zenodo.
3. Wang, B. & Komatsuzaki, A. (2021). "GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model."
4. Black, S., Biderman, S., Hallahan, E., et al. (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." Proceedings of BigScience Episode #5, ACL 2022. arXiv:2204.06745.
5. Biderman, S., Schoelkopf, H., Anthony, Q., et al. (2023). "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." Proceedings of ICML 2023. arXiv:2304.01373.
6. Biderman, S., et al. (2022). "Datasheet for the Pile." arXiv:2201.07311.
7. EleutherAI. "About." https://www.eleuther.ai/about/
8. EleutherAI Blog. "What A Long, Strange Trip It's Been: EleutherAI One Year Retrospective." https://blog.eleuther.ai/year-one/
9. EleutherAI Blog. "Announcing GPT-NeoX-20B." https://blog.eleuther.ai/announcing-20b/
10. Mozilla Foundation. "Evaluation Harness Is Setting the Benchmark for Auditing Large Language Models." https://www.mozillafoundation.org/en/blog/evaluation-harness-is-setting-the-benchmark-for-auditing-large-language-models/
11. TechCrunch. "Stability AI, Hugging Face and Canva back new AI research nonprofit." March 2, 2023. https://techcrunch.com/2023/03/02/stability-ai-hugging-face-and-canva-back-new-ai-research-nonprofit/
12. IEEE Spectrum. "EleutherAI: When OpenAI Isn't Open Enough." 2021. https://spectrum.ieee.org/eleutherai-openai-not-open-enough
13. Kandpal, N., Biderman, S., et al. (2025). "The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text." arXiv:2506.05209.
14. EleutherAI Blog. "The Common Pile v0.1." June 5, 2025. https://blog.eleuther.ai/common-pile/
15. TechCrunch. "EleutherAI releases massive AI training dataset of licensed and open domain text." June 6, 2025. https://techcrunch.com/2025/06/06/eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text/
16. Willison, S. "Comma v0.1 1T and 2T - 7B LLMs trained on openly licensed text." June 7, 2025. https://simonwillison.net/2025/Jun/7/comma/
17. O'Brien, K., Biderman, S., Skowron, A., Anthony, Q., et al. (2025). "Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs." arXiv:2508.06601.
18. EleutherAI Blog. "Pretraining Data Filtering for Open-Weight AI Safety (Deep Ignorance)." August 12, 2025. https://blog.eleuther.ai/deep-ignorance/
19. Peng, B., et al. (2025). "RWKV-7 'Goose' with Expressive Dynamic State Evolution." arXiv:2503.14456.
20. EleutherAI (2026). "Quantifying the Effect of Test Set Contamination on Generative Evaluations." arXiv:2601.04301. February 16, 2026.
21. EleutherAI. "Summer of Open AI Research." https://www.eleuther.ai/soar
22. EleutherAI. "A short retrospective on the EleutherAI Summer of Open AI Research." May 18, 2026. https://www.eleuther.ai/news/a-short-retrospective-on-the-eleutherai-summer-of-open-ai-research
23. van der Wal, O., Biderman, S., et al. (2025). "PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs." arXiv (March 2025).
24. EleutherAI/lm-evaluation-harness. GitHub releases. https://github.com/EleutherAI/lm-evaluation-harness/releases
25. EleutherAI (2025). "Does Transformer Interpretability Transfer to RNNs?" arXiv:2404.05971.