GPT-J
Last reviewed
Apr 30, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,388 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,388 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-J is a 6-billion-parameter autoregressive transformer language model released by the EleutherAI collective on June 9, 2021. The full release name on Hugging Face is GPT-J-6B. At the time of its launch it was the largest publicly available open-source language model in the world, and it was the first community-built model to come close to the performance of OpenAI's then-state-of-the-art GPT-3 6.7B ("Curie") variant. Because the weights were published under the Apache License 2.0, anyone could download, fine-tune, redistribute, or commercialise the model, and that single fact reshaped the open-source large language model scene for the next eighteen months.
The model was designed and trained by Ben Wang, who also wrote the underlying training framework Mesh Transformer JAX, with Aran Komatsuzaki as co-author of the introductory blog post. It was trained on the Pile, EleutherAI's 825 GiB curated text dataset, using a TPU v3-256 pod donated through Google's TPU Research Cloud (TRC) program. GPT-J set the architectural template that several later open models, including GPT-NeoX-20B and Pythia, would build on, and a long list of community fine-tunes (GPT4All-J, the original Dolly v1, KoboldAI's adventure variants, BERTIN, and so on) have GPT-J as their base model.
EleutherAI began in July 2020 as a Discord server organised by Connor Leahy, Leo Gao, and Sid Black, with the explicit goal of replicating the GPT-3 paper in the open. The group's first public deliverable was the Pile dataset in December 2020, followed by the GPT-Neo family of models (125M, 1.3B, and 2.7B parameters) released in March 2021. GPT-Neo proved that the collective could ship working models, but at 2.7 billion parameters the largest GPT-Neo was still a long way short of GPT-3 sizes that researchers wanted to study.
GPT-J was the next step. Ben Wang, working largely independently, had been building a JAX-based model-parallel training library called Mesh Transformer JAX, optimised for the unusual sharding constraints of Google's TPU v3 pods. Wang and Komatsuzaki announced GPT-J-6B in a blog post titled "GPT-J-6B: 6B JAX-Based (Mesh) Transformer LM" on June 4, 2021, and the weights and code were uploaded to the kingoflolz/mesh-transformer-jax GitHub repository alongside a public web demo. EleutherAI hosted the artefact on its site and on Hugging Face under the EleutherAI/gpt-j-6B identifier, with Stella Biderman handling the Hugging Face port. InfoQ covered the launch in July 2021 and quoted Komatsuzaki claiming GPT-J was "the best-performing publicly available Transformer LM in terms of zero-shot performance."
The political subtext mattered. OpenAI had moved from open-source GPT-2 weights in 2019 to closed weights and a paid API for GPT-3 in 2020, citing safety concerns about misuse. EleutherAI argued, both in its writing and through actions, that closed models were not actually a safety strategy because the architectures and recipes were already well known, and that gating access to weights mostly hurt academic safety research. GPT-J became the practical demonstration of that argument. Connor Leahy summarised the position in the InfoQ piece: GPT-like models are "simple and theoretically straight-forward," and trying to keep them locked away while everyone roughly knew how to build them was not a productive use of attention.
GPT-J is a decoder-only autoregressive transformer in the same broad family as GPT-2 and GPT-3, but with three architectural choices that were unusual at the time and that proved influential afterwards.
The first is rotary position embeddings (RoPE), introduced in the RoFormer paper by Su et al. and then adopted by EleutherAI before most major labs. GPT-J does not use the learned absolute position embeddings of GPT-2/3; it applies RoPE to a portion of each attention head's dimensions. The second is parallel placement of the attention and feed-forward sub-layers in each transformer block. In a standard GPT-2/3 block the input goes through self-attention, then through the feed-forward network, with residual additions in between. GPT-J computes attention and feed-forward in parallel from the same layer-normalised input, then adds both outputs to the residual stream. This trick was sketched in the appendix of the GPT-3 paper as a way to reduce communication overhead in large-scale model parallelism, and EleutherAI was among the first groups to deploy it at six-billion-parameter scale. PaLM (Google, 2022) used the same parallel-layer design, citing GPT-J as prior art. The third is the use of an unusually wide attention-head dimension (256 instead of GPT-3's 128) with relatively few heads (16), which Wang found gave better TPU utilisation.
The canonical configuration, taken from the EleutherAI/gpt-j-6B Hugging Face model card and the kingoflolz/mesh-transformer-jax repository:
| Hyperparameter | Value |
|---|---|
| Total parameters | 6,053,381,344 (~6.05 billion) |
| Transformer layers | 28 |
| Model dimension (hidden size) | 4,096 |
| Feed-forward dimension | 16,384 |
| Attention heads | 16 |
| Head dimension | 256 |
| Context length | 2,048 tokens |
| Vocabulary size | 50,257 (GPT-2 BPE tokenizer; the Hugging Face GPTJ config exposes 50,400 with padding for efficient TPU sharding) |
| Position encoding | Rotary (RoPE) on 64 dimensions per head (rotary_pct = 0.25) |
| Attention/FFN layout | Parallel (computed simultaneously, then summed into the residual) |
| Activation function | GELU |
| Tied input/output embeddings | Yes |
| Training precision | bfloat16 weights, fp32 master copy |
A few specifics worth highlighting. The vocabulary uses the same byte-level BPE tokenizer as GPT-2 and GPT-3; the 50,257-token vocabulary is padded out to 50,400 in the JAX checkpoint to keep the embedding matrix divisible across the TPU shard axis, which is why the Hugging Face configuration sometimes reports the larger number. The 64-dimensional RoPE slice (out of 256 head dims) corresponds to a rotary fraction of 0.25 and is the same fraction that GPT-NeoX-20B inherited the following year. The parameter count of 6.05B is the exact figure including the embedding weights, the attention projections, the feed-forward weights, and the layer norms; it differs slightly from the round "6B" used in the model name.
GPT-J was trained on the Pile, EleutherAI's 825 GiB English-language corpus assembled by Leo Gao, Stella Biderman, and several other contributors and described in the December 2020 paper "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" (arXiv:2101.00027). The Pile is a union of 22 sub-datasets ranging from web crawls (Pile-CC, derived from Common Crawl) and reference works (Wikipedia, Stack Exchange, PubMed, ArXiv, USPTO patents) to fiction (Books3, Bibliotik), email and chat (Enron, Hacker News, OpenSubtitles, Ubuntu IRC), legal text (FreeLaw), and a substantial slice of code from GitHub. The diversity of source material was a deliberate departure from web-only corpora like OpenAI's WebText and Google's C4; the EleutherAI authors argued that mixing high-quality non-web text with cleaned web crawl produced better downstream performance per token.
The inclusion of GitHub code mattered for GPT-J specifically: it gave the model a noticeable head start on programming tasks compared with GPT-3 of equivalent size. The Books3 subset, contributed by Shawn Presser, was a 37 GiB collection of novels and other long-form prose drawn from the Bibliotik shadow library; this component would later attract copyright lawsuits against several model providers, although those legal disputes did not target GPT-J or EleutherAI directly.
GPT-J was trained for 402 billion tokens over 383,500 steps on a single TPU v3-256 pod (256 TPU v3 cores, organised as 32 hosts each with 8 cores). Training took roughly five weeks of wall-clock time. The compute was donated by Google's TPU Research Cloud (TRC) program, the same program that had funded the Pile and the GPT-Neo runs. EleutherAI does not publish a precise dollar figure for the compute donation; secondary estimates in community discussions put the value at several hundred thousand dollars at then-current TPU rental rates.
Wang reported a sustained training throughput of about 151,000 tokens per second on the TPU v3-256 pod, achieving roughly 8.1 PFLOP/s out of a theoretical 13.4 PFLOP/s peak (around 60% hardware utilisation). The total training compute budget worked out to approximately 1.5 × 10²² floating-point operations, which is in the same ballpark as GPT-3 6.7B's 1.2 × 10²² FLOPs but with the difference that GPT-J's compute was on TPUs and used bfloat16 throughout, with model parallelism implemented through JAX's xmap and pjit primitives.
The Mesh Transformer JAX library that Wang built for the run is itself a notable artefact. It was one of the earliest serious attempts to use JAX's then-experimental xmap operator for production-scale model parallelism, and the design influenced later JAX-based frameworks such as MaxText and the EasyLM project. The library implements a Megatron-style tensor-parallel split of the attention and feed-forward weights across the TPU shards, plus a custom data pipeline that streams the Pile from a Google Cloud Storage bucket.
The headline claim from Wang and Komatsuzaki was that GPT-J-6B was approximately on par with GPT-3 6.7B on standard zero-shot benchmarks, and that on code-related tasks it was actually better because the Pile contained substantial GitHub data while GPT-3's training corpus did not (the Codex effort to add code data to GPT-3 came later in 2021). The published numbers from the EleutherAI blog post and the Hugging Face model card support that claim closely.
Figures below are taken from the GPT-J Hugging Face model card and match the EleutherAI blog post within rounding. LAMBADA PPL is perplexity (lower is better); the rest are accuracy (higher is better).
| Model | Public weights | Training FLOPs | LAMBADA PPL | LAMBADA Acc | Winogrande | HellaSwag | PIQA |
|---|---|---|---|---|---|---|---|
| GPT-2 1.5B | Yes | n/a | 10.63 | 51.21% | 59.4% | 50.9% | 70.8% |
| GPT-Neo 1.3B | Yes | 3.0e21 | 7.50 | 57.2% | 55.0% | 48.9% | 71.1% |
| GPT-Neo 2.7B | Yes | 6.8e21 | 5.63 | 62.2% | 56.5% | 55.8% | 73.0% |
| GPT-3 Ada (~350M) | No | n/a | 9.95 | 51.6% | 52.9% | 43.4% | 70.5% |
| GPT-3 Babbage (1.3B) | No | 2.4e21 | 5.44 | 63.6% | 58.7% | 54.7% | 75.1% |
| GPT-3 2.7B | No | 4.8e21 | 4.60 | 67.1% | 62.3% | 62.8% | 75.6% |
| GPT-J 6B | Yes | 1.5e22 | 3.99 | 69.7% | 65.3% | 66.1% | 76.5% |
| GPT-3 Curie (6.7B) | No | 1.2e22 | 4.00 | 70.3% | 64.5% | 67.4% | 78.0% |
| GPT-3 13B | No | 2.3e22 | 3.56 | 72.5% | 67.9% | 70.9% | 78.5% |
| GPT-3 Davinci (175B) | No | 3.1e23 | 3.00 | 76.2% | 70.2% | 78.9% | 81.0% |
The pattern is clear. GPT-J essentially matches GPT-3 6.7B on LAMBADA perplexity and accuracy, and is fractionally better on Winogrande while being slightly behind on HellaSwag and PIQA. It comfortably beats every smaller publicly available model. Reported MMLU performance for GPT-J is around 27% accuracy, which is close to chance on a 4-way multiple choice but consistent with what was expected from a model of that scale before instruction tuning became standard.
On code generation, the EleutherAI blog post showed GPT-J producing usable Python from natural-language descriptions, a capability GPT-3 6.7B did not have at release because OpenAI had not yet trained Codex. This made GPT-J the first widely-available open model that researchers could fine-tune for programming-assistant tasks.
GPT-J is released under the Apache License 2.0, a permissive licence that allows commercial use, modification, redistribution, and patent grants. This was a deliberate choice by EleutherAI: a copyleft licence (such as GPL) would have discouraged enterprise adoption, and a more restrictive bespoke licence (such as the OPT and LLaMA licences that came later) would have been against the collective's open-source ethos. The Apache 2.0 licence is one of the main reasons GPT-J became the default base model for so many downstream products in 2022 and 2023.
The model is distributed in three main forms. The original JAX checkpoint sits in the kingoflolz/mesh-transformer-jax GitHub repository and uses the bfloat16 weights produced by the TPU run. A PyTorch port lives at EleutherAI/gpt-j-6B on Hugging Face, exposing GPTJModel and GPTJForCausalLM classes inside the transformers library, with both float32 and float16 branches. The float16 branch was added to make inference fit on a single 24 GB GPU. Later the llama.cpp and ggml ecosystems added GGUF/GGML support for GPT-J, allowing CPU inference with int8 or int4 quantisation on commodity hardware.
Deployment-wise, full-precision fp32 inference requires roughly 24 GB of VRAM at minimum (a 16 GB GPU is enough only if the model is loaded in fp16). Fine-tuning with the standard Adam optimiser is much heavier because Adam stores three additional fp32 buffers per parameter, so practical fine-tuning typically uses DeepSpeed ZeRO-3 or LoRA-style adapters. The original Mesh Transformer JAX repository also includes scripts to fine-tune the JAX checkpoint on a TPU v3-8 slice, which is the cheapest official option.
The Apache 2.0 licence and the model's competitive performance turned GPT-J into a Cambrian explosion of fine-tunes. A non-exhaustive sample:
| Derivative | Maintainer | Year | Purpose |
|---|---|---|---|
| GPT4All-J | Nomic AI | 2023 | Instruction-tuned chatbot, trained on 800K instruction-response pairs; Apache 2.0 |
| Dolly v1 | Databricks | March 2023 | First widely-publicised instruction-tuned open chatbot; later replaced by Dolly 2.0 on Pythia-12B |
| KoboldAI Adventure | KoboldAI community | 2021-2022 | Long-form storytelling and text adventure |
| BERTIN GPT-J-6B | BERTIN Project | 2022 | Spanish fine-tune |
| nb-gpt-j-6B | National Library of Norway | 2022 | Norwegian fine-tune |
| GPT-JT-6B | Together AI | 2022 | Instruction- and multitask-tuned variant |
| PygmalionAI 6B | Pygmalion community | 2022-2023 | Conversational role-play model (later moved to LLaMA bases) |
| Codegen-6B (J-style ancestor) | Salesforce | 2022 | Code-focused successor inspired by the GPT-J recipe |
Several commercial inference-as-a-service providers, including NLP Cloud, Banana, Modal, Replicate, and Forefront, ran GPT-J as a hosted endpoint during 2021-2023, often as a cheaper alternative to GPT-3 Curie. Cerebras and Graphcore both used GPT-J as a reference workload for their custom AI accelerators in marketing material.
To place GPT-J in the 2021 landscape, here is a snapshot of the most-discussed large language models from roughly the same window:
| Model | Developer | Parameters | Year | Open weights |
|---|---|---|---|---|
| GPT-2 1.5B | OpenAI | 1.5B | 2019 | Yes |
| GPT-Neo 2.7B | EleutherAI | 2.7B | March 2021 | Yes (MIT) |
| GPT-3 Ada | OpenAI | ~350M | 2020 | No |
| GPT-3 Babbage | OpenAI | 1.3B | 2020 | No |
| GPT-3 Curie | OpenAI | 6.7B | 2020 | No |
| GPT-J-6B | EleutherAI | 6B | June 2021 | Yes (Apache 2.0) |
| Megatron-LM 8.3B | NVIDIA | 8.3B | 2019-2020 | Code yes, weights mostly no |
| Jurassic-1 Jumbo | AI21 Labs | 178B | August 2021 | No (paid API) |
| GPT-3 Davinci | OpenAI | 175B | 2020 | No |
| Codex (initial) | OpenAI | 12B | August 2021 | No |
| GPT-NeoX-20B | EleutherAI | 20B | February 2022 | Yes (Apache 2.0) |
| BLOOM-176B | BigScience | 176B | July 2022 | Yes (RAIL licence) |
| OPT-175B | Meta AI | 175B | May 2022 | Yes (research licence) |
In that company GPT-J occupied an awkwardly useful spot. It was much smaller than the 100-billion-plus closed models from OpenAI and AI21 Labs, but it was the only model in the high single-digit-billion range that anyone outside a large lab could actually run, study, and modify. For the eight months between June 2021 and February 2022, it was the largest fully open language model in existence.
GPT-J carried the limitations typical of a pre-RLHF base model. It is a pure next-token predictor with no instruction tuning, no reinforcement learning from human feedback, and no safety filtering applied at training time. The Pile contains a fair amount of toxic and explicit content (the Hugging Face model card warns about this directly), and GPT-J reproduces those patterns when prompted. The 2,048-token context window feels short by 2024 standards, where 32K, 128K, and even 1M token contexts have become normal.
The GPT-2 byte-level BPE tokenizer is also showing its age. It was trained on English web text from 2018 and is inefficient for languages other than English, code, and mathematical notation; later tokenizers based on SentencePiece (used in T5, LLaMA, and most subsequent models) compress these inputs better. Training data ran through roughly 2020 (the Pile cut-off), so GPT-J knows nothing about the COVID-19 vaccine rollout, the Russia-Ukraine war, or any AI development after late 2020.
For English text continuation in a research or hobby setting GPT-J still works fine, but it is not appropriate for production deployment without careful fine-tuning, supervision, and content moderation. The model card explicitly says so.
The deeper importance of GPT-J is structural rather than technical. The release made several things visible that had previously been argued only in principle.
It showed that a small, mostly volunteer collective with donated compute could train a model competitive with the public-API tier of the leading commercial provider. That broke the assumption that frontier-adjacent capability required a corporate budget, and it gave academic researchers a model they could actually open up and study. The Pythia model suite, released by EleutherAI in 2023, would later push that further by publishing not just final weights but every training checkpoint along the way.
It established a template for open-source LLM releases (open weights, permissive licence, transparent training data, public model card, multiple format variants on Hugging Face) that later projects largely copied. GPT-NeoX-20B (EleutherAI, February 2022), BLOOM (BigScience, July 2022), OPT (Meta AI, May 2022), Cerebras-GPT (March 2023), MPT (MosaicML, May 2023), Pythia (EleutherAI, April 2023), RWKV, and Falcon all followed broadly the same playbook. Even LLaMA (February 2023), although Meta initially released it under a research-only licence, was clearly aimed at the same ecological niche GPT-J had been occupying.
It also seeded a generation of researchers and engineers. Many of the people who later worked on Pythia, GPT-NeoX-20B, OpenLLaMA, RedPajama, and the various Mistral and Mixtral derivatives passed through EleutherAI's Discord during the GPT-J period. The Mesh Transformer JAX codebase fed directly into later JAX-based frameworks. Ben Wang himself went on to work on commercial LLM systems after GPT-J.
Finally, GPT-J quietly normalised several architectural choices: rotary position embeddings, parallel attention/feed-forward blocks, tied input/output embeddings, and careful TPU/GPU sharding. Almost every high-profile open model since 2022 uses RoPE, and the parallel-block design remained common until grouped-query attention and other later refinements began to diverge from it.
By 2024, GPT-J was small by frontier standards and few people deployed it for new applications. As a moment in the history of open-source AI, however, it occupies the position equivalent to the original GPT-2 release: the first time the wider research community got a model big enough to feel real, and the foundation on which most of what came next was built.