# GPT-J

> Source: https://aiwiki.ai/wiki/gpt_j
> Updated: 2026-06-23
> Categories: Large Language Models, Open Source AI, Research Organizations
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**GPT-J** (full release name **GPT-J-6B**) is a 6-billion-parameter autoregressive [transformer](/wiki/transformer) [language model](/wiki/language_model) released by the [EleutherAI](/wiki/eleutherai) collective on June 9, 2021, and first announced in a blog post dated June 4, 2021. [1][3] It was the largest publicly available open-source language model in the world at launch, and the first community-built model to come close to the performance of OpenAI's then-state-of-the-art [GPT-3](/wiki/gpt-3) 6.7B ("Curie") variant, which its authors described as performing "nearly on par with 6.7B GPT-3 (or Curie) on various zero-shot down-streaming tasks." [1] Because the weights were published under the [Apache License 2.0](/wiki/apache_license), anyone could download, fine-tune, redistribute, or commercialise the model, and that single fact reshaped the open-source large language model scene for the next eighteen months.

The model was designed and trained by Ben Wang, who also wrote the underlying training framework Mesh Transformer JAX, with Aran Komatsuzaki as co-author of the introductory blog post. [1][4] It was trained on the [Pile](/wiki/the_pile), EleutherAI's 825 GiB curated text dataset, using a TPU v3-256 pod donated through Google's TPU Research Cloud (TRC) program. [3][5] GPT-J set the architectural template that several later open models, including [GPT-NeoX-20B](/wiki/gpt_neox) and Pythia, would build on, and a long list of community fine-tunes ([GPT4All-J](/wiki/gpt4all), the original Dolly v1, KoboldAI's adventure variants, BERTIN, and so on) have GPT-J as their base model.

## What is the background and origin of GPT-J?

EleutherAI began in July 2020 as a Discord server organised by [Connor Leahy](/wiki/connor_leahy), Leo Gao, and Sid Black, with the explicit goal of replicating the GPT-3 paper in the open. [7] The group's first public deliverable was the Pile dataset in December 2020, followed by the GPT-Neo family of models (125M, 1.3B, and 2.7B parameters) released in March 2021. GPT-Neo proved that the collective could ship working models, but at 2.7 billion parameters the largest GPT-Neo was still a long way short of GPT-3 sizes that researchers wanted to study.

GPT-J was the next step. Ben Wang, working largely independently, had been building a [JAX](/wiki/jax)-based model-parallel training library called Mesh Transformer JAX, optimised for the unusual sharding constraints of Google's TPU v3 pods. Wang and Komatsuzaki announced GPT-J-6B in a blog post titled "GPT-J-6B: 6B JAX-Based (Mesh) Transformer LM" on June 4, 2021, and the weights and code were uploaded to the kingoflolz/mesh-transformer-jax GitHub repository alongside a public web demo. [1][4] EleutherAI hosted the artefact on its site and on [Hugging Face](/wiki/hugging_face) under the EleutherAI/gpt-j-6B identifier, with Stella Biderman handling the Hugging Face port. [2][3] InfoQ covered the launch in July 2021 and quoted Komatsuzaki claiming GPT-J was "the best-performing publicly available Transformer LM in terms of zero-shot performance." [1][8]

The political subtext mattered. OpenAI had moved from open-source GPT-2 weights in 2019 to closed weights and a paid API for GPT-3 in 2020, citing safety concerns about misuse. EleutherAI argued, both in its writing and through actions, that closed models were not actually a safety strategy because the architectures and recipes were already well known, and that gating access to weights mostly hurt academic safety research. GPT-J became the practical demonstration of that argument. Connor Leahy summarised the position in the InfoQ piece: GPT-like models are "simple and theoretically straight-forward," and trying to keep them locked away while everyone roughly knew how to build them was not a productive use of attention. [8]

## How is GPT-J built? Architecture

GPT-J is a decoder-only autoregressive transformer in the same broad family as [GPT-2](/wiki/gpt-2) and GPT-3, but with three architectural choices that were unusual at the time and that proved influential afterwards.

The first is [rotary position embeddings](/wiki/rotary_position_embedding) (RoPE), introduced in the RoFormer paper by Su et al. and then adopted by EleutherAI before most major labs. [11][12] GPT-J does not use the learned absolute position embeddings of GPT-2/3; it applies RoPE to a portion of each attention head's dimensions, specifically 64 of the 256 head dimensions (a rotary fraction of 0.25). [3] The second is parallel placement of the attention and feed-forward sub-layers in each transformer block. In a standard GPT-2/3 block the input goes through self-attention, then through the feed-forward network, with residual additions in between. GPT-J computes attention and feed-forward in parallel from the same layer-normalised input, then adds both outputs to the residual stream. This trick was sketched in the appendix of the GPT-3 paper as a way to reduce communication overhead in large-scale model parallelism, and EleutherAI was among the first groups to deploy it at six-billion-parameter scale. Google's [PaLM](/wiki/palm) (2022) used the same parallel-layer design, stating in its paper that it adopted the "Parallel" formulation "as in GPT-J-6B" because it yielded roughly 15% faster training at large scale. [3] The third is the use of an unusually wide attention-head dimension (256 instead of GPT-3's 128) with relatively few heads (16), which Wang found gave better TPU utilisation. [3]

### What are GPT-J's hyperparameters?

The canonical configuration, taken from the EleutherAI/gpt-j-6B Hugging Face model card and the kingoflolz/mesh-transformer-jax repository: [3][4]

| Hyperparameter | Value |
| --- | --- |
| Total parameters | 6,053,381,344 (~6.05 billion) |
| Transformer layers | 28 |
| Model dimension (hidden size) | 4,096 |
| Feed-forward dimension | 16,384 |
| Attention heads | 16 |
| Head dimension | 256 |
| Context length | 2,048 tokens |
| Vocabulary size | 50,257 (GPT-2 BPE tokenizer; the Hugging Face GPTJ config exposes 50,400 with padding for efficient TPU sharding) |
| Position encoding | Rotary (RoPE) on 64 dimensions per head (rotary_pct = 0.25) |
| Attention/FFN layout | Parallel (computed simultaneously, then summed into the residual) |
| Activation function | GELU |
| Tied input/output embeddings | Yes |
| Training precision | bfloat16 weights, fp32 master copy |

A few specifics worth highlighting. The vocabulary uses the same byte-level [Byte-Pair Encoding](/wiki/byte_pair_encoding) tokenizer as GPT-2 and GPT-3; the 50,257-token vocabulary is padded out to 50,400 in the JAX checkpoint to keep the embedding matrix divisible across the TPU shard axis, which is why the Hugging Face configuration sometimes reports the larger number. [3] The 64-dimensional RoPE slice (out of 256 head dims) corresponds to a rotary fraction of 0.25 and is the same fraction that GPT-NeoX-20B inherited the following year. [10] The parameter count of 6,053,381,344 is the exact figure including the embedding weights, the attention projections, the feed-forward weights, and the layer norms; it differs slightly from the round "6B" used in the model name. [3]

## What data was GPT-J trained on? The Pile

GPT-J was trained on the Pile, EleutherAI's 825 GiB English-language corpus assembled by Leo Gao, Stella Biderman, and several other contributors and described in the December 2020 paper "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" (arXiv:2101.00027). [5] The Pile is a union of 22 sub-datasets ranging from web crawls (Pile-CC, derived from [Common Crawl](/wiki/common_crawl)) and reference works (Wikipedia, Stack Exchange, PubMed, ArXiv, USPTO patents) to fiction (Books3, Bibliotik), email and chat (Enron, Hacker News, OpenSubtitles, Ubuntu IRC), legal text (FreeLaw), and a substantial slice of code from GitHub. [5] The diversity of source material was a deliberate departure from web-only corpora like OpenAI's WebText and Google's C4; the EleutherAI authors argued that mixing high-quality non-web text with cleaned web crawl produced better downstream performance per token. [5]

The inclusion of GitHub code mattered for GPT-J specifically: it gave the model a noticeable head start on programming tasks compared with GPT-3 of equivalent size. The Books3 subset, contributed by Shawn Presser, was a 37 GiB collection of novels and other long-form prose drawn from the Bibliotik shadow library; this component would later attract copyright lawsuits against several model providers, although those legal disputes did not target GPT-J or EleutherAI directly.

## How much compute did GPT-J need? Training infrastructure

GPT-J was trained for 402 billion tokens over 383,500 steps on a single TPU v3-256 pod (256 TPU v3 cores, organised as 32 hosts each with 8 cores). [3] Training took roughly five weeks of wall-clock time. The compute was donated by Google's TPU Research Cloud (TRC) program, the same program that had funded the Pile and the GPT-Neo runs. [1][3] EleutherAI does not publish a precise dollar figure for the compute donation; secondary estimates in community discussions put the value at several hundred thousand dollars at then-current TPU rental rates.

Wang reported a sustained training throughput of about 151,000 tokens per second on the TPU v3-256 pod, slightly faster than GPT-Neo-2.7B's 148,000 tokens per second, achieving roughly 8.1 PFLOP/s out of a theoretical 13.4 PFLOP/s peak (around 60% hardware utilisation). [1] The total training compute budget worked out to approximately 1.5 x 10^22 floating-point operations, which is in the same ballpark as GPT-3 6.7B's 1.2 x 10^22 FLOPs but with the difference that GPT-J's compute was on TPUs and used bfloat16 throughout, with model parallelism implemented through JAX's xmap and pjit primitives. [4]

The Mesh Transformer JAX library that Wang built for the run is itself a notable artefact. It was one of the earliest serious attempts to use JAX's then-experimental xmap operator for production-scale model parallelism, and the design influenced later JAX-based frameworks such as MaxText and the EasyLM project. [4] The library implements a Megatron-style tensor-parallel split of the attention and feed-forward weights across the TPU shards, plus a custom data pipeline that streams the Pile from a Google Cloud Storage bucket.

## How good is GPT-J? Performance and benchmarks

The headline claim from Wang and Komatsuzaki was that GPT-J-6B was approximately on par with GPT-3 6.7B on standard zero-shot benchmarks, and that on code-related tasks it was actually better because the Pile contained substantial GitHub data while GPT-3's training corpus did not (the Codex effort to add code data to GPT-3 came later in 2021). [1] The published numbers from the EleutherAI blog post and the Hugging Face model card support that claim closely.

### How does GPT-J compare to GPT-3 zero-shot?

Figures below are taken from the GPT-J Hugging Face model card and match the EleutherAI blog post within rounding. [3] LAMBADA PPL is perplexity (lower is better); the rest are accuracy (higher is better).

| Model | Public weights | Training FLOPs | LAMBADA PPL | LAMBADA Acc | Winogrande | HellaSwag | PIQA |
| --- | --- | --- | --- | --- | --- | --- | --- |
| GPT-2 1.5B | Yes | n/a | 10.63 | 51.21% | 59.4% | 50.9% | 70.8% |
| GPT-Neo 1.3B | Yes | 3.0e21 | 7.50 | 57.2% | 55.0% | 48.9% | 71.1% |
| GPT-Neo 2.7B | Yes | 6.8e21 | 5.63 | 62.2% | 56.5% | 55.8% | 73.0% |
| GPT-3 Ada (~350M) | No | n/a | 9.95 | 51.6% | 52.9% | 43.4% | 70.5% |
| GPT-3 Babbage (1.3B) | No | 2.4e21 | 5.44 | 63.6% | 58.7% | 54.7% | 75.1% |
| GPT-3 2.7B | No | 4.8e21 | 4.60 | 67.1% | 62.3% | 62.8% | 75.6% |
| **GPT-J 6B** | **Yes** | **1.5e22** | **3.99** | **69.7%** | **65.3%** | **66.1%** | **76.5%** |
| GPT-3 Curie (6.7B) | No | 1.2e22 | 4.00 | 70.3% | 64.5% | 67.4% | 78.0% |
| GPT-3 13B | No | 2.3e22 | 3.56 | 72.5% | 67.9% | 70.9% | 78.5% |
| GPT-3 Davinci (175B) | No | 3.1e23 | 3.00 | 76.2% | 70.2% | 78.9% | 81.0% |

The pattern is clear. GPT-J essentially matches GPT-3 6.7B on [LAMBADA](/wiki/lambada) perplexity (3.99 versus 4.00) and accuracy (69.7% versus 70.3%), and is fractionally better on Winogrande while being slightly behind on [HellaSwag](/wiki/hellaswag) and PIQA. [3] It comfortably beats every smaller publicly available model. Reported [MMLU](/wiki/mmlu) performance for GPT-J is around 27% accuracy, which is close to chance on a 4-way multiple choice but consistent with what was expected from a model of that scale before instruction tuning became standard.

On code generation, the EleutherAI blog post showed GPT-J producing usable Python from natural-language descriptions, a capability GPT-3 6.7B did not have at release because OpenAI had not yet trained [Codex](/wiki/codex). [1] This made GPT-J the first widely-available open model that researchers could fine-tune for programming-assistant tasks.

## Is GPT-J open source? License, distribution, and frameworks

GPT-J is released under the Apache License 2.0, a permissive licence that allows commercial use, modification, redistribution, and patent grants. [3] This was a deliberate choice by EleutherAI: a copyleft licence (such as GPL) would have discouraged enterprise adoption, and a more restrictive bespoke licence (such as the OPT and LLaMA licences that came later) would have been against the collective's open-source ethos. The Apache 2.0 licence is one of the main reasons GPT-J became the default base model for so many downstream products in 2022 and 2023.

The model is distributed in three main forms. The original JAX checkpoint sits in the kingoflolz/mesh-transformer-jax GitHub repository and uses the bfloat16 weights produced by the TPU run. [4] A PyTorch port lives at EleutherAI/gpt-j-6B on Hugging Face, exposing GPTJModel and GPTJForCausalLM classes inside the transformers library, with both float32 and float16 branches. [9] The float16 branch was added to make inference fit on a single 24 GB GPU. Later the [llama.cpp](/wiki/llama_cpp) and ggml ecosystems added GGUF/GGML support for GPT-J, allowing CPU inference with int8 or int4 quantisation on commodity hardware.

Deployment-wise, full-precision fp32 inference requires roughly 24 GB of VRAM at minimum (a 16 GB GPU is enough only if the model is loaded in fp16). [9] Fine-tuning with the standard Adam optimiser is much heavier because Adam stores three additional fp32 buffers per parameter, so practical fine-tuning typically uses [DeepSpeed](/wiki/deepspeed) ZeRO-3 or [LoRA](/wiki/lora)-style adapters. The original Mesh Transformer JAX repository also includes scripts to fine-tune the JAX checkpoint on a TPU v3-8 slice, which is the cheapest official option. [4]

## What models are built on GPT-J? Notable derivatives and fine-tunes

The Apache 2.0 licence and the model's competitive performance turned GPT-J into a Cambrian explosion of fine-tunes. A non-exhaustive sample:

| Derivative | Maintainer | Year | Purpose |
| --- | --- | --- | --- |
| GPT4All-J | Nomic AI | 2023 | Instruction-tuned chatbot, trained on 800K instruction-response pairs; Apache 2.0 [13][14] |
| Dolly v1 | [Databricks](/wiki/databricks) | March 2023 | First widely-publicised instruction-tuned open chatbot; later replaced by Dolly 2.0 on Pythia-12B |
| KoboldAI Adventure | KoboldAI community | 2021-2022 | Long-form storytelling and text adventure |
| BERTIN GPT-J-6B | BERTIN Project | 2022 | Spanish fine-tune |
| nb-gpt-j-6B | National Library of Norway | 2022 | Norwegian fine-tune |
| GPT-JT-6B | [Together AI](/wiki/together_ai) | 2022 | Instruction- and multitask-tuned variant |
| PygmalionAI 6B | Pygmalion community | 2022-2023 | Conversational role-play model (later moved to LLaMA bases) |
| Codegen-6B (J-style ancestor) | Salesforce | 2022 | Code-focused successor inspired by the GPT-J recipe |

Several commercial inference-as-a-service providers, including NLP Cloud, Banana, Modal, Replicate, and Forefront, ran GPT-J as a hosted endpoint during 2021-2023, often as a cheaper alternative to GPT-3 Curie. [Cerebras](/wiki/cerebras) and Graphcore both used GPT-J as a reference workload for their custom AI accelerators in marketing material.

## How did GPT-J compare with contemporary models?

To place GPT-J in the 2021 landscape, here is a snapshot of the most-discussed large language models from roughly the same window:

| Model | Developer | Parameters | Year | Open weights |
| --- | --- | --- | --- | --- |
| GPT-2 1.5B | OpenAI | 1.5B | 2019 | Yes |
| GPT-Neo 2.7B | EleutherAI | 2.7B | March 2021 | Yes (MIT) |
| GPT-3 Ada | OpenAI | ~350M | 2020 | No |
| GPT-3 Babbage | OpenAI | 1.3B | 2020 | No |
| GPT-3 Curie | OpenAI | 6.7B | 2020 | No |
| **GPT-J-6B** | **EleutherAI** | **6B** | **June 2021** | **Yes (Apache 2.0)** |
| [Megatron-LM](/wiki/megatron_lm) 8.3B | NVIDIA | 8.3B | 2019-2020 | Code yes, weights mostly no |
| Jurassic-1 Jumbo | [AI21 Labs](/wiki/ai21_labs) | 178B | August 2021 | No (paid API) |
| GPT-3 Davinci | OpenAI | 175B | 2020 | No |
| Codex (initial) | OpenAI | 12B | August 2021 | No |
| GPT-NeoX-20B | EleutherAI | 20B | February 2022 | Yes (Apache 2.0) |
| [BLOOM](/wiki/bloom)-176B | BigScience | 176B | July 2022 | Yes (RAIL licence) |
| [OPT](/wiki/opt)-175B | Meta AI | 175B | May 2022 | Yes (research licence) |

In that company GPT-J occupied an awkwardly useful spot. It was much smaller than the 100-billion-plus closed models from OpenAI and AI21 Labs, but it was the only model in the high single-digit-billion range that anyone outside a large lab could actually run, study, and modify. For the eight months between June 2021 and February 2022, it was the largest fully open language model in existence.

## What are GPT-J's limitations?

GPT-J carried the limitations typical of a pre-RLHF base model. It is a pure next-token predictor with no instruction tuning, no [reinforcement learning from human feedback](/wiki/rlhf), and no safety filtering applied at training time. The Pile contains a fair amount of toxic and explicit content (the Hugging Face model card warns about this directly), and GPT-J reproduces those patterns when prompted. [3] The model card is explicit that the model is a research artefact rather than a product: "GPT-J-6B is not intended for deployment without fine-tuning, supervision, and/or moderation. It is not a in itself a product and cannot be used for human-facing interactions." [3] The 2,048-token context window feels short by 2024 standards, where 32K, 128K, and even 1M token contexts have become normal.

The GPT-2 byte-level BPE tokenizer is also showing its age. It was trained on English web text from 2018 and is inefficient for languages other than English, code, and mathematical notation; later tokenizers based on [SentencePiece](/wiki/sentencepiece) (used in T5, LLaMA, and most subsequent models) compress these inputs better. Training data ran through roughly 2020 (the Pile cut-off), so GPT-J knows nothing about the COVID-19 vaccine rollout, the Russia-Ukraine war, or any AI development after late 2020.

For English text continuation in a research or hobby setting GPT-J still works fine, but it is not appropriate for production deployment without careful fine-tuning, supervision, and content moderation. The model card explicitly says so. [3]

## Why does GPT-J matter? Influence and legacy

The deeper importance of GPT-J is structural rather than technical. The release made several things visible that had previously been argued only in principle.

It showed that a small, mostly volunteer collective with donated compute could train a model competitive with the public-API tier of the leading commercial provider. That broke the assumption that frontier-adjacent capability required a corporate budget, and it gave academic researchers a model they could actually open up and study. The Pythia model suite, released by EleutherAI in 2023, would later push that further by publishing not just final weights but every training checkpoint along the way.

It established a template for open-source LLM releases (open weights, permissive licence, transparent training data, public model card, multiple format variants on Hugging Face) that later projects largely copied. GPT-NeoX-20B (EleutherAI, February 2022), whose architecture the authors describe as "largely the same as GPT-J," BLOOM (BigScience, July 2022), OPT (Meta AI, May 2022), Cerebras-GPT (March 2023), MPT (MosaicML, May 2023), Pythia (EleutherAI, April 2023), [RWKV](/wiki/rwkv), and [Falcon](/wiki/falcon) all followed broadly the same playbook. [10] Even [LLaMA](/wiki/llama) (February 2023), although Meta initially released it under a research-only licence, was clearly aimed at the same ecological niche GPT-J had been occupying.

It also seeded a generation of researchers and engineers. Many of the people who later worked on Pythia, GPT-NeoX-20B, OpenLLaMA, RedPajama, and the various Mistral and Mixtral derivatives passed through EleutherAI's Discord during the GPT-J period. The Mesh Transformer JAX codebase fed directly into later JAX-based frameworks. Ben Wang himself went on to work on commercial LLM systems after GPT-J.

Finally, GPT-J quietly normalised several architectural choices: rotary position embeddings, parallel attention/feed-forward blocks, tied input/output embeddings, and careful TPU/GPU sharding. Almost every high-profile open model since 2022 uses RoPE, and the parallel-block design remained common until grouped-query attention and other later refinements began to diverge from it.

By 2024, GPT-J was small by frontier standards and few people deployed it for new applications. As a moment in the history of [open-source AI](/wiki/open_source), however, it occupies the position equivalent to the original GPT-2 release: the first time the wider research community got a model big enough to feel real, and the foundation on which most of what came next was built.

## References

1. Wang, Ben and Komatsuzaki, Aran. "GPT-J-6B: 6B JAX-Based (Mesh) Transformer LM." June 4, 2021. https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/
2. EleutherAI. "GPT-J" model artefact page. https://www.eleuther.ai/artifacts/gpt-j
3. EleutherAI/gpt-j-6B model card. Hugging Face. https://huggingface.co/EleutherAI/gpt-j-6b
4. Wang, Ben. kingoflolz/mesh-transformer-jax GitHub repository. 2021. https://github.com/kingoflolz/mesh-transformer-jax
5. Gao, Leo et al. "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027, December 2020. https://arxiv.org/abs/2101.00027
6. "GPT-J" Wikipedia entry. https://en.wikipedia.org/wiki/GPT-J
7. "EleutherAI" Wikipedia entry. https://en.wikipedia.org/wiki/EleutherAI
8. InfoQ. "EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J." July 2021. https://www.infoq.com/news/2021/07/eleutherai-gpt-j/
9. Hugging Face transformers documentation. "GPT-J" model_doc page. https://huggingface.co/docs/transformers/model_doc/gptj
10. Black, Sid et al. "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv:2204.06745, April 2022. https://arxiv.org/abs/2204.06745
11. Su, Jianlin et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, 2021. https://arxiv.org/abs/2104.09864
12. EleutherAI Blog. "Rotary Embeddings: A Relative Revolution." https://blog.eleuther.ai/rotary-embeddings/
13. Anand, Yuvanesh et al. "GPT4All: An Ecosystem of Open Source Compressed Language Models." 2023. https://aclanthology.org/2023.nlposs-1.7.pdf
14. nomic-ai/gpt4all-j model card. Hugging Face. https://huggingface.co/nomic-ai/gpt4all-j
15. Chowdhery, Aakanksha et al. "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311