ESM3
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,349 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,349 words
Add missing citations, update stale details, or suggest a clearer explanation.
ESM3 (Evolutionary Scale Modeling 3) is a frontier multimodal generative language model for biology developed by EvolutionaryScale. Released on June 25, 2024 and formally published in Science in January 2025, ESM3 was the first generative model to reason jointly over the sequence, three dimensional structure, and biological function of proteins within a single neural network. The flagship 98 billion parameter version is the largest protein language model ever trained, and a smaller 1.4 billion parameter checkpoint was released with open weights under a non commercial license, making it one of the most widely accessible foundation models in computational biology.
The model attracted broad scientific attention because its training run consumed roughly one trillion teraflops of compute, more than any earlier biological model, and because the launch demonstration produced a novel green fluorescent protein, esmGFP, whose amino acid sequence is only 58 percent similar to the closest natural fluorescent protein. The authors estimated that arriving at such a divergent functional sequence through natural mutation and selection would have taken on the order of 500 million years, which became the headline figure of the launch and the title of the Science publication.
ESM3 is the third generation of the ESM family of protein language models, which began at Meta AI Research with the original ESM work in 2019 and continued with ESM-2 and ESMFold in 2022. After Meta dissolved its protein team in 2023, the lead researchers founded EvolutionaryScale as an independent public benefit company and built ESM3 from scratch. Unlike earlier ESM models, which were trained only on amino acid sequences, ESM3 ingests three discrete token tracks at once: sequence, structure, and function. The model is trained with a masked language modeling objective across all three tracks simultaneously, which lets users prompt it with any combination of the three modalities and receive completions in the others.
This multimodal design gives ESM3 a flexibility that pure structure prediction systems do not have. The same network can predict structure from sequence in the manner of ESMFold or AlphaFold, but it can also do the inverse design problem of generating a sequence that folds into a desired structure, infer plausible function annotations for an unknown protein, or generate proteins conditioned on arbitrary combinations of structural motifs and functional tags. EvolutionaryScale frames the model as a programmable platform for protein engineering rather than a single task system, which is how it is positioned alongside specialist tools such as AlphaFold 3 in the broader landscape of biological foundation models.
ESM3 is built on a transformer architecture but adds geometric attention in the first block so that atomic coordinates can be folded into the same latent space as sequence and function tokens. The architectural choices, the order of magnitude jump in training compute, and the open release of a usable checkpoint together pushed ESM3 to the center of the conversation about AI for drug discovery in 2024 and 2025.
The ESM line of work originated at Meta AI, then known as Facebook AI Research, where Alexander Rives and colleagues showed in 2019 that large transformer language models trained on amino acid sequences alone could learn rich representations of protein biology. The 2022 ESM-2 release scaled the family up to 15 billion parameters and was paired with ESMFold, a single sequence structure predictor that ran roughly 60 times faster than AlphaFold 2 on short proteins because it skipped the multiple sequence alignment step. ESM-2 and ESMFold were used to compute structures for more than 600 million metagenomic proteins, releasing the ESM Metagenomic Atlas in 2022.
In the summer of 2023 Meta closed the protein team as part of a broader reorganization that refocused the lab on generative consumer AI. Rives and several colleagues spun out a new company, EvolutionaryScale, in July 2023. The startup operated in stealth for roughly a year before unveiling ESM3 and a 142 million dollar seed round on June 25, 2024. The round was led by Nat Friedman, Daniel Gross, and Lux Capital, with participation from Amazon and NVentures, the venture arm of NVIDIA, among other angel investors. The company is structured as a public benefit corporation, which the founders said was meant to give them flexibility to balance commercial pressure with open science commitments.
The original ESM3 preprint appeared on bioRxiv on July 1, 2024 under the title "Simulating 500 million years of evolution with a language model." After more than a year of peer review the work was published in Science on January 17, 2025 under DOI 10.1126/science.ads0018, marking one of the most prominent publications of a frontier AI for biology system to date.
ESM3 is described in the Science paper as a multi track masked generative transformer. Three independent token tracks are fed into a shared trunk that mixes them through bidirectional self attention. Each track has its own discrete vocabulary so that all three modalities can share the same modeling objective.
The sequence track uses the 20 standard amino acid tokens plus rare and gap tokens. The structure track converts the local three dimensional environment of each residue into a discrete code using a learned VQ-VAE that quantizes backbone geometry into 4,096 structural tokens, which means that protein structures can be expressed as a token string analogous to text. The function track uses a hierarchical vocabulary built from InterPro functional annotations along with keyword and Gene Ontology tags. A separate set of secondary structure and solvent accessibility tracks is included for finer grained control during inference.
The trunk uses Pre Layer Normalization, rotary position embeddings, and SwiGLU feed forward non linearities, which are the same building blocks used in modern large language models such as Llama. The first transformer block is augmented with an SE(3) invariant geometric attention layer that conditions on raw backbone atomic coordinates when they are available, which lets ESM3 use real structures without first quantizing them into structure tokens when the user prefers to provide continuous geometry.
The training objective is a generalized masked language modeling loss applied across all three tracks at varying mask ratios, which means the model has to learn to fill in any subset of tokens given any other subset. At inference time this is what enables prompting in any combination, since the model has been trained on every possible mask pattern. EvolutionaryScale also describes a chain of thought style inference procedure in which the model generates an intermediate structural reasoning trace before producing a final sequence, conceptually similar to reasoning prompts used in language models.
ESM3 is released in a family of three sizes that share architecture but differ in width, depth, and training compute.
| Model | Parameters | Transformer blocks | Hidden dim (approx) | Availability |
|---|---|---|---|---|
| ESM3-small (open) | 1.4 billion | 48 | 1,536 | Open weights on Hugging Face and GitHub under the Cambrian Non Commercial License |
| ESM3-medium | 7 billion | 96 | 2,560 | Available through the Forge API |
| ESM3-large (98B) | 98 billion | 216 | 6,144 | Available through the Forge API and via partner platforms such as AWS and NVIDIA BioNeMo |
The 1.4 billion parameter open model is officially named ESM3-sm-open-v1 and is the version that most academic researchers use. The 7 billion medium model is positioned for users who want a balance of cost and capability through the API. The 98 billion parameter flagship is the model used for the headline generative experiments in the Science paper and is what powers EvolutionaryScale's commercial offerings.
ESM3 was trained on a corpus assembled from public sequence and structure databases together with extensive metagenomic data, covering proteins from environments as diverse as the Amazon rainforest, the deep ocean, hydrothermal vents, and ordinary soil microbiomes. The published figures emphasize the breadth of natural diversity that the model has seen.
| Quantity | Value |
|---|---|
| Unique protein sequences | 2.78 billion |
| Protein sequences after augmentation | 3.15 billion |
| Protein structures | 236 million |
| Function annotations | 539 million |
| Total training tokens | 771 billion |
| Compute for the 98B model | 1.07 x 10^24 FLOPs (about one trillion teraflops) |
| Largest model size | 98 billion parameters |
EvolutionaryScale describes the training as "the most compute ever applied to training a biological model." The infrastructure used was provided in part by Amazon Web Services and built around NVIDIA H100 GPU clusters. The compute figure of roughly 10^24 FLOPs puts ESM3 in the same order of magnitude as contemporary general purpose large language models, which is a sharp break from the much smaller compute budgets that biology models typically received in earlier years.
Because ESM3 is trained to predict masked tokens across all three tracks, the same checkpoint can perform a wide range of tasks depending on what is supplied at inference time.
| Capability | What ESM3 does | Example use case |
|---|---|---|
| Sequence prediction | Generates a plausible amino acid sequence from partial sequence, structure, or function constraints | Filling in a missing loop in a known protein |
| Structure prediction | Predicts a three dimensional structure from sequence | Single sequence structure prediction comparable to ESMFold |
| Function annotation | Predicts likely InterPro or Gene Ontology tags for an unknown protein | Annotating metagenomic dark matter |
| Inverse folding | Designs a sequence that folds into a specified structural template | Protein engineering for a target backbone |
| Conditional generation | Generates novel proteins that satisfy combinations of constraints | Designing a binder to a specified surface with a desired active site geometry |
| Atomic coordination | Solves tasks where specific residues must form a precise geometric arrangement | Engineering metal binding sites or catalytic triads |
| Chain of thought reasoning | Produces intermediate structural reasoning before final sequence output | Multi step generative design with internal scratch space |
The most striking demonstrated capability is conditional generation of fully novel proteins. In the Science paper, ESM3 is prompted with high level instructions such as "design a fluorescent protein" together with a small set of conserved residues, and the model produces candidate sequences whose synthesized versions actually express, fold, and fluoresce in the wet lab.
The headline experiment in the launch announcement and in the Science paper was the design of a new green fluorescent protein, esmGFP. Green fluorescent proteins are one of the most studied tool kits in modern biology, used as visual reporters in everything from neuroscience to gene expression studies. Every known natural and engineered GFP variant shares a deeply conserved chromophore forming motif and a characteristic eleven stranded beta barrel fold.
To test whether ESM3 could go beyond paraphrasing the GFPs in its training data, the team prompted the 98 billion parameter model with a very small set of conserved chromophore residues and a high level functional tag asking for fluorescence, then ran a chain of thought style generation procedure. After in silico ranking and laboratory screening of the top candidates, a protein the authors named esmGFP folded correctly, formed the chromophore, and produced bright green fluorescence. Its sequence differed from the nearest natural GFP by 96 mutations and was only 58 percent identical to it, far more diverged than any GFP that had been engineered before.
Using published estimates for the rate at which the GFP family has diversified in nature, the authors calculated that traversing this much sequence space through ordinary mutation and selection would have taken roughly 500 million years of natural evolution. The figure is approximate, since rates of molecular evolution vary, but it captured public attention because it made concrete the idea that a generative model could shortcut evolutionary time.
The esmGFP result is best understood as a proof of concept that ESM3 can produce functional proteins well outside the distribution of natural sequences while respecting the structural and chemical constraints needed for activity. The same generative procedure has since been used by EvolutionaryScale and its partners to design candidates for binding proteins, enzymes, and other targets, although most of those results remain unpublished.
ESM3 sits at the intersection of two earlier research traditions. The ESM family produced fast, sequence only models that learned biology through self supervised pretraining, while AlphaFold and AlphaFold 2 produced highly accurate structure predictors that used multiple sequence alignments and equivariant attention over protein geometry. ESM3 is closer in spirit to ESM-2 than to AlphaFold 2 in that it relies on language modeling at scale, but it borrows the idea of geometry aware attention from the AlphaFold lineage and pushes into generative territory that AlphaFold 2 was never designed for.
| Feature | ESM-2 (2022) | AlphaFold 2 (2021) | AlphaFold 3 (2024) | ESM3 (2024) |
|---|---|---|---|---|
| Developer | Meta AI | Google DeepMind | Google DeepMind, Isomorphic Labs | EvolutionaryScale |
| Modalities | Sequence only | Sequence with MSA | Sequence with MSA, plus ligands and nucleic acids | Sequence, structure, function |
| Largest version | 15 billion parameters | Fixed architecture | Fixed architecture | 98 billion parameters |
| Primary task | Representation learning, masked language modeling | Structure prediction | Structure prediction including biomolecular complexes | Multimodal generation, prediction, and design |
| Requires MSA | No | Yes | Yes | No |
| Generative | No | No | No | Yes |
| Structure accuracy | Below AlphaFold 2 | State of the art at release | Improved over AlphaFold 2, especially for complexes | Below AlphaFold 2 on monomer accuracy benchmarks, but supports tasks the AlphaFold family cannot |
| Open weights | Yes, fully open | No, AlphaFold 3 weights not released initially | Restricted access | Yes for the 1.4B small model, API only for 7B and 98B |
| License of weights | MIT | Not applicable | Restricted | Cambrian Non Commercial License for the open checkpoint |
ESM3 does not displace AlphaFold 2 or AlphaFold 3 for pure structure prediction. Published benchmarks show that AlphaFold 2 remains the most accurate single chain structure predictor and that AlphaFold 3 improved on it for complexes and for biomolecular interactions involving small molecules and nucleic acids. ESM3 trades some monomer accuracy for the flexibility of generative design, which is a different and largely complementary capability.
The 1.4 billion parameter ESM3 model was released with open weights on launch day and hosted on Hugging Face under the name EvolutionaryScale/esm3-sm-open-v1, with code and inference utilities on the evolutionaryscale/esm GitHub repository. EvolutionaryScale used a custom license that it initially called the ESM3 Community License Agreement and later revised under the name Cambrian Non Commercial License Agreement when releasing the related ESMC family.
The license allows use of the weights for non commercial research at universities, non profit research institutes, government laboratories, and similar organizations. It explicitly disallows hosting the model as a service, using outputs for commercial activities, and training a competing model on outputs of the released weights. Users must accept the license before downloading the weights on Hugging Face. The Cambrian revision removed an earlier restriction that excluded drug development, added an attribution and naming requirement for derived models, and clarified that fine tuned weights are considered derivative works.
The 7 billion and 98 billion parameter models are not released as weights and are accessible only through the Forge API operated by EvolutionaryScale, which offers free tier access for academic users and paid commercial access for industry. The commercial path is also offered through enterprise partners, most notably AWS and the NVIDIA BioNeMo platform.
ESM3 was positioned from launch as a platform for industrial drug discovery rather than a purely academic curiosity, and the surrounding partnerships reflected that.
NVIDIA announced on launch day that ESM3 would be optimized for inference and training through the NVIDIA BioNeMo platform and offered as a NIM microservice on NVIDIA AI Enterprise, which is the company's stack for deploying foundation models inside regulated industries. This made ESM3 available alongside other biology models in the BioNeMo catalog, including AlphaFold derivatives, MolMIM, and DiffDock. NVIDIA cited collaborations with more than 200 biotech and pharmaceutical users of BioNeMo at the time of launch, with that number growing in subsequent updates.
Amazon Web Services partnered with EvolutionaryScale to host the Forge API and to make the full ESM3 family accessible on AWS to enterprise customers, including the majority of the top ten global pharmaceutical companies. The collaboration included support for secure fine tuning on proprietary protein data without exposing the data to EvolutionaryScale.
The Chan Zuckerberg Initiative added ESM3 to its Virtual Cells Platform in 2025 as part of its catalog of open biology models. Academic groups began publishing applications quickly, including a 2025 study from the Tranos lab that used ESM3 representations to build vESM, a variant effect predictor for clinical genetics. ESM3 has also been adopted as a baseline in benchmarking papers for protein structure prediction and design.
The scientific and trade press described ESM3 in unusually large terms. Several outlets called it a "ChatGPT moment for biology," referring both to the openness of the release and to the conceptual leap of treating proteins as a multimodal generative modeling problem rather than a structure prediction problem. The 142 million dollar seed round was the largest disclosed seed financing in AI for biology at the time and helped catalyze further venture investment in the space during the second half of 2024.
The Science publication in January 2025 added formal peer reviewed weight to the launch claims, especially the esmGFP demonstration, which had initially appeared only in a preprint and a press release. Commentators noted both the achievement and the limits of the result, observing that ESM3 still lags AlphaFold 2 on monomer structure prediction and that the open 1.4 billion parameter model captures only a fraction of the capability of the 98 billion parameter flagship. The community response also raised familiar concerns about biosecurity that apply to any generative model that can design functional proteins, although the design of pathogen related proteins is restricted by the license and is not a marketed use case.
In December 2024 EvolutionaryScale released ESMC, a refreshed family of sequence only ESM models that includes a 600 million parameter ESMC-600m checkpoint with open weights. ESMC is positioned as a focused upgrade to the sequence only path that ESM-2 occupied, designed for fast embedding generation and for downstream tasks such as variant effect prediction. The company has signaled that future ESM3 generations will continue to expand the multimodal training corpus and incorporate additional biological modalities.
Through 2025 EvolutionaryScale also expanded the Forge platform with new features including conditional generation templates and fine tuning interfaces, building on the foundation laid by the initial ESM3 release.