ESMFold
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,098 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,098 words
Add missing citations, update stale details, or suggest a clearer explanation.
ESMFold is an end-to-end protein structure prediction model developed by Meta AI's Fundamental AI Research (FAIR) Protein Team that infers atomic-level three-dimensional structures directly from a single amino acid sequence using the ESM-2 protein masked language model as a backbone.[1][2] Unlike earlier high-accuracy folders such as AlphaFold 2 and RoseTTAFold, ESMFold does not perform a multiple sequence alignment (MSA) search at inference time and does not require an external sequence or template database; structural information is extracted from the embeddings produced by ESM-2, which was pretrained on tens of millions of natural protein sequences from UniRef.[1][2][3] On benchmark sets, the system reaches accuracies competitive with single-sequence variants of AlphaFold 2 while running roughly an order of magnitude faster on a comparable GPU.[1][4] Meta released a 15-billion-parameter ESM-2 checkpoint together with ESMFold and used the model to fold more than 617 million metagenomic proteins, which were collected into the publicly accessible ESM Metagenomic Atlas.[3][5]
| Attribute | Value |
|---|---|
| Type | Single-sequence protein structure prediction model |
| Developer | Meta AI FAIR Protein Team |
| Lead authors | Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Alexander Rives, et al. |
| Backbone | ESM-2 transformer protein language model (8M to 15B parameters) |
| Folding head | Folding Trunk (48 blocks) plus Structure Module with Invariant Point Attention |
| Training data | UniRef50 / UR50/D (ESM-2 pretraining) and Protein Data Bank (folding head) |
| Pretraining objective | Masked language modeling on amino-acid sequences |
| Initial preprint | bioRxiv, 21 July 2022 (10.1101/2022.07.20.500902)[2] |
| Journal publication | Science 379(6637):1123-1130, 16 March 2023 (DOI 10.1126/science.ade2574)[1] |
| ESM Metagenomic Atlas launch | 1 November 2022[3] |
| Production checkpoint | facebook/esmfold_v1 on Hugging Face[6] |
| License | MIT (model code), CC BY 4.0 (Atlas data)[4][6] |
The use of unsupervised neural networks for protein sequences was pioneered by Alexander Rives and collaborators at FAIR, who in 2019 posted a preprint titled "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences," eventually published in PNAS in 2021.[7] That work introduced the ESM-1 family of transformer encoders, trained with a masked language modeling objective on UniParc-derived sequence sets, and showed that representations learned without any structural supervision encoded secondary structure, tertiary contacts and remote homology relationships.[7] The successor ESM-1b model (33 layers, 650M parameters) used the same masked objective on UniRef50 and became a widely cited baseline for protein representation learning.[4][7]
The protein structure prediction landscape was reshaped in 2020 to 2021 by DeepMind's AlphaFold 2, which combined deep evolutionary information from a multiple sequence alignment with a transformer-style "EvoFormer" trunk and an SE(3)-equivariant structure module to reach atomic accuracy at the 14th Critical Assessment of protein Structure Prediction (CASP14).[1][8] RoseTTAFold from David Baker's lab followed shortly afterward with a related three-track architecture.[8] Both models rely heavily on MSAs and templates, which are expensive to build and become unreliable for orphan sequences or rapidly evolving viral proteins.[8][9] The ESM Metagenomic Atlas Meta blog noted that traditional structure prediction often "would have taken decades" to scale to the metagenomic universe.[3]
ESMFold was first announced through a bioRxiv preprint posted on 21 July 2022 by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido and Alexander Rives under the title "Language models of protein sequences at the scale of evolution enable accurate structure prediction."[2] A second version was posted in October 2022 expanding the analysis to metagenomic sequences; the title was changed to "Evolutionary-scale prediction of atomic level protein structure with a language model" for the Science submission.[2] The peer-reviewed paper appeared in Science volume 379, issue 6637, pages 1123 to 1130 on 16 March 2023, with the additional co-authors Nikita Smetanin, Robert Verkuil, Ori Kabeli and Yaniv Shmueli listed on the journal version.[1] On 1 November 2022 Meta AI's blog announced the ESM Metagenomic Atlas, releasing more than 600 million predicted structures along with the ESM-2 and ESMFold weights and code.[3]
In the weeks following the bioRxiv release, the Hugging Face team ported the ESMFold inference path into the transformers library, simplifying the original implementation (which depended on OpenFold, PyTorch Geometric and several Meta-internal utilities) into a stand-alone EsmForProteinFolding class that can be invoked through standard transformers conventions.[4][8] The HuggingFace blog and the transformers documentation became the most popular entry point for new ESMFold users, particularly in conjunction with Google Colab notebooks that wrapped a single-GPU inference loop. By the end of 2023, ESMFold had become one of the most-downloaded biology-specific models on the HuggingFace Hub and was integrated into a wide range of downstream applications.[6][8]
After Meta wound down its dedicated protein team in 2023, several core authors, including Rives, Lin and Hie, co-founded the New York startup EvolutionaryScale, which released ESM3 in June 2024 as a multimodal generative successor.[10] The Meta-published ESMFold weights remain freely available on GitHub and Hugging Face.[4][6]
ESMFold is composed of two stacked modules trained largely independently. The first is the ESM-2 transformer language model, pretrained from scratch on raw amino-acid sequences with a masked language modeling objective. The second is a "folding head" that consumes the per-residue embeddings and attention maps from the frozen or lightly fine-tuned ESM-2 stem and outputs atomic coordinates and confidence scores.[1][2] Unlike AlphaFold 2, no MSA features, templates or external homology search are used at inference time; the input is the raw single-letter amino-acid string of one chain.[1][4]
The Hugging Face transformers documentation describes the design as follows: ESMFold "relies on the token embeddings from the large pre-trained protein language model stem and does not perform a multiple sequence alignment (MSA) step at inference time, which means that ESMFold checkpoints are fully 'standalone' and do not require a database of known protein sequences and structures with associated external query tools."[4]
ESM-2 is an encoder-only Transformer with the same general topology as BERT but applied to protein sequences. The paper trained a family of six model sizes that span more than three orders of magnitude in parameters; the Meta GitHub repository and the Hugging Face transformers integration list them as follows:[4]
| Model | Parameters | Layers | Hidden dim |
|---|---|---|---|
| ESM-2 (8M) | 7.5M | 6 | 320 |
| ESM-2 (35M) | 35M | 12 | 480 |
| ESM-2 (150M) | 150M | 30 | 640 |
| ESM-2 (650M) | 650M | 33 | 1280 |
| ESM-2 (3B) | 3B | 36 | 2560 |
| ESM-2 (15B) | 15B | 48 | 5120 |
All variants were pretrained on UniRef50, release 2021_04 (specifically the UR50/D variant that samples sequences from UniRef100 uniformly across UniRef50 clusters), comprising roughly 65 million unique sequences and around 48 million distinct cluster representatives.[4][11] Training used random masking of 15% of residue positions with a BERT-style masked language modeling loss, in which the model must reconstruct the masked amino acids conditioned on the surrounding sequence.[11] Compared with ESM-1b, the 2022 architecture replaces learned absolute positional embeddings with rotary position embedding (RoPE), allowing longer sequences and improved length generalization.[4]
Beyond raw perplexity, the central empirical finding of the ESM-2 study is that as the model is scaled from 8M up to 15B parameters the structural information accessible from its representations grows monotonically: linear probes on attention maps recover ever sharper contact predictions, and downstream folding accuracy rises in lock-step with scaling laws established for natural-language transformers.[1][2] The 15-billion-parameter ESM-2 was the largest publicly released protein language model at the time of publication.[1][3]
The folding head consumes two tensors derived from the ESM-2 forward pass: a sequence representation of size (L, d_s) built from the final hidden states and a pairwise representation of size (L, L, d_p) initialized from the model's attention maps across heads and layers. These representations are fed into a "Folding Trunk" that consists of 48 stacked folding blocks. Each block applies axial-style sequence and pair attention plus updates that allow information to flow between the two tracks, analogous to a single-sequence simplification of the AlphaFold 2 EvoFormer that drops the MSA dimension and the row-wise gated MSA attention.[1][4][12]
The trunk is run multiple times in a recycling loop, with the default ESMFold v1 configuration using up to four recycles in which the trunk's output structure and pair representation are fed back as additional inputs. After the final recycle, the pair representation is passed to a structure module based on Invariant Point Attention (IPA), the SE(3)-equivariant attention introduced by AlphaFold 2 that operates on rigid-body frames per residue and emits up to 14 atom coordinates per amino acid, covering backbone and side-chain heavy atoms.[12] The structure module also predicts a per-residue pLDDT confidence score using the same network head and confidence calibration as AlphaFold 2.[1][12]
The published ESMFold checkpoints (v0 and v1) pair the 3-billion-parameter ESM-2 stem with the 48-block trunk; ablation experiments in the paper that train only the structure module on top of each of the six ESM-2 sizes (0 trunk blocks) demonstrate how downstream folding accuracy improves smoothly with backbone scale.[1][12]
The Folding Trunk and structure module were trained on a curated subset of the Protein Data Bank with a training cutoff of May 2020 to allow fair benchmarking against AlphaFold 2 on CASP14 targets and CAMEO assessments collected after that date.[1][9] Following AlphaFold 2 practice, the authors augmented the training set with high-confidence predicted structures (self-distillation), and the folding losses combined frame-aligned point error (FAPE), distogram, secondary-structure, masked-LM language consistency and a confidence-prediction loss.[1][2] During folding-head training the ESM-2 weights are lightly fine-tuned end-to-end with a low learning rate.[2]
The paper reports that training the folding head against PDB targets took roughly two weeks on tens of NVIDIA A100 GPUs, considerably less than the multi-month compute budget for the underlying ESM-2 language model. After the folding head converges, the authors apply an OpenFold-style refinement protocol using AMBER force-field relaxation to remove minor stereochemical violations, although this final step is optional and is omitted in the production HuggingFace inference path.[1][4]
While the Folding Trunk inherits many architectural ideas from AlphaFold 2's EvoFormer, several simplifications are notable. AlphaFold 2 operates on an (s, L, d) MSA tensor with depth s up to a few thousand, requiring row-wise gated attention with mean-pooled gating across the MSA dimension to keep memory tractable. ESMFold drops the MSA dimension entirely; the row-track becomes a single (L, d_s) sequence representation and the entire row-wise gated MSA attention is replaced by standard self-attention on a single track. The triangular multiplicative updates and triangle attention on the pair representation are retained, because they remain the most efficient way to propagate information across pairs of residues. AlphaFold 2's "extra MSA stack," which processes a much deeper but lower-dimensional MSA before the main EvoFormer, has no analog in ESMFold and is replaced by the embeddings produced by the upstream ESM-2 model.[1][4][12]
On CAMEO assessments (194 targets in the original paper window) ESMFold reaches an average TM-score of approximately 0.83, and on the 71 publicly assessed targets of CASP14 it reaches approximately 0.68.[1][2][9] The full AlphaFold 2 pipeline with MSAs and templates achieves about 0.88 on CAMEO and 0.85 on CASP14 on the same selections, and RoseTTAFold reaches roughly 0.82 and 0.81 respectively.[9] The performance gap between ESMFold and AlphaFold 2 narrows substantially on targets with shallow MSAs, where evolutionary signal is sparse, and ESMFold matches or beats single-sequence variants of AlphaFold 2 and RoseTTAFold (which lose 30 to 50 TM-score points when their MSA input is removed).[1][2] An independent 2025 benchmark in Frontiers in Genetics reported a median TM-score of 0.95 for ESMFold versus 0.96 for AlphaFold 2 on a curated globular set, and median pLDDT of 87.4 for ESMFold versus 92.65 for AlphaFold 2.[9]
| Method | CAMEO TM-score | CASP14 TM-score | Inputs needed |
|---|---|---|---|
| AlphaFold 2 (full pipeline) | ~0.88 | ~0.85 | Sequence + MSA + templates |
| RoseTTAFold | ~0.82 | ~0.81 | Sequence + MSA |
| ESMFold (v1) | ~0.83 | ~0.68 | Sequence only |
| AlphaFold 2 (single-seq) | substantially lower | ~0.37 | Sequence only |
Numbers are approximate values reported in the Science paper and a 2025 Frontiers benchmark.[1][9]
The Science paper reports that on a single NVIDIA V100, ESMFold predicts the structure of a 384-residue protein in roughly 14 seconds, about 6 times faster than the AlphaFold 2 neural network alone and approximately 60 times faster than the AlphaFold 2 end-to-end pipeline including MSA construction and template search.[1][4][9] At scale, Meta reported folding more than 600 million metagenomic proteins in approximately two weeks on a cluster of about 2,000 GPUs, an operation that the team described as previously requiring years of wall-clock time with MSA-based methods.[3][5]
ESMFold inherits AlphaFold 2's pLDDT confidence metric and its calibration tracks the empirical accuracy reasonably well: predictions with pLDDT > 70 are usually globally correct, while regions below 50 are typically disordered or low confidence.[1] In the bulk metagenomic fold, roughly one third of predicted structures have a mean pLDDT above 70, and about 13% have mean pLDDT above 90 across the full sequence.[3][5]
The reference open-source repository, github.com/facebookresearch/esm, distributes both ESM-2 language model weights at six sizes (8M, 35M, 150M, 650M, 3B, 15B) and two ESMFold checkpoints, esmfold_v0 (used for the original Atlas release) and esmfold_v1 (the recommended production checkpoint).[4] Both production ESMFold checkpoints use the 3B-parameter ESM-2 stem with a 48-block folding trunk and an 8-block structure module, while ablation checkpoints that train only the structure module on top of each ESM-2 size are also published for research purposes.[4][12] The repository was archived as read-only in August 2024 after the team moved to EvolutionaryScale, but weights and inference code remain downloadable under an MIT license.[4][10]
Hugging Face shipped ESMFold and ESM-2 in the transformers library in late 2022 via the EsmModel and EsmForProteinFolding classes, ported in part from the OpenFold reimplementation. The canonical production checkpoint is published at facebook/esmfold_v1, with separate repositories for each ESM-2 size (e.g. facebook/esm2_t33_650M_UR50D for the 650M variant and facebook/esm2_t48_15B_UR50D for the 15B model).[4][6] The transformers integration removed many heavyweight dependencies from the original Meta implementation, making single-GPU inference straightforward through the standard from_pretrained API. As of 2026 the facebook/esmfold_v1 model card reports roughly two million monthly downloads.[6]
Meta deployed a hosted API at api.esmatlas.com and a web search interface at esmatlas.com on 1 November 2022, providing folding-as-a-service and search-by-sequence over the 617-million-structure database under a CC BY 4.0 license.[3][4][5] The Atlas allows users to retrieve precomputed structures, search by sequence identity using MMseqs2-based tooling, and submit short sequences for on-demand folding via the API.[3][5]
The flagship application announced alongside ESMFold was the ESM Metagenomic Atlas, a snapshot of the structure of more than 617 million predicted proteins derived from MGnify-clustered metagenomic sequences from soil, ocean and host-associated microbiomes.[3][5] At the time of release, this was described in Meta's blog as "the largest database of high-resolution predicted protein structures" and was roughly three times larger than other public structure databases such as the AlphaFold Protein Structure Database snapshot then available.[3] Subsequent studies have used the Atlas to identify novel folds, characterize uncharacterized protein families and expand the known protein structural space.[5][9]
Because ESMFold provides sequence-only inference, it is well suited to high-throughput screening pipelines in AI drug discovery and protein engineering campaigns where MSAs are difficult to construct, for example for designed proteins, antibodies, intrinsically disordered regions, or rapidly evolving pathogens. ESMFold is widely used as a fast structure oracle in diffusion model-based protein design loops, including the RFdiffusion family, where candidate sequences must be folded and scored thousands of times per design experiment.[9]
Independently of the folding head, the underlying ESM-2 embeddings have become a default representation for downstream protein tasks including function prediction, variant effect prediction, contact and binding-site prediction, and remote homology detection.[1][4] The medium-sized 150M and 650M ESM-2 checkpoints are frequently used as feature extractors in academic pipelines because they balance accuracy with modest GPU memory requirements.[11] Independent transfer-learning studies, including a 2025 Scientific Reports analysis, found that the medium-sized ESM-2 checkpoints often match or exceed the 15B variant on practical downstream tasks once labeled fine-tuning data is available, suggesting a saturation effect for some tasks even though raw structural information continues to improve with backbone scale.[11]
The original Lin et al. 2023 paper reported that zero-shot fitness predictions made from masked-LM log-likelihood ratios on ESM-2 (15B) correlate strongly with experimentally measured deep mutational scanning datasets, replicating and extending earlier findings from the ESM-1 line. This zero-shot capability makes ESM-2 a popular starting point for predicting the functional effects of human genetic variants and for guiding directed evolution campaigns where no labeled data is available.[1][7]
ESMFold sits within a broader wave of foundation models applied to structural biology, including the AlphaFold family from Google DeepMind, RoseTTAFold from the Baker lab, OmegaFold from Helixon, and successor multimodal models such as ESM3 and AlphaFold 3.[9][10]
The Science paper and several follow-up evaluations note that ESMFold does not match the full AlphaFold 2 pipeline on average accuracy when high-quality MSAs are available; on CASP14 targets specifically the gap is roughly 0.17 TM-score points, which can translate into qualitatively wrong folds on hard cases.[1][9] ESMFold also produces lower confidence predictions on average than AlphaFold 2, with median pLDDT roughly five points lower on globular benchmarks.[9] The original release supports monomeric chains only and does not natively handle multimers, ligands, nucleic acids, post-translational modifications, or membrane environments; AlphaFold 3 and Chai-1 (2024) added these capabilities but use diffusion-style architectures that depart from the ESMFold design.[9]
The 15-billion-parameter ESM-2 model is large enough that single-GPU inference of long sequences is memory limited. The Meta team initially reported that ESMFold handled proteins of up to roughly 3,000 residues on then-current data center hardware, with quadratic memory scaling in sequence length limiting longer chains.[5] Reproducibility of the headline benchmark numbers has been the subject of community discussion: the official GitHub repository acknowledged an issue thread in which users could not exactly reproduce the published CAMEO and CASP14 averages without using the precise target lists and evaluation protocol described in the paper supplement.[4]
A more conceptual critique is that ESMFold remains a discriminative regressor that maps sequence to a single best-guess structure. It does not capture conformational ensembles, alternative folds, or unstructured-to-folded transitions, and its outputs should not be interpreted as physically sampled states.[9] Generative successors such as ESM3 and diffusion-based folders such as AlphaFold 3 explicitly address some of these limitations.[10]
There is also a continuing debate about whether single-sequence prediction can ever fully recover the information present in deep evolutionary alignments. Several benchmark studies have noted that ESMFold's accuracy advantage over MSA-based methods materializes only when MSAs are shallow or unavailable; on protein families with rich evolutionary histories, AlphaFold 2's accuracy lead remains substantial.[9] Critics have noted that the 15B parameter ESM-2 model effectively learns an internal representation of evolutionary signal during pretraining, raising the question of whether it has merely shifted the MSA bottleneck from inference time to training time rather than eliminating it. The compute resources required to train ESM-2 (15B), reported in the supplement at over 4 million A100 GPU hours, are themselves considerable, even though inference-time cost is dramatically reduced.[1][2]
Finally, the Atlas itself is a snapshot rather than a continuously updated resource. After the 2022 release Meta did not refresh the Atlas with newer metagenomic sequences, and maintenance of the esmatlas.com infrastructure transferred to the EvolutionaryScale team in 2024. Users seeking up-to-date metagenomic structures must either re-fold from updated MGnify releases or rely on third-party mirrors.[3][10]
| System | Year | Backbone | MSA needed | Approx. CASP14 TM | Speed vs AlphaFold 2 pipeline | Multimer / ligands |
|---|---|---|---|---|---|---|
| AlphaFold 2 | 2021 | EvoFormer + IPA structure module | Yes | ~0.85 | 1x (baseline) | Multimer extension (separate) |
| RoseTTAFold | 2021 | Three-track network | Yes | ~0.81 | Similar order to AlphaFold 2 | Limited |
| OmegaFold | 2022 | OmegaPLM (670M) + Geoformer | No | Lower than AlphaFold 2 with MSA, higher than AlphaFold 2 without | Faster than AlphaFold 2 | No |
| ESMFold | 2022 to 2023 | ESM-2 (up to 15B) + Folding Trunk + IPA | No | ~0.68 | ~6x faster on neural net, ~60x with MSA search | Monomer only |
| ESM3 | 2024 | Multimodal generative encoder-decoder | No | n/a (generative model) | Variable | Sequence/structure/function |
| AlphaFold 3 | 2024 | Diffusion-based | Optional | Higher than AlphaFold 2 | Comparable | Multimer + ligands + nucleic acids |
Values are illustrative and drawn from the cited primary papers and benchmark studies.[1][9][10]
ESMFold demonstrated that the scaling laws established for general-purpose language models transfer to biological sequence modeling: as ESM-2 grows from 8M to 15B parameters, structural information emerges in its attention patterns at a steady rate, and the corresponding folding head improves smoothly.[1][2] By eliminating the MSA bottleneck, the model enabled the first dense survey of the structural universe of metagenomic proteins on the time scale of weeks rather than years, and helped catalyze the founding of EvolutionaryScale and the broader generative protein modeling agenda that followed in 2024.[3][10] AI drug discovery pipelines and self-supervised learning research alike continue to use ESMFold and ESM-2 as default baselines. It also became a default piece of infrastructure for protein design pipelines that need to fold thousands of candidate sequences inside an inner loop, because of its standalone single-sequence operation and permissive MIT licensing.[4][6]
ESMFold belongs to a family of deep learning protein structure predictors that emerged after 2018. The most direct comparison is AlphaFold 2 (DeepMind, 2021), which set the bar for accuracy by using deep MSAs. RoseTTAFold (Baker lab, 2021) used a similar approach with a three-track design. OmegaFold (Helixon, 2022) is the closest single-sequence peer to ESMFold, pairing the smaller OmegaPLM language model with a Geoformer trunk. Inside Meta's lineage, ESMFold was preceded by ESM-1, ESM-1b and ESM-MSA-1, and was succeeded by ESM3 from EvolutionaryScale (2024), which generalizes the architecture to a generative multimodal foundation model jointly trained on sequence, structure and function. AlphaFold 3 (DeepMind, 2024) and Chai-1 (2024) extend protein folding to ligands, nucleic acids and complexes using diffusion-based generative architectures.[1][9][10]