OpenFold
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,708 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,708 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenFold is an open source, trainable, GPU friendly PyTorch reimplementation of AlphaFold 2, developed initially by the AlQuraishi lab at Columbia University and then governed by the OpenFold Consortium, a multi institution non profit hosted by the Open Molecular Software Foundation.[^1][^2][^3] First released publicly in June 2022, the project closed a gap left when Google DeepMind published only inference code and weights for AlphaFold 2 without releasing trainable training code or training data, leaving the wider research community unable to retrain, fine tune, or systematically ablate the model.[^4][^2] In May 2024 a technical paper describing the retraining of AlphaFold 2 from scratch using OpenFold and the accompanying OpenProteinSet dataset was published in Nature Methods under the title "OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization."[^1][^5] OpenFold matches AlphaFold 2 accuracy while running roughly three to five times faster at inference on modern GPUs, and its structure module has been reused as the geometric back end of several downstream systems, including Meta AI's ESMFold protein language model.[^1][^6][^7]
| Field | Value |
|---|---|
| Initial release | June 2022 (code and first MSA tranche)[^4] |
| Technical paper | Ahdritz et al., Nature Methods, 14 May 2024 (preprint: bioRxiv, 22 Nov 2022)[^1][^5] |
| Lead authors | Gustaf Ahdritz, Nazim Bouatta, Mohammed AlQuraishi (Columbia, Harvard Medical School)[^1] |
| Code | github.com/aqlaboratory/openfold[^3] |
| License (code) | Apache License 2.0[^3] |
| License (weights and data) | CC BY 4.0[^1][^8] |
| Companion dataset | OpenProteinSet (more than 16 million MSAs, NeurIPS 2023)[^8][^9] |
| Implementation | Python, PyTorch, custom CUDA kernels, DeepSpeed integration[^3][^1] |
| Governance | OpenFold Consortium, hosted by the Open Molecular Software Foundation[^2][^10] |
When DeepMind released AlphaFold 2 in July 2021, alongside a paper in Nature and a code repository, the release was unusual for an industrial AI system in that it included both inference source code and the trained network weights.[^1] However, several critical assets were missing. Training code was not shipped in a usable, fully reproducible form, and the curated multiple sequence alignment (MSA) inputs and self distilled training set on which the model had been trained were not made publicly available.[^1][^4] Without these, external researchers could not retrain the network from scratch, could not run principled ablations to ask which components mattered, and could not adapt the model to tasks that required new training data such as protein ligand complexes, antibody specific datasets, or evolution free regimes.[^1]
A second, more diffuse problem was the implementation itself. AlphaFold 2 was written in JAX and tightly coupled to Tensor Processing Units that DeepMind had used internally; while researchers could run inference on commodity GPUs, scaling training to thousands of GPU hours, debugging the gradient flow, and integrating modern training systems like DeepSpeed required a different implementation. Many groups in academia and industry standardised on PyTorch and wanted an implementation native to that ecosystem so that they could mix the structure prediction model with their own networks, optimizers, and trainers.[^3][^1]
The original AlphaFold work also raised foundational scientific questions that could only be answered with a reproducible training pipeline. Among them: how data efficient is the architecture, how sensitive is it to the diversity of its training distribution, in what order does it learn structural features, and how much of its apparent generalization is in fact memorization of training set folds rather than learned physical rules.[^1][^11] OpenFold was designed in part to answer those questions empirically.[^1]
The project was initiated by Mohammed AlQuraishi's laboratory at Columbia University in collaboration with Nazim Bouatta at Harvard Medical School and a network of academic and industry partners. AlQuraishi's group had previously published end to end differentiable models for protein structure (recurrent geometric networks) before AlphaFold 2, and was well positioned to attempt a faithful retraining.[^4][^1]
On 28 June 2022 the OpenFold AI Research Consortium was publicly announced. The founding members were the AlQuraishi laboratory at Columbia University together with the protein design and drug discovery companies Arzeda, Cyrus Biotechnology, Genentech's Prescient Design, and Outpace Bio.[^4][^12] The consortium described its mission as the development of free, open source software tools for biology and drug discovery, and it positioned itself as a project hosted under the Open Molecular Software Foundation.[^2][^4] The initial public release of OpenFold consisted of the trainable PyTorch implementation of AlphaFold 2 and the first 400,000 multiple sequence alignments of what would become the OpenProteinSet corpus.[^4] At that point the consortium stated that OpenFold had been trained from scratch in roughly 100,000 A100 GPU hours and that the resulting model was on par with the original DeepMind weights.[^4]
A full technical preprint titled "OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization" appeared on bioRxiv on 20 November 2022 (doi: 10.1101/2022.11.20.517210), with arXiv listing 2207.05485 as the related arXiv identifier.[^11][^1] The peer reviewed version appeared in Nature Methods on 14 May 2024 (volume 21, pages 1514 to 1524, doi: 10.1038/s41592-024-02272-z) with first author Gustaf Ahdritz and senior author Mohammed AlQuraishi.[^1][^5] The author list spans Columbia University, Harvard University and Harvard Medical School, the NYU Courant Institute, the Flatiron Institute, Nvidia, EleutherAI, Genentech's Prescient Design group, Cyrus Bio, Outpace Bio, Arzeda, Rutgers University, the University of Illinois, Microsoft, and Stability AI, among others.[^5]
A companion paper, "OpenProteinSet: Training data for structural biology at scale," was posted to arXiv on 10 August 2023 (arXiv:2308.05326) and presented in the Datasets and Benchmarks track at NeurIPS 2023.[^9][^13] OpenProteinSet released more than 16 million multiple sequence alignments together with associated structural homologs from the Protein Data Bank and AlphaFold 2 self distillation predictions, all under the CC BY 4.0 license.[^9][^8] The corpus is hosted on the AWS Registry of Open Data and was, at release, the largest public training resource for MSA based protein structure prediction.[^1][^8]
The consortium expanded substantially after its first year. In September 2023, UCB, Nvidia, and Valence Labs (a Recursion company) joined as new members.[^14] In August 2024 a further six organizations joined: Astex, Biogen, Congruence, Polaris Quantum, Psivant, and SandboxAQ.[^15] In April 2025 the consortium announced another tranche of new members including Bristol Myers Squibb, COGNANO, Lambda, Novo Nordisk, Structure Therapeutics, Tamarind, Unnatural Products, and Visterra.[^16] By that point the consortium described itself as having 24 partner organizations including six global pharma companies.[^17][^16]
In parallel the consortium developed a successor system. OpenFold3, a fully open reproduction targeting AlphaFold 3 style biomolecular complex prediction (proteins, nucleic acids, and ligands), was released in preview form on 28 October 2025.[^17][^18] OpenFold3 was trained on more than 300,000 publicly available experimental structures plus an OpenFold curated synthetic database of over 13 million structures, and was released under the Apache License 2.0 family of permissive licenses for both research and commercial use, in contrast to AlphaFold 3 whose code and weights are gated for non commercial academic use.[^17][^18] OpenFold3 lives in a separate GitHub repository (aqlaboratory/openfold-3) alongside the original OpenFold codebase.[^18]
OpenFold targets a near bitwise reproduction of the AlphaFold 2 architecture as described in the 2021 Jumper et al. paper. The model takes as input a query protein sequence plus an MSA encoding evolutionary context, processes them through an "Evoformer" stack of attention layers that mix sequence and pair representations, and produces 3D coordinates via a structure module that applies invariant point attention to update per residue frames.[^1] The implementation reproduces both the monomer inference path (corresponding to AlphaFold v2.0.1) and the multimer inference path (corresponding to AlphaFold Multimer v2.3.2) and is compatible with DeepMind's published JAX parameter files as well as with OpenFold's own retrained weights.[^3][^19]
The model is implemented in PyTorch rather than JAX. The training pipeline supports full precision, half precision (FP16) and bfloat16 training, with optional DeepSpeed integration including ZeRO stage 2 optimizer sharding. The OpenFold team developed and integrated the DeepSpeed DS4Sci_EvoformerAttention kernel through a collaboration with Microsoft's DeepSpeed4Science initiative, producing a memory efficient kernel tailored to the Evoformer block.[^3][^1] OpenFold also supports custom CUDA attention kernels adapted from the FastFold project that perform in place attention during inference and training, using roughly four to five times less GPU memory than equivalent stock PyTorch implementations.[^3]
The Nature Methods paper reports that OpenFold is between roughly three and five times faster than the original AlphaFold 2 implementation at inference for most proteins on a single NVIDIA A100 GPU.[^1] The implementation can predict structures of proteins beyond 4,000 residues on a single 40 GB A100, whereas AlphaFold 2 typically crashes due to memory pressure beyond about 2,500 residues.[^1][^19] During training, the initial phase accepts crops of up to 1,200 residues, compared with 384 residues in AlphaFold 2, which allows larger structural context to be presented during gradient updates.[^1] OpenFold can also be combined with FlashAttention for further speed gains on shorter sequences.[^3]
The headline retraining experiment used 44 NVIDIA A100 GPUs (40 GB each), one protein per GPU with three way gradient accumulation, giving an effective batch size of 132 proteins. Mixed precision used bfloat16 by default with an FP16 mode available for older V100 hardware. Optimization used DeepSpeed 0.5.10 with ZeRO stage 2 and the AdamW family of optimizers used by AlphaFold 2.[^1] The original training run for parity with AlphaFold 2 consumed on the order of 50,000 GPU hours in total.[^1] The consortium's first public announcement quoted around 100,000 A100 hours including auxiliary experiments and the fine tuning phase.[^4]
During replication the authors discovered that some training runs would plateau at relatively poor accuracy (lDDT C alpha values between 0.30 and 0.35) for extended periods. They traced the issue to the FAPE loss clamping schedule in the original recipe: AlphaFold 2 clamped the per residue loss for an entire batch with 90% probability, occasionally producing a batch with no clamping at all and yielding very large gradient spikes. The OpenFold team changed the clamping to operate at the level of individual samples within a batch and reported that this substantially improved training stability and reduced time to convergence.[^1]
OpenProteinSet is the open data side of the project. Released in 2023, it comprises more than 16 million multiple sequence alignments paired with associated structural homologs from the Protein Data Bank and AlphaFold 2 self distillation predictions.[^9] The dataset is hosted under the CC BY 4.0 license on the AWS Registry of Open Data.[^8] In the retraining experiments, the authors used approximately 132,000 unique experimental chains (about 640,000 non unique chains including symmetry mates) from the Protein Data Bank, augmented with approximately 270,000 self distilled MSAs as predicted structure targets, sampled from a starting pool of roughly 15 million Uniclust30 MSAs filtered for diversity and depth.[^1][^9] OpenProteinSet was, at release, the largest open MSA corpus assembled specifically as a training and benchmarking resource for structural biology machine learning, and it was presented at NeurIPS 2023 in the Datasets and Benchmarks track.[^13][^9]
The dataset has subsequently been reused as training data for follow on structure prediction systems and for studies of how MSA quality and depth affect model accuracy. The authors explicitly framed it as intended to enable both protein structure, function, and design tasks and broader large scale multimodal machine learning research over biological sequences.[^9]
A central contribution of the Nature Methods paper is not the existence of OpenFold per se but what could be learned by running controlled experiments on retraining the network from scratch.[^1] These findings would have been difficult or impossible to obtain without an end to end trainable reproduction.
OpenFold proved remarkably data efficient. Models trained on as few as 1,000 protein chains (roughly 0.76% of the full training set) achieved an lDDT C alpha score of 0.64 on a held out test set, a value the paper notes exceeds the median lDDT C alpha of 0.62 achieved at CASP13 by the first generation AlphaFold.[^1] Training with 10,000 chains (about 7.6% of the full data) approached near parity with the full training set. The authors argue that this exceptional data efficiency is what differentiates AlphaFold 2 style models from earlier protein structure predictors and that, in this regime, architecture matters more than scale of data.[^1]
A second striking observation is that approximately 90% of final model accuracy is achieved within roughly the first 3% of training compute. Concretely, the model reached about 90% of its eventual accuracy in around 1,500 GPU hours of the roughly 50,000 GPU hours total, and around 95% within 2,500 GPU hours.[^1] The remaining compute mostly polishes the model: the long tail of training improves but does not transform structural accuracy. This early stopping behaviour is operationally important because it enables rapid exploration of model variants and ablations at modest cost.[^1]
The authors performed aggressive ablations on the training distribution to ask how the model generalizes beyond the kinds of structures it has seen. They trained variants of the model on data filtered by CATH topology, architecture, and class, in some cases removing entire categories of secondary structure. Models trained on only 5% of all topologies retained an lDDT C alpha near 0.6. Models trained on a single CATH architecture (one of 42) reached a peak lDDT C alpha near 0.6. Most strikingly, models trained on proteins that were almost exclusively alpha helical achieved lDDT C alpha above 0.7 on proteins containing beta sheets, and vice versa.[^1] The implication is that the model is not simply memorising the topologies in its training set: it has learned generalizable principles of how local geometry assembles into tertiary structure.
By examining intermediate model checkpoints, the authors showed that OpenFold learns structural features in a specific, staggered order. In early training (roughly the first 300 gradient steps) accuracy on local 10 residue fragments improves dramatically while global structure remains poor. In the middle phase (roughly steps 300 to 1,800) tertiary structure quality grows quickly. In later training, global and secondary structure converge to their final accuracy.[^1] Within secondary structure elements, alpha helices, which are most frequent in the training distribution, are learned first; beta sheets follow; rarer secondary structure motifs come last. Helices in particular tend to be learned in an abrupt, almost phase transition like manner rather than gradually.[^1]
A line of follow on work led by Lauren Porter, Devlina Chakravarty and colleagues argued separately that AlphaFold 2's apparent success on fold switching proteins reflects memorization of training set structures rather than learned physical principles, achieving about 35% success on fold switchers likely present in its training data but capturing only one of seven experimentally confirmed fold switchers outside its training set despite sampling roughly 280,000 candidate models.[^11] The OpenFold retraining experiments provide a useful counterpoint: the architecture itself appears to generalize robustly even when entire classes of structures are withheld, suggesting that the "shortcut learning" failures of AlphaFold 2 on fold switchers may say more about the training distribution and inference time sampling strategy than about a fundamental limit of the architecture.[^1][^11]
The pLDDT confidence prediction head correlates with true accuracy early in training, even before the model becomes generally accurate. Templates contribute only modestly to prediction quality, especially in low data regimes. The second, fine tuning phase of the original AlphaFold training recipe (with violations losses and a richer loss head) has only a modest effect on overall structural accuracy and primarily resolves chemical constraint violations.[^1]
One of the most consequential downstream effects of OpenFold has been the reuse of its structure module by other groups. The structure module is the geometric back end of AlphaFold 2 style models: it takes a learned representation of the protein and produces per residue 3D frames using invariant point attention. Because OpenFold provides an open, PyTorch native, well tested implementation of this module, other systems can simply import it rather than reimplement.
The most prominent example is Meta AI's ESMFold. ESMFold replaces the MSA input of AlphaFold with embeddings produced by the ESM 2 protein language model, then folds the embeddings into 3D structure using a structure module derived from OpenFold's implementation; the Hugging Face port of ESMFold relies on portions of the OpenFold library, and the [esmfold] pip installation option automatically pulls OpenFold as a dependency.[^6][^7] This is the most public example of the structure module reuse pattern, but similar reuse occurs in academic codebases for protein design and benchmark pipelines.[^7]
OpenFold ships an MSA free monomer inference mode named SoloSeq that uses ESM 1b sequence embeddings in place of evolutionary alignments. This makes OpenFold usable in regimes where no homologs exist for the query protein, including for synthetic and designed sequences.[^3]
OpenFold supports multimer inference using AlphaFold Multimer v2.3.2 weights and additionally provides an "AlphaFold Gap" zero shot mode that uses monomer weights with a sequence break token to model complexes when multimer weights are not used.[^3][^19] The gap based mode is less accurate than full AlphaFold Multimer but is useful as a baseline.[^19]
OpenFold's PyTorch implementation and its retrained weights have been used as a validation step in generative protein design pipelines. Practitioners commonly use RFdiffusion or similar diffusion model generators to design sequences and then fold the candidate sequences using OpenFold or AlphaFold 2 to check that the predicted structure agrees with the design target.[^20] This pattern has been documented in tutorials by groups including the Institute for Protein Design and in academic write ups of the design and validation loop.[^20]
OpenFold3, released in preview on 28 October 2025, extends the project from a monomer and multimer AlphaFold 2 reproduction to a full biomolecular foundation model covering proteins, nucleic acids, and small molecule ligands, in line with AlphaFold 3. The system is released under permissive open source terms and is the consortium's flagship successor effort.[^17][^18]
| System | Year (initial release) | Trainable code release | Inputs | Implementation | License (code/weights) |
|---|---|---|---|---|---|
| AlphaFold 2 (DeepMind) | 2021 | Inference only; training code partial[^1] | Sequence + MSA + templates | JAX | Apache 2.0 (code), CC BY 4.0 (weights since Jan 2022)[^3] |
| OpenFold (AlQuraishi lab, OpenFold Consortium) | 2022 | Full training and inference[^3][^1] | Sequence + MSA (or ESM 1b embeddings via SoloSeq) | PyTorch | Apache 2.0 (code), CC BY 4.0 (data)[^3] |
| ColabFold (Mirdita et al.) | 2022 | Wrapper, no retraining of core network[^21] | Sequence + MMseqs2 fast MSA | Python notebooks over AlphaFold 2 / OpenFold | MIT (front end); inherits AlphaFold license[^21] |
| RoseTTAFold (Baker lab) | 2021 | Yes, before OpenFold | Sequence + MSA + templates | PyTorch | MIT[^22] |
| ESMFold (Meta AI; Lin et al.) | 2022 | Yes; uses ESM 2 LM + OpenFold derived structure module | Sequence only (no MSA) | PyTorch | MIT[^6][^7] |
| AlphaFold 3 (DeepMind / Isomorphic) | 2024 | Code and weights gated to academic non commercial use | Sequence + ligands + nucleic acids (diffusion) | JAX | Restricted academic[^17] |
| Boltz 1 (MIT Jameel Clinic) | 2024 | Fully open source MIT licensed reproduction of AlphaFold 3 | Sequence + ligands + nucleic acids | PyTorch | MIT License[^23] |
| OpenFold3 (OpenFold Consortium) | 2025 (preview) | Fully open source reproduction of AlphaFold 3 style cofolding | Sequence + ligands + nucleic acids | PyTorch | Apache License 2.0[^17][^18] |
OpenFold's design choices make it the most direct open analogue to AlphaFold 2 for the monomer case: it matches accuracy, is faster, uses less memory, and provides a fully reproducible training loop. ColabFold by contrast does not retrain the underlying network but instead wraps AlphaFold 2 (and, optionally, OpenFold) with a faster MSA pipeline based on MMseqs2; combining the two has become a common practical pattern for high throughput structure prediction.[^21][^19] RoseTTAFold is an independent reimplementation by the Baker lab with a three track architecture (1D sequence, 2D pair, 3D coordinates) and was released earlier than OpenFold, but predates OpenProteinSet and is not framed primarily as an analytic reproduction of AlphaFold.[^22] ESMFold, Boltz 1, and OpenFold3 all build on the lineage that OpenFold helped to open by demonstrating that a faithful retraining of AlphaFold 2 was feasible from a public codebase and public data.[^6][^23][^18]
OpenFold has been applied along the same axis as AlphaFold 2: rapid prediction of protein 3D structure from sequence, including for proteins whose structures have not yet been determined experimentally. Because OpenFold is faster and more memory efficient, it is a popular choice for large scale prediction jobs and for long sequences where AlphaFold 2 fails.[^1][^3] The trainability of OpenFold opens additional applications that are out of reach for AlphaFold 2: fine tuning on domain specific protein datasets (for example antibodies or specific enzyme families), training small student models distilled from a teacher, ablating individual components for interpretability research, and integrating the structure module into larger differentiable pipelines for design tasks.[^3][^1] In AI drug discovery settings, OpenFold has been adopted by member companies of the consortium as a base for proprietary fine tuning and as a structure prediction back end for compound and target evaluation, with the consortium's commercial members including pharmaceutical and biotechnology firms.[^4][^14][^15][^16] On the infrastructure side, OpenFold is integrated into Nvidia BioNeMo's MSA Search NIM workflow as one of the supported structure prediction back ends.[^21]
OpenFold inherits the scientific limits of AlphaFold 2: it predicts a single static structure per query (or an ensemble assembled by stochastic sampling), it depends critically on the depth and diversity of the input MSA for many proteins, and it does not natively model conformational dynamics, ligand binding, post translational modifications, or large complexes with high accuracy outside the multimer setting.[^1][^19] Fold switching proteins remain a known weakness, partly because of the training distribution and the way the model is sampled at inference; this issue has been documented quantitatively in subsequent work showing that AlphaFold 2 captures only a small fraction of experimentally confirmed fold switchers outside its training set despite extensive sampling.[^11]
OpenFold's multimer support is currently the AlphaFold Multimer v2.3.2 inference path plus an "AlphaFold Gap" zero shot baseline. While useful, the gap based approach falls short of true AlphaFold Multimer accuracy when only monomer weights are available.[^19][^3] OpenFold also did not, in its initial form, address biomolecular interaction beyond protein protein complexes; for ligand cofolding and nucleic acid prediction, the relevant successor is OpenFold3, which is at the time of writing still in preview status and not yet peer reviewed.[^17][^18]
A broader caveat from the Nature Methods paper itself is that the demonstrated robustness of the architecture to training distribution does not imply that any specific deployed model is robust: a model trained on the canonical OpenProteinSet may still memorize folds from its training distribution. The fold switching critique is the clearest empirical example of this distinction.[^11][^1]
The OpenFold codebase is distributed under the Apache License 2.0.[^3] The retrained weights and the OpenProteinSet dataset are distributed under the CC BY 4.0 license.[^1][^8] DeepMind's original AlphaFold 2 parameters, which OpenFold can also load, were originally released under CC BY NC 4.0 and were re licensed by DeepMind to CC BY 4.0 in January 2022, removing the non commercial restriction.[^3]
The OpenFold Consortium is a non profit organization hosted as a project of the Open Molecular Software Foundation. Governance includes a Governing Board with member representation weighted by membership tier, a Technical Advisory Council of elected technical members, and an Executive Committee that provides strategic direction.[^10][^2] The consortium has grown from five founding organizations in June 2022 to roughly two dozen partner organizations by 2025, including a mix of academic laboratories, biotech startups, drug discovery companies, large pharmaceutical firms, cloud and hardware vendors such as Nvidia, and AI research organizations.[^4][^14][^15][^16]