OpenFold

AI for Science Drug Discovery Open Source AI

24 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v3 · 4,706 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OpenFold is an open source, trainable, GPU friendly PyTorch reimplementation of AlphaFold 2, developed initially by the AlQuraishi lab at Columbia University and then governed by the OpenFold Consortium, a multi institution non profit hosted by the Open Molecular Software Foundation.^[1]^[2]^[3] First released publicly in June 2022, the project closed a gap left when Google DeepMind published only inference code and weights for AlphaFold 2 without releasing trainable training code or training data, leaving the wider research community unable to retrain, fine tune, or systematically ablate the model.^[4]^[2] In May 2024 a technical paper describing the retraining of AlphaFold 2 from scratch using OpenFold and the accompanying OpenProteinSet dataset was published in Nature Methods under the title "OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization."^[1]^[5] OpenFold matches AlphaFold 2 accuracy while running roughly three to five times faster at inference on modern GPUs, and its structure module has been reused as the geometric back end of several downstream systems, including Meta AI's ESMFold protein language model.^[1]^[6]^[7]

Infobox

Field	Value
Initial release	June 2022 (code and first MSA tranche)^[4]
Technical paper	Ahdritz et al., Nature Methods, 14 May 2024 (preprint: bioRxiv, 22 Nov 2022)^[1]^[5]
Lead authors	Gustaf Ahdritz, Nazim Bouatta, Mohammed AlQuraishi (Columbia, Harvard Medical School)^[1]
Code	github.com/aqlaboratory/openfold^[3]
License (code)	Apache License 2.0^[3]
License (weights and data)	CC BY 4.0^[1]^[8]
Companion dataset	OpenProteinSet (more than 16 million MSAs, NeurIPS 2023)^[8]^[9]
Implementation	Python, PyTorch, custom CUDA kernels, DeepSpeed integration^[3]^[1]
Governance	OpenFold Consortium, hosted by the Open Molecular Software Foundation^[2]^[10]

Background and motivation

When DeepMind released AlphaFold 2 in July 2021, alongside a paper in Nature and a code repository, the release was unusual for an industrial AI system in that it included both inference source code and the trained network weights.^[1] However, several critical assets were missing. Training code was not shipped in a usable, fully reproducible form, and the curated multiple sequence alignment (MSA) inputs and self distilled training set on which the model had been trained were not made publicly available.^[1]^[4] Without these, external researchers could not retrain the network from scratch, could not run principled ablations to ask which components mattered, and could not adapt the model to tasks that required new training data such as protein ligand complexes, antibody specific datasets, or evolution free regimes.^[1]

A second, more diffuse problem was the implementation itself. AlphaFold 2 was written in JAX and tightly coupled to Tensor Processing Units that DeepMind had used internally; while researchers could run inference on commodity GPUs, scaling training to thousands of GPU hours, debugging the gradient flow, and integrating modern training systems like DeepSpeed required a different implementation. Many groups in academia and industry standardised on PyTorch and wanted an implementation native to that ecosystem so that they could mix the structure prediction model with their own networks, optimizers, and trainers.^[3]^[1]

The original AlphaFold work also raised foundational scientific questions that could only be answered with a reproducible training pipeline. Among them: how data efficient is the architecture, how sensitive is it to the diversity of its training distribution, in what order does it learn structural features, and how much of its apparent generalization is in fact memorization of training set folds rather than learned physical rules.^[1]^[11] OpenFold was designed in part to answer those questions empirically.^[1]

The project was initiated by Mohammed AlQuraishi's laboratory at Columbia University in collaboration with Nazim Bouatta at Harvard Medical School and a network of academic and industry partners. AlQuraishi's group had previously published end to end differentiable models for protein structure (recurrent geometric networks) before AlphaFold 2, and was well positioned to attempt a faithful retraining.^[4]^[1]

History and releases

Founding of the OpenFold Consortium (June 2022)

On 28 June 2022 the OpenFold AI Research Consortium was publicly announced. The founding members were the AlQuraishi laboratory at Columbia University together with the protein design and drug discovery companies Arzeda, Cyrus Biotechnology, Genentech's Prescient Design, and Outpace Bio.^[4]^[12] The consortium described its mission as the development of free, open source software tools for biology and drug discovery, and it positioned itself as a project hosted under the Open Molecular Software Foundation.^[2]^[4] The initial public release of OpenFold consisted of the trainable PyTorch implementation of AlphaFold 2 and the first 400,000 multiple sequence alignments of what would become the OpenProteinSet corpus.^[4] At that point the consortium stated that OpenFold had been trained from scratch in roughly 100,000 A100 GPU hours and that the resulting model was on par with the original DeepMind weights.^[4]

bioRxiv preprint (November 2022) and Nature Methods paper (May 2024)

A full technical preprint titled "OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization" appeared on bioRxiv on 20 November 2022 (doi: 10.1101/2022.11.20.517210), with arXiv listing 2207.05485 as the related arXiv identifier.^[11]^[1] The peer reviewed version appeared in Nature Methods on 14 May 2024 (volume 21, pages 1514 to 1524, doi: 10.1038/s41592-024-02272-z) with first author Gustaf Ahdritz and senior author Mohammed AlQuraishi.^[1]^[5] The author list spans Columbia University, Harvard University and Harvard Medical School, the NYU Courant Institute, the Flatiron Institute, Nvidia, EleutherAI, Genentech's Prescient Design group, Cyrus Bio, Outpace Bio, Arzeda, Rutgers University, the University of Illinois, Microsoft, and Stability AI, among others.^[5]

OpenProteinSet (NeurIPS 2023)

A companion paper, "OpenProteinSet: Training data for structural biology at scale," was posted to arXiv on 10 August 2023 (arXiv:2308.05326) and presented in the Datasets and Benchmarks track at NeurIPS 2023.^[9]^[13] OpenProteinSet released more than 16 million multiple sequence alignments together with associated structural homologs from the Protein Data Bank and AlphaFold 2 self distillation predictions, all under the CC BY 4.0 license.^[9]^[8] The corpus is hosted on the AWS Registry of Open Data and was, at release, the largest public training resource for MSA based protein structure prediction.^[1]^[8]

Consortium growth and OpenFold3 (2023 to 2025)

The consortium expanded substantially after its first year. In September 2023, UCB, Nvidia, and Valence Labs (a Recursion company) joined as new members.^[14] In August 2024 a further six organizations joined: Astex, Biogen, Congruence, Polaris Quantum, Psivant, and SandboxAQ.^[15] In April 2025 the consortium announced another tranche of new members including Bristol Myers Squibb, COGNANO, Lambda, Novo Nordisk, Structure Therapeutics, Tamarind, Unnatural Products, and Visterra.^[16] By that point the consortium described itself as having 24 partner organizations including six global pharma companies.^[17]^[16]

In parallel the consortium developed a successor system. OpenFold3, a fully open reproduction targeting AlphaFold 3 style biomolecular complex prediction (proteins, nucleic acids, and ligands), was released in preview form on 28 October 2025.^[17]^[18] OpenFold3 was trained on more than 300,000 publicly available experimental structures plus an OpenFold curated synthetic database of over 13 million structures, and was released under the Apache License 2.0 family of permissive licenses for both research and commercial use, in contrast to AlphaFold 3 whose code and weights are gated for non commercial academic use.^[17]^[18] OpenFold3 lives in a separate GitHub repository (aqlaboratory/openfold-3) alongside the original OpenFold codebase.^[18]

Technical details

Faithful reimplementation of AlphaFold 2

OpenFold targets a near bitwise reproduction of the AlphaFold 2 architecture as described in the 2021 Jumper et al. paper. The model takes as input a query protein sequence plus an MSA encoding evolutionary context, processes them through an "Evoformer" stack of attention layers that mix sequence and pair representations, and produces 3D coordinates via a structure module that applies invariant point attention to update per residue frames.^[1] The implementation reproduces both the monomer inference path (corresponding to AlphaFold v2.0.1) and the multimer inference path (corresponding to AlphaFold Multimer v2.3.2) and is compatible with DeepMind's published JAX parameter files as well as with OpenFold's own retrained weights.^[3]^[19]

PyTorch native and DeepSpeed integrated

The model is implemented in PyTorch rather than JAX. The training pipeline supports full precision, half precision (FP16) and bfloat16 training, with optional DeepSpeed integration including ZeRO stage 2 optimizer sharding. The OpenFold team developed and integrated the DeepSpeed DS4Sci_EvoformerAttention kernel through a collaboration with Microsoft's DeepSpeed4Science initiative, producing a memory efficient kernel tailored to the Evoformer block.^[3]^[1] OpenFold also supports custom CUDA attention kernels adapted from the FastFold project that perform in place attention during inference and training, using roughly four to five times less GPU memory than equivalent stock PyTorch implementations.^[3]

Speed and memory improvements over AlphaFold 2

The Nature Methods paper reports that OpenFold is between roughly three and five times faster than the original AlphaFold 2 implementation at inference for most proteins on a single NVIDIA A100 GPU.^[1] The implementation can predict structures of proteins beyond 4,000 residues on a single 40 GB A100, whereas AlphaFold 2 typically crashes due to memory pressure beyond about 2,500 residues.^[1]^[19] During training, the initial phase accepts crops of up to 1,200 residues, compared with 384 residues in AlphaFold 2, which allows larger structural context to be presented during gradient updates.^[1] OpenFold can also be combined with FlashAttention for further speed gains on shorter sequences.^[3]

Training procedure and hardware

The headline retraining experiment used 44 NVIDIA A100 GPUs (40 GB each), one protein per GPU with three way gradient accumulation, giving an effective batch size of 132 proteins. Mixed precision used bfloat16 by default with an FP16 mode available for older V100 hardware. Optimization used DeepSpeed 0.5.10 with ZeRO stage 2 and the AdamW family of optimizers used by AlphaFold 2.^[1] The original training run for parity with AlphaFold 2 consumed on the order of 50,000 GPU hours in total.^[1] The consortium's first public announcement quoted around 100,000 A100 hours including auxiliary experiments and the fine tuning phase.^[4]

Training stability

During replication the authors discovered that some training runs would plateau at relatively poor accuracy (lDDT C alpha values between 0.30 and 0.35) for extended periods. They traced the issue to the FAPE loss clamping schedule in the original recipe: AlphaFold 2 clamped the per residue loss for an entire batch with 90% probability, occasionally producing a batch with no clamping at all and yielding very large gradient spikes. The OpenFold team changed the clamping to operate at the level of individual samples within a batch and reported that this substantially improved training stability and reduced time to convergence.^[1]

OpenProteinSet

OpenProteinSet is the open data side of the project. Released in 2023, it comprises more than 16 million multiple sequence alignments paired with associated structural homologs from the Protein Data Bank and AlphaFold 2 self distillation predictions.^[9] The dataset is hosted under the CC BY 4.0 license on the AWS Registry of Open Data.^[8] In the retraining experiments, the authors used approximately 132,000 unique experimental chains (about 640,000 non unique chains including symmetry mates) from the Protein Data Bank, augmented with approximately 270,000 self distilled MSAs as predicted structure targets, sampled from a starting pool of roughly 15 million Uniclust30 MSAs filtered for diversity and depth.^[1]^[9] OpenProteinSet was, at release, the largest open MSA corpus assembled specifically as a training and benchmarking resource for structural biology machine learning, and it was presented at NeurIPS 2023 in the Datasets and Benchmarks track.^[13]^[9]

The dataset has subsequently been reused as training data for follow on structure prediction systems and for studies of how MSA quality and depth affect model accuracy. The authors explicitly framed it as intended to enable both protein structure, function, and design tasks and broader large scale multimodal machine learning research over biological sequences.^[9]

Scientific findings from retraining

A central contribution of the Nature Methods paper is not the existence of OpenFold per se but what could be learned by running controlled experiments on retraining the network from scratch.^[1] These findings would have been difficult or impossible to obtain without an end to end trainable reproduction.

Data efficiency

OpenFold proved remarkably data efficient. Models trained on as few as 1,000 protein chains (roughly 0.76% of the full training set) achieved an lDDT C alpha score of 0.64 on a held out test set, a value the paper notes exceeds the median lDDT C alpha of 0.62 achieved at CASP13 by the first generation AlphaFold.^[1] Training with 10,000 chains (about 7.6% of the full data) approached near parity with the full training set. The authors argue that this exceptional data efficiency is what differentiates AlphaFold 2 style models from earlier protein structure predictors and that, in this regime, architecture matters more than scale of data.^[1]

Early stopping signature

A second striking observation is that approximately 90% of final model accuracy is achieved within roughly the first 3% of training compute. Concretely, the model reached about 90% of its eventual accuracy in around 1,500 GPU hours of the roughly 50,000 GPU hours total, and around 95% within 2,500 GPU hours.^[1] The remaining compute mostly polishes the model: the long tail of training improves but does not transform structural accuracy. This early stopping behaviour is operationally important because it enables rapid exploration of model variants and ablations at modest cost.^[1]

Robustness to training distribution

The authors performed aggressive ablations on the training distribution to ask how the model generalizes beyond the kinds of structures it has seen. They trained variants of the model on data filtered by CATH topology, architecture, and class, in some cases removing entire categories of secondary structure. Models trained on only 5% of all topologies retained an lDDT C alpha near 0.6. Models trained on a single CATH architecture (one of 42) reached a peak lDDT C alpha near 0.6. Most strikingly, models trained on proteins that were almost exclusively alpha helical achieved lDDT C alpha above 0.7 on proteins containing beta sheets, and vice versa.^[1] The implication is that the model is not simply memorising the topologies in its training set: it has learned generalizable principles of how local geometry assembles into tertiary structure.

Hierarchy of learning

By examining intermediate model checkpoints, the authors showed that OpenFold learns structural features in a specific, staggered order. In early training (roughly the first 300 gradient steps) accuracy on local 10 residue fragments improves dramatically while global structure remains poor. In the middle phase (roughly steps 300 to 1,800) tertiary structure quality grows quickly. In later training, global and secondary structure converge to their final accuracy.^[1] Within secondary structure elements, alpha helices, which are most frequent in the training distribution, are learned first; beta sheets follow; rarer secondary structure motifs come last. Helices in particular tend to be learned in an abrupt, almost phase transition like manner rather than gradually.^[1]

Implications for the "shortcut learning" debate

A line of follow on work led by Lauren Porter, Devlina Chakravarty and colleagues argued separately that AlphaFold 2's apparent success on fold switching proteins reflects memorization of training set structures rather than learned physical principles, achieving about 35% success on fold switchers likely present in its training data but capturing only one of seven experimentally confirmed fold switchers outside its training set despite sampling roughly 280,000 candidate models.^[11] The OpenFold retraining experiments provide a useful counterpoint: the architecture itself appears to generalize robustly even when entire classes of structures are withheld, suggesting that the "shortcut learning" failures of AlphaFold 2 on fold switchers may say more about the training distribution and inference time sampling strategy than about a fundamental limit of the architecture.^[1]^[11]

Confidence calibration and ablation findings

The pLDDT confidence prediction head correlates with true accuracy early in training, even before the model becomes generally accurate. Templates contribute only modestly to prediction quality, especially in low data regimes. The second, fine tuning phase of the original AlphaFold training recipe (with violations losses and a richer loss head) has only a modest effect on overall structural accuracy and primarily resolves chemical constraint violations.^[1]

Variants and downstream uses

Structure module reuse

One of the most consequential downstream effects of OpenFold has been the reuse of its structure module by other groups. The structure module is the geometric back end of AlphaFold 2 style models: it takes a learned representation of the protein and produces per residue 3D frames using invariant point attention. Because OpenFold provides an open, PyTorch native, well tested implementation of this module, other systems can simply import it rather than reimplement.

The most prominent example is Meta AI's ESMFold. ESMFold replaces the MSA input of AlphaFold with embeddings produced by the ESM 2 protein language model, then folds the embeddings into 3D structure using a structure module derived from OpenFold's implementation; the Hugging Face port of ESMFold relies on portions of the OpenFold library, and the [esmfold] pip installation option automatically pulls OpenFold as a dependency.^[6]^[7] This is the most public example of the structure module reuse pattern, but similar reuse occurs in academic codebases for protein design and benchmark pipelines.^[7]

SoloSeq and template free inference

OpenFold ships an MSA free monomer inference mode named SoloSeq that uses ESM 1b sequence embeddings in place of evolutionary alignments. This makes OpenFold usable in regimes where no homologs exist for the query protein, including for synthetic and designed sequences.^[3]

Multimer support

OpenFold supports multimer inference using AlphaFold Multimer v2.3.2 weights and additionally provides an "AlphaFold Gap" zero shot mode that uses monomer weights with a sequence break token to model complexes when multimer weights are not used.^[3]^[19] The gap based mode is less accurate than full AlphaFold Multimer but is useful as a baseline.^[19]

OpenFold in downstream protein design pipelines

OpenFold's PyTorch implementation and its retrained weights have been used as a validation step in generative protein design pipelines. Practitioners commonly use RFdiffusion or similar diffusion model generators to design sequences and then fold the candidate sequences using OpenFold or AlphaFold 2 to check that the predicted structure agrees with the design target.^[20] This pattern has been documented in tutorials by groups including the Institute for Protein Design and in academic write ups of the design and validation loop.^[20]

OpenFold3

OpenFold3, released in preview on 28 October 2025, extends the project from a monomer and multimer AlphaFold 2 reproduction to a full biomolecular foundation model covering proteins, nucleic acids, and small molecule ligands, in line with AlphaFold 3. The system is released under permissive open source terms and is the consortium's flagship successor effort.^[17]^[18]

Comparison with other protein structure prediction systems

System	Year (initial release)	Trainable code release	Inputs	Implementation	License (code/weights)
AlphaFold 2 (DeepMind)	2021	Inference only; training code partial^[1]	Sequence + MSA + templates	JAX	Apache 2.0 (code), CC BY 4.0 (weights since Jan 2022)^[3]
OpenFold (AlQuraishi lab, OpenFold Consortium)	2022	Full training and inference^[3]^[1]	Sequence + MSA (or ESM 1b embeddings via SoloSeq)	PyTorch	Apache 2.0 (code), CC BY 4.0 (data)^[3]
ColabFold (Mirdita et al.)	2022	Wrapper, no retraining of core network^[21]	Sequence + MMseqs2 fast MSA	Python notebooks over AlphaFold 2 / OpenFold	MIT (front end); inherits AlphaFold license^[21]
RoseTTAFold (Baker lab)	2021	Yes, before OpenFold	Sequence + MSA + templates	PyTorch	MIT^[22]
ESMFold (Meta AI; Lin et al.)	2022	Yes; uses ESM 2 LM + OpenFold derived structure module	Sequence only (no MSA)	PyTorch	MIT^[6]^[7]
AlphaFold 3 (DeepMind / Isomorphic)	2024	Code and weights gated to academic non commercial use	Sequence + ligands + nucleic acids (diffusion)	JAX	Restricted academic^[17]
Boltz 1 (MIT Jameel Clinic)	2024	Fully open source MIT licensed reproduction of AlphaFold 3	Sequence + ligands + nucleic acids	PyTorch	MIT License^[23]

OpenFold's design choices make it the most direct open analogue to AlphaFold 2 for the monomer case: it matches accuracy, is faster, uses less memory, and provides a fully reproducible training loop. ColabFold by contrast does not retrain the underlying network but instead wraps AlphaFold 2 (and, optionally, OpenFold) with a faster MSA pipeline based on MMseqs2; combining the two has become a common practical pattern for high throughput structure prediction.^[21]^[19] RoseTTAFold is an independent reimplementation by the Baker lab with a three track architecture (1D sequence, 2D pair, 3D coordinates) and was released earlier than OpenFold, but predates OpenProteinSet and is not framed primarily as an analytic reproduction of AlphaFold.^[22] ESMFold, Boltz 1, and OpenFold3 all build on the lineage that OpenFold helped to open by demonstrating that a faithful retraining of AlphaFold 2 was feasible from a public codebase and public data.^[6]^[23]^[18]

Applications

OpenFold has been applied along the same axis as AlphaFold 2: rapid prediction of protein 3D structure from sequence, including for proteins whose structures have not yet been determined experimentally. Because OpenFold is faster and more memory efficient, it is a popular choice for large scale prediction jobs and for long sequences where AlphaFold 2 fails.^[1]^[3] The trainability of OpenFold opens additional applications that are out of reach for AlphaFold 2: fine tuning on domain specific protein datasets (for example antibodies or specific enzyme families), training small student models distilled from a teacher, ablating individual components for interpretability research, and integrating the structure module into larger differentiable pipelines for design tasks.^[3]^[1] In AI drug discovery settings, OpenFold has been adopted by member companies of the consortium as a base for proprietary fine tuning and as a structure prediction back end for compound and target evaluation, with the consortium's commercial members including pharmaceutical and biotechnology firms.^[4]^[14]^[15]^[16] On the infrastructure side, OpenFold is integrated into Nvidia BioNeMo's MSA Search NIM workflow as one of the supported structure prediction back ends.^[21]

Limitations and criticisms

OpenFold inherits the scientific limits of AlphaFold 2: it predicts a single static structure per query (or an ensemble assembled by stochastic sampling), it depends critically on the depth and diversity of the input MSA for many proteins, and it does not natively model conformational dynamics, ligand binding, post translational modifications, or large complexes with high accuracy outside the multimer setting.^[1]^[19] Fold switching proteins remain a known weakness, partly because of the training distribution and the way the model is sampled at inference; this issue has been documented quantitatively in subsequent work showing that AlphaFold 2 captures only a small fraction of experimentally confirmed fold switchers outside its training set despite extensive sampling.^[11]

OpenFold's multimer support is currently the AlphaFold Multimer v2.3.2 inference path plus an "AlphaFold Gap" zero shot baseline. While useful, the gap based approach falls short of true AlphaFold Multimer accuracy when only monomer weights are available.^[19]^[3] OpenFold also did not, in its initial form, address biomolecular interaction beyond protein protein complexes; for ligand cofolding and nucleic acid prediction, the relevant successor is OpenFold3, which is at the time of writing still in preview status and not yet peer reviewed.^[17]^[18]

A broader caveat from the Nature Methods paper itself is that the demonstrated robustness of the architecture to training distribution does not imply that any specific deployed model is robust: a model trained on the canonical OpenProteinSet may still memorize folds from its training distribution. The fold switching critique is the clearest empirical example of this distinction.^[11]^[1]

Governance and license

The OpenFold codebase is distributed under the Apache License 2.0.^[3] The retrained weights and the OpenProteinSet dataset are distributed under the CC BY 4.0 license.^[1]^[8] DeepMind's original AlphaFold 2 parameters, which OpenFold can also load, were originally released under CC BY NC 4.0 and were re licensed by DeepMind to CC BY 4.0 in January 2022, removing the non commercial restriction.^[3]

The OpenFold Consortium is a non profit organization hosted as a project of the Open Molecular Software Foundation. Governance includes a Governing Board with member representation weighted by membership tier, a Technical Advisory Council of elected technical members, and an Executive Committee that provides strategic direction.^[10]^[2] The consortium has grown from five founding organizations in June 2022 to roughly two dozen partner organizations by 2025, including a mix of academic laboratories, biotech startups, drug discovery companies, large pharmaceutical firms, cloud and hardware vendors such as Nvidia, and AI research organizations.^[4]^[14]^[15]^[16]

AlphaFold: the Google DeepMind system that OpenFold reproduces in trainable form.
AlphaFold 3: the 2024 DeepMind / Isomorphic Labs successor for biomolecular complexes, which OpenFold3 reproduces in open source form.
RoseTTAFold: independent open reimplementation of AlphaFold style structure prediction from the Baker lab.
Boltz: the MIT Jameel Clinic MIT licensed open reproduction of AlphaFold 3.
ESM3: a later evolutionary scale generative protein language model from EvolutionaryScale, in the same family as the ESM line that powers ESMFold.
AI Drug Discovery: the broader application context for OpenFold and related models.
AI for Science: the broader research programme that OpenFold and AlphaFold sit within.

References

Ahdritz, Gustaf et al., "OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization", Nature Methods 21, 1514 to 1524, 2024-05-14. https://www.nature.com/articles/s41592-024-02272-z. Accessed 2026-05-20. ↩
OpenFold Consortium, "Welcome to the OpenFold Consortium", openfold.ghost.io, 2022. https://openfold.ghost.io/welcome-to-the-openfold-consortium/. Accessed 2026-05-20. ↩
aqlaboratory, "openfold: Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2", GitHub repository, 2022 to 2026. https://github.com/aqlaboratory/openfold. Accessed 2026-05-20. ↩
BusinessWire, "Academic, Industry Leaders Form OpenFold AI Research Consortium to Develop Open Source Software Tools To Understand Biological Systems and Discover New Medicines", 2022-06-28. https://www.businesswire.com/news/home/20220628005101/en/Academic-Industry-Leaders-Form-OpenFold-AI-Research-Consortium-to-Develop-Open-Source-Software-Tools-To-Understand-Biological-Systems-and-Discover-New-Medicines. Accessed 2026-05-20. ↩
Ahdritz, Gustaf et al., "OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization", PubMed Central PMC11645889, Nature Methods, 2024-05-14. https://pmc.ncbi.nlm.nih.gov/articles/PMC11645889/. Accessed 2026-05-20. ↩
Lin, Zeming et al., "Evolutionary-scale prediction of atomic-level protein structure with a language model", Science 379, 1123 to 1130, 2023-03-17. https://www.science.org/doi/10.1126/science.ade2574. Accessed 2026-05-20. ↩
Hugging Face, "ESM model documentation", Transformers library docs, 2024. https://huggingface.co/docs/transformers/en/model_doc/esm. Accessed 2026-05-20. ↩
AWS Open Data Registry, "OpenFold dataset", Registry of Open Data on AWS, 2023. https://registry.opendata.aws/openfold/. Accessed 2026-05-20. ↩
Ahdritz, Gustaf et al., "OpenProteinSet: Training data for structural biology at scale", arXiv:2308.05326, 2023-08-10. https://arxiv.org/abs/2308.05326. Accessed 2026-05-20. ↩
OpenFold Consortium, "OpenFold Consortium homepage", openfold.io, 2024 to 2026. https://openfold.io/. Accessed 2026-05-20. ↩
Chakravarty, Devlina et al., "AlphaFold predictions of fold-switched conformations are driven by structure memorization", Nature Communications 15, 2024. https://www.nature.com/articles/s41467-024-51801-z. Accessed 2026-05-20. ↩
Ahdritz, Gustaf et al., "OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization", bioRxiv preprint 2022.11.20.517210, 2022-11-22. https://www.biorxiv.org/content/10.1101/2022.11.20.517210v1. Accessed 2026-05-20. ↩
Ahdritz, Gustaf et al., "OpenProteinSet: Training data for structural biology at scale", NeurIPS 2023 Datasets and Benchmarks Track. https://proceedings.neurips.cc/paper_files/paper/2023/hash/0eb82171240776fe19da498bef3b1abe-Abstract-Datasets_and_Benchmarks.html. Accessed 2026-05-20. ↩
BusinessWire, "OpenFold AI Research Consortium Welcomes Three New Members: UCB, NVIDIA and Valence Labs", 2023-09-12. https://www.businesswire.com/news/home/20230912365077/en/OpenFold-AI-Research-Consortium-Welcomes-Three-New-Members-UCB-NVIDIA-and-Valence-Labs. Accessed 2026-05-20. ↩
BusinessWire, "OpenFold AI Research Consortium Welcomes Six New Members: Astex, Biogen, Congruence, Polaris Quantum, Psivant, and SandboxAQ", 2024-08-13. https://www.businesswire.com/news/home/20240813538795/en/OpenFold-AI-Research-Consortium-Welcomes-Six-New-Members-Astex-Biogen-Congruence-Polaris-Quantum-Psivant-and-SandboxAQ. Accessed 2026-05-20. ↩
BusinessWire, "OpenFold AI Research Consortium Welcomes New Members Including Bristol Myers Squibb, COGNANO, Lambda, Novo Nordisk, Structure Therapeutics, Tamarind, Unnatural Products, and Visterra", 2025-04-15. https://www.businesswire.com/news/home/20250415351561/en/OpenFold-AI-Research-Consortium-Welcomes-New-Members-Including-Bristol-Myers-Squibb-COGNANO-Lambda-Novo-Nordisk-Structure-Therapeutics-Tamarind-Unnatural-Products-and-Visterra. Accessed 2026-05-20. ↩
BusinessWire, "OpenFold Consortium Releases Preview of OpenFold3: An Open-Source Foundation Model for Structure Prediction of Proteins, Nucleic Acids, and Drugs", 2025-10-28. https://www.businesswire.com/news/home/20251028507233/en/OpenFold-Consortium-Releases-Preview-of-OpenFold3-An-Open-Source-Foundation-Model-for-Structure-Prediction-of-Proteins-Nucleic-Acids-and-Drugs. Accessed 2026-05-20. ↩
aqlaboratory, "openfold-3: A fully open source biomolecular structure prediction model based on AlphaFold3", GitHub repository, 2025. https://github.com/aqlaboratory/openfold-3. Accessed 2026-05-20. ↩
OpenFold contributors, "OpenFold documentation", openfold.readthedocs.io, 2022 to 2026. https://openfold.readthedocs.io/en/latest/original_readme.html. Accessed 2026-05-20. ↩
Krishna, Rohith et al., "Generalized biomolecular modeling and design with RoseTTAFold All-Atom", Science 384, 2024. https://www.science.org/doi/10.1126/science.adl2528. Accessed 2026-05-20. ↩
Mirdita, Milot et al., "ColabFold: making protein folding accessible to all", Nature Methods 19, 679 to 682, 2022. https://www.nature.com/articles/s41592-022-01488-1. Accessed 2026-05-20. ↩
Baek, Minkyung et al., "Accurate prediction of protein structures and interactions using a three-track neural network (RoseTTAFold)", Science 373, 871 to 876, 2021-08-19. https://www.science.org/doi/10.1126/science.abj8754. Accessed 2026-05-20. ↩
Wohlwend, Jeremy et al., "Boltz-1: Democratizing Biomolecular Interaction Modeling", bioRxiv preprint, 2024-11. https://gcorso.github.io/assets/boltz1.pdf. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

AlphaFold-Multimer