Boltz-2
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v3 ยท 2,429 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v3 ยท 2,429 words
Add missing citations, update stale details, or suggest a clearer explanation.
Boltz-2 is an open-source biomolecular foundation model that jointly predicts the 3D structure of biological complexes and the binding affinity between small molecules and proteins. It was released in June 2025 by a team at the Massachusetts Institute of Technology Jameel Clinic and CSAIL together with the drug-discovery company Recursion, and it builds directly on Boltz (Boltz-1), the open reproduction of AlphaFold 3 that the same group put out in late 2024. What sets Boltz-2 apart from earlier work is that it does not stop at the shape of a molecular complex. It also estimates how tightly a candidate drug grips its target, and it does so fast enough to make large-scale screening practical. The authors describe it as, to their knowledge, the first deep learning model to approach the accuracy of free-energy perturbation (FEP) calculations for protein-ligand affinity while running at least 1000 times more cheaply. [1][2][3]
The model weights, the inference code, and the full training pipeline were published under a permissive MIT license that allows both academic and commercial use. That licensing choice matters in a field where the most capable competing system, AlphaFold 3 from Google DeepMind and Isomorphic Labs, kept its weights and training details restricted at launch. Boltz-2 inherits the open posture of Boltz-1 and pushes it into territory that had previously belonged to slow physics simulations. [3][4][5]
For most of the deep learning era of structural biology, the goal was geometry. AlphaFold and its successors learned to fold a sequence into a plausible 3D structure, and AlphaFold 3 extended that to complexes of proteins, nucleic acids, ions, and small molecules. Boltz-1 reproduced that capability in the open. These are co-folding models: given the components of a complex, they predict where every atom sits. [1][6]
Knowing the shape of a protein-ligand complex is useful, but drug discovery teams usually want a different number. They want binding affinity, a measure of how strongly a candidate molecule sticks to its target. Affinity is what separates a promising lead from a dead end, and predicting it accurately from structure alone has been hard. The established gold standard, free-energy perturbation, runs long atomistic molecular dynamics simulations to estimate the free energy of binding. FEP can be accurate, but it is slow and expensive. A single compound can take hours or days on cluster hardware and cost upward of one hundred dollars to evaluate, which makes screening large libraries impractical. [2][3][7]
Boltz-2 collapses that gap. It keeps the co-folding behavior of its predecessors and adds a dedicated affinity module that reads the predicted complex and outputs a binding estimate directly. The model handles proteins, DNA, RNA, and small-molecule ligands, and it produces a structure and an affinity prediction for a protein-ligand pair in roughly 20 seconds on a single GPU. The same kind of answer from FEP would take orders of magnitude longer. [1][2][3]
Boltz-2 keeps the overall shape of the AlphaFold 3 and Boltz-1 design and extends it. The technical report describes four main components: a trunk, a denoising module that generates atomic coordinates, a confidence module, and the new affinity module. The trunk processes the input, including a multiple sequence alignment for proteins, and organizes the network around a pairwise representation that aggregates evolutionary and geometric information. A diffusion based denoiser then turns that representation into 3D coordinates, the same family of generative method used elsewhere in structural biology by tools such as RFdiffusion. [1][8]
The affinity module is the headline addition. It uses a pairwise transformer block to refine the protein-ligand and intra-ligand interactions in the predicted pose, then produces two outputs. One is a binary probability that the ligand is a genuine binder rather than a decoy, which is useful for separating real hits from noise during screening. The other is a continuous affinity value reported on a log scale, expressed as log10 of the IC50 in micromolar units, where lower numbers mean tighter binding. [1][9]
Training happened in phases. The model first learned structure, then confidence, then affinity. The data mix was broad. It drew on experimental structures from the Protein Data Bank, on thousands of short molecular dynamics simulations that taught the model something about how molecules move rather than how they sit in a single crystal pose, on distillation data from predicted structures, and on roughly five million binding affinity measurements aggregated and batch-corrected from the experimental literature. That affinity corpus is what makes the affinity head possible, since public structural data alone does not carry enough labeled binding strengths. The heavy training and validation ran on Recursion's NVIDIA-accelerated BioHive-2 supercomputer, with NVIDIA also distributing the model through its inference catalog. [1][2][3]
Boltz-2 also adds controllability features that the earlier open models lacked. Users can condition structure prediction on the experimental method, supply distance or pocket constraints to steer the pose, and pass multi-chain templates. These give practitioners more control over what the model produces rather than leaving it to guess. [1]
Boltz-2 came out of a partnership that paired academic method development with industrial compute and data. The model was built in Regina Barzilay's group at MIT CSAIL and the Jameel Clinic, with Tommi Jaakkola also leading on the MIT side and a core team that included Saro Passaro, Gabriele Corso, and Jeremy Wohlwend, several of whom had worked on Boltz-1. Recursion, a public TechBio company, contributed AI research effort along with the training infrastructure, and additional authors came from Valence Labs and ETH Zurich. [1][2][3]
The compute side ran on BioHive-2, Recursion's NVIDIA-accelerated supercomputer, which handled the heavy training and validation runs. The result was released as an open project rather than a product locked behind a company. Barzilay framed the motivation around a long-standing bottleneck, saying the model helps scientists ask questions about binding they could not ask before, while Recursion's chief research and development and commercial officer Najat Khan described it as a tool that lets teams triage compounds more effectively and concentrate their resources on the most promising ones. The model is installable from the Python Package Index and is also distributed through NVIDIA's hosted inference catalog, which lowers the barrier for groups that do not want to manage their own GPUs. [2][3][7]
The central claim of Boltz-2 is that a learned model can get close to physics-based accuracy at a tiny fraction of the cost. On the widely used FEP+ benchmark, which tests how well a method ranks related compounds by potency, Boltz-2 reaches an average Pearson correlation of about 0.66, ahead of the cheaper physical methods and deep learning baselines the authors compared against, and on the OpenFE subset it lands near 0.62, roughly comparable to that open-source FEP pipeline while running more than 1000 times faster. Commercial FEP+ still scores higher, around 0.78, so the model narrows the gap to physics rather than closing it. On the harder, blind CASP16 affinity challenge, which covered 140 protein-ligand pairs across two targets, Boltz-2 ran out of the box with no fine-tuning or input curation and outperformed all of the participating methods. [2][3][9][10]
The model was also tested on the kind of task that screening teams actually run. On a hit-discovery benchmark drawn from MF-PCBA assay data, where the job is to surface real binders out of a large pool of mostly inactive molecules, Boltz-2 retrieved close to twice as many true binders among its top-ranked picks as the next-best baseline, roughly doubling average precision over machine learning and docking methods. Separately, when the model was conditioned on molecular dynamics data, it reproduced measures of local protein flexibility such as root-mean-square fluctuation about as well as specialized dynamics models like AlphaFlow and BioEmu, which is notable for a system whose main job is single-structure prediction. [1][3][9]
The speed difference is the practical story. FEP is accurate enough to trust but too slow to run across a vast chemical library. A model that approaches FEP-level ranking in seconds rather than hours opens up structure-based virtual screening at a scale that was not feasible before, letting teams triage hundreds of thousands of candidate molecules and spend their expensive physics simulations and lab time only on the survivors. The economics shift too, since a per-compound estimate that once cost hours of cluster time and a meaningful dollar amount can now run on a single GPU in seconds. [1][2][3][7]
The authors also showed a forward-looking workflow that points past simple screening. By pairing Boltz-2 with a generative model for small molecules, they searched for new, synthesizable, high-affinity binders against the TYK2 target, a kinase relevant to inflammatory disease, then confirmed the best candidates with rigorous absolute FEP simulations. That combination, fast neural triage followed by careful physics validation, is how many groups expect the tool to be used in practice: the model proposes and ranks, and the slower physics confirms what matters. [1][2][7]
The table below collects the verified figures from the technical report and the official release materials.
| Property | Detail |
|---|---|
| Developers | MIT Jameel Clinic and CSAIL with Recursion (and Valence Labs, ETH Zurich) |
| Release | June 2025 |
| License | MIT (open weights, inference, and training code; commercial use allowed) |
| Task | Joint biomolecular structure prediction and protein-ligand binding affinity |
| Modalities | Proteins, DNA, RNA, small-molecule ligands and complexes |
| Affinity outputs | Binary binder probability, continuous affinity as log10(IC50) in micromolar |
| Inference speed | About 20 seconds for structure plus affinity on a single GPU |
| Efficiency vs FEP | At least 1000 times more computationally efficient |
| FEP+ benchmark | Average Pearson about 0.66 (competitive set); near 0.62 vs OpenFE; commercial FEP+ about 0.78 |
| CASP16 affinity challenge | Outperformed all participants out of the box across 140 protein-ligand pairs |
| Affinity training data | About 5 million batch-corrected assay measurements from the literature |
| Structure training data | Protein Data Bank, molecular dynamics trajectories, distillation data |
| Architecture | Trunk, diffusion denoising module, confidence module, affinity module |
| Predecessor | Boltz-1 (open reproduction of AlphaFold 3, 2024) |
Boltz-2 sits in a lineage. AlphaFold 3 set the bar for predicting the structure of biomolecular complexes, but its restricted release left academic and commercial users without weights they could run and modify freely. Boltz-1 answered that by reproducing AlphaFold 3 level structure prediction as a fully open model, and it became a widely used open alternative across academia and industry. Boltz-2 is the next step in that line. On structure prediction it matches or moderately improves over Boltz-1 and still trails AlphaFold 3 by a small margin, so its structural accuracy is best described as approaching the AlphaFold 3 frontier rather than surpassing it. It does show gains over Boltz-1 on difficult cases such as antibody-antigen complexes, and when conditioned on molecular dynamics data it captures local protein flexibility competitively with specialized dynamics models like AlphaFlow and BioEmu. [1][3][4][6]
The bigger leap is the affinity capability, which neither AlphaFold 3 nor Boltz-1 offered. By adding it, the Boltz team moved the open-model story past structure and into the quantitative prediction that AI drug discovery teams care about most. Boltz-2 also fits the broader pattern of foundation models for science, where a single network trained on large, diverse data is adapted to several related tasks, and of open-source AI releases that let outside groups build on frontier methods. [3][5]
The accuracy claims come with caveats that the authors and independent reviewers have been careful to state. Boltz-2 approaches FEP for ranking related compounds, but predicting absolute binding free energies remains hard, and the model is best read as producing relative rankings rather than exact binding constants. It tends to work better as a binary binder-versus-decoy classifier than as a fine-grained ranker of close structural analogues, which is exactly the resolution the hit-to-lead stage needs. Because the affinity head was trained largely on IC50-type assay data, its outputs are most reliable for targets and chemical series that resemble what it saw in training, and performance can fall off on novel targets, unusual ligands, or chemotypes far from the training distribution. [1][9]
There are also hard scope limits. The affinity model covers small-molecule ligands only and does not predict protein-protein binding affinity. Its affinity estimates are considered unreliable for ligands larger than about 50 atoms and are not computed at all past roughly 128 atoms. The structural side carries the usual risks of learned predictors. The model can be confidently wrong on new folds, struggles with large conformational changes, and is weaker on membrane proteins such as GPCRs and ion channels, so predicted poses should be validated rather than trusted blindly. That is part of why the authors frame the strongest use case as fast triage that feeds into physics-based confirmation rather than a replacement for it. Independent evaluations have probed these questions of reliability and applicability domain, with one March 2026 study reporting structural inconsistencies and signs of overfitting to features of the training data, and the practical advice that has emerged is to treat Boltz-2 as a powerful filter for early-stage screening while keeping slower, more rigorous methods in the loop for final decisions. [9][11][12]