Open Catalyst Project
Last reviewed
Jun 3, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 1,529 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 1,529 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Open Catalyst Project (OCP) is a research collaboration between Meta AI's Fundamental AI Research group (FAIR) and Carnegie Mellon University's Department of Chemical Engineering. Launched in October 2020, it applies machine learning to the discovery of new catalysts for renewable energy storage. The project's central output has been a series of large open datasets of quantum chemistry calculations, along with public competitions and open-source model code, aimed at training ML models that can approximate slow physics simulations and screen candidate materials orders of magnitude faster.[1][2]
The effort was led on the academic side by Zachary Ulissi, then an assistant professor of chemical engineering at Carnegie Mellon, and on the industry side by C. Lawrence (Larry) Zitnick and colleagues at FAIR.[2][3] Over time the project broadened well beyond catalysis, and the umbrella code and model organization was rebranded as FAIR Chemistry in 2024 to reflect work spanning materials science, direct air capture, and molecular chemistry.[4]
As electricity grids add intermittent renewable sources such as wind and solar, storing energy across hours, days, or seasons becomes a bottleneck. One scalable option is to convert surplus renewable electricity into chemical fuels, for example splitting water to make hydrogen or synthesizing ammonia, and then converting those fuels back to electricity in fuel cells. These conversions depend on catalysts, and the most effective ones often rely on scarce and expensive metals such as platinum. Finding cheaper, more abundant catalysts that work well would lower the cost of the whole chain.[3][5]
The standard computational tool for evaluating a candidate catalyst is density functional theory (DFT), a quantum chemistry method that estimates the energy and forces of a system of atoms. DFT is accurate enough to be useful but very slow: a single relaxation of one adsorbate-catalyst structure can take from roughly twelve hours to three days of computation, which makes brute-force screening of the enormous space of possible materials and surface configurations impractical.[3] The project's premise is that a machine learning model, once trained on a large body of DFT results, can act as a surrogate that predicts the same energies and forces far faster, letting researchers search a much larger set of candidates before committing scarce DFT or laboratory time to the most promising ones.[1][2]
The project's main contribution has been releasing large, openly licensed datasets of DFT calculations on catalyst surfaces with adsorbed molecules. The flagship release, Open Catalyst 2020 (OC20), was the largest dataset of its kind when published. Surfaces were drawn from low-Miller-index facets of stable materials in the Materials Project, spanning 55 elements, combined with 82 adsorbates relevant to renewable energy and environmental chemistry (small species and carbon-, nitrogen-, and oxygen-containing intermediates).[1][6]
A follow-up, Open Catalyst 2022 (OC22), targeted oxide materials, which were under-represented in OC20 yet central to the oxygen evolution reaction (OER) and other electrochemical processes. OC22 also generalized the prediction target from adsorption energy to total energy, broadening the range of properties a trained model could address.[7] Later datasets extended the same methodology to other application areas, most notably Open DAC 2023 (ODAC23) for sorbent discovery in direct air capture, built on metal-organic frameworks (MOFs) with adsorbed CO2 and water.[8]
| Dataset | Year | Domain | DFT relaxations | Single-point calculations | Scope |
|---|---|---|---|---|---|
| OC20 | 2020 | Catalyst surfaces + adsorbates | ~1,281,040 | ~264,890,000 | 55 elements, 82 adsorbates |
| OC22 | 2022 | Oxide electrocatalysts (e.g. OER) | ~62,331 | ~9,854,504 | Oxide materials, total-energy task |
| ODAC23 | 2023 | Direct air capture (MOFs) | N/A | >38,000,000 | >8,400 MOFs with CO2 and/or H2O |
The OC20 dataset paper reports 1,281,040 DFT relaxations corresponding to roughly 264.9 million single-point evaluations; it was first posted to arXiv in October 2020 and published in ACS Catalysis in 2021.[1][6] The OC22 paper reports 62,331 relaxations (about 9.85 million single-point calculations) and appeared in ACS Catalysis in 2023.[7] ODAC23, described as the largest set of MOF adsorption calculations at the DFT level available at the time, comprises more than 38 million DFT calculations on more than 8,400 MOF structures, including both pristine and defective frameworks.[8]
The learning problem is to predict, for a system of a catalyst surface (a slab of atoms) plus an adsorbed molecule, the properties that DFT would compute: the system's total energy and the force on each atom. Because relaxing a structure means repeatedly moving atoms downhill in energy until forces are near zero, a model that predicts energy and forces can drive that relaxation directly, replacing the DFT inner loop. OC20 framed this as three benchmark tasks with associated public leaderboards.[1][6]
| Task | Abbreviation | What the model predicts |
|---|---|---|
| Structure to Energy and Forces | S2EF | Total energy and per-atom forces for a given atomic configuration |
| Initial Structure to Relaxed Structure | IS2RS | The relaxed (lowest-energy) atomic positions, starting from an unrelaxed structure |
| Initial Structure to Relaxed Energy | IS2RE | The energy of the relaxed state, predicted from the initial structure |
The evaluation was designed to test generalization. Test splits separated systems drawn from the same distribution as training from out-of-domain cases involving unseen adsorbates, unseen catalyst compositions, or both, since a useful surrogate must extrapolate to materials it was not trained on.[6]
Most baseline and competitive models for these tasks are graph neural networks, which represent each structure as a graph of atoms connected by edges to nearby neighbors and learn to predict energies and forces from that geometry. The OC20 paper benchmarked architectures including CGCNN, SchNet, and DimeNet++,[1][6] and later entries used more specialized equivariant and message-passing models such as GemNet-OC, EquiformerV2, and eSEN.[4][9]
To spur progress, the project ran the Open Catalyst Challenge as a competition at NeurIPS in 2021 and 2022, with a related session at the 2023 AI for Science workshop. Both the 2021 and 2022 editions centered on the IS2RE task, predicting a relaxed-state energy from an initial structure, with participants given millions of training samples and evaluated on a held-out test set.[10][11] Reported winning energy mean-absolute-errors fell from about 0.547 eV at NeurIPS 2021 to about 0.396 eV at NeurIPS 2022, with teams from Microsoft Research Asia and Tencent AI Lab among the top finishers.[10][11] The project maintained public evaluation servers and leaderboards so that submissions could be compared on a common footing, and released its model code and pretrained checkpoints as open source.[2]
By 2024 the same DFT-surrogate approach had outgrown catalysis. The code, datasets, and models were consolidated and rebranded as FAIR Chemistry, with the modeling library renamed fairchem, reflecting use cases that now included direct air capture and general inorganic materials.[4] In October 2024 the team released Open Materials 2024 (OMat24), a dataset of over 110 million DFT calculations on bulk inorganic materials, together with EquiformerV2-based models that ranked at the top of the Matbench Discovery leaderboard for predicting material stability and formation energies.[9]
In May 2025 FAIR Chemistry released Open Molecules 2025 (OMol25), a dataset of over 100 million high-accuracy quantum chemistry calculations on molecular systems, alongside a family of Universal Models for Atoms (UMA). UMA is a single machine learning interatomic potential trained jointly on the molecular and materials datasets the group had accumulated over roughly five years, spanning catalysts, oxides, MOFs, bulk materials, and molecules.[12][13] These releases position the work that began as the Open Catalyst Project within a wider program in AI for science: building fast, general-purpose surrogates for quantum chemistry that can be reused across catalysis, energy storage, carbon capture, and materials discovery.