Open Catalyst Project

AI for Science Data & Datasets Meta AI

8 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v2 · 1,525 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Open Catalyst Project (OCP) is a research collaboration between Meta AI's Fundamental AI Research group (FAIR) and Carnegie Mellon University's Department of Chemical Engineering. Launched in October 2020, it applies machine learning to the discovery of new catalysts for renewable energy storage. The project's central output has been a series of large open datasets of quantum chemistry calculations, along with public competitions and open-source model code, aimed at training ML models that can approximate slow physics simulations and screen candidate materials orders of magnitude faster.^[1]^[2]

The effort was led on the academic side by Zachary Ulissi, then an assistant professor of chemical engineering at Carnegie Mellon, and on the industry side by C. Lawrence (Larry) Zitnick and colleagues at FAIR.^[2]^[3] Over time the project broadened well beyond catalysis, and the umbrella code and model organization was rebranded as FAIR Chemistry in 2024 to reflect work spanning materials science, direct air capture, and molecular chemistry.^[4]

Motivation

As electricity grids add intermittent renewable sources such as wind and solar, storing energy across hours, days, or seasons becomes a bottleneck. One scalable option is to convert surplus renewable electricity into chemical fuels, for example splitting water to make hydrogen or synthesizing ammonia, and then converting those fuels back to electricity in fuel cells. These conversions depend on catalysts, and the most effective ones often rely on scarce and expensive metals such as platinum. Finding cheaper, more abundant catalysts that work well would lower the cost of the whole chain.^[3]^[5]

The standard computational tool for evaluating a candidate catalyst is density functional theory (DFT), a quantum chemistry method that estimates the energy and forces of a system of atoms. DFT is accurate enough to be useful but very slow: a single relaxation of one adsorbate-catalyst structure can take from roughly twelve hours to three days of computation, which makes brute-force screening of the enormous space of possible materials and surface configurations impractical.^[3] The project's premise is that a machine learning model, once trained on a large body of DFT results, can act as a surrogate that predicts the same energies and forces far faster, letting researchers search a much larger set of candidates before committing scarce DFT or laboratory time to the most promising ones.^[1]^[2]

Datasets

The project's main contribution has been releasing large, openly licensed datasets of DFT calculations on catalyst surfaces with adsorbed molecules. The flagship release, Open Catalyst 2020 (OC20), was the largest dataset of its kind when published. Surfaces were drawn from low-Miller-index facets of stable materials in the Materials Project, spanning 55 elements, combined with 82 adsorbates relevant to renewable energy and environmental chemistry (small species and carbon-, nitrogen-, and oxygen-containing intermediates).^[1]^[6]

A follow-up, Open Catalyst 2022 (OC22), targeted oxide materials, which were under-represented in OC20 yet central to the oxygen evolution reaction (OER) and other electrochemical processes. OC22 also generalized the prediction target from adsorption energy to total energy, broadening the range of properties a trained model could address.^[7] Later datasets extended the same methodology to other application areas, most notably Open DAC 2023 (ODAC23) for sorbent discovery in direct air capture, built on metal-organic frameworks (MOFs) with adsorbed CO2 and water.^[8]

Dataset	Year	Domain	DFT relaxations	Single-point calculations	Scope
OC20	2020	Catalyst surfaces + adsorbates	~1,281,040	~264,890,000	55 elements, 82 adsorbates
OC22	2022	Oxide electrocatalysts (e.g. OER)	~62,331	~9,854,504	Oxide materials, total-energy task
ODAC23	2023	Direct air capture (MOFs)	N/A	>38,000,000	>8,400 MOFs with CO2 and/or H2O

The OC20 dataset paper reports 1,281,040 DFT relaxations corresponding to roughly 264.9 million single-point evaluations; it was first posted to arXiv in October 2020 and published in ACS Catalysis in 2021.^[1]^[6] The OC22 paper reports 62,331 relaxations (about 9.85 million single-point calculations) and appeared in ACS Catalysis in 2023.^[7] ODAC23, described as the largest set of MOF adsorption calculations at the DFT level available at the time, comprises more than 38 million DFT calculations on more than 8,400 MOF structures, including both pristine and defective frameworks.^[8]

The machine learning task

The learning problem is to predict, for a system of a catalyst surface (a slab of atoms) plus an adsorbed molecule, the properties that DFT would compute: the system's total energy and the force on each atom. Because relaxing a structure means repeatedly moving atoms downhill in energy until forces are near zero, a model that predicts energy and forces can drive that relaxation directly, replacing the DFT inner loop. OC20 framed this as three benchmark tasks with associated public leaderboards.^[1]^[6]

Task	Abbreviation	What the model predicts
Structure to Energy and Forces	S2EF	Total energy and per-atom forces for a given atomic configuration
Initial Structure to Relaxed Structure	IS2RS	The relaxed (lowest-energy) atomic positions, starting from an unrelaxed structure
Initial Structure to Relaxed Energy	IS2RE	The energy of the relaxed state, predicted from the initial structure

The evaluation was designed to test generalization. Test splits separated systems drawn from the same distribution as training from out-of-domain cases involving unseen adsorbates, unseen catalyst compositions, or both, since a useful surrogate must extrapolate to materials it was not trained on.^[6]

Most baseline and competitive models for these tasks are graph neural networks, which represent each structure as a graph of atoms connected by edges to nearby neighbors and learn to predict energies and forces from that geometry. The OC20 paper benchmarked architectures including CGCNN, SchNet, and DimeNet++,^[1]^[6] and later entries used more specialized equivariant and message-passing models such as GemNet-OC, EquiformerV2, and eSEN.^[4]^[9]

Challenges and leaderboards

To spur progress, the project ran the Open Catalyst Challenge as a competition at NeurIPS in 2021 and 2022, with a related session at the 2023 AI for Science workshop. Both the 2021 and 2022 editions centered on the IS2RE task, predicting a relaxed-state energy from an initial structure, with participants given millions of training samples and evaluated on a held-out test set.^[10]^[11] Reported winning energy mean-absolute-errors fell from about 0.547 eV at NeurIPS 2021 to about 0.396 eV at NeurIPS 2022, with teams from Microsoft Research Asia and Tencent AI Lab among the top finishers.^[10]^[11] The project maintained public evaluation servers and leaderboards so that submissions could be compared on a common footing, and released its model code and pretrained checkpoints as open source.^[2]

Broader Open Materials and FAIR Chemistry work

By 2024 the same DFT-surrogate approach had outgrown catalysis. The code, datasets, and models were consolidated and rebranded as FAIR Chemistry, with the modeling library renamed fairchem, reflecting use cases that now included direct air capture and general inorganic materials.^[4] In October 2024 the team released Open Materials 2024 (OMat24), a dataset of over 110 million DFT calculations on bulk inorganic materials, together with EquiformerV2-based models that ranked at the top of the Matbench Discovery leaderboard for predicting material stability and formation energies.^[9]

In May 2025 FAIR Chemistry released Open Molecules 2025 (OMol25), a dataset of over 100 million high-accuracy quantum chemistry calculations on molecular systems, alongside a family of Universal Models for Atoms (UMA). UMA is a single machine learning interatomic potential trained jointly on the molecular and materials datasets the group had accumulated over roughly five years, spanning catalysts, oxides, MOFs, bulk materials, and molecules.^[12]^[13] These releases position the work that began as the Open Catalyst Project within a wider program in AI for science: building fast, general-purpose surrogates for quantum chemistry that can be reused across catalysis, energy storage, carbon capture, and materials discovery.

References

L. Chanussot, A. Das, S. Goyal, et al., "Open Catalyst 2020 (OC20) Dataset and Community Challenges," *ACS Catalysis*, 2021. https://pubs.acs.org/doi/10.1021/acscatal.0c04525 ↩
Open Catalyst Project, official website. https://opencatalystproject.org/ ↩
"Ulissi and Facebook AI Create World's Largest Catalysis Dataset," Carnegie Mellon University College of Engineering, October 14, 2020. https://engineering.cmu.edu/news-events/news/2020/10/14-facebook-ai.html ↩
FAIR Chemistry / facebookresearch fairchem repository, "FAIR Chemistry's library of machine learning methods for chemistry." https://github.com/facebookresearch/fairchem ↩
"Facebook and Carnegie Mellon team up for AI-led energy storage research," Data Center Dynamics. https://www.datacenterdynamics.com/en/news/facebook-and-carnegie-mellon-team-ai-led-energy-storage-research/ ↩
L. Chanussot, A. Das, S. Goyal, et al., "The Open Catalyst 2020 (OC20) Dataset and Community Challenges," arXiv:2010.09990. https://arxiv.org/abs/2010.09990 ↩
R. Tran, J. Lan, M. Shuaibi, et al., "The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts," arXiv:2206.08917; *ACS Catalysis*, 2023. https://arxiv.org/abs/2206.08917 ↩
A. S. Sriram, et al., "The Open DAC 2023 Dataset and Challenges for Sorbent Discovery in Direct Air Capture," *ACS Central Science*, 2024. https://pubs.acs.org/doi/10.1021/acscentsci.3c01629 ↩
L. Barroso-Luque, M. Shuaibi, X. Fu, et al., "Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models," arXiv:2410.12771. https://arxiv.org/abs/2410.12771 ↩
A. Das, et al., "The Open Catalyst Challenge 2021: Competition Report," *Proceedings of Machine Learning Research*, vol. 176. https://proceedings.mlr.press/v176/das22a.html ↩
Open Catalyst Challenge page, Open Catalyst Project. https://opencatalystproject.org/challenge.html ↩
D. S. Levine, M. Shuaibi, et al., "Open Molecules 2025 (OMol25)," FAIR Chemistry; see also Meta AI announcement. https://ai.meta.com/blog/meta-fair-science-new-open-source-releases/ ↩
"Universal Model for Atoms (UMA)," FAIR Chemistry documentation. https://fair-chem.github.io/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

UMA (Universal Model for Atoms)

Motivation

Datasets

The machine learning task

Challenges and leaderboards

Broader Open Materials and FAIR Chemistry work

References

Improve this article

Related Articles

ESMFold

Galactica (language model)

Segment Anything Model and Dataset (SAM and SA-1B)

Ego4D

Ego-Exo4D

MetaCLIP