Virtual Biology Initiative
Last reviewed
Jun 7, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 1,930 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 1,930 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Virtual Biology Initiative is a five-year, $500 million research program announced on April 29, 2026 by Biohub, the science organization of the Chan Zuckerberg Initiative (CZI). The initiative aims to generate the large, open biological datasets and computing infrastructure needed to build AI "virtual cell" models, predictive systems that simulate how human cells work in health and disease. Rather than funding a single model, the program is designed to galvanize a coordinated global effort across leading research institutions, with Biohub committing $100 million to external research and $400 million to internal technology development. It brings together the Broad Institute of MIT and Harvard, the Allen Institute, the Arc Institute, the Wellcome Sanger Institute, the chipmaker NVIDIA as technology partner, and international data consortia including the Human Cell Atlas and the Human Protein Atlas. The effort sits at the center of CZI's broader pivot toward AI for science and its founding goal of helping to cure, prevent, or manage all disease by the end of the century.
The Chan Zuckerberg Initiative was founded in 2015 by Meta chief executive Mark Zuckerberg and the pediatrician Priscilla Chan, who pledged 99 percent of their lifetime Meta shares to philanthropy. Its science arm launched the original Biohub in San Francisco in 2016, and CZI has since donated roughly $4 billion to basic science research. The "virtual cell" idea, an AI model that can simulate the behavior of a cell across molecular, spatial, and temporal scales, became a central organizing goal. In a 2023 essay in MIT Technology Review, Chan and Zuckerberg argued that bringing generative AI to biology at scale could let researchers query simulations of healthy and diseased cells instead of running every experiment at the bench.
A virtual cell model is broadly analogous to a large language model, but trained on biological measurements such as single-cell RNA sequencing, imaging, and protein data rather than text. The central obstacle has long been data. As Biohub Head of Science Alex Rives put it, building AI that can "accurately represent the full complexity of biology" requires "orders of magnitude more data than exists today." Existing biological datasets are often siloed, proprietary, and structured to answer one narrow question, which limits their usefulness for training general models. The Virtual Biology Initiative is CZI's bet that solving the data bottleneck is the prerequisite for building useful AI in healthcare.
In April 2025, CZI laid the groundwork by announcing four scientific grand challenges: building an AI virtual cell model; developing new imaging technologies to map biological systems; creating tools to sense and measure inflammation in tissues in real time; and harnessing the immune system for early disease detection and treatment. The Virtual Biology Initiative is the funding and data engine for the first of these challenges.
Biohub committed $500 million over five years to anchor the program. The money is split into two parts. About $100 million funds external research to help nucleate a coordinated, worldwide data-generation effort, while roughly $400 million is invested internally at Biohub to generate data at scale and to develop next-generation technologies for measuring, imaging, and engineering biology. The figure and structure were reported consistently by Biohub's own announcement, a PR Newswire release, Axios, and trade press.
The stated goal is to create the open data foundation for what Biohub calls "predictive models of life," AI systems that let scientists ask and answer biological questions digitally at a scale far beyond what is feasible in a physical laboratory. Biohub frames the scale of the data gap in concrete terms: today's AI-ready biology datasets amount to roughly one billion cells, which the organization estimates may be an order of magnitude or more below what is needed to train models that genuinely predict cellular behavior. All data generated under the initiative is to be released openly and freely to the global scientific community.
The announcement followed a broader restructuring of CZI. Over 2024 and 2025 the organization wound down most non-science work, including grantmaking tied to diversity, education, and immigration advocacy, and unified its scientific efforts under the Biohub name. Reporting in late 2025 described CZI shifting the bulk of its philanthropy toward AI-powered biology, operating at roughly $1 billion per year and stating it was on track to double its cumulative $4 billion in science giving over the following decade. The Virtual Biology Initiative is the largest concrete expression of that pivot.
The initiative is structured as a consortium rather than a grant to a single lab. Biohub coordinates participating institutions and consortia, each contributing data-generation capacity, expertise, or computing resources. It builds on Biohub's prior support for large-scale atlases including the Human Cell Atlas, the Billion Cells Project (which coordinates 17 projects across institutions such as MIT, Stanford, UC San Francisco, Columbia, the University of Washington, ETH Zurich, and the Genome Institute of Singapore), and the Tabula Sapiens multi-organ cell atlas.
| Partner | Type | Role in initiative |
|---|---|---|
| Biohub (CZI) | Founder / funder | Commits $500M; builds core technology and infrastructure |
| Broad Institute of MIT and Harvard | Research institution | Large-scale data generation |
| Allen Institute | Research institution | Cell biology and imaging data |
| Arc Institute | Research institution | Functional genomics and virtual-cell research |
| Wellcome Sanger Institute | Research institution | Genomics and Human Cell Atlas contributions |
| NVIDIA | Technology partner | Accelerated computing, GPU software, technical expertise |
| Human Cell Atlas | Consortium | Open single-cell reference data |
| Human Protein Atlas | Consortium | Protein localization and expression data |
| Billion Cells Project | Consortium | Coordinates 17 data-generation projects |
| Renaissance Philanthropy | Philanthropic partner | Coordination and co-funding |
NVIDIA's role centers on compute. CZI operates one of the largest computing systems dedicated to nonprofit life-science research. Announced in September 2023, the cluster is built around more than 1,000 NVIDIA H100 GPUs; trade reporting described it as 1,024 H100s in an NVIDIA DGX SuperPOD configuration paired with VAST data storage, intended to come online in 2024. The cluster lets CZI train large biological models in-house, an unusual capability for a nonprofit. In October 2025, CZI and NVIDIA expanded their collaboration to develop open virtual cell models, GPU-accelerated tools for harmonizing petabyte-scale biological data spanning billions of cellular observations, and a shared evaluation suite called cz-benchmarks, all delivered through CZI's open Virtual Cells Platform.
The Virtual Biology Initiative is deliberately distinct from any specific model: it funds the data and infrastructure, while the models are built on top. CZI and its collaborators have already released several virtual-cell models that the initiative is designed to feed.
TranscriptFormer, posted as a bioRxiv preprint on April 25, 2025, is a generative, transformer-based model trained on single-cell transcriptomics from more than 110 million cells across 12 species, spanning roughly 1.5 billion years of evolution. CZI co-founder Stephen Quake described it as "the equivalent of genome assembly for all the cell atlas data," and it supports tasks such as disease-state identification and gene-interaction prediction. Like a foundation model for cells, it is meant to generalize across many biological problems.
GREmLN (a gene-regulatory, embedding-based neural model), released in mid-2025 and developed with researchers at Columbia University, takes a more specialized, graph-aware approach. Instead of treating each gene as an isolated token, it integrates gene-regulatory network structure into its attention mechanism, learning embeddings that capture each gene's role in the broader network. It was initially trained on about 11 million single-cell RNA-seq profiles spanning 162 cell types across tissues including brain, lung, kidney, and blood, and is aimed at applications such as pinpointing cancer-associated cell states. These transformer- and graph-based systems are extensions of modern deep learning applied to molecular biology, and they rely on the same kind of neural network architectures used in language and vision.
Beyond single-cell models, Biohub released ESMFold 2 on May 27, 2026, an open protein-biology engine from the team led by Alex Rives, whose EvoScale protein-modeling group was absorbed into Biohub. ESMFold 2 builds on the ESM family of protein language models and shipped with an atlas of billions of predicted protein structures, positioning it as an open alternative to structure-prediction systems like AlphaFold. CZI's portfolio also includes rBio and the RNA model CodonFM. The initiative is not the only effort of its kind; the Arc Institute's State model and a 2025 Arc Virtual Cell Challenge reflect a wider race to build predictive cell models.
The Virtual Biology Initiative is notable for its scale, its open-data philosophy, and what it signals about the direction of one of the world's largest science philanthropies. By committing $500 million primarily to generating shared, open datasets rather than to a proprietary model, Biohub is treating data scarcity, not algorithms, as the binding constraint on AI for biology. That framing echoes broader debates in machine learning about whether progress is limited more by compute and data than by model design.
For CZI, the initiative consolidates a decade of work into a single coordinated push behind the virtual cell, aligning its in-house GPU cluster, its model portfolio, and a network of elite research institutions. If successful, the resulting datasets and models could let researchers simulate how cells transition from healthy to diseased states and test interventions computationally before bringing them to the bench. Skeptics note that AI biology datasets remain far below the scale Biohub itself estimates is necessary, and that virtual cell models are early and largely unvalidated for clinical use. The program's significance will ultimately depend on whether the open data it generates proves sufficient to train models that make reliable, testable biological predictions.