State (Arc Institute virtual cell model)

AI Models AI for Science

9 min read

Updated Jul 17, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 17, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v2 · 1,767 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

State is a machine learning "virtual cell" model developed by the Arc Institute, a nonprofit biomedical research organization, to predict how cells respond to perturbations such as drugs, cytokines, and genetic edits. Released on June 23, 2025, State takes a starting cell transcriptome plus a specified perturbation and predicts the resulting shifts in RNA expression.^[1] It was the institute's first virtual cell model and followed its Evo 2 DNA language models, marking a shift from modeling genome sequences to modeling cellular behavior. State is built from two components, a transformer-based State Transition (ST) model that learns perturbation effects across sets of cells, and an optional State Embedding (SE) model that converts single-cell transcriptomes into numerical representations.^[1] The work was published as a preprint on bioRxiv with Yusuf Roohani as corresponding author and Arc co-founders Silvana Konermann and Patrick Hsu among the senior authors.^[3]^[4]

Background: Arc Institute and virtual cells

The Arc Institute is a nonprofit research organization based in Palo Alto, California, founded in 2021 and co-founded by Patrick Hsu, a UC Berkeley bioengineer, Silvana Konermann, a Stanford biochemist who serves as executive director, and Patrick Collison, the chief executive of Stripe. It operates in partnership with Stanford University, UC Berkeley, and UCSF, and was launched with roughly 650 million dollars in initial funding so that scientists can pursue long-horizon research without writing grant applications.^[16]^[17] Patrick Hsu also leads the Evo line of DNA foundation models, which places both of Arc's flagship AI efforts under closely related leadership.^[15]

A "virtual cell" is a computational model intended to simulate how a living cell behaves, including how it changes when it is perturbed. The idea sits at the intersection of AI for science, generative AI, and AI in healthcare. The motivation is practical: laboratory experiments that measure how thousands of genes or drugs affect cells are slow and expensive, so a model that can predict those outcomes could narrow down which experiments are worth running. State was built and distributed alongside Arc's broader Virtual Cell Initiative, which also produced the data resources the model was trained on.^[1]

The State model: architecture and training data

State is described by Arc as a multi-scale model that operates on sets of cells rather than one cell at a time, so it can capture both population-level and cell-level perturbation effects.^[2] It uses a modern transformer architecture and combines two interlocking modules.

The State Transition (ST) model is the core perturbation predictor. It uses a bidirectional transformer that applies self-attention over sets of cells, learning how a population of cells transitions across a learned manifold of cell states in response to a given perturbation.^[3] This set-level design distinguishes it from earlier perturbation models that make predictions for a single cell in isolation. The State Embedding (SE) model is an encoder that maps a single-cell transcriptome into a smooth vector space of embeddings; cells of the same type cluster together, and the representation is meant to be more robust to technical noise. The released SE checkpoint, SE-600M, is reported on Hugging Face at roughly 0.7 billion parameters.^[6] Arc also released task-specific ST checkpoints named ST-Tahoe and ST-Parse.^[7]

State was trained on two kinds of single-cell data. The SE model was trained on observational data from about 167 million human cells, meaning measurements taken without any deliberate intervention. The ST model was trained on perturbational data from more than 100 million cells across roughly 70 human cell lines or contexts.^[1]^[3] The perturbation datasets named in Arc's materials include Tahoe-100M, Parse-PBMC, and Replogle-Nadig. Arc described this as the largest body of single-cell perturbation data used to train any such model at the time. Much of the data came from the Arc Virtual Cell Atlas, a public resource that combines observational and perturbed cells. Tahoe-100M, contributed by Tahoe Therapeutics (formerly Vevo Therapeutics) with the Chan Zuckerberg Biohub, was open-sourced on February 25, 2025 and maps roughly 60,000 drug-cell interactions spanning 50 cancer cell lines and about 1,200 drug perturbations.^[12]^[13] The atlas also draws on scBaseCount, a corpus of around 200 million cells assembled from public sequencing data across 21 species.

The following table summarizes the model's key facts.

Attribute	Detail
Name	State
Developer	Arc Institute (nonprofit)
Released	June 23, 2025^[1]
Type	Virtual cell model for perturbation response prediction
Components	State Transition (ST) and State Embedding (SE)
Architecture	Bidirectional transformer; self-attention over sets of cells
Released checkpoints	SE-600M (about 0.7B parameters), ST-Tahoe, ST-Parse^[6]
Observational training data	About 167 million human cells
Perturbational training data	More than 100 million cells across about 70 cell contexts
Key datasets	Tahoe-100M, Parse-PBMC, Replogle-Nadig, Arc Virtual Cell Atlas, scBaseCount
Predicts	RNA expression shifts after drugs, cytokines, or genetic perturbations
Preprint	bioRxiv 2025.06.26.661135 (corresponding author Yusuf Roohani)^[3]
Software	Python package, GitHub repo ArcInstitute/state, models on Hugging Face^[5]
License	Code CC BY-NC-SA 4.0; weights under a noncommercial Arc model license

The Virtual Cell Challenge

Alongside State, Arc announced the inaugural Virtual Cell Challenge in June 2025, a public competition designed to benchmark perturbation prediction and, in Arc's framing, to work toward a kind of Turing test for the virtual cell.^[8]^[9] The challenge was sponsored by NVIDIA, 10x Genomics, and Ultima Genomics, with a prize structure of 100,000 dollars for first place, 50,000 dollars for second, and 25,000 dollars for third, plus NVIDIA DGX Cloud credits.^[8]

Entrants had to predict how cells respond to genetic perturbations and, critically, to generalize to new cellular contexts. The evaluation used a newly generated dataset of roughly 300,000 H1 human embryonic stem cells subjected to about 300 genetic perturbations, and submissions were scored on three metrics: prediction of differentially expressed genes, discrimination between distinct perturbation effects, and overall expression count accuracy.^[11] Participants were allowed to train on public resources such as the Arc Virtual Cell Atlas, scBaseCount, Tahoe-100M, and X-Atlas/Orion. State served as the competition's baseline model.^[9]

Arc reported strong participation: more than 5,000 people registered across 114 countries, over 1,200 teams submitted results, and more than 300 teams made final submissions.^[10] The grand prize went to a BioMap Research team for a model called xTrimoSCPerturb, which combined deep learning with classical statistics. A separate Generalist Prize went to a team from Altos Labs for a model called go-with-the-flow that achieved the highest average ranking across the metrics.^[10]

Performance and reception

On Arc's internal benchmarks, State improved discrimination of perturbation effects on large datasets by more than 50 percent and roughly doubled the accuracy of identifying truly differentially expressed genes compared with existing models.^[1] Arc also described State as the first model in this area to consistently beat simple linear baselines, a claim that matters because such baselines have historically been hard for deep learning models to surpass. The benchmark figures vary by dataset and by preprint version; the headline improvement is most often cited as over 50 percent for perturbation discrimination.

That caveat about linear baselines reflects an active debate in the field. Multiple independent studies have argued that deep-learning perturbation predictors, including large foundation models, do not reliably outperform simple additive or linear baselines, and that apparent gains can shrink or vanish under stricter evaluation, on unseen perturbations, or across datasets.^[20] A line of 2026 critiques, including bioRxiv preprints arguing that virtual cells "need context, not just scale" and that current virtual cell models are of limited use for scientific discovery, contends that the main bottleneck is coverage of diverse biological contexts rather than model size.^[18]^[19] The Virtual Cell Challenge results echoed this: Arc itself noted that perturbation prediction models were not yet consistently beating naive baselines across all metrics, and the top-placing teams explicitly blended statistical features with deep learning rather than relying on neural networks alone.^[10]

Significance

State is notable as one of the first large transformer-based attempts to build a general perturbation-prediction model for cells, trained at a scale of hundreds of millions of single cells. It is distinct from Arc's Evo DNA models, which are genome language models that read and design DNA sequences at single-nucleotide resolution; State instead operates on cell transcriptomes and predicts expression responses, so the two represent different layers of biology rather than versions of the same system.^[15] By releasing the model, the training data through the Virtual Cell Atlas, and an open benchmark in the Virtual Cell Challenge, Arc helped define an evaluation framework for the emerging virtual cell field. The mixed competition outcomes and the surrounding critiques also clarified how far the field still has to go before such models reliably generalize to new contexts. In January 2026 Arc released a complementary single-cell model called Stack that learns new tasks at inference time without retraining, and the institute has said a successor, State 2, is in development and will build on lessons from both models.^[14]

References

Arc Institute. "Arc Institute's first virtual cell model: State." arcinstitute.org, June 23, 2025. https://arcinstitute.org/news/virtual-cell-model-state ↩
Arc Institute. "Arc Virtual Cell Model: State." arcinstitute.org. https://arcinstitute.org/tools/state ↩
Adduri, A.K., Gautam, D., et al. (corresponding author Y.H. Roohani). "Predicting cellular responses to perturbation across diverse contexts with State." bioRxiv 2025.06.26.661135. https://www.biorxiv.org/content/10.1101/2025.06.26.661135v1 ↩
bioRxiv API record for 10.1101/2025.06.26.661135 (full author list and abstract). https://api.biorxiv.org/details/biorxiv/10.1101/2025.06.26.661135 ↩
ArcInstitute/state. GitHub repository. https://github.com/ArcInstitute/state ↩
Arc Institute. "SE-600M." Hugging Face model card. https://huggingface.co/arcinstitute/SE-600M ↩
Arc Institute. "ST-Tahoe." Hugging Face model card. https://huggingface.co/arcinstitute/ST-Tahoe ↩
GEN (Genetic Engineering & Biotechnology News). "Arc Institute Launches Virtual Cell Challenge to Accelerate AI Model Development." June 2025. https://www.genengnews.com/topics/artificial-intelligence/arc-institute-launches-virtual-cell-challenge-to-accelerate-ai-model-development/ ↩
Arc Institute. "Virtual Cell Challenge: Toward a Turing test for the virtual cell" (Cell). https://www.cell.com/cell/fulltext/S0092-8674(25)00675-0 ↩
Arc Institute. "Virtual Cell Challenge 2025 Wrap-Up: Winners and Reflections." https://arcinstitute.org/news/virtual-cell-challenge-2025-wrap-up ↩
Arc Institute. "Behind the Data of the Virtual Cell Challenge." https://arcinstitute.org/news/behind-the-data-virtual-cell-challenge ↩
Arc Institute. "Tahoe Open Sources Tahoe-100M ... as the Inaugural Contribution to Arc Institute's New Virtual Cell Atlas." February 25, 2025. https://arcinstitute.org/news/arc-vevo ↩
Tahoe Bio. "Tahoe Therapeutics, Arc Institute, and Biohub Partner to Generate the Largest Perturbation Dataset for Virtual Cell Models." https://www.tahoebio.ai/news/tahoe-arc-and-biohub-partnership ↩
Arc Institute. "Stack: Simulating cellular conditions via prompt engineering, without the need for fine-tuning." January 9, 2026. https://arcinstitute.org/news/foundation-model-stack ↩
Arc Institute. "Evo 2: DNA Foundation Model." https://arcinstitute.org/tools/evo ↩
Arc Institute. "About." https://arcinstitute.org/about ↩
Wikipedia. "Arc Institute." https://en.wikipedia.org/wiki/Arc_Institute ↩
"Virtual Cells Need Context, Not Just Scale." bioRxiv, 2026. https://www.biorxiv.org/content/10.64898/2026.02.04.703804v1.full ↩
"Are Current AI Virtual Cell Models Useful for Scientific Discovery?" bioRxiv, 2026. https://www.biorxiv.org/content/10.64898/2026.04.23.719015v1.full ↩
Ahlmann-Eltze, C., et al. "Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines." PMC. https://pmc.ncbi.nlm.nih.gov/articles/PMC12328236/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Virtual Biology Initiative rBio (CZI)

Background: Arc Institute and virtual cells

The State model: architecture and training data

The Virtual Cell Challenge

Performance and reception

Significance

References

Improve this article

Related Articles

Weather

AlphaGeometry

AlphaFold 3

AlphaProof

AlphaProteo

Boltz

What links here

Related Articles

Weather

AlphaGeometry

AlphaFold 3

AlphaProof

AlphaProteo

Boltz

What links here