State (Arc Institute virtual cell model)
Last reviewed
Jun 7, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 1,767 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 1,767 words
Add missing citations, update stale details, or suggest a clearer explanation.
State is a machine learning "virtual cell" model developed by the Arc Institute, a nonprofit biomedical research organization, to predict how cells respond to perturbations such as drugs, cytokines, and genetic edits. Released on June 23, 2025, State takes a starting cell transcriptome plus a specified perturbation and predicts the resulting shifts in RNA expression. It was the institute's first virtual cell model and followed its Evo 2 DNA language models, marking a shift from modeling genome sequences to modeling cellular behavior. State is built from two components, a transformer-based State Transition (ST) model that learns perturbation effects across sets of cells, and an optional State Embedding (SE) model that converts single-cell transcriptomes into numerical representations. The work was published as a preprint on bioRxiv with Yusuf Roohani as corresponding author and Arc co-founders Silvana Konermann and Patrick Hsu among the senior authors.
The Arc Institute is a nonprofit research organization based in Palo Alto, California, founded in 2021 and co-founded by Patrick Hsu, a UC Berkeley bioengineer, Silvana Konermann, a Stanford biochemist who serves as executive director, and Patrick Collison, the chief executive of Stripe. It operates in partnership with Stanford University, UC Berkeley, and UCSF, and was launched with roughly 650 million dollars in initial funding so that scientists can pursue long-horizon research without writing grant applications. Patrick Hsu also leads the Evo line of DNA foundation models, which places both of Arc's flagship AI efforts under closely related leadership.
A "virtual cell" is a computational model intended to simulate how a living cell behaves, including how it changes when it is perturbed. The idea sits at the intersection of AI for science, generative AI, and AI in healthcare. The motivation is practical: laboratory experiments that measure how thousands of genes or drugs affect cells are slow and expensive, so a model that can predict those outcomes could narrow down which experiments are worth running. State was built and distributed alongside Arc's broader Virtual Cell Initiative, which also produced the data resources the model was trained on.
State is described by Arc as a multi-scale model that operates on sets of cells rather than one cell at a time, so it can capture both population-level and cell-level perturbation effects. It uses a modern transformer architecture and combines two interlocking modules.
The State Transition (ST) model is the core perturbation predictor. It uses a bidirectional transformer that applies self-attention over sets of cells, learning how a population of cells transitions across a learned manifold of cell states in response to a given perturbation. This set-level design distinguishes it from earlier perturbation models that make predictions for a single cell in isolation. The State Embedding (SE) model is an encoder that maps a single-cell transcriptome into a smooth vector space of embeddings; cells of the same type cluster together, and the representation is meant to be more robust to technical noise. The released SE checkpoint, SE-600M, is reported on Hugging Face at roughly 0.7 billion parameters. Arc also released task-specific ST checkpoints named ST-Tahoe and ST-Parse.
State was trained on two kinds of single-cell data. The SE model was trained on observational data from about 167 million human cells, meaning measurements taken without any deliberate intervention. The ST model was trained on perturbational data from more than 100 million cells across roughly 70 human cell lines or contexts. The perturbation datasets named in Arc's materials include Tahoe-100M, Parse-PBMC, and Replogle-Nadig. Arc described this as the largest body of single-cell perturbation data used to train any such model at the time. Much of the data came from the Arc Virtual Cell Atlas, a public resource that combines observational and perturbed cells. Tahoe-100M, contributed by Tahoe Therapeutics (formerly Vevo Therapeutics) with the Chan Zuckerberg Biohub, was open-sourced on February 25, 2025 and maps roughly 60,000 drug-cell interactions spanning 50 cancer cell lines and about 1,200 drug perturbations. The atlas also draws on scBaseCount, a corpus of around 200 million cells assembled from public sequencing data across 21 species.
The following table summarizes the model's key facts.
| Attribute | Detail |
|---|---|
| Name | State |
| Developer | Arc Institute (nonprofit) |
| Released | June 23, 2025 |
| Type | Virtual cell model for perturbation response prediction |
| Components | State Transition (ST) and State Embedding (SE) |
| Architecture | Bidirectional transformer; self-attention over sets of cells |
| Released checkpoints | SE-600M (about 0.7B parameters), ST-Tahoe, ST-Parse |
| Observational training data | About 167 million human cells |
| Perturbational training data | More than 100 million cells across about 70 cell contexts |
| Key datasets | Tahoe-100M, Parse-PBMC, Replogle-Nadig, Arc Virtual Cell Atlas, scBaseCount |
| Predicts | RNA expression shifts after drugs, cytokines, or genetic perturbations |
| Preprint | bioRxiv 2025.06.26.661135 (corresponding author Yusuf Roohani) |
| Software | Python package, GitHub repo ArcInstitute/state, models on Hugging Face |
| License | Code CC BY-NC-SA 4.0; weights under a noncommercial Arc model license |
Alongside State, Arc announced the inaugural Virtual Cell Challenge in June 2025, a public competition designed to benchmark perturbation prediction and, in Arc's framing, to work toward a kind of Turing test for the virtual cell. The challenge was sponsored by NVIDIA, 10x Genomics, and Ultima Genomics, with a prize structure of 100,000 dollars for first place, 50,000 dollars for second, and 25,000 dollars for third, plus NVIDIA DGX Cloud credits.
Entrants had to predict how cells respond to genetic perturbations and, critically, to generalize to new cellular contexts. The evaluation used a newly generated dataset of roughly 300,000 H1 human embryonic stem cells subjected to about 300 genetic perturbations, and submissions were scored on three metrics: prediction of differentially expressed genes, discrimination between distinct perturbation effects, and overall expression count accuracy. Participants were allowed to train on public resources such as the Arc Virtual Cell Atlas, scBaseCount, Tahoe-100M, and X-Atlas/Orion. State served as the competition's baseline model.
Arc reported strong participation: more than 5,000 people registered across 114 countries, over 1,200 teams submitted results, and more than 300 teams made final submissions. The grand prize went to a BioMap Research team for a model called xTrimoSCPerturb, which combined deep learning with classical statistics. A separate Generalist Prize went to a team from Altos Labs for a model called go-with-the-flow that achieved the highest average ranking across the metrics.
On Arc's internal benchmarks, State improved discrimination of perturbation effects on large datasets by more than 50 percent and roughly doubled the accuracy of identifying truly differentially expressed genes compared with existing models. Arc also described State as the first model in this area to consistently beat simple linear baselines, a claim that matters because such baselines have historically been hard for deep learning models to surpass. The benchmark figures vary by dataset and by preprint version; the headline improvement is most often cited as over 50 percent for perturbation discrimination.
That caveat about linear baselines reflects an active debate in the field. Multiple independent studies have argued that deep-learning perturbation predictors, including large foundation models, do not reliably outperform simple additive or linear baselines, and that apparent gains can shrink or vanish under stricter evaluation, on unseen perturbations, or across datasets. A line of 2026 critiques, including bioRxiv preprints arguing that virtual cells "need context, not just scale" and that current virtual cell models are of limited use for scientific discovery, contends that the main bottleneck is coverage of diverse biological contexts rather than model size. The Virtual Cell Challenge results echoed this: Arc itself noted that perturbation prediction models were not yet consistently beating naive baselines across all metrics, and the top-placing teams explicitly blended statistical features with deep learning rather than relying on neural networks alone.
State is notable as one of the first large transformer-based attempts to build a general perturbation-prediction model for cells, trained at a scale of hundreds of millions of single cells. It is distinct from Arc's Evo DNA models, which are genome language models that read and design DNA sequences at single-nucleotide resolution; State instead operates on cell transcriptomes and predicts expression responses, so the two represent different layers of biology rather than versions of the same system. By releasing the model, the training data through the Virtual Cell Atlas, and an open benchmark in the Virtual Cell Challenge, Arc helped define an evaluation framework for the emerging virtual cell field. The mixed competition outcomes and the surrounding critiques also clarified how far the field still has to go before such models reliably generalize to new contexts. In January 2026 Arc released a complementary single-cell model called Stack that learns new tasks at inference time without retraining, and the institute has said a successor, State 2, is in development and will build on lessons from both models.