Evo 2 (genomic model)
Last reviewed
May 31, 2026
Sources
15 citations
Review status
Source-backed
Revision
v3 ยท 2,806 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
15 citations
Review status
Source-backed
Revision
v3 ยท 2,806 words
Add missing citations, update stale details, or suggest a clearer explanation.
Evo 2 is a genomic foundation model built by the Arc Institute together with NVIDIA, Stanford University, and collaborators, and first released in February 2025. It learns the language of DNA directly from sequence, with no labels and no task-specific supervision, by reading roughly 9.3 trillion nucleotides drawn from across the tree of life. The largest version has about 40 billion parameters and a context window of up to roughly 1 million base pairs at single-nucleotide resolution. From that single training objective, predicting the next nucleotide, Evo 2 can score the likely effect of mutations without any fine-tuning and can also generate new DNA at the scale of whole genes, chromosomes, and small genomes. The model, its training code, and the training data were all released openly, alongside an interpretability tool for inspecting what the network has learned. [1][2][3]
The project is the second generation of Evo. The first Evo model, published in Science in late 2024, was a 7 billion parameter model trained on about 300 billion nucleotides of bacterial, archaeal, and phage sequence with a context length of 131 kilobases. [4] Evo 2 is the much larger successor, trained on roughly 30 times more data and extended to cover eukaryotes, including plant, fungal, and human genomes. [1][5]
Most work on AI for biology has focused on proteins. Tools like AlphaFold predict the three-dimensional shape a protein folds into, which is one important step downstream of the genome. Evo 2 works one level up, at the DNA itself. The idea behind a genomic foundation model is the same idea that powers a large language model. Instead of training on text, you train a sequence model on long stretches of genetic code and ask it, over and over, to predict the next letter. The four DNA bases, A, C, G, and T, take the place of words. To do that prediction well across billions of examples, the network has to pick up on the statistical structure of real biology: where genes start and stop, which codons spell out which amino acids, how regulatory regions are arranged, and which changes a genome tolerates versus which ones break it. [1][6]
Because the model is trained only to predict sequence, the things it learns are general rather than tied to one assay. The same network that estimates how surprising a given stretch of DNA is can be turned toward judging whether a mutation looks harmful, toward annotating an unfamiliar genome, or toward writing fresh sequence that follows the rules of life. This is the generative AI recipe applied to genomes, and it is what lets one model serve as a base for many biological tasks. [2][6]
Evo 2 was trained on a dataset the team assembled called OpenGenome2. It contains on the order of 8.8 trillion nucleotides of curated, non-redundant sequence pulled from public databases, spanning bacteria, archaea, eukarya, and bacteriophage. [6][7] In other words it reaches across all domains of life rather than focusing on a single branch, which is the headline claim of the work. The collection draws on more than 128,000 genomes and is roughly 30 times larger than the data behind the original Evo. [1][3]
The two released model sizes saw different amounts of this data during training. The 7 billion parameter model was trained on about 2.4 trillion tokens, while the 40 billion parameter model was trained on about 9.3 trillion tokens, passing over the curated sequence using a staged curriculum rather than reading each base exactly once. [6][7] Training ran in two phases. An initial pretraining phase used a shorter window and weighted the data toward functional genetic elements such as genes and their regulatory neighborhoods. A second midtraining phase then stretched the context length out toward 1 million base pairs so the model could pick up longer-range structure that only becomes visible across very long stretches of a chromosome. [6][8]
The data was also filtered with safety in mind. Sequences from viruses that infect humans and other eukaryotes were deliberately left out of OpenGenome2, a choice the team made to reduce the chance that the model could help design dangerous pathogens. [9][10]
Evo 2 does not use a standard Transformer. A Transformer relies on self-attention, whose cost grows with the square of the sequence length, which makes a context of a million tokens very expensive. Evo 2 instead uses an architecture called StripedHyena 2, which descends from the Hyena line of work and is related to the broader family of state space models and long-convolution sequence layers. The point of these designs is to handle very long inputs with compute and memory that scale far more gently than attention does. [6][8]
StripedHyena 2 is described as a convolutional multi-hybrid. Rather than a single repeated block, it interleaves several kinds of operator so that each can specialize. The published layout mixes short explicit hyena filters, which capture local patterns like codons; medium regularized filters for intermediate structure such as the boundaries between coding and non-coding regions; and long implicit filters for sweeping, genome-scale dependencies. A smaller amount of attention is woven in as well. [6][11] According to the developers, this mix let Evo 2 train nearly three times faster than an optimized Transformer of comparable scale on long genomic sequences, which is what made a million-base context practical. [3][8] The models were trained using an open framework called Savanna, running on roughly 2,000 NVIDIA H100 GPUs provided through NVIDIA DGX Cloud on AWS. [3][9]
| Property | Detail |
|---|---|
| Developers | Arc Institute, NVIDIA, Stanford University, UC Berkeley, UC San Francisco, Goodfire |
| First release | February 19, 2025 |
| Peer-reviewed publication | Nature, 2026 (volume 652, pp. 1349 to 1361) |
| Model type | Autoregressive genomic foundation model (DNA) |
| Largest model size | About 40 billion parameters |
| Smaller released sizes | 7 billion and 1 billion parameters (later checkpoints include 20 billion) |
| Architecture | StripedHyena 2 (convolutional multi-hybrid) |
| Resolution | Single nucleotide |
| Maximum context length | Up to about 1 million base pairs |
| Training dataset | OpenGenome2, about 8.8 trillion nucleotides |
| Tokens seen (40B model) | About 9.3 trillion |
| Genomes and species covered | More than 128,000, across all domains of life |
| Training hardware | About 2,000 NVIDIA H100 GPUs via DGX Cloud on AWS |
| Training framework | Savanna |
| License | Apache 2.0 |
| Open release | Model weights, training and inference code, OpenGenome2 data, interpretability visualizer |
The capability that drew the most attention is zero-shot variant-effect prediction. The method is simple in outline. Evo 2 assigns a likelihood to a DNA sequence, so you can score a healthy reference sequence and then score the same sequence with a single base changed. A change that makes the sequence much less likely under the model tends to be one that biology rarely tolerates, which is a signal that the mutation disrupts function. Because this comes straight from the pretrained model, there is no separate classifier to train for each gene. [12][6]
The team highlighted the BRCA1 gene, where harmful mutations raise the risk of breast and ovarian cancer. On the task of telling benign variants apart from pathogenic ones, Evo 2 reached over 90 percent accuracy, and it was especially strong on the harder case of non-coding BRCA1 variants that most tools struggle with. A classifier built on Evo 2 embeddings reached an AUROC of about 0.95 on the BRCA1 test set, beating supervised predictors made specifically for splicing. [1][12] Patrick Hsu, one of the senior authors, framed the result by noting that Evo 2 is roughly the second-best model for coding mutations but state of the art for non-coding ones, which is where existing variant tools tend to fall down. [13] The model scored both coding and non-coding variants well across BRCA1 and BRCA2, which suggests it is a reasonably well-calibrated predictor of human variant effects even though it was never trained on a list of known-bad mutations. The same machinery extends beyond humans. Evo 2 predicts the functional impact of mutations in bacterial and other eukaryotic genomes. [5][6]
Evo 2 is also a generative model. Prompted with a starting context, it will continue the sequence one base at a time, and it can keep going long enough to produce genome-scale output. In the paper the team generated sequences at the scale of whole genomes. Prompted with the start of the human mitochondrial genome, the model produced roughly 16 kilobase mitochondrial genomes with the right complement of transfer RNAs, ribosomal RNAs, and protein-coding genes, and proteins folded from those sequences by AlphaFold 3 resembled their natural counterparts. From a fragment of Mycoplasma genitalium the model extended out a minimal bacterial genome of about 580 kilobases, and from a piece of a yeast chromosome it generated about 330 kilobases of eukaryotic sequence complete with introns, promoters, and tRNA clusters. [1][6][13] The generated sequences are not just random strings that happen to use the right letters. They reproduce realistic features such as gene-like coding regions and plausible overall organization, though the bacterial design was missing some elements it would need to actually function. [6][13]
A further result combined generation with guidance, which the authors describe as among the first inference-time scaling results in biology. By using a separate predictor of chromatin accessibility, models in the Enformer and Borzoi family, as a scoring function during a beam search over candidate sequences, the team steered Evo 2 to design DNA whose predicted epigenome followed a chosen pattern. In one demonstration they laid out accessible and closed regions to spell short messages in Morse code, such as ARC and EVO2, across the epigenome, with predicted and measured accessibility agreeing at an AUROC of roughly 0.92 to 0.95. It shows the model can be aimed at a specific functional readout rather than only sampling freely. [6][10]
Because Evo 2 was released openly, the team could look inside it. Working with the interpretability company Goodfire, they trained sparse autoencoders on the model's internal activations, a technique from mechanistic interpretability that pulls apart tangled representations into individual, human-readable features. Many of the features that emerged line up with real biology that no one labeled for the model. The autoencoders surfaced units that fire at exon and intron boundaries, units that track protein secondary structure such as alpha helices and beta sheets, and units tied to other functional elements including transcription factor binding motifs, mobile genetic elements, and prophage regions. One feature even activated on the spacer sequences inside CRISPR arrays, the genetic memory bacteria keep of past viral infections. [6][11][13] These learned features turn out to be portable across species. The team built a single-nucleotide exon classifier on Evo 2 features that reached AUROC values from about 0.82 to 0.99 across organisms, and used features learned partly from primate and mouse genomes to annotate the exon and intron structure of the woolly mammoth genome, an extinct species the model never saw in training. [5][6] The team paired the work with a public visualizer so that others can browse the features the model uses, which is part of why Evo 2 is interesting as an interpretability target and not only as a predictor. [2][3]
Evo 2 came out of a collaboration led by the Arc Institute and NVIDIA, with contributors from Stanford University, the University of California, Berkeley, and the University of California, San Francisco, plus Goodfire on the interpretability side. [1][9] NVIDIA supplied the compute through DGX Cloud and worked on the model with the Arc team, and the result is distributed through NVIDIA's BioNeMo platform, including as a hosted NIM microservice, and through a web tool called Evo Designer that lets people generate sequences interactively. [3][9]
The release is unusually complete for a model of this size. The weights, the training and inference code, and the full OpenGenome2 dataset were all made available, under an Apache 2.0 license, which is the kind of open source AI posture that lets outside groups reproduce, audit, and build on the work. [2][7] Several model checkpoints were published, ranging from a 1 billion parameter version up to the 40 billion parameter flagship, with shorter-context base variants alongside the full long-context models, and later checkpoints added a 20 billion parameter option. [7] In 2026 the work was peer reviewed and published in Nature, having first appeared as a preprint in February 2025. [5][14]
Evo 2 matters as a scaling result for biology. It takes the recipe that worked for language, a large model trained with self-supervision on a very large and diverse corpus, and shows that it transfers to genomes spanning every domain of life rather than to one narrow slice. The same base model handles prediction and design across DNA, RNA, and protein-coding sequence, which points toward general-purpose tools for reading and writing genetic information. [2][6] For the field of AI for science it is also a notable open release, since the data, code, and weights together make it a shared resource that other labs can extend rather than a closed product. [3][7]
For the deep learning community specifically, Evo 2 is one of the clearest large-scale demonstrations that a non-attention architecture can reach frontier scale. Its use of StripedHyena 2 over a standard Transformer is what made the million-base context tractable, and that choice is part of an ongoing line of deep learning research into sequence layers that scale better than self-attention for very long inputs. [8][11]
Evo 2 is a strong predictor of variant effects, but it is not a diagnostic, and its likelihood scores are a statistical proxy for function rather than a measured experimental result. Performance varies by gene and by the type of variant, and predictions still need laboratory or clinical validation before anyone acts on them. [12] On the generative side, designing realistic genome-scale sequences remains hard, and independent analyses have questioned how closely model-generated genomes match the fine-grained statistics of natural ones, so the design results are best read as early progress rather than finished biological parts. [15]
The biosafety question is real and the team addressed it directly. A model that can read and write DNA across all of life could in principle be misused to help engineer pathogens, so the developers excluded viruses that infect humans and other eukaryotes from the training data and then red-teamed the released model. In those tests the sequences Evo 2 produced for pathogenic viral proteins were effectively random, meaning the model does not appear to offer a useful shortcut for designing such agents. [9][10] The authors are candid that this is a starting point. As genomic foundation models grow more capable, keeping them safe will take continued alignment work and governance, which is one reason the open and documented nature of the Evo 2 release is meant to support outside scrutiny rather than replace it. [9][10]