AlphaGenome
Last reviewed
May 16, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 3,017 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 3,017 words
Add missing citations, update stale details, or suggest a clearer explanation.
AlphaGenome is a unified deep learning model developed by Google DeepMind that predicts thousands of functional genomic properties from raw DNA sequences. The model accepts up to one million base pairs of DNA as input and produces predictions at single-base-pair resolution across eleven distinct molecular modalities, including gene expression, RNA splicing, chromatin accessibility, histone modifications, transcription factor binding, and three-dimensional chromatin contacts. AlphaGenome was announced on June 25, 2025 through a DeepMind blog post and an accompanying bioRxiv preprint, and the peer-reviewed research describing the system was published in Nature on January 28, 2026 under the title "Advancing regulatory variant effect prediction with AlphaGenome."
The model is the direct successor to Enformer, DeepMind's earlier genomic deep learning system released in 2021, and represents a substantial step forward in the field of computational genomics. In a head-to-head evaluation against the strongest available external models, AlphaGenome matched or exceeded the state of the art on 25 of 26 variant effect prediction benchmarks and on 22 of 24 sequence track prediction benchmarks. It is the first publicly described model capable of jointly predicting all of the assayed regulatory modalities at single-base resolution from a single shared backbone, and it can score the impact of an individual genetic variant on every modality in roughly one second on a modern GPU. AlphaGenome is available to non-commercial researchers through a free API and an accompanying Python software development kit.
The system is positioned within DeepMind's broader portfolio of structural and functional biology models, which includes AlphaFold for protein structure prediction, AlphaFold 3 for biomolecular complex prediction, and AlphaMissense for the pathogenicity of protein-coding missense variants. Whereas AlphaMissense focuses on the roughly two percent of the human genome that encodes proteins, AlphaGenome is designed to interpret the remaining ninety-eight percent, the non-coding regions that regulate when, where, and how strongly genes are expressed.
The human genome contains approximately three billion base pairs of DNA. Of these, only a small fraction, roughly two percent, directly encodes proteins. The remaining non-coding regions were historically underexplored, but a substantial body of work over the past two decades, including the ENCODE Project, the Roadmap Epigenomics Mapping Consortium, GTEx, and FANTOM5, has established that non-coding DNA carries enhancers, promoters, insulators, splice regulatory elements, and many other features that control gene regulation. The overwhelming majority of disease-associated genetic variants identified by genome-wide association studies fall in non-coding regions, yet interpreting them mechanistically remains difficult because their effects are typically subtle, context dependent, and mediated through long-range regulatory interactions.
Machine learning models that map DNA sequence to functional output have become a central tool for this interpretation problem. Early convolutional networks such as DeepSEA, Basenji, and Basset learned to predict assays like DNase-seq, ChIP-seq, and CAGE-seq from short windows of sequence. The 2021 release of Enformer extended this paradigm by combining convolutional layers with transformer architecture blocks, enabling the model to consider context up to 196,608 base pairs and capture longer-range enhancer-promoter interactions. The Borzoi model released by the Calico research team in 2024 further enlarged the input to roughly half a megabase and added RNA-seq coverage prediction, while ChromBPNet, ProCapNet, and Orca specialized in chromatin accessibility, transcription initiation, and three-dimensional contact maps respectively. SpliceAI and Pangolin set strong baselines for splice site prediction.
Despite this progress, the field lacked a single model that could jointly predict every major regulatory modality at base-pair resolution, span the megabase-scale context relevant for distal regulation, and run quickly enough to be applied to large variant catalogs. AlphaGenome was designed to fill all three of these gaps simultaneously.
AlphaGenome uses a hybrid architecture that combines convolutional neural network layers with transformer blocks, organized as a U-Net-style encoder-decoder backbone with skip connections. The design produces output representations at three different scales, allowing each modality to be predicted at the resolution that is biologically meaningful for it.
The encoder consumes a one-megabase DNA sequence encoded as a one-hot matrix and progressively downsamples it through stacked convolutional blocks interleaved with pooling operations. Local sequence patterns such as transcription factor binding motifs, splice donor and acceptor consensus sequences, and short core promoter elements are captured by these convolutional layers. After downsampling, the sequence representation enters a transformer tower that operates at 128-base-pair resolution. Multi-head self-attention in this stage allows the model to share information across positions separated by hundreds of thousands of base pairs, which is essential for capturing enhancer-promoter interactions, insulator effects, and other distal regulatory phenomena.
After the transformer tower, a decoder upsamples the representation back toward base-pair resolution through transposed convolutions, with skip connections from the encoder injecting fine-grained local features at each level. The decoder produces three sets of internal embeddings: one-dimensional embeddings at 1-base-pair resolution, one-dimensional embeddings at 128-base-pair resolution, and two-dimensional embeddings at 2,048-base-pair resolution that are used for predicting pairwise chromatin contact maps.
To handle a one-megabase input within available accelerator memory, AlphaGenome partitions the sequence dimension across eight interconnected tensor processing unit (TPU) devices during training. Each TPU processes a 131,072-base-pair chunk of the full sequence, with communication between devices implementing the attention operations across chunk boundaries. This sequence parallelism design is what allows the model to combine a megabase-scale context with single-base resolution outputs without sacrificing either.
A set of modality-specific output heads sits on top of the shared backbone. Each head maps the appropriate internal embedding to one or more genomic tracks. In total the production model predicts 5,930 human genomic tracks and 1,128 mouse genomic tracks across the eleven supported modalities.
AlphaGenome accepts a single contiguous DNA sequence of up to 1,048,576 base pairs and produces a multi-track output spanning eleven assay types. The table below summarizes the modalities, the typical experimental assays they correspond to, and the resolution at which AlphaGenome makes predictions for each.
| Modality | Underlying assay | Output resolution |
|---|---|---|
| Gene expression | RNA-seq coverage | 1 bp |
| Transcription initiation | CAGE-seq | 1 bp |
| Nascent transcription | PRO-cap | 1 bp |
| Splice sites | Splice donor and acceptor probabilities | 1 bp |
| Splice site usage | Fractional usage across cell types | 1 bp |
| Splice junctions | Junction coordinates and strength | Junction-level |
| DNA accessibility (DNase) | DNase-seq | 1 bp |
| DNA accessibility (ATAC) | ATAC-seq | 1 bp |
| Histone modifications | ChIP-seq for histone marks | 128 bp |
| Transcription factor binding | ChIP-seq for sequence-specific TFs | 128 bp |
| Chromatin contact maps | Hi-C and Micro-C | 2,048 bp pairwise |
The inclusion of explicit splice junction modeling is a notable innovation. Earlier sequence models typically predicted splice donor and acceptor scores at individual positions but could not directly represent the coordinates and quantitative strength of the resulting junctions. Because many Mendelian diseases are caused by aberrant splicing, this capability allows AlphaGenome to be applied to a clinically important class of variants that earlier general-purpose models handled poorly.
AlphaGenome was trained on aligned experimental measurements from large public consortia. The principal data sources include:
Additional datasets cover PRO-cap nascent transcription and curated splice junction catalogs. The training corpus covers hundreds of distinct human and mouse cell types and tissues. The model is trained jointly on human and mouse genomes, which improves generalization to cell types and assays that are sparsely represented in either species alone.
Training proceeds in two stages: a pre-training stage in which independent teacher models learn from the experimental tracks, followed by a distillation stage in which a single student model is trained to reproduce the teacher predictions. The student model is the one served through the API and used for variant scoring, and its smaller effective size enables sub-second variant evaluation on a single GPU.
A notable feature of the training procedure is its efficiency relative to predecessor models. According to the DeepMind announcement, training a single AlphaGenome model takes approximately four hours on the TPU cluster used, consuming roughly half of the compute budget required to train the original Enformer model. The training table below summarizes the compute profile reported by the authors.
| Item | Value |
|---|---|
| Input sequence length | 1,048,576 base pairs (1 Mb) |
| Sequence parallelism | 8 TPU v3 devices, 131,072 bp per device |
| Training time per model | Approximately 4 hours |
| Relative compute vs Enformer | Approximately one half |
| Inference time per variant | About 1 second on an NVIDIA H100 GPU |
| Human tracks predicted | 5,930 |
| Mouse tracks predicted | 1,128 |
| Output modalities | 11 |
The authors attribute the compute reduction to architectural improvements, in particular the U-Net backbone and the sequence parallelism scheme, rather than to a reduction in model expressiveness.
The AlphaGenome team evaluated the model along two main axes: sequence track prediction, which measures how well the model reproduces held-out experimental measurements when given a reference genomic sequence, and variant effect prediction, which measures how well the model predicts the functional consequences of substituting or inserting an alternative allele at a specific position.
On sequence track prediction, AlphaGenome was compared against task-specific state-of-the-art models for each modality. The model matched or exceeded the best external model on 22 of 24 evaluations. For chromatin accessibility, the reported improvements relative to ChromBPNet ranged from roughly 8 to 19 percent depending on the cell type. For splice prediction, AlphaGenome outperformed SpliceAI and Pangolin on six of seven benchmarks while also being the only model to simultaneously model splice sites, splice site usage, and full junction coordinates.
The variant effect evaluation suite spans 26 distinct benchmarks drawn from massively parallel reporter assays (MPRA), saturation mutagenesis experiments, eQTL and sQTL fine-mapping datasets, clinical variant databases such as ClinVar, and curated disease cohorts. AlphaGenome matched or exceeded the strongest external comparator on 25 of these 26 benchmarks. On expression quantitative trait locus direction-of-effect prediction, the model achieved a reported 25.5 percent relative improvement over Borzoi.
The table below summarizes the high-level comparison against the most directly comparable prior models.
| Model | Maximum input | Output resolution | Number of modalities | Joint splice junction | Year |
|---|---|---|---|---|---|
| Basenji2 | 131,072 bp | 128 bp | Limited | No | 2020 |
| Enformer | 196,608 bp | 128 bp | Limited | No | 2021 |
| Sei | 4,096 bp | Sequence class | 21,907 cis-regulatory classes | No | 2022 |
| Borzoi | 524,288 bp | 32 bp | Several including RNA-seq | No | 2024 |
| AlphaGenome | 1,048,576 bp | 1 bp | 11 | Yes | 2025 |
The authors and external collaborators demonstrated AlphaGenome on several biological case studies. In T-cell acute lymphoblastic leukemia, the model recovered the known mechanism by which non-coding mutations create a novel MYB transcription factor binding motif that activates the TAL1 oncogene. In a screen of fine-mapped non-coding variants associated with autoimmune and neurodegenerative diseases, AlphaGenome prioritized causal candidates consistent with subsequent experimental validation. The team also showed that the model can be used in synthetic biology applications to design promoter and enhancer sequences with targeted cell-type-specific activity.
AlphaGenome is available to academic and non-commercial researchers through the AlphaGenome API, hosted by DeepMind. Access requires registering with an institutional email and agreeing to non-commercial terms of use. A Python software development kit accompanies the API and provides convenience functions for scoring single variants, scoring batches of variants from a VCF file, retrieving full-track predictions across a window, and visualizing predicted regulatory landscapes. Commercial use requires a separate licensing arrangement, for which DeepMind has published an inquiry form.
The DeepMind team has also launched a community forum at alphagenomecommunity.com to coordinate user discussions, share notebooks, and report issues. As of the Nature publication, the team has stated an intent to release the full model weights in a future update, though the exact timing has not been published.
The authors and independent commentators have explicitly flagged several limitations of AlphaGenome that are important for interpreting its predictions.
Coverage of AlphaGenome has emphasized both its technical scope and its position within a rapidly maturing field of functional genomics models. STAT News, in its launch-day report, framed the release as DeepMind's most ambitious extension of the AlphaFold methodology beyond proteins. Chemical and Engineering News and MIT Technology Review highlighted the breadth of jointly predicted modalities and the model's relevance for interpreting the non-coding genome. Science covered the Nature publication in January 2026 and noted the strong benchmark performance against Borzoi and ChromBPNet.
Research engineer Natasha Latysheva of DeepMind characterized the project as more open-ended than protein structure prediction. "Genomics is more of a fuzzy field," she told STAT News. "There's no single metric of success." In response, the team designed the evaluation suite to span as many independent benchmarks as feasible, with the explicit goal of avoiding overfitting to any single proxy task.
The ALZFORUM commentary on the Nature paper described AlphaGenome as the new "top dog" for noncoding variant interpretation in neurodegenerative disease research, noting concrete improvements on Alzheimer's disease fine-mapped variants. The Science Media Centre collected reactions from independent UK-based geneticists, most of whom welcomed the model while emphasizing the need for orthogonal experimental validation of its predictions.
AlphaGenome is the latest entry in a sequence of DeepMind biology models that began with AlphaFold in 2018 and 2020. Where AlphaFold and AlphaFold 3 address the three-dimensional structure of proteins and biomolecular complexes, AlphaGenome addresses the functional regulatory consequences of DNA sequence variation. AlphaMissense, released in 2023, predicts the pathogenicity of missense variants in protein-coding sequences. Together, these tools partition the variant interpretation problem into complementary domains: AlphaMissense for coding missense variants, AlphaFold for the structural consequences of those variants, and AlphaGenome for the much larger space of non-coding regulatory variation.
The broader DeepMind biology program also includes work on protein design through systems such as AlphaProteo, and a commercial spin-out, Isomorphic Labs, focused on drug discovery. AlphaGenome itself is positioned as a research tool rather than a direct therapeutic product, but its developers have argued that improved variant interpretation is an important upstream input to drug target identification and clinical trial design.
The Nature paper lists Žiga Avsec as first author and Pushmeet Kohli, vice president of research at Google DeepMind, as corresponding author, with all authors affiliated with Google DeepMind in London. The paper was published in volume 649, issue 8099, pages 1206 to 1218, on January 28, 2026, under digital object identifier 10.1038/s41586-025-10014-0, and is released under a Creative Commons Attribution 4.0 International license. The June 2025 preprint is hosted on bioRxiv under the identifier 10.1101/2025.06.25.661532.