AlphaGenome

AI Models AI for Science Google DeepMind Healthcare AI

15 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 3,017 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AlphaGenome is a unified deep learning model developed by Google DeepMind that predicts thousands of functional genomic properties from raw DNA sequences. The model accepts up to one million base pairs of DNA as input and produces predictions at single-base-pair resolution across eleven distinct molecular modalities, including gene expression, RNA splicing, chromatin accessibility, histone modifications, transcription factor binding, and three-dimensional chromatin contacts. AlphaGenome was announced on June 25, 2025 through a DeepMind blog post and an accompanying bioRxiv preprint,^[3]^[2] and the peer-reviewed research describing the system was published in Nature on January 28, 2026 under the title "Advancing regulatory variant effect prediction with AlphaGenome."^[1]

The model is the direct successor to Enformer, DeepMind's earlier genomic deep learning system released in 2021, and represents a substantial step forward in the field of computational genomics. In a head-to-head evaluation against the strongest available external models, AlphaGenome matched or exceeded the state of the art on 25 of 26 variant effect prediction benchmarks and on 22 of 24 sequence track prediction benchmarks.^[1] It is the first publicly described model capable of jointly predicting all of the assayed regulatory modalities at single-base resolution from a single shared backbone, and it can score the impact of an individual genetic variant on every modality in roughly one second on a modern GPU.^[1] AlphaGenome is available to non-commercial researchers through a free API and an accompanying Python software development kit.^[3]

The system is positioned within DeepMind's broader portfolio of structural and functional biology models, which includes AlphaFold for protein structure prediction, AlphaFold 3 for biomolecular complex prediction, and AlphaMissense for the pathogenicity of protein-coding missense variants. Whereas AlphaMissense focuses on the roughly two percent of the human genome that encodes proteins, AlphaGenome is designed to interpret the remaining ninety-eight percent, the non-coding regions that regulate when, where, and how strongly genes are expressed.

Background and motivation

The human genome contains approximately three billion base pairs of DNA. Of these, only a small fraction, roughly two percent, directly encodes proteins. The remaining non-coding regions were historically underexplored, but a substantial body of work over the past two decades, including the ENCODE Project, the Roadmap Epigenomics Mapping Consortium, GTEx, and FANTOM5, has established that non-coding DNA carries enhancers, promoters, insulators, splice regulatory elements, and many other features that control gene regulation. The overwhelming majority of disease-associated genetic variants identified by genome-wide association studies fall in non-coding regions, yet interpreting them mechanistically remains difficult because their effects are typically subtle, context dependent, and mediated through long-range regulatory interactions.

Machine learning models that map DNA sequence to functional output have become a central tool for this interpretation problem. Early convolutional networks such as DeepSEA, Basenji, and Basset learned to predict assays like DNase-seq, ChIP-seq, and CAGE-seq from short windows of sequence. The 2021 release of Enformer extended this paradigm by combining convolutional layers with transformer architecture blocks, enabling the model to consider context up to 196,608 base pairs and capture longer-range enhancer-promoter interactions.^[10] The Borzoi model released by the Calico research team in 2024 further enlarged the input to roughly half a megabase and added RNA-seq coverage prediction,^[11] while ChromBPNet, ProCapNet, and Orca specialized in chromatin accessibility, transcription initiation, and three-dimensional contact maps respectively. SpliceAI and Pangolin set strong baselines for splice site prediction.

Despite this progress, the field lacked a single model that could jointly predict every major regulatory modality at base-pair resolution, span the megabase-scale context relevant for distal regulation, and run quickly enough to be applied to large variant catalogs. AlphaGenome was designed to fill all three of these gaps simultaneously.

Architecture

AlphaGenome uses a hybrid architecture that combines convolutional neural network layers with transformer blocks, organized as a U-Net-style encoder-decoder backbone with skip connections. The design produces output representations at three different scales, allowing each modality to be predicted at the resolution that is biologically meaningful for it.^[1]

Encoder, transformer tower, and decoder

The encoder consumes a one-megabase DNA sequence encoded as a one-hot matrix and progressively downsamples it through stacked convolutional blocks interleaved with pooling operations. Local sequence patterns such as transcription factor binding motifs, splice donor and acceptor consensus sequences, and short core promoter elements are captured by these convolutional layers. After downsampling, the sequence representation enters a transformer tower that operates at 128-base-pair resolution. Multi-head self-attention in this stage allows the model to share information across positions separated by hundreds of thousands of base pairs, which is essential for capturing enhancer-promoter interactions, insulator effects, and other distal regulatory phenomena.

After the transformer tower, a decoder upsamples the representation back toward base-pair resolution through transposed convolutions, with skip connections from the encoder injecting fine-grained local features at each level. The decoder produces three sets of internal embeddings: one-dimensional embeddings at 1-base-pair resolution, one-dimensional embeddings at 128-base-pair resolution, and two-dimensional embeddings at 2,048-base-pair resolution that are used for predicting pairwise chromatin contact maps.

Sequence parallelism

To handle a one-megabase input within available accelerator memory, AlphaGenome partitions the sequence dimension across eight interconnected tensor processing unit (TPU) devices during training. Each TPU processes a 131,072-base-pair chunk of the full sequence, with communication between devices implementing the attention operations across chunk boundaries. This sequence parallelism design is what allows the model to combine a megabase-scale context with single-base resolution outputs without sacrificing either.^[1]

Output heads

A set of modality-specific output heads sits on top of the shared backbone. Each head maps the appropriate internal embedding to one or more genomic tracks. In total the production model predicts 5,930 human genomic tracks and 1,128 mouse genomic tracks across the eleven supported modalities.^[1]

Input and output modalities

AlphaGenome accepts a single contiguous DNA sequence of up to 1,048,576 base pairs and produces a multi-track output spanning eleven assay types. The table below summarizes the modalities, the typical experimental assays they correspond to, and the resolution at which AlphaGenome makes predictions for each.^[1]

Modality	Underlying assay	Output resolution
Gene expression	RNA-seq coverage	1 bp
Transcription initiation	CAGE-seq	1 bp
Nascent transcription	PRO-cap	1 bp
Splice sites	Splice donor and acceptor probabilities	1 bp
Splice site usage	Fractional usage across cell types	1 bp
Splice junctions	Junction coordinates and strength	Junction-level
DNA accessibility (DNase)	DNase-seq	1 bp
DNA accessibility (ATAC)	ATAC-seq	1 bp
Histone modifications	ChIP-seq for histone marks	128 bp
Transcription factor binding	ChIP-seq for sequence-specific TFs	128 bp
Chromatin contact maps	Hi-C and Micro-C	2,048 bp pairwise

The inclusion of explicit splice junction modeling is a notable innovation. Earlier sequence models typically predicted splice donor and acceptor scores at individual positions but could not directly represent the coordinates and quantitative strength of the resulting junctions. Because many Mendelian diseases are caused by aberrant splicing, this capability allows AlphaGenome to be applied to a clinically important class of variants that earlier general-purpose models handled poorly.

Training data

AlphaGenome was trained on aligned experimental measurements from large public consortia. The principal data sources include:

ENCODE (Encyclopedia of DNA Elements), providing ChIP-seq, DNase-seq, ATAC-seq, and RNA-seq data across hundreds of human and mouse cell lines and tissues.
GTEx (Genotype-Tissue Expression Project), providing RNA-seq and splicing measurements across human tissues, along with expression quantitative trait loci used in downstream evaluation.
4D Nucleome, providing Hi-C and Micro-C chromatin contact maps.
FANTOM5, providing CAGE-seq transcription initiation profiles across a broad panel of human and mouse samples.

Additional datasets cover PRO-cap nascent transcription and curated splice junction catalogs. The training corpus covers hundreds of distinct human and mouse cell types and tissues. The model is trained jointly on human and mouse genomes, which improves generalization to cell types and assays that are sparsely represented in either species alone.^[1]

Training procedure and compute

Training proceeds in two stages: a pre-training stage in which independent teacher models learn from the experimental tracks, followed by a distillation stage in which a single student model is trained to reproduce the teacher predictions. The student model is the one served through the API and used for variant scoring, and its smaller effective size enables sub-second variant evaluation on a single GPU.^[1]

A notable feature of the training procedure is its efficiency relative to predecessor models. According to the DeepMind announcement, training a single AlphaGenome model takes approximately four hours on the TPU cluster used, consuming roughly half of the compute budget required to train the original Enformer model.^[3] The training table below summarizes the compute profile reported by the authors.

Item	Value
Input sequence length	1,048,576 base pairs (1 Mb)
Sequence parallelism	8 TPU v3 devices, 131,072 bp per device
Training time per model	Approximately 4 hours
Relative compute vs Enformer	Approximately one half
Inference time per variant	About 1 second on an NVIDIA H100 GPU
Human tracks predicted	5,930
Mouse tracks predicted	1,128
Output modalities	11

The authors attribute the compute reduction to architectural improvements, in particular the U-Net backbone and the sequence parallelism scheme, rather than to a reduction in model expressiveness.^[1]

Evaluation

The AlphaGenome team evaluated the model along two main axes: sequence track prediction, which measures how well the model reproduces held-out experimental measurements when given a reference genomic sequence, and variant effect prediction, which measures how well the model predicts the functional consequences of substituting or inserting an alternative allele at a specific position.

Sequence track benchmarks

On sequence track prediction, AlphaGenome was compared against task-specific state-of-the-art models for each modality. The model matched or exceeded the best external model on 22 of 24 evaluations.^[1] For chromatin accessibility, the reported improvements relative to ChromBPNet ranged from roughly 8 to 19 percent depending on the cell type.^[1] For splice prediction, AlphaGenome outperformed SpliceAI and Pangolin on six of seven benchmarks while also being the only model to simultaneously model splice sites, splice site usage, and full junction coordinates.^[1]

Variant effect prediction benchmarks

The variant effect evaluation suite spans 26 distinct benchmarks drawn from massively parallel reporter assays (MPRA), saturation mutagenesis experiments, eQTL and sQTL fine-mapping datasets, clinical variant databases such as ClinVar, and curated disease cohorts. AlphaGenome matched or exceeded the strongest external comparator on 25 of these 26 benchmarks.^[1] On expression quantitative trait locus direction-of-effect prediction, the model achieved a reported 25.5 percent relative improvement over Borzoi.^[1]

The table below summarizes the high-level comparison against the most directly comparable prior models.

Model	Maximum input	Output resolution	Number of modalities	Joint splice junction	Year
Basenji2	131,072 bp	128 bp	Limited	No	2020
Enformer	196,608 bp	128 bp	Limited	No	2021^[10]
Sei	4,096 bp	Sequence class	21,907 cis-regulatory classes	No	2022
Borzoi	524,288 bp	32 bp	Several including RNA-seq	No	2024^[11]
AlphaGenome	1,048,576 bp	1 bp	11	Yes	2025^[1]

Independent applications

The authors and external collaborators demonstrated AlphaGenome on several biological case studies. In T-cell acute lymphoblastic leukemia, the model recovered the known mechanism by which non-coding mutations create a novel MYB transcription factor binding motif that activates the TAL1 oncogene.^[1] In a screen of fine-mapped non-coding variants associated with autoimmune and neurodegenerative diseases, AlphaGenome prioritized causal candidates consistent with subsequent experimental validation.^[1] The team also showed that the model can be used in synthetic biology applications to design promoter and enhancer sequences with targeted cell-type-specific activity.^[1]

Access and licensing

AlphaGenome is available to academic and non-commercial researchers through the AlphaGenome API, hosted by DeepMind. Access requires registering with an institutional email and agreeing to non-commercial terms of use. A Python software development kit accompanies the API and provides convenience functions for scoring single variants, scoring batches of variants from a VCF file, retrieving full-track predictions across a window, and visualizing predicted regulatory landscapes. Commercial use requires a separate licensing arrangement, for which DeepMind has published an inquiry form.^[3]

The DeepMind team has also launched a community forum at alphagenomecommunity.com to coordinate user discussions, share notebooks, and report issues.^[3] As of the Nature publication, the team has stated an intent to release the full model weights in a future update, though the exact timing has not been published.^[1]

Limitations

The authors and independent commentators have explicitly flagged several limitations of AlphaGenome that are important for interpreting its predictions.

Distal regulation beyond 100 kb. Despite the megabase context window, the model captures the effects of regulatory elements located more than approximately 100,000 base pairs from their target gene less reliably than nearby elements. This reflects both the sparsity of supervised signal at very long ranges and the underlying biological difficulty of associating distal enhancers with specific promoters.^[1]
Cell-type and tissue specificity. While AlphaGenome predicts hundreds of human and mouse cell types and tissues, its resolution is limited by the cell types represented in the training data. Rare or transient cell states are poorly covered.^[1]
Not a clinical tool. The model is explicitly framed as a research tool. The authors caution that its predictions are not validated for clinical use and should not be used to make individual diagnostic or treatment decisions.^[1]
Personal genome prediction. AlphaGenome is not designed to predict an individual's phenotype from their personal genome. Complex traits depend on combinations of variants, gene-environment interactions, and developmental factors that lie outside the scope of a sequence-to-function model.^[1]
Population genetics inputs. The model is purely sequence based and does not consume allele frequency, linkage disequilibrium, or other population genetic features. This is an advantage for rare variants, where population priors are uninformative, but it means that AlphaGenome scores complement rather than replace population genetic approaches for common variant analysis.^[1]

Reception and impact

Coverage of AlphaGenome has emphasized both its technical scope and its position within a rapidly maturing field of functional genomics models. STAT News, in its launch-day report, framed the release as DeepMind's most ambitious extension of the AlphaFold methodology beyond proteins.^[4] Chemical and Engineering News and MIT Technology Review highlighted the breadth of jointly predicted modalities and the model's relevance for interpreting the non-coding genome.^[5] Science covered the Nature publication in January 2026 and noted the strong benchmark performance against Borzoi and ChromBPNet.

Research engineer Natasha Latysheva of DeepMind characterized the project as more open-ended than protein structure prediction. "Genomics is more of a fuzzy field," she told STAT News.^[4] "There's no single metric of success." In response, the team designed the evaluation suite to span as many independent benchmarks as feasible, with the explicit goal of avoiding overfitting to any single proxy task.

The ALZFORUM commentary on the Nature paper described AlphaGenome as the new "top dog" for noncoding variant interpretation in neurodegenerative disease research, noting concrete improvements on Alzheimer's disease fine-mapped variants.^[8] The Science Media Centre collected reactions from independent UK-based geneticists, most of whom welcomed the model while emphasizing the need for orthogonal experimental validation of its predictions.^[9]

Relationship to other DeepMind biology models

AlphaGenome is the latest entry in a sequence of DeepMind biology models that began with AlphaFold in 2018 and 2020. Where AlphaFold and AlphaFold 3 address the three-dimensional structure of proteins and biomolecular complexes, AlphaGenome addresses the functional regulatory consequences of DNA sequence variation. AlphaMissense, released in 2023, predicts the pathogenicity of missense variants in protein-coding sequences. Together, these tools partition the variant interpretation problem into complementary domains: AlphaMissense for coding missense variants, AlphaFold for the structural consequences of those variants, and AlphaGenome for the much larger space of non-coding regulatory variation.

The broader DeepMind biology program also includes work on protein design through systems such as AlphaProteo, and a commercial spin-out, Isomorphic Labs, focused on drug discovery. AlphaGenome itself is positioned as a research tool rather than a direct therapeutic product, but its developers have argued that improved variant interpretation is an important upstream input to drug target identification and clinical trial design.

Authorship and publication

The Nature paper lists Žiga Avsec as first author and Pushmeet Kohli, vice president of research at Google DeepMind, as corresponding author, with all authors affiliated with Google DeepMind in London.^[1] The paper was published in volume 649, issue 8099, pages 1206 to 1218, on January 28, 2026, under digital object identifier 10.1038/s41586-025-10014-0, and is released under a Creative Commons Attribution 4.0 International license.^[1] The June 2025 preprint is hosted on bioRxiv under the identifier 10.1101/2025.06.25.661532.^[2]

References

Avsec, Ž. et al. "Advancing regulatory variant effect prediction with AlphaGenome." *Nature* 649, 1206 to 1218 (January 28, 2026). DOI: 10.1038/s41586-025-10014-0. ↩
Avsec, Ž. et al. "AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model." bioRxiv preprint (June 25, 2025). DOI: 10.1101/2025.06.25.661532. ↩
Google DeepMind. "AlphaGenome: AI for better understanding the genome." DeepMind blog, June 25, 2025. ↩
Molteni, M. "DeepMind launches AlphaGenome, aiming to predict gene regulation from DNA sequence." *STAT News*, June 25, 2025. ↩
*Chemical and Engineering News*. "Google's AlphaGenome predicts the function of a DNA sequence." January 2026. ↩
MarkTechPost. "Google DeepMind Releases AlphaGenome: A Deep Learning Model that can more Comprehensively Predict the Impact of Single Variants or Mutations in DNA." June 26, 2025.
InfoQ. "Google DeepMind Unveils AlphaGenome: a Unified AI Model for High-Resolution Genome Interpretation." July 2025.
ALZFORUM. "Top Dog: AlphaGenome Predicts How Noncoding Variants Work." 2026. ↩
Science Media Centre. "Expert reaction to paper on Google DeepMind's AlphaGenome." January 28, 2026. ↩
Avsec, Ž. et al. "Effective gene expression prediction from sequence by integrating long-range interactions." *Nature Methods* 18, 1196 to 1203 (2021). (Enformer paper, for comparison.) ↩
Linder, J. et al. "Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation." Calico Life Sciences, 2024. (Borzoi paper, for comparison.) ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AlphaMissense Pushmeet Kohli

Background and motivation

Architecture

Encoder, transformer tower, and decoder

Sequence parallelism

Output heads

Input and output modalities

Training data

Training procedure and compute

Evaluation

Sequence track benchmarks

Variant effect prediction benchmarks

Independent applications

Access and licensing

Limitations

Reception and impact

Relationship to other DeepMind biology models

Authorship and publication

References

Improve this article

Related Articles

AlphaMissense

IsoDDE

AlphaGeometry

AlphaFold 3

AlphaProof

AlphaProteo

What links here

Related Articles

AlphaMissense

IsoDDE

AlphaGeometry

AlphaFold 3

AlphaProof

AlphaProteo

What links here