ZAYA1-8B

AI Models Large Language Models Mixture of Experts Open Source AI Reasoning Models

18 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v3 · 3,505 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ZAYA1-8B is an open-weight, reasoning-focused Mixture-of-Experts (MoE) large language model released by San Francisco-based AI research lab Zyphra on May 6, 2026.^[1] The model has 8.4 billion total parameters with only 760 million active parameters per token, and was the first frontier-class reasoning model pretrained entirely on AMD Instinct MI300X hardware without any NVIDIA accelerators in the training stack.^[1] ZAYA1-8B is distributed under the Apache 2.0 license on Hugging Face, alongside a free serverless endpoint on Zyphra Cloud and a multimodal vision-language variant called ZAYA1-VL-8B.^[3]

Designed to maximize what Zyphra calls "intelligence density per parameter," ZAYA1-8B introduces three core architectural innovations on top of the standard MoE Transformer recipe: Compressed Convolutional Attention (CCA) that reduces key-value cache memory by roughly 8x, an MLP-based expert router replacing the conventional linear projection, and learned residual scaling for stable deep-network training.^[2] Combined with a novel test-time compute method called Markovian Recursive Self-Aggregation (Markovian RSA), the model reaches 91.9% on AIME 2025 and 89.6% on HMMT 2025, competitive with much larger frontier reasoning systems including DeepSeek-V3.2 and GPT-5-High while remaining small enough to deploy on a single accelerator.^[2]

Overview

ZAYA1-8B is the first model in Zyphra's ZAYA family, succeeding the company's earlier Zamba and Zamba2 hybrid state-space-attention models. Where Zamba was a dense hybrid architecture, ZAYA1 returns to a pure MoE Transformer skeleton but augments it with several efficiency-oriented modifications collectively branded as the MoE++ architecture. The model targets the same niche as small reasoning specialists such as Qwen3-4B-Thinking, Gemma-4-E4B, and Nemotron-3-Nano-30B, while attempting to match the math, code, and reasoning quality of much larger systems like DeepSeek-R1, Mistral-Small-4-119B, and Gemini 2.5 Pro.

The model is reasoning-tuned, meaning that responses include explicit chain-of-thought traces before delivering a final answer. Both a base checkpoint (Zyphra/ZAYA1-base) and a reasoning base (Zyphra/ZAYA1-reasoning-base) are available in addition to the final post-trained chat model. Sampling defaults recommended by Zyphra are temperature 1.0 with top-p 0.95 for general use and temperature 0.6 with top-p 0.95 for agentic and code workloads.^[3]

Background and release

Zyphra was founded in 2020 by Krithik Puthalath, Tomas Figliolia, Beren Millidge, and Danny Martinelli, with the company describing itself as a superintelligence research and product lab. After raising an 11 million USD seed round in 2023 with participation from Intel Capital, Future Ventures, and Bison, the company closed a 100 million USD Series A in June 2025 led by Jaan Tallinn at a one billion USD post-money valuation. Tallinn was previously an early backer of DeepMind and Anthropic.

Zyphra's prior public models included Zamba (April 2024), a 7B hybrid state-space and attention foundation model, and the smaller Zamba2-2.7B aimed at efficient on-device inference. ZAYA1-8B was announced jointly with AMD and IBM on May 6, 2026,^[1] alongside a technical report on arXiv (2605.05365)^[2] and the vision-language extension ZAYA1-VL-8B published roughly one week later.^[4] Founder and CEO Krithik Puthalath framed the release as a demonstration of co-designed architecture and training: "ZAYA1-8B demonstrates what is possible when architecture, pretraining, and reinforcement learning are co-designed toward a single objective: maximizing the intelligence extracted per parameter and per FLOP."^[6]

Model architecture

ZAYA1-8B uses a 40-layer Transformer decoder with a hidden dimension of 2048 and 16 routed experts per MoE layer. Routing is top-1, meaning each token activates exactly one expert per MoE layer, with no always-on shared expert. The tokenizer is borrowed from Gemma 3 and has a vocabulary of 262,272 tokens. Weights are stored in bfloat16, with selective FP32 upcasting on the language-model head, attention norms, routing logits, and residual connections.^[2]

Parameter breakdown

Component	Value
Total parameters	8.4 billion
Active parameters per token	760 million
Transformer layers	40
Hidden dimension	2048
Experts per MoE layer	16
Expert routing	Top-1
Expert FFN width	4096 pre-activation, 2048 post-activation
Attention	CCGQA with CCA (2x query, 8x KV compression)
Vocabulary	262,272 (Gemma 3 tokenizer)
Precision	BF16 with selective FP32

Compressed Convolutional Attention

The single most prominent architectural innovation in ZAYA1-8B is Compressed Convolutional Attention, abbreviated CCA. Standard multi-head attention scales the key-value (KV) cache linearly with both context length and the number of heads, which dominates memory and bandwidth costs at long contexts and is the primary bottleneck for decoding throughput. CCA performs sequence mixing in a compressed latent space and combines compression on both the query and KV pathways. In ZAYA1-8B the configuration delivers a 2x reduction in query projection size and an 8x reduction in KV-cache size compared with full multi-head attention at equivalent quality. In practical terms a conversation or document that would normally require around 8 GB of KV-cache memory shrinks to roughly 1 GB without a meaningful accuracy penalty, which is what allows an 8.4B-parameter MoE to run long reasoning traces on a single mid-range accelerator.^[2] The CCA design is closely related to the Multi-head Latent Attention used in DeepSeek's models but is presented by Zyphra as a distinct convolutional variant.^[2]

MoE++ router and load balancing

The ZAYA1 router departs from the standard linear top-k softmax router used in most MoE models. Instead it uses a small MLP (multi-layer perceptron) followed by Exponential Depth Averaging (EDA) that smooths routing decisions across consecutive layers. Together these changes are reported to improve expert balancing, expert specialization, and entropy recovery during training compared with a linear router baseline. Load balance is additionally enforced through a PID-controller-style bias adjustment, which dynamically nudges per-expert biases up or down to drive average utilization toward the target. The router enables stable training with top-1 routing, which is more compute-efficient than the top-2 routing common in earlier MoE designs.^[2]

Learned residual scaling

The third architectural ingredient is learned residual scaling, in which each residual connection in the network has a small learned coefficient that controls how much the residual stream norm grows through depth. The mechanism adds approximately 4 x L x D parameters, where L is the number of layers and D is the hidden dimension, a negligible amount in total FLOPs and parameter count. Zyphra reports that the technique reduces residual norm blow-up at deeper layers, stabilizing training of deeper MoEs without requiring careful initialization tricks.^[2]

Architectural comparison with other MoEs

Model	Total parameters	Active parameters	Routing	Attention type	Open weights
ZAYA1-8B	8.4B	0.76B	Top-1, MLP router	CCA, 8x KV compression	Yes (Apache 2.0)
DeepSeek-V3.2	671B	37B	Top-K with shared expert	MLA	Yes
DeepSeek-R1	671B	37B	Top-K with shared expert	MLA	Yes
Mistral-Small-4	119B	~22B	Linear router	GQA	Yes
Qwen3-Next-80B-A3B	80B	3B	Linear router	GQA	Yes
Mixtral 8x7B	47B	13B	Top-2 linear	GQA	Yes

Training

Hardware: AMD MI300X cluster

ZAYA1-8B was pretrained, midtrained, and supervised fine-tuned exclusively on AMD hardware. The primary training cluster was built jointly with IBM and consisted of 1,024 AMD Instinct MI300X accelerators connected by AMD Pensando Pollara 400 networking.^[5] Each MI300X carries 192 GB of HBM3, more than double the 80 GB offered by an NVIDIA H100, which provides a meaningful memory headroom for MoE training where weights and activations from many experts must coexist.^[5] Zyphra also reported that an earlier 1,024-MI300X reference cluster delivered more than 750 PFLOPs of real-world training throughput on dense workloads.^[10]

Neither the technical report nor the company blog claims a specific aggregate FLOP count for ZAYA1-8B, but the absence of NVIDIA accelerators in the training stack is repeatedly emphasized in press materials and in coverage by VentureBeat, MarkTechPost, and AIWire.^[7]^[8]^[9] The training stack relied on a modified Megatron fork compiled for ROCm together with custom distributed-training kernels and Zyphra's own optimization runtime.^[5] Zyphra has stated that the total training spend was funded out of its 100 million USD Series A and was therefore well under 100 million USD, a notable figure for a model competitive with much larger frontier systems.^[1]

Training hardware and compute details

Item	Value
Accelerator	AMD Instinct MI300X (192 GB HBM3)
Cluster size	1,024 GPUs
Networking	AMD Pensando Pollara 400
Infrastructure partner	IBM (custom cluster on IBM Cloud)
Software stack	ROCm, modified Megatron fork, custom Zyphra kernels
Precision	BF16 with selective FP32 upcasting
NVIDIA usage	None (entire pretraining, midtraining, SFT, RL on AMD)
Funding round	100M USD Series A (June 2025)

Data and pretraining stages

The pretraining recipe is divided into multiple sequential phases at increasing context lengths.

Phase	Tokens	Context length	Focus
Base pretraining 1	8 trillion	4K	General web and code
Base pretraining 2	4 trillion	4K	Higher-quality and reasoning data
Reasoning midtraining	1.2 trillion	32K	Long chain-of-thought reasoning
Supervised fine-tuning	660 billion	131K	Long-context SFT

The base ZAYA1 checkpoint was therefore trained on approximately 14 trillion total tokens, with an additional 1.2 trillion midtraining tokens at 32K context and 660 billion SFT tokens at 131K context.^[2] The reasoning midtrain mix is dominated by long chain-of-thought reasoning (86.1%), supplemented by web and synthetic data (5.7%), code (3.0%), math and STEM (3.0%), short instruction data (1.4%), and long-context data (0.8%).^[2] A novel preprocessing step called Answer-Preserving Trimming truncates the middle of long reasoning traces while retaining final answers, allowing the model to ingest long-CoT examples even during short-context pretraining.^[2]

Post-training reinforcement learning cascade

ZAYA1-8B uses a four-to-five-stage reinforcement learning cascade after SFT, building on the verifiable-reward paradigm popularized by DeepSeek-R1.^[2]

Stage	Steps	Data	Reward
Reasoning warmup	232	Math, puzzles, test-time compute traces	Verifiable task reward
RLVE-Gym curriculum	400	400 adaptive task environments	Environment verifier
Math + Code + TTC Phase 1	384	General math, code, TTC	Verifiable task reward
Math + Code + TTC Phase 2	464	Code-focused mix	Verifiable task reward
Behavioral RL	384	Chat, instruction-following	Reward model score

Key infrastructural choices in the RL phase include PipelineRL for asynchronous rollout and trainer separation, a Dr-GRPO sequence-mean over token-sum-norm loss aggregation, MaxRL advantage estimation normalized by mean reward, a DPPO Binary-TV trust region with delta 0.1 instead of an explicit KL penalty, a momentum-free Muon optimizer for actors with AdamW for embeddings, and router replay where the trainer reuses the routing decisions made during vLLM rollout to prevent the MoE logit mismatch problem. Zyphra also documents a streaming compressibility canary based on LZ77 that flags degenerate, repetitive rollouts before they can poison the reward signal.^[2]

The combined RL stack delivers substantial gains over the 131K SFT checkpoint:

Benchmark	After SFT	After RL	Gain
AIME 2026	68.3	89.1	+20.8
HMMT February 2026	39.2	71.6	+32.4
LiveCodeBench v6	54.8	64.8	+10.0
IFBench	30.2	52.6	+22.4

Markovian recursive self-aggregation

For extended test-time compute, Zyphra introduces Markovian Recursive Self-Aggregation, abbreviated Markovian RSA. The method generates N parallel candidate reasoning traces, each with a fixed per-candidate budget of beta tokens. After each round, the candidates are compressed into tau-token tail summaries, and the next round generates an improved solution conditioned on a small subset of those tails. The key property is Markovian: between rounds the model carries forward only bounded-length tails rather than full reasoning traces, so the working context length remains bounded even as the total amount of test-time compute grows. The default configuration uses beta = 40K tokens per candidate, tau = 4K-token tails, T = 2 aggregation rounds, N = 16 parallel candidates, and C = 4 tails aggregated per iteration.^[2]

This configuration is the source of ZAYA1-8B's headline reasoning numbers. With Markovian RSA at 40K-token candidate budget and 4K-token tails, the model reaches 91.9% on AIME 2025 (up from 89.1% with single-rollout sampling) and 89.6% on HMMT 2025 (up from 71.6% single-rollout), closing much of the gap to Gemini 2.5 Pro, DeepSeek-V3.2, and GPT-5-High on those benchmarks.^[2] Markovian RSA itself is trained into the model through both an SFT phase that constructs aggregation traces by reshuffling expert-model rollouts and through RL stages that explicitly train both single-rollout and aggregation modes.

Benchmarks and performance

Zyphra evaluates ZAYA1-8B across mathematics, coding, general reasoning, instruction following, style, and agentic benchmarks. The model is reported as a clear leader in its in-class comparison group of roughly 4B active-parameter reasoning models, while remaining competitive with much larger systems on mathematics and coding tasks specifically.^[2]

Single-rollout results

Benchmark	ZAYA1-8B	Qwen3-4B-Thinking	Qwen3.5-4B	Gemma-4-E4B
AIME 2026	89.1	79.0	84.5	50.3
HMMT February 2026	71.6	53.6	63.6	32.1
LiveCodeBench v6	64.8	54.9	55.8	54.2
GPQA-Diamond	71.0	66.1	76.2	57.4
MMLU-Pro	74.2	74.3	79.7	70.2
IMO-AnswerBench	59.3	n/a	n/a	n/a
APEX-shortlist	32.2	n/a	n/a	n/a
IFEval	85.58	n/a	n/a	n/a
IFBench	52.56	n/a	n/a	n/a
EQBench	72.95	n/a	n/a	n/a
Creative Writing v3	62.97	n/a	n/a	n/a
BFCL v4	39.22	n/a	n/a	n/a
Tau-squared	43.12	n/a	n/a	n/a

Test-time compute results

With the default Markovian RSA configuration (40K-token candidates, 4K-token tails) ZAYA1-8B's reasoning scores rise substantially on the two flagship math benchmarks:

Benchmark	Single rollout	Markovian RSA (40K/4K)
AIME 2025	89.1	91.9
HMMT 2025	71.6	89.6

Comparisons with larger models

Model	Active params	Total params	AIME 2026	HMMT Feb 2026	LiveCodeBench v6
ZAYA1-8B	0.76B	8.4B	89.1	71.6	64.8
Nemotron-3-Nano-30B	~3B	30B	90.1	n/a	n/a
Qwen3-Next-80B-A3B-Think	3B	80B	90.2	n/a	n/a
OLMo-3.1-32B-Think	n/a	32B	78.9	n/a	n/a
DeepSeek-R1-0528	37B	671B	~87.5 (AIME 2025)	n/a	n/a
Claude 4.5 Sonnet	undisclosed	undisclosed	n/a	88.3 (HMMT 2025)	n/a
GPT-5-High	undisclosed	undisclosed	n/a	n/a	n/a

In the head-to-head with Claude 4.5 Sonnet on HMMT 2025, ZAYA1-8B with Markovian RSA scores 89.6 versus 88.3 for Sonnet, and on AIME 2025 ZAYA1-8B reaches 91.9 versus DeepSeek-R1-0528's 87.5.^[2] Coverage from MarkTechPost and VentureBeat highlighted the contrast that ZAYA1-8B uses fewer than one billion active parameters compared to 37 billion active parameters in DeepSeek's reasoning models, a roughly 37x difference in active compute per forward pass.^[7]^[8]

ZAYA1-VL-8B vision-language model

Approximately one week after the language model release, Zyphra published ZAYA1-VL-8B, a vision-language extension of ZAYA1-8B.^[4] The vision-language variant pairs the ZAYA1-8B language model with a Qwen 2.5-VL vision tower, bringing the total parameter count to approximately 10 billion (0.7B vision encoder plus 8.4B language model). The composite system is released under the same Apache 2.0 license and ships through the Zyphra fork of the Hugging Face Transformers library.^[4]

Two architectural innovations distinguish ZAYA1-VL-8B from naive bolted-on vision tower designs. First, vision-specific LoRA parameters are attached to the MLP and CCA weights of every layer and are activated only when the token being processed is a vision token. This gives the model a dedicated visual processing pathway without doubling the parameter count or requiring a separate vision-LLM training pass. Second, image tokens use bidirectional attention masks rather than causal masks, reflecting the non-causal nature of 2D visual data.^[4]

Reported vision benchmark results include 87.5 on AI2D test, 92.5 on DocVQA test, 74.4 on TextVQA validation, 80.0 on VQA v2.0 validation, 64.0 on MathVista mini, and 46.0 on MMMU validation.^[4] Zyphra reports that on these benchmarks the model outperforms comparable-size vision-language systems and several larger ones on efficiency-adjusted metrics, while trailing larger specialists such as InternVL3.5-20B on some tasks.^[4]

Deployment and tooling

ZAYA1-8B is distributed via Hugging Face with model weights, tokenizer, and configuration available at huggingface.co/Zyphra/ZAYA1-8B.^[3] Because the model uses novel CCA attention and the ZAYA1 router, it is not natively supported by mainline Hugging Face Transformers or vLLM at release; Zyphra maintains forks of both projects with ZAYA1 support, installable via pip directly from the company's GitHub. A typical vLLM launch command uses the qwen3 reasoning parser, the custom zaya_xml tool-call parser, BF16 precision, and an FP32 Mamba cache (a leftover convenience flag from earlier Zyphra hybrid models). Beyond Hugging Face, the model is available as a free serverless endpoint on Zyphra Cloud at cloud.zyphra.com, and the company has published quantized GGUF builds for llama.cpp, Ollama, LM Studio, and Jan to support local on-device inference.^[3]

Reception and significance

Reception of ZAYA1-8B in technical media in May 2026 focused on two themes: efficiency relative to active parameter count, and the demonstration of frontier training on AMD hardware. VentureBeat headlined its coverage "Meet ZAYA1-8B, a super efficient, open reasoning model trained on AMD Instinct MI300 GPUs," emphasizing the dual achievement of small active parameter count and non-NVIDIA training.^[7] MarkTechPost described the model as "a reasoning MoE trained on AMD hardware that punches far above its weight class,"^[8] and AIWire framed the release as a proof point for AMD's ability to host frontier model development.^[9]

For AMD, the release was a strategically important demonstration that frontier MoE training is feasible at scale on the MI300X platform with the ROCm software stack, providing a counterweight to the perception that production-scale AI training is exclusively a CUDA ecosystem activity. For IBM, the deployment served as a showcase for its custom AI training cluster offering. For Zyphra, the release positioned the company as a research-driven challenger optimizing for parameter efficiency and open licensing rather than scale, in contrast to closed-weight frontier labs and to compute-intensive open releases like DeepSeek-V3.2.

The broader significance of ZAYA1-8B lies in three areas. First, the model substantiates the claim that small active parameter counts plus extended test-time compute can match dense or heavily activated MoE reasoning models on math and coding benchmarks, supporting an architectural direction sometimes described as small-active sparse compute. Second, the model establishes Markovian RSA as a credible alternative to long single-trace chain-of-thought scaling, by keeping working context bounded as compute grows. Third, it offers an existence proof that frontier-class reasoning models can be developed without exclusive dependence on a single hardware vendor.

Limitations

Zyphra acknowledges several limitations of ZAYA1-8B in the technical report. Agentic benchmark scores are modest, with BFCL v4 at 39.2% and tau-squared at 43.1%, reflecting the absence of a dedicated multi-turn tool-use RL stage during post-training.^[2] The model trails larger general-knowledge systems on MMLU-Pro (74.2) and on certain knowledge-heavy GPQA-Diamond subsets, indicating that intelligence density gains are concentrated in reasoning rather than broad recall. Test-time compute validation is limited to comparisons against open-weight and select proprietary systems, and per-stage ablations of the RL cascade are not provided. The custom CCA attention and ZAYA1 router require Zyphra's forked inference stacks at release, which introduces some operational friction relative to fully mainline architectures.

References

Zyphra. "ZAYA1-8B: Frontier intelligence density, trained on AMD." Zyphra blog, May 6, 2026. https://www.zyphra.com/post/zaya1-8b ↩
Zyphra. "ZAYA1-8B Technical Report." arXiv:2605.05365, May 2026. https://arxiv.org/abs/2605.05365 ↩
Zyphra. "ZAYA1-8B" model card. Hugging Face. https://huggingface.co/Zyphra/ZAYA1-8B ↩
Zyphra. "ZAYA1-VL-8B" model card. Hugging Face. https://huggingface.co/Zyphra/ZAYA1-VL-8B ↩
Zyphra. "ZAYA1: Pretraining on Integrated AMD Platform." Zyphra blog. https://www.zyphra.com/post/zaya1 ↩
PR Newswire. "Zyphra Releases ZAYA1-8B, a Reasoning Model trained on AMD and Optimized for Maximum Intelligence Density per Parameter." May 6, 2026. https://www.prnewswire.com/news-releases/zyphra-releases-zaya1-8b-a-reasoning-model-trained-on-amd-and-optimized-for-maximum-intelligence-density-per-parameter-302764700.html ↩
VentureBeat. "Meet ZAYA1-8B, a super efficient, open reasoning model trained on AMD Instinct MI300 GPUs." May 6, 2026. https://venturebeat.com/technology/meet-zaya1-8b-a-super-efficient-open-reasoning-model-trained-on-amd-instinct-mi300-gpus ↩
MarkTechPost. "Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class." May 6, 2026. https://www.marktechpost.com/2026/05/06/zyphra-releases-zaya1-8b-a-reasoning-moe-trained-on-amd-hardware-that-punches-far-above-its-weight-class/ ↩
AIWire / HPCwire. "Zyphra Releases ZAYA1-8B Reasoning Model." May 7, 2026. https://www.hpcwire.com/aiwire/2026/05/07/zyphra-releases-zaya1-8b-reasoning-model/ ↩
AMD. "Zyphra Demonstrates Large Scale Training on AMD with ZAYA1." AMD blog. https://www.amd.com/en/blogs/2025/zyphra-demonstrates-large-scale-training-on-amd-with-zaya1.html ↩
BuildFastWithAI. "ZAYA1-8B: The Efficient MoE Reasoning Model Explained (2026)." https://www.buildfastwithai.com/blogs/zaya1-8b-reasoning-model-2026
Let's Data Science. "Zyphra ZAYA1-8B: AMD-Trained Reasoning Model Beats Claude on Math (2026)." https://letsdatascience.com/blog/zaya1-8b-amd-mi300x-claude-sonnet-math

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Zyphra