ZAYA1-8B
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,505 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,505 words
Add missing citations, update stale details, or suggest a clearer explanation.
ZAYA1-8B is an open-weight, reasoning-focused Mixture-of-Experts (MoE) large language model released by San Francisco-based AI research lab Zyphra on May 6, 2026. The model has 8.4 billion total parameters with only 760 million active parameters per token, and was the first frontier-class reasoning model pretrained entirely on AMD Instinct MI300X hardware without any NVIDIA accelerators in the training stack. ZAYA1-8B is distributed under the Apache 2.0 license on Hugging Face, alongside a free serverless endpoint on Zyphra Cloud and a multimodal vision-language variant called ZAYA1-VL-8B.
Designed to maximize what Zyphra calls "intelligence density per parameter," ZAYA1-8B introduces three core architectural innovations on top of the standard MoE Transformer recipe: Compressed Convolutional Attention (CCA) that reduces key-value cache memory by roughly 8x, an MLP-based expert router replacing the conventional linear projection, and learned residual scaling for stable deep-network training. Combined with a novel test-time compute method called Markovian Recursive Self-Aggregation (Markovian RSA), the model reaches 91.9% on AIME 2025 and 89.6% on HMMT 2025, competitive with much larger frontier reasoning systems including DeepSeek-V3.2 and GPT-5-High while remaining small enough to deploy on a single accelerator.
ZAYA1-8B is the first model in Zyphra's ZAYA family, succeeding the company's earlier Zamba and Zamba2 hybrid state-space-attention models. Where Zamba was a dense hybrid architecture, ZAYA1 returns to a pure MoE Transformer skeleton but augments it with several efficiency-oriented modifications collectively branded as the MoE++ architecture. The model targets the same niche as small reasoning specialists such as Qwen3-4B-Thinking, Gemma-4-E4B, and Nemotron-3-Nano-30B, while attempting to match the math, code, and reasoning quality of much larger systems like DeepSeek-R1, Mistral-Small-4-119B, and Gemini 2.5 Pro.
The model is reasoning-tuned, meaning that responses include explicit chain-of-thought traces before delivering a final answer. Both a base checkpoint (Zyphra/ZAYA1-base) and a reasoning base (Zyphra/ZAYA1-reasoning-base) are available in addition to the final post-trained chat model. Sampling defaults recommended by Zyphra are temperature 1.0 with top-p 0.95 for general use and temperature 0.6 with top-p 0.95 for agentic and code workloads.
Zyphra was founded in 2020 by Krithik Puthalath, Tomas Figliolia, Beren Millidge, and Danny Martinelli, with the company describing itself as a superintelligence research and product lab. After raising an 11 million USD seed round in 2023 with participation from Intel Capital, Future Ventures, and Bison, the company closed a 100 million USD Series A in June 2025 led by Jaan Tallinn at a one billion USD post-money valuation. Tallinn was previously an early backer of DeepMind and Anthropic.
Zyphra's prior public models included Zamba (April 2024), a 7B hybrid state-space and attention foundation model, and the smaller Zamba2-2.7B aimed at efficient on-device inference. ZAYA1-8B was announced jointly with AMD and IBM on May 6, 2026, alongside a technical report on arXiv (2605.05365) and the vision-language extension ZAYA1-VL-8B published roughly one week later. Founder and CEO Krithik Puthalath framed the release as a demonstration of co-designed architecture and training: "ZAYA1-8B demonstrates what is possible when architecture, pretraining, and reinforcement learning are co-designed toward a single objective: maximizing the intelligence extracted per parameter and per FLOP."
ZAYA1-8B uses a 40-layer Transformer decoder with a hidden dimension of 2048 and 16 routed experts per MoE layer. Routing is top-1, meaning each token activates exactly one expert per MoE layer, with no always-on shared expert. The tokenizer is borrowed from Gemma 3 and has a vocabulary of 262,272 tokens. Weights are stored in bfloat16, with selective FP32 upcasting on the language-model head, attention norms, routing logits, and residual connections.
| Component | Value |
|---|---|
| Total parameters | 8.4 billion |
| Active parameters per token | 760 million |
| Transformer layers | 40 |
| Hidden dimension | 2048 |
| Experts per MoE layer | 16 |
| Expert routing | Top-1 |
| Expert FFN width | 4096 pre-activation, 2048 post-activation |
| Attention | CCGQA with CCA (2x query, 8x KV compression) |
| Vocabulary | 262,272 (Gemma 3 tokenizer) |
| Precision | BF16 with selective FP32 |
The single most prominent architectural innovation in ZAYA1-8B is Compressed Convolutional Attention, abbreviated CCA. Standard multi-head attention scales the key-value (KV) cache linearly with both context length and the number of heads, which dominates memory and bandwidth costs at long contexts and is the primary bottleneck for decoding throughput. CCA performs sequence mixing in a compressed latent space and combines compression on both the query and KV pathways. In ZAYA1-8B the configuration delivers a 2x reduction in query projection size and an 8x reduction in KV-cache size compared with full multi-head attention at equivalent quality. In practical terms a conversation or document that would normally require around 8 GB of KV-cache memory shrinks to roughly 1 GB without a meaningful accuracy penalty, which is what allows an 8.4B-parameter MoE to run long reasoning traces on a single mid-range accelerator. The CCA design is closely related to the Multi-head Latent Attention used in DeepSeek's models but is presented by Zyphra as a distinct convolutional variant.
The ZAYA1 router departs from the standard linear top-k softmax router used in most MoE models. Instead it uses a small MLP (multi-layer perceptron) followed by Exponential Depth Averaging (EDA) that smooths routing decisions across consecutive layers. Together these changes are reported to improve expert balancing, expert specialization, and entropy recovery during training compared with a linear router baseline. Load balance is additionally enforced through a PID-controller-style bias adjustment, which dynamically nudges per-expert biases up or down to drive average utilization toward the target. The router enables stable training with top-1 routing, which is more compute-efficient than the top-2 routing common in earlier MoE designs.
The third architectural ingredient is learned residual scaling, in which each residual connection in the network has a small learned coefficient that controls how much the residual stream norm grows through depth. The mechanism adds approximately 4 x L x D parameters, where L is the number of layers and D is the hidden dimension, a negligible amount in total FLOPs and parameter count. Zyphra reports that the technique reduces residual norm blow-up at deeper layers, stabilizing training of deeper MoEs without requiring careful initialization tricks.
| Model | Total parameters | Active parameters | Routing | Attention type | Open weights |
|---|---|---|---|---|---|
| ZAYA1-8B | 8.4B | 0.76B | Top-1, MLP router | CCA, 8x KV compression | Yes (Apache 2.0) |
| DeepSeek-V3.2 | 671B | 37B | Top-K with shared expert | MLA | Yes |
| DeepSeek-R1 | 671B | 37B | Top-K with shared expert | MLA | Yes |
| Mistral-Small-4 | 119B | ~22B | Linear router | GQA | Yes |
| Qwen3-Next-80B-A3B | 80B | 3B | Linear router | GQA | Yes |
| Mixtral 8x7B | 47B | 13B | Top-2 linear | GQA | Yes |
ZAYA1-8B was pretrained, midtrained, and supervised fine-tuned exclusively on AMD hardware. The primary training cluster was built jointly with IBM and consisted of 1,024 AMD Instinct MI300X accelerators connected by AMD Pensando Pollara 400 networking. Each MI300X carries 192 GB of HBM3, more than double the 80 GB offered by an NVIDIA H100, which provides a meaningful memory headroom for MoE training where weights and activations from many experts must coexist. Zyphra also reported that an earlier 1,024-MI300X reference cluster delivered more than 750 PFLOPs of real-world training throughput on dense workloads.
Neither the technical report nor the company blog claims a specific aggregate FLOP count for ZAYA1-8B, but the absence of NVIDIA accelerators in the training stack is repeatedly emphasized in press materials and in coverage by VentureBeat, MarkTechPost, and AIWire. The training stack relied on a modified Megatron fork compiled for ROCm together with custom distributed-training kernels and Zyphra's own optimization runtime. Zyphra has stated that the total training spend was funded out of its 100 million USD Series A and was therefore well under 100 million USD, a notable figure for a model competitive with much larger frontier systems.
| Item | Value |
|---|---|
| Accelerator | AMD Instinct MI300X (192 GB HBM3) |
| Cluster size | 1,024 GPUs |
| Networking | AMD Pensando Pollara 400 |
| Infrastructure partner | IBM (custom cluster on IBM Cloud) |
| Software stack | ROCm, modified Megatron fork, custom Zyphra kernels |
| Precision | BF16 with selective FP32 upcasting |
| NVIDIA usage | None (entire pretraining, midtraining, SFT, RL on AMD) |
| Funding round | 100M USD Series A (June 2025) |
The pretraining recipe is divided into multiple sequential phases at increasing context lengths.
| Phase | Tokens | Context length | Focus |
|---|---|---|---|
| Base pretraining 1 | 8 trillion | 4K | General web and code |
| Base pretraining 2 | 4 trillion | 4K | Higher-quality and reasoning data |
| Reasoning midtraining | 1.2 trillion | 32K | Long chain-of-thought reasoning |
| Supervised fine-tuning | 660 billion | 131K | Long-context SFT |
The base ZAYA1 checkpoint was therefore trained on approximately 14 trillion total tokens, with an additional 1.2 trillion midtraining tokens at 32K context and 660 billion SFT tokens at 131K context. The reasoning midtrain mix is dominated by long chain-of-thought reasoning (86.1%), supplemented by web and synthetic data (5.7%), code (3.0%), math and STEM (3.0%), short instruction data (1.4%), and long-context data (0.8%). A novel preprocessing step called Answer-Preserving Trimming truncates the middle of long reasoning traces while retaining final answers, allowing the model to ingest long-CoT examples even during short-context pretraining.
ZAYA1-8B uses a four-to-five-stage reinforcement learning cascade after SFT, building on the verifiable-reward paradigm popularized by DeepSeek-R1.
| Stage | Steps | Data | Reward |
|---|---|---|---|
| Reasoning warmup | 232 | Math, puzzles, test-time compute traces | Verifiable task reward |
| RLVE-Gym curriculum | 400 | 400 adaptive task environments | Environment verifier |
| Math + Code + TTC Phase 1 | 384 | General math, code, TTC | Verifiable task reward |
| Math + Code + TTC Phase 2 | 464 | Code-focused mix | Verifiable task reward |
| Behavioral RL | 384 | Chat, instruction-following | Reward model score |
Key infrastructural choices in the RL phase include PipelineRL for asynchronous rollout and trainer separation, a Dr-GRPO sequence-mean over token-sum-norm loss aggregation, MaxRL advantage estimation normalized by mean reward, a DPPO Binary-TV trust region with delta 0.1 instead of an explicit KL penalty, a momentum-free Muon optimizer for actors with AdamW for embeddings, and router replay where the trainer reuses the routing decisions made during vLLM rollout to prevent the MoE logit mismatch problem. Zyphra also documents a streaming compressibility canary based on LZ77 that flags degenerate, repetitive rollouts before they can poison the reward signal.
The combined RL stack delivers substantial gains over the 131K SFT checkpoint:
| Benchmark | After SFT | After RL | Gain |
|---|---|---|---|
| AIME 2026 | 68.3 | 89.1 | +20.8 |
| HMMT February 2026 | 39.2 | 71.6 | +32.4 |
| LiveCodeBench v6 | 54.8 | 64.8 | +10.0 |
| IFBench | 30.2 | 52.6 | +22.4 |
For extended test-time compute, Zyphra introduces Markovian Recursive Self-Aggregation, abbreviated Markovian RSA. The method generates N parallel candidate reasoning traces, each with a fixed per-candidate budget of beta tokens. After each round, the candidates are compressed into tau-token tail summaries, and the next round generates an improved solution conditioned on a small subset of those tails. The key property is Markovian: between rounds the model carries forward only bounded-length tails rather than full reasoning traces, so the working context length remains bounded even as the total amount of test-time compute grows. The default configuration uses beta = 40K tokens per candidate, tau = 4K-token tails, T = 2 aggregation rounds, N = 16 parallel candidates, and C = 4 tails aggregated per iteration.
This configuration is the source of ZAYA1-8B's headline reasoning numbers. With Markovian RSA at 40K-token candidate budget and 4K-token tails, the model reaches 91.9% on AIME 2025 (up from 89.1% with single-rollout sampling) and 89.6% on HMMT 2025 (up from 71.6% single-rollout), closing much of the gap to Gemini 2.5 Pro, DeepSeek-V3.2, and GPT-5-High on those benchmarks. Markovian RSA itself is trained into the model through both an SFT phase that constructs aggregation traces by reshuffling expert-model rollouts and through RL stages that explicitly train both single-rollout and aggregation modes.
Zyphra evaluates ZAYA1-8B across mathematics, coding, general reasoning, instruction following, style, and agentic benchmarks. The model is reported as a clear leader in its in-class comparison group of roughly 4B active-parameter reasoning models, while remaining competitive with much larger systems on mathematics and coding tasks specifically.
| Benchmark | ZAYA1-8B | Qwen3-4B-Thinking | Qwen3.5-4B | Gemma-4-E4B |
|---|---|---|---|---|
| AIME 2026 | 89.1 | 79.0 | 84.5 | 50.3 |
| HMMT February 2026 | 71.6 | 53.6 | 63.6 | 32.1 |
| LiveCodeBench v6 | 64.8 | 54.9 | 55.8 | 54.2 |
| GPQA-Diamond | 71.0 | 66.1 | 76.2 | 57.4 |
| MMLU-Pro | 74.2 | 74.3 | 79.7 | 70.2 |
| IMO-AnswerBench | 59.3 | n/a | n/a | n/a |
| APEX-shortlist | 32.2 | n/a | n/a | n/a |
| IFEval | 85.58 | n/a | n/a | n/a |
| IFBench | 52.56 | n/a | n/a | n/a |
| EQBench | 72.95 | n/a | n/a | n/a |
| Creative Writing v3 | 62.97 | n/a | n/a | n/a |
| BFCL v4 | 39.22 | n/a | n/a | n/a |
| Tau-squared | 43.12 | n/a | n/a | n/a |
With the default Markovian RSA configuration (40K-token candidates, 4K-token tails) ZAYA1-8B's reasoning scores rise substantially on the two flagship math benchmarks:
| Benchmark | Single rollout | Markovian RSA (40K/4K) |
|---|---|---|
| AIME 2025 | 89.1 | 91.9 |
| HMMT 2025 | 71.6 | 89.6 |
| Model | Active params | Total params | AIME 2026 | HMMT Feb 2026 | LiveCodeBench v6 |
|---|---|---|---|---|---|
| ZAYA1-8B | 0.76B | 8.4B | 89.1 | 71.6 | 64.8 |
| Nemotron-3-Nano-30B | ~3B | 30B | 90.1 | n/a | n/a |
| Qwen3-Next-80B-A3B-Think | 3B | 80B | 90.2 | n/a | n/a |
| OLMo-3.1-32B-Think | n/a | 32B | 78.9 | n/a | n/a |
| DeepSeek-R1-0528 | 37B | 671B | ~87.5 (AIME 2025) | n/a | n/a |
| Claude 4.5 Sonnet | undisclosed | undisclosed | n/a | 88.3 (HMMT 2025) | n/a |
| GPT-5-High | undisclosed | undisclosed | n/a | n/a | n/a |
In the head-to-head with Claude 4.5 Sonnet on HMMT 2025, ZAYA1-8B with Markovian RSA scores 89.6 versus 88.3 for Sonnet, and on AIME 2025 ZAYA1-8B reaches 91.9 versus DeepSeek-R1-0528's 87.5. Coverage from MarkTechPost and VentureBeat highlighted the contrast that ZAYA1-8B uses fewer than one billion active parameters compared to 37 billion active parameters in DeepSeek's reasoning models, a roughly 37x difference in active compute per forward pass.
Approximately one week after the language model release, Zyphra published ZAYA1-VL-8B, a vision-language extension of ZAYA1-8B. The vision-language variant pairs the ZAYA1-8B language model with a Qwen 2.5-VL vision tower, bringing the total parameter count to approximately 10 billion (0.7B vision encoder plus 8.4B language model). The composite system is released under the same Apache 2.0 license and ships through the Zyphra fork of the Hugging Face Transformers library.
Two architectural innovations distinguish ZAYA1-VL-8B from naive bolted-on vision tower designs. First, vision-specific LoRA parameters are attached to the MLP and CCA weights of every layer and are activated only when the token being processed is a vision token. This gives the model a dedicated visual processing pathway without doubling the parameter count or requiring a separate vision-LLM training pass. Second, image tokens use bidirectional attention masks rather than causal masks, reflecting the non-causal nature of 2D visual data.
Reported vision benchmark results include 87.5 on AI2D test, 92.5 on DocVQA test, 74.4 on TextVQA validation, 80.0 on VQA v2.0 validation, 64.0 on MathVista mini, and 46.0 on MMMU validation. Zyphra reports that on these benchmarks the model outperforms comparable-size vision-language systems and several larger ones on efficiency-adjusted metrics, while trailing larger specialists such as InternVL3.5-20B on some tasks.
ZAYA1-8B is distributed via Hugging Face with model weights, tokenizer, and configuration available at huggingface.co/Zyphra/ZAYA1-8B. Because the model uses novel CCA attention and the ZAYA1 router, it is not natively supported by mainline Hugging Face Transformers or vLLM at release; Zyphra maintains forks of both projects with ZAYA1 support, installable via pip directly from the company's GitHub. A typical vLLM launch command uses the qwen3 reasoning parser, the custom zaya_xml tool-call parser, BF16 precision, and an FP32 Mamba cache (a leftover convenience flag from earlier Zyphra hybrid models). Beyond Hugging Face, the model is available as a free serverless endpoint on Zyphra Cloud at cloud.zyphra.com, and the company has published quantized GGUF builds for llama.cpp, Ollama, LM Studio, and Jan to support local on-device inference.
Reception of ZAYA1-8B in technical media in May 2026 focused on two themes: efficiency relative to active parameter count, and the demonstration of frontier training on AMD hardware. VentureBeat headlined its coverage "Meet ZAYA1-8B, a super efficient, open reasoning model trained on AMD Instinct MI300 GPUs," emphasizing the dual achievement of small active parameter count and non-NVIDIA training. MarkTechPost described the model as "a reasoning MoE trained on AMD hardware that punches far above its weight class," and AIWire framed the release as a proof point for AMD's ability to host frontier model development.
For AMD, the release was a strategically important demonstration that frontier MoE training is feasible at scale on the MI300X platform with the ROCm software stack, providing a counterweight to the perception that production-scale AI training is exclusively a CUDA ecosystem activity. For IBM, the deployment served as a showcase for its custom AI training cluster offering. For Zyphra, the release positioned the company as a research-driven challenger optimizing for parameter efficiency and open licensing rather than scale, in contrast to closed-weight frontier labs and to compute-intensive open releases like DeepSeek-V3.2.
The broader significance of ZAYA1-8B lies in three areas. First, the model substantiates the claim that small active parameter counts plus extended test-time compute can match dense or heavily activated MoE reasoning models on math and coding benchmarks, supporting an architectural direction sometimes described as small-active sparse compute. Second, the model establishes Markovian RSA as a credible alternative to long single-trace chain-of-thought scaling, by keeping working context bounded as compute grows. Third, it offers an existence proof that frontier-class reasoning models can be developed without exclusive dependence on a single hardware vendor.
Zyphra acknowledges several limitations of ZAYA1-8B in the technical report. Agentic benchmark scores are modest, with BFCL v4 at 39.2% and tau-squared at 43.1%, reflecting the absence of a dedicated multi-turn tool-use RL stage during post-training. The model trails larger general-knowledge systems on MMLU-Pro (74.2) and on certain knowledge-heavy GPQA-Diamond subsets, indicating that intelligence density gains are concentrated in reasoning rather than broad recall. Test-time compute validation is limited to comparisons against open-weight and select proprietary systems, and per-stage ablations of the RL cascade are not provided. The custom CCA attention and ZAYA1 router require Zyphra's forked inference stacks at release, which introduces some operational friction relative to fully mainline architectures.