Zyphra
Last reviewed
May 16, 2026
Sources
36 citations
Review status
Source-backed
Revision
v1 ยท 3,924 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
36 citations
Review status
Source-backed
Revision
v1 ยท 3,924 words
Add missing citations, update stale details, or suggest a clearer explanation.
Zyphra is an American artificial intelligence research and product company headquartered in San Francisco, California, with a secondary office in London. The company builds open-weight foundation models, specializing in hybrid architectures that combine state space models with attention mechanisms, sparse mixture of experts (MoE) designs, and efficient training systems. Zyphra was founded in 2020 by Krithik Puthalath, Beren Millidge, Tomas Figliolia, and Danny Martinelli. In June 2025 the company closed a $100 million Series A funding round at a $1 billion post-money valuation, becoming an AI unicorn.
Zyphra is best known for three product lines: the Zamba family of hybrid Mamba-Transformer large language models, the Zonos text-to-speech models, and the ZAYA1-8B reasoning MoE released on May 6, 2026. ZAYA1-8B was pretrained entirely on AMD Instinct MI300 hardware using a 1,024-GPU cluster built jointly with IBM, making it the first frontier-scale reasoning model trained end-to-end on the AMD stack. The company has also published influential research on Tree Attention, Compressed Convolutional Attention (CCA), and the Zyda training data corpus.
Zyphra was founded in 2020 with a stated mission of building open superintelligence. The four co-founders bring backgrounds in physics, neuroscience, machine learning research, and infrastructure engineering. Krithik Puthalath, who serves as chairman and chief executive officer, holds a theoretical and mathematical physics degree from the University of Illinois Urbana-Champaign and previously worked at Cambridge Quantum on product and research, with earlier stints at IBM, Xerion Advanced Battery, and SunPower. Beren Millidge is co-founder, president, and chief scientist; he completed postdoctoral research at the University of Oxford on active inference and predictive coding before turning to applied machine learning. Tomas Figliolia and Danny Martinelli round out the founding team and continue to lead research and engineering work, including authorship on the company's Compressed Convolutional Attention paper.
Quentin Anthony, who heads model training and systems work, is not a co-founder but is among the most visible technical leads at the company and an author on most Zyphra papers. His research focuses on the intersection of deep learning systems and architectures, including the Zamba, Zamba2, and ZAYA1 technical reports. The wider research team has grown alongside Zyphra's GPU access; the ZAYA1-8B technical report lists 18 named authors.
The company operates two divisions: Zyphra Research, which develops multimodal open models, and Zyphra Cloud, which provides hosted inference infrastructure. The cloud platform launched a free serverless endpoint for ZAYA1-8B at the time of the model's release in May 2026.
Zyphra remained relatively under the radar through its earliest years, releasing the first Zamba paper in May 2024 with limited public visibility. The fundraising picture shifted decisively in mid-2025. In June 2025, Zyphra closed a $100 million Series A round at a $1 billion post-money valuation, led by Jaan Tallinn through his investment vehicle Metaplanet. Tallinn is an Estonian programmer and one of the founders of Skype, and was an early backer of both DeepMind and Anthropic. The round also included Bison Ventures and Future Ventures, the firm co-founded by Steve Jurvetson.
A second pillar of the company's strategy is hardware. On October 1, 2025, IBM and AMD jointly announced a multi-year collaboration with Zyphra to deliver dedicated AI training infrastructure on IBM Cloud. The agreement is the first large-scale dedicated training cluster on IBM Cloud built around AMD Instinct MI300X GPUs, paired with AMD Pensando Pollara 400 AI NICs and AMD Pensando Ortano DPUs. Initial deployment landed in September 2025, with expansion planned through 2026. Puthalath described the deal as the first time AMD's full-stack training platform had been integrated and scaled on IBM Cloud.
| Round | Date | Amount | Lead investor | Valuation |
|---|---|---|---|---|
| Seed | 2020-2024 (multiple) | Undisclosed | Various angels | Undisclosed |
| Series A | June 9, 2025 | $100 million | Metaplanet (Jaan Tallinn) | $1 billion post-money |
| Strategic infrastructure | October 1, 2025 | Undisclosed | IBM and AMD (multi-year compute agreement) | n/a |
Zyphra's technical work clusters around four ideas. First, hybrid architectures that pair the Mamba state space model with sparse attention layers, trading dense self-attention for lower latency and smaller key-value cache. Second, sparse MoE designs that lift total parameter count without scaling active compute. Third, efficient attention variants such as Tree Attention and Compressed Convolutional Attention. Fourth, large open training datasets, distributed through the Zyda series. Each of these threads informs the others; ZAYA1-8B, for example, draws on CCA for attention compression, on the MoE++ recipe for routing and load balancing, and on a four-stage reinforcement learning pipeline with a custom test-time compute method called Markovian RSA.
The company's published artifacts are all permissive open releases, almost always under the Apache 2.0 license, with full model weights, inference code, and (in most cases) a detailed technical report. The licensing posture is deliberate: Zyphra's marketing material describes the company as advocating for open superintelligence rather than closed control of frontier capability.
Zyphra's model releases sit at the intersection of language, audio, and reasoning. The table below summarizes the headline releases through May 2026.
| Release | Date | Type | Parameters | License |
|---|---|---|---|---|
| Zamba-7B | April 2024 | Hybrid Mamba1-Transformer LLM | 7B | Apache 2.0 |
| Zyda | June 2024 | Pretraining dataset | 1.3T tokens | ODC-By |
| Tree Attention | August 2024 | Research paper and library | n/a | Apache 2.0 (code) |
| Zamba2-2.7B | August 2024 | Hybrid Mamba2-Transformer LLM | 2.7B | Apache 2.0 |
| Zamba2-mini (1.2B) | August 2024 | Hybrid LLM for on-device | 1.2B | Apache 2.0 |
| Zamba2-7B | October 14, 2024 | Hybrid Mamba2-Transformer LLM | 7B | Apache 2.0 |
| Zyda-2 | October 2024 | Pretraining dataset | 5T tokens | ODC-By |
| Zonos-v0.1 | February 10, 2025 | Text-to-speech (transformer and SSM hybrid) | 1.6B each | Apache 2.0 |
| Compressed Convolutional Attention | October 2025 | Research paper | n/a | Apache 2.0 (code) |
| ZAYA1-8B | May 6, 2026 | Reasoning MoE | 8.4B total / 760M active | Apache 2.0 |
| ZAYA1-VL-8B | May 2026 | Vision language MoE | 9.2B total / 1.4B active | Apache 2.0 |
The Zamba family was Zyphra's first major release and remains the most cited line of work for the company. The original Zamba-7B was announced in April 2024 and accompanied by an arXiv paper, "Zamba: A Compact 7B SSM Hybrid Model" (arXiv:2405.16712), published on May 26, 2024. The model uses a backbone of Mamba state space layers interleaved with a single shared attention block whose weights are reused across the network. The shared attention design lets the model gain global mixing without paying the parameter cost of a full per-layer transformer.
Zyphra reported that Zamba-7B approached the performance of Mistral and Gemma at comparable scales while training on roughly half the tokens, and outperformed LLaMA-2 7B and OLMo-7B across a broad set of standard benchmarks. The model was open-sourced under Apache 2.0 along with all training checkpoints, an unusual choice that let researchers study the trajectory of capability emergence.
The second generation, Zamba2, replaced the Mamba1 backbone with Mamba2 blocks. Mamba2 is a refinement of the original state space architecture, with roughly four times the throughput of an equivalent-parameter transformer block. The Zamba2 suite shipped in three sizes:
A core architectural quirk of Zamba2 is that it stores key-value caches only for invocations of the shared attention block, not for every layer. At a typical 1:6 ratio of Mamba2 to attention, this cuts KV-cache memory by a factor of roughly six compared with a pure transformer of similar size. LoRA projectors are applied to each shared MLP and attention block to enable a degree of depth-specialization without inflating parameter counts.
| Model | Total parameters | Pretraining tokens | Mamba variant | Shared attention blocks | Tokenizer |
|---|---|---|---|---|---|
| Zamba-7B (v1) | 7B | ~1T (less than half of Mistral/Gemma comparators) | Mamba1 | 1 (global) | Custom |
| Zamba2-mini (1.2B) | 1.2B | 3T + 100B anneal | Mamba2 | 1 | Mistral v0.1 |
| Zamba2-2.7B | 2.7B | 3T + 100B anneal | Mamba2 | 2 (ABAB) | Mistral v0.1 |
| Zamba2-7B | 7B | ~3T + 100B anneal | Mamba2 | 2 (ABAB) | Mistral v0.1 |
The Zamba2 technical report was published on arXiv as 2411.15242 in November 2024 with authors Paolo Glorioso, Quentin Anthony, Yury Tokpanov, and collaborators.
Tree Attention is a research result first published in August 2024, with the paper "Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters" (arXiv:2408.04093). The work was a collaboration between Zyphra researchers and EleutherAI, and the code is hosted at github.com/Zyphra/tree_attention under Apache 2.0.
The paper makes two contributions. First, it derives a scalar energy function whose gradient computes the self-attention block, drawing a connection between attention and energy-based models such as Hopfield networks. Second, and more practically, it exploits a structural property of attention that prior work had not put to use. The reduction of the logsumexp and max operators across the sequence axis is associative, which means it can be computed in parallel via an associative scan, the same primitive used in state space model training.
In distributed decoding on GPU clusters, this lets attention be computed using a tree reduction with logarithmic depth in the number of devices, rather than the linear depth of ring attention. The reported result is up to 8x faster cross-device decoding than ring attention, with significantly lower communication volume and 2x lower peak memory. The Tree Attention algorithm has been folded into several inference engines and is one of the standard reference points for long-context decoding on multi-GPU servers.
Compressed Convolutional Attention (CCA) is a more recent attention variant, introduced in October 2025 in arXiv:2510.04476 by Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, and Beren Millidge. CCA down-projects queries, keys, and values into a shared low-dimensional latent space and performs the entire attention operation inside that latent. The compression factor controls a tradeoff that reduces parameters, KV cache, and FLOPs in a single move.
A convolution over the sequence and channel axes is applied to the compressed Q and K latents, which is the source of the "convolutional" name. The paper reports that this additional mixing lets CCA match or exceed multi-head latent attention (MLA) and standard multi-head attention (MHA) on quality metrics. An extension called Compressed Convolutional Grouped Query Attention (CCGQA) adds GQA-style head sharing inside the latent space and yields a further 2x KV cache reduction without measurable performance loss.
On an MoE workload, CCGQA achieves 8x KV cache compression with no drop in performance relative to MHA. On H100 GPUs, the fused CCA/CCGQA kernel cuts prefill latency by roughly 1.7x at a 16k sequence length compared to MHA. CCA is the attention method used inside ZAYA1-8B, and the kernel work formed the foundation for the model's efficient inference profile.
ZAYA1-8B is Zyphra's first reasoning-focused MoE and was announced on May 6, 2026. The model has 8.4 billion total parameters but only 760 million active parameters per forward pass, placing it in the under-1B-active class. It is built on a proprietary stack the company calls MoE++, which combines several architectural choices the company introduced in earlier papers:
The ZAYA1-8B post-training pipeline involves a four-stage reinforcement learning cascade and introduces Markovian RSA, a test-time compute method that fuses the Markovian thinker idea (reasoning in fixed-length context chunks where only the tail of each chunk is passed forward) with parallel trace generation and recursive self-aggregation (RSA). The result is that the model can run effectively unbounded chains of thought without ever overflowing a fixed context window, by carrying forward only a 4K-token tail between chunks and periodically aggregating partial traces.
The figures below are from the official Hugging Face model card and the Zyphra release post. Mathematics scores are reported across single-pass evaluation and extended test-time compute with Markovian RSA enabled. Scores driven by Markovian RSA are not strictly comparable to standard pass@1 numbers on closed-source models.
| Benchmark | ZAYA1-8B | Qwen3-4B | Qwen3.5-4B | Gemma-4-E4B |
|---|---|---|---|---|
| AIME'26 | 89.1 | 77.5 | 84.5 | 50.3 |
| HMMT Feb.'26 | 71.6 | 60.8 | 63.6 | 32.1 |
| IMO-AnswerBench | 59.3 | 50.9 | 48.7 | 27.3 |
| LiveCodeBench-v6 | 65.8 | 54.2 | n/a | 54.2 |
| GPQA-Diamond | 71.0 | 66.5 | 76.2 | 57.4 |
| MMLU-Pro | 74.2 | 74.3 | 79.1 | 70.2 |
| IFEval | 85.58 | 86.80 | 89.80 | 88.50 |
| APEX-shortlist | 32.2 | n/a | n/a | n/a |
With Markovian RSA enabled at high test-time compute (a roughly 5.5 million token budget), Zyphra reports a 91.9 score on AIME'25 and 89.6 on HMMT'25, the latter exceeding Claude 4.5 Sonnet (88.3) and approaching GPT-5-High on the same benchmark. On APEX-shortlist with extended compute, the company reports outperforming DeepSeek-V3.2 at a tiny fraction of the active parameter count. These results sit in an unusual region of the design space: a model with under 1B active parameters, run with aggressive test-time compute, matching the math performance of models with hundreds of times more active parameters.
The accompanying CEO statement framed the design philosophy directly. "ZAYA1-8B demonstrates what is possible when architecture, pretraining, and reinforcement learning are co-designed toward a single objective: maximizing the intelligence extracted per parameter and per FLOP," Puthalath said in the launch announcement.
A vision-language variant, ZAYA1-VL-8B, was released alongside the text model in May 2026 and shares the same base LLM. The VLM has 9.2 billion total parameters (1.4 billion active including the vision encoder) and uses the Qwen2.5-VL vision tower for image encoding. Two architectural choices distinguish it from the base model: vision-specific LoRA adapters integrated into the LLM rather than additional experts, which keeps the MoE structure intact, and bidirectional attention over image tokens to improve visual understanding. Zyphra reports performance competitive with Molmo2-4B and InternVL3.5-4B, exceeding Qwen2.5-VL-3B, PLM-3B, and MolmoE-8B on a range of image understanding, reasoning, and counting benchmarks.
ZAYA1-8B is the first widely-discussed reasoning model trained end-to-end on AMD hardware. The training cluster, hosted on IBM Cloud, consists of 1,024 AMD Instinct MI300X nodes connected by AMD Pensando Pollara 400 AI NICs. The MI300X is AMD's flagship data center accelerator, with 192 GB of HBM3 memory per GPU, the largest memory capacity among commercially available AI training accelerators at the time of ZAYA1's training. The Pollara interconnect uses Ultra Ethernet to provide high-bandwidth GPU-to-GPU communication, an alternative to NVIDIA's NVLink and InfiniBand-based topologies common at other frontier labs.
The cluster was the same infrastructure announced in the October 2025 IBM-AMD-Zyphra partnership and served as a real-world workload demonstrating the AMD training stack at scale. Zyphra's engineering team contributed back to the ecosystem with kernel work, profiling tools, and ROCm-targeted modifications that have since been folded into AMD's public software stack.
Zyphra also publishes large open pretraining datasets under the Zyda brand. The original Zyda was released in June 2024 as a 1.3 trillion token dataset, built by filtering and deduplicating across RefinedWeb, Starcoder, C4, the Pile, SlimPajama, peS2o, and arXiv. The accompanying paper (arXiv:2406.01981) reports that models trained on Zyda outperform models trained on any individual source dataset at comparable scale.
Zyda-2 followed in October 2024 and expanded the corpus to 5 trillion tokens, incorporating DCLM, FineWeb-Edu, and the Common Crawl portion of Dolma v1.7. The data pipeline was rebuilt using NVIDIA's NeMo Curator, a GPU-accelerated curation library, which cut total processing time from roughly three weeks to two days. Models trained on Zyda-2 outperform identical models trained on the Pile, RefinedWeb, FineWeb, FineWeb-Edu, or DCLM individually. Both datasets are available on the Zyphra Hugging Face organization under the ODC-By license.
In February 2025 Zyphra extended its open-model strategy beyond text with the beta release of Zonos-v0.1, a pair of 1.6B parameter text-to-speech models. One uses a standard transformer architecture; the other is an SSM hybrid based on Mamba2, the first open-source SSM-based TTS model in the public domain. Both were trained on roughly 200,000 hours of speech, primarily English with substantial Chinese, Japanese, French, Spanish, and German content.
Zonos generates audio natively at 44 kHz and supports voice cloning from clips as short as five seconds. The models accept conditioning inputs for speaking rate, pitch, audio quality, and emotion. Zyphra released the model weights on Hugging Face under the Apache 2.0 license, framing the release as a direct alternative to closed-source TTS providers.
Maia is the name Zyphra has given to a general-purpose AI agent currently in development under the IBM-AMD partnership. The company describes it as a superagent spanning language, vision, and audio modalities, targeted at enterprise knowledge workers. Maia is built on top of the same MoE foundation models Zyphra trains on the IBM Cloud MI300X cluster, with ZAYA1 representing the first publicly released building block. The agent's full capabilities and launch timing have not been disclosed in public statements as of the May 2026 ZAYA1-8B release.
Zyphra's standing in the AI research community changed sharply between 2024 and 2026. The original Zamba paper drew technical interest but limited mainstream coverage. The Zamba2 release a few months later landed in the middle of the wave of Mamba-based research and brought the company to the attention of inference-efficiency-focused engineers. The Tree Attention paper widened that audience by addressing distributed decoding, a long-standing pain point for long-context serving.
The ZAYA1-8B release was the first time Zyphra was widely treated as a frontier laboratory rather than a hybrid-architecture specialist. Coverage emphasized two themes. First, that a model with fewer than one billion active parameters could match or exceed first-generation reasoning models (DeepSeek-R1-0528, Gemini 2.5 Pro, Claude 4.5 Sonnet) on hard mathematics benchmarks. Second, that the model was trained entirely on AMD hardware, demonstrating that competitive frontier training is possible outside the NVIDIA ecosystem. The combination matters for both technical and strategic reasons: it offers a path for labs that lack access to large NVIDIA allocations, and it gives AMD and IBM a publicly reproducible reference workload for selling MI300-class clusters to enterprise customers.
The company's open-license posture stands out among labs operating at this scale. Models, weights, inference code, datasets, and technical reports have all shipped under Apache 2.0 or equivalent permissive licenses, with no commercial use restrictions. Zyphra has framed this as central to its identity, arguing that intelligence should not be controlled by a small number of closed-source frontier labs.
The through-line of Zyphra's technical work is reducing the cost of intelligence per parameter and per FLOP, often by changing the building blocks rather than scaling up. The list below summarizes the company's named contributions and where they appear.
| Innovation | Year | Description | First used in |
|---|---|---|---|
| Shared attention block | 2024 | A single attention block whose weights are reused across the network | Zamba-7B |
| Mamba2 hybrid backbone | 2024 | Mamba2 SSM layers interleaved with shared attention | Zamba2 family |
| LoRA projectors on shared blocks | 2024 | Per-invocation low-rank adapters for depth specialization | Zamba2 family |
| Tree Attention | 2024 | Tree-reduction parallelization of attention across GPUs | Inference engines |
| Compressed Convolutional Attention | 2025 | Attention in a shared low-rank latent with sequence convolution | ZAYA1-8B |
| MoE++ architecture recipe | 2026 | CCA, MLP router with PID balancing, learned residual scaling | ZAYA1-8B |
| Markovian RSA | 2026 | Recursive self-aggregation over fixed-length reasoning chunks | ZAYA1-8B |