Zyphra

AI Companies Large Language Models Mixture of Experts Open Source AI

20 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

36 citations

Revision

v3 · 3,924 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Zyphra is an American artificial intelligence research and product company headquartered in San Francisco, California, with a secondary office in London.^[1] The company builds open-weight foundation models, specializing in hybrid architectures that combine state space models with attention mechanisms, sparse mixture of experts (MoE) designs, and efficient training systems.^[1] Zyphra was founded in 2020 by Krithik Puthalath, Beren Millidge, Tomas Figliolia, and Danny Martinelli.^[1]^[12] In June 2025 the company closed a $100 million Series A funding round at a $1 billion post-money valuation, becoming an AI unicorn.^[11]^[12]

Zyphra is best known for three product lines: the Zamba family of hybrid Mamba-Transformer large language models, the Zonos text-to-speech models, and the ZAYA1-8B reasoning MoE released on May 6, 2026.^[2] ZAYA1-8B was pretrained entirely on AMD Instinct MI300 hardware using a 1,024-GPU cluster built jointly with IBM, making it the first frontier-scale reasoning model trained end-to-end on the AMD stack.^[2]^[4] The company has also published influential research on Tree Attention, Compressed Convolutional Attention (CCA), and the Zyda training data corpus.^[24]^[27]^[28]

Company background

Zyphra was founded in 2020 with a stated mission of building open superintelligence.^[1] The four co-founders bring backgrounds in physics, neuroscience, machine learning research, and infrastructure engineering.^[1] Krithik Puthalath, who serves as chairman and chief executive officer, holds a theoretical and mathematical physics degree from the University of Illinois Urbana-Champaign and previously worked at Cambridge Quantum on product and research, with earlier stints at IBM, Xerion Advanced Battery, and SunPower.^[13] Beren Millidge is co-founder, president, and chief scientist; he completed postdoctoral research at the University of Oxford on active inference and predictive coding before turning to applied machine learning.^[14] Tomas Figliolia and Danny Martinelli round out the founding team and continue to lead research and engineering work, including authorship on the company's Compressed Convolutional Attention paper.^[1]^[27]

Quentin Anthony, who heads model training and systems work, is not a co-founder but is among the most visible technical leads at the company and an author on most Zyphra papers.^[1] His research focuses on the intersection of deep learning systems and architectures, including the Zamba, Zamba2, and ZAYA1 technical reports.^[15]^[20] The wider research team has grown alongside Zyphra's GPU access; the ZAYA1-8B technical report lists 18 named authors.^[5]

The company operates two divisions: Zyphra Research, which develops multimodal open models, and Zyphra Cloud, which provides hosted inference infrastructure.^[1] The cloud platform launched a free serverless endpoint for ZAYA1-8B at the time of the model's release in May 2026.^[2]^[5]

Funding and partnerships

Zyphra remained relatively under the radar through its earliest years, releasing the first Zamba paper in May 2024 with limited public visibility.^[15] The fundraising picture shifted decisively in mid-2025. In June 2025, Zyphra closed a $100 million Series A round at a $1 billion post-money valuation, led by Jaan Tallinn through his investment vehicle Metaplanet.^[11]^[12] Tallinn is an Estonian programmer and one of the founders of Skype, and was an early backer of both DeepMind and Anthropic.^[11] The round also included Bison Ventures and Future Ventures, the firm co-founded by Steve Jurvetson.^[12]

A second pillar of the company's strategy is hardware. On October 1, 2025, IBM and AMD jointly announced a multi-year collaboration with Zyphra to deliver dedicated AI training infrastructure on IBM Cloud.^[9] The agreement is the first large-scale dedicated training cluster on IBM Cloud built around AMD Instinct MI300X GPUs, paired with AMD Pensando Pollara 400 AI NICs and AMD Pensando Ortano DPUs.^[9]^[10] Initial deployment landed in September 2025, with expansion planned through 2026.^[9] Puthalath described the deal as the first time AMD's full-stack training platform had been integrated and scaled on IBM Cloud.^[9]

Funding history

Round	Date	Amount	Lead investor	Valuation
Seed	2020-2024 (multiple)	Undisclosed	Various angels	Undisclosed
Series A	June 9, 2025	$100 million	Metaplanet (Jaan Tallinn)	$1 billion post-money^[11]^[12]
Strategic infrastructure	October 1, 2025	Undisclosed	IBM and AMD (multi-year compute agreement)	n/a^[9]

Research focus and product portfolio

Zyphra's technical work clusters around four ideas. First, hybrid architectures that pair the Mamba state space model with sparse attention layers, trading dense self-attention for lower latency and smaller key-value cache.^[15] Second, sparse MoE designs that lift total parameter count without scaling active compute.^[5] Third, efficient attention variants such as Tree Attention and Compressed Convolutional Attention.^[24]^[27] Fourth, large open training datasets, distributed through the Zyda series.^[28] Each of these threads informs the others; ZAYA1-8B, for example, draws on CCA for attention compression, on the MoE++ recipe for routing and load balancing, and on a four-stage reinforcement learning pipeline with a custom test-time compute method called Markovian RSA.^[5]

The company's published artifacts are all permissive open releases, almost always under the Apache 2.0 license, with full model weights, inference code, and (in most cases) a detailed technical report.^[6] The licensing posture is deliberate: Zyphra's marketing material describes the company as advocating for open superintelligence rather than closed control of frontier capability.^[1]

Model timeline

Zyphra's model releases sit at the intersection of language, audio, and reasoning. The table below summarizes the headline releases through May 2026.

Release	Date	Type	Parameters	License
Zamba-7B	April 2024	Hybrid Mamba1-Transformer LLM	7B	Apache 2.0^[16]
Zyda	June 2024	Pretraining dataset	1.3T tokens	ODC-By^[28]^[30]
Tree Attention	August 2024	Research paper and library	n/a	Apache 2.0 (code)^[26]
Zamba2-2.7B	August 2024	Hybrid Mamba2-Transformer LLM	2.7B	Apache 2.0^[19]
Zamba2-mini (1.2B)	August 2024	Hybrid LLM for on-device	1.2B	Apache 2.0^[18]
Zamba2-7B	October 14, 2024	Hybrid Mamba2-Transformer LLM	7B	Apache 2.0^[17]
Zyda-2	October 2024	Pretraining dataset	5T tokens	ODC-By^[31]
Zonos-v0.1	February 10, 2025	Text-to-speech (transformer and SSM hybrid)	1.6B each	Apache 2.0^[33]
Compressed Convolutional Attention	October 2025	Research paper	n/a	Apache 2.0 (code)^[27]
ZAYA1-8B	May 6, 2026	Reasoning MoE	8.4B total / 760M active	Apache 2.0^[6]
ZAYA1-VL-8B	May 2026	Vision language MoE	9.2B total / 1.4B active	Apache 2.0^[7]

Zamba and Zamba2 model family

The Zamba family was Zyphra's first major release and remains the most cited line of work for the company. The original Zamba-7B was announced in April 2024 and accompanied by an arXiv paper, "Zamba: A Compact 7B SSM Hybrid Model" (arXiv:2405.16712), published on May 26, 2024.^[15]^[16] The model uses a backbone of Mamba state space layers interleaved with a single shared attention block whose weights are reused across the network.^[15] The shared attention design lets the model gain global mixing without paying the parameter cost of a full per-layer transformer.^[15]

Zyphra reported that Zamba-7B approached the performance of Mistral and Gemma at comparable scales while training on roughly half the tokens, and outperformed LLaMA-2 7B and OLMo-7B across a broad set of standard benchmarks.^[15]^[35] The model was open-sourced under Apache 2.0 along with all training checkpoints, an unusual choice that let researchers study the trajectory of capability emergence.^[16]

The second generation, Zamba2, replaced the Mamba1 backbone with Mamba2 blocks.^[20] Mamba2 is a refinement of the original state space architecture, with roughly four times the throughput of an equivalent-parameter transformer block.^[20] The Zamba2 suite shipped in three sizes:

Zamba2-mini (1.2B): announced in August 2024, this variant uses a single shared attention block (rather than two interleaved blocks) to maximize FLOP count per parameter, making it well suited to on-device inference.^[18]
Zamba2-2.7B: released alongside the mini, this size uses two interleaved shared attention blocks in an ABAB pattern and is positioned as a mid-scale efficient model.^[19]
Zamba2-7B: released on October 14, 2024, this is the flagship of the Zamba2 family, also using two shared attention blocks.^[17] Zyphra reported 25% faster time-to-first-token and a 20% improvement in tokens-per-second compared to Llama3-8B, with substantially lower memory usage.^[17] The Mistral v0.1 tokenizer was used across all Zamba2 models.^[20]

A core architectural quirk of Zamba2 is that it stores key-value caches only for invocations of the shared attention block, not for every layer.^[20] At a typical 1:6 ratio of Mamba2 to attention, this cuts KV-cache memory by a factor of roughly six compared with a pure transformer of similar size.^[20] LoRA projectors are applied to each shared MLP and attention block to enable a degree of depth-specialization without inflating parameter counts.^[20]

Zamba family parameter counts

Model	Total parameters	Pretraining tokens	Mamba variant	Shared attention blocks	Tokenizer
Zamba-7B (v1)	7B	~1T (less than half of Mistral/Gemma comparators)	Mamba1	1 (global)	Custom^[15]
Zamba2-mini (1.2B)	1.2B	3T + 100B anneal	Mamba2	1	Mistral v0.1^[18]^[23]
Zamba2-2.7B	2.7B	3T + 100B anneal	Mamba2	2 (ABAB)	Mistral v0.1^[19]^[22]
Zamba2-7B	7B	~3T + 100B anneal	Mamba2	2 (ABAB)	Mistral v0.1^[17]^[21]

The Zamba2 technical report was published on arXiv as 2411.15242 in November 2024 with authors Paolo Glorioso, Quentin Anthony, Yury Tokpanov, and collaborators.^[20]

Tree Attention

Tree Attention is a research result first published in August 2024, with the paper "Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters" (arXiv:2408.04093).^[24] The work was a collaboration between Zyphra researchers and EleutherAI, and the code is hosted at github.com/Zyphra/tree_attention under Apache 2.0.^[24]^[26]

The paper makes two contributions. First, it derives a scalar energy function whose gradient computes the self-attention block, drawing a connection between attention and energy-based models such as Hopfield networks.^[24] Second, and more practically, it exploits a structural property of attention that prior work had not put to use. The reduction of the logsumexp and max operators across the sequence axis is associative, which means it can be computed in parallel via an associative scan, the same primitive used in state space model training.^[24]

In distributed decoding on GPU clusters, this lets attention be computed using a tree reduction with logarithmic depth in the number of devices, rather than the linear depth of ring attention.^[24] The reported result is up to 8x faster cross-device decoding than ring attention, with significantly lower communication volume and 2x lower peak memory.^[24]^[25] The Tree Attention algorithm has been folded into several inference engines and is one of the standard reference points for long-context decoding on multi-GPU servers.

Compressed Convolutional Attention

Compressed Convolutional Attention (CCA) is a more recent attention variant, introduced in October 2025 in arXiv:2510.04476 by Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, and Beren Millidge.^[27] CCA down-projects queries, keys, and values into a shared low-dimensional latent space and performs the entire attention operation inside that latent.^[27] The compression factor controls a tradeoff that reduces parameters, KV cache, and FLOPs in a single move.^[27]

A convolution over the sequence and channel axes is applied to the compressed Q and K latents, which is the source of the "convolutional" name.^[27] The paper reports that this additional mixing lets CCA match or exceed multi-head latent attention (MLA) and standard multi-head attention (MHA) on quality metrics.^[27] An extension called Compressed Convolutional Grouped Query Attention (CCGQA) adds GQA-style head sharing inside the latent space and yields a further 2x KV cache reduction without measurable performance loss.^[27]

On an MoE workload, CCGQA achieves 8x KV cache compression with no drop in performance relative to MHA.^[27] On H100 GPUs, the fused CCA/CCGQA kernel cuts prefill latency by roughly 1.7x at a 16k sequence length compared to MHA.^[27] CCA is the attention method used inside ZAYA1-8B, and the kernel work formed the foundation for the model's efficient inference profile.^[5]

ZAYA1-8B

ZAYA1-8B is Zyphra's first reasoning-focused MoE and was announced on May 6, 2026.^[2] The model has 8.4 billion total parameters but only 760 million active parameters per forward pass, placing it in the under-1B-active class.^[2]^[6] It is built on a proprietary stack the company calls MoE++, which combines several architectural choices the company introduced in earlier papers:

Compressed Convolutional Attention as the attention primitive, providing 8x KV-cache compression compared to standard attention.^[27]
MLP-based expert router with PID-controller bias balancing, replacing the linear router common in earlier MoE designs and improving routing stability across training.^[5]
Learned residual scaling, which manages residual stream norm growth as the network gets deeper.^[5]

The ZAYA1-8B post-training pipeline involves a four-stage reinforcement learning cascade and introduces Markovian RSA, a test-time compute method that fuses the Markovian thinker idea (reasoning in fixed-length context chunks where only the tail of each chunk is passed forward) with parallel trace generation and recursive self-aggregation (RSA).^[5] The result is that the model can run effectively unbounded chains of thought without ever overflowing a fixed context window, by carrying forward only a 4K-token tail between chunks and periodically aggregating partial traces.^[3]^[5]

ZAYA1-8B benchmark results

The figures below are from the official Hugging Face model card and the Zyphra release post.^[5]^[6] Mathematics scores are reported across single-pass evaluation and extended test-time compute with Markovian RSA enabled. Scores driven by Markovian RSA are not strictly comparable to standard pass@1 numbers on closed-source models.

Benchmark	ZAYA1-8B	Qwen3-4B	Qwen3.5-4B	Gemma-4-E4B
AIME'26	89.1	77.5	84.5	50.3
HMMT Feb.'26	71.6	60.8	63.6	32.1
IMO-AnswerBench	59.3	50.9	48.7	27.3
LiveCodeBench-v6	65.8	54.2	n/a	54.2
GPQA-Diamond	71.0	66.5	76.2	57.4
MMLU-Pro	74.2	74.3	79.1	70.2
IFEval	85.58	86.80	89.80	88.50
APEX-shortlist	32.2	n/a	n/a	n/a^[6]

With Markovian RSA enabled at high test-time compute (a roughly 5.5 million token budget), Zyphra reports a 91.9 score on AIME'25 and 89.6 on HMMT'25, the latter exceeding Claude 4.5 Sonnet (88.3) and approaching GPT-5-High on the same benchmark.^[5] On APEX-shortlist with extended compute, the company reports outperforming DeepSeek-V3.2 at a tiny fraction of the active parameter count.^[5] These results sit in an unusual region of the design space: a model with under 1B active parameters, run with aggressive test-time compute, matching the math performance of models with hundreds of times more active parameters.^[3]

The accompanying CEO statement framed the design philosophy directly. "ZAYA1-8B demonstrates what is possible when architecture, pretraining, and reinforcement learning are co-designed toward a single objective: maximizing the intelligence extracted per parameter and per FLOP," Puthalath said in the launch announcement.^[2]

ZAYA1-VL-8B

A vision-language variant, ZAYA1-VL-8B, was released alongside the text model in May 2026 and shares the same base LLM.^[7]^[8] The VLM has 9.2 billion total parameters (1.4 billion active including the vision encoder) and uses the Qwen2.5-VL vision tower for image encoding.^[7] Two architectural choices distinguish it from the base model: vision-specific LoRA adapters integrated into the LLM rather than additional experts, which keeps the MoE structure intact, and bidirectional attention over image tokens to improve visual understanding.^[8] Zyphra reports performance competitive with Molmo2-4B and InternVL3.5-4B, exceeding Qwen2.5-VL-3B, PLM-3B, and MolmoE-8B on a range of image understanding, reasoning, and counting benchmarks.^[7]^[8]

Training infrastructure

ZAYA1-8B is the first widely-discussed reasoning model trained end-to-end on AMD hardware.^[4]^[36] The training cluster, hosted on IBM Cloud, consists of 1,024 AMD Instinct MI300X nodes connected by AMD Pensando Pollara 400 AI NICs.^[4]^[9] The MI300X is AMD's flagship data center accelerator, with 192 GB of HBM3 memory per GPU, the largest memory capacity among commercially available AI training accelerators at the time of ZAYA1's training.^[4] The Pollara interconnect uses Ultra Ethernet to provide high-bandwidth GPU-to-GPU communication, an alternative to NVIDIA's NVLink and InfiniBand-based topologies common at other frontier labs.^[9]^[10]

The cluster was the same infrastructure announced in the October 2025 IBM-AMD-Zyphra partnership and served as a real-world workload demonstrating the AMD training stack at scale.^[9]^[10] Zyphra's engineering team contributed back to the ecosystem with kernel work, profiling tools, and ROCm-targeted modifications that have since been folded into AMD's public software stack.^[4]

Datasets: Zyda and Zyda-2

Zyphra also publishes large open pretraining datasets under the Zyda brand. The original Zyda was released in June 2024 as a 1.3 trillion token dataset, built by filtering and deduplicating across RefinedWeb, Starcoder, C4, the Pile, SlimPajama, peS2o, and arXiv.^[28]^[30] The accompanying paper (arXiv:2406.01981) reports that models trained on Zyda outperform models trained on any individual source dataset at comparable scale.^[28]

Zyda-2 followed in October 2024 and expanded the corpus to 5 trillion tokens, incorporating DCLM, FineWeb-Edu, and the Common Crawl portion of Dolma v1.7.^[29]^[31] The data pipeline was rebuilt using NVIDIA's NeMo Curator, a GPU-accelerated curation library, which cut total processing time from roughly three weeks to two days.^[29] Models trained on Zyda-2 outperform identical models trained on the Pile, RefinedWeb, FineWeb, FineWeb-Edu, or DCLM individually.^[29] Both datasets are available on the Zyphra Hugging Face organization under the ODC-By license.^[30]^[31]

Zonos text-to-speech

In February 2025 Zyphra extended its open-model strategy beyond text with the beta release of Zonos-v0.1, a pair of 1.6B parameter text-to-speech models.^[32]^[33] One uses a standard transformer architecture; the other is an SSM hybrid based on Mamba2, the first open-source SSM-based TTS model in the public domain.^[32]^[33] Both were trained on roughly 200,000 hours of speech, primarily English with substantial Chinese, Japanese, French, Spanish, and German content.^[32]^[33]

Zonos generates audio natively at 44 kHz and supports voice cloning from clips as short as five seconds.^[32]^[34] The models accept conditioning inputs for speaking rate, pitch, audio quality, and emotion.^[34] Zyphra released the model weights on Hugging Face under the Apache 2.0 license, framing the release as a direct alternative to closed-source TTS providers.^[33]

Maia

Maia is the name Zyphra has given to a general-purpose AI agent currently in development under the IBM-AMD partnership.^[9] The company describes it as a superagent spanning language, vision, and audio modalities, targeted at enterprise knowledge workers.^[9] Maia is built on top of the same MoE foundation models Zyphra trains on the IBM Cloud MI300X cluster, with ZAYA1 representing the first publicly released building block.^[9] The agent's full capabilities and launch timing have not been disclosed in public statements as of the May 2026 ZAYA1-8B release.

Reception and significance

Zyphra's standing in the AI research community changed sharply between 2024 and 2026. The original Zamba paper drew technical interest but limited mainstream coverage.^[35] The Zamba2 release a few months later landed in the middle of the wave of Mamba-based research and brought the company to the attention of inference-efficiency-focused engineers. The Tree Attention paper widened that audience by addressing distributed decoding, a long-standing pain point for long-context serving.^[24]

The ZAYA1-8B release was the first time Zyphra was widely treated as a frontier laboratory rather than a hybrid-architecture specialist. Coverage emphasized two themes. First, that a model with fewer than one billion active parameters could match or exceed first-generation reasoning models (DeepSeek-R1-0528, Gemini 2.5 Pro, Claude 4.5 Sonnet) on hard mathematics benchmarks.^[3]^[4] Second, that the model was trained entirely on AMD hardware, demonstrating that competitive frontier training is possible outside the NVIDIA ecosystem.^[4]^[36] The combination matters for both technical and strategic reasons: it offers a path for labs that lack access to large NVIDIA allocations, and it gives AMD and IBM a publicly reproducible reference workload for selling MI300-class clusters to enterprise customers.

The company's open-license posture stands out among labs operating at this scale. Models, weights, inference code, datasets, and technical reports have all shipped under Apache 2.0 or equivalent permissive licenses, with no commercial use restrictions.^[6] Zyphra has framed this as central to its identity, arguing that intelligence should not be controlled by a small number of closed-source frontier labs.^[1]

Architectural innovations summary

The through-line of Zyphra's technical work is reducing the cost of intelligence per parameter and per FLOP, often by changing the building blocks rather than scaling up. The list below summarizes the company's named contributions and where they appear.

Innovation	Year	Description	First used in
Shared attention block	2024	A single attention block whose weights are reused across the network	Zamba-7B^[15]
Mamba2 hybrid backbone	2024	Mamba2 SSM layers interleaved with shared attention	Zamba2 family^[20]
LoRA projectors on shared blocks	2024	Per-invocation low-rank adapters for depth specialization	Zamba2 family^[20]
Tree Attention	2024	Tree-reduction parallelization of attention across GPUs	Inference engines^[24]
Compressed Convolutional Attention	2025	Attention in a shared low-rank latent with sequence convolution	ZAYA1-8B^[27]
MoE++ architecture recipe	2026	CCA, MLP router with PID balancing, learned residual scaling	ZAYA1-8B^[5]
Markovian RSA	2026	Recursive self-aggregation over fixed-length reasoning chunks	ZAYA1-8B^[5]

References

Zyphra company overview, About page. https://www.zyphra.com/about ↩
"Zyphra Releases ZAYA1-8B, a Reasoning Model trained on AMD and Optimized for Maximum Intelligence Density per Parameter," PR Newswire, May 6, 2026. https://www.prnewswire.com/news-releases/zyphra-releases-zaya1-8b-a-reasoning-model-trained-on-amd-and-optimized-for-maximum-intelligence-density-per-parameter-302764700.html ↩
"Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class," MarkTechPost, May 6, 2026. https://www.marktechpost.com/2026/05/06/zyphra-releases-zaya1-8b-a-reasoning-moe-trained-on-amd-hardware-that-punches-far-above-its-weight-class/ ↩
"Meet ZAYA1-8B, a super efficient, open reasoning model trained on AMD Instinct MI300 GPUs," VentureBeat, May 2026. https://venturebeat.com/technology/meet-zaya1-8b-a-super-efficient-open-reasoning-model-trained-on-amd-instinct-mi300-gpus ↩
"ZAYA1-8B: Frontier intelligence density, trained on AMD," Zyphra release post. https://www.zyphra.com/post/zaya1-8b ↩
ZAYA1-8B model card, Hugging Face. https://huggingface.co/Zyphra/ZAYA1-8B ↩
ZAYA1-VL-8B model card, Hugging Face. https://huggingface.co/Zyphra/ZAYA1-VL-8B ↩
ZAYA1-VL-8B release post, Zyphra. https://www.zyphra.com/post/zaya1-vl-8b ↩
"IBM and AMD Collaborate with Zyphra on Next Generation AI Infrastructure," IBM newsroom, October 1, 2025. https://newsroom.ibm.com/2025-10-01-ibm-and-amd-collaborate-with-zyphra-on-next-generation-ai-infrastructure ↩
"AI research and product company Zyphra signs deal for large AMD MI300X cluster on IBM Cloud," Data Center Dynamics. https://www.datacenterdynamics.com/en/news/ai-research-and-product-company-zyphra-signs-deal-for-large-amd-mi300x-cluster-on-ibm-cloud/ ↩
"Open-Source AI Startup Zyphra Targets $1 Billion Valuation," The Information. https://www.theinformation.com/briefings/open-source-ai-startup-zyphra-targets-1-billion-valuation ↩
Zyphra company profile, Crunchbase. https://www.crunchbase.com/organization/zyphra-technologies ↩
Krithik Puthalath profile, Crunchbase. https://www.crunchbase.com/person/krithik-puthalath ↩
Beren Millidge personal site. https://www.beren.io/aboutme/ ↩
"Zamba: A Compact 7B SSM Hybrid Model," arXiv:2405.16712, Glorioso, Anthony et al., May 2024. https://arxiv.org/abs/2405.16712 ↩
"Zamba-7B," Zyphra release post, April 2024. https://www.zyphra.com/post/zamba ↩
"Zamba2-7B," Zyphra release post, October 2024. https://www.zyphra.com/post/zamba2-7b ↩
"Zamba2-mini (1.2B)," Zyphra release post. https://www.zyphra.com/post/zamba2-mini ↩
"Zamba2-small (2.7B)," Zyphra release post. https://www.zyphra.com/post/zamba2-small ↩
"The Zamba2 Suite: Technical Report," arXiv:2411.15242, Glorioso, Anthony, Tokpanov et al., November 2024. https://arxiv.org/abs/2411.15242 ↩
Zamba2-7B model card, Hugging Face. https://huggingface.co/Zyphra/Zamba2-7B ↩
Zamba2-2.7B model card, Hugging Face. https://huggingface.co/Zyphra/Zamba2-2.7B ↩
Zamba2-1.2B model card, Hugging Face. https://huggingface.co/Zyphra/Zamba2-1.2B ↩
"Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters," arXiv:2408.04093. https://arxiv.org/abs/2408.04093 ↩
"Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU Clusters," Zyphra release post. https://www.zyphra.com/post/tree-attention-topology-aware-decoding-for-long-context-attention-on-gpu-clusters ↩
Tree Attention reference implementation. https://github.com/Zyphra/tree_attention ↩
"Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space," arXiv:2510.04476, Figliolia, Alonso, Iyer, Anthony, Millidge, October 2025. https://arxiv.org/abs/2510.04476 ↩
"Zyda: A 1.3T Dataset for Open Language Modeling," arXiv:2406.01981. https://arxiv.org/abs/2406.01981 ↩
"Train Highly Accurate LLMs with the Zyda-2 Open 5T-Token Dataset Processed with NVIDIA NeMo Curator," NVIDIA Technical Blog. https://developer.nvidia.com/blog/train-highly-accurate-llms-with-the-zyda-2-open-5t-token-dataset-processed-with-nvidia-nemo-curator/ ↩
Zyda dataset, Hugging Face. https://huggingface.co/datasets/Zyphra/Zyda ↩
Zyda-2 dataset, Hugging Face. https://huggingface.co/datasets/Zyphra/Zyda-2 ↩
"Zyphra Introduces the Beta Release of Zonos: A Highly Expressive TTS Model with High Fidelity Voice Cloning," MarkTechPost, February 10, 2025. https://www.marktechpost.com/2025/02/10/zyphra-introduces-the-beta-release-of-zonos-a-highly-expressive-tts-model-with-high-fidelity-voice-cloning/ ↩
"Beta Release of Zonos-v0.1," Zyphra release post. https://www.zyphra.com/post/beta-release-of-zonos-v0-1 ↩
Zonos repository, GitHub. https://github.com/Zyphra/Zonos ↩
"Zyphra releases Zamba, an SSM-hybrid foundation model to bring AI to more devices," VentureBeat. https://venturebeat.com/ai/zyphra-releases-zamba-an-ssm-hybrid-foundation-model-to-bring-ai-to-more-devices ↩
"Zyphra Releases ZAYA1-8B Reasoning Model," HPCwire (AIwire), May 7, 2026. https://www.hpcwire.com/aiwire/2026/05/07/zyphra-releases-zaya1-8b-reasoning-model/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

ZAYA1-8B

Company background

Funding and partnerships

Funding history

Research focus and product portfolio

Model timeline

Zamba and Zamba2 model family

Zamba family parameter counts

Tree Attention

Compressed Convolutional Attention

ZAYA1-8B

ZAYA1-8B benchmark results

ZAYA1-VL-8B

Training infrastructure

Datasets: Zyda and Zyda-2

Zonos text-to-speech

Maia

Reception and significance

Architectural innovations summary

See also

References

Improve this article

Related Articles

Mixtral

Snowflake Arctic

DBRX

Mixtral 8x22B

DeepSeek V4

Kimi K2

What links here

Related Articles

Mixtral

Snowflake Arctic

DBRX

Mixtral 8x22B

DeepSeek V4

Kimi K2