Hyena
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,769 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,769 words
Add missing citations, update stale details, or suggest a clearer explanation.
Hyena is a sub-quadratic, attention-free neural sequence operator introduced by Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré in the February 2023 paper "Hyena Hierarchy: Towards Larger Convolutional Language Models."[^1] The operator replaces self-attention with a recurrence that interleaves long convolutions parameterized implicitly by a small feed-forward network and elementwise, data-controlled multiplicative gating. Evaluated efficiently through Fast Fourier Transform convolutions, Hyena scales as O(L log L) in sequence length L rather than the O(L²) of dense attention, while matching transformer perplexity on standard language modeling benchmarks.[^1][^2] Hyena emerged from the Hazy Research group at Stanford University in collaboration with Mila and was published at ICML 2023.[^3][^1] Successor work, including the StripedHyena hybrid models released by Together AI in December 2023 and the Evo and Evo 2 genomic foundation models from the Arc Institute, demonstrated that the architecture scales to billions of parameters and to sequence contexts of more than one million tokens.[^4][^5][^6]
The dominant sequence modeling primitive since 2017 has been self-attention as defined in the original Transformer architecture. While expressive, self-attention has time and memory complexity that grows quadratically with sequence length, which limits the context window that large language models can ingest and bounds the throughput of long-document inference. Through 2021 and 2022, multiple research lines tried to soften this bottleneck. One line pushed kernel-level engineering, producing Flash Attention from Tri Dao and collaborators, which kept the asymptotic O(L²) but reduced wall-clock cost and memory by tiling attention computation in SRAM.[^3] A second line tried to replace attention outright with linear-time alternatives such as linear attention, state-space models, and recurrent variants.[^7]
Within the second category, the deep learning state-space model thread, which produced S4 and later state space models for deep sequence modeling, struggled to match transformer quality on language modeling tasks that demand associative recall.[^8] The Hazy Research lab at Stanford, led by Christopher Ré, identified this recall gap and published the H3 paper, "Hungry Hungry Hippos: Towards Language Modeling with State Space Models," at ICLR 2023.[^8] H3 introduced a layer that stacked two state-space models with multiplicative interactions between their outputs and input projections, drawing structural inspiration from linear attention. The H3 paper demonstrated that hybrid H3 plus attention models could outperform transformers on OpenWebText perplexity and was the first SSM-based approach to come within striking distance of transformer quality at scale.[^8]
Hyena built directly on H3. Where H3 used two specific state-space layers (a shift SSM and a diagonal SSM) and a fixed number of projections, Hyena generalized the construction to an arbitrary recurrence order with implicit long convolution filters parameterized by a small neural network. The Hyena authors framed the change as the natural progression from H3: the same data-controlled multiplicative gating skeleton, but with the inner state-space layers replaced by more flexible long convolutions that could be efficiently evaluated via FFT.[^1][^2]
The paper was first posted to arXiv on 21 February 2023 (arXiv:2302.10866), revised through April 2023, and published in the Proceedings of the 40th International Conference on Machine Learning (ICML 2023).[^1][^9] The authors are affiliated with Stanford University, Mila and the Université de Montréal, with later commercialization through Together AI.[^10][^4]
Hyena is defined as an operator that takes an input sequence u of length L and returns an output sequence y of the same shape. The operator is parameterized by an integer order N (typically 2 or 3 in published configurations).[^2]
Given the input u, Hyena first produces N+1 projections through a learned linear map followed by a short, depthwise causal convolution. The projections play roles analogous to the queries, keys, and values of self-attention. In the order-N=2 case, the three projections are conventionally labeled x¹, x², and v, where v is the value path that will be progressively mixed with the gating signals.[^2][^11]
The Hyena operator is a recurrence with N steps. Starting from y⁰ = v, at each step i it performs an elementwise multiplication by a gating projection xⁱ followed by a long convolution with an implicitly parameterized filter hⁱ:
y^i = x^i ⊙ (h^i * y^{i-1})
where ⊙ denotes elementwise multiplication and * denotes causal convolution along the sequence dimension.[^2][^11] After N such steps, the final y^N is returned as the operator output. This corresponds to alternating a Toeplitz matrix multiplication (the convolution) with a diagonal matrix multiplication (the gate), and unrolling the recurrence shows that Hyena can be expressed as a product of alternating data-controlled diagonal matrices D_x^i and Toeplitz matrices S_h^i acting on v.[^1][^11]
A key design choice is that the convolution filters hⁱ are not stored as L learned weights. Storing such filters explicitly would scale the parameter count linearly with sequence length. Instead, Hyena learns a small neural network γ_θⁱ, typically a multi-layer perceptron of fixed size, that maps a positional index t (optionally encoded with sinusoids or other positional encodings) to the filter value at that position.[^11][^2] The filter is then sampled at each of the L positions to form a long convolution kernel of arbitrary length. This implicit parameterization decouples parameter count from sequence length and enables the model to handle very long contexts without inflating its memory footprint. Window functions are applied to the implicit filter to encourage suitable temporal decay.[^1][^11]
A long convolution of length L can be evaluated in O(L log L) time using the Fast Fourier Transform: transform the input and filter to the frequency domain, multiply elementwise, and transform back. Combined with the O(L) elementwise gates, the total cost of a Hyena layer is O(N L log L) for a fixed recurrence order N.[^2][^11] At very long sequence lengths this is dramatically cheaper than the O(L²) cost of dense attention. The authors of the original paper reported that an optimized Hyena layer was twice as fast as an optimized attention layer at L = 8K tokens and approximately 100 times faster at L = 64K tokens.[^1]
In attention, the mixing matrix is a function of the input through the QK product, making the operator "data-controlled" in the language of the Hyena paper. Hyena retains this property: the diagonal matrices come from the input-dependent projections x^i, so the structured operator H(u) varies with u even though its Toeplitz components are not. The Hyena authors argued that data control plus unrestricted global context plus sub-linear parameter scaling jointly identify what made attention powerful, and that an operator constructed from long convolutions plus gating could achieve all three without quadratic cost.[^1][^2]
The original Hyena paper presented experiments at three primary scales: synthetic recall and reasoning benchmarks, autoregressive language modeling on standard text datasets, and image classification on ImageNet.
The authors introduced a suite of synthetic tasks targeting capabilities such as associative recall, multi-query associative recall, and induction heads. On these tasks they compared Hyena against attention, several state-space models, and other sub-quadratic baselines at sequence lengths from a few hundred to hundreds of thousands of tokens. Hyena improved accuracy by more than 50 percentage points over the closest non-attention baselines and matched attention itself.[^1][^2] The authors argued that this performance differentiated Hyena from many sub-quadratic attempts that close the perplexity gap on bulk language modeling but fail to perform sharp targeted recall.
Hyena was trained as a causal language model on the WikiText-103 corpus and on The Pile at multiple scales up to roughly 1.3 billion parameters. On WikiText-103 the Hyena model matched transformer perplexity. On The Pile, the 1.3 billion parameter Hyena model reached approximately 10.8 perplexity after 5 billion training tokens, comparable to the GPTNeo-style transformer baselines at the same compute budget, while requiring about 20 percent less training compute at sequence length 2K.[^1][^12] The authors highlighted Hyena as the first dense, attention-free architecture to match transformer quality on these standard language modeling tasks without requiring a hybrid attention layer.[^1]
In addition to language, the authors evaluated Hyena as a drop-in replacement for self-attention inside a Vision Transformer (ViT) backbone, training from scratch on ImageNet-1k with about 88 million parameters in a Hyena-ViT-B model. The resulting model matched the accuracy of a comparably sized ViT-B baseline on ImageNet-1k, demonstrating that the operator generalizes beyond text.[^11][^1]
The paper reported wall-clock crossovers between Hyena and FlashAttention-based dense attention at approximately 6K tokens of sequence length, with the gap widening sharply at longer sequences. At 64K tokens, Hyena layers were reported as approximately 100 times faster than highly optimized attention layers.[^1][^11]
Hyena sits at the intersection of several research threads that all attempt to escape the quadratic cost of self-attention. A short comparison illustrates the design space.
| Architecture | Year | Mixing primitive | Asymptotic cost | Key idea |
|---|---|---|---|---|
| Transformer attention | 2017 | Softmax QK^T V | O(L²) | Token-pair scores via dot product[^13] |
| Linear attention | 2020 | Kernelized QK^T V | O(L) | Replace softmax with a feature map[^7] |
| S4 / state-space models | 2021 to 2022 | Linear SSM | O(L log L) | Convolutional view of structured SSM[^8] |
| H3 | December 2022 | Two stacked SSMs plus gating | O(L log L) | Bridge SSM and attention via gates[^8] |
| Hyena | February 2023 | Implicit long conv plus gates | O(L log L) | Generalize H3 to arbitrary recurrence order[^1] |
| RetNet | July 2023 | Retention | O(L) | Parallel, recurrent, and chunkwise forms[^14] |
| Mamba | December 2023 | Selective SSM | O(L) | Input-dependent SSM parameters[^15] |
Attention retains the strongest empirical performance on token-pair operations such as associative recall but at quadratic cost. Linear attention drops the softmax to obtain linear complexity but suffers a quality drop on language modeling. State-space models including S4 provide an alternative linear formulation through structured matrix decompositions but historically struggled with recall.[^8] H3 from Hazy Research showed that combining SSMs with multiplicative gating could narrow the recall gap. Hyena, in turn, generalized H3 by replacing the inner SSMs with implicit long convolutions and introducing an arbitrary recurrence order.[^1][^11] Subsequent work, particularly Mamba and Mamba 2, pushed the SSM line further by making the state-space parameters input-dependent.[^15]
RWKV and RetNet are recurrent variants that share Hyena's motivation of replacing attention but differ structurally: RWKV builds on a linear attention recurrence with time-mixing and channel-mixing blocks, while RetNet introduces a "retention" mechanism that admits parallel, recurrent, and chunkwise forms.[^14] Empirical comparisons reported in the literature show that on SuperGLUE-style downstream tasks RWKV is competitive on zero-shot accuracy while Hyena is stronger in few-shot regimes, and that all of these sub-quadratic alternatives lag dense transformers on tasks that require recalling rare or arbitrary information mentioned earlier in the prompt.[^7][^16] This recall gap was a central focus of subsequent Hazy Research papers including the Zoology benchmark suite.[^16]
A widely cited point of contrast is between Hyena and Mamba. Hyena uses a fixed Fourier-based long convolution per layer with input-dependent gating; Mamba instead uses an SSM whose A, B, C, and step-size parameters are themselves functions of the input, giving it a selective mechanism that does not require explicit gating outside the SSM.[^15] Both architectures achieve sub-quadratic compute and matching transformer perplexity at small to mid scales, but they trade off differently between parallelizability, recurrence form, and hardware-friendly long-sequence inference.
The paper title refers to a "hierarchy" because the recurrence stacks N long convolutions with intermediate elementwise gating, and the authors organized the operator into a family parameterized by N. Order-2 Hyena is the smallest non-trivial variant and recovers an H3-like structure when the convolutions are specialized to particular SSM forms. Higher orders allow the operator to express more complex compositional patterns at the cost of a larger constant in the O(N L log L) compute bound.[^11][^1] In practice the published Hyena and StripedHyena models use small N, typically 2 or 3, with the bulk of capacity coming from stacking many Hyena layers and from larger filter MLPs in γ_θ.[^4]
Reference implementations of Hyena were released by the Hazy Research group as part of the Safari repository on GitHub.[^17] The implementation relies on causal FFT-based convolutions and short depthwise convolutions for the input projections. Subsequent work on FFT efficiency, particularly FlashFFTConv from Daniel Y. Fu, Hermann Kumbong, Eric Nguyen, and Christopher Ré, used tensor cores to speed up exact FFT convolutions by up to 7.93 times over PyTorch and delivered up to 4.4 times end-to-end speedup on Hyena and related models.[^18] FlashFFTConv reported that, for the same compute budget, it allowed a small Hyena-GPT model to reach 2.3 points better perplexity, and it extended HyenaDNA to 4 million token sequence length, sufficient to embed the longest human genes at single-nucleotide resolution.[^18]
The "Laughing Hyena Distillery" paper by Stefano Massaroli, Michael Poli, and collaborators, presented at NeurIPS 2023, addressed the autoregressive inference cost of long-convolution architectures.[^19] Naive long-convolution inference is O(L) per token because each new token requires convolving against an L-length filter. The Laughing Hyena work extracted compact linear state-space recurrences from pre-trained Hyena convolution filters using rational interpolation and model-order reduction, achieving O(1) compute and memory cost per generated token and reporting 10 times higher throughput than transformers and 1.5 times higher than vanilla Hyena at the 1.3 billion parameter scale, without quality loss after distillation.[^19]
StripedHyena, released by Together AI on 8 December 2023, was the first openly published seven-billion-parameter language model built around the Hyena operator.[^4][^20] The architecture is explicitly hybrid: Hyena layers (multi-head gated convolutions) carry the bulk of sequence mixing, while a small fraction of layers are conventional grouped or rotary attention layers responsible for targeted pattern recall.[^4] Two variants were released, StripedHyena-Hessian-7B (the base model) and StripedHyena-Nous-7B (a chat model fine-tuned in collaboration with Nous Research). StripedHyena was trained on the RedPajama dataset augmented with longer-context data and supports a 32K context window in its released base configuration, with internal scaling tests reaching 128K.[^4][^20]
Together AI reported that StripedHyena-Hessian-7B was competitive with strong open transformer baselines on the OpenLLM leaderboard and outperformed Llama-2 7B, Yi 7B, and RWKV 14B on several short-context tasks including ARC-Challenge, HellaSwag, WinoGrande, BoolQ, and TruthfulQA.[^4][^20] On the long-context ZeroScrolls benchmark, StripedHyena-7B exceeded a Mistral 7B baseline on GovReport F1 (27.9 versus 17.5) and on NarrativeQA F1 (25.8 versus 24.7), with a smaller gap on Qasper F1.[^4] On training throughput, the model was reported to be more than 30 percent faster than an optimized transformer with FlashAttention v2 at 32K sequence length, more than 50 percent faster at 64K, and more than 100 percent faster at 128K. Inference caches for autoregressive generation were reported to be more than 50 percent smaller than for a grouped-query attention transformer of equivalent quality.[^4]
The Hyena and StripedHyena architectures were applied to biological sequence modeling in two influential papers from the Arc Institute, Stanford, and collaborators.
The HyenaDNA paper, posted to arXiv as 2306.15794 in June 2023 and presented at NeurIPS 2023 as a spotlight, was the first to use Hyena for genomic foundation modeling.[^21] HyenaDNA modeled human genomic DNA at single-nucleotide resolution with context windows of up to one million tokens, roughly 500 times longer than the four-thousand-token contexts of earlier transformer-based genomic models, and reported up to 160 times faster training than transformers using FlashAttention.[^21] HyenaDNA achieved state-of-the-art results on 12 of 18 Nucleotide Transformer benchmarks and exceeded baselines on 7 of 8 GenomicBenchmarks tasks by roughly 10 accuracy points, while using substantially fewer parameters than competing genomic models. The paper also demonstrated the first use of in-context learning in genomics, showing that a single pre-trained model could adapt to novel tasks without weight updates.[^21]
The Evo model, described in "Sequence modeling and design from molecular to genome scale with Evo" by Eric Nguyen, Brian L. Hie, and collaborators, was posted as a preprint on 27 February 2024 and published in Science in November 2024.[^5][^22] Evo uses the StripedHyena architecture with a hybrid of 29 Hyena layers and 3 multi-head attention layers (roughly a 10 percent attention fraction), trained on the OpenGenome dataset of approximately 2.7 million prokaryotic and bacteriophage genomes totaling about 300 billion tokens, with a 131-kilobase context length.[^5][^22] The model has 7 billion parameters and is capable of both predictive and generative tasks across molecular and whole-genome scales, including generating coherent sequences longer than 650 kilobases.[^5]
Evo 2, released as a preprint on 19 February 2025 and later published in Nature, was developed by Arc Institute, Stanford University, UC Berkeley, UC San Francisco, and NVIDIA.[^6] Evo 2 scales the StripedHyena-style architecture to 40 billion parameters and a 1-megabase context window, with training over more than 9 trillion nucleotides drawn from more than 100,000 species across the entire tree of life. Reported applications include 90 percent accurate prediction of the functional impact of previously unrecognized BRCA1 mutations, generation of full bacterial-genome-scale sequences, and integration into the NVIDIA BioNeMo framework.[^6][^23]
A number of additional Hyena variants and extensions have appeared. Multi-Dimensional Hyena adapted the operator to two-dimensional inputs for image classification, and HyenaPixel, a later 2D extension, reported competitive ImageNet-1k top-1 accuracies (84.9 percent and 85.2 percent for variants) while outperforming several large-kernel convolutional baselines.[^24] A Hyena neural operator for partial differential equations applied the same convolution-plus-gating structure to physical simulation tasks.[^25] Scavenging Hyena explored distilling pre-trained transformers into long-convolution architectures, complementing the Laughing Hyena work on extracting recurrences after the fact.[^26]
Hyena and its descendants have been deployed across several domains, with the strongest impact in long-context language modeling and genomics.
StripedHyena-7B served as a public demonstration that a hybrid Hyena-plus-attention architecture could match seven-billion-parameter transformer language models on standard benchmarks while delivering substantial speedups at long context.[^4] The model and its weights were released openly under permissive licenses, allowing third parties to fine-tune the architecture and to use it as a backbone for downstream applications such as chat (StripedHyena-Nous-7B).[^20]
The most consequential application has been in computational biology. Standard transformer-based genomic models such as the Nucleotide Transformer relied on context windows of four thousand tokens or less and on k-mer tokenization, which discarded single-nucleotide resolution.[^21] The sub-quadratic compute of Hyena made it feasible to train on context windows of one million tokens, sufficient to span whole bacterial genomes or large eukaryotic loci, while preserving the byte-level resolution needed for variant-effect prediction. HyenaDNA, Evo, and Evo 2 collectively demonstrated that this combination, long context plus single-nucleotide resolution plus a foundation model training recipe, enabled new capabilities, including zero-shot variant pathogenicity prediction, generation of regulatory elements, and design of synthetic genomes.[^21][^5][^6]
The original Hyena paper showed that the operator could substitute for attention in a Vision Transformer trained from scratch on ImageNet-1k.[^1] Subsequent work, including Multi-Dimensional Hyena and HyenaPixel, extended this to two-dimensional spatial mixing and reported competitive accuracy on image classification benchmarks with favorable memory scaling in the number of image patches.[^24] Hyena-style operators have also been used in speech and audio research, where long context is valuable.[^27]
Despite strong empirical results, Hyena and other sub-quadratic alternatives have several documented weaknesses.
The most discussed limitation is that sub-quadratic architectures, including Hyena, RWKV, and earlier SSMs, lag dense attention on tasks that require precise associative recall of rare or arbitrary tokens earlier in the context. The Zoology benchmark suite from Hazy Research, presented in late 2023, isolated this recall capability and showed that pure long-convolution models including Hyena underperformed attention by measurable margins on synthetic recall problems even when overall language modeling perplexity was close.[^16] This finding motivated the explicit hybridization in StripedHyena, where roughly 10 percent of layers retain attention to handle targeted recall.[^4][^16]
Vanilla Hyena layers, evaluated through FFT convolution, are sub-quadratic for the full sequence but require O(L) work per generated token during autoregressive decoding because each new token must be convolved against a filter of length up to L. This is in contrast to recurrent architectures and Mamba-style selective SSMs, which admit constant-time-per-token decoding by maintaining a fixed-size state. The Laughing Hyena Distillery paper from late 2023 addressed this gap by distilling each Hyena convolution into a compact state-space recurrence for inference time, but at the cost of an additional post-training step.[^19]
As of 2026 the open ecosystem around the Transformer remains substantially more mature than for Hyena and its successors, with a much larger inventory of pretrained checkpoints, fine-tuning toolkits, and downstream alignment recipes. While public weights exist for StripedHyena, Evo, Evo 2, and HyenaDNA, the open-weight space of seven-billion-parameter plus Hyena language models is small compared to the transformer ecosystem.[^4][^5]
In direct head-to-head comparisons with Mamba on long-sequence DNA modeling, the Mamba paper reported that Mamba's perplexity improved with longer context lengths up to 1 million tokens while a HyenaDNA baseline degraded with sequence length on the same task.[^15] Authors of subsequent biology papers have reported varied outcomes depending on the task: some find that Mamba-style selective SSMs and attention transformers outperform Hyena reimplementations on RNA-seq prediction, while others find Hyena and StripedHyena competitive on whole-genome generation.[^28] The relative ranking of Hyena, Mamba, and attention remains task-dependent.
Implicit filter parameterization is sensitive to the choice of window function, positional encoding into γ_θ, and initialization. The original paper and subsequent follow-ups documented several engineering details (such as exponential decay windows and modulated sinusoidal positional inputs) that materially affect convergence; replication efforts have noted that these details are nontrivial to get right.[^1][^11]
Hyena was developed within the broader research program of the Hazy Research group, a Stanford computer science research group led by Christopher Ré. The group's stated research agenda includes both foundational sequence modeling and systems for efficient machine learning, and it has produced several closely related lines of work over a multi-year period, including FlashAttention (Tri Dao and Daniel Y. Fu), S4 and related state-space models, H3, Hyena, FlashFFTConv, and the Zoology benchmarks for sub-quadratic recall.[^3][^8][^18][^16]
Commercialization of Hyena and its successors has occurred primarily through Together AI, a company that has employed several of the original Hyena authors including Michael Poli and Tri Dao and has published StripedHyena, Evo (via a partnership with the Arc Institute), and related kernels under open licenses on its company GitHub.[^10][^4][^5] Other authors have moved to related organizations: Stefano Massaroli has been associated with Liquid AI, a company building alternative sequence model architectures, while Yoshua Bengio continues to lead Mila in Montreal.[^10]
The following table summarizes published empirical claims for Hyena and several closely related sub-quadratic architectures at the seven-billion-parameter scale or, where relevant, at the operator-comparison scale used in the original Hyena paper.
| Model or operator | Year | Scale | Reported claim | Source |
|---|---|---|---|---|
| Hyena (operator) | 2023 | up to 1.3B | Matches transformer on WikiText103 and The Pile with about 20 percent less training compute at 2K[^1] | Poli et al. |
| Hyena vs FlashAttention | 2023 | layer-level | About 2x faster at 8K, about 100x faster at 64K[^1] | Poli et al. |
| StripedHyena-7B | December 2023 | 7B | Outperforms Llama-2 7B on several OpenLLM tasks; more than 100 percent faster at 128K training[^4] | Together AI |
| Evo 1 (StripedHyena) | 2024 | 7B | 131K context; generates more than 650K tokens; near-linear scaling of compute[^5] | Nguyen, Hie et al. |
| HyenaDNA | 2023 | up to 1M tokens | State-of-the-art on 12 of 18 Nucleotide Transformer tasks; 160x faster training than FlashAttention[^21] | Nguyen et al. |
| Evo 2 | 2025 | 40B | 1-megabase context; 9 trillion training nucleotides; predicts BRCA1 mutation effects with 90 percent accuracy[^6] | Arc Institute |
| Laughing Hyena Distillation | NeurIPS 2023 | 1.3B | 10x throughput vs transformer, 1.5x vs Hyena, no quality loss[^19] | Massaroli, Poli et al. |
| FlashFFTConv on Hyena | November 2023 | small Hyena-GPT | 2.3 points better perplexity at fixed compute; 4M-token HyenaDNA[^18] | Fu et al. |