# Hyena

> Source: https://aiwiki.ai/wiki/hyena
> Updated: 2026-07-23
> Categories: Deep Learning, Model Architecture
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Hyena** is a sub-quadratic, attention-free neural sequence operator that replaces the self-[attention](/wiki/attention) operator of the [Transformer](/wiki/transformer) with a recurrence of long, implicitly parameterized convolutions and data-controlled (elementwise) gating. It was introduced by Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, [Stefano Ermon](/wiki/stefano_ermon), and [Christopher Ré](/wiki/christopher_re) in the February 2023 paper "Hyena Hierarchy: Towards Larger Convolutional Language Models."[^1] The paper describes Hyena as "a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating."[^1] Evaluated efficiently through Fast Fourier Transform convolutions, Hyena scales as O(L log L) in sequence length L rather than the O(L²) of dense attention, while reaching Transformer quality on standard language modeling benchmarks with a 20 percent reduction in training compute at sequence length 2K.[^1][^2]

Hyena emerged from the Hazy Research group at [Stanford University](/wiki/stanford_university) in collaboration with Mila and was published at [ICML](/wiki/icml) 2023.[^3][^1] Its authors reported that a Hyena operator is twice as fast as highly optimized attention at sequence length 8K and roughly 100 times faster at sequence length 64K.[^1] Successor work, including the StripedHyena hybrid models released by [Together AI](/wiki/together_ai) in December 2023 and the Evo and Evo 2 genomic foundation models from the [Arc Institute](/wiki/arc_institute), demonstrated that the architecture scales to billions of parameters and to sequence contexts of more than one million tokens.[^4][^5][^6]

## What is Hyena?

Hyena is a sequence-mixing layer designed as a direct, drop-in substitute for self-attention inside Transformer-style models. Whereas attention computes pairwise interaction scores between every pair of tokens (the source of its O(L²) cost), Hyena mixes tokens through a stack of long convolutions whose filters span the entire sequence, with input-dependent multiplicative gates inserted between the convolutions. The three properties the authors argue make attention effective, namely data-controlled mixing, unrestricted global context, and sub-linear growth of parameters with sequence length, are all retained by Hyena, but without paying the quadratic compute cost.[^1][^2]

The operator is "attention-free" in that it contains no [softmax](/wiki/softmax) and no explicit query-key score matrix. The data dependence that softmax attention obtains from the query-key product is instead supplied by the elementwise gates, whose values are projected from the input. The name "Hyena Hierarchy" refers to the recurrence stacking several long convolutions with intermediate gating into an operator family indexed by an order N.[^1][^11]

## Background

The dominant sequence modeling primitive since 2017 has been self-attention as defined in the original [Transformer](/wiki/transformer) architecture. While expressive, self-attention has time and memory complexity that grows quadratically with sequence length, which limits the context window that large language models can ingest and bounds the throughput of long-document inference. Through 2021 and 2022, multiple research lines tried to soften this bottleneck. One line pushed kernel-level engineering, producing Flash Attention from Tri Dao and collaborators, which kept the asymptotic O(L²) but reduced wall-clock cost and memory by tiling attention computation in SRAM.[^3] A second line tried to replace attention outright with linear-time alternatives such as [linear attention](/wiki/linear_attention), state-space models, and recurrent variants.[^7]

Within the second category, the [deep learning](/wiki/deep_learning) state-space model thread, which produced S4 and later [state space models](/wiki/state_space_model) for deep sequence modeling, struggled to match transformer quality on language modeling tasks that demand associative recall.[^8] The Hazy Research lab at Stanford, led by Christopher Ré, identified this recall gap and published the H3 paper, "Hungry Hungry Hippos: Towards Language Modeling with State Space Models," at ICLR 2023.[^8] H3 introduced a layer that stacked two state-space models with multiplicative interactions between their outputs and input projections, drawing structural inspiration from linear attention. The H3 paper demonstrated that hybrid H3 plus attention models could outperform transformers on OpenWebText perplexity and was the first SSM-based approach to come within striking distance of transformer quality at scale.[^8]

Hyena built directly on H3. Where H3 used two specific state-space layers (a shift SSM and a diagonal SSM) and a fixed number of projections, Hyena generalized the construction to an arbitrary recurrence order with implicit long convolution filters parameterized by a small neural network. The Hyena authors framed the change as the natural progression from H3: the same data-controlled multiplicative gating skeleton, but with the inner state-space layers replaced by more flexible long convolutions that could be efficiently evaluated via FFT.[^1][^2]

The paper was first posted to arXiv on 21 February 2023 (arXiv:2302.10866), revised through April 2023, and published in the Proceedings of the 40th International Conference on Machine Learning (ICML 2023).[^1][^9] The authors are affiliated with Stanford University, Mila and the Université de Montréal, with later commercialization through Together AI.[^10][^4]

## How does the Hyena operator work?

Hyena is defined as an operator that takes an input sequence u of length L and returns an output sequence y of the same shape. The operator is parameterized by an integer order N (typically 2 or 3 in published configurations).[^2]

### Projections

Given the input u, Hyena first produces N+1 projections through a learned linear map followed by a short, depthwise causal convolution. The projections play roles analogous to the queries, keys, and values of self-attention. In the order-N=2 case, the three projections are conventionally labeled x¹, x², and v, where v is the value path that will be progressively mixed with the gating signals.[^2][^11]

### Recurrence

The Hyena operator is a recurrence with N steps. Starting from y⁰ = v, at each step i it performs an elementwise multiplication by a gating projection xⁱ followed by a long convolution with an implicitly parameterized filter hⁱ:

$$
y^i = x^i \odot (h^i * y^{i-1})
$$

where ⊙ denotes elementwise multiplication and * denotes causal convolution along the sequence dimension.[^2][^11] After N such steps, the final y^N is returned as the operator output. This corresponds to alternating a Toeplitz matrix multiplication (the convolution) with a diagonal matrix multiplication (the gate), and unrolling the recurrence shows that Hyena can be expressed as a product of alternating data-controlled diagonal matrices D_x^i and Toeplitz matrices S_h^i acting on v.[^1][^11]

### Implicit filter parameterization

A key design choice is that the convolution filters hⁱ are not stored as L learned weights. Storing such filters explicitly would scale the parameter count linearly with sequence length. Instead, Hyena learns a small neural network γ_θⁱ, typically a multi-layer perceptron of fixed size, that maps a positional index t (optionally encoded with sinusoids or other positional encodings) to the filter value at that position.[^11][^2] The filter is then sampled at each of the L positions to form a long convolution kernel of arbitrary length. This implicit parameterization decouples parameter count from sequence length and enables the model to handle very long contexts without inflating its memory footprint. Window functions are applied to the implicit filter to encourage suitable temporal decay.[^1][^11]

### Why is Hyena sub-quadratic? Evaluation via FFT

A long convolution of length L can be evaluated in O(L log L) time using the Fast Fourier Transform: transform the input and filter to the frequency domain, multiply elementwise, and transform back. Combined with the O(L) elementwise gates, the total cost of a Hyena layer is O(N L log L) for a fixed recurrence order N.[^2][^11] At very long sequence lengths this is dramatically cheaper than the O(L²) cost of dense attention. The authors of the original paper reported that an optimized Hyena layer was twice as fast as an optimized attention layer at L = 8K tokens and approximately 100 times faster at L = 64K tokens.[^1]

### Data-controlled mixing

In attention, the mixing matrix is a function of the input through the QK product, making the operator "data-controlled" in the language of the Hyena paper. Hyena retains this property: the diagonal matrices come from the input-dependent projections x^i, so the structured operator H(u) varies with u even though its Toeplitz components are not. The Hyena authors argued that data control plus unrestricted global context plus sub-linear parameter scaling jointly identify what made attention powerful, and that an operator constructed from long convolutions plus gating could achieve all three without quadratic cost.[^1][^2] In the paper's framing, existing sub-quadratic methods fell short because, as the abstract puts it, they "need to be combined with dense attention layers to match Transformers, indicating a gap in capability," whereas Hyena was the first dense, attention-free operator to close that gap on standard language modeling.[^1]

## What results did the original Hyena paper report?

The original Hyena paper presented experiments at three primary scales: synthetic recall and reasoning benchmarks, autoregressive language modeling on standard text datasets, and image classification on [ImageNet](/wiki/imagenet).

### Mechanistic recall benchmarks

The authors introduced a suite of synthetic tasks targeting capabilities such as associative recall, multi-query associative recall, and induction heads. On these tasks they compared Hyena against attention, several state-space models, and other sub-quadratic baselines at sequence lengths from a few hundred to hundreds of thousands of tokens. Hyena improved accuracy by more than 50 percentage points over the closest non-attention baselines and matched attention itself.[^1][^2] The authors argued that this performance differentiated Hyena from many sub-quadratic attempts that close the perplexity gap on bulk language modeling but fail to perform sharp targeted recall.

### Autoregressive language modeling

Hyena was trained as a [causal language model](/wiki/causal_language_model) on the WikiText-103 corpus and on [The Pile](/wiki/the_pile) at multiple scales up to roughly 1.3 billion parameters. On WikiText-103 the Hyena model matched transformer perplexity. On The Pile, the 1.3 billion parameter Hyena model reached approximately 10.8 perplexity after 5 billion training tokens, comparable to the GPTNeo-style transformer baselines at the same compute budget, while requiring about 20 percent less training compute at sequence length 2K.[^1][^12] The authors highlighted Hyena as the first dense, attention-free architecture to match transformer quality on these standard language modeling tasks without requiring a hybrid attention layer.[^1]

### Vision

In addition to language, the authors evaluated Hyena as a drop-in replacement for self-attention inside a [Vision Transformer (ViT)](/wiki/vision_transformer_vit) backbone, training from scratch on ImageNet-1k with about 88 million parameters in a Hyena-ViT-B model. The resulting model matched the accuracy of a comparably sized ViT-B baseline on ImageNet-1k, demonstrating that the operator generalizes beyond text.[^11][^1]

### Compute and wall-clock benchmarks

The paper reported wall-clock crossovers between Hyena and FlashAttention-based dense attention at approximately 6K tokens of sequence length, with the gap widening sharply at longer sequences. At 64K tokens, Hyena layers were reported as approximately 100 times faster than highly optimized attention layers.[^1][^11]

## How does Hyena compare to attention and other architectures?

Hyena sits at the intersection of several research threads that all attempt to escape the quadratic cost of self-attention. A short comparison illustrates the design space.

| Architecture | Year | Mixing primitive | Asymptotic cost | Key idea |
|---|---|---|---|---|
| Transformer attention | 2017 | $$\text{Softmax } QK^\top V$$ | O(L²) | Token-pair scores via dot product[^13] |
| Linear attention | 2020 | $$\text{Kernelized } QK^\top V$$ | O(L) | Replace softmax with a feature map[^7] |
| S4 / state-space models | 2021 to 2022 | Linear SSM | O(L log L) | Convolutional view of structured SSM[^8] |
| H3 | December 2022 | Two stacked SSMs plus gating | O(L log L) | Bridge SSM and attention via gates[^8] |
| Hyena | February 2023 | Implicit long conv plus gates | O(L log L) | Generalize H3 to arbitrary recurrence order[^1] |
| RetNet | July 2023 | Retention | O(L) | Parallel, recurrent, and chunkwise forms[^14] |
| [Mamba](/wiki/mamba) | December 2023 | Selective SSM | O(L) | Input-dependent SSM parameters[^15] |

[Attention](/wiki/attention) retains the strongest empirical performance on token-pair operations such as associative recall but at quadratic cost. Linear attention drops the softmax to obtain linear complexity but suffers a quality drop on language modeling. State-space models including [S4](/wiki/state_space_model) provide an alternative linear formulation through structured matrix decompositions but historically struggled with recall.[^8] H3 from Hazy Research showed that combining SSMs with multiplicative gating could narrow the recall gap. Hyena, in turn, generalized H3 by replacing the inner SSMs with implicit long convolutions and introducing an arbitrary recurrence order.[^1][^11] Subsequent work, particularly [Mamba](/wiki/mamba) and [Mamba 2](/wiki/mamba_2), pushed the SSM line further by making the state-space parameters input-dependent.[^15]

[RWKV](/wiki/rwkv) and [RetNet](/wiki/retnet) are recurrent variants that share Hyena's motivation of replacing attention but differ structurally: RWKV builds on a linear attention recurrence with time-mixing and channel-mixing blocks, while RetNet introduces a "retention" mechanism that admits parallel, recurrent, and chunkwise forms.[^14] Empirical comparisons reported in the literature show that on SuperGLUE-style downstream tasks RWKV is competitive on zero-shot accuracy while Hyena is stronger in few-shot regimes, and that all of these sub-quadratic alternatives lag dense transformers on tasks that require recalling rare or arbitrary information mentioned earlier in the prompt.[^7][^16] This recall gap was a central focus of subsequent Hazy Research papers including the Zoology benchmark suite.[^16]

A widely cited point of contrast is between Hyena and Mamba. Hyena uses a fixed Fourier-based long convolution per layer with input-dependent gating; Mamba instead uses an SSM whose A, B, C, and step-size parameters are themselves functions of the input, giving it a selective mechanism that does not require explicit gating outside the SSM.[^15] Both architectures achieve sub-quadratic compute and matching transformer perplexity at small to mid scales, but they trade off differently between parallelizability, recurrence form, and hardware-friendly long-sequence inference.

## Why is it called the Hyena "Hierarchy"?

The paper title refers to a "hierarchy" because the recurrence stacks N long convolutions with intermediate elementwise gating, and the authors organized the operator into a family parameterized by N. Order-2 Hyena is the smallest non-trivial variant and recovers an H3-like structure when the convolutions are specialized to particular SSM forms. Higher orders allow the operator to express more complex compositional patterns at the cost of a larger constant in the O(N L log L) compute bound.[^11][^1] In practice the published Hyena and StripedHyena models use small N, typically 2 or 3, with the bulk of capacity coming from stacking many Hyena layers and from larger filter MLPs in γ_θ.[^4]

## How is Hyena implemented in software?

Reference implementations of Hyena were released by the Hazy Research group as part of the Safari repository on GitHub.[^17] The implementation relies on causal FFT-based convolutions and short depthwise convolutions for the input projections. Subsequent work on FFT efficiency, particularly FlashFFTConv from Daniel Y. Fu, Hermann Kumbong, Eric Nguyen, and Christopher Ré, used tensor cores to speed up exact FFT convolutions by up to 7.93 times over PyTorch and delivered up to 4.4 times end-to-end speedup on Hyena and related models.[^18] FlashFFTConv reported that, for the same compute budget, it allowed a small Hyena-GPT model to reach 2.3 points better perplexity, and it extended HyenaDNA to 4 million token sequence length, sufficient to embed the longest human genes at single-nucleotide resolution.[^18]

The "Laughing Hyena Distillery" paper by Stefano Massaroli, Michael Poli, and collaborators, presented at NeurIPS 2023, addressed the autoregressive inference cost of long-convolution architectures.[^19] Naive long-convolution inference is O(L) per token because each new token requires convolving against an L-length filter. The Laughing Hyena work extracted compact linear state-space recurrences from pre-trained Hyena convolution filters using rational interpolation and model-order reduction, achieving O(1) compute and memory cost per generated token and reporting 10 times higher throughput than transformers and 1.5 times higher than vanilla Hyena at the 1.3 billion parameter scale, without quality loss after distillation.[^19]

## What are the successor models to Hyena (StripedHyena, Evo)?

### StripedHyena

StripedHyena, released by [Together AI](/wiki/together_ai) on 8 December 2023, was the first openly published seven-billion-parameter language model built around the Hyena operator.[^4][^20] The architecture is explicitly hybrid: Hyena layers (multi-head gated convolutions) carry the bulk of sequence mixing, while a small fraction of layers are conventional grouped or rotary attention layers responsible for targeted pattern recall.[^4] Two variants were released, StripedHyena-Hessian-7B (the base model) and StripedHyena-Nous-7B (a chat model fine-tuned in collaboration with [Nous Research](/wiki/nous_research)). StripedHyena was trained on the RedPajama dataset augmented with longer-context data and supports a 32K context window in its released base configuration, with internal scaling tests reaching 128K.[^4][^20]

Together AI reported that StripedHyena-Hessian-7B was competitive with strong open transformer baselines on the OpenLLM leaderboard and outperformed Llama-2 7B, Yi 7B, and RWKV 14B on several short-context tasks including ARC-Challenge, HellaSwag, WinoGrande, BoolQ, and TruthfulQA.[^4][^20] On the long-context ZeroScrolls benchmark, StripedHyena-7B exceeded a Mistral 7B baseline on GovReport F1 (27.9 versus 17.5) and on NarrativeQA F1 (25.8 versus 24.7), with a smaller gap on Qasper F1.[^4] On training throughput, the model was reported to be more than 30 percent faster than an optimized transformer with FlashAttention v2 at 32K sequence length, more than 50 percent faster at 64K, and more than 100 percent faster at 128K. Inference caches for autoregressive generation were reported to be more than 50 percent smaller than for a grouped-query attention transformer of equivalent quality.[^4]

### Evo and Evo 2

The Hyena and StripedHyena architectures were applied to biological sequence modeling in two influential papers from the Arc Institute, Stanford, and collaborators.

The HyenaDNA paper, posted to arXiv as 2306.15794 in June 2023 and presented at NeurIPS 2023 as a spotlight, was the first to use Hyena for genomic foundation modeling.[^21] HyenaDNA modeled human genomic DNA at single-nucleotide resolution with context windows of up to one million tokens, roughly 500 times longer than the four-thousand-token contexts of earlier transformer-based genomic models, and reported up to 160 times faster training than transformers using FlashAttention.[^21] HyenaDNA achieved state-of-the-art results on 12 of 18 Nucleotide Transformer benchmarks and exceeded baselines on 7 of 8 GenomicBenchmarks tasks by roughly 10 accuracy points, while using substantially fewer parameters than competing genomic models. The paper also demonstrated the first use of in-context learning in genomics, showing that a single pre-trained model could adapt to novel tasks without weight updates.[^21]

The Evo model, described in "Sequence modeling and design from molecular to genome scale with Evo" by Eric Nguyen, Brian L. Hie, and collaborators, was posted as a preprint on 27 February 2024 and published in Science in November 2024.[^5][^22] Evo uses the StripedHyena architecture with a hybrid of 29 Hyena layers and 3 multi-head attention layers (roughly a 10 percent attention fraction), trained on the OpenGenome dataset of approximately 2.7 million prokaryotic and bacteriophage genomes totaling about 300 billion tokens, with a 131-kilobase context length.[^5][^22] The model has 7 billion parameters and is capable of both predictive and generative tasks across molecular and whole-genome scales, including generating coherent sequences longer than 650 kilobases.[^5]

Evo 2, released as a preprint on 19 February 2025 and later published in Nature, was developed by Arc Institute, Stanford University, UC Berkeley, UC San Francisco, and NVIDIA.[^6] Evo 2 scales the StripedHyena-style architecture to 40 billion parameters and a 1-megabase context window, with training over more than 9 trillion nucleotides drawn from more than 100,000 species across the entire tree of life. Reported applications include 90 percent accurate prediction of the functional impact of previously unrecognized BRCA1 mutations, generation of full bacterial-genome-scale sequences, and integration into the NVIDIA BioNeMo framework.[^6][^23]

### Other Hyena variants

A number of additional Hyena variants and extensions have appeared. Multi-Dimensional Hyena adapted the operator to two-dimensional inputs for image classification, and HyenaPixel, a later 2D extension, reported competitive ImageNet-1k top-1 accuracies (84.9 percent and 85.2 percent for variants) while outperforming several large-kernel convolutional baselines.[^24] A Hyena neural operator for partial differential equations applied the same convolution-plus-gating structure to physical simulation tasks.[^25] Scavenging Hyena explored distilling pre-trained transformers into long-convolution architectures, complementing the Laughing Hyena work on extracting recurrences after the fact.[^26]

## What is Hyena used for?

Hyena and its descendants have been deployed across several domains, with the strongest impact in long-context language modeling and genomics.

### Language modeling

StripedHyena-7B served as a public demonstration that a hybrid Hyena-plus-attention architecture could match seven-billion-parameter transformer language models on standard benchmarks while delivering substantial speedups at long context.[^4] The model and its weights were released openly under permissive licenses, allowing third parties to fine-tune the architecture and to use it as a backbone for downstream applications such as chat (StripedHyena-Nous-7B).[^20]

### Genomic foundation models

The most consequential application has been in computational biology. Standard transformer-based genomic models such as the Nucleotide Transformer relied on context windows of four thousand tokens or less and on k-mer tokenization, which discarded single-nucleotide resolution.[^21] The sub-quadratic compute of Hyena made it feasible to train on context windows of one million tokens, sufficient to span whole bacterial genomes or large eukaryotic loci, while preserving the byte-level resolution needed for variant-effect prediction. HyenaDNA, Evo, and Evo 2 collectively demonstrated that this combination, long context plus single-nucleotide resolution plus a [foundation model](/wiki/foundation_model) training recipe, enabled new capabilities, including zero-shot variant pathogenicity prediction, generation of regulatory elements, and design of synthetic genomes.[^21][^5][^6]

### Vision and multimodal models

The original Hyena paper showed that the operator could substitute for attention in a Vision Transformer trained from scratch on ImageNet-1k.[^1] Subsequent work, including Multi-Dimensional Hyena and HyenaPixel, extended this to two-dimensional spatial mixing and reported competitive accuracy on image classification benchmarks with favorable memory scaling in the number of image patches.[^24] Hyena-style operators have also been used in speech and audio research, where long context is valuable.[^27]

## What are the limitations of Hyena?

Despite strong empirical results, Hyena and other sub-quadratic alternatives have several documented weaknesses.

### Recall gap

The most discussed limitation is that sub-quadratic architectures, including Hyena, RWKV, and earlier SSMs, lag dense attention on tasks that require precise associative recall of rare or arbitrary tokens earlier in the context. The Zoology benchmark suite from Hazy Research, presented in late 2023, isolated this recall capability and showed that pure long-convolution models including Hyena underperformed attention by measurable margins on synthetic recall problems even when overall language modeling perplexity was close.[^16] This finding motivated the explicit hybridization in StripedHyena, where roughly 10 percent of layers retain attention to handle targeted recall.[^4][^16]

### Inference cost without distillation

Vanilla Hyena layers, evaluated through FFT convolution, are sub-quadratic for the full sequence but require O(L) work per generated token during autoregressive decoding because each new token must be convolved against a filter of length up to L. This is in contrast to recurrent architectures and Mamba-style selective SSMs, which admit constant-time-per-token decoding by maintaining a fixed-size state. The Laughing Hyena Distillery paper from late 2023 addressed this gap by distilling each Hyena convolution into a compact state-space recurrence for inference time, but at the cost of an additional post-training step.[^19]

### Ecosystem maturity

As of 2026 the open ecosystem around the [Transformer](/wiki/transformer) remains substantially more mature than for Hyena and its successors, with a much larger inventory of pretrained checkpoints, fine-tuning toolkits, and downstream alignment recipes. While public weights exist for StripedHyena, Evo, Evo 2, and HyenaDNA, the open-weight space of seven-billion-parameter plus Hyena language models is small compared to the transformer ecosystem.[^4][^5]

### Comparisons with Mamba

In direct head-to-head comparisons with [Mamba](/wiki/mamba) on long-sequence DNA modeling, the Mamba paper reported that Mamba's perplexity improved with longer context lengths up to 1 million tokens while a HyenaDNA baseline degraded with sequence length on the same task.[^15] Authors of subsequent biology papers have reported varied outcomes depending on the task: some find that Mamba-style selective SSMs and attention transformers outperform Hyena reimplementations on RNA-seq prediction, while others find Hyena and StripedHyena competitive on whole-genome generation.[^28] The relative ranking of Hyena, Mamba, and attention remains task-dependent.

### Position-aware gating tuning

Implicit filter parameterization is sensitive to the choice of window function, positional encoding into γ_θ, and initialization. The original paper and subsequent follow-ups documented several engineering details (such as exponential decay windows and modulated sinusoidal positional inputs) that materially affect convergence; replication efforts have noted that these details are nontrivial to get right.[^1][^11]

## Who created Hyena, and who maintains it?

Hyena was developed within the broader research program of the Hazy Research group, a Stanford computer science research group led by Christopher Ré. The group's stated research agenda includes both foundational sequence modeling and systems for efficient machine learning, and it has produced several closely related lines of work over a multi-year period, including FlashAttention (Tri Dao and Daniel Y. Fu), [S4](/wiki/state_space_model) and related state-space models, H3, Hyena, FlashFFTConv, and the Zoology benchmarks for sub-quadratic recall.[^3][^8][^18][^16]

Commercialization of Hyena and its successors has occurred primarily through [Together AI](/wiki/together_ai), a company that has employed several of the original Hyena authors including Michael Poli and Tri Dao and has published StripedHyena, Evo (via a partnership with the Arc Institute), and related kernels under open licenses on its company GitHub.[^10][^4][^5] Other authors have moved to related organizations: Stefano Massaroli has been associated with [Liquid AI](/wiki/liquid_ai), a company building alternative sequence model architectures, while [Yoshua Bengio](/wiki/yoshua_bengio) continues to lead Mila in Montreal.[^10]

## Comparison Table of Empirical Results

The following table summarizes published empirical claims for Hyena and several closely related sub-quadratic architectures at the seven-billion-parameter scale or, where relevant, at the operator-comparison scale used in the original Hyena paper.

| Model or operator | Year | Scale | Reported claim | Source |
|---|---|---|---|---|
| Hyena (operator) | 2023 | up to 1.3B | Matches transformer on WikiText103 and The Pile with about 20 percent less training compute at 2K[^1] | Poli et al. |
| Hyena vs FlashAttention | 2023 | layer-level | About 2x faster at 8K, about 100x faster at 64K[^1] | Poli et al. |
| StripedHyena-7B | December 2023 | 7B | Outperforms Llama-2 7B on several OpenLLM tasks; more than 100 percent faster at 128K training[^4] | Together AI |
| Evo 1 (StripedHyena) | 2024 | 7B | 131K context; generates more than 650K tokens; near-linear scaling of compute[^5] | Nguyen, Hie et al. |
| HyenaDNA | 2023 | up to 1M tokens | State-of-the-art on 12 of 18 Nucleotide Transformer tasks; 160x faster training than FlashAttention[^21] | Nguyen et al. |
| Evo 2 | 2025 | 40B | 1-megabase context; 9 trillion training nucleotides; predicts BRCA1 mutation effects with 90 percent accuracy[^6] | Arc Institute |
| Laughing Hyena Distillation | NeurIPS 2023 | 1.3B | 10x throughput vs transformer, 1.5x vs Hyena, no quality loss[^19] | Massaroli, Poli et al. |
| FlashFFTConv on Hyena | November 2023 | small Hyena-GPT | 2.3 points better perplexity at fixed compute; 4M-token HyenaDNA[^18] | Fu et al. |

## See also

- [attention](/wiki/attention)
- [transformer](/wiki/transformer)
- [self attention](/wiki/self_attention)
- [mamba](/wiki/mamba)
- [mamba 2](/wiki/mamba_2)
- [rwkv](/wiki/rwkv)
- [rwkv 7](/wiki/rwkv_7)
- [retnet](/wiki/retnet)
- [state space model](/wiki/state_space_model)
- [flash attention](/wiki/flash_attention)
- [flash attention 3](/wiki/flash_attention_3)
- [together ai](/wiki/together_ai)
- [stanford university](/wiki/stanford_university)
- [nous research](/wiki/nous_research)
- [liquid ai](/wiki/liquid_ai)
- [tri dao](/wiki/tri_dao)
- [yoshua bengio](/wiki/yoshua_bengio)
- [icml](/wiki/icml)
- [the pile](/wiki/the_pile)
- [imagenet](/wiki/imagenet)
- [vision transformer vit](/wiki/vision_transformer_vit)
- [foundation model](/wiki/foundation_model)
- [in-context learning](/wiki/in-context_learning)
- [context window](/wiki/context_window)
- [positional encoding](/wiki/positional_encoding)
- [convolution](/wiki/convolution)

## References

[^1]: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, "Hyena Hierarchy: Towards Larger Convolutional Language Models", arXiv, 2023-02-21. https://arxiv.org/abs/2302.10866. Accessed 2026-05-20.
[^2]: Michael Poli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models", Hazy Research blog, Stanford, 2023-03-07. https://hazyresearch.stanford.edu/blog/2023-03-07-hyena. Accessed 2026-05-20.
[^3]: Hazy Research, "Hazy Research lab homepage", Stanford University, 2024. https://hazyresearch.stanford.edu/. Accessed 2026-05-20.
[^4]: Together AI, "Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers", Together AI Blog, 2023-12-08. https://www.together.ai/blog/stripedhyena-7b. Accessed 2026-05-20.
[^5]: Eric Nguyen, Michael Poli, Matthew G. Durrant, Brian Kang, Brian L. Hie et al., "Sequence modeling and design from molecular to genome scale with Evo", Science 386(6723), eado9336, 2024-11-14. https://www.science.org/doi/10.1126/science.ado9336. Accessed 2026-05-20.
[^6]: Arc Institute, "Evo 2: DNA Foundation Model", Arc Institute, 2025-02-19. https://arcinstitute.org/tools/evo. Accessed 2026-05-20.
[^7]: Eren Gölge, "Exploring Beyond Regular Transformers", Machine Learns, 2023-10-01. https://erogol.com/2023/10/01/transformer-alternatives. Accessed 2026-05-20.
[^8]: Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré, "Hungry Hungry Hippos: Towards Language Modeling with State Space Models", arXiv, 2022-12-28. https://arxiv.org/abs/2212.14052. Accessed 2026-05-20.
[^9]: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, "Hyena Hierarchy: Towards Larger Convolutional Language Models", Proceedings of the 40th International Conference on Machine Learning (ICML 2023), PMLR vol. 202, 2023. https://proceedings.mlr.press/v202/poli23a.html. Accessed 2026-05-20.
[^10]: Michael Poli, "Michael Poli profile", LinkedIn / Google Scholar, 2024. https://scholar.google.com/citations?hl=en&user=RgIBwboAAAAJ. Accessed 2026-05-20.
[^11]: Hugging Face, "Hyena", Hugging Face Computer Vision Course, Unit 13, 2024. https://huggingface.co/learn/computer-vision-course/unit13/hyena. Accessed 2026-05-20.
[^12]: Andrew Lukyanenko, "Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models", Medium, 2023-03-13. https://artgor.medium.com/paper-review-hyena-hierarchy-towards-larger-convolutional-language-models-e56b55232800. Accessed 2026-05-20.
[^13]: Ashish Vaswani et al., "Attention Is All You Need", arXiv, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-05-20.
[^14]: Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei, "Retentive Network: A Successor to Transformer for Large Language Models", arXiv, 2023-07-17. https://arxiv.org/abs/2307.08621. Accessed 2026-05-20.
[^15]: Albert Gu, Tri Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", arXiv, 2023-12-01. https://arxiv.org/abs/2312.00752. Accessed 2026-05-20.
[^16]: Hazy Research, "Zoology (Blogpost 1): Measuring and Improving Recall in Efficient Language Models", Stanford, 2023-12-11. https://hazyresearch.stanford.edu/blog/2023-12-11-zoology1-analysis. Accessed 2026-05-20.
[^17]: HazyResearch, "safari: Convolutions for Sequence Modeling", GitHub, 2023. https://github.com/HazyResearch/safari. Accessed 2026-05-20.
[^18]: Daniel Y. Fu, Hermann Kumbong, Eric Nguyen, Christopher Ré, "FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores", arXiv, 2023-11-10. https://arxiv.org/abs/2311.05908. Accessed 2026-05-20.
[^19]: Stefano Massaroli, Michael Poli, Daniel Y. Fu, Hermann Kumbong, Rom N. Parnichkun, Aman Timalsina, David W. Romero, Quinn McIntyre, Beidi Chen, Atri Rudra, Ce Zhang, Christopher Ré, Stefano Ermon, Yoshua Bengio, "Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions", arXiv, 2023-10-28. https://arxiv.org/abs/2310.18780. Accessed 2026-05-20.
[^20]: Together AI, "togethercomputer/stripedhyena", GitHub repository, 2023. https://github.com/togethercomputer/stripedhyena. Accessed 2026-05-20.
[^21]: Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Christopher Ré, "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution", arXiv, 2023-06-27. https://arxiv.org/abs/2306.15794. Accessed 2026-05-20.
[^22]: Eric Nguyen et al., "Sequence modeling and design from molecular to genome scale with Evo", bioRxiv preprint, 2024-02-27. https://www.biorxiv.org/content/10.1101/2024.02.27.582234v1. Accessed 2026-05-20.
[^23]: Arc Institute, "AI can now model and design the genetic code for all domains of life with Evo 2", Arc Institute News, 2025-02-19. https://arcinstitute.org/news/evo2. Accessed 2026-05-20.
[^24]: Julian Spravil, Sebastian Houben, "HyenaPixel: Global Image Context with Convolutions", arXiv, 2024-02-29. https://arxiv.org/abs/2402.19305. Accessed 2026-05-20.
[^25]: Saurabh Patil, Zijie Li, Amir Barati Farimani, "Hyena Neural Operator for Partial Differential Equations", arXiv, 2023-06-28. https://arxiv.org/abs/2306.16524. Accessed 2026-05-20.
[^26]: Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang, "Scavenging Hyena: Distilling Transformers into Long Convolution Models", arXiv, 2024-01-31. https://arxiv.org/abs/2401.17574. Accessed 2026-05-20.
[^27]: Marco Gaido, Sara Papi, Matteo Negri, Marco Turchi, Luisa Bentivogli, "How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena", arXiv, 2024-02-20. https://arxiv.org/abs/2402.13208. Accessed 2026-05-20.
[^28]: Selective State Space Models Outperform Transformers at Predicting RNA-Seq Read Coverage, bioRxiv, 2025-02-13. https://www.biorxiv.org/content/10.1101/2025.02.13.638190v1. Accessed 2026-05-20.