H-Net (dynamic chunking)

Deep Learning Neural Networks

12 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 2,324 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

H-Net, short for Hierarchical Network, is a tokenizer-free neural sequence model that learns to segment raw bytes into content-adaptive "chunks" as part of ordinary end-to-end training, rather than relying on a fixed, hand-designed tokenization step such as byte-pair encoding (BPE). It was introduced in the paper "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling," posted to arXiv on July 10, 2025 by Sukjun Hwang, Brandon Wang, and Albert Gu. Gu is an assistant professor of machine learning at Carnegie Mellon University, a co-founder and chief scientist of the AI startup Cartesia, and a co-inventor (with Tri Dao) of the Mamba architecture. The central contribution is a "dynamic chunking" (DC) mechanism that places chunk boundaries using a learned predictor and routes information through a U-Net-like hierarchy, so that a single network operating on raw UTF-8 bytes can match or beat a strong BPE-tokenized Transformer at equal compute. Reference code is released as the goombalab/hnet repository and pretrained checkpoints are distributed by Cartesia. ^[1]^[2]

Overview

Almost every modern large language model begins with a tokenizer, a separate, non-learned program that chops text into subword units before the neural network ever sees it. H-Net removes that program. It reads a sequence of bytes, uses small neural networks to decide on the fly where one chunk should end and the next begin, compresses the byte stream into that shorter sequence of chunks, performs the bulk of its computation there, and then expands back to bytes to make predictions. Because the boundary decisions are produced by differentiable modules, the whole system, including the segmentation policy itself, is trained jointly by gradient descent against the ordinary next-byte prediction objective. The authors report that a two-stage H-Net surpasses a compute-matched BPE Transformer after only about 30 billion training bytes, with the performance gap widening thereafter. ^[1]

Background: the tokenization problem

A tokenizer such as BPE is built by a frequency-counting algorithm run once over a corpus, producing a fixed vocabulary of subword pieces that is then frozen. This preprocessing has several well-known drawbacks that the paper sets out to remove. It is hand-designed and not learned, so it cannot adapt to the model or the task. It is brittle outside the data it was tuned on: vocabularies built mainly for English waste many tokens on other scripts, and languages such as Chinese, Japanese, or Thai that do not separate words with spaces are segmented poorly because BPE leans heavily on whitespace as an implicit boundary cue. It adds fragility to spelling, character-level manipulation, and orthographic noise, since changing one character can alter the entire downstream token sequence. And it is awkward or counterproductive for modalities that have no natural notion of a word at all, such as source code, raw audio, or DNA. ^[1]

The obvious alternative, modeling raw bytes directly, removes the tokenizer but creates a different problem: byte sequences are roughly four to five times longer than their tokenized counterparts, and because the cost of self-attention grows quadratically with sequence length, naive byte-level models are far more expensive and had generally been unable to match tokenized models at scale. The question H-Net poses is whether the compression that tokenization provides can instead be learned inside the network, keeping the efficiency of short sequences while discarding the hand-built vocabulary. ^[1]

How H-Net works

Hierarchical architecture

H-Net borrows the encoder-decoder shape of a U-Net. The basic single-stage pipeline has three parts. First, a lightweight encoder network reads the raw byte embeddings at full resolution. Second, a dynamic chunking module compresses that sequence by selecting a subset of positions as chunk boundaries, and a "main" network performs the heavy modeling on this much shorter chunk sequence. Third, the chunk representations are upsampled back to byte resolution and passed through a small decoder network that produces the per-byte outputs, with residual connections carrying fine-grained information from the encoder around the compressed middle. ^[1]

A key design choice concerns which neural primitive is used where. The outer encoder and decoder, which must handle long, fine-grained byte sequences, are built from Mamba-2 state space model (SSM) layers, because SSMs process long sequences in linear time and have proven strong on fine-grained data such as audio and DNA. The inner main network, which operates on the compressed, more abstract chunk sequence that behaves much like a sequence of tokens, uses standard Transformer layers with gated feedforward blocks, where self-attention is affordable and most of the model's parameters are concentrated. The model dimension grows from the outer byte level inward to the chunk level. ^[1]

The architecture is recursive: the main network at the center can itself be another full H-Net rather than a plain Transformer, producing two or more nested stages of compression. The paper studies one-stage and two-stage models. A one-stage model targeting a compression ratio of six (denoted 6-DC) ends up using about 4.8 bytes per chunk in practice; a two-stage model targeting three at each level (3,3) reaches about 6.9 bytes per chunk, with the first stage learning roughly word-like units and the second composing them into larger spans. This compounding lets deeper hierarchies model several levels of abstraction at once. ^[1]

Dynamic chunking

Dynamic chunking is the mechanism that decides where chunk boundaries fall, and it is what makes the compression learnable. It has two cooperating parts, a routing module and a smoothing module. ^[1]

The routing module is a boundary predictor. For each position it projects the encoder output into a query vector q_t and a key vector k_t and compares each position with its predecessor using cosine similarity. The boundary probability is p_t = 0.5 * (1 - cos(q_t, k_{t-1})), which is large when adjacent byte representations are dissimilar and small when they are similar. A boundary is placed wherever p_t is at least 0.5, equivalently wherever the cosine similarity between neighbors falls to zero or below. The inductive bias is intuitive: a chunk boundary should fall where the content changes sharply, for example between a word and the punctuation that follows it. The downsampler then keeps only the representations at boundary positions and discards the rest, yielding the compressed sequence. ^[1]

The difficulty is that selecting boundaries is a discrete, non-differentiable operation, so gradients cannot flow through it directly. H-Net solves this without reinforcement learning or auxiliary discrete losses. On the way down, the smoothing module (the "dechunking" operation) replaces the hard selection with a differentiable exponential moving average: each smoothed chunk vector is a confidence-weighted blend of the form z_t = P_t * z_hat_t + (1 - P_t) * z_(t-1), so a confidently chosen boundary (P close to 1) stays sharp while an uncertain one (P near 0.5) is interpolated with the previous chunk. On the way back up, the upsampler repeats each chunk vector across the byte positions it covers and multiplies by a confidence score, using a straight-through estimator so the discrete routing decision still passes gradients during the backward pass. Finally, to keep the network from collapsing to a trivial solution (a boundary at every byte, or none at all), a "ratio loss" gently pushes the average compression toward a target ratio N without forcing any particular segmentation. Together these pieces make the boundary predictor trainable purely from the end-to-end language-modeling signal. ^[1]

Training the chunker end to end

Because the encoder, the chunker, the main network, the decoder, and the ratio loss are all differentiable and trained together, H-Net learns its segmentation policy from scratch with no tokenizer, no vocabulary, and no boundary labels. The authors add several techniques to make the deep, multi-resolution stack train stably: normalization layers that balance the magnitudes of the encoder, main, and decoder signals so the deep inner network does not drown out fine-grained outer information; a separation of the residual and compression "streams" leaving the encoder; and higher learning rates for the longer-sequence outer stages, following maximal-update-parameterization (muP) reasoning. Everything is causal, so the model remains a valid autoregressive predictor; at inference the main network is invoked only when the router actually emits a boundary, which the authors compare to a form of speculative decoding. Inspecting trained models shows the router learns sensible, unsupervised boundaries, placing them at whitespace and grouping characters into word-like and phrase-like units even in text written without spaces. ^[1]

Results

H-Net was evaluated mainly as a language model on the FineWeb-Edu corpus at two scales, called Large (about 760 million parameters) and XL (about 1.3 billion parameters), each trained on roughly 100 billion bytes and compared against BPE-tokenized Transformer baselines matched for compute and data. Because one model sees tokens and the other sees bytes, quality is measured in bits per byte (BPB), a tokenizer-independent compression metric for which lower is better. The two-stage models, which spend their parameters on a deeper hierarchy, give the strongest results. ^[1]

Model	Scale	Bits per byte (lower is better)	Zero-shot downstream average
Transformer (BPE)	Large, about 760M	0.756	53.3%
H-Net, 1-stage	Large	0.755	-
H-Net, 2-stage	Large	0.743	55.5%
Transformer (BPE)	XL, about 1.3B	0.730	55.5%
H-Net, 1-stage	XL	0.728	-
H-Net, 2-stage	XL	0.715	58.2%

Beyond raw perplexity, the two-stage H-Net improved average zero-shot accuracy across a standard suite (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande, OpenBookQA) and was markedly more robust to character-level corruption: on a perturbed version of HellaSwag the Large two-stage model scored about 39 against roughly 20 for the tokenized Transformer. The advantages were largest exactly where tokenization is weakest. On Chinese, the XL two-stage H-Net reached 66.3 on XWinograd-zh versus 59.9 for the tokenized baseline, and it showed comparable gains on code. On DNA language modeling, where there is no meaningful tokenizer at all, H-Net achieved nearly four times (about 3.6x) the data efficiency of an isotropic, non-hierarchical byte baseline. ^[1]

Relationship to other byte-level methods

H-Net sits at the convergence of two research threads: state space model sequence mixers and tokenizer-free, byte-level modeling. Its use of Mamba-2 layers in the encoder and decoder builds directly on Mamba, the selective SSM that processes long sequences in linear time and is well suited to the long byte streams H-Net must handle. Several earlier byte-level architectures share H-Net's goal of escaping tokenization but differ in how, or whether, they learn to segment.

MegaByte (Meta, 2023) splits a byte stream into fixed-size patches and models them with a multiscale Transformer; the patch boundaries are static and content-independent. ^[3]
MambaByte (2024) applies a Mamba SSM directly to bytes with no chunking at all; it is isotropic, trading the long-sequence cost for architectural efficiency but performing no compression. ^[4]
SpaceByte (2024) inserts larger Transformer blocks at dynamic positions, but chooses those positions with a hand-written rule, placing boundaries at spacelike bytes, which reintroduces a whitespace heuristic that fails on space-free languages. ^[5]
Byte Latent Transformer (BLT) (Meta, December 2024) makes patching data-dependent by training a small separate byte-level language model and cutting a boundary wherever its next-byte entropy is high. This is dynamic, but the segmenter is trained in a separate stage on an auxiliary objective rather than jointly with the main model. ^[6]

H-Net's distinction is that its chunking is fully end-to-end: the boundary predictor is part of the same network and is optimized by the same loss, with no external tokenizer, no fixed patch size, no whitespace rule, and no separately trained entropy model. The authors also distinguish H-Net's learned, semantically meaningful sparsity from the generic conditional computation of mixture-of-experts models: H-Net decides which spans of the input deserve a chunk, not merely which parameters to activate. A later, independent paper, H-Net++ (August 2025), extended the dynamic-chunking idea to morphologically rich languages such as Persian. ^[1]^[7]

Significance

The paper has been received as a notable step toward removing one of the last hand-engineered components of the language-modeling pipeline, an instance of the "bitter lesson" that learned, general-purpose methods tend to overtake hand-designed ones given enough compute and data. By showing that end-to-end learned chunking can match BPE at the billion-parameter scale while improving robustness and working across English, Chinese, code, and DNA, H-Net points toward genuinely modality-agnostic foundation models that need no per-domain tokenizer. ^[1]

The authors are candid about the limitations. The current implementation trains roughly twice as slowly as a comparable isotropic model, because variable-length chunk sequences complicate efficient batching and memory layout on GPUs. The largest models studied are around 1.3 billion parameters, so behavior and stability at the much larger scales used in frontier models remain open questions. And the dynamic, data-dependent sequence lengths raise systems-level challenges for memory management and serving that fixed tokenization does not. Even so, H-Net, together with its public code and Cartesia-hosted checkpoints, has made tokenizer-free, end-to-end hierarchical modeling a concrete and reproducible alternative to the tokenize-then-model paradigm. ^[1]^[2]

References

Hwang, Sukjun; Wang, Brandon; Gu, Albert. "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling." arXiv:2507.07955, July 10, 2025. https://arxiv.org/abs/2507.07955 ↩
goombalab/hnet. "H-Net: Hierarchical Network with Dynamic Chunking" (reference implementation and pretrained checkpoints; models hosted by cartesia-ai on Hugging Face). GitHub. https://github.com/goombalab/hnet ↩
Yu, Lili; Simig, Daniel; Flaherty, Colin; Aghajanyan, Armen; Zettlemoyer, Luke; Lewis, Mike. "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers." arXiv:2305.07185, May 2023. https://arxiv.org/abs/2305.07185 ↩
Wang, Junxiong; Gangavarapu, Tushaar; Yan, Jing Nathan; Rush, Alexander M. "MambaByte: Token-free Selective State Space Model." arXiv:2401.13660, January 2024. https://arxiv.org/abs/2401.13660 ↩
Slagle, Kevin. "SpaceByte: Towards Deleting Tokenization from Large Language Modeling." arXiv:2404.14408, April 2024. https://arxiv.org/abs/2404.14408 ↩
Pagnoni, Artidoro; Pasunuru, Ram; Rodriguez, Pedro; Nguyen, John; Muller, Benjamin; Li, Margaret; Zhou, Chunting; Yu, Lili; Weston, Jason; Zettlemoyer, Luke; Ghosh, Gargi; Lewis, Mike; Holtzman, Ari; Iyer, Srinivasan. "Byte Latent Transformer: Patches Scale Better Than Tokens." arXiv:2412.09871, December 2024. https://arxiv.org/abs/2412.09871 ↩
"H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages." arXiv:2508.05628, August 2025. https://arxiv.org/abs/2508.05628 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Neural Network