Register tokens (Vision Transformers Need Registers)

Deep Learning Neural Networks

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 1,713 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Register tokens are a small set of extra learnable tokens added to the input sequence of a Vision Transformer (ViT) so the network has a dedicated place to carry out internal, image-level computation. They were introduced in the paper "Vision Transformers Need Registers" by Timothee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski, a collaboration between FAIR, the research division of Meta AI, and Universite Grenoble Alpes with Inria in France ^[1]. The paper diagnosed a recurring defect in ViT feature maps: well-trained models spontaneously grow a few high-norm "artifact" tokens that occupy uninformative background patches and dominate the attention maps, which blurs dense features and ruins interpretability. The proposed fix is to append a handful of learnable tokens, called registers, to the patch sequence, let the model offload that global computation onto them instead of commandeering real image patches, and then discard the registers from the output ^[1].

The work was presented as an oral and received one of the Outstanding Paper Awards at the 2024 International Conference on Learning Representations (ICLR) ^[2]. Its most visible product is DINOv2 with registers, a re-released version of Meta's DINOv2 self-supervised foundation model ^[3] that ships four register tokens by default ^[1]^[6], and the idea has since been built natively into later backbones such as DINOv3 ^[7]. Registers are also widely read as the vision counterpart of the attention sink observed in language models, where a few positions absorb attention mass the model does not otherwise need ^[4]^[5].

The artifact-token problem

A ViT splits an image into a grid of patches, embeds each patch as a token, prepends a special classification (CLS) token, and processes the whole sequence with self-attention. Ideally each patch token's output reflects the content of its own region, so the patch grid can be reshaped into a clean dense feature map for tasks like segmentation or depth. Darcet et al. found that this assumption breaks for many trained ViTs. When the per-token feature norms are visualized over the image, a small number of tokens, roughly two percent of the patches, have norms far above the rest (the analysis flags tokens above a norm of about 150 as outliers), and they sit almost exclusively in flat, low-information background regions ^[1].

These artifacts are not present from the start. They appear only in sufficiently large models, ViT-Large and above, emerge after about a third of training, and concentrate in the middle and later blocks, around layer 15 of the 40-block ViT-giant ^[1]. They are not specific to one training recipe either: the authors observed them in the self-supervised DINOv2, the supervised DeiT-III, and the text-supervised OpenCLIP (CLIP), while the earlier DINO model, which produced famously clean attention maps, did not show them ^[1].

To understand what the high-norm tokens are doing, the authors probed them with linear classifiers. The artifact tokens turned out to hold little information about their own patch (they are poor at predicting their position in the grid and at reconstructing their local pixels) yet hold more global, image-level information than ordinary patches (a single artifact token is a better image classifier than a single normal token). The interpretation is that the network recognizes these patches as redundant and recycles them as scratch space for global computation, throwing away their local content in the process ^[1].

Linear probe	Artifact tokens	Normal tokens
Patch position prediction (top-1 accuracy)	22.8%	41.7%
Local pixel reconstruction (L2 error, lower is better)	25.23	18.38
Image classification from a single token (top-1 accuracy)	69.0%	65.8%

This recycling is what corrupts downstream use: the high-norm tokens soak up a disproportionate share of attention, so attention maps look noisy rather than object-focused, and the affected patch positions no longer carry usable local features for dense prediction ^[1].

Register tokens (the fix)

The remedy is deliberately minimal. The authors add a few extra learnable tokens, the registers, to the input sequence alongside the patch tokens and the CLS token. Structurally a register behaves exactly like the CLS token: it is an independent, learned embedding that does not correspond to any image patch, it is processed by every self-attention layer together with the patches, and it can read from and write to the rest of the sequence. The difference is purpose. Where the CLS token is read out as the image-level representation, the registers are simply discarded at the output and never used for any prediction. They exist only to give the model an explicit, content-free place to store and aggregate the global information it would otherwise have hidden inside background patches ^[1].

By default the method uses four registers, with ablations spanning zero to sixteen; performance is fairly flat once a few are present, so four is a sensible default ^[1]. The cost is negligible: four extra tokens raise the sequence length by a couple of percent, adding under two percent to the model's FLOPs and essentially no parameters ^[1]. In the original method the registers are present throughout pretraining, since the artifact behavior is something the model learns over training rather than a fixed property of the architecture (a training-free variant for already-trained models is discussed below). With registers in place, the high-norm outliers disappear, the feature norms become uniform across the patch grid, and the attention and feature maps become smooth and object-centric ^[1].

Results

Adding registers removes the artifacts for supervised, text-supervised, and self-supervised models alike, and it does so without hurting, and sometimes while helping, downstream performance. The gains are largest where clean dense features matter most. On unsupervised object discovery with the LOST algorithm, where artifacts had been especially damaging, DINOv2 with registers improves sharply. On dense prediction such as semantic segmentation and monocular depth it improves modestly, and on global image classification it is essentially unchanged, confirming that the cleanup does not cost global quality ^[1].

DINOv2 frozen features, as reported by Darcet et al. ^[1]	Without registers	With registers
LOST object discovery, VOC 2007 (corloc)	35.3	55.4
ImageNet-1k linear classification (top-1)	84.3	84.8
ADE20k semantic segmentation (mIoU)	46.6	47.9
NYUd monocular depth (RMSE, lower is better)	0.378	0.366

For the supervised DeiT-III and text-supervised OpenCLIP the picture is the same: classification accuracy is left essentially unchanged (DeiT-III is unmoved on ImageNet, for example) while the visible artifacts vanish, so the method is close to a free improvement in feature quality and interpretability ^[1]. The paper reported a new state of the art for self-supervised models on several dense benchmarks at the time of publication ^[1].

Connection to attention sinks

Register tokens are now commonly described as the vision analog of the attention sink in large language models. In a decoder-only Transformer, a softmax attention head must distribute weights that sum to one even when it has nothing relevant to read, and the surplus mass piles onto a few always-visible positions, typically the first token. The contemporaneous StreamingLLM work by Xiao et al., also presented at ICLR 2024, both characterized this and showed that a single dedicated learnable "sink" token can absorb the no-op attention by design ^[5]. Register tokens play the same structural role for vision: rather than letting the model seize a few real background patches as no-op scratch space, they hand it a set of dedicated slots to dump that computation into, leaving the real tokens clean.

The Darcet paper does not itself use the term attention sink or cite the language-model work; the two lines were independent and concurrent, both posted in late September 2023, and the connection has been drawn by the community since. It was made explicit by the 2025 follow-up "Vision Transformers Don't Need Trained Registers" (Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandelsman, a NeurIPS 2025 spotlight), which traced the high-norm artifacts to a sparse set of "register neurons" inside a few attention heads and showed that their activations can be shifted into a single extra untrained token at test time. This reproduces the benefit of trained registers, cleaner attention and better dense features, on existing CLIP and DINOv2 checkpoints without any retraining, and frames the vision artifacts in the same terms as the attention-sink literature ^[4].

Adoption

The most direct adoption is by the original authors. Meta released DINOv2 with registers checkpoints across model sizes (small, base, large, and giant), and the variant was integrated into the Hugging Face Transformers library as a first-class model, making four register tokens the default for the cleaned-up DINOv2 ^[1]^[6]. The technique then became a standard ingredient in newer vision foundation models: Meta's DINOv3, released in 2025, builds registers in from the start, jointly learning one class token, four register tokens, and the patch tokens throughout training rather than retrofitting them ^[7].

Because registers are cheap and recipe-agnostic, they have spread broadly, and follow-up research continues to refine the idea. The training-free "test-time registers" approach offers the benefit to models that were trained without them ^[4], and 2026 work such as "Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment" re-examines which architectures and scales actually develop the artifacts and gain from the fix ^[8]. Together these establish register tokens as a small, durable architectural pattern: a few dedicated scratch tokens that keep a Transformer's real tokens free of no-op computation.

References

Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. "Vision Transformers Need Registers." arXiv:2309.16588, September 2023; ICLR 2024 (Oral, Outstanding Paper Award). https://arxiv.org/abs/2309.16588 ↩
ICLR. "ICLR 2024 Outstanding Paper Awards." ICLR Blog, May 6, 2024. https://blog.iclr.cc/2024/05/06/iclr-2024-outstanding-paper-awards/ ↩
Oquab, M., Darcet, T., Moutakanni, T., et al. "DINOv2: Learning Robust Visual Features without Supervision." arXiv:2304.07193, April 2023. https://arxiv.org/abs/2304.07193 ↩
Jiang, N., Dravid, A., Efros, A. A., and Gandelsman, Y. "Vision Transformers Don't Need Trained Registers." arXiv:2506.08010, June 2025; NeurIPS 2025 (Spotlight). https://arxiv.org/abs/2506.08010 ↩
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. "Efficient Streaming Language Models with Attention Sinks." arXiv:2309.17453, September 2023; ICLR 2024. https://arxiv.org/abs/2309.17453 ↩
Hugging Face. "DINOv2 with Registers." Transformers model documentation. https://huggingface.co/docs/transformers/en/model_doc/dinov2_with_registers ↩
Meta AI. "DINOv3." Model card and repository, 2025. https://github.com/facebookresearch/dinov3 ↩
"Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment." arXiv:2603.25803, 2026. https://arxiv.org/abs/2603.25803 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Neural Network

Overview

The artifact-token problem

Register tokens (the fix)

Results

Connection to attention sinks

Adoption

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation