Test-Time Training (TTT)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,202 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,202 words
Add missing citations, update stale details, or suggest a clearer explanation.
Test-Time Training (TTT) is a family of machine learning techniques in which a model updates a subset of its own parameters at inference time, optimizing a self-supervised auxiliary loss derived from the test input itself before producing the final prediction. The approach was introduced by Yu Sun and collaborators in the paper Test-Time Training with Self-Supervision for Generalization under Distribution Shifts (ICML 2020), where it was used to improve image-classifier robustness to corruptions and distribution shifts.[^1] In 2024 the same lead author extended the idea into a sequence-modeling primitive, the TTT layer, whose hidden state is itself a small neural network trained by gradient descent as tokens stream in, giving an alternative to attention and to linear-state recurrences such as Mamba.[^2] Subsequent work has applied TTT to abstract reasoning benchmarks such as ARC-AGI, to long-form video generation, and to in-context-learning regimes for large language models.[^3][^4]
Because TTT performs gradient updates per test instance (or per token), it sits between conventional supervised inference, which keeps weights frozen, and full continual learning, which adapts on labeled streaming data. The defining requirement is that the update signal comes from the test input under a self-supervised objective, with no test labels available. This positions TTT as a form of test-time compute that trades extra inference-time FLOPs for adaptation rather than for more search or longer chains of thought.
Adapting a model to new distributions at inference time is a long-standing concern in machine learning, encompassing unsupervised domain adaptation, transductive learning, and online learning. Classical domain-adaptation methods typically require an unlabeled batch from the target distribution at training time, while transductive methods require all test inputs in advance. TTT departs from both: each test sample defines its own learning problem, and only the current sample (or a short stream) is used to update parameters before prediction.[^1] The conceptual move is to recast inference as a small training loop driven by a self-supervised task that is the same at training time and at test time, so that any improvement on the auxiliary task is expected to transfer to the main task.
The original TTT paper, posted to arXiv in September 2019 and published in the Proceedings of the 37th ICML (2020), was authored by Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt, with affiliations spanning UC Berkeley and UC San Diego.[^1][^5] The system trains a Y-shaped network with a shared feature extractor and two heads: a main head for image classification and a self-supervised head that predicts the rotation (0, 90, 180, or 270 degrees) applied to an input image. At test time the shared feature extractor is fine-tuned to minimize the rotation-prediction loss on the current image, and the updated features are then used by the (unchanged) classification head to predict a label.[^1]
The paper reports two variants. Standard TTT updates a fresh copy of the model independently for each test image and then discards the update, so the next image starts from the original weights. Online TTT keeps the accumulated updates across the test stream, so the model becomes progressively more specialized to the test distribution.[^1] Both variants were evaluated on CIFAR-10-C, ImageNet-C, and other corruption and shift benchmarks designed for measuring robustness; online TTT in particular showed monotonically improving performance as more test samples were processed.[^1]
Subsequent work extended the original recipe to richer self-supervised objectives. Test-Time Training with Masked Autoencoders, by Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros (arXiv 2209.07522, 2022), replaces rotation prediction with the reconstruction objective of a Masked Autoencoder (MAE).[^6] Because MAE is a strong self-supervised objective for Vision Transformers, using it as the inner loop allowed TTT to generalize across a wider range of corruptions and to draw on theoretical analysis framed around a bias-variance tradeoff.[^6]
In July 2024 Yu Sun, with co-authors Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin, posted Learning to (Learn at Test Time): RNNs with Expressive Hidden States (arXiv 2407.04620), which generalized TTT from a per-image adaptation procedure into a generic sequence-modeling primitive.[^2] The key reframe is that a recurrent neural network can be viewed as a sequence-to-sequence map whose hidden state compresses past context; the TTT layer makes that hidden state itself a small machine-learning model whose parameters are updated by self-supervised gradient descent on each incoming token.[^2] Two instantiations are introduced: TTT-Linear (the hidden state is a linear model) and TTT-MLP (the hidden state is a two-layer multi-layer perceptron).[^2]
Two further developments are noteworthy. First, in November 2024 Ekin Akyurek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas (MIT) released The Surprising Effectiveness of Test-Time Training for Few-Shot Learning, in which a TTT recipe is applied to an 8 billion-parameter language model to attack the ARC-AGI benchmark.[^3] Second, in April 2025, Karan Dalal and collaborators released One-Minute Video Generation with Test-Time Training, adding TTT layers to a pre-trained 5 billion-parameter diffusion transformer to generate one-minute Tom-and-Jerry-style cartoons from text storyboards.[^4] These results expanded TTT's footprint from a robustness technique into a generic tool for long-context generation and few-shot reasoning.
The general TTT recipe involves three ingredients: a main task f trained at training time, a self-supervised auxiliary task g sharing some parameters with f, and an adaptation procedure A that takes the test input x, evaluates the auxiliary loss L_aux(g; x), and runs one or more steps of gradient descent on the shared parameters before evaluating f on x.[^1] Because L_aux requires no labels, A can be applied at inference. The set of parameters updated is typically restricted to make adaptation cheap: in the original paper, only the shared feature-extractor weights are touched; in TTT layers, only the hidden state, which is itself a small set of model weights, is updated.[^1][^2]
In the ICML 2020 instantiation, the auxiliary task is image-rotation classification. Given a test image x, the procedure samples one of four rotations, asks the rotation head to predict it, and updates the shared encoder by gradient descent on the cross-entropy loss for this prediction. The classification head's weights remain fixed throughout the adaptation step, so the head sees a slightly different feature vector at test time than it did at training, but never an explicitly mislabeled input. Standard TTT resets the encoder weights to their pre-test values after each prediction, while online TTT does not, accumulating updates that can either help or hurt as the test distribution drifts.[^1]
In the 2024 TTT-layers formulation, each token x_t in a sequence is processed by a layer whose internal state W_t is the parameters of a small machine-learning model f(.; W). For TTT-Linear, f is a linear model; for TTT-MLP, f is a two-layer MLP.[^2] Three learnable input projections produce a "training view" theta_K x_t, a "label view" theta_V x_t, and a "test view" theta_Q x_t per token. The inner-loop self-supervised reconstruction loss is
ell(W; x_t) = || f(theta_K x_t; W) - theta_V x_t ||^2,
and the layer's update rule is one step of (mini-batch) gradient descent on this loss, giving W_t = W_{t-1} - eta nabla ell(W_{t-1}; x_t). The layer's output for position t is f(theta_Q x_t; W_t), the prediction of the updated inner model on the test view.[^2]
This construction has two important properties. First, like a recurrence, it processes one token at a time, so the per-token cost is constant and the overall complexity is linear in sequence length. Second, because the hidden state is a parameterized function rather than a fixed-size vector, the layer can in principle compress information far richer than a single matrix or vector, addressing a well-known capacity bottleneck of classical RNNs and many state-space models.[^2]
A central theoretical result of the 2024 paper is that TTT-Linear reduces, under a particular set of choices, to linear attention. Specifically, when f is a linear model, the inner update uses batch gradient descent with learning rate 1/2, and the initial weights W_0 are zero, the layer's outputs across a sequence are identical to those produced by linear attention, where the accumulated inner-loop weights play the role of a linear-attention KV matrix.[^2] This unifies the RNN and attention perspectives: a TTT-Linear layer can be read either as an RNN that updates a parametric hidden state, or as a generalization of linear attention that supports richer optimizers, non-linear inner models (TTT-MLP), and self-supervised auxiliary losses other than the simple reconstruction described above.[^2]
To make TTT layers practical on accelerators, the paper introduces a mini-batch TTT algorithm that batches multiple consecutive tokens before computing each inner update, and a dual form that re-expresses the inner-loop updates in a way that can be implemented with matrix multiplications matched to GPU tensor cores. Without these systems-level techniques the wall-clock cost of TTT would dominate, but with them TTT-Linear is reported to be competitive in throughput with Mamba and significantly faster than Transformer models at long context lengths.[^2]
The 2024 MIT TTT-for-ARC paper applies a different style of TTT to an already-trained large language model. Given an ARC task that consists of several input-output example pairs and a held-out test query, the method constructs "leave-one-out" tasks from the given examples, augments them through rule-based transformations such as rotations and color permutations, and uses these augmented tasks to fine-tune small task-specific LoRA adapters on top of a fine-tuned Llama-3 8B model. Each task gets its own adapter, trained for two epochs with batch size two and AdamW at learning rate around 5e-5 or 1e-4; the LoRA rank is 128 with alpha 16, applied to attention query and value projections, MLP weights, and the output projection.[^3] The adapted model is then run on the held-out query. This procedure is TTT in the sense that the model's parameters are updated from the test input alone (the few-shot examples), via a supervised loss derived without external labels.
| Variant | Year | Lead author(s) | Adaptation target | Update signal | Notes |
|---|---|---|---|---|---|
| Standard TTT (vision) | 2020 | Sun et al. | Shared encoder weights | Rotation-prediction loss on the test image | Resets weights per image[^1] |
| Online TTT (vision) | 2020 | Sun et al. | Shared encoder weights | Rotation-prediction loss on streaming images | Accumulates updates across stream[^1] |
| TTT-MAE | 2022 | Gandelsman et al. | ViT encoder, MAE setup | MAE reconstruction loss | Stronger auxiliary task, ViT backbone[^6] |
| TTT-Linear | 2024 | Sun et al. | Inner linear-model hidden state | Per-token reconstruction loss | Linear-time language model layer[^2] |
| TTT-MLP | 2024 | Sun et al. | Inner two-layer MLP hidden state | Per-token reconstruction loss | Higher capacity, more memory I/O[^2] |
| TTT for ARC | 2024 | Akyurek et al. | Task-specific LoRA adapters | Loss on augmented few-shot examples | 8B model, 53% on ARC public val[^3] |
| TTT-Video | 2025 | Dalal et al. | TTT layers added to a DiT | Inner-loop reconstruction during finetune | One-minute coherent video[^4] |
The 2024 TTT-layers paper was accompanied by two official open-source repositories:
Additional code, including custom kernels for the dual-form mini-batch TTT update and a video-generation codebase (ttt-video-dit), was released through the same test-time-training GitHub organization.[^7][^8]
Several independent or derivative implementations have appeared:
socialfoundations/tttlm repository.[^9]On CIFAR-10-C, which applies 15 standard corruption types at five severities to the CIFAR-10 test set, the original TTT recipe improved over standard and pretraining-only baselines without hurting clean-data accuracy. The online variant improved further as more test samples were processed.[^1] On ImageNet-C, the larger-scale counterpart, the paper reports gains that grow with the size of the test stream, consistent with the model's slow adaptation to the corruption.[^1] The paper also evaluates on additional shift benchmarks including video robustness datasets, and reports consistent improvements over the baselines used.[^1]
The TTT-layers paper trains models from 125M to 1.3B parameters on The Pile at 2k and 8k context lengths and on Books3 at 32k context. The paper reports that at 2k context, TTT-Linear, Mamba, and a Transformer baseline have "mostly overlapping" perplexity curves across scales. At 8k context, both TTT-Linear and TTT-MLP perform "significantly better" than Mamba, with an example 1.3B perplexity around 11.09 in the ablation Tables of the paper (compared to roughly 15.23 for a linear-attention baseline). At 32k context on Books3, the Transformer and the TTT variants continue reducing perplexity as context grows, while Mamba plateaus after roughly 16k tokens.[^2] TTT-MLP achieves better perplexity than TTT-Linear at every scale tested, but its larger inner MLP creates additional memory-I/O pressure that partially offsets the quality gain.[^2]
On the ARC-AGI public validation set, the MIT group reports that applying TTT to a Llama-3 8B instruction-tuned model achieves 53.0% accuracy, an improvement of "nearly 25%" over the previous state of the art for purely neural public approaches, and "up to 6x higher accuracy compared to fine-tuned baselines."[^3] When combined with an existing program-synthesis solver, the joint system reaches 61.9% on ARC public validation, which the authors describe as matching average human performance on the same set.[^3] On BIG-Bench Hard in a 10-shot setting, the same TTT recipe improves accuracy from 50.5% to 57.8%, a 7.3 point gain.[^3]
The TTT-Video work adds TTT-MLP layers into a pre-trained 5 billion-parameter diffusion transformer (CogVideoX-5B), then finetunes on a custom dataset of more than seven hours of classic Tom-and-Jerry cartoons broken into 3-second annotated segments. Compared to baselines including Mamba 2, Gated DeltaNet, and a sliding-window-attention variant, the TTT-MLP model leads by 34 Elo points in a 100-video-per-method human evaluation, producing one-minute videos that the authors describe as more temporally coherent.[^4][^13] The paper notes residual visual artifacts and emphasizes that the implementation is not yet efficient enough for serving.[^4]
| Approach | Hidden state | Per-token cost | Long-context behavior | Test-time adaptation |
|---|---|---|---|---|
| Transformer (self-attention) | Full KV cache, grows with t | Quadratic in sequence length t | Strong, but cost grows | No adaptation; weights frozen |
| RNN / LSTM | Fixed-size vector | Constant | Fades or saturates over very long contexts | No adaptation |
| Mamba / state-space models | Selective state vector | Constant | Strong up to roughly 16k tokens; plateaus thereafter on Books3 in TTT paper experiments[^2] | No adaptation |
| Linear attention | Matrix accumulator | Constant per token | Equivalent to TTT-Linear with zero W_0 and lr=1/2[^2] | Implicit accumulator, not gradient-based |
| TTT-Linear | Linear inner model trained at test time | Constant per token, plus inner-loop GD step | Reduces perplexity beyond 16k on Books3[^2] | Yes, by inner-loop SSL update |
| TTT-MLP | Two-layer MLP inner model | Constant per token, more memory I/O | Best long-context behavior reported in the paper[^2] | Yes, by inner-loop SSL update |
The qualitative picture from the 2024 TTT-layers paper is that classical fixed-vector RNNs and even modern SSMs face a representational ceiling that grows visible as context length increases. Self-attention avoids that ceiling by simply keeping the entire context in a KV cache, at quadratic cost. TTT layers attempt a third path: keep the per-token cost constant by storing context inside a small parametric model, and increase that model's expressivity by making the inner state itself a neural network. The fact that TTT-Linear collapses to linear attention under a specific reduction makes the relationship to existing methods precise rather than merely analogical.[^2]
The original use case for TTT remains adapting a deployed image classifier to corruptions or covariate shifts that were not present at training. Because the auxiliary task (rotation prediction or MAE reconstruction) does not require labels, the procedure can be applied wherever per-sample inference is acceptable to slow down by a modest factor. Online TTT, with its accumulated updates, is particularly useful when the deployment environment drifts slowly and a moving running model is desired.[^1][^6]
The TTT-layer formulation suggests an alternative architecture for large language models in which the model's hidden state can absorb information from arbitrarily long contexts without paying quadratic attention cost. Empirically, TTT-Linear and TTT-MLP keep reducing perplexity beyond 16k tokens on Books3, where Mamba plateaus.[^2] This makes TTT a candidate primitive for long-document, code-base-scale, or multi-turn agent tasks where context windows reach into the hundreds of thousands of tokens. The official open-source releases give an entry point for experimentation, though the PyTorch reference implementation is not optimized for training and is intended as a tutorial.[^7]
The MIT result on ARC-AGI puts TTT on the map as an approach to few-shot abstract reasoning. The recipe is naturally suited to ARC-style tasks because each task is presented as a handful of input-output pairs that already form a tiny supervised dataset, perfectly matched to TTT's "learn at inference time" framing.[^3] More broadly, the ability to spin up a per-task LoRA adapter at inference suggests applications in personalized assistants, on-device adaptation, and tasks where the user's intent is conveyed through demonstrations rather than instructions.
The TTT-Video result extends TTT into generative modeling, applying TTT layers inside a pre-trained DiT to handle the very long sequences that one-minute videos require. The reported 34 Elo-point margin over Mamba 2 baselines on coherent Tom-and-Jerry video generation suggests that the inner-loop expressivity of TTT can carry over from text to dense pixel sequences.[^4][^13]
Despite the encouraging results, TTT carries several practical limitations.
First, inference-time cost. Performing a gradient-descent step (or many steps) per test sample or per token can multiply the FLOPs and wall-clock time of inference. The TTT-layers paper introduces mini-batch TTT and a dual form to keep wall-clock time competitive with Mamba on existing accelerators, but the authors note that TTT-MLP still faces memory-I/O bottlenecks and that further systems work is needed.[^2] The TTT-Video paper similarly notes that its implementation is not yet efficient enough for production serving.[^4]
Second, risk of degradation. The original TTT paper observes that updating model weights at test time can hurt performance on cleanly distributed data if the auxiliary task is poorly chosen or if too many adaptation steps are taken. The standard variant resets weights between samples partly to bound this risk. Online TTT, while powerful when the test stream is consistent, can drift if the distribution changes within the stream.[^1]
Third, auxiliary-task design. The framework is only as good as the self-supervised loss used in the inner loop. Rotation prediction works for natural images but is unlikely to help for, say, satellite imagery where rotational invariance is wanted. Per-token reconstruction in TTT layers may have its own systematic failure modes when sequences contain repeated structure or adversarial inputs.[^2][^6]
Fourth, evaluation maturity. As of mid-2026, head-to-head comparisons of TTT-style architectures against modern Transformer and Mamba 2 baselines exist mostly at sub-2B parameter scale, on a small set of language-modeling and reasoning benchmarks. The behavior of TTT layers at frontier-model scale, in instruction-tuned settings, and under adversarial inputs is still emerging.[^2][^12]
Fifth, interaction with existing inference stacks. Many production systems assume frozen weights at serve time, which simplifies batching, KV-caching, request routing, and quantization. TTT layers, especially those that mutate weights per token, complicate these assumptions and may require new serving infrastructure to deploy widely. Hugging Face Transformers integration in ttt-lm-pytorch is described by the authors as primarily for study rather than for high-throughput deployment.[^7]
TTT belongs to a broader family of inference-time adaptation techniques. Test-time adaptation (TTA) usually refers to methods that adjust only batch-normalization statistics or a few parameters at test time on a batch of unlabeled samples; surveys such as Liang et al. taxonomize over fifty TTA methods, of which TTT is one branch defined by its use of an explicit auxiliary loss.[^12] Test-time augmentation averages predictions over augmented copies of the test input without changing weights and is therefore not TTT.
Other adjacent fields include meta-learning (which trains models to be quickly adaptable from a few examples), continual learning (which adapts on labeled streaming data), in-context learning (which adapts behavior without weight changes by conditioning on examples in the prompt), and online learning. The 2024 TTT-for-ARC paper explicitly compares TTT with in-context learning, finding that explicit gradient-based adaptation outperforms purely in-context inference on ARC-style tasks at the scale tested.[^3]
Within sequence modeling, TTT layers are most directly comparable to selective state space models such as Mamba and Mamba 2, to linear attention variants, and to RWKV-style linear RNNs. The relationship to linear attention is formal (Theorem 1 of the 2024 paper); the relationship to Mamba is empirical and shows up most clearly at context lengths beyond 16k tokens.[^2] The TTT-Video paper compares against Gated DeltaNet and sliding-window attention, both of which are recent alternatives in the linear-time-RNN family.[^4]