Test-Time Training (TTT)

AI Inference Machine Learning Training & Optimization

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v5 · 4,200 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Test-Time Training (TTT) is a family of machine learning techniques in which a model updates a subset of its own parameters at inference time, optimizing a self-supervised auxiliary loss derived from the test input itself before producing the final prediction. The approach was introduced by Yu Sun and collaborators in the paper Test-Time Training with Self-Supervision for Generalization under Distribution Shifts (ICML 2020), where it was used to improve image-classifier robustness to corruptions and distribution shifts.^[1] In 2024 the same lead author extended the idea into a sequence-modeling primitive, the TTT layer, whose hidden state is itself a small neural network trained by gradient descent as tokens stream in, giving an alternative to attention and to linear-state recurrences such as Mamba.^[2] Subsequent work has applied TTT to abstract reasoning benchmarks such as ARC-AGI, to long-form video generation, and to in-context-learning regimes for large language models.^[3]^[4]

Because TTT performs gradient updates per test instance (or per token), it sits between conventional supervised inference, which keeps weights frozen, and full continual learning, which adapts on labeled streaming data. The defining requirement is that the update signal comes from the test input under a self-supervised objective, with no test labels available. This positions TTT as a form of test-time compute that trades extra inference-time FLOPs for adaptation rather than for more search or longer chains of thought.

History

Background and predecessors

Adapting a model to new distributions at inference time is a long-standing concern in machine learning, encompassing unsupervised domain adaptation, transductive learning, and online learning. Classical domain-adaptation methods typically require an unlabeled batch from the target distribution at training time, while transductive methods require all test inputs in advance. TTT departs from both: each test sample defines its own learning problem, and only the current sample (or a short stream) is used to update parameters before prediction.^[1] The conceptual move is to recast inference as a small training loop driven by a self-supervised task that is the same at training time and at test time, so that any improvement on the auxiliary task is expected to transfer to the main task.

The 2019 to 2020 ICML paper

The original TTT paper, posted to arXiv in September 2019 and published in the Proceedings of the 37th ICML (2020), was authored by Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt, with affiliations spanning UC Berkeley and UC San Diego.^[1]^[5] The system trains a Y-shaped network with a shared feature extractor and two heads: a main head for image classification and a self-supervised head that predicts the rotation (0, 90, 180, or 270 degrees) applied to an input image. At test time the shared feature extractor is fine-tuned to minimize the rotation-prediction loss on the current image, and the updated features are then used by the (unchanged) classification head to predict a label.^[1]

The paper reports two variants. Standard TTT updates a fresh copy of the model independently for each test image and then discards the update, so the next image starts from the original weights. Online TTT keeps the accumulated updates across the test stream, so the model becomes progressively more specialized to the test distribution.^[1] Both variants were evaluated on CIFAR-10-C, ImageNet-C, and other corruption and shift benchmarks designed for measuring robustness; online TTT in particular showed monotonically improving performance as more test samples were processed.^[1]

TTT for vision after 2020

Subsequent work extended the original recipe to richer self-supervised objectives. Test-Time Training with Masked Autoencoders, by Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros (arXiv 2209.07522, 2022), replaces rotation prediction with the reconstruction objective of a Masked Autoencoder (MAE).^[6] Because MAE is a strong self-supervised objective for Vision Transformers, using it as the inner loop allowed TTT to generalize across a wider range of corruptions and to draw on theoretical analysis framed around a bias-variance tradeoff.^[6]

Expansion to sequence modeling: TTT layers (2024)

In July 2024 Yu Sun, with co-authors Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin, posted Learning to (Learn at Test Time): RNNs with Expressive Hidden States (arXiv 2407.04620), which generalized TTT from a per-image adaptation procedure into a generic sequence-modeling primitive.^[2] The key reframe is that a recurrent neural network can be viewed as a sequence-to-sequence map whose hidden state compresses past context; the TTT layer makes that hidden state itself a small machine-learning model whose parameters are updated by self-supervised gradient descent on each incoming token.^[2] Two instantiations are introduced: TTT-Linear (the hidden state is a linear model) and TTT-MLP (the hidden state is a two-layer multi-layer perceptron).^[2]

Recent extensions (2024 to 2025)

Two further developments are noteworthy. First, in November 2024 Ekin Akyurek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas (MIT) released The Surprising Effectiveness of Test-Time Training for Few-Shot Learning, in which a TTT recipe is applied to an 8 billion-parameter language model to attack the ARC-AGI benchmark.^[3] Second, in April 2025, Karan Dalal and collaborators released One-Minute Video Generation with Test-Time Training, adding TTT layers to a pre-trained 5 billion-parameter diffusion transformer to generate one-minute Tom-and-Jerry-style cartoons from text storyboards.^[4] These results expanded TTT's footprint from a robustness technique into a generic tool for long-context generation and few-shot reasoning.

Technical Details

General recipe

The general TTT recipe involves three ingredients: a main task f trained at training time, a self-supervised auxiliary task g sharing some parameters with f, and an adaptation procedure A that takes the test input x, evaluates the auxiliary loss $L_{\text{aux}}(g; x)$ , and runs one or more steps of gradient descent on the shared parameters before evaluating f on x.^[1] Because $L_{\text{aux}}$ requires no labels, A can be applied at inference. The set of parameters updated is typically restricted to make adaptation cheap: in the original paper, only the shared feature-extractor weights are touched; in TTT layers, only the hidden state, which is itself a small set of model weights, is updated.^[1]^[2]

The original rotation-prediction objective

In the ICML 2020 instantiation, the auxiliary task is image-rotation classification. Given a test image x, the procedure samples one of four rotations, asks the rotation head to predict it, and updates the shared encoder by gradient descent on the cross-entropy loss for this prediction. The classification head's weights remain fixed throughout the adaptation step, so the head sees a slightly different feature vector at test time than it did at training, but never an explicitly mislabeled input. Standard TTT resets the encoder weights to their pre-test values after each prediction, while online TTT does not, accumulating updates that can either help or hurt as the test distribution drifts.^[1]

TTT layers as a sequence-modeling primitive

In the 2024 TTT-layers formulation, each token $x_t$ in a sequence is processed by a layer whose internal state $W_t$ is the parameters of a small machine-learning model $f(\cdot; W)$ . For TTT-Linear, f is a linear model; for TTT-MLP, f is a two-layer MLP.^[2] Three learnable input projections produce a "training view" $\theta_K x_t$ , a "label view" $\theta_V x_t$ , and a "test view" $\theta_Q x_t$ per token. The inner-loop self-supervised reconstruction loss is

\ell(W; x_t) = \lVert f(\theta_K x_t; W) - \theta_V x_t \rVert^2

and the layer's update rule is one step of (mini-batch) gradient descent on this loss, giving $W_t = W_{t-1} - \eta \nabla \ell(W_{t-1}; x_t)$ . The layer's output for position t is $f(\theta_Q x_t; W_t)$ , the prediction of the updated inner model on the test view.^[2]

This construction has two important properties. First, like a recurrence, it processes one token at a time, so the per-token cost is constant and the overall complexity is linear in sequence length. Second, because the hidden state is a parameterized function rather than a fixed-size vector, the layer can in principle compress information far richer than a single matrix or vector, addressing a well-known capacity bottleneck of classical RNNs and many state-space models.^[2]

Connection to linear attention

A central theoretical result of the 2024 paper is that TTT-Linear reduces, under a particular set of choices, to linear attention. Specifically, when f is a linear model, the inner update uses batch gradient descent with learning rate 1/2, and the initial weights $W_0$ are zero, the layer's outputs across a sequence are identical to those produced by linear attention, where the accumulated inner-loop weights play the role of a linear-attention KV matrix.^[2] This unifies the RNN and attention perspectives: a TTT-Linear layer can be read either as an RNN that updates a parametric hidden state, or as a generalization of linear attention that supports richer optimizers, non-linear inner models (TTT-MLP), and self-supervised auxiliary losses other than the simple reconstruction described above.^[2]

Mini-batch TTT and dual form

To make TTT layers practical on accelerators, the paper introduces a mini-batch TTT algorithm that batches multiple consecutive tokens before computing each inner update, and a dual form that re-expresses the inner-loop updates in a way that can be implemented with matrix multiplications matched to GPU tensor cores. Without these systems-level techniques the wall-clock cost of TTT would dominate, but with them TTT-Linear is reported to be competitive in throughput with Mamba and significantly faster than Transformer models at long context lengths.^[2]

TTT for in-context learning of language models

The 2024 MIT TTT-for-ARC paper applies a different style of TTT to an already-trained large language model. Given an ARC task that consists of several input-output example pairs and a held-out test query, the method constructs "leave-one-out" tasks from the given examples, augments them through rule-based transformations such as rotations and color permutations, and uses these augmented tasks to fine-tune small task-specific LoRA adapters on top of a fine-tuned Llama-3 8B model. Each task gets its own adapter, trained for two epochs with batch size two and AdamW at learning rate around 5e-5 or 1e-4; the LoRA rank is 128 with alpha 16, applied to attention query and value projections, MLP weights, and the output projection.^[3] The adapted model is then run on the held-out query. This procedure is TTT in the sense that the model's parameters are updated from the test input alone (the few-shot examples), via a supervised loss derived without external labels.

Variants and Implementations

Variant	Year	Lead author(s)	Adaptation target	Update signal	Notes
Standard TTT (vision)	2020	Sun et al.	Shared encoder weights	Rotation-prediction loss on the test image	Resets weights per image^[1]
Online TTT (vision)	2020	Sun et al.	Shared encoder weights	Rotation-prediction loss on streaming images	Accumulates updates across stream^[1]
TTT-MAE	2022	Gandelsman et al.	ViT encoder, MAE setup	MAE reconstruction loss	Stronger auxiliary task, ViT backbone^[6]
TTT-Linear	2024	Sun et al.	Inner linear-model hidden state	Per-token reconstruction loss	Linear-time language model layer^[2]
TTT-MLP	2024	Sun et al.	Inner two-layer MLP hidden state	Per-token reconstruction loss	Higher capacity, more memory I/O^[2]
TTT for ARC	2024	Akyurek et al.	Task-specific LoRA adapters	Loss on augmented few-shot examples	8B model, 53% on ARC public val^[3]
TTT-Video	2025	Dalal et al.	TTT layers added to a DiT	Inner-loop reconstruction during finetune	One-minute coherent video^[4]

Official open-source releases

The 2024 TTT-layers paper was accompanied by two official open-source repositories:

ttt-lm-pytorch: a PyTorch implementation built on the Hugging Face Transformers library, distributed under the MIT license. The authors describe it as "a naive implementation of TTT layers for tutorial purposes" and explicitly recommend against using it for serious training because it lacks systems optimization. A "ttt-1b style configuration" is provided via TTTConfig.^[7]
ttt-lm-jax: a JAX implementation supporting both GPUs and Cloud TPU VMs (Python 3.11), used for the speed and scaling experiments reported in the paper. Separate dependency lists for GPU and TPU are provided.^[8]

Additional code, including custom kernels for the dual-form mini-batch TTT update and a video-generation codebase (ttt-video-dit), was released through the same test-time-training GitHub organization.^[7]^[8]

Other implementations

Several independent or derivative implementations have appeared:

Test-Time Training on Nearest Neighbors by Hardt and Sun (arXiv 2305.18466, ICLR 2024) fine-tunes a small head of a large language model on the nearest training-set neighbors of each test input retrieved from an external index. Code is published as the socialfoundations/tttlm repository.^[9]
TTT-AdaptNet (ECCV 2024) uses adaptive linear layers for test-time adaptation in image reconstruction.^[10]
TTRL: Test-Time Reinforcement Learning (arXiv 2504.16084, 2025) treats TTT as a continual learning setup driven by a reward signal at inference time.^[11]
Comprehensive surveys such as Liang et al., A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts (IJCV, 2025), collect dozens of TTT variants and place them in a taxonomy that distinguishes batch, online, and per-sample adaptation regimes.^[12]

Results

Vision benchmarks (original TTT, 2020)

On CIFAR-10-C, which applies 15 standard corruption types at five severities to the CIFAR-10 test set, the original TTT recipe improved over standard and pretraining-only baselines without hurting clean-data accuracy. The online variant improved further as more test samples were processed.^[1] On ImageNet-C, the larger-scale counterpart, the paper reports gains that grow with the size of the test stream, consistent with the model's slow adaptation to the corruption.^[1] The paper also evaluates on additional shift benchmarks including video robustness datasets, and reports consistent improvements over the baselines used.^[1]

Language modeling on The Pile and Books3 (TTT layers, 2024)

The TTT-layers paper trains models from 125M to 1.3B parameters on The Pile at 2k and 8k context lengths and on Books3 at 32k context. The paper reports that at 2k context, TTT-Linear, Mamba, and a Transformer baseline have "mostly overlapping" perplexity curves across scales. At 8k context, both TTT-Linear and TTT-MLP perform "significantly better" than Mamba, with an example 1.3B perplexity around 11.09 in the ablation Tables of the paper (compared to roughly 15.23 for a linear-attention baseline). At 32k context on Books3, the Transformer and the TTT variants continue reducing perplexity as context grows, while Mamba plateaus after roughly 16k tokens.^[2] TTT-MLP achieves better perplexity than TTT-Linear at every scale tested, but its larger inner MLP creates additional memory-I/O pressure that partially offsets the quality gain.^[2]

ARC-AGI (TTT for few-shot learning, 2024)

On the ARC-AGI public validation set, the MIT group reports that applying TTT to a Llama-3 8B instruction-tuned model achieves 53.0% accuracy, an improvement of "nearly 25%" over the previous state of the art for purely neural public approaches, and "up to 6x higher accuracy compared to fine-tuned baselines."^[3] When combined with an existing program-synthesis solver, the joint system reaches 61.9% on ARC public validation, which the authors describe as matching average human performance on the same set.^[3] On BIG-Bench Hard in a 10-shot setting, the same TTT recipe improves accuracy from 50.5% to 57.8%, a 7.3 point gain.^[3]

Video generation (TTT-Video, 2025)

The TTT-Video work adds TTT-MLP layers into a pre-trained 5 billion-parameter diffusion transformer (CogVideoX-5B), then finetunes on a custom dataset of more than seven hours of classic Tom-and-Jerry cartoons broken into 3-second annotated segments. Compared to baselines including Mamba 2, Gated DeltaNet, and a sliding-window-attention variant, the TTT-MLP model leads by 34 Elo points in a 100-video-per-method human evaluation, producing one-minute videos that the authors describe as more temporally coherent.^[4]^[13] The paper notes residual visual artifacts and emphasizes that the implementation is not yet efficient enough for serving.^[4]

Comparison with other sequence-modeling approaches

Approach	Hidden state	Per-token cost	Long-context behavior	Test-time adaptation
Transformer (self-attention)	Full KV cache, grows with t	Quadratic in sequence length t	Strong, but cost grows	No adaptation; weights frozen
RNN / LSTM	Fixed-size vector	Constant	Fades or saturates over very long contexts	No adaptation
Mamba / state-space models	Selective state vector	Constant	Strong up to roughly 16k tokens; plateaus thereafter on Books3 in TTT paper experiments^[2]	No adaptation
Linear attention	Matrix accumulator	Constant per token	Equivalent to TTT-Linear with zero $W_0$ and lr=1/2^[2]	Implicit accumulator, not gradient-based
TTT-Linear	Linear inner model trained at test time	Constant per token, plus inner-loop GD step	Reduces perplexity beyond 16k on Books3^[2]	Yes, by inner-loop SSL update
TTT-MLP	Two-layer MLP inner model	Constant per token, more memory I/O	Best long-context behavior reported in the paper^[2]	Yes, by inner-loop SSL update

The qualitative picture from the 2024 TTT-layers paper is that classical fixed-vector RNNs and even modern SSMs face a representational ceiling that grows visible as context length increases. Self-attention avoids that ceiling by simply keeping the entire context in a KV cache, at quadratic cost. TTT layers attempt a third path: keep the per-token cost constant by storing context inside a small parametric model, and increase that model's expressivity by making the inner state itself a neural network. The fact that TTT-Linear collapses to linear attention under a specific reduction makes the relationship to existing methods precise rather than merely analogical.^[2]

Applications

Robustness to distribution shifts

The original use case for TTT remains adapting a deployed image classifier to corruptions or covariate shifts that were not present at training. Because the auxiliary task (rotation prediction or MAE reconstruction) does not require labels, the procedure can be applied wherever per-sample inference is acceptable to slow down by a modest factor. Online TTT, with its accumulated updates, is particularly useful when the deployment environment drifts slowly and a moving running model is desired.^[1]^[6]

Long-context language modeling

The TTT-layer formulation suggests an alternative architecture for large language models in which the model's hidden state can absorb information from arbitrarily long contexts without paying quadratic attention cost. Empirically, TTT-Linear and TTT-MLP keep reducing perplexity beyond 16k tokens on Books3, where Mamba plateaus.^[2] This makes TTT a candidate primitive for long-document, code-base-scale, or multi-turn agent tasks where context windows reach into the hundreds of thousands of tokens. The official open-source releases give an entry point for experimentation, though the PyTorch reference implementation is not optimized for training and is intended as a tutorial.^[7]

Few-shot reasoning and ARC-AGI

The MIT result on ARC-AGI puts TTT on the map as an approach to few-shot abstract reasoning. The recipe is naturally suited to ARC-style tasks because each task is presented as a handful of input-output pairs that already form a tiny supervised dataset, perfectly matched to TTT's "learn at inference time" framing.^[3] More broadly, the ability to spin up a per-task LoRA adapter at inference suggests applications in personalized assistants, on-device adaptation, and tasks where the user's intent is conveyed through demonstrations rather than instructions.

Long-form video and other generative settings

The TTT-Video result extends TTT into generative modeling, applying TTT layers inside a pre-trained DiT to handle the very long sequences that one-minute videos require. The reported 34 Elo-point margin over Mamba 2 baselines on coherent Tom-and-Jerry video generation suggests that the inner-loop expressivity of TTT can carry over from text to dense pixel sequences.^[4]^[13]

Limitations

Despite the encouraging results, TTT carries several practical limitations.

First, inference-time cost. Performing a gradient-descent step (or many steps) per test sample or per token can multiply the FLOPs and wall-clock time of inference. The TTT-layers paper introduces mini-batch TTT and a dual form to keep wall-clock time competitive with Mamba on existing accelerators, but the authors note that TTT-MLP still faces memory-I/O bottlenecks and that further systems work is needed.^[2] The TTT-Video paper similarly notes that its implementation is not yet efficient enough for production serving.^[4]

Second, risk of degradation. The original TTT paper observes that updating model weights at test time can hurt performance on cleanly distributed data if the auxiliary task is poorly chosen or if too many adaptation steps are taken. The standard variant resets weights between samples partly to bound this risk. Online TTT, while powerful when the test stream is consistent, can drift if the distribution changes within the stream.^[1]

Third, auxiliary-task design. The framework is only as good as the self-supervised loss used in the inner loop. Rotation prediction works for natural images but is unlikely to help for, say, satellite imagery where rotational invariance is wanted. Per-token reconstruction in TTT layers may have its own systematic failure modes when sequences contain repeated structure or adversarial inputs.^[2]^[6]

Fourth, evaluation maturity. As of mid-2026, head-to-head comparisons of TTT-style architectures against modern Transformer and Mamba 2 baselines exist mostly at sub-2B parameter scale, on a small set of language-modeling and reasoning benchmarks. The behavior of TTT layers at frontier-model scale, in instruction-tuned settings, and under adversarial inputs is still emerging.^[2]^[12]

Fifth, interaction with existing inference stacks. Many production systems assume frozen weights at serve time, which simplifies batching, KV-caching, request routing, and quantization. TTT layers, especially those that mutate weights per token, complicate these assumptions and may require new serving infrastructure to deploy widely. Hugging Face Transformers integration in ttt-lm-pytorch is described by the authors as primarily for study rather than for high-throughput deployment.^[7]

TTT belongs to a broader family of inference-time adaptation techniques. Test-time adaptation (TTA) usually refers to methods that adjust only batch-normalization statistics or a few parameters at test time on a batch of unlabeled samples; surveys such as Liang et al. taxonomize over fifty TTA methods, of which TTT is one branch defined by its use of an explicit auxiliary loss.^[12] Test-time augmentation averages predictions over augmented copies of the test input without changing weights and is therefore not TTT.

Other adjacent fields include meta-learning (which trains models to be quickly adaptable from a few examples), continual learning (which adapts on labeled streaming data), in-context learning (which adapts behavior without weight changes by conditioning on examples in the prompt), and online learning. The 2024 TTT-for-ARC paper explicitly compares TTT with in-context learning, finding that explicit gradient-based adaptation outperforms purely in-context inference on ARC-style tasks at the scale tested.^[3]

Within sequence modeling, TTT layers are most directly comparable to selective state space models such as Mamba and Mamba 2, to linear attention variants, and to RWKV-style linear RNNs. The relationship to linear attention is formal (Theorem 1 of the 2024 paper); the relationship to Mamba is empirical and shows up most clearly at context lengths beyond 16k tokens.^[2] The TTT-Video paper compares against Gated DeltaNet and sliding-window attention, both of which are recent alternatives in the linear-time-RNN family.^[4]

References

Sun, Yu; Wang, Xiaolong; Liu, Zhuang; Miller, John; Efros, Alexei A.; Hardt, Moritz, "Test-Time Training with Self-Supervision for Generalization under Distribution Shifts", arXiv preprint (ICML 2020), 2019-09-29 (revised 2020-07-01). https://arxiv.org/abs/1909.13231. Accessed 2026-05-20. ↩
Sun, Yu; Li, Xinhao; Dalal, Karan; Xu, Jiarui; Vikram, Arjun; Zhang, Genghan; Dubois, Yann; Chen, Xinlei; Wang, Xiaolong; Koyejo, Sanmi; Hashimoto, Tatsunori; Guestrin, Carlos, "Learning to (Learn at Test Time): RNNs with Expressive Hidden States", arXiv preprint, 2024-07-05. https://arxiv.org/abs/2407.04620. Accessed 2026-05-20. ↩
Akyurek, Ekin; Damani, Mehul; Zweiger, Adam; Qiu, Linlu; Guo, Han; Pari, Jyothish; Kim, Yoon; Andreas, Jacob, "The Surprising Effectiveness of Test-Time Training for Few-Shot Learning", arXiv preprint, 2024-11-11. https://arxiv.org/abs/2411.07279. Accessed 2026-05-20. ↩
Dalal, Karan; Koceja, Daniel; Hussein, Gashon; Xu, Jiarui; Zhao, Yue; Song, Youjin; Han, Shihao; Cheung, Ka Chun; Kautz, Jan; Guestrin, Carlos; Hashimoto, Tatsunori; Koyejo, Sanmi; Choi, Yejin; Sun, Yu; Wang, Xiaolong, "One-Minute Video Generation with Test-Time Training", arXiv preprint, 2025-04-07. https://arxiv.org/abs/2504.05298. Accessed 2026-05-20. ↩
Sun, Yu; Wang, Xiaolong; Liu, Zhuang; Miller, John; Efros, Alexei; Hardt, Moritz, "Test-Time Training with Self-Supervision for Generalization Under Distribution Shifts", Proceedings of the 37th International Conference on Machine Learning, PMLR vol. 119, 2020. https://proceedings.mlr.press/v119/sun20b.html. Accessed 2026-05-20. ↩
Gandelsman, Yossi; Sun, Yu; Chen, Xinlei; Efros, Alexei A., "Test-Time Training with Masked Autoencoders", arXiv preprint, 2022-09-15. https://arxiv.org/abs/2209.07522. Accessed 2026-05-20. ↩
test-time-training, "ttt-lm-pytorch: Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States", GitHub repository, 2024 (MIT License). https://github.com/test-time-training/ttt-lm-pytorch. Accessed 2026-05-20. ↩
test-time-training, "ttt-lm-jax: Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States", GitHub repository, 2024. https://github.com/test-time-training/ttt-lm-jax. Accessed 2026-05-20. ↩
socialfoundations, "tttlm: Test-time-training on nearest neighbors for large language models", GitHub repository (ICLR 2024 companion code), 2024. https://github.com/socialfoundations/tttlm. Accessed 2026-05-20. ↩
Zhao, Yutian and collaborators, "TTT-AdaptNet: Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers (ECCV 2024)", GitHub repository, 2024. https://github.com/yutianzhao-00/TTT-AdaptNet. Accessed 2026-05-20. ↩
Zuo, Yuxin et al., "TTRL: Test-Time Reinforcement Learning", arXiv preprint, 2025-04-22 (revised 2025-06-30). https://arxiv.org/abs/2504.16084. Accessed 2026-05-20. ↩
Liang, Jian; He, Ran; Tan, Tieniu, "A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts", International Journal of Computer Vision, 133(1):31-64, 2025 (preprint arXiv 2303.15361). https://arxiv.org/abs/2303.15361. Accessed 2026-05-20. ↩
test-time-training, "One-Minute Video Generation with Test-Time Training (project page)", 2025. https://test-time-training.github.io/video-dit/. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

ARC-AGI-2 Gated DeltaNet Titans (neural architecture)

History

Background and predecessors

The 2019 to 2020 ICML paper

TTT for vision after 2020

Expansion to sequence modeling: TTT layers (2024)

Recent extensions (2024 to 2025)

Technical Details

General recipe

The original rotation-prediction objective

TTT layers as a sequence-modeling primitive

Connection to linear attention

Mini-batch TTT and dual form

TTT for in-context learning of language models

Variants and Implementations

Official open-source releases

Other implementations

Results

Vision benchmarks (original TTT, 2020)

Language modeling on The Pile and Books3 (TTT layers, 2024)

ARC-AGI (TTT for few-shot learning, 2024)

Video generation (TTT-Video, 2025)

Comparison with other sequence-modeling approaches

Applications

Robustness to distribution shifts

Long-context language modeling

Few-shot reasoning and ARC-AGI

Long-form video and other generative settings

Limitations

Related Work

See also

References

Improve this article

Related Articles

Pruning

QLoRA

GRPO

KTO

RLVR

NormalFloat 4-bit (NF4)

What links here

Related Articles

Pruning

QLoRA

GRPO

KTO

RLVR

NormalFloat 4-bit (NF4)

What links here