# Test-Time Training (TTT)

> Source: https://aiwiki.ai/wiki/test_time_training
> Updated: 2026-07-16
> Categories: AI Inference, Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Test-Time Training (TTT)** is a family of machine learning techniques in which a model updates a subset of its own parameters at inference time, optimizing a self-supervised auxiliary loss derived from the test input itself before producing the final prediction. The approach was introduced by Yu Sun and collaborators in the paper *Test-Time Training with Self-Supervision for Generalization under Distribution Shifts* (ICML 2020), where it was used to improve image-classifier robustness to corruptions and distribution shifts.[^1] In 2024 the same lead author extended the idea into a sequence-modeling primitive, the *TTT layer*, whose hidden state is itself a small neural network trained by gradient descent as tokens stream in, giving an alternative to attention and to linear-state recurrences such as [Mamba](/wiki/mamba).[^2] Subsequent work has applied TTT to abstract reasoning benchmarks such as [ARC-AGI](/wiki/arc_agi), to long-form video generation, and to in-context-learning regimes for [large language models](/wiki/large_language_model).[^3][^4]

Because TTT performs gradient updates per test instance (or per token), it sits between conventional supervised inference, which keeps weights frozen, and full continual learning, which adapts on labeled streaming data. The defining requirement is that the update signal comes from the test input under a self-supervised objective, with no test labels available. This positions TTT as a form of [test-time compute](/wiki/test_time_compute) that trades extra inference-time FLOPs for adaptation rather than for more search or longer chains of thought.

## History

### Background and predecessors

Adapting a model to new distributions at inference time is a long-standing concern in machine learning, encompassing unsupervised [domain adaptation](/wiki/domain_adaptation), transductive learning, and online learning. Classical domain-adaptation methods typically require an unlabeled batch from the target distribution at training time, while transductive methods require all test inputs in advance. TTT departs from both: each test sample defines its own learning problem, and only the current sample (or a short stream) is used to update parameters before prediction.[^1] The conceptual move is to recast inference as a small training loop driven by a [self-supervised](/wiki/self_supervised_learning) task that is the same at training time and at test time, so that any improvement on the auxiliary task is expected to transfer to the main task.

### The 2019 to 2020 ICML paper

The original TTT paper, posted to arXiv in September 2019 and published in the *Proceedings of the 37th [ICML](/wiki/icml)* (2020), was authored by Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt, with affiliations spanning [UC Berkeley](/wiki/uc_berkeley) and UC San Diego.[^1][^5] The system trains a Y-shaped network with a shared feature extractor and two heads: a main head for image classification and a self-supervised head that predicts the [rotation](/wiki/rotational_invariance) (0, 90, 180, or 270 degrees) applied to an input image. At test time the shared feature extractor is fine-tuned to minimize the rotation-prediction loss on the current image, and the updated features are then used by the (unchanged) classification head to predict a label.[^1]

The paper reports two variants. *Standard TTT* updates a fresh copy of the model independently for each test image and then discards the update, so the next image starts from the original weights. *Online TTT* keeps the accumulated updates across the test stream, so the model becomes progressively more specialized to the test distribution.[^1] Both variants were evaluated on CIFAR-10-C, [ImageNet](/wiki/imagenet)-C, and other corruption and shift benchmarks designed for measuring robustness; online TTT in particular showed monotonically improving performance as more test samples were processed.[^1]

### TTT for vision after 2020

Subsequent work extended the original recipe to richer self-supervised objectives. *Test-Time Training with Masked Autoencoders*, by Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros (arXiv 2209.07522, 2022), replaces rotation prediction with the reconstruction objective of a [Masked Autoencoder](/wiki/masked_autoencoder) (MAE).[^6] Because MAE is a strong self-supervised objective for [Vision Transformers](/wiki/vision_transformer), using it as the inner loop allowed TTT to generalize across a wider range of corruptions and to draw on theoretical analysis framed around a bias-variance tradeoff.[^6]

### Expansion to sequence modeling: TTT layers (2024)

In July 2024 Yu Sun, with co-authors Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin, posted *Learning to (Learn at Test Time): RNNs with Expressive Hidden States* (arXiv 2407.04620), which generalized TTT from a per-image adaptation procedure into a generic sequence-modeling primitive.[^2] The key reframe is that a [recurrent neural network](/wiki/recurrent_neural_network) can be viewed as a sequence-to-sequence map whose hidden state compresses past context; the TTT layer makes that hidden state itself a small machine-learning model whose parameters are updated by self-supervised gradient descent on each incoming token.[^2] Two instantiations are introduced: TTT-Linear (the hidden state is a linear model) and TTT-MLP (the hidden state is a two-layer multi-layer perceptron).[^2]

### Recent extensions (2024 to 2025)

Two further developments are noteworthy. First, in November 2024 Ekin Akyurek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas (MIT) released *The Surprising Effectiveness of Test-Time Training for Few-Shot Learning*, in which a TTT recipe is applied to an 8 billion-parameter language model to attack the [ARC-AGI](/wiki/arc_agi) benchmark.[^3] Second, in April 2025, Karan Dalal and collaborators released *One-Minute Video Generation with Test-Time Training*, adding TTT layers to a pre-trained 5 billion-parameter [diffusion transformer](/wiki/diffusion_transformer) to generate one-minute Tom-and-Jerry-style cartoons from text storyboards.[^4] These results expanded TTT's footprint from a robustness technique into a generic tool for long-context generation and few-shot reasoning.

## Technical Details

### General recipe

The general TTT recipe involves three ingredients: a main task f trained at training time, a self-supervised auxiliary task g sharing some parameters with f, and an adaptation procedure A that takes the test input x, evaluates the auxiliary loss $$L_{\text{aux}}(g; x)$$, and runs one or more steps of gradient descent on the shared parameters before evaluating f on x.[^1] Because $$L_{\text{aux}}$$ requires no labels, A can be applied at inference. The set of parameters updated is typically restricted to make adaptation cheap: in the original paper, only the shared feature-extractor weights are touched; in TTT layers, only the hidden state, which is itself a small set of model weights, is updated.[^1][^2]

### The original rotation-prediction objective

In the ICML 2020 instantiation, the auxiliary task is image-rotation classification. Given a test image x, the procedure samples one of four rotations, asks the rotation head to predict it, and updates the shared encoder by gradient descent on the cross-entropy loss for this prediction. The classification head's weights remain fixed throughout the adaptation step, so the head sees a slightly different feature vector at test time than it did at training, but never an explicitly mislabeled input. Standard TTT resets the encoder weights to their pre-test values after each prediction, while online TTT does not, accumulating updates that can either help or hurt as the test distribution drifts.[^1]

### TTT layers as a sequence-modeling primitive

In the 2024 TTT-layers formulation, each token $$x_t$$ in a sequence is processed by a layer whose internal state $$W_t$$ is the parameters of a small machine-learning model $$f(\cdot; W)$$. For TTT-Linear, f is a linear model; for TTT-MLP, f is a two-layer MLP.[^2] Three learnable input projections produce a "training view" $$\theta_K x_t$$, a "label view" $$\theta_V x_t$$, and a "test view" $$\theta_Q x_t$$ per token. The inner-loop self-supervised reconstruction loss is

$$
\ell(W; x_t) = \lVert f(\theta_K x_t; W) - \theta_V x_t \rVert^2
$$

and the layer's update rule is one step of (mini-batch) gradient descent on this loss, giving $$W_t = W_{t-1} - \eta \nabla \ell(W_{t-1}; x_t)$$. The layer's output for position t is $$f(\theta_Q x_t; W_t)$$, the prediction of the updated inner model on the test view.[^2]

This construction has two important properties. First, like a recurrence, it processes one token at a time, so the per-token cost is constant and the overall complexity is linear in sequence length. Second, because the hidden state is a parameterized function rather than a fixed-size vector, the layer can in principle compress information far richer than a single matrix or vector, addressing a well-known capacity bottleneck of classical RNNs and many state-space models.[^2]

### Connection to linear attention

A central theoretical result of the 2024 paper is that TTT-Linear reduces, under a particular set of choices, to [linear attention](/wiki/linear_attention). Specifically, when f is a linear model, the inner update uses batch gradient descent with learning rate 1/2, and the initial weights $$W_0$$ are zero, the layer's outputs across a sequence are identical to those produced by linear attention, where the accumulated inner-loop weights play the role of a linear-attention KV matrix.[^2] This unifies the RNN and attention perspectives: a TTT-Linear layer can be read either as an RNN that updates a parametric hidden state, or as a generalization of linear attention that supports richer optimizers, non-linear inner models (TTT-MLP), and self-supervised auxiliary losses other than the simple reconstruction described above.[^2]

### Mini-batch TTT and dual form

To make TTT layers practical on accelerators, the paper introduces a *mini-batch TTT* algorithm that batches multiple consecutive tokens before computing each inner update, and a *dual form* that re-expresses the inner-loop updates in a way that can be implemented with matrix multiplications matched to GPU tensor cores. Without these systems-level techniques the wall-clock cost of TTT would dominate, but with them TTT-Linear is reported to be competitive in throughput with Mamba and significantly faster than [Transformer](/wiki/attention_is_all_you_need_transformer) models at long context lengths.[^2]

### TTT for in-context learning of language models

The 2024 MIT TTT-for-ARC paper applies a different style of TTT to an already-trained [large language model](/wiki/large_language_model). Given an ARC task that consists of several input-output example pairs and a held-out test query, the method constructs "leave-one-out" tasks from the given examples, augments them through rule-based transformations such as rotations and color permutations, and uses these augmented tasks to fine-tune small task-specific [LoRA](/wiki/lora) adapters on top of a fine-tuned Llama-3 8B model. Each task gets its own adapter, trained for two epochs with batch size two and AdamW at learning rate around 5e-5 or 1e-4; the LoRA rank is 128 with alpha 16, applied to attention query and value projections, MLP weights, and the output projection.[^3] The adapted model is then run on the held-out query. This procedure is TTT in the sense that the model's parameters are updated from the test input alone (the few-shot examples), via a supervised loss derived without external labels.

## Variants and Implementations

| Variant | Year | Lead author(s) | Adaptation target | Update signal | Notes |
|---|---|---|---|---|---|
| Standard TTT (vision) | 2020 | Sun et al. | Shared encoder weights | Rotation-prediction loss on the test image | Resets weights per image[^1] |
| Online TTT (vision) | 2020 | Sun et al. | Shared encoder weights | Rotation-prediction loss on streaming images | Accumulates updates across stream[^1] |
| TTT-MAE | 2022 | Gandelsman et al. | ViT encoder, MAE setup | MAE reconstruction loss | Stronger auxiliary task, ViT backbone[^6] |
| TTT-Linear | 2024 | Sun et al. | Inner linear-model hidden state | Per-token reconstruction loss | Linear-time language model layer[^2] |
| TTT-MLP | 2024 | Sun et al. | Inner two-layer MLP hidden state | Per-token reconstruction loss | Higher capacity, more memory I/O[^2] |
| TTT for ARC | 2024 | Akyurek et al. | Task-specific [LoRA](/wiki/lora) adapters | Loss on augmented few-shot examples | 8B model, 53% on ARC public val[^3] |
| TTT-Video | 2025 | Dalal et al. | TTT layers added to a [DiT](/wiki/diffusion_transformer) | Inner-loop reconstruction during finetune | One-minute coherent video[^4] |

### Official open-source releases

The 2024 TTT-layers paper was accompanied by two official open-source repositories:

- *ttt-lm-pytorch*: a PyTorch implementation built on the [Hugging Face Transformers](/wiki/transformers_library) library, distributed under the MIT license. The authors describe it as "a naive implementation of TTT layers for tutorial purposes" and explicitly recommend against using it for serious training because it lacks systems optimization. A "ttt-1b style configuration" is provided via TTTConfig.[^7]
- *ttt-lm-jax*: a [JAX](/wiki/jax) implementation supporting both GPUs and Cloud TPU VMs (Python 3.11), used for the speed and scaling experiments reported in the paper. Separate dependency lists for GPU and TPU are provided.[^8]

Additional code, including custom kernels for the dual-form mini-batch TTT update and a video-generation codebase (ttt-video-dit), was released through the same `test-time-training` GitHub organization.[^7][^8]

### Other implementations

Several independent or derivative implementations have appeared:

- *Test-Time Training on Nearest Neighbors* by Hardt and Sun (arXiv 2305.18466, ICLR 2024) fine-tunes a small head of a large language model on the nearest training-set neighbors of each test input retrieved from an external index. Code is published as the `socialfoundations/tttlm` repository.[^9]
- *TTT-AdaptNet* (ECCV 2024) uses adaptive linear layers for test-time adaptation in image reconstruction.[^10]
- *TTRL: Test-Time Reinforcement Learning* (arXiv 2504.16084, 2025) treats TTT as a [continual learning](/wiki/continual_learning) setup driven by a reward signal at inference time.[^11]
- Comprehensive surveys such as Liang et al., *A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts* (IJCV, 2025), collect dozens of TTT variants and place them in a taxonomy that distinguishes batch, online, and per-sample adaptation regimes.[^12]

## Results

### Vision benchmarks (original TTT, 2020)

On CIFAR-10-C, which applies 15 standard corruption types at five severities to the CIFAR-10 test set, the original TTT recipe improved over standard and pretraining-only baselines without hurting clean-data accuracy. The online variant improved further as more test samples were processed.[^1] On ImageNet-C, the larger-scale counterpart, the paper reports gains that grow with the size of the test stream, consistent with the model's slow adaptation to the corruption.[^1] The paper also evaluates on additional shift benchmarks including video robustness datasets, and reports consistent improvements over the baselines used.[^1]

### Language modeling on The Pile and Books3 (TTT layers, 2024)

The TTT-layers paper trains models from 125M to 1.3B parameters on [The Pile](/wiki/the_pile) at 2k and 8k context lengths and on Books3 at 32k context. The paper reports that at 2k context, TTT-Linear, [Mamba](/wiki/mamba), and a [Transformer](/wiki/attention_is_all_you_need_transformer) baseline have "mostly overlapping" perplexity curves across scales. At 8k context, both TTT-Linear and TTT-MLP perform "significantly better" than Mamba, with an example 1.3B perplexity around 11.09 in the ablation Tables of the paper (compared to roughly 15.23 for a [linear-attention](/wiki/linear_attention) baseline). At 32k context on Books3, the Transformer and the TTT variants continue reducing perplexity as context grows, while Mamba plateaus after roughly 16k tokens.[^2] TTT-MLP achieves better perplexity than TTT-Linear at every scale tested, but its larger inner MLP creates additional memory-I/O pressure that partially offsets the quality gain.[^2]

### ARC-AGI (TTT for few-shot learning, 2024)

On the [ARC-AGI](/wiki/arc_agi) public validation set, the MIT group reports that applying TTT to a Llama-3 8B instruction-tuned model achieves 53.0% accuracy, an improvement of "nearly 25%" over the previous state of the art for purely neural public approaches, and "up to 6x higher accuracy compared to fine-tuned baselines."[^3] When combined with an existing program-synthesis solver, the joint system reaches 61.9% on ARC public validation, which the authors describe as matching average human performance on the same set.[^3] On [BIG-Bench Hard](/wiki/big_bench) in a 10-shot setting, the same TTT recipe improves accuracy from 50.5% to 57.8%, a 7.3 point gain.[^3]

### Video generation (TTT-Video, 2025)

The TTT-Video work adds TTT-MLP layers into a pre-trained 5 billion-parameter [diffusion transformer](/wiki/diffusion_transformer) (CogVideoX-5B), then finetunes on a custom dataset of more than seven hours of classic Tom-and-Jerry cartoons broken into 3-second annotated segments. Compared to baselines including [Mamba 2](/wiki/mamba_2), Gated DeltaNet, and a [sliding-window-attention](/wiki/sliding_window_attention) variant, the TTT-MLP model leads by 34 Elo points in a 100-video-per-method human evaluation, producing one-minute videos that the authors describe as more temporally coherent.[^4][^13] The paper notes residual visual artifacts and emphasizes that the implementation is not yet efficient enough for serving.[^4]

## Comparison with other sequence-modeling approaches

| Approach | Hidden state | Per-token cost | Long-context behavior | Test-time adaptation |
|---|---|---|---|---|
| [Transformer](/wiki/attention_is_all_you_need_transformer) (self-attention) | Full KV cache, grows with t | Quadratic in sequence length t | Strong, but cost grows | No adaptation; weights frozen |
| [RNN](/wiki/rnn) / [LSTM](/wiki/long_short-term_memory_lstm) | Fixed-size vector | Constant | Fades or saturates over very long contexts | No adaptation |
| [Mamba](/wiki/mamba) / [state-space models](/wiki/state_space_model) | Selective state vector | Constant | Strong up to roughly 16k tokens; plateaus thereafter on Books3 in TTT paper experiments[^2] | No adaptation |
| [Linear attention](/wiki/linear_attention) | Matrix accumulator | Constant per token | Equivalent to TTT-Linear with zero $$W_0$$ and lr=1/2[^2] | Implicit accumulator, not gradient-based |
| TTT-Linear | Linear inner model trained at test time | Constant per token, plus inner-loop GD step | Reduces perplexity beyond 16k on Books3[^2] | Yes, by inner-loop SSL update |
| TTT-MLP | Two-layer MLP inner model | Constant per token, more memory I/O | Best long-context behavior reported in the paper[^2] | Yes, by inner-loop SSL update |

The qualitative picture from the 2024 TTT-layers paper is that classical fixed-vector RNNs and even modern SSMs face a representational ceiling that grows visible as context length increases. Self-attention avoids that ceiling by simply keeping the entire context in a [KV cache](/wiki/kv_cache), at quadratic cost. TTT layers attempt a third path: keep the per-token cost constant by storing context inside a small parametric model, and increase that model's expressivity by making the inner state itself a neural network. The fact that TTT-Linear collapses to linear attention under a specific reduction makes the relationship to existing methods precise rather than merely analogical.[^2]

## Applications

### Robustness to distribution shifts

The original use case for TTT remains adapting a deployed image classifier to corruptions or covariate shifts that were not present at training. Because the auxiliary task (rotation prediction or MAE reconstruction) does not require labels, the procedure can be applied wherever per-sample inference is acceptable to slow down by a modest factor. Online TTT, with its accumulated updates, is particularly useful when the deployment environment drifts slowly and a moving running model is desired.[^1][^6]

### Long-context language modeling

The TTT-layer formulation suggests an alternative architecture for [large language models](/wiki/large_language_model) in which the model's hidden state can absorb information from arbitrarily long contexts without paying quadratic attention cost. Empirically, TTT-Linear and TTT-MLP keep reducing perplexity beyond 16k tokens on Books3, where Mamba plateaus.[^2] This makes TTT a candidate primitive for long-document, code-base-scale, or multi-turn agent tasks where context windows reach into the hundreds of thousands of tokens. The official open-source releases give an entry point for experimentation, though the PyTorch reference implementation is not optimized for training and is intended as a tutorial.[^7]

### Few-shot reasoning and ARC-AGI

The MIT result on ARC-AGI puts TTT on the map as an approach to few-shot abstract reasoning. The recipe is naturally suited to ARC-style tasks because each task is presented as a handful of input-output pairs that already form a tiny supervised dataset, perfectly matched to TTT's "learn at inference time" framing.[^3] More broadly, the ability to spin up a per-task LoRA adapter at inference suggests applications in personalized assistants, on-device adaptation, and tasks where the user's intent is conveyed through demonstrations rather than instructions.

### Long-form video and other generative settings

The TTT-Video result extends TTT into generative modeling, applying TTT layers inside a pre-trained DiT to handle the very long sequences that one-minute videos require. The reported 34 Elo-point margin over Mamba 2 baselines on coherent Tom-and-Jerry video generation suggests that the inner-loop expressivity of TTT can carry over from text to dense pixel sequences.[^4][^13]

## Limitations

Despite the encouraging results, TTT carries several practical limitations.

First, *inference-time cost*. Performing a [gradient-descent](/wiki/gradient_descent) step (or many steps) per test sample or per token can multiply the FLOPs and wall-clock time of inference. The TTT-layers paper introduces mini-batch TTT and a dual form to keep wall-clock time competitive with Mamba on existing accelerators, but the authors note that TTT-MLP still faces memory-I/O bottlenecks and that further systems work is needed.[^2] The TTT-Video paper similarly notes that its implementation is not yet efficient enough for production serving.[^4]

Second, *risk of degradation*. The original TTT paper observes that updating model weights at test time can hurt performance on cleanly distributed data if the auxiliary task is poorly chosen or if too many adaptation steps are taken. The standard variant resets weights between samples partly to bound this risk. Online TTT, while powerful when the test stream is consistent, can drift if the distribution changes within the stream.[^1]

Third, *auxiliary-task design*. The framework is only as good as the self-supervised loss used in the inner loop. Rotation prediction works for natural images but is unlikely to help for, say, satellite imagery where rotational invariance is wanted. Per-token reconstruction in TTT layers may have its own systematic failure modes when sequences contain repeated structure or adversarial inputs.[^2][^6]

Fourth, *evaluation maturity*. As of mid-2026, head-to-head comparisons of TTT-style architectures against modern Transformer and [Mamba 2](/wiki/mamba_2) baselines exist mostly at sub-2B parameter scale, on a small set of language-modeling and reasoning benchmarks. The behavior of TTT layers at frontier-model scale, in instruction-tuned settings, and under adversarial inputs is still emerging.[^2][^12]

Fifth, *interaction with existing inference stacks*. Many production systems assume frozen weights at serve time, which simplifies batching, KV-caching, request routing, and quantization. TTT layers, especially those that mutate weights per token, complicate these assumptions and may require new serving infrastructure to deploy widely. Hugging Face Transformers integration in `ttt-lm-pytorch` is described by the authors as primarily for study rather than for high-throughput deployment.[^7]

## Related Work

TTT belongs to a broader family of inference-time adaptation techniques. *Test-time adaptation* (TTA) usually refers to methods that adjust only batch-normalization statistics or a few parameters at test time on a batch of unlabeled samples; surveys such as Liang et al. taxonomize over fifty TTA methods, of which TTT is one branch defined by its use of an explicit auxiliary loss.[^12] *Test-time augmentation* averages predictions over augmented copies of the test input without changing weights and is therefore not TTT.

Other adjacent fields include [meta-learning](/wiki/meta-learning) (which trains models to be quickly adaptable from a few examples), [continual learning](/wiki/continual_learning) (which adapts on labeled streaming data), [in-context learning](/wiki/in-context_learning) (which adapts behavior without weight changes by conditioning on examples in the prompt), and [online learning](/wiki/online_learning). The 2024 TTT-for-ARC paper explicitly compares TTT with in-context learning, finding that explicit gradient-based adaptation outperforms purely in-context inference on ARC-style tasks at the scale tested.[^3]

Within sequence modeling, TTT layers are most directly comparable to selective [state space models](/wiki/state_space_model) such as [Mamba](/wiki/mamba) and [Mamba 2](/wiki/mamba_2), to [linear attention](/wiki/linear_attention) variants, and to [RWKV](/wiki/rwkv)-style linear RNNs. The relationship to linear attention is formal (Theorem 1 of the 2024 paper); the relationship to Mamba is empirical and shows up most clearly at context lengths beyond 16k tokens.[^2] The TTT-Video paper compares against Gated DeltaNet and sliding-window attention, both of which are recent alternatives in the linear-time-RNN family.[^4]

## See also

- [Transformer](/wiki/attention_is_all_you_need_transformer)
- [Mamba](/wiki/mamba) and [Mamba 2](/wiki/mamba_2)
- [State space model (deep learning)](/wiki/state_space_model)
- [Linear Attention](/wiki/linear_attention)
- [Recurrent Neural Network](/wiki/recurrent_neural_network)
- [Self-supervised learning](/wiki/self_supervised_learning)
- [Masked Autoencoder](/wiki/masked_autoencoder)
- [Domain adaptation](/wiki/domain_adaptation)
- [Continual learning](/wiki/continual_learning)
- [Meta-Learning](/wiki/meta-learning)
- [In-context learning](/wiki/in-context_learning)
- [LoRA](/wiki/lora)
- [ARC-AGI](/wiki/arc_agi)
- [Test-time compute](/wiki/test_time_compute)
- [The Pile (dataset)](/wiki/the_pile)

## References

[^1]: Sun, Yu; Wang, Xiaolong; Liu, Zhuang; Miller, John; Efros, Alexei A.; Hardt, Moritz, "Test-Time Training with Self-Supervision for Generalization under Distribution Shifts", arXiv preprint (ICML 2020), 2019-09-29 (revised 2020-07-01). https://arxiv.org/abs/1909.13231. Accessed 2026-05-20.
[^2]: Sun, Yu; Li, Xinhao; Dalal, Karan; Xu, Jiarui; Vikram, Arjun; Zhang, Genghan; Dubois, Yann; Chen, Xinlei; Wang, Xiaolong; Koyejo, Sanmi; Hashimoto, Tatsunori; Guestrin, Carlos, "Learning to (Learn at Test Time): RNNs with Expressive Hidden States", arXiv preprint, 2024-07-05. https://arxiv.org/abs/2407.04620. Accessed 2026-05-20.
[^3]: Akyurek, Ekin; Damani, Mehul; Zweiger, Adam; Qiu, Linlu; Guo, Han; Pari, Jyothish; Kim, Yoon; Andreas, Jacob, "The Surprising Effectiveness of Test-Time Training for Few-Shot Learning", arXiv preprint, 2024-11-11. https://arxiv.org/abs/2411.07279. Accessed 2026-05-20.
[^4]: Dalal, Karan; Koceja, Daniel; Hussein, Gashon; Xu, Jiarui; Zhao, Yue; Song, Youjin; Han, Shihao; Cheung, Ka Chun; Kautz, Jan; Guestrin, Carlos; Hashimoto, Tatsunori; Koyejo, Sanmi; Choi, Yejin; Sun, Yu; Wang, Xiaolong, "One-Minute Video Generation with Test-Time Training", arXiv preprint, 2025-04-07. https://arxiv.org/abs/2504.05298. Accessed 2026-05-20.
[^5]: Sun, Yu; Wang, Xiaolong; Liu, Zhuang; Miller, John; Efros, Alexei; Hardt, Moritz, "Test-Time Training with Self-Supervision for Generalization Under Distribution Shifts", Proceedings of the 37th International Conference on Machine Learning, PMLR vol. 119, 2020. https://proceedings.mlr.press/v119/sun20b.html. Accessed 2026-05-20.
[^6]: Gandelsman, Yossi; Sun, Yu; Chen, Xinlei; Efros, Alexei A., "Test-Time Training with Masked Autoencoders", arXiv preprint, 2022-09-15. https://arxiv.org/abs/2209.07522. Accessed 2026-05-20.
[^7]: test-time-training, "ttt-lm-pytorch: Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States", GitHub repository, 2024 (MIT License). https://github.com/test-time-training/ttt-lm-pytorch. Accessed 2026-05-20.
[^8]: test-time-training, "ttt-lm-jax: Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States", GitHub repository, 2024. https://github.com/test-time-training/ttt-lm-jax. Accessed 2026-05-20.
[^9]: socialfoundations, "tttlm: Test-time-training on nearest neighbors for large language models", GitHub repository (ICLR 2024 companion code), 2024. https://github.com/socialfoundations/tttlm. Accessed 2026-05-20.
[^10]: Zhao, Yutian and collaborators, "TTT-AdaptNet: Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers (ECCV 2024)", GitHub repository, 2024. https://github.com/yutianzhao-00/TTT-AdaptNet. Accessed 2026-05-20.
[^11]: Zuo, Yuxin et al., "TTRL: Test-Time Reinforcement Learning", arXiv preprint, 2025-04-22 (revised 2025-06-30). https://arxiv.org/abs/2504.16084. Accessed 2026-05-20.
[^12]: Liang, Jian; He, Ran; Tan, Tieniu, "A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts", International Journal of Computer Vision, 133(1):31-64, 2025 (preprint arXiv 2303.15361). https://arxiv.org/abs/2303.15361. Accessed 2026-05-20.
[^13]: test-time-training, "One-Minute Video Generation with Test-Time Training (project page)", 2025. https://test-time-training.github.io/video-dit/. Accessed 2026-05-20.