Titans (neural architecture)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,547 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,547 words
Add missing citations, update stale details, or suggest a clearer explanation.
Titans is a family of neural sequence-modeling architectures from Google Research that combines an attention-based "short-term memory" with a deep neural "long-term memory" module that updates its own parameters at test time using online gradient descent on a "surprise" signal. Introduced by Ali Behrouz, Peilin Zhong, and Vahab Mirrokni in the paper "Titans: Learning to Memorize at Test Time" (arXiv:2501.00663, submitted 31 December 2024)[^1][^2], the design is motivated by the limitations of standard Transformer models on extremely long contexts and by the lossy fixed-size state of modern linear recurrent baselines such as Mamba and Mamba 2. Titans treats the parameters of a small multi-layer perceptron as a writable long-term memory that is overwritten as tokens stream in, with a momentum term, a learned decay (forget) gate, and an associative key-value loss. The paper proposes three variants that differ in how the neural memory is wired to attention: Memory-as-Context (MAC), Memory-as-Gate (MAG), and Memory-as-Layer (MAL)[^1][^3].
Two families of sequence models dominated language modeling in the years leading up to Titans. Decoder-only Transformers using softmax attention obtain strong accuracy and recall but pay quadratic time and memory in the sequence length, which limits feasible context windows in training[^1][^4]. A second family of modern linear recurrent models, including Mamba, Mamba 2, RetNet, Linear Attention variants, and RWKV, achieves linear-time training and constant-memory inference by compressing the past into a fixed-size hidden state[^1][^4]. That compression sacrifices recall over very long contexts: information not retained in the state cannot be recovered, and the family struggles on tasks where a specific fact placed millions of tokens earlier must be retrieved verbatim[^1].
Behrouz, Zhong, and Mirrokni framed this as a memory problem rather than an architecture problem[^1][^2]. Their thesis is that attention behaves like a precise but short-horizon working memory, modern linear recurrences provide an efficient but lossy summary, and what is missing is a third component, an explicit long-term memory that can be written to during inference and read from again later[^1][^2]. The Titans paper formulates that long-term memory as a neural network whose parameters are themselves the storage medium, with writes implemented by gradient updates on an internal loss[^1]. This connects Titans to a thread of work on test-time learning, including Test-Time Training[^5] and earlier "fast weights" ideas, and to the wider meta-learning view that an inner optimization loop can serve as a memory mechanism[^1][^2].
The paper was first posted to arXiv on 31 December 2024 (with the 2501.00663 identifier reflecting January 2025 assignment), and a companion overview appeared on the Google Research blog under the title "Titans + MIRAS: Helping AI have long-term memory" later in 2025[^1][^2]. The same team subsequently generalized the framework with MIRAS, a design space that recasts associative-memory regularization in terms of "attentional bias", "retention gate", and "memory algorithm" choices, of which Titans is one instance[^2][^6].
The core innovation in Titans is the Neural Long-Term Memory module, abbreviated LMM or NLM in different parts of the paper[^1]. The module is not a static buffer or a fixed-size hidden state; it is a small multi-layer perceptron whose weights M_t are updated as the model consumes tokens. Reads and writes happen during the forward pass, so the storage is permanent only for the duration of a single sequence rather than across training updates[^1][^3].
For each incoming token x_t, two learned projections produce a key k_t = W_K x_t and a value v_t = W_V x_t[^1][^3]. The memory M_{t-1} is treated as a function from keys to predicted values, and the internal loss is the squared error of the prediction:
l(M_{t-1}; x_t) = || M_{t-1}(k_t) - v_t ||_2^2
Minimizing this loss makes the memory store the association k_t -> v_t, so that the same key presented later produces the same value[^1][^3]. The loss is a per-token associative-memory objective, and the gradient of the loss with respect to the parameters of M is the surprise signal that drives the write.
A naive implementation would update memory by a single SGD step per token. Titans replaces that with a momentum-style update that the authors describe as a "surprise" mechanism[^1][^2]. Define an intermediate state S_t that accumulates past surprise with a data-dependent decay eta_t and scales each new gradient by theta_t:
S_t = eta_t * S_{t-1} - theta_t * grad l(M_{t-1}; x_t)
The memory itself then evolves with a learned forget gate alpha_t in [0,1] that acts as adaptive weight decay:
M_t = (1 - alpha_t) * M_{t-1} + S_t
The first term implements a soft erase that lets the model release capacity when long sequences would otherwise saturate the parameters[^1][^3]. The second term injects the accumulated, momentum-smoothed surprise into the parameters. In the limit alpha_t = 1 the previous memory is wiped; in the limit alpha_t = 0 the memory is fully preserved and updates are purely additive[^1][^3]. The paper draws an explicit analogy to human memory: events that produce a large prediction error are stored more strongly, while routine content fades through the decay term[^1][^2].
Because the update is structured as mini-batch gradient descent with momentum and weight decay, the entire memory trajectory over a chunk of tokens can be reformulated as a matrix-multiplication-based associative scan, which the authors exploit for a parallelizable training algorithm on GPUs and TPUs[^1][^3]. Inference remains step-by-step but is constant-time per token after the chunk parallelization at training time.
Earlier test-time-trained recurrences such as TTT used a linear or low-depth memory, and DeltaNet and Gated DeltaNet effectively use a matrix-shaped state with a fast-weight update rule[^1]. Titans makes the memory a deeper meta-learning object: an MLP with L_M >= 1 layers, where experiments find L_M = 2 or L_M = 4 substantially outperform the L_M = 1 linear baseline on long sequences[^1][^3]. The depth gives the memory a non-linear read map M(k) that can compress more information per parameter than a flat linear projection, at the cost of additional FLOPs per token[^1][^3]. The paper reports roughly a 15 to 20 percent throughput penalty for L_M = 4 versus L_M = 1, with a 0.5 to 1.0 perplexity gain across 170M to 760M parameter scales[^3].
Titans also includes a third memory type called persistent memory, a set of learnable but data-independent parameters prepended to every input sequence[^1][^3]. These tokens function like task embeddings or memory prompts: they encode knowledge that is fixed at test time but trained end-to-end, separating "what does this task look like" from the per-sequence content stored in the dynamic LMM and the immediate content held in the attention window[^1]. The three components together (attention "core", neural long-term memory, persistent memory) form what the paper calls Titans' "three hyper-heads"[^1].
The Titans paper proposes three concrete ways to couple the long-term memory module to attention, and reports experiments on all three[^1][^3]. The variants differ in where the memory output enters the computation graph relative to the attention layers.
In MAC, the input is split into fixed-length segments. Before processing a new segment with attention, the model queries the long-term memory with the current segment's representation, retrieves a set of memory vectors, and prepends them, together with the persistent memory tokens, to the segment[^1][^3]. Causal attention then runs over the concatenation of persistent tokens, retrieved memory tokens, and current-segment tokens. The retrieved tokens act as compressed "notes" from earlier in the sequence that the attention layer can read alongside fresh content[^1][^3]. After processing the segment, the memory is updated by running the surprise-driven write rule on the segment's keys and values, so subsequent retrieval reflects what the segment contained[^1].
MAC is the variant that the paper reports as strongest on tasks with long-range dependencies, including needle-in-haystack retrieval and BABILong-style reasoning, because the attention layer sees both local context and an explicit summary of distant history in the same softmax operation[^1][^3].
MAG runs two branches in parallel at each layer. One branch is a sliding window attention over the local context. The other branch is the neural long-term memory, queried over the same input. A learned gate combines the two outputs token by token, letting the model choose whether to rely on the local attention or on retrieval from the persistent memory[^1][^3]. Because the attention branch is windowed, MAG avoids the quadratic cost of full attention, while the LMM branch supplies global context as a separate signal that the gate can weight up when long-range information matters and down otherwise[^1][^3]. The arrangement resembles hybrid architectures such as Samba and Griffin, except that the recurrent branch is an online-trained MLP rather than a fixed state-space recurrence[^1].
MAL is the simplest variant. The neural memory module is inserted as its own layer in the network, stacked before a sliding-window attention layer, so each block first compresses the sequence through the LMM and then attends locally[^1][^3]. Because the memory and the attention are sequential rather than parallel, MAL has less expressive capacity than MAG and tends to underperform the other two variants on the long-context benchmarks reported in the paper, although it remains the fastest of the three and matches Mamba 2 throughput in the efficiency study[^1][^3].
| Variant | Memory placement | Attention type used | Reported strength |
|---|---|---|---|
| MAC | Retrieved tokens prepended to segment | Full causal attention over [persistent, retrieved, segment] | Best long-range recall and reasoning[^1][^3] |
| MAG | Parallel branch combined via gate | Sliding window | Best efficiency-accuracy trade-off in language modeling[^1][^3] |
| MAL | Sequential layer before attention | Sliding window | Highest throughput, weaker on long-context tasks[^1][^3] |
Titans is evaluated at multiple scales on language modeling, common-sense reasoning, long-context retrieval and reasoning, time-series forecasting, and DNA modeling[^1][^3][^4]. Models are trained with AdamW at a learning rate of 4e-4 on the FineWeb-Edu corpus, with a 4K-token training context, and evaluated at much longer sequence lengths[^3].
The paper reports Titans trained at 170M, 340M, 400M, and 760M parameters on 15B to 30B tokens, and at 1.3B parameters on 100B tokens[^3][^4]. The 1.3B model is the largest reported in the paper and is the comparison point against Mamba, Mamba 2, DeltaNet, Gated DeltaNet, Retentive Network (RetNet), TTT, RWKV, and a strong Transformer++ baseline at matched parameter count[^3][^4].
On language modeling, the three Titans variants improve perplexity over both attention and recurrent baselines at matched scale. At 340M parameters, the paper reports a Wikipedia perplexity of 25.07 for Titans (MAG) versus 27.01 for Gated DeltaNet, and at 760M parameters Titans (MAC) reaches 19.93 versus 21.18 for Gated DeltaNet[^3]. On a suite of common-sense reasoning tasks including PIQA, HellaSwag, WinoGrande, ARC, SIQA, and BoolQ, Titans variants are reported as the strongest non-Transformer entries and competitive with Transformer++ at matched parameter count[^1][^3].
A separate ablation table in the paper isolates the contribution of individual mechanisms, with weight decay, momentum, persistent memory, and the move from linear to deep memory each producing measurable perplexity improvements[^3].
Long-context tasks are where Titans is reported as clearly differentiated from prior recurrent baselines[^1][^3]. On the S-NIAH subset of the RULER benchmark, a single-key Needle in a Haystack retrieval test, Titans (LMM) is reported at 96.2 percent accuracy on 16K-token contexts, compared with 88.4 percent for TTT, 71.4 percent for DeltaNet, and only 5.4 percent for Mamba 2 at the same sequence length[^4]. Titans (MAC) maintains near-perfect retrieval up to roughly 2 million tokens in the paper's needle-in-haystack curves[^1][^3].
On BABILong, a long-context multi-hop reasoning benchmark, the paper reports that Titans (MAC) outperforms not only the modern recurrent baselines but also GPT-4 and a Llama 3.1 8B model used with Retrieval-Augmented Generation over a roughly seventy-times-larger context, in both few-shot and fine-tuned settings[^1][^3]. This is the result most often cited in secondary coverage[^4][^7].
Beyond language, the paper applies the neural memory module to time-series forecasting on ETT, ECL, Traffic, and Weather, where it improves over Simba, iTransformer, RLinear, PatchTST, Crossformer, and TiDE, and to genomics modeling on the GenomicsBenchmarks where it is competitive with Hyena DNA and Transformer++[^3]. These domains test whether the memory mechanism transfers beyond text and language modeling losses, and the authors interpret the results as evidence that the surprise plus decay rule is a general-purpose online-learning mechanism rather than a language-specific trick[^1][^3].
A throughput study reports that Titans (MAL) matches the Mamba 2 tokens-per-second curve, the neural memory module alone is roughly 10 to 15 percent slower than Mamba 2 because of the deeper-MLP inner step, and all variants scale linearly with sequence length rather than quadratically with the attention window[^3]. Going from L_M = 1 to L_M = 4 reduces throughput by a further 15 to 20 percent while improving long-context accuracy[^3].
The Titans paper places its own contribution in the context of three overlapping research threads.
State space models such as S4 and Mamba, gated linear attentions such as DeltaNet and Gated DeltaNet, and the broader family that includes Retentive Network (RetNet) and RWKV share an architectural pattern: compress the past into a hidden state of fixed size and update that state with a linear recurrence that admits a fast parallel scan at training time[^1][^4]. Titans inherits the linear-time training and constant-time inference of this family, but replaces the linear matrix-shaped state with the parameters of a deep MLP that is updated by gradient descent rather than by a closed-form linear rule[^1][^3]. The MIRAS follow-up paper makes the connection explicit by recasting different members of the recurrent family as different choices of memory architecture, attentional bias, retention gate, and learning algorithm, of which Titans is one configuration[^2][^6].
Titans is closely related to Test-Time Training (TTT), where the model performs a small number of gradient descent updates on a self-supervised loss during inference and uses the updated parameters as memory[^5]. The Titans authors describe TTT as a special case in which the inner memory is small and shallow, and they extend the idea with (a) the deep-MLP memory, (b) the momentum-plus-decay update rule rather than plain SGD, and (c) the explicit coupling to attention through the MAC, MAG, and MAL wirings[^1][^5]. The same connection makes Titans part of the broader test-time training and test-time compute literature, which uses computation at inference rather than additional pretraining data to improve performance[^5].
The "core" branch of every Titans variant is a softmax attention computation: full causal attention for MAC and sliding window attention for MAG and MAL[^1][^3]. The paper positions Titans not as a replacement for the Transformer but as an augmentation that gives the attention layer a separate, persistent, writable memory it can either be conditioned on (MAC), fused with via a gate (MAG), or chained with as a previous layer (MAL)[^1]. Hybrid models such as Samba and Griffin combine attention with Mamba or other recurrent layers, and Titans can be read as a hybrid in the same sense, with a meta-learned memory module replacing the state-space recurrence[^1].
| Family | Representative model | State / memory | Long-context behavior |
|---|---|---|---|
| Attention | Transformer | Full key-value cache | Strong recall, quadratic cost[^1] |
| State space | Mamba, Mamba 2 | Fixed-size linear state | Linear cost, degrades on retrieval[^1][^4] |
| Linear attention / gated | RetNet, DeltaNet, Gated DeltaNet | Matrix state with fast-weight update | Better recall than SSMs, still lossy[^1][^3] |
| Test-time trained | TTT[^5] | Shallow online-updated parameters | Online adaptation, limited capacity[^1][^5] |
| Online deep memory | Titans (this article) | Deep MLP updated with surprise + decay | Linear cost, reported strong long-context recall[^1][^3][^4] |
At the time the Titans paper appeared, the official code was not released, and early reproducibility coverage explicitly noted this as a limitation[^7]. An unofficial PyTorch implementation called titans-pytorch, maintained by Phil Wang (lucidrains) and released under the MIT license, became the most widely cited community port[^8]. It provides a MemoryAsContextTransformer class that exposes the MAC variant together with configurable segment length, persistent-memory token count, and long-term memory token count, and explicitly supports the deep-MLP memory rather than only the linear case[^8].
The Google Research team extended the line of work in 2025 with the MIRAS framework, which generalizes Titans by reframing several recurrent and attention-based architectures as different points in a four-axis design space and introduces three additional models, Moneta, Yaad, and Memora, derived from that space[^2][^6]. The Google Research blog post that introduced MIRAS describes it as the conceptual successor to Titans rather than a replacement, with Titans serving as one concrete instance of the broader recipe[^2]. Independent surveys of long-context language modeling cite Titans alongside Mamba 2, Gated DeltaNet, and TTT as part of a wave of architectures that try to break the attention-versus-recurrence trade-off by adding an online-updated memory component[^4][^7].
Several limitations are noted by the authors or by independent commentary.
The deep neural memory is more expensive per token than a comparable fixed linear state, and the paper reports a measurable throughput penalty when increasing memory depth, so practitioners must trade memory capacity against decoding speed[^3]. The improvements over baselines also depend on task and dataset: while gains are large on retrieval-heavy and reasoning-heavy long-context benchmarks, the relative advantage on short-context language modeling is more modest, and secondary coverage notes that the architecture's effectiveness "depends heavily on the task and dataset"[^7].
The largest Titans model reported in the paper is 1.3B parameters, trained on 100B tokens, which is small compared with current frontier large language models[^3][^4]. Whether the surprise plus decay mechanism continues to outperform attention and modern recurrences when scaled into the tens of billions of parameters and trillions of training tokens is an open empirical question that the paper does not resolve. The MIRAS extension provides a partial answer by showing that several memory-regularization choices, of which Titans is one, can be competitive, but does not extend the scaling envelope[^2][^6].
The initial release without code drew criticism in early community reviews, which complicated independent reproduction; the appearance of the unofficial titans-pytorch port partially addressed this, although results from third-party reproductions remain less extensive than the original paper's tables[^7][^8]. As with other meta-learning systems, the memory's behavior at inference time depends on the input distribution and can be harder to interpret than a fixed key-value cache, since the storage is the parameter vector of an MLP rather than an explicit set of tokens.
Finally, the Titans memory is per-sequence rather than across-sequence: at the end of a sequence the parameters are reset, so the long-term memory does not accumulate across user sessions in the way that an external retrieval store or an agent memory system would[^1][^2]. Combining the inner memory with explicit retrieval, such as Retrieval-Augmented Generation, is left to follow-up work.
The Titans paper is explicit that the inner update rule for the memory is structurally identical to mini-batch stochastic gradient descent with momentum and weight decay applied to the MLP parameters[^1][^3]. This identification has several practical consequences.
First, it lets the authors exploit existing parallelization tricks. If the inner step is M_t = (1 - alpha_t) M_{t-1} + S_t with S_t a momentum-smoothed gradient, then a chunk of tokens can be processed by treating the recurrence as a linear-in-state operation with parameters that depend on the inputs, and the chunk can be evaluated through a matrix-multiplication-heavy associative scan that maps well onto modern accelerators[^1][^3]. The result is that training time per token is comparable to attention-free alternatives even though the inner step is non-trivial[^3].
Second, the rule makes hyperparameters of the memory (momentum coefficient, decay rate, learning rate per token) themselves learnable functions of the input rather than fixed scalars, so the model decides per token how strongly to write and how quickly to forget[^1]. The paper explicitly contrasts this with the static decay rates in earlier linear recurrences and reports in an ablation that removing weight decay alone costs roughly one perplexity point at the scales tested[^3].
Third, framing the long-term memory as the output of an inner online learning process places Titans squarely in the meta-learning tradition: the outer training loop is optimizing the architecture and the projection matrices W_K, W_V, W_Q so that an inner gradient-descent-with-momentum-and-decay procedure produces a useful memory for downstream prediction[^1][^2][^5]. The paper relates this view to a long sequence of earlier ideas including fast weights, hypernetworks, and Test-Time Training[^1][^5].
The paper offers a three-way taxonomy of memory that is reused throughout the architecture descriptions[^1][^3].
The core attention acts as immediate or working memory, holding the current segment in full resolution but with strict capacity bounds set by the attention window length. The neural long-term memory is a per-sequence dynamic store: the MLP parameters change as the sequence is consumed and reset between sequences. The persistent memory consists of a fixed set of learnable parameters that are trained end-to-end and remain unchanged at inference, functioning as task-specific knowledge prepended to every input[^1][^3].
This three-component decomposition lets each memory specialize. Persistent memory absorbs task-level priors, the neural memory absorbs sequence-level facts and dependencies, and attention handles tight local interactions. The split also gives the model an explicit separation between "things I learned during pretraining" and "things I learned from this specific sequence", which is harder to recover from a pure Transformer that mixes both in the same key-value cache.
The arXiv version of the paper documents several implementation choices that are not always reflected in summaries[^1][^3].
The model uses an AdamW optimizer for the outer training loop with a learning rate of 4e-4 and standard warmup-and-decay schedules on the FineWeb-Edu corpus[^3]. Training contexts are kept short (4K tokens) to limit compute, with longer contexts evaluated at inference, so the long-context behavior reported in the paper is in part a generalization claim: the surprise plus decay rule learned at 4K tokens still produces useful memory writes when the sequence stretches to 16K, 128K, or beyond[^1][^3].
The neural memory MLP uses GELU-style nonlinearities between layers, and the paper reports the depth L_M as the single most impactful hyperparameter for long-context tasks: moving from L_M = 1 (a linear map) to L_M = 2 produces most of the gain, with diminishing returns at L_M = 4[^3]. The persistent memory budget is on the order of a few tokens (four in the public titans-pytorch example configuration), while the long-term memory token budget is configurable per sequence and trades capacity against per-step cost[^8].
A separate convolutional pre-projection is reported in the ablation table as worth roughly 0.7 perplexity at the 760M scale, suggesting that a short causal convolution applied to keys, values, and queries before the memory and attention computations helps stabilize the inner gradient signal[^3]. This convolutional component is reminiscent of the short causal convolutions used in Mamba and several other modern recurrent architectures and appears to play a similar role in Titans[^3].
Coverage of Titans in 2025 was largely positive, with several outlets characterizing the paper as one of the more substantive recent attempts to move past the attention-versus-recurrence dichotomy[^4][^7]. The DataCamp explainer described the design as introducing "inference-time learning" and emphasized the alignment with human-style episodic memory, where surprising events are stored more strongly[^4]. The AI Insider noted both the 2-million-token scaling claim and the open question of whether the gains generalize beyond the specific benchmarks reported[^7].
Within the research community, Titans was discussed alongside Mamba 2, Gated DeltaNet, and TTT[^5] as a representative of an emerging class of "online-memory" sequence models[^4][^7]. The release of the unofficial titans-pytorch PyTorch port within roughly two weeks of the arXiv posting allowed third parties to experiment with the MAC variant without waiting for an official code release[^8].
A follow-up workshop paper, "TPTT: Transforming Pretrained Transformers into Titans", explored whether existing pretrained Transformer checkpoints could be retrofitted with a Titans-style neural memory module, citing the original Titans paper as the source architecture[^9]. The MIRAS extension by the same group then provided a generalization of the surprise plus decay recipe to a four-axis design space, with Titans appearing as one configuration alongside the new Moneta, Yaad, and Memora models[^2][^6].
Titans is significant primarily because it gives a concrete, parallelizable instantiation of the idea that the parameters of a small neural network can serve as the working memory of a much larger model, and that those parameters can be safely updated during inference using a structured online-learning rule[^1][^2]. By recasting the long-context problem as an online associative-memory problem with surprise as the write signal and a learned forget gate as the retention regularizer, the paper provides a vocabulary (surprise, retention, persistent memory) that the subsequent MIRAS framework reuses to organize a much broader family of architectures[^2][^6].
Secondary coverage from 2025 frames Titans as one of the first published designs that combines linear-time training, constant-time inference, deep online-updated memory, and reported long-context results competitive with or exceeding much larger Transformer baselines on BABILong-style tasks[^4][^7]. Whether the variant will displace standard attention at frontier scales is unsettled, but the design has already influenced both academic follow-up (MIRAS and the Moneta, Yaad, and Memora models) and open-source ports (titans-pytorch), and the MAC, MAG, and MAL taxonomy is now commonly cited as a reference point for how to integrate an online memory module with an attention backbone[^2][^6][^8].