Grokking, also called delayed generalization, is a phenomenon in deep learning where a neural network first memorizes its training data (achieving near-perfect training accuracy but random-level test accuracy), and then, after a prolonged period of additional training with no apparent progress, abruptly transitions to near-perfect generalization on held-out data. The term was introduced by Alethea Power and colleagues at OpenAI in a January 2022 paper titled "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." The name draws from the science fiction verb "grok," coined by Robert Heinlein in his 1961 novel Stranger in a Strange Land, meaning to understand something so thoroughly that it becomes part of the observer.
Grokking challenges the conventional expectation that training and test performance improve in tandem. In standard supervised learning, a model's ability to generalize typically tracks its training loss closely, with overfitting manifesting as a gradual widening gap between training and validation metrics. Grokking breaks this pattern entirely: the model reaches zero training loss early, shows no improvement on test data for thousands or even millions of additional optimization steps, and then experiences a sudden, sharp jump to perfect test accuracy. This behavior has attracted significant attention from researchers studying generalization, regularization, and the internal mechanisms of neural networks.
Imagine you are learning your multiplication tables. At first, you just memorize each answer one by one: 3 times 4 is 12, 7 times 8 is 56, and so on. You can get the right answer for any problem you have already seen, but if someone asks you a new one you have not practiced, you are stuck.
Then one day, after lots and lots of practice, something clicks. You suddenly understand how multiplication works, not just the individual answers. Now you can figure out any multiplication problem, even ones you have never seen before.
Grokking in machine learning is exactly like that. A computer model first memorizes all the answers it has been shown. Then, after training for a very long time, it suddenly "gets it" and figures out the real rule behind the answers. The surprising part is that this understanding comes much, much later than the memorization, after a long stretch where it seems like nothing is improving at all.
The phenomenon was first documented by Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra, all affiliated with OpenAI at the time. Their paper, published at ICLR 2022, systematically studied how small transformer models learn to perform binary operations on finite algebraic structures.
The experimental setup used a decoder-only transformer to predict outputs of equations in the form a ◦ b = c, where each element (a, the operator ◦, b, the equals sign, and c) was treated as a separate token. Each token was represented as a 256-dimensional embedding vector with learnable embeddings initialized randomly. After the transformer layers, a final linear layer mapped the output to class logits.
The paper tested the model on a range of binary operations over modular arithmetic groups and symmetric groups. These included:
| Operation | Description | Grokking observed |
|---|---|---|
| x + y (mod p) | Modular addition | Yes |
| x - y (mod p) | Modular subtraction | Yes |
| x * y (mod p) | Modular multiplication | Yes |
| x / y (mod p) | Modular division | Yes |
| x^2 + y^2 (mod p) | Sum of squares | Yes |
| x^2 + xy + y^2 (mod p) | Quadratic form (symmetric) | Yes |
| x^2 + xy + y^2 + x (mod p) | Quadratic form (asymmetric) | Yes |
| S_5 composition | Permutation group composition | Yes |
All operations were computed modulo a prime number p (commonly p = 97 or p = 113 in subsequent work). The training set consisted of a fraction of all possible input pairs, with the remainder held out for testing.
The most striking result was that for modular division, validation accuracy began increasing beyond chance level only after approximately 1,000 times more optimization steps than were required for training accuracy to reach near-optimal levels. The paper also documented several patterns:
A typical grokking training run on modular addition proceeds through several recognizable stages when viewed through the lens of training and test loss curves:
Rapid memorization (early epochs): Training loss drops quickly to near zero. The model memorizes the training examples through a lookup-table-like mechanism, storing specific input-output mappings in its weights. Test loss remains at or near the level expected from random guessing.
Plateau (extended period): Both training loss and test loss remain nearly flat. Training loss stays near zero; test loss stays high. To an observer monitoring only these curves, it appears that the model has fully overfit and that continued training is wasted computation.
Sudden generalization (grokking point): Test loss drops sharply over a relatively small number of additional steps, transitioning from random-chance performance to near-perfect accuracy. Training loss may actually increase slightly during this transition as the model trades memorization-based accuracy for a more general solution.
Post-grokking convergence: Both training and test performance stabilize at high accuracy. The model has learned the underlying algorithmic rule rather than relying on memorized mappings.
The interval between memorization and generalization is sometimes called the "grokking time." This delay can range from hundreds of epochs to millions of optimization steps, depending on factors like the data fraction, weight decay strength, and learning rate.
The delay between memorization and generalization occurs because the model initially finds a memorization solution that achieves zero training loss. This solution sits in a region of weight space with high parameter norms (large weights). Weight decay steadily applies pressure to reduce these norms, but the memorization solution is a local attractor that resists this pressure for a long time. Eventually, the weight decay destabilizes the memorization solution enough for the model to escape toward a different region of weight space where a lower-norm, generalizing solution exists. The transition, once it begins, proceeds rapidly because the generalizing solution, being simpler, is a strong attractor in the regularized loss landscape.
This explanation aligns with the observation that stronger weight decay shortens the grokking time (the destabilizing pressure is greater) while weaker weight decay lengthens it or prevents grokking entirely.
One of the most detailed studies of grokking's internal mechanisms was conducted by Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Their paper, "Progress Measures for Grokking via Mechanistic Interpretability," presented at ICLR 2023 as an oral presentation, fully reverse-engineered the algorithm learned by a small transformer trained on modular addition.
The study used a one-layer transformer with the following architecture:
| Component | Specification | |---|---|---| | Layers | 1 | | Attention heads | 4 | | Embedding dimension (d) | 128 | | Head dimension | 32 (d/4) | | MLP hidden units | 512 | | Layer normalization | None | | Embed/unembed matrices | Untied | | Positional embeddings | Learned | | Task | a + b mod 113 | | Training data | 30% of all 113^2 input pairs | | Optimizer | AdamW (lr = 0.001, weight decay = 1) | | Training duration | 40,000 epochs (full batch) |
Nanda et al. discovered that the trained transformer implements a four-step algorithm based on discrete Fourier transforms and trigonometric identities:
Step 1, embedding: The network maps each input to sine and cosine components at specific frequencies. For inputs a and b, it computes sin(w_k * a), cos(w_k * a), sin(w_k * b), and cos(w_k * b), where w_k = 2 pi k / p for key frequency indices k.
Step 2, computing the sum via trigonometric identities: The attention and MLP layers apply the angle addition formulas:
Step 3, output computation: The network computes cos(w_k(a + b - c)) for each candidate output c using another trigonometric identity, effectively rotating back to find the correct answer.
Step 4, interference pattern: The network sums cosine waves across five key frequencies (k in {14, 35, 41, 42, 52} for p = 113). At the correct output c* = (a + b) mod 113, these waves constructively interfere (all producing the value 1), producing a large logit. At all other values of c, the waves destructively interfere, producing small logits.
This is a remarkably elegant algorithm: the transformer has independently discovered that modular addition can be performed by converting to frequency space, applying rotation, and using interference to select the correct answer.
Using their mechanistic understanding, Nanda et al. defined progress measures that revealed three continuous phases in the training process:
| Phase | Epoch range (p = 113) | What happens | Key indicators |
|---|---|---|---|
| Memorization | 0 to ~1,400 | Training loss drops rapidly; model memorizes training data via lookup-table-like circuits | Low restricted loss is absent; excluded loss is low |
| Circuit formation | ~1,400 to ~9,400 | Fourier multiplication circuit gradually forms in the weights; memorization circuits coexist with emerging generalizing circuits | Restricted loss begins declining; weight norm decreases; excluded loss increases |
| Cleanup | ~9,400 to ~14,000 | Weight decay eliminates the remaining memorization circuits; test accuracy jumps sharply (this is the visible "grokking" moment) | Test loss drops suddenly; weight norm drops sharply; Gini coefficient increases (sparsification in Fourier basis) |
Restricted loss was defined as the loss computed after ablating all Fourier components from the model's logits except those corresponding to the five key frequencies. This measures how much of the model's performance comes from the generalizing Fourier circuit.
Excluded loss was defined as the loss computed after removing only the key frequencies, measured on training data. This tracks how much performance relies on memorization rather than the Fourier circuit.
The central finding was that grokking is not a sudden phase transition but rather the culmination of a gradual process. The generalizing circuit forms continuously during the circuit formation phase, well before any improvement is visible in test accuracy. The apparent "sudden" jump in test performance corresponds to the cleanup phase, when the memorization circuits are finally pruned away and the already-formed generalizing circuit dominates.
Weight decay (or equivalently, L2 regularization) plays a central role in grokking. Multiple studies have converged on a consistent picture of how it operates.
Weight decay adds a penalty proportional to the squared magnitude of the model's weights at each optimization step. In the context of grokking, this penalty has two effects:
Destabilizing the memorization solution: Memorization requires large weights to store lookup tables for specific input-output pairs. Weight decay continuously pushes these weights toward zero, creating ongoing tension with the memorization objective.
Favoring the generalizing solution: The generalizing solution (the Fourier multiplication circuit, in the case of modular addition) can achieve the same training accuracy with much smaller weights. Weight decay makes this solution energetically favorable in the regularized loss landscape.
The interplay between these two effects explains the long delay: the memorization solution is a local minimum that weight decay slowly erodes. The generalizing solution is a global minimum of the regularized loss, but reaching it requires escaping the basin of attraction of the memorization solution.
The strength of weight decay has a predictable relationship with grokking behavior:
| Weight decay strength | Effect on grokking |
|---|---|
| Zero (no weight decay) | No grokking occurs; model remains in memorization state indefinitely |
| Very low | Extremely delayed grokking; may take millions of steps |
| Moderate | Grokking occurs after a manageable delay |
| High | Grokking occurs faster but the transition is sharper |
| Very high | May prevent memorization entirely; model generalizes from the start (no grokking, but also potentially slower convergence) |
This relationship is consistent with the "LU mechanism" described by Liu, Michaud, and Tegmark in their Omnigrok paper (2022): when training and test losses are plotted against the weight norm, the training loss typically forms an "L" shape (flat at low norm, then rising) while the test loss forms a "U" shape (high at low norm, dropping to a minimum, then rising again). Grokking occurs because the model starts at a high-norm memorization solution and weight decay drives it toward the low-norm generalizing solution.
Several theoretical frameworks have been proposed to explain grokking. These explanations are not mutually exclusive; they highlight different aspects of the same underlying phenomenon.
One class of explanations connects grokking to the transition between two well-studied training regimes in neural network theory:
Lyu et al. (2024), in a paper presented at ICLR 2024, proved that when training homogeneous neural networks with large initialization and small weight decay, the optimization process first gets trapped at a solution corresponding to a kernel predictor (lazy regime) for a long time. Then a sharp transition to a minimum-norm or maximum-margin predictor (rich regime) occurs, causing a dramatic change in test accuracy. This provides a rigorous mathematical explanation for why grokking involves a long delay followed by a sudden transition.
Multiple researchers have drawn connections between grokking and phase transitions in physics. Rubin et al. demonstrated a mapping between grokking and first-order phase transitions, where the state of the network after grokking is analogous to a mixed phase following a first-order transition. This perspective helps explain the sharpness of the transition: just as water does not gradually become ice but rather undergoes a discontinuous change at a critical temperature, the model's transition from memorization to generalization can be abrupt once the critical conditions (driven by weight decay reducing the weight norm) are met.
Another line of work focuses on the simplicity of the generalizing solution relative to the memorizing solution. Neural networks trained with stochastic gradient descent exhibit a well-documented simplicity bias: given multiple solutions that achieve the same training loss, SGD tends to converge to simpler ones. In the context of grokking, the generalizing solution has lower weight norms and lower effective complexity than the memorizing solution. Weight decay amplifies this simplicity bias by explicitly penalizing complexity, eventually making the simpler generalizing solution the preferred minimum.
Research by Yoshida et al. (2023) bridged the lottery ticket hypothesis with grokking. The lottery ticket hypothesis posits that dense neural networks contain sparse subnetworks ("winning tickets") that can match the full network's performance. In the context of grokking, the transition from memorization to generalization corresponds to the identification and amplification of these sparse subnetworks. The memorization phase uses the full network in a distributed way, while the generalizing phase concentrates computation in a small, structured subnetwork. This work suggests that weight norm alone is insufficient to fully explain grokking; the reorganization of the network's internal structure into efficient subnetworks is equally important.
Thilak et al. (2022), in a study from Apple Machine Learning Research, identified an optimization anomaly they called the "slingshot mechanism" that is closely tied to grokking. When using adaptive optimizers like Adam, training at very late stages exhibits repeating cycles of stability and instability. The unstable phases are characterized by extremely large gradients, spiking training loss, and rapid growth of last-layer weights. Grokking was found to occur almost exclusively at the onset of these slingshot events, and was absent without them. This suggests that the instabilities created by adaptive optimizers may play a functional role in escaping the memorization basin, complementing the role of weight decay.
Recent theoretical work (2025) has proposed quantitative explanations for the delay in grokking. This line of research models grokking as a norm-driven representational phase transition: training first converges to a high-norm memorization solution, and the time required for weight decay to contract the weights toward the lower-norm generalizing solution determines the grokking time. The resulting "norm-separation delay law" provides predictions for how grokking time scales with weight decay strength, learning rate, and data fraction.
Double descent is a related phenomenon where test error follows a non-monotonic curve as model size or training time increases: error first decreases, then increases (the classical overfitting bump), and then decreases again. Grokking can be understood as an extreme form of epoch-wise double descent, where the second descent (the improvement in test performance after the overfitting phase) is delayed by a much larger gap.
Davies, Langosco, and Krueger (2023) formally unified grokking and double descent under a single framework they called pattern learning speeds. Their key insight was that both phenomena arise when a model learns multiple patterns at different speeds:
When the fast patterns are learned first and dominate predictions, test error rises (because the fast patterns overfit). When the slow patterns eventually become strong enough, test error falls. The difference between standard double descent and grokking is the magnitude of the delay: in double descent, the second descent follows relatively quickly; in grokking, it can be separated by orders of magnitude more training steps.
This framework also predicts model-wise grokking, where grokking occurs as a function of model size rather than training time. Davies et al. provided the first empirical demonstration of this effect.
| Phenomenon | What varies | Delay magnitude | Transition sharpness |
|---|---|---|---|
| Epoch-wise double descent | Training epochs | Moderate | Gradual |
| Model-wise double descent | Model size | Moderate | Gradual |
| Grokking (epoch-wise) | Training epochs | Very large (100x to 1,000,000x) | Sharp |
| Model-wise grokking | Model size | Large | Sharp |
A common early criticism of grokking research was that the phenomenon might be an artifact of the artificial, highly structured datasets used in the original experiments. Liu, Michaud, and Tegmark addressed this concern in their 2022 paper "Omnigrok: Grokking Beyond Algorithmic Data." By analyzing neural network loss landscapes and identifying the "LU mechanism" (the characteristic shapes of training and test loss as functions of weight norm), they were able to induce grokking on tasks involving images (MNIST), language, and molecular data. This demonstrated that grokking is a general property of neural network optimization under certain conditions, not a quirk limited to modular arithmetic.
The conditions for grokking in these broader settings typically require:
Wang et al. (2024), in a paper presented at NeurIPS 2024 titled "Grokked Transformers are Implicit Reasoners," demonstrated that transformers can learn implicit multi-step reasoning through grokking. Small transformers trained on reasoning tasks (composition and comparison) were found to use their feedforward layers to carry out multi-step logical inferences after sufficient grokking. For composition tasks with a large search space, a fully grokked transformer achieved near-perfect accuracy, while GPT-4-Turbo and Gemini-1.5-Pro failed regardless of prompting strategy or retrieval augmentation. This suggests that grokking may be relevant not just for simple arithmetic but for developing genuine reasoning capabilities in neural networks through parametric memory.
While most grokking research has focused on small models and controlled datasets, there is growing evidence that LLM pretraining exhibits grokking-like dynamics. Different capabilities may "grok" at different points during training, with some abilities emerging suddenly after long plateaus. The asynchronous nature of these transitions in large models makes them harder to detect than in the clean, single-task experiments typical of grokking research, but internal dynamics (tracked through probing and mechanistic interpretability) reveal similar patterns of delayed generalization.
The long delay between memorization and generalization is the main practical limitation of grokking. Lee, Kang, Kim, and Lee (2024) addressed this with Grokfast, a simple technique that accelerates grokking by more than 50x with only a few lines of code added to the training loop.
Grokfast treats the sequence of gradients for each parameter over training iterations as a temporal signal and decomposes it into two components:
By applying a low-pass filter to the gradient signal and amplifying the slow-varying component before passing it to the optimizer, Grokfast selectively accelerates the learning of generalizing patterns without disrupting the overall training dynamics.
Grokfast reduced grokking time by more than 50x across modular arithmetic tasks and also demonstrated effectiveness on tasks involving images, language, and graphs. Combining Grokfast with weight decay produced a synergistic effect, further reducing training time and improving stability.
Research on grokking has identified four distinct learning outcomes that depend on the interaction between model capacity, data size, regularization strength, and learning rate:
| Regime | Training accuracy | Test accuracy | What happens |
|---|---|---|---|
| Comprehension | High | High | Model learns the rule quickly and generalizes; no delay between memorization and generalization |
| Grokking | High (early) | High (late) | Model memorizes first, then generalizes after a long delay |
| Memorization | High | Low | Model memorizes training data but never generalizes, even with extended training |
| Confusion | Low | Low | Model fails to learn even the training data; underfitting |
The boundaries between these regimes depend on several interacting factors. Higher weight decay and larger training fractions push the model toward comprehension. Lower weight decay, smaller training fractions, and large initial weight norms push toward memorization or grokking. Very high learning rates or very small models push toward confusion.
Grokking has several practical consequences for how researchers and practitioners approach model training:
Early stopping may be premature: The standard practice of halting training when validation loss begins to rise would prevent grokking from occurring. On structured or algorithmic tasks with small datasets, continued training beyond apparent overfitting may lead to dramatically better generalization.
Weight decay is more than a regularizer: In the context of grokking, weight decay does not merely prevent overfitting; it actively drives the model from a memorization solution to a qualitatively different generalizing solution. This suggests that the role of weight decay in neural network training may be more fundamental than the standard framing as a regularization technique.
Training curves can be misleading: A flat or slowly rising test loss does not necessarily mean the model has fully converged to a poor solution. Internal progress (circuit formation) may be occurring even when no external metric reflects it.
Compute considerations: Grokking requires training for far longer than would be needed just to achieve zero training loss. Whether this additional compute is justified depends on the task and whether cheaper alternatives (like increasing the training data fraction) can achieve the same generalization without the long delay.
Curriculum and data design: The grokking time depends heavily on the training data fraction and distribution. Research by Wang et al. (2024) suggests that the ratio between inferred and atomic facts in the training data, rather than absolute data size, determines grokking characteristics. This points to the possibility of designing training curricula that promote faster grokking.
Despite significant progress, several questions about grokking remain open: