Grokking

Deep Learning Machine Learning

24 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v4 · 4,716 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Grokking, also called delayed generalization, is a phenomenon in deep learning where a neural network first memorizes its training data (achieving near-perfect training accuracy but random-level test accuracy), and then, after a prolonged period of additional training with no apparent progress, abruptly transitions to near-perfect generalization on held-out data. The term was introduced by Alethea Power and colleagues at OpenAI in a January 2022 paper titled "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets."^[1] The name draws from the science fiction verb "grok," coined by Robert Heinlein in his 1961 novel Stranger in a Strange Land, meaning to understand something so thoroughly that it becomes part of the observer.

Grokking challenges the conventional expectation that training and test performance improve in tandem. In standard supervised learning, a model's ability to generalize typically tracks its training loss closely, with overfitting manifesting as a gradual widening gap between training and validation metrics. Grokking breaks this pattern entirely: the model reaches zero training loss early, shows no improvement on test data for thousands or even millions of additional optimization steps, and then experiences a sudden, sharp jump to perfect test accuracy.^[1] This behavior has attracted significant attention from researchers studying generalization, regularization, and the internal mechanisms of neural networks.

Explain like I'm 5 (ELI5)

Imagine you are learning your multiplication tables. At first, you just memorize each answer one by one: 3 times 4 is 12, 7 times 8 is 56, and so on. You can get the right answer for any problem you have already seen, but if someone asks you a new one you have not practiced, you are stuck.

Then one day, after lots and lots of practice, something clicks. You suddenly understand how multiplication works, not just the individual answers. Now you can figure out any multiplication problem, even ones you have never seen before.

Grokking in machine learning is exactly like that. A computer model first memorizes all the answers it has been shown. Then, after training for a very long time, it suddenly "gets it" and figures out the real rule behind the answers. The surprising part is that this understanding comes much, much later than the memorization, after a long stretch where it seems like nothing is improving at all.

Discovery and original experiments

The Power et al. paper (2022)

The phenomenon was first documented by Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra, all affiliated with OpenAI at the time. Their paper, published at ICLR 2022, systematically studied how small transformer models learn to perform binary operations on finite algebraic structures.^[1]

The experimental setup used a decoder-only transformer to predict outputs of equations in the form a ◦ b = c, where each element (a, the operator ◦, b, the equals sign, and c) was treated as a separate token. Each token was represented as a 256-dimensional embedding vector with learnable embeddings initialized randomly. After the transformer layers, a final linear layer mapped the output to class logits.^[1]

Operations studied

The paper tested the model on a range of binary operations over modular arithmetic groups and symmetric groups.^[1] These included:

Operation	Description	Grokking observed
x + y (mod p)	Modular addition	Yes
x - y (mod p)	Modular subtraction	Yes
x * y (mod p)	Modular multiplication	Yes
x / y (mod p)	Modular division	Yes
x^2 + y^2 (mod p)	Sum of squares	Yes
x^2 + xy + y^2 (mod p)	Quadratic form (symmetric)	Yes
x^2 + xy + y^2 + x (mod p)	Quadratic form (asymmetric)	Yes
S_5 composition	Permutation group composition	Yes

All operations were computed modulo a prime number p (commonly p = 97 or p = 113 in subsequent work). The training set consisted of a fraction of all possible input pairs, with the remainder held out for testing.^[1]

Key observations

The most striking result was that for modular division, validation accuracy began increasing beyond chance level only after approximately 1,000 times more optimization steps than were required for training accuracy to reach near-optimal levels.^[1] The paper also documented several patterns:

Data fraction matters: Smaller training fractions required dramatically more optimization steps before grokking occurred. With very small fractions, grokking could take millions of steps.^[1]
Symmetry helps: Symmetric operations (x + y, x * y, x^2 + y^2) tended to require less data for generalization than their asymmetric counterparts (x - y, x / y, x^2 + xy + y^2 + x).^[1]
Weight decay is necessary: Without weight decay or other forms of regularization, the models remained in the memorization state indefinitely. Weight decay appeared essential for triggering the transition to generalization.^[1]

The phenomenon in detail

Training dynamics

A typical grokking training run on modular addition proceeds through several recognizable stages when viewed through the lens of training and test loss curves:^[13]

Rapid memorization (early epochs): Training loss drops quickly to near zero. The model memorizes the training examples through a lookup-table-like mechanism, storing specific input-output mappings in its weights. Test loss remains at or near the level expected from random guessing.
Plateau (extended period): Both training loss and test loss remain nearly flat. Training loss stays near zero; test loss stays high. To an observer monitoring only these curves, it appears that the model has fully overfit and that continued training is wasted computation.
Sudden generalization (grokking point): Test loss drops sharply over a relatively small number of additional steps, transitioning from random-chance performance to near-perfect accuracy. Training loss may actually increase slightly during this transition as the model trades memorization-based accuracy for a more general solution.
Post-grokking convergence: Both training and test performance stabilize at high accuracy. The model has learned the underlying algorithmic rule rather than relying on memorized mappings.

The interval between memorization and generalization is sometimes called the "grokking time." This delay can range from hundreds of epochs to millions of optimization steps, depending on factors like the data fraction, weight decay strength, and learning rate.^[1]

Why the delay happens

The delay between memorization and generalization occurs because the model initially finds a memorization solution that achieves zero training loss. This solution sits in a region of weight space with high parameter norms (large weights). Weight decay steadily applies pressure to reduce these norms, but the memorization solution is a local attractor that resists this pressure for a long time. Eventually, the weight decay destabilizes the memorization solution enough for the model to escape toward a different region of weight space where a lower-norm, generalizing solution exists.^[3] The transition, once it begins, proceeds rapidly because the generalizing solution, being simpler, is a strong attractor in the regularized loss landscape.

This explanation aligns with the observation that stronger weight decay shortens the grokking time (the destabilizing pressure is greater) while weaker weight decay lengthens it or prevents grokking entirely.

Mechanistic interpretability: Nanda et al. (2023)

One of the most detailed studies of grokking's internal mechanisms was conducted by Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Their paper, "Progress Measures for Grokking via Mechanistic Interpretability," presented at ICLR 2023 as an oral presentation, fully reverse-engineered the algorithm learned by a small transformer trained on modular addition.^[2]

Experimental setup

The study used a one-layer transformer with the following architecture:^[2]

| Component | Specification | |---|---|---| | Layers | 1 | | Attention heads | 4 | | Embedding dimension (d) | 128 | | Head dimension | 32 (d/4) | | MLP hidden units | 512 | | Layer normalization | None | | Embed/unembed matrices | Untied | | Positional embeddings | Learned | | Task | a + b mod 113 | | Training data | 30% of all 113^2 input pairs | | Optimizer | AdamW (lr = 0.001, weight decay = 1) | | Training duration | 40,000 epochs (full batch) |

The Fourier multiplication circuit

Nanda et al. discovered that the trained transformer implements a four-step algorithm based on discrete Fourier transforms and trigonometric identities:^[2]

Step 1, embedding: The network maps each input to sine and cosine components at specific frequencies. For inputs a and b, it computes sin(w_k * a), cos(w_k * a), sin(w_k * b), and cos(w_k * b), where w_k = 2 pi k / p for key frequency indices k.

Step 2, computing the sum via trigonometric identities: The attention and MLP layers apply the angle addition formulas:

cos(w_k(a + b)) = cos(w_k * a) * cos(w_k * b) - sin(w_k * a) * sin(w_k * b)
sin(w_k(a + b)) = sin(w_k * a) * cos(w_k * b) + cos(w_k * a) * sin(w_k * b)

Step 3, output computation: The network computes cos(w_k(a + b - c)) for each candidate output c using another trigonometric identity, effectively rotating back to find the correct answer.

Step 4, interference pattern: The network sums cosine waves across five key frequencies (k in {14, 35, 41, 42, 52} for p = 113).^[2] At the correct output c* = (a + b) mod 113, these waves constructively interfere (all producing the value 1), producing a large logit. At all other values of c, the waves destructively interfere, producing small logits.

This is a remarkably elegant algorithm: the transformer has independently discovered that modular addition can be performed by converting to frequency space, applying rotation, and using interference to select the correct answer.

Three phases of training

Using their mechanistic understanding, Nanda et al. defined progress measures that revealed three continuous phases in the training process:^[2]

Phase	Epoch range (p = 113)	What happens	Key indicators
Memorization	0 to ~1,400	Training loss drops rapidly; model memorizes training data via lookup-table-like circuits	Low restricted loss is absent; excluded loss is low
Circuit formation	~1,400 to ~9,400	Fourier multiplication circuit gradually forms in the weights; memorization circuits coexist with emerging generalizing circuits	Restricted loss begins declining; weight norm decreases; excluded loss increases
Cleanup	~9,400 to ~14,000	Weight decay eliminates the remaining memorization circuits; test accuracy jumps sharply (this is the visible "grokking" moment)	Test loss drops suddenly; weight norm drops sharply; Gini coefficient increases (sparsification in Fourier basis)

Restricted loss was defined as the loss computed after ablating all Fourier components from the model's logits except those corresponding to the five key frequencies. This measures how much of the model's performance comes from the generalizing Fourier circuit.^[2]

Excluded loss was defined as the loss computed after removing only the key frequencies, measured on training data. This tracks how much performance relies on memorization rather than the Fourier circuit.^[2]

The central finding was that grokking is not a sudden phase transition but rather the culmination of a gradual process. The generalizing circuit forms continuously during the circuit formation phase, well before any improvement is visible in test accuracy. The apparent "sudden" jump in test performance corresponds to the cleanup phase, when the memorization circuits are finally pruned away and the already-formed generalizing circuit dominates.^[2]

The role of weight decay

Weight decay (or equivalently, L2 regularization) plays a central role in grokking. Multiple studies have converged on a consistent picture of how it operates.

Mechanism

Weight decay adds a penalty proportional to the squared magnitude of the model's weights at each optimization step. In the context of grokking, this penalty has two effects:

Destabilizing the memorization solution: Memorization requires large weights to store lookup tables for specific input-output pairs. Weight decay continuously pushes these weights toward zero, creating ongoing tension with the memorization objective.
Favoring the generalizing solution: The generalizing solution (the Fourier multiplication circuit, in the case of modular addition) can achieve the same training accuracy with much smaller weights. Weight decay makes this solution energetically favorable in the regularized loss landscape.

The interplay between these two effects explains the long delay: the memorization solution is a local minimum that weight decay slowly erodes. The generalizing solution is a global minimum of the regularized loss, but reaching it requires escaping the basin of attraction of the memorization solution.

Effect of weight decay strength

The strength of weight decay has a predictable relationship with grokking behavior:

Weight decay strength	Effect on grokking
Zero (no weight decay)	No grokking occurs; model remains in memorization state indefinitely^[1]
Very low	Extremely delayed grokking; may take millions of steps
Moderate	Grokking occurs after a manageable delay
High	Grokking occurs faster but the transition is sharper
Very high	May prevent memorization entirely; model generalizes from the start (no grokking, but also potentially slower convergence)

This relationship is consistent with the "LU mechanism" described by Liu, Michaud, and Tegmark in their Omnigrok paper (2022): when training and test losses are plotted against the weight norm, the training loss typically forms an "L" shape (flat at low norm, then rising) while the test loss forms a "U" shape (high at low norm, dropping to a minimum, then rising again).^[3] Grokking occurs because the model starts at a high-norm memorization solution and weight decay drives it toward the low-norm generalizing solution.^[3]

Theoretical explanations

Several theoretical frameworks have been proposed to explain grokking. These explanations are not mutually exclusive; they highlight different aspects of the same underlying phenomenon.

Lazy-to-rich training transition

One class of explanations connects grokking to the transition between two well-studied training regimes in neural network theory:

In the lazy training regime (also called the kernel regime or neural tangent kernel regime), the model's weights remain close to their initialization, and the model behaves approximately as a linear function of its parameters. This regime tends to produce memorization-like solutions.
In the rich training regime (also called the feature learning regime), the weights move significantly from initialization, and the model learns structured internal representations.

Lyu et al. (2024), in a paper presented at ICLR 2024, proved that when training homogeneous neural networks with large initialization and small weight decay, the optimization process first gets trapped at a solution corresponding to a kernel predictor (lazy regime) for a long time. Then a sharp transition to a minimum-norm or maximum-margin predictor (rich regime) occurs, causing a dramatic change in test accuracy.^[7] This provides a rigorous mathematical explanation for why grokking involves a long delay followed by a sudden transition.

Phase transition perspective

Multiple researchers have drawn connections between grokking and phase transitions in physics. Rubin et al. demonstrated a mapping between grokking and first-order phase transitions, where the state of the network after grokking is analogous to a mixed phase following a first-order transition. This perspective helps explain the sharpness of the transition: just as water does not gradually become ice but rather undergoes a discontinuous change at a critical temperature, the model's transition from memorization to generalization can be abrupt once the critical conditions (driven by weight decay reducing the weight norm) are met.

Simplicity bias and weight norms

Another line of work focuses on the simplicity of the generalizing solution relative to the memorizing solution. Neural networks trained with stochastic gradient descent exhibit a well-documented simplicity bias: given multiple solutions that achieve the same training loss, SGD tends to converge to simpler ones. In the context of grokking, the generalizing solution has lower weight norms and lower effective complexity than the memorizing solution.^[3] Weight decay amplifies this simplicity bias by explicitly penalizing complexity, eventually making the simpler generalizing solution the preferred minimum.

Lottery ticket connection

Research by Yoshida et al. (2023) bridged the lottery ticket hypothesis with grokking.^[8] The lottery ticket hypothesis posits that dense neural networks contain sparse subnetworks ("winning tickets") that can match the full network's performance. In the context of grokking, the transition from memorization to generalization corresponds to the identification and amplification of these sparse subnetworks.^[8] The memorization phase uses the full network in a distributed way, while the generalizing phase concentrates computation in a small, structured subnetwork. This work suggests that weight norm alone is insufficient to fully explain grokking; the reorganization of the network's internal structure into efficient subnetworks is equally important.^[8]

Slingshot mechanism

Thilak et al. (2022), in a study from Apple Machine Learning Research, identified an optimization anomaly they called the "slingshot mechanism" that is closely tied to grokking.^[6] When using adaptive optimizers like Adam, training at very late stages exhibits repeating cycles of stability and instability. The unstable phases are characterized by extremely large gradients, spiking training loss, and rapid growth of last-layer weights.^[6] Grokking was found to occur almost exclusively at the onset of these slingshot events, and was absent without them.^[6] This suggests that the instabilities created by adaptive optimizers may play a functional role in escaping the memorization basin, complementing the role of weight decay.

Norm-separation delay law

Recent theoretical work (2025) has proposed quantitative explanations for the delay in grokking. This line of research models grokking as a norm-driven representational phase transition: training first converges to a high-norm memorization solution, and the time required for weight decay to contract the weights toward the lower-norm generalizing solution determines the grokking time. The resulting "norm-separation delay law" provides predictions for how grokking time scales with weight decay strength, learning rate, and data fraction.

Relationship to double descent

Double descent is a related phenomenon where test error follows a non-monotonic curve as model size or training time increases: error first decreases, then increases (the classical overfitting bump), and then decreases again.^[11] Grokking can be understood as an extreme form of epoch-wise double descent, where the second descent (the improvement in test performance after the overfitting phase) is delayed by a much larger gap.^[5]

Davies, Langosco, and Krueger (2023) formally unified grokking and double descent under a single framework they called pattern learning speeds.^[5] Their key insight was that both phenomena arise when a model learns multiple patterns at different speeds:

Fast patterns are learned quickly but may not generalize well (for example, memorization of specific training examples).
Slow patterns are learned gradually but generalize much better (for example, the underlying algorithmic rule).

When the fast patterns are learned first and dominate predictions, test error rises (because the fast patterns overfit). When the slow patterns eventually become strong enough, test error falls.^[5] The difference between standard double descent and grokking is the magnitude of the delay: in double descent, the second descent follows relatively quickly; in grokking, it can be separated by orders of magnitude more training steps.

This framework also predicts model-wise grokking, where grokking occurs as a function of model size rather than training time. Davies et al. provided the first empirical demonstration of this effect.^[5]

Phenomenon	What varies	Delay magnitude	Transition sharpness
Epoch-wise double descent	Training epochs	Moderate	Gradual
Model-wise double descent	Model size	Moderate	Gradual
Grokking (epoch-wise)	Training epochs	Very large (100x to 1,000,000x)	Sharp
Model-wise grokking	Model size	Large	Sharp

Grokking beyond algorithmic datasets

Omnigrok: extending to real-world data

A common early criticism of grokking research was that the phenomenon might be an artifact of the artificial, highly structured datasets used in the original experiments. Liu, Michaud, and Tegmark addressed this concern in their 2022 paper "Omnigrok: Grokking Beyond Algorithmic Data." By analyzing neural network loss landscapes and identifying the "LU mechanism" (the characteristic shapes of training and test loss as functions of weight norm), they were able to induce grokking on tasks involving images (MNIST), language, and molecular data.^[3] This demonstrated that grokking is a general property of neural network optimization under certain conditions, not a quirk limited to modular arithmetic.

The conditions for grokking in these broader settings typically require:

A sufficiently small training set relative to model capacity
Appropriate weight decay or other regularization
An initial condition (weight initialization scale) that places the model in the high-norm memorization basin^[3]

Grokking in reasoning tasks

Wang et al. (2024), in a paper presented at NeurIPS 2024 titled "Grokked Transformers are Implicit Reasoners," demonstrated that transformers can learn implicit multi-step reasoning through grokking.^[10] Small transformers trained on reasoning tasks (composition and comparison) were found to use their feedforward layers to carry out multi-step logical inferences after sufficient grokking.^[10] For composition tasks with a large search space, a fully grokked transformer achieved near-perfect accuracy, while GPT-4-Turbo and Gemini-1.5-Pro failed regardless of prompting strategy or retrieval augmentation.^[10] This suggests that grokking may be relevant not just for simple arithmetic but for developing genuine reasoning capabilities in neural networks through parametric memory.

Evidence in large language models

While most grokking research has focused on small models and controlled datasets, there is growing evidence that LLM pretraining exhibits grokking-like dynamics. Different capabilities may "grok" at different points during training, with some abilities emerging suddenly after long plateaus. The asynchronous nature of these transitions in large models makes them harder to detect than in the clean, single-task experiments typical of grokking research, but internal dynamics (tracked through probing and mechanistic interpretability) reveal similar patterns of delayed generalization.

Accelerating grokking: Grokfast

The long delay between memorization and generalization is the main practical limitation of grokking. Lee, Kang, Kim, and Lee (2024) addressed this with Grokfast, a simple technique that accelerates grokking by more than 50x with only a few lines of code added to the training loop.^[9]

Core idea

Grokfast treats the sequence of gradients for each parameter over training iterations as a temporal signal and decomposes it into two components:

Fast-varying component: Associated with overfitting and memorization. These gradient signals change rapidly from step to step.
Slow-varying component: Associated with generalization. These gradient signals represent the gradual, consistent direction in which the model needs to move to learn the underlying rule.

By applying a low-pass filter to the gradient signal and amplifying the slow-varying component before passing it to the optimizer, Grokfast selectively accelerates the learning of generalizing patterns without disrupting the overall training dynamics.^[9]

Results

Grokfast reduced grokking time by more than 50x across modular arithmetic tasks and also demonstrated effectiveness on tasks involving images, language, and graphs.^[9] Combining Grokfast with weight decay produced a synergistic effect, further reducing training time and improving stability.^[9]

Four learning regimes

Research on grokking has identified four distinct learning outcomes that depend on the interaction between model capacity, data size, regularization strength, and learning rate:^[4]

Regime	Training accuracy	Test accuracy	What happens
Comprehension	High	High	Model learns the rule quickly and generalizes; no delay between memorization and generalization
Grokking	High (early)	High (late)	Model memorizes first, then generalizes after a long delay
Memorization	High	Low	Model memorizes training data but never generalizes, even with extended training
Confusion	Low	Low	Model fails to learn even the training data; underfitting

The boundaries between these regimes depend on several interacting factors. Higher weight decay and larger training fractions push the model toward comprehension. Lower weight decay, smaller training fractions, and large initial weight norms push toward memorization or grokking. Very high learning rates or very small models push toward confusion.^[4]

Implications for training practice

Grokking has several practical consequences for how researchers and practitioners approach model training:

Early stopping may be premature: The standard practice of halting training when validation loss begins to rise would prevent grokking from occurring. On structured or algorithmic tasks with small datasets, continued training beyond apparent overfitting may lead to dramatically better generalization.^[1]

Weight decay is more than a regularizer: In the context of grokking, weight decay does not merely prevent overfitting; it actively drives the model from a memorization solution to a qualitatively different generalizing solution. This suggests that the role of weight decay in neural network training may be more fundamental than the standard framing as a regularization technique.

Training curves can be misleading: A flat or slowly rising test loss does not necessarily mean the model has fully converged to a poor solution. Internal progress (circuit formation) may be occurring even when no external metric reflects it.^[2]

Compute considerations: Grokking requires training for far longer than would be needed just to achieve zero training loss. Whether this additional compute is justified depends on the task and whether cheaper alternatives (like increasing the training data fraction) can achieve the same generalization without the long delay.

Curriculum and data design: The grokking time depends heavily on the training data fraction and distribution. Research by Wang et al. (2024) suggests that the ratio between inferred and atomic facts in the training data, rather than absolute data size, determines grokking characteristics.^[10] This points to the possibility of designing training curricula that promote faster grokking.

Open questions

Despite significant progress, several questions about grokking remain open:

Prevalence in practice: How often does grokking occur in real-world, large-scale training? Is it a common phenomenon masked by early stopping, or is it rare outside of structured, small-data settings?
Predictability: Can grokking time be reliably predicted before it occurs? The norm-separation delay law provides some theoretical predictions, but practical prediction remains difficult.
Relationship to emergent abilities: Are the sudden capability improvements observed in large language models during scaling (emergent abilities) related to grokking? Some researchers have drawn parallels, but the connection remains speculative.
Beyond weight decay: Can other forms of regularization or architectural modifications trigger grokking more efficiently? Early results with Grokfast are promising, but the space of possible interventions is large.^[9]
Biological parallels: Does the human brain exhibit anything analogous to grokking, where understanding of a concept crystallizes suddenly after a long period of apparent stagnation? Some researchers have noted informal parallels to the "aha moment" in human learning.

References

Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2201.02177 ↩
Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. (2023). "Progress Measures for Grokking via Mechanistic Interpretability." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2301.05217 ↩
Liu, Z., Michaud, E. J., and Tegmark, M. (2022). "Omnigrok: Grokking Beyond Algorithmic Data." *arXiv preprint*. https://arxiv.org/abs/2210.01117 ↩
Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M., and Williams, M. (2022). "Towards Understanding Grokking: An Effective Theory of Representation Learning." *Advances in Neural Information Processing Systems (NeurIPS)*. https://arxiv.org/abs/2205.10343 ↩
Davies, X., Langosco, L., and Krueger, D. (2023). "Unifying Grokking and Double Descent." *arXiv preprint*. https://arxiv.org/abs/2303.06173 ↩
Thilak, V., Littwin, E., Zhai, S., Sarber, O., Nicolicioiu, A., Ablin, P., and Susskind, J. (2022). "The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon." *arXiv preprint*. https://arxiv.org/abs/2206.04817 ↩
Lyu, K., Jin, J., Li, Z., Du, S. S., Lee, J. D., and Hu, W. (2024). "Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2310.06110 ↩
Yoshida, H., Takeishi, N., Kawakami, M., and Tsubaki, M. (2023). "Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks." *arXiv preprint*. https://arxiv.org/abs/2310.19470 ↩
Lee, J., Kang, B. G., Kim, K., and Lee, K. M. (2024). "Grokfast: Accelerated Grokking by Amplifying Slow Gradients." *arXiv preprint*. https://arxiv.org/abs/2405.20233 ↩
Wang, Z., Yue, Z., Zhang, Y., Huang, Z., and Sun, H. (2024). "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization." *Advances in Neural Information Processing Systems (NeurIPS)*. https://arxiv.org/abs/2405.15071 ↩
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2020). "Deep Double Descent: Where Bigger Models and More Data Can Hurt." *Journal of Statistical Mechanics: Theory and Experiment*. https://arxiv.org/abs/1912.02292 ↩
Mohamadi, M. A., Wu, Z., Derakhshani, M., and Yoo, K. M. (2024). "Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition." *International Conference on Machine Learning (ICML)*. https://proceedings.mlr.press/v235/mohamadi24a.html
Google PAIR. "Do Machine Learning Models Memorize or Generalize?" *PAIR Explorables*. https://pair.withgoogle.com/explorables/grokking/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Double Descent Loss Surface Model Capacity

Explain like I'm 5 (ELI5)

Discovery and original experiments

The Power et al. paper (2022)

Operations studied

Key observations

The phenomenon in detail

Training dynamics

Why the delay happens

Mechanistic interpretability: Nanda et al. (2023)

Experimental setup

The Fourier multiplication circuit

Three phases of training

The role of weight decay

Mechanism

Effect of weight decay strength

Theoretical explanations

Lazy-to-rich training transition

Phase transition perspective

Simplicity bias and weight norms

Lottery ticket connection

Slingshot mechanism

Norm-separation delay law

Relationship to double descent

Grokking beyond algorithmic datasets

Omnigrok: extending to real-world data

Grokking in reasoning tasks

Evidence in large language models

Accelerating grokking: Grokfast

Core idea

Results

Four learning regimes

Implications for training practice

Open questions

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here