# Sparse Representation

> Source: https://aiwiki.ai/wiki/sparse_representation
> Updated: 2026-06-23
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Sparse representation** is a data encoding strategy in which most values in a vector, matrix, or tensor are zero (or near zero), with only a small fraction of elements carrying non-zero values. The same principle spans signal processing, [machine learning](/wiki/machine_learning), [natural language processing](/wiki/natural_language_processing), and [deep learning](/wiki/deep_learning): a 10,000-dimensional vector with only 50 non-zero entries is 99.5% sparse, and storing only those non-zero entries can cut memory by thousands of times versus a [dense feature](/wiki/dense_feature) representation. Sparsity is also the central tool in modern AI interpretability, where sparse autoencoders decomposed one layer of a transformer into more than 4,000 human-interpretable features in Anthropic's 2023 "Towards Monosemanticity" study and into 34 million features in the 2024 follow-up on Claude 3 Sonnet.[11][12]

The idea of sparsity is grounded in a simple observation: many real-world signals and datasets contain far more zeros than meaningful values. A document represented as a [bag of words](/wiki/bag_of_words) vector over a 100,000-word vocabulary might activate only a few hundred entries. A social network adjacency matrix with millions of users has connections for only a tiny fraction of all possible pairs. Recognizing and exploiting this structure has been one of the most productive themes in computational science.

## What is a sparse representation?

A representation is called **sparse** when the vast majority of its elements are exactly zero. The **sparsity ratio** of a vector is the proportion of zero-valued entries to the total number of entries. For example, a 10,000-dimensional vector with only 50 non-zero entries has a sparsity ratio of 99.5%.

Formally, given a data vector **x** in R^n, a sparse representation seeks a coefficient vector **a** in R^m (with m often greater than n) such that:

```
x ≈ D * a
```

where **D** is a dictionary (or basis) matrix and **a** has very few non-zero entries. The goal is to minimize the number of active coefficients (the L0 "norm") while keeping the reconstruction error small. Because direct L0 minimization is NP-hard, practical algorithms often substitute the L1 norm as a convex relaxation.

### How do sparse and dense representations differ?

The distinction between [sparse feature](/wiki/sparse_feature) and [dense feature](/wiki/dense_feature) representations is fundamental in machine learning.

| Property | Sparse Representation | Dense Representation |
|---|---|---|
| Dimensionality | High (often thousands to millions) | Low (typically 50 to 1,024) |
| Zero entries | Majority of values are zero | Few or no zero values |
| Storage | Stores only non-zero values and indices | Stores all values |
| Interpretability | High: each dimension maps to a specific [feature](/wiki/feature) | Low: dimensions encode latent patterns |
| Semantic capture | Exact keyword or feature matching | Captures synonymy and analogy |
| Typical examples | [One-hot encoding](/wiki/one-hot_encoding), [TF-IDF](/wiki/tf_idf), [bag of words](/wiki/bag_of_words) | [Word2Vec](/wiki/word2vec), [BERT](/wiki/bert) embeddings |
| Computation | Fast with specialized sparse libraries | Fast with GPU-optimized dense math |

Dense representations compress information into every dimension, capturing latent semantic relationships. Sparse representations preserve explicit feature identity, making them more interpretable but requiring specialized data structures to handle efficiently.

## Sparse Vectors in Natural Language Processing

Some of the most widely used sparse representations come from text processing, where documents must be converted into numerical vectors for machine learning models.

### One-Hot Encoding

[One-hot encoding](/wiki/one-hot_encoding) is the simplest sparse vector scheme. Each word in the vocabulary is represented by a vector of length V (the vocabulary size) containing a single 1 and V-1 zeros. For a vocabulary of 50,000 words, every word becomes a 50,000-dimensional vector that is 99.998% zeros. One-hot vectors are orthogonal to each other, meaning the representation captures no similarity between words: "cat" and "kitten" are just as distant as "cat" and "refrigerator."

### Bag of Words (BoW)

The [bag of words](/wiki/bag_of_words) model represents a document as a single sparse vector of word counts. Each dimension corresponds to a vocabulary term, and its value is the number of times that term appears in the document. While BoW discards word order and grammar, it remains effective for tasks like text classification and [sentiment analysis](/wiki/sentiment_analysis). A typical BoW vector over a large vocabulary is extremely sparse, since any individual document uses only a small subset of the full vocabulary.

### TF-IDF

[Term Frequency-Inverse Document Frequency](/wiki/tf_idf) refines the BoW approach by weighting each term according to both its frequency in the current document (TF) and how rare it is across the entire corpus (IDF). Common words like "the" receive low weights, while distinctive terms receive high weights. The resulting vectors remain sparse but carry richer information than raw counts. TF-IDF vectors served as the backbone of information retrieval systems for decades before the rise of neural methods.

### Learned Sparse Representations (SPLADE)

Recent advances have combined the interpretability and efficiency of sparse vectors with the semantic power of [neural networks](/wiki/neural_network). SPLADE (Sparse Lexical and Expansion Model), introduced at SIGIR 2021 by Thibault Formal, Benjamin Piwowarski, and Stephane Clinchant, uses a [transformer](/wiki/transformer)-based architecture to produce sparse document and query representations.[4] SPLADE leverages the masked language model head of [BERT](/wiki/bert) to predict term importance scores over the full vocabulary, then applies sparse regularization (a log-saturation activation plus a FLOPS penalty) to ensure most scores are zero. The result is a sparse vector that can be stored in a traditional inverted index while capturing semantic expansion (for example, a document about "automobiles" might also receive a non-zero weight for the term "cars"). SPLADE v2 further improved performance through knowledge distillation and refined training objectives, achieving retrieval quality comparable to dense neural models with the latency benefits of sparse inverted-index search.[4]

## Sparse Matrices and Storage Formats

When sparse vectors are collected into a matrix, or when the problem domain itself produces sparse matrices (for instance, [graph](/wiki/graph_neural_network) adjacency matrices or user-item interaction matrices in [recommendation systems](/wiki/recommender_system)), specialized storage formats become essential. Storing a 100,000 x 100,000 matrix densely requires 80 GB of memory (at 8 bytes per entry), but if only 0.01% of entries are non-zero, a sparse format needs only a few megabytes.

### Common Sparse Matrix Formats

| Format | Full Name | Storage Scheme | Best For |
|---|---|---|---|
| COO | Coordinate | Three arrays: row indices, column indices, values | Incremental construction, format conversion |
| CSR | Compressed Sparse Row | Row pointer array, column index array, values array | Row slicing, matrix-vector products |
| CSC | Compressed Sparse Column | Column pointer array, row index array, values array | Column slicing, solving linear systems |
| BSR | Block Sparse Row | Like CSR but stores dense sub-blocks | Finite element matrices with block structure |
| DOK | Dictionary of Keys | Hash map from (row, col) to value | Incremental construction, random access |
| LIL | List of Lists | Each row stored as a list of (column, value) pairs | Row-wise construction |

**COO (Coordinate format)** stores each non-zero entry as a triplet of (row, column, value). It is the simplest format and allows duplicate entries, making it convenient for assembling matrices incrementally. However, arithmetic operations and slicing are slow with COO.

**CSR (Compressed Sparse Row)** replaces the per-element row indices of COO with a compact row-pointer array. For a matrix with N rows, CSR uses an array of length N+1 where entry i gives the starting position in the column-index and value arrays for row i. CSR is the most popular format for matrix-vector multiplication and is the default in SciPy and many solver libraries.

**CSC (Compressed Sparse Column)** is the transpose analog of CSR. It compresses column indices instead of row indices, making column-based operations efficient. CSC is preferred by direct sparse solvers such as UMFPACK and SuperLU.

## Sparse Coding

Sparse coding is a class of unsupervised learning algorithms that seek to find a set of basis vectors (a "dictionary") such that input data can be represented as a sparse linear combination of those bases. The seminal work by Bruno Olshausen and David Field, published in Nature in 1996, showed that applying a sparsity objective to natural image patches yields basis functions that resemble the oriented, bandpass receptive fields of simple cells in the mammalian primary visual cortex (V1).[1] As the authors put it, "a coding strategy that maximizes sparseness is sufficient to account for these properties."[1] This finding provided both a computational motivation for sparsity (efficient coding of natural scenes) and a neurobiological validation (the visual system appears to employ sparse codes).[1]

Key sparse coding algorithms include:

- **Matching Pursuit (MP):** A greedy iterative algorithm that selects the dictionary atom best correlated with the current residual, subtracts its contribution, and repeats until a stopping criterion is met.
- **Orthogonal Matching Pursuit (OMP):** An extension of MP that re-optimizes all selected coefficients at each iteration using least squares, producing orthogonal residuals and faster convergence.
- **Basis Pursuit (BP):** Formulates the sparse coding problem as an L1 minimization subject to an equality or inequality constraint on reconstruction error. This convex program can be solved with linear programming methods.
- **LASSO (Least Absolute Shrinkage and Selection Operator):** Adds an L1 penalty to a least-squares objective, providing a continuous trade-off between reconstruction fidelity and sparsity. LASSO is widely used in [regression](/wiki/regression) and [feature selection](/wiki/feature_selection).
- **K-SVD:** An iterative dictionary learning algorithm that alternates between sparse coding (using OMP) and dictionary update steps (using truncated SVD). K-SVD is popular in image processing applications such as denoising and inpainting.[9]

## Compressed Sensing

Compressed sensing (also called compressive sampling) is a signal acquisition framework that leverages sparsity to recover signals from far fewer measurements than traditional Nyquist-rate sampling requires. The theoretical foundations were established independently by Emmanuel Candes, Justin Romberg, and Terence Tao,[3] and by David Donoho,[2] in a series of papers published around 2006.

The core idea is that if a signal is sparse (or compressible) in some basis, it can be recovered exactly from a small number of random linear measurements by solving an L1 minimization problem.[2] Two key conditions underpin the theory:

1. **Sparsity:** The signal must have a sparse representation in some known basis or dictionary.
2. **Restricted Isometry Property (RIP):** The measurement matrix must approximately preserve the distances between sparse signals, ensuring that the sparse structure is not destroyed during measurement.[3]

Compressed sensing has found applications in MRI acceleration (reducing scan times by a factor of 4 to 8), single-pixel cameras, radar imaging, and astronomical signal processing. It demonstrates that sparsity is not merely a computational convenience but a fundamental property that enables entirely new approaches to data acquisition.

## Sparsity in Neural Networks

Sparsity plays multiple roles in modern [neural network](/wiki/neural_network) architectures, from the activation functions used in individual neurons to the large-scale architecture of [large language models](/wiki/large_language_model).

### Activation Sparsity and ReLU

The Rectified Linear Unit ([ReLU](/wiki/relu)) activation function, defined as f(x) = max(0, x), naturally produces sparse activations by setting all negative pre-activation values to zero. In a typical ReLU network, 50% or more of activations may be exactly zero at any given time. This "activation sparsity" has both computational and representational benefits: sparse activations require fewer multiplications during the forward pass, and they encourage the network to develop disentangled feature representations.[8]

Recent research has revisited ReLU specifically for its sparsity properties in large language models. The paper "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models" (2023) showed that replacing non-sparse activation functions (such as GELU or SiLU) with ReLU in [transformer](/wiki/transformer) models can yield 50% or greater activation sparsity with minimal accuracy loss, enabling significant inference speedups through sparse matrix operations.[7]

### Weight Pruning

[Neural network pruning](/wiki/pruning) removes weights (connections) from a trained network to create a sparser model. The central insight, articulated in the "Lottery Ticket Hypothesis" by Jonathan Frankle and Michael Carbin (2019), is that dense networks contain sparse subnetworks that can match the full network's performance when trained in isolation.[5] Pruning methods fall into several categories:

- **Magnitude pruning:** Removes weights with the smallest absolute values, under the assumption that small weights contribute little to the output.
- **Structured pruning:** Removes entire neurons, channels, or attention heads rather than individual weights, producing speedups on standard hardware without the need for sparse computation libraries.
- **Iterative pruning:** Alternates between pruning and fine-tuning, gradually increasing sparsity while maintaining accuracy.

Modern pruning techniques can remove 80% to 95% of weights in many architectures with less than 1% accuracy degradation, producing models that are both smaller and faster.

### Sparse Attention

Standard self-attention in [transformers](/wiki/transformer) has O(n^2) complexity with respect to sequence length, which becomes prohibitive for long sequences. Sparse attention mechanisms address this by allowing each token to attend to only a subset of other tokens. Notable approaches include:

- **Longformer:** Combines local sliding-window attention with task-specific global attention on selected tokens.[10]
- **BigBird:** Uses a combination of random, window, and global attention patterns.
- **SeerAttention (2024):** Learns intrinsic sparse attention patterns using a gating mechanism inspired by Mixture of Experts, dynamically selecting which attention blocks to compute.

Sparse attention reduces both memory usage and computation from O(n^2) to O(n * sqrt(n)) or even O(n * log(n)), enabling transformers to process sequences of tens of thousands of tokens.

### Mixture of Experts (MoE)

The [Mixture of Experts](/wiki/mixture_of_experts) architecture introduces sparsity at the model level. Instead of passing every input through every parameter, MoE models contain multiple "expert" sub-networks and a lightweight router (gating network) that selects a small subset of experts for each input token. This decouples model capacity (total parameters) from computational cost (parameters activated per input).

Key MoE models include:

| Model | Year | Total Parameters | Active Parameters | Experts |
|---|---|---|---|---|
| Switch Transformer | 2021 | 1.6T | ~1B | 2,048 (top-1 routing) |
| Mixtral 8x7B | 2024 | 46.7B | ~12.9B | 8 (top-2 routing) |
| DeepSeek-V2 | 2024 | 236B | ~21B | 160 (top-6 routing) |
| Mixtral 8x22B | 2024 | 176B | ~39B | 8 (top-2 routing) |

MoE achieves the quality of very large dense models at a fraction of the inference cost, because only 10% to 25% of total parameters are active for any given input.[6] This form of conditional computation represents one of the most impactful applications of sparsity in modern AI.

## Sparse Autoencoders and Interpretability

The most active modern application of sparse representation is [mechanistic interpretability](/wiki/mechanistic_interpretability): using sparsity to decompose the internal activations of [large language models](/wiki/large_language_model) into human-understandable units. The motivating problem is **polysemanticity**: individual neurons in a trained network typically fire for many unrelated concepts at once, which makes them hard to interpret.

### What is superposition?

The leading explanation for polysemantic neurons is the [superposition](/wiki/superposition) hypothesis. In the 2022 Anthropic paper "Toy Models of Superposition," Nelson Elhage, Christopher Olah, and colleagues studied how small models "can represent more features than they have neurons."[13] When features are sparse (rarely active at the same time), a network can pack many of them into a lower-dimensional space by assigning each feature its own near-orthogonal direction rather than its own dedicated neuron. Superposition makes networks more capacity-efficient, but it also entangles concepts across neurons, which is exactly what sparse-coding methods aim to undo.

### How do sparse autoencoders interpret a model?

A [sparse autoencoder](/wiki/sparse_autoencoder) (SAE) is dictionary learning applied to a model's internal activations. It encodes an activation vector into a much wider hidden layer under an L1 sparsity penalty, so each activation is reconstructed from only a few active "features," then decodes back to the original. Because the hidden layer is overcomplete and sparse, its directions tend to align with single interpretable concepts rather than blended ones.

Anthropic's 2023 study "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning," led by Trenton Bricken and Christopher Olah, applied this idea to a one-layer transformer with a 512-neuron MLP and decomposed it into more than 4,000 features that "separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much more."[11] The autoencoders were trained on roughly 8 billion activation data points, with dictionary sizes ranging from 512 to about 131,000 features. Human raters judged about 70% of the learned features to be genuinely interpretable, far higher than for individual neurons.[11]

### How well does this scale to production models?

The 2024 follow-up, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," by Adly Templeton and colleagues, showed the method scales to a production model.[12] Training sparse autoencoders on Claude 3 Sonnet's middle-layer residual stream extracted up to 34 million features. The features are multilingual and multimodal (a single "Golden Gate Bridge" direction fires on English, Japanese, Korean, and Russian text and on images of the bridge), and they can steer behavior: clamping the Golden Gate Bridge feature to roughly 10 times its maximum natural activation made the model begin to "self-identify" as the bridge.[12] This experiment was released publicly as the "Golden Gate Claude" demo and made feature steering a widely discussed interpretability result. Anthropic frames this work as evidence that sparse features are "a step toward mechanistically understanding" frontier models, even though most of the 34 million features were rare or never activated.[12]

## Applications

Sparse representation has been successfully applied across a wide range of domains:

- **Image processing:** Sparse coding is used for image denoising, compression, inpainting (filling missing regions), and super-resolution. The underlying sparsity of natural images in wavelet or learned dictionary bases makes these applications particularly effective.
- **Information retrieval:** Sparse vector models such as [TF-IDF](/wiki/tf_idf), [BM25](/wiki/bm25), and SPLADE power search engines using inverted indexes, where the sparsity of document representations enables sub-millisecond retrieval over billions of documents.
- **Feature extraction and [classification](/wiki/classification):** Sparse representations can isolate the most salient features of high-dimensional data, enabling effective [object recognition](/wiki/object_recognition), text classification, and anomaly detection.
- **[Recommendation systems](/wiki/recommender_system):** User-item interaction matrices are inherently sparse (most users interact with a tiny fraction of available items). Techniques like matrix factorization and sparse autoencoders exploit this structure.
- **Model interpretability:** Sparse autoencoders decompose the activations of [large language models](/wiki/large_language_model) into interpretable features for [mechanistic interpretability](/wiki/mechanistic_interpretability) and AI safety research.[11][12]
- **Medical imaging:** Compressed sensing enables accelerated MRI acquisition, and sparse representations are used in CT reconstruction and microscopy.
- **Network analysis:** Social networks, citation graphs, and biological networks are naturally sparse. Graph-based machine learning methods, including [graph neural networks](/wiki/graph_neural_network), leverage sparse adjacency matrices.

## Implementation and Software Libraries

Several major scientific computing and machine learning libraries provide first-class support for sparse data structures.

### SciPy (Python)

The `scipy.sparse` module offers seven sparse matrix formats (COO, CSR, CSC, BSR, DOK, LIL, DIA) with conversion utilities between them. CSR and CSC are the primary formats for arithmetic operations and linear algebra. SciPy's sparse module integrates with NumPy and provides sparse versions of common operations such as matrix-vector products, element-wise operations, and direct solvers via `scipy.sparse.linalg`.

```python
from scipy.sparse import csr_matrix
import numpy as np

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sparse_mat = csr_matrix((data, (row, col)), shape=(3, 3))
```

### PyTorch

The `torch.sparse` module supports COO and CSR tensor formats with autograd (automatic differentiation) support. Sparse tensors can participate in matrix multiplications and additions with dense tensors. PyTorch sparse tensors are particularly useful in [graph neural networks](/wiki/graph_neural_network), where adjacency matrices are naturally sparse.

```python
import torch

indices = torch.tensor([[0, 1, 2], [2, 0, 1]])
values = torch.tensor([3.0, 4.0, 5.0])
sparse_tensor = torch.sparse_coo_tensor(indices, values, (3, 3))
```

### TensorFlow

`tf.SparseTensor` represents sparse data using three components: `indices` (a 2D tensor of coordinates), `values` (a 1D tensor of non-zero values), and `dense_shape` (the full tensor shape). TensorFlow provides operations like `tf.sparse.sparse_dense_matmul` and integration with Keras layers through `tf.keras.layers.Dense` with sparse input support.

```python
import tensorflow as tf

sparse_tensor = tf.SparseTensor(
    indices=[[0, 2], [1, 0], [2, 1]],
    values=[3.0, 4.0, 5.0],
    dense_shape=[3, 3]
)
```

### Performance Comparison

| Library | Formats | GPU Support | Autograd | Primary Use Case |
|---|---|---|---|---|
| scipy.sparse | COO, CSR, CSC, BSR, DOK, LIL, DIA | No | No | Scientific computing, data preprocessing |
| torch.sparse | COO, CSR, CSC, BSR | Yes | Yes | Deep learning, graph neural networks |
| tf.SparseTensor | COO-like | Yes | Yes | Deep learning, recommendation systems |

## Memory and Computational Benefits

The practical advantages of sparse representations can be enormous. Consider a user-item matrix in a recommendation system with 1 million users and 500,000 items. Stored densely, this requires 4 TB of memory (at 8 bytes per float64 entry). If only 0.01% of entries are non-zero, a CSR representation requires roughly 800 MB: a reduction of nearly 5,000x.

Computational savings are equally significant. Sparse matrix-vector multiplication (SpMV) has complexity proportional to the number of non-zero entries (nnz) rather than the full matrix dimension. For a matrix with a sparsity ratio of 99.9%, this means a 1,000x reduction in arithmetic operations compared to dense multiplication.

These savings extend to training machine learning models. Sparse input features (such as TF-IDF vectors) can be processed by linear models and [neural networks](/wiki/neural_network) using specialized sparse kernels, avoiding the memory overhead of materializing full dense vectors.

## Explain Like I'm 5 (ELI5)

Imagine you have a huge wall of light switches, with one switch for every word in the dictionary. To describe what a sentence is about, you flip on just a few switches and leave the rest off. That is a sparse representation: almost everything is off (zero), and only a handful of switches are on. The useful information is in which switches you turned on and how bright you set them.

A dense representation is more like mixing paint colors. Instead of flipping individual switches, you blend many colors together into one small blob. The blob is compact, but you cannot easily point to one part and say "this is the 'cat' part."

Sparse representations are great when you want to be precise and fast at looking things up (like a search engine finding exact words), while dense representations are great when you need to understand meaning and similarity (like a chatbot understanding that "puppy" and "dog" mean similar things). Researchers even use sparse "feature detectors" to peek inside an AI's brain and find the single switch for a thing like the Golden Gate Bridge.

## References

1. Olshausen, B. A., & Field, D. J. (1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images." *Nature*, 381(6583), 607-609.
2. Donoho, D. L. (2006). "Compressed sensing." *IEEE Transactions on Information Theory*, 52(4), 1289-1306.
3. Candes, E. J., Romberg, J. K., & Tao, T. (2006). "Stable signal recovery from incomplete and inaccurate measurements." *Communications on Pure and Applied Mathematics*, 59(8), 1207-1223.
4. Formal, T., Piwowarski, B., & Clinchant, S. (2021). "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." *Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval*.
5. Frankle, J., & Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *International Conference on Learning Representations (ICLR)*.
6. Fedus, W., Zoph, B., & Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." *Journal of Machine Learning Research*, 23(120), 1-39.
7. Mirzadeh, S. I., et al. (2023). "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models." *arXiv preprint arXiv:2310.04564*.
8. Kurtz, M., et al. (2020). "Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks." *Proceedings of the 37th International Conference on Machine Learning (ICML)*.
9. Aharon, M., Elad, M., & Bruckstein, A. (2006). "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation." *IEEE Transactions on Signal Processing*, 54(11), 4311-4322.
10. Beltagy, I., Peters, M. E., & Cohan, A. (2020). "Longformer: The Long-Document Transformer." *arXiv preprint arXiv:2004.05150*.
11. Bricken, T., Templeton, A., Batson, J., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." *Transformer Circuits Thread*, Anthropic. https://transformer-circuits.pub/2023/monosemantic-features
12. Templeton, A., Conerly, T., Marcus, J., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." *Transformer Circuits Thread*, Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity
13. Elhage, N., Hume, T., Olsson, C., et al. (2022). "Toy Models of Superposition." *Transformer Circuits Thread*, Anthropic. https://transformer-circuits.pub/2022/toy_model/index.html

