Representation

introduction

In machine learning, a representation is the format in which an example is encoded before a model sees it. The same data point can be represented as raw pixels, a list of hand-designed measurements, or a vector produced by a deep network, and the choice of representation largely decides which patterns a model can pick up. Bengio, Courville, and Vincent put it bluntly in their 2013 review: the performance of machine learning methods is heavily dependent on the choice of data representation, because different representations can entangle and hide more or less the explanatory factors of variation behind the data [1].

For most of the field's history, getting a good representation meant hand-crafting it: a vision researcher decided that gradient histograms were what mattered for object detection, an audio researcher decided on cepstral coefficients, and a text researcher counted n-grams. The shift to deep learning replaced much of that work with representation learning, where a model discovers its own features directly from raw data. Today's foundation models are essentially giant representation engines: pretrained networks whose intermediate vectors, called embeddings, are reused across countless downstream tasks.

hand-crafted vs learned representations

The practical contrast between the two paradigms is large enough to be worth spelling out in a table.

Aspect	Hand-crafted features	Learned representations
How they are built	Domain experts design them by hand	Optimised end-to-end during training
Examples in vision	SIFT, HOG, SURF, color histograms	Convolutional features, ViT patch embeddings, DINOv2 features
Examples in text	Bag-of-words, TF-IDF, n-grams	word2vec, GloVe, BERT, GPT hidden states
Examples in audio	MFCC, spectrogram bins, pitch features	wav2vec 2.0, HuBERT, Whisper encoder outputs
Required data volume	Works with small datasets	Typically needs large unlabeled corpora
Interpretability	High; each feature has a clear meaning	Low; individual dimensions rarely correspond to obvious concepts
Transferability	Limited to the task and modality they were designed for	Often transfer across tasks and even across modalities
Failure mode	Misses patterns the designer did not anticipate	Inherits biases and artifacts from the training data

Hand-crafted features dominated computer vision and speech until the early 2010s. SIFT and HOG were the backbone of object recognition pipelines, MFCC was almost universal in speech recognition, and TF-IDF weighted bag-of-words was the default text representation for classification and retrieval. These features were the result of years of careful engineering, and they encoded strong inductive biases that worked well when training data was scarce.

The turning point in vision was AlexNet's 2012 ImageNet result, where a convolutional network trained end-to-end beat the best hand-crafted pipelines by a wide margin [2]. Within a few years, learned features had displaced hand-crafted ones in nearly every modality where enough data was available. Feature engineering did not disappear, especially for tabular data and small problems, but for unstructured inputs the field largely moved to letting models discover their own representations.

what makes a representation good

Bengio and colleagues catalogued a set of properties that a useful representation tends to have [1]. The list is not a strict checklist, more a set of design goals that different methods optimise to different degrees.

Smoothness. If two inputs are close in input space, their representations should usually be close as well. This is the prior behind nearest-neighbour methods and behind most distance-based downstream uses of embeddings.

Linear separability of relevant attributes. A good representation makes the things you care about easy to read off with a simple classifier. This is exactly the assumption behind linear probing, where a logistic regression on top of frozen features is used as a quick read of representation quality [3].

Disentanglement. Different latent dimensions should respond to different underlying factors of variation. In an ideal disentangled face representation, one direction would track pose, another lighting, another identity, with minimal cross-talk. Disentangled representations should separate the distinct, independent and informative generative factors of variation in the data [4]. Pure unsupervised disentanglement turns out to be much harder than it looked; Locatello and colleagues showed in 2019 that without inductive biases or weak supervision, the same model can learn very different factorisations across runs [4].

Invariance to nuisance factors. Translations, rotations, lighting changes, paraphrases, and accent should not move the representation much when the task does not depend on them. A face embedding used for identity verification should ideally be invariant to expression and pose.

Hierarchy. Useful structure tends to live at multiple scales. Edges and textures combine into parts; parts combine into objects; objects combine into scenes. Deep networks tend to mirror this hierarchy in their layers, which is one reason features pulled from intermediate layers transfer so well.

Sparsity. Only a small subset of features should be active for any given input. Sparse representations tend to be more interpretable and more robust, and the idea has resurfaced strongly in the modern interpretability literature with sparse autoencoders trained on the activations of large language models.

categories of representation

The field roughly distinguishes three styles of representation, each tied to a particular research lineage.

Distributed representations. Each concept is encoded by a pattern of activity across many units, and each unit takes part in encoding many concepts. The idea goes back to Hinton's 1986 paper on learning distributed representations [5] and is the dominant paradigm in modern neural networks. The geometry of word2vec, where vector arithmetic captures analogies like king minus man plus woman is approximately queen, is the canonical illustration [6].

Sparse representations. Most coordinates of the vector are zero. Sparse coding methods learn an overcomplete dictionary in which each input is a sparse combination of basis elements. Sparse representations are easier to interpret one feature at a time and are central to the recent line of work on extracting monosemantic features from large language models.

Compositional representations. Complex meanings are built up from simpler ones following predictable rules. A fully compositional representation of language would let you take the embedding of "red" and "car" and combine them, in a known way, to get the embedding of "red car." Modern transformer-based models exhibit some compositional structure, but how much, and whether it is enough, is still actively debated.

major learned-representation methods

The last fifteen years produced an unusually clean sequence of milestones in representation learning. The table below picks the ones that mattered most.

Year	Method	Modality	What it introduced
2006	Stacked autoencoders	Generic	Layer-wise unsupervised pretraining of deep nets
2008	Denoising autoencoder (Vincent et al.)	Generic	Reconstruct clean input from a corrupted version
2012	AlexNet	Vision	End-to-end learned ConvNet features beat hand-crafted ones on ImageNet
2013	word2vec (Mikolov et al.)	Text	Skip-gram and CBOW; vector arithmetic on words
2014	GloVe (Pennington et al.)	Text	Embeddings from global co-occurrence statistics
2014	Variational autoencoder (Kingma and Welling)	Generic	Probabilistic latent representations
2015	ResNet (He et al.)	Vision	Residual connections enabled very deep, transferable backbones
2017	FastText (Bojanowski et al.)	Text	Subword embeddings using character n-grams
2018	ELMo (Peters et al.)	Text	First widely used contextual word embeddings
2018	BERT (Devlin et al.)	Text	Masked language modelling for bidirectional contextual embeddings
2019	SBERT (Reimers and Gurevych)	Text	Sentence embeddings via siamese BERT
2020	SimCLR (Chen et al.)	Vision	Simple contrastive framework with strong augmentations
2020	ViT (Dosovitskiy et al.)	Vision	Transformer over 16x16 image patches
2021	CLIP (Radford et al.)	Vision and language	Joint image-text embeddings via contrastive training on 400M pairs
2021	DINO (Caron et al.)	Vision	Self-distillation produces strong unsupervised ViT features
2022	MAE (He et al.)	Vision	Mask 75% of image patches, reconstruct from the rest
2024	DINOv2 (Oquab et al.)	Vision	Foundation-grade ViT features pretrained on 142M curated images

A few of these deserve more space.

Autoencoders and their descendants. An autoencoder maps inputs through a bottleneck and back out, and the bottleneck activations form the representation. Denoising autoencoders force the network to undo a known corruption, which makes the representation more robust. Variational autoencoders (VAEs) treat the bottleneck as a probabilistic latent space, which gives a generative model alongside the representation. Masked autoencoders apply the same logic to image patches: hide most of an image and ask the model to fill in the rest [7].

Word embeddings. word2vec trains shallow networks to predict either a word from its context (CBOW) or the context from a word (skip-gram), producing dense vectors where semantic and syntactic relationships show up as linear directions [6]. GloVe takes a different route, factorising a global word-word co-occurrence matrix to produce embeddings with similar properties [8]. FastText extends both ideas to character n-grams, which lets the model produce vectors for out-of-vocabulary words and morphologically rich languages [9].

Contextual embeddings. The next jump was making the embedding of a word depend on its sentence. ELMo used a bidirectional LSTM over a language modelling objective. BERT replaced the LSTM with a transformer trained with a masked language model and a next-sentence prediction objective, producing bidirectional contextual representations from unlabeled text. The pretrained model could be fine-tuned with a single output layer to produce state-of-the-art results across eleven NLP benchmarks [10]. GPT-style decoder-only models produce contextual embeddings as a byproduct of next-token prediction, and those hidden states are now used as general-purpose text encoders.

Sentence and document embeddings. Out of the box, BERT's token embeddings are a bad way to compare whole sentences: comparing 10,000 sentences pairwise would take about 65 hours on a single GPU [11]. Sentence-BERT solves this by fine-tuning BERT in a siamese setup with a contrastive or regression loss, producing fixed-length sentence vectors that can be compared with a dot product. The same 10,000-sentence comparison drops from 65 hours to about 5 seconds [11]. Universal Sentence Encoder, OpenAI's text-embedding models, and the more recent Cohere and Voyage embedders all sit in this lineage.

Self-supervised vision. SimCLR showed that a clean contrastive setup, augmented views of the same image as positives and other images in the batch as negatives, could match supervised ImageNet pretraining when given a large enough batch [12]. MoCo built a momentum-encoded queue of negatives so the same idea worked at smaller batch sizes. BYOL and SimSiam removed negatives entirely and relied on architectural tricks to avoid collapse. MAE went a different route, masking 75% of image patches and reconstructing them, which trained a ViT-Huge to 87.8% on ImageNet using only ImageNet-1K data [7]. DINO and DINOv2 used self-distillation between a student and an exponentially averaged teacher; DINOv2's features, learned without labels on 142 million curated images, work as drop-in inputs to linear probes across many vision tasks [13].

Multimodal representations. CLIP trained an image encoder and a text encoder jointly so that matching image-caption pairs produced similar embeddings and mismatched pairs produced dissimilar ones. After training on 400 million internet image-text pairs, the model could classify images zero-shot by comparing image embeddings to embeddings of prompts like "a photo of a dog" [14]. LiT froze a pretrained image encoder and trained only the text tower, often beating end-to-end CLIP at lower cost. SigLIP replaced CLIP's softmax over the batch with a sigmoid loss, which scales better and removes the need for huge batches. These shared embedding spaces are now the backbone of most text-to-image generators and many vision-language models.

foundation models as representation backbones

Bommasani and colleagues coined the term foundation model in their 2021 Stanford report to describe the new generation of large pretrained models that get adapted to many downstream tasks. Foundation models are based on standard deep learning and transfer learning, but their scale produces new emergent capabilities, and their effectiveness across so many tasks incentivises homogenisation of approaches across the field [15]. Practically, the report argues that models like BERT, GPT-3, CLIP, and DALL-E function as shared infrastructure: their representations are reused, fine-tuned, or queried by countless downstream applications.

This representation-as-infrastructure framing is now standard. A modern computer vision team rarely trains a backbone from scratch; they pull DINOv2 or a CLIP image encoder and fine-tune. A modern NLP team rarely trains an encoder from scratch; they pull BERT, RoBERTa, or a sentence-transformer. The good and bad of this arrangement both follow from concentration: every downstream system inherits whatever strengths and biases live in the foundation model's representations.

embeddings in production

One of the most visible applications of learned representations is semantic search. A document collection is encoded once into vectors, a query is encoded at runtime, and the system returns documents whose vectors are closest to the query. The whole pattern hinges on having a good text embedding model and a fast way to do approximate nearest neighbour search.

Vector databases are the storage and indexing systems that make this practical at scale. Common options in 2026 include the ones listed below.

System	Type	Notable trait
Pinecone	Managed cloud	Fully hosted; auto-scales index and query layer
Weaviate	Open source and managed	Strong hybrid search combining vectors with keyword filters
Milvus	Open source	Designed for billions of vectors across distributed clusters
Qdrant	Open source and managed	Rust-based; rich payload filtering
Chroma	Open source	Lightweight; popular for local prototyping and small RAG apps
FAISS	Library	Facebook AI's similarity search library; embedded into many other systems
pgvector	Postgres extension	Adds vector search to an existing relational database

Retrieval-augmented generation (RAG) is the dominant production use case driving vector database adoption. In a RAG system the user's question is embedded, relevant passages are retrieved by vector similarity, and a language model is asked to answer using those passages as context. The quality of the embedding model usually matters more than the choice of vector database for the resulting accuracy.

probing and interpretability

If representations are the medium through which models think, the natural question is what is encoded in them. Probing is the standard technique: train a small classifier on frozen representations to predict some property and see how well it does. Alain and Bengio formalised this with linear probes, training a logistic regression on the activations of every layer of a deep network. They observed that the linear separability of features increases monotonically along the depth of the model, which gave a simple way to see how representations evolve [3].

Tenney and colleagues applied probing to BERT in 2019 and found a layered structure that loosely tracked the classical NLP pipeline: lower layers handled surface and syntactic features, middle layers handled syntactic phenomena like dependency relations, and higher layers handled semantic ones. Probing has its critics, since a probe might learn things that are actually not used by the model, but it remains one of the most direct ways to read what is in a representation.

More recent interpretability work uses sparse autoencoders trained on the hidden states of large language models to extract features that fire on a single concept. Anthropic and others have used this approach to find features for things like Golden Gate Bridge mentions, code-switching, or specific stylistic patterns in production-scale models.

representation collapse and rank degeneration

Learned representations can fail in characteristic ways. Two common failure modes are full collapse, where the model maps every input to roughly the same vector, and dimensional collapse, where the embeddings only span a low-dimensional subspace of the available space.

Contrastive methods like SimCLR avoid full collapse by construction, since pulling negatives apart prevents everything from mapping to one point. Non-contrastive methods like BYOL and SimSiam were initially surprising because they have no negatives, and yet they avoid collapse in practice. Subsequent analysis showed that batch normalisation, the predictor head, and the use of an exponentially averaged target encoder all play a role in preventing trivial solutions [16]. Even when full collapse is avoided, dimensional collapse is common: embedding vectors end up spanning only a low-dimensional subspace, which limits how much information they can carry. Whitening losses, projector heads, and rank-based regularisers are the common fixes.

In language models, a related problem called anisotropy shows up: the embeddings of all tokens cluster in a narrow cone, which makes cosine similarity less informative than it should be. Post-hoc whitening and contrastive fine-tuning are the usual remedies.

modality-specific considerations

Different kinds of data require different choices of input unit before any representation learning happens.

Text is almost universally tokenised first. Byte-pair encoding (BPE), WordPiece, and SentencePiece split text into subword units that balance vocabulary size against the rate of unknown tokens. The choice of tokenizer shapes the representation in surprising ways: unusual languages, code, and numbers all behave differently depending on whether they get split into many tiny pieces or kept as larger units.

Images in modern systems are usually split into patches before being fed to a transformer. ViT's choice of 16x16 patches turns a 224x224 image into a sequence of 196 tokens that the transformer processes in much the same way it would a sentence [17]. Convolutional backbones still produce useful features, but ViT-based approaches dominate at large scale.

Audio gets converted to log-mel spectrograms, then either fed to a convolutional encoder or, increasingly, a transformer. Self-supervised models like wav2vec 2.0 and HuBERT replace the older MFCC pipelines with learned representations of raw waveforms.

Graphs and structured data have their own representation literature, with graph neural networks producing node and edge embeddings that capture local neighbourhood structure. Embeddings have also been applied to recommendations, where users and items live in a shared latent space, and to knowledge graphs, where entities and relations are represented so that plausible facts have small reconstruction error.

limitations

Learned representations come with their own set of recurring problems.

Bias inheritance. A representation reflects the data it was trained on. Word embeddings trained on web text encode the gender, racial, and cultural biases present in that text, and downstream classifiers that use those embeddings inherit them. CLIP famously associates certain occupations more strongly with certain genders or ethnicities, depending on what its training image-caption pairs contained.

Out-of-distribution failures. A representation that works well on data similar to its training distribution can degrade sharply outside it. ImageNet-trained features struggle on medical images; text encoders trained on web English struggle on legal jargon, archaic English, or code-mixed text without further adaptation.

Lack of grounding. Text-only language models learn representations that capture statistical relationships between tokens. They do not, on their own, ground those representations in perception or action. Multimodal training and tool use are partial answers, but the gap between manipulating text about the world and interacting with the world remains.

Specification gaming and shortcut features. When the easiest way to solve a training task is not the way the designer had in mind, the resulting representation will encode the shortcut. Vision models that latch onto background textures rather than object shapes are a classic example.

Opacity. Even when a representation works, understanding why it works is hard. The dimensions of a learned vector usually do not correspond to clean human concepts, and interpretability methods like probing and sparse coding only partly close the gap.

explain like I'm 5 (ELI5)

Think about how you might describe a dog to a friend. You could list facts: four legs, fur, a tail, barks. You could draw it. You could just point. Each of those is a different way to represent a dog, and the way you pick changes what your friend can do with the information. A drawing is good for spotting a similar dog later. A list of facts is good for filling out a form.

Machine learning models are the same way. They cannot look at the world directly; they only see numbers. The way we turn things into numbers is called the representation. In the old days, smart humans would design those numbers by hand, deciding which measurements mattered. These days, the computer figures out its own numbers by looking at huge amounts of examples, and it usually does a better job than the humans did. Once it has those numbers, it can use them to find similar things, sort them into groups, or generate new ones, the same way you can use your mental picture of a dog to recognise one across the street.

references

Bengio, Y., Courville, A., and Vincent, P. (2013). "Representation Learning: A Review and New Perspectives." IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828. https://arxiv.org/abs/1206.5538
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS 2012. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
Alain, G. and Bengio, Y. (2016). "Understanding intermediate layers using linear classifier probes." arXiv:1610.01644. https://arxiv.org/abs/1610.01644
Locatello, F., et al. (2019). "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations." ICML 2019. https://proceedings.mlr.press/v97/locatello19a/locatello19a.pdf
Hinton, G. E. (1986). "Learning Distributed Representations of Concepts." Proceedings of the Eighth Annual Conference of the Cognitive Science Society.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS 2013. https://arxiv.org/abs/1310.4546
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." CVPR 2022. https://arxiv.org/abs/2111.06377
Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014. https://aclanthology.org/D14-1162/
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). "Enriching Word Vectors with Subword Information." Transactions of the ACL, 5, 135-146. https://arxiv.org/abs/1607.04606
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT 2019. https://arxiv.org/abs/1810.04805
Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP-IJCNLP 2019. https://arxiv.org/abs/1908.10084
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." ICML 2020. https://arxiv.org/abs/2002.05709
Oquab, M., et al. (2024). "DINOv2: Learning Robust Visual Features without Supervision." Transactions on Machine Learning Research. https://arxiv.org/abs/2304.07193
Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021. https://arxiv.org/abs/2103.00020
Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models." Stanford CRFM. https://arxiv.org/abs/2108.07258
Chen, X. and He, K. (2021). "Exploring Simple Siamese Representation Learning." CVPR 2021. https://arxiv.org/abs/2011.10566
Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. https://arxiv.org/abs/2010.11929

introduction

hand-crafted vs learned representations

what makes a representation good

categories of representation

major learned-representation methods

foundation models as representation backbones

embeddings in production

probing and interpretability

representation collapse and rank degeneration

modality-specific considerations

limitations

explain like I'm 5 (ELI5)

references

Improve this article

Related Articles

ARC-AGI 2

Autoencoder

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

introduction

hand-crafted vs learned representations

what makes a representation good

categories of representation

major learned-representation methods

foundation models as representation backbones

embeddings in production

probing and interpretability

representation collapse and rank degeneration

modality-specific considerations

limitations

explain like I'm 5 (ELI5)

references

Related Articles

ARC-AGI 2

Autoencoder

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness