See also: Machine learning terms
In machine learning, a representation is the format in which an example is encoded before a model sees it. The same data point can be represented as raw pixels, a list of hand-designed measurements, or a vector produced by a deep network, and the choice of representation largely decides which patterns a model can pick up. Bengio, Courville, and Vincent put it bluntly in their 2013 review: the performance of machine learning methods is heavily dependent on the choice of data representation, because different representations can entangle and hide more or less the explanatory factors of variation behind the data [1].
For most of the field's history, getting a good representation meant hand-crafting it: a vision researcher decided that gradient histograms were what mattered for object detection, an audio researcher decided on cepstral coefficients, and a text researcher counted n-grams. The shift to deep learning replaced much of that work with representation learning, where a model discovers its own features directly from raw data. Today's foundation models are essentially giant representation engines: pretrained networks whose intermediate vectors, called embeddings, are reused across countless downstream tasks.
The practical contrast between the two paradigms is large enough to be worth spelling out in a table.
| Aspect | Hand-crafted features | Learned representations |
|---|---|---|
| How they are built | Domain experts design them by hand | Optimised end-to-end during training |
| Examples in vision | SIFT, HOG, SURF, color histograms | Convolutional features, ViT patch embeddings, DINOv2 features |
| Examples in text | Bag-of-words, TF-IDF, n-grams | word2vec, GloVe, BERT, GPT hidden states |
| Examples in audio | MFCC, spectrogram bins, pitch features | wav2vec 2.0, HuBERT, Whisper encoder outputs |
| Required data volume | Works with small datasets | Typically needs large unlabeled corpora |
| Interpretability | High; each feature has a clear meaning | Low; individual dimensions rarely correspond to obvious concepts |
| Transferability | Limited to the task and modality they were designed for | Often transfer across tasks and even across modalities |
| Failure mode | Misses patterns the designer did not anticipate | Inherits biases and artifacts from the training data |
Hand-crafted features dominated computer vision and speech until the early 2010s. SIFT and HOG were the backbone of object recognition pipelines, MFCC was almost universal in speech recognition, and TF-IDF weighted bag-of-words was the default text representation for classification and retrieval. These features were the result of years of careful engineering, and they encoded strong inductive biases that worked well when training data was scarce.
The turning point in vision was AlexNet's 2012 ImageNet result, where a convolutional network trained end-to-end beat the best hand-crafted pipelines by a wide margin [2]. Within a few years, learned features had displaced hand-crafted ones in nearly every modality where enough data was available. Feature engineering did not disappear, especially for tabular data and small problems, but for unstructured inputs the field largely moved to letting models discover their own representations.
Bengio and colleagues catalogued a set of properties that a useful representation tends to have [1]. The list is not a strict checklist, more a set of design goals that different methods optimise to different degrees.
Smoothness. If two inputs are close in input space, their representations should usually be close as well. This is the prior behind nearest-neighbour methods and behind most distance-based downstream uses of embeddings.
Linear separability of relevant attributes. A good representation makes the things you care about easy to read off with a simple classifier. This is exactly the assumption behind linear probing, where a logistic regression on top of frozen features is used as a quick read of representation quality [3].
Disentanglement. Different latent dimensions should respond to different underlying factors of variation. In an ideal disentangled face representation, one direction would track pose, another lighting, another identity, with minimal cross-talk. Disentangled representations should separate the distinct, independent and informative generative factors of variation in the data [4]. Pure unsupervised disentanglement turns out to be much harder than it looked; Locatello and colleagues showed in 2019 that without inductive biases or weak supervision, the same model can learn very different factorisations across runs [4].
Invariance to nuisance factors. Translations, rotations, lighting changes, paraphrases, and accent should not move the representation much when the task does not depend on them. A face embedding used for identity verification should ideally be invariant to expression and pose.
Hierarchy. Useful structure tends to live at multiple scales. Edges and textures combine into parts; parts combine into objects; objects combine into scenes. Deep networks tend to mirror this hierarchy in their layers, which is one reason features pulled from intermediate layers transfer so well.
Sparsity. Only a small subset of features should be active for any given input. Sparse representations tend to be more interpretable and more robust, and the idea has resurfaced strongly in the modern interpretability literature with sparse autoencoders trained on the activations of large language models.
The field roughly distinguishes three styles of representation, each tied to a particular research lineage.
Distributed representations. Each concept is encoded by a pattern of activity across many units, and each unit takes part in encoding many concepts. The idea goes back to Hinton's 1986 paper on learning distributed representations [5] and is the dominant paradigm in modern neural networks. The geometry of word2vec, where vector arithmetic captures analogies like king minus man plus woman is approximately queen, is the canonical illustration [6].
Sparse representations. Most coordinates of the vector are zero. Sparse coding methods learn an overcomplete dictionary in which each input is a sparse combination of basis elements. Sparse representations are easier to interpret one feature at a time and are central to the recent line of work on extracting monosemantic features from large language models.
Compositional representations. Complex meanings are built up from simpler ones following predictable rules. A fully compositional representation of language would let you take the embedding of "red" and "car" and combine them, in a known way, to get the embedding of "red car." Modern transformer-based models exhibit some compositional structure, but how much, and whether it is enough, is still actively debated.
The last fifteen years produced an unusually clean sequence of milestones in representation learning. The table below picks the ones that mattered most.
| Year | Method | Modality | What it introduced |
|---|---|---|---|
| 2006 | Stacked autoencoders | Generic | Layer-wise unsupervised pretraining of deep nets |
| 2008 | Denoising autoencoder (Vincent et al.) | Generic | Reconstruct clean input from a corrupted version |
| 2012 | AlexNet | Vision | End-to-end learned ConvNet features beat hand-crafted ones on ImageNet |
| 2013 | word2vec (Mikolov et al.) | Text | Skip-gram and CBOW; vector arithmetic on words |
| 2014 | GloVe (Pennington et al.) | Text | Embeddings from global co-occurrence statistics |
| 2014 | Variational autoencoder (Kingma and Welling) | Generic | Probabilistic latent representations |
| 2015 | ResNet (He et al.) | Vision | Residual connections enabled very deep, transferable backbones |
| 2017 | FastText (Bojanowski et al.) | Text | Subword embeddings using character n-grams |
| 2018 | ELMo (Peters et al.) | Text | First widely used contextual word embeddings |
| 2018 | BERT (Devlin et al.) | Text | Masked language modelling for bidirectional contextual embeddings |
| 2019 | SBERT (Reimers and Gurevych) | Text | Sentence embeddings via siamese BERT |
| 2020 | SimCLR (Chen et al.) | Vision | Simple contrastive framework with strong augmentations |
| 2020 | ViT (Dosovitskiy et al.) | Vision | Transformer over 16x16 image patches |
| 2021 | CLIP (Radford et al.) | Vision and language | Joint image-text embeddings via contrastive training on 400M pairs |
| 2021 | DINO (Caron et al.) | Vision | Self-distillation produces strong unsupervised ViT features |
| 2022 | MAE (He et al.) | Vision | Mask 75% of image patches, reconstruct from the rest |
| 2024 | DINOv2 (Oquab et al.) | Vision | Foundation-grade ViT features pretrained on 142M curated images |
A few of these deserve more space.
Autoencoders and their descendants. An autoencoder maps inputs through a bottleneck and back out, and the bottleneck activations form the representation. Denoising autoencoders force the network to undo a known corruption, which makes the representation more robust. Variational autoencoders (VAEs) treat the bottleneck as a probabilistic latent space, which gives a generative model alongside the representation. Masked autoencoders apply the same logic to image patches: hide most of an image and ask the model to fill in the rest [7].
Word embeddings. word2vec trains shallow networks to predict either a word from its context (CBOW) or the context from a word (skip-gram), producing dense vectors where semantic and syntactic relationships show up as linear directions [6]. GloVe takes a different route, factorising a global word-word co-occurrence matrix to produce embeddings with similar properties [8]. FastText extends both ideas to character n-grams, which lets the model produce vectors for out-of-vocabulary words and morphologically rich languages [9].
Contextual embeddings. The next jump was making the embedding of a word depend on its sentence. ELMo used a bidirectional LSTM over a language modelling objective. BERT replaced the LSTM with a transformer trained with a masked language model and a next-sentence prediction objective, producing bidirectional contextual representations from unlabeled text. The pretrained model could be fine-tuned with a single output layer to produce state-of-the-art results across eleven NLP benchmarks [10]. GPT-style decoder-only models produce contextual embeddings as a byproduct of next-token prediction, and those hidden states are now used as general-purpose text encoders.
Sentence and document embeddings. Out of the box, BERT's token embeddings are a bad way to compare whole sentences: comparing 10,000 sentences pairwise would take about 65 hours on a single GPU [11]. Sentence-BERT solves this by fine-tuning BERT in a siamese setup with a contrastive or regression loss, producing fixed-length sentence vectors that can be compared with a dot product. The same 10,000-sentence comparison drops from 65 hours to about 5 seconds [11]. Universal Sentence Encoder, OpenAI's text-embedding models, and the more recent Cohere and Voyage embedders all sit in this lineage.
Self-supervised vision. SimCLR showed that a clean contrastive setup, augmented views of the same image as positives and other images in the batch as negatives, could match supervised ImageNet pretraining when given a large enough batch [12]. MoCo built a momentum-encoded queue of negatives so the same idea worked at smaller batch sizes. BYOL and SimSiam removed negatives entirely and relied on architectural tricks to avoid collapse. MAE went a different route, masking 75% of image patches and reconstructing them, which trained a ViT-Huge to 87.8% on ImageNet using only ImageNet-1K data [7]. DINO and DINOv2 used self-distillation between a student and an exponentially averaged teacher; DINOv2's features, learned without labels on 142 million curated images, work as drop-in inputs to linear probes across many vision tasks [13].
Multimodal representations. CLIP trained an image encoder and a text encoder jointly so that matching image-caption pairs produced similar embeddings and mismatched pairs produced dissimilar ones. After training on 400 million internet image-text pairs, the model could classify images zero-shot by comparing image embeddings to embeddings of prompts like "a photo of a dog" [14]. LiT froze a pretrained image encoder and trained only the text tower, often beating end-to-end CLIP at lower cost. SigLIP replaced CLIP's softmax over the batch with a sigmoid loss, which scales better and removes the need for huge batches. These shared embedding spaces are now the backbone of most text-to-image generators and many vision-language models.
Bommasani and colleagues coined the term foundation model in their 2021 Stanford report to describe the new generation of large pretrained models that get adapted to many downstream tasks. Foundation models are based on standard deep learning and transfer learning, but their scale produces new emergent capabilities, and their effectiveness across so many tasks incentivises homogenisation of approaches across the field [15]. Practically, the report argues that models like BERT, GPT-3, CLIP, and DALL-E function as shared infrastructure: their representations are reused, fine-tuned, or queried by countless downstream applications.
This representation-as-infrastructure framing is now standard. A modern computer vision team rarely trains a backbone from scratch; they pull DINOv2 or a CLIP image encoder and fine-tune. A modern NLP team rarely trains an encoder from scratch; they pull BERT, RoBERTa, or a sentence-transformer. The good and bad of this arrangement both follow from concentration: every downstream system inherits whatever strengths and biases live in the foundation model's representations.
One of the most visible applications of learned representations is semantic search. A document collection is encoded once into vectors, a query is encoded at runtime, and the system returns documents whose vectors are closest to the query. The whole pattern hinges on having a good text embedding model and a fast way to do approximate nearest neighbour search.
Vector databases are the storage and indexing systems that make this practical at scale. Common options in 2026 include the ones listed below.
| System | Type | Notable trait |
|---|---|---|
| Pinecone | Managed cloud | Fully hosted; auto-scales index and query layer |
| Weaviate | Open source and managed | Strong hybrid search combining vectors with keyword filters |
| Milvus | Open source | Designed for billions of vectors across distributed clusters |
| Qdrant | Open source and managed | Rust-based; rich payload filtering |
| Chroma | Open source | Lightweight; popular for local prototyping and small RAG apps |
| FAISS | Library | Facebook AI's similarity search library; embedded into many other systems |
| pgvector | Postgres extension | Adds vector search to an existing relational database |
Retrieval-augmented generation (RAG) is the dominant production use case driving vector database adoption. In a RAG system the user's question is embedded, relevant passages are retrieved by vector similarity, and a language model is asked to answer using those passages as context. The quality of the embedding model usually matters more than the choice of vector database for the resulting accuracy.
If representations are the medium through which models think, the natural question is what is encoded in them. Probing is the standard technique: train a small classifier on frozen representations to predict some property and see how well it does. Alain and Bengio formalised this with linear probes, training a logistic regression on the activations of every layer of a deep network. They observed that the linear separability of features increases monotonically along the depth of the model, which gave a simple way to see how representations evolve [3].
Tenney and colleagues applied probing to BERT in 2019 and found a layered structure that loosely tracked the classical NLP pipeline: lower layers handled surface and syntactic features, middle layers handled syntactic phenomena like dependency relations, and higher layers handled semantic ones. Probing has its critics, since a probe might learn things that are actually not used by the model, but it remains one of the most direct ways to read what is in a representation.
More recent interpretability work uses sparse autoencoders trained on the hidden states of large language models to extract features that fire on a single concept. Anthropic and others have used this approach to find features for things like Golden Gate Bridge mentions, code-switching, or specific stylistic patterns in production-scale models.
Learned representations can fail in characteristic ways. Two common failure modes are full collapse, where the model maps every input to roughly the same vector, and dimensional collapse, where the embeddings only span a low-dimensional subspace of the available space.
Contrastive methods like SimCLR avoid full collapse by construction, since pulling negatives apart prevents everything from mapping to one point. Non-contrastive methods like BYOL and SimSiam were initially surprising because they have no negatives, and yet they avoid collapse in practice. Subsequent analysis showed that batch normalisation, the predictor head, and the use of an exponentially averaged target encoder all play a role in preventing trivial solutions [16]. Even when full collapse is avoided, dimensional collapse is common: embedding vectors end up spanning only a low-dimensional subspace, which limits how much information they can carry. Whitening losses, projector heads, and rank-based regularisers are the common fixes.
In language models, a related problem called anisotropy shows up: the embeddings of all tokens cluster in a narrow cone, which makes cosine similarity less informative than it should be. Post-hoc whitening and contrastive fine-tuning are the usual remedies.
Different kinds of data require different choices of input unit before any representation learning happens.
Text is almost universally tokenised first. Byte-pair encoding (BPE), WordPiece, and SentencePiece split text into subword units that balance vocabulary size against the rate of unknown tokens. The choice of tokenizer shapes the representation in surprising ways: unusual languages, code, and numbers all behave differently depending on whether they get split into many tiny pieces or kept as larger units.
Images in modern systems are usually split into patches before being fed to a transformer. ViT's choice of 16x16 patches turns a 224x224 image into a sequence of 196 tokens that the transformer processes in much the same way it would a sentence [17]. Convolutional backbones still produce useful features, but ViT-based approaches dominate at large scale.
Audio gets converted to log-mel spectrograms, then either fed to a convolutional encoder or, increasingly, a transformer. Self-supervised models like wav2vec 2.0 and HuBERT replace the older MFCC pipelines with learned representations of raw waveforms.
Graphs and structured data have their own representation literature, with graph neural networks producing node and edge embeddings that capture local neighbourhood structure. Embeddings have also been applied to recommendations, where users and items live in a shared latent space, and to knowledge graphs, where entities and relations are represented so that plausible facts have small reconstruction error.
Learned representations come with their own set of recurring problems.
Bias inheritance. A representation reflects the data it was trained on. Word embeddings trained on web text encode the gender, racial, and cultural biases present in that text, and downstream classifiers that use those embeddings inherit them. CLIP famously associates certain occupations more strongly with certain genders or ethnicities, depending on what its training image-caption pairs contained.
Out-of-distribution failures. A representation that works well on data similar to its training distribution can degrade sharply outside it. ImageNet-trained features struggle on medical images; text encoders trained on web English struggle on legal jargon, archaic English, or code-mixed text without further adaptation.
Lack of grounding. Text-only language models learn representations that capture statistical relationships between tokens. They do not, on their own, ground those representations in perception or action. Multimodal training and tool use are partial answers, but the gap between manipulating text about the world and interacting with the world remains.
Specification gaming and shortcut features. When the easiest way to solve a training task is not the way the designer had in mind, the resulting representation will encode the shortcut. Vision models that latch onto background textures rather than object shapes are a classic example.
Opacity. Even when a representation works, understanding why it works is hard. The dimensions of a learned vector usually do not correspond to clean human concepts, and interpretability methods like probing and sparse coding only partly close the gap.
Think about how you might describe a dog to a friend. You could list facts: four legs, fur, a tail, barks. You could draw it. You could just point. Each of those is a different way to represent a dog, and the way you pick changes what your friend can do with the information. A drawing is good for spotting a similar dog later. A list of facts is good for filling out a form.
Machine learning models are the same way. They cannot look at the world directly; they only see numbers. The way we turn things into numbers is called the representation. In the old days, smart humans would design those numbers by hand, deciding which measurements mattered. These days, the computer figures out its own numbers by looking at huge amounts of examples, and it usually does a better job than the humans did. Once it has those numbers, it can use them to find similar things, sort them into groups, or generate new ones, the same way you can use your mental picture of a dog to recognise one across the street.