Feature Extraction Models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,177 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,177 words
Add missing citations, update stale details, or suggest a clearer explanation.
Feature extraction models are machine learning systems that transform raw inputs such as text, images, or audio into dense numerical vectors known as embeddings or hidden-state representations. Rather than producing a final task prediction, these models expose intermediate activations that downstream applications reuse for retrieval, similarity comparison, clustering, classification, and conditioning generative models. The Hugging Face task taxonomy lists feature-extraction as a distinct category covering any model whose primary output is the encoded representation of an input rather than a task-specific decision.
The field grew from decades of work on hand-engineered descriptors and traditional dimensionality reduction techniques, then accelerated once deep learning showed that representations learned end to end could outperform manual features across computer vision, natural language processing, and speech. Modern feature extractors are usually large neural networks trained with self-supervised learning or contrastive learning on web-scale corpora.
Feature extraction means running an input through a frozen pre-trained model and reading out the activations of one or more layers as a fixed-length vector. The Hugging Face documentation describes the task as "extracting features learned in a model," producing numerical representations that capture the semantic content of the input. A separate downstream system, often a small classifier, similarity index, or vector database, then consumes these vectors.
This differs from end-to-end task models, which are optimized for a specific objective such as sentiment classification or machine translation. A feature extractor is optimized so its representations generalize across many downstream tasks, letting many applications share the cost of a single forward pass.
Cosine similarity or dot product in embedding space approximates semantic similarity, enabling semantic search, retrieval augmented generation, recommendation systems, and zero-shot classification. The same vectors support visualization with t-SNE or UMAP, clustering, and anomaly detection.
Before deep learning, vision systems relied on hand-designed local descriptors such as the Scale-Invariant Feature Transform (SIFT) introduced by David Lowe in 1999 and Histogram of Oriented Gradients (HOG) introduced by Navneet Dalal and Bill Triggs in 2005. Text systems used bag-of-words, TF-IDF, and Latent Semantic Analysis. These hand-crafted features required domain expertise and rarely transferred across tasks.
The 2012 victory of AlexNet on ImageNet showed that features learned by a convolutional neural network could outperform engineered descriptors. Practitioners discovered that activations from ImageNet-pretrained networks served as strong general features for unrelated vision tasks. Bengio, Courville, and Vincent surveyed this shift in their 2013 review Representation Learning: A Review and New Perspectives, arguing that automatic representation discovery would replace manual feature engineering.[1]
NLP followed a similar path. Distributed word embeddings from word2vec in 2013 and GloVe in 2014 replaced sparse representations. Both produced static, context-free vectors: a word received the same embedding regardless of its surrounding text. ELMo (2018) from AllenAI introduced contextual representations by running a bidirectional LSTM over the full sentence, and the release of BERT later that year extended the idea to transformer-scale contextual representations reusable across tasks. The historical trajectory reveals a clear progression: sparse statistical representations gave way to dense static vectors, which were then superseded by dynamic context-dependent models.[2] Self-supervised pretraining became the dominant route to foundation feature extractors.
Feature extraction models are the applied face of a broader scientific program called representation learning, which asks how to discover the latent structure of data automatically rather than relying on human-designed features. The landmark 2013 review by Bengio, Courville, and Vincent defined the goal as learning representations that expose the explanatory factors of variation behind the data.
The dominant implementation of this program is the frozen backbone plus downstream head pattern. A large model is pretrained on a broad, often unlabeled corpus using a self-supervised objective: masked token prediction for text encoders, masked image patch modeling for vision transformers, contrastive alignment for cross-modal models, or masked waveform prediction for audio encoders. The pretrained backbone captures rich, general representations. A lightweight downstream head, often a single linear layer or a shallow MLP, is then attached and trained on labeled data for the target task, while the backbone parameters remain frozen.
This separation produces several practical benefits. The expensive forward pass is computed once per input during inference, and multiple task heads can share a single backbone without multiplying storage. The frozen backbone also provides a strong inductive bias: training only the head is less likely to overfit on small labeled datasets than fine-tuning the full network. Research by Kumar et al. (2022) found that fine-tuning the entire network can distort pretrained features and underperform on out-of-distribution data compared to linear probing, particularly when the pretrained representations are already high quality.[19] A two-step strategy of linear probing followed by full fine-tuning (LP-FT) often combines the best of both regimes.
An extension called Head2Toe, introduced by Evci et al. at ICML 2022, relaxes the constraint that only the final layer's output is used. Instead, features from all intermediate layers are concatenated and a sparse linear selector is trained on top, matching fine-tuning performance on average while reducing training cost by more than a hundredfold on out-of-distribution transfer tasks.[20]
Transfer learning is the broader umbrella under which feature extraction sits. In the strict feature-extraction regime, backbone weights are entirely frozen. In lightweight fine-tuning, only the top few layers are unfrozen. Parameter-efficient fine-tuning methods such as adapters and LoRA (Low-Rank Adaptation) insert small trainable modules into a frozen backbone, achieving near-full fine-tuning accuracy while training only a fraction of the total parameters, typically one to five percent. These methods allow the backbone to remain a universal feature extractor while still adapting its representations to a new domain.
Text feature extraction is dominated by transformer encoders. BERT, released by Google in October 2018, exposes 768-dimensional hidden states in its base configuration and 1024 in the large; many systems pool these states by averaging or by taking the [CLS] token. RoBERTa (Facebook AI, 2019) retrained BERT on more data with longer schedules.
Generic BERT hidden states perform poorly for direct sentence comparison. Sentence-BERT, introduced by Nils Reimers and Iryna Gurevych at TU Darmstadt's UKP Lab in 2019, fine-tunes BERT with siamese and triplet losses so semantically related sentences land near each other under cosine similarity.[3] The sentence-transformers library it ships became the default toolkit for off-the-shelf text embeddings. For a deeper treatment of sentence-level text embeddings, see sentence similarity models.
Dedicated embedding APIs followed. OpenAI released text-embedding-3-small and text-embedding-3-large in January 2024, the large variant producing up to 3072-dimensional vectors with variable output dimensions via Matryoshka representation learning.[17] Cohere shipped Embed v3 in late 2023. The open-source BGE family from BAAI and E5 family from Microsoft Research provide free alternatives that frequently top public benchmarks.
Classical image feature extractors are convolutional networks pre-trained on ImageNet. ResNet-50 and ResNet-152, introduced by Kaiming He and colleagues at Microsoft Research Asia in 2015, produce 2048-dimensional features from their penultimate pooled layer.[4] The vision transformer (ViT), introduced by Alexey Dosovitskiy and colleagues at Google Research in October 2020, treats an image as a sequence of patches and scales well with data and compute.[5]
Contrastively trained image-text models reshaped the field. CLIP, released by OpenAI in January 2021, trained a ViT image encoder and a text encoder jointly on 400 million image-caption pairs so matched pairs share a vector space.[6] CLIP image features support strong zero-shot classification and have become the default backbone for many vision applications. SigLIP, introduced by Xiaohua Zhai and colleagues at Google Research in 2023, replaced the softmax contrastive loss with a sigmoid loss and reached 84.5% ImageNet zero-shot accuracy.[8]
Pure self-supervised learning vision encoders matured in parallel. DINOv2, released by Meta in April 2023, trains ViTs with self-distillation on a curated billion-image corpus and outperforms OpenCLIP on many benchmarks without any text supervision.[7] EVA from BAAI (2022) scaled masked image modeling to billion-parameter ViTs and powered the EVA-CLIP variants.
Speech and audio embeddings rely on self-supervised models trained on raw waveforms. wav2vec 2.0, introduced by Alexei Baevski and colleagues at Facebook AI in June 2020, masks portions of a latent speech representation and learns to discriminate true continuations from distractors, fine-tuning to strong speech recognition with minutes of labeled audio.[9] HuBERT, released by the same group in June 2021, replaces the contrastive objective with masked prediction over clustered targets.[10]
The encoder of OpenAI's Whisper, released in September 2022, was trained for automatic speech recognition on 680,000 hours of weakly labeled multilingual audio; its mid-network activations serve as general audio features.[11] Music-specific encoders extend the recipe to musical content. MERT (2023), developed by the M-A-P group, applies masked acoustic modeling with acoustic and musical teacher models to learn representations that generalize across music understanding tasks including beat tracking, pitch detection, genre classification, and melody extraction. MERT-v1-330M outperformed previous music SSL models on most tasks in the MARBLE evaluation suite.
Multimodal extractors output vectors comparable across input types. CLIP and SigLIP align image and text. The visual encoder inside BLIP and BLIP-2 from Salesforce Research, released between 2022 and 2023, exposes representations that feed vision language models and stand alone for retrieval. LLaVA and similar multimodal LLMs use a frozen CLIP-style vision encoder as their feature extractor.
ImageBind, released by Meta in April 2023, binds six modalities (images, text, audio, depth, thermal, and IMU signals) into a single embedding space using only image-paired data.[12] The shared space supports cross-modal retrieval such as searching audio with images. Because image-paired data is abundant for each modality, ImageBind avoids the need for direct pairings between non-image modalities, an insight that generalizes the contrastive pretraining recipe to arbitrarily many input types.
The choice of extraction strategy affects embedding quality substantially. Common strategies include:
For instruction-tuned embedding models, a task description is prepended to the query before encoding: for example, "Represent the question for retrieving relevant documents: {query}." This prompt steering lets a single model behave as a retrieval encoder, a classification encoder, or a clustering encoder without weight changes.
| Model | Released | Organization | Modality | Typical dim | Notes |
|---|---|---|---|---|---|
| ResNet-50 | Dec 2015 | Microsoft Research Asia | Image | 2048 | Penultimate pooled features, ImageNet baseline |
| BERT-base | Oct 2018 | Text | 768 | Hidden states, 110M parameters | |
| Sentence-BERT | Aug 2019 | UKP Lab TU Darmstadt | Text | 384 to 1024 | Siamese fine-tune of BERT or RoBERTa |
| ViT-B/16 | Oct 2020 | Google Research | Image | 768 | Patch transformer pre-trained on JFT-300M |
| wav2vec 2.0 base | Jun 2020 | Facebook AI | Audio | 768 | Self-supervised speech |
| CLIP ViT-L/14 | Jan 2021 | OpenAI | Image and text | 768 | 400M image-text pairs |
| HuBERT large | Jun 2021 | Facebook AI | Audio | 1024 | Masked prediction with clustered targets |
| EVA | Nov 2022 | BAAI | Image | up to 1408 | Masked image modeling at scale |
| Whisper encoder | Sep 2022 | OpenAI | Audio | 512 to 1280 | Trained for speech recognition, reused for features |
| DINOv2 ViT-L | Apr 2023 | Meta | Image | 1024 | Self-supervised, no text supervision |
| ImageBind | Apr 2023 | Meta | Six modalities | 1024 | Image, text, audio, depth, thermal, IMU |
| SigLIP | Mar 2023 | Google Research | Image and text | 768 to 1152 | Sigmoid contrastive loss |
| MERT-v1-330M | Mar 2023 | M-A-P | Audio (music) | 1024 | Self-supervised music understanding |
| BGE-large-en-v1.5 | Aug 2023 | BAAI | Text | 1024 | Open weights, strong MTEB results |
| text-embedding-3-large | Jan 2024 | OpenAI | Text | up to 3072 | Matryoshka, configurable dimensions |
| NV-Embed-v2 | Aug 2024 | NVIDIA | Text | 4096 | LLM-derived, topped MTEB at release |
| gte-Qwen2-7B-instruct | Jun 2024 | Alibaba | Text | 3584 | Instruction-tuned 7B embedding |
| Qwen3-Embedding-8B | Jun 2025 | Alibaba Qwen | Text | 4096 | Ranked 1st on MTEB Multilingual at release |
Embedding sizes range from 128 dimensions for compact mobile models to 4096 or more for LLM-derived embedders. A corpus of one billion vectors at 3072 float32 dimensions occupies roughly twelve terabytes before any indexing structures.
Matryoshka Representation Learning, introduced by Aditya Kusupati and colleagues at NeurIPS 2022, trains a single model whose first k dimensions remain useful for any k, so the same vector can be truncated to fit storage budgets.[15] Product quantization compresses vectors into byte codes; binary and int8 quantization trade accuracy for size. Approximate nearest neighbor indexes such as HNSW, IVF, and ScaNN make retrieval over hundreds of millions of vectors practical.
The dimensionality of useful embeddings has risen alongside model scale, but the relationship is not monotonic. Studies consistently find that marginal retrieval quality gains diminish sharply beyond a few thousand dimensions, while storage and indexing cost grow linearly with dimension. The Matryoshka approach sidesteps this tradeoff by letting callers choose an effective dimension at query time without reindexing.
The practical workflow for applying a feature extractor to a new task follows a standard pattern. First, a base model is selected based on modality, size, and whether its pretraining distribution resembles the target domain. Second, input examples from the target dataset are passed through the frozen encoder to produce embeddings, a step often called embedding or featurizing the dataset. Third, a lightweight head is trained on the resulting vectors using a task-specific loss.
Several factors govern when this approach outperforms full fine-tuning. When labeled data is scarce, a frozen backbone with a simple head avoids overfitting: the backbone parameters far outnumber the labeled samples, so gradient updates tend to destroy the general structure of the representations. When the target domain differs from the pretraining domain, linear probing on a strong encoder has been shown to preserve out-of-distribution generalization better than full fine-tuning, because fine-tuning reshapes the feature space toward the narrow fine-tuning distribution.[19] Conversely, when labeled data is abundant and the target task is closely related to the pretraining objective, full fine-tuning or LP-FT typically wins in-distribution.
Parameter-efficient fine-tuning bridges the two regimes. LoRA inserts trainable low-rank matrices into the attention projections of a frozen transformer, typically adding fewer than one percent of the original parameter count while recovering most of the fine-tuning gains. Adapter modules insert small bottleneck layers at each transformer block. These techniques are standard in the LLM embedding literature: models such as NV-Embed-v2 use instruction-following adapters that redirect the frozen LLM's representations toward a retrieval-optimized subspace without altering the core backbone.
Head2Toe (Evci et al., ICML 2022) demonstrated that the final layer is not always the most informative extraction point. By selecting features from all intermediate layers and training a sparse linear head over the concatenated activations, the method matched full fine-tuning accuracy on average and outperformed it on out-of-distribution benchmarks, while storing only a small fraction of gradients during adaptation.[20]
| Use case | How features are used |
|---|---|
| Semantic search | Encode queries and documents, retrieve by nearest neighbor |
| Retrieval augmented generation | Fetch relevant passages with embeddings, condition an LLM on them |
| Recommendation | Embed users and items, score by inner product |
| Clustering and visualization | Group vectors with k-means or HDBSCAN, project with UMAP or t-SNE |
| Near-duplicate detection | Threshold cosine similarity to find paraphrases or repost images |
| Anomaly detection | Flag samples far from any cluster centroid in embedding space |
| Transfer learning | Freeze the extractor and train a small head on a new task |
| Zero-shot classification | Compare image features to text features of candidate labels |
| Image search and reverse image lookup | Index images by ViT or CLIP features |
| Audio fingerprinting | Identify songs or speakers from short audio clips |
| Multimodal retrieval | Search text with images, images with audio, or any cross-modal pair |
| Conditioning generative models | Provide vision tokens to multimodal LLMs and diffusion models |
| Domain adaptation | Fine-tune a small adapter on domain-specific data while keeping the backbone frozen |
| Data augmentation assessment | Measure whether augmented samples remain close in embedding space to originals |
Text embedding quality is most commonly measured by the Massive Text Embedding Benchmark (MTEB), introduced by Niklas Muennighoff and colleagues in 2022.[14] MTEB aggregates 8 task families across 58 datasets and 112 languages, including retrieval, reranking, clustering, classification, sentence similarity, and pair classification.
For image features, ImageNet linear probing remains a standard signal: freeze the backbone, train a linear classifier on ImageNet labels, and report top-1 accuracy. The VTAB and CLIP zero-shot ImageNet benchmarks extend evaluation to broader transfer settings. Audio extractors are commonly evaluated on LibriSpeech for speech recognition transfer and VoxCeleb for speaker identification. The SUPERB suite covers a range of downstream speech tasks for self-supervised encoders. Music understanding models are evaluated on the MARBLE (Music Audio Representation Benchmark for Universal Evaluation) suite, which spans ten tasks including key detection, beat tracking, pitch detection, and genre classification.
The top of MTEB by mid-2024 was occupied by large language model derived embedders such as E5-Mistral, NV-Embed, SFR-Embedding, and gte-Qwen2-7B-instruct. These models adapt a generative LLM into a bidirectional encoder, then fine-tune with contrastive objectives on curated retrieval pairs.[18] NV-Embed-v2 reached 72.31 on the English MTEB in August 2024.
Instruction tuning has become standard, with callers providing a short description of the retrieval intent so a single model can switch between tasks. Multimodal foundation models serve as universal feature extractors: the same ViT or LLaVA-style backbone may feed retrieval pipelines, classification heads, and generative decoders. Open-weight releases such as BGE-multilingual-gemma2 and Qwen3-Embedding continue to narrow the gap with closed commercial APIs. Qwen3-Embedding-8B, released in June 2025, ranked first on the MTEB Multilingual leaderboard with a score of 70.58, surpassing proprietary offerings including Gemini-Embedding.
The boundary between feature extraction and generation is blurring. The same backbone that serves as a frozen encoder in a retrieval system may also serve as the generative core of a multimodal chat model, with the frozen vs. trained distinction determined at inference time by whether a task head or a generative decoder is attached.
Feature extractors inherit biases from their pretraining corpora. Embeddings can encode demographic stereotypes and skew toward majority languages, propagating into downstream retrieval and classification. Domain transfer is uneven: a general-purpose text embedder may struggle on clinical notes or legal contracts without adaptation.
Evaluation is fragile. MTEB scores can be optimized by training on data closely related to its benchmarks, and reproducibility suffers when top models do not disclose data or weights. Beyond a few thousand dimensions, marginal gains in retrieval quality shrink while storage and compute grow.
In retrieval augmented generation, embedding mismatches surface as hallucinations downstream: if a retriever returns a tangentially related document, the generator may treat it as authoritative. Production deployments combine dense retrieval with sparse methods such as BM25 and with reranking models. Generating and indexing billions of vectors at high dimension imposes nontrivial cost, motivating ongoing work on quantization and distillation.
Feature extraction also has specific failure modes that fine-tuning can avoid. When the target domain differs substantially from pretraining, even a very deep backbone may not encode the right distinctions. Medical imaging, satellite imagery, and niche industrial inspection tasks have each motivated domain-specific pretraining rather than relying on ImageNet-pretrained or web-text-pretrained backbones. The frozen backbone assumption breaks down when the input distribution is far outside the pretraining manifold: the encoder may map all such inputs into a small dense region, collapsing the geometry that downstream tasks rely on.
Finally, the practical engineering of feature extraction at scale introduces its own concerns. Embedding a dataset of billions of documents requires distributed inference infrastructure. Incremental updates, where new documents are added without re-embedding the entire corpus, can cause index drift if the embedding model is updated, requiring careful versioning of both model and index. These operational challenges are as important as raw embedding quality in production deployments.
Feature extraction models are closely related to several neighboring topics on this wiki: