Feature Extraction Models
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,496 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,496 words
Add missing citations, update stale details, or suggest a clearer explanation.
Feature extraction models are machine learning systems that transform raw inputs such as text, images, or audio into dense numerical vectors known as embeddings or hidden-state representations. Rather than producing a final task prediction, these models expose intermediate activations that downstream applications reuse for retrieval, similarity comparison, clustering, classification, and conditioning generative models. The Hugging Face task taxonomy lists feature-extraction as a distinct category covering any model whose primary output is the encoded representation of an input rather than a task-specific decision.
The field grew from decades of work on hand-engineered descriptors and traditional dimensionality reduction techniques, then accelerated once deep learning showed that representations learned end to end could outperform manual features across computer vision, natural language processing, and speech. Modern feature extractors are usually large neural networks trained with self-supervised learning or contrastive learning on web-scale corpora.
Feature extraction means running an input through a frozen pre-trained model and reading out the activations of one or more layers as a fixed-length vector. The Hugging Face documentation describes the task as "extracting features learned in a model," producing numerical representations that capture the semantic content of the input. A separate downstream system, often a small classifier, similarity index, or vector database, then consumes these vectors.
This differs from end-to-end task models, which are optimized for a specific objective such as sentiment classification or machine translation. A feature extractor is optimized so its representations generalize across many downstream tasks, letting many applications share the cost of a single forward pass.
Cosine similarity or dot product in embedding space approximates semantic similarity, enabling semantic search, retrieval augmented generation, recommendation systems, and zero-shot classification. The same vectors support visualization with t-SNE or UMAP, clustering, and anomaly detection.
Before deep learning, vision systems relied on hand-designed local descriptors such as the Scale-Invariant Feature Transform (SIFT) introduced by David Lowe in 1999 and Histogram of Oriented Gradients (HOG) introduced by Navneet Dalal and Bill Triggs in 2005. Text systems used bag-of-words, TF-IDF, and Latent Semantic Analysis. These hand-crafted features required domain expertise and rarely transferred across tasks.
The 2012 victory of AlexNet on ImageNet showed that features learned by a convolutional neural network could outperform engineered descriptors. Practitioners discovered that activations from ImageNet-pretrained networks served as strong general features for unrelated vision tasks. Bengio, Courville, and Vincent surveyed this shift in their 2013 review Representation Learning: A Review and New Perspectives, arguing that automatic representation discovery would replace manual feature engineering.
NLP followed a similar path. Distributed word embeddings from word2vec in 2013 and GloVe in 2014 replaced sparse representations, and the 2018 release of ELMo and BERT extended the idea to contextual representations reusable across tasks. Self-supervised pretraining became the dominant route to foundation feature extractors.
Text feature extraction is dominated by transformer encoders. BERT, released by Google in October 2018, exposes 768-dimensional hidden states in its base configuration and 1024 in the large; many systems pool these states by averaging or by taking the [CLS] token. RoBERTa (Facebook AI, 2019) retrained BERT on more data with longer schedules.
Generic BERT hidden states perform poorly for direct sentence comparison. Sentence-BERT, introduced by Nils Reimers and Iryna Gurevych at TU Darmstadt's UKP Lab in 2019, fine-tunes BERT with siamese and triplet losses so semantically related sentences land near each other under cosine similarity. The sentence-transformers library it ships became the default toolkit for off-the-shelf text embeddings.
Dedicated embedding APIs followed. OpenAI released text-embedding-3-small and text-embedding-3-large in January 2024, the large variant producing up to 3072-dimensional vectors with variable output dimensions via Matryoshka representation learning. Cohere shipped Embed v3 in late 2023. The open-source BGE family from BAAI and E5 family from Microsoft Research provide free alternatives that frequently top public benchmarks.
Classical image feature extractors are convolutional networks pre-trained on ImageNet. ResNet-50 and ResNet-152, introduced by Kaiming He and colleagues at Microsoft Research Asia in 2015, produce 2048-dimensional features from their penultimate pooled layer. The Vision Transformer (ViT), introduced by Alexey Dosovitskiy and colleagues at Google Research in October 2020, treats an image as a sequence of patches and scales well with data and compute.
Contrastively trained image-text models reshaped the field. CLIP, released by OpenAI in January 2021, trained a ViT image encoder and a text encoder jointly on 400 million image-caption pairs so matched pairs share a vector space. CLIP image features support strong zero-shot classification and have become the default backbone for many vision applications. SigLIP, introduced by Xiaohua Zhai and colleagues at Google Research in 2023, replaced the softmax contrastive loss with a sigmoid loss and reached 84.5% ImageNet zero-shot accuracy.
Pure self-supervised vision encoders matured in parallel. DINOv2, released by Meta in April 2023, trains ViTs with self-distillation on a curated billion-image corpus and outperforms OpenCLIP on many benchmarks without any text supervision. EVA from BAAI (2022) scaled masked image modeling to billion-parameter ViTs and powered the EVA-CLIP variants.
Speech and audio embeddings rely on self-supervised models trained on raw waveforms. wav2vec 2.0, introduced by Alexei Baevski and colleagues at Facebook AI in June 2020, masks portions of a latent speech representation and learns to discriminate true continuations from distractors, fine-tuning to strong speech recognition with minutes of labeled audio. HuBERT, released by the same group in June 2021, replaces the contrastive objective with masked prediction over clustered targets.
The encoder of OpenAI's Whisper, released in September 2022, was trained for automatic speech recognition on 680,000 hours of weakly labeled multilingual audio; its mid-network activations serve as general audio features. Music-specific encoders such as MERT (2023) extend the recipe to musical content.
Multimodal extractors output vectors comparable across input types. CLIP and SigLIP align image and text. The visual encoder inside BLIP and BLIP-2 from Salesforce Research, released between 2022 and 2023, exposes representations that feed vision language models and stand alone for retrieval. LLaVA and similar multimodal LLMs use a frozen CLIP-style vision encoder as their feature extractor.
ImageBind, released by Meta in April 2023, binds six modalities (images, text, audio, depth, thermal, and IMU signals) into a single embedding space using only image-paired data. The shared space supports cross-modal retrieval such as searching audio with images.
Common extraction strategies include:
| Model | Released | Organization | Modality | Typical dim | Notes |
|---|---|---|---|---|---|
| ResNet-50 | Dec 2015 | Microsoft Research Asia | Image | 2048 | Penultimate pooled features, ImageNet baseline |
| BERT-base | Oct 2018 | Text | 768 | Hidden states, 110M parameters | |
| Sentence-BERT | Aug 2019 | UKP Lab TU Darmstadt | Text | 384 to 1024 | Siamese fine-tune of BERT or RoBERTa |
| ViT-B/16 | Oct 2020 | Google Research | Image | 768 | Patch transformer pre-trained on JFT-300M |
| wav2vec 2.0 base | Jun 2020 | Facebook AI | Audio | 768 | Self-supervised speech |
| CLIP ViT-L/14 | Jan 2021 | OpenAI | Image and text | 768 | 400M image-text pairs |
| HuBERT large | Jun 2021 | Facebook AI | Audio | 1024 | Masked prediction with clustered targets |
| EVA | Nov 2022 | BAAI | Image | up to 1408 | Masked image modeling at scale |
| Whisper encoder | Sep 2022 | OpenAI | Audio | 512 to 1280 | Trained for speech recognition, reused for features |
| DINOv2 ViT-L | Apr 2023 | Meta | Image | 1024 | Self-supervised, no text supervision |
| ImageBind | Apr 2023 | Meta | Six modalities | 1024 | Image, text, audio, depth, thermal, IMU |
| SigLIP | Mar 2023 | Google Research | Image and text | 768 to 1152 | Sigmoid contrastive loss |
| BGE-large-en-v1.5 | Aug 2023 | BAAI | Text | 1024 | Open weights, strong MTEB results |
| text-embedding-3-large | Jan 2024 | OpenAI | Text | up to 3072 | Matryoshka, configurable dimensions |
| NV-Embed-v2 | Aug 2024 | NVIDIA | Text | 4096 | LLM-derived, topped MTEB at release |
| gte-Qwen2-7B-instruct | Jun 2024 | Alibaba | Text | 3584 | Instruction-tuned 7B embedding |
Embedding sizes range from 128 dimensions for compact mobile models to 4096 or more for LLM-derived embedders. A corpus of one billion vectors at 3072 float32 dimensions occupies roughly twelve terabytes before any indexing structures.
Matryoshka Representation Learning, introduced by Aditya Kusupati and colleagues at NeurIPS 2022, trains a single model whose first k dimensions remain useful for any k, so the same vector can be truncated to fit storage budgets. Product quantization compresses vectors into byte codes; binary and int8 quantization trade accuracy for size. Approximate nearest neighbor indexes such as HNSW, IVF, and ScaNN make retrieval over hundreds of millions of vectors practical.
| Use case | How features are used |
|---|---|
| Semantic search | Encode queries and documents, retrieve by nearest neighbor |
| Retrieval augmented generation | Fetch relevant passages with embeddings, condition an LLM on them |
| Recommendation | Embed users and items, score by inner product |
| Clustering and visualization | Group vectors with k-means or HDBSCAN, project with UMAP or t-SNE |
| Near-duplicate detection | Threshold cosine similarity to find paraphrases or repost images |
| Anomaly detection | Flag samples far from any cluster centroid in embedding space |
| Transfer learning | Freeze the extractor and train a small head on a new task |
| Zero-shot classification | Compare image features to text features of candidate labels |
| Image search and reverse image lookup | Index images by ViT or CLIP features |
| Audio fingerprinting | Identify songs or speakers from short audio clips |
| Multimodal retrieval | Search text with images, images with audio, or any cross-modal pair |
| Conditioning generative models | Provide vision tokens to multimodal LLMs and diffusion models |
Text embedding quality is most commonly measured by the Massive Text Embedding Benchmark (MTEB), introduced by Niklas Muennighoff and colleagues in 2022. MTEB aggregates 8 task families across 58 datasets and 112 languages, including retrieval, reranking, clustering, classification, sentence similarity, and pair classification.
For image features, ImageNet linear probing remains a standard signal: freeze the backbone, train a linear classifier on ImageNet labels, and report top-1 accuracy. The VTAB and CLIP zero-shot ImageNet benchmarks extend evaluation to broader transfer settings. Audio extractors are commonly evaluated on LibriSpeech for speech recognition transfer and VoxCeleb for speaker identification. The SUPERB suite covers a range of downstream speech tasks for self-supervised encoders.
The top of MTEB by mid-2024 was occupied by large language model derived embedders such as E5-Mistral, NV-Embed, SFR-Embedding, and gte-Qwen2-7B-instruct. These models adapt a generative LLM into a bidirectional encoder, then fine-tune with contrastive objectives on curated retrieval pairs. NV-Embed-v2 reached 72.31 on the English MTEB in August 2024.
Instruction tuning has become standard, with callers providing a short description of the retrieval intent so a single model can switch between tasks. Multimodal foundation models serve as universal feature extractors: the same ViT or LLaVA-style backbone may feed retrieval pipelines, classification heads, and generative decoders. Open-weight releases such as BGE-multilingual-gemma2 and Qwen3-Embedding continue to narrow the gap with closed commercial APIs.
Feature extractors inherit biases from their pretraining corpora. Embeddings can encode demographic stereotypes and skew toward majority languages, propagating into downstream retrieval and classification. Domain transfer is uneven: a general-purpose text embedder may struggle on clinical notes or legal contracts without adaptation.
Evaluation is fragile. MTEB scores can be optimized by training on data closely related to its benchmarks, and reproducibility suffers when top models do not disclose data or weights. Beyond a few thousand dimensions, marginal gains in retrieval quality shrink while storage and compute grow.
In retrieval augmented generation, embedding mismatches surface as hallucinations downstream: if a retriever returns a tangentially related document, the generator may treat it as authoritative. Production deployments combine dense retrieval with sparse methods such as BM25 and with reranking models. Generating and indexing billions of vectors at high dimension imposes nontrivial cost, motivating ongoing work on quantization and distillation.