Feature Extraction Models

AI Models Multimodal AI

21 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v3 · 4,177 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Feature extraction models are machine learning systems that transform raw inputs such as text, images, or audio into dense numerical vectors known as embeddings or hidden-state representations. Rather than producing a final task prediction, these models expose intermediate activations that downstream applications reuse for retrieval, similarity comparison, clustering, classification, and conditioning generative models. The Hugging Face task taxonomy lists feature-extraction as a distinct category covering any model whose primary output is the encoded representation of an input rather than a task-specific decision.

The field grew from decades of work on hand-engineered descriptors and traditional dimensionality reduction techniques, then accelerated once deep learning showed that representations learned end to end could outperform manual features across computer vision, natural language processing, and speech. Modern feature extractors are usually large neural networks trained with self-supervised learning or contrastive learning on web-scale corpora.

Definition and role

Feature extraction means running an input through a frozen pre-trained model and reading out the activations of one or more layers as a fixed-length vector. The Hugging Face documentation describes the task as "extracting features learned in a model," producing numerical representations that capture the semantic content of the input. A separate downstream system, often a small classifier, similarity index, or vector database, then consumes these vectors.

This differs from end-to-end task models, which are optimized for a specific objective such as sentiment classification or machine translation. A feature extractor is optimized so its representations generalize across many downstream tasks, letting many applications share the cost of a single forward pass.

Cosine similarity or dot product in embedding space approximates semantic similarity, enabling semantic search, retrieval augmented generation, recommendation systems, and zero-shot classification. The same vectors support visualization with t-SNE or UMAP, clustering, and anomaly detection.

Historical context

Before deep learning, vision systems relied on hand-designed local descriptors such as the Scale-Invariant Feature Transform (SIFT) introduced by David Lowe in 1999 and Histogram of Oriented Gradients (HOG) introduced by Navneet Dalal and Bill Triggs in 2005. Text systems used bag-of-words, TF-IDF, and Latent Semantic Analysis. These hand-crafted features required domain expertise and rarely transferred across tasks.

The 2012 victory of AlexNet on ImageNet showed that features learned by a convolutional neural network could outperform engineered descriptors. Practitioners discovered that activations from ImageNet-pretrained networks served as strong general features for unrelated vision tasks. Bengio, Courville, and Vincent surveyed this shift in their 2013 review Representation Learning: A Review and New Perspectives, arguing that automatic representation discovery would replace manual feature engineering.^[1]

NLP followed a similar path. Distributed word embeddings from word2vec in 2013 and GloVe in 2014 replaced sparse representations. Both produced static, context-free vectors: a word received the same embedding regardless of its surrounding text. ELMo (2018) from AllenAI introduced contextual representations by running a bidirectional LSTM over the full sentence, and the release of BERT later that year extended the idea to transformer-scale contextual representations reusable across tasks. The historical trajectory reveals a clear progression: sparse statistical representations gave way to dense static vectors, which were then superseded by dynamic context-dependent models.^[2] Self-supervised pretraining became the dominant route to foundation feature extractors.

The representation learning paradigm

Feature extraction models are the applied face of a broader scientific program called representation learning, which asks how to discover the latent structure of data automatically rather than relying on human-designed features. The landmark 2013 review by Bengio, Courville, and Vincent defined the goal as learning representations that expose the explanatory factors of variation behind the data.

The dominant implementation of this program is the frozen backbone plus downstream head pattern. A large model is pretrained on a broad, often unlabeled corpus using a self-supervised objective: masked token prediction for text encoders, masked image patch modeling for vision transformers, contrastive alignment for cross-modal models, or masked waveform prediction for audio encoders. The pretrained backbone captures rich, general representations. A lightweight downstream head, often a single linear layer or a shallow MLP, is then attached and trained on labeled data for the target task, while the backbone parameters remain frozen.

This separation produces several practical benefits. The expensive forward pass is computed once per input during inference, and multiple task heads can share a single backbone without multiplying storage. The frozen backbone also provides a strong inductive bias: training only the head is less likely to overfit on small labeled datasets than fine-tuning the full network. Research by Kumar et al. (2022) found that fine-tuning the entire network can distort pretrained features and underperform on out-of-distribution data compared to linear probing, particularly when the pretrained representations are already high quality.^[19] A two-step strategy of linear probing followed by full fine-tuning (LP-FT) often combines the best of both regimes.

An extension called Head2Toe, introduced by Evci et al. at ICML 2022, relaxes the constraint that only the final layer's output is used. Instead, features from all intermediate layers are concatenated and a sparse linear selector is trained on top, matching fine-tuning performance on average while reducing training cost by more than a hundredfold on out-of-distribution transfer tasks.^[20]

Transfer learning is the broader umbrella under which feature extraction sits. In the strict feature-extraction regime, backbone weights are entirely frozen. In lightweight fine-tuning, only the top few layers are unfrozen. Parameter-efficient fine-tuning methods such as adapters and LoRA (Low-Rank Adaptation) insert small trainable modules into a frozen backbone, achieving near-full fine-tuning accuracy while training only a fraction of the total parameters, typically one to five percent. These methods allow the backbone to remain a universal feature extractor while still adapting its representations to a new domain.

Modalities and standard extractors

Text

Text feature extraction is dominated by transformer encoders. BERT, released by Google in October 2018, exposes 768-dimensional hidden states in its base configuration and 1024 in the large; many systems pool these states by averaging or by taking the [CLS] token. RoBERTa (Facebook AI, 2019) retrained BERT on more data with longer schedules.

Generic BERT hidden states perform poorly for direct sentence comparison. Sentence-BERT, introduced by Nils Reimers and Iryna Gurevych at TU Darmstadt's UKP Lab in 2019, fine-tunes BERT with siamese and triplet losses so semantically related sentences land near each other under cosine similarity.^[3] The sentence-transformers library it ships became the default toolkit for off-the-shelf text embeddings. For a deeper treatment of sentence-level text embeddings, see sentence similarity models.

Dedicated embedding APIs followed. OpenAI released text-embedding-3-small and text-embedding-3-large in January 2024, the large variant producing up to 3072-dimensional vectors with variable output dimensions via Matryoshka representation learning.^[17] Cohere shipped Embed v3 in late 2023. The open-source BGE family from BAAI and E5 family from Microsoft Research provide free alternatives that frequently top public benchmarks.

Image

Classical image feature extractors are convolutional networks pre-trained on ImageNet. ResNet-50 and ResNet-152, introduced by Kaiming He and colleagues at Microsoft Research Asia in 2015, produce 2048-dimensional features from their penultimate pooled layer.^[4] The vision transformer (ViT), introduced by Alexey Dosovitskiy and colleagues at Google Research in October 2020, treats an image as a sequence of patches and scales well with data and compute.^[5]

Contrastively trained image-text models reshaped the field. CLIP, released by OpenAI in January 2021, trained a ViT image encoder and a text encoder jointly on 400 million image-caption pairs so matched pairs share a vector space.^[6] CLIP image features support strong zero-shot classification and have become the default backbone for many vision applications. SigLIP, introduced by Xiaohua Zhai and colleagues at Google Research in 2023, replaced the softmax contrastive loss with a sigmoid loss and reached 84.5% ImageNet zero-shot accuracy.^[8]

Pure self-supervised learning vision encoders matured in parallel. DINOv2, released by Meta in April 2023, trains ViTs with self-distillation on a curated billion-image corpus and outperforms OpenCLIP on many benchmarks without any text supervision.^[7] EVA from BAAI (2022) scaled masked image modeling to billion-parameter ViTs and powered the EVA-CLIP variants.

Audio

Speech and audio embeddings rely on self-supervised models trained on raw waveforms. wav2vec 2.0, introduced by Alexei Baevski and colleagues at Facebook AI in June 2020, masks portions of a latent speech representation and learns to discriminate true continuations from distractors, fine-tuning to strong speech recognition with minutes of labeled audio.^[9] HuBERT, released by the same group in June 2021, replaces the contrastive objective with masked prediction over clustered targets.^[10]

The encoder of OpenAI's Whisper, released in September 2022, was trained for automatic speech recognition on 680,000 hours of weakly labeled multilingual audio; its mid-network activations serve as general audio features.^[11] Music-specific encoders extend the recipe to musical content. MERT (2023), developed by the M-A-P group, applies masked acoustic modeling with acoustic and musical teacher models to learn representations that generalize across music understanding tasks including beat tracking, pitch detection, genre classification, and melody extraction. MERT-v1-330M outperformed previous music SSL models on most tasks in the MARBLE evaluation suite.

Multimodal

Multimodal extractors output vectors comparable across input types. CLIP and SigLIP align image and text. The visual encoder inside BLIP and BLIP-2 from Salesforce Research, released between 2022 and 2023, exposes representations that feed vision language models and stand alone for retrieval. LLaVA and similar multimodal LLMs use a frozen CLIP-style vision encoder as their feature extractor.

ImageBind, released by Meta in April 2023, binds six modalities (images, text, audio, depth, thermal, and IMU signals) into a single embedding space using only image-paired data.^[12] The shared space supports cross-modal retrieval such as searching audio with images. Because image-paired data is abundant for each modality, ImageBind avoids the need for direct pairings between non-image modalities, an insight that generalizes the contrastive pretraining recipe to arbitrarily many input types.

Methods for extracting features

The choice of extraction strategy affects embedding quality substantially. Common strategies include:

Last hidden state mean pooling: Compute the mean of token or patch embeddings from the final transformer block, masking padding positions. This is the most widely used strategy for modern sentence embedding models because it captures information from all tokens equally.
CLS or special token pooling: Use the activation at a designated classification position. BERT prepends a [CLS] token to every input; after passing through all transformer layers, this token's final hidden state is intended to summarize the sequence. CLS pooling works well for models fine-tuned with a pooled-output objective but can be suboptimal for raw encoders.
Last-token pooling: Use the last token's hidden state, natural for decoder-only (causal) LLMs that are adapted into embedding models. NV-Embed-v2 and gte-Qwen2-7B-instruct use this strategy because their backbones are causal transformers.
Penultimate layer features: For convolutional classifiers, take the global average pooling layer output before the classification head. ResNet and EfficientNet features are typically extracted at this point.
Multi-layer concatenation: Concatenate representations from multiple transformer layers. Research suggests that different layers encode different levels of abstraction, so combining them can produce richer features for tasks that span syntax, semantics, and pragmatics.
Mid-network features: Earlier CNN layers encode texture and local patterns while later layers encode object identity and global structure. Selecting the appropriate depth depends on whether the downstream task requires low-level or high-level features.
Contrastively trained encoders: Models trained with InfoNCE-style or sigmoid losses produce embeddings whose cosine similarity is directly calibrated to be semantically meaningful, unlike raw BERT hidden states.

For instruction-tuned embedding models, a task description is prepended to the query before encoding: for example, "Represent the question for retrieving relevant documents: {query}." This prompt steering lets a single model behave as a retrieval encoder, a classification encoder, or a clustering encoder without weight changes.

Notable feature extraction models

Model	Released	Organization	Modality	Typical dim	Notes
ResNet-50	Dec 2015	Microsoft Research Asia	Image	2048	Penultimate pooled features, ImageNet baseline
BERT-base	Oct 2018	Google	Text	768	Hidden states, 110M parameters
Sentence-BERT	Aug 2019	UKP Lab TU Darmstadt	Text	384 to 1024	Siamese fine-tune of BERT or RoBERTa
ViT-B/16	Oct 2020	Google Research	Image	768	Patch transformer pre-trained on JFT-300M
wav2vec 2.0 base	Jun 2020	Facebook AI	Audio	768	Self-supervised speech
CLIP ViT-L/14	Jan 2021	OpenAI	Image and text	768	400M image-text pairs
HuBERT large	Jun 2021	Facebook AI	Audio	1024	Masked prediction with clustered targets
EVA	Nov 2022	BAAI	Image	up to 1408	Masked image modeling at scale
Whisper encoder	Sep 2022	OpenAI	Audio	512 to 1280	Trained for speech recognition, reused for features
DINOv2 ViT-L	Apr 2023	Meta	Image	1024	Self-supervised, no text supervision
ImageBind	Apr 2023	Meta	Six modalities	1024	Image, text, audio, depth, thermal, IMU
SigLIP	Mar 2023	Google Research	Image and text	768 to 1152	Sigmoid contrastive loss
MERT-v1-330M	Mar 2023	M-A-P	Audio (music)	1024	Self-supervised music understanding
BGE-large-en-v1.5	Aug 2023	BAAI	Text	1024	Open weights, strong MTEB results
text-embedding-3-large	Jan 2024	OpenAI	Text	up to 3072	Matryoshka, configurable dimensions
NV-Embed-v2	Aug 2024	NVIDIA	Text	4096	LLM-derived, topped MTEB at release
gte-Qwen2-7B-instruct	Jun 2024	Alibaba	Text	3584	Instruction-tuned 7B embedding
Qwen3-Embedding-8B	Jun 2025	Alibaba Qwen	Text	4096	Ranked 1st on MTEB Multilingual at release

Embedding spaces and dimensionality

Embedding sizes range from 128 dimensions for compact mobile models to 4096 or more for LLM-derived embedders. A corpus of one billion vectors at 3072 float32 dimensions occupies roughly twelve terabytes before any indexing structures.

Matryoshka Representation Learning, introduced by Aditya Kusupati and colleagues at NeurIPS 2022, trains a single model whose first k dimensions remain useful for any k, so the same vector can be truncated to fit storage budgets.^[15] Product quantization compresses vectors into byte codes; binary and int8 quantization trade accuracy for size. Approximate nearest neighbor indexes such as HNSW, IVF, and ScaNN make retrieval over hundreds of millions of vectors practical.

The dimensionality of useful embeddings has risen alongside model scale, but the relationship is not monotonic. Studies consistently find that marginal retrieval quality gains diminish sharply beyond a few thousand dimensions, while storage and indexing cost grow linearly with dimension. The Matryoshka approach sidesteps this tradeoff by letting callers choose an effective dimension at query time without reindexing.

Transfer learning and the frozen backbone paradigm

The practical workflow for applying a feature extractor to a new task follows a standard pattern. First, a base model is selected based on modality, size, and whether its pretraining distribution resembles the target domain. Second, input examples from the target dataset are passed through the frozen encoder to produce embeddings, a step often called embedding or featurizing the dataset. Third, a lightweight head is trained on the resulting vectors using a task-specific loss.

Several factors govern when this approach outperforms full fine-tuning. When labeled data is scarce, a frozen backbone with a simple head avoids overfitting: the backbone parameters far outnumber the labeled samples, so gradient updates tend to destroy the general structure of the representations. When the target domain differs from the pretraining domain, linear probing on a strong encoder has been shown to preserve out-of-distribution generalization better than full fine-tuning, because fine-tuning reshapes the feature space toward the narrow fine-tuning distribution.^[19] Conversely, when labeled data is abundant and the target task is closely related to the pretraining objective, full fine-tuning or LP-FT typically wins in-distribution.

Parameter-efficient fine-tuning bridges the two regimes. LoRA inserts trainable low-rank matrices into the attention projections of a frozen transformer, typically adding fewer than one percent of the original parameter count while recovering most of the fine-tuning gains. Adapter modules insert small bottleneck layers at each transformer block. These techniques are standard in the LLM embedding literature: models such as NV-Embed-v2 use instruction-following adapters that redirect the frozen LLM's representations toward a retrieval-optimized subspace without altering the core backbone.

Head2Toe (Evci et al., ICML 2022) demonstrated that the final layer is not always the most informative extraction point. By selecting features from all intermediate layers and training a sparse linear head over the concatenated activations, the method matched full fine-tuning accuracy on average and outperformed it on out-of-distribution benchmarks, while storing only a small fraction of gradients during adaptation.^[20]

Use cases

Use case	How features are used
Semantic search	Encode queries and documents, retrieve by nearest neighbor
Retrieval augmented generation	Fetch relevant passages with embeddings, condition an LLM on them
Recommendation	Embed users and items, score by inner product
Clustering and visualization	Group vectors with k-means or HDBSCAN, project with UMAP or t-SNE
Near-duplicate detection	Threshold cosine similarity to find paraphrases or repost images
Anomaly detection	Flag samples far from any cluster centroid in embedding space
Transfer learning	Freeze the extractor and train a small head on a new task
Zero-shot classification	Compare image features to text features of candidate labels
Image search and reverse image lookup	Index images by ViT or CLIP features
Audio fingerprinting	Identify songs or speakers from short audio clips
Multimodal retrieval	Search text with images, images with audio, or any cross-modal pair
Conditioning generative models	Provide vision tokens to multimodal LLMs and diffusion models
Domain adaptation	Fine-tune a small adapter on domain-specific data while keeping the backbone frozen
Data augmentation assessment	Measure whether augmented samples remain close in embedding space to originals

Benchmarks

Text embedding quality is most commonly measured by the Massive Text Embedding Benchmark (MTEB), introduced by Niklas Muennighoff and colleagues in 2022.^[14] MTEB aggregates 8 task families across 58 datasets and 112 languages, including retrieval, reranking, clustering, classification, sentence similarity, and pair classification.

For image features, ImageNet linear probing remains a standard signal: freeze the backbone, train a linear classifier on ImageNet labels, and report top-1 accuracy. The VTAB and CLIP zero-shot ImageNet benchmarks extend evaluation to broader transfer settings. Audio extractors are commonly evaluated on LibriSpeech for speech recognition transfer and VoxCeleb for speaker identification. The SUPERB suite covers a range of downstream speech tasks for self-supervised encoders. Music understanding models are evaluated on the MARBLE (Music Audio Representation Benchmark for Universal Evaluation) suite, which spans ten tasks including key detection, beat tracking, pitch detection, and genre classification.

Modern landscape, 2024 to 2026

The top of MTEB by mid-2024 was occupied by large language model derived embedders such as E5-Mistral, NV-Embed, SFR-Embedding, and gte-Qwen2-7B-instruct. These models adapt a generative LLM into a bidirectional encoder, then fine-tune with contrastive objectives on curated retrieval pairs.^[18] NV-Embed-v2 reached 72.31 on the English MTEB in August 2024.

Instruction tuning has become standard, with callers providing a short description of the retrieval intent so a single model can switch between tasks. Multimodal foundation models serve as universal feature extractors: the same ViT or LLaVA-style backbone may feed retrieval pipelines, classification heads, and generative decoders. Open-weight releases such as BGE-multilingual-gemma2 and Qwen3-Embedding continue to narrow the gap with closed commercial APIs. Qwen3-Embedding-8B, released in June 2025, ranked first on the MTEB Multilingual leaderboard with a score of 70.58, surpassing proprietary offerings including Gemini-Embedding.

The boundary between feature extraction and generation is blurring. The same backbone that serves as a frozen encoder in a retrieval system may also serve as the generative core of a multimodal chat model, with the frozen vs. trained distinction determined at inference time by whether a task head or a generative decoder is attached.

Limitations and considerations

Feature extractors inherit biases from their pretraining corpora. Embeddings can encode demographic stereotypes and skew toward majority languages, propagating into downstream retrieval and classification. Domain transfer is uneven: a general-purpose text embedder may struggle on clinical notes or legal contracts without adaptation.

Evaluation is fragile. MTEB scores can be optimized by training on data closely related to its benchmarks, and reproducibility suffers when top models do not disclose data or weights. Beyond a few thousand dimensions, marginal gains in retrieval quality shrink while storage and compute grow.

In retrieval augmented generation, embedding mismatches surface as hallucinations downstream: if a retriever returns a tangentially related document, the generator may treat it as authoritative. Production deployments combine dense retrieval with sparse methods such as BM25 and with reranking models. Generating and indexing billions of vectors at high dimension imposes nontrivial cost, motivating ongoing work on quantization and distillation.

Feature extraction also has specific failure modes that fine-tuning can avoid. When the target domain differs substantially from pretraining, even a very deep backbone may not encode the right distinctions. Medical imaging, satellite imagery, and niche industrial inspection tasks have each motivated domain-specific pretraining rather than relying on ImageNet-pretrained or web-text-pretrained backbones. The frozen backbone assumption breaks down when the input distribution is far outside the pretraining manifold: the encoder may map all such inputs into a small dense region, collapsing the geometry that downstream tasks rely on.

Finally, the practical engineering of feature extraction at scale introduces its own concerns. Embedding a dataset of billions of documents requires distributed inference infrastructure. Incremental updates, where new documents are added without re-embedding the entire corpus, can cause index drift if the embedding model is updated, requiring careful versioning of both model and index. These operational challenges are as important as raw embedding quality in production deployments.

Feature extraction models are closely related to several neighboring topics on this wiki:

Sentence similarity models focus specifically on the subproblem of comparing two text sequences and describe the training objectives (contrastive, pairwise) and evaluation metrics (Spearman correlation on STS benchmarks) in depth.
The embedding article covers the mathematical and conceptual foundations of vector representations.
Self-supervised learning describes the pretraining objectives, including masked language modeling, masked image modeling, contrastive learning, and self-distillation, that produce the general representations feature extractors expose.
Transfer learning covers the broader paradigm of reusing a model trained on one task or domain for another.
BERT and CLIP are the two most widely used backbone architectures and have their own detailed articles.

References

Bengio, Y., Courville, A., and Vincent, P. (2013). Representation Learning: A Review and New Perspectives. arXiv:1206.5538. https://arxiv.org/abs/1206.5538 ↩
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084. https://arxiv.org/abs/1908.10084 ↩
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://arxiv.org/abs/1512.03385 ↩
Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929. https://arxiv.org/abs/2010.11929 ↩
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020. https://arxiv.org/abs/2103.00020 ↩
Oquab, M. et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193. https://arxiv.org/abs/2304.07193 ↩
Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training (SigLIP). arXiv:2303.15343. https://arxiv.org/abs/2303.15343 ↩
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv:2006.11477. https://arxiv.org/abs/2006.11477 ↩
Hsu, W.-N. et al. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv:2106.07447. https://arxiv.org/abs/2106.07447 ↩
Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). arXiv:2212.04356. https://arxiv.org/abs/2212.04356 ↩
Girdhar, R. et al. (2023). ImageBind: One Embedding Space To Bind Them All. arXiv:2305.05665. https://arxiv.org/abs/2305.05665 ↩
Xiao, S., Liu, Z., Zhang, P. et al. (2023). C-Pack: Packed Resources For General Chinese Embeddings (BGE). arXiv:2309.07597. https://arxiv.org/abs/2309.07597
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316. https://arxiv.org/abs/2210.07316 ↩
Kusupati, A. et al. (2022). Matryoshka Representation Learning. arXiv:2205.13147. https://arxiv.org/abs/2205.13147 ↩
Hugging Face. Feature Extraction task page. https://huggingface.co/tasks/feature-extraction
OpenAI (January 2024). New embedding models and API updates. https://openai.com/index/new-embedding-models-and-api-updates/ ↩
Lee, C. et al. (2024). NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. arXiv:2405.17428. https://arxiv.org/abs/2405.17428 ↩
Kumar, A. et al. (2022). Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. ICLR 2022. arXiv:2202.10054. https://arxiv.org/abs/2202.10054 ↩
Evci, U., Dumoulin, V., Larochelle, H., and Mozer, M. C. (2022). Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning. ICML 2022. arXiv:2201.03529. https://arxiv.org/abs/2201.03529 ↩
Li, Y. et al. (2023). MERT: Acoustic Music Understanding Model with Large-Scale Self-Supervised Training. arXiv:2306.00107. https://arxiv.org/abs/2306.00107
Qwen Team (2025). Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176. https://arxiv.org/abs/2506.05176

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Sentence Similarity Models