MetaCLIP
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,346 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,346 words
Add missing citations, update stale details, or suggest a clearer explanation.
MetaCLIP (Metadata-Curated Language-Image Pre-training) is a data curation recipe and a family of vision-language models from Meta AI, introduced in the 2023 paper "Demystifying CLIP Data" by Hu Xu, Saining Xie, and collaborators [1]. The project's central argument is that the success of OpenAI's CLIP came primarily from its training data rather than from its model architecture or pre-training objective, and that CLIP's most important contribution, the data, was never disclosed. MetaCLIP reverse-engineers and openly describes a curation algorithm that reproduces a CLIP-style dataset from a raw web pool, then releases both the curation code and the resulting distribution of data over its metadata [2]. Trained on this curated data, MetaCLIP matches or exceeds the original CLIP at the same model size and compute budget.
MetaCLIP should be distinguished from CLIP itself. CLIP is OpenAI's 2021 contrastive learning approach that trains an image encoder and a text encoder to align matching image-text pairs in a shared embedding space, enabling zero-shot learning image classification. MetaCLIP reuses the same training objective and the same vision transformer encoders; what it contributes is an explicit, auditable account of how to build the multimodal training set that CLIP kept secret.
OpenAI released CLIP's model weights and described the contrastive objective in detail, but said almost nothing about how its roughly 400 million image-text pairs (a private set named WIT) were gathered or filtered, beyond noting that the data was constructed around about 500,000 search queries [1]. Later open efforts such as LAION reproduced large image-text datasets by filtering CommonCrawl with a pre-trained CLIP model, which the MetaCLIP authors note is circular: it uses CLIP to recreate CLIP's data and inherits CLIP's biases. MetaCLIP instead tries to recover the curation process directly from the few hints in the original paper, without using any trained model as a filter [1][2].
MetaCLIP's recipe has two parts: building metadata, and using that metadata to filter and balance a raw pool of image-text pairs.
The metadata is a list of about 500,000 text entries (the "queries" or concepts) that define which captions are worth keeping. The authors reconstruct this list from four publicly available sources, matching the count CLIP reported [1][3]:
| Source | Entries | Selection rule |
|---|---|---|
| WordNet synsets | 86,654 | All synsets |
| Wikipedia unigrams | 251,465 | Appearing at least 100 times in English Wikipedia |
| Wikipedia bigrams | 100,646 | High pointwise mutual information (PMI threshold around 30) |
| Wikipedia article titles | 61,235 | Above a pageview frequency threshold (around 70 views) |
| Total | ~500,000 |
Given a raw pool of image-text pairs scraped from the web, MetaCLIP first applies sub-string matching: a caption is kept only if it contains at least one metadata entry as a sub-string, and the matched entries are recorded for each pair. This step removes noisy or off-topic captions without a hand-tuned filter cascade [1].
The second step is balancing, which is the part the authors emphasize. Web text is dominated by a few extremely common terms, so the matched counts are highly skewed. In one analysis over a 1.6 billion pair pool, the most frequent entries were stop-word-like terms ('of' matched about 120 million pairs, 'in' about 107 million), roughly 114,000 of the 500,000 entries matched nothing at all, and only about 16,000 entries (3.2% of the list) each exceeded 20,000 matches yet together accounted for 94.5% of all matches [1][3]. Balancing caps the contribution of each entry at a threshold t. Entries below the cap (the "tail") keep all their pairs, while entries above it (the "head") are sub-sampled down to t, so that a pair associated with a rare concept has a higher chance of being selected than one tied only to a common term. For the 400 million pair dataset the authors use t = 20,000 [1][3]. The effect is a dataset balanced across concepts rather than dominated by a handful of frequent words, which the paper shows improves the diversity and quality of the resulting embeddings.
A practical feature of the algorithm is that the sampling probability for each pair, t divided by its entry count, can be computed in a single pass over the data, so curation scales to billions of pairs without an expensive model in the loop [2].
Applying the recipe to CommonCrawl yields a 400 million pair dataset comparable in size to CLIP's WIT, plus a larger 2.5 billion pair version. Holding the model architecture and training compute fixed, MetaCLIP matches or beats OpenAI's CLIP on zero-shot ImageNet classification at every model scale tested [1][3]:
| Model scale | Training data | OpenAI CLIP | MetaCLIP |
|---|---|---|---|
| ViT-B/32 | 400M | 63.4% | 65.5% |
| ViT-B/16 | 400M | 68.3% | 70.8% |
| ViT-L/14 | 400M | 75.5% | 76.2% |
| ViT-L/14 | 2.5B | N/A | 79.2% |
| ViT-H/14 | 2.5B | N/A | 80.5% |
| ViT-bigG/14 | 2.5B | N/A | 82.1% |
The comparison is controlled: same encoder sizes, same number of training samples seen, and the only change is the data. The improvement at fixed scale (for example 70.8% versus 68.3% at ViT-B/16), and the further gains from scaling the curated pool to 2.5 billion pairs, support the paper's claim that data curation, not architecture, drives CLIP's performance [1]. "Demystifying CLIP Data" was accepted as a Spotlight at ICLR 2024 [4].
Unlike CLIP, where only weights were public, MetaCLIP released the full curation pipeline. The GitHub repository facebookresearch/MetaCLIP provides the metadata, the sub-string matching and balancing code, the distribution of the curated data over the metadata, and the trained checkpoints (ViT-B/32, ViT-B/16, ViT-L/14, ViT-H/14, and ViT-bigG/14) [2][5]. Because the recipe and the data distribution are open, others can regenerate or audit a CLIP-style dataset rather than treating it as a black box. The code is released primarily under a CC-BY-NC license [2].
A follow-up, "Meta CLIP 2: A Worldwide Scaling Recipe" (arXiv, July 2025), extends the curation recipe from English-only web data to worldwide, multilingual data [6][7]. It is described as the first recipe to train a CLIP model from scratch on worldwide web-scale image-text pairs, and was selected as a NeurIPS 2025 Spotlight [2].
MetaCLIP 2 targets two problems. First, the original recipe had no way to curate non-English data. Second, naively adding many languages tends to drag down English performance, an effect the authors call the "curse of multilinguality." The recipe makes three changes: it scales the metadata to more than 300 languages using Wikipedia and multilingual WordNet; it applies a per-language balancing algorithm with a language-specific threshold so that each language keeps a consistent head-to-tail concept ratio; and it scales model capacity and batch size, training a ViT-H/14 on about 29 billion seen image-text pairs (roughly 44% English and 56% non-English) [6][7].
The headline finding is that English and non-English data can help each other rather than compete. The worldwide ViT-H/14 reaches 81.3% zero-shot accuracy on ImageNet, above its English-only MetaCLIP counterpart at 80.5% and above the multilingual SigLIP baseline (mSigLIP) at 80.6%, while also setting strong multilingual results such as 50.2% on Babel-ImageNet and 64.3% image-to-text retrieval on XM3600 [6][7]. In other words, MetaCLIP 2 removes the usual trade-off in which multilingual CLIP models underperform English-only ones on English tasks.