MetaCLIP

Data & Datasets Meta AI Multimodal AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,344 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MetaCLIP (Metadata-Curated Language-Image Pre-training) is a data curation recipe and a family of vision-language models from Meta AI, introduced in the 2023 paper "Demystifying CLIP Data" by Hu Xu, Saining Xie, and collaborators ^[1]. The project's central argument is that the success of OpenAI's CLIP came primarily from its training data rather than from its model architecture or pre-training objective, and that CLIP's most important contribution, the data, was never disclosed. MetaCLIP reverse-engineers and openly describes a curation algorithm that reproduces a CLIP-style dataset from a raw web pool, then releases both the curation code and the resulting distribution of data over its metadata ^[2]. Trained on this curated data, MetaCLIP matches or exceeds the original CLIP at the same model size and compute budget.

MetaCLIP should be distinguished from CLIP itself. CLIP is OpenAI's 2021 contrastive learning approach that trains an image encoder and a text encoder to align matching image-text pairs in a shared embedding space, enabling zero-shot learning image classification. MetaCLIP reuses the same training objective and the same vision transformer encoders; what it contributes is an explicit, auditable account of how to build the multimodal training set that CLIP kept secret.

Motivation

OpenAI released CLIP's model weights and described the contrastive objective in detail, but said almost nothing about how its roughly 400 million image-text pairs (a private set named WIT) were gathered or filtered, beyond noting that the data was constructed around about 500,000 search queries ^[1]. Later open efforts such as LAION reproduced large image-text datasets by filtering CommonCrawl with a pre-trained CLIP model, which the MetaCLIP authors note is circular: it uses CLIP to recreate CLIP's data and inherits CLIP's biases. MetaCLIP instead tries to recover the curation process directly from the few hints in the original paper, without using any trained model as a filter ^[1]^[2].

Data curation algorithm

MetaCLIP's recipe has two parts: building metadata, and using that metadata to filter and balance a raw pool of image-text pairs.

Metadata

The metadata is a list of about 500,000 text entries (the "queries" or concepts) that define which captions are worth keeping. The authors reconstruct this list from four publicly available sources, matching the count CLIP reported ^[1]^[3]:

Source	Entries	Selection rule
WordNet synsets	86,654	All synsets
Wikipedia unigrams	251,465	Appearing at least 100 times in English Wikipedia
Wikipedia bigrams	100,646	High pointwise mutual information (PMI threshold around 30)
Wikipedia article titles	61,235	Above a pageview frequency threshold (around 70 views)
Total	~500,000

Sub-string matching and balancing

Given a raw pool of image-text pairs scraped from the web, MetaCLIP first applies sub-string matching: a caption is kept only if it contains at least one metadata entry as a sub-string, and the matched entries are recorded for each pair. This step removes noisy or off-topic captions without a hand-tuned filter cascade ^[1].

The second step is balancing, which is the part the authors emphasize. Web text is dominated by a few extremely common terms, so the matched counts are highly skewed. In one analysis over a 1.6 billion pair pool, the most frequent entries were stop-word-like terms ('of' matched about 120 million pairs, 'in' about 107 million), roughly 114,000 of the 500,000 entries matched nothing at all, and only about 16,000 entries (3.2% of the list) each exceeded 20,000 matches yet together accounted for 94.5% of all matches ^[1]^[3]. Balancing caps the contribution of each entry at a threshold t. Entries below the cap (the "tail") keep all their pairs, while entries above it (the "head") are sub-sampled down to t, so that a pair associated with a rare concept has a higher chance of being selected than one tied only to a common term. For the 400 million pair dataset the authors use t = 20,000 ^[1]^[3]. The effect is a dataset balanced across concepts rather than dominated by a handful of frequent words, which the paper shows improves the diversity and quality of the resulting embeddings.

A practical feature of the algorithm is that the sampling probability for each pair, t divided by its entry count, can be computed in a single pass over the data, so curation scales to billions of pairs without an expensive model in the loop ^[2].

Results versus CLIP

Applying the recipe to CommonCrawl yields a 400 million pair dataset comparable in size to CLIP's WIT, plus a larger 2.5 billion pair version. Holding the model architecture and training compute fixed, MetaCLIP matches or beats OpenAI's CLIP on zero-shot ImageNet classification at every model scale tested ^[1]^[3]:

Model scale	Training data	OpenAI CLIP	MetaCLIP
ViT-B/32	400M	63.4%	65.5%
ViT-B/16	400M	68.3%	70.8%
ViT-L/14	400M	75.5%	76.2%
ViT-L/14	2.5B	N/A	79.2%
ViT-H/14	2.5B	N/A	80.5%
ViT-bigG/14	2.5B	N/A	82.1%

The comparison is controlled: same encoder sizes, same number of training samples seen, and the only change is the data. The improvement at fixed scale (for example 70.8% versus 68.3% at ViT-B/16), and the further gains from scaling the curated pool to 2.5 billion pairs, support the paper's claim that data curation, not architecture, drives CLIP's performance ^[1]. "Demystifying CLIP Data" was accepted as a Spotlight at ICLR 2024 ^[4].

Open release

Unlike CLIP, where only weights were public, MetaCLIP released the full curation pipeline. The GitHub repository facebookresearch/MetaCLIP provides the metadata, the sub-string matching and balancing code, the distribution of the curated data over the metadata, and the trained checkpoints (ViT-B/32, ViT-B/16, ViT-L/14, ViT-H/14, and ViT-bigG/14) ^[2]^[5]. Because the recipe and the data distribution are open, others can regenerate or audit a CLIP-style dataset rather than treating it as a black box. The code is released primarily under a CC-BY-NC license ^[2].

MetaCLIP 2

A follow-up, "Meta CLIP 2: A Worldwide Scaling Recipe" (arXiv, July 2025), extends the curation recipe from English-only web data to worldwide, multilingual data ^[6]^[7]. It is described as the first recipe to train a CLIP model from scratch on worldwide web-scale image-text pairs, and was selected as a NeurIPS 2025 Spotlight ^[2].

MetaCLIP 2 targets two problems. First, the original recipe had no way to curate non-English data. Second, naively adding many languages tends to drag down English performance, an effect the authors call the "curse of multilinguality." The recipe makes three changes: it scales the metadata to more than 300 languages using Wikipedia and multilingual WordNet; it applies a per-language balancing algorithm with a language-specific threshold so that each language keeps a consistent head-to-tail concept ratio; and it scales model capacity and batch size, training a ViT-H/14 on about 29 billion seen image-text pairs (roughly 44% English and 56% non-English) ^[6]^[7].

The headline finding is that English and non-English data can help each other rather than compete. The worldwide ViT-H/14 reaches 81.3% zero-shot accuracy on ImageNet, above its English-only MetaCLIP counterpart at 80.5% and above the multilingual SigLIP baseline (mSigLIP) at 80.6%, while also setting strong multilingual results such as 50.2% on Babel-ImageNet and 64.3% image-to-text retrieval on XM3600 ^[6]^[7]. In other words, MetaCLIP 2 removes the usual trade-off in which multilingual CLIP models underperform English-only ones on English tasks.

References

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer. "Demystifying CLIP Data." arXiv:2309.16671. https://arxiv.org/abs/2309.16671 ↩
facebookresearch/MetaCLIP, GitHub repository. https://github.com/facebookresearch/MetaCLIP ↩
"Demystifying CLIP Data" (HTML version, metadata and balancing details). https://arxiv.org/html/2309.16671v4 ↩
"Demystifying CLIP Data," ICLR 2024 (Spotlight), OpenReview. https://openreview.net/forum?id=5BCFlnfE1g ↩
"Demystifying CLIP Data," ICLR 2024 Proceedings (PDF). https://proceedings.iclr.cc/paper_files/paper/2024/file/d1450d6c10c6b6cf1b80964357f5fa08-Paper-Conference.pdf ↩
Yung-Sung Chuang, et al. "Meta CLIP 2: A Worldwide Scaling Recipe." arXiv:2507.22062. https://arxiv.org/abs/2507.22062 ↩
"Meta CLIP 2: A Worldwide Scaling Recipe" (HTML version). https://arxiv.org/html/2507.22062v1 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Segment Anything Model and Dataset (SAM and SA-1B)

Motivation

Data curation algorithm

Metadata

Sub-string matching and balancing

Results versus CLIP

Open release

MetaCLIP 2

See also

References

Improve this article

Related Articles

Llama 3.2

Llama 4 Scout and Maverick

Llama 3.2 Vision

Muse Spark

Chameleon (Meta AI)

CM3leon