LAION-5B
Last reviewed
May 1, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 3,710 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 3,710 words
Add missing citations, update stale details, or suggest a clearer explanation.
LAION-5B is an open-source dataset of 5.85 billion image and text pairs scraped from the public internet, assembled and released by LAION (Large-scale Artificial Intelligence Open Network), a German non-profit research organization. First made public in March 2022, LAION-5B was the largest freely available image-text dataset in existence at the time and quickly became the foundational training corpus for an entire generation of open-source generative vision models, most famously Stable Diffusion. The dataset was withdrawn from public distribution in late December 2023 after researchers at the Stanford Internet Observatory identified thousands of links to suspected child sexual abuse material (CSAM) in the corpus. A cleaned successor, Re-LAION-5B, was published on August 30, 2024 in collaboration with the Internet Watch Foundation, the Canadian Centre for Child Protection, and Stanford itself.
The project's significance goes well beyond raw scale. Before LAION-5B existed, large web-scale image-text datasets were the private property of a handful of well-funded labs (notably OpenAI's WIT-400M for CLIP and Google's JFT-3B). LAION-5B made comparable scale openly downloadable, which directly enabled academics, hobbyists, and small companies to train competitive multimodal and generative models that had previously been infeasible outside of industrial labs. It also kicked off a still-unresolved set of legal and ethical questions about copyright, consent, privacy, and the safety of indiscriminate web scraping for AI training.
LAION-5B is a dataset of approximately 5.85 billion image-URL and alt-text pairs, filtered using OpenAI's CLIP model so that the textual caption and the image are at least loosely semantically related. The dataset is technically a metadata file: LAION distributes Parquet files containing URLs, captions, similarity scores, and safety tags, but does not redistribute the actual images themselves. End users download the images directly from the original web hosts using the open-source img2dataset tool.
The corpus is divided into three main subsets by language:
| Subset | Pairs | Description |
|---|---|---|
| LAION-2B-en | 2.32 billion | English-language alt text |
| LAION-2B-multi | 2.26 billion | Non-English text spanning 100+ languages |
| LAION-1B-nolang | 1.27 billion | Pairs whose alt text could not be assigned a language |
| Total | ~5.85 billion | Roughly 240 to 250 TB of images if fully downloaded |
A central technical claim of the project is that this scale, combined with a permissive training dataset license on the metadata (CC-BY-4.0), enables open replication of large vision-language models that previously existed only as closed industrial artifacts.
LAION-5B was produced by an unusually large and decentralized author list, reflecting the organization's open-source roots. The accompanying NeurIPS paper, "LAION-5B: An open large-scale dataset for training next generation image-text models," lists the following authors: Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev.
Christoph Schuhmann, a German high-school teacher who became LAION's most public face, founded the organization in early 2021 along with Jenia Jitsev (a researcher at the Jülich Supercomputing Centre), Richard Vencu, Robert Kaczmarczyk, Theo Coombes, Mehdi Cherti, Aarush Katta, and Jan Ebert. The group originally coalesced on a Discord server around the goal of replicating OpenAI's CLIP using fully open data. That replication effort produced LAION-400M in August 2021. LAION-5B was its direct sequel, released in March 2022 and accompanied by a longer technical paper that first appeared on arXiv in October 2022 (arXiv:2210.08402).
The paper won the Outstanding Datasets and Benchmarks Track Award at NeurIPS 2022. The project was funded by Hugging Face, Doodlebot, and Stability AI, with compute provided in part by the Jülich Supercomputing Centre's JUWELS Booster system.
The construction process for LAION-5B closely follows the approach piloted in LAION-400M, scaled up roughly 14-fold and extended to multilingual data.
The raw input is the Common Crawl WAT files, an open archive of HTML pages collected continuously since 2008. LAION processed crawls covering several years (the bulk of the data was sourced from snapshots between 2014 and 2021). For each web page, the pipeline parses every <img> tag, extracting the image URL and the contents of the alt attribute as a candidate caption.
This stage produced on the order of 50 billion candidate image-text pairs before filtering.
The team then needed to actually fetch each image to verify it existed and to compute embeddings. This was done with img2dataset, an open-source library written by Romain Beaumont that turns lists of image URLs into compressed WebDataset shards while resizing on the fly. For LAION-5B, img2dataset was extended with a distributed PySpark mode; the full 5.85 billion images were downloaded in roughly one week using ten cloud nodes.
The scale of this scraping was disruptive enough that some popular sites began blocking LAION-affiliated traffic, and the tool was later cited in 2023 reporting on AI scraping load on the open web.
For every downloaded pair, the pipeline computed CLIP embeddings for both the image (via the ViT-B/32 vision encoder) and the alt text (via the text encoder), then measured the cosine similarity between them. Pairs whose similarity fell below a threshold were discarded as too noisy. The threshold differed by language: 0.28 for English (using OpenAI's CLIP ViT-B/32) and 0.26 for the multilingual subset (using mCLIP).
This filtering step is the conceptual heart of LAION-5B. Most alt text on the web is junk, generic file names, marketing boilerplate, or empty strings. Cosine similarity in CLIP embedding space is a cheap, model-based proxy for "does this caption actually describe what is in this image," and it removes the bulk of the noise without requiring human annotation.
Each surviving pair was tagged with several additional metadata fields:
p_unsafe score for each image. The classifier achieved roughly 96% accuracy on internal tests. The score was retained in the metadata so downstream users could filter at their own thresholds; the dataset itself was not pre-filtered for NSFW content.The complete pipeline ran on a heterogeneous mix of CPU nodes for downloading and GPU nodes for inference, including time on the JUWELS Booster supercomputer in Jülich.
Beyond the three primary language splits, LAION released several derived subsets aimed at specific use cases.
| Subset | Approximate size | Purpose |
|---|---|---|
| LAION-2B-en | 2.32 billion pairs | Primary English subset, used for CLIP reproduction |
| LAION-2B-multi | 2.26 billion pairs | 100+ non-English languages |
| LAION-1B-nolang | 1.27 billion pairs | Unlabeled language |
| LAION-Aesthetics v1 | ~120 million pairs | Filtered by aesthetic predictor (LAION-Aesthetics_Predictor V1) |
| LAION-Aesthetics v2 5+ | ~600 million pairs | Aesthetic score >= 5 (used to fine-tune Stable Diffusion 1.4 and 1.5) |
| LAION-Aesthetics v2 6+ | ~12 million pairs | Aesthetic score >= 6 |
| LAION-Aesthetics v2 6.5+ | ~625k pairs | Highest-quality subset |
| LAION-High-Resolution | ~170 million pairs | Images >= 1024 pixels on the short side |
| LAION-COCO | ~600 million pairs | Synthetic captions generated for LAION-2B-en using BLIP and OpenCLIP |
| LAION-Face | ~50 million pairs | Subset detected to contain at least one face |
The LAION-Aesthetics line is particularly important historically. The aesthetic predictor was a small linear head trained on top of CLIP embeddings to estimate human ratings of "how much do you like this image on a scale from 1 to 10," using the Simulacra Aesthetic Captions corpus and other rating datasets. Filtering LAION-2B-en to images with predicted aesthetic score 5 or higher produced the 600M-pair LAION-Aesthetics v2 5+ subset that the CompVis team used to fine-tune Stable Diffusion 1.4 and 1.5.
LAION-5B was distributed as a set of Parquet metadata files hosted on Hugging Face. The metadata, the embeddings, and the various indices are released under the Creative Commons CC-BY-4.0 license. The actual images themselves are not redistributed; LAION's position is that it merely indexes pointers to publicly accessible web content, and users who choose to download the images via img2dataset are bound by whatever licenses apply to the original sources.
Nearest-neighbor indices over the CLIP embeddings were also made available, enabling fast similarity search across billions of pairs through the clip-retrieval and autofaiss tools. LAION provided a public web interface (clip.rom1504.fr) that allowed anyone to query the dataset by text prompt or by uploaded image and inspect the matched results.
LAION-5B and its subsets quickly became the default training corpus for open-source vision-language and text-to-image work. The following table lists some of the most influential models that depended on it:
| Model | Year | Developer | LAION subset used |
|---|---|---|---|
| Stable Diffusion 1.1 to 1.4 | 2022 | CompVis (LMU Munich), Stability AI, Runway | LAION-2B-en, LAION-High-Resolution, LAION-Aesthetics v2 5+ |
| Stable Diffusion 1.5 | 2022 | Runway / RunwayML | LAION-Aesthetics v2 5+ (595k fine-tuning steps at 512x512) |
| Stable Diffusion 2.0 / 2.1 | 2022 | Stability AI | LAION-5B subset filtered with NSFW classifier (LAION-5B p_unsafe < 0.1) |
| OpenCLIP ViT-B/32, ViT-L/14, ViT-H/14, ViT-G/14 | 2022 to 2023 | LAION | LAION-2B-en, LAION-5B |
| BLIP-2 | 2023 | Salesforce | LAION-115M (subset) |
| Kandinsky 2.x | 2022 to 2023 | Sber AI | LAION-5B subsets |
| Karlo | 2022 | Kakao Brain | COYO-700M plus LAION-2B-en |
| Würstchen | 2023 | LAION community | LAION-Aesthetics |
| Many community Stable Diffusion fine-tunes (Anything, Realistic Vision, Dreamshaper, ChilloutMix, etc.) | 2022 to present | Various | Inherits LAION via base SD weights |
DALL-E 2, Imagen, Midjourney v1 to v3, and Parti were trained on closed proprietary corpora rather than LAION, but LAION-5B is consistently described as the open analogue of the data those systems used.
LAION-5B is fairly described as the dataset that made the open generative AI ecosystem possible. Stable Diffusion, the model that broke text-to-image generation out of API-only walled gardens in August 2022, would not exist without LAION's work; the CompVis paper credits LAION explicitly for assembling the training data. OpenCLIP's reproductions of CLIP, which now serve as the default vision encoder in essentially every open multimodal model, were trained on LAION-2B-en.
The paper also demonstrated something that had been widely assumed but never publicly verified: that web-scraped, CLIP-filtered image-text data at multi-billion scale was sufficient to train competitive vision-language models, and that such training was reproducible outside large industrial labs. It directly inspired follow-on dataset projects including DataComp (Gadre et al., 2023), COYO-700M from Kakao Brain (2022), and OBELICS (HuggingFaceM4, 2023).
By 2024 the original LAION-5B paper had accumulated more than a thousand citations, and the dataset was the de facto reference benchmark for any new method that claimed to operate at "web scale."
On December 20, 2023, the Stanford Internet Observatory (SIO) published a report titled "Identifying and Eliminating CSAM in Generative ML Training Data and Models," authored by David Thiel, the chief technologist at SIO. The report applied PhotoDNA hash matching, NCMEC's hash database, and Microsoft's CSAM detection APIs to a sample of LAION-5B and identified 3,226 image entries suspected to be child sexual abuse material. Of those, 1,008 were externally verified by the Canadian Centre for Child Protection.
Thiel characterized the contamination bluntly: anyone in possession of an unfiltered LAION-5B copy after late 2023 "absolutely have CSAM, unless you took some extraordinary measures to stop it." While 1,008 verified images out of nearly 5.9 billion is mathematically tiny (roughly 0.00002% of the dataset), the legal and ethical implications were not. CSAM possession is a strict-liability criminal offense in most jurisdictions regardless of the percentage involved, and downstream models trained on the unfiltered data, particularly Stable Diffusion 1.x, had been exposed to those samples during training.
LAION removed LAION-5B and LAION-400M from public distribution within a day of being notified by SIO. The episode triggered a broader reckoning across the open generative AI community: model hosts began aggressively filtering uploads of older Stable Diffusion checkpoints that lacked safety classifiers, several model marketplaces took down derivative weights, and discussions about the responsibility of dataset publishers entered mainstream tech press. SIO recommended that older Stable Diffusion 1.5 checkpoints be deprecated; some model hubs complied, others did not.
Reporting in 404 Media also surfaced internal Discord conversations indicating that LAION leadership had been aware of CSAM risks in web-scale scraping since at least 2021 but had proceeded with the project anyway, contributing to the public criticism.
On August 30, 2024, LAION released Re-LAION-5B, a cleaned reissue of LAION-5B that the organization described as "the first web-scale, text-link to image pair dataset to be thoroughly cleaned of known links to suspected CSAM." The cleaning was performed in collaboration with the Internet Watch Foundation (IWF), the Canadian Centre for Child Protection (C3P), and the Stanford Internet Observatory.
In total, 2,236 links were removed after matching against hash lists provided by the partners:
| Source of removal hashes | Links removed |
|---|---|
| Stanford Internet Observatory | 1,714 |
| Canadian Centre for Child Protection (C3P) | 1,129 |
| Internet Watch Foundation (IWF) | 18 |
| Human Rights Watch (privacy concerns) | 399 |
| Total unique link removals | 2,236 |
(Counts overlap because the same image can appear in more than one partner's hash list. The 2,236 figure subsumes the 1,008 SIO-verified samples from the December 2023 report.)
Re-LAION-5B is published in two configurations:
p_unsafe > 0.95, eliminating roughly 1.121% of the original (approximately 65 million samples).p_unsafe > 0.45, removing roughly 3.044% (approximately 176 million samples) and aimed at users who want to avoid pornographic and other explicitly unsafe content entirely.The final size of Re-LAION-5B is approximately 5.5 billion pairs (5,526,641,167 to be exact). The release also documents a hash-matching procedure that other dataset maintainers can apply, and LAION committed to maintaining the cleaning pipeline going forward.
The CSAM episode is the most acute ethical issue, but it is not the only one. LAION-5B has faced sustained criticism on several other axes:
The corpus contains many copyrighted images: stock photography, news photos, artwork, screenshots from films, and so on. LAION's defense is that the dataset distributes only URLs and metadata, not the images themselves, which positions the project as a research index analogous to a search engine. Plaintiffs in lawsuits against Stable Diffusion's makers have not generally accepted that distinction.
Getty Images sued Stability AI in both the US and the UK in early 2023, alleging that 7.3 million Getty-owned images were used in training Stable Diffusion v1 (and 4.4 million in v2) via LAION-5B. The UK case went to trial in 2025 and resulted in most of Getty's primary copyright claims being dismissed, leaving narrower trademark and "passing off" issues. A class-action by artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt also relies on LAION-5B as the alleged training source.
In April 2023, German photographer Robert Kneschke sued LAION directly, asking that his photographs be removed from the dataset. In September 2024 a German court dismissed the case, ruling in what was described as a "landmark decision" that LAION's scraping qualified for a research exemption under German copyright law.
Researchers including Birhane et al. have shown that LAION-5B contains personal images, faces of identifiable individuals, leaked medical photographs, and even private documents. Because the dataset records the original URL, removed images can sometimes be re-identified after the fact.
LAION's NSFW classifier was retained as metadata rather than used to pre-filter the dataset, meaning the raw release contained substantial pornographic content. Audits have also documented racial, gender, and cultural biases consistent with what is present on the open web.
The Re-LAION-5B research-safe variant addresses NSFW content by filtering at a lower p_unsafe threshold, but bias remains an open problem.
| Dataset | Year | Pairs | Source | License | Languages |
|---|---|---|---|---|---|
| Conceptual Captions (CC3M) | 2018 | 3.3 million | Web alt text, heavily filtered | Custom | English |
| Conceptual Captions 12M (CC12M) | 2021 | 12 million | Web alt text | Custom | English |
| WIT (Wikipedia Image-Text) | 2021 | 37.6 million | Wikipedia | CC BY-SA 3.0 | 108 |
| LAION-400M | 2021 | 400 million | Common Crawl, CLIP-filtered | CC BY-4.0 metadata | English |
| WIT-400M (OpenAI, internal) | 2021 | 400 million | Web | Closed | English |
| LAION-5B | 2022 | 5.85 billion | Common Crawl, CLIP-filtered | CC BY-4.0 metadata | 100+ |
| COYO-700M | 2022 | 747 million | Common Crawl, CLIP-filtered | CC BY-4.0 metadata | English |
| JFT-3B (Google, internal) | 2021 | 3 billion | Closed Google sources | Closed | n/a |
| DataComp-1B | 2023 | 1.4 billion | CommonPool 12.8B, algorithmically filtered | CC BY-4.0 metadata | English |
| Re-LAION-5B | 2024 | 5.5 billion | Cleaned LAION-5B | CC BY-4.0 metadata | 100+ |
DataComp is worth singling out: its 2023 paper showed that a CLIP ViT-L/14 trained on the smaller, better-filtered DataComp-1B outperformed a CLIP ViT-G/14 trained on LAION-2B for three times as long, demonstrating that data curation can substitute for raw scale. That finding has substantially shaped the post-LAION-5B research conversation.
LAION-5B and Re-LAION-5B have been used for:
clip-retrieval and autofaiss-built nearest-neighbor indices.Since the Re-LAION-5B release in August 2024, several trends have shaped the conversation around web-scale image-text data:
LAION itself has continued to release smaller specialized datasets (such as LAION-SG with structural scene graph annotations in 2024) and remains active in open multimodal research, though its profile is more cautious than in the 2021 to 2023 period.