LAION-5B

Data & Datasets Generative AI

20 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v2 · 4,062 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LAION-5B is an open dataset of approximately 5.85 billion CLIP-filtered image and text pairs scraped from the public internet, released by LAION (Large-scale Artificial Intelligence Open Network) on March 31, 2022. ^[1]^[3] Of the 5.85 billion pairs, 2.32 billion carry English-language alt text, 2.26 billion span 100-plus other languages, and roughly 1.27 billion have text that could not be assigned a language. ^[1]^[3] At release it was 14 times larger than the previous record holder, LAION-400M, and the largest freely available image-text dataset in existence; it became the foundational training corpus for an entire generation of open-source generative vision models, most famously Stable Diffusion. ^[1]^[3] LAION distributes only the metadata (image URLs, captions, and embeddings) under a Creative Commons CC-BY-4.0 license, not the images themselves. ^[3]

The dataset was withdrawn from public distribution on December 20, 2023 after researchers at the Stanford Internet Observatory identified 3,226 entries suspected to be child sexual abuse material (CSAM), 1,008 of which were externally validated. ^[6]^[8] A cleaned successor, Re-LAION-5B, was published on August 30, 2024 in collaboration with the Internet Watch Foundation, the Canadian Centre for Child Protection, and Stanford itself. ^[4]^[9]

The project's significance goes well beyond raw scale. Before LAION-5B existed, large web-scale image-text datasets were the private property of a handful of well-funded labs (notably OpenAI's WIT-400M for CLIP and Google's JFT-3B). LAION-5B made comparable scale openly downloadable, which directly enabled academics, hobbyists, and small companies to train competitive multimodal and generative models that had previously been infeasible outside of industrial labs. LAION states its goal plainly: the dataset exists "to democratize research and experimentation around large-scale multi-modal model training." ^[3] It also kicked off a still-unresolved set of legal and ethical questions about copyright, consent, privacy, and the safety of indiscriminate web scraping for AI training.

What is LAION-5B?

LAION-5B is a dataset of approximately 5.85 billion image-URL and alt-text pairs, filtered using OpenAI's CLIP model so that the textual caption and the image are at least loosely semantically related. ^[1] The dataset is technically a metadata file: LAION distributes Parquet files containing URLs, captions, similarity scores, and safety tags, but does not redistribute the actual images themselves. ^[3] End users download the images directly from the original web hosts using the open-source img2dataset tool. ^[10]

The corpus is divided into three main subsets by language:

Subset	Pairs	Description
LAION-2B-en	2.32 billion	English-language alt text
LAION-2B-multi	2.26 billion	Non-English text spanning 100+ languages
LAION-1B-nolang	1.27 billion	Pairs whose alt text could not be assigned a language
Total	~5.85 billion	Roughly 240 to 250 TB of images if fully downloaded

A central technical claim of the project is that this scale, combined with a permissive training dataset license on the metadata (CC-BY-4.0), enables open replication of large vision-language models that previously existed only as closed industrial artifacts. ^[1]^[3]

Who built LAION-5B and when was it released?

LAION-5B was produced by an unusually large and decentralized author list, reflecting the organization's open-source roots. The accompanying NeurIPS paper, "LAION-5B: An open large-scale dataset for training next generation image-text models," lists the following authors: Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. ^[1]

Christoph Schuhmann, a German high-school teacher who became LAION's most public face, founded the organization in early 2021 along with Jenia Jitsev (a researcher at the Julich Supercomputing Centre), Richard Vencu, Robert Kaczmarczyk, Theo Coombes, Mehdi Cherti, Aarush Katta, and Jan Ebert. The group originally coalesced on a Discord server around the goal of replicating OpenAI's CLIP using fully open data. That replication effort produced LAION-400M in August 2021. ^[2] LAION-5B was its direct sequel, released on March 31, 2022 and accompanied by a longer technical paper that first appeared on arXiv in October 2022 (arXiv:2210.08402). ^[1]^[3]

The paper won the Outstanding Datasets paper award in the Datasets and Benchmarks track at NeurIPS 2022. ^[1]^[19] The project was funded by Hugging Face, Doodlebot, and Stability AI, with compute provided in part by the Julich Supercomputing Centre's JUWELS Booster system. ^[3]

How was LAION-5B built?

The construction process for LAION-5B closely follows the approach piloted in LAION-400M, scaled up roughly 14-fold and extended to multilingual data. ^[1]^[2]

Step 1: Common Crawl ingestion

The raw input is the Common Crawl WAT files, an open archive of HTML pages collected continuously since 2008. LAION processed crawls covering several years (the bulk of the data was sourced from snapshots between 2014 and 2021). For each web page, the pipeline parses every <img> tag, extracting the image URL and the contents of the alt attribute as a candidate caption. ^[1]

This stage produced on the order of 50 billion candidate image-text pairs before filtering. ^[1]

Step 2: Distributed image download

The team then needed to actually fetch each image to verify it existed and to compute embeddings. This was done with img2dataset, an open-source library written by Romain Beaumont that turns lists of image URLs into compressed WebDataset shards while resizing on the fly. ^[10] For LAION-5B, img2dataset was extended with a distributed PySpark mode; the full 5.85 billion images were downloaded in roughly one week using ten cloud nodes. ^[10]

The scale of this scraping was disruptive enough that some popular sites began blocking LAION-affiliated traffic, and the tool was later cited in 2023 reporting on AI scraping load on the open web.

Step 3: CLIP-based filtering

For every downloaded pair, the pipeline computed CLIP embeddings for both the image (via the ViT-B/32 vision encoder) and the alt text (via the text encoder), then measured the cosine similarity between them. Pairs whose similarity fell below a threshold were discarded as too noisy. The threshold differed by language: 0.28 for English (using OpenAI's CLIP ViT-B/32) and 0.26 for the multilingual subset (using mCLIP). ^[3]

This filtering step is the conceptual heart of LAION-5B. Most alt text on the web is junk, generic file names, marketing boilerplate, or empty strings. Cosine similarity in CLIP embedding space is a cheap, model-based proxy for "does this caption actually describe what is in this image," and it removes the bulk of the noise without requiring human annotation. ^[1]

Step 4: Auxiliary tagging

Each surviving pair was tagged with several additional metadata fields: ^[1]

Language detection: Google's CLD3 was applied to the alt text and used to assign each pair to one of the three subsets above.
NSFW probability: a CLIP-based binary classifier produced a p_unsafe score for each image. The classifier achieved roughly 96% accuracy on internal tests. The score was retained in the metadata so downstream users could filter at their own thresholds; the dataset itself was not pre-filtered for NSFW content. ^[1]
Watermark probability: another CLIP-based classifier flagged the likely presence of a stock-photo watermark. Roughly 4% to 6% of images in each subset were flagged. ^[1]
Resolution and aspect ratio: stored to support resolution-based filtering.

The complete pipeline ran on a heterogeneous mix of CPU nodes for downloading and GPU nodes for inference, including time on the JUWELS Booster supercomputer in Julich. ^[3]

Subsets and derivatives

Beyond the three primary language splits, LAION released several derived subsets aimed at specific use cases.

Subset	Approximate size	Purpose
LAION-2B-en	2.32 billion pairs	Primary English subset, used for CLIP reproduction
LAION-2B-multi	2.26 billion pairs	100+ non-English languages
LAION-1B-nolang	1.27 billion pairs	Unlabeled language
LAION-Aesthetics v1	~120 million pairs	Filtered by aesthetic predictor (LAION-Aesthetics_Predictor V1)
LAION-Aesthetics v2 5+	~600 million pairs	Aesthetic score >= 5 (used to fine-tune Stable Diffusion 1.4 and 1.5)
LAION-Aesthetics v2 6+	~12 million pairs	Aesthetic score >= 6
LAION-Aesthetics v2 6.5+	~625k pairs	Highest-quality subset
LAION-High-Resolution	~170 million pairs	Images >= 1024 pixels on the short side
LAION-COCO	~600 million pairs	Synthetic captions generated for LAION-2B-en using BLIP and OpenCLIP
LAION-Face	~50 million pairs	Subset detected to contain at least one face

The LAION-Aesthetics line is particularly important historically. ^[5] The aesthetic predictor was a small linear head trained on top of CLIP embeddings to estimate human ratings of "how much do you like this image on a scale from 1 to 10," using the Simulacra Aesthetic Captions corpus and other rating datasets. Filtering LAION-2B-en to images with predicted aesthetic score 5 or higher produced the 600M-pair LAION-Aesthetics v2 5+ subset that the CompVis team used to fine-tune Stable Diffusion 1.4 and 1.5. ^[5]^[12]

How is LAION-5B distributed and licensed?

LAION-5B was distributed as a set of Parquet metadata files hosted on Hugging Face. The metadata, the embeddings, and the various indices are released under the Creative Commons CC-BY-4.0 license, which LAION describes as posing "no particular restriction." ^[3] The actual images themselves are not redistributed; LAION's position is that it merely indexes pointers to publicly accessible web content, and users who choose to download the images via img2dataset are bound by whatever licenses apply to the original sources. ^[3]^[10]

Nearest-neighbor indices over the CLIP embeddings were also made available, enabling fast similarity search across billions of pairs through the clip-retrieval and autofaiss tools. LAION provided a public web interface (clip.rom1504.fr) that allowed anyone to query the dataset by text prompt or by uploaded image and inspect the matched results.

Which models were trained on LAION-5B?

LAION-5B and its subsets quickly became the default training corpus for open-source vision-language and text-to-image work. The following table lists some of the most influential models that depended on it:

Model	Year	Developer	LAION subset used
Stable Diffusion 1.1 to 1.4	2022	CompVis (LMU Munich), Stability AI, Runway	LAION-2B-en, LAION-High-Resolution, LAION-Aesthetics v2 5+
Stable Diffusion 1.5	2022	Runway / RunwayML	LAION-Aesthetics v2 5+ (595k fine-tuning steps at 512x512)
Stable Diffusion 2.0 / 2.1	2022	Stability AI	LAION-5B subset filtered with NSFW classifier (LAION-5B p_unsafe < 0.1)
OpenCLIP ViT-B/32, ViT-L/14, ViT-H/14, ViT-G/14	2022 to 2023	LAION	LAION-2B-en, LAION-5B
BLIP-2	2023	Salesforce	LAION-115M (subset)
Kandinsky 2.x	2022 to 2023	Sber AI	LAION-5B subsets
Karlo	2022	Kakao Brain	COYO-700M plus LAION-2B-en
Wurstchen	2023	LAION community	LAION-Aesthetics
Many community Stable Diffusion fine-tunes (Anything, Realistic Vision, Dreamshaper, ChilloutMix, etc.)	2022 to present	Various	Inherits LAION via base SD weights

DALL-E 2, Imagen, Midjourney v1 to v3, and Parti were trained on closed proprietary corpora rather than LAION, but LAION-5B is consistently described as the open analogue of the data those systems used.

Why does LAION-5B matter?

LAION-5B is fairly described as the dataset that made the open generative AI ecosystem possible. Stable Diffusion, the model that broke text-to-image generation out of API-only walled gardens in August 2022, would not exist without LAION's work; the CompVis paper credits LAION explicitly for assembling the training data. ^[11] OpenCLIP's reproductions of CLIP, which now serve as the default vision encoder in essentially every open multimodal model, were trained on LAION-2B-en. ^[1]

The paper also demonstrated something that had been widely assumed but never publicly verified: that web-scraped, CLIP-filtered image-text data at multi-billion scale was sufficient to train competitive vision-language models, and that such training was reproducible outside large industrial labs. ^[1] It directly inspired follow-on dataset projects including DataComp (Gadre et al., 2023), COYO-700M from Kakao Brain (2022), and OBELICS (HuggingFaceM4, 2023). ^[13]^[14]

By 2024 the original LAION-5B paper had accumulated more than a thousand citations, and the dataset was the de facto reference benchmark for any new method that claimed to operate at "web scale."

What was the LAION-5B CSAM controversy?

On December 20, 2023, the Stanford Internet Observatory (SIO) published a report titled "Identifying and Eliminating CSAM in Generative ML Training Data and Models," authored by David Thiel, the chief technologist at SIO. ^[6] The report applied PhotoDNA hash matching, NCMEC's hash database, and Microsoft's CSAM detection APIs to a sample of LAION-5B and identified 3,226 image entries suspected to be child sexual abuse material. Of those, 1,008 were externally validated by the Canadian Centre for Child Protection. ^[6]^[8]

Thiel characterized the contamination bluntly. "If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have CSAM," he told 404 Media, "unless you took some extraordinary measures to stop it." ^[8] While 1,008 verified images out of nearly 5.9 billion is mathematically tiny (roughly 0.00002% of the dataset), the legal and ethical implications were not. CSAM possession is a strict-liability criminal offense in most jurisdictions regardless of the percentage involved, and downstream models trained on the unfiltered data, particularly Stable Diffusion 1.x, had been exposed to those samples during training. ^[6]

LAION removed LAION-5B and LAION-400M from public distribution within a day of being notified by SIO. ^[8] The episode triggered a broader reckoning across the open generative AI community: model hosts began aggressively filtering uploads of older Stable Diffusion checkpoints that lacked safety classifiers, several model marketplaces took down derivative weights, and discussions about the responsibility of dataset publishers entered mainstream tech press. SIO recommended that older Stable Diffusion 1.5 checkpoints be deprecated; some model hubs complied, others did not. ^[6]^[8]

Reporting in 404 Media also surfaced internal Discord conversations indicating that LAION leadership had been aware of CSAM risks in web-scale scraping since at least 2021 but had proceeded with the project anyway, contributing to the public criticism. ^[8]

What is Re-LAION-5B?

On August 30, 2024, LAION released Re-LAION-5B, a cleaned reissue of LAION-5B that the organization described as "the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM." ^[4]^[9] The cleaning was performed in collaboration with the Internet Watch Foundation (IWF), the Canadian Centre for Child Protection (C3P), and the Stanford Internet Observatory. ^[4]

In total, 2,236 links were removed after matching against hash lists provided by the partners: ^[4]

Source of removal hashes	Links removed
Stanford Internet Observatory	1,714
Canadian Centre for Child Protection (C3P)	1,129
Internet Watch Foundation (IWF)	18
Human Rights Watch (privacy concerns)	399
Total unique link removals	2,236

(Counts overlap because the same image can appear in more than one partner's hash list. The 2,236 figure subsumes the 1,008 SIO-validated samples from the December 2023 report.) ^[4]

Re-LAION-5B is published in two configurations: ^[4]

Re-LAION-5B research: removes pairs with p_unsafe > 0.95, eliminating roughly 1.121% of the original (approximately 65 million samples).
Re-LAION-5B research-safe: a more conservative cut at p_unsafe > 0.45, removing roughly 3.044% (approximately 176 million samples) and aimed at users who want to avoid pornographic and other explicitly unsafe content entirely.

The final size of Re-LAION-5B is approximately 5.5 billion pairs (5,526,641,167 to be exact). ^[4] The release also documents a hash-matching procedure that other dataset maintainers can apply, and LAION committed to maintaining the cleaning pipeline going forward. ^[4]

What other ethical concerns surround LAION-5B?

The CSAM episode is the most acute ethical issue, but it is not the only one. LAION-5B has faced sustained criticism on several other axes.

Copyright

The corpus contains many copyrighted images: stock photography, news photos, artwork, screenshots from films, and so on. LAION's defense is that the dataset distributes only URLs and metadata, not the images themselves, which positions the project as a research index analogous to a search engine. ^[3] Plaintiffs in lawsuits against Stable Diffusion's makers have not generally accepted that distinction.

Getty Images sued Stability AI in both the US and the UK in early 2023, alleging that 7.3 million Getty-owned images were used in training Stable Diffusion v1 (and 4.4 million in v2) via LAION-5B. ^[17] The UK case was decided on November 4, 2025: Getty abandoned its primary copyright and database-right claims before closing submissions, and the High Court dismissed Getty's secondary copyright claim, ruling that Stable Diffusion does not store reproductions of the training images and so is not an "infringing copy." The court found only a small number of narrow trademark and "passing off" violations relating to generated Getty watermarks. ^[17] A class-action by artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt also relies on LAION-5B as the alleged training source.

In April 2023, German photographer Robert Kneschke sued LAION directly, asking that his photographs be removed from the dataset. On September 27, 2024 the Hamburg Regional Court dismissed the case, ruling that LAION's scraping qualified for the text and data mining for scientific research exception under Section 60d of the German Copyright Act. ^[20] On December 10, 2025 the Higher Regional Court of Hamburg dismissed Kneschke's appeal, confirming that the training-data collection was permitted, though it allowed a further appeal to Germany's Federal Court of Justice. ^[21]

Privacy

Researchers including Birhane et al. have shown that LAION-5B contains personal images, faces of identifiable individuals, leaked medical photographs, and even private documents. Because the dataset records the original URL, removed images can sometimes be re-identified after the fact. The 399 links removed at Human Rights Watch's request in Re-LAION-5B were taken out specifically over such privacy concerns. ^[4]

NSFW content and bias

LAION's NSFW classifier was retained as metadata rather than used to pre-filter the dataset, meaning the raw release contained substantial pornographic content. ^[1] Audits have also documented racial, gender, and cultural biases consistent with what is present on the open web.

The Re-LAION-5B research-safe variant addresses NSFW content by filtering at a lower p_unsafe threshold, but bias remains an open problem. ^[4]

How does LAION-5B compare to other image-text datasets?

Dataset	Year	Pairs	Source	License	Languages
Conceptual Captions (CC3M)	2018	3.3 million	Web alt text, heavily filtered	Custom	English
Conceptual Captions 12M (CC12M)	2021	12 million	Web alt text	Custom	English
WIT (Wikipedia Image-Text)	2021	37.6 million	Wikipedia	CC BY-SA 3.0	108
LAION-400M	2021	400 million	Common Crawl, CLIP-filtered	CC BY-4.0 metadata	English
WIT-400M (OpenAI, internal)	2021	400 million	Web	Closed	English
LAION-5B	2022	5.85 billion	Common Crawl, CLIP-filtered	CC BY-4.0 metadata	100+
COYO-700M	2022	747 million	Common Crawl, CLIP-filtered	CC BY-4.0 metadata	English
JFT-3B (Google, internal)	2021	3 billion	Closed Google sources	Closed	n/a
DataComp-1B	2023	1.4 billion	CommonPool 12.8B, algorithmically filtered	CC BY-4.0 metadata	English
Re-LAION-5B	2024	5.5 billion	Cleaned LAION-5B	CC BY-4.0 metadata	100+

DataComp is worth singling out: its 2023 paper showed that a CLIP ViT-L/14 trained on the smaller, better-filtered DataComp-1B outperformed a CLIP ViT-G/14 trained on LAION-2B for three times as long, demonstrating that data curation can substitute for raw scale. ^[13] That finding has substantially shaped the post-LAION-5B research conversation.

What is LAION-5B used for?

LAION-5B and Re-LAION-5B have been used for:

Training open-source latent and pixel diffusion models, including all of Stable Diffusion 1.x and 2.x and most of their fine-tunes. ^[11]^[12]
Reproducing CLIP at scales from ViT-B/32 to ViT-G/14 in the OpenCLIP project. ^[1]
Pretraining vision-language models such as BLIP-2 and various open VQA systems.
Image search and retrieval research using clip-retrieval and autofaiss-built nearest-neighbor indices.
Training NSFW and watermark classifiers, including classifiers later used to clean other datasets. ^[1]
Auditing studies of bias, privacy, and content safety in web-scraped corpora.
Building synthetic-caption datasets such as LAION-COCO, where BLIP-generated captions replace noisy alt text.

Recent developments (2024 to 2026)

Since the Re-LAION-5B release in August 2024, several trends have shaped the conversation around web-scale image-text data:

Industry shift toward licensed data: Adobe Firefly's licensed-only training stance and Getty Images' own "commercially safe" generator illustrate a counter-movement away from indiscriminate scraping. Stability AI has moved Stable Diffusion 3 and later models toward more curated training mixes.
Continued legal motion: The Andersen et al. class action against Stability AI, Midjourney, and DeviantArt is ongoing as of 2026. The UK High Court's November 2025 dismissal of Getty's secondary-copyright claim set partial precedent on the model side ^[17], while the Hamburg courts' first-instance (September 2024) and appellate (December 2025) rulings for LAION in Kneschke v. LAION provide counter-precedent on the data-collection side. ^[20]^[21]
DataComp continues to evolve, with the DataComp-CommonPool corpus (12.8 billion pairs) and the DataComp-1B filtered subset becoming increasingly common alternatives to LAION-2B for CLIP-style training. ^[13]
Provenance and documentation standards: The Datasheets for Datasets framework, Data Cards, and the C2PA content-provenance standard have all gained traction in part as a response to the LAION-5B controversy.
Hash-list cleaning as new norm: Other dataset maintainers have begun adopting the IWF / C3P hash-matching pipeline that Re-LAION-5B pioneered, with similar efforts planned for COYO-700M and other corpora. ^[4]

LAION itself has continued to release smaller specialized datasets (such as LAION-SG with structural scene graph annotations in 2024) and remains active in open multimodal research, though its profile is more cautious than in the 2021 to 2023 period.

References

Schuhmann, Christoph; Beaumont, Romain; Vencu, Richard; Gordon, Cade; Wightman, Ross; Cherti, Mehdi; Coombes, Theo; Katta, Aarush; Mullis, Clayton; Wortsman, Mitchell; Schramowski, Patrick; Kundurthy, Srivatsa; Crowson, Katherine; Schmidt, Ludwig; Kaczmarczyk, Robert; Jitsev, Jenia. "LAION-5B: An open large-scale dataset for training next generation image-text models." NeurIPS 2022 Datasets and Benchmarks Track. arXiv:2210.08402. https://arxiv.org/abs/2210.08402 ↩
Schuhmann, Christoph; Vencu, Richard; Beaumont, Romain; Kaczmarczyk, Robert; Mullis, Clayton; Katta, Aarush; Coombes, Theo; Jitsev, Jenia; Komatsuzaki, Aran. "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs." 2021. arXiv:2111.02114. ↩
LAION. "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets." Blog post, March 31, 2022. https://laion.ai/blog/laion-5b/ ↩
LAION. "Releasing Re-LAION-5B: transparent iteration on LAION-5B with additional safety fixes." Blog post, August 30, 2024. https://laion.ai/blog/relaion-5b/ ↩
LAION. "LAION-Aesthetics." Blog post. https://laion.ai/blog/laion-aesthetics/ ↩
Thiel, David. "Identifying and Eliminating CSAM in Generative ML Training Data and Models." Stanford Internet Observatory, December 20, 2023. https://purl.stanford.edu/kh752sm9123 ↩
Stanford Internet Observatory. "Investigation Finds AI Image Generation Models Trained on Child Abuse." Stanford FSI news release, December 2023.
Maiberg, Emanuel and Cole, Samantha. "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material." 404 Media, December 20, 2023. https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/ ↩
Wiggers, Kyle. "The org behind the dataset used to train Stable Diffusion claims it has removed CSAM." TechCrunch, August 30, 2024. https://techcrunch.com/2024/08/30/the-org-behind-the-data-set-used-to-train-stable-diffusion-claims-it-has-removed-csam/ ↩
Beaumont, Romain. "img2dataset." GitHub repository. https://github.com/rom1504/img2dataset ↩
Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Bjorn. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. (Stable Diffusion paper, training data section.) ↩
Stable Diffusion v1-5 model card. Hugging Face. https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 ↩
Gadre, Samir Yitzhak et al. "DataComp: In search of the next generation of multimodal datasets." NeurIPS 2023. arXiv:2304.14108. ↩
Kakao Brain. "COYO-700M: Image-Text Pair Dataset." GitHub. https://github.com/kakaobrain/coyo-dataset ↩
Wikipedia contributors. "LAION." English Wikipedia.
Wikipedia contributors. "Stable Diffusion." English Wikipedia.
Bird & Bird. "Stability AI defeats Getty Images copyright claims in first of its kind dispute before the High Court." 2025. https://www.twobirds.com/en/insights/2025/uk/stability-ai-defeats-getty-images-copyright-claims-in-first-of-its-kind-dispute-before-the-high-cour ↩
PetaPixel. "Major AI Image Dataset is Back Online After Being Pulled Over CSAM." September 3, 2024.
NeurIPS. "Announcing the NeurIPS 2022 Awards." NeurIPS Blog, November 21, 2022. https://blog.neurips.cc/2022/11/21/announcing-the-neurips-2022-awards/ ↩
Bird & Bird. "Long-awaited German judgment by the District Court of Hamburg (Kneschke v. LAION) on the text and data mining exception(s)." 2024. https://www.twobirds.com/en/insights/2024/germany/long-awaited-german-judgment-by-the-district-court-of-hamburg-kneschke-v-laion ↩
Bird & Bird. "Higher Regional Court Hamburg Confirms AI Training was Permitted (Kneschke v. LAION)." 2025. https://www.twobirds.com/en/insights/2025/germany/higher-regional-court-hamburg-confirms-ai-training-was-permitted-(kneschke-v,-d-,-laion) ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Backdoor attacks on large language models CLIP (Contrastive Language-Image Pre-training)Getty Images v. Stability AI LAION Runwayml/stable-diffusion-v1-5 model Stability AI Unstable Diffusion Würstchen Zero-Shot Image Classification Models

What is LAION-5B?

Who built LAION-5B and when was it released?

How was LAION-5B built?

Step 1: Common Crawl ingestion

Step 2: Distributed image download

Step 3: CLIP-based filtering

Step 4: Auxiliary tagging

Subsets and derivatives

How is LAION-5B distributed and licensed?

Which models were trained on LAION-5B?

Why does LAION-5B matter?

What was the LAION-5B CSAM controversy?

What is Re-LAION-5B?

What other ethical concerns surround LAION-5B?

Copyright

Privacy

NSFW content and bias

How does LAION-5B compare to other image-text datasets?

What is LAION-5B used for?

Recent developments (2024 to 2026)

See also

References

Improve this article

Related Articles

CharXiv

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

What links here

Related Articles

CharXiv

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

What links here