LAION

LAION (Large-scale Artificial Intelligence Open Network) is a German non-profit organization dedicated to making large-scale machine learning resources openly available to the public. Founded in 2021, LAION is best known for creating and releasing some of the largest open image-text datasets in existence, including LAION-400M and LAION-5B. These datasets played a central role in the training of influential text-to-image models such as Stable Diffusion and helped catalyze the open-source generative AI movement. The LAION-5B paper received the Outstanding Paper Award in the Datasets and Benchmarks Track at NeurIPS 2022.

History and Founding

LAION was founded in the summer of 2021 in Hamburg, Germany, by a group of volunteers who connected through a Discord server for AI enthusiasts. The driving force behind the organization was Christoph Schuhmann, a high school physics and computer science teacher in Hamburg who holds a master's degree in physics and computer science from the University of Vienna. Schuhmann was motivated by concerns about the centralization of AI resources after OpenAI released DALL-E in January 2021. He feared that if access to large-scale training data remained exclusive to a handful of large technology companies, it would have negative consequences for society and scientific progress.

Schuhmann recruited collaborators through the Discord community, and the founding team included Jenia Jitsev (scientific lead, from Julich Supercomputing Centre), Richard Vencu (engineering lead), Robert Kaczmarczyk (medical AI lead), Theo Coombes, Mehdi Cherti, Aarush Katta, and Jan Ebert. The organization was formally registered as LAION e.V. (eingetragener Verein, or registered association) under German law, headquartered at Marlowweg 26, 22525 Hamburg. As a non-profit, LAION does not pursue commercial goals and makes all of its outputs available free of charge.

Despite the scale of its impact on the AI industry, LAION was built on a remarkably small budget. The team covered its server fees through a combination of crowdfunding, a donation from Hugging Face in 2021, and a donation of roughly $9,000 to $10,000 from Stability AI founder Emad Mostaque. Stability AI later provided additional support, primarily in the form of access to idle GPU compute on its cluster of 4,000 to 5,600 GPUs. Sponsors including Hugging Face, Doodlebot, and Stability AI contributed computing resources that helped produce the LAION-5B dataset. Schuhmann has publicly stated that he has declined job offers from companies in order to keep LAION independent, continuing to work as a teacher while leading the organization.

Christoph Schuhmann

Christoph Schuhmann serves as the organizational lead and co-founder of LAION. Born in 1982, he studied computer science and physics at the University of Vienna and also spent six years studying acting. After graduating, he worked as an IT administrator and teacher in Hamburg, where he also directed films for children. Before founding LAION, Schuhmann produced "Schools of Trust," a crowdfunded documentary about alternative schools where students learn without grades and fixed curricula. This experience with grassroots organizing and non-profit management informed how he later structured LAION.

Schuhmann's interest in deep learning grew during years of self-study through online courses. When DALL-E was published in early 2021, Schuhmann began experimenting with AI on Google Colab and connected with like-minded researchers on Discord. Within a few weeks, the group had assembled 3 million image-text pairs; after three months they released the 400-million pair LAION-400M dataset. As of 2023, Schuhmann was still teaching physics and computer science to high school students in Germany while running LAION as a volunteer effort. His work attracted wide media attention, including a Bloomberg feature in April 2023 that highlighted how a high school teacher's free database had come to power some of the most prominent AI systems in the world.

Dataset Creation Methodology

LAION's datasets were constructed using a pipeline that combined web scraping with CLIP-based filtering. The acquisition pipeline can be divided into three major stages: distributed processing of petabyte-scale Common Crawl data, distributed downloading of images based on shuffled data, and GPU-node post-processing.

Source data collection: The starting material came from Common Crawl, a publicly available archive of web page data. Worker machines parsed Common Crawl WAT (Web ARChive Transform) files to extract all HTML image tags that contained alt-text attributes, producing raw URL and text pairs. Language detection was performed on the alt-text to classify pairs as English, another detected language, or "no language" (where detection confidence was below threshold).
Image downloading: A distributed cluster of roughly 300 worker machines downloaded images from the extracted URLs in parallel. Each worker pulled batches of 10,000 links from a PostgreSQL server. Because many URLs point to images that have been removed or relocated, a significant fraction of downloads fail at any given time. All samples with fewer than 5 characters of alt-text or fewer than 5 KB image size were dropped. Duplicate removal was performed using a bloom filter based on URL.
CLIP embedding and filtering: Each downloaded image and its associated alt-text were passed through a CLIP model (OpenAI's ViT-B/32 for English content, or multilingual CLIP for non-English content). The model computed embeddings for both the image and the text, and a cosine similarity score was calculated between the two. Image-text pairs with similarity scores below a set threshold were discarded. For LAION-400M, the threshold was 0.3. For the English subset of LAION-5B, it was 0.28, and for the multilingual subset, 0.26.
Additional metadata: Each retained pair was annotated with metadata including image dimensions, a watermark detection probability score, an NSFW (not safe for work) classification score, and the original CLIP embeddings. Nearest-neighbor indices (using FAISS) were also provided to enable efficient similarity search across the dataset.

This filtering process was aggressive. For LAION-5B, roughly 90% of the candidate samples were removed, trimming more than 50 billion raw candidates down to approximately 5.85 billion retained pairs.

It is important to note that LAION datasets do not contain the images themselves. They consist of URLs pointing to images hosted elsewhere on the web, paired with alt-text and metadata. This distinction became legally significant in later copyright proceedings.

Key Datasets

The following table summarizes LAION's major dataset releases:

Dataset	Size	Release Date	Description
LAION-400M	413 million pairs	August 2021	English-language image-text pairs filtered with CLIP ViT-B/32 at threshold 0.3
LAION-5B	5.85 billion pairs	March 2022	Multilingual image-text pairs across three subsets, filtered with CLIP and multilingual CLIP
LAION-Aesthetics v2	Up to 1.2 billion pairs	August 2022	Subsets of LAION-5B scored for visual quality by an aesthetics predictor model
LAION-High-Resolution	170 million pairs	2022	Subset of LAION-5B with images at 1024x1024 resolution or higher
LAION-COCO	600 million pairs	October 2022	LAION-2B-en subset with synthetic captions generated by BLIP and selected by CLIP
Re-LAION-5B	~5.5 billion pairs	August 2024	LAION-5B cleaned of known links to CSAM, released in two variants

LAION-400M

LAION-400M was the organization's first major release, published on August 20, 2021. It contained 413 million English-language image-text pairs filtered from Common Crawl data using OpenAI's CLIP ViT-B/32 model with a cosine similarity threshold of 0.3. The dataset also included precomputed CLIP embeddings and kNN indices for similarity search.

The accompanying paper, "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs," was authored by Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki, and published as an arXiv preprint (arXiv:2111.02114) in November 2021.

At the time of its release, LAION-400M was the largest openly accessible image-text dataset in the world. It was created to replicate the scale of OpenAI's internal WebImageText (WIT) dataset, which contained 400 million pairs used to train CLIP but was never made public.

LAION-5B

LAION-5B was released on March 31, 2022, and represented a massive expansion of the earlier dataset. It contained 5.85 billion CLIP-filtered image-text pairs, making it approximately 14 times larger than LAION-400M. The dataset was organized into three subsets:

Subset	Size	Description
LAION-2B-en	2.32 billion pairs	English-language image-text pairs, filtered with CLIP ViT-B/32 at threshold 0.28
LAION-2B-multi	2.26 billion pairs	Image-text pairs in over 100 non-English languages, filtered with multilingual CLIP at threshold 0.26
LAION-1B-nolang	1.27 billion pairs	Pairs where language could not be reliably detected, often containing product images or location photos with captions mixing natural language and SEO keywords

The most frequently represented non-English languages in the multilingual subset were Russian (10.6%), French (7.4%), German (6.6%), Spanish (6.6%), and Chinese (6.3%).

The LAION-5B paper, "LAION-5B: An open large-scale dataset for training next generation image-text models," was presented at NeurIPS 2022 in the Datasets and Benchmarks Track and received the Outstanding Paper Award. The paper's authors included Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. It was published in Advances in Neural Information Processing Systems, volume 35, pages 25278 to 25294.

LAION-Aesthetics

LAION-Aesthetics is a collection of subsets of LAION-5B filtered for high visual quality using an aesthetics prediction model. The predictor was a linear model trained on image-rating pairs from the Simulacra Aesthetic Captions (SAC) dataset, using CLIP image embeddings from the OpenAI CLIP ViT-L/14 model. The model was trained to predict the score a human would give when asked "How much do you like this image on a scale from 1 to 10?"

The key LAION-Aesthetics subsets include:

Subset	Size	Minimum Predicted Score
LAION-Aesthetics v2 5+	~600 million pairs	5.0
LAION-Aesthetics v2 6+	~12 million pairs	6.0
LAION-Aesthetics v2 6.25+	~3 million pairs	6.25
LAION-Aesthetics v2 6.5+	~625,000 pairs	6.5

The LAION-Aesthetics v2 5+ subset, which contained approximately 600 million pairs from LAION-2B-en with predicted aesthetics scores of 5.0 or higher, became especially significant because it served as the primary training dataset for Stable Diffusion v1.

LAION-High-Resolution

LAION-High-Resolution is a subset of LAION-5B containing approximately 170 million image-text pairs where the images have a resolution of at least 1024x1024 pixels. This subset is particularly useful for training super-resolution models and high-resolution image generation systems.

LAION-COCO

LAION-COCO is a dataset of 600 million image-text pairs drawn from LAION-2B-en, where the original web-scraped alt-text captions have been replaced with synthetic captions generated by AI models. The captioning pipeline used an ensemble approach: BLIP L/14 generated 40 candidate captions per image, CLIP OpenAI L/14 selected the five best candidates, and then CLIP RN50x64 chose the single best caption from those five. The hyperparameters for the generation process were tuned to match the writing style of MS-COCO captions, as measured by ROUGE scores.

Human evaluation showed that in 47.5% of cases, evaluators believed the AI-generated captions had been written by humans. LAION-COCO was released in October 2022 and is available on Hugging Face.

Role in Training Stable Diffusion

LAION's datasets were foundational to the development of Stable Diffusion, one of the most influential open-source text-to-image generation models. The relationship between LAION and Stable Diffusion involved several layers:

Stable Diffusion was developed by the CompVis group at Ludwig Maximilian University of Munich, led by Robin Rombach and Patrick Esser, with computational resources provided by Stability AI. The model was trained on subsets of LAION-5B in a multi-stage process:

Initial training was performed on LAION-2B-en and LAION-High-Resolution, giving the model broad exposure to a diverse range of image-text relationships.
Fine-tuning was done on LAION-Aesthetics v2 5+, the 600 million pair subset with high aesthetic quality scores. Low-resolution images and images flagged as likely watermarked (with greater than 80% probability by LAION-5B-WatermarkDetection) were excluded.
For the Stable Diffusion v1.5 checkpoint specifically, training involved 595,000 steps at 512x512 resolution on LAION-Aesthetics v2 5+, with 10% text conditioning dropout to improve classifier-free guidance sampling.

Stability AI also provided financial support to LAION, making the relationship symbiotic: LAION produced the open datasets, and Stability AI contributed compute resources that both funded LAION's operations and enabled the training of models on LAION's data. Stability AI credited both LAION and EleutherAI as key supporters of the Stable Diffusion project.

Stable Diffusion v2 was later trained on a filtered version of LAION-5B that removed images with NSFW content, using an improved OpenCLIP model rather than OpenAI's original CLIP.

Relationship to CLIP and OpenCLIP

LAION's work is deeply intertwined with OpenAI's CLIP (Contrastive Language-Image Pre-training) model. When OpenAI released CLIP in January 2021, the model weights and inference code were made publicly available, but OpenAI did not release the training dataset (WebImageText, or WIT, consisting of 400 million image-text pairs collected from the internet). This asymmetry motivated the creation of LAION: the goal was to build an open dataset large enough to replicate CLIP-scale training from scratch.

OpenAI's CLIP ViT-B/32 model was used directly in the LAION pipeline as the filtering mechanism. Every candidate image-text pair from Common Crawl was scored by CLIP for semantic alignment between the image and its alt-text, and pairs below the similarity threshold were discarded.

LAION also contributed to the development of OpenCLIP, an open-source reimplementation of OpenAI's CLIP architecture. Using LAION-2B-en as training data, the OpenCLIP project trained several large CLIP models (ViT-L/14, ViT-H/14, and ViT-g/14) that matched or approached the performance of OpenAI's original models. The ViT-G/14 model trained on LAION-2B achieved approximately 80.1% zero-shot accuracy on ImageNet, competitive with OpenAI's results. This demonstrated that open data combined with open code could replicate the capabilities of systems built by large corporations.

CSAM Controversy

In December 2023, a report by David Thiel at the Stanford Internet Observatory revealed that LAION-5B contained URLs linking to child sexual abuse material (CSAM). The investigation examined more than 32 million data points in the dataset and identified 3,226 suspected CSAM images, of which 1,008 were externally validated using Microsoft's PhotoDNA tool. Because LAION-5B was assembled by scraping alt-text and image URLs from Common Crawl without manually reviewing each image, illegal content from the open web had passed through the automated filtering pipeline.

It is important to note that LAION's datasets store URLs and metadata rather than the images themselves. The CSAM was present on third-party websites, and LAION's dataset pointed to those URLs. Nevertheless, the Stanford researchers noted that anyone who had populated a copy of LAION-5B by downloading the linked images, even in late 2023, would be in possession of thousands of illegal images. The discovery raised serious concerns about the safety of AI training pipelines and the difficulty of moderating datasets at web scale.

Response and Takedown

LAION responded to the Stanford report on December 19, 2023, by immediately taking down the LAION-5B dataset and all known derivatives. In a public statement, the organization said: "In an abundance of caution, we have taken down the LAION datasets to ensure they are safe before republishing them." The organization cited its "zero tolerance policy for illegal content." The dataset remained offline for approximately eight months while LAION conducted a comprehensive safety review in partnership with the Internet Watch Foundation (IWF), the Canadian Centre for Child Protection (C3P), and the Stanford Internet Observatory.

Re-LAION-5B

In August 2024, LAION released Re-LAION-5B, a cleaned version of the original dataset. The organization described it as the first web-scale image-text pair dataset to be thoroughly cleaned of known links to suspected CSAM. In total, 2,236 links were removed after matching against lists of known CSAM URLs and image hashes provided by the partner organizations. These removed links included all 1,008 flagged by the Stanford Internet Observatory report, along with additional matches found through the IWF and C3P databases up to July 2024.

Re-LAION-5B was released in two variants:

Variant	Description
Re-LAION-5B Research	The original dataset with the 2,236 confirmed CSAM-linked entries removed
Re-LAION-5B Research-Safe	A more aggressively filtered version that also removes additional NSFW content

Both variants were made available under gated access on Hugging Face. The cleaned dataset contains approximately 5.5 billion text-image pairs.

Legal Issues

Kneschke v. LAION

The most prominent legal case involving LAION is Kneschke v. LAION, a copyright lawsuit filed on April 27, 2023, by Robert Kneschke, a German stock photographer, against LAION e.V. in the Hamburg Regional Court. Kneschke alleged that LAION had included one of his copyrighted photographs in the LAION-5B dataset without his consent. It was not disputed that LAION had downloaded a copy of his image (available in low resolution and watermarked) as part of the dataset creation process.

On September 27, 2024, the Hamburg Regional Court (Landgericht Hamburg) ruled in favor of LAION, finding that the organization's use of the images fell under the exception for text and data mining (TDM) for scientific research purposes, as codified in Section 60d of the German Copyright Act (Urheberrechtsgesetz, UrhG). This exception implements Article 3 of the EU Directive on Copyright in the Digital Single Market. The court held that LAION qualified as a non-commercial institution conducting scientific research, even though some LAION members also worked at Stability AI, a for-profit company. The court reasoned that this did not give the for-profit company any "decisive influence" over LAION's research.

Kneschke appealed the decision, but on December 10, 2025, the Hanseatic Higher Regional Court (Hanseatisches Oberlandesgericht) upheld the lower court's ruling. The appellate court found that LAION qualified for the scientific research TDM exception and that Kneschke's attempt to opt out of text and data mining was invalid because it was not in a machine-readable format. The court permitted a further appeal to the Federal Court of Justice (Bundesgerichtshof), leaving the possibility of a final ruling at the highest level.

This case is considered a landmark decision for AI training data rights in Europe, as it was the first court ruling to address the EU text and data mining exceptions in the context of AI training datasets.

Broader Copyright Litigation

LAION datasets have also featured in copyright lawsuits against companies that used them for commercial model training. In April 2024, a group of artists (including Jingna Zhang, Sarah Andersen, Hope Larson, and Jessica Fink) filed a lawsuit against Google in the U.S. District Court for the Northern District of California, alleging that Google's Imagen image generator was trained on their copyrighted content without authorization. Google had publicly acknowledged that Imagen used the LAION-400M dataset. While LAION itself was not a defendant in this case, the lawsuit highlighted the tension between open dataset distribution and downstream commercial use.

"Data Laundering" Criticism

LAION's relationship with commercial entities, particularly Stability AI, has drawn criticism from some observers who describe the arrangement as "AI data laundering." Developer and writer Andy Baio popularized this term in a September 2022 essay, arguing that outsourcing the collection and curation of training data to non-profit and academic organizations allows corporations to avoid accountability and potential legal liability. Under this framing, a company like Stability AI can commercialize research outputs (for example, through its DreamStudio product) while shifting questions about privacy and copyright onto the nonprofit entities it funded. LAION has disputed this characterization, maintaining that it operates independently and that its relationship with Stability AI is one of mutual support rather than subordination.

Comparison with Other Image-Text Datasets

Dataset	Size	Year	Source	Open Access	Key Use Cases
SBU Captions	1 million pairs	2011	Flickr	Yes	Image captioning
MS-COCO	330,000 images, 1.5M captions	2014	Human-annotated	Yes	Object detection, captioning
ImageNet	14.2 million images	2009	Web images, hand-labeled	Yes (restricted)	Image classification
YFCC100M	99.2 million images	2014	Flickr (Creative Commons)	Yes	Multimodal research
Conceptual Captions (CC3M)	3.3 million pairs	2018	Web (alt-text)	Yes	Vision-language pretraining
Conceptual 12M (CC12M)	12.4 million pairs	2021	Web (alt-text)	Yes	Vision-language pretraining
WIT (Wikipedia ImageText)	37.6 million pairs	2021	Wikipedia	Yes	Multilingual multimodal learning
LAION-400M	413 million pairs	2021	Common Crawl + CLIP filtering	Yes	CLIP replication, image generation
COYO-700M	747 million pairs	2022	Web (alt-text)	Yes	Vision-language pretraining
LAION-5B	5.85 billion pairs	2022	Common Crawl + CLIP filtering	Yes (gated)	Large-scale image generation, CLIP training
DataComp CommonPool	12.8 billion pairs	2023	Common Crawl	Yes	Dataset curation benchmarking

Impact on Open-Source AI

LAION's datasets had a transformative effect on the open-source AI ecosystem. Before LAION-400M, the largest openly available image-text datasets (such as Conceptual Captions with 3.3 million pairs) were orders of magnitude smaller than the proprietary datasets used by large technology companies. OpenAI's internal WIT dataset (400 million pairs used to train CLIP) was never publicly released, and the datasets behind DALL-E 2 and Imagen were similarly proprietary.

By providing an open dataset of comparable scale, LAION enabled independent researchers, startups, and smaller organizations to train competitive models. Stability AI's use of LAION-5B to train Stable Diffusion and then release the model weights publicly was a turning point for the field. It broke the pattern in which only well-funded corporations could produce state-of-the-art generative models. The release of Stable Diffusion triggered a rapid proliferation of derivative models, fine-tuned variants, community tools, and applications that collectively established an open-source AI art ecosystem.

LAION's approach also influenced subsequent dataset curation efforts. Projects like DataComp (which LAION co-organized) built on the idea that large-scale, open datasets are essential infrastructure for AI research. The DataComp benchmark, announced in 2023, provided a framework for systematically evaluating and improving dataset curation strategies, using a CommonPool of 12.8 billion image-text pairs collected from Common Crawl.

The fact that LAION's datasets were assembled by volunteers for roughly $10,000 in total costs demonstrated that large-scale data curation did not require the resources of a major technology company. This had implications beyond image generation, influencing how the broader AI research community thought about data access, open science, and the democratization of AI capabilities.

Other Projects

Beyond its image-text datasets, LAION has contributed to several other projects:

OpenCLIP: An open-source implementation of CLIP that enabled training of CLIP models on LAION data, producing models competitive with OpenAI's originals.
Open Assistant: A crowd-sourced effort to build an open-source chat assistant, similar in concept to ChatGPT. The project collected over 600,000 human-generated data points from more than 13,500 volunteers. Key contributors included Yannic Kilcher, Andreas Kopf, and Christoph Schuhmann. The resulting dataset (OASST2) and models were released under open licenses. The project was completed and its data is available on Hugging Face.
Open Empathic: An initiative to equip open-source AI systems with empathy and emotional intelligence capabilities, aiming to create an open-source audio dataset capturing emotional characteristics of diverse speech segments.
BUD-E (Buddy for Understanding and Digital Empathy): An open-source, privacy-compliant AI education assistant framework developed in collaboration with the ELLIS Institute Tubingen, Collabora, and the Tubingen AI Center. Version 1.0 was released in January 2025.
Video2Dataset: An open-source tool for curating video and audio datasets efficiently at scale.
Project Alexandria: An initiative dedicated to freeing data from copyright restrictions for AI research and innovation.

Current Status

As of early 2026, LAION continues to operate as a volunteer-driven non-profit. The Re-LAION-5B dataset is available through gated access on Hugging Face, and the organization maintains an active presence on GitHub and its official website (laion.ai). In August 2025, LAION released Open-sci-ref 0.01, a research dense transformer model family trained on eight different reference datasets with intermediate checkpoints made publicly available. The organization has also continued work on emotion recognition, releasing an EmoNet-Face benchmark with a 40-category emotion taxonomy that was accepted at NeurIPS 2025.

The Kneschke v. LAION copyright case may still proceed to the German Federal Court of Justice, which could produce a definitive ruling on the legality of web scraping for AI training datasets under European law.

LAION's work remains at the intersection of several ongoing debates in AI policy: the balance between open data and content safety, the rights of creators whose works are included in training datasets, and the role of non-profit organizations in building foundational AI infrastructure. Regardless of how these debates resolve, LAION's datasets have already left a lasting mark on the development of generative AI, demonstrating both the power and the risks of large-scale open data.

References

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., & Komatsuzaki, A. (2021). "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs." arXiv:2111.02114. https://arxiv.org/abs/2111.02114
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models." Advances in Neural Information Processing Systems, 35, 25278-25294. https://arxiv.org/abs/2210.08402
Thiel, D. (2023). "Identifying and Eliminating CSAM in Generative ML Training Data and Models." Stanford Internet Observatory. https://cyber.fsi.stanford.edu/news/investigation-finds-ai-image-generation-models-trained-child-abuse
LAION. (2024). "Releasing Re-LAION-5B: transparent iteration on LAION-5B with additional safety fixes." https://laion.ai/blog/relaion-5b/
LAION. (2022). "LAION-Aesthetics." https://laion.ai/blog/laion-aesthetics/
LAION. (2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets." https://laion.ai/blog/laion-5b/
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684-10695.
Hamburg Regional Court. (2024). "LAION e.V. v. Robert Kneschke." Case No. 310 O 22723. Judgment of September 27, 2024.
Hanseatic Higher Regional Court. (2025). "Robert Kneschke v. LAION." Case No. 5 U 104/24. Judgment of December 10, 2025.
"A High School Teacher's Free Image Database Powers AI Unicorns." Bloomberg, April 24, 2023. https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." Proceedings of the 38th International Conference on Machine Learning (ICML).
"The org behind the dataset used to train Stable Diffusion claims it has removed CSAM." TechCrunch, August 30, 2024. https://techcrunch.com/2024/08/30/the-org-behind-the-data-set-used-to-train-stable-diffusion-claims-it-has-removed-csam/
Baio, A. (2022). "AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability." Waxy.org. https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/
NeurIPS. (2022). "Announcing the NeurIPS 2022 Awards." https://blog.neurips.cc/2022/11/21/announcing-the-neurips-2022-awards/
"Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material." 404 Media, December 2023. https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/

History and Founding

Christoph Schuhmann

Dataset Creation Methodology

Key Datasets

LAION-400M

LAION-5B

LAION-Aesthetics

LAION-High-Resolution

LAION-COCO

Role in Training Stable Diffusion

Relationship to CLIP and OpenCLIP

CSAM Controversy

Response and Takedown

Re-LAION-5B

Legal Issues

Kneschke v. LAION

Broader Copyright Litigation

"Data Laundering" Criticism

Comparison with Other Image-Text Datasets

Impact on Open-Source AI

Other Projects

Current Status

References

Improve this article

Related Articles

Common Crawl

ARC-AGI 2

MNIST

COCO dataset

PASCAL VOC

Computer-use agent

History and Founding

Christoph Schuhmann

Dataset Creation Methodology

Key Datasets

LAION-400M

LAION-5B

LAION-Aesthetics

LAION-High-Resolution

LAION-COCO

Role in Training Stable Diffusion

Relationship to CLIP and OpenCLIP

CSAM Controversy

Response and Takedown

Re-LAION-5B

Legal Issues

Kneschke v. LAION

Broader Copyright Litigation

"Data Laundering" Criticism

Comparison with Other Image-Text Datasets

Impact on Open-Source AI

Other Projects

Current Status

References

Related Articles

Common Crawl

ARC-AGI 2

MNIST

COCO dataset

PASCAL VOC

Computer-use agent