LAION (Large-scale Artificial Intelligence Open Network) is a German non-profit organization dedicated to making large-scale machine learning resources openly available to the public. Founded in 2021, LAION is best known for creating and releasing some of the largest open image-text datasets in existence, including LAION-400M and LAION-5B. These datasets played a central role in the training of influential text-to-image models such as Stable Diffusion and helped catalyze the open-source generative AI movement. The LAION-5B paper received the Outstanding Paper Award in the Datasets and Benchmarks Track at NeurIPS 2022.
LAION was founded in the summer of 2021 in Hamburg, Germany, by a group of volunteers who connected through a Discord server for AI enthusiasts. The driving force behind the organization was Christoph Schuhmann, a high school physics and computer science teacher in Hamburg who holds a master's degree in physics and computer science from the University of Vienna. Schuhmann was motivated by concerns about the centralization of AI resources after OpenAI released DALL-E in January 2021. He feared that if access to large-scale training data remained exclusive to a handful of large technology companies, it would have negative consequences for society and scientific progress.
Schuhmann recruited collaborators through the Discord community, and the founding team included Jenia Jitsev (scientific lead, from Julich Supercomputing Centre), Richard Vencu (engineering lead), Robert Kaczmarczyk (medical AI lead), Theo Coombes, Mehdi Cherti, Aarush Katta, and Jan Ebert. The organization was formally registered as LAION e.V. (eingetragener Verein, or registered association) under German law, headquartered at Marlowweg 26, 22525 Hamburg. As a non-profit, LAION does not pursue commercial goals and makes all of its outputs available free of charge.
Despite the scale of its impact on the AI industry, LAION was built on a remarkably small budget. The team covered its server fees through a combination of crowdfunding, a donation from Hugging Face in 2021, and a donation of roughly $9,000 to $10,000 from Stability AI founder Emad Mostaque. Stability AI later provided additional support, primarily in the form of access to idle GPU compute on its cluster of 4,000 to 5,600 GPUs. Sponsors including Hugging Face, Doodlebot, and Stability AI contributed computing resources that helped produce the LAION-5B dataset. Schuhmann has publicly stated that he has declined job offers from companies in order to keep LAION independent, continuing to work as a teacher while leading the organization.
Christoph Schuhmann serves as the organizational lead and co-founder of LAION. Born in 1982, he studied computer science and physics at the University of Vienna and also spent six years studying acting. After graduating, he worked as an IT administrator and teacher in Hamburg, where he also directed films for children. Before founding LAION, Schuhmann produced "Schools of Trust," a crowdfunded documentary about alternative schools where students learn without grades and fixed curricula. This experience with grassroots organizing and non-profit management informed how he later structured LAION.
Schuhmann's interest in deep learning grew during years of self-study through online courses. When DALL-E was published in early 2021, Schuhmann began experimenting with AI on Google Colab and connected with like-minded researchers on Discord. Within a few weeks, the group had assembled 3 million image-text pairs; after three months they released the 400-million pair LAION-400M dataset. As of 2023, Schuhmann was still teaching physics and computer science to high school students in Germany while running LAION as a volunteer effort. His work attracted wide media attention, including a Bloomberg feature in April 2023 that highlighted how a high school teacher's free database had come to power some of the most prominent AI systems in the world.
LAION's datasets were constructed using a pipeline that combined web scraping with CLIP-based filtering. The acquisition pipeline can be divided into three major stages: distributed processing of petabyte-scale Common Crawl data, distributed downloading of images based on shuffled data, and GPU-node post-processing.
Source data collection: The starting material came from Common Crawl, a publicly available archive of web page data. Worker machines parsed Common Crawl WAT (Web ARChive Transform) files to extract all HTML image tags that contained alt-text attributes, producing raw URL and text pairs. Language detection was performed on the alt-text to classify pairs as English, another detected language, or "no language" (where detection confidence was below threshold).
Image downloading: A distributed cluster of roughly 300 worker machines downloaded images from the extracted URLs in parallel. Each worker pulled batches of 10,000 links from a PostgreSQL server. Because many URLs point to images that have been removed or relocated, a significant fraction of downloads fail at any given time. All samples with fewer than 5 characters of alt-text or fewer than 5 KB image size were dropped. Duplicate removal was performed using a bloom filter based on URL.
CLIP embedding and filtering: Each downloaded image and its associated alt-text were passed through a CLIP model (OpenAI's ViT-B/32 for English content, or multilingual CLIP for non-English content). The model computed embeddings for both the image and the text, and a cosine similarity score was calculated between the two. Image-text pairs with similarity scores below a set threshold were discarded. For LAION-400M, the threshold was 0.3. For the English subset of LAION-5B, it was 0.28, and for the multilingual subset, 0.26.
Additional metadata: Each retained pair was annotated with metadata including image dimensions, a watermark detection probability score, an NSFW (not safe for work) classification score, and the original CLIP embeddings. Nearest-neighbor indices (using FAISS) were also provided to enable efficient similarity search across the dataset.
This filtering process was aggressive. For LAION-5B, roughly 90% of the candidate samples were removed, trimming more than 50 billion raw candidates down to approximately 5.85 billion retained pairs.
It is important to note that LAION datasets do not contain the images themselves. They consist of URLs pointing to images hosted elsewhere on the web, paired with alt-text and metadata. This distinction became legally significant in later copyright proceedings.
The following table summarizes LAION's major dataset releases:
| Dataset | Size | Release Date | Description |
|---|---|---|---|
| LAION-400M | 413 million pairs | August 2021 | English-language image-text pairs filtered with CLIP ViT-B/32 at threshold 0.3 |
| LAION-5B | 5.85 billion pairs | March 2022 | Multilingual image-text pairs across three subsets, filtered with CLIP and multilingual CLIP |
| LAION-Aesthetics v2 | Up to 1.2 billion pairs | August 2022 | Subsets of LAION-5B scored for visual quality by an aesthetics predictor model |
| LAION-High-Resolution | 170 million pairs | 2022 | Subset of LAION-5B with images at 1024x1024 resolution or higher |
| LAION-COCO | 600 million pairs | October 2022 | LAION-2B-en subset with synthetic captions generated by BLIP and selected by CLIP |
| Re-LAION-5B | ~5.5 billion pairs | August 2024 | LAION-5B cleaned of known links to CSAM, released in two variants |
LAION-400M was the organization's first major release, published on August 20, 2021. It contained 413 million English-language image-text pairs filtered from Common Crawl data using OpenAI's CLIP ViT-B/32 model with a cosine similarity threshold of 0.3. The dataset also included precomputed CLIP embeddings and kNN indices for similarity search.
The accompanying paper, "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs," was authored by Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki, and published as an arXiv preprint (arXiv:2111.02114) in November 2021.
At the time of its release, LAION-400M was the largest openly accessible image-text dataset in the world. It was created to replicate the scale of OpenAI's internal WebImageText (WIT) dataset, which contained 400 million pairs used to train CLIP but was never made public.
LAION-5B was released on March 31, 2022, and represented a massive expansion of the earlier dataset. It contained 5.85 billion CLIP-filtered image-text pairs, making it approximately 14 times larger than LAION-400M. The dataset was organized into three subsets:
| Subset | Size | Description |
|---|---|---|
| LAION-2B-en | 2.32 billion pairs | English-language image-text pairs, filtered with CLIP ViT-B/32 at threshold 0.28 |
| LAION-2B-multi | 2.26 billion pairs | Image-text pairs in over 100 non-English languages, filtered with multilingual CLIP at threshold 0.26 |
| LAION-1B-nolang | 1.27 billion pairs | Pairs where language could not be reliably detected, often containing product images or location photos with captions mixing natural language and SEO keywords |
The most frequently represented non-English languages in the multilingual subset were Russian (10.6%), French (7.4%), German (6.6%), Spanish (6.6%), and Chinese (6.3%).
The LAION-5B paper, "LAION-5B: An open large-scale dataset for training next generation image-text models," was presented at NeurIPS 2022 in the Datasets and Benchmarks Track and received the Outstanding Paper Award. The paper's authors included Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. It was published in Advances in Neural Information Processing Systems, volume 35, pages 25278 to 25294.
LAION-Aesthetics is a collection of subsets of LAION-5B filtered for high visual quality using an aesthetics prediction model. The predictor was a linear model trained on image-rating pairs from the Simulacra Aesthetic Captions (SAC) dataset, using CLIP image embeddings from the OpenAI CLIP ViT-L/14 model. The model was trained to predict the score a human would give when asked "How much do you like this image on a scale from 1 to 10?"
The key LAION-Aesthetics subsets include:
| Subset | Size | Minimum Predicted Score |
|---|---|---|
| LAION-Aesthetics v2 5+ | ~600 million pairs | 5.0 |
| LAION-Aesthetics v2 6+ | ~12 million pairs | 6.0 |
| LAION-Aesthetics v2 6.25+ | ~3 million pairs | 6.25 |
| LAION-Aesthetics v2 6.5+ | ~625,000 pairs | 6.5 |
The LAION-Aesthetics v2 5+ subset, which contained approximately 600 million pairs from LAION-2B-en with predicted aesthetics scores of 5.0 or higher, became especially significant because it served as the primary training dataset for Stable Diffusion v1.
LAION-High-Resolution is a subset of LAION-5B containing approximately 170 million image-text pairs where the images have a resolution of at least 1024x1024 pixels. This subset is particularly useful for training super-resolution models and high-resolution image generation systems.
LAION-COCO is a dataset of 600 million image-text pairs drawn from LAION-2B-en, where the original web-scraped alt-text captions have been replaced with synthetic captions generated by AI models. The captioning pipeline used an ensemble approach: BLIP L/14 generated 40 candidate captions per image, CLIP OpenAI L/14 selected the five best candidates, and then CLIP RN50x64 chose the single best caption from those five. The hyperparameters for the generation process were tuned to match the writing style of MS-COCO captions, as measured by ROUGE scores.
Human evaluation showed that in 47.5% of cases, evaluators believed the AI-generated captions had been written by humans. LAION-COCO was released in October 2022 and is available on Hugging Face.
LAION's datasets were foundational to the development of Stable Diffusion, one of the most influential open-source text-to-image generation models. The relationship between LAION and Stable Diffusion involved several layers:
Stable Diffusion was developed by the CompVis group at Ludwig Maximilian University of Munich, led by Robin Rombach and Patrick Esser, with computational resources provided by Stability AI. The model was trained on subsets of LAION-5B in a multi-stage process:
Initial training was performed on LAION-2B-en and LAION-High-Resolution, giving the model broad exposure to a diverse range of image-text relationships.
Fine-tuning was done on LAION-Aesthetics v2 5+, the 600 million pair subset with high aesthetic quality scores. Low-resolution images and images flagged as likely watermarked (with greater than 80% probability by LAION-5B-WatermarkDetection) were excluded.
For the Stable Diffusion v1.5 checkpoint specifically, training involved 595,000 steps at 512x512 resolution on LAION-Aesthetics v2 5+, with 10% text conditioning dropout to improve classifier-free guidance sampling.
Stability AI also provided financial support to LAION, making the relationship symbiotic: LAION produced the open datasets, and Stability AI contributed compute resources that both funded LAION's operations and enabled the training of models on LAION's data. Stability AI credited both LAION and EleutherAI as key supporters of the Stable Diffusion project.
Stable Diffusion v2 was later trained on a filtered version of LAION-5B that removed images with NSFW content, using an improved OpenCLIP model rather than OpenAI's original CLIP.
LAION's work is deeply intertwined with OpenAI's CLIP (Contrastive Language-Image Pre-training) model. When OpenAI released CLIP in January 2021, the model weights and inference code were made publicly available, but OpenAI did not release the training dataset (WebImageText, or WIT, consisting of 400 million image-text pairs collected from the internet). This asymmetry motivated the creation of LAION: the goal was to build an open dataset large enough to replicate CLIP-scale training from scratch.
OpenAI's CLIP ViT-B/32 model was used directly in the LAION pipeline as the filtering mechanism. Every candidate image-text pair from Common Crawl was scored by CLIP for semantic alignment between the image and its alt-text, and pairs below the similarity threshold were discarded.
LAION also contributed to the development of OpenCLIP, an open-source reimplementation of OpenAI's CLIP architecture. Using LAION-2B-en as training data, the OpenCLIP project trained several large CLIP models (ViT-L/14, ViT-H/14, and ViT-g/14) that matched or approached the performance of OpenAI's original models. The ViT-G/14 model trained on LAION-2B achieved approximately 80.1% zero-shot accuracy on ImageNet, competitive with OpenAI's results. This demonstrated that open data combined with open code could replicate the capabilities of systems built by large corporations.
In December 2023, a report by David Thiel at the Stanford Internet Observatory revealed that LAION-5B contained URLs linking to child sexual abuse material (CSAM). The investigation examined more than 32 million data points in the dataset and identified 3,226 suspected CSAM images, of which 1,008 were externally validated using Microsoft's PhotoDNA tool. Because LAION-5B was assembled by scraping alt-text and image URLs from Common Crawl without manually reviewing each image, illegal content from the open web had passed through the automated filtering pipeline.
It is important to note that LAION's datasets store URLs and metadata rather than the images themselves. The CSAM was present on third-party websites, and LAION's dataset pointed to those URLs. Nevertheless, the Stanford researchers noted that anyone who had populated a copy of LAION-5B by downloading the linked images, even in late 2023, would be in possession of thousands of illegal images. The discovery raised serious concerns about the safety of AI training pipelines and the difficulty of moderating datasets at web scale.
LAION responded to the Stanford report on December 19, 2023, by immediately taking down the LAION-5B dataset and all known derivatives. In a public statement, the organization said: "In an abundance of caution, we have taken down the LAION datasets to ensure they are safe before republishing them." The organization cited its "zero tolerance policy for illegal content." The dataset remained offline for approximately eight months while LAION conducted a comprehensive safety review in partnership with the Internet Watch Foundation (IWF), the Canadian Centre for Child Protection (C3P), and the Stanford Internet Observatory.
In August 2024, LAION released Re-LAION-5B, a cleaned version of the original dataset. The organization described it as the first web-scale image-text pair dataset to be thoroughly cleaned of known links to suspected CSAM. In total, 2,236 links were removed after matching against lists of known CSAM URLs and image hashes provided by the partner organizations. These removed links included all 1,008 flagged by the Stanford Internet Observatory report, along with additional matches found through the IWF and C3P databases up to July 2024.
Re-LAION-5B was released in two variants:
| Variant | Description |
|---|---|
| Re-LAION-5B Research | The original dataset with the 2,236 confirmed CSAM-linked entries removed |
| Re-LAION-5B Research-Safe | A more aggressively filtered version that also removes additional NSFW content |
Both variants were made available under gated access on Hugging Face. The cleaned dataset contains approximately 5.5 billion text-image pairs.
The most prominent legal case involving LAION is Kneschke v. LAION, a copyright lawsuit filed on April 27, 2023, by Robert Kneschke, a German stock photographer, against LAION e.V. in the Hamburg Regional Court. Kneschke alleged that LAION had included one of his copyrighted photographs in the LAION-5B dataset without his consent. It was not disputed that LAION had downloaded a copy of his image (available in low resolution and watermarked) as part of the dataset creation process.
On September 27, 2024, the Hamburg Regional Court (Landgericht Hamburg) ruled in favor of LAION, finding that the organization's use of the images fell under the exception for text and data mining (TDM) for scientific research purposes, as codified in Section 60d of the German Copyright Act (Urheberrechtsgesetz, UrhG). This exception implements Article 3 of the EU Directive on Copyright in the Digital Single Market. The court held that LAION qualified as a non-commercial institution conducting scientific research, even though some LAION members also worked at Stability AI, a for-profit company. The court reasoned that this did not give the for-profit company any "decisive influence" over LAION's research.
Kneschke appealed the decision, but on December 10, 2025, the Hanseatic Higher Regional Court (Hanseatisches Oberlandesgericht) upheld the lower court's ruling. The appellate court found that LAION qualified for the scientific research TDM exception and that Kneschke's attempt to opt out of text and data mining was invalid because it was not in a machine-readable format. The court permitted a further appeal to the Federal Court of Justice (Bundesgerichtshof), leaving the possibility of a final ruling at the highest level.
This case is considered a landmark decision for AI training data rights in Europe, as it was the first court ruling to address the EU text and data mining exceptions in the context of AI training datasets.
LAION datasets have also featured in copyright lawsuits against companies that used them for commercial model training. In April 2024, a group of artists (including Jingna Zhang, Sarah Andersen, Hope Larson, and Jessica Fink) filed a lawsuit against Google in the U.S. District Court for the Northern District of California, alleging that Google's Imagen image generator was trained on their copyrighted content without authorization. Google had publicly acknowledged that Imagen used the LAION-400M dataset. While LAION itself was not a defendant in this case, the lawsuit highlighted the tension between open dataset distribution and downstream commercial use.
LAION's relationship with commercial entities, particularly Stability AI, has drawn criticism from some observers who describe the arrangement as "AI data laundering." Developer and writer Andy Baio popularized this term in a September 2022 essay, arguing that outsourcing the collection and curation of training data to non-profit and academic organizations allows corporations to avoid accountability and potential legal liability. Under this framing, a company like Stability AI can commercialize research outputs (for example, through its DreamStudio product) while shifting questions about privacy and copyright onto the nonprofit entities it funded. LAION has disputed this characterization, maintaining that it operates independently and that its relationship with Stability AI is one of mutual support rather than subordination.
| Dataset | Size | Year | Source | Open Access | Key Use Cases |
|---|---|---|---|---|---|
| SBU Captions | 1 million pairs | 2011 | Flickr | Yes | Image captioning |
| MS-COCO | 330,000 images, 1.5M captions | 2014 | Human-annotated | Yes | Object detection, captioning |
| ImageNet | 14.2 million images | 2009 | Web images, hand-labeled | Yes (restricted) | Image classification |
| YFCC100M | 99.2 million images | 2014 | Flickr (Creative Commons) | Yes | Multimodal research |
| Conceptual Captions (CC3M) | 3.3 million pairs | 2018 | Web (alt-text) | Yes | Vision-language pretraining |
| Conceptual 12M (CC12M) | 12.4 million pairs | 2021 | Web (alt-text) | Yes | Vision-language pretraining |
| WIT (Wikipedia ImageText) | 37.6 million pairs | 2021 | Wikipedia | Yes | Multilingual multimodal learning |
| LAION-400M | 413 million pairs | 2021 | Common Crawl + CLIP filtering | Yes | CLIP replication, image generation |
| COYO-700M | 747 million pairs | 2022 | Web (alt-text) | Yes | Vision-language pretraining |
| LAION-5B | 5.85 billion pairs | 2022 | Common Crawl + CLIP filtering | Yes (gated) | Large-scale image generation, CLIP training |
| DataComp CommonPool | 12.8 billion pairs | 2023 | Common Crawl | Yes | Dataset curation benchmarking |
LAION's datasets had a transformative effect on the open-source AI ecosystem. Before LAION-400M, the largest openly available image-text datasets (such as Conceptual Captions with 3.3 million pairs) were orders of magnitude smaller than the proprietary datasets used by large technology companies. OpenAI's internal WIT dataset (400 million pairs used to train CLIP) was never publicly released, and the datasets behind DALL-E 2 and Imagen were similarly proprietary.
By providing an open dataset of comparable scale, LAION enabled independent researchers, startups, and smaller organizations to train competitive models. Stability AI's use of LAION-5B to train Stable Diffusion and then release the model weights publicly was a turning point for the field. It broke the pattern in which only well-funded corporations could produce state-of-the-art generative models. The release of Stable Diffusion triggered a rapid proliferation of derivative models, fine-tuned variants, community tools, and applications that collectively established an open-source AI art ecosystem.
LAION's approach also influenced subsequent dataset curation efforts. Projects like DataComp (which LAION co-organized) built on the idea that large-scale, open datasets are essential infrastructure for AI research. The DataComp benchmark, announced in 2023, provided a framework for systematically evaluating and improving dataset curation strategies, using a CommonPool of 12.8 billion image-text pairs collected from Common Crawl.
The fact that LAION's datasets were assembled by volunteers for roughly $10,000 in total costs demonstrated that large-scale data curation did not require the resources of a major technology company. This had implications beyond image generation, influencing how the broader AI research community thought about data access, open science, and the democratization of AI capabilities.
Beyond its image-text datasets, LAION has contributed to several other projects:
As of early 2026, LAION continues to operate as a volunteer-driven non-profit. The Re-LAION-5B dataset is available through gated access on Hugging Face, and the organization maintains an active presence on GitHub and its official website (laion.ai). In August 2025, LAION released Open-sci-ref 0.01, a research dense transformer model family trained on eight different reference datasets with intermediate checkpoints made publicly available. The organization has also continued work on emotion recognition, releasing an EmoNet-Face benchmark with a 40-category emotion taxonomy that was accepted at NeurIPS 2025.
The Kneschke v. LAION copyright case may still proceed to the German Federal Court of Justice, which could produce a definitive ruling on the legality of web scraping for AI training datasets under European law.
LAION's work remains at the intersection of several ongoing debates in AI policy: the balance between open data and content safety, the rights of creators whose works are included in training datasets, and the role of non-profit organizations in building foundational AI infrastructure. Regardless of how these debates resolve, LAION's datasets have already left a lasting mark on the development of generative AI, demonstrating both the power and the risks of large-scale open data.