Snorkel
Last reviewed
Apr 28, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 ยท 3,621 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 ยท 3,621 words
Add missing citations, update stale details, or suggest a clearer explanation.
Snorkel is an open-source software framework and methodology for programmatic data labeling, developed at Stanford University's DAWN lab beginning in 2015. Snorkel implements weak supervision by allowing users to write labeling functions (LFs), small Python functions that capture noisy heuristics, distant supervision signals, and existing knowledge bases. A statistical label model then combines the outputs of these labeling functions, learning their accuracies and correlations without ground truth, to produce probabilistic training labels for downstream models [1][2].
The underlying paradigm, called data programming, was introduced by Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Re at NeurIPS 2016 [1]. The Snorkel system itself was presented at VLDB 2017 [2]. Snorkel triggered a wave of research and tooling around weak supervision and influenced the modern data-centric AI movement, which argues that improving training data is often more impactful than tuning model architectures.
Snorkel is also the namesake of Snorkel AI, a venture-backed company founded in 2019 by core members of the Stanford team. Snorkel AI built the open framework into a commercial platform called Snorkel Flow, and by the early 2020s pivoted toward enterprise data development for large language models, including curated datasets for fine-tuning, RLHF, and evaluation.
Snorkel grew out of the Stanford DAWN lab, a research group co-led by Christopher Re focused on systems for machine learning. The earliest precursor was DeepDive, a knowledge base construction system that Re's group had developed throughout the early 2010s. DeepDive used a combination of distant supervision, rule-based features, and probabilistic inference to extract structured records from text. Building DeepDive applications required heavy engineering, and the group looked for a more general way to express domain knowledge as supervision.
The key conceptual breakthrough came in 2016 with the data programming paper [1]. Rather than asking users to label individual examples, the paper proposed asking them to write functions that label many examples at once, then learn the accuracies of those functions automatically. Each function could be wrong on some examples, and different functions could disagree, but their patterns of agreement and disagreement could be exploited to recover an estimate of the true label distribution.
A year later, the team released Snorkel as an open-source system implementing this idea, described in a VLDB 2017 paper authored by Alex Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Re [2]. The first public version was released on GitHub under the Apache 2.0 license and quickly drew interest from companies that needed large training sets but could not afford to hand-label them. Adoption inside Google was documented in a 2019 SIGMOD paper on Snorkel DryBell [3].
The authorship history matters because Snorkel's later commercialization, the founding of Snorkel AI, retained continuity with the academic project. Ratner, Ehrenberg, Braden Hancock, Paroma Varma, and Re were among the founders of the company in 2019.
Classical supervised learning treats training labels as fixed inputs. Someone, usually a human annotator, attaches a label to each training example, and a model learns to predict labels from features. This works well when labeled data is abundant and cheap. It fails when labels require domain expertise, when the dataset is large enough that hand-labeling is uneconomical, or when label requirements change frequently.
Data programming reframes labeling as a programming task. Instead of providing per-example labels, a developer writes labeling functions. A labeling function is any callable that takes an example and returns a label or abstains. Common sources of labeling functions include:
LFs are explicitly noisy. Some are accurate but cover only a slice of the data, others are broad but error-prone, and some conflict with each other. Snorkel applies all LFs to a pool of unlabeled data and produces a label matrix of shape (n_examples, n_LFs) where each cell is the LF's vote on that example, typically encoded as a class index, an abstain marker, or a probability vector.
The label model is then trained on the label matrix alone, with no ground-truth labels. It estimates each LF's accuracy and the correlations between LFs by examining their pattern of agreements and disagreements [1]. The intuition is that if many independent, accurate LFs agree on an example, the underlying label is probably what they agree on; if they disagree, their estimated accuracies decide which is more likely correct. The label model outputs a probabilistic label for each example, which can be used directly to train a discriminative end model.
A crucial empirical observation, reported across multiple Snorkel papers, is that the final discriminative model often outperforms the label model itself, and even outperforms any single LF [2][3]. This happens because the end model is trained on the rich features of the input (text, images, structured fields), so it can generalize beyond the patterns captured by the labeling functions. The label model only sees LF outputs; the end model sees the world.
The Snorkel pipeline is iterative and roughly follows six stages.
Step 1: Define the schema. Specify the classes, label space, and any business constraints. A typical example: classify customer support tickets as billing, technical, or other.
Step 2: Write labeling functions. Each LF is a Python function decorated with @labeling_function() that takes a row and returns a class index or ABSTAIN. A starting set might include a regex for the word "refund" mapping to billing, a keyword list for technical terms, and a rule that abstains on short tickets.
Step 3: Apply LFs to unlabeled data. Snorkel's PandasLFApplier (or its parallel and Spark variants) runs every LF on every example, producing the label matrix L. Diagnostic functions in Snorkel report each LF's coverage, overlap with other LFs, conflict rate, and empirical accuracy on a small held-out development set.
Step 4: Train the label model. The LabelModel class fits a generative model over the label matrix. The simplest variant assumes LFs are conditionally independent given the true label and learns a single accuracy parameter per LF. More elaborate variants model dependencies between LFs, either user-specified or learned from data [2].
Step 5: Train the end model. The probabilistic labels produced by the label model are used to train any standard discriminative model, often a neural text classification network, a gradient-boosted tree, or in modern usage a fine-tuned transformer. Snorkel does not prescribe an end model; users plug in whatever suits the task.
Step 6: Iterate. The pipeline is meant to be fast to iterate on. Developers inspect labeled examples, find mistakes, write new LFs to cover failure modes, drop or rewrite LFs that hurt performance, and rerun the pipeline. Each cycle takes minutes to hours, compared with weeks for hand-labeling campaigns.
The whole pipeline is built on standard Python infrastructure including NumPy, pandas, Python decorators, and PyTorch for the label model's gradient-based training. There is no separate annotation server, no labeling UI requirement, and no pretrained model dependency.
Snorkel introduced or popularized several technical ideas that outlasted the framework itself.
Generative label model with no labels. The 2016 data programming paper proved that under certain conditions, the accuracies of labeling functions can be estimated consistently from their outputs alone, without ground truth, as long as the LFs are at least slightly better than random and not perfectly correlated [1]. This result generalized prior work on Dawid-Skene models for crowdsourced labels.
Structure learning for LF dependencies. Bach, He, Ratner, and Re (2017) showed how to detect statistical dependencies among labeling functions automatically, since the simplifying assumption of conditional independence often fails when LFs share data sources [2].
MeTaL (Multi-Task Label model). Ratner, Hancock, Dunnmon, Sala, Pandey, and Re (2018) extended Snorkel's label model to multi-task settings where many related labels are produced jointly, exploiting label hierarchies and shared structure [4]. Snorkel MeTaL was released as part of the Snorkel project's later versions.
Transformation functions. Ratner, Ehrenberg, Hussain, Dunnmon, and Re (2017) introduced a parallel idea for data augmentation: write functions that modify examples (paraphrase, crop, perturb) and learn to compose them effectively [5].
Slicing functions. Later versions of Snorkel introduced slicing functions for stratified evaluation. Slicing functions tag subsets of data on which the user cares about model performance, such as edge cases, demographic groups, or rare categories. Slice-aware models can attend to these slices specifically, improving performance on the long tail.
FlyingSquid. Fu, Chen, Sala, Hooper, Fatahalian, and Re (2020) introduced a closed-form label model that runs orders of magnitude faster than gradient-based variants by exploiting triplet methods, released as a successor system named FlyingSquid [6].
Snorkel AI was founded in 2019 by Alex Ratner, Braden Hancock, Henry Ehrenberg, Christopher Re, and Paroma Varma, spinning out of Stanford. The company is headquartered in the San Francisco Bay Area, with an early emphasis on the Palo Alto and Redwood City corridor. Ratner became chief executive officer; Re served as chief technologist; Hancock, Ehrenberg, and Varma took technical leadership roles.
The company's product is Snorkel Flow, a no-code and low-code platform that operationalizes the Snorkel methodology for enterprise teams. Snorkel Flow adds a graphical interface for writing and inspecting labeling functions, integrations with enterprise data warehouses, model training and evaluation tools, and active learning style suggestions. It targets organizations that have unlabeled data in regulated or specialized domains where outsourced crowdsourcing is impractical, including financial services, healthcare, insurance, and government.
Snorkel AI raised capital across several rounds in its first few years [7][8][9][10]:
Publicly disclosed customers and partners over the years have included Pixar for content workflows, BNY Mellon for financial document processing, Chubb in insurance, and a roster of government, life sciences, and telecommunications organizations. Specific deployments are described in case studies and conference talks rather than peer-reviewed papers.
From roughly 2022 onward, Snorkel AI shifted strategic focus toward data development for large language models. Where the original Snorkel pitch was "label your supervised dataset programmatically," the LLM-era pitch is "build the curated SFT, preference, and evaluation datasets that let a foundation model perform on your domain." Snorkel Flow added support for instruction tuning datasets, preference data for RLHF and direct preference optimization, and structured evaluation harnesses. The pivot reflects a broader industry move from training models from scratch to adapting foundation models with high-quality, domain-specific data.
In 2019, a joint Stanford and Google team published Snorkel DryBell at SIGMOD, describing what was then the largest documented industrial deployment of weak supervision [3]. The system was used inside Google to label training data for content classifiers operating on web-scale traffic.
Key findings from the paper [3]:
DryBell validated that weak supervision could work at industrial scale outside academic benchmarks. The paper is often cited as the moment Snorkel moved from a research curiosity to a credible production technique.
Snorkel and its descendants have been applied across many domains. Some recurring categories:
Weak supervision sits in a crowded landscape of techniques for getting more labels for less effort. The table below summarizes how Snorkel relates to common alternatives.
| Approach | How labels are produced | Strengths | Weaknesses |
|---|---|---|---|
| Manual hand-labeling | Annotators label each example | Highest accuracy ceiling, easy to reason about | Expensive, slow, hard to scale, hard to update |
| Crowdsourcing (Mechanical Turk, Scale AI, Surge AI) | Distributed workforce labels examples | Scales to millions of labels, mature pipelines | Quality variance, cost, weak on specialist domains |
| Active learning | Model picks high-value examples for human labels | Reduces total labels needed | Still needs humans, doesn't help when no labels exist |
| Distant supervision | Aligns examples with existing knowledge base | Cheap, leverages existing data | Coverage limited by KB quality, noisy alignments |
| Snorkel weak supervision | LFs voted by label model | Captures expert heuristics, fast iteration, no labels needed | Needs LF authors, label quality bounded by LF coverage |
| FlyingSquid | Closed-form label model over LFs | Faster than Snorkel's gradient-based model | Same LF authoring requirement |
| Skweak | Snorkel-style API specialized for NLP | Sequence labeling support | Narrower scope |
| Refinery / Kern.ai | Combined manual labeling and weak supervision UI | Smooth onboarding for new teams | Commercial, smaller community |
| Cleanlab | Detects and fixes label errors in existing data | Improves accuracy of labeled sets | Requires labels to clean |
| Semi-supervised learning (FixMatch, MixMatch) | Pseudolabels from a model on unlabeled data | No human input after a seed | Needs an initial labeled seed, can amplify errors |
| Self-supervised learning | Pretext tasks generate labels automatically | Foundation for modern foundation models | Different paradigm, doesn't target a specific task |
In practice teams often combine these approaches. A common pattern: use Snorkel-style LFs to bootstrap a large noisy training set, hand-label a small validation set, run cleanlab over labeled portions to fix mistakes, and add active learning on examples the model is most uncertain about.
Weak supervision is not a free lunch. Researchers and practitioners have documented several recurring limitations.
The open-source Snorkel repository (snorkel-team/snorkel on GitHub) is licensed under Apache 2.0 and remains available for research and prototyping. Its last major releases were in 2019 and 2020, after which active development slowed substantially as the core team moved to Snorkel AI. Community contributions continued for some time, but the project is not under heavy ongoing investment by the original authors. Many practitioners still use it for academic projects, replication studies, and small in-house pilots.
Snorkel Flow is the commercial product from Snorkel AI. It encompasses the open framework's ideas, expressed through a hosted enterprise platform with a GUI, integrations, role-based access, audit logs, and managed compute. Snorkel Flow is closed source and sold as a subscription, primarily to large enterprises in regulated industries.
Several lines of follow-on work in academia and industry took the ideas in different directions:
Snorkel as a brand now refers to two distinct things: the academic methodology and the commercial company. Both are alive; they share founders and ideas but have diverged in scope and audience.