Snorkel

Data & Datasets Machine Learning Open Source AI

21 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v2 · 4,177 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Snorkel is an open-source software framework and methodology for programmatic data labeling that started at Stanford University in 2015 and is licensed under Apache 2.0 ^[13]^[18]. Snorkel implements weak supervision by letting users write labeling functions (LFs), small Python functions that capture noisy heuristics, distant supervision signals, and existing knowledge bases. A statistical label model then combines the outputs of these labeling functions, learning their accuracies and correlations without ground truth, to produce probabilistic training data labels for downstream models ^[1]^[2]. The Stanford team authored over sixty peer-reviewed publications on Snorkel and related weak supervision research, and deployed early versions with organizations including Google, Intel, and Stanford Medicine ^[13].

The underlying paradigm, called data programming, was introduced by Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Re at NeurIPS 2016 ^[1]. The Snorkel system itself was presented at VLDB 2017 ^[2]. Snorkel triggered a wave of research and tooling around weak supervision and influenced the modern data-centric AI movement, which argues that improving training data is often more impactful than tuning model architectures.

Snorkel is also the namesake of Snorkel AI, a venture-backed company founded in 2019 by core members of the Stanford team. Snorkel AI built the open framework into a commercial platform called Snorkel Flow, raised $237 million in total venture funding through a May 2025 Series D that valued it at $1.3 billion ^[11], and by the mid-2020s pivoted toward enterprise data development for large language models, including curated datasets for fine-tuning, RLHF, and evaluation.

What problem does Snorkel solve?

Classical supervised learning needs a labeled training set, and for many real tasks that set is slow, expensive, or impossible to hand-build. Stanford framed this as the central bottleneck in applied machine learning. As Christopher Re put it, "It was a bottleneck and pain point so many were facing. The data doesn't just 'show up' out of nowhere" ^[14]. Snorkel's answer is to have domain experts encode their knowledge as code rather than as per-example annotations. In one Stanford Hospital collaboration, work that previously took "literal person-years" was completed in "just hours" using the Snorkel approach ^[14].

Origins

Snorkel grew out of the Stanford DAWN lab, a research group co-led by Christopher Re focused on systems for machine learning. The earliest precursor was DeepDive, a knowledge base construction system that Re's group had developed throughout the early 2010s. DeepDive used a combination of distant supervision, rule-based features, and probabilistic inference to extract structured records from text. Building DeepDive applications required heavy engineering, and the group looked for a more general way to express domain knowledge as supervision.

The project formally began in 2015 on a simple bet: that increasingly it would be the training data, not the models, algorithms, or infrastructure, that decided whether a machine learning project succeeded or failed ^[13]. The key conceptual breakthrough came in 2016 with the data programming paper ^[1]. Rather than asking users to label individual examples, the paper proposed asking them to write functions that label many examples at once, then learn the accuracies of those functions automatically. Each function could be wrong on some examples, and different functions could disagree, but their patterns of agreement and disagreement could be exploited to recover an estimate of the true label distribution.

A year later, the team released Snorkel as an open-source system implementing this idea, described in a VLDB 2017 paper authored by Alex Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Re ^[2]. The first public version was released on GitHub under the Apache 2.0 license and quickly drew interest from companies that needed large training sets but could not afford to hand-label them. Adoption inside Google was documented in a 2019 SIGMOD paper on Snorkel DryBell ^[3].

The authorship history matters because Snorkel's later commercialization, the founding of Snorkel AI, retained continuity with the academic project. Ratner, Ehrenberg, Braden Hancock, Paroma Varma, and Re were among the founders of the company in 2019.

The data programming paradigm

Classical supervised learning treats training labels as fixed inputs. Someone, usually a human annotator, attaches a label to each training example, and a model learns to predict labels from features. This works well when labeled data is abundant and cheap. It fails when labels require domain expertise, when the dataset is large enough that hand-labeling is uneconomical, or when label requirements change frequently.

Data programming reframes labeling as a programming task. Instead of providing per-example labels, a developer writes labeling functions. A labeling function is any callable that takes an example and returns a label or abstains. Common sources of labeling functions include:

Regular expressions and pattern matches that recognize keywords or phrases.
Dictionary or gazetteer lookups against curated lists of entities.
Distant supervision that aligns examples with rows in an existing knowledge base.
Crowd worker outputs treated as noisy votes.
Outputs from existing classifiers, including older models, off-the-shelf APIs, or rule-based systems.
Heuristics from domain experts, encoded as Python conditionals.
Outputs from large language models used as zero-shot labelers, a usage pattern that grew popular after 2022.

LFs are explicitly noisy. Some are accurate but cover only a slice of the data, others are broad but error-prone, and some conflict with each other. Snorkel applies all LFs to a pool of unlabeled data and produces a label matrix of shape (n_examples, n_LFs) where each cell is the LF's vote on that example, typically encoded as a class index, an abstain marker, or a probability vector.

The label model is then trained on the label matrix alone, with no ground-truth labels. It estimates each LF's accuracy and the correlations between LFs by examining their pattern of agreements and disagreements ^[1]. The intuition is that if many independent, accurate LFs agree on an example, the underlying label is probably what they agree on; if they disagree, their estimated accuracies decide which is more likely correct. The label model outputs a probabilistic label for each example, which can be used directly to train a discriminative end model.

A crucial empirical observation, reported across multiple Snorkel papers, is that the final discriminative model often outperforms the label model itself, and even outperforms any single LF ^[2]^[3]. This happens because the end model is trained on the rich features of the input (text, images, structured fields), so it can generalize beyond the patterns captured by the labeling functions. The label model only sees LF outputs; the end model sees the world.

How does the Snorkel workflow work?

The Snorkel pipeline is iterative and roughly follows six stages.

Step 1: Define the schema. Specify the classes, label space, and any business constraints. A typical example: classify customer support tickets as billing, technical, or other.

Step 2: Write labeling functions. Each LF is a Python function decorated with @labeling_function() that takes a row and returns a class index or ABSTAIN. A starting set might include a regex for the word "refund" mapping to billing, a keyword list for technical terms, and a rule that abstains on short tickets.

Step 3: Apply LFs to unlabeled data. Snorkel's PandasLFApplier (or its parallel and Spark variants) runs every LF on every example, producing the label matrix L. Diagnostic functions in Snorkel report each LF's coverage, overlap with other LFs, conflict rate, and empirical accuracy on a small held-out development set.

Step 4: Train the label model. The LabelModel class fits a generative model over the label matrix. The simplest variant assumes LFs are conditionally independent given the true label and learns a single accuracy parameter per LF. More elaborate variants model dependencies between LFs, either user-specified or learned from data ^[2].

Step 5: Train the end model. The probabilistic labels produced by the label model are used to train any standard discriminative model, often a neural text classification network, a gradient-boosted tree, or in modern usage a fine-tuned transformer. Snorkel does not prescribe an end model; users plug in whatever suits the task.

Step 6: Iterate. The pipeline is meant to be fast to iterate on. Developers inspect labeled examples, find mistakes, write new LFs to cover failure modes, drop or rewrite LFs that hurt performance, and rerun the pipeline. Each cycle takes minutes to hours, compared with weeks for hand-labeling campaigns.

The whole pipeline is built on standard Python infrastructure including NumPy, pandas, Python decorators, and PyTorch for the label model's gradient-based training. There is no separate annotation server, no labeling UI requirement, and no pretrained model dependency.

Core technical contributions

Snorkel introduced or popularized several technical ideas that outlasted the framework itself.

Generative label model with no labels. The 2016 data programming paper proved that under certain conditions, the accuracies of labeling functions can be estimated consistently from their outputs alone, without ground truth, as long as the LFs are at least slightly better than random and not perfectly correlated ^[1]. This result generalized prior work on Dawid-Skene models for crowdsourced labels.

Structure learning for LF dependencies. Bach, He, Ratner, and Re (2017) showed how to detect statistical dependencies among labeling functions automatically, since the simplifying assumption of conditional independence often fails when LFs share data sources ^[15].

MeTaL (Multi-Task Label model). Ratner, Hancock, Dunnmon, Sala, Pandey, and Re (2018) extended Snorkel's label model to multi-task settings where many related labels are produced jointly, exploiting label hierarchies and shared structure ^[4]. Snorkel MeTaL was released as part of the Snorkel project's later versions.

Transformation functions. Ratner, Ehrenberg, Hussain, Dunnmon, and Re (2017) introduced a parallel idea for data augmentation: write functions that modify examples (paraphrase, crop, perturb) and learn to compose them effectively ^[5].

Slicing functions. Later versions of Snorkel introduced slicing functions for stratified evaluation. Slicing functions tag subsets of data on which the user cares about model performance, such as edge cases, demographic groups, or rare categories. Slice-aware models can attend to these slices specifically, improving performance on the long tail.

FlyingSquid. Fu, Chen, Sala, Hooper, Fatahalian, and Re (2020) introduced a closed-form label model that runs orders of magnitude faster than gradient-based variants by exploiting triplet methods, released as a successor system named FlyingSquid ^[6].

Snorkel AI (the company)

Snorkel AI was founded in 2019 by Alex Ratner, Braden Hancock, Henry Ehrenberg, Christopher Re, and Paroma Varma, spinning out of Stanford after roughly four years of research on weak supervision ^[14]. The company is headquartered in the San Francisco Bay Area and emerged from stealth in July 2020 with $15 million already raised ^[19]. Ratner became chief executive officer; Re served as chief technologist; Hancock, Ehrenberg, and Varma took technical leadership roles. As Ratner described the company's premise, the goal was to "upskill the way that subject matter experts interface with new machine learning technology" ^[14].

The company's product is Snorkel Flow, a no-code and low-code platform that operationalizes the Snorkel methodology for enterprise teams. Snorkel Flow adds a graphical interface for writing and inspecting labeling functions, integrations with enterprise data warehouses, model training and evaluation tools, and active learning style suggestions. It targets organizations that have unlabeled data in regulated or specialized domains where outsourced crowdsourcing is impractical, including financial services, healthcare, insurance, and government.

How much funding has Snorkel AI raised?

Snorkel AI raised $237 million in total venture funding across five rounds in its first six years, reaching unicorn status in 2021 and a $1.3 billion valuation in 2025 ^[9]^[11]:

Round	Amount	Date	Lead investor(s)	Valuation
Seed / pre-stealth	$15M (disclosed at exit)	2019-2020	Greylock, GV	not disclosed
Series B	$35M	August 2020	Lightspeed Venture Partners	not disclosed
Series C	$85M	August 9, 2021	Addition and BlackRock	$1.0B ^[9]
Series D	$100M	May 29, 2025	Addition	$1.3B ^[11]

The $15 million the company emerged from stealth with in July 2020 came from Greylock and GV (Google Ventures), with participation from Lightspeed and the SAP.iO fund ^[19]. The $85 million Series C in August 2021 was co-led by Addition and BlackRock, with Greylock, GV, Lightspeed, Nepenthe Capital, and Walden International continuing, and brought total funding to $135 million at a $1 billion valuation ^[9]. The $100 million Series D announced on May 29, 2025 was led by Addition, with participation from Prosperity 7 Ventures, existing investors Greylock and Lightspeed, and strategic investors BNY and QBE Ventures, bringing total funding to $237 million ^[11].

Publicly disclosed customers and partners over the years have included financial services, healthcare, insurance, government, life sciences, and telecommunications organizations; by 2025 Snorkel reported working with 7 of the top 10 US banks, multiple Fortune 500 companies, and federal agencies ^[11]. Specific deployments are described in case studies and conference talks rather than peer-reviewed papers.

The pivot to LLM data development

From roughly 2022 onward, Snorkel AI shifted strategic focus toward data development for large language models. Where the original Snorkel pitch was "label your supervised dataset programmatically," the LLM-era pitch is "build the curated SFT, preference, and evaluation datasets that let a foundation model perform on your domain." Snorkel Flow added support for instruction tuning datasets, preference data for RLHF and direct preference optimization, and structured evaluation harnesses. The pivot reflects a broader industry move from training models from scratch to adapting foundation models with high-quality, domain-specific data.

Alongside the 2025 Series D, Snorkel announced the general availability of two products, Snorkel Evaluate (programmatic, fine-grained evaluation of models and AI agents) and Snorkel Expert Data-as-a-Service (managed expert data curation), aimed at getting enterprise agentic systems into production ^[11]. Alex Ratner framed the strategy this way: "We are seeing a surge of momentum around agentic AI, but specialized enterprise agents aren't ready for production in most settings. Enterprises need domain-specific data and expertise to make this a reality" ^[11].

Snorkel DryBell (Google deployment)

In 2019, a joint Stanford and Google team published Snorkel DryBell at SIGMOD, describing what was then the largest documented industrial deployment of weak supervision ^[3]. The system was used inside Google to label training data for content classifiers operating on web-scale traffic.

Key findings from the paper ^[3]:

DryBell allowed engineers to write LFs that drew on existing organizational signals such as legacy heuristics, knowledge graph entries, and previously trained classifiers, all of which already existed inside Google but were not directly usable as training labels.
For three production classification tasks, DryBell-trained models achieved quality comparable to models trained on hand-labeled data while requiring substantially less human annotation.
The system handled cross-feature LFs, where a labeling function might consult organizational knowledge graphs that the deployed model could not access at serving time.
Engineering throughput improved measurably: writing LFs and iterating on them was faster than running labeling campaigns through annotation vendors.

DryBell validated that weak supervision could work at industrial scale outside academic benchmarks. The paper is often cited as the moment Snorkel moved from a research curiosity to a credible production technique.

What is Snorkel used for?

Snorkel and its descendants have been applied across many domains. Some recurring categories:

Information extraction from unstructured text. The original DeepDive and Snorkel papers focused on extracting relations from biomedical literature, news articles, and crawled web pages. LFs based on syntactic patterns, gazetteers, and distant supervision over knowledge bases are particularly natural here.
Text classification for support tickets, regulatory filings, clinical notes, legal documents, and content moderation. Each domain has rich rule-based heuristics that domain experts know but cannot easily turn into hand-labeled examples.
Image classification with image-aware LFs. Examples include LFs that consult metadata, fire on detected objects, or rely on simple computer vision filters. Stanford collaborators applied Snorkel to medical imaging tasks such as triaging chest X-rays.
Industrial knowledge bases. Building product catalogs, scientific knowledge bases, and entity registries from heterogeneous sources, often the original use case from the DeepDive lineage.
Fraud detection and risk scoring in financial services, where domain experts have decades of accumulated rules but rarely a clean labeled dataset.
Healthcare. Snorkel collaborations with Stanford Medicine and other partners produced systems for cohort selection, phenotype identification, and clinical decision support, where clinical heuristics translate naturally into LFs.
LLM training data curation. Post-2022, the methodology was repurposed to filter, label, and weight examples for instruction tuning and preference learning. Programmatic labeling is well suited to evaluating LLM outputs at scale, since each automated check can be encoded as a function.

How does Snorkel compare with other labeling approaches?

Weak supervision sits in a crowded landscape of techniques for getting more labels for less effort. The table below summarizes how Snorkel relates to common alternatives.

Approach	How labels are produced	Strengths	Weaknesses
Manual hand-labeling	Annotators label each example	Highest accuracy ceiling, easy to reason about	Expensive, slow, hard to scale, hard to update
Crowdsourcing (Mechanical Turk, Scale AI, Surge AI)	Distributed workforce labels examples	Scales to millions of labels, mature pipelines	Quality variance, cost, weak on specialist domains
Active learning	Model picks high-value examples for human labels	Reduces total labels needed	Still needs humans, doesn't help when no labels exist
Distant supervision	Aligns examples with existing knowledge base	Cheap, leverages existing data	Coverage limited by KB quality, noisy alignments
Snorkel weak supervision	LFs voted by label model	Captures expert heuristics, fast iteration, no labels needed	Needs LF authors, label quality bounded by LF coverage
FlyingSquid	Closed-form label model over LFs	Faster than Snorkel's gradient-based model	Same LF authoring requirement
Skweak	Snorkel-style API specialized for NLP	Sequence labeling support	Narrower scope
Refinery / Kern.ai	Combined manual labeling and weak supervision UI	Smooth onboarding for new teams	Commercial, smaller community
Cleanlab	Detects and fixes label errors in existing data	Improves accuracy of labeled sets	Requires labels to clean
Semi-supervised learning (FixMatch, MixMatch)	Pseudolabels from a model on unlabeled data	No human input after a seed	Needs an initial labeled seed, can amplify errors
Self-supervised learning	Pretext tasks generate labels automatically	Foundation for modern foundation models	Different paradigm, doesn't target a specific task

In practice teams often combine these approaches. A common pattern: use Snorkel-style LFs to bootstrap a large noisy training set, hand-label a small validation set, run cleanlab over labeled portions to fix mistakes, and add active learning on examples the model is most uncertain about.

Limitations

Weak supervision is not a free lunch. Researchers and practitioners have documented several recurring limitations.

LF authoring requires expertise. Writing useful labeling functions usually demands someone who understands both the domain and basic Python. For tasks where heuristics are obvious (a regulatory keyword, a known medical code), this is easy. For subjective tasks (creative writing quality, complex sentiment), it is hard or impossible.
Coverage matters. If labeling functions collectively cover only 30 percent of the data, the label model has nothing to say about the other 70 percent. The end model trains on the labeled portion and is expected to generalize, but generalization fails when uncovered regions differ structurally.
Bounded by LF accuracy. The label model can correct random errors but cannot fix systematic biases shared across LFs. If every LF makes the same mistake on a subgroup, the trained model inherits the mistake.
Weaker than supervised learning when labels are abundant. Multiple studies have shown that on benchmarks where plentiful gold labels exist, supervised learning outperforms Snorkel-style weak supervision at the same compute budget. The benefit of weak supervision is mostly when labels are scarce or expensive.
Dependency assumptions. The label model's accuracy estimates depend on assumptions about the dependency structure of LFs. When LFs share underlying data sources or correlated errors, naive label models can be miscalibrated.
Not a replacement for evaluation labels. Even teams that label all training data programmatically still need a clean evaluation set, hand-labeled by humans, to measure progress. Without this, weak supervision pipelines can drift toward whatever the LFs happen to capture.
Maintainability. Large LF suites become legacy code. Assumptions drift as data distributions change. MLOps practices help, but weak supervision adds an extra layer of artifacts to maintain.

Is Snorkel open source?

The open-source Snorkel repository (snorkel-team/snorkel on GitHub) is licensed under Apache 2.0 and remains available for research and prototyping; its tagline is "A system for quickly generating training data with weak supervision" ^[13]. Its most active major releases were in 2019 and 2020, after which active development slowed substantially as the core team moved to Snorkel AI, although maintenance releases continued (version 0.10.0 shipped in February 2024) ^[13]. Community contributions continued for some time, but the project is not under heavy ongoing investment by the original authors. Many practitioners still use it for academic projects, replication studies, and small in-house pilots.

Snorkel Flow is the commercial product from Snorkel AI. It encompasses the open framework's ideas, expressed through a hosted enterprise platform with a GUI, integrations, role-based access, audit logs, and managed compute. Snorkel Flow is closed source and sold as a subscription, primarily to large enterprises in regulated industries.

Several lines of follow-on work in academia and industry took the ideas in different directions:

FlyingSquid offered a faster label model and remains a useful drop-in alternative.
Skweak specializes the Snorkel API for sequence labeling tasks like named-entity recognition.
Cleanlab focuses on detecting and fixing errors in existing labeled datasets, complementary to Snorkel.
Refinery / Kern.ai combined manual and programmatic labeling in a single open-source tool.
Many large organizations have built private weak-supervision systems internally, inspired by Snorkel papers but tailored to their data infrastructure.

Snorkel as a brand now refers to two distinct things: the academic methodology and the commercial company. Both are alive; they share founders and ideas but have diverged in scope and audience.

References

Ratner, A., De Sa, C., Wu, S., Selsam, D., & Re, C. (2016). "Data Programming: Creating Large Training Sets, Quickly." Advances in Neural Information Processing Systems (NeurIPS) 29. https://arxiv.org/abs/1605.07723 ↩
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., & Re, C. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." Proceedings of the VLDB Endowment, 11(3). https://arxiv.org/abs/1711.10160 ↩
Bach, S. H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H., Kuchhal, R., Re, C., & Malkin, R. (2019). "Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale." Proceedings of SIGMOD 2019. https://arxiv.org/abs/1812.00417 ↩
Ratner, A., Hancock, B., Dunnmon, J., Sala, F., Pandey, S., & Re, C. (2018). "Training Complex Models with Multi-Task Weak Supervision." AAAI 2019. https://arxiv.org/abs/1810.02840 ↩
Ratner, A. J., Ehrenberg, H. R., Hussain, Z., Dunnmon, J., & Re, C. (2017). "Learning to Compose Domain-Specific Transformations for Data Augmentation." NeurIPS 2017. https://arxiv.org/abs/1709.01643 ↩
Fu, D. Y., Chen, M. F., Sala, F., Hooper, S., Fatahalian, K., & Re, C. (2020). "Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods." ICML 2020. https://arxiv.org/abs/2002.11955 ↩
Snorkel AI (2020). "Announcing Snorkel's $15M Series A and the Future of Machine Learning." Snorkel AI blog, April 2, 2020. https://snorkel.ai/snorkel-15m-series-a/
Snorkel AI (2020). "Snorkel AI raises $35M Series B, led by Lightspeed." Snorkel AI blog, August 18, 2020. https://snorkel.ai/snorkel-ai-raises-35m-series-b/
Snorkel AI (2021). "Snorkel AI Raises $85m Series C at $1b Valuation for Data-Centric AI." Snorkel AI blog, August 9, 2021. https://snorkel.ai/blog/85-million-series-c-accelerating-data-centric-ai-enterprise/ ↩
GlobeNewswire (2021). "Snorkel AI Raises $85 Million at $1 Billion Valuation for Data-Centric AI." August 9, 2021. https://www.globenewswire.com/news-release/2021/08/09/2277249/0/en/Snorkel-AI-Raises-85-Million-at-1-Billion-Valuation-for-Data-Centric-AI.html
Snorkel AI / BusinessWire (2025). "Snorkel AI Announces $100 Million Series D and Expanded Platform to Power Next Phase of AI with Expert Data." May 29, 2025. https://www.businesswire.com/news/home/20250529083998/en/Snorkel-AI-Announces-%24100-Million-Series-D-and-Expanded-Platform-to-Power-Next-Phase-of-AI-with-Expert-Data ↩
Snorkel AI company website. https://snorkel.ai/
snorkel-team/snorkel on GitHub. https://github.com/snorkel-team/snorkel ↩
Stanford HAI (2021). "Stanford Spin-Out Snorkel AI Solves a Major Data Problem." Stanford Institute for Human-Centered AI. https://hai.stanford.edu/news/stanford-spin-out-snorkel-ai-solves-major-data-problem ↩
Bach, S. H., He, B., Ratner, A., & Re, C. (2017). "Learning the Structure of Generative Models without Labeled Data." ICML 2017. https://arxiv.org/abs/1703.00854 ↩
Hancock, B., Bringmann, M., Varma, P., Liang, P., Wang, S., & Re, C. (2018). "Training Classifiers with Natural Language Explanations." ACL 2018. https://arxiv.org/abs/1805.03818
Re, C., Bach, S., Ratner, A., Selsam, D., Wu, S., et al. "Overton: A Data System for Monitoring and Improving Machine-Learned Products." CIDR 2020. https://arxiv.org/abs/1909.05372
Snorkel project website. https://www.snorkel.org/ ↩
BusinessWire (2020). "Stanford AI Lab Spinout, Snorkel AI Emerges From Stealth With $15M in Funding From Greylock and GV to Make AI Practical." July 14, 2020. https://www.businesswire.com/news/home/20200714005758/en/Stanford-AI-Lab-Spinout-Snorkel-AI-Emerges-From-Stealth-With-%2415M-in-Funding-From-Greylock-and-GV-to-Make-AI-Practical ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Christopher Ré Cleanlab Data-centric AI (DCAI)DatologyAI Derived label Labeled example Proxy labels Rater Surge AI Weak supervision

What problem does Snorkel solve?

Origins

The data programming paradigm

How does the Snorkel workflow work?

Core technical contributions

Snorkel AI (the company)

How much funding has Snorkel AI raised?

The pivot to LLM data development

Snorkel DryBell (Google deployment)

What is Snorkel used for?

How does Snorkel compare with other labeling approaches?

Limitations

Is Snorkel open source?

See also

References

Improve this article

Related Articles

The Pile (dataset)

FineWeb

RedPajama

Dolma

RefinedWeb

Common Corpus

What links here

Related Articles

The Pile (dataset)

FineWeb

RedPajama

Dolma

RefinedWeb

Common Corpus

What links here