# Weak supervision

> Source: https://aiwiki.ai/wiki/weak_supervision
> Updated: 2026-06-24
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Weak supervision** is a machine learning paradigm in which models are trained from noisy, limited, imprecise, or programmatically generated labels rather than from large, expensively hand annotated datasets. Instead of paying experts to label every example, practitioners encode their knowledge as code, rules, heuristics, knowledge base lookups, or prompts to a large model, then statistically combine these noisy signals into probabilistic training labels for a downstream model. The approach is central to [data-centric AI](/wiki/data-centric_ai) and is most identified with [Snorkel](/wiki/snorkel), whose "data programming" lets users write labeling functions whose votes are denoised into training labels [3][4]. Weak supervision spans [machine learning](/wiki/machine_learning) techniques including [semi-supervised learning](/wiki/semi-supervised_learning), distant supervision, multiple instance learning, label noise modeling, and crowdsourced annotation, all sharing the goal of relieving the labeling bottleneck that dominates the cost of building [supervised learning](/wiki/supervised_learning) systems.

The label "weakly supervised learning" gained currency in the natural language processing community in the late 2000s with Mintz, Bills, Snow, and Jurafsky's 2009 paper "Distant supervision for relation extraction without labeled data," which used [Freebase](/wiki/freebase) facts to automatically label sentences [1]. The paradigm was sharpened during the mid-2010s by the Stanford group around [Christopher Re](/wiki/christopher_re), who introduced data programming in 2016 and the Snorkel system in 2017 [3][4]. Zhi-Hua Zhou's 2018 *National Science Review* article "A brief introduction to weakly supervised learning" gave the field a widely cited three-part taxonomy of incomplete, inexact, and inaccurate supervision [6].

## Why does weak supervision matter?

Classical supervised learning rests on the assumption that a large pool of high quality [labeled examples](/wiki/labeled_example) is available. In practice that assumption breaks down. Hand labeling is slow, expensive, and often requires scarce domain expertise. A radiologist annotating chest X-rays, an attorney tagging contract clauses, or a chemist marking active molecules all command rates that make six and seven figure label budgets routine. Even when budgets allow, the time required to build a corpus of millions of [labeled data](/wiki/labeled_data) points conflicts with modern model iteration speed.

Weak supervision is the practitioner's response to this pressure. Instead of paying a person to look at every example, a domain expert encodes their knowledge as code: regular expressions, lookups against [knowledge bases](/wiki/knowledge_base), heuristic rules, prompts to a large model, or bag level annotations on whole groups of examples. Each weak source produces noisy, incomplete, or biased labels. The system models that noise and combines the sources into a probabilistic training signal used to train a downstream discriminative model. The end model can generalize beyond what any single labeling source captured.

Where manual labeling scales linearly with dataset size, programmatic weak supervision scales with the number of labeling functions. A team can move from zero to a working classifier on tens of thousands of examples in days rather than months, then iterate by editing or adding rules. Snorkel AI states that its programmatic approach is "10-100x faster than traditional methods" of manual labeling and curation [20].

## What are the categories of weak supervision?

Zhi-Hua Zhou's 2018 review divides weak supervision into three broad types, with a fourth category that the Snorkel literature treats separately and a fifth that has emerged with the rise of programmatic frameworks. Zhou frames the three core types as "incomplete supervision, where only a subset of training data is given with labels; inexact supervision, where the training data are given with only coarse-grained labels; and inaccurate supervision, where the given labels are not always ground-truth" [6].

| Category | Description | Representative technique |
| --- | --- | --- |
| Incomplete supervision | Only some training examples carry labels; the rest are unlabeled | [Semi-supervised learning](/wiki/semi-supervised_learning), [active learning](/wiki/active_learning) |
| Inexact supervision | Labels are provided at a coarser granularity than the prediction target | Multiple instance learning |
| Inaccurate (noisy) supervision | Labels are wrong some fraction of the time | Label noise robust training, Confident Learning, Co-teaching |
| Distant or heuristic supervision | Labels are generated automatically from external resources or rules | Distant supervision via knowledge bases, heuristic labeling |
| Programmatic, source based supervision | Labels come from multiple weak labeling functions and are combined by a label model | Data programming, Snorkel |

These categories overlap in practice. A medical imaging system might use bag level labels from radiology reports (inexact), augmented by a heuristic that flags scans from cancer wards (distant), and combined with a small set of expert annotations (incomplete).

## Where do weak labels come from?

Practitioners draw weak labels from a wide range of signals. The following table summarizes common sources and the kinds of noise they introduce.

| Source | Example | Typical noise profile |
| --- | --- | --- |
| Heuristics and rules | Regular expressions, keyword lists, dictionaries | High precision when the rule fires, low recall, many abstentions |
| Crowdsourcing | Amazon Mechanical Turk, Appen, Scale AI | Per-worker accuracy varies; systematic confusion between similar classes |
| Knowledge bases | Freebase, Wikidata, UMLS, DrugBank | High precision facts, missing entries, ambiguous mention alignment |
| Existing classifiers | Off-the-shelf models, legacy systems, transfer learners | Errors correlated with the source model's biases |
| User behavior signals | Clicks, dwell time, conversions | Confounded by ranking and position effects |
| Citation and metadata patterns | PubMed MeSH terms, citation graphs | Coverage limited to the indexed corpus |
| Hashtags, captions, alt text | Instagram tags, image alt text on the web | Self reported, often promotional rather than literal |
| Multi-rater agreement | Inter-annotator labels with majority voting or [Dawid-Skene](/wiki/dawid_skene_model) aggregation | Bias when raters share a common error |
| Large language model prompts | [GPT-4](/wiki/gpt_4) or [Claude](/wiki/claude) asked to label a batch | Hallucinations, prompt sensitivity, refusal patterns |

These sources are usually combined. Mintz et al. used Freebase plus syntactic features [1]. Snorkel users typically mix regex rules, knowledge base lookups, and small classifiers. Snorkel DryBell at Google composed labeling functions over feature stores, knowledge graphs, and existing internal models [5].

## How does the Snorkel framework work?

Snorkel is the canonical programmatic weak supervision system. It originated in Christopher Re's group at Stanford in 2015, was formalized as data programming in Ratner, De Sa, Wu, Selsam, and Re's 2016 NeurIPS paper [3], and was published as an end to end system in Ratner, Bach, Ehrenberg, Fries, Wu, and Re's 2017 VLDB paper "Snorkel: Rapid Training Data Creation with Weak Supervision" (arXiv:1711.10160) [4]. The core abstraction is the labeling function.

A labeling function (LF) is a small Python function that takes an unlabeled example and emits either a class label or an abstain token. Typical LFs encode pattern matches, lookups, or invocations of external services. For spam classification, an LF might check whether a message contains a phone number; for clinical NLP, an LF might mark a note as describing diabetes if a UMLS lookup finds the term. LFs are noisy, conflicting, and overlapping by design.

Snorkel proceeds in three stages.

1. **Apply the labeling functions.** Each LF runs on every unlabeled example, producing a sparse label matrix with one column per LF and one row per example. Cells contain class votes or abstentions.
2. **Fit a label model.** A generative model is fit on the label matrix without ground truth. The model estimates each LF's accuracy and the correlations between LFs by exploiting the agreement and disagreement structure across the dataset. The output is a probabilistic label per example.
3. **Train an end model.** A standard discriminative model, such as a [transformer](/wiki/transformer) classifier or a [random forest](/wiki/random_forest), is trained on the probabilistic labels using a noise aware loss. The end model learns features beyond what the LFs explicitly captured, which is what allows it to generalize.

The original Snorkel paper reported that in a user study, subject matter experts using Snorkel built models 2.8 times faster than seven hours of hand labeling and increased predictive performance by an average of 45.5 percent [4]. Across real world deployments, the paper found that Snorkel "provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets" [4].

The open source Snorkel library is released under the Apache 2.0 license and is hosted at snorkel.org and on GitHub [20]. In 2019, several authors of the original Snorkel papers, including Alexander Ratner, Christopher Re, Braden Hancock, Henry Ehrenberg, and Paroma Varma, founded **Snorkel AI**, a Redwood City, California company that develops Snorkel Flow, a commercial enterprise platform for programmatic data labeling [20][22]. Snorkel AI raised an 85 million dollar Series C in 2023 at a 1 billion dollar valuation, backed by investors including Addition, Greylock, Lightspeed, GV, BlackRock, and In-Q-Tel [22]. Snorkel Flow extends Snorkel with a no code interface, integrated model training, monitoring, and the ability to use [foundation models](/wiki/foundation_model) as labeling functions.

### How was Snorkel deployed at Google (Snorkel DryBell)?

In 2019, Bach et al. published "Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale" at SIGMOD, describing a Snorkel variant deployed inside Google [5]. DryBell ingested existing organizational knowledge such as feature stores, internal classifiers, and knowledge graphs and converted it into labeling functions. On three classification tasks at Google, DryBell created classifiers "of comparable quality to ones trained with tens of thousands of hand-labeled examples" and converted non-servable organizational resources into servable models for an average 52 percent performance improvement, executing over millions of data points in tens of minutes [5]. Snorkel and its descendants have since been used at Apple, Intel, IBM, Stanford Medicine, and several large banks.

## What is distant supervision?

Distant supervision is the oldest and most influential strand of weak supervision in NLP. Mintz, Bills, Snow, and Jurafsky introduced the term in their 2009 ACL paper on relation extraction [1]. The recipe is simple: take a knowledge base such as Freebase that lists pairs of entities related by a known relation, then for every sentence in a large corpus that mentions both entities of a pair, assume the sentence expresses the relation. Use those sentences as positive training examples for a relation classifier.

The Mintz paper used Freebase, "a large semantic database of several thousand relations," and showed that distant supervision could rival fully supervised baselines without manual labeling [1]. The cost is noise. Many sentences mention two related entities without expressing the relation, and many true relation mentions are missed because the entity pair is not in the knowledge base.

Later work attacked this noise problem directly. Riedel, Yao, and McCallum's 2010 ECML paper recast distant supervision as multi-instance learning, treating all sentences mentioning an entity pair as a bag and assuming at least one expresses the relation [11]. Hoffmann, Zhang, Ling, Zettlemoyer, and Weld's 2011 ACL paper extended this to overlapping relations, allowing multiple relations between the same entity pair (Founded(Jobs, Apple) and CEO-of(Jobs, Apple)) [2]. Surdeanu et al.'s 2012 EMNLP work introduced multi-instance multi-label learning for relation extraction [12]. These papers laid the groundwork for treating noisy automatic labels as a generative process.

## Other weak supervision paradigms

Weak supervision is a tent that covers several adjacent learning frameworks.

**Multiple instance learning (MIL).** Introduced by Dietterich, Lathrop, and Lozano-Perez in 1997 for drug activity prediction, MIL treats training data as bags of instances with a single bag level label [10]. A bag is positive if and only if at least one of its instances is positive. MIL is the dominant paradigm for whole slide histopathology image classification.

**Co-training.** Blum and Mitchell's 1998 paper trained two classifiers on different feature views of the data and used each classifier's confident predictions as labels for the other [14]. Co-training works when the two views are conditionally independent given the label.

**Self-training and pseudo-labeling.** A model is trained on a small labeled set, used to label a larger unlabeled set, and retrained on the union. The technique underlies many [semi-supervised learning](/wiki/semi-supervised_learning) systems and modern noisy student [self-training](/wiki/self_training) pipelines.

**Active learning.** The learner chooses which unlabeled examples to send to a human annotator, prioritizing the examples that would most reduce uncertainty.

**Curriculum and self-paced learning.** Bengio et al.'s 2009 curriculum learning and Kumar et al.'s 2010 self-paced learning weight examples by an estimated difficulty, so that the model learns from easy or confident examples first.

**Noisy label learning.** Frenay and Verleysen's 2014 IEEE TNNLS survey "Classification in the Presence of Label Noise" gave the field a taxonomy [7]. Han et al.'s 2018 NeurIPS paper introduced **Co-teaching**, training two networks that exchange small loss examples to filter noise [8]. Northcutt, Jiang, and Chuang's 2021 *Journal of Artificial Intelligence Research* paper on **Confident Learning** estimated the joint distribution of noisy and true labels to identify and prune label errors, an approach implemented in the open source [cleanlab](/wiki/cleanlab) library [9]. Other techniques include the bootstrap loss (Reed et al. 2014), the generalized cross entropy (GCE) loss (Zhang and Sabuncu 2018), and DivideMix (Li et al. 2020).

## How does weak supervision differ from related learning paradigms?

Weak supervision sits in a crowded ecosystem of label efficient learning ideas. The boundaries are fuzzy and overlap considerably.

| Paradigm | Labels available | Typical assumption |
| --- | --- | --- |
| [Supervised learning](/wiki/supervised_learning) | Large set of clean labels | Labels are correct and IID |
| [Semi-supervised learning](/wiki/semi-supervised_learning) | Small labeled set plus a large unlabeled set | Cluster or smoothness assumption holds |
| [Self-supervised learning](/wiki/self_supervised_learning) | No human labels; pretext tasks generate targets | A useful representation can be learned from data structure |
| [Unsupervised learning](/wiki/unsupervised_learning) | No labels at all | Structure exists in the data to be discovered |
| Weak supervision | Noisy, programmatic, or coarse labels | Noise can be modeled and aggregated |
| [Few-shot learning](/wiki/few-shot_learning) | A handful of labeled examples per class | A pretrained model transfers to the new task |
| [Zero-shot learning](/wiki/zero_shot_learning) | No labeled examples for the target task | Auxiliary semantic information bridges classes |
| [Reinforcement learning](/wiki/reinforcement_learning) | Reward signals rather than labels | Sequential decisions affect future rewards |

In practice, modern systems combine these. A weakly supervised text classifier often sits on top of a self-supervised pretrained encoder; a few-shot learner can be bootstrapped with weak labels; and active learning can prioritize which weak labels to spot check.

## What is weak supervision used for?

Weak supervision has been deployed across many domains. The following are representative rather than exhaustive.

- **Information extraction.** [Named entity recognition](/wiki/named_entity_recognition) and relation extraction, especially in biomedical and legal corpora where annotation requires expensive expertise. Snorkel was originally developed in part for biomedical relation extraction.
- **Text classification at scale.** Sentiment analysis, intent detection for voice assistants, content moderation, and topic classification on streams that change too quickly for traditional labeling cycles.
- **Biomedical text mining.** Extracting drug, disease, and gene mentions from PubMed abstracts and clinical notes. Distant supervision against UMLS and DrugBank is standard practice.
- **Image classification.** Mahajan et al.'s 2018 work at Facebook trained convolutional networks to predict hashtags on up to 3.5 billion Instagram images, then transferred them to ImageNet and reported a then state of the art 85.4 percent top-1 accuracy [17]. Joulin et al.'s 2016 ECCV paper trained image classifiers on Flickr photos using surrounding text as weak labels.
- **Webly supervised learning.** Chen and Gupta's 2015 ICCV paper trained vision models on noisy web image search results [18].
- **Speech recognition.** Closed captions, subtitles, and lyric files have been used as weak transcripts to train speech models on YouTube and podcast audio.
- **Search relevance and recommender systems.** Click logs, dwell time, and skip patterns serve as weak labels for ranking. Pinterest, YouTube, and large search engines all rely on this implicit feedback.
- **Anomaly and fraud detection.** Rare events with shifting distributions, where labeled fraud cases are scarce and labeling functions can encode known fraud patterns.
- **Drug discovery.** Multiple instance learning with bag level activity labels and distant supervision against ChEMBL or PubChem.
- **Knowledge base completion.** Relation extraction feeds new facts back into Wikidata and Google's [Knowledge Graph](/wiki/knowledge_graph).

In industry, public references include Google's use of Snorkel DryBell for product classification and intent detection in Google Assistant [5], Apple's use of weak supervision for recommendation and Siri training data, and Stanford Hospital's clinical NLP pipelines.

## What are the theoretical foundations of weak supervision?

The theoretical core of programmatic weak supervision is the label model: how can you estimate the accuracies and correlations of labeling functions when you do not have ground truth? The data programming paper of Ratner et al. 2016 framed this as fitting a generative model whose latent variable is the true label, and showed conditions under which the model parameters are identifiable from the agreement structure of the LFs [3]. Ratner, Hancock, Dunnmon, Sala, Pandey, and Re's 2019 AAAI paper extended the analysis to higher order LF dependencies and proved further identifiability conditions.

The ideas connect to a much older crowdsourcing literature. Dawid and Skene's 1979 *Journal of the Royal Statistical Society* paper introduced a generative model in which each rater has a confusion matrix and the EM algorithm recovers both the latent class labels and the rater accuracies [13]. Modern weak supervision label models can be seen as descendants of Dawid-Skene with relaxed assumptions, additional structure for correlated sources, and scalable inference.

Fu, Chen, Sala, Hooper, Fatahalian, and Re's 2020 ICML paper "Fast and Three-rious" gave a closed form solution for the label model under a triplet assumption, implemented in the **flyingsquid** library, trading a small accuracy hit for a large speedup [16]. A 2022 survey by Zhang, Hsieh, Yu, Zhang, and Ratner, "A Survey on Programmatic Weak Supervision," consolidates the field's methods and benchmarks [15].

## Software libraries

| Library | Focus | License | Maintainer |
| --- | --- | --- | --- |
| Snorkel | Programmatic weak supervision in Python | Apache 2.0 | Snorkel team (Stanford / Snorkel AI) |
| Snorkel Flow | Commercial enterprise platform with no code UI | Proprietary | Snorkel AI |
| Cleanlab | Finding and learning with label errors in noisy data | AGPL-3.0 / commercial | Cleanlab Inc. |
| skweak | Snorkel-style weak supervision specialized for NLP and sequence labeling | MIT | Norsk Regnesentral |
| flyingsquid | Closed form label model with triplet assumptions | Apache 2.0 | Hazy Research, Stanford |
| WRENCH | Benchmark and evaluation suite for weak supervision | Apache 2.0 | Jieyu Zhang et al. |
| Argilla / Rubrix | Data centric NLP platform with weak supervision integrations | Apache 2.0 | Argilla |
| Autodistill | Foundation model labeling pipelines for computer vision | Apache 2.0 | Roboflow |

The original Snorkel library remains the most cited reference implementation, although its active maintenance has slowed as the community shifted toward Snorkel Flow and toward integrating large language models as labeling sources. [Cleanlab](/wiki/cleanlab) has grown into a broader [data-centric AI](/wiki/data-centric_ai) toolkit covering label issues, outliers, near duplicates, and dataset quality scoring.

## What are the limitations of weak supervision?

Weak supervision is not a free lunch. The end model is upper bounded by the quality of the labels it sees, so a poor set of labeling functions caps performance no matter how good the downstream model is. Writing labeling functions still requires domain expertise and engineering taste, and the productivity gains tend to be largest when a domain expert is fluent in code or paired with someone who is. Class imbalance and label coverage are persistent concerns: rare classes may receive few or no LF votes, and the label model can be unstable when LFs disagree heavily.

Vision tasks have historically been harder than NLP for programmatic weak supervision because writing rules over pixels is awkward; the field largely moved to bag level annotations, hashtag pretraining, and foundation model labeling for image data. Complex NLP tasks such as multi-hop question answering or long form summarization also resist easy LF authoring.

The rise of [large language models](/wiki/large_language_model) has changed the competitive landscape. A practitioner can now ask GPT-4 or Claude to label a batch of examples and often beat traditional weak supervision on small benchmarks. Snorkel AI and others have responded by treating LLMs as labeling functions inside the weak supervision framework, an approach Snorkel calls foundation model distillation. The combined pipeline tends to outperform either approach alone, but it raises new concerns about cost, latency, and the propagation of LLM biases into downstream models.

## How is weak supervision changing in the LLM era?

Weak supervision in 2026 looks somewhat different than it did in the original Snorkel papers. Several trends define the current state of the field.

**LLMs as labeling functions.** A modern Snorkel pipeline might include regex rules, knowledge base lookups, and prompts to a frontier language model. Each prompt becomes a noisy LF whose accuracy and correlations are estimated by the label model. Smith et al.'s 2022 paper "Language Models in the Loop: Incorporating Prompting into Weak Supervision" reported that on the WRENCH benchmark this approach achieves "an average 19.5% reduction in errors" over zero-shot prompting and produces classifiers "with comparable or superior accuracy to those trained from hand-engineered rules" [19].

**Synthetic data generation.** Beyond labeling existing examples, LLMs are now used to generate [synthetic data](/wiki/synthetic_data) training examples that are then weakly labeled or graded. The boundary between data augmentation and weak supervision has blurred.

**Constitutional AI and RLAIF.** [Anthropic's Constitutional AI](/wiki/constitutional_ai) work and several follow ons use one model to critique and label the outputs of another according to a written policy. [Reinforcement learning from AI feedback (RLAIF)](/wiki/rlaif) is a form of weak supervision applied to alignment data, where the critique model substitutes for human raters.

**Foundation model distillation.** Snorkel AI's commercial offering combines LLM labeling with traditional Snorkel style aggregation to produce smaller, faster student models through [knowledge distillation](/wiki/knowledge_distillation), now standard practice for shipping production NLP systems where LLM inference costs are prohibitive.

**Regulated industries.** Healthcare, finance, and government remain the strongest markets for weak supervision because manual labeling is constrained by privacy rules, professional licensing, or the sensitivity of the data. A weakly supervised pipeline that runs entirely inside an organization's network can be a compliance friendly alternative to sending data to a labeling vendor or a hosted LLM API.

Weak supervision started as a way to skip hand labeling and has become a general framework for combining heterogeneous, imperfect supervision signals. Its practical impact comes less from any single algorithmic insight than from a workflow that lets domain experts encode what they know in code, then iterate as the [data labeling](/wiki/data_labeling) task evolves.

## References

1. Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). "Distant supervision for relation extraction without labeled data." *Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP*, 1003-1011. https://aclanthology.org/P09-1113/
2. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., & Weld, D. S. (2011). "Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations." *Proceedings of the 49th Annual Meeting of the ACL*, 541-550. https://aclanthology.org/P11-1055/
3. Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., & Re, C. (2016). "Data Programming: Creating Large Training Sets, Quickly." *Advances in Neural Information Processing Systems 29 (NeurIPS 2016)*. https://arxiv.org/abs/1605.07723
4. Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., & Re, C. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." *Proceedings of the VLDB Endowment*, 11(3), 269-282. https://arxiv.org/abs/1711.10160
5. Bach, S. H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H., Kuchhal, R., Re, C., & Malkin, R. (2019). "Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale." *Proceedings of the 2019 SIGMOD Conference*. https://arxiv.org/abs/1812.00417
6. Zhou, Z.-H. (2018). "A brief introduction to weakly supervised learning." *National Science Review*, 5(1), 44-53. https://academic.oup.com/nsr/article/5/1/44/4093912
7. Frenay, B., & Verleysen, M. (2014). "Classification in the Presence of Label Noise: A Survey." *IEEE Transactions on Neural Networks and Learning Systems*, 25(5), 845-869. https://ieeexplore.ieee.org/document/6685834/
8. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I. W., & Sugiyama, M. (2018). "Co-teaching: Robust training of deep neural networks with extremely noisy labels." *NeurIPS 2018*. https://proceedings.neurips.cc/paper/2018/hash/a19744e268754fb0148b017647355b7b-Abstract.html
9. Northcutt, C. G., Jiang, L., & Chuang, I. L. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." *Journal of Artificial Intelligence Research*, 70, 1373-1411. https://www.jair.org/index.php/jair/article/view/12125
10. Dietterich, T. G., Lathrop, R. H., & Lozano-Perez, T. (1997). "Solving the multiple instance problem with axis-parallel rectangles." *Artificial Intelligence*, 89(1-2), 31-71.
11. Riedel, S., Yao, L., & McCallum, A. (2010). "Modeling Relations and Their Mentions without Labeled Text." *ECML PKDD 2010*.
12. Surdeanu, M., Tibshirani, J., Nallapati, R., & Manning, C. D. (2012). "Multi-instance multi-label learning for relation extraction." *EMNLP-CoNLL 2012*.
13. Dawid, A. P., & Skene, A. M. (1979). "Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm." *Journal of the Royal Statistical Society. Series C (Applied Statistics)*, 28(1), 20-28.
14. Blum, A., & Mitchell, T. (1998). "Combining labeled and unlabeled data with co-training." *COLT 1998*.
15. Zhang, J., Hsieh, C.-Y., Yu, Y., Zhang, C., & Ratner, A. (2022). "A Survey on Programmatic Weak Supervision." arXiv:2202.05433. https://arxiv.org/abs/2202.05433
16. Fu, D. Y., Chen, M. F., Sala, F., Hooper, S., Fatahalian, K., & Re, C. (2020). "Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods." *ICML 2020*. https://arxiv.org/abs/2002.11955
17. Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., & van der Maaten, L. (2018). "Exploring the Limits of Weakly Supervised Pretraining." *ECCV 2018*. https://arxiv.org/abs/1805.00932
18. Chen, X., & Gupta, A. (2015). "Webly Supervised Learning of Convolutional Networks." *ICCV 2015*. https://arxiv.org/abs/1505.01554
19. Smith, R., Fries, J. A., Hancock, B., & Bach, S. H. (2022). "Language Models in the Loop: Incorporating Prompting into Weak Supervision." arXiv:2205.02318. https://arxiv.org/abs/2205.02318
20. Snorkel project website (https://snorkel.org/) and Snorkel AI (https://snorkel.ai/).
21. Lison, P., Barnes, J., & Hubin, A. (2021). "skweak: Weak Supervision Made Easy for NLP." *ACL 2021 System Demonstrations*. https://aclanthology.org/2021.acl-demo.40.pdf
22. "Snorkel AI Raises $85M Series C at $1B Valuation for Data-Centric AI." Snorkel AI (2023). https://snorkel.ai/blog/85-million-series-c-accelerating-data-centric-ai-enterprise/