Proxy labels

Proxy labels in machine learning

A proxy label is a substitute for the true target label of a training example. Practitioners use proxy labels when the ground-truth target is too expensive, too slow, or simply impossible to collect at the scale needed to train a model. Instead of paying experts to annotate every example, the team writes rules, taps a knowledge base, mines user behavior, or asks a larger model to predict the label, then trains on the resulting noisy targets.

The idea sits inside the broader area of weak supervision, which Zhi-Hua Zhou (2018) groups into three families: incomplete supervision (few labels), inexact supervision (coarse labels), and inaccurate supervision (noisy labels). Proxy labels are most often a form of inaccurate supervision, although they appear in the other two settings as well. They are sometimes called noisy labels, weak labels, or silver labels to contrast them with the high-quality gold labels produced by careful human annotation.

Proxy labels show up in nearly every modern machine learning pipeline. Search engines train rankers on click logs because there is no oracle telling them which document is most relevant. Radiology models train on text mined out of free-text reports because radiologists do not have time to label millions of images by hand. Large language model fine-tuning pipelines now routinely use a stronger model to label data for a smaller one. The cost savings are real, but so are the risks: optimizing a proxy too hard can drift away from the actual goal, a problem usually attributed to Goodhart's law.

The vocabulary in this area is messy and overlapping. The same idea travels under several names depending on the subfield.

Term	Common meaning	Typical context
Proxy label	A substitute target used when the true target is unavailable or expensive	General machine learning
Weak label	A label produced by an imperfect process, often heuristic or programmatic	Weak supervision, Snorkel ecosystem
Noisy label	A label that is sometimes wrong, modeled as the true label plus noise	Label-noise theory, confident learning
Silver label	An automatically derived label used as a cheap stand-in for gold labels	Biomedical NLP, named entity recognition
Gold label	A high-quality label produced by careful human annotation, treated as ground truth	Evaluation sets, benchmarks
Distant label	A label transferred from a knowledge base via a heuristic mapping	Distant supervision, relation extraction
Pseudo-label	A label predicted by a model and then used to train that same model or another model	Pseudo-labeling, self-training

These terms overlap. A click is a noisy label, a weak label, and a proxy for relevance all at once. The distinctions matter mostly when you are choosing a method to denoise or aggregate the labels.

Two senses of the term

The phrase "proxy label" actually covers two slightly different ideas that often get blurred together.

The first sense is a substitute label for the same target. You want to train a relevance model and you do not have human relevance judgments, so you use clicks as a noisy stand-in. The label you collect approximates the label you wish you had.

The second sense is a label for a related task that you train on because it is easier to collect, hoping the learned representations transfer to your real task. Recommender systems do this constantly: the team really wants to predict long-term user satisfaction, but they train on next-click prediction because clicks are abundant and satisfaction is not. The proxy task is correlated with the real task but not identical to it.

Keeping these two senses distinct helps when you are debugging why a model trained on proxy labels does worse than expected. Sometimes the labels themselves are noisy. Sometimes the entire objective is wrong.

Sources of proxy labels

The practical question is where proxy labels come from. There are six common sources, each with its own strengths and failure modes.

Heuristic rules

The oldest and simplest source is a hand-written rule. A regex that matches dollar signs followed by digits flags monetary amounts. A keyword list of disease names labels mentions of medical conditions. A rule that says "any review with five stars is positive" turns ratings into sentiment labels. Rules are fast to write, cheap to run, and transparent, but they typically have high precision on the cases they cover and zero coverage everywhere else.

Distant supervision

Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky introduced distant supervision at ACL 2009 in their paper "Distant supervision for relation extraction without labeled data." Their approach: for any pair of entities that already appear together in Freebase under some relation, find every sentence in a large corpus that mentions both entities, and label each such sentence with that relation. The assumption that every co-occurring sentence expresses the relation is wrong in many specific cases, but is right often enough that a classifier trained on the resulting noisy data extracted 10,000 instances of 102 relations at 67.6% precision. Distant supervision became a standard technique for bootstrapping training data in information extraction.

Crowdsourcing

Services like Amazon Mechanical Turk, Scale AI, and Surge let you buy human labels at much lower cost than expert annotation, but the labels are noisier. Crowd labels are usually treated as proxy labels for the labels a domain expert would have produced. Practitioners typically have multiple workers label each example and aggregate via majority vote, the Dawid-Skene model, or another consensus method.

User behavior

For consumer products, user actions provide a constant stream of implicit signals. Clicks are proxies for relevance. Dwell time is a proxy for engagement quality. Add-to-cart is a proxy for purchase intent. Purchases themselves are proxies for satisfaction. These signals are abundant, but they are biased in well-documented ways. Position bias means a result shown at the top of a search page gets clicked more whether or not it is the most relevant. Selection bias means you only get clicks on items the system already chose to show.

Pre-trained model predictions

Modern pipelines often label data with a model that already exists. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean's 2015 paper "Distilling the Knowledge in a Neural Network" formalized this as knowledge distillation: a small student model is trained to match the soft probability distribution produced by a larger teacher, treating the teacher's outputs as proxy labels that carry more information than the original hard labels. This pattern dominates current LLM training, where teams routinely use a frontier model to label millions of examples for fine-tuning a smaller production model.

Self-training and pseudo-labeling

A model can label its own training data. Dong-Hyun Lee's 2013 ICML workshop paper "Pseudo-Label" introduced the simplest version of this for deep networks: train a model on the small labeled set, predict labels for the unlabeled set, keep the high-confidence predictions, and add them to the training set. This is the workhorse of semi-supervised learning and is closely related to self-training. Lee argued that pseudo-labeling is approximately equivalent to entropy minimization, pushing the decision boundary into low-density regions of the data.

Weak supervision frameworks

A weak supervision framework is a software system for combining many noisy label sources into a single, denoised training set. The core insight, due to Alexander Ratner and collaborators, is that you can model the labeling sources themselves as noisy estimators and learn their accuracies without ever seeing the ground truth, by exploiting their agreements and disagreements.

Framework	Year	Origin	Main idea
Data programming	2016	Ratner et al., NeurIPS	Users write labeling functions; a generative model recovers their accuracies and produces probabilistic training labels
Snorkel	2017	Ratner et al., VLDB	End-to-end system implementing data programming; in a user study, subject matter experts built models 2.8x faster with 45.5% better predictive performance than seven hours of hand labeling
Snorkel DryBell	2019	Bach et al., SIGMOD	Industrial deployment at Google showing the approach scales to production weak-supervision pipelines
Skweak	2021	Lison, Barnes, Hubin, ACL	Python toolkit specialized for NLP, integrated with spaCy, for sequence labeling and text classification
Cleanlab	2019+	Northcutt, Jiang, Chuang	Library for finding and fixing label errors using confident learning; estimates the joint distribution of noisy and true labels
HoloClean	2017	Rekatsinas et al., VLDB	Statistical inference framework for cleaning noisy labels and structured data

Snorkel and its successors take labeling functions written in code, run them across an unlabeled dataset, and aggregate their noisy outputs into probabilistic labels suitable for training a downstream model. Cleanlab takes the complementary angle: given an existing labeled dataset, it finds the labels most likely to be wrong so you can fix or remove them. The two approaches are often used together.

Real-world examples

Proxy labels appear in essentially every large-scale machine learning system. A few illustrative cases.

Domain	Real target	Proxy label used	Why the proxy works (and where it fails)
Web search	Document relevance to a query	Click-through rate, dwell time, satisfied click	Clicks correlate with relevance but suffer from position bias and selection bias; modern systems use unbiased learning-to-rank corrections
Online ads	Whether the ad is useful to the user	Click-through rate, conversion rate	Clicks are easy to measure but reward attention-grabbing rather than helpful ads; conversion rate is more aligned but sparser
Information retrieval	Editorial relevance grade	BM25 ranking, prior model output	Cheap and abundant; can lock in the biases of the existing ranker
Medical imaging	Radiologist's diagnosis from the image	Concepts extracted from the free-text radiology report (e.g., the CheXpert labeler over 224,316 Stanford chest X-rays)	Reports are written by radiologists who saw the image, so the labels are clinically grounded; uncertainty in the report leaks through as label noise
Self-driving	The right action to take now	The action a human driver actually took, or the future trajectory rolled out from logs	Logs are abundant; humans sometimes act badly, and counterfactual actions are unobserved
Recommender systems	Long-term user satisfaction	Next-click prediction, watch time, ratings	Clicks and watch time are dense and immediate; they often reward addictive content over satisfying content
Large language model fine-tuning	What a high-quality response looks like	Outputs of a stronger LLM (e.g., GPT-4 labeling data for a smaller model), rationales from chain-of-thought distillation	Cheap and scalable; the student inherits the teacher's biases and mistakes
Named entity recognition (biomedical)	Whether a span names a gene, drug, or disease	Dictionary matches against UMLS or other knowledge bases	Excellent precision on in-dictionary terms; misses synonyms, abbreviations, and novel entities

These examples share a pattern. The team has a target they cannot directly observe at scale and a related signal they can. They train on the related signal and hope the resulting model generalizes to the target. Sometimes it does. Sometimes the proxy gap eats them alive.

Risks and pitfalls

Proxy labels are useful precisely because they are imperfect, so a serious training pipeline has to plan for the imperfections. The main failure modes:

The proxy gap and Goodhart's law

If you optimize a proxy hard enough, you will eventually hurt the true objective. This is Goodhart's law in its machine learning form. Leo Gao, John Schulman, and Jacob Hilton's 2023 paper "Scaling Laws for Reward Model Overoptimization" measured this directly for reinforcement learning from human feedback: as a policy optimizes a learned reward model harder, the proxy reward keeps going up while the true reward (from a much larger gold reward model) eventually turns down. Skalse et al.'s 2023 paper "Goodhart's Law in Reinforcement Learning" gave a geometric account of why this happens in Markov decision processes and proposed early-stopping rules to avoid the worst of it. The same dynamic shows up in supervised learning whenever the proxy label and the true target diverge.

Bias propagation

Proxy labels are often systematically biased rather than randomly noisy. A search ranker trained on clicks will preferentially show items that historically got clicked, perpetuating the position bias and selection bias of the original system. A recidivism predictor trained on rearrest data inherits the policing patterns that produced the rearrests. The trained model can amplify the bias in its own deployment data, creating feedback loops that are hard to detect from inside the system.

Coverage gaps

Heuristic-based proxies tend to have high precision and low recall. A regex catches obvious cases and misses the rest. A model trained only on the cases the regex labels will systematically underperform on the cases it missed. Practitioners often combine multiple weak labelers with overlapping coverage, then use a generative model (Snorkel-style) to combine them.

Label-noise correction

A whole subfield exists to handle noisy training labels. Benoit Frenay and Michel Verleysen's 2014 IEEE survey "Classification in the Presence of Label Noise" grouped the methods into three buckets: noise-robust algorithms (loss functions that are inherently less sensitive), noise cleansing (find and fix the wrong labels), and noise-tolerant methods (model the noise process explicitly during training). Confident learning, introduced by Curtis Northcutt, Lu Jiang, and Isaac Chuang in their 2021 JAIR paper, falls into the cleansing bucket: estimate the joint distribution of noisy and true labels using out-of-sample predicted probabilities, then prune the examples most likely to be mislabeled. The method is the basis for the cleanlab library.

Validation on a gold set

The one habit that prevents most disasters is keeping a small, carefully labeled gold evaluation set that the model never trains on. Proxy-trained models can score perfectly on the proxy and still fail on the gold set; this gap is the proxy gap, made visible. If you do not have a gold set, you cannot tell whether your model is learning the proxy or the real task.

Modern relevance

Proxy labels became more important, not less, with the rise of large models. A few reasons.

First, the absolute scale of training data needed for current language and vision models exceeds anything humans can label by hand. GPT-class models train on trillions of tokens of text scraped from the web. The supervision signal is the next token, which is a proxy for many things at once: factual accuracy, reasoning ability, coherence, style. Token-level next-token loss is the cheapest possible proxy for "good output," and that is largely why pretraining works at the scale it does.

Second, the same large models that benefit from proxy labels at pretraining time can produce proxy labels for downstream training. LLM-generated labels are now a standard ingredient in fine-tuning pipelines. Researchers comparing GPT-4-labeled training sets to human-labeled training sets have found that for many text classification tasks the resulting downstream models are roughly comparable, at a fraction of the cost. Anthropic, OpenAI, and Google all use stronger models to label data for smaller models, often combined with knowledge distillation on soft outputs.

Third, reinforcement learning from human feedback (RLHF) pushed the proxy-label problem to the foreground. The reward model trained on human preferences is a proxy for what humans actually want. The RL policy optimizes the reward model. The bigger the policy gets, the more aggressively it can exploit gaps between the reward model and human preferences. The whole alignment subfield is in part a response to the proxy-label problem at LLM scale.

Tools and libraries

A practitioner choosing among proxy-label tooling has a small but growing menu.

Tool	Best for	Notes
Snorkel	Combining many heuristic labeling functions into a denoised training set	Original framework; commercial successor is Snorkel Flow
Skweak	Weak supervision for NLP sequence labeling and classification	Tightly integrated with spaCy
Cleanlab	Finding and fixing label errors in existing datasets	Implements confident learning; model-agnostic
Weasel	End-to-end weakly supervised learning, integrating label model and end model in a single training loop	PyTorch-based
HoloClean	Cleaning noisy labels and structured data via statistical inference	Aimed at databases more than ML
Refinery, Rubrix	Annotation interfaces with weak supervision support	Help domain experts iterate on labeling functions

Many teams build internal tooling rather than adopt a framework, especially when their proxy-label sources are very specific to their product (clicks in their app, logs from their service). The conceptual machinery is the same regardless: define the sources, model their noise, denoise, train, and evaluate on a gold set.

Explain like I'm 5 (ELI5)

Imagine you want to teach a robot how to recognize different types of animals. To do this, you need to show it lots of pictures of animals with the correct name attached to each picture. The problem is, you don't have time to look at all the pictures and name each animal yourself.

So you come up with a clever idea. You use the names other people have given to similar pictures as a shortcut. Maybe you grab captions from a website where people posted the photos, even though those captions are sometimes wrong. Or you ask a friend who is pretty good at animals (but not perfect) to label them for you. These names might not be perfect, but they are close enough to help the robot learn. This shortcut is what we call a "proxy label" in machine learning. It is not as good as the real, careful label, but it can save a lot of time and still help the robot learn quite well.

The trick is to remember the labels are imperfect. If the robot starts getting weirdly confident about something only because the shortcut labels were biased, you have to catch it. So you keep a small set of really carefully labeled pictures, ones you took the time to check yourself, and use those to test whether the robot is actually learning what you wanted, not just what the shortcut said.

References

Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). "Distant supervision for relation extraction without labeled data." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003-1011. https://aclanthology.org/P09-1113/
Lee, D.-H. (2013). "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks." ICML 2013 Workshop on Challenges in Representation Learning.
Frenay, B., and Verleysen, M. (2014). "Classification in the Presence of Label Noise: A Survey." IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845-869.
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." NIPS 2014 Deep Learning Workshop. arXiv:1503.02531.
Ratner, A., De Sa, C., Wu, S., Selsam, D., and Re, C. (2016). "Data Programming: Creating Large Training Sets, Quickly." Advances in Neural Information Processing Systems 29 (NeurIPS 2016). https://proceedings.neurips.cc/paper/2016/hash/6709e8d64a5f47269ed5cea9f625f7ab-Abstract.html
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Re, C. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." Proceedings of the VLDB Endowment, 11(3), 269-282. https://arxiv.org/abs/1711.10160
Zhou, Z.-H. (2018). "A brief introduction to weakly supervised learning." National Science Review, 5(1), 44-53. https://academic.oup.com/nsr/article/5/1/44/4093912
Irvin, J. et al. (2019). "CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison." AAAI 2019. arXiv:1901.07031.
Bach, S. H. et al. (2019). "Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale." SIGMOD 2019.
Northcutt, C., Jiang, L., and Chuang, I. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." Journal of Artificial Intelligence Research, 70, 1373-1411. arXiv:1911.00068.
Lison, P., Barnes, J., and Hubin, A. (2021). "skweak: Weak Supervision Made Easy for NLP." ACL 2021 System Demonstrations. https://aclanthology.org/2021.acl-demo.40/
Gao, L., Schulman, J., and Hilton, J. (2023). "Scaling Laws for Reward Model Overoptimization." Proceedings of the 40th International Conference on Machine Learning. arXiv:2210.10760.
Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. (2023). "Goodhart's Law in Reinforcement Learning." arXiv:2310.09144.
Joachims, T. (2002). "Optimizing Search Engines using Clickthrough Data." KDD 2002.

Proxy labels in machine learning

Terminology and related terms

Two senses of the term

Sources of proxy labels

Heuristic rules

Distant supervision

Crowdsourcing

User behavior

Pre-trained model predictions

Self-training and pseudo-labeling

Weak supervision frameworks

Real-world examples

Risks and pitfalls

The proxy gap and Goodhart's law

Bias propagation

Coverage gaps

Label-noise correction

Validation on a gold set

Modern relevance

Tools and libraries

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Derived label

Snorkel

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Proxy labels in machine learning

Terminology and related terms

Two senses of the term

Sources of proxy labels

Heuristic rules

Distant supervision

Crowdsourcing

User behavior

Pre-trained model predictions

Self-training and pseudo-labeling

Weak supervision frameworks

Real-world examples

Risks and pitfalls

The proxy gap and Goodhart's law

Bias propagation

Coverage gaps

Label-noise correction

Validation on a gold set

Modern relevance

Tools and libraries

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Derived label

Snorkel

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests