See also: Machine learning terms
A proxy label is a substitute for the true target label of a training example. Practitioners use proxy labels when the ground-truth target is too expensive, too slow, or simply impossible to collect at the scale needed to train a model. Instead of paying experts to annotate every example, the team writes rules, taps a knowledge base, mines user behavior, or asks a larger model to predict the label, then trains on the resulting noisy targets.
The idea sits inside the broader area of weak supervision, which Zhi-Hua Zhou (2018) groups into three families: incomplete supervision (few labels), inexact supervision (coarse labels), and inaccurate supervision (noisy labels). Proxy labels are most often a form of inaccurate supervision, although they appear in the other two settings as well. They are sometimes called noisy labels, weak labels, or silver labels to contrast them with the high-quality gold labels produced by careful human annotation.
Proxy labels show up in nearly every modern machine learning pipeline. Search engines train rankers on click logs because there is no oracle telling them which document is most relevant. Radiology models train on text mined out of free-text reports because radiologists do not have time to label millions of images by hand. Large language model fine-tuning pipelines now routinely use a stronger model to label data for a smaller one. The cost savings are real, but so are the risks: optimizing a proxy too hard can drift away from the actual goal, a problem usually attributed to Goodhart's law.
The vocabulary in this area is messy and overlapping. The same idea travels under several names depending on the subfield.
| Term | Common meaning | Typical context |
|---|---|---|
| Proxy label | A substitute target used when the true target is unavailable or expensive | General machine learning |
| Weak label | A label produced by an imperfect process, often heuristic or programmatic | Weak supervision, Snorkel ecosystem |
| Noisy label | A label that is sometimes wrong, modeled as the true label plus noise | Label-noise theory, confident learning |
| Silver label | An automatically derived label used as a cheap stand-in for gold labels | Biomedical NLP, named entity recognition |
| Gold label | A high-quality label produced by careful human annotation, treated as ground truth | Evaluation sets, benchmarks |
| Distant label | A label transferred from a knowledge base via a heuristic mapping | Distant supervision, relation extraction |
| Pseudo-label | A label predicted by a model and then used to train that same model or another model | Pseudo-labeling, self-training |
These terms overlap. A click is a noisy label, a weak label, and a proxy for relevance all at once. The distinctions matter mostly when you are choosing a method to denoise or aggregate the labels.
The phrase "proxy label" actually covers two slightly different ideas that often get blurred together.
The first sense is a substitute label for the same target. You want to train a relevance model and you do not have human relevance judgments, so you use clicks as a noisy stand-in. The label you collect approximates the label you wish you had.
The second sense is a label for a related task that you train on because it is easier to collect, hoping the learned representations transfer to your real task. Recommender systems do this constantly: the team really wants to predict long-term user satisfaction, but they train on next-click prediction because clicks are abundant and satisfaction is not. The proxy task is correlated with the real task but not identical to it.
Keeping these two senses distinct helps when you are debugging why a model trained on proxy labels does worse than expected. Sometimes the labels themselves are noisy. Sometimes the entire objective is wrong.
The practical question is where proxy labels come from. There are six common sources, each with its own strengths and failure modes.
The oldest and simplest source is a hand-written rule. A regex that matches dollar signs followed by digits flags monetary amounts. A keyword list of disease names labels mentions of medical conditions. A rule that says "any review with five stars is positive" turns ratings into sentiment labels. Rules are fast to write, cheap to run, and transparent, but they typically have high precision on the cases they cover and zero coverage everywhere else.
Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky introduced distant supervision at ACL 2009 in their paper "Distant supervision for relation extraction without labeled data." Their approach: for any pair of entities that already appear together in Freebase under some relation, find every sentence in a large corpus that mentions both entities, and label each such sentence with that relation. The assumption that every co-occurring sentence expresses the relation is wrong in many specific cases, but is right often enough that a classifier trained on the resulting noisy data extracted 10,000 instances of 102 relations at 67.6% precision. Distant supervision became a standard technique for bootstrapping training data in information extraction.
Services like Amazon Mechanical Turk, Scale AI, and Surge let you buy human labels at much lower cost than expert annotation, but the labels are noisier. Crowd labels are usually treated as proxy labels for the labels a domain expert would have produced. Practitioners typically have multiple workers label each example and aggregate via majority vote, the Dawid-Skene model, or another consensus method.
For consumer products, user actions provide a constant stream of implicit signals. Clicks are proxies for relevance. Dwell time is a proxy for engagement quality. Add-to-cart is a proxy for purchase intent. Purchases themselves are proxies for satisfaction. These signals are abundant, but they are biased in well-documented ways. Position bias means a result shown at the top of a search page gets clicked more whether or not it is the most relevant. Selection bias means you only get clicks on items the system already chose to show.
Modern pipelines often label data with a model that already exists. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean's 2015 paper "Distilling the Knowledge in a Neural Network" formalized this as knowledge distillation: a small student model is trained to match the soft probability distribution produced by a larger teacher, treating the teacher's outputs as proxy labels that carry more information than the original hard labels. This pattern dominates current LLM training, where teams routinely use a frontier model to label millions of examples for fine-tuning a smaller production model.
A model can label its own training data. Dong-Hyun Lee's 2013 ICML workshop paper "Pseudo-Label" introduced the simplest version of this for deep networks: train a model on the small labeled set, predict labels for the unlabeled set, keep the high-confidence predictions, and add them to the training set. This is the workhorse of semi-supervised learning and is closely related to self-training. Lee argued that pseudo-labeling is approximately equivalent to entropy minimization, pushing the decision boundary into low-density regions of the data.
A weak supervision framework is a software system for combining many noisy label sources into a single, denoised training set. The core insight, due to Alexander Ratner and collaborators, is that you can model the labeling sources themselves as noisy estimators and learn their accuracies without ever seeing the ground truth, by exploiting their agreements and disagreements.
| Framework | Year | Origin | Main idea |
|---|---|---|---|
| Data programming | 2016 | Ratner et al., NeurIPS | Users write labeling functions; a generative model recovers their accuracies and produces probabilistic training labels |
| Snorkel | 2017 | Ratner et al., VLDB | End-to-end system implementing data programming; in a user study, subject matter experts built models 2.8x faster with 45.5% better predictive performance than seven hours of hand labeling |
| Snorkel DryBell | 2019 | Bach et al., SIGMOD | Industrial deployment at Google showing the approach scales to production weak-supervision pipelines |
| Skweak | 2021 | Lison, Barnes, Hubin, ACL | Python toolkit specialized for NLP, integrated with spaCy, for sequence labeling and text classification |
| Cleanlab | 2019+ | Northcutt, Jiang, Chuang | Library for finding and fixing label errors using confident learning; estimates the joint distribution of noisy and true labels |
| HoloClean | 2017 | Rekatsinas et al., VLDB | Statistical inference framework for cleaning noisy labels and structured data |
Snorkel and its successors take labeling functions written in code, run them across an unlabeled dataset, and aggregate their noisy outputs into probabilistic labels suitable for training a downstream model. Cleanlab takes the complementary angle: given an existing labeled dataset, it finds the labels most likely to be wrong so you can fix or remove them. The two approaches are often used together.
Proxy labels appear in essentially every large-scale machine learning system. A few illustrative cases.
| Domain | Real target | Proxy label used | Why the proxy works (and where it fails) |
|---|---|---|---|
| Web search | Document relevance to a query | Click-through rate, dwell time, satisfied click | Clicks correlate with relevance but suffer from position bias and selection bias; modern systems use unbiased learning-to-rank corrections |
| Online ads | Whether the ad is useful to the user | Click-through rate, conversion rate | Clicks are easy to measure but reward attention-grabbing rather than helpful ads; conversion rate is more aligned but sparser |
| Information retrieval | Editorial relevance grade | BM25 ranking, prior model output | Cheap and abundant; can lock in the biases of the existing ranker |
| Medical imaging | Radiologist's diagnosis from the image | Concepts extracted from the free-text radiology report (e.g., the CheXpert labeler over 224,316 Stanford chest X-rays) | Reports are written by radiologists who saw the image, so the labels are clinically grounded; uncertainty in the report leaks through as label noise |
| Self-driving | The right action to take now | The action a human driver actually took, or the future trajectory rolled out from logs | Logs are abundant; humans sometimes act badly, and counterfactual actions are unobserved |
| Recommender systems | Long-term user satisfaction | Next-click prediction, watch time, ratings | Clicks and watch time are dense and immediate; they often reward addictive content over satisfying content |
| Large language model fine-tuning | What a high-quality response looks like | Outputs of a stronger LLM (e.g., GPT-4 labeling data for a smaller model), rationales from chain-of-thought distillation | Cheap and scalable; the student inherits the teacher's biases and mistakes |
| Named entity recognition (biomedical) | Whether a span names a gene, drug, or disease | Dictionary matches against UMLS or other knowledge bases | Excellent precision on in-dictionary terms; misses synonyms, abbreviations, and novel entities |
These examples share a pattern. The team has a target they cannot directly observe at scale and a related signal they can. They train on the related signal and hope the resulting model generalizes to the target. Sometimes it does. Sometimes the proxy gap eats them alive.
Proxy labels are useful precisely because they are imperfect, so a serious training pipeline has to plan for the imperfections. The main failure modes:
If you optimize a proxy hard enough, you will eventually hurt the true objective. This is Goodhart's law in its machine learning form. Leo Gao, John Schulman, and Jacob Hilton's 2023 paper "Scaling Laws for Reward Model Overoptimization" measured this directly for reinforcement learning from human feedback: as a policy optimizes a learned reward model harder, the proxy reward keeps going up while the true reward (from a much larger gold reward model) eventually turns down. Skalse et al.'s 2023 paper "Goodhart's Law in Reinforcement Learning" gave a geometric account of why this happens in Markov decision processes and proposed early-stopping rules to avoid the worst of it. The same dynamic shows up in supervised learning whenever the proxy label and the true target diverge.
Proxy labels are often systematically biased rather than randomly noisy. A search ranker trained on clicks will preferentially show items that historically got clicked, perpetuating the position bias and selection bias of the original system. A recidivism predictor trained on rearrest data inherits the policing patterns that produced the rearrests. The trained model can amplify the bias in its own deployment data, creating feedback loops that are hard to detect from inside the system.
Heuristic-based proxies tend to have high precision and low recall. A regex catches obvious cases and misses the rest. A model trained only on the cases the regex labels will systematically underperform on the cases it missed. Practitioners often combine multiple weak labelers with overlapping coverage, then use a generative model (Snorkel-style) to combine them.
A whole subfield exists to handle noisy training labels. Benoit Frenay and Michel Verleysen's 2014 IEEE survey "Classification in the Presence of Label Noise" grouped the methods into three buckets: noise-robust algorithms (loss functions that are inherently less sensitive), noise cleansing (find and fix the wrong labels), and noise-tolerant methods (model the noise process explicitly during training). Confident learning, introduced by Curtis Northcutt, Lu Jiang, and Isaac Chuang in their 2021 JAIR paper, falls into the cleansing bucket: estimate the joint distribution of noisy and true labels using out-of-sample predicted probabilities, then prune the examples most likely to be mislabeled. The method is the basis for the cleanlab library.
The one habit that prevents most disasters is keeping a small, carefully labeled gold evaluation set that the model never trains on. Proxy-trained models can score perfectly on the proxy and still fail on the gold set; this gap is the proxy gap, made visible. If you do not have a gold set, you cannot tell whether your model is learning the proxy or the real task.
Proxy labels became more important, not less, with the rise of large models. A few reasons.
First, the absolute scale of training data needed for current language and vision models exceeds anything humans can label by hand. GPT-class models train on trillions of tokens of text scraped from the web. The supervision signal is the next token, which is a proxy for many things at once: factual accuracy, reasoning ability, coherence, style. Token-level next-token loss is the cheapest possible proxy for "good output," and that is largely why pretraining works at the scale it does.
Second, the same large models that benefit from proxy labels at pretraining time can produce proxy labels for downstream training. LLM-generated labels are now a standard ingredient in fine-tuning pipelines. Researchers comparing GPT-4-labeled training sets to human-labeled training sets have found that for many text classification tasks the resulting downstream models are roughly comparable, at a fraction of the cost. Anthropic, OpenAI, and Google all use stronger models to label data for smaller models, often combined with knowledge distillation on soft outputs.
Third, reinforcement learning from human feedback (RLHF) pushed the proxy-label problem to the foreground. The reward model trained on human preferences is a proxy for what humans actually want. The RL policy optimizes the reward model. The bigger the policy gets, the more aggressively it can exploit gaps between the reward model and human preferences. The whole alignment subfield is in part a response to the proxy-label problem at LLM scale.
A practitioner choosing among proxy-label tooling has a small but growing menu.
| Tool | Best for | Notes |
|---|---|---|
| Snorkel | Combining many heuristic labeling functions into a denoised training set | Original framework; commercial successor is Snorkel Flow |
| Skweak | Weak supervision for NLP sequence labeling and classification | Tightly integrated with spaCy |
| Cleanlab | Finding and fixing label errors in existing datasets | Implements confident learning; model-agnostic |
| Weasel | End-to-end weakly supervised learning, integrating label model and end model in a single training loop | PyTorch-based |
| HoloClean | Cleaning noisy labels and structured data via statistical inference | Aimed at databases more than ML |
| Refinery, Rubrix | Annotation interfaces with weak supervision support | Help domain experts iterate on labeling functions |
Many teams build internal tooling rather than adopt a framework, especially when their proxy-label sources are very specific to their product (clicks in their app, logs from their service). The conceptual machinery is the same regardless: define the sources, model their noise, denoise, train, and evaluate on a gold set.
Imagine you want to teach a robot how to recognize different types of animals. To do this, you need to show it lots of pictures of animals with the correct name attached to each picture. The problem is, you don't have time to look at all the pictures and name each animal yourself.
So you come up with a clever idea. You use the names other people have given to similar pictures as a shortcut. Maybe you grab captions from a website where people posted the photos, even though those captions are sometimes wrong. Or you ask a friend who is pretty good at animals (but not perfect) to label them for you. These names might not be perfect, but they are close enough to help the robot learn. This shortcut is what we call a "proxy label" in machine learning. It is not as good as the real, careful label, but it can save a lot of time and still help the robot learn quite well.
The trick is to remember the labels are imperfect. If the robot starts getting weirdly confident about something only because the shortcut labels were biased, you have to catch it. So you keep a small set of really carefully labeled pictures, ones you took the time to check yourself, and use those to test whether the robot is actually learning what you wanted, not just what the shortcut said.