See also: Machine learning terms
A rater is a person (or, increasingly, a model) who assigns labels, scores, or judgments to data items so those items can be used to train, evaluate, or align a machine learning system. The rater is the human (or proxy) who decides what counts as the correct answer, and almost every supervised model in production today rests on a pile of rater decisions made at some earlier stage. Without raters there is no labeled data, no benchmark, no RLHF reward model, and no way to tell whether a system actually behaves the way its developers claim.
The word is used loosely. The same role goes by different names depending on the field, the task, and the company doing the hiring. In computer vision the person drawing bounding boxes is usually called a labeler. In NLP they are usually called an annotator. In information retrieval they are an assessor. In LLM evaluation they are a judge. Search teams call them quality raters. The underlying job is the same: look at an item, apply a guideline, and produce a structured response.
Different communities settled on different vocabulary, often before the fields started talking to each other.
| Term | Common usage | Typical setting |
|---|---|---|
| Rater | Generic, plus Google Search and IR | Search ranking, ad quality, RLHF preference data |
| Annotator | NLP, linguistics | Tagging entities, sentiment, syntactic structure |
| Labeler | Computer vision, ML engineering | Bounding boxes, segmentation masks, classification labels |
| Coder | Content analysis, social science | Coding open-ended survey responses or media content |
| Judge | LLM evaluation, debate research | Scoring model outputs, ranking pairwise preferences |
| Assessor | TREC and IR evaluations | Marking documents relevant or non-relevant to a query |
| Grader | Education, exam scoring | Marking student work against a rubric |
| Quality rater | Google Search Quality program | Evaluating search result pages against the SQRG |
The boundaries blur in practice. A Scale AI worker producing preference rankings for a frontier model might be called a rater, an annotator, or a labeler in different documents on the same day.
Not every rater is the same kind of person, and the choice of pool drives both quality and cost.
Domain experts. Radiologists labeling chest X-rays, lawyers tagging contract clauses, biologists identifying species in camera trap photos. Expert raters are slow and expensive but produce labels in domains where lay annotators cannot reliably tell what they are looking at. Medical imaging datasets like CheXpert leaned on board-certified radiologists for the test sets specifically because crowd labels were not trustworthy enough.
Trained crowdworkers. Vetted contractors who go through onboarding and qualification tasks before they are allowed to work on production data. Surge AI, iMerit, and Scale AI all rely heavily on trained pools because frontier-lab work involves nuanced rubrics that cannot be picked up in five minutes.
Untrained crowdworkers. Anyone who picks up a task on an open marketplace like Amazon Mechanical Turk. Cheap and fast, but quality varies enormously and many tasks need redundancy and gold-standard checks to be useful.
End users. Implicit raters who never volunteered for the role. Clicks, dwell time, thumbs-up, star ratings, and the choice of which AI response to copy are all signals that get folded back into ranking models and reward models. End-user data is abundant and cheap but noisy and biased toward whatever the product surfaces.
LLMs as raters. Since 2023 it has become routine to use a strong language model as a stand-in for a human judge. The LLM-as-judge approach trades off some quality for orders-of-magnitude lower cost and is now standard in benchmarks like MT-Bench and AlpacaEval.
Most rater work today flows through a handful of platforms. Each has a different posture on quality, pay, and the kind of task it is built for.
| Platform | Founded | Focus | Notes |
|---|---|---|---|
| Amazon Mechanical Turk | 2005 | Open marketplace for micro-tasks | The original; widely used in academic NLP and CV |
| Scale AI | 2016 | Managed labeling for frontier AI labs | ~240,000 contract workers; Meta took a 49% stake in 2025 |
| Surge AI | 2020 | RLHF and high-skill text labeling | ~50,000 vetted contractors; ~$1.2B revenue in 2024 |
| Toloka | 2014 | Global crowd; multilingual | Spun out of Yandex in 2024; ~20,000 monthly contributors |
| Appen | 1996 | Speech, search relevance, multilingual data | ~1M workers across 200 countries |
| Lionbridge | 1996 | Localization and AI training data | Acquired by TELUS International in 2020 |
| iMerit | 2012 | Domain-expert labeling | Medical, geospatial, autonomous driving |
| Prolific | 2014 | Academic and survey research | Strong on demographic filtering and fair pay |
| Labelbox | 2018 | Labeling tooling and managed services | Software platform plus on-demand workforce |
| Snorkel | 2019 | Programmatic labeling | Reduces reliance on raters via labeling functions |
A handful of rater programs have shaped the systems most people interact with every day.
Google Search Quality Raters. Google has used internal quality raters since around 2005 and published the first complete public version of the Search Quality Rater Guidelines in November 2015. The guidelines run to roughly 180 pages today and are the definition of the E-E-A-T framework (Experience, Expertise, Authoritativeness, Trust), which was extended from E-A-T in December 2022. As of 2020 Google reported more than 10,000 raters working on the program, and external estimates put the current number near 16,000. Quality raters do not directly change rankings; they evaluate sample result pages, and Google uses the aggregated scores to test ranking changes.
ImageNet labelers. Fei-Fei Li and her team built ImageNet between 2007 and 2010 by routing more than 160 million candidate images through Mechanical Turk. Roughly 49,000 workers in 167 countries did the labeling, and by 2012 ImageNet was the largest academic user of Mechanical Turk in the world. The final dataset had about 14 million labeled images across 22,000 categories and is the dataset that the AlexNet result, and most of the modern computer vision boom, was built on.
LMArena human raters. Chatbot Arena (now LMArena) collects pairwise preferences from anonymous public visitors who blind-compare two model outputs and pick a winner. The aggregated Elo ratings have become a de facto leaderboard for frontier LLMs and are widely cited by labs in their model release notes.
Wikipedia editors. Not raters in the formal sense, but Wikipedia's volunteer editors produce one of the largest human-curated text corpora in existence and one that nearly every modern LLM trains on. The community's content policies effectively act as a labeling guideline for the rest of the internet.
RLHF labelers at OpenAI, Anthropic, and Google. The reward models that shape ChatGPT, Claude, and Gemini were initially trained on roughly 50,000 human preference rankings each, collected from contracted rater pools. The exact pool composition is rarely disclosed publicly.
Rater output is noisy by default. A few standard techniques are used to push the noise down to a usable level.
Gold-standard questions. Items where the correct answer is already known are sprinkled into the workload. Raters whose accuracy on the golds drops below a threshold get retrained or removed from the pool.
Qualification tasks. A short scored test that workers must pass before they can take paid work on a project. Used aggressively on Mechanical Turk and Toloka because the worker pool is otherwise unfiltered.
Multiple raters per item. The same item is shown to three or five raters and the labels are aggregated by majority vote, by weighted vote, or by an expectation-maximization model that estimates each rater's reliability and the latent true label simultaneously.
Adjudication. When raters disagree past a threshold, a senior rater or domain expert reviews the item and decides the ground truth. Common in medical and legal labeling.
Inter-rater agreement metrics. Quantitative measures of how consistently the pool labels the same items.
| Metric | What it measures | Typical use |
|---|---|---|
| Percent agreement | Raw fraction of items where raters agree | Quick sanity check; ignores chance agreement |
| Cohen's kappa | Two-rater agreement, chance-corrected | Pairs of raters on nominal labels |
| Fleiss' kappa | Multi-rater extension of Cohen's kappa | Fixed number of raters per item |
| Krippendorff's alpha | Multi-rater, handles missing data and ordinal/interval scales | Content analysis, complex annotation projects |
| Scott's pi | Two-rater, slightly different chance correction than kappa | Older content analysis literature |
A Cohen's kappa between 0.61 and 0.80 is usually considered substantial agreement, and 0.81 or higher almost perfect. Krippendorff's alpha at or above 0.80 is the conventional acceptable threshold for drawing conclusions from coded data.
The people doing the labeling shape the model that learns from their labels, and that effect is bigger than it looks. Mor Geva, Yoav Goldberg, and Jonathan Berant showed in their 2019 EMNLP paper Are We Modeling the Task or the Annotator? that NLU models pick up on the writing patterns of specific annotators in the training set and fail to generalize to examples produced by annotators they have not seen. The headline recommendation was that test-set annotators should be disjoint from training-set annotators, which most benchmarks were not doing at the time.
The broader lesson is that crowdsourced datasets reflect their crowd. If your annotator pool is concentrated in one country, one age range, or one political demographic, your model will inherit those defaults. RLHF preference data is especially exposed to this because the rater is making subjective calls about helpfulness and tone, not labeling something with a verifiable ground truth.
Using a strong language model as a rater for other model outputs has become standard practice since the publication of Lianmin Zheng et al.'s 2023 NeurIPS paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. The paper showed that GPT-4 acting as judge agreed with human raters about 80 percent of the time on MT-Bench and Chatbot Arena questions, which was roughly the same rate at which human raters agreed with each other. That result made it cheap to run automated evaluations at scale, and most modern leaderboards (MT-Bench, AlpacaEval, Arena-Hard) lean heavily on judge models.
The approach has known failure modes. LLM judges show position bias (preferring the first answer presented), verbosity bias (preferring longer answers regardless of quality), and self-enhancement bias (judging outputs from their own family more favorably). Mitigations include randomizing answer order, controlling for length, and using multiple judges from different model families.
RLHF is the most rater-intensive part of modern LLM training. The standard pipeline runs roughly like this:
Quality of rater output bottlenecks every step downstream. A reward model trained on inconsistent or biased preferences will steer the policy toward whatever the raters happened to prefer, including superficial qualities like response length or formatting flourishes that look authoritative without being correct.
The rater workforce that powers modern AI is largely contract labor in low-wage countries, and the conditions have repeatedly been the subject of investigative reporting. A January 2023 TIME investigation by Billy Perrigo revealed that OpenAI had contracted with the San Francisco firm Sama to have Kenyan workers label graphic content (including descriptions of child sexual abuse, bestiality, torture, and suicide) so that ChatGPT could be trained to refuse such content. Take-home pay was between roughly $1.32 and $2 per hour. Multiple workers reported lasting psychological harm from the material; Sama cancelled the OpenAI contract in February 2022, eight months ahead of schedule. Kenyan content moderators later petitioned their parliament for stronger protections.
Scale AI has also faced criticism. A 2024 lawsuit alleged that contractors regularly experienced delayed, reduced, or cancelled payments after completing assignments, and Scale's 240,000-strong contractor pool has been described in court filings as poorly insulated from the psychological costs of frontier AI work. The general pattern is that the labeling labor sits at the bottom of the AI value chain and captures very little of the value the labels create.
Frontier AI development as of 2026 leans on raters in a few specific places. Post-training pipelines need preference data for RLHF and direct preference optimization. Safety evaluations need raters to score model behavior against red-team prompts and to grade refusals. Domain-specific deployments (medical, legal, code) need expert raters to build evaluation suites the labs themselves cannot internally produce. Active learning loops use rater attention selectively, queueing only the items the model is least sure about for human review rather than labeling everything.
The industry has not yet figured out a sustainable model for this work. LLM judges absorb some of the demand, but the human signal at the top of the funnel is still what defines what a good response looks like. Whoever the rater is, in whatever country, on whatever pay scale, is implicitly setting the values that the eventual model will optimize for.