Rater

See also: Machine learning terms

A rater is a person (or, increasingly, a model) who assigns labels, scores, or judgments to data items so those items can be used to train, evaluate, or align a machine learning system. The rater is the human (or proxy) who decides what counts as the correct answer, and almost every supervised model in production today rests on a pile of rater decisions made at some earlier stage. Without raters there is no labeled data, no benchmark, no RLHF reward model, and no way to tell whether a system actually behaves the way its developers claim.

The word is used loosely. The same role goes by different names depending on the field, the task, and the company doing the hiring. In computer vision the person drawing bounding boxes is usually called a labeler. In NLP they are usually called an annotator. In information retrieval they are an assessor. In LLM evaluation they are a judge. Search teams call them quality raters. The underlying job is the same: look at an item, apply a guideline, and produce a structured response.

terminology across fields

Different communities settled on different vocabulary, often before the fields started talking to each other.

Term	Common usage	Typical setting
Rater	Generic, plus Google Search and IR	Search ranking, ad quality, RLHF preference data
Annotator	NLP, linguistics	Tagging entities, sentiment, syntactic structure
Labeler	Computer vision, ML engineering	Bounding boxes, segmentation masks, classification labels
Coder	Content analysis, social science	Coding open-ended survey responses or media content
Judge	LLM evaluation, debate research	Scoring model outputs, ranking pairwise preferences
Assessor	TREC and IR evaluations	Marking documents relevant or non-relevant to a query
Grader	Education, exam scoring	Marking student work against a rubric
Quality rater	Google Search Quality program	Evaluating search result pages against the SQRG

The boundaries blur in practice. A Scale AI worker producing preference rankings for a frontier model might be called a rater, an annotator, or a labeler in different documents on the same day.

types of raters

Not every rater is the same kind of person, and the choice of pool drives both quality and cost.

Domain experts. Radiologists labeling chest X-rays, lawyers tagging contract clauses, biologists identifying species in camera trap photos. Expert raters are slow and expensive but produce labels in domains where lay annotators cannot reliably tell what they are looking at. Medical imaging datasets like CheXpert leaned on board-certified radiologists for the test sets specifically because crowd labels were not trustworthy enough.

Trained crowdworkers. Vetted contractors who go through onboarding and qualification tasks before they are allowed to work on production data. Surge AI, iMerit, and Scale AI all rely heavily on trained pools because frontier-lab work involves nuanced rubrics that cannot be picked up in five minutes.

Untrained crowdworkers. Anyone who picks up a task on an open marketplace like Amazon Mechanical Turk. Cheap and fast, but quality varies enormously and many tasks need redundancy and gold-standard checks to be useful.

End users. Implicit raters who never volunteered for the role. Clicks, dwell time, thumbs-up, star ratings, and the choice of which AI response to copy are all signals that get folded back into ranking models and reward models. End-user data is abundant and cheap but noisy and biased toward whatever the product surfaces.

LLMs as raters. Since 2023 it has become routine to use a strong language model as a stand-in for a human judge. The LLM-as-judge approach trades off some quality for orders-of-magnitude lower cost and is now standard in benchmarks like MT-Bench and AlpacaEval.

crowdsourcing platforms

Most rater work today flows through a handful of platforms. Each has a different posture on quality, pay, and the kind of task it is built for.

Platform	Founded	Focus	Notes
Amazon Mechanical Turk	2005	Open marketplace for micro-tasks	The original; widely used in academic NLP and CV
Scale AI	2016	Managed labeling for frontier AI labs	~240,000 contract workers; Meta took a 49% stake in 2025
Surge AI	2020	RLHF and high-skill text labeling	~50,000 vetted contractors; ~$1.2B revenue in 2024
Toloka	2014	Global crowd; multilingual	Spun out of Yandex in 2024; ~20,000 monthly contributors
Appen	1996	Speech, search relevance, multilingual data	~1M workers across 200 countries
Lionbridge	1996	Localization and AI training data	Acquired by TELUS International in 2020
iMerit	2012	Domain-expert labeling	Medical, geospatial, autonomous driving
Prolific	2014	Academic and survey research	Strong on demographic filtering and fair pay
Labelbox	2018	Labeling tooling and managed services	Software platform plus on-demand workforce
Snorkel	2019	Programmatic labeling	Reduces reliance on raters via labeling functions

famous rater programs

A handful of rater programs have shaped the systems most people interact with every day.

Google Search Quality Raters. Google has used internal quality raters since around 2005 and published the first complete public version of the Search Quality Rater Guidelines in November 2015. The guidelines run to roughly 180 pages today and are the definition of the E-E-A-T framework (Experience, Expertise, Authoritativeness, Trust), which was extended from E-A-T in December 2022. As of 2020 Google reported more than 10,000 raters working on the program, and external estimates put the current number near 16,000. Quality raters do not directly change rankings; they evaluate sample result pages, and Google uses the aggregated scores to test ranking changes.

ImageNet labelers. Fei-Fei Li and her team built ImageNet between 2007 and 2010 by routing more than 160 million candidate images through Mechanical Turk. Roughly 49,000 workers in 167 countries did the labeling, and by 2012 ImageNet was the largest academic user of Mechanical Turk in the world. The final dataset had about 14 million labeled images across 22,000 categories and is the dataset that the AlexNet result, and most of the modern computer vision boom, was built on.

LMArena human raters. Chatbot Arena (now LMArena) collects pairwise preferences from anonymous public visitors who blind-compare two model outputs and pick a winner. The aggregated Elo ratings have become a de facto leaderboard for frontier LLMs and are widely cited by labs in their model release notes.

Wikipedia editors. Not raters in the formal sense, but Wikipedia's volunteer editors produce one of the largest human-curated text corpora in existence and one that nearly every modern LLM trains on. The community's content policies effectively act as a labeling guideline for the rest of the internet.

RLHF labelers at OpenAI, Anthropic, and Google. The reward models that shape ChatGPT, Claude, and Gemini were initially trained on roughly 50,000 human preference rankings each, collected from contracted rater pools. The exact pool composition is rarely disclosed publicly.

quality control

Rater output is noisy by default. A few standard techniques are used to push the noise down to a usable level.

Gold-standard questions. Items where the correct answer is already known are sprinkled into the workload. Raters whose accuracy on the golds drops below a threshold get retrained or removed from the pool.

Qualification tasks. A short scored test that workers must pass before they can take paid work on a project. Used aggressively on Mechanical Turk and Toloka because the worker pool is otherwise unfiltered.

Multiple raters per item. The same item is shown to three or five raters and the labels are aggregated by majority vote, by weighted vote, or by an expectation-maximization model that estimates each rater's reliability and the latent true label simultaneously.

Adjudication. When raters disagree past a threshold, a senior rater or domain expert reviews the item and decides the ground truth. Common in medical and legal labeling.

Inter-rater agreement metrics. Quantitative measures of how consistently the pool labels the same items.

Metric	What it measures	Typical use
Percent agreement	Raw fraction of items where raters agree	Quick sanity check; ignores chance agreement
Cohen's kappa	Two-rater agreement, chance-corrected	Pairs of raters on nominal labels
Fleiss' kappa	Multi-rater extension of Cohen's kappa	Fixed number of raters per item
Krippendorff's alpha	Multi-rater, handles missing data and ordinal/interval scales	Content analysis, complex annotation projects
Scott's pi	Two-rater, slightly different chance correction than kappa	Older content analysis literature

A Cohen's kappa between 0.61 and 0.80 is usually considered substantial agreement, and 0.81 or higher almost perfect. Krippendorff's alpha at or above 0.80 is the conventional acceptable threshold for drawing conclusions from coded data.

bias in rater pools

The people doing the labeling shape the model that learns from their labels, and that effect is bigger than it looks. Mor Geva, Yoav Goldberg, and Jonathan Berant showed in their 2019 EMNLP paper Are We Modeling the Task or the Annotator? that NLU models pick up on the writing patterns of specific annotators in the training set and fail to generalize to examples produced by annotators they have not seen. The headline recommendation was that test-set annotators should be disjoint from training-set annotators, which most benchmarks were not doing at the time.

The broader lesson is that crowdsourced datasets reflect their crowd. If your annotator pool is concentrated in one country, one age range, or one political demographic, your model will inherit those defaults. RLHF preference data is especially exposed to this because the rater is making subjective calls about helpfulness and tone, not labeling something with a verifiable ground truth.

llm-as-judge

Using a strong language model as a rater for other model outputs has become standard practice since the publication of Lianmin Zheng et al.'s 2023 NeurIPS paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. The paper showed that GPT-4 acting as judge agreed with human raters about 80 percent of the time on MT-Bench and Chatbot Arena questions, which was roughly the same rate at which human raters agreed with each other. That result made it cheap to run automated evaluations at scale, and most modern leaderboards (MT-Bench, AlpacaEval, Arena-Hard) lean heavily on judge models.

The approach has known failure modes. LLM judges show position bias (preferring the first answer presented), verbosity bias (preferring longer answers regardless of quality), and self-enhancement bias (judging outputs from their own family more favorably). Mitigations include randomizing answer order, controlling for length, and using multiple judges from different model families.

the rlhf rater workflow

RLHF is the most rater-intensive part of modern LLM training. The standard pipeline runs roughly like this:

A model generates two or more candidate responses to the same prompt.
A human rater reads both and picks the preferred one, sometimes with a written justification.
Roughly 50,000 of these preference pairs are collected.
A reward model is trained to predict which response a human would prefer.
The base policy is fine-tuned (typically with PPO or DPO) to maximize the reward model's score.

Quality of rater output bottlenecks every step downstream. A reward model trained on inconsistent or biased preferences will steer the policy toward whatever the raters happened to prefer, including superficial qualities like response length or formatting flourishes that look authoritative without being correct.

pay and working conditions

The rater workforce that powers modern AI is largely contract labor in low-wage countries, and the conditions have repeatedly been the subject of investigative reporting. A January 2023 TIME investigation by Billy Perrigo revealed that OpenAI had contracted with the San Francisco firm Sama to have Kenyan workers label graphic content (including descriptions of child sexual abuse, bestiality, torture, and suicide) so that ChatGPT could be trained to refuse such content. Take-home pay was between roughly $1.32 and $2 per hour. Multiple workers reported lasting psychological harm from the material; Sama cancelled the OpenAI contract in February 2022, eight months ahead of schedule. Kenyan content moderators later petitioned their parliament for stronger protections.

Scale AI has also faced criticism. A 2024 lawsuit alleged that contractors regularly experienced delayed, reduced, or cancelled payments after completing assignments, and Scale's 240,000-strong contractor pool has been described in court filings as poorly insulated from the psychological costs of frontier AI work. The general pattern is that the labeling labor sits at the bottom of the AI value chain and captures very little of the value the labels create.

raters in modern ai practice

Frontier AI development as of 2026 leans on raters in a few specific places. Post-training pipelines need preference data for RLHF and direct preference optimization. Safety evaluations need raters to score model behavior against red-team prompts and to grade refusals. Domain-specific deployments (medical, legal, code) need expert raters to build evaluation suites the labs themselves cannot internally produce. Active learning loops use rater attention selectively, queueing only the items the model is least sure about for human review rather than labeling everything.

The industry has not yet figured out a sustainable model for this work. LLM judges absorb some of the demand, but the human signal at the top of the funnel is still what defines what a good response looks like. Whoever the rater is, in whatever country, on whatever pay scale, is implicitly setting the values that the eventual model will optimize for.

references

Lianmin Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. https://arxiv.org/abs/2306.05685
Mor Geva, Yoav Goldberg, Jonathan Berant. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. EMNLP-IJCNLP 2019. https://aclanthology.org/D19-1107/
Billy Perrigo. Exclusive: OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic. TIME, January 18, 2023. https://time.com/6247678/openai-chatgpt-kenya-workers/
Google. General Guidelines (Search Quality Rater Guidelines), September 11, 2025. https://guidelines.raterhub.com/searchqualityevaluatorguidelines.pdf
Google Search Central. Our latest update to the quality rater guidelines: E-A-T gets an extra E for Experience, December 15, 2022. https://developers.google.com/search/blog/2022/12/google-raters-guidelines-e-e-a-t
Wikipedia. Amazon Mechanical Turk. https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
Wikipedia. ImageNet. https://en.wikipedia.org/wiki/ImageNet
Wikipedia. Reinforcement learning from human feedback. https://en.wikipedia.org/wiki/reinforcement_learning_from_human_feedback
Hugging Face. Illustrating Reinforcement Learning from Human Feedback (RLHF). https://huggingface.co/blog/rlhf
Sacra. Surge AI revenue, funding & news. https://sacra.com/c/surge-ai/
Sam Blum. Lawsuit Calls Scale AI 'the Sordid Underbelly of the Generative AI Industry'. Inc., 2024. https://www.inc.com/sam-blum/scale-ai-lawsuit-alexandr-wang/91064163
Encord. Understanding Krippendorff's Alpha: Inter-Rater Data Reliability Metric in Machine Learning. https://encord.com/blog/interrater-reliability-krippendorffs-alpha/
McHugh, Mary L. Interrater reliability: the kappa statistic. Biochemia Medica, 2012. https://pmc.ncbi.nlm.nih.gov/articles/PMC3900052/
TechCrunch. Workers that made ChatGPT less harmful ask lawmakers to stem alleged exploitation by Big Tech, July 14, 2023. https://techcrunch.com/2023/07/14/workers-that-made-chatgpt-less-harmful-ask-lawmakers-to-stem-alleged-exploitation-by-big-tech/

terminology across fields

types of raters

crowdsourcing platforms

famous rater programs

quality control

bias in rater pools

llm-as-judge

the rlhf rater workflow

pay and working conditions

raters in modern ai practice

references

Improve this article

Related Articles

Inter-rater agreement

Snorkel

Weak supervision

terminology across fields

types of raters

crowdsourcing platforms

famous rater programs

quality control

bias in rater pools

llm-as-judge

the rlhf rater workflow

pay and working conditions

raters in modern ai practice

references

Related Articles

Inter-rater agreement

Snorkel

Weak supervision