Data labeling (also called data annotation) is the process of attaching meaningful tags, labels, or metadata to raw data so that machine learning algorithms can interpret and learn from it. The labeled data serves as ground truth during supervised learning, where a model learns to map inputs to outputs by studying examples that humans have already classified. Without labeled data, most modern AI systems, from image recognition models to large language models, cannot be trained effectively.
Data labeling applies to every major data modality: images, text, audio, video, and 3D point clouds. The work ranges from drawing rectangles around objects in photographs to rating which of two chatbot responses sounds more helpful. As AI adoption has accelerated, data labeling has grown into a multi-billion-dollar industry with dedicated platforms, managed workforces, and increasingly automated pipelines.
At its core, data labeling converts unstructured or semi-structured data into a format that statistical models can consume. A raw photograph, for example, contains pixel values but no information about what those pixels represent. A human annotator examines the photograph and marks relevant objects ("car," "pedestrian," "traffic light"), producing structured annotations that a computer vision model can use during training.
The purpose of labeling extends beyond initial model training. Labeled data is also used to evaluate model performance on held-out test sets, to fine-tune pre-trained models on domain-specific tasks, and to generate the preference comparisons needed for reinforcement learning from human feedback (RLHF). In production systems, newly collected data is often labeled on a rolling basis so that models can be retrained as distributions shift over time.
Object detection and image classification are among the most common computer vision tasks, and each requires a different style of annotation.
| Annotation type | Description | Typical use case |
|---|---|---|
| Bounding box | A rectangle drawn tightly around an object, defined by corner coordinates (x_min, y_min, x_max, y_max) | Object detection in autonomous vehicles, retail inventory |
| Polygon | A multi-vertex outline tracing the precise boundary of an object | Instance-level annotation where object shape matters |
| Semantic segmentation | Every pixel in the image is assigned to a class (e.g., road, sidewalk, sky) | Self-driving cars, robotics, medical imaging |
| Instance segmentation | Like semantic segmentation, but separate instances of the same class receive distinct labels | Counting overlapping objects, warehouse logistics |
| Keypoint | Specific landmark points placed on an object to capture its pose or structure | Pose estimation, facial landmark detection, gesture recognition |
| Cuboid (3D bounding box) | A three-dimensional box placed around an object in a 2D or 3D scene | LiDAR data for autonomous driving, robotics |
| Polyline | A series of connected line segments tracing linear features | Lane markings on roads, power lines, cracks in infrastructure |
Bounding boxes are the fastest to produce and remain the default for many object detection pipelines. Semantic and instance segmentation require pixel-level precision and cost more per image but yield richer training signals. Keypoints are standard for human pose estimation tasks, where the annotator places points on joints (shoulders, elbows, wrists, knees) to define a skeletal structure.
Text annotation supports natural language processing (NLP) tasks. The main varieties include:
| Annotation type | Description | Typical use case |
|---|---|---|
| Named entity recognition (NER) | Identifying and classifying entities (people, organizations, locations, dates, monetary values) within text | Information extraction, knowledge graph construction |
| Sentiment analysis | Tagging text with positive, negative, or neutral sentiment labels | Customer feedback analysis, brand monitoring |
| Text classification | Assigning entire documents or passages to predefined categories | Spam detection, topic categorization, content moderation |
| Relation extraction | Labeling the relationships between identified entities | Biomedical literature mining, legal document analysis |
| Coreference resolution | Linking different mentions that refer to the same entity | Dialogue systems, document summarization |
| Part-of-speech tagging | Labeling each word with its grammatical role (noun, verb, adjective) | Parsing, grammar checking, linguistic research |
For NER, an annotator might read the sentence "Apple announced a $3 billion bond offering in London" and tag "Apple" as an organization, "$3 billion" as a monetary value, and "London" as a location. Sentiment annotation often uses Likert-scale ratings (1 to 5) or binary positive/negative labels.
Audio annotation prepares data for speech recognition, speaker identification, and sound event detection.
| Annotation type | Description | Typical use case |
|---|---|---|
| Transcription | Converting spoken language into written text with timestamps | Voice assistants, meeting transcription |
| Speaker diarization | Segmenting an audio recording by speaker and labeling each segment ("Speaker A," "Speaker B") | Call center analytics, podcast indexing |
| Sound event detection | Tagging non-speech audio events (glass breaking, dog barking, siren) | Surveillance, environmental monitoring |
| Emotion and intent labeling | Classifying the emotional tone or intent behind a spoken utterance | Customer service routing, voice-based UX research |
| Phonetic annotation | Labeling individual phonemes or prosodic features | Linguistics research, text-to-speech training |
Speaker diarization answers the question "who spoke when?" and is useful in multi-party conversations. Accurate diarization requires at least 30 seconds of uninterrupted speech per speaker for reliable clustering, according to AssemblyAI's documentation.
Video annotation extends image annotation across time. Annotators may label individual frames or track objects across a sequence of frames using interpolation, where the tool automatically estimates object positions between manually annotated keyframes. Video annotation is central to training models for autonomous driving, sports analytics, and security surveillance.
A variety of open-source and commercial platforms exist for managing labeling workflows. The table below compares several widely used options.
| Platform | Type | Supported data | Notable features | License / pricing |
|---|---|---|---|---|
| Label Studio | Open-source / Enterprise | Images, text, audio, video, time series | Configurable templates, REST API, ML backend integration, Python SDK | Apache 2.0 (community); paid Enterprise and Cloud tiers from HumanSignal |
| Labelbox | Commercial | Images, video, text, geospatial, documents | Model-assisted labeling, active learning, API-first design, Alignerr expert marketplace | Paid tiers; founded 2018, raised $189M+ |
| Scale AI | Commercial (managed service) | Images, video, text, LiDAR, documents, RLHF data | Managed human workforce, quality control, government and defense contracts, data engine for LLM evaluation | Enterprise pricing; founded 2016 by Alexandr Wang, valued at ~$14B in May 2024 (Series F) |
| Amazon SageMaker Ground Truth | Cloud service | Images, video, text, 3D point clouds | Active learning to reduce labeling cost by up to 70%, integration with Mechanical Turk (500,000+ workers), private or vendor workforce options | Pay-per-object pricing |
| Prodigy | Commercial (local) | Text, images, audio, video | Built by Explosion AI (makers of spaCy), runs entirely on local machines, scriptable recipes, active learning, LLM integration via spacy-llm | One-time license fee; data never leaves the user's machine |
| CVAT | Open-source | Images, video | Originally developed by Intel, now maintained by OpenCV; bounding boxes, polygons, cuboids, keyframes with interpolation, AI-assisted annotation via OpenVINO | MIT license |
| Encord | Commercial | Images, video, DICOM (medical) | Automated labeling, quality metrics, ontology management, HIPAA-compliant workflows | Paid tiers |
| V7 | Commercial | Images, video, documents | Auto-annotation with foundation models, pixel-accurate masks, dataset management | Paid tiers |
| SuperAnnotate | Commercial | Images, video, text, audio | Workforce management, quality assurance dashboards, model-assisted tools | Paid tiers |
| Snorkel AI | Commercial (programmatic labeling) | Text, tabular, images | Weak supervision via labeling functions, data programming paradigm, originated from Stanford research | Enterprise pricing |
Scale AI was founded in 2016 by Alexandr Wang and Lucy Guo through Y Combinator. Wang, who was 19 at the time, had dropped out of MIT after previously working as an engineer at Quora. The company provides data labeling services, model evaluation, and data infrastructure for AI development.
In May 2024, Scale AI raised $1 billion in a Series F round led by Accel, with participation from Amazon, Meta, NVIDIA, Intel Capital, and others. The round valued the company at approximately $13.8 billion, nearly double its previous valuation. Scale's annual recurring revenue tripled in 2023. In June 2025, Meta Platforms invested over $14 billion to acquire a 49% stake, pushing Scale AI's valuation to approximately $29 billion.
Scale AI's clients include major AI labs, the U.S. Department of Defense, and Fortune 500 companies. The company is known for its managed workforce model, where it handles recruiting, training, and quality control for annotators rather than leaving those tasks to the customer.
Label Studio is an open-source data labeling tool released under the Apache 2.0 license. It was created by Heartex (later rebranded to HumanSignal) and supports a wider range of data types than most alternatives, including images, text, audio, video, HTML, and time-series data. Its template system lets teams define custom labeling interfaces using XML-like configuration. HumanSignal offers a managed cloud Starter tier starting around $149 per month and a self-hosted Enterprise tier with SSO, role-based access control, and audit logs.
CVAT (Computer Vision Annotation Tool) was originally developed by Intel for internal use and later open-sourced under the MIT license. It is now maintained as part of the OpenCV project. CVAT specializes in image and video annotation, with built-in support for bounding boxes, polygons, polylines, cuboids, and keypoints. Its interpolation feature lets annotators label objects in a few keyframes and have the tool fill in intermediate frames automatically. CVAT integrates with Intel's OpenVINO toolkit for AI-assisted annotation.
Prodigy is a commercial annotation tool built by Explosion AI, the company behind the spaCy NLP library. Unlike cloud-based platforms, Prodigy runs entirely on the user's local machine, and no data is sent to third-party servers. This makes it suitable for sensitive or regulated data. Prodigy's design emphasizes efficiency: it uses active learning to surface the most informative examples and presents binary decision interfaces (accept/reject) that allow annotators to work quickly. In 2023, Prodigy added LLM integration through the spacy-llm library, enabling users to combine model predictions with human review.
Reinforcement learning from human feedback (RLHF) has become a standard technique for aligning large language models with human preferences. The data labeling requirements for RLHF differ from traditional annotation tasks.
The RLHF pipeline typically involves three stages of data:
OpenAI's InstructGPT paper (2022) described this pipeline in detail. The process requires both human-generated text (for stage 1) and human preference judgments (for stage 2). Annotators often use Likert-scale ratings or pairwise ranking interfaces.
Preference labeling introduces unique difficulties. Different evaluators may have varying interpretations of quality, helpfulness, or safety. A response that one evaluator considers helpful might strike another as verbose. These disagreements make inter-annotator agreement harder to achieve than in more objective tasks like bounding box annotation. To manage this, RLHF annotation teams use detailed rubrics, calibration sessions, and regular audits of annotator consistency.
The cost of RLHF annotation is also higher than standard labeling. Because evaluators must read and compare full text responses (sometimes multiple paragraphs long), throughput is lower and per-example costs can reach $50 to $100 for complex tasks involving domain expertise.
Organizations that need labeled data face a choice between crowd-sourced annotators and domain experts. Each approach has trade-offs.
Platforms like Amazon Mechanical Turk (MTurk), launched in 2005, pioneered the use of distributed online workers for annotation at scale. At its peak, MTurk had over 500,000 registered workers from more than 190 countries. Work is distributed as Human Intelligence Tasks (HITs), and requesters pay per completed task.
Crowd-sourced labeling is the cheapest option per unit. Hourly rates for offshore annotators typically range from $4 to $12, and simple tasks (image classification, sentiment tagging) can be completed quickly. However, quality control is a persistent challenge. A 2018 academic study found that the median hourly wage for MTurk workers was approximately $2 per hour, which raises questions about worker motivation and attention to detail. Many requesters compensate by collecting multiple annotations per item and using majority voting to determine the final label.
For tasks requiring specialized knowledge, expert annotators produce higher-quality labels but at significantly greater cost. Medical imaging annotation, where a radiologist must identify tumors or anatomical structures in DICOM images, can cost $50 to $100 per hour. Legal document annotation, genomics data, and financial compliance labeling similarly demand trained professionals.
Managed labeling services (offered by companies like Scale AI, Appen, and Sama) sit between pure crowd-sourcing and in-house expert teams. These services typically charge $6 to $12 per hour for general tasks and handle annotator recruitment, training, and quality assurance on behalf of the client.
| Factor | Crowd-sourced | Expert | Managed service |
|---|---|---|---|
| Cost per hour | $2 to $12 | $25 to $100+ | $6 to $40 |
| Quality (typical) | Variable; requires QA layers | High; fewer errors | Moderate to high |
| Speed to scale | Fast (large worker pool) | Slow (limited supply) | Moderate |
| Best for | Simple, high-volume tasks | Domain-specific, high-stakes tasks | Teams without in-house annotation infrastructure |
| Drawbacks | Inconsistent quality, ethical concerns about low pay | Expensive, hard to recruit | Vendor lock-in, less control |
Label quality directly affects model performance. A model trained on noisy labels will learn noisy patterns. Several techniques are used to measure and maintain annotation quality.
Inter-annotator agreement (IAA) measures how consistently different annotators label the same data. The most common metric is Cohen's kappa, which accounts for agreement that would occur by chance:
K = (Pr(a) - Pr(e)) / (1 - Pr(e))
where Pr(a) is the observed agreement between annotators and Pr(e) is the probability of agreement expected by chance. Kappa values are typically interpreted using the Landis and Koch scale:
| Kappa value | Interpretation |
|---|---|
| 0.81 to 1.00 | Almost perfect agreement |
| 0.61 to 0.80 | Substantial agreement |
| 0.41 to 0.60 | Moderate agreement |
| 0.21 to 0.40 | Fair agreement |
| 0.00 to 0.20 | Slight agreement |
| Below 0.00 | Less than chance agreement |
For tasks with more than two annotators, Fleiss' kappa generalizes the metric. Krippendorff's alpha is another option that handles missing data and works across different measurement scales (nominal, ordinal, interval, ratio).
Low IAA scores often indicate ambiguous labeling guidelines rather than poor annotator performance. When agreement drops below acceptable thresholds, teams typically revise their annotation instructions, add more examples to the guidelines, or hold calibration sessions.
Gold standard items are data points with known correct labels, verified by senior annotators or domain experts. These items are inserted randomly into each annotator's queue at a rate of roughly 5% to 10%. The annotator does not know which items are gold standards. If an annotator's accuracy on gold items falls below a threshold (often 90% to 95%), their work is flagged for review or they are removed from the task.
This approach provides ongoing quality signals without requiring every annotation to be checked. It also helps identify annotators who are rushing through tasks or misunderstanding the guidelines.
Many labeling pipelines collect two or three independent annotations per item and resolve disagreements through majority voting or a dedicated adjudicator. The adjudicator (usually a senior annotator or subject matter expert) reviews disputed items and selects the correct label. While this approach improves accuracy, it multiplies cost proportionally.
Active learning is a machine learning technique that reduces the amount of labeled data needed to reach a target level of model performance. Instead of labeling data points at random, active learning selects the examples that are most informative for the model.
The process runs in a loop:
Several strategies exist for choosing which examples to label next:
| Strategy | How it works | When to use |
|---|---|---|
| Uncertainty sampling | Select examples where the model is least confident in its prediction | General-purpose; works well with most classifiers |
| Query by committee | Train multiple models and select examples where they disagree most | When ensemble methods are feasible |
| Density-weighted sampling | Combine uncertainty with data density so that selected points are both uncertain and representative | When the unlabeled pool has uneven distributions |
| Expected model change | Select examples that would cause the largest update to model parameters if labeled | Computationally expensive but effective for small budgets |
When applied effectively, active learning can reduce the number of required labels by 30% to 70% compared to random sampling, according to multiple studies. Amazon SageMaker Ground Truth uses active learning to automatically label high-confidence examples and route only uncertain ones to human annotators, which AWS claims reduces labeling costs by up to 70%.
Synthetic data refers to artificially generated data that mimics the statistical properties of real-world data. It can supplement or partially replace manually labeled data in some training scenarios.
Synthetic data generation eliminates manual annotation costs after the initial setup. Once a generation pipeline is built (using 3D rendering engines, generative adversarial networks, or diffusion models), producing additional labeled examples costs almost nothing per unit. Synthetic data also avoids privacy concerns, since no real individuals appear in the generated examples. This is especially valuable in healthcare and finance, where real data is subject to strict regulations.
Gartner has forecast that by 2030, synthetic data will be more widely used for AI training than real-world datasets.
Synthetic data can introduce distribution gaps. Models trained exclusively on synthetic examples sometimes fail when encountering real-world data that differs from the generated distribution. For this reason, synthetic data works best as a supplement to real labeled data, not a full replacement. Benchmarking against real-world test sets remains necessary to validate performance.
Simulation environments (such as those used for training autonomous driving models) can generate millions of labeled frames, but the visual fidelity and scenario diversity of these simulations must be carefully managed to avoid training models on unrealistic conditions.
The rise of large language models has opened new possibilities for automating parts of the labeling process. Rather than relying solely on human annotators, teams can use LLMs to generate initial labels that humans then review and correct.
A study by Refuel AI found that GPT-4 achieved 88.4% agreement with ground truth labels across a range of text classification datasets, compared to 86.2% for human annotators on the same tasks. GPT-4o delivered the highest combined score for accuracy and efficiency among models tested. These results suggest that LLMs can match or exceed crowd-sourced human annotators for certain text labeling tasks, though expert-level performance on specialized domains still requires human oversight.
Weak supervision, pioneered by the Snorkel project at Stanford University (started in 2015), takes a programmatic approach to labeling. Instead of annotating examples one by one, users write labeling functions: simple programs that apply heuristic rules, keyword matches, or external knowledge bases to assign noisy labels. Snorkel's system then combines the outputs of many labeling functions, automatically learning each function's accuracy and correcting for correlations between them.
This approach is much faster than manual annotation. In evaluations, subject matter experts using Snorkel built models 2.8 times faster and achieved 45.5% higher predictive performance compared to seven hours of hand labeling alone.
The most effective modern labeling pipelines combine LLM predictions with human review. A common pattern:
This hybrid approach can automatically label up to 75% of a dataset, reducing human effort to the remaining difficult cases. The HILTS framework (Human-LLM collaboration for effective data labeling) formalizes this pattern by using active learning to select the most uncertain LLM labels for targeted human review.
Data labeling costs vary widely depending on task complexity, annotator expertise, data modality, and quality requirements.
| Task type | Cost per annotation | Notes |
|---|---|---|
| Image classification (binary) | $0.01 to $0.05 | Simple yes/no or category assignment |
| Bounding box annotation | $0.05 to $0.50 | Depends on number of objects per image |
| Semantic segmentation | $0.50 to $7.00 | Pixel-level labeling is labor-intensive |
| Text classification | $0.02 to $0.10 | Sentiment, topic, spam detection |
| Named entity recognition | $0.10 to $1.00 | Varies with entity density and domain |
| Audio transcription | $0.50 to $3.00 per minute | Higher for noisy audio or specialized vocabulary |
| RLHF preference comparison | $1.00 to $100 per example | Depends on response length and required expertise |
| Medical image annotation | $2.00 to $50.00 | Requires trained radiologists or pathologists |
A typical computer vision model might require 100,000 labeled images. At $0.50 to $5.00 per annotation, the direct labeling cost alone ranges from $50,000 to $500,000. This does not include project management, quality assurance overhead, tool licensing, or iteration costs when labeling guidelines change mid-project.
The global data labeling market was estimated at approximately $4.9 billion in 2025, according to Mordor Intelligence. Multiple research firms project compound annual growth rates between 25% and 30%, with the market expected to reach $17 billion or more by 2030. Growth is driven by increasing AI adoption in autonomous vehicles, healthcare, financial services, and the ongoing need for human feedback data to train and align large language models.
Data labeling work raises several ethical questions. Much of the labor is performed by workers in lower-income countries who are paid per task at rates that can fall below local minimum wages. The 2018 study of Amazon Mechanical Turk found a median hourly wage of roughly $2 per hour for workers on the platform. Time magazine reported in 2023 on Kenyan workers who labeled harmful content for OpenAI's safety filters at wages of less than $2 per hour.
Annotators who label content moderation data (violent images, hate speech, graphic material) are exposed to psychologically distressing material. Some companies have introduced wellness programs and content rotation policies to mitigate harm, but standards vary across the industry.
Bias in labeled data is another concern. If annotators bring systematic biases to their labeling decisions (for example, associating certain names with particular demographics), those biases will be encoded in the training data and reproduced by the model. Careful guideline design, diverse annotator pools, and bias auditing of completed datasets are standard mitigation strategies.
Several trends are shaping the evolution of data labeling: