See also: Machine learning terms, Semi-supervised learning, Bayesian optimization
Active learning is a subfield of machine learning in which a learning algorithm can interactively query an oracle (typically a human annotator) to label new data points. Rather than training on a randomly sampled dataset, the learner selects the most informative instances from a large pool of unlabeled examples and requests their labels. The central goal is to achieve high model accuracy while minimizing the number of labeled examples required, thereby reducing the cost, time, and effort associated with data annotation.
The key insight behind active learning is that not all data points are equally useful for training. A model can learn more efficiently if it chooses which examples to study, focusing on instances where its predictions are uncertain or where new information would most improve its understanding. This stands in contrast to passive supervised machine learning, where training examples are sampled at random and the learner has no control over what it sees. In many real-world settings, unlabeled data is abundant (web pages, sensor readings, medical scans) but obtaining labels is expensive, making active learning a practical and widely studied approach.
Burr Settles' 2009 survey, "Active Learning Literature Survey," remains the most cited work in the field with over 6,000 citations. The survey established a foundational taxonomy of active learning scenarios and query strategies that researchers continue to build upon today.
The theoretical foundations of active learning date back to at least the early 1990s. Cohn, Atlas, and Ladner (1994) published "Improving Generalization with Active Learning" in the journal Machine Learning, introducing a formalism called selective sampling. In selective sampling, a learner receives distribution information from the environment and queries an oracle on parts of the domain it considers useful. The paper demonstrated that, in some situations, active learning is provably more powerful than learning from randomly sampled examples, achieving better generalization for a fixed number of training instances. A preliminary version of this work appeared even earlier, as Cohn et al. (1990).
Another foundational contribution was the query-by-committee (QBC) algorithm, proposed by Seung, Opper, and Sompolinsky (1992). QBC used a committee of models to identify instances on which the models disagreed, and it was shown to reduce the amount of data needed for learning exponentially in certain theoretical settings.
Throughout the 2000s, active learning grew from a primarily theoretical topic into a practical tool applied in natural language processing, information extraction, and text classification. Settles and Craven (2008) provided an influential empirical comparison of active learning strategies for sequence labeling tasks. Then Settles' 2009 survey unified the field by organizing the literature into a coherent taxonomy of scenarios and query strategies, establishing vocabulary and frameworks that remain standard.
The 2010s saw a surge of interest in combining active learning with deep learning, bringing new challenges such as batch selection, calibration, and the cold start problem. Since 2023, researchers have begun exploring how large language models can serve as both the oracle and the selection mechanism in active learning pipelines, opening new frontiers in the field.
The active learning process consists of three core elements: the learning algorithm, the unlabeled data pool, and the selection strategy.
The learning algorithm is responsible for predicting labels based on previously labeled examples. It can be any machine learning algorithm, such as support vector machines, decision trees, or neural networks.
The unlabeled data pool is a large collection of unlabeled data points collected from various sources such as web crawling, sensor networks, or historical records.
The selection strategy (also called the query strategy or acquisition function) is the foundation of active learning, determining which unlabeled data points should be selected for labeling. It is guided by the learning algorithm's uncertainty, the data distribution, or both, and it strives to pick samples that will maximally improve model accuracy.
A typical active learning loop proceeds as follows:
Active learning methods are typically organized into three main scenarios based on how unlabeled data is accessed. These scenarios were formalized by Settles (2009) and remain the standard framework in the literature.
Pool-based active learning is the most commonly studied scenario. The learner has access to a large pool of unlabeled instances and evaluates the entire pool (or a subsample of it) to decide which instance to query next. The learner ranks all unlabeled instances according to some informativeness measure and selects the top-ranked instance or batch of instances for labeling.
This approach assumes that collecting unlabeled data is inexpensive but that obtaining labels is costly. Pool-based methods are widely used in practical applications such as text classification, image classification, and named entity recognition.
In stream-based selective sampling (sometimes called sequential active learning), each unlabeled instance is drawn one at a time from the data source. The learner must decide on the spot whether to query the label for each incoming instance or discard it and move on to the next. The decision is typically made by comparing the informativeness of the instance against a threshold.
Stream-based methods are useful when the data arrives continuously (as in a data stream) and it is impractical to store the entire pool. Since instances come from the true underlying distribution, the queries are guaranteed to be realistic examples even when that distribution is non-uniform or unknown.
In membership query synthesis, the learner generates new instances from the input space and requests their labels from the oracle. Rather than selecting from existing data, the learner synthesizes artificial query points that it expects to be maximally informative.
This scenario is powerful in theory because it allows the learner to explore any region of the feature space. However, synthesized instances can be difficult for human annotators to label because they may not correspond to natural or meaningful inputs. For example, a synthesized image might look like random noise to a human labeler. Membership query synthesis has found more success in domains where an automated oracle (such as a simulation or a physical experiment) can evaluate arbitrary inputs, rather than tasks requiring human judgment.
| Scenario | Data access | Strengths | Limitations |
|---|---|---|---|
| Pool-based | Evaluates a full pool of unlabeled data | Most studied; effective ranking of all candidates | Computationally expensive for very large pools |
| Stream-based | Instances arrive sequentially | Low memory footprint; suits real-time data | Cannot compare candidates globally |
| Membership query synthesis | Learner generates synthetic instances | Can explore any region of feature space | Synthetic queries may be uninterpretable for human annotators |
Active learning utilizes various selection strategies to determine which unlabeled data points are most worth labeling. The choice of query strategy has a large impact on the efficiency of active learning. Below are the major families of query strategies.
Uncertainty sampling is the simplest and most widely employed strategy. It selects the data points about which the current model is least certain. For classification tasks, uncertainty can be measured in several ways:
| Uncertainty measure | Description | Formula (informal) |
|---|---|---|
| Least confidence | Selects the instance whose predicted class has the lowest probability | 1 minus the probability of the most likely class |
| Margin sampling | Selects the instance with the smallest gap between the top two predicted class probabilities | Probability of first class minus probability of second class |
| Entropy sampling | Selects the instance with the highest entropy over the predicted class distribution | Negative sum of p times log(p) across all classes |
For instance, if a binary classifier assigns a 51% probability to class A and 49% to class B, that instance has very high uncertainty under all three measures and would be prioritized for labeling. Uncertainty sampling is computationally cheap and easy to implement, but it considers only the model's current predictions without accounting for the broader data distribution. This can cause the learner to repeatedly query outliers or instances in a narrow region of the feature space.
Despite these limitations, large-scale benchmarking studies have found that uncertainty sampling remains a surprisingly strong baseline. An expanded benchmark by Gonsior et al. (2023) on tabular datasets concluded that uncertainty sampling consistently performs at or near the top across a variety of settings, often matching more complex strategies.
Query-by-committee maintains a committee of models (an ensemble), each trained on different subsets or variations of the labeled data. The committee members independently predict labels for each unlabeled instance, and the instances that produce the most disagreement among the committee are selected for labeling.
Disagreement can be quantified using:
The underlying assumption is that instances causing high disagreement sit in regions of the feature space where the model is uncertain, and labeling those instances will help resolve ambiguities. QBC is related to the idea of version space reduction: each query aims to eliminate as many hypotheses as possible from the set of models consistent with the labeled data.
Expected model change selects the instance that, once labeled, would cause the greatest change to the current model. The "change" is typically measured as the magnitude of the gradient that would be applied to update the model's parameters after training on the new labeled instance. Instances that would induce large gradient updates are considered highly informative.
This strategy was introduced by Settles and Craven (2008) under the name "expected gradient length" (EGL). It requires computing the expected gradient for each candidate instance, which involves summing over all possible labels weighted by their predicted probabilities. Since the true label is unknown in advance, the algorithm calculates expected model change over all possible labels, with the intuition that it prefers instances likely to influence the model the most regardless of the resulting label. While more principled than uncertainty sampling, expected model change is more computationally expensive because it requires gradient computations for every unlabeled candidate.
Expected error reduction selects the instance that, once labeled, would most reduce the model's expected future error (or loss) on the remaining unlabeled data. For each candidate instance, the strategy simulates adding each possible label, retrains (or approximates retraining) the model, and estimates the resulting generalization error.
This approach is conceptually appealing because it directly targets the quantity we want to minimize. However, it is computationally demanding: for each candidate, the model must be retrained (or approximated) for every possible label, and the expected error must be estimated over the remaining unlabeled pool. This cost is prohibitive for modern deep neural networks, which is why expected error reduction has not been widely adopted in deep active learning. In practice, approximations such as one-step gradient descent or influence functions are used to make the computation feasible for simpler models.
Diversity sampling (also called representativeness-based sampling) selects instances that collectively represent the overall distribution of the unlabeled data. Rather than focusing purely on model uncertainty, diversity sampling aims to ensure that the selected batch of instances covers the feature space broadly.
Common techniques for diversity sampling include:
Diversity sampling alone does not account for model uncertainty, so it is often combined with uncertainty-based methods in hybrid approaches.
Density-weighted methods combine informativeness scores with information about the underlying data density. The intuition is that an uncertain instance in a dense region of the feature space is more valuable than an equally uncertain instance in a sparse region, because the dense-region instance is more representative of the data the model will encounter at test time.
A common formulation multiplies the informativeness score (such as uncertainty) by a density weight proportional to the average similarity between the candidate and other unlabeled instances. This helps prevent the learner from wasting queries on outliers.
| Strategy | Selection criterion | Strengths | Limitations |
|---|---|---|---|
| Uncertainty sampling | Model confidence on individual instances | Simple, fast, easy to implement | May oversample outliers; ignores data distribution |
| Query-by-committee | Disagreement among a committee of models | Reduces version space; considers multiple hypotheses | Requires maintaining multiple models |
| Expected model change | Predicted impact on model parameters | Principled; directly targets model improvement | Computationally expensive (gradient calculations) |
| Expected error reduction | Predicted reduction in future error | Directly minimizes the target objective | Very expensive; requires simulated retraining per candidate |
| Diversity sampling | Representativeness of selected batch | Ensures coverage of the feature space | Does not consider model uncertainty |
| Density-weighted methods | Informativeness weighted by local density | Avoids wasting queries on outliers | Requires density estimation; adds complexity |
Bayesian Active Learning by Disagreement (BALD), introduced by Houlsby et al. (2011), is an information-theoretic acquisition function designed for active learning with Bayesian models. BALD selects data points that maximize the mutual information between the model's predictions and its parameters, given the observed data.
The BALD acquisition function can be expressed as:
BALD(x) = H[p(y|x, D)] - E_{p(w|D)}[H[p(y|x, w)]]
The first term is the entropy of the model's predictive distribution (marginal uncertainty). The second term is the expected entropy of the prediction under individual parameter settings (aleatoric uncertainty). The difference captures epistemic uncertainty: BALD scores are highest for instances where the model is uncertain overall but individual parameter configurations are confident, meaning the parameters "disagree" about the correct prediction.
In practice, the expectation over the posterior p(w|D) is approximated using techniques such as Monte Carlo (MC) dropout (Gal and Ghahramani, 2016), where multiple forward passes with random dropout masks serve as samples from an approximate posterior. Each forward pass produces a different prediction, and the spread of those predictions estimates epistemic uncertainty.
BALD has become one of the standard acquisition functions for deep active learning due to its solid theoretical grounding and practical effectiveness. Several extensions have been proposed:
Applying active learning to deep learning models introduces several unique challenges that do not arise with traditional machine learning algorithms.
Uncertainty calibration. Deep neural networks are often overconfident in their predictions. The softmax outputs of a neural network are frequently poorly calibrated and do not reliably reflect true prediction uncertainty. This is a problem for uncertainty-based query strategies, which depend on accurate uncertainty estimates. Techniques such as temperature scaling, MC dropout, and Bayesian neural networks are used to improve calibration.
Cold start problem. Deep networks typically require large amounts of labeled data to train effectively. In the initial rounds of active learning, when very few labeled examples are available, the model may be too poorly trained for its uncertainty estimates to be meaningful. This can lead to poor query decisions in early iterations. Several strategies have been developed to mitigate the cold start problem. Transfer learning from pretrained models (such as ImageNet-pretrained CNNs or pretrained language models like BERT) provides a strong initialization so that even the first round of uncertainty estimation is useful. Yuan et al. (2020) proposed cold-start active learning through self-supervised language modeling, using the pretraining loss to identify examples that "surprise" the model and should be labeled first. Clustering-based approaches, such as using contrastive learning features with BIRCH clustering, can also select a diverse initial seed set without relying on model predictions.
Batch mode requirements. Training deep networks is computationally expensive, so it is impractical to retrain after adding a single new labeled instance. Instead, deep active learning operates in batch mode, selecting a batch of instances to label before retraining. Naive application of instance-level query strategies (such as selecting the top-k most uncertain points) can result in redundant batches where all selected instances are similar to each other.
Computational cost. Evaluating query strategies over large unlabeled pools with deep models is expensive. Strategies that require multiple forward passes (such as MC dropout for BALD) or gradient computations (such as expected model change) add significant overhead.
Representation shift. As labeled data grows across active learning rounds, the learned feature representations change. A query strategy that selects informative instances based on the current representation may not remain optimal after the model is retrained with new data.
| Method | Year | Key idea |
|---|---|---|
| Core-Set (Sener and Savarese) | 2018 | Frames active learning as core-set selection in the learned feature space, minimizing the maximum distance from any unlabeled point to the selected set |
| VAAL (Sinha et al.) | 2019 | Uses a variational autoencoder and a discriminator to distinguish labeled from unlabeled data; queries instances that the discriminator classifies as unlabeled |
| BatchBALD (Kirsch et al.) | 2019 | Selects batches by jointly maximizing mutual information, avoiding redundancy inherent in greedy top-k BALD selection |
| BADGE (Ash et al.) | 2020 | Represents instances as gradient embeddings from the last layer, then applies k-means++ in gradient space to select a diverse, uncertain batch |
| CEAL (Wang et al.) | 2017 | Runs two parallel processes: querying uncertain samples for human labeling while assigning pseudo-labels to highly confident samples |
| Loss Prediction (Yoo and Kwon) | 2019 | Attaches an auxiliary "loss prediction module" to the model that predicts the loss for unlabeled samples; queries instances with the highest predicted loss |
In many practical settings, labeling is done in batches rather than one instance at a time. A labeler might annotate a batch of 100 images overnight, or a laboratory might run 96 experiments simultaneously on a well plate. Batch active learning addresses the problem of selecting an entire batch of informative instances at once, rather than selecting them one by one.
The main challenge in batch selection is avoiding redundancy. If the query strategy simply selects the top-k most uncertain instances, those instances may all be very similar, providing overlapping information. Effective batch active learning methods must balance informativeness (each instance should be valuable) with diversity (the instances in the batch should cover different parts of the feature space).
Greedy sequential selection. Select instances one at a time, updating the acquisition scores after each selection to account for the information already captured. This is straightforward but requires re-evaluating the acquisition function k times.
Clustering-based batching. First identify the top candidates by informativeness (for example, the most uncertain instances), then cluster those candidates and select one representative from each cluster. Zhdanov (2019) proposed a method that prefilters the top candidates, clusters them into the desired batch size, and selects the instance nearest each cluster center.
Gradient embedding methods. BADGE (Ash et al., 2020) computes gradient embeddings for each unlabeled instance, capturing both uncertainty (through gradient magnitude) and diversity (through the direction of the gradient). It then applies k-means++ initialization in gradient space to construct a batch that is simultaneously uncertain and diverse, without any hyperparameters to tune.
Submodular optimization. Some methods formulate batch selection as a submodular maximization problem, where the objective function captures both informativeness and diversity. Submodular functions exhibit diminishing returns, which naturally discourages redundancy.
Deciding when to stop the active learning loop is a non-trivial problem with no universal solution. Continuing to label beyond the point of diminishing returns wastes annotation budget, while stopping too early leaves potential performance on the table. Several families of stopping criteria have been proposed.
Budget exhaustion. The simplest approach: stop when the annotation budget (a fixed number of labeled instances or a fixed monetary cost) has been spent. This is practical but does not adapt to how well the model is actually learning.
Performance convergence. Monitor the model's performance on a held-out validation set and stop when performance has not improved by more than a threshold over a specified number of iterations. This approach can detect when additional labels yield diminishing returns.
Confidence-based criteria. Zhu and Hovy (2008) proposed stopping criteria based on the model's confidence over the unlabeled pool. These include stopping when the maximum uncertainty in the pool drops below a threshold (indicating the model is confident about all remaining instances) and stopping when the overall average uncertainty falls below a threshold.
Model stability. Bloodgood and Vijay-Shanker (2009) proposed stopping when the model itself no longer changes significantly with the addition of new training instances. If adding any of the remaining unlabeled examples would produce a change below a given threshold, further labeling is unlikely to help.
Error stability. Ishibashi and Hino (2020) proposed a criterion based on bounding the change in generalization error when a new sample is added. This criterion can be applied to any Bayesian active learning method and provides a more principled guarantee.
In practice, no single stopping criterion dominates across all domains. Optimality depends on the trade-off between accuracy and annotation cost, which varies by application.
Active learning is a core component of human-in-the-loop machine learning, where human annotators and machine learning models collaborate iteratively. Several tools have been developed to operationalize this workflow.
Prodigy, developed by Explosion (the creators of spaCy), is a commercial annotation tool that uses active learning as its core feature. Its text classification recipe implements uncertainty sampling to show the annotator only examples the model is most unsure about, reducing annotation effort by orders of magnitude compared to random sampling. Prodigy supports NLP tasks such as named entity recognition, text classification, and span categorization.
Label Studio is an open-source annotation platform that supports text, images, audio, video, and time-series data. It can be connected to a machine learning backend that provides predictions and receives annotations in a continuous loop, enabling active learning workflows. After each annotation, Label Studio sends the data to the ML backend for model updating, and when the user moves to the next task, the updated model provides new predictions.
Encord and other enterprise annotation platforms also support active learning workflows for computer vision tasks, including image classification, object detection, and semantic segmentation.
The human-in-the-loop paradigm provides an additional benefit beyond efficient labeling: by reviewing model-flagged examples, annotators gain insight into the model's uncertainties and potential failure modes. This feedback loop helps practitioners identify systematic errors and data quality issues early in the development cycle.
The rise of large language models has introduced new paradigms for active learning. A 2025 survey by the ACL categorized LLM-based active learning into two main steps: LLM-based selection (using an LLM to choose or generate instances for annotation) and LLM-based annotation (using an LLM to provide labels).
LLMs as annotators. Instead of using expensive human annotators, LLMs such as GPT-4 can serve as oracles that label selected instances. Research by Kholodna et al. (2024) demonstrated that using GPT-4-Turbo as an annotator within an active learning loop for low-resource languages achieved near-state-of-the-art performance with estimated cost savings of at least 42 times compared to human annotation.
LLMs for instance selection. ActiveLLM (Schroeder et al., 2024) uses large language models to select instances for labeling in few-shot scenarios. Rather than relying on a trained classifier's uncertainty estimates (which may be unreliable with very few examples), the LLM itself evaluates which unlabeled instances would be most informative. This approach outperforms traditional active learning strategies when the labeled set is extremely small.
Hybrid human-LLM frameworks. Recent work has proposed frameworks that route annotation tasks between human annotators and LLMs based on model uncertainty. High-uncertainty instances are sent to human experts for reliable labeling, while lower-uncertainty instances are labeled by the LLM at a fraction of the cost. This hybrid approach balances cost efficiency with annotation quality.
Active learning and semi-supervised learning both address the problem of learning effectively when labeled data is scarce and unlabeled data is plentiful, but they take fundamentally different approaches.
| Aspect | Active learning | Semi-supervised learning |
|---|---|---|
| Core mechanism | Selects which unlabeled instances to have labeled by an oracle | Uses unlabeled data directly (without labeling) to improve the model |
| Role of oracle | Requires an oracle (human or automated) to label selected instances | Does not require additional labeling |
| Use of unlabeled data | Queries specific unlabeled instances for labels | Leverages the distribution of all unlabeled data (for example, through consistency regularization or pseudo-labels) |
| Goal | Minimize the number of labels needed by choosing the most informative ones | Improve model performance by exploiting structure in unlabeled data |
| Label selection | Queries instances in uncertain or informative regions | Assigns pseudo-labels to instances the model is most confident about |
The two approaches are complementary and can be combined. For instance, an active learning system might use semi-supervised techniques to leverage the unlabeled data between query rounds, while using an active query strategy to select which instances to send for human annotation. This combination has been shown to outperform either approach used in isolation.
Annotating text data for NLP tasks is labor-intensive. Tasks like named entity recognition, sentiment analysis, relation extraction, and text classification require domain experts to read and label large volumes of text. Active learning has been applied extensively to reduce this annotation burden.
In clinical NLP, active learning with uncertainty sampling has been shown to achieve an F-measure of 0.80 for named entity recognition while requiring 66% fewer annotated sentences compared to random sampling (Chen et al., 2015). For text classification with deep learning, recent surveys have categorized active learning query strategies into data-based, model-based, and prediction-based approaches, reflecting the diversity of techniques adapted for neural text models.
Active learning is also used for building training sets for large language models and for preference annotation in reinforcement learning from human feedback (RLHF), where annotator time is a major bottleneck.
Annotating medical images is uniquely expensive because it requires trained specialists. A radiologist may spend approximately 60 minutes manually segmenting brain tumors per patient in multi-sequence MRI volumes. A pathologist typically needs 15 to 30 minutes to examine a single histopathology slide under a microscope.
Active learning addresses this bottleneck by selecting the most informative images or image regions for expert annotation. A comprehensive survey by Li et al. (2024) cataloged nearly 164 deep active learning works applied to medical image analysis, covering tasks such as tumor segmentation, disease classification, lesion detection, and cell counting. Active learning has been combined with semi-supervised learning and self-supervised learning to further reduce annotation requirements in radiology, pathology, dermatology, and ophthalmology.
Perception systems for autonomous vehicles must recognize objects such as vehicles, pedestrians, cyclists, traffic signs, and lane markings from camera images and LiDAR point clouds. Annotating 3D point clouds is particularly time-consuming: each frame may contain hundreds of thousands of points that need semantic or instance-level labels.
Active learning helps prioritize which frames or scenes to annotate. Rather than labeling every frame captured during test drives, an active learning system can identify the frames where the perception model is most uncertain or where rare objects appear. This is especially valuable for detecting edge cases and rare scenarios (such as unusual weather conditions or uncommon road users) that are underrepresented in the training data but critical for safety.
In drug discovery, the search space of possible molecular candidates is enormous (estimated at 10^60 drug-like molecules), and experimentally evaluating each candidate is expensive and slow. Active learning, often framed as Bayesian optimization, guides the selection of which molecules to synthesize and test next.
A surrogate model (often a Gaussian process or a graph neural network) predicts molecular properties such as binding affinity, toxicity, or solubility. The active learning loop selects molecules that balance exploration (testing molecules in unexplored regions of chemical space) with exploitation (testing molecules predicted to have desirable properties). Batched Bayesian optimization is particularly relevant because laboratory experiments are typically run in batches on well plates.
Recent work has demonstrated that combining pretrained transformer models (such as molecular BERT variants) with Bayesian active learning can identify toxic compounds with 50% fewer experimental iterations compared to conventional active learning approaches.
Active learning has also been applied to:
Several open-source libraries make it easier to implement active learning workflows in Python.
| Library | Description | Integrations | License |
|---|---|---|---|
| modAL | Modular active learning framework built on top of scikit-learn. Provides flexibility to swap query strategies, models, and other components with custom implementations. | scikit-learn | MIT |
| ALiPy | Comprehensive active learning toolbox implementing over 20 query strategies. Provides modules for data management, strategy invocation, and experiment evaluation. | scikit-learn, NumPy | BSD-3 |
| Baal | Bayesian active learning library focused on uncertainty estimation with deep learning models. Implements MC dropout, BALD, and other Bayesian acquisition functions. | PyTorch, Hugging Face | Apache 2.0 |
| small-text | Active learning library specialized for text classification. Supports scikit-learn classifiers, PyTorch models, and Hugging Face transformers with GPU acceleration. | scikit-learn, PyTorch, Hugging Face Transformers | MIT |
| libact | Pool-based active learning library that implements several classical strategies including uncertainty sampling, QBC, and query synthesis with a unified interface. | scikit-learn | BSD-2 |
| scikit-activeml | Comprehensive library following scikit-learn conventions. Provides a wide range of query strategies for classification, regression, and clustering. | scikit-learn | BSD-3 |
Cost savings. Active learning can reduce the expense of labeling data by selecting only the most informative samples instead of labeling the entire dataset. This approach is especially advantageous in situations where labeling data is expensive or time-consuming, such as medical diagnosis or image recognition.
Higher accuracy. Active learning can achieve higher accuracy with fewer labeled data points, since the algorithm focuses on the most informative samples that will likely boost the model's precision.
Scalability. Active learning can handle large datasets with a limited labeling budget, allowing the learning algorithm to draw upon an abundance of unlabeled data. This is especially useful in situations where labeled information is scarce.
Faster iteration. By focusing annotation effort on the most valuable instances, active learning enables faster development cycles for machine learning projects, since less time is spent on labeling data that would provide little marginal benefit.
Labeling bias. Selecting data points for labeling may introduce bias into the training data, as certain regions of the input space may be oversampled or undersampled. This sampling bias can cause the model to perform well on queried regions but poorly on underrepresented areas.
Selection strategy design. The success of active learning depends heavily on the selection strategy employed. Crafting an effective selection strategy necessitates a deep understanding of both the problem domain and learning algorithm. A strategy that works well for one model or dataset may perform poorly on another.
Oracle dependency. Active learning relies on an oracle or human to label selected samples, which could introduce errors and increase labeling costs. In practice, human annotators make mistakes, and noisy labels can degrade model performance. Inter-annotator disagreement is a further complication.
Stopping criteria. Deciding when to stop querying is a non-trivial problem. Common heuristics include budget exhaustion, performance convergence, or reaching a target metric, but none of these provide a principled guarantee that further labeling would not help.
Evaluation difficulty. Comparing active learning strategies fairly is difficult because performance depends on the initial seed set, the model architecture, the dataset, and random factors. Small changes in experimental setup can lead to different conclusions about which strategy is best.
Imagine you have a huge box of toys, but you do not know what some of them are called. You could ask a grown-up to name every single toy, but that would take a very long time. Instead, you pick out the toys you are most confused about and ask about those first. After each answer, you get a little smarter about which toys you still need help with. That is what active learning does: instead of asking about everything, it picks the things it is most confused about so it can learn faster with less help.