Active Learning

Introduction

Active learning is a subfield of machine learning in which a learning algorithm can interactively query an oracle (typically a human annotator) to label new data points. Rather than training on a randomly sampled dataset, the learner selects the most informative instances from a large pool of unlabeled examples and requests their labels. The central goal is to achieve high model accuracy while minimizing the number of labeled examples required, thereby reducing the cost, time, and effort associated with data annotation.

The key insight behind active learning is that not all data points are equally useful for training. A model can learn more efficiently if it chooses which examples to study, focusing on instances where its predictions are uncertain or where new information would most improve its understanding. This stands in contrast to passive supervised machine learning, where training examples are sampled at random and the learner has no control over what it sees. In many real-world settings, unlabeled data is abundant (web pages, sensor readings, medical scans) but obtaining labels is expensive, making active learning a practical and widely studied approach.

Burr Settles' 2009 survey, "Active Learning Literature Survey," remains the most cited work in the field with over 6,000 citations. The survey established a foundational taxonomy of active learning scenarios and query strategies that researchers continue to build upon today.

Historical context

The theoretical foundations of active learning date back to at least the early 1990s. Cohn, Atlas, and Ladner (1994) published "Improving Generalization with Active Learning" in the journal Machine Learning, introducing a formalism called selective sampling. In selective sampling, a learner receives distribution information from the environment and queries an oracle on parts of the domain it considers useful. The paper demonstrated that, in some situations, active learning is provably more powerful than learning from randomly sampled examples, achieving better generalization for a fixed number of training instances. A preliminary version of this work appeared even earlier, as Cohn et al. (1990).

Another foundational contribution was the query-by-committee (QBC) algorithm, proposed by Seung, Opper, and Sompolinsky (1992). QBC used a committee of models to identify instances on which the models disagreed, and it was shown to reduce the amount of data needed for learning exponentially in certain theoretical settings.

Throughout the 2000s, active learning grew from a primarily theoretical topic into a practical tool applied in natural language processing, information extraction, and text classification. Settles and Craven (2008) provided an influential empirical comparison of active learning strategies for sequence labeling tasks. Then Settles' 2009 survey unified the field by organizing the literature into a coherent taxonomy of scenarios and query strategies, establishing vocabulary and frameworks that remain standard.

The 2010s saw a surge of interest in combining active learning with deep learning, bringing new challenges such as batch selection, calibration, and the cold start problem. Since 2023, researchers have begun exploring how large language models can serve as both the oracle and the selection mechanism in active learning pipelines, opening new frontiers in the field.

Active learning process

The active learning process consists of three core elements: the learning algorithm, the unlabeled data pool, and the selection strategy.

The learning algorithm is responsible for predicting labels based on previously labeled examples. It can be any machine learning algorithm, such as support vector machines, decision trees, or neural networks.

The unlabeled data pool is a large collection of unlabeled data points collected from various sources such as web crawling, sensor networks, or historical records.

The selection strategy (also called the query strategy or acquisition function) is the foundation of active learning, determining which unlabeled data points should be selected for labeling. It is guided by the learning algorithm's uncertainty, the data distribution, or both, and it strives to pick samples that will maximally improve model accuracy.

A typical active learning loop proceeds as follows:

Train an initial model on a small set of labeled examples (the seed set).
Use the trained model to evaluate the unlabeled pool according to a query strategy.
Select the most informative unlabeled instance(s) and request their label(s) from the oracle.
Add the newly labeled instance(s) to the training set.
Retrain the model on the expanded training set.
Repeat steps 2 through 5 until a stopping criterion is met (for example, a target accuracy, an annotation budget limit, or convergence of model performance).

Active learning scenarios

Active learning methods are typically organized into three main scenarios based on how unlabeled data is accessed. These scenarios were formalized by Settles (2009) and remain the standard framework in the literature.

Pool-based active learning

Pool-based active learning is the most commonly studied scenario. The learner has access to a large pool of unlabeled instances and evaluates the entire pool (or a subsample of it) to decide which instance to query next. The learner ranks all unlabeled instances according to some informativeness measure and selects the top-ranked instance or batch of instances for labeling.

This approach assumes that collecting unlabeled data is inexpensive but that obtaining labels is costly. Pool-based methods are widely used in practical applications such as text classification, image classification, and named entity recognition.

Stream-based selective sampling

In stream-based selective sampling (sometimes called sequential active learning), each unlabeled instance is drawn one at a time from the data source. The learner must decide on the spot whether to query the label for each incoming instance or discard it and move on to the next. The decision is typically made by comparing the informativeness of the instance against a threshold.

Stream-based methods are useful when the data arrives continuously (as in a data stream) and it is impractical to store the entire pool. Since instances come from the true underlying distribution, the queries are guaranteed to be realistic examples even when that distribution is non-uniform or unknown.

Membership query synthesis

In membership query synthesis, the learner generates new instances from the input space and requests their labels from the oracle. Rather than selecting from existing data, the learner synthesizes artificial query points that it expects to be maximally informative.

This scenario is powerful in theory because it allows the learner to explore any region of the feature space. However, synthesized instances can be difficult for human annotators to label because they may not correspond to natural or meaningful inputs. For example, a synthesized image might look like random noise to a human labeler. Membership query synthesis has found more success in domains where an automated oracle (such as a simulation or a physical experiment) can evaluate arbitrary inputs, rather than tasks requiring human judgment.

Scenario	Data access	Strengths	Limitations
Pool-based	Evaluates a full pool of unlabeled data	Most studied; effective ranking of all candidates	Computationally expensive for very large pools
Stream-based	Instances arrive sequentially	Low memory footprint; suits real-time data	Cannot compare candidates globally
Membership query synthesis	Learner generates synthetic instances	Can explore any region of feature space	Synthetic queries may be uninterpretable for human annotators

Query strategies

Active learning utilizes various selection strategies to determine which unlabeled data points are most worth labeling. The choice of query strategy has a large impact on the efficiency of active learning. Below are the major families of query strategies.

Uncertainty sampling

Uncertainty sampling is the simplest and most widely employed strategy. It selects the data points about which the current model is least certain. For classification tasks, uncertainty can be measured in several ways:

Uncertainty measure	Description	Formula (informal)
Least confidence	Selects the instance whose predicted class has the lowest probability	1 minus the probability of the most likely class
Margin sampling	Selects the instance with the smallest gap between the top two predicted class probabilities	Probability of first class minus probability of second class
Entropy sampling	Selects the instance with the highest entropy over the predicted class distribution	Negative sum of p times log(p) across all classes

For instance, if a binary classifier assigns a 51% probability to class A and 49% to class B, that instance has very high uncertainty under all three measures and would be prioritized for labeling. Uncertainty sampling is computationally cheap and easy to implement, but it considers only the model's current predictions without accounting for the broader data distribution. This can cause the learner to repeatedly query outliers or instances in a narrow region of the feature space.

Despite these limitations, large-scale benchmarking studies have found that uncertainty sampling remains a surprisingly strong baseline. An expanded benchmark by Gonsior et al. (2023) on tabular datasets concluded that uncertainty sampling consistently performs at or near the top across a variety of settings, often matching more complex strategies.

Query-by-committee (QBC)

Query-by-committee maintains a committee of models (an ensemble), each trained on different subsets or variations of the labeled data. The committee members independently predict labels for each unlabeled instance, and the instances that produce the most disagreement among the committee are selected for labeling.

Disagreement can be quantified using:

Vote entropy: measures the entropy of the distribution of votes across committee members.
Consensus entropy: measures how much the average prediction of the committee differs from a confident prediction.
KL divergence: measures the average divergence between each committee member's prediction and the mean committee prediction.

The underlying assumption is that instances causing high disagreement sit in regions of the feature space where the model is uncertain, and labeling those instances will help resolve ambiguities. QBC is related to the idea of version space reduction: each query aims to eliminate as many hypotheses as possible from the set of models consistent with the labeled data.

Expected model change

Expected model change selects the instance that, once labeled, would cause the greatest change to the current model. The "change" is typically measured as the magnitude of the gradient that would be applied to update the model's parameters after training on the new labeled instance. Instances that would induce large gradient updates are considered highly informative.

This strategy was introduced by Settles and Craven (2008) under the name "expected gradient length" (EGL). It requires computing the expected gradient for each candidate instance, which involves summing over all possible labels weighted by their predicted probabilities. Since the true label is unknown in advance, the algorithm calculates expected model change over all possible labels, with the intuition that it prefers instances likely to influence the model the most regardless of the resulting label. While more principled than uncertainty sampling, expected model change is more computationally expensive because it requires gradient computations for every unlabeled candidate.

Expected error reduction

Expected error reduction selects the instance that, once labeled, would most reduce the model's expected future error (or loss) on the remaining unlabeled data. For each candidate instance, the strategy simulates adding each possible label, retrains (or approximates retraining) the model, and estimates the resulting generalization error.

This approach is conceptually appealing because it directly targets the quantity we want to minimize. However, it is computationally demanding: for each candidate, the model must be retrained (or approximated) for every possible label, and the expected error must be estimated over the remaining unlabeled pool. This cost is prohibitive for modern deep neural networks, which is why expected error reduction has not been widely adopted in deep active learning. In practice, approximations such as one-step gradient descent or influence functions are used to make the computation feasible for simpler models.

Diversity sampling

Diversity sampling (also called representativeness-based sampling) selects instances that collectively represent the overall distribution of the unlabeled data. Rather than focusing purely on model uncertainty, diversity sampling aims to ensure that the selected batch of instances covers the feature space broadly.

Common techniques for diversity sampling include:

Clustering-based selection: cluster the unlabeled pool using algorithms like k-means, then select instances near each cluster center.
Core-set selection: select a subset of instances such that the maximum distance from any unlabeled point to its nearest selected point is minimized. Sener and Savarese (2018) formalized this as a core-set problem for convolutional neural networks at ICLR 2018.
Determinantal Point Processes (DPPs): model repulsion between selected points to ensure diversity in the chosen batch.

Diversity sampling alone does not account for model uncertainty, so it is often combined with uncertainty-based methods in hybrid approaches.

Density-weighted methods

Density-weighted methods combine informativeness scores with information about the underlying data density. The intuition is that an uncertain instance in a dense region of the feature space is more valuable than an equally uncertain instance in a sparse region, because the dense-region instance is more representative of the data the model will encounter at test time.

A common formulation multiplies the informativeness score (such as uncertainty) by a density weight proportional to the average similarity between the candidate and other unlabeled instances. This helps prevent the learner from wasting queries on outliers.

Summary of query strategies

Strategy	Selection criterion	Strengths	Limitations
Uncertainty sampling	Model confidence on individual instances	Simple, fast, easy to implement	May oversample outliers; ignores data distribution
Query-by-committee	Disagreement among a committee of models	Reduces version space; considers multiple hypotheses	Requires maintaining multiple models
Expected model change	Predicted impact on model parameters	Principled; directly targets model improvement	Computationally expensive (gradient calculations)
Expected error reduction	Predicted reduction in future error	Directly minimizes the target objective	Very expensive; requires simulated retraining per candidate
Diversity sampling	Representativeness of selected batch	Ensures coverage of the feature space	Does not consider model uncertainty
Density-weighted methods	Informativeness weighted by local density	Avoids wasting queries on outliers	Requires density estimation; adds complexity

Bayesian active learning (BALD)

Bayesian Active Learning by Disagreement (BALD), introduced by Houlsby et al. (2011), is an information-theoretic acquisition function designed for active learning with Bayesian models. BALD selects data points that maximize the mutual information between the model's predictions and its parameters, given the observed data.

The BALD acquisition function can be expressed as:

BALD(x) = H[p(y|x, D)] - E_{p(w|D)}[H[p(y|x, w)]]

The first term is the entropy of the model's predictive distribution (marginal uncertainty). The second term is the expected entropy of the prediction under individual parameter settings (aleatoric uncertainty). The difference captures epistemic uncertainty: BALD scores are highest for instances where the model is uncertain overall but individual parameter configurations are confident, meaning the parameters "disagree" about the correct prediction.

In practice, the expectation over the posterior p(w|D) is approximated using techniques such as Monte Carlo (MC) dropout (Gal and Ghahramani, 2016), where multiple forward passes with random dropout masks serve as samples from an approximate posterior. Each forward pass produces a different prediction, and the spread of those predictions estimates epistemic uncertainty.

BALD has become one of the standard acquisition functions for deep active learning due to its solid theoretical grounding and practical effectiveness. Several extensions have been proposed:

BatchBALD (Kirsch et al., 2019): extends BALD to select batches of points by computing the joint mutual information of the entire batch, rather than greedily selecting individual high-BALD points. Naive top-k selection with BALD can produce redundant batches (for example, selecting near-duplicate points that all have high individual scores). BatchBALD uses a greedy algorithm with a (1 - 1/e) approximation guarantee, amenable to dynamic programming and efficient caching, that jointly considers dependencies within the batch. Experiments showed that BatchBALD needs fewer iterations and fewer labeled points to reach high accuracy compared to greedy BALD selection.
BALD by Distribution Disagreement (Werner et al., 2025): replaces the point-estimate disagreement with a disagreement measure over full predictive distributions, improving robustness when the test distribution differs from the training distribution.

Deep active learning

Applying active learning to deep learning models introduces several unique challenges that do not arise with traditional machine learning algorithms.

Challenges

Uncertainty calibration. Deep neural networks are often overconfident in their predictions. The softmax outputs of a neural network are frequently poorly calibrated and do not reliably reflect true prediction uncertainty. This is a problem for uncertainty-based query strategies, which depend on accurate uncertainty estimates. Techniques such as temperature scaling, MC dropout, and Bayesian neural networks are used to improve calibration.

Cold start problem. Deep networks typically require large amounts of labeled data to train effectively. In the initial rounds of active learning, when very few labeled examples are available, the model may be too poorly trained for its uncertainty estimates to be meaningful. This can lead to poor query decisions in early iterations. Several strategies have been developed to mitigate the cold start problem. Transfer learning from pretrained models (such as ImageNet-pretrained CNNs or pretrained language models like BERT) provides a strong initialization so that even the first round of uncertainty estimation is useful. Yuan et al. (2020) proposed cold-start active learning through self-supervised language modeling, using the pretraining loss to identify examples that "surprise" the model and should be labeled first. Clustering-based approaches, such as using contrastive learning features with BIRCH clustering, can also select a diverse initial seed set without relying on model predictions.

Batch mode requirements. Training deep networks is computationally expensive, so it is impractical to retrain after adding a single new labeled instance. Instead, deep active learning operates in batch mode, selecting a batch of instances to label before retraining. Naive application of instance-level query strategies (such as selecting the top-k most uncertain points) can result in redundant batches where all selected instances are similar to each other.

Computational cost. Evaluating query strategies over large unlabeled pools with deep models is expensive. Strategies that require multiple forward passes (such as MC dropout for BALD) or gradient computations (such as expected model change) add significant overhead.

Representation shift. As labeled data grows across active learning rounds, the learned feature representations change. A query strategy that selects informative instances based on the current representation may not remain optimal after the model is retrained with new data.

Notable deep active learning methods

Method	Year	Key idea
Core-Set (Sener and Savarese)	2018	Frames active learning as core-set selection in the learned feature space, minimizing the maximum distance from any unlabeled point to the selected set
VAAL (Sinha et al.)	2019	Uses a variational autoencoder and a discriminator to distinguish labeled from unlabeled data; queries instances that the discriminator classifies as unlabeled
BatchBALD (Kirsch et al.)	2019	Selects batches by jointly maximizing mutual information, avoiding redundancy inherent in greedy top-k BALD selection
BADGE (Ash et al.)	2020	Represents instances as gradient embeddings from the last layer, then applies k-means++ in gradient space to select a diverse, uncertain batch
CEAL (Wang et al.)	2017	Runs two parallel processes: querying uncertain samples for human labeling while assigning pseudo-labels to highly confident samples
Loss Prediction (Yoo and Kwon)	2019	Attaches an auxiliary "loss prediction module" to the model that predicts the loss for unlabeled samples; queries instances with the highest predicted loss

Batch active learning

In many practical settings, labeling is done in batches rather than one instance at a time. A labeler might annotate a batch of 100 images overnight, or a laboratory might run 96 experiments simultaneously on a well plate. Batch active learning addresses the problem of selecting an entire batch of informative instances at once, rather than selecting them one by one.

The main challenge in batch selection is avoiding redundancy. If the query strategy simply selects the top-k most uncertain instances, those instances may all be very similar, providing overlapping information. Effective batch active learning methods must balance informativeness (each instance should be valuable) with diversity (the instances in the batch should cover different parts of the feature space).

Approaches to batch active learning

Greedy sequential selection. Select instances one at a time, updating the acquisition scores after each selection to account for the information already captured. This is straightforward but requires re-evaluating the acquisition function k times.

Clustering-based batching. First identify the top candidates by informativeness (for example, the most uncertain instances), then cluster those candidates and select one representative from each cluster. Zhdanov (2019) proposed a method that prefilters the top candidates, clusters them into the desired batch size, and selects the instance nearest each cluster center.

Gradient embedding methods. BADGE (Ash et al., 2020) computes gradient embeddings for each unlabeled instance, capturing both uncertainty (through gradient magnitude) and diversity (through the direction of the gradient). It then applies k-means++ initialization in gradient space to construct a batch that is simultaneously uncertain and diverse, without any hyperparameters to tune.

Submodular optimization. Some methods formulate batch selection as a submodular maximization problem, where the objective function captures both informativeness and diversity. Submodular functions exhibit diminishing returns, which naturally discourages redundancy.

Stopping criteria

Deciding when to stop the active learning loop is a non-trivial problem with no universal solution. Continuing to label beyond the point of diminishing returns wastes annotation budget, while stopping too early leaves potential performance on the table. Several families of stopping criteria have been proposed.

Budget exhaustion. The simplest approach: stop when the annotation budget (a fixed number of labeled instances or a fixed monetary cost) has been spent. This is practical but does not adapt to how well the model is actually learning.

Performance convergence. Monitor the model's performance on a held-out validation set and stop when performance has not improved by more than a threshold over a specified number of iterations. This approach can detect when additional labels yield diminishing returns.

Confidence-based criteria. Zhu and Hovy (2008) proposed stopping criteria based on the model's confidence over the unlabeled pool. These include stopping when the maximum uncertainty in the pool drops below a threshold (indicating the model is confident about all remaining instances) and stopping when the overall average uncertainty falls below a threshold.

Model stability. Bloodgood and Vijay-Shanker (2009) proposed stopping when the model itself no longer changes significantly with the addition of new training instances. If adding any of the remaining unlabeled examples would produce a change below a given threshold, further labeling is unlikely to help.

Error stability. Ishibashi and Hino (2020) proposed a criterion based on bounding the change in generalization error when a new sample is added. This criterion can be applied to any Bayesian active learning method and provides a more principled guarantee.

In practice, no single stopping criterion dominates across all domains. Optimality depends on the trade-off between accuracy and annotation cost, which varies by application.

Human-in-the-loop systems

Active learning is a core component of human-in-the-loop machine learning, where human annotators and machine learning models collaborate iteratively. Several tools have been developed to operationalize this workflow.

Prodigy, developed by Explosion (the creators of spaCy), is a commercial annotation tool that uses active learning as its core feature. Its text classification recipe implements uncertainty sampling to show the annotator only examples the model is most unsure about, reducing annotation effort by orders of magnitude compared to random sampling. Prodigy supports NLP tasks such as named entity recognition, text classification, and span categorization.

Label Studio is an open-source annotation platform that supports text, images, audio, video, and time-series data. It can be connected to a machine learning backend that provides predictions and receives annotations in a continuous loop, enabling active learning workflows. After each annotation, Label Studio sends the data to the ML backend for model updating, and when the user moves to the next task, the updated model provides new predictions.

Encord and other enterprise annotation platforms also support active learning workflows for computer vision tasks, including image classification, object detection, and semantic segmentation.

The human-in-the-loop paradigm provides an additional benefit beyond efficient labeling: by reviewing model-flagged examples, annotators gain insight into the model's uncertainties and potential failure modes. This feedback loop helps practitioners identify systematic errors and data quality issues early in the development cycle.

Active learning with large language models

The rise of large language models has introduced new paradigms for active learning. A 2025 survey by the ACL categorized LLM-based active learning into two main steps: LLM-based selection (using an LLM to choose or generate instances for annotation) and LLM-based annotation (using an LLM to provide labels).

LLMs as annotators. Instead of using expensive human annotators, LLMs such as GPT-4 can serve as oracles that label selected instances. Research by Kholodna et al. (2024) demonstrated that using GPT-4-Turbo as an annotator within an active learning loop for low-resource languages achieved near-state-of-the-art performance with estimated cost savings of at least 42 times compared to human annotation.

LLMs for instance selection. ActiveLLM (Schroeder et al., 2024) uses large language models to select instances for labeling in few-shot scenarios. Rather than relying on a trained classifier's uncertainty estimates (which may be unreliable with very few examples), the LLM itself evaluates which unlabeled instances would be most informative. This approach outperforms traditional active learning strategies when the labeled set is extremely small.

Hybrid human-LLM frameworks. Recent work has proposed frameworks that route annotation tasks between human annotators and LLMs based on model uncertainty. High-uncertainty instances are sent to human experts for reliable labeling, while lower-uncertainty instances are labeled by the LLM at a fraction of the cost. This hybrid approach balances cost efficiency with annotation quality.

Active learning vs. semi-supervised learning

Active learning and semi-supervised learning both address the problem of learning effectively when labeled data is scarce and unlabeled data is plentiful, but they take fundamentally different approaches.

Aspect	Active learning	Semi-supervised learning
Core mechanism	Selects which unlabeled instances to have labeled by an oracle	Uses unlabeled data directly (without labeling) to improve the model
Role of oracle	Requires an oracle (human or automated) to label selected instances	Does not require additional labeling
Use of unlabeled data	Queries specific unlabeled instances for labels	Leverages the distribution of all unlabeled data (for example, through consistency regularization or pseudo-labels)
Goal	Minimize the number of labels needed by choosing the most informative ones	Improve model performance by exploiting structure in unlabeled data
Label selection	Queries instances in uncertain or informative regions	Assigns pseudo-labels to instances the model is most confident about

The two approaches are complementary and can be combined. For instance, an active learning system might use semi-supervised techniques to leverage the unlabeled data between query rounds, while using an active query strategy to select which instances to send for human annotation. This combination has been shown to outperform either approach used in isolation.

Applications

Natural language processing

Annotating text data for NLP tasks is labor-intensive. Tasks like named entity recognition, sentiment analysis, relation extraction, and text classification require domain experts to read and label large volumes of text. Active learning has been applied extensively to reduce this annotation burden.

In clinical NLP, active learning with uncertainty sampling has been shown to achieve an F-measure of 0.80 for named entity recognition while requiring 66% fewer annotated sentences compared to random sampling (Chen et al., 2015). For text classification with deep learning, recent surveys have categorized active learning query strategies into data-based, model-based, and prediction-based approaches, reflecting the diversity of techniques adapted for neural text models.

Active learning is also used for building training sets for large language models and for preference annotation in reinforcement learning from human feedback (RLHF), where annotator time is a major bottleneck.

Medical imaging

Annotating medical images is uniquely expensive because it requires trained specialists. A radiologist may spend approximately 60 minutes manually segmenting brain tumors per patient in multi-sequence MRI volumes. A pathologist typically needs 15 to 30 minutes to examine a single histopathology slide under a microscope.

Active learning addresses this bottleneck by selecting the most informative images or image regions for expert annotation. A comprehensive survey by Li et al. (2024) cataloged nearly 164 deep active learning works applied to medical image analysis, covering tasks such as tumor segmentation, disease classification, lesion detection, and cell counting. Active learning has been combined with semi-supervised learning and self-supervised learning to further reduce annotation requirements in radiology, pathology, dermatology, and ophthalmology.

Autonomous driving

Perception systems for autonomous vehicles must recognize objects such as vehicles, pedestrians, cyclists, traffic signs, and lane markings from camera images and LiDAR point clouds. Annotating 3D point clouds is particularly time-consuming: each frame may contain hundreds of thousands of points that need semantic or instance-level labels.

Active learning helps prioritize which frames or scenes to annotate. Rather than labeling every frame captured during test drives, an active learning system can identify the frames where the perception model is most uncertain or where rare objects appear. This is especially valuable for detecting edge cases and rare scenarios (such as unusual weather conditions or uncommon road users) that are underrepresented in the training data but critical for safety.

Drug discovery

In drug discovery, the search space of possible molecular candidates is enormous (estimated at 10^60 drug-like molecules), and experimentally evaluating each candidate is expensive and slow. Active learning, often framed as Bayesian optimization, guides the selection of which molecules to synthesize and test next.

A surrogate model (often a Gaussian process or a graph neural network) predicts molecular properties such as binding affinity, toxicity, or solubility. The active learning loop selects molecules that balance exploration (testing molecules in unexplored regions of chemical space) with exploitation (testing molecules predicted to have desirable properties). Batched Bayesian optimization is particularly relevant because laboratory experiments are typically run in batches on well plates.

Recent work has demonstrated that combining pretrained transformer models (such as molecular BERT variants) with Bayesian active learning can identify toxic compounds with 50% fewer experimental iterations compared to conventional active learning approaches.

Other applications

Active learning has also been applied to:

Speech recognition: selecting utterances for transcription to improve automatic speech recognition systems.
Robotics: selecting environments or scenarios for a robot to practice in, improving sample efficiency in reinforcement learning.
Materials science: selecting which material compositions to synthesize and characterize, accelerating the discovery of new materials with desired properties.
Remote sensing: selecting satellite or aerial images for land-use annotation, reducing the cost of building geographic databases.

Tools and libraries

Several open-source libraries make it easier to implement active learning workflows in Python.

Library	Description	Integrations	License
modAL	Modular active learning framework built on top of scikit-learn. Provides flexibility to swap query strategies, models, and other components with custom implementations.	scikit-learn	MIT
ALiPy	Comprehensive active learning toolbox implementing over 20 query strategies. Provides modules for data management, strategy invocation, and experiment evaluation.	scikit-learn, NumPy	BSD-3
Baal	Bayesian active learning library focused on uncertainty estimation with deep learning models. Implements MC dropout, BALD, and other Bayesian acquisition functions.	PyTorch, Hugging Face	Apache 2.0
small-text	Active learning library specialized for text classification. Supports scikit-learn classifiers, PyTorch models, and Hugging Face transformers with GPU acceleration.	scikit-learn, PyTorch, Hugging Face Transformers	MIT
libact	Pool-based active learning library that implements several classical strategies including uncertainty sampling, QBC, and query synthesis with a unified interface.	scikit-learn	BSD-2
scikit-activeml	Comprehensive library following scikit-learn conventions. Provides a wide range of query strategies for classification, regression, and clustering.	scikit-learn	BSD-3

Advantages of active learning

Cost savings. Active learning can reduce the expense of labeling data by selecting only the most informative samples instead of labeling the entire dataset. This approach is especially advantageous in situations where labeling data is expensive or time-consuming, such as medical diagnosis or image recognition.
Higher accuracy. Active learning can achieve higher accuracy with fewer labeled data points, since the algorithm focuses on the most informative samples that will likely boost the model's precision.
Scalability. Active learning can handle large datasets with a limited labeling budget, allowing the learning algorithm to draw upon an abundance of unlabeled data. This is especially useful in situations where labeled information is scarce.
Faster iteration. By focusing annotation effort on the most valuable instances, active learning enables faster development cycles for machine learning projects, since less time is spent on labeling data that would provide little marginal benefit.

Challenges of active learning

Labeling bias. Selecting data points for labeling may introduce bias into the training data, as certain regions of the input space may be oversampled or undersampled. This sampling bias can cause the model to perform well on queried regions but poorly on underrepresented areas.
Selection strategy design. The success of active learning depends heavily on the selection strategy employed. Crafting an effective selection strategy necessitates a deep understanding of both the problem domain and learning algorithm. A strategy that works well for one model or dataset may perform poorly on another.
Oracle dependency. Active learning relies on an oracle or human to label selected samples, which could introduce errors and increase labeling costs. In practice, human annotators make mistakes, and noisy labels can degrade model performance. Inter-annotator disagreement is a further complication.
Stopping criteria. Deciding when to stop querying is a non-trivial problem. Common heuristics include budget exhaustion, performance convergence, or reaching a target metric, but none of these provide a principled guarantee that further labeling would not help.
Evaluation difficulty. Comparing active learning strategies fairly is difficult because performance depends on the initial seed set, the model architecture, the dataset, and random factors. Small changes in experimental setup can lead to different conclusions about which strategy is best.

Explain like I'm 5 (ELI5)

Imagine you have a huge box of toys, but you do not know what some of them are called. You could ask a grown-up to name every single toy, but that would take a very long time. Instead, you pick out the toys you are most confused about and ask about those first. After each answer, you get a little smarter about which toys you still need help with. That is what active learning does: instead of asking about everything, it picks the things it is most confused about so it can learn faster with less help.

References

Cohn, D., Atlas, L., and Ladner, R. (1994). "Improving Generalization with Active Learning." Machine Learning, 15(2), 201-221.
Settles, B. (2009). "Active Learning Literature Survey." Computer Sciences Technical Report 1648, University of Wisconsin-Madison.
Seung, H.S., Opper, M., and Sompolinsky, H. (1992). "Query by Committee." Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT), 287-294.
Houlsby, N., Huszar, F., Ghahramani, Z., and Lengyel, M. (2011). "Bayesian Active Learning for Classification and Preference Learning." arXiv:1112.5745.
Sener, O. and Savarese, S. (2018). "Active Learning for Convolutional Neural Networks: A Core-Set Approach." ICLR 2018.
Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., and Agarwal, A. (2020). "Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds." ICLR 2020.
Kirsch, A., van Amersfoort, J., and Gal, Y. (2019). "BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning." NeurIPS 2019.
Settles, B. and Craven, M. (2008). "An Analysis of Active Learning Strategies for Sequence Labeling Tasks." EMNLP 2008.
Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." ICML 2016.
Li, L. et al. (2024). "A Comprehensive Survey on Deep Active Learning in Medical Image Analysis." Medical Image Analysis, 95, 101269.
Zhu, J. and Hovy, E. (2008). "Confidence-Based Stopping Criteria for Active Learning for Data Annotation." ACM Transactions on Speech and Language Processing, 6(3), 1-24.
Yuan, M., Lin, H., and Boyd-Graber, J. (2020). "Cold-start Active Learning through Self-supervised Language Modeling." EMNLP 2020.
Kholodna, N. et al. (2024). "LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages." arXiv:2404.02261.
Li, X., Wang, Y., and Bhatia, P. (2024). "A Survey on Deep Active Learning: Recent Advances and New Frontiers." IEEE Transactions on Neural Networks and Learning Systems.

Introduction

Historical context

Active learning process

Active learning scenarios

Pool-based active learning

Stream-based selective sampling

Membership query synthesis

Query strategies

Uncertainty sampling

Query-by-committee (QBC)

Expected model change

Expected error reduction

Diversity sampling

Density-weighted methods

Summary of query strategies

Bayesian active learning (BALD)

Deep active learning

Challenges

Notable deep active learning methods

Batch active learning

Approaches to batch active learning

Stopping criteria

Human-in-the-loop systems

Active learning with large language models

Active learning vs. semi-supervised learning

Applications

Natural language processing

Medical imaging

Autonomous driving

Drug discovery

Other applications

Tools and libraries

Advantages of active learning

Challenges of active learning

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Machine learning terms/Fundamentals

Introduction

Historical context

Active learning process

Active learning scenarios

Pool-based active learning

Stream-based selective sampling

Membership query synthesis

Query strategies

Uncertainty sampling

Query-by-committee (QBC)

Expected model change

Expected error reduction

Diversity sampling

Density-weighted methods

Summary of query strategies

Bayesian active learning (BALD)

Deep active learning

Challenges

Notable deep active learning methods

Batch active learning

Approaches to batch active learning

Stopping criteria

Human-in-the-loop systems

Active learning with large language models

Active learning vs. semi-supervised learning

Applications

Natural language processing

Medical imaging

Autonomous driving

Drug discovery

Other applications

Tools and libraries

Advantages of active learning

Challenges of active learning

Explain like I'm 5 (ELI5)

References