Scikit-learn

Machine Learning

34 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

36 citations

Revision

v8 · 6,699 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Scikit-learn (also known as sklearn) is a free, open-source machine learning library for the Python programming language. It provides simple and efficient tools for data mining, data analysis, and predictive modeling, covering a broad range of classical machine learning tasks including classification, regression, clustering, and dimensionality reduction. Built on top of NumPy, SciPy, and matplotlib, and licensed under the permissive BSD 3-Clause license, scikit-learn has become one of the most widely used machine learning libraries in the world. The library's foundational 2011 paper defines it as "a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems," with emphasis "put on ease of use, performance, documentation, and API consistency"^[7]. As of December 2025 the GitHub repository has accumulated more than 66,000 stars and over 27,000 forks, with contributions from thousands of individuals across academic and industrial settings, and by 2018 scikit-learn had already been recognized as the third most used free software for machine learning in the world^[1]^[2]^[8].

Scikit-learn is designed around a consistent, clean API that makes it straightforward to swap algorithms in and out of a workflow. Whether a practitioner needs to train a support vector machine, fit a random forest, or run k-means clustering, the interface remains uniform: fit(), predict(), and transform(). This design philosophy, formally described by the project's developers in 2013, has made scikit-learn a default starting point for students, researchers, and industry professionals working with tabular data^[3]. By 2024 the package had been downloaded approximately 2.5 billion times from the Python Package Index, and it continues to be installed on the order of 200 million times per month, with deployments at companies including Spotify, Booking.com, BNP Paribas, and JP Morgan Chase^[4]^[23].

History

Origins as a Google Summer of Code project (2007)

Scikit-learn traces its origins to the 2007 Google Summer of Code, when David Cournapeau, a French computing researcher, created the project under the name scikits.learn. The name reflected its status as a "SciPy Toolkit" (or "scikit"), one of several add-on packages in the SciPy ecosystem. Cournapeau's initial goal was to build a collection of well-tested, accessible machine learning routines that could integrate smoothly with the existing Python scientific stack^[1]^[2].

Later in 2007, Matthieu Brucher continued development on the project as part of his doctoral thesis. The project remained relatively small and intermittently maintained until 2010, when researchers at Inria, the French National Institute for Research in Digital Science and Technology, took a decisive role in its future. By that point the project had begun to attract sporadic contributions from across the SciPy community, but it lacked a sustained development effort or a formal release process^[2]^[5].

Inria leadership and the first public release (2010)

In 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, all affiliated with Inria, assumed leadership of the project and coordinated the first public release on February 1, 2010 (version 0.1 beta). This release marked the transformation of scikit-learn from a small community project into a professionally maintained library, and it shifted the development home from a side project to the PARIETAL team at Inria Saclay near Paris^[5]^[6].

In 2011 the team published the seminal paper "Scikit-learn: Machine Learning in Python" in the Journal of Machine Learning Research (JMLR), Volume 12, pages 2825 to 2830. Written by sixteen authors including Pedregosa, Varoquaux, Gramfort, Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, Cournapeau, Brucher, Matthieu Perrot, and Edouard Duchesnay, this paper has since become one of the most cited works in machine learning. Semantic Scholar indexed more than 84,000 citations of the paper by 2026, placing it among the most cited publications in the history of the field^[7]^[36].

A companion paper, "API design for machine learning software: experiences from the scikit-learn project," was presented at the European Conference on Machine Learning and Principles of Knowledge Discovery in Databases (ECML PKDD) workshop on Languages for Data Mining and Machine Learning in 2013. Written by Lars Buitinck and fourteen co-authors, this paper formalized the estimator, predictor, transformer, and meta-estimator interfaces that distinguish scikit-learn from earlier Python machine learning libraries^[3].

Inria funding and the 2010s

Inria's institutional backing was critical to scikit-learn's growth during the 2010s. The institute funded multiple full-time developers across the decade, including Fabian Pedregosa from 2010 to 2012, Jaques Grobler from 2012 to 2013, and Olivier Grisel beginning in 2013, among others. This sustained institutional support gave the project the stability it needed during its formative years, when most equivalent libraries depended on volunteer labor without dedicated maintainers^[6].

By 2018 the user base had grown sufficiently that recurrent funding from corporate users became feasible. In late 2018 the Scikit-learn Consortium was formally established within the Fondation Inria to channel corporate sponsorship into developer salaries, infrastructure costs, and community events. Founding consortium members included AXA, BNP Paribas Cardif, BCG Gamma, Dataiku, Intel, Microsoft, and Nvidia^[6].

Recognition and the 2019 innovation prize

In 2019, scikit-learn received the Inria-French Academy of Sciences-Dassault Systemes Innovation Prize. The award was presented to the five core Inria researchers running the project: Gael Varoquaux, Olivier Grisel, Loic Esteve, Alexandre Gramfort, and Bertrand Thirion. The prize citation noted that, as of 2018, scikit-learn had attracted 1,400 contributors globally, drawn 42 million site visits per year, and become the third most used free software for machine learning in the world. The prize recognized the project as a flagship example of how publicly funded research software can achieve broad industrial adoption^[8].

Version 1.0 and project maturity (2021)

After more than a decade of development across the 0.x series, the project released its first stable version, 1.0.0, on September 24, 2021. The release represented over 2,100 merged pull requests, approximately 800 of which were focused on documentation. The transition to a 1.x version number was a recognition by the maintainers that the API had been stable for years, rather than the addition of any single landmark feature. As the release announcement noted, the project had been production-ready for a long time, and 1.0 was a signal of that stability rather than a fresh start^[9].

Probabl and the move to commercial stewardship (2023 to 2025)

In 2022 the French government, via Inria, was tasked with ensuring scikit-learn's continued development as part of France's national artificial intelligence strategy. In 2023 a new mission-driven company, Probabl, was founded with Yann Lechelle (a former Scaleway chief executive) as cofounder and chief executive, alongside twelve cofounders drawn primarily from the Inria scikit-learn team. The French state and Inria became shareholders in Probabl, with the explicit mandate that the BSD-licensed open-source codebase would not be changed and that no proprietary fork would be created^[4].

In 2025 Probabl announced a 13 million euro seed round, led by the French venture capital firm Serena and the hedge fund Capital Fund Management, with participation from Mozilla Ventures and the state-backed French Tech Souverainete program. Total funding to Probabl reached 18.5 million euros after the round. The company now operates the scikit-learn sponsorship program and employs the full-time core maintainers, including Adrin Jalali, Arturo Amor, Francois Goupil, Guillaume Lemaitre, Jeremie du Boisberranger, Loic Esteve, Olivier Grisel, and Stefanie Senger. Probabl has also released Skore, a commercial product offering enterprise support and tooling on top of scikit-learn^[4]^[6].

Design philosophy

Scikit-learn's broad adoption rests on a small number of explicit design decisions that the maintainers have defended for more than a decade.

What makes the scikit-learn API consistent?

Every estimator in scikit-learn follows the same interface pattern. The project's 2013 API design paper states that the library "is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts," built around "the simple and elegant interface shared by all learning and processing units in the library"^[3]. The paper codifies this around three primary object roles: estimators, which expose fit(X, y) for supervised learning or fit(X) for unsupervised; predictors, which expose predict(X); and transformers, which expose transform(X). Models that produce probabilities additionally expose predict_proba(X), and many estimators also implement score(X, y). Constructor arguments are restricted to hyperparameters, so an estimator instance can be cloned, persisted, and inspected without re-running training^[3].

This consistency means that once a user learns how to train one model, switching to another usually requires changing only a single class name. The uniform API also enables higher-order abstractions, including pipelines, column transformers, cross-validation utilities, and grid search, which all rely on the duck-typed estimator interface rather than a deep inheritance hierarchy^[3]^[10].

Sensible defaults and minimal cognitive load

Scikit-learn ships with reasonable default hyperparameters for every algorithm. A user can instantiate RandomForestClassifier() without specifying any parameters and still obtain a competitive model on most tabular benchmarks. This lowers the barrier to entry for newcomers while allowing experienced practitioners to tune every setting. The maintainers explicitly resist adding novelty algorithms or experimental variants that would expand the public surface without delivering predictable behavior across versions^[11].

Strict inclusion criteria

The project applies an unusually strict inclusion policy. According to the official FAQ, only well-established algorithms are considered for inclusion: a candidate algorithm must typically have been described in a peer-reviewed publication at least three years earlier, accumulated more than 200 citations, demonstrated wide use in practice, fit naturally into the existing API, and offer measurable advantages over existing implementations. Authors are not permitted to propose their own algorithms for inclusion, which deliberately prevents the use of scikit-learn as a marketing platform for new academic work. This conservatism keeps the library focused, predictable, and maintainable, and it explains why fast-moving subfields such as deep learning are deliberately excluded^[11].

The implication of this policy is that scikit-learn explicitly chooses breadth and stability over novelty. The library positions itself as a reference implementation of classical machine learning, not as a research vehicle. Variants of algorithms with marginal published advantages are typically rejected, and the maintainers will sometimes wait several years after an algorithm becomes popular before adding it. HDBSCAN, for example, was a widely used external library for nearly a decade before it was finally incorporated as sklearn.cluster.HDBSCAN in version 1.3 in 2023^[11].

Documentation as a primary product

The project is widely praised for its documentation, which includes a detailed User Guide, an API reference, and a gallery of more than 300 worked examples. Each algorithm page explains the underlying mathematics, provides runnable code snippets, and links to the original research papers. The 1.0 release notes show that documentation accounted for nearly 800 of the 2,100 merged pull requests in that release cycle. The User Guide also doubles as a free educational resource that is regularly assigned in university machine learning courses^[9].

Interoperability with the Python data stack

Scikit-learn is designed to work seamlessly with the broader Python data science ecosystem. Input data is typically provided as NumPy arrays, SciPy sparse matrices, pandas DataFrames, or (since 2023) Polars DataFrames. Outputs integrate with visualization libraries such as matplotlib and seaborn. Newer versions added a set_output("pandas") API that allows transformed data to retain column names and DataFrame structure, simplifying integration with downstream tools and feature stores^[11]^[12].

Core abstractions

Estimators, predictors, transformers, and meta-estimators

The library uses four duck-typed object roles. Any class with a fit(X, y=None) method is an estimator. An estimator that also implements predict(X) is a predictor. An estimator that implements transform(X) is a transformer; if it also has fit, it is a fitted transformer that can be reused on new data. A meta-estimator wraps another estimator to extend its behavior; examples include Pipeline, GridSearchCV, OneVsRestClassifier, and CalibratedClassifierCV^[3]^[10].

Internally, every estimator records its trained parameters as attributes ending in an underscore (for example, coef_, intercept_, feature_importances_), distinguishing them from the user-supplied hyperparameters set in the constructor. This convention makes introspection of fitted models straightforward and supports persistence with joblib or pickle without bespoke serialization code^[3].

Pipelines

A Pipeline chains multiple processing steps into a single estimator object. For example, a typical workflow might include scaling, dimensionality reduction, and classification. Rather than applying each step manually (and risking data leakage during cross-validation), a Pipeline ensures that transformations are fitted only on training folds:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=50)),
    ('svm', SVC(kernel='rbf'))
])

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

Pipelines prevent a subtle but common bug: if scaling or imputation is fitted on the full dataset before splitting into training and test folds, information from the test set leaks into the training process. By wrapping all steps in a Pipeline, scikit-learn guarantees that each transformation is fitted only on training data during cross-validation^[10]^[13].

A related object, ColumnTransformer, applies different transformers to different subsets of columns. This is essential for working with mixed tabular data, where numeric columns may need scaling, categorical columns need one-hot encoding, and text columns need TF-IDF vectorization^[14].

Metadata routing

Scikit-learn 1.4, released in early 2024, introduced a metadata routing API that resolves a long-standing limitation: how to pass per-sample metadata such as sample_weight or groups through complex meta-estimator chains. The mechanism, codified in SLEP006, divides estimators into routers (meta-estimators such as Pipeline or GridSearchCV) and consumers (estimators that actually use the metadata in their fit or score methods). Each consumer requests the metadata it needs via methods such as set_fit_request(sample_weight=True), and routers forward only requested metadata to each step^[14]^[15].

Algorithms

Scikit-learn provides implementations across the major categories of classical machine learning. The tables below summarize the principal algorithms available in the library as of version 1.8.

Classification

Algorithm	Module	Description
Logistic regression	`sklearn.linear_model`	Linear model for binary and multiclass classification
Support vector machines	`sklearn.svm`	Kernel-based classifier wrapping LIBSVM and LIBLINEAR
Random forest	`sklearn.ensemble`	Ensemble of decision trees using bagging
Histogram gradient boosting	`sklearn.ensemble`	Fast LightGBM-inspired gradient boosting with native categorical support
k-nearest neighbors	`sklearn.neighbors`	Instance-based learning using distance metrics
Naive Bayes	`sklearn.naive_bayes`	Probabilistic classifier based on Bayes' theorem
Decision tree	`sklearn.tree`	Recursive partitioning of feature space
AdaBoost	`sklearn.ensemble`	Adaptive boosting with weighted weak learners
Multi-layer perceptron	`sklearn.neural_network`	Simple feedforward neural network
Quadratic discriminant analysis	`sklearn.discriminant_analysis`	Class-conditional Gaussian with per-class covariance

Regression

Algorithm	Module	Description
Linear regression	`sklearn.linear_model`	Ordinary least squares regression
Ridge regression	`sklearn.linear_model`	L2-regularized linear regression
Lasso	`sklearn.linear_model`	L1-regularized linear regression for feature selection
ElasticNet	`sklearn.linear_model`	Combined L1 and L2 regularization
SVR (Support vector regression)	`sklearn.svm`	SVM adapted for continuous output
Random forest regressor	`sklearn.ensemble`	Ensemble of decision trees for regression
Histogram gradient boosting regressor	`sklearn.ensemble`	Histogram-binned gradient boosting for regression
Bayesian ridge	`sklearn.linear_model`	Bayesian linear regression with automatic regularization
Gaussian process regressor	`sklearn.gaussian_process`	Nonparametric Bayesian regression

Clustering

Algorithm	Module	Description
K-means	`sklearn.cluster`	Partition-based clustering into k groups
DBSCAN	`sklearn.cluster`	Density-based spatial clustering
Agglomerative clustering	`sklearn.cluster`	Hierarchical bottom-up clustering
Mean shift	`sklearn.cluster`	Centroid-based clustering using kernel density
Spectral clustering	`sklearn.cluster`	Graph-based clustering using eigenvalues
HDBSCAN	`sklearn.cluster`	Hierarchical density-based clustering (added in v1.3)
Birch	`sklearn.cluster`	Memory-efficient hierarchical clustering
Gaussian mixture	`sklearn.mixture`	Probabilistic clustering with mixture components

Dimensionality reduction and manifold learning

Algorithm	Module	Description
PCA	`sklearn.decomposition`	Linear projection to maximize variance
t-SNE	`sklearn.manifold`	Non-linear embedding for visualization
Truncated SVD	`sklearn.decomposition`	SVD for sparse matrices, similar to latent semantic analysis
Non-negative matrix factorization	`sklearn.decomposition`	Parts-based decomposition for non-negative data
Independent component analysis	`sklearn.decomposition`	Signal separation assuming statistical independence
Isomap	`sklearn.manifold`	Non-linear dimensionality reduction via geodesic distances
Classical MDS	`sklearn.manifold`	Eigendecomposition-based multidimensional scaling, added in 1.8

Implementation language

Although scikit-learn is most often described as a Python library, the performance-critical code paths are written in Cython, C, and C++. By repository line count the project is roughly 93 percent Python, 5 percent Cython, and 1 percent C++ as of 2025. Support vector machines are implemented by a Cython wrapper around LIBSVM, while logistic regression and linear support vector machines wrap LIBLINEAR; both libraries are vendored in the scikit-learn source tree^[1]^[16].

Beyond LIBSVM and LIBLINEAR, several other lower-level kernels are pre-compiled. Tree-based algorithms (DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, and the histogram gradient boosting estimators) use hand-written Cython code with manual memory management for the inner splitting loop. k-nearest-neighbor queries rely on ball trees and kd-trees implemented in Cython. Distance computations and pairwise kernels use OpenMP-parallelized Cython inner loops that release the Python GIL during execution, which is one of the prerequisites for the free-threaded CPython work shipped in versions 1.6 through 1.8^[11]^[27].

Linear algebra falls through to whichever BLAS and LAPACK implementations NumPy and SciPy are linked against; on most platforms this is OpenBLAS or the Apple Accelerate framework, and on Intel-managed installations it can be Intel MKL. This means that performance on dense numerical workloads is largely a property of the underlying BLAS library rather than scikit-learn itself, and the project benefits whenever the broader scientific Python stack adopts faster numerical kernels^[16].

Preprocessing and feature engineering

Scikit-learn provides a broad toolkit for the preprocessing steps that typically dominate real machine learning pipelines.

Scaling and normalization includes StandardScaler (zero mean, unit variance), MinMaxScaler (scaling to a specified range), RobustScaler (using median and interquartile range for robustness to outliers), and Normalizer (scaling individual samples to unit norm).

Encoding of categorical variables is handled by OneHotEncoder for nominal categories, OrdinalEncoder for ordinal categories, and LabelEncoder for the target variable.

Missing data is addressed by SimpleImputer (mean, median, most frequent, or constant fill), KNNImputer (k-nearest neighbor imputation), and IterativeImputer (multivariate imputation by chained equations). The histogram gradient boosting estimators have built-in support for missing values via a learned default direction at each split.

Feature selection methods include SelectKBest (univariate feature selection), RFE (recursive feature elimination), and SelectFromModel (selection based on model-derived feature importances).

Text vectorization is provided by CountVectorizer (bag-of-words), TfidfVectorizer (TF-IDF weighted features), and HashingVectorizer (memory-efficient feature hashing).

Hyperparameter search

GridSearchCV performs an exhaustive search over a specified parameter grid, evaluating each combination using cross-validation. Parameters for individual pipeline steps are addressed using a double-underscore naming convention:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'pca__n_components': [20, 50, 100],
    'svm__C': [0.1, 1, 10],
    'svm__kernel': ['linear', 'rbf']
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

For larger parameter spaces where exhaustive search is impractical, RandomizedSearchCV samples a fixed number of parameter combinations from specified distributions. Scikit-learn also provides HalvingGridSearchCV and HalvingRandomSearchCV, which apply successive halving to speed up the search by progressively eliminating poor-performing configurations from larger sample sizes^[17].

Model evaluation

Scikit-learn provides a comprehensive set of tools for assessing model performance.

Cross-validation functions include cross_val_score, cross_val_predict, and cross_validate (which returns training scores, fit times, and score times alongside test scores). Multiple splitting strategies are available, including KFold, StratifiedKFold, LeaveOneOut, LeaveOneGroupOut, GroupKFold, TimeSeriesSplit, and RepeatedStratifiedKFold^[18].

Metrics cover a broad range of evaluation needs:

Task	Metrics
Classification	Accuracy, precision, recall, F1, ROC-AUC, log loss, Matthews correlation coefficient, Brier score
Regression	Mean squared error, mean absolute error, R-squared, explained variance, mean Poisson deviance
Clustering	Silhouette score, adjusted Rand index, adjusted mutual information, Davies-Bouldin
Ranking	NDCG, mean reciprocal rank, average precision
Calibration	Brier score, expected calibration error (via `CalibratedClassifierCV`)

Visualization tools include ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay, DetCurveDisplay, CalibrationDisplay, LearningCurveDisplay, and ValidationCurveDisplay, each generating publication-ready plots with a single method call.

Comparison with other libraries

Scikit-learn occupies a distinct niche in the machine learning ecosystem compared with deep learning frameworks and dedicated gradient boosting libraries.

How does scikit-learn differ from PyTorch and TensorFlow?

Feature	Scikit-learn	PyTorch / TensorFlow / Keras
Primary focus	Classical ML, tabular data, statistical models	Deep learning, neural networks, GPU computation
Learning curve	Low, beginner-friendly API	Steeper, requires understanding of tensors and autograd
Data types	Tabular, text via vectorizers, small images	Images, audio, video, text, sequences
GPU support	Limited (experimental Array API, third-party extensions)	Native GPU and TPU acceleration
Model complexity	Shallow models, small to medium ensembles	Arbitrarily deep architectures
Training data size	Small to medium datasets that fit in RAM	Massive datasets via batching
Deployment	Pickle, joblib, ONNX	TorchScript, TensorFlow Serving, ONNX
Use cases	Structured data, prototyping, strong baselines	Computer vision, NLP, generative models

Scikit-learn is typically the right choice when working with structured, tabular data, when the dataset is small to medium-sized, when interpretability matters, or when a strong baseline is needed quickly. Deep learning frameworks become necessary for unstructured data (images, audio, long text), very large datasets, and problems that benefit from learned representations^[19].

In practice many production workflows use both. Scikit-learn handles preprocessing, cross-validation, and metric reporting, while a deep learning framework trains the model itself. Libraries such as skorch (a scikit-learn compatible PyTorch wrapper) and KerasClassifier bridge the two ecosystems by exposing neural network models through the scikit-learn estimator interface, allowing them to be used inside pipelines and grid search^[11].

Versus dedicated gradient boosting libraries

Three external libraries dominate the gradient boosting niche on tabular data: XGBoost, LightGBM, and CatBoost. All three predate scikit-learn's own HistGradientBoostingClassifier, and all three implement the scikit-learn estimator API so they can be dropped into pipelines and grid search.

Library	Notable strengths	Notable weaknesses
Scikit-learn `HistGradientBoosting`	Built into the core library, no extra dependency, native categorical support, fast	Marginally lower accuracy on some Kaggle-style benchmarks than tuned XGBoost or LightGBM
XGBoost	Mature, GPU support, win record on Kaggle	Heavier installation, more hyperparameters
LightGBM	Fastest training in many benchmarks, leaf-wise growth	Sensitive to small datasets and overfitting if untuned
CatBoost	Strong categorical feature handling without preprocessing	Slower than LightGBM on numeric-only data

In practice scikit-learn's HistGradientBoostingClassifier and HistGradientBoostingRegressor, introduced in version 0.21 (2019) and inspired by LightGBM's histogram binning, narrow the gap considerably. The scikit-learn implementation is often faster than XGBoost on CPU for medium datasets and matches it within a few percent on accuracy, although XGBoost and LightGBM retain an edge when GPU acceleration is available^[20]^[21].

Adoption

Industry use

Scikit-learn is one of the most widely deployed machine learning libraries in production. Companies that have publicly described their use of the library include Spotify (music recommendation systems), Booking.com (hotel and destination recommendation, fraud detection, customer service scheduling), JP Morgan Chase (classification and predictive analytics across the bank), BNP Paribas (banking and risk modeling), Evernote, AWeber, Yhat, Inria's neuroscience and brain-imaging teams, Mars Inc., Amazon, and Microsoft. The official testimonials page on scikit-learn.org collects engineer endorsements from many of these organizations^[22].

By 2024 the package had been downloaded approximately 2.5 billion times from PyPI, with current download volumes of roughly 200 million per month. The user base spans research scientists in physics, neuroscience, genomics, and economics, as well as data scientists and machine learning engineers in technology, finance, retail, and government^[4]^[23].

Academic and educational adoption

Beyond industry, scikit-learn is the default introductory machine learning library in most modern university curricula, where its consistent API and detailed documentation make it well suited for teaching. The Pedregosa 2011 paper is among the most cited works in machine learning, and the User Guide is frequently assigned as supplementary reading in undergraduate and graduate courses^[7]^[9].

The library is also embedded in many specialized scientific packages. Astropy uses scikit-learn for source classification in astronomical surveys. The MNE-Python neuroscience toolkit uses it for decoding analyses of electroencephalography and magnetoencephalography signals. Nilearn, also developed at Inria, builds on scikit-learn for statistical analysis of brain imaging data. The success of these downstream projects reflects the durability of the estimator API: a domain-specific library can wrap or subclass scikit-learn objects without breaking compatibility across point releases^[6]^[22].

Government and public sector

In 2022 the French government formally mandated Inria to ensure scikit-learn's continued development as part of the national artificial intelligence strategy. This was the first time a European government explicitly committed to the long-term maintenance of an open-source machine learning library as a piece of strategic digital infrastructure. The mandate later became one of the motivations for the creation of Probabl in 2023^[4].

Governance

Scikit-learn follows a formal governance model documented on its website. The model recognizes three roles: contributors (anyone who has contributed in any form, with no voting rights), core contributors (active contributors with full voting rights, recognized via a two-thirds majority of existing core contributors), and the Technical Committee (a small senior body responsible for strategic planning and final tiebreaking)^[24].

As of 2026 the Technical Committee comprises Thomas J. Fan, Alexandre Gramfort, Olivier Grisel, Adrin Jalali, Andreas Mueller, Joel Nothman, and Gael Varoquaux. The core contributor group is organized into four working teams: a Maintainers Team responsible for code review and merging, a Documentation Team, a Contributor Experience Team, and a Communication Team^[6]^[24].

Scikit-learn enhancement proposals

Substantive API or governance changes are tracked using Scikit-learn Enhancement Proposals (SLEPs), modeled loosely on the Python Enhancement Proposal (PEP) process. SLEP000 describes the process itself. Notable subsequent SLEPs include SLEP006 (metadata routing, accepted and implemented in version 1.4), SLEP019 (recognizing contributions beyond code), and SLEP020 (simplifying governance changes). A SLEP is accepted only after a one-month voting period and a two-thirds supermajority of cast votes^[15]^[24].

NumFOCUS and fiscal sponsorship

Since 2020, scikit-learn has been a fiscally sponsored project of NumFOCUS, a 501(c)(3) public charity based in Austin, Texas. NumFOCUS handles tax-deductible donations to the project and supports it as part of a portfolio of scientific Python projects that also includes NumPy, SciPy, pandas, Jupyter, and matplotlib^[25].

Funding

The project receives recurring institutional and corporate support. As of late 2025 the active sponsors listed on the official "About us" page are:

Tier	Sponsor
Founding sponsor	Inria
Gold	Chanel
Silver	BNP Paribas
Bronze	Nvidia (which also employs core maintainer Tim Head)

In addition, Microsoft funds Andreas Mueller as a core developer (since 2020), Quansight Labs funds Lucy Liu (since 2022), and the Chan Zuckerberg Initiative and Wellcome Trust have supported the project through cycle 6 of the Essential Open Source Software for Science (EOSS) program. Past consortium sponsors that have rotated through the program include Microsoft, Boston Consulting Group, Fujitsu, AP-HP (Assistance Publique des Hopitaux de Paris), Hugging Face, Dataiku, BNP Paribas Cardif, and AXA. Anaconda, CircleCI, GitHub, and Microsoft Azure contribute in-kind storage and continuous integration resources^[6].

Probabl now manages the sponsorship program and employs eight full-time core maintainers, providing the most consistent paid development capacity in the project's history^[4]^[6].

Limitations

Scikit-learn's design philosophy is explicit about what it will and will not do, and these constraints lead to several well-known limitations.

Does scikit-learn support GPUs?

By default scikit-learn runs on CPU only. The maintainers have long argued, in the official FAQ, that adding mandatory GPU support would introduce heavy hardware-specific dependencies, fragment the installation experience, and require reimplementation of every algorithm in CUDA or another GPU language. Many of the library's most-used algorithms (notably the tree-based ensembles) are written in Cython and are fundamentally non-array operations that do not benefit from naive GPU porting^[11].

Since 2023 the project has supported an experimental Array API path, allowing a growing list of estimators to accept PyTorch or CuPy arrays and execute on GPU when those backends are configured. Versions 1.7 and 1.8 substantially expanded the list of Array API compatible estimators and metrics, but coverage remains partial^[11]^[26]^[27].

External libraries fill the gap for users who need GPU acceleration with a scikit-learn compatible API. NVIDIA cuML, part of the RAPIDS suite, provides a zero-code-change GPU acceleration layer that intercepts imports from scikit-learn, umap-learn, and hdbscan and routes them to GPU implementations, with reported speedups of up to 50 times for scikit-learn algorithms on supported estimators^[28]. The Intel-funded scikit-learn-intelex package, now maintained by the UXL Foundation, provides similar accelerations on Intel CPUs and GPUs.

Single-machine, in-memory design

Scikit-learn assumes that training data fits in the memory of a single machine. There is no native distributed-training support analogous to Spark MLlib or to PyTorch's distributed data parallel. For datasets larger than memory, users must either subsample, use the partial_fit method of incremental estimators (a subset of the library that includes SGDClassifier, SGDRegressor, MiniBatchKMeans, IncrementalPCA, and Naive Bayes), or move to a distributed framework. The Dask-ML project provides scikit-learn compatible distributed wrappers, but it is maintained separately from the core library^[11]^[29].

Why does scikit-learn not include deep learning?

The FAQ states explicitly that deep learning and reinforcement learning are out of scope. The maintainers argue that these subfields require rich architecture-specification vocabularies, GPU-first runtime designs, and rapid iteration on algorithms, none of which fit the scikit-learn design constraints. The MLPClassifier and MLPRegressor in sklearn.neural_network are maintained only for bug fixes, not for new feature work, and users needing modern neural network capabilities are directed to PyTorch, TensorFlow, Keras, or JAX^[11].

No structured prediction or sequence models

Sequence models (such as hidden Markov models or conditional random fields) and structured prediction models are similarly out of scope. The FAQ recommends pystruct for general structured learning and seqlearn for sequence-focused models, but neither is part of the official scikit-learn project^[11].

Categorical and DataFrame handling

Until relatively recently, scikit-learn assumed numeric input matrices and required users to preprocess categorical columns manually. Versions from 1.0 onward have softened this with native pandas DataFrame and Polars DataFrame inputs, the set_output API, the categorical_features="from_dtype" option on histogram gradient boosting, and the ColumnTransformer meta-estimator. Comparative ergonomics, however, still favor R's data.frame-aware machine learning packages or specialist tabular deep-learning libraries when categorical handling is paramount^[11]^[14].

A related limitation is that scikit-learn does not provide a strong type system for column metadata. Column names are preserved through the set_output API and through ColumnTransformer, but feature semantics (such as units, allowable ranges, or downstream interpretation) are not part of the estimator interface. This is acceptable for ad hoc data science but can become problematic in regulated production environments such as banking and insurance, where additional schema management has to be added on top by the user^[11].

Conservative inclusion policy

The strict inclusion criteria (three-year publication minimum, 200-citation threshold, no self-proposals, demonstrated existing user demand) keep the library focused, but they also mean that fast-moving subfields rarely make it into scikit-learn. Modern techniques for handling tabular data, such as the TabPFN family or attention-based tabular models, are not included and are unlikely to be considered until they have a much longer track record^[11].

Release history and current state

Scikit-learn releases roughly every six months, with patch releases as needed for security and bug fixes. Major recent releases include:

Version	Release date	Notable changes
0.20	September 2018	`ColumnTransformer`, formal API stability commitments
0.21	May 2019	`HistGradientBoostingClassifier` and `HistGradientBoostingRegressor`
1.0	September 24, 2021	First 1.x release, keyword-only enforcement, 2,100 merged PRs
1.3	June 2023	`HDBSCAN` added to core, public `set_output` API
1.4	January 2024	Metadata routing (SLEP006), broader categorical support
1.6	December 2024	Experimental free-threaded CPython wheels, Python 3.13 support
1.7	June 5, 2025	Expanded Array API, MLP Poisson loss, ROC curves from CV results
1.8	December 10, 2025	Free-threaded wheels for Python 3.14, classical MDS, temperature scaling, faster MAE-criterion trees

Version 1.6 (December 2024)

Version 1.6.0 introduced experimental support for free-threaded CPython (the "no-GIL" Python initiative), enabling better parallelism on multi-core machines. The release also added Python 3.13 support^[30].

Version 1.7 (June 2025)

Version 1.7, released on June 5, 2025, expanded Array API support across additional metrics (including fbeta_score, precision_score, recall_score, jaccard_score, hamming_loss, and explained_variance_score), added a from_cv_results class method on RocCurveDisplay for plotting multiple ROC curves from cross-validation, allowed HistGradientBoosting models to take an explicit validation set, added Poisson loss and sample_weight support to the multi-layer perceptron, and added a parameter table to the HTML estimator visualization in Jupyter^[26].

Version 1.8 (December 2025)

The current stable release, version 1.8.0, was published on December 10, 2025. It supports Python 3.11 through 3.14, including free-threaded wheels for all supported platforms on Python 3.14. Major improvements include^[27]^[31]:

Free-threaded CPython 3.14 wheels for all supported platforms, enabling thread-based parallelism without inter-process communication overhead when using n_jobs > 1.
Significantly expanded Array API support, with StandardScaler, RidgeCV, RidgeClassifier, RidgeClassifierCV, GaussianMixture, PolynomialFeatures, CalibratedClassifierCV, GaussianNB, and many metrics now accepting Array API arrays.
Linear models: gap-safe screening rules give an approximately 10x speedup on Lasso and ElasticNet regularization paths.
Trees: DecisionTreeRegressor with criterion="absolute_error" improves from O(n^2) to O(n log n) split complexity, enabling training on millions of rows.
New ClassicalMDS estimator and an arbitrary-metric MDS extension in sklearn.manifold.
Temperature scaling added to CalibratedClassifierCV.
New metrics d2_brier_score and confusion_matrix_at_thresholds.
QuadraticDiscriminantAnalysis gains solver, covariance_estimator, and shrinkage parameters.
MaxAbsScaler gains a clip parameter, and SplineTransformer gains a handle_missing parameter.
Deprecation of PassiveAggressiveClassifier and PassiveAggressiveRegressor in favor of SGDClassifier and SGDRegressor with learning_rate="pa1" or "pa2".

The 1.8 release notes credit more than 200 contributors^[27].

Looking ahead

Active development on the main branch continues toward version 1.9. Ongoing work focuses on broader Array API coverage, further metadata routing rollout to remaining estimators, performance improvements in core compilation paths, and tighter integration with Polars and Arrow. The maintainers have explicitly stated that, despite the pressure of the deep learning ecosystem, scikit-learn intends to remain focused on classical machine learning and to leave neural network research to dedicated frameworks^[11]^[27].

Ecosystem and extensions

Scikit-learn's consistent API has spawned a large ecosystem of compatible libraries:

Library	Purpose
imbalanced-learn	Resampling and class-imbalance techniques
scikit-optimize (skopt)	Bayesian optimization for hyperparameter tuning
scikit-image	Image processing, sister project to scikit-learn
category_encoders	Additional categorical encoding strategies
skorch	Scikit-learn compatible PyTorch neural network wrapper
KerasClassifier / KerasRegressor	Scikit-learn compatible wrappers for Keras models
Dask-ML	Distributed scikit-learn workflows on Dask clusters
scikit-learn-intelex	UXL Foundation extension for hardware-accelerated scikit-learn on Intel hardware
cuML (RAPIDS)	GPU-accelerated scikit-learn compatible estimators by NVIDIA
XGBoost, LightGBM, CatBoost	Gradient boosting libraries that implement the estimator interface
optuna	Hyperparameter optimization with scikit-learn integrations
MLflow, Weights and Biases	Experiment tracking with scikit-learn integrations

Each of these libraries follows scikit-learn's estimator interface and can be dropped into Pipeline and GridSearchCV workflows without modification, illustrating the durability of the 2013 API design^[4]^[11].

Installation and dependencies

Scikit-learn can be installed via pip, conda, or from source:

pip install scikit-learn

or

conda install scikit-learn

As of version 1.8, the library requires Python 3.11 or later, NumPy 1.24.1 or later, SciPy 1.10.0 or later, and joblib 1.4.0 or later. Plotting functionality additionally requires matplotlib 3.6.1 or later. Optional dependencies include pandas (for DataFrame support), Polars (for the alternative DataFrame backend), and pytest (for running the test suite)^[1]^[27].

References

scikit-learn developers, "scikit-learn: machine learning in Python (GitHub repository)", GitHub, 2025. https://github.com/scikit-learn/scikit-learn. Accessed 2026-05-24. ↩
scikit-learn developers, "Wikipedia: scikit-learn", Wikipedia, 2026. https://en.wikipedia.org/wiki/Scikit-learn. Accessed 2026-05-24. ↩
L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt and G. Varoquaux, "API design for machine learning software: experiences from the scikit-learn project", ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, pages 108-122. https://arxiv.org/abs/1309.0238. Accessed 2026-05-24. ↩
Eleanor Warnock, "Open-source startup powering Spotify and JP Morgan's AI raises EUR 13m seed", Sifted, 2025. https://sifted.eu/articles/probabl-open-source-13m-seed. Accessed 2026-05-24. ↩
scikit-learn developers, "Release History", scikit-learn 1.8.0 documentation, 2025-12-10. https://scikit-learn.org/stable/whats_new.html. Accessed 2026-05-24. ↩
scikit-learn developers, "About us", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/about.html. Accessed 2026-05-24. ↩
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, "Scikit-learn: Machine Learning in Python", Journal of Machine Learning Research, Volume 12, 2011, pages 2825-2830. https://jmlr.org/papers/v12/pedregosa11a.html. Accessed 2026-05-24. ↩
Inria, "The 2019 Inria-French Academy of Sciences-Dassault Systemes Innovation Prize: scikit-learn, a success story for machine learning free software", Inria, 2019. https://www.inria.fr/en/2019-inria-french-academy-sciences-dassault-systemes-innovation-prize-scikit-learn-success-story. Accessed 2026-05-24. ↩
scikit-learn developers, "Release Highlights for scikit-learn 1.0", scikit-learn documentation, 2021-09-24. https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_0_0.html. Accessed 2026-05-24. ↩
scikit-learn developers, "Pipelines and composite estimators", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/modules/compose.html. Accessed 2026-05-24. ↩
scikit-learn developers, "Frequently Asked Questions", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/faq.html. Accessed 2026-05-24. ↩
scikit-learn developers, "set_output API and DataFrame outputs", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html. Accessed 2026-05-24. ↩
scikit-learn developers, "Pipeline class reference", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html. Accessed 2026-05-24. ↩
scikit-learn developers, "ColumnTransformer for heterogeneous data", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html. Accessed 2026-05-24. ↩
scikit-learn enhancement proposals, "SLEP006: Metadata Routing", scikit-learn-enhancement-proposals documentation, 2023. https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html. Accessed 2026-05-24. ↩
scikit-learn developers, "How to optimize for speed (LIBSVM and LIBLINEAR wrappers)", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/developers/performance.html. Accessed 2026-05-24. ↩
scikit-learn developers, "Tuning the hyper-parameters of an estimator", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/modules/grid_search.html. Accessed 2026-05-24. ↩
scikit-learn developers, "Cross-validation: evaluating estimator performance", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/modules/cross_validation.html. Accessed 2026-05-24. ↩
Python Central, "TensorFlow vs. PyTorch vs. Scikit-Learn: a guide to Python AI frameworks", Python Central, 2024. https://www.pythoncentral.io/tensorflow-pytorch-or-scikit-learn-a-guide-to-python-ai-frameworks/. Accessed 2026-05-24. ↩
Jacob Gursky, "Boosting Showdown: Scikit-Learn vs XGBoost vs LightGBM vs CatBoost in Sentiment Classification", Towards Data Science, 2020. https://medium.com/data-science/boosting-showdown-scikit-learn-vs-xgboost-vs-lightgbm-vs-catboost-in-sentiment-classification-f7c7f46fd956. Accessed 2026-05-24. ↩
Uttam Kumar, "Boosting Techniques Battle: CatBoost vs XGBoost vs LightGBM vs scikit-learn GradientBoosting", Medium / LinkedIn, 2024. https://analystuttam.medium.com/boosting-techniques-battle-catboost-vs-xgboost-vs-lightgbm-vs-scikit-learn-gradientboosting-vs-c106afc85dda. Accessed 2026-05-24. ↩
scikit-learn developers, "Who is using scikit-learn? (Testimonials)", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/testimonials/testimonials.html. Accessed 2026-05-24. ↩
PyPI Stats, "scikit-learn download statistics", PyPIStats, 2025. https://pypistats.org/packages/scikit-learn. Accessed 2026-05-24. ↩
scikit-learn developers, "Scikit-learn governance and decision-making", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/governance.html. Accessed 2026-05-24. ↩
NumFOCUS, "scikit-learn project page", NumFOCUS, 2025. https://numfocus.org/project/scikit-learn. Accessed 2026-05-24. ↩
scikit-learn developers, "Release Highlights for scikit-learn 1.7", scikit-learn 1.8.0 documentation, 2025-06-05. https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_7_0.html. Accessed 2026-05-24. ↩
scikit-learn developers, "Version 1.8 release notes", scikit-learn 1.8.0 documentation, 2025-12-10. https://scikit-learn.org/stable/whats_new/v1.8.html. Accessed 2026-05-24. ↩
NVIDIA, "NVIDIA cuML Brings Zero Code Change Acceleration to scikit-learn", NVIDIA Technical Blog, 2025. https://developer.nvidia.com/blog/nvidia-cuml-brings-zero-code-change-acceleration-to-scikit-learn/. Accessed 2026-05-24. ↩
Dask-ML developers, "Dask-ML: scalable machine learning with Dask", Dask-ML documentation, 2025. https://ml.dask.org/. Accessed 2026-05-24. ↩
scikit-learn developers, "Version 1.6 release notes", scikit-learn 1.8.0 documentation, 2024-12-09. https://scikit-learn.org/stable/whats_new/v1.6.html. Accessed 2026-05-24. ↩
scikit-learn developers, "Release Highlights for scikit-learn 1.8", scikit-learn 1.8.0 documentation, 2025-12-10. https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html. Accessed 2026-05-24. ↩
scikit-learn enhancement proposals, "SLEP000: SLEP process and template", scikit-learn-enhancement-proposals documentation, 2020. https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep000/proposal.html. Accessed 2026-05-24.
B. Lambert, Y. Lechelle and the Probabl team, "Probabl announcement and Skore launch", Probabl.ai, 2024. https://probabl.ai. Accessed 2026-05-24.
Bernard Marr, "What Is Scikit-learn and how is it used in AI?", Dataquest blog, 2024. https://www.dataquest.io/blog/what-is-scikit-learn-and-how-is-it-used-in-ai/. Accessed 2026-05-24.
scikit-learn developers, "Array API support (experimental)", scikit-learn 1.8.0 documentation, 2025. https://scikit-learn.org/stable/modules/array_api.html. Accessed 2026-05-24.
Semantic Scholar, "Scikit-learn: Machine Learning in Python (citation record)", Semantic Scholar / Allen Institute for AI, 2026. https://www.semanticscholar.org/paper/Scikit-learn:-Machine-Learning-in-Python-Pedregosa-Varoquaux/ad4fd2c149f220a62441576af92a8a669fe81246. Accessed 2026-06-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit