Scikit-learn
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 6,546 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 6,546 words
Add missing citations, update stale details, or suggest a clearer explanation.
Scikit-learn (also known as sklearn) is a free, open-source machine learning library for the Python programming language. It provides simple and efficient tools for data mining, data analysis, and predictive modeling, covering a broad range of classical machine learning tasks including classification, regression, clustering, and dimensionality reduction. Built on top of NumPy, SciPy, and matplotlib, and licensed under the permissive BSD 3-Clause license, scikit-learn has become one of the most widely used machine learning libraries in the world. As of December 2025 the GitHub repository has accumulated more than 66,000 stars and over 27,000 forks, with contributions from thousands of individuals across academic and industrial settings[1][2].
Scikit-learn is designed around a consistent, clean API that makes it straightforward to swap algorithms in and out of a workflow. Whether a practitioner needs to train a support vector machine, fit a random forest, or run k-means clustering, the interface remains uniform: fit(), predict(), and transform(). This design philosophy, formally described by the project's developers in 2013, has made scikit-learn a default starting point for students, researchers, and industry professionals working with tabular data[3]. By 2024 the package had been downloaded approximately 2.5 billion times from the Python Package Index, with deployments at companies including Spotify, Booking.com, BNP Paribas, and JP Morgan Chase[4].
Scikit-learn traces its origins to the 2007 Google Summer of Code, when David Cournapeau, a French computing researcher, created the project under the name scikits.learn. The name reflected its status as a "SciPy Toolkit" (or "scikit"), one of several add-on packages in the SciPy ecosystem. Cournapeau's initial goal was to build a collection of well-tested, accessible machine learning routines that could integrate smoothly with the existing Python scientific stack[1][2].
Later in 2007, Matthieu Brucher continued development on the project as part of his doctoral thesis. The project remained relatively small and intermittently maintained until 2010, when researchers at Inria, the French National Institute for Research in Digital Science and Technology, took a decisive role in its future. By that point the project had begun to attract sporadic contributions from across the SciPy community, but it lacked a sustained development effort or a formal release process[2][5].
In 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, all affiliated with Inria, assumed leadership of the project and coordinated the first public release on February 1, 2010 (version 0.1 beta). This release marked the transformation of scikit-learn from a small community project into a professionally maintained library, and it shifted the development home from a side project to the PARIETAL team at Inria Saclay near Paris[5][6].
In 2011 the team published the seminal paper "Scikit-learn: Machine Learning in Python" in the Journal of Machine Learning Research (JMLR), Volume 12, pages 2825 to 2830. Written by sixteen authors including Pedregosa, Varoquaux, Gramfort, Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, Cournapeau, Brucher, Matthieu Perrot, and Edouard Duchesnay, this paper has since become one of the most cited works in machine learning, with tens of thousands of citations by the mid-2020s[7].
A companion paper, "API design for machine learning software: experiences from the scikit-learn project," was presented at the European Conference on Machine Learning and Principles of Knowledge Discovery in Databases (ECML PKDD) workshop on Languages for Data Mining and Machine Learning in 2013. Written by Lars Buitinck and fourteen co-authors, this paper formalized the estimator, predictor, transformer, and meta-estimator interfaces that distinguish scikit-learn from earlier Python machine learning libraries[3].
Inria's institutional backing was critical to scikit-learn's growth during the 2010s. The institute funded multiple full-time developers across the decade, including Fabian Pedregosa from 2010 to 2012, Jaques Grobler from 2012 to 2013, and Olivier Grisel beginning in 2013, among others. This sustained institutional support gave the project the stability it needed during its formative years, when most equivalent libraries depended on volunteer labor without dedicated maintainers[6].
By 2018 the user base had grown sufficiently that recurrent funding from corporate users became feasible. In late 2018 the Scikit-learn Consortium was formally established within the Fondation Inria to channel corporate sponsorship into developer salaries, infrastructure costs, and community events. Founding consortium members included AXA, BNP Paribas Cardif, BCG Gamma, Dataiku, Intel, Microsoft, and Nvidia[6].
In 2019, scikit-learn received the Inria-French Academy of Sciences-Dassault Systemes Innovation Prize. The award was presented to the five core Inria researchers running the project: Gael Varoquaux, Olivier Grisel, Loic Esteve, Alexandre Gramfort, and Bertrand Thirion. The prize citation noted that, as of 2018, scikit-learn had attracted 1,400 contributors globally, drawn 42 million site visits per year, and become the third most used free software for machine learning in the world. The prize recognized the project as a flagship example of how publicly funded research software can achieve broad industrial adoption[8].
After more than a decade of development across the 0.x series, the project released its first stable version, 1.0.0, on September 24, 2021. The release represented over 2,100 merged pull requests, approximately 800 of which were focused on documentation. The transition to a 1.x version number was a recognition by the maintainers that the API had been stable for years, rather than the addition of any single landmark feature. As the release announcement noted, the project had been production-ready for a long time, and 1.0 was a signal of that stability rather than a fresh start[9].
In 2022 the French government, via Inria, was tasked with ensuring scikit-learn's continued development as part of France's national artificial intelligence strategy. In 2023 a new mission-driven company, Probabl, was founded with Yann Lechelle (a former Scaleway chief executive) as cofounder and chief executive, alongside twelve cofounders drawn primarily from the Inria scikit-learn team. The French state and Inria became shareholders in Probabl, with the explicit mandate that the BSD-licensed open-source codebase would not be changed and that no proprietary fork would be created[4].
In 2025 Probabl announced a 13 million euro seed round, led by the French venture capital firm Serena and the hedge fund Capital Fund Management, with participation from Mozilla Ventures and the state-backed French Tech Souverainete program. Total funding to Probabl reached 18.5 million euros after the round. The company now operates the scikit-learn sponsorship program and employs the full-time core maintainers, including Adrin Jalali, Arturo Amor, Francois Goupil, Guillaume Lemaitre, Jeremie du Boisberranger, Loic Esteve, Olivier Grisel, and Stefanie Senger. Probabl has also released Skore, a commercial product offering enterprise support and tooling on top of scikit-learn[4][6].
Scikit-learn's broad adoption rests on a small number of explicit design decisions that the maintainers have defended for more than a decade.
Every estimator in scikit-learn follows the same interface pattern. The 2013 API design paper by Buitinck and colleagues codifies this around three primary object roles: estimators, which expose fit(X, y) for supervised learning or fit(X) for unsupervised; predictors, which expose predict(X); and transformers, which expose transform(X). Models that produce probabilities additionally expose predict_proba(X), and many estimators also implement score(X, y). Constructor arguments are restricted to hyperparameters, so an estimator instance can be cloned, persisted, and inspected without re-running training[3].
This consistency means that once a user learns how to train one model, switching to another usually requires changing only a single class name. The uniform API also enables higher-order abstractions, including pipelines, column transformers, cross-validation utilities, and grid search, which all rely on the duck-typed estimator interface rather than a deep inheritance hierarchy[3][10].
Scikit-learn ships with reasonable default hyperparameters for every algorithm. A user can instantiate RandomForestClassifier() without specifying any parameters and still obtain a competitive model on most tabular benchmarks. This lowers the barrier to entry for newcomers while allowing experienced practitioners to tune every setting. The maintainers explicitly resist adding novelty algorithms or experimental variants that would expand the public surface without delivering predictable behavior across versions[11].
The project applies an unusually strict inclusion policy. According to the official FAQ, only well-established algorithms are considered for inclusion: a candidate algorithm must typically have been described in a peer-reviewed publication at least three years earlier, accumulated more than 200 citations, demonstrated wide use in practice, fit naturally into the existing API, and offer measurable advantages over existing implementations. Authors are not permitted to propose their own algorithms for inclusion, which deliberately prevents the use of scikit-learn as a marketing platform for new academic work. This conservatism keeps the library focused, predictable, and maintainable, and it explains why fast-moving subfields such as deep learning are deliberately excluded[11].
The implication of this policy is that scikit-learn explicitly chooses breadth and stability over novelty. The library positions itself as a reference implementation of classical machine learning, not as a research vehicle. Variants of algorithms with marginal published advantages are typically rejected, and the maintainers will sometimes wait several years after an algorithm becomes popular before adding it. HDBSCAN, for example, was a widely used external library for nearly a decade before it was finally incorporated as sklearn.cluster.HDBSCAN in version 1.3 in 2023[11].
The project is widely praised for its documentation, which includes a detailed User Guide, an API reference, and a gallery of more than 300 worked examples. Each algorithm page explains the underlying mathematics, provides runnable code snippets, and links to the original research papers. The 1.0 release notes show that documentation accounted for nearly 800 of the 2,100 merged pull requests in that release cycle. The User Guide also doubles as a free educational resource that is regularly assigned in university machine learning courses[9].
Scikit-learn is designed to work seamlessly with the broader Python data science ecosystem. Input data is typically provided as NumPy arrays, SciPy sparse matrices, pandas DataFrames, or (since 2023) Polars DataFrames. Outputs integrate with visualization libraries such as matplotlib and seaborn. Newer versions added a set_output("pandas") API that allows transformed data to retain column names and DataFrame structure, simplifying integration with downstream tools and feature stores[11][12].
The library uses four duck-typed object roles. Any class with a fit(X, y=None) method is an estimator. An estimator that also implements predict(X) is a predictor. An estimator that implements transform(X) is a transformer; if it also has fit, it is a fitted transformer that can be reused on new data. A meta-estimator wraps another estimator to extend its behavior; examples include Pipeline, GridSearchCV, OneVsRestClassifier, and CalibratedClassifierCV[3][10].
Internally, every estimator records its trained parameters as attributes ending in an underscore (for example, coef_, intercept_, feature_importances_), distinguishing them from the user-supplied hyperparameters set in the constructor. This convention makes introspection of fitted models straightforward and supports persistence with joblib or pickle without bespoke serialization code[3].
A Pipeline chains multiple processing steps into a single estimator object. For example, a typical workflow might include scaling, dimensionality reduction, and classification. Rather than applying each step manually (and risking data leakage during cross-validation), a Pipeline ensures that transformations are fitted only on training folds:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=50)),
('svm', SVC(kernel='rbf'))
])
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
Pipelines prevent a subtle but common bug: if scaling or imputation is fitted on the full dataset before splitting into training and test folds, information from the test set leaks into the training process. By wrapping all steps in a Pipeline, scikit-learn guarantees that each transformation is fitted only on training data during cross-validation[10][13].
A related object, ColumnTransformer, applies different transformers to different subsets of columns. This is essential for working with mixed tabular data, where numeric columns may need scaling, categorical columns need one-hot encoding, and text columns need TF-IDF vectorization[14].
Scikit-learn 1.4, released in early 2024, introduced a metadata routing API that resolves a long-standing limitation: how to pass per-sample metadata such as sample_weight or groups through complex meta-estimator chains. The mechanism, codified in SLEP006, divides estimators into routers (meta-estimators such as Pipeline or GridSearchCV) and consumers (estimators that actually use the metadata in their fit or score methods). Each consumer requests the metadata it needs via methods such as set_fit_request(sample_weight=True), and routers forward only requested metadata to each step[14][15].
Scikit-learn provides implementations across the major categories of classical machine learning. The tables below summarize the principal algorithms available in the library as of version 1.8.
| Algorithm | Module | Description |
|---|---|---|
| Logistic regression | sklearn.linear_model | Linear model for binary and multiclass classification |
| Support vector machines | sklearn.svm | Kernel-based classifier wrapping LIBSVM and LIBLINEAR |
| Random forest | sklearn.ensemble | Ensemble of decision trees using bagging |
| Histogram gradient boosting | sklearn.ensemble | Fast LightGBM-inspired gradient boosting with native categorical support |
| k-nearest neighbors | sklearn.neighbors | Instance-based learning using distance metrics |
| Naive Bayes | sklearn.naive_bayes | Probabilistic classifier based on Bayes' theorem |
| Decision tree | sklearn.tree | Recursive partitioning of feature space |
| AdaBoost | sklearn.ensemble | Adaptive boosting with weighted weak learners |
| Multi-layer perceptron | sklearn.neural_network | Simple feedforward neural network |
| Quadratic discriminant analysis | sklearn.discriminant_analysis | Class-conditional Gaussian with per-class covariance |
| Algorithm | Module | Description |
|---|---|---|
| Linear regression | sklearn.linear_model | Ordinary least squares regression |
| Ridge regression | sklearn.linear_model | L2-regularized linear regression |
| Lasso | sklearn.linear_model | L1-regularized linear regression for feature selection |
| ElasticNet | sklearn.linear_model | Combined L1 and L2 regularization |
| SVR (Support vector regression) | sklearn.svm | SVM adapted for continuous output |
| Random forest regressor | sklearn.ensemble | Ensemble of decision trees for regression |
| Histogram gradient boosting regressor | sklearn.ensemble | Histogram-binned gradient boosting for regression |
| Bayesian ridge | sklearn.linear_model | Bayesian linear regression with automatic regularization |
| Gaussian process regressor | sklearn.gaussian_process | Nonparametric Bayesian regression |
| Algorithm | Module | Description |
|---|---|---|
| K-means | sklearn.cluster | Partition-based clustering into k groups |
| DBSCAN | sklearn.cluster | Density-based spatial clustering |
| Agglomerative clustering | sklearn.cluster | Hierarchical bottom-up clustering |
| Mean shift | sklearn.cluster | Centroid-based clustering using kernel density |
| Spectral clustering | sklearn.cluster | Graph-based clustering using eigenvalues |
| HDBSCAN | sklearn.cluster | Hierarchical density-based clustering (added in v1.3) |
| Birch | sklearn.cluster | Memory-efficient hierarchical clustering |
| Gaussian mixture | sklearn.mixture | Probabilistic clustering with mixture components |
| Algorithm | Module | Description |
|---|---|---|
| PCA | sklearn.decomposition | Linear projection to maximize variance |
| t-SNE | sklearn.manifold | Non-linear embedding for visualization |
| Truncated SVD | sklearn.decomposition | SVD for sparse matrices, similar to latent semantic analysis |
| Non-negative matrix factorization | sklearn.decomposition | Parts-based decomposition for non-negative data |
| Independent component analysis | sklearn.decomposition | Signal separation assuming statistical independence |
| Isomap | sklearn.manifold | Non-linear dimensionality reduction via geodesic distances |
| Classical MDS | sklearn.manifold | Eigendecomposition-based multidimensional scaling, added in 1.8 |
Although scikit-learn is most often described as a Python library, the performance-critical code paths are written in Cython, C, and C++. By repository line count the project is roughly 93 percent Python, 5 percent Cython, and 1 percent C++ as of 2025. Support vector machines are implemented by a Cython wrapper around LIBSVM, while logistic regression and linear support vector machines wrap LIBLINEAR; both libraries are vendored in the scikit-learn source tree[1][16].
Beyond LIBSVM and LIBLINEAR, several other lower-level kernels are pre-compiled. Tree-based algorithms (DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, and the histogram gradient boosting estimators) use hand-written Cython code with manual memory management for the inner splitting loop. k-nearest-neighbor queries rely on ball trees and kd-trees implemented in Cython. Distance computations and pairwise kernels use OpenMP-parallelized Cython inner loops that release the Python GIL during execution, which is one of the prerequisites for the free-threaded CPython work shipped in versions 1.6 through 1.8[11][27].
Linear algebra falls through to whichever BLAS and LAPACK implementations NumPy and SciPy are linked against; on most platforms this is OpenBLAS or the Apple Accelerate framework, and on Intel-managed installations it can be Intel MKL. This means that performance on dense numerical workloads is largely a property of the underlying BLAS library rather than scikit-learn itself, and the project benefits whenever the broader scientific Python stack adopts faster numerical kernels[16].
Scikit-learn provides a broad toolkit for the preprocessing steps that typically dominate real machine learning pipelines.
Scaling and normalization includes StandardScaler (zero mean, unit variance), MinMaxScaler (scaling to a specified range), RobustScaler (using median and interquartile range for robustness to outliers), and Normalizer (scaling individual samples to unit norm).
Encoding of categorical variables is handled by OneHotEncoder for nominal categories, OrdinalEncoder for ordinal categories, and LabelEncoder for the target variable.
Missing data is addressed by SimpleImputer (mean, median, most frequent, or constant fill), KNNImputer (k-nearest neighbor imputation), and IterativeImputer (multivariate imputation by chained equations). The histogram gradient boosting estimators have built-in support for missing values via a learned default direction at each split.
Feature selection methods include SelectKBest (univariate feature selection), RFE (recursive feature elimination), and SelectFromModel (selection based on model-derived feature importances).
Text vectorization is provided by CountVectorizer (bag-of-words), TfidfVectorizer (TF-IDF weighted features), and HashingVectorizer (memory-efficient feature hashing).
GridSearchCV performs an exhaustive search over a specified parameter grid, evaluating each combination using cross-validation. Parameters for individual pipeline steps are addressed using a double-underscore naming convention:
from sklearn.model_selection import GridSearchCV
param_grid = {
'pca__n_components': [20, 50, 100],
'svm__C': [0.1, 1, 10],
'svm__kernel': ['linear', 'rbf']
}
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
For larger parameter spaces where exhaustive search is impractical, RandomizedSearchCV samples a fixed number of parameter combinations from specified distributions. Scikit-learn also provides HalvingGridSearchCV and HalvingRandomSearchCV, which apply successive halving to speed up the search by progressively eliminating poor-performing configurations from larger sample sizes[17].
Scikit-learn provides a comprehensive set of tools for assessing model performance.
Cross-validation functions include cross_val_score, cross_val_predict, and cross_validate (which returns training scores, fit times, and score times alongside test scores). Multiple splitting strategies are available, including KFold, StratifiedKFold, LeaveOneOut, LeaveOneGroupOut, GroupKFold, TimeSeriesSplit, and RepeatedStratifiedKFold[18].
Metrics cover a broad range of evaluation needs:
| Task | Metrics |
|---|---|
| Classification | Accuracy, precision, recall, F1, ROC-AUC, log loss, Matthews correlation coefficient, Brier score |
| Regression | Mean squared error, mean absolute error, R-squared, explained variance, mean Poisson deviance |
| Clustering | Silhouette score, adjusted Rand index, adjusted mutual information, Davies-Bouldin |
| Ranking | NDCG, mean reciprocal rank, average precision |
| Calibration | Brier score, expected calibration error (via CalibratedClassifierCV) |
Visualization tools include ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay, DetCurveDisplay, CalibrationDisplay, LearningCurveDisplay, and ValidationCurveDisplay, each generating publication-ready plots with a single method call.
Scikit-learn occupies a distinct niche in the machine learning ecosystem compared with deep learning frameworks and dedicated gradient boosting libraries.
| Feature | Scikit-learn | PyTorch / TensorFlow / Keras |
|---|---|---|
| Primary focus | Classical ML, tabular data, statistical models | Deep learning, neural networks, GPU computation |
| Learning curve | Low, beginner-friendly API | Steeper, requires understanding of tensors and autograd |
| Data types | Tabular, text via vectorizers, small images | Images, audio, video, text, sequences |
| GPU support | Limited (experimental Array API, third-party extensions) | Native GPU and TPU acceleration |
| Model complexity | Shallow models, small to medium ensembles | Arbitrarily deep architectures |
| Training data size | Small to medium datasets that fit in RAM | Massive datasets via batching |
| Deployment | Pickle, joblib, ONNX | TorchScript, TensorFlow Serving, ONNX |
| Use cases | Structured data, prototyping, strong baselines | Computer vision, NLP, generative models |
Scikit-learn is typically the right choice when working with structured, tabular data, when the dataset is small to medium-sized, when interpretability matters, or when a strong baseline is needed quickly. Deep learning frameworks become necessary for unstructured data (images, audio, long text), very large datasets, and problems that benefit from learned representations[19].
In practice many production workflows use both. Scikit-learn handles preprocessing, cross-validation, and metric reporting, while a deep learning framework trains the model itself. Libraries such as skorch (a scikit-learn compatible PyTorch wrapper) and KerasClassifier bridge the two ecosystems by exposing neural network models through the scikit-learn estimator interface, allowing them to be used inside pipelines and grid search[11].
Three external libraries dominate the gradient boosting niche on tabular data: XGBoost, LightGBM, and CatBoost. All three predate scikit-learn's own HistGradientBoostingClassifier, and all three implement the scikit-learn estimator API so they can be dropped into pipelines and grid search.
| Library | Notable strengths | Notable weaknesses |
|---|---|---|
Scikit-learn HistGradientBoosting | Built into the core library, no extra dependency, native categorical support, fast | Marginally lower accuracy on some Kaggle-style benchmarks than tuned XGBoost or LightGBM |
| XGBoost | Mature, GPU support, win record on Kaggle | Heavier installation, more hyperparameters |
| LightGBM | Fastest training in many benchmarks, leaf-wise growth | Sensitive to small datasets and overfitting if untuned |
| CatBoost | Strong categorical feature handling without preprocessing | Slower than LightGBM on numeric-only data |
In practice scikit-learn's HistGradientBoostingClassifier and HistGradientBoostingRegressor, introduced in version 0.21 (2019) and inspired by LightGBM's histogram binning, narrow the gap considerably. The scikit-learn implementation is often faster than XGBoost on CPU for medium datasets and matches it within a few percent on accuracy, although XGBoost and LightGBM retain an edge when GPU acceleration is available[20][21].
Scikit-learn is one of the most widely deployed machine learning libraries in production. Companies that have publicly described their use of the library include Spotify (music recommendation systems), Booking.com (hotel and destination recommendation, fraud detection, customer service scheduling), JP Morgan Chase (classification and predictive analytics across the bank), BNP Paribas (banking and risk modeling), Evernote, AWeber, Yhat, Inria's neuroscience and brain-imaging teams, Mars Inc., Amazon, and Microsoft. The official testimonials page on scikit-learn.org collects engineer endorsements from many of these organizations[22].
By 2024 the package had been downloaded approximately 2.5 billion times from PyPI, with current download volumes of roughly 200 million per month. The user base spans research scientists in physics, neuroscience, genomics, and economics, as well as data scientists and machine learning engineers in technology, finance, retail, and government[4][23].
Beyond industry, scikit-learn is the default introductory machine learning library in most modern university curricula, where its consistent API and detailed documentation make it well suited for teaching. The Pedregosa 2011 paper is among the most cited works in machine learning, and the User Guide is frequently assigned as supplementary reading in undergraduate and graduate courses[7][9].
The library is also embedded in many specialized scientific packages. Astropy uses scikit-learn for source classification in astronomical surveys. The MNE-Python neuroscience toolkit uses it for decoding analyses of electroencephalography and magnetoencephalography signals. Nilearn, also developed at Inria, builds on scikit-learn for statistical analysis of brain imaging data. The success of these downstream projects reflects the durability of the estimator API: a domain-specific library can wrap or subclass scikit-learn objects without breaking compatibility across point releases[6][22].
In 2022 the French government formally mandated Inria to ensure scikit-learn's continued development as part of the national artificial intelligence strategy. This was the first time a European government explicitly committed to the long-term maintenance of an open-source machine learning library as a piece of strategic digital infrastructure. The mandate later became one of the motivations for the creation of Probabl in 2023[4].
Scikit-learn follows a formal governance model documented on its website. The model recognizes three roles: contributors (anyone who has contributed in any form, with no voting rights), core contributors (active contributors with full voting rights, recognized via a two-thirds majority of existing core contributors), and the Technical Committee (a small senior body responsible for strategic planning and final tiebreaking)[24].
As of 2026 the Technical Committee comprises Thomas J. Fan, Alexandre Gramfort, Olivier Grisel, Adrin Jalali, Andreas Mueller, Joel Nothman, and Gael Varoquaux. The core contributor group is organized into four working teams: a Maintainers Team responsible for code review and merging, a Documentation Team, a Contributor Experience Team, and a Communication Team[6][24].
Substantive API or governance changes are tracked using Scikit-learn Enhancement Proposals (SLEPs), modeled loosely on the Python Enhancement Proposal (PEP) process. SLEP000 describes the process itself. Notable subsequent SLEPs include SLEP006 (metadata routing, accepted and implemented in version 1.4), SLEP019 (recognizing contributions beyond code), and SLEP020 (simplifying governance changes). A SLEP is accepted only after a one-month voting period and a two-thirds supermajority of cast votes[15][24].
Since 2020, scikit-learn has been a fiscally sponsored project of NumFOCUS, a 501(c)(3) public charity based in Austin, Texas. NumFOCUS handles tax-deductible donations to the project and supports it as part of a portfolio of scientific Python projects that also includes NumPy, SciPy, pandas, Jupyter, and matplotlib[25].
The project receives recurring institutional and corporate support. As of late 2025 the active sponsors listed on the official "About us" page are:
| Tier | Sponsor |
|---|---|
| Founding sponsor | Inria |
| Gold | Chanel |
| Silver | BNP Paribas |
| Bronze | Nvidia (which also employs core maintainer Tim Head) |
In addition, Microsoft funds Andreas Mueller as a core developer (since 2020), Quansight Labs funds Lucy Liu (since 2022), and the Chan Zuckerberg Initiative and Wellcome Trust have supported the project through cycle 6 of the Essential Open Source Software for Science (EOSS) program. Past consortium sponsors that have rotated through the program include Microsoft, Boston Consulting Group, Fujitsu, AP-HP (Assistance Publique des Hopitaux de Paris), Hugging Face, Dataiku, BNP Paribas Cardif, and AXA. Anaconda, CircleCI, GitHub, and Microsoft Azure contribute in-kind storage and continuous integration resources[6].
Probabl now manages the sponsorship program and employs eight full-time core maintainers, providing the most consistent paid development capacity in the project's history[4][6].
Scikit-learn's design philosophy is explicit about what it will and will not do, and these constraints lead to several well-known limitations.
By default scikit-learn runs on CPU only. The maintainers have long argued, in the official FAQ, that adding mandatory GPU support would introduce heavy hardware-specific dependencies, fragment the installation experience, and require reimplementation of every algorithm in CUDA or another GPU language. Many of the library's most-used algorithms (notably the tree-based ensembles) are written in Cython and are fundamentally non-array operations that do not benefit from naive GPU porting[11].
Since 2023 the project has supported an experimental Array API path, allowing a growing list of estimators to accept PyTorch or CuPy arrays and execute on GPU when those backends are configured. Versions 1.7 and 1.8 substantially expanded the list of Array API compatible estimators and metrics, but coverage remains partial[11][26][27].
External libraries fill the gap for users who need GPU acceleration with a scikit-learn compatible API. NVIDIA cuML, part of the RAPIDS suite, provides a zero-code-change GPU acceleration layer that intercepts imports from scikit-learn, umap-learn, and hdbscan and routes them to GPU implementations, with reported speedups of up to 50 times for scikit-learn algorithms on supported estimators[28]. The Intel-funded scikit-learn-intelex package, now maintained by the UXL Foundation, provides similar accelerations on Intel CPUs and GPUs.
Scikit-learn assumes that training data fits in the memory of a single machine. There is no native distributed-training support analogous to Spark MLlib or to PyTorch's distributed data parallel. For datasets larger than memory, users must either subsample, use the partial_fit method of incremental estimators (a subset of the library that includes SGDClassifier, SGDRegressor, MiniBatchKMeans, IncrementalPCA, and Naive Bayes), or move to a distributed framework. The Dask-ML project provides scikit-learn compatible distributed wrappers, but it is maintained separately from the core library[11][29].
The FAQ states explicitly that deep learning and reinforcement learning are out of scope. The maintainers argue that these subfields require rich architecture-specification vocabularies, GPU-first runtime designs, and rapid iteration on algorithms, none of which fit the scikit-learn design constraints. The MLPClassifier and MLPRegressor in sklearn.neural_network are maintained only for bug fixes, not for new feature work, and users needing modern neural network capabilities are directed to PyTorch, TensorFlow, Keras, or JAX[11].
Sequence models (such as hidden Markov models or conditional random fields) and structured prediction models are similarly out of scope. The FAQ recommends pystruct for general structured learning and seqlearn for sequence-focused models, but neither is part of the official scikit-learn project[11].
Until relatively recently, scikit-learn assumed numeric input matrices and required users to preprocess categorical columns manually. Versions from 1.0 onward have softened this with native pandas DataFrame and Polars DataFrame inputs, the set_output API, the categorical_features="from_dtype" option on histogram gradient boosting, and the ColumnTransformer meta-estimator. Comparative ergonomics, however, still favor R's data.frame-aware machine learning packages or specialist tabular deep-learning libraries when categorical handling is paramount[11][14].
A related limitation is that scikit-learn does not provide a strong type system for column metadata. Column names are preserved through the set_output API and through ColumnTransformer, but feature semantics (such as units, allowable ranges, or downstream interpretation) are not part of the estimator interface. This is acceptable for ad hoc data science but can become problematic in regulated production environments such as banking and insurance, where additional schema management has to be added on top by the user[11].
The strict inclusion criteria (three-year publication minimum, 200-citation threshold, no self-proposals, demonstrated existing user demand) keep the library focused, but they also mean that fast-moving subfields rarely make it into scikit-learn. Modern techniques for handling tabular data, such as the TabPFN family or attention-based tabular models, are not included and are unlikely to be considered until they have a much longer track record[11].
Scikit-learn releases roughly every six months, with patch releases as needed for security and bug fixes. Major recent releases include:
| Version | Release date | Notable changes |
|---|---|---|
| 0.20 | September 2018 | ColumnTransformer, formal API stability commitments |
| 0.21 | May 2019 | HistGradientBoostingClassifier and HistGradientBoostingRegressor |
| 1.0 | September 24, 2021 | First 1.x release, keyword-only enforcement, 2,100 merged PRs |
| 1.3 | June 2023 | HDBSCAN added to core, public set_output API |
| 1.4 | January 2024 | Metadata routing (SLEP006), broader categorical support |
| 1.6 | December 2024 | Experimental free-threaded CPython wheels, Python 3.13 support |
| 1.7 | June 5, 2025 | Expanded Array API, MLP Poisson loss, ROC curves from CV results |
| 1.8 | December 10, 2025 | Free-threaded wheels for Python 3.14, classical MDS, temperature scaling, faster MAE-criterion trees |
Version 1.6.0 introduced experimental support for free-threaded CPython (the "no-GIL" Python initiative), enabling better parallelism on multi-core machines. The release also added Python 3.13 support[30].
Version 1.7, released on June 5, 2025, expanded Array API support across additional metrics (including fbeta_score, precision_score, recall_score, jaccard_score, hamming_loss, and explained_variance_score), added a from_cv_results class method on RocCurveDisplay for plotting multiple ROC curves from cross-validation, allowed HistGradientBoosting models to take an explicit validation set, added Poisson loss and sample_weight support to the multi-layer perceptron, and added a parameter table to the HTML estimator visualization in Jupyter[26].
The current stable release, version 1.8.0, was published on December 10, 2025. It supports Python 3.11 through 3.14, including free-threaded wheels for all supported platforms on Python 3.14. Major improvements include[27][31]:
n_jobs > 1.StandardScaler, RidgeCV, RidgeClassifier, RidgeClassifierCV, GaussianMixture, PolynomialFeatures, CalibratedClassifierCV, GaussianNB, and many metrics now accepting Array API arrays.Lasso and ElasticNet regularization paths.DecisionTreeRegressor with criterion="absolute_error" improves from O(n^2) to O(n log n) split complexity, enabling training on millions of rows.ClassicalMDS estimator and an arbitrary-metric MDS extension in sklearn.manifold.CalibratedClassifierCV.d2_brier_score and confusion_matrix_at_thresholds.QuadraticDiscriminantAnalysis gains solver, covariance_estimator, and shrinkage parameters.MaxAbsScaler gains a clip parameter, and SplineTransformer gains a handle_missing parameter.PassiveAggressiveClassifier and PassiveAggressiveRegressor in favor of SGDClassifier and SGDRegressor with learning_rate="pa1" or "pa2".The 1.8 release notes credit more than 200 contributors[27].
Active development on the main branch continues toward version 1.9. Ongoing work focuses on broader Array API coverage, further metadata routing rollout to remaining estimators, performance improvements in core compilation paths, and tighter integration with Polars and Arrow. The maintainers have explicitly stated that, despite the pressure of the deep learning ecosystem, scikit-learn intends to remain focused on classical machine learning and to leave neural network research to dedicated frameworks[11][27].
Scikit-learn's consistent API has spawned a large ecosystem of compatible libraries:
| Library | Purpose |
|---|---|
| imbalanced-learn | Resampling and class-imbalance techniques |
| scikit-optimize (skopt) | Bayesian optimization for hyperparameter tuning |
| scikit-image | Image processing, sister project to scikit-learn |
| category_encoders | Additional categorical encoding strategies |
| skorch | Scikit-learn compatible PyTorch neural network wrapper |
| KerasClassifier / KerasRegressor | Scikit-learn compatible wrappers for Keras models |
| Dask-ML | Distributed scikit-learn workflows on Dask clusters |
| scikit-learn-intelex | UXL Foundation extension for hardware-accelerated scikit-learn on Intel hardware |
| cuML (RAPIDS) | GPU-accelerated scikit-learn compatible estimators by NVIDIA |
| XGBoost, LightGBM, CatBoost | Gradient boosting libraries that implement the estimator interface |
| optuna | Hyperparameter optimization with scikit-learn integrations |
| MLflow, Weights and Biases | Experiment tracking with scikit-learn integrations |
Each of these libraries follows scikit-learn's estimator interface and can be dropped into Pipeline and GridSearchCV workflows without modification, illustrating the durability of the 2013 API design[4][11].
Scikit-learn can be installed via pip, conda, or from source:
pip install scikit-learn
or
conda install scikit-learn
As of version 1.8, the library requires Python 3.11 or later, NumPy 1.24.1 or later, SciPy 1.10.0 or later, and joblib 1.4.0 or later. Plotting functionality additionally requires matplotlib 3.6.1 or later. Optional dependencies include pandas (for DataFrame support), Polars (for the alternative DataFrame backend), and pytest (for running the test suite)[1][27].