Scikit-Learn

See also: Machine learning terms, Python

Introduction

Scikit-learn (also written as sklearn) is a free, open-source machine learning library for the Python programming language. It provides efficient implementations of a wide range of classical machine learning algorithms for classification, regression, clustering, dimensionality reduction, model selection, and data preprocessing. Built on top of NumPy, SciPy, and Matplotlib, scikit-learn is designed to interoperate with the broader Python scientific computing ecosystem and has become one of the most widely used machine learning frameworks in the world.

Scikit-learn focuses on classical (non-deep-learning) algorithms and is best suited for small to medium-sized tabular datasets. Its hallmark is a consistent, minimal API where every estimator follows the same fit, predict, and transform pattern. This uniformity makes it an excellent entry point for beginners learning machine learning and a productive tool for experienced practitioners building production pipelines.

According to the 2022 Kaggle Machine Learning and Data Science Survey, scikit-learn was identified as the most widely used machine learning framework among respondents. The library has attracted over 1,400 contributors and serves as the foundation for many downstream tools and AutoML systems.

History and Development

Scikit-learn began as a Google Summer of Code project in 2007, created by David Cournapeau under the original name scikits.learn (a third-party extension to SciPy). Later that year, Matthieu Brucher continued development on the project as part of his doctoral thesis.

The project gained significant momentum in 2010 when researchers from the French National Institute for Research in Digital Science and Technology (INRIA) took leadership. Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel drove the first public release, version 0.1 beta, which was published on February 1, 2010. Since then, the project has followed an approximately three-month release cycle.

The following table summarizes major milestones in scikit-learn's history.

Year	Milestone
2007	David Cournapeau starts scikits.learn as a Google Summer of Code project
2010	INRIA researchers take leadership; version 0.1 beta released (February 1)
2011	Pedregosa et al. publish the landmark JMLR paper describing the library ^[1]
2013	Buitinck et al. publish the API design paper at ECML PKDD Workshop ^[2]
2019	Project receives the Inria-French Academy of Sciences-Dassault Systemes Innovation Prize
2021	Version 1.0.0 released (September 24), the first stable release after over 2,100 merged pull requests
2022	Receives the French Open Science Award for Open Source Research Software
2025	Version 1.8 released with native Array API support for GPU computations via PyTorch and CuPy

Scikit-learn is a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit. INRIA holds the copyright over contributions made by INRIA-employed developers. The company Probabl manages the sponsorship program and employs several full-time core maintainers, including Olivier Grisel, Guillaume Lemaitre, and Adrin Jalali. Additional funding comes from sponsors such as Chanel, BNP Paribas, NVIDIA, Microsoft, and the Chan-Zuckerberg Initiative ^[3].

Core Design Principles

Scikit-learn's API is governed by a small set of design principles that make the library consistent and easy to learn.

Consistency. Every algorithm in the library exposes the same interface. Estimators are instantiated with hyperparameters passed to the constructor. Training uses the fit(X, y) method. Prediction uses predict(X) or transform(X). This means switching from a random forest to a support vector machine requires changing only the class name.

Inspection. All hyperparameters set during construction are stored as public attributes. Learned parameters discovered during fitting are stored with a trailing underscore (for example, coef_ or feature_importances_). The get_params() and set_params() methods allow programmatic access to hyperparameters.

Non-proliferation of classes. Datasets are represented as NumPy arrays or SciPy sparse matrices rather than custom objects. Hyperparameters are standard Python types (strings, integers, floats). This reduces the number of concepts a user must learn.

Composition. Complex workflows are built by combining simple objects. Pipelines chain transformers and estimators. ColumnTransformer applies different preprocessing to different feature groups. Meta-estimators like GridSearchCV wrap any estimator to add hyperparameter search.

Sensible defaults. Every hyperparameter has a reasonable default value so that users can create a working model with minimal configuration.

The Estimator Interface

The estimator interface is the central abstraction in scikit-learn. Understanding the three core methods unlocks the entire library.

fit(X, y)

The fit method trains the estimator on data. The input X is a sample matrix with shape (n_samples, n_features) where rows are observations and columns are features. The target y contains labels for supervised tasks or is omitted for unsupervised tasks. After calling fit, the estimator stores learned parameters as attributes.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

predict(X)

The predict method applies the fitted model to new data and returns predictions. For classifiers, it returns class labels. For regressors, it returns continuous values. Classifiers may also expose predict_proba(X) to return probability estimates for each class.

y_pred = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)

transform(X)

Transformers implement a transform method that takes input data and returns a modified version. Preprocessing steps such as scaling, encoding, and dimensionality reduction are transformers. The fit_transform(X) convenience method combines fitting and transforming in a single call.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The following table summarizes the key methods and when each applies.

Method	Purpose	Used by
`fit(X, y)`	Learn parameters from training data	All estimators
`predict(X)`	Generate predictions on new data	Classifiers, regressors
`predict_proba(X)`	Return class probability estimates	Classifiers (when supported)
`transform(X)`	Transform data using learned parameters	Transformers (scalers, encoders, PCA)
`fit_transform(X)`	Fit and transform in one step	Transformers
`score(X, y)`	Return default evaluation metric	Classifiers (accuracy), regressors (R-squared)
`get_params()`	Return estimator hyperparameters	All estimators
`set_params()`	Set estimator hyperparameters	All estimators

Main Modules

Scikit-learn organizes its algorithms into subpackages grouped by task. The following sections describe the most important modules.

Classification (sklearn.linear_model, sklearn.ensemble, sklearn.svm, sklearn.neighbors)

Classification algorithms assign input samples to discrete categories. Scikit-learn provides a broad selection of classifiers.

Algorithm	Class	Typical Use Case
Logistic Regression	`LogisticRegression`	Binary and multiclass classification on linearly separable data
Support Vector Machine	`SVC`, `LinearSVC`	High-dimensional data, text classification
Random Forest	`RandomForestClassifier`	General-purpose classification with feature importance
Gradient Boosting	`GradientBoostingClassifier`, `HistGradientBoostingClassifier`	Tabular data competitions, high accuracy
K-Nearest Neighbors	`KNeighborsClassifier`	Simple baseline, small datasets
Naive Bayes	`GaussianNB`, `MultinomialNB`	Text classification, spam filtering
Decision Tree	`DecisionTreeClassifier`	Interpretable models, feature selection

HistGradientBoostingClassifier, introduced in version 0.21, is inspired by LightGBM and uses histogram-based splitting for faster training on large datasets. It natively supports missing values and categorical features.

Regression (sklearn.linear_model, sklearn.ensemble, sklearn.svm)

Regression algorithms predict continuous target values.

Algorithm	Class	Notes
Linear Regression	`LinearRegression`	Ordinary least squares
Ridge Regression	`Ridge`	L2 regularization to prevent overfitting
Lasso Regression	`Lasso`	L1 regularization for sparse solutions
Elastic Net	`ElasticNet`	Combines L1 and L2 penalties
Random Forest	`RandomForestRegressor`	Ensemble of decision trees
Gradient Boosting	`GradientBoostingRegressor`, `HistGradientBoostingRegressor`	High-accuracy regression on tabular data
Support Vector Regression	`SVR`	Kernel-based regression

Clustering (sklearn.cluster)

Clustering algorithms group unlabeled data points based on similarity.

Algorithm	Class	Key Characteristics
K-Means	`KMeans`, `MiniBatchKMeans`	Requires number of clusters; fast on large datasets
DBSCAN	`DBSCAN`	Density-based; finds arbitrary-shaped clusters; no need to specify number of clusters
Agglomerative Clustering	`AgglomerativeClustering`	Hierarchical; produces a dendrogram
Mean Shift	`MeanShift`	Kernel-based; automatic number of clusters
OPTICS	`OPTICS`	Density-based generalization of DBSCAN
Spectral Clustering	`SpectralClustering`	Graph-based; handles non-convex clusters
HDBSCAN	`HDBSCAN`	Hierarchical density-based; robust to parameter choices (added in v1.3)

Dimensionality Reduction (sklearn.decomposition, sklearn.manifold)

Dimensionality reduction techniques project high-dimensional data into a lower-dimensional space while preserving important structure.

Algorithm	Class	Notes
Principal Component Analysis	`PCA`	Linear projection; retains maximum variance
Truncated SVD	`TruncatedSVD`	Works on sparse matrices; used in latent semantic analysis
t-SNE	`TSNE`	Nonlinear; visualization of high-dimensional data in 2D or 3D
Non-negative Matrix Factorization	`NMF`	Parts-based representation; useful for topic modeling
Independent Component Analysis	`FastICA`	Separates mixed signals

Model Selection and Evaluation (sklearn.model_selection, sklearn.metrics)

Scikit-learn provides tools for selecting the best model and evaluating its performance.

Cross-validation splits data into multiple folds and trains on each subset to produce a reliable estimate of model performance. The cross_val_score and cross_validate functions automate this process. Stratified variants ensure class proportions are maintained in each fold.

Hyperparameter tuning is performed with search objects that wrap an estimator and explore parameter combinations during cross-validation. The two main tools are:

GridSearchCV: Exhaustively evaluates every combination in a user-specified parameter grid. Best for small search spaces.
RandomizedSearchCV: Samples a fixed number of parameter settings from specified distributions. More efficient for large search spaces.

Both search objects expose the same fit/predict interface as regular estimators, so they integrate seamlessly into pipelines.

Evaluation metrics in sklearn.metrics include accuracy, precision, recall, F1 score, ROC AUC, mean squared error, mean absolute error, and many others. Custom scoring functions can be created with make_scorer.

Preprocessing (sklearn.preprocessing, sklearn.impute)

Preprocessing transforms raw features into a form suitable for machine learning.

Transformer	Purpose
`StandardScaler`	Standardizes features to zero mean and unit variance
`MinMaxScaler`	Scales features to a specified range (default 0 to 1)
`RobustScaler`	Scales using statistics robust to outliers (median, interquartile range)
`OneHotEncoder`	Converts categorical variables to binary indicator columns
`OrdinalEncoder`	Converts categorical variables to integer codes
`LabelEncoder`	Encodes target labels as integers
`TargetEncoder`	Encodes categories using the target mean (added in v1.3)
`SimpleImputer`	Fills missing values with mean, median, most frequent, or constant
`KNNImputer`	Fills missing values using K-nearest neighbors
`PolynomialFeatures`	Generates polynomial and interaction features

Feature Selection (sklearn.feature_selection)

Feature selection methods reduce the number of input variables to those most relevant to the target.

Univariate selection (SelectKBest, SelectPercentile): Ranks features by statistical tests such as chi-squared or ANOVA F-value.
Model-based selection (SelectFromModel): Uses the feature importances from a fitted estimator (such as a random forest) to select features.
Recursive feature elimination (RFE, RFECV): Iteratively removes the least important features and refits the model.

Pipelines and ColumnTransformer

Pipelines are one of scikit-learn's most powerful features. A Pipeline chains multiple processing steps into a single estimator object that supports fit, predict, and transform. This simplifies code, prevents data leakage during cross-validation, and makes the full workflow reproducible.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

The make_pipeline convenience function automatically generates step names:

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression())

ColumnTransformer applies different transformations to different subsets of columns. This is essential for real-world datasets that contain a mix of numerical and categorical features.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'income']),
    ('cat', OneHotEncoder(), ['gender', 'city'])
])

full_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

When using GridSearchCV or RandomizedSearchCV with a pipeline, parameter names use the double-underscore notation to reference nested components. For example, classifier__n_estimators refers to the n_estimators parameter of the classifier step.

Hyperparameter Tuning with SearchCV

Selecting optimal hyperparameters is critical for model performance. Scikit-learn provides two automated search strategies.

GridSearchCV

GridSearchCV evaluates every combination in a parameter grid using cross-validation. It is thorough but can be computationally expensive.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [5, 10, 20, None]
}

search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)
search.fit(X_train, y_train)
print(search.best_params_)
print(search.best_score_)

RandomizedSearchCV

RandomizedSearchCV samples a fixed number of parameter settings from specified distributions. This is more efficient when the search space is large.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 30),
    'min_samples_split': uniform(0.01, 0.2)
}

search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    random_state=42
)
search.fit(X_train, y_train)

Both search objects store the best estimator in search.best_estimator_, the best parameters in search.best_params_, and detailed results for every evaluated combination in search.cv_results_.

Performance and Scalability

Scikit-learn is optimized for small to medium-sized datasets that fit in memory. Several design choices affect its performance characteristics.

Implementation languages. Performance-critical code is written in Cython, C, and C++. For example, the SVM implementation wraps the LIBSVM and LIBLINEAR libraries. Tree-based models use Cython for node splitting. These compiled backends keep training times competitive.

Parallelism. Many estimators accept an n_jobs parameter that enables parallel execution. Internally, scikit-learn uses the Joblib library with its default loky backend for multiprocessing. When n_jobs=-1, all available CPU cores are used. Cross-validation, grid search, and ensemble methods (random forests, bagging) all support parallel execution.

Memory management. When data exceeds 1 MB during multiprocessing, Joblib creates memory-mapped files that all worker processes can share, avoiding unnecessary data duplication.

Large dataset strategies. For datasets that do not fit in memory, scikit-learn offers partial fitting through the partial_fit method on select estimators (such as SGDClassifier, MiniBatchKMeans). For truly massive datasets, external tools such as Dask-ML can distribute scikit-learn workloads across a cluster.

GPU support. Starting with version 1.8, scikit-learn has begun adopting the Python Array API standard, which allows PyTorch tensors and CuPy arrays to be passed directly to supported estimators. This enables GPU-accelerated computation without changing user code. Early benchmarks show up to 10x speedups on GPU versus a single CPU core for supported estimators ^[4]. Support for free-threaded CPython (Python 3.14) is also available.

Scalability Strategy	Mechanism	When to Use
`n_jobs` parameter	Joblib multiprocessing	Multi-core machines, embarrassingly parallel tasks
`partial_fit` method	Incremental / online learning	Data too large for memory, streaming data
`HistGradientBoosting*`	Histogram-based splitting	Datasets with more than 10,000 samples
Array API (v1.8+)	GPU via PyTorch/CuPy tensors	GPU-accelerated linear algebra
Dask-ML integration	Distributed computing	Cluster-scale datasets

Comparison with Deep Learning Frameworks

Scikit-learn and deep learning frameworks such as TensorFlow and PyTorch serve different purposes. The following table highlights when each is most appropriate.

Factor	Scikit-Learn	TensorFlow / PyTorch
Algorithm type	Classical ML (trees, SVMs, linear models, clustering)	Deep learning (neural networks, CNNs, RNNs, transformers)
Data type	Tabular / structured data	Images, text, audio, video, sequences
Dataset size	Small to medium (fits in memory)	Small to very large
GPU support	Experimental (Array API in v1.8)	Native, first-class
Learning curve	Gentle; minimal API	Steeper; requires understanding of layers, optimizers, loss functions
Preprocessing	Built-in scalers, encoders, imputers	Requires external libraries or custom code
Hyperparameter search	Built-in GridSearchCV, RandomizedSearchCV	Requires external tools (Optuna, Ray Tune)
Deployment	Pickle, ONNX, skops	TensorFlow Serving, TorchServe, ONNX
Best for	Rapid prototyping, baselines, tabular ML	Computer vision, NLP, generative models

In practice, many machine learning workflows use both. Scikit-learn handles data preprocessing, feature engineering, and classical baselines, while deep learning frameworks are used when the task demands representation learning from unstructured data.

The Scikit-Learn Ecosystem

Scikit-learn's consistent API has inspired a large ecosystem of compatible libraries. These projects extend scikit-learn's functionality while following the same fit/predict/transform conventions.

Project	Purpose
imbalanced-learn	Resampling techniques (SMOTE, ADASYN) for class-imbalanced datasets
category_encoders	Advanced categorical encoding (target encoding, binary encoding, hashing)
Feature-engine	Feature engineering transformers (imputation, encoding, outlier handling)
XGBoost	Optimized distributed gradient boosting with sklearn-compatible API
LightGBM	Fast gradient boosting with histogram-based splitting
skorch	PyTorch wrapper with sklearn-compatible interface
scikeras	Keras/TensorFlow wrapper for sklearn pipelines
auto-sklearn	Automated machine learning using sklearn estimators
TPOT	Genetic programming-based AutoML pipeline optimizer
mlxtend	Additional estimators, feature selection, and visualization utilities
Dask-ML	Distributed and parallel machine learning on large datasets
skrub	Encoders for dirty categorical data and dataframe integration
yellowbrick	Visual diagnostics for machine learning model selection
Intel Extension for Scikit-learn	Hardware-accelerated scikit-learn on Intel processors

The scikit-learn-contrib organization on GitHub hosts community-maintained projects that follow strict compatibility guidelines with the main library.

Real-World Applications

Scikit-learn is used across industries and academia for a wide range of applications.

Finance. AXA uses scikit-learn for fraud detection. J.P. Morgan applies it for predictive analytics and risk modeling.
E-commerce. Booking.com builds recommendation systems with scikit-learn. The Otto Group uses it for logistics optimization.
Media and technology. Spotify uses scikit-learn for music recommendation features. Change.org applies it for targeted outreach.
Healthcare. Researchers use scikit-learn for medical image classification, patient outcome prediction, and genomic analysis.
Academia. Telecom ParisTech and many universities incorporate scikit-learn into their machine learning curricula.

Explain Like I'm 5 (ELI5)

Imagine you have a giant box of LEGO bricks, and each brick is a different tool for teaching a computer to learn things. Some bricks help the computer sort things into groups, like separating red candies from blue candies. Other bricks help the computer guess numbers, like how tall you will be when you grow up. And some bricks help the computer find patterns in a big pile of data, even when nobody tells it what to look for.

Scikit-learn is that box of LEGO bricks. All the bricks snap together the same way (that is the consistent API). You pick a brick, show it some examples (that is "fit"), and then it can make guesses about new things it has never seen before (that is "predict"). You can also snap multiple bricks together in a line (that is a "pipeline") so the computer does many steps automatically. The best part is that the box is free for everyone to use, and thousands of people around the world keep adding new and better bricks to it.

References

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, pp. 2825-2830, 2011. https://jmlr.org/papers/v12/pedregosa11a.html
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. "API design for machine learning software: experiences from the scikit-learn project." *ECML PKDD Workshop: Languages for Data Mining and Machine Learning*, pp. 108-122, 2013.
"About us." scikit-learn documentation, version 1.8.0. https://scikit-learn.org/stable/about.html
"Release Highlights for scikit-learn 1.8." scikit-learn documentation. https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html
"Getting Started." scikit-learn documentation, version 1.8.0. https://scikit-learn.org/stable/getting_started.html
"Pipelines and composite estimators." scikit-learn documentation, version 1.8.0. https://scikit-learn.org/stable/modules/compose.html
"Related Projects." scikit-learn documentation, version 1.8.0. https://scikit-learn.org/stable/related_projects.html
"Scikit-learn." Wikipedia. https://en.wikipedia.org/wiki/Scikit-learn
"Parallelism, resource management, and configuration." scikit-learn documentation, version 1.8.0. https://scikit-learn.org/stable/computing/parallelism.html
"Update on array API adoption in scikit-learn." scikit-learn Blog. https://blog.scikit-learn.org/updates/update-array-api/

Introduction

History and Development

Core Design Principles

The Estimator Interface

fit(X, y)

predict(X)

transform(X)

Main Modules

Classification (sklearn.linear_model, sklearn.ensemble, sklearn.svm, sklearn.neighbors)

Regression (sklearn.linear_model, sklearn.ensemble, sklearn.svm)

Clustering (sklearn.cluster)

Dimensionality Reduction (sklearn.decomposition, sklearn.manifold)

Model Selection and Evaluation (sklearn.model_selection, sklearn.metrics)

Preprocessing (sklearn.preprocessing, sklearn.impute)

Feature Selection (sklearn.feature_selection)

Pipelines and ColumnTransformer

Hyperparameter Tuning with SearchCV

GridSearchCV

RandomizedSearchCV

Performance and Scalability

Comparison with Deep Learning Frameworks

The Scikit-Learn Ecosystem

Real-World Applications

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

DataFrame

Keras

Matplotlib

NumPy

Pandas

Introduction

History and Development

Core Design Principles

The Estimator Interface

fit(X, y)

predict(X)

transform(X)

Main Modules

Classification (sklearn.linear_model, sklearn.ensemble, sklearn.svm, sklearn.neighbors)

Regression (sklearn.linear_model, sklearn.ensemble, sklearn.svm)

Clustering (sklearn.cluster)

Dimensionality Reduction (sklearn.decomposition, sklearn.manifold)

Model Selection and Evaluation (sklearn.model_selection, sklearn.metrics)

Preprocessing (sklearn.preprocessing, sklearn.impute)

Feature Selection (sklearn.feature_selection)

Pipelines and ColumnTransformer

Hyperparameter Tuning with SearchCV

GridSearchCV

RandomizedSearchCV

Performance and Scalability

Comparison with Deep Learning Frameworks

The Scikit-Learn Ecosystem

Real-World Applications

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

DataFrame

Keras

Matplotlib

NumPy

Pandas