See also: Machine learning terms, Python
Scikit-learn (also written as sklearn) is a free, open-source machine learning library for the Python programming language. It provides efficient implementations of a wide range of classical machine learning algorithms for classification, regression, clustering, dimensionality reduction, model selection, and data preprocessing. Built on top of NumPy, SciPy, and Matplotlib, scikit-learn is designed to interoperate with the broader Python scientific computing ecosystem and has become one of the most widely used machine learning frameworks in the world.
Scikit-learn focuses on classical (non-deep-learning) algorithms and is best suited for small to medium-sized tabular datasets. Its hallmark is a consistent, minimal API where every estimator follows the same fit, predict, and transform pattern. This uniformity makes it an excellent entry point for beginners learning machine learning and a productive tool for experienced practitioners building production pipelines.
According to the 2022 Kaggle Machine Learning and Data Science Survey, scikit-learn was identified as the most widely used machine learning framework among respondents. The library has attracted over 1,400 contributors and serves as the foundation for many downstream tools and AutoML systems.
Scikit-learn began as a Google Summer of Code project in 2007, created by David Cournapeau under the original name scikits.learn (a third-party extension to SciPy). Later that year, Matthieu Brucher continued development on the project as part of his doctoral thesis.
The project gained significant momentum in 2010 when researchers from the French National Institute for Research in Digital Science and Technology (INRIA) took leadership. Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel drove the first public release, version 0.1 beta, which was published on February 1, 2010. Since then, the project has followed an approximately three-month release cycle.
The following table summarizes major milestones in scikit-learn's history.
| Year | Milestone |
|---|---|
| 2007 | David Cournapeau starts scikits.learn as a Google Summer of Code project |
| 2010 | INRIA researchers take leadership; version 0.1 beta released (February 1) |
| 2011 | Pedregosa et al. publish the landmark JMLR paper describing the library [1] |
| 2013 | Buitinck et al. publish the API design paper at ECML PKDD Workshop [2] |
| 2019 | Project receives the Inria-French Academy of Sciences-Dassault Systemes Innovation Prize |
| 2021 | Version 1.0.0 released (September 24), the first stable release after over 2,100 merged pull requests |
| 2022 | Receives the French Open Science Award for Open Source Research Software |
| 2025 | Version 1.8 released with native Array API support for GPU computations via PyTorch and CuPy |
Scikit-learn is a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit. INRIA holds the copyright over contributions made by INRIA-employed developers. The company Probabl manages the sponsorship program and employs several full-time core maintainers, including Olivier Grisel, Guillaume Lemaitre, and Adrin Jalali. Additional funding comes from sponsors such as Chanel, BNP Paribas, NVIDIA, Microsoft, and the Chan-Zuckerberg Initiative [3].
Scikit-learn's API is governed by a small set of design principles that make the library consistent and easy to learn.
Consistency. Every algorithm in the library exposes the same interface. Estimators are instantiated with hyperparameters passed to the constructor. Training uses the fit(X, y) method. Prediction uses predict(X) or transform(X). This means switching from a random forest to a support vector machine requires changing only the class name.
Inspection. All hyperparameters set during construction are stored as public attributes. Learned parameters discovered during fitting are stored with a trailing underscore (for example, coef_ or feature_importances_). The get_params() and set_params() methods allow programmatic access to hyperparameters.
Non-proliferation of classes. Datasets are represented as NumPy arrays or SciPy sparse matrices rather than custom objects. Hyperparameters are standard Python types (strings, integers, floats). This reduces the number of concepts a user must learn.
Composition. Complex workflows are built by combining simple objects. Pipelines chain transformers and estimators. ColumnTransformer applies different preprocessing to different feature groups. Meta-estimators like GridSearchCV wrap any estimator to add hyperparameter search.
Sensible defaults. Every hyperparameter has a reasonable default value so that users can create a working model with minimal configuration.
The estimator interface is the central abstraction in scikit-learn. Understanding the three core methods unlocks the entire library.
The fit method trains the estimator on data. The input X is a sample matrix with shape (n_samples, n_features) where rows are observations and columns are features. The target y contains labels for supervised tasks or is omitted for unsupervised tasks. After calling fit, the estimator stores learned parameters as attributes.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
The predict method applies the fitted model to new data and returns predictions. For classifiers, it returns class labels. For regressors, it returns continuous values. Classifiers may also expose predict_proba(X) to return probability estimates for each class.
y_pred = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)
Transformers implement a transform method that takes input data and returns a modified version. Preprocessing steps such as scaling, encoding, and dimensionality reduction are transformers. The fit_transform(X) convenience method combines fitting and transforming in a single call.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
The following table summarizes the key methods and when each applies.
| Method | Purpose | Used by |
|---|---|---|
fit(X, y) | Learn parameters from training data | All estimators |
predict(X) | Generate predictions on new data | Classifiers, regressors |
predict_proba(X) | Return class probability estimates | Classifiers (when supported) |
transform(X) | Transform data using learned parameters | Transformers (scalers, encoders, PCA) |
fit_transform(X) | Fit and transform in one step | Transformers |
score(X, y) | Return default evaluation metric | Classifiers (accuracy), regressors (R-squared) |
get_params() | Return estimator hyperparameters | All estimators |
set_params() | Set estimator hyperparameters | All estimators |
Scikit-learn organizes its algorithms into subpackages grouped by task. The following sections describe the most important modules.
Classification algorithms assign input samples to discrete categories. Scikit-learn provides a broad selection of classifiers.
| Algorithm | Class | Typical Use Case |
|---|---|---|
| Logistic Regression | LogisticRegression | Binary and multiclass classification on linearly separable data |
| Support Vector Machine | SVC, LinearSVC | High-dimensional data, text classification |
| Random Forest | RandomForestClassifier | General-purpose classification with feature importance |
| Gradient Boosting | GradientBoostingClassifier, HistGradientBoostingClassifier | Tabular data competitions, high accuracy |
| K-Nearest Neighbors | KNeighborsClassifier | Simple baseline, small datasets |
| Naive Bayes | GaussianNB, MultinomialNB | Text classification, spam filtering |
| Decision Tree | DecisionTreeClassifier | Interpretable models, feature selection |
HistGradientBoostingClassifier, introduced in version 0.21, is inspired by LightGBM and uses histogram-based splitting for faster training on large datasets. It natively supports missing values and categorical features.
Regression algorithms predict continuous target values.
| Algorithm | Class | Notes |
|---|---|---|
| Linear Regression | LinearRegression | Ordinary least squares |
| Ridge Regression | Ridge | L2 regularization to prevent overfitting |
| Lasso Regression | Lasso | L1 regularization for sparse solutions |
| Elastic Net | ElasticNet | Combines L1 and L2 penalties |
| Random Forest | RandomForestRegressor | Ensemble of decision trees |
| Gradient Boosting | GradientBoostingRegressor, HistGradientBoostingRegressor | High-accuracy regression on tabular data |
| Support Vector Regression | SVR | Kernel-based regression |
Clustering algorithms group unlabeled data points based on similarity.
| Algorithm | Class | Key Characteristics |
|---|---|---|
| K-Means | KMeans, MiniBatchKMeans | Requires number of clusters; fast on large datasets |
| DBSCAN | DBSCAN | Density-based; finds arbitrary-shaped clusters; no need to specify number of clusters |
| Agglomerative Clustering | AgglomerativeClustering | Hierarchical; produces a dendrogram |
| Mean Shift | MeanShift | Kernel-based; automatic number of clusters |
| OPTICS | OPTICS | Density-based generalization of DBSCAN |
| Spectral Clustering | SpectralClustering | Graph-based; handles non-convex clusters |
| HDBSCAN | HDBSCAN | Hierarchical density-based; robust to parameter choices (added in v1.3) |
Dimensionality reduction techniques project high-dimensional data into a lower-dimensional space while preserving important structure.
| Algorithm | Class | Notes |
|---|---|---|
| Principal Component Analysis | PCA | Linear projection; retains maximum variance |
| Truncated SVD | TruncatedSVD | Works on sparse matrices; used in latent semantic analysis |
| t-SNE | TSNE | Nonlinear; visualization of high-dimensional data in 2D or 3D |
| Non-negative Matrix Factorization | NMF | Parts-based representation; useful for topic modeling |
| Independent Component Analysis | FastICA | Separates mixed signals |
Scikit-learn provides tools for selecting the best model and evaluating its performance.
Cross-validation splits data into multiple folds and trains on each subset to produce a reliable estimate of model performance. The cross_val_score and cross_validate functions automate this process. Stratified variants ensure class proportions are maintained in each fold.
Hyperparameter tuning is performed with search objects that wrap an estimator and explore parameter combinations during cross-validation. The two main tools are:
GridSearchCV: Exhaustively evaluates every combination in a user-specified parameter grid. Best for small search spaces.RandomizedSearchCV: Samples a fixed number of parameter settings from specified distributions. More efficient for large search spaces.Both search objects expose the same fit/predict interface as regular estimators, so they integrate seamlessly into pipelines.
Evaluation metrics in sklearn.metrics include accuracy, precision, recall, F1 score, ROC AUC, mean squared error, mean absolute error, and many others. Custom scoring functions can be created with make_scorer.
Preprocessing transforms raw features into a form suitable for machine learning.
| Transformer | Purpose |
|---|---|
StandardScaler | Standardizes features to zero mean and unit variance |
MinMaxScaler | Scales features to a specified range (default 0 to 1) |
RobustScaler | Scales using statistics robust to outliers (median, interquartile range) |
OneHotEncoder | Converts categorical variables to binary indicator columns |
OrdinalEncoder | Converts categorical variables to integer codes |
LabelEncoder | Encodes target labels as integers |
TargetEncoder | Encodes categories using the target mean (added in v1.3) |
SimpleImputer | Fills missing values with mean, median, most frequent, or constant |
KNNImputer | Fills missing values using K-nearest neighbors |
PolynomialFeatures | Generates polynomial and interaction features |
Feature selection methods reduce the number of input variables to those most relevant to the target.
SelectKBest, SelectPercentile): Ranks features by statistical tests such as chi-squared or ANOVA F-value.SelectFromModel): Uses the feature importances from a fitted estimator (such as a random forest) to select features.RFE, RFECV): Iteratively removes the least important features and refits the model.Pipelines are one of scikit-learn's most powerful features. A Pipeline chains multiple processing steps into a single estimator object that supports fit, predict, and transform. This simplifies code, prevents data leakage during cross-validation, and makes the full workflow reproducible.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
The make_pipeline convenience function automatically generates step names:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression())
ColumnTransformer applies different transformations to different subsets of columns. This is essential for real-world datasets that contain a mix of numerical and categorical features.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocessor = ColumnTransformer([
('num', StandardScaler(), ['age', 'income']),
('cat', OneHotEncoder(), ['gender', 'city'])
])
full_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
When using GridSearchCV or RandomizedSearchCV with a pipeline, parameter names use the double-underscore notation to reference nested components. For example, classifier__n_estimators refers to the n_estimators parameter of the classifier step.
Selecting optimal hyperparameters is critical for model performance. Scikit-learn provides two automated search strategies.
GridSearchCV evaluates every combination in a parameter grid using cross-validation. It is thorough but can be computationally expensive.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [5, 10, 20, None]
}
search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy'
)
search.fit(X_train, y_train)
print(search.best_params_)
print(search.best_score_)
RandomizedSearchCV samples a fixed number of parameter settings from specified distributions. This is more efficient when the search space is large.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 30),
'min_samples_split': uniform(0.01, 0.2)
}
search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=50,
cv=5,
scoring='accuracy',
random_state=42
)
search.fit(X_train, y_train)
Both search objects store the best estimator in search.best_estimator_, the best parameters in search.best_params_, and detailed results for every evaluated combination in search.cv_results_.
Scikit-learn is optimized for small to medium-sized datasets that fit in memory. Several design choices affect its performance characteristics.
Implementation languages. Performance-critical code is written in Cython, C, and C++. For example, the SVM implementation wraps the LIBSVM and LIBLINEAR libraries. Tree-based models use Cython for node splitting. These compiled backends keep training times competitive.
Parallelism. Many estimators accept an n_jobs parameter that enables parallel execution. Internally, scikit-learn uses the Joblib library with its default loky backend for multiprocessing. When n_jobs=-1, all available CPU cores are used. Cross-validation, grid search, and ensemble methods (random forests, bagging) all support parallel execution.
Memory management. When data exceeds 1 MB during multiprocessing, Joblib creates memory-mapped files that all worker processes can share, avoiding unnecessary data duplication.
Large dataset strategies. For datasets that do not fit in memory, scikit-learn offers partial fitting through the partial_fit method on select estimators (such as SGDClassifier, MiniBatchKMeans). For truly massive datasets, external tools such as Dask-ML can distribute scikit-learn workloads across a cluster.
GPU support. Starting with version 1.8, scikit-learn has begun adopting the Python Array API standard, which allows PyTorch tensors and CuPy arrays to be passed directly to supported estimators. This enables GPU-accelerated computation without changing user code. Early benchmarks show up to 10x speedups on GPU versus a single CPU core for supported estimators [4]. Support for free-threaded CPython (Python 3.14) is also available.
| Scalability Strategy | Mechanism | When to Use |
|---|---|---|
n_jobs parameter | Joblib multiprocessing | Multi-core machines, embarrassingly parallel tasks |
partial_fit method | Incremental / online learning | Data too large for memory, streaming data |
HistGradientBoosting* | Histogram-based splitting | Datasets with more than 10,000 samples |
| Array API (v1.8+) | GPU via PyTorch/CuPy tensors | GPU-accelerated linear algebra |
| Dask-ML integration | Distributed computing | Cluster-scale datasets |
Scikit-learn and deep learning frameworks such as TensorFlow and PyTorch serve different purposes. The following table highlights when each is most appropriate.
| Factor | Scikit-Learn | TensorFlow / PyTorch |
|---|---|---|
| Algorithm type | Classical ML (trees, SVMs, linear models, clustering) | Deep learning (neural networks, CNNs, RNNs, transformers) |
| Data type | Tabular / structured data | Images, text, audio, video, sequences |
| Dataset size | Small to medium (fits in memory) | Small to very large |
| GPU support | Experimental (Array API in v1.8) | Native, first-class |
| Learning curve | Gentle; minimal API | Steeper; requires understanding of layers, optimizers, loss functions |
| Preprocessing | Built-in scalers, encoders, imputers | Requires external libraries or custom code |
| Hyperparameter search | Built-in GridSearchCV, RandomizedSearchCV | Requires external tools (Optuna, Ray Tune) |
| Deployment | Pickle, ONNX, skops | TensorFlow Serving, TorchServe, ONNX |
| Best for | Rapid prototyping, baselines, tabular ML | Computer vision, NLP, generative models |
In practice, many machine learning workflows use both. Scikit-learn handles data preprocessing, feature engineering, and classical baselines, while deep learning frameworks are used when the task demands representation learning from unstructured data.
Scikit-learn's consistent API has inspired a large ecosystem of compatible libraries. These projects extend scikit-learn's functionality while following the same fit/predict/transform conventions.
| Project | Purpose |
|---|---|
| imbalanced-learn | Resampling techniques (SMOTE, ADASYN) for class-imbalanced datasets |
| category_encoders | Advanced categorical encoding (target encoding, binary encoding, hashing) |
| Feature-engine | Feature engineering transformers (imputation, encoding, outlier handling) |
| XGBoost | Optimized distributed gradient boosting with sklearn-compatible API |
| LightGBM | Fast gradient boosting with histogram-based splitting |
| skorch | PyTorch wrapper with sklearn-compatible interface |
| scikeras | Keras/TensorFlow wrapper for sklearn pipelines |
| auto-sklearn | Automated machine learning using sklearn estimators |
| TPOT | Genetic programming-based AutoML pipeline optimizer |
| mlxtend | Additional estimators, feature selection, and visualization utilities |
| Dask-ML | Distributed and parallel machine learning on large datasets |
| skrub | Encoders for dirty categorical data and dataframe integration |
| yellowbrick | Visual diagnostics for machine learning model selection |
| Intel Extension for Scikit-learn | Hardware-accelerated scikit-learn on Intel processors |
The scikit-learn-contrib organization on GitHub hosts community-maintained projects that follow strict compatibility guidelines with the main library.
Scikit-learn is used across industries and academia for a wide range of applications.
Imagine you have a giant box of LEGO bricks, and each brick is a different tool for teaching a computer to learn things. Some bricks help the computer sort things into groups, like separating red candies from blue candies. Other bricks help the computer guess numbers, like how tall you will be when you grow up. And some bricks help the computer find patterns in a big pile of data, even when nobody tells it what to look for.
Scikit-learn is that box of LEGO bricks. All the bricks snap together the same way (that is the consistent API). You pick a brick, show it some examples (that is "fit"), and then it can make guesses about new things it has never seen before (that is "predict"). You can also snap multiple bricks together in a line (that is a "pipeline") so the computer does many steps automatically. The best part is that the box is free for everyone to use, and thousands of people around the world keep adding new and better bricks to it.