The Modified National Institute of Standards and Technology (MNIST) database is a large collection of handwritten digit images that has served as one of the most widely used benchmarks in machine learning and computer vision. Created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges, the dataset was first made available in 1998 alongside work on convolutional neural networks (CNNs). MNIST contains 70,000 grayscale images of handwritten digits (0 through 9), split into 60,000 training examples and 10,000 test examples. Each image is 28 by 28 pixels.
The dataset's simplicity, small size, and ease of use have made it a standard first exercise for students learning deep learning and neural networks. However, because modern algorithms routinely achieve above 99.5% accuracy on MNIST, researchers have increasingly turned to more challenging alternatives such as Fashion-MNIST, EMNIST, and CIFAR-10. The foundational paper associated with the dataset, "Gradient-based learning applied to document recognition" (LeCun et al., 1998), has accumulated over 57,000 citations on Semantic Scholar, making it one of the most cited works in the history of artificial intelligence.[1]
Imagine you have a big box of flashcards, and each flashcard has a number written on it by hand, anywhere from 0 to 9. Some people write their numbers in neat, tidy ways, and others write them all wobbly and messy. There are 70,000 flashcards in the box. Scientists use these flashcards to teach computers how to read handwritten numbers. The computer looks at thousands of flashcards, learns what each number looks like, and then tries to guess the number on flashcards it has never seen before. MNIST is basically that box of flashcards, and it has been the most popular box for testing whether a computer is good at reading numbers.
In the late 1980s, the United States Census Bureau needed automated systems to read handwritten census forms. The Bureau partnered with the National Institute of Standards and Technology (NIST) to develop optical character recognition (OCR) tools, and NIST created several handwriting databases to support this effort.[2]
NIST originally designated SD-3 as a training set and SD-1 as a test set. However, researchers discovered a serious distribution mismatch: SD-3 was written entirely by Census Bureau employees, while SD-1 included writers from a broader population. Models trained on SD-3 often saw their error rates jump from under 1% to around 10% when evaluated on SD-7, illustrating the distribution shift problem.[3]
The MNIST dataset was constructed before summer 1994 to address the distribution shift between NIST databases. LeCun and collaborators mixed samples from both SD-3 and SD-1 so that the training and test sets each contained digits from a diverse pool of writers.[3]
The training set was composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1, for a total of 60,000 images. The test set similarly combined 5,000 patterns from SD-3 and 5,000 patterns from SD-1, totaling 10,000 images. Writers were split so that the training and test sets did not share any writer; approximately 250 writers contributed to each split.
The original NIST binary images were 128x128 pixels. To create MNIST, each digit was processed through several steps:[3]
This preprocessing pipeline reduced spatial variance and ensured consistent formatting across all samples.
| Property | Value |
|---|---|
| Total images | 70,000 |
| Training set | 60,000 images |
| Test set | 10,000 images |
| Image dimensions | 28 x 28 pixels |
| Color space | Grayscale (8-bit, values 0-255) |
| Number of classes | 10 (digits 0-9) |
| Source databases | NIST SD-1 and SD-3 |
| Writers (training) | ~250 from SD-1 + ~250 from SD-3 |
| Writers (test) | ~250 from SD-1 + ~250 from SD-3 |
| File format | IDX (custom binary) |
| License | Creative Commons Attribution-Share Alike 3.0 |
The digit classes are roughly balanced, though not perfectly equal. Each pixel value ranges from 0 (white/background) to 255 (black/foreground).
MNIST data is stored in the IDX binary format, a simple format for vectors and multidimensional matrices. The dataset consists of four files:[4]
| File | Contents | Size |
|---|---|---|
train-images-idx3-ubyte.gz | Training set images | ~9.9 MB |
train-labels-idx1-ubyte.gz | Training set labels | ~29 KB |
t10k-images-idx3-ubyte.gz | Test set images | ~1.6 MB |
t10k-labels-idx1-ubyte.gz | Test set labels | ~5 KB |
Each IDX file begins with a magic number header. The first two bytes are always zero. The third byte encodes the data type (0x08 for unsigned byte). The fourth byte indicates the number of dimensions (3 for image files, 1 for label files). Following the header, dimension sizes are stored as 4-byte big-endian integers, and then the raw data follows in row-major order.
MNIST has served as a proving ground for nearly every major classification algorithm developed since the late 1990s. The table below summarizes notable benchmark results reported on the official MNIST test set.[3][5]
| Classifier | Error rate (%) | Year | Notes |
|---|---|---|---|
| Linear classifier (1-layer NN) | 12.0 | 1998 | No preprocessing |
| K-nearest neighbors (L2) | 5.0 | 1998 | Baseline, no preprocessing |
| K-nearest neighbors (with deskewing) | 0.52 | 1998 | Shape context + deskewing |
| Support vector machine (SVM) | 0.56 | 1998 | Polynomial kernel, degree 9 |
| 2-layer NN, 300 hidden units | 4.7 | 1998 | Standard MLP |
| 2-layer NN, 1000 hidden units | 1.6 | 1998 | Larger MLP |
| LeNet-1 | 1.7 | 1998 | Early CNN architecture |
| LeNet-5 | 0.95 | 1998 | Classic CNN |
| LeNet-5 + boosting | 0.7 | 1998 | LeNet-5 with ensemble boosting |
| LIRA neural classifier | 0.42 | 2004 | Associative neural classifier |
| Deep NN + elastic distortions | 0.39 | 2003 | Data augmentation with elastic deformations |
| Committee of 35 CNNs | 0.23 | 2012 | Multi-column deep neural network (MCDNN) by Ciresan et al. |
| Dropout regularized NN | 0.21 | 2013 | With data augmentation |
| DropConnect ensemble (5 CNNs) | 0.21 | 2013 | Regularization variant |
| Batch-normalized maxout network | 0.24 | 2015 | With affine distortions |
| Ensemble of CNNs with SE-Net | 0.17 | 2018 | Squeeze-and-Excitation networks |
| Ensemble of CNNs + augmentation | 0.13 | 2020 | Rotation and translation augmentation |
| Single CNN (branching/merging) | 0.17 | 2021 | Advanced single-model architecture |
A simple linear classifier can reach about 88% accuracy (12% error) without any feature engineering, while modern deep learning models with ensembles and heavy data augmentation have pushed error rates below 0.2%. The human error rate on MNIST is estimated at around 0.2%.[5]
MNIST played a significant role in demonstrating the practical viability of neural networks during a period when the field was largely out of favor. In the late 1990s, support vector machines dominated supervised learning research, and many researchers considered neural networks impractical. LeCun's work on LeNet-5, trained and benchmarked on MNIST, showed that convolutional neural networks could match or outperform SVMs on real-world pattern recognition tasks.[6]
The 1998 paper "Gradient-based learning applied to document recognition" by LeCun, Bottou, Bengio, and Haffner introduced the LeNet-5 architecture and demonstrated end-to-end trainable systems for document recognition. LeNet was adopted commercially for reading handwritten checks at ATMs and recognizing zip codes for the United States Postal Service. These practical deployments helped sustain interest in neural networks during the "AI winter" of the early 2000s.[6]
MNIST's role as a shared benchmark allowed researchers to compare approaches objectively. When deep learning experienced a resurgence after 2012, MNIST remained a standard baseline test for new architectures, optimizers, and regularization techniques. Nearly every major deep learning framework (TensorFlow, PyTorch, Keras) includes MNIST in its introductory tutorials.
Although MNIST is primarily used as a research and educational benchmark, the underlying task of handwritten digit recognition has several real-world applications:
Despite its historical significance, MNIST has faced sustained criticism from the research community for several reasons:
Modern convolutional neural networks routinely achieve above 99.5% accuracy on MNIST. Even simple models, such as logistic regression or a small fully connected network, can reach 97% or higher. Because nearly all architectures perform well on MNIST, the dataset provides little discriminative power for comparing different approaches. As deep learning expert Ian Goodfellow has noted, researchers should move away from MNIST as a benchmarking tool. Francois Chollet, the creator of Keras, has similarly argued that MNIST "cannot represent modern computer vision tasks."[7]
The handwriting samples come from a narrow demographic, primarily Census Bureau employees and American high school students. The writing styles are relatively uniform compared to the global diversity of handwriting. Models trained on MNIST may not generalize well to handwritten digits from other populations or written in different contexts.[7]
At 28x28 pixels in grayscale, MNIST images are tiny by modern standards. Real-world digit recognition often involves higher-resolution, color images with complex backgrounds, noise, and varying lighting conditions that MNIST does not capture.
Researchers have documented at least four incorrect labels in the MNIST dataset. While four errors out of 70,000 is a tiny fraction, this has prompted discussion about data quality standards in benchmark datasets.[5]
MNIST digits are pre-segmented, centered, size-normalized, and presented on a clean white background. Real-world digit recognition requires handling segmentation, variable-size inputs, cluttered backgrounds, and connected or overlapping characters. Success on MNIST does not necessarily translate to success in production OCR systems.
The limitations of MNIST have motivated the creation of numerous alternative datasets that follow the same 28x28 grayscale format but present more challenging classification tasks.
Fashion-MNIST, introduced by Zalando Research in 2017, is a drop-in replacement for MNIST. It contains 70,000 grayscale images (60,000 training, 10,000 test) of 10 categories of clothing and accessories: T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. Fashion-MNIST is significantly harder than MNIST; state-of-the-art models achieve roughly 96-97% accuracy compared to 99.7%+ on the original.[8]
The Extended MNIST (EMNIST) dataset, introduced by Cohen et al. in 2017, extends the original MNIST format to include handwritten letters in addition to digits. EMNIST is derived from NIST Special Database 19 and is available in six splits:[9]
| Split | Classes | Total samples | Description |
|---|---|---|---|
| ByClass | 62 | 814,255 | Full set, all digits and letters (unbalanced) |
| ByMerge | 47 | 814,255 | Merged upper/lower case (unbalanced) |
| Balanced | 47 | 131,600 | Equal samples per class |
| Digits | 10 | 280,000 | Digits only (balanced) |
| Letters | 26 | 145,600 | Letters only (merged case, balanced) |
| MNIST | 10 | 70,000 | Direct MNIST equivalent |
Kuzushiji-MNIST (KMNIST) contains 70,000 images of 10 classes of cursive Japanese (Kuzushiji) Hiragana characters, formatted identically to MNIST. KMNIST is considered more challenging than MNIST because multiple visually distinct characters can map to the same class label.[10]
QMNIST, introduced by Chhavi Yadav and Leon Bottou in 2019, is a reconstruction of the MNIST dataset that recovers the full 60,000-image test set originally selected from NIST databases but never distributed. QMNIST traces each digit back to its NIST source image and associated metadata (writer identifier, partition, etc.), enabling investigation of potential overfitting to the original 10,000-image test set over 25 years of repeated benchmarking.[11]
| Dataset | Description | Image format |
|---|---|---|
| notMNIST | Letters A-J rendered in various computer fonts | 28x28 grayscale |
| Kannada-MNIST | Digits in the Kannada script (South Indian language) | 28x28 grayscale |
| MedMNIST | 18 biomedical image classification datasets (12 in 2D, 6 in 3D) | 28x28 (2D) or 28x28x28 (3D) |
| AudioMNIST | 30,000 spoken digit recordings (0-9) from 60 speakers | Audio waveforms |
| 3D MNIST | Volumetric (voxel) representations of digits | 3D voxels |
| affNIST | MNIST digits with random affine transformations | 40x40 grayscale |
| MNIST-1D | A 1D analog of MNIST designed for rapid prototyping | 40-element vectors |
| SVHN | Street View House Numbers from Google Street View | 32x32 color |
| HASYv2 | 168,233 handwritten mathematical symbols across 369 classes | 32x32 grayscale |
The following table compares MNIST with other commonly used image classification benchmarks.
| Dataset | Images | Classes | Resolution | Color | Typical accuracy |
|---|---|---|---|---|---|
| MNIST | 70,000 | 10 | 28x28 | Grayscale | 99.7%+ |
| Fashion-MNIST | 70,000 | 10 | 28x28 | Grayscale | ~96-97% |
| EMNIST (Balanced) | 131,600 | 47 | 28x28 | Grayscale | ~91% |
| KMNIST | 70,000 | 10 | 28x28 | Grayscale | ~98% |
| CIFAR-10 | 60,000 | 10 | 32x32 | RGB | ~96-99% |
| CIFAR-100 | 60,000 | 100 | 32x32 | RGB | ~80-90% |
| SVHN | 630,000+ | 10 | 32x32 | RGB | ~98% |
| ImageNet | 14M+ | 1,000+ | Variable | RGB | ~90% (top-1) |
MNIST is built into most major machine learning frameworks and can be loaded with a single function call.
TensorFlow/Keras:
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
PyTorch:
from torchvision import datasets, transforms
train_dataset = datasets.MNIST(root='./data', train=True, download=True,
transform=transforms.ToTensor())
test_dataset = datasets.MNIST(root='./data', train=False, download=True,
transform=transforms.ToTensor())
scikit-learn:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
The dataset is also available from Yann LeCun's website, Hugging Face Datasets, the UCI Machine Learning Repository, and OpenML.