CIFAR-10 is a labeled dataset of 60,000 small color images, organized into 10 mutually exclusive object categories with 6,000 images per class. The acronym stands for the Canadian Institute For Advanced Research, the funding agency that supported the work behind the dataset. Each image measures 32 by 32 pixels in three color channels. CIFAR-10 was assembled in 2009 by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at the University of Toronto, and was released as a labeled subset of the much larger 80 Million Tiny Images collection from MIT and NYU [1]. The official split allocates 50,000 images for training and 10,000 images for testing, with the test set built so that each of the 10 classes contributes exactly 1,000 examples. Together with its sister dataset CIFAR-100, CIFAR-10 has been one of the most heavily used benchmarks in computer vision history, and it sits alongside MNIST and ImageNet as one of the canonical small-scale image classification problems that shaped the deep learning revolution.
The dataset gained prominence in the early 2010s as a tractable testbed for convolutional neural network (CNN) research. At only 32 by 32 pixels per image and roughly 170 megabytes in total, CIFAR-10 was small enough to fit in the memory of a single GPU yet large enough to require non-trivial generalization, which made it the preferred sandbox for prototyping new architectures, regularization techniques, and training tricks before scaling to larger datasets like ImageNet. Many ideas that later powered modern computer vision were first validated on CIFAR-10, including dropout variants, residual connections, dense connections, neural architecture search, and a long line of data augmentation methods. By the late 2010s, the test error had been driven so low that CIFAR-10 became a saturated benchmark, and the most informative comparisons shifted to throughput, robustness, and out-of-distribution behavior rather than raw accuracy.
The lineage of CIFAR-10 begins with the 80 Million Tiny Images dataset published by Antonio Torralba, Rob Fergus, and William Freeman in 2008 [2]. That collection was built by querying every non-abstract noun in the WordNet lexical database against several image search engines and downloading the top results, then resizing every result down to 32 by 32 pixels for storage and processing efficiency. The resulting corpus contained roughly 79 million images and was intended for nonparametric scene matching and large-scale visual learning experiments. However, because the labels were derived automatically from the search query that retrieved each image, the per-image label quality was poor, and many images were only loosely related to the noun used to fetch them.
Alex Krizhevsky, then a graduate student at the University of Toronto under Geoffrey Hinton, set out to extract a clean labeled subset from the Tiny Images collection that could serve as a benchmark for unsupervised feature learning and emerging deep architectures. Working with Vinod Nair, Krizhevsky paid student labelers to verify and relabel candidate images, enforcing strict criteria so that each image clearly depicted a single instance of its assigned class and was photographed from a recognizable viewpoint [1]. Two parallel datasets emerged from this effort: CIFAR-10, which collapses the natural and artifact world into 10 broad coarse categories, and CIFAR-100, which uses a finer-grained taxonomy of 100 categories grouped under 20 superclasses.
The 2009 technical report "Learning Multiple Layers of Features from Tiny Images" by Krizhevsky introduced both datasets and used them to evaluate restricted Boltzmann machines and deep belief networks [1]. Within months, CIFAR-10 had become the standard small-image benchmark for the rapidly growing community working on unsupervised pretraining, autoencoders, and convolutional architectures. Krizhevsky himself returned to the dataset in a 2010 manuscript that reported strong results from a convolutional deep belief network with locally connected and globally connected layers, an architecture that prefigured many ideas in his more famous 2012 ImageNet work with Ilya Sutskever and Hinton [3].
In 2020, the parent 80 Million Tiny Images dataset was withdrawn from public distribution after researchers Abeba Birhane and Vinay Prabhu documented that some images and labels in the corpus included racist and misogynistic slurs derived from WordNet, and that the dataset contained offensive imagery [4]. The original creators issued a public apology, asked researchers to delete their copies, and removed the download links. CIFAR-10 itself was not pulled because its 60,000 images had been independently relabeled by humans and its 10 class names contain no slurs or sensitive content. The dataset remains freely downloadable from the University of Toronto and is mirrored by virtually every deep learning framework.
CIFAR-10 is distributed in three packages on the official website: a Python pickle version, a MATLAB version, and a binary version intended for low-level access from C and other languages. All three contain the same 60,000 images and labels, organized into six batches of 10,000 images each. Five of the batches make up the training set and one is the held-out test batch. Each pickle file decodes to a dictionary with two keys, one storing a 10,000 by 3,072 unsigned 8-bit integer array of pixel data and the other storing a list of 10,000 class labels in the range 0 to 9. The 3,072 numbers per row encode the image as 1,024 red intensities followed by 1,024 green intensities and then 1,024 blue intensities, with each color plane stored in row-major order. Reshaping a row to a (3, 32, 32) array reproduces the original image in channels-first order, and a further transpose to (32, 32, 3) yields the channels-last layout used by Pillow and TensorFlow.
The test batch contains exactly 1,000 images per class, sampled at random from the available labeled pool. The five training batches collectively contain the remaining 5,000 images per class but in random rather than balanced order, so a particular training batch can over-represent one class and under-represent another. This distinction matters for code that streams the dataset one batch at a time without first reshuffling, since per-batch class statistics are not uniform. Almost all modern training pipelines load the full training set into memory and apply random shuffling each epoch, which makes the within-batch imbalance moot.
A design decision that distinguishes CIFAR-10 from many later benchmarks is that the 10 classes are completely mutually exclusive. There is no overlap between automobiles and trucks, for example, because the automobile class contains only sedans, SUVs, and other passenger cars, while the truck class contains only large trucks. Similarly, the airplane class excludes airplane parts, and the ship class excludes boats too small to be classified as ships. This level of curation was unusual for early datasets and was one reason CIFAR-10 became a clean signal of model capability rather than a confused jumble of edge cases.
The 10 CIFAR-10 categories are grouped into two natural macro-classes: animals and vehicles. The table below lists each class with its index, the broad type of object it depicts, and a short description of what counts as an in-class example.
| Index | Class | Type | Description |
|---|---|---|---|
| 0 | Airplane | Vehicle | Fixed-wing aircraft photographed from the air or on the ground, including jets, propeller planes, and gliders |
| 1 | Automobile | Vehicle | Passenger cars including sedans, SUVs, and coupes, but explicitly excluding pickup trucks |
| 2 | Bird | Animal | Wild and domestic birds in flight or perched, including eagles, sparrows, parrots, and chickens |
| 3 | Cat | Animal | Domestic and wild cats including tabbies, kittens, lions, and tigers |
| 4 | Deer | Animal | Deer including stags, does, fawns, elk, and moose photographed in natural settings |
| 5 | Dog | Animal | Domestic dogs of any breed, including puppies, working dogs, and house pets |
| 6 | Frog | Animal | Frogs and toads in close-up or environmental shots |
| 7 | Horse | Animal | Domestic and wild horses, including foals, ponies, and racing horses |
| 8 | Ship | Vehicle | Large watercraft including cargo ships, cruise liners, sailing ships, and naval vessels |
| 9 | Truck | Vehicle | Large trucks including delivery trucks, semi-trailers, dump trucks, and military trucks |
The split between four vehicle classes (airplane, automobile, ship, truck) and six animal classes (bird, cat, deer, dog, frog, horse) creates an interesting natural taxonomy. Several of the most challenging confusion pairs in CIFAR-10 fall within the same macro-class: cats and dogs are visually similar at 32 by 32 resolution, deer and horse profiles can blur together, and automobiles and trucks share many design cues. These confusions also map onto natural categories in human perception, which is one reason CIFAR-10 has been used in cognitive science studies of visual recognition.
CIFAR-10 occupies a distinctive niche in the family of image classification benchmarks. It is small enough to train and evaluate models in minutes on a single GPU, yet diverse enough to require non-trivial visual representations. The table below compares CIFAR-10 with the closest standard benchmarks, highlighting the trade-offs that make each one useful for different research goals.
| Dataset | Image size | Channels | Classes | Train size | Test size | Year | Domain |
|---|---|---|---|---|---|---|---|
| MNIST | 28 by 28 | Grayscale | 10 | 60,000 | 10,000 | 1998 | Handwritten digits |
| Fashion-MNIST | 28 by 28 | Grayscale | 10 | 60,000 | 10,000 | 2017 | Clothing photos |
| SVHN | 32 by 32 | RGB | 10 | 73,257 | 26,032 | 2011 | Street view house numbers |
| CIFAR-10 | 32 by 32 | RGB | 10 | 50,000 | 10,000 | 2009 | Natural objects |
| CIFAR-100 | 32 by 32 | RGB | 100 | 50,000 | 10,000 | 2009 | Natural objects, fine-grained |
| STL-10 | 96 by 96 | RGB | 10 | 5,000 | 8,000 | 2011 | Natural objects, larger |
| Tiny ImageNet | 64 by 64 | RGB | 200 | 100,000 | 10,000 | 2015 | Natural objects |
| ImageNet | Variable, ~256 by 256 | RGB | 1,000 | ~1,281,167 | 50,000 | 2009 | Natural scenes, object-centric |
| ImageNet-21k | Variable | RGB | 21,841 | ~14M | None | 2009 | Full WordNet hierarchy |
MNIST predates CIFAR-10 by roughly a decade and contains 28 by 28 grayscale images of handwritten digits 0 through 9. It is significantly easier than CIFAR-10 because the images depict a single isolated stroke pattern with high contrast, no color information to model, and very low intra-class variation. By the time CIFAR-10 was released, modern CNNs were already achieving better than 99.5 percent on MNIST, and the remaining errors largely reflected ambiguous handwriting rather than algorithmic limitations. CIFAR-10 added color, natural-image statistics, viewpoint variation, and complex backgrounds, all of which made it a meaningfully harder task that better tracked progress on real-world computer vision.
CIFAR-100 is structurally identical to CIFAR-10 (same image size, same total number of images, same train/test split) but uses 100 fine-grained classes with 600 images each rather than 10 broad classes with 6,000 images each. The fine-grained categories include taxa such as beaver, otter, and seal grouped under an aquatic mammals superclass, and items such as bicycle, bus, motorcycle, pickup truck, and train grouped under a vehicles superclass. The 100-way split makes CIFAR-100 substantially harder, with state-of-the-art error rates running roughly four to five times those of CIFAR-10 throughout the 2010s and 2020s. Many papers report results on both datasets to demonstrate that improvements transfer across difficulty levels.
Compared to ImageNet, CIFAR-10 is roughly 1,000 times smaller in pixel count per image and roughly 25 times smaller in total dataset size. ImageNet's higher resolution allows networks to learn fine-grained texture and part-based features that are simply invisible at 32 by 32 resolution, while CIFAR-10's small size makes it tractable for ablation studies that would be prohibitively expensive on ImageNet. A common practice in the deep learning literature is to validate a new idea quickly on CIFAR-10, then confirm it scales by re-running the experiment on ImageNet.
CIFAR-10 has been the proving ground for nearly every major computer vision architecture and training innovation of the past 15 years. The headline metric is top-1 test error percentage, which measures the fraction of the 10,000 test images that the model classifies incorrectly when forced to choose a single label. Lower is better, and a model that achieves 1.0 percent error is correctly labeling 9,900 of the 10,000 test images. The arc of progress on CIFAR-10 traces the broader story of image classification models from shallow feature pipelines through deep convolutional architectures and on to vision transformers and large-scale pretraining.
The earliest CIFAR-10 results came from shallow pipelines that combined hand-engineered features with classifiers like k-nearest neighbors, support vector machines, and shallow Boltzmann machines. Krizhevsky's 2009 technical report reported error rates above 35 percent for these baselines [1]. Adam Coates and Andrew Ng demonstrated in 2011 that a single-layer K-means feature extractor with thousands of centroids and a linear SVM could reach roughly 20 percent error, an unexpectedly strong result that suggested simple unsupervised feature learning was competitive with more elaborate methods at this scale. Throughout the period from 2009 to 2011, the broad pattern was that engineering effort spent on feature design produced incremental gains and that no single approach pulled away from the field.
The arrival of GPU-trained CNNs upended the CIFAR-10 leaderboard. Krizhevsky's 2010 manuscript on convolutional deep belief networks with locally connected layers reported around 21 percent error [3]. By 2013, Maxout Networks from Ian Goodfellow and colleagues had pushed the error rate down to 9.38 percent. Network in Network from Min Lin and colleagues, also in 2013, reached 8.8 percent. Sergey Zagoruyko and Nikos Komodakis's Wide Residual Networks (WideResNet) in 2016 brought the error to roughly 4.0 percent by widening the channels of standard ResNet blocks rather than making the network deeper. DenseNet from Gao Huang and colleagues hit 3.46 percent in the same year through dense feature reuse across layers.
The period from 2016 to 2018 saw error rates fall by roughly half through a combination of neural architecture search and increasingly elaborate regularization. Barret Zoph and Quoc Le's NAS with reinforcement learning reached 3.65 percent in 2016. Xavier Gastaldi's Shake-Shake regularization, which randomly mixes parallel residual branches at training time, brought it to 2.86 percent in 2017. Yoshihiro Yamada and colleagues' ShakeDrop pushed it to 2.67 percent. Terrance DeVries and Graham Taylor's Cutout, which randomly zeros out square patches of input images, reached 2.56 percent. Esteban Real and colleagues' Regularized Evolution for architecture search hit 2.13 percent in early 2018.
The AutoAugment paper by Ekin Cubuk and colleagues in 2018 introduced a learned augmentation policy that combined operations such as rotation, color jitter, and shearing with reinforcement-learned probabilities, dropping CIFAR-10 error to 1.48 percent. Later that same year, the GPipe paper from Yanping Huang and colleagues used pipeline parallelism to train a giant AmoebaNet on a TPU pod and reported 1.0 percent error, briefly setting a new standard for what was achievable with sufficient compute and architectural search.
The Big Transfer (BiT) paper from Alexander Kolesnikov and colleagues at Google Brain in 2019 demonstrated that transferring a large CNN pretrained on the JFT-300M private dataset gave roughly 99.4 percent accuracy on CIFAR-10, corresponding to about 0.6 percent error [5]. The result emphasized that scale of pretraining data, more than any architectural choice, had become the dominant factor in benchmark performance.
The Vision Transformer (ViT) from Alexey Dosovitskiy and colleagues, also at Google Brain, was released in late 2020 and reached comparable accuracy by adapting the transformer architecture from natural language processing to image patches. ViT models pretrained on JFT and fine-tuned on CIFAR-10 reached around 99.5 percent test accuracy. The success of ViT on CIFAR-10 helped popularize the Vision Transformer (ViT) architecture more broadly and spawned a wave of variants including DeiT, Swin Transformer, and ConvNeXt that all reported strong CIFAR-10 numbers.
By 2023 the headline CIFAR-10 error rate had fallen below 0.5 percent. The Class Activation Uncertainty Reduction paper from May 2023 reported 0.95 percent. Vision transformer variants combined with self-supervised pretraining and aggressive augmentation routinely exceed 99.5 percent accuracy. The current top entries on the Papers with Code leaderboard sit at roughly 99.6 percent accuracy, corresponding to about 40 misclassified test images out of 10,000.
A notable subplot is the parallel push for fast training rather than peak accuracy. David Page's CIFAR-10 fast training experiments and Keller Jordan's later "airbench" project demonstrated that 94 percent CIFAR-10 accuracy could be reached in under 4 seconds of single-GPU wall-clock time, by carefully tuning architecture, optimizer, and data pipeline. These efforts have made CIFAR-10 a useful microbenchmark for systems-level work on training throughput rather than raw model quality.
The table below summarizes a selection of representative CIFAR-10 results across the dataset's lifetime. The numbers reported are top-1 test error percentages on the standard 10,000-image test set unless otherwise noted, taken from the original publications.
| Year | Method | Authors | Test error (%) | Notes |
|---|---|---|---|---|
| 2010 | Convolutional deep belief network | Krizhevsky | 21.1 | Early GPU-trained deep model |
| 2011 | K-means features plus linear SVM | Coates and Ng | ~20 | Strong shallow baseline |
| 2013 | Maxout Networks | Goodfellow et al. | 9.38 | Maxout activation function |
| 2013 | Network in Network | Lin et al. | 8.81 | Inception-precursor architecture |
| 2014 | Fractional Max-Pooling | Graham | 3.47 | Fractional stride pooling |
| 2016 | Wide ResNet | Zagoruyko and Komodakis | 4.0 | Wider residual blocks |
| 2016 | DenseNet | Huang et al. | 3.46 | Dense layer-to-layer connections |
| 2016 | NAS with RL | Zoph and Le | 3.65 | First major architecture search win |
| 2017 | Shake-Shake regularization | Gastaldi | 2.86 | Random branch mixing |
| 2017 | Cutout regularization | DeVries and Taylor | 2.56 | Random rectangular masking |
| 2018 | PyramidNet | Han et al. | 3.31 | Gradually increasing channels |
| 2018 | ShakeDrop | Yamada et al. | 2.67 | Probabilistic branch dropping |
| 2018 | Regularized Evolution | Real et al. | 2.13 | Evolutionary architecture search |
| 2018 | AutoAugment | Cubuk et al. | 1.48 | Learned augmentation policy |
| 2018 | GPipe (AmoebaNet-B) | Huang et al. | 1.00 | Pipeline-parallel training |
| 2019 | BiT-L (ResNet152x4) | Kolesnikov et al. | ~0.6 | JFT-300M pretraining transfer |
| 2021 | ViT-H/14 (JFT pretrained) | Dosovitskiy et al. | ~0.5 | Vision transformer transfer |
| 2023 | CAUR | Various | 0.95 | Class activation uncertainty reduction |
| 2024 to 2026 | Modern ViT and hybrid models | Various | ~0.4 | PaperWithCode current leaders |
A persistent question for any benchmark that has been driven near its accuracy ceiling is whether the remaining errors reflect genuine limits of the models or imperfections in the labels themselves. In 2021, Curtis Northcutt, Anish Athalye, and Jonas Mueller published a study using their confident learning framework to estimate label errors across 10 widely used test sets, including CIFAR-10 [6]. They found that approximately 0.54 percent of the CIFAR-10 test set, or roughly 54 of 10,000 images, contained labels that human annotators on Mechanical Turk disagreed with after re-examination. Although this is a small fraction in absolute terms, it is on the same order as the residual error of state-of-the-art models, which means that some apparent errors are actually correct predictions on mislabeled images, and conversely that some apparent gains in benchmark accuracy come from a model learning to reproduce the original labelers' mistakes.
Northcutt and colleagues released corrected versions of the affected test sets and demonstrated that on CIFAR-10 the relative ranking of competing models can flip when labels are cleaned. In one example, a smaller VGG-11 outperformed a larger VGG-19 on the corrected test set even though VGG-19 was better on the original noisy labels. This sensitivity highlights an important methodological point: tiny absolute differences between top-of-leaderboard models on CIFAR-10 are no longer reliable indicators of true model quality and can be dominated by label noise.
A related concern is the influence of near-duplicates between training and test sets. A 2019 paper by Barz and Denzler examined the CIFAR datasets for visual near-duplicates and reported that roughly 3 to 4 percent of the CIFAR-10 test set has very similar counterparts in the training set, which can inflate apparent test accuracy [7]. These two findings together suggest that a 99.5 percent benchmark score on CIFAR-10 should be interpreted with care, since label noise and train-test contamination can each account for several tenths of a percentage point.
Despite its limitations, CIFAR-10 has been one of the most consequential datasets in modern artificial intelligence. It served as the empirical anchor for the convolutional renaissance of the early 2010s, the regularization arms race of the mid-2010s, and the architecture search and pretraining waves of the late 2010s. Many ideas now standard in deep learning, including dropout, batch normalization, residual connections, dense connections, mixup, cutout, AutoAugment, and learning rate scheduling refinements, were initially validated on CIFAR-10 before being scaled up to ImageNet and beyond. The dataset's small size made it possible for researchers without access to industrial compute to participate in the cutting edge, which broadened the community of contributors and accelerated iteration speed.
CIFAR-10 has also functioned as an educational on-ramp for new deep learning practitioners. It is the default first computer vision benchmark in most introductory deep learning courses, including those at Stanford, Carnegie Mellon, and the University of Toronto. The dataset is included as a one-line download in PyTorch (torchvision.datasets.CIFAR10), TensorFlow (tf.keras.datasets.cifar10), JAX-based libraries, and the Hugging Face datasets hub, which means a typical newcomer can train their first CNN end-to-end within an hour. This widespread teaching role has familiarized an entire generation of researchers with the dataset's classes, image statistics, and standard preprocessing.
CIFAR-10 also plays an outsized role in adversarial robustness research. The RobustBench leaderboard, maintained by a consortium led by Francesco Croce and others, tracks adversarial accuracy on CIFAR-10 against standardized attack budgets and is one of the primary venues for evaluating robust training methods such as adversarial training, randomized smoothing, and certified defenses. Because robustness research often requires many iterations of attack and defense, the dataset's small size keeps experiments tractable while still being meaningful for natural images.
Finally, CIFAR-10 has become a standard benchmark for systems and infrastructure work. The DAWNBench competition launched by Stanford in 2017 used CIFAR-10 as one of its primary tracks for measuring time-to-accuracy and cost-to-accuracy of training pipelines. The fast training community has produced a steady stream of innovations in optimizer scheduling, mixed-precision arithmetic, and data loading that originated as CIFAR-10 micro-optimizations and were later adopted in industrial training stacks.
The primary limitation of CIFAR-10 is the 32 by 32 image resolution. At this size, fine textures, small details, and small objects within scenes are simply unresolvable, which limits how realistic the dataset can be as a stand-in for real-world recognition. Modern phone cameras produce images thousands of times larger in pixel count, and many real applications such as medical imaging, autonomous driving, and document understanding depend on details that vanish at 32 by 32 resolution. Improvements that look impressive on CIFAR-10 do not always carry over to higher-resolution benchmarks, and ablations that show clear effects on CIFAR-10 sometimes fail to replicate at scale.
The second concern is benchmark saturation. With state-of-the-art error rates approaching the noise floor of the labels themselves, CIFAR-10 no longer provides a strong gradient for measuring model capability. New methods often report incremental gains that are within the standard error of repeated runs, and confident comparison between top systems requires careful attention to seeding, training schedule length, and evaluation protocol.
A third issue is that the dataset's coarse 10-way taxonomy does not reflect the long tail of real-world category distributions. ImageNet, with 1,000 classes, captures more of the diversity that production vision systems face, and even ImageNet itself is increasingly seen as too narrow compared to billion-scale web-scraped collections. CIFAR-10 was a clean signal for early algorithmic comparison but is no longer a representative target for modern computer vision systems.
A fourth subtle limitation is the connection to the parent 80 Million Tiny Images collection. Although CIFAR-10 itself contains no offensive content, it inherits the same image-sourcing methodology, and the parent dataset's withdrawal in 2020 has made some researchers cautious about citing CIFAR-10 in contexts that touch on data ethics. Most of the community has continued to use CIFAR-10 freely on the grounds that it was independently relabeled and contains no problematic class names, but the discussion remains open.
A standard CIFAR-10 training pipeline includes per-channel normalization to zero mean and unit variance using the dataset's empirical statistics. The widely cited normalization values are mean (0.4914, 0.4822, 0.4465) and standard deviation (0.2470, 0.2435, 0.2616) for the red, green, and blue channels. These values are computed once over the training set and applied to both training and test images.
The canonical data augmentation recipe combines random cropping with reflective padding and random horizontal flipping. Padding the 32 by 32 input by 4 pixels on each side and then randomly cropping back to 32 by 32 produces small translation invariance, while horizontal flipping doubles the effective dataset size for vehicle and animal classes that have natural left-right symmetry. Many modern recipes add Cutout, mixup, AutoAugment, RandAugment, or AugMix on top of the base recipe, and the choice of augmentation can swing test error by one to two percentage points on identical architectures.
The standard training loss for CIFAR-10 is cross-entropy between the predicted class probabilities and the one-hot label vector. The optimizer is typically stochastic gradient descent with Nesterov momentum (initial learning rate 0.1, momentum 0.9) for CNN architectures, with a cosine or multi-step learning rate schedule over 100 to 300 epochs. Vision transformer training on CIFAR-10 usually uses AdamW with cosine decay and a longer training schedule, and benefits substantially from large-scale pretraining followed by fine-tuning rather than training from scratch.
CIFAR-10 is hosted at the University of Toronto's website at https://www.cs.toronto.edu/~kriz/cifar.html in three formats. The Python pickle version is roughly 163 megabytes compressed; the MATLAB version is similar in size; and the binary version is slightly smaller because it omits some metadata. The dataset is also mirrored by virtually every popular deep learning library: torchvision.datasets.CIFAR10 in PyTorch, tf.keras.datasets.cifar10 in TensorFlow, the uoft-cs/cifar10 dataset on the Hugging Face Hub, and tfds.load('cifar10') in TensorFlow Datasets. The license follows the original Tiny Images license terms with a custom academic use clause, and the dataset has been freely available for research and educational purposes since 2009.