# ImageNet

> Source: https://aiwiki.ai/wiki/imagenet
> Updated: 2026-06-20
> Categories: Computer Vision, Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

## Introduction

**ImageNet** is a large-scale image database of more than 14 million hand-annotated photographs organized into over 21,000 categories according to the [WordNet](/wiki/wordnet) noun hierarchy [1][19]. It was created by [Fei-Fei Li](/wiki/fei_fei_li) and her collaborators at Princeton University (later Stanford University) and first presented at the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), where the team described the goal as a database that "aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images" [1]. The full dataset ultimately reached 14,197,122 images across 21,841 categories, making it one of the largest and most influential visual recognition datasets ever assembled [19].

ImageNet is best known for spawning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition held from 2010 to 2017 that used a 1,000-class subset of roughly 1.2 million training images and became the primary benchmark for progress in [image classification](/wiki/image_classification_models) and [object detection](/wiki/object_detection) [19]. The 2012 edition produced what is widely considered the single most important result in the history of modern [deep learning](/wiki/deep_learning): a [convolutional neural network](/wiki/convolutional_neural_network) called [AlexNet](/wiki/alexnet), built by Alex Krizhevsky, [Ilya Sutskever](/wiki/ilya_sutskever), and [Geoffrey Hinton](/wiki/geoffrey_hinton), cut the top-5 classification error rate to 15.3 percent, compared with 26.2 percent for the second-best entry, demonstrating the power of deep neural networks trained on GPUs [4]. This result catalyzed the deep learning revolution that has reshaped [artificial intelligence](/wiki/artificial_intelligence) and transformed industries from healthcare to transportation.

## What is ImageNet?

ImageNet is a research dataset, not a model: it is a collection of labeled images, organized by meaning, that algorithms can be trained and tested on. Each image is tied to a WordNet "synset" (synonym set), a node in a semantic hierarchy that runs from broad concepts such as "mammal" or "vehicle" at the top down to highly specific ones such as "German shepherd" or "convertible" at the bottom [1]. Because the labels follow WordNet rather than an ad hoc category list, ImageNet captures relationships between concepts, not just isolated class names. The dataset's defining characteristics are its scale (14,197,122 images), its breadth (21,841 categories), and its rich annotation, including 1,034,908 images with hand-drawn bounding boxes [19].

## Origins and Development

### Motivation

Fei-Fei Li began conceptualizing ImageNet in 2006, motivated by research in cognitive science suggesting that humans can recognize approximately 30,000 distinct object categories. At the time, most [computer vision](/wiki/computer_vision) research relied on small, carefully curated datasets with a few hundred to a few thousand images across a handful of categories. Li recognized that achieving human-level visual recognition would require training and evaluating algorithms on data that more closely matched the scale and diversity of the visual world.

The prevailing approach in computer vision during the mid-2000s focused on designing better features and classifiers. Li took the opposite stance, arguing that the bottleneck was data, not algorithms. She believed that building a comprehensive, large-scale image database organized by semantic meaning would unlock fundamental advances in visual recognition.

### Construction

The original ImageNet paper lists six authors: Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei [1]. The project began at Princeton, where Li was an assistant professor, and continued after she moved to Stanford University in 2009.

ImageNet's structure follows the WordNet lexical database. WordNet organizes English nouns into a hierarchy of "synonym sets" (synsets), each representing a distinct concept. ImageNet aimed to provide approximately 500 to 1,000 quality-controlled images for each of the more than 80,000 noun synsets in WordNet [1]. The team ultimately populated 21,841 synsets with images [19].

Collecting and labeling millions of images required an enormous annotation effort. The team turned to Amazon Mechanical Turk, using crowd workers to verify whether candidate images (found through internet image search engines) correctly depicted a given concept. Quality control involved multiple redundant annotations per image; each image was labeled by multiple workers, and images were accepted only when annotators agreed. At its peak, the annotation pipeline processed tens of thousands of images per day.

The timeline of dataset growth illustrates the scale of effort involved:

| Date | Images | Synsets |
|---|---|---|
| July 2008 | 0 | 0 |
| December 2008 | ~3 million | ~6,000 |
| April 2010 | ~11 million | ~15,000 |
| Final dataset | 14,197,122 | 21,841 |

### Dataset Details

The full ImageNet database has the following characteristics [19]:

| Property | Value |
|---|---|---|
| Total images | 14,197,122 |
| Total synsets (categories) | 21,841 |
| Images with bounding box annotations | 1,034,908 |
| Images with SIFT features | ~1.2 million |
| Average images per synset | ~650 |
| Average image resolution | Variable (typically 300-500 pixels on longest side) |
| Hierarchy depth | 9 levels (from "entity" to specific breeds/species) |

The categories range from broad concepts at the top of the hierarchy (e.g., "mammal," "vehicle") to highly specific ones at the bottom (e.g., "German shepherd," "convertible"). Each synset can contain multiple synonyms. For example, a synset might include both "kitty" and "young cat" as names for the same concept.

Images were sourced primarily from the internet through search engines like Google, Yahoo, and Flickr. They depict objects in natural settings with varying backgrounds, occlusions, viewpoints, and lighting conditions, making the dataset considerably more challenging than earlier benchmarks that used controlled studio photography.

## The ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

The ILSVRC was organized annually from 2010 to 2017 and became the most watched and most consequential benchmark competition in AI research. It used a subset of the full ImageNet database: 1,000 object categories with roughly 1.2 million training images, 50,000 validation images, and 100,000 test images [19].

The primary metric was the **top-5 error rate** for the classification task: the fraction of test images for which the correct label was not among the model's five highest-confidence predictions. The challenge also included tasks for object localization (classifying and placing a bounding box around objects) and, in later years, object detection from video. The challenge organizers, writing in the International Journal of Computer Vision, described ILSVRC as "a benchmark in object category classification and detection on hundreds of object categories and millions of images" that had "been run annually from 2010 to present" [19].

### ILSVRC Results by Year

| Year | Winner | Top-5 Error (%) | Key Innovation | Reference |
|---|---|---|---|---|
| 2010 | NEC-UIUC | 28.2 | SIFT/LBP features + SVM | Lin et al. [2] |
| 2011 | XRCE | 25.8 | High-dimensional signatures + compressed Fisher vectors | Perronnin et al. [3] |
| 2012 | [AlexNet](/wiki/alexnet) | 15.3 | Deep CNN trained on GPUs, ReLU, dropout | Krizhevsky et al. [4] |
| 2013 | ZFNet | 14.8 | Improved AlexNet with deconvolutional visualization | Zeiler & Fergus [5] |
| 2014 | [GoogLeNet](/wiki/googlenet) (Inception) | 6.67 | Inception modules, 22 layers deep | Szegedy et al. [6] |
| 2014 | [VGGNet](/wiki/vgg) (runner-up) | 7.3 | Very deep networks (16-19 layers) with 3x3 convolutions | Simonyan & Zisserman [7] |
| 2015 | [ResNet](/wiki/resnet) | 3.57 | Residual connections, 152 layers | He et al. [8] |
| 2016 | Trimps-Soushen | 2.99 | Ensemble of Inception and ResNet variants | [9] |
| 2017 | SENet | 2.25 | Squeeze-and-Excitation blocks for channel recalibration | Hu et al. [10] |

The progression from 28.2% error in 2010 to 2.25% in 2017 represents one of the most dramatic performance improvements in the history of computing, particularly because human-level performance on the same task was estimated at roughly 5.1% (a figure established by [Andrej Karpathy](/wiki/andrej_karpathy) through personal experimentation in 2014 [11]).

## The AlexNet Moment (2012)

The 2012 ILSVRC competition stands as a watershed event in the history of artificial intelligence. Alex Krizhevsky, a graduate student at the University of Toronto, along with Ilya Sutskever and their advisor Geoffrey Hinton, submitted an entry based on a deep convolutional neural network that they later named AlexNet.

### Architecture and Innovations

AlexNet contained eight learned layers: five [convolutional layers](/wiki/convolutional_neural_network) followed by three fully connected layers, with a final 1,000-way softmax output. The model had approximately 60 million parameters and 650,000 neurons [4]. As the authors wrote, "The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax" [4]. Key innovations included:

- **[ReLU](/wiki/relu) activation**: AlexNet used Rectified Linear Units (ReLU) instead of the traditional tanh or sigmoid activations. ReLU trains several times faster than saturating nonlinearities, a critical advantage for large models.
- **GPU training**: The model was trained on two NVIDIA GTX 580 GPUs (3 GB memory each), with the network split across the two GPUs. Training took five to six days. This was one of the first demonstrations that commodity GPUs could accelerate deep learning to practical timescales.
- **[Dropout](/wiki/dropout) [regularization](/wiki/regularization)**: AlexNet applied dropout (randomly zeroing 50% of neuron activations during training) in the fully connected layers, which substantially reduced [overfitting](/wiki/overfitting).
- **Data augmentation**: The team used image translations, horizontal reflections, and color jittering to artificially expand the training set.
- **Local response normalization**: A form of lateral inhibition inspired by neuroscience, though this technique was later abandoned by the community in favor of [batch normalization](/wiki/batch_normalization).

### Results and Impact

AlexNet achieved a top-5 error rate of 15.3% on the ILSVRC 2012 test set, compared to 26.2% for the second-place entry, which used traditional hand-engineered features [4]. The authors reported that their model "achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry" [4]. The margin of victory was unprecedented. The runner-up used Fisher vectors with SIFT features, representing the best of the pre-deep-learning approach.

The result sent shockwaves through the computer vision community. Many researchers who had spent decades engineering visual features recognized almost immediately that deep learning had changed the game. Within a year, virtually every competitive entry in the ILSVRC used deep convolutional neural networks. The effect rippled outward into other areas of AI and then into industry, triggering massive investment in deep learning research and GPU hardware.

The paper "ImageNet Classification with Deep Convolutional Neural Networks" was published at NIPS 2012 (now [NeurIPS](/wiki/neurips)) and has been cited over 100,000 times, making it one of the most cited papers in the history of computer science [4].

## Subsequent Challenge Milestones

### ZFNet (2013)

Matthew Zeiler and Rob Fergus of New York University won the 2013 challenge with ZFNet, a refined version of AlexNet. Their key contribution was a deconvolutional network visualization technique that allowed researchers to see what each layer of a CNN had learned, providing interpretability that had previously been lacking [5]. ZFNet reduced the top-5 error to 14.8%, a modest but meaningful improvement.

### GoogLeNet and VGG (2014)

The 2014 competition saw two landmark entries. GoogLeNet (also known as [Inception](/wiki/inception) v1), from a team at Google led by Christian Szegedy, won with a 6.67% top-5 error rate [6]. GoogLeNet introduced the "Inception module," which applies multiple filter sizes (1x1, 3x3, 5x5) in parallel and concatenates the results, allowing the network to capture patterns at multiple scales within a single layer. Despite being 22 layers deep, GoogLeNet used only about 5 million parameters, far fewer than AlexNet, thanks to careful architectural design.

VGGNet, from the Visual Geometry Group at the University of Oxford (Karen Simonyan and Andrew Zisserman), placed second with a 7.3% error rate [7]. [VGG](/wiki/vgg) demonstrated that depth matters: by stacking many layers of small 3x3 convolution filters, a 16- or 19-layer network could achieve excellent results. The VGG architecture became widely used as a feature extractor in [transfer learning](/wiki/transfer_learning) because of its simplicity and the quality of its learned representations.

### ResNet (2015)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun at [Microsoft](/wiki/microsoft) Research won the 2015 challenge with ResNet (Residual Network), achieving a 3.57% top-5 error rate, surpassing the estimated human-level accuracy of 5.1% for the first time [8]. ResNet introduced [residual connections](/wiki/residual_connection) (skip connections) that allow gradients to flow directly through the network by adding the input of a layer to its output. This simple modification made it possible to train extremely deep networks (up to 152 layers in the competition entry, and later 1,000+ layers in experiments) without suffering from the vanishing gradient problem.

ResNet's impact extended far beyond image classification. Residual connections became a standard component in neural network design, appearing in nearly every major architecture that followed, including the [Transformer](/wiki/transformer).

### Beyond Human-Level Accuracy (2016-2017)

By 2016, the top entries had pushed error rates below 3%. Trimps-Soushen, a team from the Chinese Academy of Sciences, won the 2016 challenge with a 2.99% error using ensembles of Inception and ResNet variants [9]. In 2017, the final year of the competition, Jie Hu, Li Shen, and Gang Sun won with SENet (Squeeze-and-Excitation Networks), achieving 2.25% error through a channel attention mechanism that learned to reweight feature maps based on their informational content [10].

The competition ended after 2017. The organizers concluded that the classification task on the 1,000-class subset had been effectively saturated, with error rates approaching the noise floor of label ambiguity in the dataset itself.

## Impact on the Deep Learning Revolution

ImageNet's influence on AI research and industry is difficult to overstate. The dataset and its associated challenge served multiple roles in driving progress.

### Standardized Benchmarking

Before ImageNet, computer vision research lacked a universally accepted large-scale benchmark. Different papers evaluated on different datasets with different metrics, making it hard to compare results. ILSVRC provided a common yardstick that the entire community could rally around. This standardization accelerated progress by making it immediately clear when a genuinely better approach had been found.

### Demonstrating the Data Hypothesis

ImageNet validated Fei-Fei Li's original thesis: that large, diverse datasets could unlock capabilities that smaller datasets could not. The architectures that succeeded on ImageNet (particularly AlexNet) were not entirely new. Convolutional neural networks had existed since the late 1980s (LeCun et al., 1989) [12], and GPU training had been explored before. What was new was the combination of a sufficiently large and challenging dataset, enough computational power (GPUs), and a few key architectural tricks. ImageNet provided the data that made the rest possible.

### Catalyzing Hardware Investment

The success of GPU-trained neural networks on ImageNet caught the attention of hardware companies, particularly [NVIDIA](/wiki/nvidia). NVIDIA pivoted aggressively toward deep learning, developing specialized hardware (the Tesla K40, K80, P100, V100, A100, H100 series) and software libraries (cuDNN) optimized for neural network training. This hardware investment created a virtuous cycle: better hardware enabled larger models, which achieved better results, which attracted more investment.

### Enabling Transfer Learning

Models trained on ImageNet proved to be excellent starting points for other vision tasks. The features learned in the early layers of an ImageNet-trained CNN (edges, textures, shapes) are broadly useful for visual recognition in general. Researchers discovered that fine-tuning an ImageNet pre-trained model on a smaller target dataset consistently outperformed training from scratch, even when the target domain (e.g., medical imaging, satellite imagery) differed substantially from ImageNet's everyday objects. This transfer learning paradigm became the standard approach in computer vision and later inspired similar approaches in [natural language processing](/wiki/natural_language_processing) (e.g., [BERT](/wiki/bert), [GPT](/wiki/gpt)).

### Inspiring Other Benchmarks and Datasets

ImageNet's success inspired the creation of numerous other large-scale datasets:

- **COCO (Common Objects in Context)**: Developed by a team including Tsung-Yi Lin and others at Microsoft, COCO focuses on object detection, segmentation, and captioning with 330,000+ images [13].
- **Open Images**: A Google dataset with 9 million images and thousands of object classes with bounding box and segmentation annotations.
- **Places**: An MIT dataset focused on scene recognition with 10 million images across 400+ scene categories.

## Controversies

Despite its enormous contributions to AI research, ImageNet has faced significant criticism on several fronts.

### Offensive and Problematic Categories

ImageNet inherited its category structure from WordNet, which includes nouns describing people. Many of these categories contained derogatory, racist, or sexist labels. In 2019, artist Trevor Paglen and AI researcher Kate Crawford created "ImageNet Roulette," a web application that classified uploaded photos of people using ImageNet's person categories. The project revealed that the system labeled people with terms including racial slurs, gendered insults, and other offensive language [14].

The ImageNet team identified 1,593 problematic person categories (approximately 54% of the 2,932 person-related synsets) and removed them from the dataset. In total, approximately 600,000 images from the person subtree were affected.

### Consent and Privacy

ImageNet images were scraped from the internet, primarily from platforms like Flickr, without the explicit consent of the people depicted or the photographers who took the images. Many individuals in the dataset had no idea their photos were being used to train AI systems. This raised questions about data rights and privacy that have only grown more pressing as AI systems trained on such data have been deployed in consequential real-world settings.

In response, the ImageNet team blurred the faces of people in 243,198 images across the dataset. However, critics noted that blurring faces after years of unblurred distribution does not undo the use of the original unblurred images in training systems already deployed.

### Demographic and Geographic Bias

Research has shown that ImageNet images are heavily skewed toward North America and Western Europe. Approximately 45% of images originate from the United States, while China and India (which together represent over a third of the world's population) account for roughly 1% and 2.1% of images respectively [15]. This geographic imbalance means that models trained on ImageNet may perform better on objects, settings, and cultural contexts familiar in Western countries and worse on those from the Global South.

Additionally, the distribution of object categories reflects the biases of English-language WordNet and Western cultural perspectives. Many everyday objects, foods, and activities from non-Western cultures are underrepresented or entirely absent.

### Critical Scholarship

A 2021 paper by Emily Denton, Alex Hanna, and colleagues, "On the Genealogy of Machine Learning Datasets," examined ImageNet as a case study in how training datasets encode historical and social assumptions [16]. The paper traced ImageNet's roots to earlier classification projects and argued that the dataset's construction process, from the choice of WordNet as an organizing principle to the use of crowd workers with minimal context for annotation, embedded particular views about how the visual world should be categorized.

### Summary of Controversies

| Issue | Details | Response |
|---|---|---|
| Offensive labels | Derogatory terms in person categories | Removed 1,593 categories (~600K images) |
| Consent | Images scraped without subjects' permission | Blurred faces in 243,198 images |
| Geographic bias | 45% of images from the US | Acknowledged; no full remediation |
| Demographic bias | Underrepresentation of non-Western cultures | Ongoing research into mitigation |
| Annotation quality | Crowd workers made errors; ambiguous categories | Quality control through redundant labeling |

## Legacy and Current State

ImageNet's legacy is secure as one of the most consequential datasets in the history of computer science. It demonstrated that large-scale data collection, combined with community benchmarking, could drive transformative progress in AI. The ImageNet moment of 2012 is frequently cited as the beginning of the modern deep learning era.

### Is ImageNet still used in 2026?

Yes. As of 2026, the dataset remains available for research through the ImageNet website (image-net.org). The ILSVRC competition concluded after 2017, but ImageNet continues to be used as a standard benchmark for evaluating new architectures, training techniques, and [data augmentation](/wiki/data_augmentation) strategies. Pre-trained ImageNet models remain the default initialization for computer vision tasks in both research and production.

However, the field has also moved beyond ImageNet in important ways:

- **Larger and more diverse datasets**: Modern vision models are increasingly trained on web-scale datasets (e.g., [LAION](/wiki/laion)-5B with 5.85 billion image-text pairs) that dwarf ImageNet in size.
- **Multimodal training**: Models like [CLIP](/wiki/clip) (Contrastive Language-Image [Pre-training](/wiki/pre-training), [OpenAI](/wiki/openai), 2021) and SigLIP (Google, 2023) learn visual representations from paired image-text data rather than from category labels, achieving strong zero-shot performance across a wide range of visual tasks [17].
- **Self-supervised pre-training**: Methods like DINO, MAE (Masked [Autoencoder](/wiki/autoencoder)), and DINOv2 learn visual representations without any labels, reducing dependence on human-annotated datasets.
- **Vision Transformers**: The [Vision Transformer](/wiki/vision_transformer) (ViT), introduced by Dosovitskiy et al. in 2020, demonstrated that [Transformer](/wiki/transformer) architectures could match or exceed CNNs on ImageNet classification, opening a new chapter in computer vision architecture design [18].

ImageNet's greatest contribution may be not the dataset itself, but the principle it established: that investing in large, well-organized datasets is just as important as investing in algorithms. This lesson has been applied repeatedly, from the text corpora that power large language models to the protein structure databases that enabled [AlphaFold](/wiki/alphafold). Fei-Fei Li's stubborn conviction that data was the bottleneck, once dismissed by some colleagues as misguided, proved to be one of the most prescient insights in the history of artificial intelligence.

## Explain Like I'm 5 (ELI5)

Imagine you have a giant box of more than 14 million photos, and each photo has a label telling you what is in the picture: "dog," "car," "flower," and thousands of other things. ImageNet is that giant box of labeled photos. Scientists held a yearly contest where teams built computer programs to try to guess what was in photos from the box. In 2012, one team used a special kind of computer program (a deep neural network) that was so much better than anything before it that everyone realized this was the future. That contest and those photos helped start the revolution in AI that gave us things like self-driving cars, photo search on your phone, and much more.

## References

[1] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248-255. https://ieeexplore.ieee.org/document/5206848

[2] Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., Cao, L., & Huang, T. (2011). "Large-Scale Image Classification: Fast Feature Extraction and SVM Training." CVPR 2011.

[3] Perronnin, F., Sanchez, J., & Mensink, T. (2010). "Improving the Fisher Kernel for Large-Scale Image Classification." ECCV 2010.

[4] Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." Advances in Neural Information Processing Systems 25 (NIPS 2012). https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

[5] Zeiler, M.D. & Fergus, R. (2014). "Visualizing and Understanding Convolutional Networks." ECCV 2014. arXiv:1311.1901. https://arxiv.org/abs/1311.1901

[6] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). "Going Deeper with Convolutions." CVPR 2015. https://arxiv.org/abs/1409.4842

[7] Simonyan, K. & Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR 2015. arXiv:1409.1556. https://arxiv.org/abs/1409.1556

[8] He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. arXiv:1512.03385. https://arxiv.org/abs/1512.03385

[9] ILSVRC 2016 Results. https://image-net.org/challenges/LSVRC/2016/results

[10] Hu, J., Shen, L., & Sun, G. (2018). "Squeeze-and-Excitation Networks." CVPR 2018. arXiv:1709.01507. https://arxiv.org/abs/1709.01507

[11] Karpathy, A. (2014). "What I learned from competing against a ConvNet on ImageNet." https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

[12] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., & Jackel, L.D. (1989). "[Backpropagation](/wiki/backpropagation) Applied to Handwritten Zip Code Recognition." Neural Computation, 1(4), 541-551.

[13] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C.L. (2014). "Microsoft COCO: Common Objects in Context." ECCV 2014. https://arxiv.org/abs/1405.0312

[14] Crawford, K. & Paglen, T. (2019). "Excavating AI: The Politics of Training Sets for Machine Learning." https://excavating.ai

[15] Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). "No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World." NIPS 2017 Workshop.

[16] Denton, E., Hanna, A., Amironesei, R., Smart, A., & Nicole, H. (2021). "On the Genealogy of Machine Learning Datasets: A Critical History of ImageNet." Big Data & Society. https://journals.sagepub.com/doi/full/10.1177/20539517211035955

[17] Radford, A., Kim, J.W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021. https://arxiv.org/abs/2103.00020

[18] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. arXiv:2010.11929. https://arxiv.org/abs/2010.11929

[19] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., & Fei-Fei, L. (2015). "ImageNet Large Scale Visual Recognition Challenge." International Journal of Computer Vision, 115(3), 211-252. arXiv:1409.0575. https://arxiv.org/abs/1409.0575

