Image Classification Models

AI Models Computer Vision

16 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v5 · 3,283 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Computer Vision Models and Tasks

Image classification models are machine learning systems that assign one or more category labels to a whole input image, the task that drove the modern wave of deep learning in computer vision. The defining benchmark is ImageNet top-1 and top-5 accuracy on a 1000 class subset of about 1.28 million training images, and the architecture lineage runs from AlexNet (2012, 15.3 percent top-5 error) through VGG, GoogLeNet (Inception), ResNet (2015, 3.57 percent top-5 error), and EfficientNet to the Vision Transformer (2020) and ConvNeXt (2022). The current fine-tuned ImageNet-1K record is 91.0 percent top-1, held by Google's CoCa.¹

Given a fixed vocabulary of classes, the model outputs either a single most likely label (top-1) or a ranked list (top-5). The task is distinct from object detection, which localizes multiple objects with bounding boxes, and from semantic segmentation, which assigns a class to every pixel. Whole-image classification has been the central benchmark task that drove the modern wave of deep learning in computer vision, with the ImageNet dataset and its annual ILSVRC competition as the proving ground from 2010 through 2017.

The field moved through three broad phases. Before 2012, classifiers relied on handcrafted descriptors fed into linear models. From 2012 to 2020, deep convolutional neural networks (CNNs) dominated. Since 2020, vision transformers and large self-supervised or language-image foundation models have largely replaced task-specific classifiers, and frozen features from CLIP and DINOv2 are now the default starting point for downstream classification work.

What came before deep learning (before 2012)?

Before AlexNet, image classification combined two stages: a feature extractor turned pixels into a fixed length descriptor, and a separate classifier such as a support vector machine assigned labels. The widely used descriptors were SIFT (Lowe 2004), HOG (Dalal and Triggs 2005), and the bag of visual words model that aggregated local features into histograms over a learned codebook. Fisher vectors and VLAD encodings refined this pipeline.

The ImageNet dataset, introduced by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li at CVPR 2009,² expanded training data by several orders of magnitude over earlier benchmarks like Caltech-101 or PASCAL VOC. The ILSVRC competition first ran in 2010 on a 1000 class subset, with about 1.28 million training images.³ Winning 2010 and 2011 entries were SIFT plus Fisher vector pipelines with linear SVMs, hitting top-5 error near 25 to 28 percent. A CNN entry broke that ceiling the next year.

How did the CNN era unfold (2012 to 2020)?

The second phase opened when AlexNet won ILSVRC 2012 with a top-5 error of 15.3 percent, more than ten points ahead of the runner up.⁴ Built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto, AlexNet had eight learned layers (five convolutional and three fully connected), used ReLU activations and dropout, and trained for about a week on two NVIDIA GTX 580 GPUs. With roughly 60 million parameters, it reset the field and made convolutional networks the default.

ZFNet (Zeiler and Fergus, 2013) won ILSVRC 2013 and added visualization techniques for interpreting convolutional layers. VGG (Simonyan and Zisserman, Oxford, arXiv 1409.1556, 2014) showed that stacking small 3x3 convolutions to 16 or 19 weight layers produced strong representations; VGG-16 has about 138 million parameters.⁵ GoogLeNet, the Inception v1 network from Szegedy et al. at Google (arXiv 1409.4842, 2014), won ILSVRC 2014 with only about 7 million parameters thanks to multi branch Inception modules.⁶ The Inception family later included v2, v3, v4, and Inception-ResNet.

ResNet (He, Zhang, Ren, Sun at Microsoft Research Asia, arXiv 1512.03385, December 2015) introduced residual connections that let signals skip over blocks of layers.⁷ The authors wrote that they "explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions," and evaluated nets "with a depth of up to 152 layers, 8x deeper than VGG nets but still having lower complexity."⁷ An ensemble of these residual nets achieved 3.57 percent top-5 error on the ImageNet test set and won ILSVRC 2015, surpassing the roughly 5 percent estimated human top-5 error.⁷ ResNet-50, with about 25.6 million parameters, remains one of the most widely used backbones a decade later.

Later CNNs refined the idea. DenseNet (Huang et al., 2016) connected every layer to every later one inside dense blocks. ResNeXt (2016) added grouped convolutions. SE-Net (Hu, Shen, Sun, 2017) added squeeze-and-excitation channel attention and won ILSVRC 2017, the final year.⁸ MobileNet (Howard et al. at Google, 2017) used depthwise separable convolutions for on device inference, with V2 (2018) and V3 (2019) following.⁹ EfficientNet (Tan and Le at Google, arXiv 1905.11946, ICML 2019) introduced compound scaling across depth, width, and resolution. EfficientNet-B7 reached 84.4 percent top-1 (97.1 percent top-5) on ImageNet while being about 8.4 times smaller and 6.1 times faster on inference than the previous best CNN.¹⁰ EfficientNetV2 (2021) and RegNet (Meta, 2020) refined scaling further.

How did transformers change image classification (2020 to present)?

The third phase began when the Vision Transformer (ViT) from Alexey Dosovitskiy et al. at Google Research (arXiv 2010.11929, October 2020) showed that a pure transformer encoder applied to image patches could match or beat the best CNNs when pretrained on enough data.¹¹ The paper argued that "a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks," challenging the assumption that convolutions were necessary.¹¹ ViT splits a 224 by 224 image into 14 by 14 patches of 16 by 16 pixels, embeds each as a token, and runs a standard transformer over the sequence. Pretrained on JFT-300M, ViT-H/14 reached 88.55 percent top-1 on ImageNet.¹¹ Without large scale pretraining, ViT trailed CNNs at equal parameter count because it lacks the locality and translation equivariance built into convolutions.

DeiT (Hugo Touvron et al. at Meta, December 2020) showed that distillation and strong augmentation let ViT-style models train on ImageNet-1K alone. Swin Transformer (Ze Liu et al. at MSRA, arXiv 2103.14030, March 2021) reintroduced hierarchical feature maps and computed attention inside shifted local windows, reaching 87.3 percent top-1 and serving as a general backbone for detection and segmentation.¹² BEiT (Hangbo Bao et al., 2021) applied BERT-style masked image modeling using a discrete tokenizer.

The Masked Autoencoder (MAE) from Kaiming He, Xinlei Chen, and colleagues at Meta (arXiv 2111.06377, November 2021) trained ViT by masking 75 percent of input patches and reconstructing the missing pixels.¹³ A ViT-Huge pretrained with MAE reached 87.8 percent top-1 using only ImageNet-1K. ConvNeXt (Zhuang Liu et al. at Meta, arXiv 2201.03545, January 2022) modernized a ResNet step by step, borrowing design choices from transformers, and also reached 87.8 percent top-1, showing the architectural gap between CNNs and ViTs was smaller than it first appeared.¹⁴ ConvNeXtV2 (Sanghyun Woo et al., 2023) added MAE pretraining.

Can models learn to classify images without labels?

A parallel line of work removed the need for labels altogether. Contrastive methods including SimCLR (Ting Chen et al. at Google, 2020) and MoCo (Kaiming He et al., 2019) trained image encoders to map two augmentations of the same image close in feature space while pushing different images apart. Non contrastive methods like BYOL and DINO (Mathilde Caron et al. at Meta, 2021) showed that self-distillation between teacher and student networks could learn strong features without negative pairs. DINOv2 (Maxime Oquab and 25 coauthors at Meta, arXiv 2304.07193, April 2023) scaled this recipe to 142 million curated images and a ViT-g/14 backbone, producing general purpose features that work well with a simple linear classifier on dozens of tasks without fine-tuning.¹⁵

DINOv3 (Siméoni, Oquab, Vedaldi and colleagues at Meta, arXiv 2508.10104, August 2025) scaled self-supervised learning further, training a 7 billion parameter ViT (about 6.7 billion parameters) on the curated LVD-1689M set of 1.7 billion images and distilling it into smaller ViT-S, ViT-B, ViT-L, ViT-H+, and ConvNeXt variants.¹⁶ A new Gram anchoring technique kept dense feature maps from degrading over long training schedules. Meta described it as the first self-supervised model to outperform weakly supervised models on both dense prediction and classification tasks without fine-tuning.¹⁶

Language image pretraining changed the picture again. CLIP (Radford et al. at OpenAI, arXiv 2103.00020, 2021) trained paired image and text encoders on 400 million image text pairs from the web using a contrastive objective.¹⁷ The result can perform zero-shot classification on any class set described in natural language. ALIGN from Google extended this with noisier larger data, and OpenCLIP reproduced CLIP on the open LAION data. EVA-02 (Fang, Sun et al. at BAAI, arXiv 2303.11331, March 2023) reconstructed CLIP features through masked image modeling and reached 90.0 percent fine-tuned top-1 on ImageNet-1K with only 304 million parameters.¹⁸ SigLIP (Zhai et al. at Google, arXiv 2303.15343, March 2023) replaced the softmax contrastive loss with a pairwise sigmoid loss that trains well at smaller batch sizes; its widely used So400m variant became a default open vision encoder.¹⁹ SigLIP 2 (Tschannen et al. at Google, arXiv 2502.14786, February 2025) added captioning and self-supervised objectives and improved zero-shot accuracy at every scale, with the ViT-g/16 variant reaching about 85.0 percent zero-shot top-1 on ImageNet.²⁰ CoCa (Yu et al. at Google, arXiv 2205.01917, May 2022) jointly trained a contrastive and a captioning objective and reported 86.3 percent zero-shot, 90.6 percent frozen, and 91.0 percent fine-tuned top-1 on ImageNet, the highest figure on the benchmark as of 2026.¹

Other foundation models include SAM (Segment Anything, Meta, April 2023) for promptable segmentation and SAM 2 (2024) for video. Apple shipped AIMv1 (2024) and AIMv2 (arXiv 2411.14402, November 2024); the AIMv2-3B encoder reaches 89.5 percent top-1 on ImageNet-1K with a frozen trunk, and the 681 million parameter AIMv2-Huge reaches 87.5 percent.²¹ Google scaled supervised ViT to ViT-22B in 2023.

What is the ImageNet accuracy progression?

The ILSVRC top-5 error trajectory captures progress on the canonical benchmark.

Year	Winning model	Top-5 error	Notes
2011	Fisher vector plus SVM	25.8%	Last pre deep learning winner
2012	AlexNet	15.3%	First CNN winner
2013	ZFNet	11.7%	Visualization driven tuning
2014	GoogLeNet (Inception v1)	6.7%	VGG-16 second at 7.3%
2015	ResNet-152 ensemble	3.57%	Surpassed human level (about 5%)
2016	Trimps-Soushen ensemble	2.99%	CUImage close behind
2017	SE-Net (SENet-154)	2.25%	Final year of the challenge

Later results outside the challenge include EfficientNet-B7 at 84.4 percent top-1 and ViT-H/14 pretrained on JFT-300M at 88.55 percent top-1. Image-text foundation models then pushed past 90 percent: EVA-02 reached 90.0 percent fine-tuned top-1 in 2023,¹⁸ and CoCa holds the standard fine-tuned ImageNet-1K record at 91.0 percent top-1.¹

Notable model summary

Model	Year	Lab	Params	Key idea
LeNet-5	1998	LeCun et al.	60K	First CNN at scale, on digits
AlexNet	2012	Toronto	60M	ReLU, dropout, GPU training
ZFNet	2013	NYU	60M	Visualization, smaller stride
VGG-16	2014	Oxford	138M	Stacked 3x3 convs
GoogLeNet	2014	Google	7M	Inception modules
ResNet-50	2015	MSRA	25.6M	Residual connections
DenseNet-121	2016	Cornell/Tsinghua	8M	Dense feature reuse
MobileNet	2017	Google	4.2M	Depthwise separable convs
SE-Net (SENet-154)	2017	Momenta	145M	Squeeze excitation attention
EfficientNet-B7	2019	Google	66M	Compound scaling
ViT-B/16	2020	Google	86M	Transformer on 16x16 patches
DeiT-B	2020	Meta	86M	Distillation, ImageNet only
Swin-B	2021	MSRA	88M	Hierarchical shifted windows
CLIP ViT-L/14	2021	OpenAI	304M	Language image contrastive
MAE ViT-H	2021	Meta	632M	Masked patch reconstruction
ConvNeXt-B	2022	Meta	89M	Modernized CNN
CoAtNet	2021	Google	75M to 2.4B	Conv attention hybrid
CoCa	2022	Google	2.1B	Contrastive plus captioning, 91.0% top-1
EVA-02	2023	BAAI	304M	MIM of CLIP features, 90.0% top-1
DINOv2 ViT-L	2023	Meta	304M	Self-supervised at scale
SigLIP 2 ViT-g/16	2025	Google	1B	Sigmoid loss, 85.0% zero-shot
AIMv2-3B	2024	Apple	3B	Autoregressive, 89.5% frozen top-1
DINOv3 ViT-7B	2025	Meta	6.7B	Self-supervised at scale

How are image classifiers pretrained?

Modern classifiers rarely train from scratch. Instead they fine tune from a pretrained backbone. Recipes fall into five broad groups.

Strategy	Representative method	Data source
Supervised	ImageNet-1K, ImageNet-21K, JFT-300M	Labeled images
Self supervised contrastive	SimCLR, MoCo v1 to v3	Unlabeled images
Self supervised non contrastive	BYOL, DINO, DINOv2	Unlabeled images
Masked image modeling	BEiT, MAE, SimMIM, ConvNeXtV2	Unlabeled images
Language image	CLIP, ALIGN, SigLIP, SigLIP 2, EVA-02, CoCa, AIMv2	Image text pairs from the web

Google's JFT-300M (later expanded to JFT-3B) drove much of the gain on ViT-style scaling, but its closed nature means open replicas like LAION-5B and DataComp are the standard for academic work.

What benchmarks measure image classification?

Benchmark	Images	Classes	Purpose
ImageNet-1K (ILSVRC)	1.28M train, 50K val	1000	Standard accuracy benchmark
ImageNet-21K	about 14M	21,843	Pretraining for transfer
CIFAR-10 / 100	60K	10 / 100	Small image fast iteration
Places365	1.8M	365	Scene classification
iNaturalist	2.7M	10,000+	Fine grained species classification
ImageNet-V2	10K	1000	Distribution shift on the same classes
ImageNet-A	7,500	200	Natural adversarial examples
ImageNet-C	corrupted ImageNet-1K	1000	Common corruption robustness
ImageNet-R	30K	200	Renditions (art, sketches)
JFT-300M / JFT-3B	300M / 3B	tens of thousands	Google internal pretraining

The usual metrics are top-1 and top-5 accuracy. Calibration (expected calibration error, ECE) and robustness across ImageNet-C, ImageNet-A, and ImageNet-R have become standard supplementary numbers because raw top-1 saturated above human level years ago.

What does the landscape look like in 2024 to 2026?

Task specific image classifiers are slowly being replaced by general purpose visual representations. Teams that needed a custom classifier in 2018 now often take a frozen CLIP, SigLIP, DINOv2 or v3, or AIM encoder, extract patch or CLS token features, and fit a small linear or MLP head per task. Zero-shot CLIP or SigLIP often serves as a baseline before any labeled data is collected. As of 2025 the strongest open self-supervised backbone is DINOv3, whose 7 billion parameter teacher and distilled variants set the state of the art on a wide range of probing tasks without fine-tuning.¹⁶ Frontier multimodal models such as GPT-4o, Gemini, and Claude 3.5 Sonnet include vision encoders that classify images directly through natural language prompts, blurring the line between classification and visual question answering. ConvNet research has not disappeared; modernized CNNs still match transformers on many benchmarks while being faster on edge hardware.

What is image classification used for?

Image classification is rarely the end goal in deployed systems, but it sits inside many product pipelines.

Domain	Use case
Medical imaging	Diabetic retinopathy, dermatology triage, chest X-ray screening
Industrial QC	Defect detection on manufacturing lines
Agriculture	Crop disease ID, weed versus crop classification
Content moderation	Flagging unsafe or policy violating images
Autonomous driving	Traffic sign and scene type classification
Satellite imagery	Land cover, deforestation, building footprint typing
Consumer photo apps	Album organization, face and pet tagging
Retail	Product categorization, inventory checks
Document AI	Page type and form field classification
Biodiversity	Camera trap species ID, iNaturalist

What are the limitations of image classification models?

Despite saturated benchmarks, image classification has open problems.

Distribution shift. ImageNet trained models lose accuracy on photos from different cameras or geographies; ImageNet-V2 alone drops top-1 by about 11 percent.
Adversarial robustness. Imperceptible pixel level perturbations can flip predictions, a weakness shared by CNNs and transformers.
Fine grained classification. Distinguishing closely related species, products, or vehicle models often needs far more labels per class than ImageNet provides.
Class imbalance and long tails. Real world image distributions follow heavy tails; iNaturalist and LVIS specifically test this.
Dataset bias. ImageNet has documented geographic, racial, and offensive label biases; later cleanups removed about 2,700 problematic person categories.
Interpretability. Saliency maps give partial insight, but classifier decisions remain hard to fully explain.
Energy and cost. Training a ViT-Huge or larger from scratch costs hundreds of GPU days, raising environmental and accessibility concerns.

References

Yu, J. et al. (2022). "CoCa: Contrastive Captioners are Image-Text Foundation Models." arXiv:2205.01917. https://arxiv.org/abs/2205.01917. Accessed 2026-05-31. ↩ ↩² ↩³
Deng, J. et al. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." CVPR. https://ieeexplore.ieee.org/document/5206848. Accessed 2026-05-31. ↩
Russakovsky, O. et al. (2015). "ImageNet Large Scale Visual Recognition Challenge." arXiv:1409.0575. https://arxiv.org/abs/1409.0575. Accessed 2026-05-31. ↩
Krizhevsky, A., Sutskever, I., Hinton, G. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks. Accessed 2026-05-31. ↩
Simonyan, K., Zisserman, A. (2014). "Very Deep Convolutional Networks for Large-Scale Image Recognition." arXiv:1409.1556. https://arxiv.org/abs/1409.1556. Accessed 2026-05-31. ↩
Szegedy, C. et al. (2014). "Going Deeper with Convolutions." arXiv:1409.4842. https://arxiv.org/abs/1409.4842. Accessed 2026-05-31. ↩
He, K., Zhang, X., Ren, S., Sun, J. (2015). "Deep Residual Learning for Image Recognition." arXiv:1512.03385. https://arxiv.org/abs/1512.03385. Accessed 2026-06-22. ↩ ↩² ↩³
Hu, J., Shen, L., Sun, G. (2017). "Squeeze-and-Excitation Networks." arXiv:1709.01507. https://arxiv.org/abs/1709.01507. Accessed 2026-05-31. ↩
Howard, A. et al. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." arXiv:1704.04861. https://arxiv.org/abs/1704.04861. Accessed 2026-05-31. ↩
Tan, M., Le, Q. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." arXiv:1905.11946. https://arxiv.org/abs/1905.11946. Accessed 2026-06-22. ↩
Dosovitskiy, A. et al. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." arXiv:2010.11929. https://arxiv.org/abs/2010.11929. Accessed 2026-06-22. ↩ ↩² ↩³
Liu, Z. et al. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." arXiv:2103.14030. https://arxiv.org/abs/2103.14030. Accessed 2026-05-31. ↩
He, K. et al. (2021). "Masked Autoencoders Are Scalable Vision Learners." arXiv:2111.06377. https://arxiv.org/abs/2111.06377. Accessed 2026-05-31. ↩
Liu, Z. et al. (2022). "A ConvNet for the 2020s." arXiv:2201.03545. https://arxiv.org/abs/2201.03545. Accessed 2026-05-31. ↩
Oquab, M. et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." arXiv:2304.07193. https://arxiv.org/abs/2304.07193. Accessed 2026-05-31. ↩
Siméoni, O. et al. (2025). "DINOv3." arXiv:2508.10104. https://arxiv.org/abs/2508.10104. Model card and specifications: https://ai.meta.com/research/publications/dinov3/ and https://huggingface.co/facebook/dinov3-vit7b16-pretrain-lvd1689m. Accessed 2026-05-31. ↩ ↩² ↩³
Radford, A. et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). arXiv:2103.00020. https://arxiv.org/abs/2103.00020. Accessed 2026-05-31. ↩
Fang, Y., Sun, Q. et al. (2023). "EVA-02: A Visual Representation for Neon Genesis." arXiv:2303.11331. https://arxiv.org/abs/2303.11331. Accessed 2026-05-31. ↩ ↩²
Zhai, X. et al. (2023). "Sigmoid Loss for Language Image Pre-Training" (SigLIP). arXiv:2303.15343. https://arxiv.org/abs/2303.15343. Accessed 2026-05-31. ↩
Tschannen, M. et al. (2025). "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features." arXiv:2502.14786. https://arxiv.org/abs/2502.14786. Accessed 2026-05-31. ↩
Fini, E. et al. (2024). "Multimodal Autoregressive Pre-training of Large Vision Encoders" (AIMv2). arXiv:2411.14402. https://arxiv.org/abs/2411.14402. Accessed 2026-05-31. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Active Learning BrainChip Akida CIFAR-10 Contrastive Learning Convolution Gradio Image Recognition Image segmentation ImageNet Lion (optimizer)MobileNet Object detection Self-Supervised Learning Semantic Segmentation Transfer Learning Video Classification Models YOLO (object detection)tf.keras

What came before deep learning (before 2012)?

How did the CNN era unfold (2012 to 2020)?

How did transformers change image classification (2020 to present)?

Can models learn to classify images without labels?

What is the ImageNet accuracy progression?

Notable model summary

How are image classifiers pretrained?

What benchmarks measure image classification?

What does the landscape look like in 2024 to 2026?

What is image classification used for?

What are the limitations of image classification models?

See also

References

Footnotes

Improve this article

Related Articles

Image-to-Image Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Video Classification Models

Visual Question Answering Models

Zero-Shot Image Classification Models

What links here

Related Articles

Image-to-Image Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Video Classification Models

Visual Question Answering Models

Zero-Shot Image Classification Models

What links here