Machine learning terms/Computer Vision
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,967 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,967 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Computer vision (CV) is the subfield of artificial intelligence and machine learning that builds systems capable of extracting information from digital images, video, and other visual inputs. The discipline aims to give machines the ability to identify objects, recognize people, understand scenes, infer geometry, and reason about events. Modern computer vision has converged with deep learning, and most state of the art systems are built on convolutional neural networks, vision transformers, and multimodal foundation models trained on hundreds of millions of image text pairs.
The field traces its origins to early experiments at MIT in the 1960s, including the 1966 Summer Vision Project led by Seymour Papert, which famously underestimated the difficulty of human level visual perception. Six decades of research have produced techniques ranging from hand crafted detectors such as SIFT and HOG to billion parameter foundation models such as CLIP, SAM, and DINOv2. Computer vision now powers face unlock, autonomous driving, medical imaging diagnostics, satellite analysis, content moderation, augmented reality, and generative tools that turn text prompts into photorealistic images.
Computer vision history divides into several broad eras, each defined by the dominant image representation and algorithms.
| Era | Years | Defining ideas | Representative work |
|---|---|---|---|
| Early symbolic | 1960s to 1970s | Block worlds, line drawings, edge reasoning | Roberts (1963), Waltz (1972) |
| Geometric and physics based | 1980s | Stereo vision, shape from shading, optical flow | Marr's Vision (1982), Horn and Schunck (1981) |
| Feature engineering | 1990s to early 2000s | Hand crafted descriptors, statistical learning | SIFT, HOG, Viola Jones |
| Deep learning | 2012 to 2019 | End to end CNNs on GPUs | AlexNet, VGG, ResNet, Faster R-CNN |
| Foundation models | 2020 to present | Attention encoders, multimodal pretraining | ViT, CLIP, SAM, Stable Diffusion |
The transition to deep learning dates to the 2012 ImageNet Challenge, where AlexNet (Krizhevsky, Sutskever, Hinton) cut top 5 error from 26.2 to 15.3 percent, ending the era of hand crafted features almost overnight.
Computer vision is organized around a small number of canonical tasks. Real applications often combine several of them, but treating each as a benchmark has driven steady progress.
Image classification assigns one or more labels to a whole image. The classic benchmark is ImageNet with 1,000 categories and 1.28 million training images. Top 1 accuracy rose from about 56 percent for AlexNet in 2012 to over 91 percent for modern models such as CoCa and EVA-02 by 2024.
Object detection localizes and classifies multiple objects, producing bounding boxes with class labels and confidence scores. It is evaluated using intersection over union thresholds and mean average precision (mAP). The standard benchmark is MS COCO (2014), with 80 categories and over 200,000 labeled images.
Segmentation assigns a label to every pixel. Three main variants exist:
| Variant | What it labels | Example output |
|---|---|---|
| Semantic | Each pixel by class only | All pixels belonging to any car share one color |
| Instance | Each pixel by class and instance | Each car is colored separately |
| Panoptic | Class plus instance for things, class only for stuff | Cars individually colored, road one solid color |
The distinction between things (countable objects) and stuff (amorphous regions like sky) was formalized by Kirillov and colleagues in 2019.
Keypoint detection identifies landmarks such as eye corners or body joints. Keypoints and landmarks are used in face alignment, hand tracking, animal pose, and motion capture. Notable systems include OpenPose (2017) and MediaPipe (2019).
Depth estimation predicts pixel wise distance from the camera. It can use stereo pairs, structured light, time of flight sensors, or a single monocular image. Modern monocular models such as MiDaS (2020) and Depth Anything (2024) generalize across domains.
Optical flow estimates per pixel motion between video frames. Horn and Schunck (1981) and Lucas and Kanade (1981) defined the classical formulation. Deep approaches like FlowNet (2015), PWC-Net (2018), and RAFT (2020) lead modern benchmarks (Sintel, KITTI).
Additional tasks include image generation, image to image translation, super resolution, denoising, image captioning, visual question answering, action recognition, 3D reconstruction, and visual SLAM.
Before deep learning, computer vision relied on hand designed detectors and descriptors. Many remain useful today for limited data or strict latency budgets, and they are essential to robotics and structure from motion pipelines.
| Method | Year | Purpose |
|---|---|---|
| Canny edge detector | 1986 | Step changes in intensity via gradient magnitude and NMS |
| Harris corner detector | 1988 | Points where the image gradient varies in two orthogonal directions |
| SIFT | 1999 | Scale and rotation invariant keypoints, 128 dim descriptors |
| SURF | 2006 | Faster SIFT approximation using integral images |
| HOG | 2005 | Oriented gradients in cells, pedestrian detection (Dalal and Triggs) |
| ORB | 2011 | Binary descriptor for real time matching, patent free |
| Bag of Visual Words | early 2000s | Quantizes local descriptors into a vocabulary histogram |
Other classical operations include Sobel and Prewitt gradients, Laplacian of Gaussian blob detection, RANSAC for robust fitting, and the Hough transform for parametric shape detection.
The modern era began with convolutional neural networks (CNNs). CNNs stack convolutional layers, ReLU nonlinearities, pooling, and fully connected layers to learn hierarchical features from pixels. Key ideas include weight sharing, translational invariance, and stacking small convolutional filters deeply.
| Model | Year | Authors | Key contribution |
|---|---|---|---|
| LeNet-5 | 1998 | LeCun et al. | First successful CNN, recognized digits on MNIST |
| AlexNet | 2012 | Krizhevsky, Sutskever, Hinton | 8 layers, ReLU, dropout, dual GPU, won ImageNet 2012 |
| ZFNet | 2013 | Zeiler and Fergus | Visualization techniques, won ImageNet 2013 |
| VGG-16 and VGG-19 | 2014 | Simonyan and Zisserman, Oxford | Depth with small 3x3 filters improves accuracy |
| GoogLeNet (Inception v1) | 2014 | Szegedy et al., Google | Inception module, 22 layers, won ImageNet 2014 |
| ResNet | 2015 | He et al., Microsoft Research | Residual connections enabled 152 layer networks, won 2015 |
| Inception v3 and v4 | 2015 to 2016 | Szegedy et al. | Factorized convolutions, label smoothing |
| DenseNet | 2016 | Huang et al. | Each layer receives feature maps from all preceding layers |
| Xception | 2016 | Chollet | Depthwise separable convolutions |
| MobileNet v1 and v2 | 2017 to 2018 | Howard et al., Google | Lightweight CNN for mobile devices |
| ResNeXt | 2017 | Xie et al., Facebook | Grouped convolutions, cardinality as a new dimension |
| SENet | 2017 | Hu, Shen, Sun | Squeeze and excitation channel attention, won 2017 |
| EfficientNet | 2019 | Tan and Le, Google | Compound scaling of depth, width, and resolution |
| ConvNeXt | 2022 | Liu et al., Facebook | Modernized ResNet matching transformer accuracy |
Convolutions remain dominant on mobile and edge devices due to hardware efficiency. Pooling and stride parameters control how spatial information is downsampled.
Object detection has been one of the most active subfields since 2014. Detectors are typically classified as two stage, single stage, or transformer based.
| Family | Representative models | Approach |
|---|---|---|
| Two stage | R-CNN (2014), Fast R-CNN (2015), Faster R-CNN (2015), Mask R-CNN (2017) | Propose regions, then classify and refine bounding boxes |
| Single stage anchor based | SSD (2015), YOLOv2/v3 (2016 to 2018), RetinaNet (2017) | Predict boxes and classes in one pass over a dense anchor grid |
| Single stage anchor free | CornerNet (2018), FCOS (2019), YOLOv8 (2023), YOLOv10/v11 (2024) | Predict centers or corners directly without predefined anchors |
| Transformer based | DETR (2020), Deformable DETR (2020), DINO (2022), Co-DETR (2023) | Encoder decoder with learned object queries, no NMS |
RetinaNet introduced focal loss to address foreground/background imbalance, hitting 39.1 mAP on COCO in 2017. Modern transformer detectors such as Co-DETR with Swin-Large push above 66 mAP. The YOLO family, originated by Joseph Redmon in 2016, has been continued by Ultralytics, Meituan (YOLOv6), and Tsinghua researchers (YOLOv8+), and is widely deployed in real time applications.
| Model | Year | Notes |
|---|---|---|
| FCN | 2015 | Long, Shelhamer, Darrell. Replaced fully connected layers with convolutions |
| U-Net | 2015 | Ronneberger et al. Encoder decoder with skip connections, biomedical roots |
| DeepLab v1 to v3+ | 2014 to 2018 | Chen et al., Google. Atrous convolutions and ASPP |
| Mask R-CNN | 2017 | Extends Faster R-CNN with a mask branch for instance segmentation |
| PSPNet | 2017 | Pyramid pooling module for global context |
| HRNet | 2019 | Maintains high resolution feature maps throughout |
| Mask2Former | 2022 | Unified architecture for semantic, instance, and panoptic segmentation |
| SAM | 2023 | Meta. Promptable segmentation trained on SA-1B (1.1B masks) |
| SAM 2 | 2024 | Meta. Adds streaming video segmentation with memory |
U-Net remains the baseline for medical image segmentation and is the backbone of the denoiser in latent diffusion models such as Stable Diffusion.
The Vision Transformer paper by Dosovitskiy et al. at Google Brain (2020) showed that the transformer architecture, designed for language, could match or beat CNNs on classification given enough pretraining data. ViT splits the image into patches (typically 16x16), embeds each linearly, adds positional embeddings, and processes the sequence with a standard transformer encoder.
| Model | Year | Key idea |
|---|---|---|
| ViT | 2020 | Patch embedding plus transformer encoder, scales with data |
| DeiT | 2020 | Distillation token enables ViT results from ImageNet alone |
| Swin Transformer | 2021 | Shifted window attention, linear complexity in image size |
| BEiT | 2021 | Masked image modeling with a discrete tokenizer |
| MAE | 2021 | Reconstructs randomly masked patches, strong self supervision |
| EVA-02 | 2023 | Scaled up masked image modeling backbones |
| DINOv2 | 2023 | Self distillation produces general visual features without labels |
| AIMv2 | 2024 | Apple. Autoregressive vision pretraining at scale |
Vision transformers benefit greatly from pretraining on large datasets such as JFT-300M (Google, 300M images) or LAION-2B (Stable Diffusion, 2B image text pairs).
Multimodal models aligning images and text in a shared embedding space have transformed the field since 2021. They enable zero shot classification, open vocabulary detection, image text retrieval, and serve as the visual front end of multimodal LLMs such as GPT-4o, Gemini, and Claude.
| Model | Year | Organization | What it does |
|---|---|---|---|
| CLIP | 2021 | OpenAI | Trains image and text encoders via a contrastive loss on 400 million pairs |
| ALIGN | 2021 | Similar to CLIP, trained on 1.8 billion noisy alt text image pairs | |
| DALL-E | 2021 | OpenAI | Autoregressive text to image model on a discrete VAE and transformer |
| GLIDE | 2021 | OpenAI | Diffusion text to image generation with classifier free guidance |
| DALL-E 2 | 2022 | OpenAI | Two stage diffusion using CLIP image embeddings |
| Imagen | 2022 | Cascaded diffusion conditioned on a frozen T5 text encoder | |
| Stable Diffusion | 2022 | Stability AI, CompVis | Latent diffusion in a learned VAE space, open weights |
| Flamingo | 2022 | DeepMind | Few shot multimodal LLM with cross attention to image tokens |
| BLIP-2 | 2023 | Salesforce | Bootstrapped vision language pretraining with a Q-Former |
| LLaVA | 2023 | UW Madison, Microsoft | Connects a CLIP encoder to LLaMA via a projection layer |
| DALL-E 3 | 2023 | OpenAI | Improved prompt following via better training captions |
| SDXL | 2023 | Stability AI | 3.5 billion parameter latent diffusion with two text encoders |
| Stable Diffusion 3 | 2024 | Stability AI | Multimodal Diffusion Transformer (MMDiT) replacing the U-Net |
| FLUX.1 | 2024 | Black Forest Labs | Hybrid transformer diffusion model from ex Stable Diffusion authors |
Text to image generation typically uses classifier free guidance, DDPM, and increasingly transformer based diffusion (DiT). Video extensions like Sora (OpenAI, 2024) and Veo (Google DeepMind, 2024) generate multi second clips using spacetime patches as tokens.
3D computer vision recovers geometric structure from images. Classical pipelines combine multi view stereo, structure from motion (SfM), and bundle adjustment. Modern systems learn 3D representations end to end.
| Method | Year | Description |
|---|---|---|
| Photogrammetry and SfM | 1990s onwards | Camera poses and sparse 3D points from many views |
| Point clouds and PointNet | 2017 | Discrete or sparse 3D representations |
| NeRF | 2020 | Mildenhall et al. Continuous radiance field, volume rendering |
| Instant NGP | 2022 | Hash grid encoding trains NeRFs in seconds |
| Gaussian Splatting | 2023 | Kerbl et al., Inria. Millions of 3D Gaussians, real time rendering |
| DUSt3R, MASt3R | 2024 | Naver Labs. Pixel aligned point maps without calibration |
Depth estimation, photometric stereo, structured light, and time of flight sensors complement learned methods in phones, vehicles, and AR headsets like the Apple Vision Pro and Meta Quest 3.
Foundation models are large pretrained models that adapt to many downstream tasks. The vision community now has several visual and vision language foundation models, often with permissive or open licenses.
| Model | Year | Organization | Pretraining objective |
|---|---|---|---|
| CLIP, OpenCLIP | 2021 to 2023 | OpenAI, LAION | Contrastive image text alignment |
| DINOv2 | 2023 | Meta | Self distillation, 142 million curated images, no labels |
| SAM, SAM 2 | 2023 to 2024 | Meta | Promptable segmentation, image and video |
| SigLIP, SigLIP 2 | 2023 to 2024 | Sigmoid loss for language image pretraining | |
| EVA-02 | 2023 | BAAI | Masked image modeling at billion parameter scale |
| AIMv2 | 2024 | Apple | Multimodal autoregressive pretraining |
| Florence-2 | 2024 | Microsoft | Unified vision foundation model on 5B annotations |
These models are used as drop in feature extractors. DINOv2 ViT-L/14 features exceed 86 percent top 1 ImageNet accuracy with a linear probe, and SAM is widely used in data augmentation and labeling pipelines.
Progress is closely tied to large labeled datasets. The most influential are listed below.
| Dataset | Year | Size | Purpose |
|---|---|---|---|
| MNIST | 1998 | 70,000 grayscale digit images | Handwritten digit classification |
| CIFAR-10 and CIFAR-100 | 2009 | 60,000 32x32 color images | Small scale classification |
| ImageNet ILSVRC | 2009 to 2017 | 1.28M training images, 1,000 categories | Large scale classification benchmark |
| Pascal VOC | 2005 to 2012 | 11,540 detection/segmentation images | Earliest detection benchmark |
| MS COCO | 2014 onwards | 330K images, 80 classes, captions, keypoints | Detection, segmentation, captioning |
| OpenImages V7 | 2018 onwards | 9M images, 600 classes, 16M boxes | Open vocabulary detection |
| ADE20K | 2017 | 25,000 images, 150 classes | Scene parsing benchmark |
| Cityscapes | 2016 | 5,000 annotated street scenes | Urban driving segmentation |
| KITTI | 2012 | Stereo, lidar, GPS for driving | Autonomous driving research |
| LAION-5B | 2022 | 5.85B image text pairs | Pretraining for diffusion and CLIP models |
| SA-1B | 2023 | 11M images, 1.1 billion masks | Training data for SAM |
Dataset bias and licensing are persistent concerns. ImageNet's person categories were partly retired in 2019, and LAION-5B was briefly withdrawn in 2023 over problematic content that has since been removed.
Model scale is bounded by available training and inference hardware.
| Accelerator | Vendor | Typical use |
|---|---|---|
| CUDA GPUs (V100, A100, H100, B200) | Nvidia | Dominant choice for training and cloud inference |
| Instinct MI300X | AMD | Used in Frontier, El Capitan, and increasingly in cloud |
| TPU v4, v5p, v5e, Trillium | Internal training of Imagen, Gemini, and PaLI | |
| Inferentia, Trainium | AWS | Cloud inference and training accelerators |
| Apple Neural Engine | Apple | On device vision in iPhones, iPads, and Macs |
| Hexagon NPU | Qualcomm | Smartphone camera pipelines and on device generative models |
| Edge devices (Jetson, Coral Edge TPU, Hailo) | Various | Robotics, retail analytics, drones |
Quantization, pruning, and knowledge distillation fit large models onto edge accelerators. Frameworks like Core ML, TensorRT, OpenVINO, and ONNX Runtime translate models into device specific instructions.
| Benchmark | Task | Primary metric |
|---|---|---|
| ImageNet ILSVRC | Image classification | Top 1 and top 5 accuracy |
| MS COCO detection | Object detection | mAP at IoU 0.50 to 0.95 |
| MS COCO segmentation | Instance segmentation | Mask AP |
| MS COCO panoptic | Panoptic segmentation | Panoptic Quality (PQ) |
| ADE20K | Semantic segmentation | Mean intersection over union (mIoU) |
| KITTI, Sintel | Stereo, flow, detection | End point error, mAP |
| LFW, IJB | Face recognition | True accept rate at fixed FAR |
| LVIS | Long tail detection | mAP over rare, common, frequent classes |
| VQAv2, GQA | Visual reasoning | Answer accuracy |
| MMMU, MMBench | Multimodal LLM evaluation | Multiple choice accuracy |
Intersection over union is the standard geometric metric. The COCO mAP averaged over IoU 0.50 to 0.95 in steps of 0.05 is much stricter than Pascal VOC at IoU 0.50.
| Concept | Why it matters |
|---|---|
| Convolution | Core CNN operation computing a weighted sum in a sliding window |
| Convolutional filter | A learned kernel applied across spatial locations |
| Pooling and spatial pooling | Reduces spatial resolution while preserving salient information |
| Stride | The step size used when sliding a filter across the input |
| Downsampling and subsampling | Reducing the resolution of feature maps |
| Translational invariance | Recognition regardless of object position in the image |
| Rotational invariance | Robustness to in plane rotation of the input |
| Size invariance | Robustness to changes in apparent object size |
| Data augmentation | Random transforms (flips, crops, jitter, Mixup) that expand the training set |
| Receptive field | The input region that influences a given output activation |
| Batch normalization | Normalizes activations to stabilize and speed up training |
| Non maximum suppression | Removes overlapping detections referring to the same object |
| Feature pyramid | Multiscale feature maps improving small and large object detection |
Computer vision is embedded in many consumer and industrial products.
| Domain | Examples |
|---|---|
| Smartphones | Face unlock, computational photography, portrait mode, Magic Eraser |
| Autonomous driving | Lane keeping, AEB, Tesla FSD, Waymo, Mobileye |
| Medical imaging | Diabetic retinopathy screening, mammography, pathology, Aidoc, Viz.ai |
| Manufacturing | Defect detection, robotic bin picking, optical character recognition |
| Agriculture | Crop disease detection, weed identification, yield estimation |
| Retail | Cashierless checkout (Amazon Just Walk Out), shelf monitoring |
| Security | License plate recognition, face recognition, video analytics, surveillance |
| Augmented reality | Apple Vision Pro, Meta Quest passthrough, Snap and Instagram filters |
| Sports | Hawk-Eye line calling, player tracking, broadcast graphics |
| Sciences | Galaxy classification, cryo-EM reconstruction, animal behavior tracking |
| Accessibility | Be My Eyes with GPT-4 Vision, Microsoft Seeing AI |
| Robotics | Visual SLAM, vision language action policies such as RT-2 and OpenVLA |
| Challenge | Description |
|---|---|
| Distribution shift | Models often fail when lighting, weather, or camera optics change |
| Adversarial robustness | Tiny perturbations can flip predictions, a safety critical concern |
| Long tail recognition | Real categories have few examples, hurting accuracy on rare classes |
| 3D understanding | Reasoning about geometry, occlusion, and physics trails human performance |
| Privacy and consent | Web scraped training images raise legal and ethical concerns |
| Bias and fairness | Datasets and models can encode and amplify demographic biases |
| Energy use | Training the largest models consumes megawatt hours of electricity |
| Hallucination | Vision LLMs sometimes describe objects not in the image |
| Long video understanding | Most models still process short clips; minute or hour scale reasoning is active research |
The following terms have their own articles, preserved from the original index.