Machine learning terms/Computer Vision

Computer Vision Machine Learning

27 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v4 · 5,358 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

The key machine learning terms for computer vision describe how neural networks turn pixels into predictions: a convolution slides a small learned filter (kernel) across an image to detect patterns, a convolutional neural network (CNN) stacks many such filters with pooling to build hierarchical features, and downstream tasks include image recognition (labeling a whole image), object detection (drawing bounding boxes around objects), and segmentation (labeling every pixel). Detector quality is scored with intersection over union (IoU), the area of overlap divided by the area of union between a predicted box and the ground truth.^[21]^[24] Training is made robust with data augmentation, and CNNs gain efficiency from translational invariance, convolutional filters shared across the image, and depthwise separable convolutions.^[22]^[23]

Computer vision (CV) is the subfield of artificial intelligence and machine learning that builds systems capable of extracting information from digital images, video, and other visual inputs. The discipline aims to give machines the ability to identify objects, recognize people, understand scenes, infer geometry, and reason about events. Modern computer vision has converged with deep learning, and most state of the art systems are built on convolutional neural networks, vision transformers, and multimodal foundation models trained on hundreds of millions of image text pairs.

The field traces its origins to early experiments at MIT in the 1960s, including the 1966 Summer Vision Project led by Seymour Papert, which famously underestimated the difficulty of human level visual perception. Six decades of research have produced techniques ranging from hand crafted detectors such as SIFT and HOG to billion parameter foundation models such as CLIP, SAM, and DINOv2. Computer vision now powers face unlock, autonomous driving, medical imaging diagnostics, satellite analysis, content moderation, augmented reality, and generative tools that turn text prompts into photorealistic images.

What are the key machine learning computer vision terms?

This section defines the core vocabulary of computer vision as it appears in a machine learning glossary, with each term cross linked to its dedicated article. The definitions follow primary sources including Google's Machine Learning glossary and the textbook Deep Learning by Goodfellow, Bengio, and Courville (MIT Press, 2016).^[21]^[22] Each concept also appears in the more detailed sections later in this page.

Term	One line definition
Convolution	A weighted sum computed by sliding a small filter across an input matrix to mix the filter with the input.^[21]
Convolutional filter	A learned matrix (kernel) with the same rank as the input but a smaller shape, applied across spatial locations.^[21]
Convolutional layer	A layer in which a convolutional filter passes along the input matrix via repeated convolutional operations.^[21]
Convolutional neural network	A neural network in which at least one layer is a convolutional layer.^[21]
Pooling	A reduction operation that aggregates a region of a feature map into one value, shrinking resolution.^[22]
Image recognition	Assigning one or more class labels to an entire image (also called image classification).
Object detection	Locating and classifying multiple objects, returning a bounding box and label for each.
Bounding box	The (x, y) coordinates of a rectangle around an area of interest in an image.^[21]
Intersection over union (IoU)	Area of overlap divided by area of union between two boxes; ranges from 0 to 1.^[24]
Data augmentation	Artificially expanding the training set by transforming existing examples (flips, crops, color jitter).^[21]
Translational invariance	Recognizing an object regardless of where it appears in the image.
Depthwise separable convolution	A factorized convolution (depthwise then pointwise) that cuts computation roughly 8 to 9 times.^[23]

What is a convolution?

A convolution is the core mathematical operation of a CNN. Google's Machine Learning glossary defines it as a process that "mixes the convolutional filter and the input matrix in order to train weights."^[21] In practice the operation has two steps: an element wise multiplication of the convolutional filter and a slice of the input matrix, followed by a summation of every value in the resulting product matrix.^[21] Repeating this across the whole image produces a feature map. A convolutional filter, also called a kernel, is "a matrix having the same rank as the input matrix, but a smaller shape" (commonly 3x3 or 5x5), and the same filter is reused at every position, a property known as weight sharing.^[21] The stride sets how many pixels the filter moves between applications, which controls how much the output is downsampled.

What is a convolutional neural network?

A convolutional neural network (CNN) is, in Google's words, "a neural network in which at least one layer is a convolutional layer," typically combining convolutional layers, pooling layers, and dense (fully connected) layers.^[21] CNNs became the dominant approach for vision in 2012, when AlexNet (Krizhevsky, Sutskever, and Hinton) cut the ImageNet top 5 error rate from the prior best of 26.2 percent to 15.3 percent, a 10.8 point jump that ended the era of hand crafted features.^[1] Goodfellow, Bengio, and Courville explain that convolution improves a machine learning system through three ideas: "sparse interactions, parameter sharing and equivariant representations."^[22] Sparse interactions come from small kernels touching only a local region, parameter sharing reuses one kernel everywhere, and equivariance to translation means a shifted input produces a correspondingly shifted output.^[22] Stacking small filters deeply, as in VGG-16, lets the network learn edges, then textures, then object parts, then whole objects.

What is pooling and translational invariance?

Pooling is a reduction operation that summarizes a small spatial region of a feature map into a single value, most commonly with max pooling or average pooling. Beyond shrinking resolution, pooling builds robustness to position: Goodfellow, Bengio, and Courville note that "pooling helps to make the representation approximately invariant to small translations of the input."^[22] This is the basis of translational invariance, the ability to recognize an object no matter where it sits in the frame. Related properties include rotational invariance (robustness to in plane rotation) and size invariance (robustness to apparent object size), which are usually encouraged through architecture choices and data augmentation rather than guaranteed by pooling alone.

What is image recognition and object detection?

Image recognition, also called image classification, assigns one or more labels to a whole image. The canonical benchmark is ImageNet ILSVRC, which contains 1,000 categories and 1,281,167 training images (about 1.28 million), with top 1 accuracy rising from roughly 56 percent for AlexNet in 2012 to over 91 percent for recent models.^[16] Object detection goes further by localizing and classifying multiple objects at once, returning a bounding box and a class label with a confidence score for each. Google's glossary defines a bounding box simply as "the (x, y) coordinates of a rectangle around an area of interest."^[21] The standard detection benchmark is MS COCO (2014), with 80 object categories and over 200,000 labeled images.^[15] Detection quality is reported as mean average precision (mAP), computed across a range of IoU thresholds.

What is intersection over union?

Intersection over union (IoU), sometimes called the Jaccard index, is the standard geometric metric for object detection and segmentation. It is defined as the area of overlap between a predicted box and the ground truth box divided by the area of their union: IoU = area of overlap / area of union.^[24] The value ranges from 0 (no overlap) to 1 (a perfect match), and a detection is usually counted as correct when its IoU exceeds a threshold such as 0.50.^[24] The COCO primary metric averages mAP over IoU thresholds from 0.50 to 0.95 in steps of 0.05, which is much stricter than the Pascal VOC convention of a single 0.50 threshold.^[15] IoU also drives non maximum suppression, which removes duplicate boxes that overlap too heavily.

What is data augmentation?

Data augmentation is, per Google's glossary, "artificially boosting the range and number of training examples by transforming existing examples to create additional examples."^[21] Typical transforms for images include horizontal flips, random crops, rotations, color jitter, and mixing strategies such as Mixup and CutMix. Augmentation enlarges the effective training set without new labels, reduces overfitting, and teaches the model the invariances (position, scale, lighting) that a classifier should ignore. It is standard practice in nearly every modern vision training pipeline and is especially valuable when labeled data is scarce.

What is a depthwise separable convolution?

A depthwise separable convolution (sepCNN) factorizes a standard convolution into two cheaper steps: a depthwise convolution that filters each input channel independently, followed by a pointwise (1x1) convolution that combines channels. Introduced for mobile vision in MobileNets (Howard et al., 2017), this factorization costs roughly 8 to 9 times less computation than a standard 3x3 convolution at the price of a small accuracy reduction.^[23] The technique underpins efficient architectures such as MobileNet and Xception and is one reason CNNs remain competitive on phones and edge accelerators where compute and memory are tightly constrained.

history

Computer vision history divides into several broad eras, each defined by the dominant image representation and algorithms.

Era	Years	Defining ideas	Representative work
Early symbolic	1960s to 1970s	Block worlds, line drawings, edge reasoning	Roberts (1963), Waltz (1972)
Geometric and physics based	1980s	Stereo vision, shape from shading, optical flow	Marr's Vision (1982), Horn and Schunck (1981)
Feature engineering	1990s to early 2000s	Hand crafted descriptors, statistical learning	SIFT, HOG, Viola Jones
Deep learning	2012 to 2019	End to end CNNs on GPUs	AlexNet, VGG, ResNet, Faster R-CNN
Foundation models	2020 to present	Attention encoders, multimodal pretraining	ViT, CLIP, SAM, Stable Diffusion

The transition to deep learning dates to the 2012 ImageNet Challenge, where AlexNet (Krizhevsky, Sutskever, Hinton) cut top 5 error from 26.2 to 15.3 percent, ending the era of hand crafted features almost overnight.^[1]

core tasks

Computer vision is organized around a small number of canonical tasks. Real applications often combine several of them, but treating each as a benchmark has driven steady progress.

image classification

Image classification assigns one or more labels to a whole image. The classic benchmark is ImageNet with 1,000 categories and 1.28 million training images.^[16] Top 1 accuracy rose from about 56 percent for AlexNet in 2012 to over 91 percent for modern models such as CoCa and EVA-02 by 2024.

object detection

Object detection localizes and classifies multiple objects, producing bounding boxes with class labels and confidence scores. It is evaluated using intersection over union thresholds and mean average precision (mAP). The standard benchmark is MS COCO (2014), with 80 categories and over 200,000 labeled images.^[15]

image segmentation

Segmentation assigns a label to every pixel. Three main variants exist:

Variant	What it labels	Example output
Semantic	Each pixel by class only	All pixels belonging to any car share one color
Instance	Each pixel by class and instance	Each car is colored separately
Panoptic	Class plus instance for things, class only for stuff	Cars individually colored, road one solid color

The distinction between things (countable objects) and stuff (amorphous regions like sky) was formalized by Kirillov and colleagues in 2019.

keypoint and pose estimation

Keypoint detection identifies landmarks such as eye corners or body joints. Keypoints and landmarks are used in face alignment, hand tracking, animal pose, and motion capture. Notable systems include OpenPose (2017) and MediaPipe (2019).

depth estimation

Depth estimation predicts pixel wise distance from the camera. It can use stereo pairs, structured light, time of flight sensors, or a single monocular image. Modern monocular models such as MiDaS (2020) and Depth Anything (2024) generalize across domains.

optical flow

Optical flow estimates per pixel motion between video frames. Horn and Schunck (1981) and Lucas and Kanade (1981) defined the classical formulation. Deep approaches like FlowNet (2015), PWC-Net (2018), and RAFT (2020) lead modern benchmarks (Sintel, KITTI).

other tasks

Additional tasks include image generation, image to image translation, super resolution, denoising, image captioning, visual question answering, action recognition, 3D reconstruction, and visual SLAM.

classical computer vision

Before deep learning, computer vision relied on hand designed detectors and descriptors. Many remain useful today for limited data or strict latency budgets, and they are essential to robotics and structure from motion pipelines.

Method	Year	Purpose
Canny edge detector	1986	Step changes in intensity via gradient magnitude and NMS
Harris corner detector	1988	Points where the image gradient varies in two orthogonal directions
SIFT	1999	Scale and rotation invariant keypoints, 128 dim descriptors
SURF	2006	Faster SIFT approximation using integral images
HOG	2005	Oriented gradients in cells, pedestrian detection (Dalal and Triggs)
ORB	2011	Binary descriptor for real time matching, patent free
Bag of Visual Words	early 2000s	Quantizes local descriptors into a vocabulary histogram

Other classical operations include Sobel and Prewitt gradients, Laplacian of Gaussian blob detection, RANSAC for robust fitting, and the Hough transform for parametric shape detection.

the convolutional neural network era

The modern era began with convolutional neural networks (CNNs). CNNs stack convolutional layers, ReLU nonlinearities, pooling, and fully connected layers to learn hierarchical features from pixels. Key ideas include weight sharing, translational invariance, and stacking small convolutional filters deeply.

Model	Year	Authors	Key contribution
LeNet-5	1998	LeCun et al.	First successful CNN, recognized digits on MNIST
AlexNet	2012	Krizhevsky, Sutskever, Hinton	8 layers, ReLU, dropout, dual GPU, won ImageNet 2012
ZFNet	2013	Zeiler and Fergus	Visualization techniques, won ImageNet 2013
VGG-16 and VGG-19	2014	Simonyan and Zisserman, Oxford	Depth with small 3x3 filters improves accuracy
GoogLeNet (Inception v1)	2014	Szegedy et al., Google	Inception module, 22 layers, won ImageNet 2014
ResNet	2015	He et al., Microsoft Research	Residual connections enabled 152 layer networks, won 2015
Inception v3 and v4	2015 to 2016	Szegedy et al.	Factorized convolutions, label smoothing
DenseNet	2016	Huang et al.	Each layer receives feature maps from all preceding layers
Xception	2016	Chollet	Depthwise separable convolutions
MobileNet v1 and v2	2017 to 2018	Howard et al., Google	Lightweight CNN for mobile devices
ResNeXt	2017	Xie et al., Facebook	Grouped convolutions, cardinality as a new dimension
SENet	2017	Hu, Shen, Sun	Squeeze and excitation channel attention, won 2017
EfficientNet	2019	Tan and Le, Google	Compound scaling of depth, width, and resolution
ConvNeXt	2022	Liu et al., Facebook	Modernized ResNet matching transformer accuracy

Convolutions remain dominant on mobile and edge devices due to hardware efficiency. Pooling and stride parameters control how spatial information is downsampled.

object detection architectures

Object detection has been one of the most active subfields since 2014. Detectors are typically classified as two stage, single stage, or transformer based.

Family	Representative models	Approach
Two stage	R-CNN (2014), Fast R-CNN (2015), Faster R-CNN (2015), Mask R-CNN (2017)	Propose regions, then classify and refine bounding boxes
Single stage anchor based	SSD (2015), YOLOv2/v3 (2016 to 2018), RetinaNet (2017)	Predict boxes and classes in one pass over a dense anchor grid
Single stage anchor free	CornerNet (2018), FCOS (2019), YOLOv8 (2023), YOLOv10/v11 (2024)	Predict centers or corners directly without predefined anchors
Transformer based	DETR (2020), Deformable DETR (2020), DINO (2022), Co-DETR (2023)	Encoder decoder with learned object queries, no NMS

RetinaNet introduced focal loss to address foreground/background imbalance, hitting 39.1 mAP on COCO in 2017. Modern transformer detectors such as Co-DETR with Swin-Large push above 66 mAP.^[20] The YOLO family, originated by Joseph Redmon in 2016, has been continued by Ultralytics, Meituan (YOLOv6), and Tsinghua researchers (YOLOv8+), and is widely deployed in real time applications.

segmentation architectures

Model	Year	Notes
FCN	2015	Long, Shelhamer, Darrell. Replaced fully connected layers with convolutions
U-Net	2015	Ronneberger et al. Encoder decoder with skip connections, biomedical roots
DeepLab v1 to v3+	2014 to 2018	Chen et al., Google. Atrous convolutions and ASPP
Mask R-CNN	2017	Extends Faster R-CNN with a mask branch for instance segmentation
PSPNet	2017	Pyramid pooling module for global context
HRNet	2019	Maintains high resolution feature maps throughout
Mask2Former	2022	Unified architecture for semantic, instance, and panoptic segmentation
SAM	2023	Meta. Promptable segmentation trained on SA-1B (1.1B masks)
SAM 2	2024	Meta. Adds streaming video segmentation with memory

U-Net remains the baseline for medical image segmentation and is the backbone of the denoiser in latent diffusion models such as Stable Diffusion.^[5]^[10]

vision transformers

The Vision Transformer paper by Dosovitskiy et al. at Google Brain (2020) showed that the transformer architecture, designed for language, could match or beat CNNs on classification given enough pretraining data.^[7] ViT splits the image into patches (typically 16x16), embeds each linearly, adds positional embeddings, and processes the sequence with a standard transformer encoder.^[7]

Model	Year	Key idea
ViT	2020	Patch embedding plus transformer encoder, scales with data
DeiT	2020	Distillation token enables ViT results from ImageNet alone
Swin Transformer	2021	Shifted window attention, linear complexity in image size
BEiT	2021	Masked image modeling with a discrete tokenizer
MAE	2021	Reconstructs randomly masked patches, strong self supervision
EVA-02	2023	Scaled up masked image modeling backbones
DINOv2	2023	Self distillation produces general visual features without labels
AIMv2	2024	Apple. Autoregressive vision pretraining at scale

Vision transformers benefit greatly from pretraining on large datasets such as JFT-300M (Google, 300M images) or LAION-2B (Stable Diffusion, 2B image text pairs).

multimodal vision language models

Multimodal models aligning images and text in a shared embedding space have transformed the field since 2021. They enable zero shot classification, open vocabulary detection, image text retrieval, and serve as the visual front end of multimodal LLMs such as GPT-4o, Gemini, and Claude.^[9]

Model	Year	Organization	What it does
CLIP	2021	OpenAI	Trains image and text encoders via a contrastive loss on 400 million pairs
ALIGN	2021	Google	Similar to CLIP, trained on 1.8 billion noisy alt text image pairs
DALL-E	2021	OpenAI	Autoregressive text to image model on a discrete VAE and transformer
GLIDE	2021	OpenAI	Diffusion text to image generation with classifier free guidance
DALL-E 2	2022	OpenAI	Two stage diffusion using CLIP image embeddings
Imagen	2022	Google	Cascaded diffusion conditioned on a frozen T5 text encoder
Stable Diffusion	2022	Stability AI, CompVis	Latent diffusion in a learned VAE space, open weights
Flamingo	2022	DeepMind	Few shot multimodal LLM with cross attention to image tokens
BLIP-2	2023	Salesforce	Bootstrapped vision language pretraining with a Q-Former
LLaVA	2023	UW Madison, Microsoft	Connects a CLIP encoder to LLaMA via a projection layer
DALL-E 3	2023	OpenAI	Improved prompt following via better training captions
SDXL	2023	Stability AI	3.5 billion parameter latent diffusion with two text encoders
Stable Diffusion 3	2024	Stability AI	Multimodal Diffusion Transformer (MMDiT) replacing the U-Net
FLUX.1	2024	Black Forest Labs	Hybrid transformer diffusion model from ex Stable Diffusion authors

Text to image generation typically uses classifier free guidance, DDPM, and increasingly transformer based diffusion (DiT). Video extensions like Sora (OpenAI, 2024) and Veo (Google DeepMind, 2024) generate multi second clips using spacetime patches as tokens.

3d vision

3D computer vision recovers geometric structure from images. Classical pipelines combine multi view stereo, structure from motion (SfM), and bundle adjustment. Modern systems learn 3D representations end to end.

Method	Year	Description
Photogrammetry and SfM	1990s onwards	Camera poses and sparse 3D points from many views
Point clouds and PointNet	2017	Discrete or sparse 3D representations
NeRF	2020	Mildenhall et al. Continuous radiance field, volume rendering
Instant NGP	2022	Hash grid encoding trains NeRFs in seconds
Gaussian Splatting	2023	Kerbl et al., Inria. Millions of 3D Gaussians, real time rendering
DUSt3R, MASt3R	2024	Naver Labs. Pixel aligned point maps without calibration

Depth estimation, photometric stereo, structured light, and time of flight sensors complement learned methods in phones, vehicles, and AR headsets like the Apple Vision Pro and Meta Quest 3.

visual foundation models

Foundation models are large pretrained models that adapt to many downstream tasks. The vision community now has several visual and vision language foundation models, often with permissive or open licenses.

Model	Year	Organization	Pretraining objective
CLIP, OpenCLIP	2021 to 2023	OpenAI, LAION	Contrastive image text alignment
DINOv2	2023	Meta	Self distillation, 142 million curated images, no labels
SAM, SAM 2	2023 to 2024	Meta	Promptable segmentation, image and video
SigLIP, SigLIP 2	2023 to 2024	Google	Sigmoid loss for language image pretraining
EVA-02	2023	BAAI	Masked image modeling at billion parameter scale
AIMv2	2024	Apple	Multimodal autoregressive pretraining
Florence-2	2024	Microsoft	Unified vision foundation model on 5B annotations

These models are used as drop in feature extractors. DINOv2 ViT-L/14 features exceed 86 percent top 1 ImageNet accuracy with a linear probe, and SAM is widely used in data augmentation and labeling pipelines.^[14]^[13]

datasets

Progress is closely tied to large labeled datasets. The most influential are listed below.

Dataset	Year	Size	Purpose
MNIST	1998	70,000 grayscale digit images	Handwritten digit classification
CIFAR-10 and CIFAR-100	2009	60,000 32x32 color images	Small scale classification
ImageNet ILSVRC	2009 to 2017	1.28M training images, 1,000 categories	Large scale classification benchmark
Pascal VOC	2005 to 2012	11,540 detection/segmentation images	Earliest detection benchmark
MS COCO	2014 onwards	330K images, 80 classes, captions, keypoints	Detection, segmentation, captioning
OpenImages V7	2018 onwards	9M images, 600 classes, 16M boxes	Open vocabulary detection
ADE20K	2017	25,000 images, 150 classes	Scene parsing benchmark
Cityscapes	2016	5,000 annotated street scenes	Urban driving segmentation
KITTI	2012	Stereo, lidar, GPS for driving	Autonomous driving research
LAION-5B	2022	5.85B image text pairs	Pretraining for diffusion and CLIP models
SA-1B	2023	11M images, 1.1 billion masks	Training data for SAM

Dataset bias and licensing are persistent concerns. ImageNet's person categories were partly retired in 2019, and LAION-5B was briefly withdrawn in 2023 over problematic content that has since been removed.

hardware

Model scale is bounded by available training and inference hardware.

Accelerator	Vendor	Typical use
CUDA GPUs (V100, A100, H100, B200)	Nvidia	Dominant choice for training and cloud inference
Instinct MI300X	AMD	Used in Frontier, El Capitan, and increasingly in cloud
TPU v4, v5p, v5e, Trillium	Google	Internal training of Imagen, Gemini, and PaLI
Inferentia, Trainium	AWS	Cloud inference and training accelerators
Apple Neural Engine	Apple	On device vision in iPhones, iPads, and Macs
Hexagon NPU	Qualcomm	Smartphone camera pipelines and on device generative models
Edge devices (Jetson, Coral Edge TPU, Hailo)	Various	Robotics, retail analytics, drones

Quantization, pruning, and knowledge distillation fit large models onto edge accelerators. Frameworks like Core ML, TensorRT, OpenVINO, and ONNX Runtime translate models into device specific instructions.

benchmarks and metrics

Benchmark	Task	Primary metric
ImageNet ILSVRC	Image classification	Top 1 and top 5 accuracy
MS COCO detection	Object detection	mAP at IoU 0.50 to 0.95
MS COCO segmentation	Instance segmentation	Mask AP
MS COCO panoptic	Panoptic segmentation	Panoptic Quality (PQ)
ADE20K	Semantic segmentation	Mean intersection over union (mIoU)
KITTI, Sintel	Stereo, flow, detection	End point error, mAP
LFW, IJB	Face recognition	True accept rate at fixed FAR
LVIS	Long tail detection	mAP over rare, common, frequent classes
VQAv2, GQA	Visual reasoning	Answer accuracy
MMMU, MMBench	Multimodal LLM evaluation	Multiple choice accuracy

Intersection over union is the standard geometric metric. The COCO mAP averaged over IoU 0.50 to 0.95 in steps of 0.05 is much stricter than Pascal VOC at IoU 0.50.

key concepts

Concept	Why it matters
Convolution	Core CNN operation computing a weighted sum in a sliding window
Convolutional filter	A learned kernel applied across spatial locations
Pooling and spatial pooling	Reduces spatial resolution while preserving salient information
Stride	The step size used when sliding a filter across the input
Downsampling and subsampling	Reducing the resolution of feature maps
Translational invariance	Recognition regardless of object position in the image
Rotational invariance	Robustness to in plane rotation of the input
Size invariance	Robustness to changes in apparent object size
Data augmentation	Random transforms (flips, crops, jitter, Mixup) that expand the training set
Receptive field	The input region that influences a given output activation
Batch normalization	Normalizes activations to stabilize and speed up training
Non maximum suppression	Removes overlapping detections referring to the same object
Feature pyramid	Multiscale feature maps improving small and large object detection

applications

Computer vision is embedded in many consumer and industrial products.

Domain	Examples
Smartphones	Face unlock, computational photography, portrait mode, Magic Eraser
Autonomous driving	Lane keeping, AEB, Tesla FSD, Waymo, Mobileye
Medical imaging	Diabetic retinopathy screening, mammography, pathology, Aidoc, Viz.ai
Manufacturing	Defect detection, robotic bin picking, optical character recognition
Agriculture	Crop disease detection, weed identification, yield estimation
Retail	Cashierless checkout (Amazon Just Walk Out), shelf monitoring
Security	License plate recognition, face recognition, video analytics, surveillance
Augmented reality	Apple Vision Pro, Meta Quest passthrough, Snap and Instagram filters
Sports	Hawk-Eye line calling, player tracking, broadcast graphics
Sciences	Galaxy classification, cryo-EM reconstruction, animal behavior tracking
Accessibility	Be My Eyes with GPT-4 Vision, Microsoft Seeing AI
Robotics	Visual SLAM, vision language action policies such as RT-2 and OpenVLA

challenges and open problems

Challenge	Description
Distribution shift	Models often fail when lighting, weather, or camera optics change
Adversarial robustness	Tiny perturbations can flip predictions, a safety critical concern
Long tail recognition	Real categories have few examples, hurting accuracy on rare classes
3D understanding	Reasoning about geometry, occlusion, and physics trails human performance
Privacy and consent	Web scraped training images raise legal and ethical concerns
Bias and fairness	Datasets and models can encode and amplify demographic biases
Energy use	Training the largest models consumes megawatt hours of electricity
Hallucination	Vision LLMs sometimes describe objects not in the image
Long video understanding	Most models still process short clips; minute or hour scale reasoning is active research

The following terms have their own articles, preserved from the original index.

references

Krizhevsky, Sutskever, Hinton (2012). *ImageNet Classification with Deep CNNs*. NeurIPS. ↩
He, Zhang, Ren, Sun (2016). *Deep Residual Learning for Image Recognition*. CVPR.
Ren, He, Girshick, Sun (2015). *Faster R-CNN*. NeurIPS.
Long, Shelhamer, Darrell (2015). *Fully Convolutional Networks for Semantic Segmentation*. CVPR.
Ronneberger, Fischer, Brox (2015). *U-Net*. MICCAI. ↩
He, Gkioxari, Dollar, Girshick (2017). *Mask R-CNN*. ICCV.
Dosovitskiy et al. (2021). *An Image is Worth 16x16 Words*. ICLR. ↩
Liu et al. (2021). *Swin Transformer*. ICCV.
Radford et al. (2021). *Learning Transferable Visual Models From Natural Language Supervision*. ICML. ↩
Rombach et al. (2022). *High-Resolution Image Synthesis with Latent Diffusion Models*. CVPR. ↩
Mildenhall et al. (2020). *NeRF*. ECCV.
Kerbl et al. (2023). *3D Gaussian Splatting*. ACM TOG.
Kirillov et al. (2023). *Segment Anything*. ICCV. ↩
Oquab et al. (2024). *DINOv2*. TMLR. ↩
Lin et al. (2014). *Microsoft COCO*. ECCV. ↩
Russakovsky et al. (2015). *ImageNet Large Scale Visual Recognition Challenge*. IJCV. ↩
Lowe (2004). *Distinctive Image Features from Scale-Invariant Keypoints*. IJCV.
Dalal and Triggs (2005). *Histograms of Oriented Gradients for Human Detection*. CVPR.
LeCun, Bottou, Bengio, Haffner (1998). *Gradient-Based Learning Applied to Document Recognition*. Proc. IEEE.
Carion et al. (2020). *End-to-End Object Detection with Transformers*. ECCV. ↩
Google. *Machine Learning Glossary: Image* (convolution, convolutional filter, convolutional layer, convolutional neural network, convolutional operation, bounding box, data augmentation). developers.google.com/machine-learning/glossary/image. ↩
Goodfellow, Bengio, Courville (2016). *Deep Learning*, Chapter 9 (Convolutional Networks). MIT Press. ↩
Howard et al. (2017). *MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications*. arXiv:1704.04861. ↩
Rezatofighi et al. (2019). *Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression*. CVPR. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Machine learning terms Machine learning terms/Decision Forests Terms

What are the key machine learning computer vision terms?

What is a convolution?

What is a convolutional neural network?

What is pooling and translational invariance?

What is image recognition and object detection?

What is intersection over union?

What is data augmentation?

What is a depthwise separable convolution?

history

core tasks

image classification

object detection

image segmentation

keypoint and pose estimation

depth estimation

optical flow

other tasks

classical computer vision

the convolutional neural network era

object detection architectures

segmentation architectures

vision transformers

multimodal vision language models

3d vision

visual foundation models

datasets

hardware

benchmarks and metrics

key concepts

applications

challenges and open problems

related concepts in this wiki

see also

references

Improve this article

Related Articles

Diffusion model

Spatial Pooling

Average Precision

Bounding Box

Computer-use agent

Computer vision

What links here

Related Articles

Diffusion model

Spatial Pooling

Average Precision

Bounding Box

Computer-use agent

Computer vision

What links here