Keypoints, also called interest points, feature points, or salient points, are distinctive locations in an image (or other signal) that can be reliably detected and described across viewpoint, scale, lighting, and noise changes. In computer vision, keypoints serve as anchors for feature matching, tracking, image alignment, and 3D reconstruction. Each keypoint typically carries a 2D pixel coordinate, an optional scale, an orientation, and a descriptor vector that summarizes the local image patch around it.
The term covers two related but distinct uses. Detection-style keypoints are corners, blobs, and other low-level structures that are repeatable under transformation; classical detectors such as Harris, SIFT, SURF, and ORB fall in this category. Semantic keypoints, sometimes called landmarks, are predefined anatomical or object-specific locations, for example the 17 body joints in the COCO format or 21 finger joints in a hand model. Both share the same basic idea: reduce an image to a small set of meaningful points whose positions and surrounding patches encode useful structure for downstream tasks.
Dense image processing operates on every pixel, which is wasteful when most pixels are uninformative. A flat patch of sky or a stretch of uniform wall provides almost no constraint when matching one image to another. Keypoints select the few pixels that are most informative, usually because the local intensity pattern changes sharply in more than one direction (a corner) or shows a distinctive blob structure at a particular scale.
Reducing an image to a sparse set of keypoints with descriptors makes many problems tractable. Two photographs of the same building taken from different angles can be aligned by finding the same physical points in both images and solving for the geometric transformation that maps one set onto the other. The same idea drives panorama stitching, Structure from Motion, SLAM, object recognition, and visual place recognition.
For pose estimation, the term takes on a more semantic meaning. Rather than detecting whatever points happen to be locally distinctive, the system is trained to localize specific anatomical or object-defined positions: the tip of the nose, the corner of the eye, the top-left corner of a license plate. These keypoints carry meaning that is consistent across instances, which lets downstream models reason about pose, shape, and identity.
When Mikolajczyk and Schmid evaluated keypoint detectors in their 2005 survey, they identified a now-standard set of desirable properties. A detector should produce points that are:
| Property | Meaning |
|---|---|
| Repeatability | The same physical point is detected in multiple images of the same scene under different viewpoints, scales, lighting, and noise levels. |
| Distinctiveness | The patch around each keypoint is unique enough that its descriptor can be matched against a large database with low ambiguity. |
| Locality | The keypoint covers a small image region, so partial occlusion and small geometric distortions affect only a few keypoints rather than all of them. |
| Quantity | The detector finds enough points to support the task. Image stitching needs only dozens; SfM on a building can need tens of thousands. |
| Accuracy | Sub-pixel localization, both spatially and across scale and orientation, so that geometric estimates such as homographies or fundamental matrices are precise. |
| Efficiency | Detection and description are fast enough for the target application, ranging from offline 3D reconstruction to real-time SLAM on a mobile robot. |
No single detector wins on every axis. Real systems trade off these properties depending on whether they need real-time performance, robustness to severe viewpoint change, or fine localization for high-precision metrology.
Classical keypoint algorithms dominated computer vision from the late 1980s through the mid-2010s and are still widely used today. Most follow a two-stage pipeline: a detector picks salient pixel locations, and a descriptor summarizes the local appearance into a vector that can be compared with descriptors from other images.
| Detector / descriptor | Year | Authors | Detector type | Descriptor | Notes |
|---|---|---|---|---|---|
| Harris corner | 1988 | Harris and Stephens | Corner from local autocorrelation matrix | None (detector only) | Foundational corner detector still used as a building block |
| Shi-Tomasi (Good Features to Track) | 1994 | Shi and Tomasi | Minimum eigenvalue of structure tensor | None | Variant of Harris tuned for KLT tracking |
| MSER | 2002 | Matas et al. | Maximally stable extremal regions | Region descriptor | Detects affinely covariant regions |
| SIFT | 1999 / 2004 | David Lowe | Difference-of-Gaussians blobs across scales | 128-D gradient histogram | Scale and rotation invariant; long-time gold standard |
| SURF | 2006 | Bay et al. | Hessian determinant via integral images | 64-D Haar-wavelet descriptor | Several times faster than SIFT |
| FAST | 2006 | Rosten and Drummond | Bresenham-circle pixel comparison | None | Extremely fast corner detector trained via decision tree |
| BRIEF | 2010 | Calonder et al. | None (descriptor only) | 256-bit binary string from pixel comparisons | Fast Hamming-distance matching |
| BRISK | 2011 | Leutenegger et al. | Scale-space FAST | 512-bit binary descriptor | Scale and rotation invariant binary alternative |
| ORB | 2011 | Rublee et al. | Oriented FAST | Rotated BRIEF | Free for commercial use; default in OpenCV's ORB_SLAM |
| KAZE / AKAZE | 2012 / 2013 | Alcantarilla et al. | Nonlinear scale space | M-LDB binary descriptor | Better edge preservation than Gaussian scale space |
The Harris detector, introduced by Chris Harris and Mike Stephens at the 1988 Alvey Vision Conference, formalized the idea of a corner as a point where the local image autocorrelation surface has high curvature in two orthogonal directions. The Shi-Tomasi variant from 1994 replaced Harris's response function with the minimum eigenvalue of the structure tensor, which gave more uniform corner quality and became the standard for the Kanade-Lucas-Tomasi (KLT) tracker.
David Lowe's Scale-Invariant Feature Transform was the breakthrough that made wide-baseline matching practical. The full algorithm appeared in his 2004 International Journal of Computer Vision paper Distinctive Image Features from Scale-Invariant Keypoints. SIFT detects keypoints as extrema of the Difference-of-Gaussians (DoG) scale space, assigns each one a dominant orientation, and computes a 128-dimensional histogram of gradient orientations over a 16x16 patch around the keypoint. The result is invariant to scale and rotation and robust to small changes in viewpoint and illumination. SIFT was patented until March 2020, after which it was added to OpenCV's main module.
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool introduced Speeded Up Robust Features at ECCV 2006. SURF approximates SIFT's DoG with the determinant of a Hessian matrix computed from box filters and integral images, which gives a substantial speedup. The descriptor uses Haar-wavelet responses in 4x4 sub-regions and is typically 64-dimensional. SURF was patented and is also now freely available.
Edward Rosten and Tom Drummond proposed the FAST corner detector at ECCV 2006 in Machine learning for high-speed corner detection. FAST examines a Bresenham circle of 16 pixels around a candidate point and uses a decision tree learned from training data to classify the point as a corner if a contiguous arc of pixels is sufficiently brighter or darker than the center. The detector runs an order of magnitude faster than Harris while preserving repeatability.
BRIEF, proposed by Michael Calonder and colleagues in 2010, is a binary descriptor that compares pairs of pixels in a smoothed patch and packs the results into a 256-bit string. Matching reduces to Hamming distance, which modern CPUs compute extremely fast.
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski combined these ideas in ORB (Oriented FAST and Rotated BRIEF) at ICCV 2011. ORB adds scale-space sampling and orientation estimation to FAST, and applies a learned sampling pattern to BRIEF that maintains discriminability under rotation. Because ORB has no patent restrictions and runs roughly two orders of magnitude faster than SIFT on similar tasks, it became the default keypoint pipeline in many open-source robotics and SLAM systems, including ORB-SLAM.
MSER (Maximally Stable Extremal Regions), proposed by Jiri Matas and colleagues in 2002, finds connected components in the image that remain stable across a range of intensity thresholds, producing affinely covariant region detections. BRISK (Stefan Leutenegger et al., 2011) and AKAZE (Pablo Alcantarilla, 2013) are scale-invariant binary alternatives to ORB; AKAZE in particular uses nonlinear diffusion in its scale space, which preserves edges better than the Gaussian scale spaces used in SIFT and SURF.
By the late 2010s, neural networks had begun to replace hand-crafted detectors. Learned methods can be trained directly to optimize repeatability and matching score on real image pairs, and they often handle large viewpoint and illumination changes better than classical detectors.
| Method | Year | Venue | Architecture | Key idea |
|---|---|---|---|---|
| LIFT | 2016 | ECCV | CNN | First end-to-end learned detector, orientation, and descriptor |
| SuperPoint | 2018 | CVPR Workshop | Fully convolutional, shared encoder | Self-supervised via Homographic Adaptation |
| D2-Net | 2019 | CVPR | VGG-based dense feature map | Detect-and-describe from a single shared map |
| R2D2 | 2019 | NeurIPS | L2-Net backbone | Predicts both repeatability and reliability |
| DISK | 2020 | NeurIPS | U-Net | Trained with reinforcement learning style policy gradient |
| ALIKE / ALIKED | 2022 / 2023 | TIP / arXiv | Deformable convolutions | Sub-pixel keypoint regression |
| LoFTR | 2021 | CVPR | Transformer | Detector-free dense matching using attention |
| SuperGlue | 2020 | CVPR | Graph neural network | Matches two SuperPoint sets jointly via attention and Sinkhorn |
| LightGlue | 2023 | ICCV | Adaptive transformer | Faster and more accurate successor to SuperGlue |
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich at Magic Leap proposed SuperPoint in SuperPoint: Self-Supervised Interest Point Detection and Description (CVPR Deep Learning for Visual SLAM Workshop, 2018). The architecture is a fully convolutional network with a shared VGG-style encoder and two heads: one outputs a dense interest-point heatmap and the other a dense 256-dimensional descriptor map. Training uses Homographic Adaptation, in which a base detector is bootstrapped by applying many random homographies to MS COCO images and aggregating the resulting detections to produce robust pseudo-labels. SuperPoint became a default learned front end for many SfM and SLAM pipelines.
Paul-Edouard Sarlin and colleagues at Magic Leap published SuperGlue at CVPR 2020. It takes two sets of local features (typically from SuperPoint) and produces a partial assignment between them by formulating matching as a differentiable optimal transport problem. A graph neural network with self-attention and cross-attention layers reasons jointly about the geometric and visual context of the two images. SuperGlue won three CVPR 2020 challenges in visual localization and image matching.
LightGlue, by Philipp Lindenberger, Sarlin, and Marc Pollefeys at ICCV 2023, revisits SuperGlue's design with several improvements: rotary positional encodings, more efficient attention, an early-exit confidence classifier, and pruning of unmatchable points. It is faster, more accurate, and easier to train than SuperGlue, especially for typical settings with up to 2,000 keypoints per image.
LoFTR (Local Feature TRansformer), introduced by Jiaming Sun and colleagues at CVPR 2021, takes a different route: it skips the explicit detection step entirely and produces dense pixel-level matches using a coarse-to-fine transformer. Detector-free methods often perform better in low-texture or repetitive regions where classical detectors find too few or unreliable points.
When the term keypoints appears in human or animal pose estimation, object pose, or face analysis, it usually refers to a fixed set of semantic landmarks rather than autonomously detected interest points. The detector is trained, often as a heatmap regression network, to predict the locations of these predefined points.
| Keypoint set | Count | Domain | Used by |
|---|---|---|---|
| MPII Human Pose | 16 | Human body joints | MPII benchmark, classical pose models |
| COCO Keypoints | 17 | Human body joints | Standard for multi-person 2D pose estimation |
| COCO-WholeBody | 133 | Body, face, hands, feet | Whole-body pose models |
| OpenPose BODY_25 | 25 | Body plus feet | OpenPose body detector |
| MediaPipe Pose (BlazePose) | 33 | Full body, with extra wrist and finger reference points | MediaPipe Pose, on-device fitness apps |
| MediaPipe Hands | 21 per hand | Wrist, finger joints, fingertips | MediaPipe Hands |
| MediaPipe Face Mesh | 468 | Dense facial surface | MediaPipe Face Mesh, AR filters |
| iBUG 300-W | 68 | Facial landmarks (eyes, nose, mouth, jawline) | dlib, classical face alignment |
| WFLW | 98 | Dense facial landmarks with attribute tags | WFLW benchmark |
| Halpe Full-Body | 136 | Body, face, hands, feet | AlphaPose Halpe model |
The COCO keypoint format from the Microsoft COCO dataset (Lin et al., 2014) defines 17 person keypoints: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle. Each keypoint is annotated with an (x, y) pixel position and a visibility flag with three values: 0 for not labeled, 1 for labeled but occluded, and 2 for labeled and visible. The COCO keypoints challenge has been the dominant benchmark for 2D human pose estimation since 2016.
The MPII Human Pose dataset (Andriluka et al., 2014) uses 16 body joints organized along a kinematic tree rooted at the pelvis. MPII was the standard for single-person pose estimation before COCO, and it is still used for evaluating models with the PCKh metric.
OpenPose, from Cao et al. at CMU (CVPR 2017, journal version 2019), introduced Part Affinity Fields and a real-time bottom-up multi-person system. The full OpenPose pipeline detects 25 body keypoints (BODY_25), 21 keypoints per hand, and 70 facial keypoints, totaling around 135 points per person. Google's MediaPipe Pose, based on the BlazePose model (Bazarevsky et al., 2020), predicts 33 body keypoints designed as a superset of the COCO topology with additional points on the hands and feet, suitable for fitness and wearable applications. AlphaPose and HRNet target the COCO 17-point format with top-down pipelines, while ViTPose (Xu et al., NeurIPS 2022) demonstrates that plain vision transformers can reach 81.1 AP on the COCO test-dev set. RTMPose (Jiang et al., 2023) emphasizes deployment, achieving more than 90 FPS on a CPU at competitive accuracy.
Keypoints also drive 6-DoF object pose estimation, where the goal is to recover the 3D rotation and translation of a known object relative to the camera. A common pipeline detects 2D keypoints corresponding to predefined 3D points on the object (often the eight corners of its 3D bounding box, or a sparse set of surface points), then solves the Perspective-n-Point (PnP) problem to recover the camera pose given those 2D-3D correspondences. Methods such as PVNet (Peng et al., 2019) and KeypointNet learn to vote for object keypoints from local features, and combine well with iterative refinement using the Iterative Closest Point algorithm. This approach underlies many robotic grasping and AR object-tracking systems.
Keypoints appear in almost every classical computer vision pipeline that has to align, match, or reason about geometry.
| Application | How keypoints are used |
|---|---|
| Image stitching and panoramas | Detect keypoints in overlapping photographs, match them, and estimate a homography to align the images. Used by Hugin, AutoStitch, and the iOS Camera panorama mode. |
| Structure from Motion | SfM systems such as COLMAP, OpenMVG, and Bundler use SIFT or a learned alternative to find correspondences across many images, triangulate 3D points, and refine camera poses with bundle adjustment. |
| Visual SLAM | SLAM systems such as ORB-SLAM2 and ORB-SLAM3 maintain a map of keypoint landmarks and the camera trajectory in real time on a CPU. LSD-SLAM and DSO use direct, non-keypoint methods for comparison. |
| Image retrieval and place recognition | Keypoint descriptors are quantized into visual words (Bag of Visual Words, Sivic and Zisserman 2003) or aggregated into compact image descriptors such as VLAD and NetVLAD for retrieval against large databases. |
| Tracking | The Kanade-Lucas-Tomasi (KLT) tracker follows Shi-Tomasi keypoints across video frames using sparse optical flow. Modern visual-inertial odometry systems still use this idea on the front end. |
| Pose estimation | Semantic keypoints encode the configuration of a person, hand, or face for action recognition, animation, fitness coaching, and sign-language analysis. |
| Augmented and virtual reality | Keypoint tracking aligns virtual content with real surfaces; ARKit and ARCore combine keypoint-based visual odometry with inertial sensing. |
| Robotic manipulation | Object keypoints such as the rim of a mug or the handle of a screwdriver provide affordances for grasping policies including kPAM (Manuelli et al., 2019) and Dense Object Nets. |
| Medical imaging | Anatomical landmarks support image registration, growth tracking, and surgical navigation. |
Different keypoint problems use different metrics, but they cluster into two broad families.
For low-level detector-style keypoints, the standard measures come from the Mikolajczyk and Schmid (2005) evaluation protocol. Repeatability rate is the fraction of keypoints detected in one image that are also detected in a transformed version, after compensating for the known transformation. Matching score is the ratio of correct matches to the total number of detected features in the overlap region of two images. Mean Matching Accuracy (MMA) is used in the HPatches benchmark (Balntas et al., 2017) at multiple pixel error thresholds. Pose accuracy, for SfM and SLAM evaluations, is the median rotation and translation error after estimating relative pose from matched keypoints.
For semantic pose keypoints, the standard metrics are PCK, OKS, and MPJPE. PCK (Percentage of Correct Keypoints) marks a keypoint as correct if its predicted location is within a fraction of the body or head size from the ground truth; PCKh@0.5, normalized by head size at threshold 0.5, is the MPII standard. OKS (Object Keypoint Similarity) and the corresponding AP based on OKS form the COCO standard, which weights each keypoint by its annotation difficulty and normalizes by person scale. MPJPE (Mean Per Joint Position Error) is the average Euclidean distance between predicted and ground-truth 3D joints, and is the dominant metric for 3D pose estimation on Human3.6M.
Most keypoint algorithms have well-tested implementations in open-source libraries:
| Library | Language | What it provides |
|---|---|---|
| OpenCV | C++, Python | SIFT, SURF (contrib), ORB, BRISK, AKAZE, FAST, KAZE, MSER, KLT tracker, Lucas-Kanade |
| Kornia | PyTorch | SIFT, SuperPoint, LoFTR, LightGlue, differentiable feature matching |
| MMPose / MMTracking | PyTorch | HRNet, ViTPose, RTMPose, SimCC, top-down and bottom-up pose pipelines |
| MediaPipe | C++, Python, JS | Pose, Hands, Face Mesh, Holistic with on-device inference |
| dlib | C++, Python | 68-point facial landmark detector based on regression trees |
| COLMAP | C++ | SIFT-based Structure from Motion and Multi-View Stereo |
| ORB-SLAM3 | C++ | Real-time visual SLAM with ORB keypoints |
| HLoc | Python | Visual localization toolbox combining SuperPoint, SuperGlue, LightGlue |
| pyKITTI / pyTorch3D | Python | Geometry utilities including PnP and triangulation |
A typical OpenCV keypoint workflow looks like the following Python snippet:
import cv2
orb = cv2.ORB_create(nfeatures=2000)
kp1, des1 = orb.detectAndCompute(img1, None)
kp2, des2 = orb.detectAndCompute(img2, None)
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = sorted(bf.match(des1, des2), key=lambda m: m.distance)
The strict separation between sparse keypoints and dense pixel features has eroded as foundation vision models have matured. Self-supervised models such as DINOv2 (Oquab et al., 2024) produce dense per-patch features whose nearest-neighbor matches across images often rival hand-crafted keypoint pipelines, and they generalize across domains where classical methods struggle. Segment Anything (Kirillov et al., 2023) provides a different perspective: instead of keypoints, it offers prompt-driven dense masks that can serve similar roles in alignment and tracking.
At the same time, hybrid pipelines are common. A typical modern SfM stack pairs SuperPoint detection with LightGlue matching, falls back to LoFTR for difficult pairs, and uses COLMAP for triangulation and bundle adjustment. For pose estimation, RTMPose, ViTPose, and MediaPipe each occupy different points on the speed-accuracy frontier, and there is no single dominant model. Keypoints remain useful precisely because they offer a compact, geometric, interpretable summary of an image, even when the rest of the system is dense and learned.