Keypoints

Keypoints, also called interest points, feature points, or salient points, are distinctive locations in an image (or other signal) that can be reliably detected and described across viewpoint, scale, lighting, and noise changes. In computer vision, keypoints serve as anchors for feature matching, tracking, image alignment, and 3D reconstruction. Each keypoint typically carries a 2D pixel coordinate, an optional scale, an orientation, and a descriptor vector that summarizes the local image patch around it.

The term covers two related but distinct uses. Detection-style keypoints are corners, blobs, and other low-level structures that are repeatable under transformation; classical detectors such as Harris, SIFT, SURF, and ORB fall in this category. Semantic keypoints, sometimes called landmarks, are predefined anatomical or object-specific locations, for example the 17 body joints in the COCO format or 21 finger joints in a hand model. Both share the same basic idea: reduce an image to a small set of meaningful points whose positions and surrounding patches encode useful structure for downstream tasks.

background and motivation

Dense image processing operates on every pixel, which is wasteful when most pixels are uninformative. A flat patch of sky or a stretch of uniform wall provides almost no constraint when matching one image to another. Keypoints select the few pixels that are most informative, usually because the local intensity pattern changes sharply in more than one direction (a corner) or shows a distinctive blob structure at a particular scale.

Reducing an image to a sparse set of keypoints with descriptors makes many problems tractable. Two photographs of the same building taken from different angles can be aligned by finding the same physical points in both images and solving for the geometric transformation that maps one set onto the other. The same idea drives panorama stitching, Structure from Motion, SLAM, object recognition, and visual place recognition.

For pose estimation, the term takes on a more semantic meaning. Rather than detecting whatever points happen to be locally distinctive, the system is trained to localize specific anatomical or object-defined positions: the tip of the nose, the corner of the eye, the top-left corner of a license plate. These keypoints carry meaning that is consistent across instances, which lets downstream models reason about pose, shape, and identity.

properties of a good keypoint detector

When Mikolajczyk and Schmid evaluated keypoint detectors in their 2005 survey, they identified a now-standard set of desirable properties. A detector should produce points that are:

Property	Meaning
Repeatability	The same physical point is detected in multiple images of the same scene under different viewpoints, scales, lighting, and noise levels.
Distinctiveness	The patch around each keypoint is unique enough that its descriptor can be matched against a large database with low ambiguity.
Locality	The keypoint covers a small image region, so partial occlusion and small geometric distortions affect only a few keypoints rather than all of them.
Quantity	The detector finds enough points to support the task. Image stitching needs only dozens; SfM on a building can need tens of thousands.
Accuracy	Sub-pixel localization, both spatially and across scale and orientation, so that geometric estimates such as homographies or fundamental matrices are precise.
Efficiency	Detection and description are fast enough for the target application, ranging from offline 3D reconstruction to real-time SLAM on a mobile robot.

No single detector wins on every axis. Real systems trade off these properties depending on whether they need real-time performance, robustness to severe viewpoint change, or fine localization for high-precision metrology.

classical keypoint detectors and descriptors

Classical keypoint algorithms dominated computer vision from the late 1980s through the mid-2010s and are still widely used today. Most follow a two-stage pipeline: a detector picks salient pixel locations, and a descriptor summarizes the local appearance into a vector that can be compared with descriptors from other images.

Detector / descriptor	Year	Authors	Detector type	Descriptor	Notes
Harris corner	1988	Harris and Stephens	Corner from local autocorrelation matrix	None (detector only)	Foundational corner detector still used as a building block
Shi-Tomasi (Good Features to Track)	1994	Shi and Tomasi	Minimum eigenvalue of structure tensor	None	Variant of Harris tuned for KLT tracking
MSER	2002	Matas et al.	Maximally stable extremal regions	Region descriptor	Detects affinely covariant regions
SIFT	1999 / 2004	David Lowe	Difference-of-Gaussians blobs across scales	128-D gradient histogram	Scale and rotation invariant; long-time gold standard
SURF	2006	Bay et al.	Hessian determinant via integral images	64-D Haar-wavelet descriptor	Several times faster than SIFT
FAST	2006	Rosten and Drummond	Bresenham-circle pixel comparison	None	Extremely fast corner detector trained via decision tree
BRIEF	2010	Calonder et al.	None (descriptor only)	256-bit binary string from pixel comparisons	Fast Hamming-distance matching
BRISK	2011	Leutenegger et al.	Scale-space FAST	512-bit binary descriptor	Scale and rotation invariant binary alternative
ORB	2011	Rublee et al.	Oriented FAST	Rotated BRIEF	Free for commercial use; default in OpenCV's ORB_SLAM
KAZE / AKAZE	2012 / 2013	Alcantarilla et al.	Nonlinear scale space	M-LDB binary descriptor	Better edge preservation than Gaussian scale space

harris and shi-tomasi corners

The Harris detector, introduced by Chris Harris and Mike Stephens at the 1988 Alvey Vision Conference, formalized the idea of a corner as a point where the local image autocorrelation surface has high curvature in two orthogonal directions. The Shi-Tomasi variant from 1994 replaced Harris's response function with the minimum eigenvalue of the structure tensor, which gave more uniform corner quality and became the standard for the Kanade-Lucas-Tomasi (KLT) tracker.

sift

David Lowe's Scale-Invariant Feature Transform was the breakthrough that made wide-baseline matching practical. The full algorithm appeared in his 2004 International Journal of Computer Vision paper Distinctive Image Features from Scale-Invariant Keypoints. SIFT detects keypoints as extrema of the Difference-of-Gaussians (DoG) scale space, assigns each one a dominant orientation, and computes a 128-dimensional histogram of gradient orientations over a 16x16 patch around the keypoint. The result is invariant to scale and rotation and robust to small changes in viewpoint and illumination. SIFT was patented until March 2020, after which it was added to OpenCV's main module.

surf

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool introduced Speeded Up Robust Features at ECCV 2006. SURF approximates SIFT's DoG with the determinant of a Hessian matrix computed from box filters and integral images, which gives a substantial speedup. The descriptor uses Haar-wavelet responses in 4x4 sub-regions and is typically 64-dimensional. SURF was patented and is also now freely available.

fast, brief, and orb

Edward Rosten and Tom Drummond proposed the FAST corner detector at ECCV 2006 in Machine learning for high-speed corner detection. FAST examines a Bresenham circle of 16 pixels around a candidate point and uses a decision tree learned from training data to classify the point as a corner if a contiguous arc of pixels is sufficiently brighter or darker than the center. The detector runs an order of magnitude faster than Harris while preserving repeatability.

BRIEF, proposed by Michael Calonder and colleagues in 2010, is a binary descriptor that compares pairs of pixels in a smoothed patch and packs the results into a 256-bit string. Matching reduces to Hamming distance, which modern CPUs compute extremely fast.

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski combined these ideas in ORB (Oriented FAST and Rotated BRIEF) at ICCV 2011. ORB adds scale-space sampling and orientation estimation to FAST, and applies a learned sampling pattern to BRIEF that maintains discriminability under rotation. Because ORB has no patent restrictions and runs roughly two orders of magnitude faster than SIFT on similar tasks, it became the default keypoint pipeline in many open-source robotics and SLAM systems, including ORB-SLAM.

mser, brisk, and akaze

MSER (Maximally Stable Extremal Regions), proposed by Jiri Matas and colleagues in 2002, finds connected components in the image that remain stable across a range of intensity thresholds, producing affinely covariant region detections. BRISK (Stefan Leutenegger et al., 2011) and AKAZE (Pablo Alcantarilla, 2013) are scale-invariant binary alternatives to ORB; AKAZE in particular uses nonlinear diffusion in its scale space, which preserves edges better than the Gaussian scale spaces used in SIFT and SURF.

learned detectors and descriptors

By the late 2010s, neural networks had begun to replace hand-crafted detectors. Learned methods can be trained directly to optimize repeatability and matching score on real image pairs, and they often handle large viewpoint and illumination changes better than classical detectors.

Method	Year	Venue	Architecture	Key idea
LIFT	2016	ECCV	CNN	First end-to-end learned detector, orientation, and descriptor
SuperPoint	2018	CVPR Workshop	Fully convolutional, shared encoder	Self-supervised via Homographic Adaptation
D2-Net	2019	CVPR	VGG-based dense feature map	Detect-and-describe from a single shared map
R2D2	2019	NeurIPS	L2-Net backbone	Predicts both repeatability and reliability
DISK	2020	NeurIPS	U-Net	Trained with reinforcement learning style policy gradient
ALIKE / ALIKED	2022 / 2023	TIP / arXiv	Deformable convolutions	Sub-pixel keypoint regression
LoFTR	2021	CVPR	Transformer	Detector-free dense matching using attention
SuperGlue	2020	CVPR	Graph neural network	Matches two SuperPoint sets jointly via attention and Sinkhorn
LightGlue	2023	ICCV	Adaptive transformer	Faster and more accurate successor to SuperGlue

superpoint

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich at Magic Leap proposed SuperPoint in SuperPoint: Self-Supervised Interest Point Detection and Description (CVPR Deep Learning for Visual SLAM Workshop, 2018). The architecture is a fully convolutional network with a shared VGG-style encoder and two heads: one outputs a dense interest-point heatmap and the other a dense 256-dimensional descriptor map. Training uses Homographic Adaptation, in which a base detector is bootstrapped by applying many random homographies to MS COCO images and aggregating the resulting detections to produce robust pseudo-labels. SuperPoint became a default learned front end for many SfM and SLAM pipelines.

superglue and lightglue

Paul-Edouard Sarlin and colleagues at Magic Leap published SuperGlue at CVPR 2020. It takes two sets of local features (typically from SuperPoint) and produces a partial assignment between them by formulating matching as a differentiable optimal transport problem. A graph neural network with self-attention and cross-attention layers reasons jointly about the geometric and visual context of the two images. SuperGlue won three CVPR 2020 challenges in visual localization and image matching.

LightGlue, by Philipp Lindenberger, Sarlin, and Marc Pollefeys at ICCV 2023, revisits SuperGlue's design with several improvements: rotary positional encodings, more efficient attention, an early-exit confidence classifier, and pruning of unmatchable points. It is faster, more accurate, and easier to train than SuperGlue, especially for typical settings with up to 2,000 keypoints per image.

loftr and detector-free matching

LoFTR (Local Feature TRansformer), introduced by Jiaming Sun and colleagues at CVPR 2021, takes a different route: it skips the explicit detection step entirely and produces dense pixel-level matches using a coarse-to-fine transformer. Detector-free methods often perform better in low-texture or repetitive regions where classical detectors find too few or unreliable points.

pose-estimation keypoints

When the term keypoints appears in human or animal pose estimation, object pose, or face analysis, it usually refers to a fixed set of semantic landmarks rather than autonomously detected interest points. The detector is trained, often as a heatmap regression network, to predict the locations of these predefined points.

Keypoint set	Count	Domain	Used by
MPII Human Pose	16	Human body joints	MPII benchmark, classical pose models
COCO Keypoints	17	Human body joints	Standard for multi-person 2D pose estimation
COCO-WholeBody	133	Body, face, hands, feet	Whole-body pose models
OpenPose BODY_25	25	Body plus feet	OpenPose body detector
MediaPipe Pose (BlazePose)	33	Full body, with extra wrist and finger reference points	MediaPipe Pose, on-device fitness apps
MediaPipe Hands	21 per hand	Wrist, finger joints, fingertips	MediaPipe Hands
MediaPipe Face Mesh	468	Dense facial surface	MediaPipe Face Mesh, AR filters
iBUG 300-W	68	Facial landmarks (eyes, nose, mouth, jawline)	dlib, classical face alignment
WFLW	98	Dense facial landmarks with attribute tags	WFLW benchmark
Halpe Full-Body	136	Body, face, hands, feet	AlphaPose Halpe model

coco 17 keypoints

The COCO keypoint format from the Microsoft COCO dataset (Lin et al., 2014) defines 17 person keypoints: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle. Each keypoint is annotated with an (x, y) pixel position and a visibility flag with three values: 0 for not labeled, 1 for labeled but occluded, and 2 for labeled and visible. The COCO keypoints challenge has been the dominant benchmark for 2D human pose estimation since 2016.

mpii 16 keypoints

The MPII Human Pose dataset (Andriluka et al., 2014) uses 16 body joints organized along a kinematic tree rooted at the pelvis. MPII was the standard for single-person pose estimation before COCO, and it is still used for evaluating models with the PCKh metric.

openpose, mediapipe, and other systems

OpenPose, from Cao et al. at CMU (CVPR 2017, journal version 2019), introduced Part Affinity Fields and a real-time bottom-up multi-person system. The full OpenPose pipeline detects 25 body keypoints (BODY_25), 21 keypoints per hand, and 70 facial keypoints, totaling around 135 points per person. Google's MediaPipe Pose, based on the BlazePose model (Bazarevsky et al., 2020), predicts 33 body keypoints designed as a superset of the COCO topology with additional points on the hands and feet, suitable for fitness and wearable applications. AlphaPose and HRNet target the COCO 17-point format with top-down pipelines, while ViTPose (Xu et al., NeurIPS 2022) demonstrates that plain vision transformers can reach 81.1 AP on the COCO test-dev set. RTMPose (Jiang et al., 2023) emphasizes deployment, achieving more than 90 FPS on a CPU at competitive accuracy.

object pose and 6-dof estimation

Keypoints also drive 6-DoF object pose estimation, where the goal is to recover the 3D rotation and translation of a known object relative to the camera. A common pipeline detects 2D keypoints corresponding to predefined 3D points on the object (often the eight corners of its 3D bounding box, or a sparse set of surface points), then solves the Perspective-n-Point (PnP) problem to recover the camera pose given those 2D-3D correspondences. Methods such as PVNet (Peng et al., 2019) and KeypointNet learn to vote for object keypoints from local features, and combine well with iterative refinement using the Iterative Closest Point algorithm. This approach underlies many robotic grasping and AR object-tracking systems.

use cases

Keypoints appear in almost every classical computer vision pipeline that has to align, match, or reason about geometry.

Application	How keypoints are used
Image stitching and panoramas	Detect keypoints in overlapping photographs, match them, and estimate a homography to align the images. Used by Hugin, AutoStitch, and the iOS Camera panorama mode.
Structure from Motion	SfM systems such as COLMAP, OpenMVG, and Bundler use SIFT or a learned alternative to find correspondences across many images, triangulate 3D points, and refine camera poses with bundle adjustment.
Visual SLAM	SLAM systems such as ORB-SLAM2 and ORB-SLAM3 maintain a map of keypoint landmarks and the camera trajectory in real time on a CPU. LSD-SLAM and DSO use direct, non-keypoint methods for comparison.
Image retrieval and place recognition	Keypoint descriptors are quantized into visual words (Bag of Visual Words, Sivic and Zisserman 2003) or aggregated into compact image descriptors such as VLAD and NetVLAD for retrieval against large databases.
Tracking	The Kanade-Lucas-Tomasi (KLT) tracker follows Shi-Tomasi keypoints across video frames using sparse optical flow. Modern visual-inertial odometry systems still use this idea on the front end.
Pose estimation	Semantic keypoints encode the configuration of a person, hand, or face for action recognition, animation, fitness coaching, and sign-language analysis.
Augmented and virtual reality	Keypoint tracking aligns virtual content with real surfaces; ARKit and ARCore combine keypoint-based visual odometry with inertial sensing.
Robotic manipulation	Object keypoints such as the rim of a mug or the handle of a screwdriver provide affordances for grasping policies including kPAM (Manuelli et al., 2019) and Dense Object Nets.
Medical imaging	Anatomical landmarks support image registration, growth tracking, and surgical navigation.

performance metrics

Different keypoint problems use different metrics, but they cluster into two broad families.

For low-level detector-style keypoints, the standard measures come from the Mikolajczyk and Schmid (2005) evaluation protocol. Repeatability rate is the fraction of keypoints detected in one image that are also detected in a transformed version, after compensating for the known transformation. Matching score is the ratio of correct matches to the total number of detected features in the overlap region of two images. Mean Matching Accuracy (MMA) is used in the HPatches benchmark (Balntas et al., 2017) at multiple pixel error thresholds. Pose accuracy, for SfM and SLAM evaluations, is the median rotation and translation error after estimating relative pose from matched keypoints.

For semantic pose keypoints, the standard metrics are PCK, OKS, and MPJPE. PCK (Percentage of Correct Keypoints) marks a keypoint as correct if its predicted location is within a fraction of the body or head size from the ground truth; PCKh@0.5, normalized by head size at threshold 0.5, is the MPII standard. OKS (Object Keypoint Similarity) and the corresponding AP based on OKS form the COCO standard, which weights each keypoint by its annotation difficulty and normalizes by person scale. MPJPE (Mean Per Joint Position Error) is the average Euclidean distance between predicted and ground-truth 3D joints, and is the dominant metric for 3D pose estimation on Human3.6M.

implementations and libraries

Most keypoint algorithms have well-tested implementations in open-source libraries:

Library	Language	What it provides
OpenCV	C++, Python	SIFT, SURF (contrib), ORB, BRISK, AKAZE, FAST, KAZE, MSER, KLT tracker, Lucas-Kanade
Kornia	PyTorch	SIFT, SuperPoint, LoFTR, LightGlue, differentiable feature matching
MMPose / MMTracking	PyTorch	HRNet, ViTPose, RTMPose, SimCC, top-down and bottom-up pose pipelines
MediaPipe	C++, Python, JS	Pose, Hands, Face Mesh, Holistic with on-device inference
dlib	C++, Python	68-point facial landmark detector based on regression trees
COLMAP	C++	SIFT-based Structure from Motion and Multi-View Stereo
ORB-SLAM3	C++	Real-time visual SLAM with ORB keypoints
HLoc	Python	Visual localization toolbox combining SuperPoint, SuperGlue, LightGlue
pyKITTI / pyTorch3D	Python	Geometry utilities including PnP and triangulation

A typical OpenCV keypoint workflow looks like the following Python snippet:

import cv2
orb = cv2.ORB_create(nfeatures=2000)
kp1, des1 = orb.detectAndCompute(img1, None)
kp2, des2 = orb.detectAndCompute(img2, None)
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = sorted(bf.match(des1, des2), key=lambda m: m.distance)

modern context: dense features and foundation models

The strict separation between sparse keypoints and dense pixel features has eroded as foundation vision models have matured. Self-supervised models such as DINOv2 (Oquab et al., 2024) produce dense per-patch features whose nearest-neighbor matches across images often rival hand-crafted keypoint pipelines, and they generalize across domains where classical methods struggle. Segment Anything (Kirillov et al., 2023) provides a different perspective: instead of keypoints, it offers prompt-driven dense masks that can serve similar roles in alignment and tracking.

At the same time, hybrid pipelines are common. A typical modern SfM stack pairs SuperPoint detection with LightGlue matching, falls back to LoFTR for difficult pairs, and uses COLMAP for triangulation and bundle adjustment. For pose estimation, RTMPose, ViTPose, and MediaPipe each occupy different points on the speed-accuracy frontier, and there is no single dominant model. Keypoints remain useful precisely because they offer a compact, geometric, interpretable summary of an image, even when the rest of the system is dense and learned.

references

Harris, C. and Stephens, M. "A Combined Corner and Edge Detector." Proceedings of the 4th Alvey Vision Conference, Manchester, 1988, pp. 147-152.
Shi, J. and Tomasi, C. "Good Features to Track." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1994, pp. 593-600.
Lowe, D. G. "Distinctive Image Features from Scale-Invariant Keypoints." International Journal of Computer Vision, vol. 60, no. 2, 2004, pp. 91-110.
Bay, H., Tuytelaars, T., and Van Gool, L. "SURF: Speeded Up Robust Features." European Conference on Computer Vision (ECCV), 2006, pp. 404-417.
Rosten, E. and Drummond, T. "Machine Learning for High-Speed Corner Detection." European Conference on Computer Vision (ECCV), 2006, pp. 430-443.
Calonder, M., Lepetit, V., Strecha, C., and Fua, P. "BRIEF: Binary Robust Independent Elementary Features." European Conference on Computer Vision (ECCV), 2010, pp. 778-792.
Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. "ORB: An Efficient Alternative to SIFT or SURF." International Conference on Computer Vision (ICCV), 2011, pp. 2564-2571.
Leutenegger, S., Chli, M., and Siegwart, R. "BRISK: Binary Robust Invariant Scalable Keypoints." International Conference on Computer Vision (ICCV), 2011, pp. 2548-2555.
Alcantarilla, P. F., Nuevo, J., and Bartoli, A. "Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces." British Machine Vision Conference (BMVC), 2013.
Matas, J., Chum, O., Urban, M., and Pajdla, T. "Robust Wide Baseline Stereo from Maximally Stable Extremal Regions." Image and Vision Computing, vol. 22, no. 10, 2004, pp. 761-767.
Mikolajczyk, K. and Schmid, C. "A Performance Evaluation of Local Descriptors." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, 2005, pp. 1615-1630.
DeTone, D., Malisiewicz, T., and Rabinovich, A. "SuperPoint: Self-Supervised Interest Point Detection and Description." CVPR Deep Learning for Visual SLAM Workshop, 2018.
Revaud, J., De Souza, C. R., Humenberger, M., and Weinzaepfel, P. "R2D2: Reliable and Repeatable Detector and Descriptor." Advances in Neural Information Processing Systems (NeurIPS), 2019.
Tyszkiewicz, M., Fua, P., and Trulls, E. "DISK: Learning Local Features with Policy Gradient." Advances in Neural Information Processing Systems (NeurIPS), 2020.
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., and Sattler, T. "D2-Net: A Trainable CNN for Joint Description and Detection of Local Features." CVPR, 2019.
Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich, A. "SuperGlue: Learning Feature Matching with Graph Neural Networks." CVPR, 2020.
Lindenberger, P., Sarlin, P.-E., and Pollefeys, M. "LightGlue: Local Feature Matching at Light Speed." International Conference on Computer Vision (ICCV), 2023.
Sun, J., Shen, Z., Wang, Y., Bao, H., and Zhou, X. "LoFTR: Detector-Free Local Feature Matching with Transformers." CVPR, 2021.
Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. "Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields." CVPR, 2017.
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh, Y. "OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, 2021, pp. 172-186.
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., and Grundmann, M. "BlazePose: On-Device Real-Time Body Pose Tracking." CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020.
Xu, Y., Zhang, J., Zhang, Q., and Tao, D. "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation." Advances in Neural Information Processing Systems (NeurIPS), 2022.
Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., and Chen, K. "RTMPose: Real-Time Multi-Person Pose Estimation Based on MMPose." arXiv:2303.07399, 2023.
Sun, K., Xiao, B., Liu, D., and Wang, J. "Deep High-Resolution Representation Learning for Human Pose Estimation." CVPR, 2019, pp. 5693-5703.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. "Microsoft COCO: Common Objects in Context." European Conference on Computer Vision (ECCV), 2014.
Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. "2D Human Pose Estimation: New Benchmark and State of the Art Analysis." CVPR, 2014.
Sivic, J. and Zisserman, A. "Video Google: A Text Retrieval Approach to Object Matching in Videos." ICCV, 2003.
Balntas, V., Lenc, K., Vedaldi, A., and Mikolajczyk, K. "HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors." CVPR, 2017.
Peng, S., Liu, Y., Huang, Q., Zhou, X., and Bao, H. "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation." CVPR, 2019.
Manuelli, L., Gao, W., Florence, P., and Tedrake, R. "kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation." International Symposium on Robotics Research (ISRR), 2019.
Oquab, M., Darcet, T., Moutakanni, T., et al. "DINOv2: Learning Robust Visual Features without Supervision." Transactions on Machine Learning Research, 2024.
Kirillov, A., Mintun, E., Ravi, N., et al. "Segment Anything." ICCV, 2023.

background and motivation

properties of a good keypoint detector

classical keypoint detectors and descriptors

harris and shi-tomasi corners

sift

surf

fast, brief, and orb

mser, brisk, and akaze

learned detectors and descriptors

superpoint

superglue and lightglue

loftr and detector-free matching

pose-estimation keypoints

coco 17 keypoints

mpii 16 keypoints

openpose, mediapipe, and other systems

object pose and 6-dof estimation

use cases

performance metrics

implementations and libraries

modern context: dense features and foundation models

related topics

references

Improve this article

Related Articles

Machine learning terms/Computer Vision

Photography

LeNet

Computer-use agent

Computer-use model

OCR Models

background and motivation

properties of a good keypoint detector

classical keypoint detectors and descriptors

harris and shi-tomasi corners

sift

surf

fast, brief, and orb

mser, brisk, and akaze

learned detectors and descriptors

superpoint

superglue and lightglue

loftr and detector-free matching

pose-estimation keypoints

coco 17 keypoints

mpii 16 keypoints

openpose, mediapipe, and other systems

object pose and 6-dof estimation

use cases

performance metrics

implementations and libraries

modern context: dense features and foundation models

related topics

references

Related Articles

Machine learning terms/Computer Vision

Photography

LeNet

Computer-use agent

Computer-use model

OCR Models