Landmarks are reference points used as anchors in two largely separate areas of machine learning. In manifold learning and dimension reduction, landmarks are a small subset of data points selected to represent the geometry of a much larger dataset, so that expensive operations on a full N x N distance or kernel matrix can be replaced by cheaper operations on an N x L matrix with L much smaller than N. In computer vision, landmarks are predefined semantic points on objects, faces, hands, or bodies, for example the corners of the eyes, the tip of the nose, or the joints of a finger. The two uses share the same underlying intuition, namely that a sparse set of well-chosen anchor points can summarize a structure that is too expensive to model in full, but the algorithms, datasets, and applications are otherwise distinct.
This article covers both meanings. The first half explains landmark-based dimension reduction, including Landmark MDS and the Nystrom approximation. The second half covers facial, body, and hand landmarks in computer vision, the canonical landmark sets used by datasets and libraries, and the major detection methods.
The word landmark shows up in machine learning in three related senses.
| Sense | What it means | Typical context |
|---|---|---|
| Landmarks in dimension reduction | A small subset of data points used to approximate distances or kernel evaluations on the full dataset | Landmark MDS, Landmark Isomap, Nystrom approximation, spectral clustering at scale |
| Anatomical landmarks | Predefined semantic points on a face, body, hand, or organ | dlib face alignment, MediaPipe Face Mesh, medical image registration |
| Object-pose landmarks | Fixed semantic points on a rigid object used for 6-DoF pose estimation | Robot grasping, AR object tracking, satellite docking |
In classical computer vision the difference between keypoints and landmarks matters: keypoints are detector-driven, found wherever the local image structure is distinctive, while landmarks are predefined semantic positions that the model is trained to localize. The two terms do leak into each other, however, and many recent papers use them interchangeably.
Many dimension reduction and clustering algorithms scale poorly with the number of data points. Classical multidimensional scaling, kernel PCA, spectral clustering, Gaussian process regression, and Isomap all require the eigendecomposition of an N x N matrix, which costs O(N^3) time and O(N^2) memory. Even storing the full pairwise distance matrix becomes infeasible once N exceeds a few tens of thousands. Landmark methods solve this by choosing a smaller set of L landmark points, performing the expensive computation on a reduced L x L or L x N matrix, then projecting the remaining points into the same low-dimensional space using cheap linear operations.
The asymptotic gain is large. Doing classical MDS on N points costs O(N^3). Doing Landmark MDS with L landmarks costs roughly O(L^3 + L N d), which is linear in N for fixed L and d. For datasets with millions of points this is the difference between an algorithm that runs in seconds and one that does not run at all.
Landmark Multidimensional Scaling, introduced by Vin de Silva and Joshua B. Tenenbaum in a 2004 Stanford technical report titled Sparse multidimensional scaling using landmark points, was the first algorithm to make the landmark idea explicit for multidimensional scaling. The procedure has three steps. First, pick L landmarks from the N data points. Second, run classical MDS on the L x L distance matrix between landmarks to obtain a d-dimensional embedding of the landmarks. Third, embed each remaining point by a distance-based triangulation, which uses only the L distances from that point to the landmarks.
The triangulation step has a clean linear algebra interpretation. If the landmark embedding has coordinates Y in R^(L x d) and the squared distances from a new point to the landmarks are stored in a vector delta, then the embedding for the new point is given by a closed-form pseudoinverse expression that depends only on Y and the mean squared landmark distances. The cost of embedding a new point is O(L d), so the algorithm scales linearly with N once the landmarks have been chosen and the landmark embedding has been computed.
De Silva and Tenenbaum showed that LMDS reproduces classical MDS exactly when L equals N, and that the error grows slowly with the ratio L / N when landmarks span the data well. In practice L on the order of a few hundred to a few thousand is enough for most natural datasets, even when N runs into the millions.
Isomap, the nonlinear dimension reduction algorithm by Tenenbaum, de Silva, and Langford published in Science in 2000, replaces Euclidean distances with geodesic distances measured along a k-nearest-neighbor graph and then applies classical MDS. The bottleneck is again the N x N distance matrix and its eigendecomposition. Landmark Isomap, also called L-Isomap, uses the same landmark trick. It computes geodesic distances only between landmarks and the rest of the dataset, builds an L x N distance matrix instead of N x N, and embeds the remaining points by Landmark MDS. The result is a manifold embedding that scales to datasets where full Isomap would be impractical, with a small accuracy cost in regions where landmarks are sparsely distributed.
The Nystrom approximation, brought into machine learning by Christopher Williams and Matthias Seeger in their 2001 NeurIPS paper Using the Nystrom method to speed up kernel machines, is the spectral counterpart to Landmark MDS. Given an N x N kernel matrix K that is too large to store or factorize, the Nystrom method picks L landmark columns of K, computes the small L x L block W from those columns, and forms a low-rank approximation
K approx C W^+ C^T,
where C is the N x L matrix of kernel values between all points and the landmarks and W^+ is the pseudoinverse of W. The approximate eigenvectors of K can then be recovered from the eigendecomposition of W, which is an O(L^3) operation, plus matrix multiplications that scale linearly in N. This brings kernel PCA, spectral clustering, and Gaussian process regression from O(N^3) down to O(L^3 + N L^2), and makes kernel methods practical on datasets with hundreds of thousands of points.
Nystrom and Landmark MDS are closely related. Both choose L anchor points, both build a small matrix on those anchors, and both extend the result to the rest of the data by a linear operation. The main difference is that Nystrom works on a positive semidefinite kernel matrix and approximates its spectrum, while Landmark MDS works on a squared distance matrix and approximates the inner product matrix derived from it.
The quality of any landmark method depends on which points are chosen. The literature has converged on a small set of practical strategies, summarized below.
| Selection strategy | How it works | Notes |
|---|---|---|
| Uniform random sampling | Pick L points uniformly at random from the dataset | The default in Williams and Seeger (2001); cheap and surprisingly competitive |
| MaxMin (farthest-point sampling) | Iteratively add the point farthest from the current set | Ensures landmarks span the data; default in many Isomap implementations |
| K-means centers | Run k-means with L clusters and use the cluster centers as landmarks | Empirically strong; first justified theoretically by Zhang, Tsang, and Kwok (2008) and later by Oglic and Gartner (2017) |
| Leverage-score sampling | Sample columns with probability proportional to their statistical leverage | Provides theoretical error bounds via random matrix theory |
| Greedy column selection | Choose columns that minimize a Frobenius-norm error directly | Used in column subset selection and CUR decompositions |
| Active selection | Pick landmarks that resolve the most current uncertainty in the embedding | Useful when distance computations are expensive |
In practice random sampling is hard to beat for moderate L, k-means landmarks help when the data has clear clusters, and leverage-score sampling shines when theoretical guarantees matter. Most production code for Nystrom-based spectral clustering or Gaussian processes ships with all three options.
The landmark idea has been ported to most other algorithms that depend on N x N matrices.
These methods do not all use the term landmark explicitly, but they share the same trick: replace an O(N^2) or O(N^3) computation with an O(L N) or O(L^2 N) one by introducing a small set of representative anchors.
In computer vision, landmarks are predefined semantic points on an object whose locations have a fixed meaning across instances. The 31st point in the dlib 68-point face model is always the tip of the nose. The 12th point in the COCO body skeleton is always the left hip. This consistency lets downstream models reason about pose, identity, expression, or shape using a compact, interpretable representation rather than dense pixel data.
The earliest computer-vision landmarks came from anatomy and medical imaging, where radiologists had been marking corresponding points on X-rays for decades. The transition to automated detection started in the 1990s with statistical shape models and matured in the 2010s with deep regression and heatmap networks.
The distinction between keypoints and landmarks is mostly a matter of where the points come from.
| Aspect | Detector-style keypoints | Landmarks |
|---|---|---|
| Origin | Found by a detector wherever the image is locally distinctive | Predefined positions in a labeling protocol |
| Number | Variable, depends on image content | Fixed, defined by the model or dataset |
| Identity | Anonymous, matched by descriptor similarity | Semantic, point i always means the same thing |
| Examples | SIFT, ORB, SuperPoint corners and blobs | 68-point face model, COCO body 17, MediaPipe hand 21 |
| Typical task | Image matching, SfM, SLAM | Face alignment, pose estimation, AR filters |
In modern multi-task networks the boundary blurs. RetinaFace, for example, regresses both an anonymous bounding box and five fixed semantic landmarks in the same forward pass, and many human-pose papers swap landmarks and keypoints in successive sentences.
Facial landmarks are by far the most studied category. The standard sets, in order of granularity, are the following.
| Set | Points | Source | Typical use |
|---|---|---|---|
| 5-point | 5 | Eye centers, nose tip, mouth corners; used by MTCNN, RetinaFace, ArcFace pipelines | Face alignment for face recognition |
| 21-point | 21 | Older face SDKs, AAM-style models | Coarse expression and pose |
| 29-point | 29 | LFPW protocol (Belhumeur et al., 2011) | Cascaded regression research |
| 68-point | 68 | Multi-PIE / iBUG 300-W protocol (Sagonas et al., 2013) | dlib, scikit-image, classical pipelines |
| 98-point | 98 | WFLW dataset (Wu et al., CVPR 2018) | Boundary-aware alignment, occlusion robustness |
| 106-point | 106 | JD AI Grand Challenge dataset, Chinese commercial SDKs | Mobile beauty filters |
| 194-point | 194 | HELEN dataset (Le et al., 2012) | Dense facial parts segmentation |
| 468-point (Face Mesh) | 468 | MediaPipe Face Mesh (Kartynnik et al., 2019) | AR filters, virtual try-on, full mesh fitting |
| 5023-vertex (FLAME) | 5023 | FLAME 3D head model (Li et al., 2017) | 3D face reconstruction, avatar driving |
The 5-point set is the bare minimum needed to align a face for recognition. Two eyes and a nose define orientation, and the two mouth corners constrain scale. Face recognition systems such as ArcFace and AdaFace assume their input has been similarity-warped to a canonical pose using exactly these five points, which is one reason MTCNN and RetinaFace, the dominant face detectors, both regress them.
The 68-point set is the most influential research protocol. Sagonas, Tzimiropoulos, Zafeiriou, and Pantic introduced it at the iBUG 300 Faces in-the-Wild Challenge held at ICCV 2013, by re-annotating the LFPW, AFW, HELEN, XM2VTS, and FRGC datasets with the 68-point Multi-PIE markup and adding a new 135-image set of difficult faces. The result, often called iBUG 300-W, became the standard benchmark for face alignment for nearly a decade, and the 68-point layout is still the default in dlib, scikit-image, and many academic baselines.
The 98-point WFLW set, from the Look at Boundary paper by Wu, Qian, Yang, Wang, and Loy (CVPR 2018), adds points along the eyebrow, eye, mouth, and jawline contours and tags each face with attributes such as occlusion, blur, and pose. It is the standard for evaluating dense alignment under challenging conditions.
MediaPipe Face Mesh, described by Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, and Matthias Grundmann in Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs (arXiv 2019), takes the dense end of the spectrum. Its 468-point output is not a flat sparse landmark set; it is a triangulated mesh that approximates the full surface of the face, regressed in 3D from a single RGB camera in real time on a phone. The mesh is what makes Snapchat-style AR filters, beauty effects, and virtual try-on work without a depth sensor.
FLAME, by Tianye Li, Timo Bolkart, Michael Black, Hao Li, and Javier Romero (SIGGRAPH Asia 2017), goes further still. It is a 3D morphable model with 5023 vertices, learned from over 33,000 head scans, that parameterizes identity, expression, and pose using a small number of latent codes. Many modern face avatars and reconstruction pipelines fit FLAME parameters to a sparse set of detected landmarks as initialization, then refine against image evidence.
Human body landmarks define a kinematic skeleton. The two dominant labeling conventions are MPII and COCO.
| Skeleton | Joints | Source | Notes |
|---|---|---|---|
| MPII | 16 | MPII Human Pose dataset (Andriluka et al., CVPR 2014) | Tree rooted at pelvis; standard for single-person 2D pose |
| COCO Keypoints | 17 | Microsoft COCO (Lin et al., ECCV 2014) | Dominant benchmark since 2016; OKS metric |
| OpenPose BODY_25 | 25 | OpenPose (Cao et al., CMU, 2017) | Body plus feet |
| MediaPipe Pose (BlazePose) | 33 | BlazePose (Bazarevsky et al., 2020) | Includes wrist and finger reference points |
| COCO-WholeBody | 133 | Jin et al., ECCV 2020 | Body, face, hands, feet in one model |
| Halpe Full-Body | 136 | AlphaPose Halpe model | Used in action recognition pipelines |
COCO 17 is now the de facto standard for multi-person 2D pose. The 17 points are nose, left and right eye, left and right ear, left and right shoulder, elbow, wrist, hip, knee, and ankle. Each point carries an x, y position and a visibility flag. The COCO keypoint challenge has driven most progress in 2D pose estimation since 2016.
Hand pose models settle on a 21-point skeleton. MediaPipe Hands, from Bazarevsky and colleagues at Google in 2020, predicts 21 3D landmarks per hand: one wrist point and four points per finger (knuckle, two finger joints, and fingertip). The full MediaPipe Holistic model combines 33 body landmarks, 21 per hand, and 468 face mesh landmarks for a total of 543 points per person, all tracked in real time on a mobile device.
Facial and body landmark detection has gone through three major waves: statistical shape models, cascaded regression, and deep heatmap or coordinate regression.
| Method | Year | Authors | Approach |
|---|---|---|---|
| Active Shape Models (ASM) | 1995 | Cootes, Taylor, Cooper, Graham | PCA shape model plus local intensity profiles, iterative fit |
| Active Appearance Models (AAM) | 2001 | Cootes, Edwards, Taylor | Joint shape and texture PCA, fit by minimizing texture residual |
| Constrained Local Models (CLM) | 2008 | Cristinacce and Cootes | Local patch experts plus a shape prior |
| Explicit Shape Regression (ESR) | 2012 | Cao, Wei, Wen, Sun | Cascaded regression on shape-indexed pixel differences |
| Robust Cascaded Pose Regression (RCPR) | 2013 | Burgos-Artizzu, Perona, Dollar | Cascaded regression with occlusion handling |
| Ensemble of Regression Trees (ERT) | 2014 | Kazemi and Sullivan | One-millisecond face alignment used by dlib |
| MTCNN | 2016 | Zhang, Zhang, Li, Qiao | Cascaded CNN for joint face detection and 5-point alignment |
| Face Alignment Network (FAN) | 2017 | Bulat and Tzimiropoulos | Stacked hourglass for 2D and 3D landmarks |
| 3DDFA | 2017 | Zhu, Liu, Lei, Li | Cascaded CNN that fits a dense 3DMM |
| PFLD | 2019 | Guo and colleagues | Lightweight MobileNet backbone for mobile devices |
| Face Mesh | 2019 | Kartynnik et al. | Real-time 468-point 3D mesh on mobile GPU |
| RetinaFace | 2020 | Deng, Guo, Zhou, Yu, Zafeiriou | Single-shot multi-level detector with built-in 5-point alignment |
| BlazeFace | 2019 | Bazarevsky et al. | Mobile-first single-shot detector with 6 keypoints |
Active Shape Models, introduced by Cootes, Taylor, Cooper, and Graham in Computer Vision and Image Understanding (1995), built a Point Distribution Model from PCA on aligned training shapes and fit it to new images by alternating between local intensity-profile search at each landmark and global shape regularization. Active Appearance Models (Cootes, Edwards, and Taylor, 2001) extended this by jointly modeling shape and texture, fitting both at once.
The cascaded regression era started with Cao, Wei, Wen, and Sun's Face alignment by Explicit Shape Regression at CVPR 2012. ESR initializes from a mean shape and refines it through a series of regressors that operate on shape-indexed pixel differences, with no explicit shape model in the loop. Burgos-Artizzu, Perona, and Dollar followed in 2013 with RCPR, which added robust regression that handles occlusion, and Kazemi and Sullivan's 2014 One Millisecond Face Alignment with an Ensemble of Regression Trees gave the field the speed jump that made real-time alignment practical on commodity CPUs. The dlib library's well-known 68-point shape predictor is a direct implementation of this ERT approach, trained on iBUG 300-W.
The deep learning era began with MTCNN by Zhang, Zhang, Li, and Qiao (IEEE Signal Processing Letters, 2016), which cascades three small CNNs (PNet, RNet, ONet) to do face detection and 5-point alignment in one pass. Bulat and Tzimiropoulos's 2017 ICCV paper How far are we from solving the 2D and 3D Face Alignment problem? introduced the Face Alignment Network, a stacked hourglass that regresses landmark heatmaps and ships with both 2D and 3D variants. Their accompanying LS3D-W dataset, with 230,000 3D landmark annotations, became a standard 3D benchmark. PFLD by Guo and colleagues (arXiv 2019) made the same task run at 140 fps on a phone using a MobileNet backbone of just 2.1 megabytes.
For 3D and dense landmarks, Zhu, Liu, Lei, and Li's 3DDFA (TPAMI 2017) fits a 3DMM in full pose range using cascaded CNNs, and the MediaPipe Face Mesh model produces a 468-point 3D mesh in real time. Single-shot multi-task detectors such as RetinaFace and BlazeFace fold landmark regression into the detection head, eliminating the need for a separate alignment stage.
Facial and body landmarks underlie a wide range of applications.
Most landmark detectors have well-tested open-source implementations.
| Library | Languages | Landmarks provided |
|---|---|---|
| dlib | C++, Python | 68-point face landmarks via the Kazemi-Sullivan ERT model; 5-point variant |
| MediaPipe | C++, Python, JS, Android, iOS | 468-point face mesh, 33-point pose, 21-point hand, holistic |
| face-alignment (1adrianb) | PyTorch | 2D and 3D Face Alignment Network from Bulat and Tzimiropoulos |
| InsightFace | MXNet, PyTorch | RetinaFace detector and 5-point alignment, 106-point dense alignment |
| OpenCV Facemark API | C++, Python | LBF, AAM, and Kazemi facemark models |
| MMPose | PyTorch | HRNet, ViTPose, RTMPose for body, face, and hand landmarks |
| PIPNet | PyTorch | Pixel-in-pixel regression for 68/98/19-point face alignment |
| AlphaPose | PyTorch | Halpe 136-point full-body landmarks |
| OpenPose | C++ | 25-point body, 21-point hand, 70-point face from CMU |
| 3DDFA_V2 | PyTorch | Real-time 3DMM fitting and dense face landmarks |
A minimal dlib face-landmark workflow looks like the following.
import dlib, cv2
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor("shape_predictor_68_face_landmarks.dat")
img = cv2.imread("face.jpg")
for det in detector(img, 1):
shape = predictor(img, det)
points = [(p.x, p.y) for p in shape.parts()]
The split between detection and landmark regression has narrowed in the last few years. RetinaFace, BlazeFace, and YOLOv8-face all regress face landmarks in the same head as the bounding box, so a single forward pass returns both. On the dense end, neural rendering and 3D head models such as DECA, EMOCA, and SPECTRE replace the small landmark set with full 3D mesh parameters fit directly from images, often supervised by sparse landmarks as auxiliary targets. The 5-point or 68-point landmark output is rarely the final goal anymore; it is usually a stepping stone toward face recognition, expression analysis, avatar driving, or 3D reconstruction.
For body landmarks, models such as ViTPose (Xu et al., NeurIPS 2022) and RTMPose (Jiang et al., 2023) push COCO keypoint AP into the low 80s on test-dev while running at real-time speeds, and dense human-mesh recovery models such as PIXIE, SPIN, and HMR2 fit a full SMPL or SMPL-X body using landmarks as one supervision signal among many.
Face landmarks are biometric identifiers in most legal frameworks, even when no original image is stored. The European Union's General Data Protection Regulation classifies biometric data, including face geometry, as a special category under Article 9, requiring explicit consent or another lawful basis for processing. Illinois's Biometric Information Privacy Act (BIPA, 2008) imposes notice-and-consent requirements and has led to large class-action settlements against Facebook, TikTok, and others. Several U.S. states have followed Illinois with similar statutes, and city-level bans on government use of face recognition (San Francisco 2019, Boston 2020, others) effectively constrain landmark-based identification systems as well. Practitioners building face-landmark systems should assume the resulting embeddings are regulated personal data even when the source pixels are deleted.
The two senses of landmark sit in different branches of machine learning, but they share a clean idea. In dimension reduction, a few well-chosen anchor points let you summarize the geometry of a million-point dataset without paying the full O(N^2) cost. In computer vision, a few well-chosen anchor points on a face let you summarize identity, pose, or expression without working with the full pixel grid. Both uses gain leverage from sparsity: most of the information that a downstream system needs lives at a small number of carefully selected positions, not in the dense interior of the data.