Simultaneous Localization and Mapping (SLAM)

Simultaneous Localization and Mapping (SLAM) is the computational problem of constructing a map of an unknown environment while at the same time estimating the position of a sensor, vehicle, or agent moving through that environment. SLAM is one of the foundational problems in mobile robotics and a key enabling technology behind autonomous vehicles, drones, augmented and virtual reality headsets, planetary rovers, and consumer products such as robot vacuum cleaners. The problem is sometimes described as the chicken-and-egg of mobile robotics: an accurate map is needed to localize a robot, yet an accurate pose is needed to build a map. SLAM systems solve both subproblems jointly, treating the map and the trajectory as coupled unknowns that are estimated together from noisy sensor data.

First formulated in the mid-1980s within the probabilistic robotics community, SLAM has progressed from filter-based approaches that worked on dozens of landmarks to graph-based and learning-based systems that build dense, photorealistic, kilometer-scale maps in real time. Modern SLAM intersects with computer vision, signal processing, optimization, and increasingly with deep learning and neural rendering. Within AI Wiki, SLAM is referenced across articles on autonomous driving, mobile robotics, drones, and spatial computing.

Definition and the chicken-and-egg problem

In its most general form, SLAM asks an agent equipped with proprioceptive sensors (such as wheel odometry or an inertial measurement unit) and exteroceptive sensors (such as cameras, lidar, radar, or sonar) to estimate two things from a stream of measurements: (1) its own pose, meaning position and orientation in three-dimensional space, and (2) a representation of the surrounding environment. The agent has no prior map and no external position fix such as GPS, although it may have approximate priors.

The problem is hard because mapping and localization depend on each other. To know where a feature in the environment is located, the system must know where the sensor was when it observed that feature. To know where the sensor is, the system must compare its current observations against a map of known features. Neither piece of information is given, and noise in motion and observation makes any naive estimate drift over time. SLAM frameworks formalize this coupling using probability theory: the joint posterior over the trajectory and the map, conditioned on all measurements, is what the algorithm tries to compute or approximate.

A second source of difficulty is the size of the state. Even a modest indoor environment may contain thousands of distinct landmarks, and a city block may contain millions. Naive joint estimation scales cubically with the number of landmarks, which made early systems impractical for large maps. Much of the history of SLAM is the story of finding sparse structure in this estimation problem and exploiting it for efficient computation.

History

The probabilistic roots: 1986 to 1995

The modern formulation of SLAM grew out of a set of papers presented at the 1986 IEEE Robotics and Automation Conference in San Francisco, where Randall Smith, Matthew Self, and Peter Cheeseman proposed a method for representing and estimating spatial uncertainty using a probabilistic state vector. Smith and Cheeseman's 1986 paper On the Representation and Estimation of Spatial Uncertainty in the International Journal of Robotics Research is widely cited as the origin of the probabilistic SLAM problem. The key insight was that the uncertainty associated with the relationships among objects in the world could be propagated using estimation theory, and that observations of one feature implicitly carry information about others through the shared sensor pose.

Throughout the late 1980s and early 1990s, researchers including Hugh Durrant-Whyte at the University of Oxford and later the University of Sydney developed these ideas into working systems. The acronym SLAM was coined by Durrant-Whyte and John J. Leonard in a 1995 paper at the International Symposium of Robotics Research, where they showed that the problem of building a map and localizing the robot in it was a single estimation problem with a convergent solution under suitable assumptions.

The Durrant-Whyte and Bailey tutorials: 2006

The two-part tutorial by Hugh Durrant-Whyte and Tim Bailey, published in IEEE Robotics and Automation Magazine in June and September 2006 (Simultaneous Localisation and Mapping: Part I The Essential Algorithms and Part II State of the Art), became the standard introduction to the field. These articles described the probabilistic formulation in detail, derived the Kalman Filter and Particle Filter approaches, and surveyed the state of the art at a moment when the community was beginning to shift from filter-based to optimization-based methods.

The robust-perception age: Cadena et al. 2016

The survey Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age by Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jose Neira, Ian Reid, and John J. Leonard, published in IEEE Transactions on Robotics in December 2016, marked a generational stocktaking. The paper organized the prior thirty years of progress into a classical age (1986 to 2004) dominated by filter-based methods, an algorithmic-analysis age (2004 to 2015) characterized by graph-based optimization and bundle adjustment, and what the authors called the emerging robust-perception age, in which SLAM systems are expected to operate reliably for long durations, in dynamic and changing environments, and to produce semantically meaningful maps.

The Cadena survey identified several open challenges: long-term operation under appearance change, scalability to large environments, semantic mapping that goes beyond geometry, deep learning integration, active SLAM that decides where to move next, and theoretical guarantees on convergence and consistency. Many of these directions have driven research over the subsequent decade.

Classification of SLAM systems

SLAM systems are commonly classified along several axes: the type of sensor, the type of map representation, the estimation framework, and whether the method is sparse or dense. The most familiar classification is by sensor.

SLAM type	Primary sensor	Strengths	Weaknesses
Visual SLAM (vSLAM)	Monocular, stereo, or RGB-D camera	Inexpensive sensors; rich photometric information; dense color maps possible	Sensitive to lighting; monocular has scale ambiguity; texture-poor scenes are hard
LiDAR SLAM	2D or 3D laser scanner	High geometric accuracy; works in low light; produces dense point clouds directly	Sensors are expensive; less semantic information; struggles in geometrically degenerate environments such as long corridors
Visual-Inertial SLAM (VI-SLAM)	Camera plus inertial measurement unit (IMU)	Resolves monocular scale ambiguity; robust to fast motion; small low-power form factor	Requires precise calibration; gyroscope and accelerometer biases must be estimated online
Radar SLAM	Millimeter-wave or scanning radar	Robust to fog, dust, rain, and darkness	Lower angular resolution; sparse returns
RGB-D SLAM	Depth camera (Kinect, RealSense, iPhone TrueDepth)	Direct depth measurements simplify mapping; works indoors at short range	Limited outdoor range; depth quality drops on shiny or dark surfaces
Acoustic SLAM	Sonar (underwater) or microphone arrays	Works underwater where light cannot travel; can map sound sources	Low resolution; severe multipath in cluttered environments

A secondary classification distinguishes sparse SLAM, which tracks a relatively small set of point features such as ORB or SIFT keypoints, from dense SLAM, which estimates depth or color at every pixel. Sparse methods are generally faster and run on modest hardware; dense methods produce richer maps that are useful for path planning, telepresence, and content creation but typically require GPUs. A third axis is direct versus indirect: direct methods optimize photometric error on raw pixel intensities, while indirect methods first extract features and then optimize geometric reprojection error.

Approaches and algorithms

Filter-based SLAM: EKF and FastSLAM

The earliest SLAM systems used the Extended Kalman Filter (EKF) to maintain a Gaussian posterior over the joint state of the robot pose and all landmark positions. Each new measurement updates the mean and covariance through a linearization of the nonlinear motion and observation models. EKF-SLAM is conceptually clean and gave the first convergence proofs in the 1990s, but it has two well-known weaknesses. First, the covariance matrix grows quadratically with the number of landmarks, which makes it unsuitable for large maps. Second, the linearization step accumulates errors that cause the filter to become inconsistent over long trajectories.

FastSLAM, introduced by Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben Wegbreit in 2002 and refined as FastSLAM 2.0 in 2003, replaced the single Gaussian over the full state with a particle filter over robot trajectories. Each particle carries its own factored map of small EKFs, one per landmark. This decomposition exploits a key conditional independence: given the trajectory, individual landmark estimates are independent of one another. FastSLAM scales as O(M log K) per update, where M is the number of particles and K is the number of landmarks, which made it practical for thousands of landmarks. The Rao-Blackwellized particle filter underlying FastSLAM became the basis for many subsequent grid-based mapping systems, including the popular gmapping ROS package.

Graph-based SLAM and bundle adjustment

By the mid-2000s the field had largely shifted to graph-based or factor-graph formulations. In this view, the SLAM problem is encoded as a graph in which nodes represent variables (poses and landmarks) and edges represent constraints derived from sensor measurements. Solving the SLAM problem then amounts to finding the configuration of variables that best satisfies all constraints, which is a nonlinear least squares problem. The solver is usually called the backend, while the frontend is responsible for extracting features, matching them across frames, and forming the constraints.

The nonlinear least squares problem can be tackled with sparse Gauss-Newton or Levenberg-Marquardt methods. Two software libraries dominate practical implementations:

g2o (General Graph Optimization), released by Rainer Kuemmerle and colleagues in 2011, provides a generic framework for optimizing graph-based nonlinear error functions. It is widely used inside ORB-SLAM and many other systems.
GTSAM (Georgia Tech Smoothing and Mapping), developed by Frank Dellaert and his group, expresses estimation problems as factor graphs and supports incremental smoothing through iSAM and iSAM2, which update the solution efficiently as new measurements arrive.

Bundle adjustment (BA), inherited from photogrammetry, is the form of graph optimization used in visual SLAM systems. It jointly refines all camera poses and 3D landmark coordinates so as to minimize the total reprojection error across all images. Local bundle adjustment over a sliding window of recent keyframes runs in real time, while global bundle adjustment over the entire map is reserved for loop-closure events. The Schur complement trick exploits the sparsity of the Hessian to make BA tractable.

Direct methods

Indirect methods extract sparse keypoints and minimize geometric error. Direct methods skip the feature extraction step and instead minimize photometric error on raw image intensities. The advantage is that direct methods can use information from edges and weakly textured surfaces that feature detectors miss. LSD-SLAM (Large-Scale Direct Monocular SLAM), published by Jakob Engel, Thomas Schoeps, and Daniel Cremers at ECCV 2014, was the first direct method to demonstrate large-scale, semi-dense reconstruction from a single moving camera. The paper received the Koenderink Award for Lasting Impact at ECCV 2024, ten years after publication.

Direct Sparse Odometry (DSO), by Engel, Vladlen Koltun, and Cremers, published in IEEE Transactions on Pattern Analysis and Machine Intelligence in 2018, combined the photometric formulation with sparse pixel sampling and joint optimization of geometry, motion, and a full photometric calibration model. DSO outperformed LSD-SLAM in runtime, accuracy, and robustness, and showed that direct methods could match or beat feature-based methods on standard benchmarks.

Modern classical systems

The table below summarizes a representative set of widely used SLAM systems from the past fifteen years.

System	Year	Sensors	Method	Notes
PTAM	2007	Monocular	Indirect, keyframe BA	First to split tracking and mapping into parallel threads; designed for AR
LSD-SLAM	2014	Monocular	Direct, semi-dense	Won ECCV 2024 Koenderink Award for Lasting Impact
ORB-SLAM	2015	Monocular	Indirect, ORB features, BA	Robust loop closure with DBoW2
ORB-SLAM2	2017	Mono / stereo / RGB-D	Indirect, BA	Open source, widely benchmarked
DSO	2017	Monocular	Direct, sparse	Photometric calibration; high accuracy
VINS-Mono	2018	Monocular + IMU	Tightly coupled VIO	From HKUST Aerial Robotics; popular on drones
VINS-Fusion	2019	Mono / stereo + IMU	Multi-sensor optimization	Extends VINS-Mono with stereo and GPS fusion
Cartographer	2016	2D and 3D LiDAR	Pose-graph SLAM	Open-sourced by Google; robust to environmental change
RTAB-Map	2013 onward	RGB-D, stereo, LiDAR	Appearance-based loop closure with memory management	Long-term operation focus
Kimera	2020	Stereo + IMU	VIO + 3D mesh + semantics	Real-time metric-semantic mapping
OpenVSLAM	2019	Mono / stereo / RGB-D	Indirect, modular	Designed for easy extension and integration
OpenVINS	2020	Camera + IMU	MSCKF sliding-window filter	Research platform from RPNG group
ORB-SLAM3	2021	Mono / stereo / RGB-D + IMU	Indirect, multi-map, MAP estimation	First system to combine visual, visual-inertial, and multi-map SLAM with pinhole and fisheye lenses

ORB-SLAM3, published in IEEE Transactions on Robotics in December 2021 by Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos, has become the most widely used open source visual SLAM system. Its main innovations include a tightly integrated visual-inertial pipeline based on Maximum-a-Posteriori estimation that operates from the moment the IMU initializes, and a multi-map system called Atlas that maintains a set of disconnected sub-maps and merges them when loop closures across maps are detected. The authors report accuracy two to five times better than previous systems on the EuRoC and TUM-VI benchmarks.

Google's Cartographer, open-sourced in 2016, became the de facto standard for 2D LiDAR SLAM in the ROS ecosystem. It uses real-time correlative scan matching to build small submaps, then performs global pose-graph optimization to align submaps and close loops. Kimera, released in 2020 by the SPARK Lab at MIT, was among the first open-source systems to integrate visual-inertial odometry, dense 3D mesh reconstruction, semantic labeling, and pose-graph optimization in a single modular framework.

Loop closure and place recognition

A core problem in long-running SLAM is loop closure: detecting that the robot has revisited a previously mapped place, so that the accumulated drift in the trajectory can be corrected by adding a constraint between the current pose and the past pose. Without loop closure, the estimated trajectory and map drift apart, and the system loses global consistency.

For more than a decade, the dominant loop-closure technique was bag of visual words. The DBoW2 library by Dorian Galvez-Lopez and Juan D. Tardos, published in IEEE Transactions on Robotics in 2012, quantizes binary BRIEF or ORB features into a vocabulary tree and retrieves candidate matches by their visual word histograms. DBoW2 is fast, memory-efficient, and used inside ORB-SLAM2 and ORB-SLAM3. Its weakness is sensitivity to changes in lighting, viewpoint, and seasonal appearance, since the underlying features are not invariant to such changes.

Deep learning has produced more robust place recognition descriptors. NetVLAD, by Relja Arandjelovic and colleagues at Inria and Oxford, published at CVPR 2016, learns a global image descriptor by aggregating local CNN features in a differentiable VLAD layer trained with weakly supervised triplet ranking. Patch-NetVLAD extends the idea to multi-scale patch-level matching. AnyLoc, published at the 2024 IEEE Robotics and Automation Letters by Nikhil Keetha and colleagues, uses features from large foundation models such as DINOv2 to perform universal visual place recognition without retraining for each environment. Recent work in 2025 and 2026 integrates NetVLAD and AnyLoc as drop-in replacements for DBoW2 inside ORB-SLAM-style frontends, with reported gains in robustness to illumination and viewpoint change.

Neural and learning-based SLAM

The rise of deep learning has reshaped SLAM in two ways. First, neural networks now serve as components inside otherwise classical pipelines: learned feature detectors and descriptors (SuperPoint, R2D2), learned matchers (SuperGlue, LightGlue), learned monocular depth estimators (MiDaS, DepthAnything), and learned place recognition modules (NetVLAD, AnyLoc) all replace handcrafted parts of the pipeline. Second, end-to-end neural SLAM systems represent the map itself as the parameters of a neural network or as a learnable scene representation.

Neural radiance fields and implicit SLAM

NeRF, or Neural Radiance Fields, introduced in 2020, represents a 3D scene as a continuous function from spatial coordinates to color and density, encoded in the weights of a multi-layer perceptron. Although NeRF was originally an offline scene reconstruction method, it inspired a wave of SLAM systems that incrementally build neural map representations.

iMAP, by Edgar Sucar and colleagues at Imperial College London, presented at ICCV 2021, was the first real-time SLAM system to use a single MLP as the entire map of an indoor scene. NICE-SLAM, published at CVPR 2022 by Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin Oswald, and Marc Pollefeys, replaced the single MLP with a hierarchical voxel grid of learned features, which scaled to larger and more detailed indoor scenes. NICER-SLAM extended this to RGB-only input, removing the dependency on depth sensors.

NeRF-SLAM, by Antoni Rosinol, John J. Leonard, and Luca Carlone in 2022, combined a dense visual-inertial SLAM frontend with an Instant-NGP NeRF backend to produce real-time photorealistic maps. GO-SLAM, by Youmin Zhang and colleagues at ICCV 2023, added global pose-graph optimization and loop closure to neural implicit SLAM, addressing one of the most persistent weaknesses of earlier neural SLAM systems.

3D Gaussian splatting SLAM

In 2023 the introduction of Gaussian splatting for real-time radiance field rendering produced a new wave of SLAM systems. 3D Gaussian splatting represents a scene as a collection of anisotropic 3D Gaussians with color, opacity, and covariance parameters, which can be rasterized at interactive rates. The representation is explicit, easy to edit, and amenable to incremental updates, all of which are attractive for SLAM.

SplaTAM (Splat, Track and Map), MonoGS (also called Gaussian Splatting SLAM), and GS-SLAM all appeared at CVPR 2024 within months of each other. They share a common loop: render the current Gaussian map from the estimated camera pose, compare against the live image, backpropagate the photometric error to update both the camera pose and the Gaussian parameters, and add new Gaussians where the rendered image is sparse. Photo-SLAM and Gaussian-SLAM hybridize the approach by coupling 3DGS mapping with ORB-SLAM3 and DROID-SLAM trackers respectively. SGS-SLAM (ECCV 2024) adds semantic labels and RTG-SLAM scales the approach to larger environments using compact Gaussian representations.

The table below contrasts representative neural SLAM systems.

System	Year	Map representation	Sensors	Notes
iMAP	2021	Single MLP	RGB-D	First real-time MLP-based scene representation
NICE-SLAM	2022	Hierarchical feature grid + MLP	RGB-D	Scales to room-scale scenes
NICER-SLAM	2023	Hierarchical neural fields	RGB only	RGB-only neural SLAM
Vox-Fusion	2022	Sparse octree of MLPs	RGB-D	Adaptive map growth
NeRF-SLAM	2022	Instant-NGP NeRF	RGB-D	Photorealistic real-time mapping
Co-SLAM	2023	Joint coordinate and parametric encoding	RGB-D	Improved hole filling
GO-SLAM	2023	Multi-resolution hash grid	RGB / RGB-D	Adds global pose-graph optimization and loop closure
Point-SLAM	2023	Neural point cloud	RGB-D	Anchors features at 3D points
SplaTAM	2024	3D Gaussians	RGB-D	First mainstream 3DGS SLAM, CVPR 2024
MonoGS	2024	3D Gaussians	Monocular	First monocular 3DGS SLAM, CVPR 2024
GS-SLAM	2024	3D Gaussians	RGB-D	Coarse-to-fine tracking, 386 FPS
Photo-SLAM	2024	3DGS map + ORB-SLAM3 tracking	Mono / stereo / RGB-D	Hybrid classical-neural
SGS-SLAM	2024	Semantic 3D Gaussians	RGB-D	First semantic 3DGS SLAM
RTG-SLAM	2024	Compact 3DGS	RGB-D	Scales to large environments

Challenges in practice

Drift

Every SLAM system that estimates motion incrementally accumulates error. Visual feature tracking, IMU integration, and laser scan matching all introduce small per-frame errors that grow without bound along the trajectory. Loop closure provides the only general remedy: when the system recognizes a previously visited place, it adds a constraint to the pose graph and re-optimizes the trajectory to be consistent with the loop. In environments without loop closures, such as a single forward traverse through a forest, drift is unavoidable and is bounded only by the local accuracy of the sensors and the quality of any external priors.

Dynamic objects

Most SLAM formulations assume the world is static, which is convenient for the math but rarely true. Pedestrians, vehicles, and even slowly drifting objects violate the assumption and contaminate the feature correspondences. Modern systems address dynamics in three ways. The first is robust statistics: M-estimators, robust kernels (Huber, Cauchy), and RANSAC reject outlier correspondences. The second is semantic segmentation: a network labels which pixels belong to known dynamic categories such as people, cars, or animals, and the SLAM frontend ignores them. The third is explicit dynamic-object tracking, in which the system maintains separate trajectories for moving objects in addition to the static map. DynaSLAM, MaskFusion, and Co-Fusion are well-known examples.

Scale ambiguity

A monocular camera cannot observe absolute scale: a model of a building at one centimeter resolution looks identical to the real building from the right viewpoint. Monocular SLAM systems therefore reconstruct geometry only up to an unknown scale factor, and scale itself drifts slowly as small errors accumulate. The standard remedies are to add a sensor that observes scale (an IMU through gravity and accelerometer integration, a stereo camera through known baseline, an RGB-D sensor through metric depth) or to use a learned monocular depth network as a prior. Visual-inertial SLAM has become the dominant approach for handheld and aerial applications precisely because the IMU resolves scale and provides robust short-term motion estimates.

Computational and memory cost

SLAM has hard real-time constraints: a robot moving at one meter per second cannot wait several seconds for a global optimization pass. Modern systems use a layered architecture in which a fast frontend processes every frame at thirty Hertz or more, a sliding-window local optimizer runs at a few Hertz, and global bundle adjustment or pose-graph optimization runs only when needed (typically on a separate thread, triggered by loop closures). Memory grows with map size, which has motivated work on submapping, hierarchical maps, and map compression.

Long-term operation and lifelong SLAM

A SLAM system that runs for hours or days in the same environment must cope with appearance change: lighting shifts, weather, seasonal vegetation, rearranged furniture. Lifelong SLAM systems maintain a map that updates over time, distinguishing transient changes from persistent ones, and managing memory so that the map does not grow unbounded. RTAB-Map's working-memory and long-term-memory split is one approach. Recent research uses learned descriptors that are robust to appearance change, and Bayesian belief updates that down-weight outdated map elements.

Applications

Mobile robotics

Indoor service robots, warehouse robots, hospital delivery robots, and consumer robot vacuum cleaners all rely on SLAM, usually built on 2D LiDAR or low-cost RGB-D sensors. iRobot's Roomba i7 and later models, Roborock and Ecovacs robot vacuums, and Amazon's warehouse robots use variants of pose-graph SLAM with loop closure. Autonomous mobile robots in factories often combine SLAM with fiducial markers to reach industrial reliability levels.

Autonomous vehicles

Autonomous driving stacks use SLAM in two roles. Online SLAM provides ego-motion estimation that fuses cameras, IMU, wheel odometry, GPS, and LiDAR, often called odometry in this context. Offline SLAM is used to build the high-definition (HD) maps that many self-driving systems rely on as a prior, with centimeter-level accuracy on lane geometry, curbs, and traffic signs. Waymo, Cruise, Mobileye, Pony.ai, and AutoX all use a combination of LiDAR SLAM and visual-inertial methods. Tesla's vision-only Autopilot uses online visual odometry but not a persistent SLAM map; it relies on neural scene understanding rather than a stored geometric map.

Drones and aerial robotics

Consumer and professional drones use visual-inertial SLAM (typically VINS-Fusion, OpenVINS, or proprietary derivatives) for indoor flight and GPS-denied operation. Skydio's autonomous filming drones and DJI's obstacle-avoidance systems both depend on real-time SLAM. Outdoor mapping drones use SLAM to register thousands of aerial images for photogrammetric reconstruction.

Augmented and virtual reality

Inside-out tracking on AR and VR headsets is the consumer success story of visual-inertial SLAM. Microsoft HoloLens, the Meta Quest line (Quest 2, Quest 3, Quest Pro), Apple Vision Pro, PICO 4, and Magic Leap headsets all run SLAM on the device using fisheye cameras and IMUs to estimate head pose at high frequency and low latency. Apple Vision Pro, released in 2024, runs a multi-camera visual-inertial SLAM stack on its dedicated R1 chip, with the M2 chip handling rendering. Smartphone AR frameworks (Apple ARKit, Google ARCore) use visual-inertial SLAM to anchor virtual content to the physical world.

Planetary exploration

SLAM and visual odometry have flown on every NASA Mars rover since Spirit and Opportunity in 2003. Curiosity and Perseverance use stereo visual odometry to estimate motion in the absence of GPS, with periodic correction by feature matching against orbital imagery. The Ingenuity helicopter that flew on Mars in 2021 used a downward-facing camera and IMU for visual-inertial navigation.

Other domains

SLAM is also used in indoor navigation aids for blind users, in archaeological documentation, in firefighting, in subterranean exploration (the DARPA Subterranean Challenge produced SLAM systems robust to dust, smoke, and darkness), in medical endoscopy, in agricultural robots, and in mixed reality games such as Niantic's Pokemon Go, which uses crowdsourced visual SLAM to anchor objects to real-world locations.

Theoretical foundations

The probabilistic SLAM problem can be written as the joint posterior P(x_{1:t}, m | z_{1:t}, u_{1:t}), where x_{1:t} is the trajectory, m is the map, z_{1:t} is the sequence of observations, and u_{1:t} is the sequence of control inputs. Under Gaussian noise and Markovian models this posterior factorizes into a product of motion and observation likelihoods, which is what filters and graph-based optimizers exploit.

Key theoretical results include the convergence of EKF-SLAM under known data association (Dissanayake et al. 2001), the sparsity of the information matrix in graph SLAM, and observability analyses that identify unobservable directions of the state. Monocular SLAM is unobservable along the gauge freedoms of similarity transforms; visual-inertial SLAM has four unobservable directions, three for global position and one for global yaw.

Benchmarks and datasets

Progress in SLAM has been driven by public datasets and benchmarks. The most widely used include:

KITTI (2012) provides stereo, LiDAR, and GPS data from a car driven through Karlsruhe, Germany, and remains the standard outdoor automotive benchmark.
EuRoC MAV (2016) provides visual-inertial data from a micro aerial vehicle in a Vicon room and a machine hall, and is the standard for VI-SLAM.
TUM RGB-D (2012) is the standard for indoor RGB-D SLAM evaluation.
TUM-VI (2018) extends the TUM benchmarks to visual-inertial sequences.
ScanNet (2017) provides RGB-D scans of 1500 indoor scenes, used heavily by neural SLAM systems.
Replica (2019) provides high-quality synthetic indoor scenes used by NICE-SLAM, NeRF-SLAM, and most 3DGS-SLAM papers.
Hilti SLAM Challenge (2021 onward) is an annual challenge from Hilti and ETH Zurich focused on construction-site SLAM.
Newer College (2020 onward) is an Oxford handheld dataset that captures large-scale outdoor SLAM with several sensor configurations.

Standard accuracy metrics include Absolute Trajectory Error (ATE), Relative Pose Error (RPE), and for dense methods the L1 depth error and reconstruction completeness.

Research frontiers

Following the directions identified by Cadena et al. in 2016, the most active research areas in SLAM today include:

Foundation models for SLAM components: replacing hand-engineered detectors, descriptors, matchers, and place-recognition modules with networks built on top of large pre-trained vision transformers such as DINOv2.
Object-level and semantic SLAM: representing the map as a graph of object instances rather than (or in addition to) a point cloud, which improves long-term consistency and supports task planning.
Lifelong SLAM: maintaining and updating a map over months or years across appearance change.
Multi-robot collaborative SLAM: fusing maps and trajectories from a team of robots, with robust handling of bandwidth limits and inconsistent observations. Kimera-Multi and DOOR-SLAM are research examples.
Active SLAM: deciding where the robot should go next to reduce uncertainty in the map and pose, which is a coupled estimation and planning problem.
Differentiable SLAM: making the entire SLAM pipeline end-to-end differentiable so that it can be trained jointly with downstream tasks. DROID-SLAM, gradSLAM, and Theseus are early examples.
Event-camera SLAM: using bio-inspired event cameras that respond to brightness changes asynchronously, suited to high dynamic range and high-speed motion.
Underwater and extreme-environment SLAM: extending SLAM to acoustic, sonar, and other unconventional sensors for environments where light or radio do not propagate well.

References

Smith, R. C., and Cheeseman, P. (1986). On the representation and estimation of spatial uncertainty. *International Journal of Robotics Research*, 5(4), 56 to 68.
Smith, R., Self, M., and Cheeseman, P. (1990). Estimating uncertain spatial relationships in robotics. In Cox, I. J., and Wilfong, G. T. (Eds.), *Autonomous Robot Vehicles*. Springer.
Durrant-Whyte, H., and Bailey, T. (2006). Simultaneous localisation and mapping: Part I. *IEEE Robotics and Automation Magazine*, 13(2), 99 to 110.
Bailey, T., and Durrant-Whyte, H. (2006). Simultaneous localisation and mapping (SLAM): Part II. *IEEE Robotics and Automation Magazine*, 13(3), 108 to 117.
Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., and Leonard, J. J. (2016). Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. *IEEE Transactions on Robotics*, 32(6), 1309 to 1332.
Dissanayake, M. W. M. G., Newman, P., Clark, S., Durrant-Whyte, H. F., and Csorba, M. (2001). A solution to the simultaneous localization and map building (SLAM) problem. *IEEE Transactions on Robotics and Automation*, 17(3), 229 to 241.
Montemerlo, M., Thrun, S., Koller, D., and Wegbreit, B. (2002). FastSLAM: A factored solution to the simultaneous localization and mapping problem. *AAAI 2002*.
Montemerlo, M., Thrun, S., Koller, D., and Wegbreit, B. (2003). FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. *IJCAI 2003*.
Klein, G., and Murray, D. (2007). Parallel tracking and mapping for small AR workspaces. *Proc. ISMAR 2007*.
Engel, J., Schoeps, T., and Cremers, D. (2014). LSD-SLAM: Large-scale direct monocular SLAM. *ECCV 2014*. (Koenderink Award for Lasting Impact, ECCV 2024.)
Mur-Artal, R., Montiel, J. M. M., and Tardos, J. D. (2015). ORB-SLAM: A versatile and accurate monocular SLAM system. *IEEE Transactions on Robotics*, 31(5), 1147 to 1163.
Mur-Artal, R., and Tardos, J. D. (2017). ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. *IEEE Transactions on Robotics*, 33(5), 1255 to 1262.
Engel, J., Koltun, V., and Cremers, D. (2018). Direct sparse odometry. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 40(3), 611 to 625.
Qin, T., Li, P., and Shen, S. (2018). VINS-Mono: A robust and versatile monocular visual-inertial state estimator. *IEEE Transactions on Robotics*, 34(4), 1004 to 1020.
Hess, W., Kohler, D., Rapp, H., and Andor, D. (2016). Real-time loop closure in 2D LIDAR SLAM. *ICRA 2016*. (Google Cartographer.)
Labbe, M., and Michaud, F. (2019). RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. *Journal of Field Robotics*, 36(2), 416 to 446.
Rosinol, A., Abate, M., Chang, Y., and Carlone, L. (2020). Kimera: An open-source library for real-time metric-semantic localization and mapping. *ICRA 2020*.
Sumikura, S., Shibuya, M., and Sakurada, K. (2019). OpenVSLAM: A versatile visual SLAM framework. *ACM Multimedia 2019*.
Geneva, P., Eckenhoff, K., Lee, W., Yang, Y., and Huang, G. (2020). OpenVINS: A research platform for visual-inertial estimation. *ICRA 2020*.
Mourikis, A. I., and Roumeliotis, S. I. (2007). A multi-state constraint Kalman filter for vision-aided inertial navigation. *ICRA 2007*. (MSCKF.)
Campos, C., Elvira, R., Gomez Rodriguez, J. J., Montiel, J. M. M., and Tardos, J. D. (2021). ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM. *IEEE Transactions on Robotics*, 37(6), 1874 to 1890.
Kuemmerle, R., Grisetti, G., Strasdat, H., Konolige, K., and Burgard, W. (2011). g2o: A general framework for graph optimization. *ICRA 2011*.
Dellaert, F. (2012). Factor graphs and GTSAM: A hands-on introduction. *Georgia Tech Technical Report*.
Galvez-Lopez, D., and Tardos, J. D. (2012). Bags of binary words for fast place recognition in image sequences. *IEEE Transactions on Robotics*, 28(5), 1188 to 1197. (DBoW2.)
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic, J. (2016). NetVLAD: CNN architecture for weakly supervised place recognition. *CVPR 2016*.
Keetha, N., Mishra, A., Karhade, J., Jatavallabhula, K. M., Scherer, S., Krishna, M., and Garg, S. (2024). AnyLoc: Towards universal visual place recognition. *IEEE Robotics and Automation Letters*.
Sucar, E., Liu, S., Ortiz, J., and Davison, A. J. (2021). iMAP: Implicit mapping and positioning in real-time. *ICCV 2021*.
Zhu, Z., Peng, S., Larsson, V., Xu, W., Bao, H., Cui, Z., Oswald, M. R., and Pollefeys, M. (2022). NICE-SLAM: Neural implicit scalable encoding for SLAM. *CVPR 2022*.
Rosinol, A., Leonard, J. J., and Carlone, L. (2022). NeRF-SLAM: Real-time dense monocular SLAM with neural radiance fields. *arXiv:2210.13641*.
Zhang, Y., Tosi, F., Mattoccia, S., and Poggi, M. (2023). GO-SLAM: Global optimization for consistent 3D instant reconstruction. *ICCV 2023*.
Keetha, N., Karhade, J., Jatavallabhula, K. M., Yang, G., Scherer, S., Ramanan, D., and Luiten, J. (2024). SplaTAM: Splat, track and map 3D Gaussians for dense RGB-D SLAM. *CVPR 2024*.
Matsuki, H., Murai, R., Kelly, P. H. J., and Davison, A. J. (2024). Gaussian splatting SLAM. *CVPR 2024*.
Yan, C., Qu, D., Xu, D., Zhao, B., Wang, Z., Wang, D., and Li, X. (2024). GS-SLAM: Dense visual SLAM with 3D Gaussian splatting. *CVPR 2024*.
Teed, Z., and Deng, J. (2021). DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. *NeurIPS 2021*.

Simultaneous Localization and Mapping (SLAM)

Definition and the chicken-and-egg problem

History

The probabilistic roots: 1986 to 1995

The Durrant-Whyte and Bailey tutorials: 2006

The robust-perception age: Cadena et al. 2016

Classification of SLAM systems

Approaches and algorithms

Filter-based SLAM: EKF and FastSLAM

Graph-based SLAM and bundle adjustment

Direct methods

Modern classical systems

Loop closure and place recognition

Neural and learning-based SLAM

Neural radiance fields and implicit SLAM

3D Gaussian splatting SLAM

Challenges in practice

Drift

Dynamic objects

Scale ambiguity

Computational and memory cost

Long-term operation and lifelong SLAM

Applications

Mobile robotics

Autonomous vehicles

Drones and aerial robotics

Augmented and virtual reality

Planetary exploration

Other domains

Theoretical foundations

Benchmarks and datasets

Research frontiers

See also

References

Improve this article

Related Articles

VLA

Machine learning terms/Computer Vision

Photography

LeNet

Motion planning

Computer-use agent

Simultaneous Localization and Mapping (SLAM)

Definition and the chicken-and-egg problem

History

The probabilistic roots: 1986 to 1995

The Durrant-Whyte and Bailey tutorials: 2006

The robust-perception age: Cadena et al. 2016

Classification of SLAM systems

Approaches and algorithms

Filter-based SLAM: EKF and FastSLAM

Graph-based SLAM and bundle adjustment

Direct methods

Modern classical systems

Loop closure and place recognition

Neural and learning-based SLAM

Neural radiance fields and implicit SLAM

3D Gaussian splatting SLAM

Challenges in practice

Drift

Dynamic objects

Scale ambiguity

Computational and memory cost

Long-term operation and lifelong SLAM

Applications

Mobile robotics

Autonomous vehicles

Drones and aerial robotics

Augmented and virtual reality

Planetary exploration

Other domains

Theoretical foundations

Benchmarks and datasets

Research frontiers

See also

References

Related Articles

VLA

Machine learning terms/Computer Vision

Photography