Robot navigation is the ability of a mobile robot to determine its own position in its environment, plan a path toward a goal, and execute that path while avoiding obstacles. It is one of the central problems in robotics and autonomous systems, with applications ranging from warehouse logistics and household vacuum cleaners to autonomous driving and planetary exploration. A complete navigation system typically combines localization, mapping, path planning, and motion control into a pipeline that lets the robot move safely and efficiently through the world.
Robot navigation answers three questions that any mobile agent must solve: "Where am I?", "Where do I want to go?", and "How do I get there?" The first question is addressed by localization, the second by goal specification or mission planning, and the third by path planning and trajectory execution. In practice these problems are tightly coupled. A robot that cannot localize itself accurately will plan poor paths; a robot that cannot plan around obstacles will collide, no matter how good its localization is.
Modern navigation systems rely on a layered architecture. A perception layer processes raw sensor data (cameras, LiDAR, IMU, wheel encoders) into a representation of the environment. A planning layer uses that representation to compute a collision-free route. A control layer converts the planned route into motor commands. Between these layers, a localization module continuously estimates the robot's pose (its position and orientation) within a map.
The choice of sensor hardware heavily influences the capabilities and limitations of a navigation system.
| Sensor | Principle | Typical range | Strengths | Weaknesses |
|---|---|---|---|---|
| LiDAR | Laser time-of-flight | 10 m to 200+ m | High accuracy, works in darkness | Expensive, sparse in vertical resolution |
| RGB Camera | Passive light capture | Varies with lens | Low cost, rich texture and color | Sensitive to lighting, no direct depth |
| RGB-D / Depth Camera | Structured light or ToF | 0.3 m to 10 m | Direct depth at low cost | Limited range, poor in sunlight |
| IMU (Inertial Measurement Unit) | Accelerometer + gyroscope | N/A (ego-motion) | High update rate, no external dependency | Drift accumulates over time |
| Wheel Encoders | Rotary counting | N/A (ego-motion) | Simple, inexpensive | Slip on rough or wet surfaces |
| GPS / GNSS | Satellite trilateration | Global | Absolute global position | Poor indoors, meter-level accuracy |
| Ultrasonic | Sound time-of-flight | 0.02 m to 5 m | Very low cost | Narrow beam, low resolution |
| Radar | Radio wave reflection | 1 m to 200+ m | Works in rain, fog, dust | Lower spatial resolution than LiDAR |
Because every sensor has blind spots, most robots use sensor fusion, combining data from multiple sensors to compensate for individual weaknesses. Common fusion combinations include LiDAR with IMU, camera with IMU (visual-inertial odometry), and LiDAR with camera for both geometric and semantic understanding. The Extended Kalman Filter (EKF) is frequently used to merge these heterogeneous measurements into a single, consistent state estimate.
SLAM is the computational problem of constructing a map of an unknown environment while simultaneously tracking the robot's position within that map. It is sometimes described as a chicken-and-egg problem: accurate localization requires a good map, but building a good map requires accurate localization. SLAM algorithms solve both problems jointly.
The probabilistic formulation of SLAM originated at the 1986 IEEE Robotics and Automation Conference in San Francisco. Randall Smith and Peter Cheeseman showed that the correlations between landmark position estimates and the robot pose estimate grow without bound as the robot explores, and proposed using the Extended Kalman Filter to maintain a joint state vector of the robot pose and all landmark positions. Hugh Durrant-Whyte, John Leonard, and others developed this idea through the late 1980s and 1990s. The acronym "SLAM" was coined by Durrant-Whyte in 1995.
Major subsequent milestones include FastSLAM (Montemerlo, Thrun, Koller, and Wegbreit, 2002), which used particle filters to factorize the SLAM posterior into per-particle maps, and GraphSLAM (Thrun and Montemerlo, 2006), which recast SLAM as a sparse graph optimization problem.
SLAM can be stated as follows. At each time step, the robot executes a control command u and receives a sensor observation z. The robot's pose at time t is x_t and the map consists of a set of landmark positions m. The goal is to estimate the posterior probability P(x_{1:t}, m | z_{1:t}, u_{1:t}), the joint distribution over the trajectory and map given all controls and observations. Different SLAM algorithms approximate this posterior in different ways.
The earliest SLAM solutions represented the robot pose and all landmark locations in a single augmented state vector, updated at each step by an Extended Kalman Filter. EKF-SLAM is simple to implement and produces consistent estimates when the number of landmarks is small. Its main drawback is computational cost: the state covariance matrix grows quadratically with the number of landmarks, making it impractical for large environments.
FastSLAM, introduced by Michael Montemerlo and Sebastian Thrun in 2002, uses a Rao-Blackwellized particle filter. Each particle carries a sampled robot trajectory and a set of independent landmark filters. This factorization breaks the quadratic scaling of EKF-SLAM, making the algorithm O(M log N) per update where M is the number of particles and N is the number of landmarks.
Graph-based SLAM (also called GraphSLAM or pose-graph SLAM) represents the problem as a factor graph. Nodes correspond to robot poses at different time steps (and optionally landmark positions). Edges encode constraints from odometry measurements between consecutive poses and from observations of landmarks. The full trajectory and map are recovered by solving a nonlinear least-squares optimization problem over the graph. This is the dominant SLAM paradigm as of 2025. Commonly used optimization libraries include g2o (developed by Rainer Kummerle et al., 2011) and GTSAM (developed by Frank Dellaert at Georgia Tech, 2012).
Visual SLAM (vSLAM) uses cameras as the primary sensor. It extracts visual features (corners, edges, or learned descriptors) from images and tracks them across frames to estimate motion and build a 3D map. Visual SLAM systems are popular because cameras are inexpensive and provide rich appearance information.
Notable visual SLAM systems include:
LiDAR-based SLAM uses laser scanners to produce accurate range measurements. In 2D, algorithms like GMapping (Grisetti, Stachniss, and Burgard, 2007) use Rao-Blackwellized particle filters on occupancy grids. In 3D, point cloud registration methods such as ICP (Iterative Closest Point) and NDT (Normal Distributions Transform) align successive scans. Google Cartographer (Hess et al., 2016) is a widely used open-source system that performs real-time 2D and 3D SLAM using scan matching and loop closure on a pose graph.
LiDAR SLAM produces highly accurate geometric maps and works in complete darkness, but the sensor cost is higher than cameras and the resulting maps lack the color and texture information that visual SLAM provides.
Semantic SLAM adds object detection and scene understanding on top of geometric SLAM. By recognizing objects and assigning semantic labels to map elements, the robot can build maps that encode not just geometry but also the meaning of different regions (e.g., "door", "table", "road"). Advances in deep learning and computer vision have accelerated semantic SLAM research, with systems leveraging convolutional neural networks and, more recently, vision-language models for richer scene representations.
State-of-the-art SLAM systems increasingly fuse data from multiple sensors. Common configurations include LiDAR-IMU SLAM, visual-inertial SLAM (camera + IMU), LiDAR-visual SLAM, and three-way LiDAR-IMU-visual SLAM. Fusion can be loosely coupled (each sensor runs its own front-end and results are merged) or tightly coupled (raw measurements are combined in a single optimization). RTAB-Map (Labbe and Michaud, 2019) is a notable open-source system that can fuse LiDAR and camera data with loop closure detection and multi-session mapping.
| System | Sensor input | Approach | Key feature |
|---|---|---|---|
| ORB-SLAM3 | Mono, stereo, RGB-D camera + optional IMU | Feature-based visual(-inertial) SLAM | Multi-map, loop closure, relocalization |
| LSD-SLAM | Monocular camera | Direct (intensity-based) visual SLAM | Semi-dense depth maps |
| Google Cartographer | 2D/3D LiDAR + IMU | Scan matching + pose graph optimization | Real-time 2D and 3D mapping |
| GMapping | 2D LiDAR | Rao-Blackwellized particle filter | Occupancy grid maps |
| RTAB-Map | LiDAR, stereo, RGB-D | Graph-based with appearance-based loop closure | Multi-session, multi-sensor |
| Hector SLAM | 2D LiDAR | Scan matching (no odometry required) | Fast indoor 2D mapping |
Path planning is the problem of computing a collision-free route from a start position to a goal position, given a map of the environment. Planning algorithms are typically divided into global planners, which compute a full path before execution begins, and local planners (controllers), which adjust the path in real time to respond to new obstacles.
Planning is often formulated in configuration space (C-space), where each point represents a complete specification of the robot's degrees of freedom. The subset of C-space that is free of collisions is called the free space. A path is a continuous curve through free space connecting the start configuration to the goal configuration.
Dijkstra's algorithm (Edsger Dijkstra, 1956) finds the shortest path in a weighted graph by expanding nodes in order of increasing cost from the start. It is complete and optimal but explores in all directions, making it slow for large search spaces.
A* (Hart, Nilsson, and Raphael, 1968) improves on Dijkstra's algorithm by adding a heuristic function h(n) that estimates the cost from node n to the goal. A* expands nodes in order of f(n) = g(n) + h(n), where g(n) is the cost from the start. When h(n) is admissible (never overestimates), A* is both complete and optimal. It is one of the most widely used planners in robotics and is the basis for many grid-based navigation systems.
D* (Dynamic A*), developed by Anthony Stentz at Carnegie Mellon University in 1994, is designed for environments that change during execution. When the robot discovers new obstacles, D* repairs the existing solution incrementally rather than replanning from scratch.
D* Lite (Koenig and Likhachev, 2002) achieves the same incremental replanning behavior with a simpler algorithm built on top of Lifelong Planning A* (LPA*). It maintains a priority queue of inconsistent nodes and updates only the portion of the search space affected by map changes. D* Lite is the preferred variant in most modern implementations, as it is easier to implement and analyze while producing identical behavior to Focused D*.
Theta* (Nash, Daniel, Koenig, and Felner, 2007) extends A* by allowing any-angle paths. At each node expansion, it checks whether a straight-line path to the parent's parent is collision-free, producing smoother paths than grid-constrained A*.
PRM (Kavraki, Svestka, Latombe, and Overmars, 1996) operates in two phases. In the construction phase, it samples random configurations in free space and connects nearby samples with straight-line paths, building a roadmap graph. In the query phase, it connects the start and goal to the roadmap and finds the shortest path using Dijkstra's algorithm. PRM is probabilistically complete: as the number of samples grows, the probability of finding a solution (if one exists) approaches 1.
RRT (LaValle, 1998) incrementally builds a tree rooted at the start configuration by repeatedly sampling random points and extending the nearest tree node toward them. RRT is efficient at exploring high-dimensional configuration spaces and can handle non-holonomic constraints. Its main weakness is that it does not optimize path quality.
RRT* (Karaman and Frazzoli, 2011) is an asymptotically optimal variant of RRT. After adding a new node, it searches for nearby nodes that could serve as better parents and rewires the tree to reduce total path cost. Given enough time, RRT* converges to the optimal path.
| Algorithm | Type | Completeness | Optimality | Handles dynamic environments | Typical use case |
|---|---|---|---|---|---|
| Dijkstra's | Graph search | Complete | Optimal | No (static) | Small graphs, baseline |
| A* | Graph search | Complete | Optimal (admissible h) | No (static) | Grid-based 2D/3D planning |
| D* Lite | Graph search | Complete | Optimal | Yes (incremental) | Unknown or changing environments |
| Theta* | Graph search | Complete | Near-optimal | No (static) | Any-angle grid planning |
| PRM | Sampling-based | Probabilistically complete | Asymptotically optimal (PRM*) | No (static) | Multi-query, high-dimensional |
| RRT | Sampling-based | Probabilistically complete | No | No (static) | Single-query, high-dimensional |
| RRT* | Sampling-based | Probabilistically complete | Asymptotically optimal | No (static) | High-dimensional, quality paths |
Obstacle avoidance is the reactive layer of navigation that prevents collisions during path execution. Even with a planned global path, the robot needs real-time collision avoidance to handle dynamic obstacles (people, other robots) and map inaccuracies.
The Artificial Potential Field method (Khatib, 1986) models the goal as an attractive potential and obstacles as repulsive potentials. The robot follows the negative gradient of the combined field. APF is computationally simple and generates smooth trajectories. Its main limitation is susceptibility to local minima, where attractive and repulsive forces cancel out and the robot gets stuck. Various escape strategies have been proposed, including random walks, virtual obstacles, and harmonic potential functions.
VFH (Borenstein and Koren, 1991) builds a polar histogram of obstacle density around the robot using range sensor data. It identifies openings (valleys) in the histogram that are wide enough for the robot to pass through, and selects the opening closest to the goal direction. VFH+ and VFH* add trajectory simulation and cost functions to handle wider robots and more complex environments.
DWA (Fox, Burgard, and Thrun, 1997) works in velocity space. It considers only velocities reachable within one control cycle (the "dynamic window") and simulates short trajectories for each candidate velocity. An objective function scores each trajectory based on progress toward the goal, clearance from obstacles, and velocity. DWA is widely used in the ROS Navigation Stack.
MPC formulates trajectory tracking as an optimization problem over a receding time horizon. At each step, it solves for the optimal control sequence that minimizes a cost function (tracking error, control effort, obstacle proximity) subject to dynamics and constraint equations, then executes only the first control command before re-optimizing. MPC can handle complex dynamics and constraints but requires more computation than reactive methods.
Bug algorithms (Lumelsky and Stepanov, 1987) are simple reactive strategies. In Bug1, the robot moves toward the goal until it hits an obstacle, circumnavigates the entire obstacle, and then departs from the point closest to the goal. Bug2 follows a straight line to the goal and, upon hitting an obstacle, follows the boundary until it can rejoin the original line. These algorithms guarantee convergence in 2D but produce long, inefficient paths.
The Robot Operating System (ROS) provides the most widely used open-source navigation framework. The original ROS 1 Navigation Stack, and its successor Nav2 for ROS 2, implement a full pipeline from sensor data to motor commands.
Nav2 (Navigation2) is the production-grade navigation framework for ROS 2, trusted by over 100 companies worldwide as of 2025. Its architecture is built on ROS 2 action servers, lifecycle-managed nodes, and behavior trees (using BehaviorTree.CPP V4) for task orchestration.
The core components of Nav2 include:
Nav2 supports differential-drive, holonomic, Ackermann (car-like), and legged robot kinematics through its plugin architecture.
Indoor navigation presents specific challenges that distinguish it from outdoor operation.
GPS signals are unreliable or unavailable inside buildings due to walls and ceilings, so indoor robots depend entirely on onboard sensors (LiDAR, cameras, wheel odometry) and pre-built maps or SLAM for localization. Indoor environments tend to be geometrically structured (walls, corridors, rooms) but highly dynamic, with people, furniture, and doors that open and close. The robot must navigate socially, yielding to pedestrians and respecting personal space.
Common indoor navigation applications include warehouse autonomous mobile robots (AMRs), hospital delivery robots, cleaning robots (e.g., Roomba and its successors), and service robots in hotels and airports. These systems typically use 2D LiDAR SLAM on an occupancy grid, with A* or Dijkstra for global planning and DWA or TEB for local control.
Indoor positioning can also be augmented with infrastructure-based systems such as ultra-wideband (UWB) beacons, Wi-Fi fingerprinting, or Bluetooth Low Energy (BLE) beacons, which provide absolute position references to correct odometry drift.
Outdoor navigation introduces challenges related to scale, terrain variability, weather, and GPS availability.
Outdoor robots can use GPS/GNSS for coarse global localization, but centimeter-level accuracy requires RTK (Real-Time Kinematic) corrections or fusion with other sensors. Terrain can vary from paved roads to gravel, grass, mud, and uneven ground, requiring the robot to assess traversability, not just detect obstacles. Lighting conditions change dramatically between sun and shade, and weather (rain, fog, snow) can degrade both camera and LiDAR performance.
Autonomous vehicles are the highest-profile outdoor navigation application. Self-driving cars from companies like Waymo, Cruise, and others combine high-definition maps, LiDAR, cameras, radar, and IMUs in a tightly integrated perception and planning stack. Agricultural robots, delivery robots (e.g., Starship Technologies, Nuro), and planetary rovers (NASA's Mars rovers Curiosity and Perseverance) also rely on outdoor navigation. Unstructured outdoor environments, such as forests and disaster sites, remain among the most challenging domains for autonomous navigation.
Machine learning and deep learning approaches are increasingly used alongside or in place of classical navigation algorithms.
Reinforcement learning (RL) trains a navigation policy by letting the agent interact with an environment (real or simulated) and learn from rewards. The agent receives sensor observations as state input and outputs velocity commands. Deep RL methods such as Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Twin Delayed Deep Deterministic Policy Gradient (TD3) have been applied to mapless navigation, where the robot learns to reach goals and avoid obstacles directly from raw sensor data without building an explicit map.
The main advantage of RL-based navigation is adaptability: a well-trained policy can generalize to new environments without manual tuning. The main challenges are sample efficiency (training requires millions of interactions), sim-to-real transfer (policies trained in simulation may fail on real hardware), and safety during training.
End-to-end approaches train a single neural network to map directly from sensor input (e.g., camera images) to control commands. Early work includes ALVINN (Pomerleau, 1989), which trained a neural network to steer a vehicle from camera images, and NVIDIA's PilotNet (Bojarski et al., 2016), which used convolutional neural networks for lane following. These methods bypass the traditional perception-planning-control pipeline but are difficult to interpret and verify.
Visual navigation uses camera images as the primary input for navigation decisions. Recent systems combine computer vision with learned representations to navigate in previously unseen environments. Point-goal navigation ("go to coordinates X, Y") and image-goal navigation ("go to the place shown in this photo") are standard benchmarks in the embodied AI community, evaluated in simulation environments like Habitat (Meta AI) and AI2-THOR (Allen Institute for AI).
Object-goal navigation tasks the robot with finding a specific object category (e.g., "find a refrigerator") in an unexplored environment. This requires the robot to combine exploration, object detection, and commonsense reasoning about where objects are likely to be found. Recent approaches (2024-2025) use scene graphs and large language models for high-level reasoning. CogNav (Cao et al., ICCV 2025) uses LLM-driven cognitive maps combining semantic scene graphs, landmark graphs, and occupancy maps. FOM-Nav (2025) maintains frontier-object maps with 3D object point clouds to guide exploration efficiently. The Open Scene Graph Navigator (2025) organizes spatial information hierarchically for zero-shot generalization across environments and robot platforms.
Vision-language navigation (VLN) requires a robot to follow natural language instructions (e.g., "Go past the kitchen and turn left at the hallway") while navigating through a 3D environment. VLN integrates natural language processing, visual perception, and spatial reasoning. The Room-to-Room (R2R) dataset (Anderson et al., 2018) is a widely used benchmark. Recent work leverages vision-language models and transformer architectures to improve instruction following and generalization.
As of 2025, foundation models are being integrated into robot navigation at multiple levels. Large language models serve as high-level planners that decompose natural language goals into sequences of navigation sub-tasks. Vision-language models provide zero-shot object recognition and scene understanding for semantic navigation. Vision-Language-Action (VLA) models attempt to map visual observations and language instructions directly to robot actions. While these approaches show strong results on benchmarks, transferring them reliably to physical robots in uncontrolled settings remains an active research problem.
When multiple robots share an environment, navigation must account for inter-robot coordination to avoid collisions and deadlocks. Centralized approaches plan paths for all robots jointly but scale poorly. Decentralized approaches let each robot plan independently while following collision avoidance protocols such as ORCA (Optimal Reciprocal Collision Avoidance, van den Berg et al., 2011). Multi-robot SLAM systems extend single-robot SLAM by merging maps built by different robots, requiring robust place recognition and inter-robot communication.
Several trends define the current state of robot navigation research and deployment.
Neural SLAM and learned representations. Neural network-based SLAM systems are replacing hand-crafted features with learned descriptors and end-to-end differentiable pipelines. Systems using the Mamba state-space architecture for efficient visual feature processing, combined with transformer-based attention, represent one frontier.
Event cameras for SLAM. Event-based cameras, which output asynchronous per-pixel brightness changes rather than full frames, offer high dynamic range, microsecond-level temporal resolution, and low power consumption. These properties make them attractive for high-speed navigation and environments with extreme lighting variation.
Multi-sensor fusion as default. Modern production systems almost universally combine LiDAR, cameras, and IMUs in tightly coupled fusion frameworks. The single-sensor SLAM system is increasingly a research prototype rather than a deployed product.
LLM-powered planning. Large language models are being used as high-level task planners and commonsense reasoning engines for navigation. They decompose goals like "bring me a glass of water" into navigation sub-goals and manipulation actions, though reliability in open-ended real-world scenarios remains limited.
Sim-to-real transfer. Photorealistic simulators (Habitat, Isaac Sim, AI2-THOR) allow training navigation policies in simulation and transferring them to real robots. Domain randomization and domain adaptation techniques help bridge the gap between simulated and real sensor data.
Continuous and lifelong mapping. Robots deployed for months or years in the same environment need maps that update incrementally as the environment changes, handling new furniture, seasonal changes, or construction. Lifelong SLAM and experience-based navigation address this requirement.