Sensor fusion
Last reviewed
Apr 30, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 · 3,993 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 · 3,993 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sensor fusion is the process of combining data from multiple sensors, often of different types, to produce information that is more accurate, complete, or reliable than what any single sensor could provide on its own. The phrase covers everything from a smartphone blending GPS, Wi-Fi, and accelerometer readings to estimate where you are walking, to an autonomous truck merging camera, LiDAR, radar, and inertial data to decide whether the object 80 metres ahead is a parked car or a pedestrian. Hall and Llinas, in their widely cited 1997 tutorial in the Proceedings of the IEEE, defined data fusion as "the combination of data from multiple sensors, and related information from associated databases, to achieve improved accuracies and more specific inferences than could be achieved by the use of a single sensor alone" (Hall and Llinas, 1997).
The motivation is essentially that no single sensor is good at everything. Cameras give dense appearance information but struggle in low light and cannot directly measure depth. LiDAR provides precise 3D geometry but is expensive and degrades in heavy rain or fog. Radar measures velocity through the Doppler effect and works in poor weather, yet has low angular resolution. Inertial measurement units (IMUs) give high-rate motion data but drift without external aiding. GPS gives global position, but only outdoors and only to a few metres of accuracy without augmentation. Fusion is how a system combines complementary strengths and covers for the gaps.
Almost every embodied AI system depends on sensor fusion. Self-driving cars, mobile robots, drones, augmented and virtual reality headsets, smartphones, fitness wearables, surgical robots, smart-home devices, and military targeting systems all rely on it. Hall and Llinas open their 1997 paper by listing similar motivations: military target tracking, automated target recognition, monitoring of complex machinery, medical diagnosis, and what they called "smart buildings." Mitchell's 2007 textbook Multi-Sensor Data Fusion: An Introduction covers a similar spread, including biometrics, geoscience, and computer vision (Mitchell, 2007).
The KITTI benchmark, introduced by Geiger, Lenz, and Urtasun in 2012, demonstrated that algorithms ranking near the top of laboratory datasets fell to mediocre performance on real-world stereo vision and 3D detection tasks (Geiger et al., 2012). Later datasets like nuScenes (Caesar et al., 2020) and the Waymo Open Dataset (Sun et al., 2020) shipped with the full sensor suite of a commercial autonomous-driving prototype, in part to push the field toward genuine multi-sensor learning rather than vision-only solutions.
A standard way to organise sensor fusion is by the level of abstraction at which data is combined. The Joint Directors of Laboratories (JDL) data fusion model, popularised by Hall and Llinas, distinguishes several levels. In practice, robotics and machine-learning communities tend to talk about three or four broad categories.
| Level | Other names | What is combined | Typical example |
|---|---|---|---|
| Low-level | Raw, signal, pixel-level, early fusion | Raw sensor measurements before any processing | Stitching multiple LiDAR returns, fusing pixel intensities from RGB and thermal cameras |
| Feature-level | Mid-fusion | Features extracted from each sensor (edges, point clusters, embeddings) | Concatenating CNN features from a camera with PointNet features from LiDAR |
| Decision-level | Late fusion, high-level | Independent decisions or detections from per-sensor pipelines | Voting between a camera-only pedestrian detector and a radar-only one |
| State-vector | Track-level | Estimates of state variables (position, velocity) | Combining a GPS fix with an IMU-driven state estimate via a Kalman filter |
Deep-learning literature reframes this as early, mid, or late fusion. Early fusion concatenates raw inputs at the network input. Mid (feature) fusion merges intermediate feature maps. Late fusion combines per-modality outputs. A 2025 review (Liu et al., 2025) notes that feature-level and bird's-eye-view (BEV) fusion dominate state-of-the-art autonomous-driving stacks, while late fusion remains common in safety-critical pipelines because the modalities can be evaluated independently for redundancy.
Classical sensor fusion is built on recursive Bayesian estimation. Newer work bolts deep learning on top, but the underlying tools are still standard first-year robotics material.
| Technique | What it does | Strengths | Limitations |
|---|---|---|---|
| Kalman filter (KF) | Optimal recursive estimator for linear systems with Gaussian noise (Kalman, 1960) | Closed-form, fast, optimal under its assumptions | Linearity and Gaussian assumptions rarely hold |
| Extended Kalman filter (EKF) | Linearises the system dynamics around the current estimate at each step | Works for mildly nonlinear problems; widely deployed | Linearisation can diverge for highly nonlinear systems |
| Unscented Kalman filter (UKF) | Uses a deterministic set of "sigma points" (the unscented transform) to capture mean and covariance through the nonlinearity (Julier and Uhlmann, 1997; Wan and van der Merwe, 2000) | Captures nonlinear effects to higher order than EKF without computing Jacobians | Same computational order as EKF but slightly more expensive |
| Particle filter | Monte Carlo estimator that represents the state distribution by a cloud of weighted samples | Handles arbitrary distributions and nonlinearities | Sample impoverishment in high dimensions; expensive |
| Information filter | Dual of the Kalman filter using inverse covariance | Convenient for combining many independent measurements | Less intuitive for prediction step |
| Complementary filter | Mixes high-pass gyro data with low-pass accelerometer data for attitude | Cheap, easy to tune | Limited to specific sensor combinations |
| Madgwick / Mahony filters | Quaternion-space attitude estimators using gradient descent or PI feedback | Run on microcontrollers, common in IMUs | Less optimal than full EKF for high-end applications |
| Factor graphs | Probabilistic graphical model where variables and constraints are nodes; solved via nonlinear least squares | Scales to large SLAM and bundle-adjustment problems; flexible | Requires careful initialisation |
| Dempster-Shafer theory | Combines evidence from sources with explicit treatment of ignorance | Useful for decision fusion under uncertainty | Computationally heavy; debate over interpretation |
| Fuzzy logic | Combines linguistic rules and membership functions | Easy to encode expert knowledge | Hard to verify; limited probabilistic guarantees |
| Deep neural fusion | Cross-modal CNNs, transformers, and attention layers | Learns complex correlations from data | Data hungry; weaker formal guarantees |
The Kalman filter, published by Rudolf Kalman in 1960 in the Journal of Basic Engineering, is the most influential sensor-fusion algorithm ever written. Kalman recast filtering and prediction using state-space representations and showed that for linear-Gaussian systems the optimal recursive estimator has a closed form (Kalman, 1960). The extended Kalman filter (EKF), which linearises a nonlinear system about the current estimate, was developed at NASA by Stanley F. Schmidt and others to support Apollo navigation. The Apollo Primary Guidance, Navigation, and Control System (PGNCS) used a Schmidt-Kalman filter to fuse star sightings, IMU readings, and ground-based tracking, and is widely regarded as the first major operational use of Kalman filtering (Cipra, 1993; Grewal and Andrews, 2010).
The unscented Kalman filter, introduced by Simon Julier and Jeffrey Uhlmann at SPIE AeroSense in 1997 and refined by Wan and van der Merwe in 2000, sidesteps EKF linearisation by passing a small set of carefully chosen "sigma points" through the nonlinear function and recovering the posterior mean and covariance directly (Julier and Uhlmann, 1997; Wan and van der Merwe, 2000). The UKF captures effects to the third order of a Taylor expansion for any nonlinearity, at the same computational order as the EKF, and has become a default choice for many nonlinear fusion problems.
Particle filters, factor graphs, and graphical models are covered in depth in Probabilistic Robotics by Sebastian Thrun, Wolfram Burgard, and Dieter Fox, which set the modern probabilistic framing of robot perception (Thrun et al., 2005). Frank Dellaert's GTSAM library at Georgia Tech is the canonical factor-graph implementation, supporting incremental smoothing for SLAM and visual-inertial odometry (Dellaert, 2012).
A practical fusion stack is shaped by the physics and economics of each sensor below.
| Sensor | Measures | Strengths | Weaknesses |
|---|---|---|---|
| RGB camera | Light intensity per pixel | Dense semantic information; cheap; high resolution | No depth; sensitive to glare, low light, fog |
| Stereo camera | Two views for disparity-based depth | Passive depth; works outdoors | Range limited by baseline; texture-dependent |
| LiDAR | Time-of-flight 3D point clouds | Accurate geometry to long range | Expensive; degraded in rain, snow, fog |
| Radar | Range and Doppler velocity of reflectors | Works in fog/rain; direct velocity; long range | Low angular resolution; sparse |
| 4D imaging radar | Range, azimuth, elevation, Doppler | Higher resolution than classical radar | Still sparse vs LiDAR |
| IMU (gyros + accelerometers) | Angular rate and linear acceleration | High data rate (100-1000 Hz); cheap | Drifts without external aiding |
| GPS / GNSS | Global position from satellite signals | Absolute position outdoors | No service indoors or in tunnels; metres of error |
| Wheel odometry | Wheel rotation | Cheap; high rate | Slip on ice, grass, sand |
| Ultrasonic | Short-range echo | Cheap; works in close quarters | Very short range; slow update |
| Time-of-flight depth camera | Per-pixel depth via modulated light | Dense depth at short range | Limited outdoor performance |
| Magnetometer | Earth's magnetic field | Heading reference | Distorted by ferromagnetic objects |
| Barometer | Atmospheric pressure | Useful for altitude (e.g. floor changes indoors) | Drifts with weather |
| Microphone array | Sound | Localises sound sources | Noisy in real environments |
| Force / torque sensor | Contact forces | Essential for manipulation | Limited to contact tasks |
Real fusion stacks combine sensors with complementary failure modes. A camera identifies a green spot as a traffic light; a radar reports that the blob 30 metres ahead is moving away at 5 m/s; an IMU tells the pose estimator how the car has rotated in the last 10 ms while it waits for the next GNSS fix.
Inertial navigation systems (INS) integrate accelerometer and gyroscope measurements over time. Pure inertial navigation drifts because errors accumulate, so any practical INS aids the filter with external measurements. The standard pattern fuses INS with GPS via an extended or unscented Kalman filter, giving smooth high-rate position output between GPS fixes. Apollo PGNCS, the Boeing 747's INS, the Tomahawk cruise missile, and vehicle stability control all use this pattern.
SLAM (simultaneous localization and mapping) is a textbook sensor-fusion problem: build a map of an unknown environment while simultaneously locating the robot inside it. Visual SLAM uses a camera, LiDAR SLAM uses scanning lasers, and visual-inertial SLAM adds an IMU to stabilise the visual front end. Systems like LIO-SAM (Shan et al., 2020) and LVI-SAM (Shan et al., 2021) tightly couple LiDAR, IMU, and camera data via factor graphs and incremental smoothing, producing centimetre-accurate trajectories in places where any single sensor would fail.
Multi-sensor target tracking has driven fusion research since the Cold War. Air-traffic control radars fuse primary radar returns with secondary IFF transponder data, and military systems like the Aegis Combat System combine the SPY-1 phased array radar with infrared, ESM, and external sensor feeds.
Autonomous-driving stacks are the most visible modern sensor-fusion systems. A representative autonomous vehicle might carry six to twelve cameras, one to five LiDARs, three to eight radars, an IMU, RTK-GPS, wheel speed sensors, and an HD map prior. The KITTI benchmark (Geiger et al., 2012), nuScenes (Caesar et al., 2020), and the Waymo Open Dataset (Sun et al., 2020) all ship with synchronised camera, LiDAR, and other sensor data and have driven a decade of fusion research.
Major families of deep-learning fusion methods for autonomous driving include:
| Family | Representative works | Fusion strategy |
|---|---|---|
| Object-centric two-stage fusion | MV3D (Chen et al., 2017), AVOD (Ku et al., 2018) | Generate 3D proposals from LiDAR bird's-eye view, fuse with image features at the region-of-interest stage |
| Sequential / point-painting | PointPainting (Vora et al., 2020) | Project LiDAR points into image semantic-segmentation outputs, append class scores to each point |
| Voxel-based fusion | VoxelNet, PointPillars derivatives | Voxelise the point cloud and fuse with image features in 3D |
| Bird's-eye-view (BEV) fusion | BEVFormer (Li et al., 2022), BEVFusion (Liu et al., 2022; Liang et al., 2022) | Lift camera features into a top-down BEV frame and fuse with LiDAR BEV features |
| Cross-attention transformers | DETR3D, FUTR3D, TransFusion | Attend across modalities with learned queries |
MIT's BEVFusion (Liu et al., 2022) is a useful reference point. Its camera stream does not depend on LiDAR, so the system still produces sensible outputs when the LiDAR drops out. It set state of the art on nuScenes 3D detection and BEV map segmentation while reducing computation.
The industrial picture is split. Tesla's production Autopilot and Full Self-Driving stacks moved away from radar in 2021 and operate from cameras only, an approach often called "Tesla Vision." Waymo, Cruise, Motional, Zoox, and Apollo (Baidu) all rely heavily on multi-sensor fusion that includes LiDAR.
Visual-inertial odometry (VIO) fuses one or more cameras with an IMU to estimate motion at high rate. VIO is the workhorse for quadrotor drones, legged robots like Boston Dynamics' Spot, and head-mounted devices including the Meta Quest, Microsoft HoloLens, and Apple Vision Pro. Open-source implementations like ORB-SLAM3, OpenVINS, and VINS-Fusion provide tightly coupled visual-inertial fusion with closed-loop SLAM.
A modern smartphone is a fusion device that happens to also place calls. iOS and Android combine accelerometer, gyroscope, magnetometer, GPS, Wi-Fi, BLE beacon, and barometer data to estimate location, count steps, classify activity, and orient AR overlays. Apple's Core Motion and Google's Activity Recognition APIs expose fused outputs rather than raw sensor data. Continuous glucose monitors, smartwatches, and fall-detection systems use similar fusion techniques over physiological sensors.
Fusion only works if you know which measurements correspond in time and space. Two practical concerns dominate.
Extrinsic calibration finds the rigid transformation between sensor frames. For a camera-LiDAR rig, this means determining the rotation and translation that map LiDAR points into the camera frame so they can be projected onto the image. Intrinsic calibration estimates the per-sensor parameters (camera focal length and distortion, IMU biases, LiDAR ranging offsets). Tools like Kalibr, AprilTag-based calibration, and the open ROS calibration packages are widely used.
Time synchronisation is the other big issue. A 100 ms misalignment between a camera and an IMU can wreck visual-inertial state estimation. Production rigs often hardware-trigger sensors from a common pulse and timestamp them with PTP (IEEE 1588) or GPS-disciplined clocks. Cheaper rigs use NTP and software timestamping with calibrated offsets. The Waymo Open Dataset paper specifically calls out that its LiDARs and cameras are well synchronised and calibrated, because that is a precondition for anyone training fusion models on the data (Sun et al., 2020).
Real sensors fail. Cameras blow out in glare or go dark in shadow. LiDAR returns drop in heavy rain. Radar throws ghost detections off guardrails. IMUs saturate during impacts. A robust fusion system needs to track uncertainty, detect outliers, and degrade gracefully when a sensor drops out.
Classical Bayesian filters represent uncertainty as a covariance matrix and update it as data arrives. Modern deep-learning detectors increasingly include uncertainty heads (predictive variance, evidential deep learning) so that downstream fusion logic can weigh outputs accordingly. Adversarial robustness is an active research area: a small sticker on a stop sign can fool a vision-only detector, but the same attack often fails to fool a LiDAR detector, so multi-modal redundancy provides a basic defence.
Some of the most-studied failures in the history of automated systems involve sensor fusion either being implemented badly or not being used at all.
The Boeing 737 MAX MCAS accidents (Lion Air 610 in 2018, Ethiopian 302 in 2019) are textbook examples of what happens when a safety-critical control law depends on a single sensor. The Maneuvering Characteristics Augmentation System could push the nose down based on input from only one of the aircraft's two angle-of-attack vanes; if that single sensor failed, MCAS triggered. After the accidents, Boeing reworked MCAS to require both AoA sensors to agree within 5.5 degrees before activating, a decision-fusion fix to a problem that should never have been single-sensor (FAA, 2022).
The USS Vincennes shootdown of Iran Air Flight 655 in July 1988 killed 290 people and is a case study in poor decision-level fusion under stress. The Aegis radar tracked Flight 655 correctly, the IFF system reported a Mode III civilian transponder, and the ascent profile was consistent with a commercial airliner. Crew misread altitude trends, suffered confirmation bias, and classified the contact as a hostile descending F-14 (US Navy, 1988). The technology produced the right data; the human-in-the-loop fusion failed.
The Uber ATG fatality in Tempe, Arizona in March 2018 was the first known pedestrian death caused by an automated vehicle. According to the NTSB, Uber's stack first detected the pedestrian (Elaine Herzberg) about six seconds before impact via radar, then via LiDAR. The classifier then cycled between "vehicle," "bicycle," and "unknown," resetting its motion-prediction history each time it changed its mind. The system did not predict that the pedestrian's path would intersect the vehicle until it was too late to brake (NTSB, 2019).
Several Tesla Autopilot crashes show what happens when fusion is weak or absent. The 2016 Williston, Florida crash involved a Model S striking a left-turning semi-trailer; the Autopilot's forward radar and camera both failed to identify the trailer broadside, and NTSB cited overreliance on automation as a contributing factor (NTSB, 2017). Subsequent crashes involving stationary emergency vehicles have prompted ongoing NHTSA investigation.
The ecosystem around sensor fusion is mature. ROS 2 ships tf2 for spatio-temporal transforms and robot_localization for an EKF/UKF that fuses arbitrary numbers of nav_msgs/Odometry, sensor_msgs/Imu, and pose messages into a 15-dimensional state estimate (Moore and Stouch, 2014). GTSAM, written by Frank Dellaert and collaborators at Georgia Tech, is the dominant factor-graph SLAM and smoothing library. Google's Ceres Solver is the workhorse nonlinear least-squares back end for many bundle-adjustment and SLAM systems. OpenVINS, ORB-SLAM3, and VINS-Fusion are widely used open-source visual-inertial odometry packages. Autoware and Apollo (Baidu) are full open-source autonomous-driving stacks. CARLA and NVIDIA Drive Sim synthesise multi-sensor data for training and testing. The KITTI, nuScenes, Waymo Open Dataset, Argoverse, and A2D2 benchmarks provide synchronised multi-modal sensor data for the research community.
| Resource | Type | Maintainer | Typical use |
|---|---|---|---|
| KITTI | Dataset | Karlsruhe Institute of Technology, Toyota Technological Institute | Stereo, optical flow, 3D detection benchmarks (Geiger et al., 2012) |
| nuScenes | Dataset | Motional / nuTonomy | 1000 scenes, full sensor suite (Caesar et al., 2020) |
| Waymo Open Dataset | Dataset | Waymo | 1150 scenes, 5 LiDARs + 5 cameras (Sun et al., 2020) |
| ROS 2 robot_localization | EKF/UKF library | Open Source Robotics Foundation | Wheeled and legged robot pose fusion |
| GTSAM | Factor graph library | Georgia Tech | SLAM, VIO, bundle adjustment |
| Apollo | Driving stack | Baidu | Production-grade self-driving pipeline |
| Autoware | Driving stack | Autoware Foundation | Open-source self-driving |
A few problems keep showing up in research. Long-tail edge cases (a couch falling off a truck, a child in a Halloween costume crossing at night) remain hard because no individual sensor handles all of them and the fusion model must decide from sparse evidence. Adversarial robustness in vision is still an active research area, and multi-sensor fusion is a practical defence. Cost, power, and packaging tradeoffs are real: spinning LiDARs are heavy and expensive; solid-state LiDARs reduce cost but limit field of view; high-resolution radars are improving but still sparse. Learning-based fusion gives strong empirical performance but weaker formal guarantees than classical Kalman methods, which complicates safety certification. Models trained in CARLA or NVIDIA Drive Sim do not always transfer to real sensor data. ISO 21448 (SOTIF) explicitly addresses how to verify perception and fusion systems whose failure modes are statistical rather than deterministic.
Modern stacks combine learned components (BEV fusion transformers, deep detectors with uncertainty heads) with classical back ends (factor graphs, EKF/UKF for pose) because the two sets of tools are good at different things. The interesting frontier is end-to-end learned fusion that still respects the physics, calibration, and safety constraints that classical methods made explicit.