Augmented reality (AR) is a class of display and interaction technologies that overlay computer-generated content onto a user's perception of the physical world. Unlike virtual reality, which replaces the surrounding environment with a synthetic one, AR keeps the real world visible and adds graphics, text, audio, or 3D objects that appear to coexist with it. The user can move around, look at the augmented scene from different angles, and in many systems interact with the virtual content using hands, gaze, or voice.
Modern AR depends heavily on artificial intelligence. Cameras stream raw pixels and inertial sensors stream motion data, and machine learning models turn that input into a structured understanding of the room: where the floor is, which surface is a wall, where the user's hands are, what objects sit on the table, how the lighting falls. None of this works without computer vision, and most of it is now driven by neural networks running on the device. AR is therefore one of the most visible consumer applications of on-device AI.
The field has gone through several waves. It started as a research curiosity in the 1960s, became an industrial tool in the 1990s, broke into mainstream attention with smartphone games in the 2010s, and entered a hardware boom with standalone headsets like the Meta Quest 3 (2023) and the Apple Vision Pro (2024). Glasses-form factors from Meta, Snap, Xreal, and others are now the focus of most consumer R&D.
The most widely cited definition comes from Ronald Azuma's 1997 survey in Presence: Teleoperators and Virtual Environments. Azuma defined an AR system as one that has three properties: it combines real and virtual content, it is interactive in real time, and it is registered in three dimensions. Registration means that virtual objects appear to be at specific physical locations and stay there as the user moves; a virtual lamp that floats off the table whenever the camera turns is, by Azuma's standard, not really augmented reality.
This definition deliberately excludes a lot of things people sometimes call AR, including 2D heads-up displays, simple video overlays, and pre-rendered visual effects. It also leaves room for displays that are not handheld screens: optical see-through glasses, head-up displays in car windshields, and projectors that paint imagery onto real surfaces (the "spatial AR" tradition described in Bimber and Raskar's 2005 textbook).
Mixed reality (MR) is sometimes used as a broader term for any system that blends real and virtual content, with AR as one point on a spectrum. Paul Milgram's 1994 reality-virtuality continuum places the fully real environment at one end and the fully virtual environment at the other, with augmented reality and augmented virtuality in between. Extended reality (XR) is an umbrella term covering all of these, often used by industry to avoid having to pick one.
| Term | What the user sees | Typical hardware |
|---|---|---|
| Augmented reality (AR) | Real world plus overlaid digital content | Phone, tablet, optical glasses |
| Virtual reality (VR) | Fully synthetic environment, real world occluded | Opaque headset like Quest 2 |
| Mixed reality (MR) | Real and virtual blended with mutual interaction | Passthrough headset like Quest 3, HoloLens 2 |
| Extended reality (XR) | Umbrella term covering AR, VR, and MR | Any of the above |
In practice the boundaries are fuzzy. The Quest 3 is marketed as a mixed reality device because it can show a color video feed of the room with virtual content composited in, but the same hardware runs fully immersive VR games. The Vision Pro can switch between a passthrough AR view and a full virtual environment with a digital crown.
Ivan Sutherland built the first head-mounted display at the University of Utah in 1968, with help from his student Bob Sproull. The headset was so heavy it had to be suspended from the ceiling; the rig earned the nickname "Sword of Damocles." It used see-through optics and head tracking to draw simple wireframe shapes that appeared to sit in the room, which makes it the first AR display by Azuma's definition, even though no one called it that yet.
Through the 1970s and 1980s the field stayed academic. Researchers at NASA, the U.S. Air Force, and university labs worked on flight simulators, head-up displays for pilots, and see-through helmets. The hardware was bulky and the tracking was unreliable, but the basic ideas of registered overlays and head-coupled rendering were established.
The phrase "augmented reality" was coined in 1992 by Thomas Caudell and David Mizell, two engineers at Boeing. They were trying to help workers assemble the wire bundles for the Boeing 777. The standard process involved threading wires along pegs on a 20- to 30-foot board following printed schematics, which was slow and error-prone. Caudell and Mizell prototyped a head-mounted see-through display that drew the correct wiring path directly on top of the board the worker was looking at. They published the system at the Hawaii International Conference on System Sciences in 1992 and used the new term in the title.
The rest of the 1990s produced a string of academic projects (KARMA at Columbia, Studierstube in Vienna, and the Touring Machine at Columbia, which was the first wearable outdoor AR system). In 1999 Hirokazu Kato of the Nara Institute of Science and Technology released ARToolKit, an open-source library that tracked the camera pose from printed black-and-white square markers. ARToolKit lowered the barrier enough that artists, students, and small studios could build AR demos, and it remained the dominant marker-based toolkit for almost a decade.
When Android and iPhone phones got rear cameras, GPS, compasses, and accelerometers, AR became something you could ship to the public. The first wave of AR browsers appeared in 2008 and 2009: Wikitude on Android in late 2008 and Layar in mid-2009. These overlaid points of interest (restaurants, landmarks, Wikipedia entries) on the camera view based on GPS and compass heading. The registration was crude by modern standards, but it was the first time millions of consumers used AR.
Google Glass arrived in 2013 as a $1,500 explorer-program device. It was less an AR system in Azuma's strict sense (the small prism display did not register content in 3D) and more a hands-free heads-up display, but it triggered a public conversation about wearable AR and surveillance that the industry is still having.
Microsoft announced HoloLens in 2015 and shipped the Development Edition in March 2016 at $3,000. HoloLens was the first self-contained holographic computer with optical see-through displays, on-device tracking, and gesture input. The original device had a narrow 30 to 34 degree field of view and was clearly a developer kit, but it set the template for standalone optical AR headsets.
The consumer breakthrough was Pokemon GO, released by Niantic on July 6, 2016. The game used GPS, compass, and a simple camera overlay to make Pokemon appear in the player's real surroundings. It hit 500 million downloads and reached $600 million in revenue in 90 days, the fastest any mobile game had ever done. Whether the game was "really" AR (the camera mode was optional and not strictly registered in 3D) was debated, but for most people Pokemon GO was their first AR experience.
Snapchat had launched its first face Lenses in 2015, after acquiring the Ukrainian computer vision startup Looksery for around $150 million. Lenses tracked the user's face in real time and pasted on a dog nose, a flower crown, or a face swap. By 2017, AR filters were a default feature in Snapchat, Instagram, and Facebook.
Apple released ARKit at WWDC in June 2017 and shipped it with iOS 11 in September. ARKit gave every iPhone with an A9 chip or newer the ability to do visual-inertial tracking, plane detection, and basic light estimation without any hardware change. Google's competing ARCore came out of beta in February 2018 for high-end Android phones, building on the earlier Project Tango. IKEA Place, launched on iOS 11 release day in 2017, became the showcase ARKit app and let users place 2,000 IKEA products in their rooms with about 98 percent scale accuracy.
Magic Leap, a heavily funded startup that had raised over $2.6 billion before shipping anything, released the Magic Leap One in August 2018 at $2,295. It used a waveguide-based optical display and a separate "Lightpack" puck. Reception was mixed. The device confirmed that the technology could work but did not justify the hype. Magic Leap pivoted to enterprise and shipped the much improved Magic Leap 2 in September 2022.
Google Maps Live View launched in August 2019, overlaying walking directions as floating arrows on the camera feed. It was one of the first AR experiences shipped to a billion-user app and remains a standard utility-AR example.
Meta (then Facebook) acquired Oculus in 2014 and gradually shifted Quest from VR-only toward mixed reality. The Meta Quest 3, released October 10, 2023 at $499, was the first mainstream consumer headset with full color passthrough. Two RGB cameras on the front visor let the headset show a real-time video feed of the user's surroundings with virtual content composited in.
Apple released the Vision Pro in the United States on February 2, 2024 at $3,499. The device runs on a dual-chip design (the M2 for general compute, the R1 for sensor processing) and uses two micro-OLED displays totaling 23 million pixels. The R1 chip ingests data from 12 cameras, 5 sensors, and 6 microphones at 256 GB/s, which Apple says brings passthrough latency below 12 ms. Vision Pro is controlled with a combination of eye tracking, hand pinches, and voice.
Meta's Ray-Ban smart glasses, codeveloped with EssilorLuxottica, launched their second generation in September 2023. The glasses do not have a display but include cameras, microphones, and speakers, and a 2024 update added multimodal Meta AI that can describe what the wearer is looking at. Snap shipped the fifth generation of Spectacles in September 2024 as a $99-per-month developer kit, with a 46 degree FOV, dual Snapdragon processors, and the new Snap OS. Xreal sells consumer birdbath-optic glasses (the Xreal Air series and Xreal One) that act as wearable displays for phones, laptops, and game consoles.
A usable AR experience has to do four things in the right order, every frame, fast enough to feel real. The pipeline is roughly: figure out where the device is, figure out what is around it, draw something convincing on top of that scene, and let the user interact.
Before anything else the system has to know its own pose: where the camera is in 3D space and which way it points. Modern AR uses visual-inertial odometry, which fuses camera images with the inertial measurement unit (the accelerometer and gyroscope) using a Kalman filter or a related estimator. The vision side detects feature points in successive frames and triangulates their motion to recover camera translation and rotation. The IMU side fills in the gaps when the camera is occluded or when the device is moving too fast for stable feature matches. Together they produce a six-degrees-of-freedom pose at 60 to 120 Hz with millimeter-level precision over short distances.
This is essentially SLAM: simultaneous localization and mapping. The system builds a sparse 3D map of the environment as it explores, and uses that map to relocalize when the user comes back to a previously seen area. Closed-source implementations inside ARKit and ARCore are derived from research lines like ORB-SLAM, MSCKF, and VINS-Mono. For room-scale AR this works well; for city-scale persistent AR, devices increasingly query a cloud-side visual positioning system (such as Niantic's Lightship VPS or Google's Geospatial API) that matches the device's view against a precomputed point cloud of the location.
Once the device knows where it is, it has to understand what it is looking at. This is where most of the AI work happens.
| Task | Common ML approach | What AR uses it for |
|---|---|---|
| Plane detection | Region growing on point clouds, learned plane segmentation | Placing virtual objects on tables and floors |
| Depth estimation | Stereo, time-of-flight LiDAR, monocular depth networks (MiDaS, Depth Anything) | Occlusion, distance to objects |
| Semantic segmentation | Encoder-decoder CNNs, transformer segmenters | Labeling sky, walls, people, furniture |
| 3D mesh reconstruction | TSDF fusion, learned implicit surfaces | Letting virtual objects collide with real geometry |
| Object detection | YOLO, DETR, MobileNet variants | Recognizing products, signs, landmarks |
| Face tracking | ARKit face mesh, MediaPipe Face | Filters, persona avatars, lip sync |
| Hand tracking | MediaPipe Hands (21 keypoints), Quest hand mesh | Controller-free input |
| Eye tracking | IR cameras with neural gaze estimators | Gaze-based selection on Vision Pro and Quest Pro |
| Body pose estimation | OpenPose, MoveNet, BlazePose | Full-body avatars, fitness coaching |
Most of these models are trained off-device on large datasets and then quantized and compiled to run in real time on a mobile NPU. Apple's Neural Engine, Qualcomm's Hexagon, and Google's Tensor blocks all exist partly to power this kind of always-on perception. The push toward edge AI is much stronger in AR than in cloud-first applications, because round-trip latency to a server would break the illusion that virtual content is glued to the world.
With a known pose and a model of the scene, the renderer can draw virtual content. This sounds like a graphics problem, and it is, but to look convincing it has to respect the geometry and lighting of the real world. Shadows have to fall in the right direction, virtual surfaces have to match the ambient color of the room, and real objects have to occlude virtual ones when they are closer to the camera. Modern AR runtimes do real-time light estimation from the camera image, often producing an environment map that is fed back into the rendering shader.
Neural rendering has become important here. Neural radiance fields, introduced by Ben Mildenhall and colleagues at ECCV 2020, represent a 3D scene as a small neural network that maps a position and viewing direction to a color and density. NeRF can synthesize photorealistic novel views of a scene from a few dozen photos and is now used to capture environments for AR placement and to render persistent virtual locations. The follow-up technique 3D Gaussian Splatting, published by Kerbl and colleagues at SIGGRAPH 2023, represents the scene as millions of explicit 3D Gaussians and renders at over 30 frames per second at 1080p, which is fast enough for real-time AR. Both methods have moved out of research and into commercial AR pipelines (see neural radiance field).
Generative AI now also creates the assets themselves. DreamFusion (Google, 2022) introduced score distillation sampling, which optimizes a NeRF using a 2D text-to-image diffusion model as a prior. Newer methods like GET3D, Magic3D, and Shap-E generate textured 3D meshes from a text prompt in seconds rather than hours, which means AR users may eventually be able to ask for an object and get one without finding a 3D artist.
AR systems have settled on a small set of input modalities. Smartphones use touch on the screen and rear-camera point-and-tap. Optical headsets like HoloLens 2 and Vision Pro use hand tracking and gaze, sometimes with voice. Quest 3 supports both hand tracking and physical controllers. Multimodal language models are now showing up as conversational layers on top of these inputs: Apple Intelligence runs on Vision Pro, and Meta AI runs on Ray-Ban smart glasses, both letting the user describe what they want rather than navigate menus. The combination of an LLM and live camera context (a multimodal model acting on the wearer's view) is what people usually mean by an "AI assistant in glasses."
The AR development landscape is dominated by a handful of platforms tied to specific device ecosystems.
| Platform | Vendor | Target devices | First release |
|---|---|---|---|
| ARKit and RealityKit | Apple | iPhone, iPad, Vision Pro | 2017 |
| ARCore and Geospatial API | Android, ARCore-supported devices | 2018 (out of beta) | |
| Mixed Reality Toolkit (MRTK) | Microsoft | HoloLens 2, Windows Mixed Reality | 2018 |
| Presence Platform | Meta | Quest 2, Quest 3, Quest Pro | 2021 |
| Lightship ARDK and VPS | Niantic | iOS, Android | 2021 |
| Snap Lens Studio and Snap OS | Snap | Snapchat app, Spectacles | 2017 (Lens Studio) |
| AR Foundation | Unity | Cross-platform wrapper over ARKit, ARCore, others | 2018 |
| Unreal AR Framework | Epic Games | Cross-platform | 2018 |
Unity AR Foundation is by far the most common cross-platform choice, because it abstracts ARKit, ARCore, and HoloLens behind one API. Unreal Engine's AR support is similar in scope and is heavily used in film production for live composited virtual content.
| Device | Vendor | Year | Type | Notes |
|---|---|---|---|---|
| Sword of Damocles | Sutherland (Utah) | 1968 | Tethered see-through HMD | First head-mounted display |
| Google Glass | 2013 | Monocular display glasses | Explorer Edition $1,500 | |
| HoloLens 1 | Microsoft | 2016 | Standalone optical see-through | $3,000 dev kit, 30 deg FOV |
| HoloLens 2 | Microsoft | 2019 | Standalone optical see-through | 52 deg FOV, hand and eye tracking |
| Magic Leap One | Magic Leap | 2018 | Tethered waveguide HMD | $2,295, consumer launch |
| Magic Leap 2 | Magic Leap | 2022 | Standalone, enterprise focus | 70 deg FOV, dynamic dimming |
| Quest 3 | Meta | 2023 | Color-passthrough headset | $499, Snapdragon XR2 Gen 2 |
| Vision Pro | Apple | 2024 | Color-passthrough headset | $3,499, M2 plus R1, micro-OLED |
| Ray-Ban Meta (Gen 2) | Meta and EssilorLuxottica | 2023 | Camera and audio glasses, no display | Multimodal Meta AI added 2024 |
| Spectacles 5 | Snap | 2024 | Standalone optical see-through | Developer kit, 46 deg FOV |
| Xreal Air / One | Xreal | 2022 onward | Tethered birdbath display glasses | Consumer-priced, no on-board compute |
AR has found product-market fit unevenly. Some categories (face filters, navigation) have hundreds of millions of daily users. Others (industrial training, medical visualization) have smaller audiences but make a clear business case. The applications below are representative rather than exhaustive.
Pokemon GO is still the canonical example. Niantic followed up with Ingress Prime, Harry Potter: Wizards Unite, and Pokemon Sleep, with mixed commercial results. Snap, Instagram, and TikTok host a constant stream of AR mini-games as effects. Nintendo and PlayStation experimented with AR for the 3DS and PS Vita. On headsets, Quest 3 launched with mixed reality games like First Encounters and Demeo, where virtual creatures break through the user's actual walls.
IKEA Place (2017), Wayfair's View in Room, Amazon's AR View, and Houzz let shoppers preview furniture in their homes. IKEA reported that Place drove a meaningful increase in online sales, although exact figures vary. Sephora Virtual Artist and L'Oreal's ModiFace use face tracking to let customers try on lipstick and eye shadow. Warby Parker and most major eyewear retailers offer virtual glasses try-on. Nike, Gucci, and Adidas have all shipped AR sneaker try-on through Snapchat lenses or their own apps.
This is where AR has been quietly successful for years. Boeing returned to its 1992 problem and now uses HoloLens-based wire harness visualization that the company says cuts assembly time significantly. Ford, Airbus, BMW, and Lockheed Martin all use AR for assembly guidance and quality inspection. PTC's Vuforia, Microsoft Dynamics 365 Guides, and TeamViewer Frontline are common enterprise platforms. Remote-expert tools (an offsite engineer sees what a field technician sees and draws annotations on their view) became standard during the COVID-19 period.
AccuVein projects a near-infrared map of the patient's veins onto their skin. Several systems including Augmedics' xvision and Brainlab's Mixed Reality Viewer overlay CT and MRI data on the surgical field. Stryker, Medtronic, and others have FDA-cleared AR navigation systems for orthopedic and spine procedures. Medical schools use HoloLens and the Microsoft-Case Western HoloAnatomy app to teach anatomy without cadavers.
Google Maps Live View shows floating arrows on the sidewalk for walking directions. Apple Maps added a similar feature in iOS 15. Inside large indoor venues like airports and stadiums, AR wayfinding apps from companies like Pointr handle the GPS-poor environment with visual localization.
AR textbooks from publishers like Pearson and Houghton Mifflin let students point a phone at a page to see 3D models pop out. Apps like Froggipedia, Complete Anatomy, and Jigspace teach biology and engineering concepts with manipulable 3D content. Google's Expeditions and Merge Cube reached classrooms before the COVID-19 disruption.
Snapchat lenses, Instagram filters, and TikTok effects are a daily AR experience for hundreds of millions of users. Apple's Memoji and Animoji animate a face mesh in real time. FaceTime on Vision Pro creates a "Persona" avatar by scanning the wearer's face once, then driving it with eye and hand tracking during a call. Spatial video, captured on iPhone 15 Pro and viewed on Vision Pro, is the first mass-market 3D video format since 3D TVs failed.
The field has a stable list of unsolved problems that come up at every conference.
Latency is the constant enemy. Any noticeable delay between head motion and image update breaks immersion and can cause discomfort. Apple's R1 chip exists specifically to keep Vision Pro passthrough below 12 ms.
Field of view in optical see-through glasses is still narrow. HoloLens 2 reached 52 degrees and Magic Leap 2 reached 70 degrees, both well below human peripheral vision. Birdbath glasses like the Xreal Air get to roughly 46 degrees but at the cost of being more like wearable monitors than world-registered AR. The waveguide and metalens research aimed at fixing this has been promising for years and slow to ship.
Battery life and thermals limit how much compute can fit on the head. Vision Pro uses an external battery pack and still gets about two hours of use. Snap Spectacles 5 manage about 45 minutes of standalone runtime.
Occlusion accuracy is hard. If the AR system places a virtual cup on the table and the user's hand passes between the cup and the camera, the hand should occlude the cup. Doing this at 60 Hz requires fast, accurate depth sensing, which is why Vision Pro and Quest 3 both ship with depth sensors and high-resolution color cameras.
Lighting and shading mismatches are still common. Algorithms estimate ambient light from the camera, but matching specular highlights and consistent shadows on virtual objects is an active research area.
Privacy is the social challenge. Always-on cameras worn at face height make bystanders visible to the cloud, which has driven backlash against Google Glass, Ray-Ban Meta, and Spectacles. The recording-light convention is widely used but easy to defeat. Several jurisdictions are working through whether always-on glasses count as recording in two-party-consent regions.
Motion sickness, social acceptance, and the cost of high-end devices remain practical adoption barriers.
As of 2026, the consumer market splits into three rough segments. Smartphone AR through ARKit and ARCore is the mainstream, and almost any iPhone or recent Android phone can run a useful AR app. Camera-and-audio glasses like Ray-Ban Meta have crossed into mass-market style accessories without trying to be displays. Standalone passthrough headsets like Quest 3 and Vision Pro carry the more ambitious AR pitch but remain expensive and bulky.
The research and product trajectory points clearly toward thin, all-day, displayful smart glasses with on-device multimodal AI. Meta showed its Orion prototype glasses in late 2024 as an internal demonstration, Apple is rumored to be working on a lighter Vision device, Google announced its Android XR platform with Samsung in 2024, and Snap plans a consumer Specs launch in 2026. Whether the hardware can meet the price, weight, and battery targets that consumers will accept is still open.