Edge AI

AI Hardware Artificial Intelligence Machine Learning

30 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v4 · 5,944 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Edge AI is the practice of running artificial intelligence models directly on the device that generates the data, such as a smartphone, laptop, camera, vehicle, or IoT sensor, instead of sending that data to a centralized cloud server for processing. By performing inference locally, edge AI removes the network round trip to a data center, which lets it deliver real-time responses (often in single-digit milliseconds versus the 80-150 milliseconds a typical cloud round trip adds), keep sensitive data on the device, cut bandwidth costs, and keep working with no internet connection ^[17]^[18]. By 2026, virtually every flagship chip from Apple, Qualcomm, Intel, AMD, MediaTek, Samsung, and Google ships a dedicated Neural Processing Unit (NPU) built specifically for on-device AI, and the global edge AI market is estimated at roughly $25 billion in 2025, projected to reach $100 billion to $143 billion by the early-to-mid 2030s at compound annual growth rates of around 20-30% ^[1]^[2]^[3].

See also: Small language model, Quantization, GPU

Introduction

Edge AI refers to the practice of running artificial intelligence models directly on edge devices, such as smartphones, laptops, IoT sensors, cameras, vehicles, and industrial equipment, rather than sending data to centralized cloud servers for processing. By performing inference locally, edge AI eliminates the network round trip to a data center, enabling real-time decision-making, preserving data privacy, reducing bandwidth costs, and functioning even without internet connectivity.

The concept builds on the broader trend of edge computing, which moves computation closer to the source of data. The defining difference from cloud AI is where the model runs: cloud AI ships raw data to a remote data center and waits for a response, while edge AI executes the model on local silicon so the data never has to leave the device. Gartner has estimated that around 75% of enterprise-generated data would be created and processed outside traditional centralized data centers, at the edge, by 2025, up from roughly 10% in 2018 ^[18]. What makes edge AI distinct is the deployment of machine learning and deep learning models, particularly neural networks, on hardware that was traditionally too constrained to run such workloads. Advances in model compression, specialized processors, and inference frameworks have changed this equation. By 2026, virtually every flagship chip from Apple, Qualcomm, Intel, AMD, MediaTek, and Samsung includes a dedicated Neural Processing Unit (NPU) designed specifically for on-device AI inference ^[1].

The edge AI market has seen rapid growth. Estimates for 2025 place the global market between $25 billion and $36 billion depending on the research firm, with projections reaching $100 billion to $143 billion by the early 2030s at compound annual growth rates of 20-30% ^[2]^[3]. This growth is driven by the explosion of connected devices (more than 18.8 billion connected IoT devices counted at the end of 2024 and a forecast exceeding 40 billion by 2030 per IoT Analytics), the increasing capability of on-device hardware, and the growing demand for AI that works without cloud dependency ^[4].

Motivation

Latency

For many AI applications, the time it takes to send data to a cloud server and receive a response is unacceptable. A self-driving car processing camera feeds cannot wait 100-200 milliseconds for a cloud round trip before deciding whether to brake. A factory quality inspection system needs to classify products on a conveyor belt in real time. A voice assistant that pauses for a second after every command feels sluggish. Edge AI solves these latency problems by running inference locally, often achieving response times in the single-digit millisecond range. Nordic Semiconductor's Axon NPU, for example, runs inference workloads in approximately 6.5 milliseconds ^[5].

Privacy

Edge AI keeps sensitive data on the device where it was generated. Medical wearables can analyze heart rhythms locally without uploading patient health data to the cloud. Smart home cameras can detect intruders on-device without streaming video to external servers. Smartphones can summarize messages and emails without exposing their contents to a third-party API. For organizations subject to GDPR, HIPAA, or other data protection regulations, on-device processing simplifies compliance by ensuring data never leaves the user's control.

Cost

Cloud inference at scale is expensive. Every API call to a cloud-hosted model incurs compute charges, and for applications with millions of users making frequent requests, those costs add up quickly. Edge AI shifts the compute cost to the device itself, which the user (or device manufacturer) has already paid for. Once the model is deployed on-device, inference is essentially free from an ongoing operational perspective. This is particularly advantageous for consumer applications where per-user cloud costs would erode margins.

Bandwidth

Sending raw sensor data (video, audio, high-frequency telemetry) to the cloud consumes significant bandwidth. An autonomous vehicle generates terabytes of sensor data per day. An industrial facility with hundreds of cameras produces continuous video streams. Edge AI processes this data locally, transmitting only the results (alerts, summaries, classifications) rather than the raw inputs. This dramatically reduces bandwidth requirements and network infrastructure costs.

Offline Capability

Edge AI works without an internet connection. This is essential for military operations in contested environments, rural healthcare facilities with unreliable connectivity, field service technicians working in basements or remote sites, and aircraft in flight. Any application that must function regardless of network availability benefits from edge deployment.

Hardware

Edge AI has driven the development of specialized processor architectures designed to accelerate neural network inference within tight power and thermal constraints.

Neural Processing Units (NPUs)

An NPU (also called an AI accelerator, neural engine, or AI processing unit) is a specialized hardware block designed to accelerate the matrix multiplication and convolution operations that dominate neural network inference. NPUs use low-precision arithmetic (INT8 or INT4) and highly parallelized architectures to deliver high throughput at low power consumption. Performance is typically measured in TOPS (trillion operations per second).

The table below summarizes key NPU implementations as of early 2026.

NPU / AI Accelerator	Manufacturer	Device Category	Performance (TOPS)	Key Features
Apple Neural Engine (M4)	Apple	Mac, iPad	38	16-core NPU; integrated with Apple silicon; powers Core ML
Apple Neural Engine (A18 Pro)	Apple	iPhone	~35	Powers Apple Intelligence on-device features
Hexagon NPU (Snapdragon 8 Elite Gen 5)	Qualcomm	Smartphones	45+	Up to 100x CPU speedup; 46% faster than prior gen; always-on low-power sensing ^[6]
Hexagon NPU (Snapdragon X2 Elite)	Qualcomm	AI PCs	80	Announced CES 2026; 78% more TOPS than X Elite; ships in first-half 2026 devices ^[16]
Hexagon NPU (Snapdragon X Elite)	Qualcomm	AI PCs	45	Powers on-device LLMs on Windows laptops
MediaTek APU (Dimensity 9400)	MediaTek	Smartphones	46	Generative AI capable; supports LoRA adapters on-device
Intel NPU (Lunar Lake)	Intel	AI PCs	48	Integrated into Core Ultra 200V processors
AMD Ryzen AI NPU (Ryzen AI 300)	AMD	AI PCs	55	XDNA 2 architecture; highest TOPS among PC NPUs
Google Coral NPU	Google	Wearables, IoT	Varies	Ultra-low-power; co-designed with DeepMind; always-on edge AI ^[7]
Intel Neural Compute Stick 2	Intel	USB accelerator	~4	VPU-based; plugs into any system via USB for edge inference
NVIDIA Jetson Orin Nano	NVIDIA	Robotics, embedded	40-67	Full Linux system; supports PyTorch/TensorFlow; GPU-based inference
Nordic Axon NPU	Nordic Semiconductor	Ultra-low-power IoT	<1	~6.5ms inference; <20 microcoulombs energy per inference ^[5]

Mobile GPUs

Before dedicated NPUs became widespread, mobile GPUs handled on-device AI workloads. GPUs remain relevant for larger models and workloads that do not map efficiently to NPU architectures. Qualcomm's Adreno, ARM's Mali, and Apple's integrated GPU all support AI inference through frameworks like Metal Performance Shaders (Apple), Vulkan compute (Android), and OpenCL.

Apple Neural Engine

Apple was an early pioneer of on-device AI hardware, introducing the Neural Engine with the A11 Bionic chip in 2017. That first version delivered 0.6 TOPS. By 2026, Apple's Neural Engine in the M4 chip reaches 38 TOPS, a roughly 60x increase in under a decade ^[1]. The Neural Engine is tightly integrated with Apple's software stack; Core ML automatically routes model layers to the most efficient processor (NPU, GPU, or CPU) based on the operation type.

Apple's on-device foundation model (approximately 3B parameters) runs on the Neural Engine to power Apple Intelligence features including text summarization, rewriting, entity extraction, and notification prioritization ^[8]. In its 2025 technical report, Apple describes the model as "optimized for efficiency and tailored for Apple silicon, enabling low-latency inference with minimal resource usage," and states that it "compressed the on-device model to 2 bits per weight (bpw) using Quantization-Aware-Training," while KV-cache sharing reduces key-value cache memory usage by 37.5% ^[17]. The 2025 on-device model is designed to support 15 languages and can take image as well as text input ^[17].

Qualcomm AI Engine

Qualcomm's AI Engine is a heterogeneous compute architecture that coordinates the Hexagon NPU, Adreno GPU, and Kryo CPU for AI workloads. The Hexagon NPU is the primary accelerator, and Qualcomm provides the QNN (Qualcomm Neural Network) SDK for developers to optimize and deploy models. On the Snapdragon 8 Elite Gen 5, over 56 models run inference in under 5 milliseconds on the NPU, compared to only 13 achieving that threshold on the CPU ^[6]. More than 80% of recent Qualcomm SoCs include an NPU, making it a standard component rather than a premium feature.

MediaTek APU

MediaTek brands its neural processing cores as the AI Processing Unit (APU). The APU in the Dimensity 9400 supports on-device generative AI workloads, including running small language models and supporting LoRA adapters for model customization without cloud connectivity. MediaTek targets the mid-range smartphone market in addition to flagships, bringing NPU capabilities to a broader range of price points.

Intel and the AI PC

Intel's vision for the "AI PC" centers on integrating NPUs into laptop and desktop processors. The Lunar Lake platform (Core Ultra 200V series) includes a 48 TOPS NPU designed for on-device AI workloads like real-time translation, image generation, and local LLM inference. Microsoft requires a minimum of 40 TOPS of NPU performance for its Copilot+ PC designation, establishing a baseline for the AI PC category ^[1].

Intel Neural Compute Stick

The Intel Neural Compute Stick 2 is a USB thumb-drive-sized device containing a Vision Processing Unit (VPU) that can be plugged into any computer to add dedicated AI inference capability. While its performance (roughly 4 TOPS) is modest compared to integrated NPUs, it provided an accessible entry point for edge AI prototyping and remains useful for adding inference capability to legacy hardware.

Inference Frameworks

Deploying AI models on edge devices requires specialized software frameworks that handle model conversion, optimization, and runtime execution across diverse hardware targets.

Framework	Developer	Primary Targets	Key Features
TensorFlow Lite / LiteRT	Google	Android, iOS, microcontrollers, Linux	Mature ecosystem; 8-bit/16-bit quantization; delegate system for hardware acceleration; recently rebranded as LiteRT
ONNX Runtime Mobile	Microsoft	Android, iOS, Windows, Linux	Cross-platform; supports models from PyTorch, TensorFlow, and others via ONNX format; strong CPU/GPU optimization
Core ML	Apple	iOS, macOS, watchOS, tvOS	Deep integration with Apple silicon; automatic NPU/GPU/CPU routing; ML model compilation at install time
ExecuTorch	Meta	Android, iOS, embedded, microcontrollers	PyTorch-native; hardware support across CPU, GPU, and NPU; lightweight runtime
llama.cpp	Open source (ggml-org)	Desktop, mobile, embedded	C/C++ LLM inference; GGUF format; automatic hardware detection; quantization (Q2-Q8); supports 100+ model architectures ^[9]
Qualcomm AI Hub	Qualcomm	Snapdragon devices	Pre-optimized models for Qualcomm hardware; integration with QNN SDK
TensorRT	NVIDIA	NVIDIA GPUs, Jetson	Graph optimization; mixed-precision inference; highest throughput on NVIDIA hardware
OpenVINO	Intel	Intel CPUs, GPUs, NPUs	Optimized for Intel hardware; supports model compression and quantization

TensorFlow Lite (LiteRT)

TensorFlow Lite, recently rebranded as LiteRT, is the most widely deployed edge AI framework. It converts TensorFlow models into a compact flatbuffer format optimized for mobile and embedded inference. The delegate system allows hardware-specific acceleration: the GPU delegate for mobile GPUs, the NNAPI delegate for Android NPUs, the Core ML delegate for Apple devices, and the Coral delegate for Google's Edge TPU. LiteRT supports 8-bit and 16-bit quantization, model pruning, and operator fusion to minimize model size and maximize throughput ^[10].

Google recently introduced a new LiteRT accelerator for Qualcomm hardware (Qualcomm AI Engine Direct / QNN), enabling high-performance inference on Snapdragon 8 series devices directly through LiteRT ^[6].

ONNX Runtime Mobile

ONNX Runtime, developed by Microsoft, provides a cross-platform inference engine that can run models exported from virtually any training framework through the ONNX (Open Neural Network Exchange) format. The mobile variant is optimized for Android and iOS, with reduced binary size and support for quantized models. Its greatest strength is interoperability: a model trained in PyTorch, TensorFlow, or any ONNX-compatible framework can be deployed on any supported platform without rewriting ^[11].

Core ML

Apple's Core ML framework is the gateway to running models on Apple hardware. It automatically partitions model computation across the Neural Engine, GPU, and CPU based on the specific operations and available hardware. Core ML supports model compilation at install time, producing optimized executables specific to the user's device. The framework supports all major model types including transformers, CNNs, and classical ML models. With the Foundation Models framework (September 2025), Apple exposed its on-device language model to third-party developers through Core ML's infrastructure ^[8].

ExecuTorch

ExecuTorch is Meta's PyTorch-native inference framework for edge devices. It was designed from the ground up for portability and efficiency, supporting CPU, GPU, and NPU execution across iOS, Android, embedded systems, and microcontrollers. For developers already in the PyTorch ecosystem, ExecuTorch provides the most seamless path from training to on-device deployment. It supports quantization, operator fusion, and memory planning optimizations ^[12].

llama.cpp

llama.cpp is an open-source C/C++ inference engine specifically designed for running large language models on consumer hardware. It supports quantized execution at precisions from 2-bit to 8-bit using the GGUF format, and it can run on CPUs, GPUs, and Apple silicon with automatic hardware detection and optimal execution path selection. The project supports over 100 model architectures including Llama, Mistral, Phi, Gemma, Qwen, DeepSeek, and StableLM ^[9]. Mobile applications like AnythingLLM and ChatterUI use llama.cpp as their backend for running language models on phones.

Qualcomm AI Hub

Qualcomm AI Hub provides a library of pre-optimized models ready for deployment on Snapdragon-powered devices. It handles the conversion and optimization pipeline, producing models that take advantage of Qualcomm's Hexagon NPU, Adreno GPU, and Kryo CPU. The hub integrates with standard ML frameworks and supports both traditional vision/audio models and generative AI models including small language models.

Model Optimization Techniques

Edge devices impose strict constraints on memory, compute, power, and thermal budgets. Several optimization techniques bridge the gap between the resource requirements of state-of-the-art models and the capabilities of edge hardware.

Quantization

Quantization reduces the numerical precision of model weights and activations from high-precision formats (FP32 or FP16) to lower-precision formats (INT8, INT4, or even INT2). This reduces memory footprint proportionally, speeds up inference (because lower-precision arithmetic is cheaper), and reduces power consumption. Most edge NPUs are optimized for INT8 operations, making quantization a natural fit for on-device deployment.

Two main approaches exist:

Approach	Description	Trade-offs
Post-Training Quantization (PTQ)	Quantize after training is complete using a calibration dataset	Simple to apply; some accuracy loss at very low bit widths
Quantization-Aware Training (QAT)	Simulate quantization during training so the model learns to compensate	Better accuracy at low precision; requires retraining

Apple's on-device foundation model uses 2-bit QAT, an aggressively low precision that demonstrates how far quantization can be pushed when the training process is designed around it ^[8]. The GGUF format used by llama.cpp supports a range of quantization levels from Q2_K (2-bit) through Q8_0 (8-bit), letting users choose their preferred accuracy-efficiency trade-off.

Pruning

Pruning removes weights, neurons, or entire layers from a model that contribute least to its output. Structured pruning (removing complete attention heads, channels, or layers) produces models that run faster on standard hardware because the remaining structure is regular. Unstructured pruning (zeroing out individual weights) can achieve higher compression ratios but requires sparse computation support from the hardware. NVIDIA's NeMo framework combines pruning with knowledge distillation, using the original unpruned model as a teacher to help the pruned model recover accuracy ^[13].

Knowledge Distillation

Knowledge distillation trains a smaller "student" model to replicate the behavior of a larger "teacher" model. The student learns from the teacher's full output probability distributions rather than just ground-truth labels, capturing nuanced patterns that standard training would miss. This technique is fundamental to the creation of edge-deployable models: Google's Gemma is distilled from Gemini, and Microsoft's 14B-parameter Phi-4 was trained on synthetic data generated with distillation signals from GPT-4o ^[14]. The result can surpass the teacher on targeted tasks: Phi-4 scores 80.4% on the MATH benchmark versus 74.6% for GPT-4o, showing that careful data generation and post-training can push a small model past a much larger one on specific skills ^[14]. The result is a compact model that inherits much of the teacher's capability at a fraction of the inference cost.

Other Techniques

Additional optimization methods include:

Operator fusion: Combining multiple sequential operations (e.g., convolution + batch normalization + activation) into a single fused kernel to reduce memory reads and computation overhead.
Mixed-precision inference: Running different layers at different precisions (e.g., attention layers at FP16, feed-forward layers at INT8) to balance accuracy and speed.
Model architecture search: Automatically designing network architectures optimized for specific hardware constraints, a technique used by Google's MnasNet and EfficientNet families.
Weight sharing and embedding tying: Reducing parameter count by sharing weights across layers or between input and output embeddings.

Applications

Smartphones

Smartphones are the highest-volume edge AI platform. Every major smartphone chipset now includes an NPU, and manufacturers are deploying AI features that run entirely on-device. These include:

Real-time language translation (voice and text)
Camera scene detection and computational photography
Keyboard autocomplete and smart reply
On-device text summarization and rewriting (Apple Intelligence, Google Gemini Nano)
Voice assistant wake-word detection and initial processing
Face recognition and biometric authentication

Apple's on-device foundation model powers Apple Intelligence features across iPhone, iPad, and Mac ^[8]. Google integrates Gemma Nano into Android for on-device text tasks. Qualcomm's NexaSDK provides developers with a simplified path to deploying on-device AI on Snapdragon phones ^[6].

Internet of Things (IoT)

The IoT ecosystem, encompassing smart home devices, environmental sensors, industrial monitors, and agricultural sensors, benefits enormously from edge AI. Devices with limited power and connectivity can use tiny ML models (running on microcontrollers with kilobytes of RAM) for anomaly detection, keyword spotting, predictive maintenance, and environmental monitoring. Approximately 47% of IoT applications are expected to be AI-infused by 2027 ^[4].

Google's Coral NPU platform targets this space, offering ultra-low-power always-on AI for wearables and IoT devices with minimal battery impact ^[7]. Nordic Semiconductor's Axon NPU brings inference capability to Bluetooth-class microcontrollers, achieving inference in 6.5 milliseconds at less than 20 microcoulombs of energy ^[5].

What is TinyML?

TinyML (tiny machine learning) is the branch of edge AI that runs models on microcontrollers with only kilobytes of RAM, hundreds of kilobytes of flash, and clock speeds in the tens to hundreds of megahertz, drawing milliwatts of power. Models must be aggressively compressed to fit: a keyword-spotting model that distinguishes spoken words can be squeezed below 20 KB while remaining accurate, and Google's Assistant team has reported a wake-word model just 14 KB in size ^[19]. The term was popularized by Pete Warden and Daniel Situnayake, whose 2019 O'Reilly book "TinyML" demonstrated deep learning on Arduino-class hardware using TensorFlow Lite for Microcontrollers ^[19]. TinyML enables always-on tasks such as keyword spotting, anomaly detection, gesture recognition, and predictive maintenance on battery-powered sensors that could never reach the cloud.

Autonomous Vehicles

Self-driving vehicles are among the most demanding edge AI applications. Cameras, radar, LiDAR, and ultrasonic sensors generate terabytes of data per day, all of which must be processed locally in real time for navigation, obstacle detection, lane keeping, and collision avoidance. The latency requirements are extreme: a vehicle traveling at highway speed covers several meters per millisecond of processing delay.

Automotive edge AI uses specialized hardware from companies like NVIDIA (Drive platform), Qualcomm (Snapdragon Ride), and Mobileye (EyeQ). McKinsey research indicates that edge AI in the automotive sector is expanding beyond basic driver assistance to include predictive maintenance, in-cabin monitoring, and vehicle-to-infrastructure communication ^[15]. The automotive and transportation sector currently leads edge AI adoption by revenue.

Manufacturing

Factory environments use edge AI for real-time quality inspection, predictive maintenance, process optimization, and safety monitoring. Camera-equipped inspection stations classify products on conveyor belts at speeds that would be impossible with cloud-based inference. Vibration sensors on machinery use local ML models to detect early signs of bearing failure or motor degradation before catastrophic breakdown occurs. The integration of edge AI with industrial IoT enables proactive quality control and supply chain visibility ^[4].

Healthcare Devices

Medical wearables and portable diagnostic devices increasingly incorporate edge AI. Continuous glucose monitors, cardiac rhythm monitors, pulse oximeters, and fall detection systems use on-device models to analyze physiological signals in real time. Edge processing is essential here for two reasons: clinical alerts must be immediate (no cloud latency), and patient health data must be handled with extreme privacy safeguards. Wearable devices powered by edge AI can detect sudden falls and immediately notify caregivers, or flag irregular heart rhythms for clinical review ^[4].

The healthcare sector is expected to be the fastest-growing segment of the edge AI market, driven by the combination of privacy requirements and real-time responsiveness ^[2].

Retail and Smart Spaces

Retail stores use edge AI for shelf monitoring, customer traffic analysis, checkout-free shopping (like Amazon's Just Walk Out technology), and loss prevention. Smart building systems use on-device AI for occupancy detection, energy management, and security. These applications process camera and sensor data locally, transmitting only aggregated insights to central systems.

Security and Surveillance

Edge AI enables intelligent video analytics at the camera level. Rather than streaming full video to a central server for analysis, edge-equipped cameras can perform object detection, facial recognition, license plate reading, and anomaly detection locally, transmitting only relevant clips or alerts. This reduces bandwidth by orders of magnitude and enables faster response times.

AI PCs and NPUs

The "AI PC" has emerged as a distinct product category defined by the inclusion of a dedicated NPU alongside the traditional CPU and GPU. Microsoft formalized this with the Copilot+ PC specification, requiring a minimum of 40 TOPS of NPU performance ^[1].

Processor	NPU TOPS	Manufacturer	Notable Capabilities
Apple M4	38	Apple	Powers Core ML; runs Apple Intelligence; 16-core Neural Engine
Qualcomm Snapdragon X2 Elite	80	Qualcomm	Announced CES 2026; 3nm; ships first-half 2026; highest TOPS in consumer laptops ^[16]
Qualcomm Snapdragon X Elite	45	Qualcomm	Windows on ARM; runs on-device LLMs; Copilot+ certified
Intel Core Ultra 200V (Lunar Lake)	48	Intel	Windows AI PCs; integrated into low-power laptop processors
AMD Ryzen AI 300	55	AMD	XDNA 2 architecture; second-highest TOPS among consumer PC NPUs

These NPUs enable local execution of AI features that previously required cloud processing: real-time meeting transcription and translation, AI-powered image editing, local document summarization, and running small language models for offline chat and code assistance. On Snapdragon X Elite laptops, users can run quantized 7B-parameter language models locally with acceptable performance.

The AI PC trend represents a fundamental shift in how personal computers handle AI workloads. Rather than treating every AI task as a cloud API call, the computing industry is moving toward a hybrid model where routine inference runs locally on the NPU and only the most demanding tasks are offloaded to the cloud.

On-Device Large Language Models

One of the most visible applications of edge AI in 2025-2026 is running language models directly on consumer devices. This was impractical just a few years ago, but advances in model compression and NPU hardware have made it viable.

Key On-Device LLMs

Model	Parameters	Developer	Deployment Context
Apple Foundation Model	~3B	Apple	iPhone, iPad, Mac; powers Apple Intelligence ^[8]
Gemma Nano / Gemma 3n	1B-2B	Google	Android devices; smart reply, summarization
Phi-4-mini	3.8B	Microsoft	Windows AI PCs; via ONNX Runtime
Llama 3.2	1B, 3B	Meta	Mobile and edge via ExecuTorch
Qwen 2.5	0.5B, 1.5B, 3B	Alibaba	Edge devices; via llama.cpp or ONNX
SmolLM	135M-1.7B	Hugging Face	Ultra-constrained devices

Apple's approach is particularly notable. The on-device foundation model uses 2-bit quantization-aware training and KV-cache sharing to fit a 3B-parameter model within the memory and power constraints of an iPhone. Despite its compact size, it outperforms several larger models (Phi-3-mini, Mistral-7B, Gemma-7B, Llama-3-8B) on its target tasks ^[8]. With the Foundation Models framework released in September 2025, Apple opened this capability to third-party developers.

Google deploys Gemma Nano within Android for features like smart reply in messaging apps and on-device text summarization. The model runs on the device's NPU through the LiteRT (TensorFlow Lite) framework.

For users who want to run open-weight models on their own hardware, llama.cpp has become the standard tool. It supports quantized versions of most popular models and automatically optimizes execution for the available hardware (CPU SIMD, GPU, or Apple Neural Engine) ^[9].

Market Size and Growth

The edge AI market is growing rapidly, though estimates vary across research firms depending on market definition and methodology.

Source	2025 Estimate	2030-2034 Projection	CAGR
Grand View Research	$24.9B	$102B (2030)	21.7%
Fortune Business Insights	$35.8B	N/A	33.4%
Precedence Research	$25.7B	$143B (2034)	29.0%
Technavio	N/A	$61.1B (2029)	~20%

The hardware segment (NPUs, AI accelerators, edge servers) accounted for approximately 51.8% of market revenue in 2025, reflecting the significant investment in specialized silicon ^[2]. The software and services segment is growing faster as frameworks mature and deployment tools simplify the development process.

Key growth drivers include the proliferation of IoT devices, the demand for real-time AI in automotive and industrial settings, the rising importance of data privacy, and the expanding capabilities of on-device hardware. The automotive and transportation sector currently leads adoption, but healthcare is expected to be the fastest-growing segment through 2030 ^[2].

Challenges

Hardware Fragmentation

The edge AI hardware landscape is highly fragmented. Different NPUs, GPUs, and microcontrollers support different precision formats, memory architectures, and instruction sets. A model optimized for Qualcomm's Hexagon NPU may not run efficiently on Intel's NPU or Apple's Neural Engine. This fragmentation complicates deployment and forces developers to maintain multiple model variants or rely on abstraction frameworks like ONNX Runtime that paper over hardware differences.

Model-Hardware Co-Optimization

Achieving peak performance on edge hardware often requires hardware-specific optimization: custom quantization calibration, operator selection, memory layout tuning, and profiling. This is a specialized skill set that many development teams lack. Tools like Qualcomm AI Hub and Apple's Core ML tools aim to automate this process, but significant manual tuning is often still needed for production workloads.

Limited Model Capability

Edge-deployed models are necessarily smaller and less capable than their cloud counterparts. They may struggle with complex reasoning, nuanced language understanding, or tasks requiring broad world knowledge. The gap is narrowing (as demonstrated by models like Phi-4 competing with 70B models on specific benchmarks), but for general-purpose AI tasks, cloud models remain superior. Many production systems adopt a hybrid approach: handle simple queries on-device and route complex ones to the cloud.

Power and Thermal Constraints

Mobile devices and IoT sensors operate under strict power budgets. Running continuous AI inference drains batteries and generates heat. While NPUs are far more power-efficient than CPUs or GPUs for AI workloads, sustained high-throughput inference still impacts battery life. Always-on AI features (wake-word detection, health monitoring) must be designed with extreme power efficiency in mind, often using dedicated low-power sensing hubs alongside the main NPU ^[6].

Security

Models deployed on edge devices are physically accessible to users, making them vulnerable to model extraction attacks, adversarial inputs, and reverse engineering. Protecting model intellectual property on-device is harder than protecting it behind a cloud API. Secure enclaves and hardware-based model encryption are emerging as countermeasures, but the field is still maturing.

Current State (2025-2026)

Edge AI in early 2026 is at an inflection point, transitioning from early adoption to mainstream deployment.

NPUs are ubiquitous and rapidly improving. Every major mobile and PC chip includes a dedicated neural processing unit. Qualcomm's Snapdragon X2 Elite, announced at CES 2026 and shipping in first-half 2026 laptops, raises the consumer PC NPU record to 80 TOPS, nearly double the previous generation ^[16]. Apple's M5 family is also expected to ship in 2026. The hardware foundation for on-device AI is now firmly in place across billions of devices, and performance is still climbing steeply.

On-device LLMs are shipping products. Apple Intelligence runs a 3B-parameter model on every compatible iPhone. Google integrates Gemma Nano into Android. Qualcomm certifies AI PCs that can run 7B+ parameter models locally. What was a research curiosity in 2023 is a consumer product in 2026.

Frameworks are maturing. TensorFlow Lite (LiteRT), Core ML, ExecuTorch, ONNX Runtime, and llama.cpp have all evolved into production-grade tools with broad hardware support, comprehensive documentation, and active communities. The tooling gap between cloud and edge deployment has narrowed significantly.

Hybrid cloud-edge architectures are the default. Rather than choosing between cloud and edge, most production systems now combine both. Simple, latency-sensitive, or privacy-critical tasks run on-device, while complex tasks that exceed the edge model's capability are routed to the cloud. This pattern maximizes the strengths of both approaches.

The market is booming. With growth rates exceeding 20% annually and projections reaching $100B+ by the early 2030s, edge AI is one of the fastest-growing segments of the AI industry. Investment in edge AI hardware, software, and applications continues to accelerate.

New form factors are emerging. AI-capable wearables (smart glasses, health monitors, earbuds), AI-equipped drones, and smart industrial sensors represent new deployment targets beyond the traditional phone-laptop-server categories. Google's Coral NPU specifically targets always-on wearable AI with ultra-low power requirements ^[7].

The direction is clear: AI is moving from the cloud to the edge. Not as a replacement for cloud AI, but as a complementary deployment model that brings intelligence to every device, in every environment, with or without a network connection.

Common Questions About Edge AI

How does edge AI differ from cloud AI?

Cloud AI sends data from a device to a remote data center, runs the model there, and returns the result, which adds a network round trip of roughly 80-150 milliseconds and requires connectivity. Edge AI runs the model on the device itself, so inference can complete in single-digit milliseconds, works offline, and keeps raw data local. Most production systems now use a hybrid of both: simple, latency-sensitive, or privacy-critical tasks run on-device, while complex tasks are offloaded to the cloud ^[17]^[18].

What is an NPU and why does edge AI need one?

An NPU (Neural Processing Unit) is a specialized accelerator that performs the matrix-multiply and convolution operations at the core of neural networks using low-precision arithmetic (INT8 or INT4), delivering high throughput at low power. Performance is measured in TOPS (trillion operations per second). NPUs let phones, laptops, and sensors run AI within tight battery and thermal budgets where CPUs and GPUs would be too slow or power-hungry. Microsoft requires at least 40 TOPS of NPU performance for a Copilot+ PC, and Qualcomm's Snapdragon X2 Elite reaches 80 TOPS ^[1]^[16].

Can large language models run on a phone?

Yes. Quantization, knowledge distillation, and NPUs have made on-device LLMs practical. Apple Intelligence runs an approximately 3-billion-parameter model compressed to 2 bits per weight on every compatible iPhone, Google runs Gemma Nano on Android, and tools like llama.cpp let users run quantized 1B-7B models locally on laptops and phones ^[8]^[9]^[17].

Is edge AI more private than cloud AI?

Generally yes, because the data used for inference does not leave the device, which simplifies compliance with regulations such as GDPR and HIPAA. The trade-off is that a model shipped to a user's device is physically accessible and therefore more exposed to extraction or reverse-engineering attacks than a model kept behind a cloud API.

Explain Like I'm 5 (ELI5)

Imagine you have a really smart friend who can answer any question, but they live far away. Every time you want to ask something, you have to call them on the phone, wait for them to pick up, and then wait for the answer. That is cloud AI. Now imagine that same smart friend shrinks down and lives right in your pocket, inside your phone. You can ask questions instantly, they answer right away, and you do not even need phone service. That is edge AI. The friend in your pocket might not know quite as many things as the one far away, but for most questions you ask every day, they are just as good, and way faster.

References

Wikipedia. "Neural processing unit." https://en.wikipedia.org/wiki/Neural_processing_unit ↩
Grand View Research. "Edge AI Market Size, Share & Trends Analysis Report, 2033." https://www.grandviewresearch.com/industry-analysis/edge-ai-market-report ↩
Precedence Research. "Edge AI Market Size to Attain USD 143.06 Billion by 2034." https://www.precedenceresearch.com/edge-ai-market ↩
Business Wire. "Edge AI Market Report 2025-2035: IoT Expansion Drives Surging Demand." December 2025. https://www.businesswire.com/news/home/20251217736530/en/ ↩
Electromaker. "Edge AI with Neuton Models and Axon NPU at Embedded World 2026." https://www.electromaker.io/blog/article/nordic-demonstrates-edge-ai-with-neuton-models-and-axon-npu-at-embedded-world-2026 ↩
Gizmochina. "On-Device AI Explained: How Snapdragon 8 Gen 5's NPU Will Change Smartphones in 2026." December 2025. https://www.gizmochina.com/2025/12/24/on-device-ai-snapdragon-8-gen-5-npu-explained/ ↩
Google Developers Blog. "Introducing Coral NPU: A Full-Stack Platform for Edge AI." https://developers.googleblog.com/introducing-coral-npu-a-full-stack-platform-for-edge-ai/ ↩
Apple Machine Learning Research. "Introducing Apple's On-Device and Server Foundation Models." June 2024. https://machinelearning.apple.com/research/introducing-apple-foundation-models ↩
ggml-org. "llama.cpp: LLM Inference in C/C++." GitHub. https://github.com/ggml-org/llama.cpp ↩
DZone. "Edge AI: TensorFlow Lite vs. ONNX Runtime vs. PyTorch Mobile." https://dzone.com/articles/edge-ai-tensorflow-lite-vs-onnx-runtime-vs-pytorch ↩
AIM Technolabs. "TensorFlow Lite vs. ONNX Runtime: Choosing an Engine for Your Edge AI Project." https://www.aimtechnolabs.com/blogs/tensorflow-lite-vs-onnx-runtime-edge-ai ↩
InfoWorld. "Meta Releases PyTorch Inference Framework for Edge Devices." https://www.infoworld.com/article/4079663/meta-releases-pytorch-inference-framework-for-edge-devices.html ↩
NVIDIA. "LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework." NVIDIA Developer Blog. https://developer.nvidia.com/blog/llm-model-pruning-and-knowledge-distillation-with-nvidia-nemo-framework/ ↩
Microsoft. "Introducing Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning." December 2024. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090 ↩
McKinsey & Company. "The Rise of Edge AI in Automotive." https://www.mckinsey.com/industries/semiconductors/our-insights/the-rise-of-edge-ai-in-automotive ↩
TechRepublic. "Qualcomm Debuts Snapdragon X2 Elite Extreme and X2 Elite for Windows PCs." January 2026. https://www.techrepublic.com/article/news-qualcomm-snapdragon-x2-elite-extreme-announcement-2026/ ↩
Apple Machine Learning Research. "Updates to Apple's On-Device and Server Foundation Language Models." 2025. https://machinelearning.apple.com/research/apple-foundation-models-2025-updates ↩
Gartner. "What Edge Computing Means for Infrastructure and Operations Leaders." https://www.gartner.com/smarterwithgartner/what-edge-computing-means-for-infrastructure-and-operations-leaders ↩
Warden, Pete and Situnayake, Daniel. "TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers." O'Reilly Media, 2019. https://tinymlbook.com/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit