An AI accelerator (also called an AI chip or neural processing unit) is a class of specialized hardware designed to perform artificial intelligence workloads, particularly neural network training and inference, far more efficiently than general-purpose CPUs. These processors achieve their performance advantage by incorporating massively parallel architectures, high-bandwidth memory, and low-precision arithmetic units optimized for the matrix and tensor operations that dominate deep learning computation.
Since the resurgence of deep learning in the early 2010s, AI accelerators have evolved from repurposed graphics cards into a diverse ecosystem of GPUs, tensor processing units, custom ASICs, and FPGAs. The global AI chip market generated an estimated $71 billion in revenue in 2024 according to Gartner, and continues to grow at a rapid pace as demand for AI compute shows no sign of slowing.
Traditional CPUs are designed for serial, general-purpose computation. While they can execute neural network operations, they lack the throughput to train large language models or serve billions of inference requests at acceptable latency. A single transformer training run for a frontier model can require thousands of accelerators working in concert for weeks or months.
AI accelerators address this bottleneck through several architectural strategies:
AI accelerators can be grouped into four broad categories, each with distinct trade-offs in flexibility, performance, and power efficiency.
Originally developed for rendering 3D graphics, GPUs became the default platform for deep learning because their thousands of parallel cores naturally fit the matrix-heavy computation of neural networks. NVIDIA pioneered this shift with the release of its CUDA programming platform in 2006, which gave researchers a way to write general-purpose code that ran on GPU hardware.
Today, data center GPUs from NVIDIA and AMD dominate AI training and inference workloads. GPUs offer a balance of programmability and raw throughput, and the mature software ecosystem (frameworks like PyTorch and TensorFlow, plus vendor-specific libraries such as cuDNN and Triton) makes them the most accessible option for practitioners.
Google's Tensor Processing Units are custom ASICs built specifically for neural network workloads. First deployed internally in 2015, TPUs use a systolic array architecture that excels at dense matrix multiplications. Google makes TPUs available through Google Cloud and uses them internally to train models like Gemini and PaLM.
Beyond Google's TPUs, a growing number of companies have designed application-specific integrated circuits (ASICs) tailored to AI. These include cloud providers building silicon for their own platforms (AWS Trainium, Meta MTIA, Microsoft Maia) and startups pursuing novel architectures (Cerebras, Groq, SambaNova, Tenstorrent). Custom ASICs can deliver the highest performance per watt for a fixed workload but sacrifice the general-purpose programmability of GPUs.
FPGAs are reconfigurable chips whose logic can be reprogrammed after manufacturing. Companies like Intel (through its Altera division) and AMD (through its Xilinx acquisition) produce FPGAs used for AI inference at the edge and in latency-sensitive applications such as autonomous vehicles, medical imaging, and telecommunications. FPGAs offer lower batch latency than GPUs and can be tuned for specific model architectures, but they require specialized hardware design expertise and typically deliver lower peak throughput than GPUs or ASICs.
NVIDIA commands roughly 80 to 90 percent of the data center AI accelerator market by revenue as of 2025. This dominance rests on two pillars: consistently leading hardware and a deeply entrenched software ecosystem.
NVIDIA's data center GPU lineup has progressed through several major architecture generations:
| Architecture | Flagship GPU | Year | Process Node | Transistors | HBM Capacity | Memory Bandwidth | Key FP8/FP16 Performance |
|---|---|---|---|---|---|---|---|
| Ampere | A100 | 2020 | TSMC 7nm | 54.2B | 80 GB HBM2e | 2.0 TB/s | 624 TFLOPS FP16 (sparse) |
| Hopper | H100 | 2022 | TSMC 4N | 80B | 80 GB HBM3 | 3.35 TB/s | 1,979 TFLOPS FP8 |
| Hopper (refresh) | H200 | 2024 | TSMC 4N | 80B | 141 GB HBM3e | 4.8 TB/s | ~4 PFLOPS FP8 |
| Blackwell | B200 | 2024 | TSMC 4NP | 208B | 192 GB HBM3e | 8.0 TB/s | 4,500 TFLOPS FP8 (dense) |
The A100, launched in 2020, introduced the Ampere architecture with third-generation Tensor Cores and support for sparsity acceleration. It became the workhorse GPU for training large language models during 2021 and 2022.
The H100, based on the Hopper architecture and announced in March 2022, doubled the A100's memory bandwidth and introduced the Transformer Engine, which dynamically switches between FP8 and FP16 precision to accelerate transformer-based models. The H100 became the most sought-after chip in the AI industry, with waiting times stretching to months during 2023.
The H200, shipping from Q2 2024, kept the Hopper compute die but upgraded memory to 141 GB of HBM3e with 4.8 TB/s bandwidth, yielding roughly 2x faster inference for memory-bound large language model workloads.
The B200, part of the Blackwell architecture announced at GTC 2024, uses a dual-die chiplet design containing 208 billion transistors. It delivers up to 9 PFLOPS of dense FP4 performance and 4,500 TFLOPS of dense FP8 performance, roughly 2.3x the H100 in FP8. Head-to-head benchmarks showed B200-based systems delivering 2.2x faster Llama 2 70B fine-tuning and 2x faster GPT-3 175B pre-training compared to H100.
CUDA, introduced in 2006, is NVIDIA's parallel computing platform and programming model. It has grown into the single most important moat in the AI hardware market. With over 4 million developers, more than 3,000 GPU-optimized applications, and native integration into every major machine learning framework, CUDA creates switching costs that competitors struggle to overcome. Practically every deep learning library, from PyTorch to JAX, is optimized for CUDA first, and most research papers assume NVIDIA hardware.
NVIDIA supplements CUDA with specialized libraries: cuDNN for deep learning primitives, TensorRT for inference optimization, cuBLAS for linear algebra, NCCL for multi-GPU communication, and Triton (originally from OpenAI, now NVIDIA-supported) for writing custom GPU kernels in Python.
Google was the first hyperscaler to design a custom AI chip, deploying TPU v1 internally in 2015. The TPU program has progressed through seven generations.
TPU v1 (2015). An inference-only accelerator featuring a 256x256 systolic array running at 700 MHz. It delivered 92 TOPS for 8-bit integer operations and consumed 75 watts. Google reported it achieved 15 to 30x higher throughput than contemporary GPUs on inference workloads while delivering 30 to 80x better TOPS per watt.
TPU v2 (2017). The first TPU capable of both training and inference. Each chip delivered approximately 45 TFLOPS in bfloat16 with 8 GB of HBM. Google introduced the bfloat16 numerical format with TPU v2, which later became an industry standard adopted by NVIDIA, AMD, and Intel.
TPU v3 (2018). Doubled performance over v2, adding liquid cooling to manage increased power density. Each chip provided approximately 420 TFLOPS in bfloat16 with 16 GB of HBM per chip.
TPU v4 (2021). Improved performance by more than 2x over v3, with 275 TFLOPS per chip and 32 GB of HBM per chip. A single v4 pod contained 4,096 chips and introduced optical circuit switches for faster inter-chip communication. Google published a detailed paper on the v4 architecture at ISCA 2023.
TPU v5e (2023). Optimized for cost-efficient inference and fine-tuning. Scaled to 256-chip pods.
TPU v5p (2023). The highest-performance v5 variant, with pods containing 8,960 chips delivering 460 PFLOPS of aggregate compute.
TPU v6e / Trillium (2024). Google's sixth-generation TPU, codenamed Trillium, delivered 4.7x improved compute performance per chip over v5e and doubled HBM capacity and bandwidth. A single Trillium pod cluster achieved 91 EXAFLOPS. Trillium provided up to 2.5x better performance per dollar over v5p for dense LLM training and reached general availability in December 2024.
AMD is NVIDIA's most significant merchant-silicon competitor in the data center AI market, holding an estimated 5 to 8 percent market share.
Released in late 2023, the MI300X is built on AMD's CDNA 3 architecture using a chiplet design combining 5nm and 6nm dies. It features 192 GB of HBM3 memory with 5.3 TB/s bandwidth, 1,307 TFLOPS of peak FP16 performance, and 2,615 TFLOPS of peak FP8 performance. The MI300X's 192 GB memory capacity (matching the B200 and exceeding the H100's 80 GB) made it competitive for serving very large language models that benefit from fitting more parameters on a single accelerator.
Launched in 2024, the MI325X upgraded memory to 256 GB of HBM3e with 6 TB/s bandwidth while maintaining the CDNA 3 architecture. It is designed for customers deploying larger LLMs and supporting more concurrent inference sessions.
AMD's open-source ROCm (Radeon Open Compute) platform is the primary alternative to CUDA. ROCm 6.0+ provides native support for PyTorch and TensorFlow, and AMD has invested in HIP (Heterogeneous Interface for Portability), a translation layer that allows many CUDA programs to run on AMD hardware with minimal code changes. However, ROCm's ecosystem remains less mature than CUDA, with fewer optimized libraries and less third-party support.
Intel's Gaudi 3, manufactured on a 5nm process, is the company's flagship AI accelerator. It features 128 GB of HBM2e memory with 3.67 TB/s bandwidth and delivers up to 1,835 TFLOPS in FP8 via its matrix multiplication engines (MMEs). The chip consumes up to 900 watts (with liquid cooling). Intel positioned Gaudi 3 as a cost-competitive alternative to the H100, launching volume production in Q3 2024. While benchmarks show Gaudi 3 trailing the H100 in some workloads, its lower price point targets customers seeking better price-performance ratios.
The largest cloud providers have each invested in designing their own AI chips, primarily to reduce dependence on NVIDIA and to optimize for their specific workloads.
Amazon Web Services has developed two chip families. Inferentia (first generation in 2019, Inferentia2 in 2022) targets inference workloads, with each Inferentia2 chip providing 32 GB of HBM. Trainium targets training. The second-generation Trainium2, which became generally available in December 2024, delivers up to 1.3 PFLOPS of dense FP8 per chip with 96 GB of HBM. A single Trn2 instance combines 16 Trainium2 chips for 20.8 PFLOPS of compute. AWS claims Trn2 instances offer 30 to 40 percent better price-performance than comparable GPU-based EC2 instances. AWS has also announced Trainium3, a 3nm chip expected in late 2025 with 2.52 PFLOPS per chip and 144 GB of HBM3e.
Meta developed its Meta Training and Inference Accelerator (MTIA) for internal use. The first generation (MTIA v1) used a 7nm process and consumed 25 watts. MTIA v2, announced in April 2024, moved to 5nm, features a 64-PE (processing element) array delivering 354 TOPS (INT8) and 177 TFLOPS (FP16), with 128 GB of LPDDR5 memory and a 90-watt power envelope. In March 2025, Meta announced four new generations: the MTIA 300, 400, 450, and 500, scheduled for deployment over the following two years.
Microsoft revealed its first custom AI accelerator, Maia 100, in late 2023. The chip is fabricated on TSMC's 5nm process with a die area of approximately 820 mm squared. It features 64 GB of HBM2e memory with 1.8 TB/s bandwidth, 500 MB of on-chip cache, and supports up to 700 watts TDP. Maia 100 integrates into Azure data centers and supports PyTorch models through a custom backend. Microsoft designed the chip to handle both training and inference for its Azure OpenAI Service and Copilot products.
Tesla designed the Dojo D1 chip specifically for training its autonomous driving neural networks. Fabricated on TSMC's 7nm process, the D1 contains 50 billion transistors in a 645 mm squared die. Each chip delivers 362 TFLOPS at FP16/CFP8 and has 1.25 MB of SRAM per functional unit with no external DRAM. Twenty-five D1 chips are packaged into a water-cooled Training Tile that achieves 9 PFLOPS at BF16 while consuming 15 kilowatts. Tesla has reported working on next-generation Dojo chips manufactured at more advanced process nodes.
Several well-funded startups have introduced novel architectures that challenge the GPU-centric status quo.
Cerebras takes the most radical approach in the industry: building an entire processor from a single silicon wafer. The Wafer-Scale Engine 3 (WSE-3), announced in March 2024, is fabricated on TSMC's 5nm process and contains 4 trillion transistors across 46,255 mm squared of silicon, making it the largest chip ever built. It features 900,000 AI-optimized compute cores, 44 GB of on-chip SRAM, and delivers 125 PFLOPS of peak AI performance. On-chip memory bandwidth exceeds 20 PB/s because memory is distributed locally across the tile array rather than accessed through external HBM stacks. The CS-3 system built around the WSE-3 can train models up to 24 trillion parameters. TIME Magazine recognized the WSE-3 as a Best Invention of 2024.
Groq's Language Processing Unit (LPU) is designed specifically for inference, emphasizing deterministic, low-latency execution over raw training throughput. The LPU uses hundreds of megabytes of on-chip SRAM as primary storage (not cache) and employs a compiler-driven, single-core architecture that eliminates the scheduling overhead found in GPUs. In benchmarks, Groq's systems run Llama 3 70B at over 1,660 tokens per second using speculative decoding, or 280 to 300 tokens per second in standard mode. LPUs connect via a plesiosynchronous protocol that allows hundreds of chips to function as a single logical processor. The system is air-cooled, simplifying data center deployment.
SambaNova's Reconfigurable Dataflow Unit (RDU) uses a dataflow architecture rather than the von Neumann model. The SN40L, fabricated on TSMC's 5nm process, features 1,040 Pattern Compute Units delivering 638 TFLOPS in BF16. Its defining feature is a three-tier memory hierarchy: 520 MB of on-chip SRAM, 64 GB of co-packaged HBM, and up to 1.5 TB of pluggable DDR DRAM. This hierarchy allows a single SN40L system to serve models up to 5 trillion parameters without distributing across multiple nodes. SambaNova raised $350 million in a 2026 funding round with Intel as a backer.
Bristol-based Graphcore developed the Intelligence Processing Unit (IPU), which features a bulk synchronous parallel execution model and large amounts of distributed on-chip SRAM (900 MB in the MK2 GC200). Each GC200 IPU has 1,472 cores running 8,832 parallel threads and delivers 250 TFLOPS at FP16. Graphcore was acquired by SoftBank in July 2024 for approximately $500 million. Under SoftBank's ownership, Graphcore has announced a roadmap toward an "Ultra Intelligence" AI supercomputer.
Led by CEO Jim Keller (previously of Apple, AMD, Intel, and Tesla), Tenstorrent builds RISC-V-based AI accelerators using an open-source philosophy. The Wormhole chip features up to 128 Tensix cores, 24 GB of GDDR6 memory with 576 GB/s bandwidth, and delivers up to 524 TFLOPS at FP8 in the dual-chip n300d configuration. The next-generation Blackhole chip, expected in 2025, moves to a 6nm process with 140 Tensix++ cores, 774 TFLOPS (FP8), 8 channels of GDDR6, and 16 RISC-V CPU cores. Tenstorrent also licenses its RISC-V CPU cores to other chip designers, creating a secondary business alongside its accelerator products.
When comparing AI accelerators, several metrics matter:
TOPS (Tera Operations Per Second). Measures integer operations, typically at INT8 precision. Commonly used for edge and inference-focused chips.
TFLOPS (TeraFLOPS). Measures floating-point operations per second. Reported at various precisions: FP32 for traditional HPC, FP16/BF16 for mixed-precision training, and FP8/FP4 for next-generation inference.
Memory capacity. The total amount of HBM, GDDR, or SRAM available on the accelerator. Larger models require more memory to store weights, activations, and optimizer states.
Memory bandwidth. Measured in TB/s, this determines how quickly data can be fed to the compute units. For inference of large language models (which are often memory-bandwidth-bound), this metric can matter more than raw TFLOPS.
Interconnect bandwidth. The speed of chip-to-chip communication, which determines how efficiently workloads can be distributed across multiple accelerators.
Performance per watt. Increasingly important as data centers face power constraints. Custom ASICs often lead in this metric because their fixed-function designs eliminate wasted transistors.
Total cost of ownership (TCO). Combines chip price, power consumption, cooling costs, and software development effort. A chip with lower peak TFLOPS but better TCO may be the more practical choice.
The following table compares key specifications of major data center AI accelerators available as of early 2025.
| Accelerator | Vendor | Architecture | Process Node | Memory | Memory Bandwidth | Peak FP8 / FP16 | TDP | Year |
|---|---|---|---|---|---|---|---|---|
| A100 SXM | NVIDIA | Ampere | 7nm | 80 GB HBM2e | 2.0 TB/s | 624 TFLOPS FP16 (sparse) | 400W | 2020 |
| H100 SXM | NVIDIA | Hopper | 4nm (4N) | 80 GB HBM3 | 3.35 TB/s | 1,979 TFLOPS FP8 | 700W | 2022 |
| H200 SXM | NVIDIA | Hopper | 4nm (4N) | 141 GB HBM3e | 4.8 TB/s | ~3,958 TFLOPS FP8 | 700W | 2024 |
| B200 | NVIDIA | Blackwell | 4nm (4NP) | 192 GB HBM3e | 8.0 TB/s | 4,500 TFLOPS FP8 | 1,000W | 2024 |
| MI300X | AMD | CDNA 3 | 5nm/6nm | 192 GB HBM3 | 5.3 TB/s | 2,615 TFLOPS FP8 | 750W | 2023 |
| MI325X | AMD | CDNA 3 | 5nm/6nm | 256 GB HBM3e | 6.0 TB/s | 1,307 TFLOPS FP16 | 750W | 2024 |
| Gaudi 3 | Intel | Gaudi | 5nm | 128 GB HBM2e | 3.67 TB/s | 1,835 TFLOPS FP8 | 900W | 2024 |
| TPU v4 | TPU | N/A | 32 GB HBM | N/A | 275 TFLOPS BF16 | N/A | 2021 | |
| TPU v6e (Trillium) | TPU | N/A | N/A | N/A | 4.7x v5e per chip | N/A | 2024 | |
| Trainium2 | AWS | Custom | N/A | 96 GB HBM | 2.9 TB/s | 1,300 TFLOPS FP8 | N/A | 2024 |
| WSE-3 | Cerebras | Wafer-Scale | 5nm | 44 GB SRAM | 20+ PB/s (on-chip) | 125 PFLOPS peak | N/A | 2024 |
| SN40L | SambaNova | Dataflow RDU | 5nm | 64 GB HBM + 1.5 TB DDR | N/A | 638 TFLOPS BF16 | N/A | 2023 |
The AI accelerator market has experienced explosive growth. Gartner forecasted worldwide AI semiconductor revenue of $71 billion for 2024, representing a 33 percent increase over 2023. Within this figure, AI accelerators used in servers accounted for approximately $21 billion, with that segment projected to grow to $33 billion by 2028.
NVIDIA captured the lion's share of this market. The company's data center revenue grew from approximately $15 billion in fiscal 2022 to over $100 billion in fiscal 2024, driven overwhelmingly by AI GPU demand. Various analysts estimate NVIDIA's market share for AI training hardware at 80 to 92 percent, depending on the definition and time period.
Hyperscaler custom silicon represents a growing countervailing force. Google, AWS, Meta, and Microsoft have collectively invested over $50 billion in their custom chip programs. While these chips are not sold on the open market, they reduce the hyperscalers' dependence on NVIDIA and increase their negotiating leverage.
The broader competitive landscape includes AMD (projected to reach $5 to $7 billion in annual data center GPU revenue), Intel (repositioning Gaudi as a value alternative), and numerous startups that collectively raised billions in venture capital during 2023 and 2024.
AI chips have become a focal point of geopolitical competition between the United States and China.
In October 2022, the U.S. Department of Commerce imposed sweeping export controls on advanced semiconductors and semiconductor manufacturing equipment destined for China. These rules targeted chips above certain performance thresholds, effectively banning the export of NVIDIA's A100 and H100 to Chinese customers.
NVIDIA responded by designing export-compliant chips (the A800, H800, and later the H20) with reduced interconnect bandwidth or lower compute performance that fell below the controlled thresholds. In October 2023, the Commerce Department tightened the rules further, closing loopholes and capturing these downgraded chips as well.
The Biden administration introduced the "AI Diffusion Rule" in January 2025, establishing global performance thresholds and a tiered country system ("green zone" allies, restricted countries, and embargoed countries). Under this framework, flagship chips like the H100 and H200 remained blocked for China.
Policy shifted again under the Trump administration. In April 2025, exports of NVIDIA's H20 chips to China were halted. However, the administration reversed course in July 2025, allowing H20 shipments to resume. By December 2025, the Trump administration approved the export of NVIDIA H200 chips to approved Chinese customers under licensing conditions, making the H200 the most powerful AI chip ever cleared for export to China.
These export controls have spurred China's domestic chip industry, with companies like Huawei developing the Ascend 910B and 910C processors as alternatives. However, analysts assess that Chinese chips remain one to two generations behind leading NVIDIA and AMD products in performance, partly because Chinese foundries lack access to the most advanced extreme ultraviolet (EUV) lithography equipment from ASML.
The rising power demands of AI accelerators present significant infrastructure challenges. A single NVIDIA B200 consumes up to 1,000 watts, and a rack containing eight such GPUs (plus networking and support hardware) can draw 40 to 120 kilowatts. Data centers designed for traditional server workloads (typically 5 to 15 kW per rack) cannot support this density without major retrofits.
Liquid cooling has become increasingly necessary. NVIDIA's GB200 NVL72 server rack, which combines 72 Blackwell GPUs in a single liquid-cooled enclosure, requires direct-to-chip liquid cooling infrastructure. Google adopted liquid cooling starting with TPU v3, and Tesla's Dojo Training Tiles are water-cooled from the start. Meanwhile, Groq's LPU architecture remains air-cooled, which the company highlights as a deployment advantage.
Several trends are shaping the next generation of AI accelerators:
Lower-precision arithmetic. NVIDIA's Blackwell architecture introduced FP4 support, and multiple vendors are exploring sub-4-bit formats. Lower precision allows more operations per watt but requires careful quantization to maintain model quality.
Chiplet and advanced packaging. The B200's dual-die design and AMD's MI300X chiplet approach both use TSMC's CoWoS (Chip-on-Wafer-on-Substrate) packaging to combine multiple dies into a single package. This trend will continue as monolithic die sizes approach the limits of lithography reticles.
Photonic interconnects. Scaling multi-chip systems requires ever-faster interconnects. Google pioneered optical circuit switches in TPU v4 pods, and several startups (Lightmatter, Ayar Labs) are developing silicon photonics for chip-to-chip communication.
Inference-optimized designs. As AI models move from research labs into production, the ratio of inference to training compute is growing rapidly. Chips like Groq's LPU and AWS Inferentia are designed specifically for low-latency, high-throughput inference rather than training.
RISC-V integration. Tenstorrent and others are incorporating open-source RISC-V CPU cores alongside AI accelerator units, enabling more flexible system-on-chip designs without licensing fees to ARM or x86 vendors.
3nm and beyond. TSMC's 3nm process is being adopted by the next generation of AI chips, including AWS Trainium3 (expected late 2025) and AMD's MI350X. Smaller transistors enable higher compute density and better energy efficiency.