AI accelerator
Last reviewed
Sources
58 citations
Review status
Source-backed
Revision
v8 ยท 7,115 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
58 citations
Review status
Source-backed
Revision
v8 ยท 7,115 words
Add missing citations, update stale details, or suggest a clearer explanation.
An AI accelerator (also called an AI chip or neural processing unit) is a class of specialized hardware designed to run artificial intelligence workloads, particularly neural network training and inference, far more efficiently than general-purpose CPUs. AI chips achieve their advantage by combining massively parallel architectures, high-bandwidth memory, and low-precision arithmetic units optimized for the matrix and tensor operations that dominate deep learning computation. The dominant types are GPUs (graphics processing units), TPUs (tensor processing units), custom ASICs, and NPUs.
Since the resurgence of deep learning in the early 2010s, AI accelerators have evolved from repurposed graphics cards into a diverse ecosystem of GPUs, tensor processing units, custom ASICs, and FPGAs. The global AI chip market generated an estimated $71 billion in revenue in 2024, a 33 percent increase over 2023, according to Gartner.[18] Growth then accelerated sharply: Gartner estimated that AI processors alone exceeded $200 billion in sales in 2025, when total worldwide semiconductor revenue reached $793 billion.[50] NVIDIA, which holds roughly 80 to 90 percent of the data center AI accelerator market by revenue, became the first company in history to close a trading day valued above $5 trillion on October 29, 2025.[19][28]
AI chips are used to train and serve neural networks, the two phases that dominate modern AI compute. Training adjusts a model's billions of parameters over many passes through large datasets, a process that for a frontier model can require thousands of accelerators working in concert for weeks or months. Inference runs the finished model to generate answers, classifications, or predictions for end users, often at the scale of billions of requests per day. The same architectural strengths, parallel matrix math and high memory bandwidth, serve both phases, though some chips (such as Google Ironwood, AWS Inferentia, and Groq's LPU) are tuned primarily for inference while others (such as AWS Trainium) target training.
Traditional CPUs are designed for serial, general-purpose computation. While they can execute neural network operations, they lack the throughput to train large language models or serve billions of inference requests at acceptable latency. A single transformer training run for a frontier model can require thousands of accelerators working in concert for weeks or months.
AI accelerators address this bottleneck through several architectural strategies:
AI accelerators can be grouped into four broad categories, each with distinct trade-offs in flexibility, performance, and power efficiency.
Originally developed for rendering 3D graphics, GPUs became the default platform for deep learning because their thousands of parallel cores naturally fit the matrix-heavy computation of neural networks. NVIDIA pioneered this shift with the release of its CUDA programming platform in 2006, which gave researchers a way to write general-purpose code that ran on GPU hardware.
Today, data center GPUs from NVIDIA and AMD dominate AI training and inference workloads. GPUs offer a balance of programmability and raw throughput, and the mature software ecosystem (frameworks like PyTorch and TensorFlow, plus vendor-specific libraries such as cuDNN and Triton) makes them the most accessible option for practitioners.
Google's Tensor Processing Units are custom ASICs built specifically for neural network workloads. First deployed internally in 2015, TPUs use a systolic array architecture that excels at dense matrix multiplications.[7] Google makes TPUs available through Google Cloud and uses them internally to train models like Gemini and PaLM.
Beyond Google's TPUs, a growing number of companies have designed application-specific integrated circuits (ASICs) tailored to AI. These include cloud providers building silicon for their own platforms (AWS Trainium, Meta MTIA, Microsoft Maia) and startups pursuing novel architectures (Cerebras, Groq, SambaNova, Tenstorrent). Custom ASICs can deliver the highest performance per watt for a fixed workload but sacrifice the general-purpose programmability of GPUs.
FPGAs are reconfigurable chips whose logic can be reprogrammed after manufacturing. Companies like Intel (through its Altera division) and AMD (through its Xilinx acquisition) produce FPGAs used for AI inference at the edge and in latency-sensitive applications such as autonomous vehicles, medical imaging, and telecommunications. FPGAs offer lower batch latency than GPUs and can be tuned for specific model architectures, but they require specialized hardware design expertise and typically deliver lower peak throughput than GPUs or ASICs.
NVIDIA commands roughly 80 to 90 percent of the data center AI accelerator market by revenue as of 2025.[19] This dominance rests on two pillars: consistently leading hardware and a deeply entrenched software ecosystem. "Generative AI is the defining technology of our time," NVIDIA chief executive Jensen Huang said when announcing the Blackwell platform in March 2024. "Blackwell is the engine to power this new industrial revolution."[2]
NVIDIA's data center GPU lineup has progressed through several major architecture generations:
| Architecture | Flagship GPU | Year | Process Node | Transistors | HBM Capacity | Memory Bandwidth | Key FP8/FP16 Performance |
|---|---|---|---|---|---|---|---|
| Ampere | A100 | 2020 | TSMC 7nm | 54.2B | 80 GB HBM2e | 2.0 TB/s | 624 TFLOPS FP16 (sparse) |
| Hopper | H100 | 2022 | TSMC 4N | 80B | 80 GB HBM3 | 3.35 TB/s | 1,979 TFLOPS FP8 |
| Hopper (refresh) | H200 | 2024 | TSMC 4N | 80B | 141 GB HBM3e | 4.8 TB/s | ~4 PFLOPS FP8 |
| Blackwell | B200 | 2024 | TSMC 4NP | 208B | 192 GB HBM3e | 8.0 TB/s | 4,500 TFLOPS FP8 (dense) |
The A100, launched in 2020, introduced the Ampere architecture with third-generation Tensor Cores and support for sparsity acceleration.[4] It became the workhorse GPU for training large language models during 2021 and 2022.
The H100, based on the Hopper architecture and announced in March 2022, doubled the A100's memory bandwidth and introduced the Transformer Engine, which dynamically switches between FP8 and FP16 precision to accelerate transformer-based models.[1] The H100 became the most sought-after chip in the AI industry, with waiting times stretching to months during 2023.
The H200, shipping from Q2 2024, kept the Hopper compute die but upgraded memory to 141 GB of HBM3e with 4.8 TB/s bandwidth, yielding roughly 2x faster inference for memory-bound large language model workloads.[3]
The B200, part of the Blackwell architecture announced at GTC 2024, uses a dual-die chiplet design containing 208 billion transistors.[2] It delivers up to 9 PFLOPS of dense FP4 performance and 4,500 TFLOPS of dense FP8 performance, roughly 2.3x the H100 in FP8. Head-to-head benchmarks showed B200-based systems delivering 2.2x faster Llama 2 70B fine-tuning and 2x faster GPT-3 175B pre-training compared to H100.
At GTC in March 2025, NVIDIA committed to an annual data center cadence and unveiled Blackwell Ultra (B300), which raises HBM3e capacity to 288 GB per package and delivers roughly 15 PFLOPS of dense FP4, about 50 percent more than the B200; GB300 NVL72 rack systems reached customers in the second half of 2025.[23] In the MLPerf Training v5.1 round published in November 2025, the GB300 NVL72 made its benchmark debut, delivering more than 4x the Llama 3.1 405B pretraining performance of an equal number of Hopper GPUs, and NVIDIA set a 10-minute Llama 3.1 405B time-to-train record using more than 5,000 Blackwell GPUs, the first MLPerf Training results computed in FP4 (NVFP4) precision.[24]
The successor platform, Vera Rubin, pairs the Rubin GPU (roughly 50 PFLOPS of dense FP4 and 288 GB of HBM4 per package) with Vera, an 88-core custom Arm CPU.[23] At CES in January 2026, Jensen Huang said Vera Rubin was in full production with shipments expected in the second half of 2026.[25] At GTC on March 16, 2026, NVIDIA positioned Vera Rubin as the foundation of its next platform generation, and Huang said the company expects roughly $1 trillion in cumulative orders for Blackwell and Rubin systems through 2027.[26] The public roadmap extends to Rubin Ultra in 2027, which moves to an NVL576 rack configuration with approximately 100 PFLOPS of FP4 and 1 TB of HBM4e per package, and a Feynman architecture in 2028.[23]
CUDA, introduced in 2006, is NVIDIA's parallel computing platform and programming model. It has grown into the single most important moat in the AI hardware market. With over 4 million developers, more than 3,000 GPU-optimized applications, and native integration into every major machine learning framework, CUDA creates switching costs that competitors struggle to overcome. Practically every deep learning library, from PyTorch to JAX, is optimized for CUDA first, and most research papers assume NVIDIA hardware.
NVIDIA supplements CUDA with specialized libraries: cuDNN for deep learning primitives, TensorRT for inference optimization, cuBLAS for linear algebra, NCCL for multi-GPU communication, and Triton (originally from OpenAI, now NVIDIA-supported) for writing custom GPU kernels in Python.
Google was the first hyperscaler to design a custom AI chip, deploying TPU v1 internally in 2015.[7] The TPU program has progressed through seven generations.
TPU v1 (2015). An inference-only accelerator featuring a 256x256 systolic array running at 700 MHz.[7] It delivered 92 TOPS for 8-bit integer operations and consumed 75 watts. Google reported it achieved 15 to 30x higher throughput than contemporary GPUs on inference workloads while delivering 30 to 80x better TOPS per watt.
TPU v2 (2017). The first TPU capable of both training and inference.[7] Each chip delivered approximately 45 TFLOPS in bfloat16 with 8 GB of HBM. Google introduced the bfloat16 numerical format with TPU v2, which later became an industry standard adopted by NVIDIA, AMD, and Intel.
TPU v3 (2018). Doubled performance over v2, adding liquid cooling to manage increased power density.[7] Each chip provided approximately 420 TFLOPS in bfloat16 with 16 GB of HBM per chip.
TPU v4 (2021). Improved performance by more than 2x over v3, with 275 TFLOPS per chip and 32 GB of HBM per chip. A single v4 pod contained 4,096 chips and introduced optical circuit switches for faster inter-chip communication.[7] Google published a detailed paper on the v4 architecture at ISCA 2023.
TPU v5e (2023). Optimized for cost-efficient inference and fine-tuning. Scaled to 256-chip pods.
TPU v5p (2023). The highest-performance v5 variant, with pods containing 8,960 chips delivering 460 PFLOPS of aggregate compute.
TPU v6e / Trillium (2024). Google's sixth-generation TPU, codenamed Trillium, delivered 4.7x improved compute performance per chip over v5e and doubled HBM capacity and bandwidth.[6] A single Trillium pod cluster achieved 91 EXAFLOPS. Trillium provided up to 2.5x better performance per dollar over v5p for dense LLM training and reached general availability in December 2024.
TPU v7 / Ironwood (2025). Google's seventh-generation TPU, codenamed Ironwood, was unveiled at Cloud Next in April 2025 as the company's first TPU marketed primarily for what Google calls the "age of inference" and became generally available in November 2025.[29] "Ironwood is our most powerful, capable and energy efficient TPU yet," said Amin Vahdat, the Google vice president and general manager who oversees ML, systems, and cloud AI. "And it's purpose-built to power thinking, inferential AI models at scale."[57] Each Ironwood chip delivers 4,614 TFLOPS of peak FP8 compute with 192 GB of HBM3e and 7.37 TB/s of memory bandwidth.[30] Ironwood scales to 9,216-chip superpods connected by a 9.6 Tb/s Inter-Chip Interconnect, yielding 42.5 FP8 exaflops per superpod, and Google states it offers more than 4x better per-chip performance than Trillium for both training and inference.[29] The platform anchored one of the largest AI compute agreements to date: in October 2025, Anthropic said it would expand its use of Google Cloud TPUs to up to one million chips, representing well over a gigawatt of capacity coming online in 2026 in a deal worth tens of billions of dollars.[31] Anthropic chief financial officer Krishna Rao said the "expanded capacity ensures we can meet our exponentially growing demand while keeping our models at the cutting edge of the industry."[58]
AMD is NVIDIA's most significant merchant-silicon competitor in the data center AI market, holding an estimated 5 to 8 percent market share.
Released in late 2023, the MI300X is built on AMD's CDNA 3 architecture using a chiplet design combining 5nm and 6nm dies. It features 192 GB of HBM3 memory with 5.3 TB/s bandwidth, 1,307 TFLOPS of peak FP16 performance, and 2,615 TFLOPS of peak FP8 performance.[8] The MI300X's 192 GB memory capacity (matching the B200 and exceeding the H100's 80 GB) made it competitive for serving very large language models that benefit from fitting more parameters on a single accelerator.
Launched in 2024, the MI325X upgraded memory to 256 GB of HBM3e with 6 TB/s bandwidth while maintaining the CDNA 3 architecture.[9] It is designed for customers deploying larger LLMs and supporting more concurrent inference sessions.
AMD launched the Instinct MI350X and MI355X at its Advancing AI event on June 12, 2025. Built on the CDNA 4 architecture using TSMC's N3P process with 185 billion transistors, the MI350 series carries 288 GB of HBM3e with 8 TB/s of bandwidth, adds FP4 and FP6 data types, and reaches 10.1 PFLOPS of peak FP4 on the liquid-cooled MI355X; AMD claims up to a 4x generation-on-generation increase in AI compute and a 35x leap in inference performance over the MI300 series, with volume shipments beginning in the third quarter of 2025.[32] At the same event, AMD previewed its MI400 series and the double-wide, rack-scale Helios system planned for 2026, with OpenAI chief executive Sam Altman joining Lisa Su on stage as an announced customer.[33] That relationship was formalized on October 6, 2025, when AMD and OpenAI agreed to a 6-gigawatt, multi-generation deployment of Instinct GPUs beginning with 1 gigawatt of MI450 systems in the second half of 2026; AMD issued OpenAI a warrant for up to 160 million AMD shares that vests as deployment milestones are met, and said the deal is expected to generate tens of billions of dollars in revenue.[34]
AMD's open-source ROCm (Radeon Open Compute) platform is the primary alternative to CUDA. ROCm 6.0+ provides native support for PyTorch and TensorFlow, and AMD has invested in HIP (Heterogeneous Interface for Portability), a translation layer that allows many CUDA programs to run on AMD hardware with minimal code changes. However, ROCm's ecosystem remains less mature than CUDA, with fewer optimized libraries and less third-party support.
Intel's Gaudi 3, manufactured on a 5nm process, is the company's flagship AI accelerator. It features 128 GB of HBM2e memory with 3.67 TB/s bandwidth and delivers up to 1,835 TFLOPS in FP8 via its matrix multiplication engines (MMEs).[10] The chip consumes up to 900 watts (with liquid cooling). Intel positioned Gaudi 3 as a cost-competitive alternative to the H100, launching volume production in Q3 2024. While benchmarks show Gaudi 3 trailing the H100 in some workloads, its lower price point targets customers seeking better price-performance ratios.
In October 2025, Intel announced Crescent Island, a data center GPU designed exclusively for inference. Based on the Xe3P architecture, the card pairs 160 GB of LPDDR5X memory (sidestepping supply-constrained HBM) with an air-cooled design targeting roughly 350 watts, and Intel plans customer sampling in the second half of 2026.[35]
The largest cloud providers have each invested in designing their own AI chips, primarily to reduce dependence on NVIDIA and to optimize for their specific workloads.
Amazon Web Services has developed two chip families. Inferentia (first generation in 2019, Inferentia2 in 2022) targets inference workloads, with each Inferentia2 chip providing 32 GB of HBM. Trainium targets training. The second-generation Trainium2, which became generally available in December 2024, delivers up to 1.3 PFLOPS of dense FP8 per chip with 96 GB of HBM.[15] A single Trn2 instance combines 16 Trainium2 chips for 20.8 PFLOPS of compute. AWS claims Trn2 instances offer 30 to 40 percent better price-performance than comparable GPU-based EC2 instances.[15] AWS has also announced Trainium3, a 3nm chip expected in late 2025 with 2.52 PFLOPS per chip and 144 GB of HBM3e.[15] Trainium3 arrived on schedule: at re:Invent on December 2, 2025, AWS made Trn3 UltraServers generally available, scaling to 144 Trainium3 chips per UltraServer (up to 362 FP8 PFLOPS) and delivering up to 4.4x the compute performance and 4x the energy efficiency of Trn2 UltraServers, while confirming that a Trainium4 chip is in development.[36] In late October 2025, AWS activated Project Rainier, a cluster of nearly 500,000 Trainium2 chips spread across multiple US data centers that Anthropic uses to train and serve its Claude models; AWS said it expected Anthropic to use more than one million Trainium2 chips by the end of 2025.[37]
Meta developed its Meta Training and Inference Accelerator (MTIA) for internal use. The first generation (MTIA v1) used a 7nm process and consumed 25 watts.[16] MTIA v2, announced in April 2024, moved to 5nm, features a 64-PE (processing element) array delivering 354 TOPS (INT8) and 177 TFLOPS (FP16), with 128 GB of LPDDR5 memory and a 90-watt power envelope.[16] In March 2026, Meta announced four new generations: the MTIA 300, 400, 450, and 500, scheduled for deployment over the following two years.[38] The MTIA 300, optimized for ranking and recommendation models, was already in production for training when the roadmap was disclosed, and the MTIA 400, built around a 72-accelerator scale-up domain for generative AI workloads, had completed lab testing; Meta said the MTIA 450 and 500 are slated for mass deployment in early 2027 and that compute throughput grows 25x and HBM bandwidth 4.5x across the four chips.[38]
Microsoft revealed its first custom AI accelerator, Maia 100, in late 2023. The chip is fabricated on TSMC's 5nm process with a die area of approximately 820 mm squared. It features 64 GB of HBM2e memory with 1.8 TB/s bandwidth, 500 MB of on-chip cache, and supports up to 700 watts TDP.[17] Maia 100 integrates into Azure data centers and supports PyTorch models through a custom backend. Microsoft designed the chip to handle both training and inference for its Azure OpenAI Service and Copilot products.
Its successor, Maia 200, was announced on January 26, 2026. Fabricated on TSMC's 3nm process with more than 140 billion transistors, each Maia 200 combines 216 GB of HBM3e at 7 TB/s with 272 MB of on-chip SRAM and delivers more than 10 PFLOPS of FP4 (and over 5 PFLOPS of FP8) within a 750-watt power envelope.[39] Microsoft says the inference-focused chip provides three times the FP4 performance of Amazon's Trainium3 and 30 percent better performance per dollar than the newest hardware in its fleet, with initial deployments in US Azure data centers serving OpenAI's GPT-5.2 models, Microsoft Foundry, and Copilot workloads.[39]
Tesla designed the Dojo D1 chip specifically for training its autonomous driving neural networks. Fabricated on TSMC's 7nm process, the D1 contains 50 billion transistors in a 645 mm squared die. Each chip delivers 362 TFLOPS at FP16/CFP8 and has 1.25 MB of SRAM per functional unit with no external DRAM. Twenty-five D1 chips are packaged into a water-cooled Training Tile that achieves 9 PFLOPS at BF16 while consuming 15 kilowatts.[22] Tesla has reported working on next-generation Dojo chips manufactured at more advanced process nodes. That effort ended in August 2025, when Elon Musk confirmed Tesla had shut down the Dojo program and disbanded its team, calling the successor chip Dojo 2 "an evolutionary dead end"; Tesla instead concentrated on its AI5 and AI6 chips, manufactured by TSMC and Samsung respectively, weeks after signing a $16.5 billion chip supply agreement with Samsung in July 2025.[40] Dojo lead Peter Bannon left the company, and roughly 20 team members departed to found the chip and infrastructure startup DensityAI.[41]
Several well-funded startups have introduced novel architectures that challenge the GPU-centric status quo.
Cerebras takes the most radical approach in the industry: building an entire processor from a single silicon wafer. The Wafer-Scale Engine 3 (WSE-3), announced in March 2024, is fabricated on TSMC's 5nm process and contains 4 trillion transistors across 46,225 mm squared of silicon, making it the largest chip ever built. It features 900,000 AI-optimized compute cores, 44 GB of on-chip SRAM, and delivers 125 PFLOPS of peak AI performance.[11] On-chip memory bandwidth exceeds 20 PB/s because memory is distributed locally across the tile array rather than accessed through external HBM stacks. The CS-3 system built around the WSE-3 can train models up to 24 trillion parameters.[11] TIME Magazine recognized the WSE-3 as a Best Invention of 2024.
In September 2025, Cerebras raised a $1.1 billion Series G led by Fidelity Management and Research and Atreides Management at an $8.1 billion post-money valuation, and days later withdrew the IPO registration it had originally filed in September 2024, a listing long delayed by a CFIUS review of UAE-based G42's stake in the company.[42] Cerebras filed for an IPO again in April 2026.[43]
Groq's Language Processing Unit (LPU) is designed specifically for inference, emphasizing deterministic, low-latency execution over raw training throughput. The LPU uses hundreds of megabytes of on-chip SRAM as primary storage (not cache) and employs a compiler-driven, single-core architecture that eliminates the scheduling overhead found in GPUs.[12] In benchmarks, Groq's systems run Llama 3 70B at over 1,660 tokens per second using speculative decoding, or 280 to 300 tokens per second in standard mode. LPUs connect via a plesiosynchronous protocol that allows hundreds of chips to function as a single logical processor.[12] The system is air-cooled, simplifying data center deployment.
Groq raised $750 million in September 2025 at a $6.9 billion post-money valuation in a round led by Disruptive, with participation from BlackRock, Neuberger Berman, Deutsche Telekom Capital Partners, Samsung, and Cisco.[44] Three months later, on December 24, 2025, NVIDIA agreed to a non-exclusive license of Groq's inference technology in a transaction CNBC reported was worth about $20 billion, the largest deal in NVIDIA's history; founder Jonathan Ross, president Sunny Madra, and much of the team joined NVIDIA, while Groq continues to operate as an independent company under new chief executive Simon Edwards.[45][46]
SambaNova's Reconfigurable Dataflow Unit (RDU) uses a dataflow architecture rather than the von Neumann model. The SN40L, fabricated on TSMC's 5nm process, features 1,040 Pattern Compute Units delivering 638 TFLOPS in BF16.[13] Its defining feature is a three-tier memory hierarchy: 520 MB of on-chip SRAM, 64 GB of co-packaged HBM, and up to 1.5 TB of pluggable DDR DRAM. This hierarchy allows a single SN40L system to serve models up to 5 trillion parameters without distributing across multiple nodes.[13] SambaNova raised $350 million in a 2026 funding round with Intel as a backer. The oversubscribed Series E, announced in February 2026, was led by Vista Equity Partners and Cambium Capital with participation from Intel Capital, and accompanied a planned inference collaboration with Intel (following earlier reports that Intel had explored acquiring the company) along with the unveiling of SambaNova's next-generation SN50 chip.[47]
Bristol-based Graphcore developed the Intelligence Processing Unit (IPU), which features a bulk synchronous parallel execution model and large amounts of distributed on-chip SRAM (900 MB in the MK2 GC200). Each GC200 IPU has 1,472 cores running 8,832 parallel threads and delivers 250 TFLOPS at FP16.[21] Graphcore was acquired by SoftBank in July 2024 for approximately $500 million. Under SoftBank's ownership, Graphcore has announced a roadmap toward an "Ultra Intelligence" AI supercomputer.
Led by CEO Jim Keller (previously of Apple, AMD, Intel, and Tesla), Tenstorrent builds RISC-V-based AI accelerators using an open-source philosophy. The Wormhole chip features up to 128 Tensix cores, 24 GB of GDDR6 memory with 576 GB/s bandwidth, and delivers up to 524 TFLOPS at FP8 in the dual-chip n300d configuration.[14] The next-generation Blackhole chip, expected in 2025, moves to a 6nm process with 140 Tensix++ cores, 774 TFLOPS (FP8), 8 channels of GDDR6, and 16 RISC-V CPU cores. Tenstorrent also licenses its RISC-V CPU cores to other chip designers, creating a secondary business alongside its accelerator products.
When comparing AI accelerators, several metrics matter:
TOPS (Tera Operations Per Second). Measures integer operations, typically at INT8 precision. Commonly used for edge and inference-focused chips.
TFLOPS (TeraFLOPS). Measures floating-point operations per second. Reported at various precisions: FP32 for traditional HPC, FP16/BF16 for mixed-precision training, and FP8/FP4 for next-generation inference.
Memory capacity. The total amount of HBM, GDDR, or SRAM available on the accelerator. Larger models require more memory to store weights, activations, and optimizer states.
Memory bandwidth. Measured in TB/s, this determines how quickly data can be fed to the compute units. For inference of large language models (which are often memory-bandwidth-bound), this metric can matter more than raw TFLOPS.
Interconnect bandwidth. The speed of chip-to-chip communication, which determines how efficiently workloads can be distributed across multiple accelerators.
Performance per watt. Increasingly important as data centers face power constraints. Custom ASICs often lead in this metric because their fixed-function designs eliminate wasted transistors.
Total cost of ownership (TCO). Combines chip price, power consumption, cooling costs, and software development effort. A chip with lower peak TFLOPS but better TCO may be the more practical choice.
The following table compares key specifications of major data center AI accelerators available as of early 2025.
| Accelerator | Vendor | Architecture | Process Node | Memory | Memory Bandwidth | Peak FP8 / FP16 | TDP | Year |
|---|---|---|---|---|---|---|---|---|
| A100 SXM | NVIDIA | Ampere | 7nm | 80 GB HBM2e | 2.0 TB/s | 624 TFLOPS FP16 (sparse) | 400W | 2020 |
| H100 SXM | NVIDIA | Hopper | 4nm (4N) | 80 GB HBM3 | 3.35 TB/s | 1,979 TFLOPS FP8 | 700W | 2022 |
| H200 SXM | NVIDIA | Hopper | 4nm (4N) | 141 GB HBM3e | 4.8 TB/s | ~3,958 TFLOPS FP8 | 700W | 2024 |
| B200 | NVIDIA | Blackwell | 4nm (4NP) | 192 GB HBM3e | 8.0 TB/s | 4,500 TFLOPS FP8 | 1,000W | 2024 |
| MI300X | AMD | CDNA 3 | 5nm/6nm | 192 GB HBM3 | 5.3 TB/s | 2,615 TFLOPS FP8 | 750W | 2023 |
| MI325X | AMD | CDNA 3 | 5nm/6nm | 256 GB HBM3e | 6.0 TB/s | 1,307 TFLOPS FP16 | 750W | 2024 |
| Gaudi 3 | Intel | Gaudi | 5nm | 128 GB HBM2e | 3.67 TB/s | 1,835 TFLOPS FP8 | 900W | 2024 |
| TPU v4 | TPU | N/A | 32 GB HBM | N/A | 275 TFLOPS BF16 | N/A | 2021 | |
| TPU v6e (Trillium) | TPU | N/A | N/A | N/A | 4.7x v5e per chip | N/A | 2024 | |
| Trainium2 | AWS | Custom | N/A | 96 GB HBM | 2.9 TB/s | 1,300 TFLOPS FP8 | N/A | 2024 |
| WSE-3 | Cerebras | Wafer-Scale | 5nm | 44 GB SRAM | 20+ PB/s (on-chip) | 125 PFLOPS peak | N/A | 2024 |
| SN40L | SambaNova | Dataflow RDU | 5nm | 64 GB HBM + 1.5 TB DDR | N/A | 638 TFLOPS BF16 | N/A | 2023 |
A second wave of accelerators was introduced or reached availability during 2025 and 2026:
| Accelerator | Vendor | Memory | Headline compute | Availability |
|---|---|---|---|---|
| B300 (Blackwell Ultra) | NVIDIA | 288 GB HBM3e | 15 PFLOPS dense FP4 | 2H 2025 [23] |
| MI355X | AMD | 288 GB HBM3e | 10.1 PFLOPS FP4 | 2H 2025 [32] |
| TPU v7 (Ironwood) | 192 GB HBM3e | 4,614 TFLOPS FP8 | GA November 2025 [30] | |
| Trainium3 | AWS | 144 GB HBM3e | 2.52 PFLOPS FP8 | GA December 2025 [36] |
| Maia 200 | Microsoft | 216 GB HBM3e | 10+ PFLOPS FP4 | January 2026 [39] |
| AI200 | Qualcomm | 768 GB LPDDR per card | Rack-scale inference solution | 2026 [49] |
| Rubin | NVIDIA | 288 GB HBM4 | 50 PFLOPS dense FP4 | 2H 2026 [23] |
The AI accelerator market has experienced explosive growth. Gartner forecasted worldwide AI semiconductor revenue of $71 billion for 2024, representing a 33 percent increase over 2023.[18] Within this figure, AI accelerators used in servers accounted for approximately $21 billion, with that segment projected to grow to $33 billion by 2028.[18]
NVIDIA captured the lion's share of this market. The company's data center revenue grew from approximately $15 billion in fiscal 2023 to over $100 billion in fiscal 2025, driven overwhelmingly by AI GPU demand.[27] Various analysts estimate NVIDIA's market share for AI training hardware at 80 to 92 percent, depending on the definition and time period.[19]
Hyperscaler custom silicon represents a growing countervailing force. Google, AWS, Meta, and Microsoft have collectively invested over $50 billion in their custom chip programs. While these chips are not sold on the open market, they reduce the hyperscalers' dependence on NVIDIA and increase their negotiating leverage.
The broader competitive landscape includes AMD (projected to reach $5 to $7 billion in annual data center GPU revenue), Intel (repositioning Gaudi as a value alternative), and numerous startups that collectively raised billions in venture capital during 2023 and 2024.
Growth accelerated sharply through 2025 and into 2026. NVIDIA's data center revenue reached $193.7 billion in fiscal 2026 (the year ended January 2026), up 68 percent year over year, out of $215.9 billion in total company revenue.[27] On October 29, 2025, NVIDIA became the first company in history to close a trading day valued above $5 trillion, roughly three months after first crossing $4 trillion.[28] Gartner estimated that worldwide semiconductor revenue grew 21 percent in 2025 to $793 billion, with AI processors exceeding $200 billion in sales and high-bandwidth memory surpassing $30 billion, and the firm projects total semiconductor revenue above $1.3 trillion in 2026 with AI-related chips contributing roughly 30 percent.[50][51] Gartner also noted that NVIDIA became the first semiconductor vendor to surpass $100 billion in annual chip sales, contributing more than a third of the industry's 2025 growth.[50]
Custom accelerators also moved from internal projects to headline supply deals. On October 13, 2025, OpenAI and Broadcom announced a collaboration to deploy 10 gigawatts of OpenAI-designed AI accelerators and rack systems networked entirely over Broadcom Ethernet, with deployments targeted to begin in the second half of 2026 and complete by the end of 2029.[48] New merchant entrants appeared as well: on October 27, 2025, Qualcomm announced its entry into the data center market with the AI200 (2026) and AI250 (2027), liquid-cooled, rack-scale inference systems built on Hexagon-derived NPUs; the AI200 supports 768 GB of LPDDR memory per accelerator card, the AI250 introduces a near-memory computing architecture that Qualcomm says delivers more than 10x higher effective memory bandwidth, and Saudi Arabia's HUMAIN plans to deploy 200 megawatts of the systems starting in 2026.[49]
AI chips have become a focal point of geopolitical competition between the United States and China.
In October 2022, the U.S. Department of Commerce imposed sweeping export controls on advanced semiconductors and semiconductor manufacturing equipment destined for China. These rules targeted chips above certain performance thresholds, effectively banning the export of NVIDIA's A100 and H100 to Chinese customers.[20]
NVIDIA responded by designing export-compliant chips (the A800, H800, and later the H20) with reduced interconnect bandwidth or lower compute performance that fell below the controlled thresholds. In October 2023, the Commerce Department tightened the rules further, closing loopholes and capturing these downgraded chips as well.[20]
The Biden administration introduced the "AI Diffusion Rule" in January 2025, establishing global performance thresholds and a tiered country system ("green zone" allies, restricted countries, and embargoed countries). Under this framework, flagship chips like the H100 and H200 remained blocked for China.[20] The Trump administration's Commerce Department rescinded the rule in May 2025, days before its compliance deadline took effect, while simultaneously issuing guidance stating that use of Huawei Ascend chips anywhere in the world risks violating US export controls.[52]
Policy shifted again under the Trump administration. In April 2025, exports of NVIDIA's H20 chips to China were halted.[20] However, the administration reversed course in July 2025, allowing H20 shipments to resume. By December 2025, the Trump administration approved the export of NVIDIA H200 chips to approved Chinese customers under licensing conditions, making the H200 the most powerful AI chip ever cleared for export to China.
The reversals carried unusual conditions and provoked countermoves from Beijing. In August 2025, NVIDIA and AMD agreed to pay the US government 15 percent of revenue from sales of H20 and MI308 chips to China in exchange for export licenses, according to reporting first published by the Financial Times.[53] In November 2025, Reuters reported that China had barred foreign AI chips from new state-funded data center projects, requiring facilities less than 30 percent complete to remove or cancel orders for NVIDIA, AMD, and Intel hardware.[54]
These export controls have spurred China's domestic chip industry, with companies like Huawei developing the Ascend 910B and 910C processors as alternatives. However, analysts assess that Chinese chips remain one to two generations behind leading NVIDIA and AMD products in performance, partly because Chinese foundries lack access to the most advanced extreme ultraviolet (EUV) lithography equipment from ASML.
Huawei has nonetheless committed to a long-range plan. At its Huawei Connect conference in September 2025, the company published its first multi-year Ascend roadmap: the Ascend 950PR (first quarter of 2026) and 950DT (fourth quarter of 2026), which introduce Huawei's first self-developed high-bandwidth memory, followed by the Ascend 960 in 2027 and the Ascend 970 in 2028, alongside Atlas SuperPoD clusters designed to scale to 8,192 and eventually 15,488 chips.[55]
The rising power demands of AI accelerators present significant infrastructure challenges. A single NVIDIA B200 consumes up to 1,000 watts, and a rack containing eight such GPUs (plus networking and support hardware) can draw 40 to 120 kilowatts. Data centers designed for traditional server workloads (typically 5 to 15 kW per rack) cannot support this density without major retrofits.
Liquid cooling has become increasingly necessary. NVIDIA's GB200 NVL72 server rack, which combines 72 Blackwell GPUs in a single liquid-cooled enclosure, requires direct-to-chip liquid cooling infrastructure.[2] Google adopted liquid cooling starting with TPU v3, and Tesla's Dojo Training Tiles are water-cooled from the start.[7] Meanwhile, Groq's LPU architecture remains air-cooled, which the company highlights as a deployment advantage.
Several trends are shaping the next generation of AI accelerators:
Lower-precision arithmetic. NVIDIA's Blackwell architecture introduced FP4 support, and multiple vendors are exploring sub-4-bit formats.[2] Lower precision allows more operations per watt but requires careful quantization to maintain model quality.
Chiplet and advanced packaging. The B200's dual-die design and AMD's MI300X chiplet approach both use TSMC's CoWoS (Chip-on-Wafer-on-Substrate) packaging to combine multiple dies into a single package. This trend will continue as monolithic die sizes approach the limits of lithography reticles.
Photonic interconnects. Scaling multi-chip systems requires ever-faster interconnects. Google pioneered optical circuit switches in TPU v4 pods, and several startups (Lightmatter, Ayar Labs) are developing silicon photonics for chip-to-chip communication.[7]
Inference-optimized designs. As AI models move from research labs into production, the ratio of inference to training compute is growing rapidly. Chips like Groq's LPU and AWS Inferentia are designed specifically for low-latency, high-throughput inference rather than training. The trend accelerated through 2025 and 2026: Google marketed Ironwood explicitly for the "age of inference," Microsoft built Maia 200 around FP4 inference throughput, and Qualcomm's AI200 and AI250 target inference exclusively.[29][39][49]
RISC-V integration. Tenstorrent and others are incorporating open-source RISC-V CPU cores alongside AI accelerator units, enabling more flexible system-on-chip designs without licensing fees to ARM or x86 vendors.
3nm and beyond. TSMC's 3nm process is being adopted by the next generation of AI chips, including AWS Trainium3 (expected late 2025) and AMD's MI350X.[15] Smaller transistors enable higher compute density and better energy efficiency.
HBM4 memory. The next high-bandwidth memory generation arrived in September 2025, when SK hynix announced it had completed development of HBM4 and readied mass production as the first supplier; HBM4 doubles the interface to 2,048 I/O terminals, exceeds 10 Gbps per pin, and improves power efficiency by roughly 40 percent over the prior generation.[56] NVIDIA's Rubin generation is designed around HBM4, with 288 GB per GPU package.[23]