AI accelerator

An AI accelerator (also called an AI chip or neural processing unit) is a class of specialized hardware designed to perform artificial intelligence workloads, particularly neural network training and inference, far more efficiently than general-purpose CPUs. These processors achieve their performance advantage by incorporating massively parallel architectures, high-bandwidth memory, and low-precision arithmetic units optimized for the matrix and tensor operations that dominate deep learning computation.

Since the resurgence of deep learning in the early 2010s, AI accelerators have evolved from repurposed graphics cards into a diverse ecosystem of GPUs, tensor processing units, custom ASICs, and FPGAs. The global AI chip market generated an estimated $71 billion in revenue in 2024 according to Gartner, and continues to grow at a rapid pace as demand for AI compute shows no sign of slowing.

Background and Motivation

Traditional CPUs are designed for serial, general-purpose computation. While they can execute neural network operations, they lack the throughput to train large language models or serve billions of inference requests at acceptable latency. A single transformer training run for a frontier model can require thousands of accelerators working in concert for weeks or months.

AI accelerators address this bottleneck through several architectural strategies:

Massive parallelism. Thousands of small compute cores execute matrix multiplications simultaneously.
Reduced-precision arithmetic. Operations in FP16, BF16, FP8, or INT8 deliver higher throughput per watt than FP32, with minimal accuracy loss for most neural network workloads.
High-bandwidth memory (HBM). Stacked memory technologies like HBM2e, HBM3, and HBM3e provide terabytes per second of bandwidth, preventing the compute units from starving for data.
Specialized interconnects. High-speed chip-to-chip links (such as NVIDIA NVLink or Google's Inter-Chip Interconnect) allow accelerators to scale from single chips to pods containing thousands of devices.

Types of AI Accelerators

AI accelerators can be grouped into four broad categories, each with distinct trade-offs in flexibility, performance, and power efficiency.

GPUs (Graphics Processing Units)

Originally developed for rendering 3D graphics, GPUs became the default platform for deep learning because their thousands of parallel cores naturally fit the matrix-heavy computation of neural networks. NVIDIA pioneered this shift with the release of its CUDA programming platform in 2006, which gave researchers a way to write general-purpose code that ran on GPU hardware.

Today, data center GPUs from NVIDIA and AMD dominate AI training and inference workloads. GPUs offer a balance of programmability and raw throughput, and the mature software ecosystem (frameworks like PyTorch and TensorFlow, plus vendor-specific libraries such as cuDNN and Triton) makes them the most accessible option for practitioners.

TPUs (Tensor Processing Units)

Google's Tensor Processing Units are custom ASICs built specifically for neural network workloads. First deployed internally in 2015, TPUs use a systolic array architecture that excels at dense matrix multiplications. Google makes TPUs available through Google Cloud and uses them internally to train models like Gemini and PaLM.

Custom ASICs

Beyond Google's TPUs, a growing number of companies have designed application-specific integrated circuits (ASICs) tailored to AI. These include cloud providers building silicon for their own platforms (AWS Trainium, Meta MTIA, Microsoft Maia) and startups pursuing novel architectures (Cerebras, Groq, SambaNova, Tenstorrent). Custom ASICs can deliver the highest performance per watt for a fixed workload but sacrifice the general-purpose programmability of GPUs.

FPGAs (Field-Programmable Gate Arrays)

FPGAs are reconfigurable chips whose logic can be reprogrammed after manufacturing. Companies like Intel (through its Altera division) and AMD (through its Xilinx acquisition) produce FPGAs used for AI inference at the edge and in latency-sensitive applications such as autonomous vehicles, medical imaging, and telecommunications. FPGAs offer lower batch latency than GPUs and can be tuned for specific model architectures, but they require specialized hardware design expertise and typically deliver lower peak throughput than GPUs or ASICs.

NVIDIA: The Dominant Force

NVIDIA commands roughly 80 to 90 percent of the data center AI accelerator market by revenue as of 2025. This dominance rests on two pillars: consistently leading hardware and a deeply entrenched software ecosystem.

Hardware Generations

NVIDIA's data center GPU lineup has progressed through several major architecture generations:

Architecture	Flagship GPU	Year	Process Node	Transistors	HBM Capacity	Memory Bandwidth	Key FP8/FP16 Performance
Ampere	A100	2020	TSMC 7nm	54.2B	80 GB HBM2e	2.0 TB/s	624 TFLOPS FP16 (sparse)
Hopper	H100	2022	TSMC 4N	80B	80 GB HBM3	3.35 TB/s	1,979 TFLOPS FP8
Hopper (refresh)	H200	2024	TSMC 4N	80B	141 GB HBM3e	4.8 TB/s	~4 PFLOPS FP8
Blackwell	B200	2024	TSMC 4NP	208B	192 GB HBM3e	8.0 TB/s	4,500 TFLOPS FP8 (dense)

The A100, launched in 2020, introduced the Ampere architecture with third-generation Tensor Cores and support for sparsity acceleration. It became the workhorse GPU for training large language models during 2021 and 2022.

The H100, based on the Hopper architecture and announced in March 2022, doubled the A100's memory bandwidth and introduced the Transformer Engine, which dynamically switches between FP8 and FP16 precision to accelerate transformer-based models. The H100 became the most sought-after chip in the AI industry, with waiting times stretching to months during 2023.

The H200, shipping from Q2 2024, kept the Hopper compute die but upgraded memory to 141 GB of HBM3e with 4.8 TB/s bandwidth, yielding roughly 2x faster inference for memory-bound large language model workloads.

The B200, part of the Blackwell architecture announced at GTC 2024, uses a dual-die chiplet design containing 208 billion transistors. It delivers up to 9 PFLOPS of dense FP4 performance and 4,500 TFLOPS of dense FP8 performance, roughly 2.3x the H100 in FP8. Head-to-head benchmarks showed B200-based systems delivering 2.2x faster Llama 2 70B fine-tuning and 2x faster GPT-3 175B pre-training compared to H100.

The CUDA Ecosystem

CUDA, introduced in 2006, is NVIDIA's parallel computing platform and programming model. It has grown into the single most important moat in the AI hardware market. With over 4 million developers, more than 3,000 GPU-optimized applications, and native integration into every major machine learning framework, CUDA creates switching costs that competitors struggle to overcome. Practically every deep learning library, from PyTorch to JAX, is optimized for CUDA first, and most research papers assume NVIDIA hardware.

NVIDIA supplements CUDA with specialized libraries: cuDNN for deep learning primitives, TensorRT for inference optimization, cuBLAS for linear algebra, NCCL for multi-GPU communication, and Triton (originally from OpenAI, now NVIDIA-supported) for writing custom GPU kernels in Python.

Google TPUs

Google was the first hyperscaler to design a custom AI chip, deploying TPU v1 internally in 2015. The TPU program has progressed through seven generations.

Generation Timeline

TPU v1 (2015). An inference-only accelerator featuring a 256x256 systolic array running at 700 MHz. It delivered 92 TOPS for 8-bit integer operations and consumed 75 watts. Google reported it achieved 15 to 30x higher throughput than contemporary GPUs on inference workloads while delivering 30 to 80x better TOPS per watt.

TPU v2 (2017). The first TPU capable of both training and inference. Each chip delivered approximately 45 TFLOPS in bfloat16 with 8 GB of HBM. Google introduced the bfloat16 numerical format with TPU v2, which later became an industry standard adopted by NVIDIA, AMD, and Intel.

TPU v3 (2018). Doubled performance over v2, adding liquid cooling to manage increased power density. Each chip provided approximately 420 TFLOPS in bfloat16 with 16 GB of HBM per chip.

TPU v4 (2021). Improved performance by more than 2x over v3, with 275 TFLOPS per chip and 32 GB of HBM per chip. A single v4 pod contained 4,096 chips and introduced optical circuit switches for faster inter-chip communication. Google published a detailed paper on the v4 architecture at ISCA 2023.

TPU v5e (2023). Optimized for cost-efficient inference and fine-tuning. Scaled to 256-chip pods.

TPU v5p (2023). The highest-performance v5 variant, with pods containing 8,960 chips delivering 460 PFLOPS of aggregate compute.

TPU v6e / Trillium (2024). Google's sixth-generation TPU, codenamed Trillium, delivered 4.7x improved compute performance per chip over v5e and doubled HBM capacity and bandwidth. A single Trillium pod cluster achieved 91 EXAFLOPS. Trillium provided up to 2.5x better performance per dollar over v5p for dense LLM training and reached general availability in December 2024.

AMD

AMD is NVIDIA's most significant merchant-silicon competitor in the data center AI market, holding an estimated 5 to 8 percent market share.

Instinct MI300X

Released in late 2023, the MI300X is built on AMD's CDNA 3 architecture using a chiplet design combining 5nm and 6nm dies. It features 192 GB of HBM3 memory with 5.3 TB/s bandwidth, 1,307 TFLOPS of peak FP16 performance, and 2,615 TFLOPS of peak FP8 performance. The MI300X's 192 GB memory capacity (matching the B200 and exceeding the H100's 80 GB) made it competitive for serving very large language models that benefit from fitting more parameters on a single accelerator.

Instinct MI325X

Launched in 2024, the MI325X upgraded memory to 256 GB of HBM3e with 6 TB/s bandwidth while maintaining the CDNA 3 architecture. It is designed for customers deploying larger LLMs and supporting more concurrent inference sessions.

ROCm Software Stack

AMD's open-source ROCm (Radeon Open Compute) platform is the primary alternative to CUDA. ROCm 6.0+ provides native support for PyTorch and TensorFlow, and AMD has invested in HIP (Heterogeneous Interface for Portability), a translation layer that allows many CUDA programs to run on AMD hardware with minimal code changes. However, ROCm's ecosystem remains less mature than CUDA, with fewer optimized libraries and less third-party support.

Intel

Gaudi 3

Intel's Gaudi 3, manufactured on a 5nm process, is the company's flagship AI accelerator. It features 128 GB of HBM2e memory with 3.67 TB/s bandwidth and delivers up to 1,835 TFLOPS in FP8 via its matrix multiplication engines (MMEs). The chip consumes up to 900 watts (with liquid cooling). Intel positioned Gaudi 3 as a cost-competitive alternative to the H100, launching volume production in Q3 2024. While benchmarks show Gaudi 3 trailing the H100 in some workloads, its lower price point targets customers seeking better price-performance ratios.

Custom Silicon from Hyperscalers

The largest cloud providers have each invested in designing their own AI chips, primarily to reduce dependence on NVIDIA and to optimize for their specific workloads.

AWS Trainium and Inferentia

Amazon Web Services has developed two chip families. Inferentia (first generation in 2019, Inferentia2 in 2022) targets inference workloads, with each Inferentia2 chip providing 32 GB of HBM. Trainium targets training. The second-generation Trainium2, which became generally available in December 2024, delivers up to 1.3 PFLOPS of dense FP8 per chip with 96 GB of HBM. A single Trn2 instance combines 16 Trainium2 chips for 20.8 PFLOPS of compute. AWS claims Trn2 instances offer 30 to 40 percent better price-performance than comparable GPU-based EC2 instances. AWS has also announced Trainium3, a 3nm chip expected in late 2025 with 2.52 PFLOPS per chip and 144 GB of HBM3e.

Meta MTIA

Meta developed its Meta Training and Inference Accelerator (MTIA) for internal use. The first generation (MTIA v1) used a 7nm process and consumed 25 watts. MTIA v2, announced in April 2024, moved to 5nm, features a 64-PE (processing element) array delivering 354 TOPS (INT8) and 177 TFLOPS (FP16), with 128 GB of LPDDR5 memory and a 90-watt power envelope. In March 2025, Meta announced four new generations: the MTIA 300, 400, 450, and 500, scheduled for deployment over the following two years.

Microsoft Maia 100

Microsoft revealed its first custom AI accelerator, Maia 100, in late 2023. The chip is fabricated on TSMC's 5nm process with a die area of approximately 820 mm squared. It features 64 GB of HBM2e memory with 1.8 TB/s bandwidth, 500 MB of on-chip cache, and supports up to 700 watts TDP. Maia 100 integrates into Azure data centers and supports PyTorch models through a custom backend. Microsoft designed the chip to handle both training and inference for its Azure OpenAI Service and Copilot products.

Tesla Dojo D1

Tesla designed the Dojo D1 chip specifically for training its autonomous driving neural networks. Fabricated on TSMC's 7nm process, the D1 contains 50 billion transistors in a 645 mm squared die. Each chip delivers 362 TFLOPS at FP16/CFP8 and has 1.25 MB of SRAM per functional unit with no external DRAM. Twenty-five D1 chips are packaged into a water-cooled Training Tile that achieves 9 PFLOPS at BF16 while consuming 15 kilowatts. Tesla has reported working on next-generation Dojo chips manufactured at more advanced process nodes.

AI Chip Startups

Several well-funded startups have introduced novel architectures that challenge the GPU-centric status quo.

Cerebras Systems

Cerebras takes the most radical approach in the industry: building an entire processor from a single silicon wafer. The Wafer-Scale Engine 3 (WSE-3), announced in March 2024, is fabricated on TSMC's 5nm process and contains 4 trillion transistors across 46,255 mm squared of silicon, making it the largest chip ever built. It features 900,000 AI-optimized compute cores, 44 GB of on-chip SRAM, and delivers 125 PFLOPS of peak AI performance. On-chip memory bandwidth exceeds 20 PB/s because memory is distributed locally across the tile array rather than accessed through external HBM stacks. The CS-3 system built around the WSE-3 can train models up to 24 trillion parameters. TIME Magazine recognized the WSE-3 as a Best Invention of 2024.

Groq

Groq's Language Processing Unit (LPU) is designed specifically for inference, emphasizing deterministic, low-latency execution over raw training throughput. The LPU uses hundreds of megabytes of on-chip SRAM as primary storage (not cache) and employs a compiler-driven, single-core architecture that eliminates the scheduling overhead found in GPUs. In benchmarks, Groq's systems run Llama 3 70B at over 1,660 tokens per second using speculative decoding, or 280 to 300 tokens per second in standard mode. LPUs connect via a plesiosynchronous protocol that allows hundreds of chips to function as a single logical processor. The system is air-cooled, simplifying data center deployment.

SambaNova Systems

SambaNova's Reconfigurable Dataflow Unit (RDU) uses a dataflow architecture rather than the von Neumann model. The SN40L, fabricated on TSMC's 5nm process, features 1,040 Pattern Compute Units delivering 638 TFLOPS in BF16. Its defining feature is a three-tier memory hierarchy: 520 MB of on-chip SRAM, 64 GB of co-packaged HBM, and up to 1.5 TB of pluggable DDR DRAM. This hierarchy allows a single SN40L system to serve models up to 5 trillion parameters without distributing across multiple nodes. SambaNova raised $350 million in a 2026 funding round with Intel as a backer.

Graphcore

Bristol-based Graphcore developed the Intelligence Processing Unit (IPU), which features a bulk synchronous parallel execution model and large amounts of distributed on-chip SRAM (900 MB in the MK2 GC200). Each GC200 IPU has 1,472 cores running 8,832 parallel threads and delivers 250 TFLOPS at FP16. Graphcore was acquired by SoftBank in July 2024 for approximately $500 million. Under SoftBank's ownership, Graphcore has announced a roadmap toward an "Ultra Intelligence" AI supercomputer.

Tenstorrent

Led by CEO Jim Keller (previously of Apple, AMD, Intel, and Tesla), Tenstorrent builds RISC-V-based AI accelerators using an open-source philosophy. The Wormhole chip features up to 128 Tensix cores, 24 GB of GDDR6 memory with 576 GB/s bandwidth, and delivers up to 524 TFLOPS at FP8 in the dual-chip n300d configuration. The next-generation Blackhole chip, expected in 2025, moves to a 6nm process with 140 Tensix++ cores, 774 TFLOPS (FP8), 8 channels of GDDR6, and 16 RISC-V CPU cores. Tenstorrent also licenses its RISC-V CPU cores to other chip designers, creating a secondary business alongside its accelerator products.

Key Performance Metrics

When comparing AI accelerators, several metrics matter:

TOPS (Tera Operations Per Second). Measures integer operations, typically at INT8 precision. Commonly used for edge and inference-focused chips.

TFLOPS (TeraFLOPS). Measures floating-point operations per second. Reported at various precisions: FP32 for traditional HPC, FP16/BF16 for mixed-precision training, and FP8/FP4 for next-generation inference.

Memory capacity. The total amount of HBM, GDDR, or SRAM available on the accelerator. Larger models require more memory to store weights, activations, and optimizer states.

Memory bandwidth. Measured in TB/s, this determines how quickly data can be fed to the compute units. For inference of large language models (which are often memory-bandwidth-bound), this metric can matter more than raw TFLOPS.

Interconnect bandwidth. The speed of chip-to-chip communication, which determines how efficiently workloads can be distributed across multiple accelerators.

Performance per watt. Increasingly important as data centers face power constraints. Custom ASICs often lead in this metric because their fixed-function designs eliminate wasted transistors.

Total cost of ownership (TCO). Combines chip price, power consumption, cooling costs, and software development effort. A chip with lower peak TFLOPS but better TCO may be the more practical choice.

Comparison of Major AI Accelerators

The following table compares key specifications of major data center AI accelerators available as of early 2025.

Accelerator	Vendor	Architecture	Process Node	Memory	Memory Bandwidth	Peak FP8 / FP16	TDP	Year
A100 SXM	NVIDIA	Ampere	7nm	80 GB HBM2e	2.0 TB/s	624 TFLOPS FP16 (sparse)	400W	2020
H100 SXM	NVIDIA	Hopper	4nm (4N)	80 GB HBM3	3.35 TB/s	1,979 TFLOPS FP8	700W	2022
H200 SXM	NVIDIA	Hopper	4nm (4N)	141 GB HBM3e	4.8 TB/s	~3,958 TFLOPS FP8	700W	2024
B200	NVIDIA	Blackwell	4nm (4NP)	192 GB HBM3e	8.0 TB/s	4,500 TFLOPS FP8	1,000W	2024
MI300X	AMD	CDNA 3	5nm/6nm	192 GB HBM3	5.3 TB/s	2,615 TFLOPS FP8	750W	2023
MI325X	AMD	CDNA 3	5nm/6nm	256 GB HBM3e	6.0 TB/s	1,307 TFLOPS FP16	750W	2024
Gaudi 3	Intel	Gaudi	5nm	128 GB HBM2e	3.67 TB/s	1,835 TFLOPS FP8	900W	2024
TPU v4	Google	TPU	N/A	32 GB HBM	N/A	275 TFLOPS BF16	N/A	2021
TPU v6e (Trillium)	Google	TPU	N/A	N/A	N/A	4.7x v5e per chip	N/A	2024
Trainium2	AWS	Custom	N/A	96 GB HBM	2.9 TB/s	1,300 TFLOPS FP8	N/A	2024
WSE-3	Cerebras	Wafer-Scale	5nm	44 GB SRAM	20+ PB/s (on-chip)	125 PFLOPS peak	N/A	2024
SN40L	SambaNova	Dataflow RDU	5nm	64 GB HBM + 1.5 TB DDR	N/A	638 TFLOPS BF16	N/A	2023

The AI Chip Market

The AI accelerator market has experienced explosive growth. Gartner forecasted worldwide AI semiconductor revenue of $71 billion for 2024, representing a 33 percent increase over 2023. Within this figure, AI accelerators used in servers accounted for approximately $21 billion, with that segment projected to grow to $33 billion by 2028.

NVIDIA captured the lion's share of this market. The company's data center revenue grew from approximately $15 billion in fiscal 2022 to over $100 billion in fiscal 2024, driven overwhelmingly by AI GPU demand. Various analysts estimate NVIDIA's market share for AI training hardware at 80 to 92 percent, depending on the definition and time period.

Hyperscaler custom silicon represents a growing countervailing force. Google, AWS, Meta, and Microsoft have collectively invested over $50 billion in their custom chip programs. While these chips are not sold on the open market, they reduce the hyperscalers' dependence on NVIDIA and increase their negotiating leverage.

The broader competitive landscape includes AMD (projected to reach $5 to $7 billion in annual data center GPU revenue), Intel (repositioning Gaudi as a value alternative), and numerous startups that collectively raised billions in venture capital during 2023 and 2024.

US-China Export Controls

AI chips have become a focal point of geopolitical competition between the United States and China.

In October 2022, the U.S. Department of Commerce imposed sweeping export controls on advanced semiconductors and semiconductor manufacturing equipment destined for China. These rules targeted chips above certain performance thresholds, effectively banning the export of NVIDIA's A100 and H100 to Chinese customers.

NVIDIA responded by designing export-compliant chips (the A800, H800, and later the H20) with reduced interconnect bandwidth or lower compute performance that fell below the controlled thresholds. In October 2023, the Commerce Department tightened the rules further, closing loopholes and capturing these downgraded chips as well.

The Biden administration introduced the "AI Diffusion Rule" in January 2025, establishing global performance thresholds and a tiered country system ("green zone" allies, restricted countries, and embargoed countries). Under this framework, flagship chips like the H100 and H200 remained blocked for China.

Policy shifted again under the Trump administration. In April 2025, exports of NVIDIA's H20 chips to China were halted. However, the administration reversed course in July 2025, allowing H20 shipments to resume. By December 2025, the Trump administration approved the export of NVIDIA H200 chips to approved Chinese customers under licensing conditions, making the H200 the most powerful AI chip ever cleared for export to China.

These export controls have spurred China's domestic chip industry, with companies like Huawei developing the Ascend 910B and 910C processors as alternatives. However, analysts assess that Chinese chips remain one to two generations behind leading NVIDIA and AMD products in performance, partly because Chinese foundries lack access to the most advanced extreme ultraviolet (EUV) lithography equipment from ASML.

Power and Cooling Challenges

The rising power demands of AI accelerators present significant infrastructure challenges. A single NVIDIA B200 consumes up to 1,000 watts, and a rack containing eight such GPUs (plus networking and support hardware) can draw 40 to 120 kilowatts. Data centers designed for traditional server workloads (typically 5 to 15 kW per rack) cannot support this density without major retrofits.

Liquid cooling has become increasingly necessary. NVIDIA's GB200 NVL72 server rack, which combines 72 Blackwell GPUs in a single liquid-cooled enclosure, requires direct-to-chip liquid cooling infrastructure. Google adopted liquid cooling starting with TPU v3, and Tesla's Dojo Training Tiles are water-cooled from the start. Meanwhile, Groq's LPU architecture remains air-cooled, which the company highlights as a deployment advantage.

Future Directions

Several trends are shaping the next generation of AI accelerators:

Lower-precision arithmetic. NVIDIA's Blackwell architecture introduced FP4 support, and multiple vendors are exploring sub-4-bit formats. Lower precision allows more operations per watt but requires careful quantization to maintain model quality.

Chiplet and advanced packaging. The B200's dual-die design and AMD's MI300X chiplet approach both use TSMC's CoWoS (Chip-on-Wafer-on-Substrate) packaging to combine multiple dies into a single package. This trend will continue as monolithic die sizes approach the limits of lithography reticles.

Photonic interconnects. Scaling multi-chip systems requires ever-faster interconnects. Google pioneered optical circuit switches in TPU v4 pods, and several startups (Lightmatter, Ayar Labs) are developing silicon photonics for chip-to-chip communication.

Inference-optimized designs. As AI models move from research labs into production, the ratio of inference to training compute is growing rapidly. Chips like Groq's LPU and AWS Inferentia are designed specifically for low-latency, high-throughput inference rather than training.

RISC-V integration. Tenstorrent and others are incorporating open-source RISC-V CPU cores alongside AI accelerator units, enabling more flexible system-on-chip designs without licensing fees to ARM or x86 vendors.

3nm and beyond. TSMC's 3nm process is being adopted by the next generation of AI chips, including AWS Trainium3 (expected late 2025) and AMD's MI350X. Smaller transistors enable higher compute density and better energy efficiency.

References

NVIDIA. "NVIDIA Hopper Architecture In-Depth." NVIDIA Developer Blog, 2022. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
NVIDIA. "NVIDIA Blackwell Platform Arrives to Power a New Era of Computing." NVIDIA Newsroom, March 2024. https://nvidianews.nvidia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing
NVIDIA. "NVIDIA H200 Tensor Core GPU." https://www.nvidia.com/en-us/data-center/h200/
NVIDIA. "NVIDIA A100 Tensor Core GPU Datasheet." https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf
Google Cloud. "TPU v6e Documentation." https://docs.cloud.google.com/tpu/docs/v6e
Google Cloud. "Introducing Trillium, sixth-generation TPUs." Google Cloud Blog, 2024. https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus
Google Cloud. "TPU transformation: A look back at 10 years of our AI-specialized chips." https://cloud.google.com/transform/ai-specialized-chips-tpu-history-gen-ai
AMD. "AMD Instinct MI300X Accelerators." https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html
AMD. "AMD Instinct MI325X Accelerators." https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html
Intel. "Gaudi 3 AI Accelerator White Paper." https://cdrdv2-public.intel.com/817486/gaudi-3-ai-accelerator-white-paper.pdf
Cerebras Systems. "Cerebras Announces Third Generation Wafer Scale Engine." March 2024. https://www.cerebras.ai/press-release/cerebras-announces-third-generation-wafer-scale-engine
Groq. "Inside the LPU: Deconstructing Groq's Speed." https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed
SambaNova. "SN40L RDU: Next-Gen AI Chip for Inference at Scale." https://sambanova.ai/products/rdu-ai-chips
Tenstorrent. "Tenstorrent Launches Next Generation Wormhole-Based Developer Kits and Workstations." 2024. https://tenstorrent.com/en/vision/tenstorrent-launches-next-generation-wormhole-based-developer-kits-and-workstations
AWS. "Amazon EC2 Trn2 Instances and Trn2 UltraServers." AWS News Blog, December 2024. https://aws.amazon.com/blogs/aws/amazon-ec2-trn2-instances-and-trn2-ultraservers-for-aiml-training-and-inference-is-now-available/
Meta AI. "Our next generation Meta Training and Inference Accelerator." https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/
Microsoft. "Inside Maia 100: Revolutionizing AI Workloads with Microsoft's Custom AI Accelerator." Microsoft Tech Community, 2024. https://techcommunity.microsoft.com/blog/azureinfrastructureblog/inside-maia-100-revolutionizing-ai-workloads-with-microsofts-custom-ai-accelerat/4229118
Gartner. "Gartner Forecasts Worldwide AI Chips Revenue to Grow 33% in 2024." May 2024. https://www.gartner.com/en/newsroom/press-releases/2024-05-29-gartner-forecasts-worldwide-artificial-intelligence-chips-revenue-to-grow-33-percent-in-2024
Silicon Analysts. "NVIDIA GPU Market Share 2024-2026." https://siliconanalysts.com/analysis/nvidia-ai-accelerator-market-share-2024-2026
U.S. Library of Congress. "U.S. Export Controls and China: Advanced Semiconductors." Congress.gov. https://www.congress.gov/crs-product/R48642
Graphcore. "IPU Processors." https://www.graphcore.ai/products/ipu
Tesla. "Tesla Dojo." Wikipedia. https://en.wikipedia.org/wiki/Tesla_Dojo

Background and Motivation

Types of AI Accelerators

GPUs (Graphics Processing Units)

TPUs (Tensor Processing Units)

Custom ASICs

FPGAs (Field-Programmable Gate Arrays)

NVIDIA: The Dominant Force

Hardware Generations

The CUDA Ecosystem

Google TPUs

Generation Timeline

AMD

Instinct MI300X

Instinct MI325X

ROCm Software Stack

Intel

Gaudi 3

Custom Silicon from Hyperscalers

AWS Trainium and Inferentia

Meta MTIA

Microsoft Maia 100

Tesla Dojo D1

AI Chip Startups

Cerebras Systems

Groq

SambaNova Systems

Graphcore

Tenstorrent

Key Performance Metrics

Comparison of Major AI Accelerators

The AI Chip Market

US-China Export Controls

Power and Cooling Challenges

Future Directions

See Also

References

Improve this article

Related Articles

GPU computing

Cambricon

High Bandwidth Memory (HBM)

Apple Silicon

GELU (Gaussian Error Linear Unit)

LeNet

Background and Motivation

Types of AI Accelerators

GPUs (Graphics Processing Units)

TPUs (Tensor Processing Units)

Custom ASICs

FPGAs (Field-Programmable Gate Arrays)

NVIDIA: The Dominant Force

Hardware Generations

The CUDA Ecosystem

Google TPUs

Generation Timeline

AMD

Instinct MI300X

Instinct MI325X

ROCm Software Stack

Intel

Gaudi 3

Custom Silicon from Hyperscalers

AWS Trainium and Inferentia

Meta MTIA

Microsoft Maia 100

Tesla Dojo D1

AI Chip Startups

Cerebras Systems

Groq

SambaNova Systems

Graphcore

Tenstorrent

Key Performance Metrics

Comparison of Major AI Accelerators

The AI Chip Market

US-China Export Controls

Power and Cooling Challenges

Future Directions

See Also

References