Graphics processing unit

AI Hardware Artificial Intelligence

29 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

31 citations

Revision

v6 · 5,857 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A graphics processing unit (GPU) is a specialized electronic circuit, originally designed to accelerate the rendering of images, video, and 2D/3D graphics, that has become the primary hardware for training and running artificial intelligence models. Its thousands of parallel cores make it far more efficient than a CPU at the matrix arithmetic underpinning modern neural networks, and that advantage is why the GPU now sits at the center of the AI industry: on July 9, 2025, GPU maker NVIDIA became the first company in history to reach a $4 trillion market capitalization, driven almost entirely by demand for its AI accelerators ^[27].

GPUs power every major AI breakthrough, from large language models like GPT-4 and Gemini to diffusion models for image generation. The world's largest AI training clusters contain tens of thousands of GPUs working together, and the data-center GPU business has become a multi-hundred-billion-dollar market: NVIDIA alone reported $75.2 billion in Data Center revenue in a single quarter ended April 2026, up 92% year over year ^[28]. The companies that design and manufacture these chips, above all NVIDIA, are now among the most valuable corporations on earth.

What is a GPU, and why is it important for AI?

A GPU differs from a CPU in how it spends its silicon. A CPU devotes most of its transistors to control logic and large caches so that a handful of cores can execute complex, branching, sequential instructions quickly. A GPU instead packs thousands of simpler arithmetic cores that all run the same operation on different data at once. AI is dominated by exactly this pattern: training and inference reduce to enormous numbers of identical multiply-and-add operations across the weights of a neural network. Because those operations have no dependencies between them, they map almost perfectly onto a GPU's parallel cores, which is why GPUs run AI workloads tens to hundreds of times faster than CPUs of comparable cost and power.

The importance to AI is both technical and economic. Technically, the GPU made deep learning practical at scale: models that would take months to train on CPUs train in days on GPUs. Economically, control of GPU supply has become a strategic asset for nations and corporations alike, and the scarcity of high-end AI GPUs in 2023-2024 reshaped the competitive landscape of the entire technology industry.

History

Early graphics acceleration

Dedicated graphics hardware existed in various forms throughout the 1980s and 1990s, but NVIDIA coined the term "GPU" in 1999 with the launch of the GeForce 256. That chip could process 10 million polygons per second and offloaded transform and lighting calculations from the CPU. Other vendors, including ATI Technologies (later acquired by AMD) and 3dfx Interactive, competed fiercely in the consumer graphics market during this period.

CUDA and general-purpose GPU computing

The turning point for scientific and eventually AI computing came in November 2006, when NVIDIA released CUDA (Compute Unified Device Architecture). CUDA gave software developers a C-like programming interface for harnessing the thousands of parallel cores inside NVIDIA GPUs for tasks beyond graphics rendering ^[1]. Before CUDA, researchers who wanted to run general-purpose computations on GPUs had to disguise their math as graphics shaders, a cumbersome and error-prone process. CUDA removed that barrier and opened GPU computing to the broader scientific community.

How did AlexNet make GPUs the standard for AI?

The moment that cemented GPUs as AI hardware arrived on September 30, 2012. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to the ImageNet Large Scale Visual Recognition Challenge. AlexNet, a convolutional neural network with 60 million parameters, achieved a top-5 error rate of 15.3%, beating the runner-up by more than 10 percentage points ^[2]. Krizhevsky trained the model on two NVIDIA GTX 580 consumer GPUs, each with just 3 GB of memory, over a period of five to six days using his custom cuda-convnet library ^[29].

The GPU was not incidental to this result; the authors put it in the first line of their abstract. "To make training faster, we used non-saturating neurons and a very efficient GPU implementation of convolutional nets," they wrote, and noted that the network's size was "limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate" ^[29]. AlexNet's victory demonstrated that deep neural networks, combined with large datasets and GPU-accelerated training, could dramatically outperform hand-engineered computer vision methods.

AlexNet's success rested on the convergence of three developments: the availability of large labeled datasets (ImageNet), general-purpose GPU computing (CUDA), and improved training techniques for deep networks (such as ReLU activations and dropout regularization). The result was an explosion of research into deep learning that has continued unabated ever since.

From gaming chips to data center products

Recognizing the growing AI workload, NVIDIA launched its first dedicated data center GPU, the Tesla K20, in 2012. Subsequent generations brought increasingly specialized hardware for AI: the Pascal architecture (2016) introduced the P100 with NVLink interconnects; the Volta architecture (2017) debuted Tensor Cores, specialized matrix-multiply units, in the V100; and the Ampere architecture (2020) produced the A100, which became the workhorse of AI training for several years. Each generation roughly doubled the performance available for mixed-precision deep learning workloads.

Why GPUs for AI

Several architectural properties make GPUs well suited to the mathematical operations that dominate AI training and inference.

Massive parallelism

A modern CPU might have 8 to 128 cores, each optimized for sequential, branching workloads. A modern data center GPU contains thousands of smaller cores organized into streaming multiprocessors (SMs). NVIDIA's H100, for instance, has 132 SMs containing a total of 16,896 CUDA cores. This architecture can execute thousands of threads simultaneously, which maps naturally onto the element-wise and matrix operations found in neural network layers.

Matrix multiplication efficiency

The fundamental operation in deep learning, both during training and inference, is matrix multiplication (and its close relatives: convolutions, attention computations, and element-wise transforms). Starting with the Volta architecture in 2017, NVIDIA GPUs include Tensor Cores: dedicated hardware units that perform small matrix-multiply-accumulate operations in a single clock cycle. Tensor Cores accelerate operations in half-precision (FP16), bfloat16 (BF16), and lower-precision formats (FP8, FP4) by orders of magnitude compared to general-purpose CUDA cores.

High memory bandwidth

Training large models requires moving enormous amounts of data between memory and compute units. Data center GPUs use High Bandwidth Memory (HBM), stacked DRAM packages that deliver far more bandwidth than the DDR5 memory used in CPUs. The A100 provides up to 2 TB/s of memory bandwidth; the H100, 3.35 TB/s; and the B200, 8 TB/s ^[3]. This bandwidth is critical because AI workloads are frequently memory-bound: the compute units can process data faster than memory can supply it, so higher bandwidth translates directly to higher utilization and throughput.

Software ecosystem

Hardware alone does not explain GPU dominance. NVIDIA's CUDA ecosystem, built over nearly two decades, includes optimized libraries for deep learning (cuDNN), linear algebra (cuBLAS), and communication (NCCL). Every major AI framework, including PyTorch and TensorFlow, is deeply integrated with CUDA. This software moat has proven at least as important as the hardware itself in sustaining NVIDIA's market position.

GPU architecture for AI

Streaming multiprocessors

The basic building block of an NVIDIA GPU is the streaming multiprocessor (SM). Each SM contains a set of CUDA cores for general floating-point and integer arithmetic, Tensor Cores for matrix operations, a register file, shared memory, and an L1 cache. The GPU's scheduler distributes warps (groups of 32 threads) across SMs, hiding memory latency by switching between warps when one stalls on a memory access.

Tensor Cores

Tensor Cores perform fused matrix-multiply-accumulate operations on small matrices (for example, 4x4 or 16x16 tiles, depending on the data format). First introduced in the Volta V100, they have evolved through each architecture generation:

Volta / Turing (2017-2018): FP16 matrix operations
Ampere (2020): Added BF16 and TF32 support, plus structured sparsity for 2x speedups
Hopper (2022): Added FP8 support and the Transformer Engine, which dynamically selects precision per layer
Blackwell (2024-2025): Added FP4 support, roughly doubling effective throughput again

AI-specific GPU features

Modern data center GPUs include several features specifically designed for AI workloads beyond Tensor Cores:

Feature	Description	First introduced
Tensor Cores	Dedicated matrix-multiply-accumulate units	Volta (2017)
Structured sparsity	2:4 sparsity pattern support for 2x speedup with minimal accuracy loss	Ampere (2020)
Transformer Engine	Dynamic per-layer FP8/FP16 precision management	Hopper (2022)
FP8 data type	8-bit floating point for training and inference	Hopper (2022)
FP4 data type	4-bit floating point for inference	Blackwell (2024)
Thread block clusters	Co-scheduled blocks for distributed shared memory	Hopper (2022)
TMA (Tensor Memory Accelerator)	Hardware unit for efficient bulk memory transfers	Hopper (2022)
NVLink-C2C	Chip-to-chip interconnect for multi-die designs	Grace Hopper (2023)
Decompression engine	Hardware-accelerated data decompression	Blackwell (2024)
Confidential computing	Hardware-level encryption for sensitive AI workloads	Hopper (2022)

The Transformer Engine deserves particular attention. For transformer-based models (which include virtually all modern LLMs), the Transformer Engine monitors the statistical properties of activations at each layer and automatically selects the optimal precision (FP8 vs. higher precision) to maximize throughput while maintaining training accuracy. This eliminates the manual tuning previously required for mixed-precision training.

HBM memory

Modern AI GPUs use stacked High Bandwidth Memory. The A100 shipped with HBM2e (80 GB, 2 TB/s); the H100 moved to HBM3 (80 GB, 3.35 TB/s); the H200 upgraded to HBM3e (141 GB, 4.8 TB/s); and Blackwell GPUs use HBM3e in higher-capacity configurations (192 GB on B200, 288 GB on B300). The trend is clear: each generation brings both more capacity and more bandwidth, because large language models grow faster than any single metric can keep up with.

Memory bandwidth analysis for AI workloads

Memory bandwidth is one of the most critical specifications for AI GPUs, often more important than peak compute throughput. This is because many AI operations, especially LLM inference, are memory-bandwidth-bound rather than compute-bound.

During autoregressive LLM inference, each new token requires reading through the entire model's weights from memory. For a 70-billion-parameter model in FP16, this means reading approximately 140 GB of data for every single token generated. The theoretical maximum token generation rate is therefore constrained by:

Max tokens/second = Memory bandwidth / Model size in memory

GPU	Memory bandwidth	Max theoretical tok/s (70B FP16)	Max theoretical tok/s (70B INT4)
A100 (80 GB)	2.0 TB/s	~14	~57
H100 (80 GB)	3.35 TB/s	~24	~96
H200 (141 GB)	4.8 TB/s	~34	~137
B200 (192 GB)	8.0 TB/s	~57	~229
MI300X (192 GB)	5.3 TB/s	~38	~152

These theoretical limits assume perfect memory access patterns and zero compute overhead. Real-world performance is typically 50-70% of the theoretical maximum due to memory access inefficiencies, KV cache overhead, and compute time. However, the table illustrates why memory bandwidth improvements (A100 to B200: 4x bandwidth increase) translate almost directly into inference speed improvements.

For training, the picture is more nuanced. Training involves both forward and backward passes with larger batch sizes, which increases arithmetic intensity (the ratio of compute operations to memory accesses). This shifts the bottleneck toward compute throughput for well-optimized training workloads, making Tensor Core performance and TDP (which determines sustained clock speeds) more important during training than during inference.

Interconnects

Within a server, GPUs communicate over NVLink, a proprietary high-speed interconnect. NVLink bandwidth has grown from 300 GB/s (Pascal) to 900 GB/s (Hopper) to 1.8 TB/s (Blackwell NVLink 5). Between servers, clusters use either InfiniBand or high-speed Ethernet. NVIDIA's Quantum-X800 InfiniBand switches deliver 800 Gb/s per port, while the Spectrum-X platform brings similar bandwidth to Ethernet-based networks.

NVIDIA AI GPU lineup

The following table summarizes the key data center GPUs NVIDIA has released for AI workloads from 2020 through 2025.

GPU	Architecture	Year	FP16/BF16 Tensor TFLOPS	Memory	Memory bandwidth	TDP	Interconnect
A100 (SXM)	Ampere	2020	312	80 GB HBM2e	2.0 TB/s	400W	NVLink 3 (600 GB/s)
H100 (SXM)	Hopper	2022	1,979	80 GB HBM3	3.35 TB/s	700W	NVLink 4 (900 GB/s)
H200 (SXM)	Hopper	2024	1,979	141 GB HBM3e	4.8 TB/s	700W	NVLink 4 (900 GB/s)
B200	Blackwell	2024-2025	4,500 (FP16); 9,000 (FP8)	192 GB HBM3e	8.0 TB/s	1,000W	NVLink 5 (1,800 GB/s)
B300	Blackwell Ultra	2025	~4,500 (FP16); 15 PFLOPS (FP4)	288 GB HBM3e	8.0 TB/s	1,400W	NVLink 5 (1,800 GB/s)

Several trends are visible in this progression. FP16 Tensor Core throughput grew roughly 6x from A100 to H100 in a single generation, then doubled again with Blackwell. Memory capacity has nearly quadrupled from 80 GB to 288 GB. TDP has risen from 400W to 1,400W, reflecting the growing power demands of AI chips and driving a shift toward liquid cooling in data centers. The B300 (Blackwell Ultra), announced by NVIDIA CEO Jensen Huang at GTC 2025, integrates 208 billion transistors on a dual-reticle die design and delivers roughly 1.5x the performance of the B200 ^[4]. At that keynote Huang reframed the modern GPU data center as "a new kind of factory generator of tokens, the building blocks of AI," capturing the industry's shift from selling raw FLOPS to selling AI output ^[30].

GB300 NVL72

The GB300 NVL72 is NVIDIA's rack-scale AI system, combining 72 Blackwell Ultra GPUs and 36 Grace ARM CPUs interconnected by NVLink 5 through NVSwitch into a single unified memory space. A single rack delivers 1.1 exaFLOPS of FP4 compute and requires 132 to 140 kW of power with direct liquid cooling ^[5]. This represents exascale AI performance in a single rack, a milestone that would have required an entire building-sized supercomputer only a decade ago.

AMD competition

AMD has steadily expanded its AI GPU lineup through the Instinct series, providing an alternative to NVIDIA's products.

MI300 series

The AMD Instinct MI300X, launched in late 2023, uses the CDNA 3 architecture and features 192 GB of HBM3 memory with approximately 5.3 TB/s of bandwidth. Its TDP is 750W. The MI325X followed in Q4 2024 with 288 GB of HBM3e memory and 6 TB/s of bandwidth at 1,000W ^[6].

MI350 series

The MI350 series, expected in 2025, uses the CDNA 4 architecture built on 3nm process technology. AMD claims up to a 35x improvement in inference performance over the MI300 series. The MI350X will support FP4 and FP6 data types and offer up to 288 GB of HBM3e memory. The higher-end MI355X variant has a 1,400W TDP ^[7].

MI400 series

Slated for 2026, the MI400 series will use the next-generation CDNA architecture and HBM4 memory. Early specifications suggest 432 GB of HBM4 at 19.6 TB/s of bandwidth, with 40 PFLOPS of FP4 compute and 20 PFLOPS of FP8 compute per chip ^[8]. If these numbers hold, the MI400 would be competitive with NVIDIA's next-generation products.

Ecosystem challenges

AMD's hardware has become increasingly competitive on paper, but the company faces a significant software ecosystem disadvantage. AMD's ROCm software stack, the equivalent of CUDA, has historically lagged in framework support, library optimization, and ease of use. Major AI labs have been reluctant to invest engineering effort in porting and optimizing their codebases for ROCm when CUDA "just works." AMD has made progress on this front, particularly through partnerships with Meta and Microsoft, and ROCm now integrates directly with PyTorch, TensorFlow, and JAX, allowing teams to move models from NVIDIA to AMD hardware by swapping containers and drivers rather than rewriting code. Still, closing the software gap remains AMD's biggest challenge in AI.

Other AI accelerators

GPUs are not the only chips used for AI. Several companies have developed purpose-built accelerators optimized for specific workloads.

Google TPU

Google has designed its own Tensor Processing Units (TPUs) since 2015. TPUs are application-specific integrated circuits (ASICs) optimized for TensorFlow and, more recently, JAX workloads. Google's TPU v6 (Trillium), available in 2024, delivers approximately 926 TFLOPS of BF16 performance per chip with 32 GB of HBM and improved energy efficiency of over 67% compared to TPU v5e ^[9]. At Cloud Next 2025, Google unveiled TPU v7 (Ironwood), which delivers 4,614 TFLOPS per chip and ships in configurations of 256 and 9,216 chips. Google uses TPUs internally to train its Gemini family of models and offers them to external customers through Google Cloud.

Amazon Trainium

Amazon Web Services developed Trainium, a custom chip for AI training. Trainium2 became generally available in December 2024, delivering 20.8 PFLOPS of FP8 per 16-chip instance. AWS claims 30 to 40 percent better price-performance than H100-based instances. Trainium3, announced in December 2025, is a 3nm chip providing 2.52 PFLOPS of FP8 compute per chip with 144 GB of HBM3e memory ^[10].

Intel Gaudi

Intel's Gaudi accelerators targeted the AI training and inference market. Gaudi 3 demonstrated competitive performance against the H100 on certain long-output LLM inference tasks. However, Intel confirmed plans to discontinue the Gaudi line in favor of next-generation GPU products expected in 2026-2027 ^[11].

Microsoft Maia

Microsoft developed the Maia 100 AI accelerator, a custom chip fabricated on TSMC's 5nm process, for use in Azure data centers. Designed to optimize inference for large language models, Maia reflects the broader trend of hyperscale cloud providers developing in-house silicon to reduce dependency on third-party GPU vendors.

Multi-GPU training

Training frontier AI models requires not just individual GPUs but entire clusters of thousands or tens of thousands of GPUs working in concert. Distributing training across multiple GPUs introduces significant complexity in communication, synchronization, and fault tolerance.

Parallelism strategies

Several complementary strategies are used to distribute training workloads:

Strategy	How it works	Communication pattern	Best for
Data parallelism	Each GPU holds a full copy of the model and processes different data	AllReduce (gradient sync)	Models that fit in single-GPU memory
Tensor parallelism	Individual layers are split across GPUs	AllReduce within each layer	Large layers (attention, FFN)
Pipeline parallelism	Different layers assigned to different GPUs	Point-to-point between stages	Very deep models
Expert parallelism	MoE experts distributed across GPUs	All-to-all routing	Mixture-of-experts models
ZeRO optimization	Model states (optimizer, gradients, weights) partitioned across GPUs	Gather/scatter as needed	Memory-efficient training
Sequence parallelism	Long sequences split across GPUs	AllGather for attention	Very long context windows

Modern frontier model training typically uses a hybrid of multiple strategies. For example, Meta's Llama 3 training used a combination of tensor parallelism (within a node), pipeline parallelism (across nodes), and data parallelism (across node groups), with ZeRO-style optimizer state sharding.

Communication overhead

The efficiency of multi-GPU training is fundamentally limited by communication overhead. During data-parallel training, all GPUs must synchronize gradients after each backward pass using an AllReduce operation. The time required for this synchronization depends on:

Model size: A 70B parameter model in FP32 requires synchronizing 280 GB of gradient data per step.
Network bandwidth: NVLink (up to 1.8 TB/s) within a node; InfiniBand (up to 800 Gb/s per port) between nodes.
Communication algorithm: Ring AllReduce, tree AllReduce, or recursive halving-doubling, each with different scaling properties.

For clusters with fast NVLink interconnects within a node, intra-node communication is rarely the bottleneck. The challenge is inter-node communication, where bandwidth is 10-100x lower. This is why large training clusters invest heavily in InfiniBand networking and why NVIDIA's NVLink Network (extending NVLink across nodes) represents a significant advancement.

Fault tolerance at scale

GPU failures are common in large clusters. In a 10,000-GPU cluster, hardware faults may occur multiple times per day, requiring checkpoint-and-restart mechanisms and redundancy planning. NCCL 2.27's communicator shrink feature enables dynamic exclusion of failed GPUs during training, allowing jobs to continue with a reduced GPU count rather than restarting entirely.

Training frameworks like Megatron-LM and FSDP (Fully Sharded Data Parallel) include built-in checkpointing that saves model state periodically, enabling recovery from failures with minimal lost compute. The checkpoint frequency is a trade-off: more frequent checkpoints reduce lost work but consume I/O bandwidth and storage.

DGX systems

NVIDIA's DGX line packages multiple GPUs into a single server node with high-speed interconnects. The DGX H100 contained eight H100 GPUs connected by NVLink. The DGX B200 and DGX B300, based on Blackwell and Blackwell Ultra respectively, scale this concept further. The DGX B300 system features 72 Blackwell Ultra GPUs connected by fifth-generation NVLink, delivering 11x faster inference and 4x faster training compared to the Hopper generation ^[12].

SuperPOD

For larger deployments, NVIDIA offers the DGX SuperPOD, a reference architecture for building AI supercomputers. NVIDIA's own Eos supercomputer, built from 18 H100-based SuperPODs, totals 576 DGX H100 systems with 500 Quantum-2 InfiniBand switches, delivering 18 exaFLOPS of FP8 compute ^[13].

Networking at scale

At cluster scale, inter-node communication becomes a critical bottleneck. The two dominant network fabrics are InfiniBand (long the standard for HPC) and increasingly, high-speed Ethernet. NVIDIA's ConnectX-8 SuperNIC provides 800 Gb/s of network connectivity per GPU. NVIDIA's NVLink Network extends the NVLink protocol beyond a single node, enabling GPU-to-GPU communication across servers with lower latency than traditional network fabrics.

The GPU shortage (2023-2024)

The launch of ChatGPT in November 2022 triggered an unprecedented surge in demand for AI-capable GPUs. By the summer of 2023, NVIDIA's H100 GPUs were sold out with lead times extending into Q1 2024. The shortage had several causes.

First, demand exploded. Every major technology company, along with hundreds of startups, scrambled to build or expand AI training infrastructure. Second, supply was constrained by limited packaging capacity at TSMC. The bottleneck was not the fabrication of the GPU die itself but Chip-on-Wafer-on-Substrate (CoWoS) advanced packaging, which is essential for integrating HBM stacks with the GPU die. TSMC was able to fulfill only about 80% of customer demand for CoWoS capacity ^[14].

NVIDIA and TSMC responded by investing heavily in expanded CoWoS capacity. NVIDIA committed approximately $2.9 billion toward packaging expansion in Taiwan. By mid-2024, the acute shortage had eased, though supply remained tight. Meta CEO Mark Zuckerberg noted publicly that the GPU shortage in data centers was being alleviated, but he identified electrical power supply as the emerging bottleneck ^[15].

Lasting effects

The shortage had lasting effects on the industry:

Custom silicon acceleration: It accelerated the development of custom AI chips by hyperscale cloud providers (Google TPU, Amazon Trainium, Microsoft Maia), reducing dependency on a single vendor.
GPU cloud market expansion: It incentivized cloud providers to offer reserved and spot GPU instances at varying price points, and spawned specialized GPU cloud providers (CoreWeave, Lambda, RunPod).
Efficiency research: It prompted AI researchers to invest more effort in training efficiency and model compression techniques, contributing to the development of approaches like QLoRA, GPTQ quantization, and more efficient architectures.
Power as the new bottleneck: As GPU supply eased, electrical power supply emerged as the binding constraint for new data center construction.

Cloud vs. on-premises economics

Organizations running AI workloads face a fundamental economic decision: rent GPU capacity in the cloud or purchase hardware for on-premises deployment.

Cloud GPU pricing

Cloud GPU pricing has become more competitive as supply has increased and more providers have entered the market. The following table compares approximate on-demand pricing for a single GPU instance across major cloud providers as of early 2026.

Provider	GPU	Approximate on-demand price (per GPU-hour)
AWS (p5 instances)	H100	~$3.90
Google Cloud	H100	~$3.00
Microsoft Azure	H100	~$6.98
AWS (p4d instances)	A100	~$3.67
RunPod (community cloud)	H100	~$1.99
GMI Cloud	H100	~$2.10

Prices are lower for reserved instances (one- or three-year commitments) and spot/preemptible instances. Spot pricing for H100s on AWS and GCP runs approximately $2.00 to $2.50 per GPU-hour. Analysts expect H100 cloud pricing to fall below $2.00 per GPU-hour universally by mid-2026 as Blackwell-based instances become widely available and older hardware is depreciated ^[16].

Specialized GPU cloud providers such as Lambda, CoreWeave, RunPod, and Together AI often offer lower prices than the major hyperscalers by focusing exclusively on GPU compute and operating with lower overhead.

On-premises hardware costs

Item	Approximate cost (2025-2026)
Single NVIDIA H100 GPU	$25,000-$30,000
Single NVIDIA H200 GPU	$25,000-$35,000
8-GPU DGX H100 system	$300,000-$400,000
InfiniBand networking (per node)	$10,000-$20,000
Rack infrastructure (power, cooling)	$50,000-$100,000 per rack
Annual operations (power, maintenance)	15-25% of hardware cost

Breakeven analysis

The cloud-vs-on-premises breakeven depends heavily on utilization rate:

Utilization scenario	Breakeven timeline	Recommendation
Intermittent (< 30% utilization)	Never reaches breakeven	Cloud
Moderate (30-60% utilization)	12-18 months	Depends on budget and growth plans
High (60-90% utilization)	6-12 months	On-premises or reserved cloud
Continuous (90%+ utilization)	< 4 months	On-premises

A 2025 Lenovo whitepaper on Generative AI TCO found that on-premises infrastructure achieves breakeven in under four months for high-utilization workloads. The main cost drivers for AI infrastructure are: GPU compute (70-80% of total), data storage and transfer (10-15%), engineering personnel (15-20%), and software/tools (5-10%) ^[23].

For organizations with continuous training needs, data sensitivity concerns, or long-term AI strategies, on-premises infrastructure offers better total cost of ownership and greater control over the training environment. Cloud is better for intermittent training, startups with limited capital, or projects requiring burst capacity.

Training cost examples

The cost of training frontier models has varied enormously depending on scale and efficiency:

Model	Estimated training cost	Year
GPT-4	$100M+	2023
Llama 3 405B	~$25M	2024
DeepSeek V3	~$5.6M	2024
Llama 3.1 70B	~$2-5M	2024

The wide range in costs reflects differences in model size, training tokens, hardware efficiency, and engineering optimization. DeepSeek V3's notably low training cost demonstrated that careful engineering and algorithmic efficiency can substantially reduce the compute requirements for competitive models.

NVIDIA market dominance

As of 2025, NVIDIA commands approximately 85 to 92 percent of the AI accelerator market by revenue, depending on the analyst and the precise market definition used ^[17]. The scale of that lead shows up directly in its financials: NVIDIA reported $75.2 billion in Data Center revenue in the single quarter ended April 2026, up 92% year over year, the bulk of which came from AI GPUs ^[28]. This dominance rests on several reinforcing factors:

Hardware leadership. NVIDIA has consistently delivered the highest-performing GPU architectures, staying roughly one generation ahead of competitors.
CUDA ecosystem. Twenty years of CUDA development have created deep integration with every major AI framework, library, and tool. The switching costs for researchers and engineers are substantial.
Full-stack approach. NVIDIA sells not just GPUs but entire systems (DGX), networking equipment (InfiniBand switches, ConnectX NICs), and software (CUDA, cuDNN, TensorRT, Triton Inference Server), allowing customers to buy a complete, tested solution.
Developer community. Millions of developers are trained on CUDA. University courses teach GPU programming using NVIDIA hardware. This creates a self-reinforcing cycle where talent availability further entrenches NVIDIA's platform.

AMD holds roughly 8 percent of the AI accelerator market, with the remainder split among Intel, Google (TPU, used internally and via Cloud), and various startups. While NVIDIA's percentage share is projected to decline gradually as AMD and custom silicon scale up, NVIDIA's absolute revenue continues to grow because the total AI chip market is expanding rapidly ^[18].

Environmental considerations

The proliferation of GPU-powered AI infrastructure has significant environmental implications.

Power consumption

Historically, data center processors ran at 150 to 200 watts per chip. AI GPUs have pushed this dramatically higher: the A100 draws 400W, the H100 draws 700W, the B200 draws 1,000W, and the B300 draws 1,400W. A single GB300 NVL72 rack consumes 132 to 140 kW, comparable to the electricity usage of roughly 40 to 50 American homes.

How much electricity do AI data centers use?

U.S. data centers consumed 183 terawatt-hours (TWh) of electricity in 2024, representing more than 4% of the country's total electricity consumption. Projections suggest this could grow to 426 TWh by 2030 ^[19]. The International Energy Agency estimates that global data center electricity consumption was around 415 TWh in 2024, roughly 1.5% of the world total, and projects it will roughly double to about 945 TWh by 2030 ^[31]. The IEA attributes much of that surge directly to AI hardware: "The rise of AI is accelerating the deployment of high-performance accelerated servers, leading to greater power density in data centres," the agency reported, adding that consumption by accelerated servers "is projected to grow by 30% annually" through 2030 ^[31].

Carbon footprint

The carbon impact depends heavily on the electricity source. Training a single frontier large language model can consume 50 GWh of energy, roughly equivalent to the annual electricity usage of 4,500 American homes ^[20]. Estimates suggest AI's annual carbon footprint could reach 32.6 to 79.7 million metric tons of CO2 by 2025. The International Energy Agency estimates that data center emissions will reach about 1% of global CO2 emissions by 2030.

Cooling

The shift to higher-TDP chips has accelerated the adoption of liquid cooling in data centers. Air cooling, which was sufficient for 400W chips, becomes impractical at 1,000W and above. NVIDIA's Blackwell-based systems (GB200, GB300) are designed for direct liquid cooling, using cold plates attached to each GPU and CPU with facility water circulated through the rack. This is more energy-efficient than air cooling but requires significant infrastructure investment.

Efficiency improvements

It is worth noting that newer GPUs deliver substantially more computation per watt than their predecessors. The B200 delivers roughly 4.5x the FP16 Tensor performance of the H100 while consuming only 1.4x the power, meaning performance per watt has roughly tripled. NVIDIA argues, with some justification, that upgrading to newer GPU generations is itself an energy efficiency measure, because the same AI workload can be completed with fewer chips in less time.

Current state: 2025-2026

Blackwell Ultra rollout

NVIDIA's Blackwell Ultra products (B300 and GB300 NVL72) began shipping to partners in the second half of 2025. The B300's 288 GB of HBM3e memory and 15 PFLOPS of FP4 compute represent a major step forward for inference workloads, particularly for serving large language models that require enormous amounts of memory for their parameters and key-value caches ^[21].

The reasoning era

NVIDIA has positioned Blackwell Ultra as the platform for "AI reasoning," reflecting the industry's shift toward models that perform multi-step reasoning, chain-of-thought processing, and agentic workflows. These workloads are more inference-heavy and require both high throughput and large memory capacity, which the B300's specifications are designed to address. By the quarter ended April 2026, Jensen Huang described the buildout of these systems as "the largest infrastructure expansion in human history," arguing that "agentic AI has arrived, doing productive work, generating real value and scaling rapidly across companies and industries" ^[28].

Next-generation roadmap

NVIDIA has announced the Rubin architecture as the successor to Blackwell, expected in 2026. Rubin is anticipated to use HBM4 memory and a new NVLink generation, continuing the pattern of roughly annual architecture releases. AMD's MI400 series, also expected in 2026 with HBM4, will represent its strongest challenge to NVIDIA to date. Google's TPU v7 (Ironwood), already announced, targets similar performance levels.

Industry dynamics

The AI GPU market continues to expand rapidly. Capital expenditure on AI infrastructure by the major cloud providers exceeded $200 billion in 2024-2025, with a significant portion directed toward GPU procurement. Sovereign AI initiatives, where governments invest in domestic AI computing capacity, have created additional demand. Meanwhile, the trend toward on-device AI inference is growing, with NVIDIA's DGX Spark (a desktop system based on the GB10 chip, delivering 1 PFLOP of FP4 performance) representing the company's push to bring AI computing to individual researchers and developers ^[22].

The GPU's transformation from a gaming peripheral to the engine of the AI revolution is one of the most consequential technology shifts of the 21st century. What began as a chip for rendering triangles faster now powers the training of systems that can write code, generate images, translate languages, and reason about complex problems. As AI models continue to scale, the GPU, and the broader accelerator ecosystem it has inspired, will remain at the foundation of the field.

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit