A graphics processing unit (GPU) is a specialized electronic circuit originally designed to accelerate the rendering of images, video, and 2D/3D graphics. Over the past decade, GPUs have become the primary hardware platform for training and running artificial intelligence models, particularly deep learning systems. Their massively parallel architecture makes them far more efficient than traditional CPUs at the matrix arithmetic that underpins modern neural networks.
GPUs now sit at the center of every major AI breakthrough, from large language models like GPT-4 and Gemini to diffusion models for image generation. The world's largest AI training clusters contain tens of thousands of GPUs working together, and the companies that design and manufacture these chips, above all NVIDIA, have become some of the most valuable corporations on earth.
Dedicated graphics hardware existed in various forms throughout the 1980s and 1990s, but NVIDIA coined the term "GPU" in 1999 with the launch of the GeForce 256. That chip could process 10 million polygons per second and offloaded transform and lighting calculations from the CPU. Other vendors, including ATI Technologies (later acquired by AMD) and 3dfx Interactive, competed fiercely in the consumer graphics market during this period.
The turning point for scientific and eventually AI computing came in November 2006, when NVIDIA released CUDA (Compute Unified Device Architecture). CUDA gave software developers a C-like programming interface for harnessing the thousands of parallel cores inside NVIDIA GPUs for tasks beyond graphics rendering [1]. Before CUDA, researchers who wanted to run general-purpose computations on GPUs had to disguise their math as graphics shaders, a cumbersome and error-prone process. CUDA removed that barrier and opened GPU computing to the broader scientific community.
The moment that cemented GPUs as AI hardware arrived on September 30, 2012. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to the ImageNet Large Scale Visual Recognition Challenge. AlexNet, a convolutional neural network with 60 million parameters, achieved a top-5 error rate of 15.3%, beating the runner-up by more than 10 percentage points [2]. Krizhevsky trained the model on two NVIDIA GTX 580 consumer GPUs using his custom cuda-convnet library. The victory demonstrated that deep neural networks, combined with large datasets and GPU-accelerated training, could dramatically outperform hand-engineered computer vision methods.
AlexNet's success rested on the convergence of three developments: the availability of large labeled datasets (ImageNet), general-purpose GPU computing (CUDA), and improved training techniques for deep networks (such as ReLU activations and dropout regularization). The result was an explosion of research into deep learning that has continued unabated ever since.
Recognizing the growing AI workload, NVIDIA launched its first dedicated data center GPU, the Tesla K20, in 2012. Subsequent generations brought increasingly specialized hardware for AI: the Pascal architecture (2016) introduced the P100 with NVLink interconnects; the Volta architecture (2017) debuted Tensor Cores, specialized matrix-multiply units, in the V100; and the Ampere architecture (2020) produced the A100, which became the workhorse of AI training for several years. Each generation roughly doubled the performance available for mixed-precision deep learning workloads.
Several architectural properties make GPUs well suited to the mathematical operations that dominate AI training and inference.
A modern CPU might have 8 to 128 cores, each optimized for sequential, branching workloads. A modern data center GPU contains thousands of smaller cores organized into streaming multiprocessors (SMs). NVIDIA's H100, for instance, has 132 SMs containing a total of 16,896 CUDA cores. This architecture can execute thousands of threads simultaneously, which maps naturally onto the element-wise and matrix operations found in neural network layers.
The fundamental operation in deep learning, both during training and inference, is matrix multiplication (and its close relatives: convolutions, attention computations, and element-wise transforms). Starting with the Volta architecture in 2017, NVIDIA GPUs include Tensor Cores: dedicated hardware units that perform small matrix-multiply-accumulate operations in a single clock cycle. Tensor Cores accelerate operations in half-precision (FP16), bfloat16 (BF16), and lower-precision formats (FP8, FP4) by orders of magnitude compared to general-purpose CUDA cores.
Training large models requires moving enormous amounts of data between memory and compute units. Data center GPUs use High Bandwidth Memory (HBM), stacked DRAM packages that deliver far more bandwidth than the DDR5 memory used in CPUs. The A100 provides up to 2 TB/s of memory bandwidth; the H100, 3.35 TB/s; and the B200, 8 TB/s [3]. This bandwidth is critical because AI workloads are frequently memory-bound: the compute units can process data faster than memory can supply it, so higher bandwidth translates directly to higher utilization and throughput.
Hardware alone does not explain GPU dominance. NVIDIA's CUDA ecosystem, built over nearly two decades, includes optimized libraries for deep learning (cuDNN), linear algebra (cuBLAS), and communication (NCCL). Every major AI framework, including PyTorch and TensorFlow, is deeply integrated with CUDA. This software moat has proven at least as important as the hardware itself in sustaining NVIDIA's market position.
The basic building block of an NVIDIA GPU is the streaming multiprocessor (SM). Each SM contains a set of CUDA cores for general floating-point and integer arithmetic, Tensor Cores for matrix operations, a register file, shared memory, and an L1 cache. The GPU's scheduler distributes warps (groups of 32 threads) across SMs, hiding memory latency by switching between warps when one stalls on a memory access.
Tensor Cores perform fused matrix-multiply-accumulate operations on small matrices (for example, 4x4 or 16x16 tiles, depending on the data format). First introduced in the Volta V100, they have evolved through each architecture generation:
Modern data center GPUs include several features specifically designed for AI workloads beyond Tensor Cores:
| Feature | Description | First introduced |
|---|---|---|
| Tensor Cores | Dedicated matrix-multiply-accumulate units | Volta (2017) |
| Structured sparsity | 2:4 sparsity pattern support for 2x speedup with minimal accuracy loss | Ampere (2020) |
| Transformer Engine | Dynamic per-layer FP8/FP16 precision management | Hopper (2022) |
| FP8 data type | 8-bit floating point for training and inference | Hopper (2022) |
| FP4 data type | 4-bit floating point for inference | Blackwell (2024) |
| Thread block clusters | Co-scheduled blocks for distributed shared memory | Hopper (2022) |
| TMA (Tensor Memory Accelerator) | Hardware unit for efficient bulk memory transfers | Hopper (2022) |
| NVLink-C2C | Chip-to-chip interconnect for multi-die designs | Grace Hopper (2023) |
| Decompression engine | Hardware-accelerated data decompression | Blackwell (2024) |
| Confidential computing | Hardware-level encryption for sensitive AI workloads | Hopper (2022) |
The Transformer Engine deserves particular attention. For transformer-based models (which include virtually all modern LLMs), the Transformer Engine monitors the statistical properties of activations at each layer and automatically selects the optimal precision (FP8 vs. higher precision) to maximize throughput while maintaining training accuracy. This eliminates the manual tuning previously required for mixed-precision training.
Modern AI GPUs use stacked High Bandwidth Memory. The A100 shipped with HBM2e (80 GB, 2 TB/s); the H100 moved to HBM3 (80 GB, 3.35 TB/s); the H200 upgraded to HBM3e (141 GB, 4.8 TB/s); and Blackwell GPUs use HBM3e in higher-capacity configurations (192 GB on B200, 288 GB on B300). The trend is clear: each generation brings both more capacity and more bandwidth, because large language models grow faster than any single metric can keep up with.
Memory bandwidth is one of the most critical specifications for AI GPUs, often more important than peak compute throughput. This is because many AI operations, especially LLM inference, are memory-bandwidth-bound rather than compute-bound.
During autoregressive LLM inference, each new token requires reading through the entire model's weights from memory. For a 70-billion-parameter model in FP16, this means reading approximately 140 GB of data for every single token generated. The theoretical maximum token generation rate is therefore constrained by:
Max tokens/second = Memory bandwidth / Model size in memory
| GPU | Memory bandwidth | Max theoretical tok/s (70B FP16) | Max theoretical tok/s (70B INT4) |
|---|---|---|---|
| A100 (80 GB) | 2.0 TB/s | ~14 | ~57 |
| H100 (80 GB) | 3.35 TB/s | ~24 | ~96 |
| H200 (141 GB) | 4.8 TB/s | ~34 | ~137 |
| B200 (192 GB) | 8.0 TB/s | ~57 | ~229 |
| MI300X (192 GB) | 5.3 TB/s | ~38 | ~152 |
These theoretical limits assume perfect memory access patterns and zero compute overhead. Real-world performance is typically 50-70% of the theoretical maximum due to memory access inefficiencies, KV cache overhead, and compute time. However, the table illustrates why memory bandwidth improvements (A100 to B200: 4x bandwidth increase) translate almost directly into inference speed improvements.
For training, the picture is more nuanced. Training involves both forward and backward passes with larger batch sizes, which increases arithmetic intensity (the ratio of compute operations to memory accesses). This shifts the bottleneck toward compute throughput for well-optimized training workloads, making Tensor Core performance and TDP (which determines sustained clock speeds) more important during training than during inference.
Within a server, GPUs communicate over NVLink, a proprietary high-speed interconnect. NVLink bandwidth has grown from 300 GB/s (Pascal) to 900 GB/s (Hopper) to 1.8 TB/s (Blackwell NVLink 5). Between servers, clusters use either InfiniBand or high-speed Ethernet. NVIDIA's Quantum-X800 InfiniBand switches deliver 800 Gb/s per port, while the Spectrum-X platform brings similar bandwidth to Ethernet-based networks.
The following table summarizes the key data center GPUs NVIDIA has released for AI workloads from 2020 through 2025.
| GPU | Architecture | Year | FP16/BF16 Tensor TFLOPS | Memory | Memory bandwidth | TDP | Interconnect |
|---|---|---|---|---|---|---|---|
| A100 (SXM) | Ampere | 2020 | 312 | 80 GB HBM2e | 2.0 TB/s | 400W | NVLink 3 (600 GB/s) |
| H100 (SXM) | Hopper | 2022 | 1,979 | 80 GB HBM3 | 3.35 TB/s | 700W | NVLink 4 (900 GB/s) |
| H200 (SXM) | Hopper | 2024 | 1,979 | 141 GB HBM3e | 4.8 TB/s | 700W | NVLink 4 (900 GB/s) |
| B200 | Blackwell | 2024-2025 | 4,500 (FP16); 9,000 (FP8) | 192 GB HBM3e | 8.0 TB/s | 1,000W | NVLink 5 (1,800 GB/s) |
| B300 | Blackwell Ultra | 2025 | ~4,500 (FP16); 15 PFLOPS (FP4) | 288 GB HBM3e | 8.0 TB/s | 1,400W | NVLink 5 (1,800 GB/s) |
Several trends are visible in this progression. FP16 Tensor Core throughput grew roughly 6x from A100 to H100 in a single generation, then doubled again with Blackwell. Memory capacity has nearly quadrupled from 80 GB to 288 GB. TDP has risen from 400W to 1,400W, reflecting the growing power demands of AI chips and driving a shift toward liquid cooling in data centers. The B300 (Blackwell Ultra), announced by NVIDIA CEO Jensen Huang at GTC 2025, integrates 208 billion transistors on a dual-reticle die design and delivers roughly 1.5x the performance of the B200 [4].
The GB300 NVL72 is NVIDIA's rack-scale AI system, combining 72 Blackwell Ultra GPUs and 36 Grace ARM CPUs interconnected by NVLink 5 through NVSwitch into a single unified memory space. A single rack delivers 1.1 exaFLOPS of FP4 compute and requires 132 to 140 kW of power with direct liquid cooling [5]. This represents exascale AI performance in a single rack, a milestone that would have required an entire building-sized supercomputer only a decade ago.
AMD has steadily expanded its AI GPU lineup through the Instinct series, providing an alternative to NVIDIA's products.
The AMD Instinct MI300X, launched in late 2023, uses the CDNA 3 architecture and features 192 GB of HBM3 memory with approximately 5.3 TB/s of bandwidth. Its TDP is 750W. The MI325X followed in Q4 2024 with 288 GB of HBM3e memory and 6 TB/s of bandwidth at 1,000W [6].
The MI350 series, expected in 2025, uses the CDNA 4 architecture built on 3nm process technology. AMD claims up to a 35x improvement in inference performance over the MI300 series. The MI350X will support FP4 and FP6 data types and offer up to 288 GB of HBM3e memory. The higher-end MI355X variant has a 1,400W TDP [7].
Slated for 2026, the MI400 series will use the next-generation CDNA architecture and HBM4 memory. Early specifications suggest 432 GB of HBM4 at 19.6 TB/s of bandwidth, with 40 PFLOPS of FP4 compute and 20 PFLOPS of FP8 compute per chip [8]. If these numbers hold, the MI400 would be competitive with NVIDIA's next-generation products.
AMD's hardware has become increasingly competitive on paper, but the company faces a significant software ecosystem disadvantage. AMD's ROCm software stack, the equivalent of CUDA, has historically lagged in framework support, library optimization, and ease of use. Major AI labs have been reluctant to invest engineering effort in porting and optimizing their codebases for ROCm when CUDA "just works." AMD has made progress on this front, particularly through partnerships with Meta and Microsoft, and ROCm now integrates directly with PyTorch, TensorFlow, and JAX, allowing teams to move models from NVIDIA to AMD hardware by swapping containers and drivers rather than rewriting code. Still, closing the software gap remains AMD's biggest challenge in AI.
GPUs are not the only chips used for AI. Several companies have developed purpose-built accelerators optimized for specific workloads.
Google has designed its own Tensor Processing Units (TPUs) since 2015. TPUs are application-specific integrated circuits (ASICs) optimized for TensorFlow and, more recently, JAX workloads. Google's TPU v6 (Trillium), available in 2024, delivers approximately 926 TFLOPS of BF16 performance per chip with 32 GB of HBM and improved energy efficiency of over 67% compared to TPU v5e [9]. At Cloud Next 2025, Google unveiled TPU v7 (Ironwood), which delivers 4,614 TFLOPS per chip and ships in configurations of 256 and 9,216 chips. Google uses TPUs internally to train its Gemini family of models and offers them to external customers through Google Cloud.
Amazon Web Services developed Trainium, a custom chip for AI training. Trainium2 became generally available in December 2024, delivering 20.8 PFLOPS of FP8 per 16-chip instance. AWS claims 30 to 40 percent better price-performance than H100-based instances. Trainium3, announced in December 2025, is a 3nm chip providing 2.52 PFLOPS of FP8 compute per chip with 144 GB of HBM3e memory [10].
Intel's Gaudi accelerators targeted the AI training and inference market. Gaudi 3 demonstrated competitive performance against the H100 on certain long-output LLM inference tasks. However, Intel confirmed plans to discontinue the Gaudi line in favor of next-generation GPU products expected in 2026-2027 [11].
Microsoft developed the Maia 100 AI accelerator, a custom chip fabricated on TSMC's 5nm process, for use in Azure data centers. Designed to optimize inference for large language models, Maia reflects the broader trend of hyperscale cloud providers developing in-house silicon to reduce dependency on third-party GPU vendors.
Training frontier AI models requires not just individual GPUs but entire clusters of thousands or tens of thousands of GPUs working in concert. Distributing training across multiple GPUs introduces significant complexity in communication, synchronization, and fault tolerance.
Several complementary strategies are used to distribute training workloads:
| Strategy | How it works | Communication pattern | Best for |
|---|---|---|---|
| Data parallelism | Each GPU holds a full copy of the model and processes different data | AllReduce (gradient sync) | Models that fit in single-GPU memory |
| Tensor parallelism | Individual layers are split across GPUs | AllReduce within each layer | Large layers (attention, FFN) |
| Pipeline parallelism | Different layers assigned to different GPUs | Point-to-point between stages | Very deep models |
| Expert parallelism | MoE experts distributed across GPUs | All-to-all routing | Mixture-of-experts models |
| ZeRO optimization | Model states (optimizer, gradients, weights) partitioned across GPUs | Gather/scatter as needed | Memory-efficient training |
| Sequence parallelism | Long sequences split across GPUs | AllGather for attention | Very long context windows |
Modern frontier model training typically uses a hybrid of multiple strategies. For example, Meta's Llama 3 training used a combination of tensor parallelism (within a node), pipeline parallelism (across nodes), and data parallelism (across node groups), with ZeRO-style optimizer state sharding.
The efficiency of multi-GPU training is fundamentally limited by communication overhead. During data-parallel training, all GPUs must synchronize gradients after each backward pass using an AllReduce operation. The time required for this synchronization depends on:
For clusters with fast NVLink interconnects within a node, intra-node communication is rarely the bottleneck. The challenge is inter-node communication, where bandwidth is 10-100x lower. This is why large training clusters invest heavily in InfiniBand networking and why NVIDIA's NVLink Network (extending NVLink across nodes) represents a significant advancement.
GPU failures are common in large clusters. In a 10,000-GPU cluster, hardware faults may occur multiple times per day, requiring checkpoint-and-restart mechanisms and redundancy planning. NCCL 2.27's communicator shrink feature enables dynamic exclusion of failed GPUs during training, allowing jobs to continue with a reduced GPU count rather than restarting entirely.
Training frameworks like Megatron-LM and FSDP (Fully Sharded Data Parallel) include built-in checkpointing that saves model state periodically, enabling recovery from failures with minimal lost compute. The checkpoint frequency is a trade-off: more frequent checkpoints reduce lost work but consume I/O bandwidth and storage.
NVIDIA's DGX line packages multiple GPUs into a single server node with high-speed interconnects. The DGX H100 contained eight H100 GPUs connected by NVLink. The DGX B200 and DGX B300, based on Blackwell and Blackwell Ultra respectively, scale this concept further. The DGX B300 system features 72 Blackwell Ultra GPUs connected by fifth-generation NVLink, delivering 11x faster inference and 4x faster training compared to the Hopper generation [12].
For larger deployments, NVIDIA offers the DGX SuperPOD, a reference architecture for building AI supercomputers. NVIDIA's own Eos supercomputer, built from 18 H100-based SuperPODs, totals 576 DGX H100 systems with 500 Quantum-2 InfiniBand switches, delivering 18 exaFLOPS of FP8 compute [13].
At cluster scale, inter-node communication becomes a critical bottleneck. The two dominant network fabrics are InfiniBand (long the standard for HPC) and increasingly, high-speed Ethernet. NVIDIA's ConnectX-8 SuperNIC provides 800 Gb/s of network connectivity per GPU. NVIDIA's NVLink Network extends the NVLink protocol beyond a single node, enabling GPU-to-GPU communication across servers with lower latency than traditional network fabrics.
The launch of ChatGPT in November 2022 triggered an unprecedented surge in demand for AI-capable GPUs. By the summer of 2023, NVIDIA's H100 GPUs were sold out with lead times extending into Q1 2024. The shortage had several causes.
First, demand exploded. Every major technology company, along with hundreds of startups, scrambled to build or expand AI training infrastructure. Second, supply was constrained by limited packaging capacity at TSMC. The bottleneck was not the fabrication of the GPU die itself but Chip-on-Wafer-on-Substrate (CoWoS) advanced packaging, which is essential for integrating HBM stacks with the GPU die. TSMC was able to fulfill only about 80% of customer demand for CoWoS capacity [14].
NVIDIA and TSMC responded by investing heavily in expanded CoWoS capacity. NVIDIA committed approximately $2.9 billion toward packaging expansion in Taiwan. By mid-2024, the acute shortage had eased, though supply remained tight. Meta CEO Mark Zuckerberg noted publicly that the GPU shortage in data centers was being alleviated, but he identified electrical power supply as the emerging bottleneck [15].
The shortage had lasting effects on the industry:
Organizations running AI workloads face a fundamental economic decision: rent GPU capacity in the cloud or purchase hardware for on-premises deployment.
Cloud GPU pricing has become more competitive as supply has increased and more providers have entered the market. The following table compares approximate on-demand pricing for a single GPU instance across major cloud providers as of early 2026.
| Provider | GPU | Approximate on-demand price (per GPU-hour) |
|---|---|---|
| AWS (p5 instances) | H100 | ~$3.90 |
| Google Cloud | H100 | ~$3.00 |
| Microsoft Azure | H100 | ~$6.98 |
| AWS (p4d instances) | A100 | ~$3.67 |
| RunPod (community cloud) | H100 | ~$1.99 |
| GMI Cloud | H100 | ~$2.10 |
Prices are lower for reserved instances (one- or three-year commitments) and spot/preemptible instances. Spot pricing for H100s on AWS and GCP runs approximately $2.00 to $2.50 per GPU-hour. Analysts expect H100 cloud pricing to fall below $2.00 per GPU-hour universally by mid-2026 as Blackwell-based instances become widely available and older hardware is depreciated [16].
Specialized GPU cloud providers such as Lambda, CoreWeave, RunPod, and Together AI often offer lower prices than the major hyperscalers by focusing exclusively on GPU compute and operating with lower overhead.
| Item | Approximate cost (2025-2026) |
|---|---|
| Single NVIDIA H100 GPU | $25,000-$30,000 |
| Single NVIDIA H200 GPU | $25,000-$35,000 |
| 8-GPU DGX H100 system | $300,000-$400,000 |
| InfiniBand networking (per node) | $10,000-$20,000 |
| Rack infrastructure (power, cooling) | $50,000-$100,000 per rack |
| Annual operations (power, maintenance) | 15-25% of hardware cost |
The cloud-vs-on-premises breakeven depends heavily on utilization rate:
| Utilization scenario | Breakeven timeline | Recommendation |
|---|---|---|
| Intermittent (< 30% utilization) | Never reaches breakeven | Cloud |
| Moderate (30-60% utilization) | 12-18 months | Depends on budget and growth plans |
| High (60-90% utilization) | 6-12 months | On-premises or reserved cloud |
| Continuous (90%+ utilization) | < 4 months | On-premises |
A 2025 Lenovo whitepaper on Generative AI TCO found that on-premises infrastructure achieves breakeven in under four months for high-utilization workloads. The main cost drivers for AI infrastructure are: GPU compute (70-80% of total), data storage and transfer (10-15%), engineering personnel (15-20%), and software/tools (5-10%) [23].
For organizations with continuous training needs, data sensitivity concerns, or long-term AI strategies, on-premises infrastructure offers better total cost of ownership and greater control over the training environment. Cloud is better for intermittent training, startups with limited capital, or projects requiring burst capacity.
The cost of training frontier models has varied enormously depending on scale and efficiency:
| Model | Estimated training cost | Year |
|---|---|---|
| GPT-4 | $100M+ | 2023 |
| Llama 3 405B | ~$25M | 2024 |
| DeepSeek V3 | ~$5.6M | 2024 |
| Llama 3.1 70B | ~$2-5M | 2024 |
The wide range in costs reflects differences in model size, training tokens, hardware efficiency, and engineering optimization. DeepSeek V3's notably low training cost demonstrated that careful engineering and algorithmic efficiency can substantially reduce the compute requirements for competitive models.
As of 2025, NVIDIA commands approximately 85 to 92 percent of the AI accelerator market by revenue, depending on the analyst and the precise market definition used [17]. This dominance rests on several reinforcing factors:
AMD holds roughly 8 percent of the AI accelerator market, with the remainder split among Intel, Google (TPU, used internally and via Cloud), and various startups. While NVIDIA's percentage share is projected to decline gradually as AMD and custom silicon scale up, NVIDIA's absolute revenue continues to grow because the total AI chip market is expanding rapidly [18].
The proliferation of GPU-powered AI infrastructure has significant environmental implications.
Historically, data center processors ran at 150 to 200 watts per chip. AI GPUs have pushed this dramatically higher: the A100 draws 400W, the H100 draws 700W, the B200 draws 1,000W, and the B300 draws 1,400W. A single GB300 NVL72 rack consumes 132 to 140 kW, comparable to the electricity usage of roughly 40 to 50 American homes.
U.S. data centers consumed 183 terawatt-hours (TWh) of electricity in 2024, representing more than 4% of the country's total electricity consumption. Projections suggest this could grow to 426 TWh by 2030 [19]. Globally, data center electricity consumption was approximately 536 TWh in 2025, with some estimates projecting a doubling to over 1,000 TWh by 2030.
The carbon impact depends heavily on the electricity source. Training a single frontier large language model can consume 50 GWh of energy, roughly equivalent to the annual electricity usage of 4,500 American homes [20]. Estimates suggest AI's annual carbon footprint could reach 32.6 to 79.7 million metric tons of CO2 by 2025. The International Energy Agency estimates that data center emissions will reach about 1% of global CO2 emissions by 2030.
The shift to higher-TDP chips has accelerated the adoption of liquid cooling in data centers. Air cooling, which was sufficient for 400W chips, becomes impractical at 1,000W and above. NVIDIA's Blackwell-based systems (GB200, GB300) are designed for direct liquid cooling, using cold plates attached to each GPU and CPU with facility water circulated through the rack. This is more energy-efficient than air cooling but requires significant infrastructure investment.
It is worth noting that newer GPUs deliver substantially more computation per watt than their predecessors. The B200 delivers roughly 4.5x the FP16 Tensor performance of the H100 while consuming only 1.4x the power, meaning performance per watt has roughly tripled. NVIDIA argues, with some justification, that upgrading to newer GPU generations is itself an energy efficiency measure, because the same AI workload can be completed with fewer chips in less time.
NVIDIA's Blackwell Ultra products (B300 and GB300 NVL72) began shipping to partners in the second half of 2025. The B300's 288 GB of HBM3e memory and 15 PFLOPS of FP4 compute represent a major step forward for inference workloads, particularly for serving large language models that require enormous amounts of memory for their parameters and key-value caches [21].
NVIDIA has positioned Blackwell Ultra as the platform for "AI reasoning," reflecting the industry's shift toward models that perform multi-step reasoning, chain-of-thought processing, and agentic workflows. These workloads are more inference-heavy and require both high throughput and large memory capacity, which the B300's specifications are designed to address.
NVIDIA has announced the Rubin architecture as the successor to Blackwell, expected in 2026. Rubin is anticipated to use HBM4 memory and a new NVLink generation, continuing the pattern of roughly annual architecture releases. AMD's MI400 series, also expected in 2026 with HBM4, will represent its strongest challenge to NVIDIA to date. Google's TPU v7 (Ironwood), already announced, targets similar performance levels.
The AI GPU market continues to expand rapidly. Capital expenditure on AI infrastructure by the major cloud providers exceeded $200 billion in 2024-2025, with a significant portion directed toward GPU procurement. Sovereign AI initiatives, where governments invest in domestic AI computing capacity, have created additional demand. Meanwhile, the trend toward on-device AI inference is growing, with NVIDIA's DGX Spark (a desktop system based on the GB10 chip, delivering 1 PFLOP of FP4 performance) representing the company's push to bring AI computing to individual researchers and developers [22].
The GPU's transformation from a gaming peripheral to the engine of the AI revolution is one of the most consequential technology shifts of the 21st century. What began as a chip for rendering triangles faster now powers the training of systems that can write code, generate images, translate languages, and reason about complex problems. As AI models continue to scale, the GPU, and the broader accelerator ecosystem it has inspired, will remain at the foundation of the field.