GPU computing refers to the use of a graphics processing unit (GPU) to perform general-purpose computation that was traditionally handled by the central processing unit (CPU). Also known as general-purpose computing on graphics processing units (GPGPU), this approach exploits the massively parallel architecture of GPUs to accelerate workloads in scientific computing, data analytics, and, most notably, artificial intelligence and machine learning. GPU computing has become the backbone of modern deep learning, powering the training and inference of large language models, computer vision systems, and other neural network applications at scale.
GPUs were originally designed for a single purpose: rendering graphics. Throughout the 1990s, companies like NVIDIA, ATI (later acquired by AMD), and 3dfx developed dedicated hardware to accelerate the pixel-level calculations required for 3D video games and visualization software. These early GPUs contained fixed-function pipelines optimized for transforming vertices and shading pixels, with no provision for arbitrary computation.
A turning point arrived in the early 2000s with the introduction of programmable shaders. Rather than locking developers into a fixed rendering pipeline, GPUs began offering small programmable stages (vertex shaders and pixel shaders) where custom code could execute on each vertex or pixel. Researchers quickly recognized that these programmable stages could be repurposed for non-graphics tasks by encoding data as textures and reading back results from the framebuffer.
In 2003, two research groups independently demonstrated that GPUs could solve general linear algebra problems faster than CPUs, a milestone that drew significant attention from the scientific computing community. The same year, researchers at Stanford University began formalizing the concept of "stream programming" for GPUs.
Ian Buck, a PhD student at Stanford, led the development of BrookGPU, a programming language and compiler that abstracted the graphics-specific details of GPUs into general-purpose programming concepts. Brook allowed scientists to write code for the GPU without understanding 3D graphics APIs like OpenGL or DirectX. Buck's dissertation work demonstrated the potential for GPUs as parallel compute engines and attracted the attention of the industry.
In 2004, Buck joined NVIDIA and, alongside John Nickolls, NVIDIA's director of architecture for GPU computing, began transforming Brook into a production-ready platform. The result was CUDA (Compute Unified Device Architecture), which NVIDIA released publicly in November 2006 alongside its GeForce 8800 GTX GPU based on the Tesla microarchitecture. The initial CUDA SDK became available on February 15, 2007, for Windows and Linux.
In response to CUDA's proprietary nature, the Khronos Group released OpenCL (Open Computing Language) in 2008 as a vendor-neutral standard for parallel programming across GPUs, CPUs, and other accelerators. While OpenCL offered portability, CUDA retained a performance and ecosystem advantage on NVIDIA hardware, and the two frameworks coexisted throughout the 2010s.
In 2009, Stanford researchers Rajat Raina, Anand Madhavan, and Andrew Ng published a seminal paper demonstrating that GPU computing could speed up the training of deep belief networks and other unsupervised learning models by an order of magnitude compared to CPUs. This work, alongside subsequent research by Geoffrey Hinton, Yann LeCun, and others, set the stage for the deep learning revolution. By 2012, when AlexNet won the ImageNet Large Scale Visual Recognition Challenge using two NVIDIA GTX 580 GPUs, GPU-accelerated deep learning had firmly entered the mainstream.
The fundamental advantage of GPUs for AI workloads is their massively parallel architecture. While a modern CPU may contain 8 to 64 high-performance cores, a single data center GPU can contain thousands of smaller cores. The NVIDIA H100, for example, has 16,896 CUDA cores. Each of these cores can execute simple arithmetic operations simultaneously, making GPUs exceptionally well suited for workloads that involve applying the same operation across large datasets.
Neural networks are inherently parallel. During a forward pass, each layer applies a set of weights to input activations through matrix multiplications, and these multiplications can be decomposed into thousands of independent operations that execute concurrently on GPU cores.
NVIDIA GPUs use a Single Instruction, Multiple Threads (SIMT) execution model, a variation of the classical SIMD (Single Instruction, Multiple Data) paradigm. In SIMT, groups of 32 threads called warps execute the same instruction in lockstep across different data elements. This model maps naturally to the regular, data-parallel structure of neural network computations such as matrix multiplications and element-wise activation functions.
AI training workloads move enormous quantities of data between memory and compute units. GPUs address this with high-bandwidth memory (HBM) technologies that provide far greater throughput than the DDR memory used by CPUs. The NVIDIA A100 offers up to 2,039 GB/s of memory bandwidth using HBM2e, while the H100 reaches 3,350 GB/s with HBM3, and the B200 delivers approximately 8,000 GB/s with HBM3e. This bandwidth is critical for feeding data to the thousands of compute cores fast enough to keep them utilized.
Starting with the Volta architecture in 2017, NVIDIA introduced Tensor Cores, specialized hardware units designed to accelerate matrix multiply-and-accumulate operations at reduced precision (FP16, BF16, INT8, and later FP8 and FP4). Tensor Cores can perform a 4x4 matrix multiply-and-accumulate in a single clock cycle, providing massive speedups for deep learning operations compared to standard CUDA cores. This mixed-precision capability aligns with the observation that many neural network operations tolerate reduced numerical precision without meaningful loss in accuracy.
General matrix multiplications (GEMMs) are the fundamental building block of deep learning. Fully connected layers, recurrent neural network layers (including LSTMs and GRUs), attention layers in transformers, and even convolutional layers are all implemented internally as matrix multiplications. GPU computation can typically perform matrix products two orders of magnitude faster than CPU computation, primarily because GPUs assign each element of the resulting matrix to a separate thread for parallel computation.
Convolutional layers apply learned filters to input feature maps. For computational efficiency, these convolution operations are transformed into matrix multiplications using techniques such as im2col (image to column), which rearranges input patches into rows of a matrix so that the convolution becomes a standard GEMM. This transformation allows GPUs to leverage their highly optimized matrix multiplication routines, such as those in NVIDIA's cuDNN library.
During training, GPUs accelerate both the forward pass (computing predictions) and the backward pass (backpropagation, computing gradients). The optimizer step, which updates model weights based on gradients, is also parallelized across GPU cores. Modern training runs for large language models involve thousands of GPUs running in parallel for weeks or months.
During inference, GPUs provide the throughput needed to serve predictions at scale. Techniques such as batching multiple inference requests together and using lower-precision formats (INT8, FP8, FP4) maximize GPU utilization and reduce latency.
The following table summarizes the key architectural differences between CPUs and GPUs that explain their complementary roles in computing.
| Feature | CPU | GPU |
|---|---|---|
| Core count | 4 to 128 high-performance cores | Thousands of simpler cores (e.g., 16,896 CUDA cores on H100) |
| Clock speed | 3.0 to 5.5 GHz | 1.0 to 2.5 GHz |
| Design philosophy | Optimized for low-latency sequential tasks | Optimized for high-throughput parallel tasks |
| Instruction handling | Complex out-of-order execution, branch prediction, speculative execution | SIMT execution; groups of threads run the same instruction |
| Cache per core | Large (MB-scale L2/L3 caches) | Small per-core cache; large shared L2 |
| Memory type | DDR4/DDR5 (up to ~100 GB/s) | HBM2e/HBM3/HBM3e (up to 8,000 GB/s) |
| Memory capacity | Up to several TB (system RAM) | 16 GB to 192 GB per GPU |
| Floating-point units | Fewer but more versatile ALUs | Thousands of specialized FP/INT units and Tensor Cores |
| Power consumption | 65 W to 350 W (typical server CPU) | 250 W to 1,000 W (data center GPU) |
| Best suited for | Operating systems, databases, serial logic, branching code | Matrix math, image processing, neural network training/inference |
CUDA (Compute Unified Device Architecture) is NVIDIA's proprietary parallel computing platform and programming model. Released in 2006/2007, CUDA allows developers to write programs in C, C++, Python, and Fortran that execute on NVIDIA GPUs. CUDA abstracts the GPU's hardware into a hierarchy of threads, blocks, and grids, enabling fine-grained control over parallel execution.
CUDA's dominance in AI is reinforced by a deep software ecosystem:
| Library | Purpose |
|---|---|
| cuDNN | Optimized primitives for deep neural networks (convolution, pooling, normalization, activation) |
| cuBLAS | GPU-accelerated basic linear algebra (GEMM, GEMV) |
| NCCL | Multi-GPU and multi-node collective communication (all-reduce, broadcast) |
| TensorRT | Inference optimization and runtime for deploying trained models |
| Triton Inference Server | Model serving framework for production deployment |
| cuDF / RAPIDS | GPU-accelerated data science and analytics |
| Thrust | High-level parallel algorithms library (sort, scan, reduce) |
All major deep learning frameworks, including PyTorch, TensorFlow, and JAX, rely on CUDA for GPU acceleration on NVIDIA hardware. This ecosystem lock-in, built over nearly two decades, represents one of NVIDIA's most significant competitive advantages.
NVIDIA has released a series of data center GPU architectures, each delivering significant performance improvements for AI workloads.
The original Tesla architecture, powering the GeForce 8800 series and the Tesla C870 data center card, was the first to support CUDA. It introduced unified shaders, replacing the separate vertex and pixel shader units of earlier GPUs with a single pool of programmable processors. While modest by modern standards, it proved the viability of general-purpose GPU computing.
Fermi introduced error-correcting code (ECC) memory, a true L1/L2 cache hierarchy, and improved double-precision floating-point performance, making GPUs viable for scientific high-performance computing (HPC) workloads for the first time.
Kepler improved energy efficiency with its SMX streaming multiprocessor design and introduced Dynamic Parallelism, allowing GPU threads to launch new GPU threads. Maxwell further refined power efficiency and was widely used in early deep learning research.
The Tesla P100, based on the Pascal architecture, was the first NVIDIA GPU to use HBM2 memory, providing 720 GB/s of bandwidth. It introduced NVLink 1.0 for high-speed GPU-to-GPU communication and delivered 10.6 TFLOPS of FP32 performance.
The Tesla V100 was a watershed moment for AI computing. It introduced first-generation Tensor Cores, specialized hardware for mixed-precision matrix operations that delivered up to 120 TFLOPS of deep learning performance (SXM2 variant). Key specifications:
The V100 became the standard GPU for AI research from 2017 to 2020.
The A100 introduced third-generation Tensor Cores with support for TF32, BF16, and structured sparsity (2:4), which could double effective throughput for compatible models. Key specifications:
The H100 brought fourth-generation Tensor Cores and the Transformer Engine, which dynamically selects between FP8 and FP16 precision to maximize throughput for transformer-based models. Key specifications of the H100 SXM:
The H200, released in late 2024, retained the Hopper architecture but upgraded to 141 GB of HBM3e memory at 4,800 GB/s, providing significantly more memory capacity and bandwidth for large model inference.
The B200 represents NVIDIA's most powerful single GPU for AI. It uses a novel dual-die design, connecting two reticle-limited dies with a 10 TB/s chip-to-chip interconnect to function as a single unified GPU. Key specifications:
The GB200 is a "Superchip" that pairs two B200 GPUs with one Grace CPU on a single board, providing 384 GB of combined GPU memory. The GB200 NVL72 system connects 72 Blackwell GPUs and 36 Grace CPUs via NVLink 5 and NVSwitch, delivering 130 TB/s of aggregate system bandwidth as a single, unified accelerator.
| GPU | Architecture | Year | Tensor TFLOPS (FP16) | Memory | Bandwidth | TDP |
|---|---|---|---|---|---|---|
| Tesla P100 | Pascal | 2016 | N/A (no Tensor Cores) | 16 GB HBM2 | 720 GB/s | 300 W |
| Tesla V100 | Volta | 2017 | 120 | 16/32 GB HBM2 | 900 GB/s | 300 W |
| A100 | Ampere | 2020 | 312 (624 sparse) | 40/80 GB HBM2e | 2,039 GB/s | 400 W |
| H100 SXM | Hopper | 2022 | 1,979 | 80 GB HBM3 | 3,350 GB/s | 700 W |
| H200 SXM | Hopper | 2024 | 1,979 | 141 GB HBM3e | 4,800 GB/s | 700 W |
| B200 | Blackwell | 2024 | ~20,000 (FP8 PFLOPS-scale) | 192 GB HBM3e | ~8,000 GB/s | 1,000 W |
AMD has emerged as NVIDIA's primary competitor in the AI accelerator market with its Instinct series of data center GPUs and the ROCm (Radeon Open Compute) open-source software stack.
| GPU | Architecture | Memory | Memory Bandwidth | Key Features |
|---|---|---|---|---|
| MI250X | CDNA 2 | 128 GB HBM2e | 3,277 GB/s | Dual-die design; 383 TFLOPS FP16 |
| MI300X | CDNA 3 | 192 GB HBM3 | 5,300 GB/s | 8 XCDs on single package; chiplet design |
| MI325X | CDNA 3 | 256 GB HBM3E | 6,000 GB/s | 1.8x the memory capacity of NVIDIA H200 |
| MI350X | CDNA 4 | 288 GB HBM3E | TBD | Launched June 2025; day-zero framework support |
The MI300X has been adopted by major AI companies, with seven of the ten largest model builders running production workloads on Instinct accelerators, including Meta, Microsoft, and others.
ROCm is AMD's answer to CUDA. It provides an open-source collection of drivers, compilers, libraries, and tools for running HPC and AI workloads on AMD GPUs. ROCm includes HIP (Heterogeneous-compute Interface for Portability), which allows developers to write code that can compile and run on both AMD and NVIDIA GPUs with minimal changes. Key ROCm libraries include rocBLAS (linear algebra), MIOpen (deep learning primitives), and RCCL (collective communications).
ROCm 7, released in 2025, demonstrated up to 3.5x performance improvements in AI inference over ROCm 6.0, with specific gains such as 3.2x faster Llama 3.1 70B training and 3.8x faster DeepSeek R1 inference. Despite these advances, the ROCm ecosystem remains smaller than CUDA's, with some frameworks and libraries offering less mature AMD support.
Google has taken a different approach with its Tensor Processing Units (TPUs), custom-designed ASICs (application-specific integrated circuits) built specifically for machine learning workloads. Unlike GPUs, which are general-purpose parallel processors repurposed for AI, TPUs are designed from the ground up to accelerate matrix multiplications and other neural network operations.
| TPU Version | Year | Key Improvements |
|---|---|---|
| TPU v1 | 2016 | Inference only; deployed internally at Google |
| TPU v2 | 2017 | Added training support; 45 TFLOPS BF16; HBM |
| TPU v3 | 2018 | Liquid cooling; 420 TFLOPS BF16; 128 GB HBM in pod |
| TPU v4 | 2021 | 275 TFLOPS BF16; up to 4,096 chips per pod |
| TPU v5e | 2023 | Cost-optimized; 393 TOPS INT8; 2.5x throughput/dollar vs. v4 |
| TPU v5p | 2023 | Performance-optimized; 2x FLOPS and 3x HBM over v4 |
| Trillium (v6e) | 2024 | 4.7x performance over v5e; doubled HBM capacity and bandwidth |
TPUs are available exclusively through Google Cloud and are used internally to train Google's largest models, including Gemini. The TPU v5p pod can connect up to 8,960 chips, delivering up to 460 petaFLOPS for large-scale distributed training.
The demand for AI compute has sparked a wave of specialized hardware beyond traditional GPUs and TPUs.
Amazon Web Services develops its own AI chips. Trainium2 is designed for large-scale model training, while Inferentia2 targets cost-efficient inference at 190 TFLOPS of FP16 performance. AWS has announced Trainium3, built on TSMC's 3nm process, which promises double the performance of Trainium2 with 40% better energy efficiency and 2.52 PFLOPS of FP8 per chip.
Intel's Gaudi accelerators (Gaudi 2 and Gaudi 3) were designed as cost-effective alternatives to NVIDIA GPUs for AI training and inference. Gaudi 3, launched in 2024, increased memory capacity for improved LLM efficiency. However, Intel announced plans to discontinue the Gaudi line when its next-generation GPU products launch in 2026-2027.
Cerebras Systems takes a radical approach with its Wafer-Scale Engine (WSE). Rather than cutting a silicon wafer into individual chips, Cerebras uses the entire wafer as a single processor. The WSE-3, announced in 2024, contains approximately 4 trillion transistors, over 900,000 compute cores, and delivers 125 PFLOPS of peak performance. The WSE eliminates the memory bandwidth bottleneck by placing 44 GB of on-chip SRAM directly adjacent to compute cores, achieving approximately 21 PB/s of memory bandwidth.
Groq designs a Language Processing Unit (LPU) optimized for low-latency inference. Each Groq chip contains 230 MB of on-chip SRAM with up to 80 TB/s of on-die memory bandwidth, delivering 750 TOPS at INT8. Groq's deterministic, compiler-driven architecture eliminates the scheduling overhead of traditional GPUs, achieving exceptionally low latency for inference tasks.
SambaNova Systems develops reconfigurable dataflow architecture accelerators. Its SN40L chip and the newer SN50 (unveiled in February 2026) are designed for enterprise AI workloads. The SN50 claims 5x more compute per accelerator and 4x more network bandwidth than its predecessor.
| Accelerator | Vendor | Type | Memory | Peak Performance | Strengths |
|---|---|---|---|---|---|
| H100 SXM | NVIDIA | GPU | 80 GB HBM3 | 3,958 TFLOPS FP8 | Ecosystem, Transformer Engine, versatility |
| B200 | NVIDIA | GPU | 192 GB HBM3e | ~20 PFLOPS FP8 | Dual-die design, FP4 support, NVLink 5 |
| MI300X | AMD | GPU | 192 GB HBM3 | 1,307 TFLOPS FP16 | High memory capacity, open-source ROCm |
| MI325X | AMD | GPU | 256 GB HBM3E | TBD | Largest memory of any single GPU |
| TPU v5p | ASIC | HBM (per chip) | 459 TFLOPS BF16 | Tight cloud integration, pod scalability | |
| Trillium (v6e) | ASIC | Doubled HBM | 4.7x v5e perf | Energy efficiency, cost per FLOP | |
| Trainium2 | AWS | ASIC | HBM | Custom benchmarks | AWS ecosystem integration |
| Gaudi 3 | Intel | ASIC | HBM2e | Competitive with H100 | Cost-effective training |
| WSE-3 | Cerebras | Wafer-scale | 44 GB SRAM on-chip | 125 PFLOPS peak | Eliminates memory bandwidth bottleneck |
| LPU | Groq | ASIC | 230 MB SRAM on-chip | 750 TOPS INT8 | Ultra-low latency inference |
Training modern AI models, especially large language models with billions or trillions of parameters, requires distributing computation across multiple GPUs and often multiple servers. Several parallelism strategies have been developed to accomplish this.
Data parallelism is the simplest distributed training strategy. Multiple copies of the entire model are placed on different GPUs, and each GPU processes a different mini-batch of training data. After computing gradients, the GPUs synchronize by performing an all-reduce operation to average the gradients before updating the model weights. PyTorch's DistributedDataParallel (DDP) is the most widely used implementation.
Data parallelism works well when the model fits in the memory of a single GPU. It scales efficiently to hundreds of GPUs, with communication overhead as the primary bottleneck.
Model parallelism splits the model itself across GPUs rather than the data. This is necessary when a model is too large to fit in a single GPU's memory.
Tensor parallelism splits individual layers (typically large matrix multiplications) across multiple GPUs. For example, a large weight matrix can be divided column-wise across four GPUs, with each GPU computing its portion and then communicating to reconstruct the full result. Tensor parallelism is best suited for GPUs within the same node connected by high-bandwidth NVLink.
Pipeline parallelism assigns different sequential layers (or groups of layers) to different GPUs. GPU 1 computes the first few layers, passes activations to GPU 2 for the next layers, and so on. To minimize idle time ("pipeline bubbles"), micro-batching techniques like GPipe and PipeDream interleave multiple micro-batches through the pipeline.
Fully Sharded Data Parallel (FSDP) combines the benefits of data and model parallelism. It shards model parameters, gradients, and optimizer states across all GPUs, gathering the full parameters only when needed for forward and backward computation and then immediately re-sharding. FSDP, originally developed at Meta and now integrated into PyTorch, is based on the same principles as DeepSpeed ZeRO Stage 3.
FSDP dramatically reduces per-GPU memory usage, making it possible to train models that would not fit using standard data parallelism, while maintaining competitive throughput.
With the rise of Mixture of Experts (MoE) architectures, expert parallelism distributes different expert sub-networks across different GPUs. Only a subset of experts is activated for each input token, reducing computation while allowing the total model to have an extremely large parameter count.
State-of-the-art training systems for the largest models combine data, tensor, and pipeline parallelism simultaneously, an approach commonly called 3D parallelism. Frameworks like Megatron-LM (NVIDIA), DeepSpeed (Microsoft), and Fully Sharded Data Parallel (Meta/PyTorch) provide the tools to orchestrate these complex parallel configurations.
High-bandwidth, low-latency communication between GPUs is critical for efficient distributed training. Three interconnect technologies dominate the landscape.
NVLink is NVIDIA's proprietary high-speed point-to-point interconnect for GPU-to-GPU communication within a node. Each generation has dramatically increased bandwidth:
| NVLink Version | GPU Architecture | Per-GPU Bidirectional Bandwidth |
|---|---|---|
| NVLink 1.0 | Pascal (P100) | 160 GB/s |
| NVLink 2.0 | Volta (V100) | 300 GB/s |
| NVLink 3.0 | Ampere (A100) | 600 GB/s |
| NVLink 4.0 | Hopper (H100) | 900 GB/s |
| NVLink 5.0 | Blackwell (B200) | 1,800 GB/s |
NVSwitch is a dedicated switch chip that enables all-to-all NVLink connectivity among all GPUs within a node. In a DGX system with 8 GPUs and NVSwitch, every GPU can communicate directly with every other GPU at full NVLink bandwidth, avoiding the reduced bandwidth of ring or mesh topologies. The Blackwell-generation DGX B200 connects 8 B200 GPUs via NVLink 5 for up to 14.4 TB/s of aggregate GPU-to-GPU bandwidth per node.
The GB200 NVL72 extends this concept further, using NVSwitch to interconnect 72 GPUs across multiple trays as a single logical accelerator with 130 TB/s of total system bandwidth.
While NVLink connects GPUs within a node, InfiniBand connects nodes across a cluster. InfiniBand is an industry-standard networking protocol designed for high-performance computing, providing low latency and high bandwidth between servers. NVIDIA's Quantum-2 InfiniBand switches support NDR (400 Gb/s per port), and the roadmap includes higher-bandwidth generations. InfiniBand remains the preferred fabric for large-scale AI training clusters, though some deployments use RoCE (RDMA over Converged Ethernet) as an alternative.
The complementary architecture is straightforward: NVLink and NVSwitch handle fast communication within each server node, while InfiniBand connects the nodes for distributed training across the cluster.
Accessing GPU compute for AI training and inference has shifted predominantly to the cloud, where organizations can rent GPU capacity without the capital expenditure of building their own infrastructure.
| Provider | GPU Instances | Key Offerings |
|---|---|---|
| AWS | P5 (H100), P5e (H200), P6 (Blackwell) | SageMaker, EC2 UltraClusters, Trainium instances |
| Azure | ND H100 v5, ND H200 v5 | Azure Machine Learning, Azure AI |
| Google Cloud | A3 (H100), A3 Ultra (H200), TPU pods | Vertex AI, GKE with GPU support |
A new category of "GPU-first" cloud providers (sometimes called "NeoClouds") has emerged to address the specific needs of AI workloads:
GPU cloud pricing has dropped significantly since the peak of the 2023 shortage. H100 GPU instances, which initially commanded premium prices, have seen costs decrease as supply expanded. Specialized providers generally offer 50-70% savings compared to hyperscale cloud providers. AWS cut H100 pricing by approximately 44% in June 2025, and competitive pressure continues to drive costs lower. Spot and preemptible instances can reduce costs by 60-90% below on-demand rates, though with the risk of interruption.
The release of ChatGPT in November 2022 triggered an unprecedented surge in demand for AI compute. Throughout 2023 and 2024, the AI industry experienced a severe GPU shortage that reshaped the economics of AI development.
Spending on GPUs jumped from approximately $30 billion in 2022 to $50 billion in 2023, a 67% increase. By 2024, NVIDIA's data center revenue was expected to reach approximately $85 billion. Hyperscale cloud providers (AWS, Azure, Google Cloud, Oracle) purchased AI GPUs at unprecedented scale, with H100 GPUs sold out through Q1 2024 as of summer 2023. GPUs were often traded at 45-55% above manufacturer's suggested retail price on the secondary market.
Several factors constrained GPU supply:
The shortage created a visible divide in the AI industry between well-funded organizations with access to large GPU clusters ("GPU rich") and smaller companies, startups, and academic researchers who struggled to secure compute ("GPU poor"). This disparity influenced AI research priorities, with some groups shifting focus to efficiency techniques, smaller models, and inference optimization rather than large-scale training.
The rapid expansion of GPU computing for AI has raised significant concerns about energy consumption and environmental impact.
Each generation of data center GPUs has increased power consumption. The V100 consumed 300 W, the A100 rose to 400 W, the H100 reached 700 W, and the B200 now draws 1,000 W. NVIDIA's roadmap includes GPUs at 1,200 W and 1,500 W in future generations. A single DGX B200 server with 8 GPUs consumes over 8 kW of GPU power alone, not including CPUs, networking, storage, and cooling.
Data centers in the United States consumed approximately 200 terawatt-hours (TWh) of electricity in 2024, with AI-specific servers estimated to account for 53 to 76 TWh. Datacenters consumed roughly 4.4% of total U.S. electricity in 2023 and are projected to reach 6.7% to 12% by 2028. As of February 2025, data center firms had requested 40.2 GW of new power connections, nearly double the 21.4 GW requested in July 2024.
The average power density per server rack is expected to increase from 36 kW in 2023 to 50 kW by 2027, driving a shift from air cooling to liquid cooling solutions.
In response to these challenges, the industry is pursuing several approaches:
Water consumption for cooling AI data centers has also drawn scrutiny, particularly in drought-prone regions. Estimates suggest that training a single large language model can consume millions of liters of water when accounting for both direct cooling and the water used in electricity generation.
Photonic computing uses light instead of electrical signals to perform computations. Photonic processors can perform matrix multiplications at the speed of light with extremely low energy consumption. In September 2025, researchers at the University of Shanghai for Science and Technology demonstrated an ultra-compact photonic AI chip, and companies like Lightmatter and Q.ANT are developing commercial photonic accelerators. Q.ANT's NPU 2 processor claims up to 30x lower energy use and 50x higher performance for certain AI and HPC workloads compared to conventional processors. However, photonic computing faces challenges in precision, programmability, and integration with existing digital systems.
Inspired by the structure of biological brains, neuromorphic chips use spiking neural networks and event-driven computation. Intel's Loihi 2 and IBM's NorthPole are research-stage neuromorphic processors that consume orders of magnitude less power than GPUs for certain pattern recognition tasks. Neuromorphic computing is particularly promising for edge AI applications where power budgets are extremely constrained, though it has not yet demonstrated competitiveness with GPUs for training large models.
The trend toward chiplet-based designs, exemplified by AMD's MI300X (which uses multiple compute dies on a single package) and NVIDIA's Blackwell (with its dual-die design), will continue. Advanced packaging technologies like TSMC's CoWoS and its successors allow multiple dies, HBM stacks, and interconnects to be integrated into ever-larger and more powerful composite processors.
While quantum computing is often discussed as a potential successor to classical accelerators for certain AI tasks, practical quantum advantage for machine learning remains years or decades away. Near-term quantum computers lack the qubit counts and error correction needed for useful AI workloads. Hybrid quantum-classical approaches, where a quantum processor handles specific subroutines within a larger classical training pipeline, are an active area of research.