GPU computing

AI Hardware Computing Deep Learning Machine Learning

28 min read

Updated Apr 26, 2026

GPU Computing

GPU computing refers to the use of a graphics processing unit (GPU) to perform general-purpose computation that was traditionally handled by the central processing unit (CPU). Also known as general-purpose computing on graphics processing units (GPGPU), this approach exploits the massively parallel architecture of GPUs to accelerate workloads in scientific computing, data analytics, and, most notably, artificial intelligence and machine learning. GPU computing has become the backbone of modern deep learning, powering the training and inference of large language models, computer vision systems, and other neural network applications at scale.

History

Early Graphics Hardware

GPUs were originally designed for a single purpose: rendering graphics. Throughout the 1990s, companies like NVIDIA, ATI (later acquired by AMD), and 3dfx developed dedicated hardware to accelerate the pixel-level calculations required for 3D video games and visualization software. These early GPUs contained fixed-function pipelines optimized for transforming vertices and shading pixels, with no provision for arbitrary computation.

The Emergence of Programmable Shaders

A turning point arrived in the early 2000s with the introduction of programmable shaders. Rather than locking developers into a fixed rendering pipeline, GPUs began offering small programmable stages (vertex shaders and pixel shaders) where custom code could execute on each vertex or pixel. Researchers quickly recognized that these programmable stages could be repurposed for non-graphics tasks by encoding data as textures and reading back results from the framebuffer.

In 2003, two research groups independently demonstrated that GPUs could solve general linear algebra problems faster than CPUs, a milestone that drew significant attention from the scientific computing community. The same year, researchers at Stanford University began formalizing the concept of "stream programming" for GPUs.

BrookGPU and the Road to CUDA

Ian Buck, a PhD student at Stanford, led the development of BrookGPU, a programming language and compiler that abstracted the graphics-specific details of GPUs into general-purpose programming concepts. Brook allowed scientists to write code for the GPU without understanding 3D graphics APIs like OpenGL or DirectX. Buck's dissertation work demonstrated the potential for GPUs as parallel compute engines and attracted the attention of the industry.

In 2004, Buck joined NVIDIA and, alongside John Nickolls, NVIDIA's director of architecture for GPU computing, began transforming Brook into a production-ready platform. The result was CUDA (Compute Unified Device Architecture), which NVIDIA released publicly in November 2006 alongside its GeForce 8800 GTX GPU based on the Tesla microarchitecture. The initial CUDA SDK became available on February 15, 2007, for Windows and Linux.

OpenCL and Broader Adoption

In response to CUDA's proprietary nature, the Khronos Group released OpenCL (Open Computing Language) in 2008 as a vendor-neutral standard for parallel programming across GPUs, CPUs, and other accelerators. While OpenCL offered portability, CUDA retained a performance and ecosystem advantage on NVIDIA hardware, and the two frameworks coexisted throughout the 2010s.

GPU Computing Enters AI

In 2009, Stanford researchers Rajat Raina, Anand Madhavan, and Andrew Ng published a seminal paper demonstrating that GPU computing could speed up the training of deep belief networks and other unsupervised learning models by an order of magnitude compared to CPUs. This work, alongside subsequent research by Geoffrey Hinton, Yann LeCun, and others, set the stage for the deep learning revolution. By 2012, when AlexNet won the ImageNet Large Scale Visual Recognition Challenge using two NVIDIA GTX 580 GPUs, GPU-accelerated deep learning had firmly entered the mainstream.

Why GPUs Work for AI

Massive Parallelism

The fundamental advantage of GPUs for AI workloads is their massively parallel architecture. While a modern CPU may contain 8 to 64 high-performance cores, a single data center GPU can contain thousands of smaller cores. The NVIDIA H100, for example, has 16,896 CUDA cores. Each of these cores can execute simple arithmetic operations simultaneously, making GPUs exceptionally well suited for workloads that involve applying the same operation across large datasets.

Neural networks are inherently parallel. During a forward pass, each layer applies a set of weights to input activations through matrix multiplications, and these multiplications can be decomposed into thousands of independent operations that execute concurrently on GPU cores.

SIMT Architecture

NVIDIA GPUs use a Single Instruction, Multiple Threads (SIMT) execution model, a variation of the classical SIMD (Single Instruction, Multiple Data) paradigm. In SIMT, groups of 32 threads called warps execute the same instruction in lockstep across different data elements. This model maps naturally to the regular, data-parallel structure of neural network computations such as matrix multiplications and element-wise activation functions.

High Memory Bandwidth

AI training workloads move enormous quantities of data between memory and compute units. GPUs address this with high-bandwidth memory (HBM) technologies that provide far greater throughput than the DDR memory used by CPUs. The NVIDIA A100 offers up to 2,039 GB/s of memory bandwidth using HBM2e, while the H100 reaches 3,350 GB/s with HBM3, and the B200 delivers approximately 8,000 GB/s with HBM3e. This bandwidth is critical for feeding data to the thousands of compute cores fast enough to keep them utilized.

Tensor Cores

Starting with the Volta architecture in 2017, NVIDIA introduced Tensor Cores, specialized hardware units designed to accelerate matrix multiply-and-accumulate operations at reduced precision (FP16, BF16, INT8, and later FP8 and FP4). Tensor Cores can perform a 4x4 matrix multiply-and-accumulate in a single clock cycle, providing massive speedups for deep learning operations compared to standard CUDA cores. This mixed-precision capability aligns with the observation that many neural network operations tolerate reduced numerical precision without meaningful loss in accuracy.

GPU for Deep Learning

Matrix Multiplication

General matrix multiplications (GEMMs) are the fundamental building block of deep learning. Fully connected layers, recurrent neural network layers (including LSTMs and GRUs), attention layers in transformers, and even convolutional layers are all implemented internally as matrix multiplications. GPU computation can typically perform matrix products two orders of magnitude faster than CPU computation, primarily because GPUs assign each element of the resulting matrix to a separate thread for parallel computation.

Convolution Operations

Convolutional layers apply learned filters to input feature maps. For computational efficiency, these convolution operations are transformed into matrix multiplications using techniques such as im2col (image to column), which rearranges input patches into rows of a matrix so that the convolution becomes a standard GEMM. This transformation allows GPUs to leverage their highly optimized matrix multiplication routines, such as those in NVIDIA's cuDNN library.

Training and Inference

During training, GPUs accelerate both the forward pass (computing predictions) and the backward pass (backpropagation, computing gradients). The optimizer step, which updates model weights based on gradients, is also parallelized across GPU cores. Modern training runs for large language models involve thousands of GPUs running in parallel for weeks or months.

During inference, GPUs provide the throughput needed to serve predictions at scale. Techniques such as batching multiple inference requests together and using lower-precision formats (INT8, FP8, FP4) maximize GPU utilization and reduce latency.

GPU vs. CPU: Architectural Differences

The following table summarizes the key architectural differences between CPUs and GPUs that explain their complementary roles in computing.

Feature	CPU	GPU
Core count	4 to 128 high-performance cores	Thousands of simpler cores (e.g., 16,896 CUDA cores on H100)
Clock speed	3.0 to 5.5 GHz	1.0 to 2.5 GHz
Design philosophy	Optimized for low-latency sequential tasks	Optimized for high-throughput parallel tasks
Instruction handling	Complex out-of-order execution, branch prediction, speculative execution	SIMT execution; groups of threads run the same instruction
Cache per core	Large (MB-scale L2/L3 caches)	Small per-core cache; large shared L2
Memory type	DDR4/DDR5 (up to ~100 GB/s)	HBM2e/HBM3/HBM3e (up to 8,000 GB/s)
Memory capacity	Up to several TB (system RAM)	16 GB to 192 GB per GPU
Floating-point units	Fewer but more versatile ALUs	Thousands of specialized FP/INT units and Tensor Cores
Power consumption	65 W to 350 W (typical server CPU)	250 W to 1,000 W (data center GPU)
Best suited for	Operating systems, databases, serial logic, branching code	Matrix math, image processing, neural network training/inference

CUDA: NVIDIA's Parallel Computing Platform

CUDA (Compute Unified Device Architecture) is NVIDIA's proprietary parallel computing platform and programming model. Released in 2006/2007, CUDA allows developers to write programs in C, C++, Python, and Fortran that execute on NVIDIA GPUs. CUDA abstracts the GPU's hardware into a hierarchy of threads, blocks, and grids, enabling fine-grained control over parallel execution.

CUDA Software Ecosystem

CUDA's dominance in AI is reinforced by a deep software ecosystem:

Library	Purpose
cuDNN	Optimized primitives for deep neural networks (convolution, pooling, normalization, activation)
cuBLAS	GPU-accelerated basic linear algebra (GEMM, GEMV)
NCCL	Multi-GPU and multi-node collective communication (all-reduce, broadcast)
TensorRT	Inference optimization and runtime for deploying trained models
Triton Inference Server	Model serving framework for production deployment
cuDF / RAPIDS	GPU-accelerated data science and analytics
Thrust	High-level parallel algorithms library (sort, scan, reduce)

All major deep learning frameworks, including PyTorch, TensorFlow, and JAX, rely on CUDA for GPU acceleration on NVIDIA hardware. This ecosystem lock-in, built over nearly two decades, represents one of NVIDIA's most significant competitive advantages.

NVIDIA GPU Generations for AI

NVIDIA has released a series of data center GPU architectures, each delivering significant performance improvements for AI workloads.

Tesla Architecture (2006)

The original Tesla architecture, powering the GeForce 8800 series and the Tesla C870 data center card, was the first to support CUDA. It introduced unified shaders, replacing the separate vertex and pixel shader units of earlier GPUs with a single pool of programmable processors. While modest by modern standards, it proved the viability of general-purpose GPU computing.

Fermi (2010)

Fermi introduced error-correcting code (ECC) memory, a true L1/L2 cache hierarchy, and improved double-precision floating-point performance, making GPUs viable for scientific high-performance computing (HPC) workloads for the first time.

Kepler (2012) and Maxwell (2014)

Kepler improved energy efficiency with its SMX streaming multiprocessor design and introduced Dynamic Parallelism, allowing GPU threads to launch new GPU threads. Maxwell further refined power efficiency and was widely used in early deep learning research.

Pascal (2016)

The Tesla P100, based on the Pascal architecture, was the first NVIDIA GPU to use HBM2 memory, providing 720 GB/s of bandwidth. It introduced NVLink 1.0 for high-speed GPU-to-GPU communication and delivered 10.6 TFLOPS of FP32 performance.

Volta (2017) and the V100

The Tesla V100 was a watershed moment for AI computing. It introduced first-generation Tensor Cores, specialized hardware for mixed-precision matrix operations that delivered up to 120 TFLOPS of deep learning performance (SXM2 variant). Key specifications:

5,120 CUDA cores and 640 Tensor Cores
16 GB or 32 GB HBM2 memory at 900 GB/s bandwidth
300 W TDP (SXM2)
NVLink 2.0 at 300 GB/s total bandwidth

The V100 became the standard GPU for AI research from 2017 to 2020.

Ampere (2020) and the A100

The A100 introduced third-generation Tensor Cores with support for TF32, BF16, and structured sparsity (2:4), which could double effective throughput for compatible models. Key specifications:

6,912 CUDA cores and 432 Tensor Cores
40 GB or 80 GB HBM2e memory at up to 2,039 GB/s bandwidth
312 TFLOPS FP16 Tensor (624 TFLOPS with sparsity)
400 W TDP (SXM variant)
Multi-Instance GPU (MIG) technology, allowing a single GPU to be partitioned into up to seven isolated instances
NVLink 3.0 at 600 GB/s total bandwidth

Hopper (2022) and the H100/H200

The H100 brought fourth-generation Tensor Cores and the Transformer Engine, which dynamically selects between FP8 and FP16 precision to maximize throughput for transformer-based models. Key specifications of the H100 SXM:

16,896 CUDA cores and 528 Tensor Cores
80 GB HBM3 memory at 3,350 GB/s bandwidth
1,979 TFLOPS FP16 Tensor; 3,958 TFLOPS FP8 Tensor
700 W TDP
NVLink 4.0 at 900 GB/s total bandwidth

The H200, released in late 2024, retained the Hopper architecture but upgraded to 141 GB of HBM3e memory at 4,800 GB/s, providing significantly more memory capacity and bandwidth for large model inference.

Blackwell (2024) and the B200/GB200

The B200 represents NVIDIA's most powerful single GPU for AI. It uses a novel dual-die design, connecting two reticle-limited dies with a 10 TB/s chip-to-chip interconnect to function as a single unified GPU. Key specifications:

208 billion transistors on TSMC 4NP process
192 GB HBM3e memory at approximately 8,000 GB/s bandwidth
Fifth-generation Tensor Cores with FP4 and FP6 support
Approximately 20 PFLOPS at FP8; 40 PFLOPS at FP4
1,000 W TDP
NVLink 5.0 at 1,800 GB/s total bandwidth

The GB200 is a "Superchip" that pairs two B200 GPUs with one Grace CPU on a single board, providing 384 GB of combined GPU memory. The GB200 NVL72 system connects 72 Blackwell GPUs and 36 Grace CPUs via NVLink 5 and NVSwitch, delivering 130 TB/s of aggregate system bandwidth as a single, unified accelerator.

Summary of NVIDIA Data Center GPU Generations

GPU	Architecture	Year	Tensor TFLOPS (FP16)	Memory	Bandwidth	TDP
Tesla P100	Pascal	2016	N/A (no Tensor Cores)	16 GB HBM2	720 GB/s	300 W
Tesla V100	Volta	2017	120	16/32 GB HBM2	900 GB/s	300 W
A100	Ampere	2020	312 (624 sparse)	40/80 GB HBM2e	2,039 GB/s	400 W
H100 SXM	Hopper	2022	1,979	80 GB HBM3	3,350 GB/s	700 W
H200 SXM	Hopper	2024	1,979	141 GB HBM3e	4,800 GB/s	700 W
B200	Blackwell	2024	~20,000 (FP8 PFLOPS-scale)	192 GB HBM3e	~8,000 GB/s	1,000 W

AMD: Instinct GPUs and ROCm

AMD has emerged as NVIDIA's primary competitor in the AI accelerator market with its Instinct series of data center GPUs and the ROCm (Radeon Open Compute) open-source software stack.

Instinct GPU Lineup

GPU	Architecture	Memory	Memory Bandwidth	Key Features
MI250X	CDNA 2	128 GB HBM2e	3,277 GB/s	Dual-die design; 383 TFLOPS FP16
MI300X	CDNA 3	192 GB HBM3	5,300 GB/s	8 XCDs on single package; chiplet design
MI325X	CDNA 3	256 GB HBM3E	6,000 GB/s	1.8x the memory capacity of NVIDIA H200
MI350X	CDNA 4	288 GB HBM3E	TBD	Launched June 2025; day-zero framework support

The MI300X has been adopted by major AI companies, with seven of the ten largest model builders running production workloads on Instinct accelerators, including Meta, Microsoft, and others.

ROCm Software Stack

ROCm is AMD's answer to CUDA. It provides an open-source collection of drivers, compilers, libraries, and tools for running HPC and AI workloads on AMD GPUs. ROCm includes HIP (Heterogeneous-compute Interface for Portability), which allows developers to write code that can compile and run on both AMD and NVIDIA GPUs with minimal changes. Key ROCm libraries include rocBLAS (linear algebra), MIOpen (deep learning primitives), and RCCL (collective communications).

ROCm 7, released in 2025, demonstrated up to 3.5x performance improvements in AI inference over ROCm 6.0, with specific gains such as 3.2x faster Llama 3.1 70B training and 3.8x faster DeepSeek R1 inference. Despite these advances, the ROCm ecosystem remains smaller than CUDA's, with some frameworks and libraries offering less mature AMD support.

Google TPUs

Google has taken a different approach with its Tensor Processing Units (TPUs), custom-designed ASICs (application-specific integrated circuits) built specifically for machine learning workloads. Unlike GPUs, which are general-purpose parallel processors repurposed for AI, TPUs are designed from the ground up to accelerate matrix multiplications and other neural network operations.

TPU Generations

TPU Version	Year	Key Improvements
TPU v1	2016	Inference only; deployed internally at Google
TPU v2	2017	Added training support; 45 TFLOPS BF16; HBM
TPU v3	2018	Liquid cooling; 420 TFLOPS BF16; 128 GB HBM in pod
TPU v4	2021	275 TFLOPS BF16; up to 4,096 chips per pod
TPU v5e	2023	Cost-optimized; 393 TOPS INT8; 2.5x throughput/dollar vs. v4
TPU v5p	2023	Performance-optimized; 2x FLOPS and 3x HBM over v4
Trillium (v6e)	2024	4.7x performance over v5e; doubled HBM capacity and bandwidth

TPUs are available exclusively through Google Cloud and are used internally to train Google's largest models, including Gemini. The TPU v5p pod can connect up to 8,960 chips, delivering up to 460 petaFLOPS for large-scale distributed training.

Other AI Accelerators

The demand for AI compute has sparked a wave of specialized hardware beyond traditional GPUs and TPUs.

AWS Trainium and Inferentia

Amazon Web Services develops its own AI chips. Trainium2 is designed for large-scale model training, while Inferentia2 targets cost-efficient inference at 190 TFLOPS of FP16 performance. AWS has announced Trainium3, built on TSMC's 3nm process, which promises double the performance of Trainium2 with 40% better energy efficiency and 2.52 PFLOPS of FP8 per chip.

Intel Gaudi

Intel's Gaudi accelerators (Gaudi 2 and Gaudi 3) were designed as cost-effective alternatives to NVIDIA GPUs for AI training and inference. Gaudi 3, launched in 2024, increased memory capacity for improved LLM efficiency. However, Intel announced plans to discontinue the Gaudi line when its next-generation GPU products launch in 2026-2027.

Cerebras

Cerebras Systems takes a radical approach with its Wafer-Scale Engine (WSE). Rather than cutting a silicon wafer into individual chips, Cerebras uses the entire wafer as a single processor. The WSE-3, announced in 2024, contains approximately 4 trillion transistors, over 900,000 compute cores, and delivers 125 PFLOPS of peak performance. The WSE eliminates the memory bandwidth bottleneck by placing 44 GB of on-chip SRAM directly adjacent to compute cores, achieving approximately 21 PB/s of memory bandwidth.

Groq

Groq designs a Language Processing Unit (LPU) optimized for low-latency inference. Each Groq chip contains 230 MB of on-chip SRAM with up to 80 TB/s of on-die memory bandwidth, delivering 750 TOPS at INT8. Groq's deterministic, compiler-driven architecture eliminates the scheduling overhead of traditional GPUs, achieving exceptionally low latency for inference tasks.

SambaNova

SambaNova Systems develops reconfigurable dataflow architecture accelerators. Its SN40L chip and the newer SN50 (unveiled in February 2026) are designed for enterprise AI workloads. The SN50 claims 5x more compute per accelerator and 4x more network bandwidth than its predecessor.

Comparison of Major AI Accelerators

Accelerator	Vendor	Type	Memory	Peak Performance	Strengths
H100 SXM	NVIDIA	GPU	80 GB HBM3	3,958 TFLOPS FP8	Ecosystem, Transformer Engine, versatility
B200	NVIDIA	GPU	192 GB HBM3e	~20 PFLOPS FP8	Dual-die design, FP4 support, NVLink 5
MI300X	AMD	GPU	192 GB HBM3	1,307 TFLOPS FP16	High memory capacity, open-source ROCm
MI325X	AMD	GPU	256 GB HBM3E	TBD	Largest memory of any single GPU
TPU v5p	Google	ASIC	HBM (per chip)	459 TFLOPS BF16	Tight cloud integration, pod scalability
Trillium (v6e)	Google	ASIC	Doubled HBM	4.7x v5e perf	Energy efficiency, cost per FLOP
Trainium2	AWS	ASIC	HBM	Custom benchmarks	AWS ecosystem integration
Gaudi 3	Intel	ASIC	HBM2e	Competitive with H100	Cost-effective training
WSE-3	Cerebras	Wafer-scale	44 GB SRAM on-chip	125 PFLOPS peak	Eliminates memory bandwidth bottleneck
LPU	Groq	ASIC	230 MB SRAM on-chip	750 TOPS INT8	Ultra-low latency inference

Multi-GPU and Distributed Training

Training modern AI models, especially large language models with billions or trillions of parameters, requires distributing computation across multiple GPUs and often multiple servers. Several parallelism strategies have been developed to accomplish this.

Data Parallelism

Data parallelism is the simplest distributed training strategy. Multiple copies of the entire model are placed on different GPUs, and each GPU processes a different mini-batch of training data. After computing gradients, the GPUs synchronize by performing an all-reduce operation to average the gradients before updating the model weights. PyTorch's DistributedDataParallel (DDP) is the most widely used implementation.

Data parallelism works well when the model fits in the memory of a single GPU. It scales efficiently to hundreds of GPUs, with communication overhead as the primary bottleneck.

Model Parallelism

Model parallelism splits the model itself across GPUs rather than the data. This is necessary when a model is too large to fit in a single GPU's memory.

Tensor parallelism splits individual layers (typically large matrix multiplications) across multiple GPUs. For example, a large weight matrix can be divided column-wise across four GPUs, with each GPU computing its portion and then communicating to reconstruct the full result. Tensor parallelism is best suited for GPUs within the same node connected by high-bandwidth NVLink.
Pipeline parallelism assigns different sequential layers (or groups of layers) to different GPUs. GPU 1 computes the first few layers, passes activations to GPU 2 for the next layers, and so on. To minimize idle time ("pipeline bubbles"), micro-batching techniques like GPipe and PipeDream interleave multiple micro-batches through the pipeline.

Fully Sharded Data Parallel (FSDP)

Fully Sharded Data Parallel (FSDP) combines the benefits of data and model parallelism. It shards model parameters, gradients, and optimizer states across all GPUs, gathering the full parameters only when needed for forward and backward computation and then immediately re-sharding. FSDP, originally developed at Meta and now integrated into PyTorch, is based on the same principles as DeepSpeed ZeRO Stage 3.

FSDP dramatically reduces per-GPU memory usage, making it possible to train models that would not fit using standard data parallelism, while maintaining competitive throughput.

Expert Parallelism

With the rise of Mixture of Experts (MoE) architectures, expert parallelism distributes different expert sub-networks across different GPUs. Only a subset of experts is activated for each input token, reducing computation while allowing the total model to have an extremely large parameter count.

3D Parallelism

State-of-the-art training systems for the largest models combine data, tensor, and pipeline parallelism simultaneously, an approach commonly called 3D parallelism. Frameworks like Megatron-LM (NVIDIA), DeepSpeed (Microsoft), and Fully Sharded Data Parallel (Meta/PyTorch) provide the tools to orchestrate these complex parallel configurations.

GPU Interconnects

High-bandwidth, low-latency communication between GPUs is critical for efficient distributed training. Three interconnect technologies dominate the landscape.

NVLink

NVLink is NVIDIA's proprietary high-speed point-to-point interconnect for GPU-to-GPU communication within a node. Each generation has dramatically increased bandwidth:

NVLink Version	GPU Architecture	Per-GPU Bidirectional Bandwidth
NVLink 1.0	Pascal (P100)	160 GB/s
NVLink 2.0	Volta (V100)	300 GB/s
NVLink 3.0	Ampere (A100)	600 GB/s
NVLink 4.0	Hopper (H100)	900 GB/s
NVLink 5.0	Blackwell (B200)	1,800 GB/s

NVSwitch

NVSwitch is a dedicated switch chip that enables all-to-all NVLink connectivity among all GPUs within a node. In a DGX system with 8 GPUs and NVSwitch, every GPU can communicate directly with every other GPU at full NVLink bandwidth, avoiding the reduced bandwidth of ring or mesh topologies. The Blackwell-generation DGX B200 connects 8 B200 GPUs via NVLink 5 for up to 14.4 TB/s of aggregate GPU-to-GPU bandwidth per node.

The GB200 NVL72 extends this concept further, using NVSwitch to interconnect 72 GPUs across multiple trays as a single logical accelerator with 130 TB/s of total system bandwidth.

InfiniBand

While NVLink connects GPUs within a node, InfiniBand connects nodes across a cluster. InfiniBand is an industry-standard networking protocol designed for high-performance computing, providing low latency and high bandwidth between servers. NVIDIA's Quantum-2 InfiniBand switches support NDR (400 Gb/s per port), and the roadmap includes higher-bandwidth generations. InfiniBand remains the preferred fabric for large-scale AI training clusters, though some deployments use RoCE (RDMA over Converged Ethernet) as an alternative.

The complementary architecture is straightforward: NVLink and NVSwitch handle fast communication within each server node, while InfiniBand connects the nodes for distributed training across the cluster.

GPU Cloud Providers

Accessing GPU compute for AI training and inference has shifted predominantly to the cloud, where organizations can rent GPU capacity without the capital expenditure of building their own infrastructure.

Hyperscale Cloud Providers

Provider	GPU Instances	Key Offerings
AWS	P5 (H100), P5e (H200), P6 (Blackwell)	SageMaker, EC2 UltraClusters, Trainium instances
Azure	ND H100 v5, ND H200 v5	Azure Machine Learning, Azure AI
Google Cloud	A3 (H100), A3 Ultra (H200), TPU pods	Vertex AI, GKE with GPU support

Specialized GPU Cloud Providers

A new category of "GPU-first" cloud providers (sometimes called "NeoClouds") has emerged to address the specific needs of AI workloads:

CoreWeave pivoted from cryptocurrency mining to become one of the largest GPU cloud providers, raising significant capital including investment from NVIDIA and completing a $1.5 billion IPO in 2025.
Lambda offers H100 GPU instances at competitive prices (around $2.99/hour per H100 as of 2025), focused specifically on AI researchers and developers.
Together AI provides GPU clusters optimized for distributed training and inference, with early access to GB200 systems.
RunPod and Vast.ai offer marketplace-style GPU rental, including spot instances at steep discounts.

Pricing Trends

GPU cloud pricing has dropped significantly since the peak of the 2023 shortage. H100 GPU instances, which initially commanded premium prices, have seen costs decrease as supply expanded. Specialized providers generally offer 50-70% savings compared to hyperscale cloud providers. AWS cut H100 pricing by approximately 44% in June 2025, and competitive pressure continues to drive costs lower. Spot and preemptible instances can reduce costs by 60-90% below on-demand rates, though with the risk of interruption.

GPU Shortage and Economics (2023-2024)

The release of ChatGPT in November 2022 triggered an unprecedented surge in demand for AI compute. Throughout 2023 and 2024, the AI industry experienced a severe GPU shortage that reshaped the economics of AI development.

Demand Surge

Spending on GPUs jumped from approximately $30 billion in 2022 to $50 billion in 2023, a 67% increase. By 2024, NVIDIA's data center revenue was expected to reach approximately $85 billion. Hyperscale cloud providers (AWS, Azure, Google Cloud, Oracle) purchased AI GPUs at unprecedented scale, with H100 GPUs sold out through Q1 2024 as of summer 2023. GPUs were often traded at 45-55% above manufacturer's suggested retail price on the secondary market.

Supply Chain Bottlenecks

Several factors constrained GPU supply:

Advanced packaging: NVIDIA's GPUs require TSMC's CoWoS (Chip-on-Wafer-on-Substrate) advanced packaging to integrate HBM memory with the GPU die. CoWoS capacity was extremely limited, and NVIDIA planned a $2.9 billion investment in Taiwan to help double packaging capacity by late 2024.
HBM memory production: High-bandwidth memory is manufactured by only three companies (SK Hynix, Samsung, and Micron), and supply could only support a fraction of the demand.
Geopolitical restrictions: U.S. export controls on advanced AI chips to China further complicated supply chains and pushed Chinese companies to stockpile GPUs before restrictions took effect.

"GPU Rich" vs. "GPU Poor"

The shortage created a visible divide in the AI industry between well-funded organizations with access to large GPU clusters ("GPU rich") and smaller companies, startups, and academic researchers who struggled to secure compute ("GPU poor"). This disparity influenced AI research priorities, with some groups shifting focus to efficiency techniques, smaller models, and inference optimization rather than large-scale training.

Power Consumption and Sustainability

The rapid expansion of GPU computing for AI has raised significant concerns about energy consumption and environmental impact.

Rising Power Demands

Each generation of data center GPUs has increased power consumption. The V100 consumed 300 W, the A100 rose to 400 W, the H100 reached 700 W, and the B200 now draws 1,000 W. NVIDIA's roadmap includes GPUs at 1,200 W and 1,500 W in future generations. A single DGX B200 server with 8 GPUs consumes over 8 kW of GPU power alone, not including CPUs, networking, storage, and cooling.

Data Center Impact

Data centers in the United States consumed approximately 200 terawatt-hours (TWh) of electricity in 2024, with AI-specific servers estimated to account for 53 to 76 TWh. Datacenters consumed roughly 4.4% of total U.S. electricity in 2023 and are projected to reach 6.7% to 12% by 2028. As of February 2025, data center firms had requested 40.2 GW of new power connections, nearly double the 21.4 GW requested in July 2024.

The average power density per server rack is expected to increase from 36 kW in 2023 to 50 kW by 2027, driving a shift from air cooling to liquid cooling solutions.

Sustainability Initiatives

In response to these challenges, the industry is pursuing several approaches:

Liquid cooling: Both direct-to-chip liquid cooling and immersion cooling are being deployed to handle higher thermal densities more efficiently than air cooling.
Nuclear power: Several major AI companies have signed agreements to revive retired nuclear plants or invest in small modular reactors (SMRs) for low-carbon baseload power.
Hardware efficiency: Each GPU generation improves performance per watt. NVIDIA's Blackwell architecture delivers approximately 4x the training performance per watt compared to Hopper.
Software efficiency: Techniques such as mixed-precision training, model quantization, sparsity, and more efficient architectures (e.g., Mixture of Experts) reduce the total compute required.
Renewable energy: Major cloud providers have committed to matching their electricity consumption with renewable energy purchases, though the gap between commitments and actual 24/7 clean energy matching remains significant.

Water consumption for cooling AI data centers has also drawn scrutiny, particularly in drought-prone regions. Estimates suggest that training a single large language model can consume millions of liters of water when accounting for both direct cooling and the water used in electricity generation.

Future of AI Compute

Photonic Computing

Photonic computing uses light instead of electrical signals to perform computations. Photonic processors can perform matrix multiplications at the speed of light with extremely low energy consumption. In September 2025, researchers at the University of Shanghai for Science and Technology demonstrated an ultra-compact photonic AI chip, and companies like Lightmatter and Q.ANT are developing commercial photonic accelerators. Q.ANT's NPU 2 processor claims up to 30x lower energy use and 50x higher performance for certain AI and HPC workloads compared to conventional processors. However, photonic computing faces challenges in precision, programmability, and integration with existing digital systems.

Neuromorphic Computing

Inspired by the structure of biological brains, neuromorphic chips use spiking neural networks and event-driven computation. Intel's Loihi 2 and IBM's NorthPole are research-stage neuromorphic processors that consume orders of magnitude less power than GPUs for certain pattern recognition tasks. Neuromorphic computing is particularly promising for edge AI applications where power budgets are extremely constrained, though it has not yet demonstrated competitiveness with GPUs for training large models.

Chiplet and Advanced Packaging

The trend toward chiplet-based designs, exemplified by AMD's MI300X (which uses multiple compute dies on a single package) and NVIDIA's Blackwell (with its dual-die design), will continue. Advanced packaging technologies like TSMC's CoWoS and its successors allow multiple dies, HBM stacks, and interconnects to be integrated into ever-larger and more powerful composite processors.

Quantum Computing

While quantum computing is often discussed as a potential successor to classical accelerators for certain AI tasks, practical quantum advantage for machine learning remains years or decades away. Near-term quantum computers lack the qubit counts and error correction needed for useful AI workloads. Hybrid quantum-classical approaches, where a quantum processor handles specific subroutines within a larger classical training pipeline, are an active area of research.

References

NVIDIA Developer Blog. "CUDA Refresher: Reviewing the Origins of GPU Computing." https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/
Wikipedia. "General-purpose computing on graphics processing units." https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units
Wikipedia. "CUDA." https://en.wikipedia.org/wiki/CUDA
NVIDIA. "NVIDIA Tesla V100 GPU Architecture Whitepaper." https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
NVIDIA. "NVIDIA A100 Tensor Core GPU Datasheet." https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf
NVIDIA. "NVIDIA H100 Tensor Core GPU Datasheet." https://www.nvidia.com/en-us/data-center/h100/
NVIDIA. "NVIDIA Blackwell Architecture." https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
NVIDIA. "NVIDIA H200 Tensor Core GPU." https://www.nvidia.com/en-us/data-center/h200/
AMD. "AMD Instinct MI300 Series Accelerators." https://www.amd.com/en/products/accelerators/instinct/mi300.html
AMD. "AMD Instinct MI350 Series and Beyond." https://www.amd.com/en/blogs/2025/amd-instinct-mi350-series-and-beyond-accelerating-the-future-of-ai-and-hpc.html
Google Cloud Blog. "Introducing Cloud TPU v5p and AI Hypercomputer." https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer
Meta Engineering. "Fully Sharded Data Parallel: faster AI training with fewer GPUs." https://engineering.fb.com/2021/07/15/open-source/fsdp/
NVIDIA. "NVLink and NVSwitch." https://www.nvidia.com/en-us/data-center/nvlink/
Cornell Virtual Workshop. "Understanding GPU Architecture." https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteristics/design
Deloitte. "As generative AI asks for more power, data centers seek more reliable, cleaner energy solutions." https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2025/genai-power-consumption-creates-need-for-more-sustainable-data-centers.html
SemiAnalysis. "NVIDIA Tensor Core Evolution: From Volta To Blackwell." https://newsletter.semianalysis.com/p/nvidia-tensor-core-evolution-from-volta-to-blackwell
Raina, R., Madhavan, A., & Ng, A. (2009). "Large-scale Deep Unsupervised Learning using Graphics Processors." Proceedings of the 26th International Conference on Machine Learning.
NVIDIA Developer Blog. "Inside NVIDIA Blackwell Ultra." https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/
Congress.gov. "Data Centers and Their Energy Consumption." https://www.congress.gov/crs-product/R48646
Nature. "An integrated large-scale photonic accelerator with ultralow latency." https://www.nature.com/articles/s41586-025-08786-6

GPU Computing

History

Early Graphics Hardware

The Emergence of Programmable Shaders

BrookGPU and the Road to CUDA

OpenCL and Broader Adoption

GPU Computing Enters AI

Why GPUs Work for AI

Massive Parallelism

SIMT Architecture

High Memory Bandwidth

Tensor Cores

GPU for Deep Learning

Matrix Multiplication

Convolution Operations

Training and Inference

GPU vs. CPU: Architectural Differences

CUDA: NVIDIA's Parallel Computing Platform

CUDA Software Ecosystem

NVIDIA GPU Generations for AI

Tesla Architecture (2006)

Fermi (2010)

Kepler (2012) and Maxwell (2014)

Pascal (2016)

Volta (2017) and the V100

Ampere (2020) and the A100

Hopper (2022) and the H100/H200

Blackwell (2024) and the B200/GB200

Summary of NVIDIA Data Center GPU Generations

AMD: Instinct GPUs and ROCm

Instinct GPU Lineup

ROCm Software Stack

Google TPUs

TPU Generations

Other AI Accelerators

AWS Trainium and Inferentia

Intel Gaudi

Cerebras

Groq

SambaNova

Comparison of Major AI Accelerators

Multi-GPU and Distributed Training

Data Parallelism

Model Parallelism

Fully Sharded Data Parallel (FSDP)

Expert Parallelism

3D Parallelism

GPU Interconnects

NVLink

NVSwitch

InfiniBand

GPU Cloud Providers

Hyperscale Cloud Providers

Specialized GPU Cloud Providers

Pricing Trends

GPU Shortage and Economics (2023-2024)

Demand Surge

Supply Chain Bottlenecks

"GPU Rich" vs. "GPU Poor"

Power Consumption and Sustainability

Rising Power Demands

Data Center Impact

Sustainability Initiatives

Future of AI Compute

Photonic Computing

Neuromorphic Computing

Chiplet and Advanced Packaging

Quantum Computing

See Also

References

Related Articles

AI accelerator

Neuromorphic computing

NVIDIA Blackwell

Tensor Processing Unit (TPU)

SambaNova Systems

TPU Pod

GPU Computing

History

Early Graphics Hardware