NVIDIA A100
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 7,492 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 7,492 words
Add missing citations, update stale details, or suggest a clearer explanation.
The NVIDIA A100 Tensor Core GPU is a datacenter graphics processing unit designed by Nvidia for deep learning, high performance computing, and data analytics workloads. It was announced on May 14, 2020 by CEO Jensen Huang during the (virtual) GTC 2020 keynote and was the first product based on the Ampere GA100 die.[^1][^3] The A100 became the workhorse GPU of the early generative AI boom, powering training runs for models including Megatron-Turing NLG, BLOOM, and the systems that produced ChatGPT, before being superseded by the NVIDIA H100 (Hopper) in 2022 and later by the Blackwell generation.[^13][^14]
If any single chip can be said to have built the modern foundation-model industry, the A100 is the front-runner. From mid-2020 through most of 2023 it sat in essentially every frontier training cluster, every hyperscale inference fleet, and a large fraction of the academic and government supercomputing systems retooled to run transformer workloads.[^14] Even now in 2026, with Hopper winding down and Blackwell shipping in volume, the installed base of A100s is still doing useful work in inference clusters, fine-tuning shops, and university labs that bought them on the secondary market for a fraction of the original price.
The A100 is a server-class AI accelerator built on TSMC's 7nm N7 process.[^1] It replaced the V100 (Volta) as Nvidia's flagship datacenter GPU and was sold in two memory variants (40GB HBM2 and 80GB HBM2e) and two physical form factors (the SXM4 mezzanine module used in HGX and DGX systems, and a PCI Express add-in card).[^4][^5] The full GA100 silicon contains 54.2 billion transistors on an 826 mm² die.[^1] Of the 128 streaming multiprocessors (SMs) on the full die, 108 are enabled in the shipping A100 product, giving 6,912 FP32 CUDA cores and 432 third-generation Tensor Cores.[^1][^2]
The A100 introduced several features that defined GPU-based AI infrastructure for the next several years: third-generation Tensor Cores with TF32 and BF16 datatypes, 2:4 structured sparsity, the Multi-Instance GPU (MIG) feature for hardware partitioning, third-generation NVLink at 600 GB/s, and PCI Express 4.0 support.[^1][^2] It also debuted in the DGX A100 server and the DGX SuperPOD reference architecture used by hyperscalers and national labs.[^3][^7]
| Field | Value |
|---|---|
| Type | Data center GPU accelerator |
| Microarchitecture | Ampere |
| Die | GA100 |
| Process node | TSMC 7nm (N7) |
| Transistors | 54.2 billion |
| Die size | 826 mm² |
| SMs (enabled / on die) | 108 / 128 |
| FP32 CUDA cores | 6,912 |
| FP64 CUDA cores | 3,456 |
| Tensor Cores (3rd gen) | 432 |
| Boost clock | 1,410 MHz |
| L2 cache | 40 MB |
| Memory variants | 40 GB HBM2 (May 2020), 80 GB HBM2e (Nov 2020) |
| Peak memory bandwidth | 1,555 GB/s (40 GB), 2,039 GB/s (80 GB SXM), 1,935 GB/s (80 GB PCIe) |
| Memory bus width | 5,120 bits |
| NVLink (3rd gen) | 600 GB/s aggregate bidirectional (12 links) |
| PCI Express | Gen 4 x16 |
| Form factors | SXM4 module, dual-slot PCIe card, A100X converged accelerator |
| TDP | 400 W (SXM4); 250 W (PCIe 40 GB), 300 W (PCIe 80 GB); SXM 80 GB configurable to 500 W |
| MIG | Up to 7 instances per GPU |
| Announced | May 14, 2020 (GTC 2020) |
| Initial shipments | May 2020 |
| 80 GB SKU | November 16, 2020 (SC20) |
| Compute capability | 8.0 |
| Successor | NVIDIA H100 (March 2022) |
| Hardware EOL announced | January 2024 |
| Software support | Phase I (Full Support) under vGPU lifecycle as of March 2026 |
| List price (peak) | $10,000 to $15,000 (40 GB), higher for 80 GB |
| Cloud rental (peak) | $1.00 to $3.00 per GPU-hour widely available by 2024 |
| Used market (2026) | Roughly $5,000 to $10,000 per card |
| Item | A100 40GB SXM4 | A100 80GB SXM4 | A100 80GB PCIe |
|---|---|---|---|
| Architecture | Ampere (GA100) | Ampere (GA100) | Ampere (GA100) |
| Process node | TSMC 7nm (N7) | TSMC 7nm (N7) | TSMC 7nm (N7) |
| Transistors | 54.2 billion | 54.2 billion | 54.2 billion |
| Die size | 826 mm² | 826 mm² | 826 mm² |
| SMs enabled | 108 of 128 | 108 of 128 | 108 of 128 |
| FP32 CUDA cores | 6,912 | 6,912 | 6,912 |
| FP64 CUDA cores | 3,456 | 3,456 | 3,456 |
| Third-gen Tensor Cores | 432 | 432 | 432 |
| Boost clock | 1,410 MHz | 1,410 MHz | 1,410 MHz |
| Memory | 40 GB HBM2 | 80 GB HBM2e | 80 GB HBM2e |
| Memory bandwidth | 1,555 GB/s | 2,039 GB/s | 1,935 GB/s |
| L2 cache | 40 MB | 40 MB | 40 MB |
| NVLink (3rd gen) | 600 GB/s bidirectional | 600 GB/s bidirectional | 600 GB/s bidirectional |
| PCI Express | Gen 4 x16 | Gen 4 x16 | Gen 4 x16 |
| Multi-Instance GPU | up to 7 instances | up to 7 instances | up to 7 instances |
| TDP | 400 W | 400 W (up to 500 W) | 300 W |
Sources: NVIDIA A100 architecture whitepaper, A100 product page, and PCIe/SXM datasheets.[^1][^5][^6]
Nvidia introduced the A100 on May 14, 2020 at GTC 2020, an event held online because of the COVID-19 pandemic.[^3][^24] Jensen Huang delivered the keynote from his kitchen and pulled the first DGX A100 motherboard out of an oven for the camera. The accompanying press release described the chip as containing more than 54 billion transistors and called it "the world's largest 7-nanometer processor" at that time.[^3] Huang framed the A100 as a single accelerator that could handle training, inference, data analytics, and HPC, replacing what had previously required separate clusters of V100 GPUs for training and T4 GPUs for inference.[^2]
Early customers cited at launch included Microsoft, DoorDash, Indiana University, and several US national labs (notably Lawrence Berkeley National Laboratory's NERSC facility).[^3] The DGX A100 server, which packs eight A100 GPUs and dual 64-core AMD EPYC "Rome" 7742 CPUs, shipped immediately at $199,000 per unit.[^7]
The A100 PCIe add-in card variant followed on June 22, 2020, giving OEMs a 250 W dual-slot form factor that could be deployed in standard rack servers without an HGX baseboard.[^25]
At SC20 on November 16, 2020, Nvidia announced the A100 80GB.[^4] The new SKU doubled HBM capacity from 40 GB to 80 GB by switching from HBM2 to HBM2e, and was the first GPU to break 2 TB/s of memory bandwidth (2.039 TB/s on the SXM4 module, 1.935 TB/s on the PCIe card).[^4][^5] Nvidia cited a 3x speedup on the DLRM recommender benchmark and roughly 2x throughput improvements on quantum chemistry workloads compared to the 40GB version, both attributable to the larger working set fitting in HBM.[^4]
The 80 GB part also raised the configurable SXM4 power envelope from 400 W to 500 W in some HGX baseboard configurations, allowing partners to push clocks slightly higher when the cooling solution could keep up.[^1] In practice most production deployments stayed at 400 W to keep thermal headroom for sustained workloads.
Alongside the 80 GB SXM upgrade, Nvidia announced the DGX Station A100, a quiet, workstation-form-factor "AI data-center in a box" with four A100 GPUs, an AMD EPYC 7742 CPU, refrigerant cooling, and a 1,500 W draw that ran on a standard wall outlet.[^26] The 320G configuration (4x 80 GB A100, 320 GB aggregate HBM2e) launched at $149,000; the 160G configuration (4x 40 GB) at $99,000.[^26] Nvidia rated the DGX Station A100 at 2.5 petaFLOPS of AI training throughput and 5 petaOPS of INT8 inference, and unlike the rackmount DGX A100 it supported MIG partitioning on each of its four cards so that small teams could share the workstation across multiple researchers.
The A100 remained Nvidia's flagship datacenter GPU for two years.[^14] Hopper, the successor architecture, was announced on March 22, 2022, with the H100 GPU shipping later that year.[^13] Even after the H100 launched, A100 demand remained extremely high through 2023 because supply of H100 was constrained and a large installed base of CUDA software had already been tuned for Ampere. CNBC reported in February 2023 that the A100 had become "the $10,000 chip powering the race for AI," with major foundation-model labs each operating thousands of A100s in their training clusters.[^14]
Nvidia formally announced end-of-life for the A100 product family in January 2024 and began winding down production through OEM channels during 2024.[^27] By the time Blackwell ramped in 2025, A100 production was largely complete and most new buys came through OEM channel inventory or the secondary market. The chip itself remains in Phase I (Full Support) under Nvidia's vGPU software lifecycle policy as of March 2026, meaning A100 deployments still receive new features, security patches, and driver updates.[^28]
The GA100 silicon is organized into 8 graphics processing clusters (GPCs), each containing 8 texture processing clusters (TPCs), with 2 SMs per TPC, for a total of 128 SMs on the full die.[^1] Each SM contains:
The shipping A100 enables 108 of the 128 SMs, yielding 6,912 FP32 cores, 3,456 FP64 cores, and 432 Tensor Cores.[^1] The L2 cache is 40 MB, nearly seven times larger than the V100's 6 MB L2, and is split into two partitions to keep latency low. Aggregate L2 bandwidth is roughly 2.3x that of V100 according to Nvidia's Ampere whitepaper.[^1] The A100's compute capability for the CUDA toolkit is 8.0.[^2]
The Tensor Cores in GA100 are the third generation since their introduction in Volta. They added several new capabilities relative to V100:[^1][^2]
Each Tensor Core in A100 performs 256 FP16 fused-multiply-add operations per clock, four times the V100 rate per core, and there are four Tensor Cores per SM.[^1]
A detail worth flagging: TF32 is one of those quietly important format choices that did a lot of work for adoption. The format kept FP32's exponent range, which meant existing model code did not need scaling factors or loss-scale tuning, but cut the mantissa down so the multiplications could use a more compact integer pipeline.[^2] For research teams porting models from V100 to A100, TF32 was usually a free 6 to 10x speedup with no code changes at all. BF16 then served the more advanced users who were already running mixed-precision training and wanted the dynamic-range benefit without the headaches of FP16's narrower exponent.
| Precision | Dense throughput | With 2:4 sparsity |
|---|---|---|
| FP64 (CUDA core) | 9.7 TFLOPS | n/a |
| FP64 Tensor Core | 19.5 TFLOPS | n/a |
| FP32 (CUDA core) | 19.5 TFLOPS | n/a |
| TF32 Tensor Core | 156 TFLOPS | 312 TFLOPS |
| BF16 Tensor Core | 312 TFLOPS | 624 TFLOPS |
| FP16 Tensor Core | 312 TFLOPS | 624 TFLOPS |
| INT8 Tensor Core | 624 TOPS | 1,248 TOPS |
| INT4 Tensor Core | 1,248 TOPS | 2,496 TOPS |
These are peak rates at the boost clock, as listed in the Ampere whitepaper and on Nvidia's A100 product page.[^1][^5][^6]
The 40GB A100 ships five 8-Hi HBM2 stacks for 40 GB total at 1.55 TB/s.[^1] The 80GB version uses HBM2e and reaches 2.0 TB/s on the SXM module and 1.94 TB/s on the PCIe card.[^4][^5] The memory interface is 5,120 bits wide. The A100 also adds Compute Data Compression, which Nvidia claims can deliver up to 4x effective DRAM bandwidth and up to 2x L2 bandwidth on workloads with sparse or repeated values.[^1] The compression is transparent to user code; cache lines are tagged with a compression descriptor, and the memory controller and L2 will reinflate them on read.
The 2:4 structured sparsity feature was the first time matrix-sparsity acceleration shipped in a mainstream GPU at this scale.[^29] After pruning, each 4-element vector of a weight matrix is required to contain at most two non-zero values; the hardware stores only the non-zeros plus a 2-bit-per-element index that the Tensor Core uses to gather operands at load time.[^29] In published case studies the workflow was: train a dense network, prune to the 2:4 pattern with cuSPARSELt's automatic compressor, retrain briefly to recover lost accuracy, then deploy through TensorRT for inference. The reported result was roughly 2x dense throughput with negligible accuracy loss on networks like BERT and ResNet-50.[^29] In production, structured sparsity was adopted more in inference than in training because the fine-tune step was cheaper than redoing the full training run, and TensorRT's automatic kernel selection could pick the sparse path without the model owner needing to touch graph code.
Nvidia shipped the A100 in two memory configurations across three physical form factors, plus an export-restricted derivative for the Chinese market. The table below summarizes the lineup as it actually appeared on order forms and OEM spec sheets.
| Variant | Form factor | Memory | Bandwidth | NVLink | TDP | Notes |
|---|---|---|---|---|---|---|
| A100 40 GB SXM4 | SXM4 mezzanine | 40 GB HBM2 | 1.555 TB/s | 600 GB/s | 400 W | Launch SKU, May 2020. Standard part for HGX A100 baseboards and the original DGX A100.[^1][^3] |
| A100 40 GB PCIe | Dual-slot PCIe Gen 4 | 40 GB HBM2 | 1.555 TB/s | 600 GB/s via 2-card bridge | 250 W | Add-in card for general OEM servers, launched June 22, 2020.[^25] |
| A100 80 GB SXM4 | SXM4 mezzanine | 80 GB HBM2e | 2.039 TB/s | 600 GB/s | 400 W (configurable to 500 W) | Announced November 16, 2020. First GPU to break 2 TB/s memory bandwidth.[^4][^5] Used in updated DGX A100 and HGX A100. |
| A100 80 GB PCIe | Dual-slot PCIe Gen 4 | 80 GB HBM2e | 1.935 TB/s | 600 GB/s via 2-card bridge | 300 W | Released mid-2021. Slightly lower memory bandwidth than the SXM module due to thermal envelope.[^5] |
| A100X (converged accelerator) | Dual-slot PCIe Gen 4 | 80 GB HBM2e | 1.935 TB/s | n/a (DPU-integrated) | 300 W | A100 + BlueField-2 DPU on one card with an onboard PCIe Gen 4 switch; 100 GbE networking; standard and BlueField-X operating modes.[^30] |
| A800 40 / 80 GB | SXM4 / PCIe | 40 GB HBM2 / 80 GB HBM2e | Same as A100 | 400 GB/s (cut from 600) | Same as A100 | China-only export-compliant SKU. In production Q3 2022.[^31][^32] |
The SXM4 module is a mezzanine card that plugs directly into a custom NVLink-enabled baseboard.[^1] SXM4 was the form factor used in every flagship cluster build because it allowed the full 600 GB/s NVLink and the higher 400 to 500 W power envelope that was needed to hold boost clocks under sustained load.
The PCIe variant is a standard dual-slot add-in card that plugs into any PCIe Gen 4 x16 slot.[^25] The PCIe card runs at a lower TDP (250 W on the 40 GB part, 300 W on the 80 GB part) and gives up some sustained throughput in exchange for fitting in a wider variety of server chassis. PCIe A100s could be linked in pairs through an NVLink bridge connector, but full eight-way NVSwitch fabrics required SXM4.
The HGX A100 baseboard comes in two reference designs: HGX A100 4-GPU and HGX A100 8-GPU.[^1] The 4-GPU variant uses direct NVLink between adjacent GPUs in a non-switched topology. The 8-GPU variant adds six second-generation NVSwitch chips that provide all-to-all 600 GB/s connectivity, which is what makes the HGX 8-GPU board the practical building block for large training clusters. OEMs including Supermicro, Dell, HPE, Lenovo, Inspur, and Foxconn all built systems around the HGX A100 baseboard.
The A100X converged accelerator put an A100 80 GB and a BlueField-2 DPU on the same dual-slot PCIe card, with an onboard PCIe Gen 4 switch enabling GPU-to-DPU traffic without touching the host PCIe complex.[^30] In its default "standard" mode the GPU and DPU appear as independent PCIe devices to the host; in "BlueField-X" mode the PCIe switch reconfigures so the GPU is dedicated to the DPU and invisible to the host, which is the architecture used for 5G vRAN and AI-on-5G deployments where the DPU runs the full software stack and the GPU is a pure compute attachment.[^30]
The A800 appeared in Q3 2022 after the first round of US export controls.[^31][^32] The chip was identical silicon to the A100 with NVLink throttled from 600 GB/s to 400 GB/s, just enough to fall under the original BIS interconnect threshold.[^32] The A800 became the bestselling Nvidia datacenter GPU inside China during 2023, until the October 2023 BIS update closed the loophole.
Multi-Instance GPU is a hardware partitioning feature unique to A100 and later datacenter GPUs.[^1][^2] A single A100 can be split into up to seven independent GPU instances, each with its own dedicated SMs, L2 cache slice, memory controllers, and HBM partition. The 80GB version provides instance sizes of 10 GB each in the 7-way configuration, with larger configurations available when fewer instances are created.[^5]
MIG was designed for cloud providers and shared inference clusters. Before MIG, sharing a GPU between tenants required time-slicing through a software scheduler, which left memory bandwidth and cache subject to noisy-neighbor interference. With MIG the partitions are physically isolated, so an OOM in one instance cannot crash another.[^2] Nvidia documents seven 1g.10gb instances, three 2g.20gb plus one 3g.40gb, and various other mixes on the 80GB part.
The table below shows the practical instance profiles that Kubernetes operators and cloud providers most often expose. Each row is a valid configuration of a single 80 GB SXM4 A100.
| Profile | Number of instances | SMs per instance | Memory per instance | Typical use |
|---|---|---|---|---|
| 1g.10gb | 7 | 14 | 10 GB | Small inference replicas, batch=1 LLM serving for sub-7B models |
| 2g.20gb | 3 | 28 | 20 GB | Mid-size inference, fine-tuning small models |
| 3g.40gb | 2 | 42 | 40 GB | Larger inference batches, 13B-class models |
| 4g.40gb | 1 (with 3g.40gb) | 56 | 40 GB | Single workload sharing card with smaller tenant |
| 7g.80gb | 1 | 98 (whole GPU) | 80 GB | Full GPU; equivalent to no MIG |
In practice, MIG turned out to be more useful for inference than training. Splitting a card into seven 1g.10gb partitions made for a tidy way to run seven independent inference replicas behind a load balancer, which is exactly the workload shape that cloud providers like AWS, Azure, and GCP wanted to expose to multi-tenant customers. Training jobs almost always wanted the whole GPU.
The A100 implements third-generation NVLink, with each link running at 50 Gb/s per signal pair and 25 GB/s per direction.[^1] The SXM4 A100 has 12 NVLinks, giving 600 GB/s of total bidirectional bandwidth between any pair of A100s, double the 300 GB/s offered by V100. The PCIe variant has fewer NVLinks (a 600 GB/s bridge between two adjacent cards is available, but full all-to-all NVSwitch fabrics require the SXM form factor).
In the DGX A100 reference design, six second-generation NVSwitch chips connect all eight A100 GPUs in a non-blocking topology so that any GPU can talk to any other at full 600 GB/s.[^7] The PCIe interface is upgraded to PCI Express 4.0, which doubles host bandwidth to 31.5 GB/s per direction over a PCIe x16 link.
The DGX A100 is Nvidia's reference server.[^7] Each unit contains:
Nvidia rates the DGX A100 at 5 petaFLOPS of FP16 Tensor Core throughput and 10 petaOPS of INT8 inference, with a launch list price of $199,000.[^7]
The DGX SuperPOD is the multi-node reference design built from DGX A100 nodes, fat-tree HDR InfiniBand, and shared parallel storage.[^7] Nvidia's own production cluster, Selene, was built on the DGX A100 SuperPOD and debuted at #7 on the June 2020 TOP500 list with 27.58 PFLOPS of Linpack performance.[^33] Selene was rebuilt with A100 80GB cards and HDR upgrades for SC20 and climbed to #5 on the November 2020 TOP500 list with 63.4 PFLOPS, more than doubling its earlier score.[^34] Selene was used internally for benchmark submissions to MLPerf and to train large research models including Megatron-Turing NLG.[^10]
A "scalable unit" of the SuperPOD reference design was 20 DGX A100 nodes (160 A100 GPUs) wired into a non-blocking InfiniBand fat tree, which could be replicated and joined together to scale up.[^7] Selene started at 280 DGX A100 nodes (2,240 A100s) and grew during its operational life. Customer SuperPOD deployments at this size included clusters at NAVER (South Korea), Cambridge-1 (UK, the first commercial supercomputer dedicated to healthcare AI), and several US national lab installations.
The DGX Station A100, announced alongside the 80 GB SXM upgrade in November 2020, was Nvidia's workstation-class A100 product.[^26] It packed four A100 80 GB cards in a desk-side chassis with refrigerant cooling and a 1,500 W input draw that ran on a standard office wall outlet, sized for a small team that wanted a private AI dev box rather than a shared cluster. The 320G configuration (320 GB aggregate HBM2e, AMD EPYC 7742 CPU, up to 512 GB DDR4, 7.68 TB NVMe data SSD) listed at $149,000; the 160G configuration (4x 40 GB) listed at $99,000.[^26] MIG worked on each of the four cards independently, so a 320G machine could be partitioned into up to 28 independent GPU instances for shared lab use.
The A100's commercial success is inseparable from the CUDA software stack that ships around it. Most of the user-visible features added to that stack between 2020 and 2023 were driven by what the A100 needed:[^2]
It is fair to say that a meaningful share of the A100's lifetime performance gains came from this stack rather than the silicon. Between MLPerf Training v0.7 and v1.1, Nvidia reported up to 3.5x at-scale speedup on identical A100 hardware, all from CUDA, NCCL, cuDNN, and framework updates.[^9] That "chip kept getting faster" pattern has been one of the more durable arguments for why software is the actual moat in AI hardware.
The A100 dominated MLPerf benchmark submissions for the duration of its lifetime as Nvidia's flagship part:
These sustained wins were as much a story about CUDA, cuDNN, and NCCL maturity as about silicon. They were also one of the more credible signals that Nvidia's software stack was a real moat, because the same chip kept getting faster between submissions.[^9]
The original GPT-3 (announced May 2020) was trained on V100 GPUs, not A100s. Microsoft disclosed in May 2020 that it had built a supercomputer for OpenAI with more than 285,000 CPU cores, 10,000 GPUs, and 400 Gb/s of network connectivity per GPU server.[^12] Subsequent OpenAI models, including the GPT-3.5 and ChatGPT family that emerged in late 2022, ran on Microsoft Azure infrastructure that had been migrated to A100 hardware in the intervening period. CNBC's February 2023 reporting on the A100 noted that ChatGPT-class workloads were running on "thousands" of A100s.[^14]
The broader GPT-4 training infrastructure, deployed during 2022 and into 2023, was a mixed A100 and H100 fleet on Azure. Azure's ND A100 v4 series became one of the largest single deployments of A100 capacity outside Nvidia's own Selene system.[^20]
Microsoft and Nvidia trained Megatron-Turing NLG 530B (a 530-billion-parameter transformer) on Selene, the Nvidia DGX A100 SuperPOD.[^10] The training infrastructure consisted of 560 DGX A100 servers, each with eight A100 80GB GPUs (4,480 A100 GPUs total) connected by HDR InfiniBand in a full fat tree.[^10] The published runs used tensor parallelism of 8 within a node, pipeline parallelism of 35 across nodes, and DeepSpeed data parallelism on top, with each model replica spanning 280 A100 GPUs.[^10]
BLOOM, the 176-billion-parameter open multilingual language model from the BigScience workshop, was trained on the Jean Zay supercomputer in France between March and July 2022.[^11] The training used 384 A100 80GB GPUs (48 nodes of 8 GPUs each, plus 32 spare GPUs for failure handling), connected by NVLink within each node and Omni-Path between nodes.[^11] Total compute was on the order of 1 million GPU-hours and the project consumed roughly $7 million of publicly funded compute time.[^11]
Meta's first-generation LLaMA models (released February 2023) were trained on A100 80GB clusters. The LLaMA 65B run reportedly used 2,048 A100s for about 21 days, totaling around 1 million GPU-hours.[^22] Llama 2, released in July 2023, was also primarily an A100 training run before later Meta clusters transitioned to H100 for Llama 3.
Stable Diffusion 1.x was trained by Stability AI and partners on a cluster of 256 A100s donated through their Lambda Labs partnership.[^23] The original RunwayML and CompVis collaborations on the underlying latent diffusion research used much smaller A100 footprints (single-node and dual-node clusters) before the model scaled up.
DeepMind's Chinchilla, AI21's Jurassic-2, Cohere's early production models, Adept's ACT-1, Eleven Labs' first speech-synthesis models, and many of the open Mistral, Falcon, Yi, and Qwen series were either A100 trained or A100 served during 2021 to 2023. The list of foundation-model labs that did not run on A100 hardware during this period is roughly limited to Google (TPU) and a handful of TPU-leasing startups.
Every major cloud provider added A100 capacity to their lineup between mid-2020 and 2021. The table below summarizes the SKUs that customers could actually order, with peak instance configurations and approximate pricing at the cycle's height.
| Provider | Instance / SKU | GPU configuration | Memory | Notes |
|---|---|---|---|---|
| AWS | p4d.24xlarge | 8x A100 40 GB SXM4 | 1.1 TB system RAM | Launched November 2020 at $32.77/hour list. EFA networking.[^19] |
| AWS | p4de.24xlarge | 8x A100 80 GB SXM4 | 1.1 TB system RAM | Launched 2022. 80 GB upgrade of p4d.[^19] |
| Azure | ND A100 v4 (NDm A100 v4 80 GB) | 8x A100 80 GB SXM4 | 1.9 TB system RAM | InfiniBand HDR. The 80 GB SKU was the workhorse for Azure's OpenAI co-located capacity.[^20] |
| Google Cloud | a2-highgpu-8g (and a2-megagpu-16g) | 8x or 16x A100 40 GB | up to 1.36 TB | Launched July 2020. a2-megagpu paired two HGX boards in one VM.[^21] |
| Google Cloud | a2-ultragpu-8g | 8x A100 80 GB | 1.36 TB | 80 GB SKU launched 2021.[^21] |
| Oracle | BM.GPU.A100-v2.8 | 8x A100 80 GB | 2 TB | Bare-metal SKU on Oracle Cloud, RDMA cluster networking. |
| CoreWeave | A100 80 GB SXM4 instances | 1 to 8x A100 80 GB per node | varies | Specialty neocloud; multi-year reserved deals to OpenAI and others. |
| Lambda Labs | 1x A100 (40 / 80 GB), 8x A100 nodes | 1 to 8x A100 | varies | On-demand and reserved; price competitive with hyperscalers from 2022. |
| Tencent Cloud | A100 / A800 instances | 8x A100 / A800 | varies | Mainland China; transitioned to A800 after October 2022. |
| Alibaba Cloud | gn7e / ebmgn7e | 8x A100 / A800 | varies | Mainland China; transitioned to A800 after October 2022. |
On-demand pricing for an 8-GPU A100 80 GB node in 2021 ran roughly $20 to $30 per hour at hyperscalers, which puts the per-GPU rate around $2.50 to $4.[^19][^20] By late 2023 and into 2024, neoclouds and reserved-instance contracts had pushed the per-GPU rate as low as $1.00 to $1.50 per hour for A100 80 GB capacity. Through 2025 and into 2026 spot prices for A100 instances on neoclouds settled around $0.80 to $1.50 per GPU-hour, far below current H100 and Blackwell rates, which is much of the reason the A100 still has a market.
Nvidia does not publish list prices for datacenter GPUs and most A100 sales went through OEM partners, but a few public datapoints anchor the range:
These numbers should be treated as approximate. The A100 used market is illiquid, dominated by a handful of brokers and refurbishment houses, and pricing moves with each large cluster decommissioning event. When Microsoft retired a tranche of older A100 capacity in early 2026, broker-channel prices on 80 GB modules dropped roughly 20 percent in two weeks before stabilizing.
The transition from A100 to H100 was the largest single generational jump in Nvidia's datacenter line and the table below shows where the gaps live.[^13] H100 is uniformly faster on Tensor Core math, doubled NVLink bandwidth, added FP8 support and the Transformer Engine, and roughly doubled HBM bandwidth. The A100 still wins on price per GPU-hour and on the simpler power and cooling profile.
| Specification | A100 80 GB SXM4 | H100 SXM5 |
|---|---|---|
| Architecture | Ampere (GA100) | Hopper (GH100) |
| Process | TSMC 7nm (N7) | TSMC 4N |
| Transistors | 54.2 billion | 80 billion |
| Die size | 826 mm² | 814 mm² |
| FP32 CUDA cores | 6,912 | 16,896 |
| Tensor Cores | 432 (3rd gen) | 528 (4th gen) |
| Memory | 80 GB HBM2e | 80 GB HBM3 (94 GB on later NVL) |
| Memory bandwidth | 2.0 TB/s | 3.35 TB/s |
| L2 cache | 40 MB | 50 MB |
| FP64 Tensor Core | 19.5 TFLOPS | 67 TFLOPS |
| TF32 Tensor Core | 156 TFLOPS | 989 TFLOPS |
| BF16 / FP16 Tensor Core | 312 TFLOPS | 1,979 TFLOPS |
| FP8 Tensor Core | not supported | 3,958 TFLOPS |
| INT8 Tensor Core | 624 TOPS | 3,958 TOPS |
| NVLink bandwidth | 600 GB/s | 900 GB/s |
| PCI Express | Gen 4 | Gen 5 |
| TDP (SXM) | 400 W | 700 W |
| Launch price (per GPU, peak) | $10,000 to $15,000 | $25,000 to $40,000 |
| Cloud price (2024 to 2026) | $1.00 to $2.00 / GPU-hour | $2.00 to $4.00 / GPU-hour |
| Best fit (2026) | Inference, fine-tuning, mid-size training | Frontier training, FP8 inference |
The Transformer Engine is the most important difference for production foundation-model work.[^13] H100's FP8 support effectively doubles inference throughput on transformer architectures with no accuracy loss, which the A100 simply cannot match. For inference of moderately sized models, however, where the math is not the bottleneck, the gap narrows and the A100's lower per-GPU-hour cost often wins on total cost of serving.
The V100, released in 2017 on the Volta architecture, introduced the original Tensor Core and was the workhorse for the first generation of large transformer training (BERT, GPT-2, and the original GPT-3). The A100 roughly doubled to tripled per-chip performance on FP16/FP32 mixed-precision training, added BF16 and TF32 support, increased memory bandwidth from 900 GB/s to 1.55 TB/s (40GB) or 2.0 TB/s (80GB), and replaced the V100's 6 MB L2 with a 40 MB L2.[^1][^2]
The H100, announced March 22, 2022 and shipping later that year, succeeded the A100 as Nvidia's flagship.[^13] Hopper added a Transformer Engine with FP8 Tensor Cores, fourth-generation NVLink at 900 GB/s per GPU, HBM3 memory at up to 3 TB/s, PCIe Gen 5, and confidential computing features. Nvidia rates the H100 as up to 9x faster than A100 for large language model training and up to 30x for inference at large batch sizes, although in mixed real-world workloads the gap was usually closer to 2x to 3x.[^13] Hopper itself was succeeded in 2024 to 2025 by Blackwell and B200.
On October 7, 2022, the US Department of Commerce's Bureau of Industry and Security (BIS) issued new export controls targeting advanced computing chips destined for China.[^18] The rules used two simultaneous thresholds: a peak-performance threshold and an interconnect bandwidth threshold of 600 GB/s of aggregate bidirectional chip-to-chip bandwidth.[^18] The A100, with exactly 600 GB/s of NVLink and very high INT8 throughput, fell on the wrong side of both criteria.
Nvidia responded with the A800, a rebinned A100 SKU intended for the China market that throttled NVLink to 400 GB/s while leaving the GPU's compute throughput essentially unchanged.[^31][^32] The A800 entered production in Q3 2022 and quickly became Nvidia's bestselling AI accelerator inside China.[^31] A parallel H800 followed for the Hopper generation. The October 2023 update to the BIS rules added a performance density threshold and effectively closed the A800/H800 loophole, though by then the chips were already in widespread deployment at Alibaba, Baidu, ByteDance, and Tencent.[^17]
It is easy to forget how unusual the A100's commercial lifecycle has been. Most server GPUs have a sharp performance edge for two years and then slide quickly into the secondary market as the next architecture takes over. The A100 has had a longer and softer landing, partly because Hopper was supply-constrained for so long, partly because A100s were bought in such enormous quantities that the depreciated inventory became a real economic asset, and partly because the kinds of workloads people want to run on GPUs in 2026 are not all frontier training.
A few patterns are visible in the 2026 market:
Nvidia formally announced the A100 end-of-life in January 2024 and the product page on nvidia.com began directing new buyers toward H100 and Blackwell.[^27] As of 2026 the A100 is in software lifecycle Phase I (Full Support) for vGPU and remains current on CUDA, cuDNN, TensorRT, and the major frameworks; that support is expected to continue for several more years given the size of the installed base.[^28]
From 2020 to roughly 2023, the A100 was the GPU that the modern AI industry was built on.[^14] It powered the first wave of frontier-scale transformer training, sat in essentially every major hyperscaler's AI datacenter buildout, and made distributed training on hundreds or thousands of GPUs a standard engineering pattern through libraries like Megatron-LM, DeepSpeed, and FSDP.[^10] Multi-Instance GPU normalized the idea that a single accelerator could be cleanly partitioned for inference serving, and the DGX SuperPOD reference architecture established the template for AI clusters that the H100 and Blackwell generations followed.
Even after the H100 became the prestige chip, A100 inventory continued serving inference workloads at scale. Cloud providers were still listing A100 instances in 2025 and 2026, often at prices low enough to make them attractive for fine-tuning runs and steady-state production inference of mid-sized models. The A100 will likely keep showing up in research papers' "hardware used" tables for several more years.