NVIDIA A100

AI Hardware Data Centers NVIDIA

38 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

34 citations

Revision

v5 · 7,693 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The NVIDIA A100 Tensor Core GPU is a datacenter graphics processing unit that NVIDIA introduced on May 14, 2020 as the first product built on its Ampere architecture, and that became the dominant chip for training and serving the first wave of large AI and generative-AI models from 2020 through roughly 2023.^[1]^[3] Built by Nvidia for deep learning, high performance computing, and data analytics, the A100 packs more than 54 billion transistors on a single 826 mm2 die, which NVIDIA called "the world's largest 7-nanometer processor" at launch, and it delivered up to 20x the AI performance of the prior V100 generation.^[1]^[3] It was announced by CEO Jensen Huang during the (virtual) GTC 2020 keynote and was the first product based on the Ampere GA100 die.^[1]^[3] The A100 powered training runs for models including Megatron-Turing NLG, BLOOM, LLaMA, and the systems that produced ChatGPT, before being superseded by the NVIDIA H100 (Hopper) in 2022 and later by the Blackwell generation.^[13]^[14]

If any single chip can be said to have built the modern foundation-model industry, the A100 is the front-runner. From mid-2020 through most of 2023 it sat in essentially every frontier training cluster, every hyperscale inference fleet, and a large fraction of the academic and government supercomputing systems retooled to run transformer workloads.^[14] Even now in 2026, with Hopper winding down and Blackwell shipping in volume, the installed base of A100s is still doing useful work in inference clusters, fine-tuning shops, and university labs that bought them on the secondary market for a fraction of the original price.

At the launch, Huang framed the chip in a single line: "NVIDIA A100 GPU is a 20x AI performance leap and an end-to-end machine learning accelerator, from data analytics to training to inference."^[3]

What is the NVIDIA A100?

The A100 is a server-class AI accelerator built on TSMC's 7nm N7 process.^[1] It replaced the V100 (Volta) as Nvidia's flagship datacenter GPU and was sold in two memory variants (40GB HBM2 and 80GB HBM2e) and two physical form factors (the SXM4 mezzanine module used in HGX and DGX systems, and a PCI Express add-in card).^[4]^[5] The full GA100 silicon contains 54.2 billion transistors on an 826 mm2 die.^[1] Of the 128 streaming multiprocessors (SMs) on the full die, 108 are enabled in the shipping A100 product, giving 6,912 FP32 CUDA cores and 432 third-generation Tensor Cores.^[1]^[2]

The A100 introduced several features that defined GPU-based AI infrastructure for the next several years: third-generation Tensor Cores with TF32 and BF16 datatypes, 2:4 structured sparsity, the Multi-Instance GPU (MIG) feature for hardware partitioning, third-generation NVLink at 600 GB/s, and PCI Express 4.0 support.^[1]^[2] It also debuted in the DGX A100 server and the DGX SuperPOD reference architecture used by hyperscalers and national labs.^[3]^[7]

Infobox

Field	Value
Type	Data center GPU accelerator
Microarchitecture	Ampere
Die	GA100
Process node	TSMC 7nm (N7)
Transistors	54.2 billion
Die size	826 mm2
SMs (enabled / on die)	108 / 128
FP32 CUDA cores	6,912
FP64 CUDA cores	3,456
Tensor Cores (3rd gen)	432
Boost clock	1,410 MHz
L2 cache	40 MB
Memory variants	40 GB HBM2 (May 2020), 80 GB HBM2e (Nov 2020)
Peak memory bandwidth	1,555 GB/s (40 GB), 2,039 GB/s (80 GB SXM), 1,935 GB/s (80 GB PCIe)
Memory bus width	5,120 bits
NVLink (3rd gen)	600 GB/s aggregate bidirectional (12 links)
PCI Express	Gen 4 x16
Form factors	SXM4 module, dual-slot PCIe card, A100X converged accelerator
TDP	400 W (SXM4); 250 W (PCIe 40 GB), 300 W (PCIe 80 GB); SXM 80 GB configurable to 500 W
MIG	Up to 7 instances per GPU
Announced	May 14, 2020 (GTC 2020)
Initial shipments	May 2020
80 GB SKU	November 16, 2020 (SC20)
Compute capability	8.0
Successor	NVIDIA H100 (March 2022)
Hardware EOL announced	January 2024
Software support	Phase I (Full Support) under vGPU lifecycle as of March 2026
List price (peak)	$10,000 to $15,000 (40 GB), higher for 80 GB
Cloud rental (peak)	$1.00 to $3.00 per GPU-hour widely available by 2024
Used market (2026)	Roughly $5,000 to $10,000 per card

Key specifications

Item	A100 40GB SXM4	A100 80GB SXM4	A100 80GB PCIe
Architecture	Ampere (GA100)	Ampere (GA100)	Ampere (GA100)
Process node	TSMC 7nm (N7)	TSMC 7nm (N7)	TSMC 7nm (N7)
Transistors	54.2 billion	54.2 billion	54.2 billion
Die size	826 mm2	826 mm2	826 mm2
SMs enabled	108 of 128	108 of 128	108 of 128
FP32 CUDA cores	6,912	6,912	6,912
FP64 CUDA cores	3,456	3,456	3,456
Third-gen Tensor Cores	432	432	432
Boost clock	1,410 MHz	1,410 MHz	1,410 MHz
Memory	40 GB HBM2	80 GB HBM2e	80 GB HBM2e
Memory bandwidth	1,555 GB/s	2,039 GB/s	1,935 GB/s
L2 cache	40 MB	40 MB	40 MB
NVLink (3rd gen)	600 GB/s bidirectional	600 GB/s bidirectional	600 GB/s bidirectional
PCI Express	Gen 4 x16	Gen 4 x16	Gen 4 x16
Multi-Instance GPU	up to 7 instances	up to 7 instances	up to 7 instances
TDP	400 W	400 W (up to 500 W)	300 W

Sources: NVIDIA A100 architecture whitepaper, A100 product page, and PCIe/SXM datasheets.^[1]^[5]^[6]

When was the NVIDIA A100 released?

May 2020 announcement

Nvidia introduced the A100 on May 14, 2020 at GTC 2020, an event held online because of the COVID-19 pandemic.^[3]^[24] Jensen Huang delivered the keynote from his kitchen and pulled the first DGX A100 motherboard out of an oven for the camera. The accompanying press release described the chip as containing more than 54 billion transistors and called it "the world's largest 7-nanometer processor" at that time, and said the architecture would "unify AI training and inference and boost performance by up to 20x over its predecessors."^[3] Huang framed the A100 as a single accelerator that could handle training, inference, data analytics, and HPC, replacing what had previously required separate clusters of V100 GPUs for training and T4 GPUs for inference.^[2]

Early customers cited at launch included Microsoft, DoorDash, Indiana University, and several US national labs (notably Lawrence Berkeley National Laboratory's NERSC facility).^[3] The DGX A100 server, which packs eight A100 GPUs and dual 64-core AMD EPYC "Rome" 7742 CPUs, shipped immediately at $199,000 per unit and was rated at a record 5 petaFLOPS of AI performance.^[7]

The A100 PCIe add-in card variant followed on June 22, 2020, giving OEMs a 250 W dual-slot form factor that could be deployed in standard rack servers without an HGX baseboard.^[25]

November 2020: 80GB variant and DGX Station A100

At SC20 on November 16, 2020, Nvidia announced the A100 80GB.^[4] The new SKU doubled HBM capacity from 40 GB to 80 GB by switching from HBM2 to HBM2e, and was the first GPU to break 2 TB/s of memory bandwidth (2.039 TB/s on the SXM4 module, 1.935 TB/s on the PCIe card).^[4]^[5] Nvidia cited a 3x speedup on the DLRM recommender benchmark and roughly 2x throughput improvements on quantum chemistry workloads compared to the 40GB version, both attributable to the larger working set fitting in HBM.^[4]

The 80 GB part also raised the configurable SXM4 power envelope from 400 W to 500 W in some HGX baseboard configurations, allowing partners to push clocks slightly higher when the cooling solution could keep up.^[1] In practice most production deployments stayed at 400 W to keep thermal headroom for sustained workloads.

Alongside the 80 GB SXM upgrade, Nvidia announced the DGX Station A100, a quiet, workstation-form-factor "AI data-center in a box" with four A100 GPUs, an AMD EPYC 7742 CPU, refrigerant cooling, and a 1,500 W draw that ran on a standard wall outlet.^[26] The 320G configuration (4x 80 GB A100, 320 GB aggregate HBM2e) launched at $149,000; the 160G configuration (4x 40 GB) at $99,000.^[26] Nvidia rated the DGX Station A100 at 2.5 petaFLOPS of AI training throughput and 5 petaOPS of INT8 inference, and unlike the rackmount DGX A100 it supported MIG partitioning on each of its four cards so that small teams could share the workstation across multiple researchers.

Production lifetime and successor

The A100 remained Nvidia's flagship datacenter GPU for two years.^[14] Hopper, the successor architecture, was announced on March 22, 2022, with the H100 GPU shipping later that year.^[13] Even after the H100 launched, A100 demand remained extremely high through 2023 because supply of H100 was constrained and a large installed base of CUDA software had already been tuned for Ampere. CNBC reported in February 2023 that the A100 had become "the $10,000 chip powering the race for AI," with major foundation-model labs each operating thousands of A100s in their training clusters.^[14]

Nvidia formally announced end-of-life for the A100 product family in January 2024 and began winding down production through OEM channels during 2024.^[27] By the time Blackwell ramped in 2025, A100 production was largely complete and most new buys came through OEM channel inventory or the secondary market. The chip itself remains in Phase I (Full Support) under Nvidia's vGPU software lifecycle policy as of March 2026, meaning A100 deployments still receive new features, security patches, and driver updates.^[28]

Architecture

GA100 die and SM layout

The GA100 silicon is organized into 8 graphics processing clusters (GPCs), each containing 8 texture processing clusters (TPCs), with 2 SMs per TPC, for a total of 128 SMs on the full die.^[1] Each SM contains:

64 FP32 CUDA cores (also usable as INT32)^[1]
32 FP64 CUDA cores
4 third-generation Tensor Cores
192 KB of combined L1 data cache and shared memory (1.5x the V100's 128 KB)^[2]

The shipping A100 enables 108 of the 128 SMs, yielding 6,912 FP32 cores, 3,456 FP64 cores, and 432 Tensor Cores.^[1] The L2 cache is 40 MB, nearly seven times larger than the V100's 6 MB L2, and is split into two partitions to keep latency low. Aggregate L2 bandwidth is roughly 2.3x that of V100 according to Nvidia's Ampere whitepaper.^[1] The A100's compute capability for the CUDA toolkit is 8.0.^[2]

Third-generation Tensor Cores

The Tensor Cores in GA100 are the third generation since their introduction in Volta. They added several new capabilities relative to V100:^[1]^[2]

FP64 matrix math, accelerated through the Tensor Core data path for HPC kernels such as LU and Cholesky factorization.
TF32, a new 19-bit format with the dynamic range of FP32 (8 exponent bits) and the precision of FP16 (10 mantissa bits). TF32 is the default math mode for FP32 inputs in cuBLAS and cuDNN on Ampere, so existing FP32 networks can transparently get a roughly 10x speedup over V100 FP32 with no code changes.
BF16 support at the same throughput as FP16, useful for training workloads that need FP32-like dynamic range without the cost of FP32 storage.
2:4 structured sparsity, which allows two of every four weight values in a matrix to be pruned to zero. The hardware then skips the zeros and roughly doubles dense throughput. Tooling in cuSPARSELt and TensorRT can compress and fine-tune existing dense networks into the 2:4 pattern.^[29]

Each Tensor Core in A100 performs 256 FP16 fused-multiply-add operations per clock, four times the V100 rate per core, and there are four Tensor Cores per SM.^[1]

A detail worth flagging: TF32 is one of those quietly important format choices that did a lot of work for adoption. The format kept FP32's exponent range, which meant existing model code did not need scaling factors or loss-scale tuning, but cut the mantissa down so the multiplications could use a more compact integer pipeline.^[2] For research teams porting models from V100 to A100, TF32 was usually a free 6 to 10x speedup with no code changes at all. BF16 then served the more advanced users who were already running mixed-precision training and wanted the dynamic-range benefit without the headaches of FP16's narrower exponent.

How fast is the NVIDIA A100? (performance per precision)

Precision	Dense throughput	With 2:4 sparsity
FP64 (CUDA core)	9.7 TFLOPS	n/a
FP64 Tensor Core	19.5 TFLOPS	n/a
FP32 (CUDA core)	19.5 TFLOPS	n/a
TF32 Tensor Core	156 TFLOPS	312 TFLOPS
BF16 Tensor Core	312 TFLOPS	624 TFLOPS
FP16 Tensor Core	312 TFLOPS	624 TFLOPS
INT8 Tensor Core	624 TOPS	1,248 TOPS
INT4 Tensor Core	1,248 TOPS	2,496 TOPS

These are peak rates at the boost clock, as listed in the Ampere whitepaper and on Nvidia's A100 product page.^[1]^[5]^[6] In headline terms, a single A100 delivers 156 TFLOPS of TF32 and 312 TFLOPS of dense BF16/FP16 Tensor Core math, rising to 624 TFLOPS of BF16/FP16 with structured sparsity enabled.^[1]^[5]

Memory subsystem

The 40GB A100 ships five 8-Hi HBM2 stacks for 40 GB total at 1.55 TB/s.^[1] The 80GB version uses HBM2e and reaches 2.0 TB/s on the SXM module and 1.94 TB/s on the PCIe card.^[4]^[5] The memory interface is 5,120 bits wide. The A100 also adds Compute Data Compression, which Nvidia claims can deliver up to 4x effective DRAM bandwidth and up to 2x L2 bandwidth on workloads with sparse or repeated values.^[1] The compression is transparent to user code; cache lines are tagged with a compression descriptor, and the memory controller and L2 will reinflate them on read.

Sparsity in practice

The 2:4 structured sparsity feature was the first time matrix-sparsity acceleration shipped in a mainstream GPU at this scale.^[29] After pruning, each 4-element vector of a weight matrix is required to contain at most two non-zero values; the hardware stores only the non-zeros plus a 2-bit-per-element index that the Tensor Core uses to gather operands at load time.^[29] In published case studies the workflow was: train a dense network, prune to the 2:4 pattern with cuSPARSELt's automatic compressor, retrain briefly to recover lost accuracy, then deploy through TensorRT for inference. The reported result was roughly 2x dense throughput with negligible accuracy loss on networks like BERT and ResNet-50.^[29] In production, structured sparsity was adopted more in inference than in training because the fine-tune step was cheaper than redoing the full training run, and TensorRT's automatic kernel selection could pick the sparse path without the model owner needing to touch graph code.

Variants and form factors

Nvidia shipped the A100 in two memory configurations across three physical form factors, plus an export-restricted derivative for the Chinese market. The table below summarizes the lineup as it actually appeared on order forms and OEM spec sheets.

Variant	Form factor	Memory	Bandwidth	NVLink	TDP	Notes
A100 40 GB SXM4	SXM4 mezzanine	40 GB HBM2	1.555 TB/s	600 GB/s	400 W	Launch SKU, May 2020. Standard part for HGX A100 baseboards and the original DGX A100.^[1]^[3]
A100 40 GB PCIe	Dual-slot PCIe Gen 4	40 GB HBM2	1.555 TB/s	600 GB/s via 2-card bridge	250 W	Add-in card for general OEM servers, launched June 22, 2020.^[25]
A100 80 GB SXM4	SXM4 mezzanine	80 GB HBM2e	2.039 TB/s	600 GB/s	400 W (configurable to 500 W)	Announced November 16, 2020. First GPU to break 2 TB/s memory bandwidth.^[4]^[5] Used in updated DGX A100 and HGX A100.
A100 80 GB PCIe	Dual-slot PCIe Gen 4	80 GB HBM2e	1.935 TB/s	600 GB/s via 2-card bridge	300 W	Released mid-2021. Slightly lower memory bandwidth than the SXM module due to thermal envelope.^[5]
A100X (converged accelerator)	Dual-slot PCIe Gen 4	80 GB HBM2e	1.935 TB/s	n/a (DPU-integrated)	300 W	A100 + BlueField-2 DPU on one card with an onboard PCIe Gen 4 switch; 100 GbE networking; standard and BlueField-X operating modes.^[30]
A800 40 / 80 GB	SXM4 / PCIe	40 GB HBM2 / 80 GB HBM2e	Same as A100	400 GB/s (cut from 600)	Same as A100	China-only export-compliant SKU. In production Q3 2022.^[31]^[32]

The SXM4 module is a mezzanine card that plugs directly into a custom NVLink-enabled baseboard.^[1] SXM4 was the form factor used in every flagship cluster build because it allowed the full 600 GB/s NVLink and the higher 400 to 500 W power envelope that was needed to hold boost clocks under sustained load.

The PCIe variant is a standard dual-slot add-in card that plugs into any PCIe Gen 4 x16 slot.^[25] The PCIe card runs at a lower TDP (250 W on the 40 GB part, 300 W on the 80 GB part) and gives up some sustained throughput in exchange for fitting in a wider variety of server chassis. PCIe A100s could be linked in pairs through an NVLink bridge connector, but full eight-way NVSwitch fabrics required SXM4.

The HGX A100 baseboard comes in two reference designs: HGX A100 4-GPU and HGX A100 8-GPU.^[1] The 4-GPU variant uses direct NVLink between adjacent GPUs in a non-switched topology. The 8-GPU variant adds six second-generation NVSwitch chips that provide all-to-all 600 GB/s connectivity, which is what makes the HGX 8-GPU board the practical building block for large training clusters. OEMs including Supermicro, Dell, HPE, Lenovo, Inspur, and Foxconn all built systems around the HGX A100 baseboard.

The A100X converged accelerator put an A100 80 GB and a BlueField-2 DPU on the same dual-slot PCIe card, with an onboard PCIe Gen 4 switch enabling GPU-to-DPU traffic without touching the host PCIe complex.^[30] In its default "standard" mode the GPU and DPU appear as independent PCIe devices to the host; in "BlueField-X" mode the PCIe switch reconfigures so the GPU is dedicated to the DPU and invisible to the host, which is the architecture used for 5G vRAN and AI-on-5G deployments where the DPU runs the full software stack and the GPU is a pure compute attachment.^[30]

The A800 appeared in Q3 2022 after the first round of US export controls.^[31]^[32] The chip was identical silicon to the A100 with NVLink throttled from 600 GB/s to 400 GB/s, just enough to fall under the original BIS interconnect threshold.^[32] The A800 became the bestselling Nvidia datacenter GPU inside China during 2023, until the October 2023 BIS update closed the loophole.

What is Multi-Instance GPU (MIG)?

Multi-Instance GPU is a hardware partitioning feature unique to A100 and later datacenter GPUs.^[1]^[2] A single A100 can be split into up to seven independent GPU instances, each with its own dedicated SMs, L2 cache slice, memory controllers, and HBM partition. The 80GB version provides instance sizes of 10 GB each in the 7-way configuration, with larger configurations available when fewer instances are created.^[4]^[5]

MIG was designed for cloud providers and shared inference clusters. Before MIG, sharing a GPU between tenants required time-slicing through a software scheduler, which left memory bandwidth and cache subject to noisy-neighbor interference. With MIG the partitions are physically isolated, so an OOM in one instance cannot crash another.^[2] Nvidia documents seven 1g.10gb instances, three 2g.20gb plus one 3g.40gb, and various other mixes on the 80GB part.

The table below shows the practical instance profiles that Kubernetes operators and cloud providers most often expose. Each row is a valid configuration of a single 80 GB SXM4 A100.

Profile	Number of instances	SMs per instance	Memory per instance	Typical use
1g.10gb	7	14	10 GB	Small inference replicas, batch=1 LLM serving for sub-7B models
2g.20gb	3	28	20 GB	Mid-size inference, fine-tuning small models
3g.40gb	2	42	40 GB	Larger inference batches, 13B-class models
4g.40gb	1 (with 3g.40gb)	56	40 GB	Single workload sharing card with smaller tenant
7g.80gb	1	98 (whole GPU)	80 GB	Full GPU; equivalent to no MIG

In practice, MIG turned out to be more useful for inference than training. Splitting a card into seven 1g.10gb partitions made for a tidy way to run seven independent inference replicas behind a load balancer, which is exactly the workload shape that cloud providers like AWS, Azure, and GCP wanted to expose to multi-tenant customers. Training jobs almost always wanted the whole GPU.

NVLink, NVSwitch, and PCIe

The A100 implements third-generation NVLink, with each link running at 50 Gb/s per signal pair and 25 GB/s per direction.^[1] The SXM4 A100 has 12 NVLinks, giving 600 GB/s of total bidirectional bandwidth between any pair of A100s, double the 300 GB/s offered by V100. The PCIe variant has fewer NVLinks (a 600 GB/s bridge between two adjacent cards is available, but full all-to-all NVSwitch fabrics require the SXM form factor).

In the DGX A100 reference design, six second-generation NVSwitch chips connect all eight A100 GPUs in a non-blocking topology so that any GPU can talk to any other at full 600 GB/s.^[7] The PCIe interface is upgraded to PCI Express 4.0, which doubles host bandwidth to 31.5 GB/s per direction over a PCIe x16 link.

DGX A100 and DGX SuperPOD

The DGX A100 is Nvidia's reference server.^[7] Each unit contains:

8x A100 SXM4 GPUs (40GB at launch, later 80GB) connected by NVSwitch
2x AMD EPYC 7742 64-core CPUs at 2.25 GHz base
1 TB of system memory (later 2 TB on the 80GB DGX A100)
8 single-port Mellanox ConnectX-6 200 Gb/s HDR InfiniBand adapters for the GPU fabric
1 dual-port ConnectX-6 for storage and Ethernet
15 TB of NVMe storage

Nvidia rates the DGX A100 at 5 petaFLOPS of FP16 Tensor Core throughput and 10 petaOPS of INT8 inference, with a launch list price of $199,000.^[7]

The DGX SuperPOD is the multi-node reference design built from DGX A100 nodes, fat-tree HDR InfiniBand, and shared parallel storage.^[7] Nvidia's own production cluster, Selene, was built on the DGX A100 SuperPOD and debuted at #7 on the June 2020 TOP500 list with 27.58 PFLOPS of Linpack performance from 280 DGX A100 systems (2,240 A100 GPUs, 35,840 CPU cores), a system NVIDIA assembled in under four weeks.^[33] Selene was rebuilt with A100 80GB cards and HDR upgrades for SC20 and climbed to #5 on the November 2020 TOP500 list with 63.4 PFLOPS, more than doubling its earlier score.^[34] Selene was used internally for benchmark submissions to MLPerf and to train large research models including Megatron-Turing NLG.^[10]

A "scalable unit" of the SuperPOD reference design was 20 DGX A100 nodes (160 A100 GPUs) wired into a non-blocking InfiniBand fat tree, which could be replicated and joined together to scale up.^[7] Selene started at 280 DGX A100 nodes (2,240 A100s) and grew during its operational life. Customer SuperPOD deployments at this size included clusters at NAVER (South Korea), Cambridge-1 (UK, the first commercial supercomputer dedicated to healthcare AI), and several US national lab installations.

DGX Station A100

The DGX Station A100, announced alongside the 80 GB SXM upgrade in November 2020, was Nvidia's workstation-class A100 product.^[26] It packed four A100 80 GB cards in a desk-side chassis with refrigerant cooling and a 1,500 W input draw that ran on a standard office wall outlet, sized for a small team that wanted a private AI dev box rather than a shared cluster. The 320G configuration (320 GB aggregate HBM2e, AMD EPYC 7742 CPU, up to 512 GB DDR4, 7.68 TB NVMe data SSD) listed at $149,000; the 160G configuration (4x 40 GB) listed at $99,000.^[26] MIG worked on each of the four cards independently, so a 320G machine could be partitioned into up to 28 independent GPU instances for shared lab use.

Software stack

The A100's commercial success is inseparable from the CUDA software stack that ships around it. Most of the user-visible features added to that stack between 2020 and 2023 were driven by what the A100 needed:^[2]

CUDA 11.0, released alongside the A100 in May 2020, added support for compute capability 8.0 (the Ampere SM ISA), TF32, BF16, third-generation Tensor Cores, and 2:4 structured sparsity.^[2] CUDA 11 also introduced the Cooperative Groups extensions that unlocked grid-wide synchronization patterns important for collective kernels.
cuDNN 8 and later integrated TF32 as the default precision for FP32 convolution and matmul calls. The transparent speedup from TF32 made cuDNN 8 a near-mandatory upgrade for any A100 deployment.
NCCL picked up NVLink topology awareness and tree-reduce algorithms tuned for the A100's 600 GB/s links and the second-generation NVSwitch fabric. The NCCL improvements were a quiet but enormous part of why MLPerf training scores kept rising on the same A100 silicon between submission rounds.^[9]
TensorRT added Ampere-specific kernels for INT8 and FP16 inference, plus support for the 2:4 sparse pattern through cuSPARSELt.^[29] TensorRT 7.2 and 8.x kernels were the basis for most production A100 inference deployments through 2022.
Triton Inference Server was the open-source serving runtime that wrapped TensorRT and other backends; it became the default deployment target for GPU inference on cloud providers.
NVIDIA Apex, then Megatron-LM, then DeepSpeed, and later PyTorch FSDP, all evolved during the A100 era to handle tensor, pipeline, and data parallelism at scales of hundreds to thousands of GPUs.^[10]

It is fair to say that a meaningful share of the A100's lifetime performance gains came from this stack rather than the silicon. Between MLPerf Training v0.7 and v1.1, Nvidia reported up to 3.5x at-scale speedup on identical A100 hardware, all from CUDA, NCCL, cuDNN, and framework updates.^[9] That "chip kept getting faster" pattern has been one of the more durable arguments for why software is the actual moat in AI hardware.

MLPerf and benchmark results

The A100 dominated MLPerf benchmark submissions for the duration of its lifetime as Nvidia's flagship part:

MLPerf Training v0.7 (July 2020): Nvidia announced 16 record results on the A100 in commercially available systems, taking the per-accelerator lead on all eight benchmarks and the at-scale lead on every benchmark using a 2,048-GPU DGX SuperPOD configuration.^[8]
MLPerf Training v1.0 (June 2021): Nvidia reported up to 2.1x per-chip improvement and up to 3.5x at-scale improvement over its v0.7 numbers on the same A100 hardware, with the gains coming entirely from software (CUDA, NCCL, cuDNN, and framework updates).^[9]
MLPerf Training v1.1 (December 2021): A100 retained the per-chip lead on all eight workloads (BERT, DLRM, Mask R-CNN, ResNet-50, SSD, RNN-T, 3D U-Net, MiniGo).
MLPerf Training v2.0 and v2.1 (2022): A100 systems remained the reference baseline against which Hopper would be measured the following year.

These sustained wins were as much a story about CUDA, cuDNN, and NCCL maturity as about silicon. They were also one of the more credible signals that Nvidia's software stack was a real moat, because the same chip kept getting faster between submissions.^[9]

What AI models were trained on the A100?

GPT-3 and the Microsoft / OpenAI cluster

The original GPT-3 (announced May 2020) was trained on V100 GPUs, not A100s. Microsoft disclosed in May 2020 that it had built a supercomputer for OpenAI with more than 285,000 CPU cores, 10,000 GPUs, and 400 Gb/s of network connectivity per GPU server.^[12] Subsequent OpenAI models, including the GPT-3.5 and ChatGPT family that emerged in late 2022, ran on Microsoft Azure infrastructure that had been migrated to A100 hardware in the intervening period. CNBC's February 2023 reporting on the A100 noted that ChatGPT-class workloads were running on "thousands" of A100s.^[14]

The broader GPT-4 training infrastructure, deployed during 2022 and into 2023, was a mixed A100 and H100 fleet on Azure. Azure's ND A100 v4 series became one of the largest single deployments of A100 capacity outside Nvidia's own Selene system.^[20]

Megatron-Turing NLG 530B

Microsoft and Nvidia trained Megatron-Turing NLG 530B (a 530-billion-parameter transformer) on Selene, the Nvidia DGX A100 SuperPOD.^[10] The training infrastructure consisted of 560 DGX A100 servers, each with eight A100 80GB GPUs (4,480 A100 GPUs total) connected by HDR InfiniBand in a full fat tree.^[10] The published runs used tensor parallelism of 8 within a node, pipeline parallelism of 35 across nodes, and DeepSpeed data parallelism on top, with each model replica spanning 280 A100 GPUs.^[10]

BLOOM

BLOOM, the 176-billion-parameter open multilingual language model from the BigScience workshop, was trained on the Jean Zay supercomputer in France between March and July 2022.^[11] The training used 384 A100 80GB GPUs (48 nodes of 8 GPUs each, plus 32 spare GPUs for failure handling), connected by NVLink within each node and Omni-Path between nodes.^[11] Total compute was on the order of 1 million GPU-hours and the project consumed roughly $7 million of publicly funded compute time.^[11]

LLaMA, Stable Diffusion, and the open-model wave

Meta's first-generation LLaMA models (released February 2023) were trained on A100 80GB clusters. The LLaMA 65B run reportedly used 2,048 A100s for about 21 days, totaling around 1 million GPU-hours.^[22] Llama 2, released in July 2023, was also primarily an A100 training run before later Meta clusters transitioned to H100 for Llama 3.

Stable Diffusion 1.x was trained by Stability AI and partners on a cluster of 256 A100s donated through their Lambda Labs partnership.^[23] The original RunwayML and CompVis collaborations on the underlying latent diffusion research used much smaller A100 footprints (single-node and dual-node clusters) before the model scaled up.

Other notable training runs

DeepMind's Chinchilla, AI21's Jurassic-2, Cohere's early production models, Adept's ACT-1, Eleven Labs' first speech-synthesis models, and many of the open Mistral, Falcon, Yi, and Qwen series were either A100 trained or A100 served during 2021 to 2023. The list of foundation-model labs that did not run on A100 hardware during this period is roughly limited to Google (TPU) and a handful of TPU-leasing startups.

Cloud availability

Every major cloud provider added A100 capacity to their lineup between mid-2020 and 2021. The table below summarizes the SKUs that customers could actually order, with peak instance configurations and approximate pricing at the cycle's height.

Provider	Instance / SKU	GPU configuration	Memory	Notes
AWS	p4d.24xlarge	8x A100 40 GB SXM4	1.1 TB system RAM	Launched November 2020 at $32.77/hour list. EFA networking.^[19]
AWS	p4de.24xlarge	8x A100 80 GB SXM4	1.1 TB system RAM	Launched 2022. 80 GB upgrade of p4d.^[19]
Azure	ND A100 v4 (NDm A100 v4 80 GB)	8x A100 80 GB SXM4	1.9 TB system RAM	InfiniBand HDR. The 80 GB SKU was the workhorse for Azure's OpenAI co-located capacity.^[20]
Google Cloud	a2-highgpu-8g (and a2-megagpu-16g)	8x or 16x A100 40 GB	up to 1.36 TB	Launched July 2020. a2-megagpu paired two HGX boards in one VM.^[21]
Google Cloud	a2-ultragpu-8g	8x A100 80 GB	1.36 TB	80 GB SKU launched 2021.^[21]
Oracle	BM.GPU.A100-v2.8	8x A100 80 GB	2 TB	Bare-metal SKU on Oracle Cloud, RDMA cluster networking.
CoreWeave	A100 80 GB SXM4 instances	1 to 8x A100 80 GB per node	varies	Specialty neocloud; multi-year reserved deals to OpenAI and others.
Lambda Labs	1x A100 (40 / 80 GB), 8x A100 nodes	1 to 8x A100	varies	On-demand and reserved; price competitive with hyperscalers from 2022.
Tencent Cloud	A100 / A800 instances	8x A100 / A800	varies	Mainland China; transitioned to A800 after October 2022.
Alibaba Cloud	gn7e / ebmgn7e	8x A100 / A800	varies	Mainland China; transitioned to A800 after October 2022.

On-demand pricing for an 8-GPU A100 80 GB node in 2021 ran roughly $20 to $30 per hour at hyperscalers, which puts the per-GPU rate around $2.50 to $4.^[19]^[20] By late 2023 and into 2024, neoclouds and reserved-instance contracts had pushed the per-GPU rate as low as $1.00 to $1.50 per hour for A100 80 GB capacity. Through 2025 and into 2026 spot prices for A100 instances on neoclouds settled around $0.80 to $1.50 per GPU-hour, far below current H100 and Blackwell rates, which is much of the reason the A100 still has a market.

Pricing and availability

Nvidia does not publish list prices for datacenter GPUs and most A100 sales went through OEM partners, but a few public datapoints anchor the range:

The launch DGX A100 server (8x 40GB A100, dual EPYC, 15 TB NVMe) listed at $199,000.^[7]
The DGX Station A100 launched at $99,000 (160G, 4x 40 GB) and $149,000 (320G, 4x 80 GB) in November 2020.^[26]
Individual A100 PCIe cards traded in the $10,000 to $15,000 range over most of 2020 to 2023, with the 80GB variant carrying a premium.^[14]
AWS launched the p4d.24xlarge instance (8x A100 40GB) in November 2020 at an on-demand list price of $32.77 per hour.^[19] Google Cloud's a2-highgpu-8g and Azure's ND A100 v4 series sat in similar territory. By late 2023 many specialty cloud providers were offering single A100s for under $2 per GPU-hour.
Cloud capacity for A100s was tight enough through 2022 and most of 2023 that startups frequently signed multi-year reserved-instance commitments to lock in supply.
On the secondary market, A100 80 GB cards trade roughly between $5,000 and $10,000 per card in 2026, with prices that move quickly week to week as buyers pull large lots out of decommissioned hyperscale fleets. The 40 GB SXM4 modules are noticeably cheaper, often under $5,000 per module, because they are harder to redeploy outside a compatible HGX baseboard.

These numbers should be treated as approximate. The A100 used market is illiquid, dominated by a handful of brokers and refurbishment houses, and pricing moves with each large cluster decommissioning event. When Microsoft retired a tranche of older A100 capacity in early 2026, broker-channel prices on 80 GB modules dropped roughly 20 percent in two weeks before stabilizing.

How does the A100 differ from the H100?

The transition from A100 to H100 was the largest single generational jump in Nvidia's datacenter line and the table below shows where the gaps live.^[13] H100 is uniformly faster on Tensor Core math, doubled NVLink bandwidth, added FP8 support and the Transformer Engine, and roughly doubled HBM bandwidth. The A100 still wins on price per GPU-hour and on the simpler power and cooling profile.

Specification	A100 80 GB SXM4	H100 SXM5
Architecture	Ampere (GA100)	Hopper (GH100)
Process	TSMC 7nm (N7)	TSMC 4N
Transistors	54.2 billion	80 billion
Die size	826 mm2	814 mm2
FP32 CUDA cores	6,912	16,896
Tensor Cores	432 (3rd gen)	528 (4th gen)
Memory	80 GB HBM2e	80 GB HBM3 (94 GB on later NVL)
Memory bandwidth	2.0 TB/s	3.35 TB/s
L2 cache	40 MB	50 MB
FP64 Tensor Core	19.5 TFLOPS	67 TFLOPS
TF32 Tensor Core	156 TFLOPS	989 TFLOPS
BF16 / FP16 Tensor Core	312 TFLOPS	1,979 TFLOPS
FP8 Tensor Core	not supported	3,958 TFLOPS
INT8 Tensor Core	624 TOPS	3,958 TOPS
NVLink bandwidth	600 GB/s	900 GB/s
PCI Express	Gen 4	Gen 5
TDP (SXM)	400 W	700 W
Launch price (per GPU, peak)	$10,000 to $15,000	$25,000 to $40,000
Cloud price (2024 to 2026)	$1.00 to $2.00 / GPU-hour	$2.00 to $4.00 / GPU-hour
Best fit (2026)	Inference, fine-tuning, mid-size training	Frontier training, FP8 inference

The Transformer Engine is the most important difference for production foundation-model work.^[13] H100's FP8 support effectively doubles inference throughput on transformer architectures with no accuracy loss, which the A100 simply cannot match. For inference of moderately sized models, however, where the math is not the bottleneck, the gap narrows and the A100's lower per-GPU-hour cost often wins on total cost of serving.

Predecessor and successor context

Volta V100 (predecessor)

The V100, released in 2017 on the Volta architecture, introduced the original Tensor Core and was the workhorse for the first generation of large transformer training (BERT, GPT-2, and the original GPT-3). The A100 roughly doubled to tripled per-chip performance on FP16/FP32 mixed-precision training, added BF16 and TF32 support, increased memory bandwidth from 900 GB/s to 1.55 TB/s (40GB) or 2.0 TB/s (80GB), and replaced the V100's 6 MB L2 with a 40 MB L2.^[1]^[2]

Hopper H100 (successor)

The H100, announced March 22, 2022 and shipping later that year, succeeded the A100 as Nvidia's flagship.^[13] Hopper added a Transformer Engine with FP8 Tensor Cores, fourth-generation NVLink at 900 GB/s per GPU, HBM3 memory at up to 3 TB/s, PCIe Gen 5, and confidential computing features. Nvidia rates the H100 as up to 9x faster than A100 for large language model training and up to 30x for inference at large batch sizes, although in mixed real-world workloads the gap was usually closer to 2x to 3x.^[13] Hopper itself was succeeded in 2024 to 2025 by Blackwell and B200.

US export controls and the A800

On October 7, 2022, the US Department of Commerce's Bureau of Industry and Security (BIS) issued new export controls targeting advanced computing chips destined for China.^[18] The rules used two simultaneous thresholds: a peak-performance threshold and an interconnect bandwidth threshold of 600 GB/s of aggregate bidirectional chip-to-chip bandwidth.^[18] The A100, with exactly 600 GB/s of NVLink and very high INT8 throughput, fell on the wrong side of both criteria.

Nvidia responded with the A800, a rebinned A100 SKU intended for the China market that throttled NVLink to 400 GB/s while leaving the GPU's compute throughput essentially unchanged.^[31]^[32] The A800 entered production in Q3 2022 and quickly became Nvidia's bestselling AI accelerator inside China.^[31] A parallel H800 followed for the Hopper generation. The October 2023 update to the BIS rules added a performance density threshold and effectively closed the A800/H800 loophole, though by then the chips were already in widespread deployment at Alibaba, Baidu, ByteDance, and Tencent.^[17]

Aging gracefully: the A100 in 2026

It is easy to forget how unusual the A100's commercial lifecycle has been. Most server GPUs have a sharp performance edge for two years and then slide quickly into the secondary market as the next architecture takes over. The A100 has had a longer and softer landing, partly because Hopper was supply-constrained for so long, partly because A100s were bought in such enormous quantities that the depreciated inventory became a real economic asset, and partly because the kinds of workloads people want to run on GPUs in 2026 are not all frontier training.

A few patterns are visible in the 2026 market:

Inference workhorses. A100 80 GB cards are still actively serving production inference for 7B, 13B, 30B, and even 70B-class open models, especially at smaller cloud providers and fine-tuning platforms. The lower per-hour cost more than makes up for the lower per-GPU throughput on these workloads. With MIG, a single 80 GB A100 can host two or three independent inference replicas behind a load balancer, which makes it competitive on cost-per-token for many serving patterns.
Fine-tuning and research. A100 cluster time is now cheap enough that academic labs and small companies can run multi-billion parameter fine-tunes that would have been hyperscaler-only as recently as 2022. The 80 GB capacity keeps many of these projects feasible without elaborate model-sharding setups.
Sovereign and regional clouds. Several non-US cloud operators have built out A100-based regional capacity, both because the chips were available outside the immediate H100 supply chokepoint and because the export-control regime around H100 is more restrictive than the (now mostly grandfathered) A100 deployments.^[17]
Aging out of frontier training. Foundation-model labs have largely retired A100 clusters from new pretraining runs, both because newer architectures' FP8 support gives a real per-watt advantage and because the network fabric matters more at frontier scales. A100 capacity at OpenAI, Anthropic, Meta, and Google has been progressively reassigned to inference and post-training during 2024 to 2026.
Power and cooling pressure. A100 nodes draw less power per rack than Blackwell B200 systems, which is meaningful for older datacenter facilities that cannot easily move from air to liquid cooling. A100 footprint is one of the cleaner ways to add inference capacity in a power-constrained colocation site.

Nvidia formally announced the A100 end-of-life in January 2024 and the product page on nvidia.com began directing new buyers toward H100 and Blackwell.^[27] As of 2026 the A100 is in software lifecycle Phase I (Full Support) for vGPU and remains current on CUDA, cuDNN, TensorRT, and the major frameworks; that support is expected to continue for several more years given the size of the installed base.^[28]

Legacy

From 2020 to roughly 2023, the A100 was the GPU that the modern AI industry was built on.^[14] It powered the first wave of frontier-scale transformer training, sat in essentially every major hyperscaler's AI datacenter buildout, and made distributed training on hundreds or thousands of GPUs a standard engineering pattern through libraries like Megatron-LM, DeepSpeed, and FSDP.^[10] Multi-Instance GPU normalized the idea that a single accelerator could be cleanly partitioned for inference serving, and the DGX SuperPOD reference architecture established the template for AI clusters that the H100 and Blackwell generations followed.

Even after the H100 became the prestige chip, A100 inventory continued serving inference workloads at scale. Cloud providers were still listing A100 instances in 2025 and 2026, often at prices low enough to make them attractive for fine-tuning runs and steady-state production inference of mid-sized models. The A100 will likely keep showing up in research papers' "hardware used" tables for several more years.

References

NVIDIA A100 Tensor Core GPU Architecture (whitepaper), Nvidia, 2020. ↩
NVIDIA Ampere Architecture In-Depth, Ronny Krashinsky and Olivier Giroux, NVIDIA Developer Blog, May 14, 2020. ↩
NVIDIA's New Ampere Data Center GPU in Full Production, NVIDIA Newsroom, May 14, 2020. ↩
NVIDIA Doubles Down: Announces A100 80GB GPU, NVIDIA Newsroom, November 16, 2020. ↩
NVIDIA A100 Tensor Core GPU product page, Nvidia. ↩
NVIDIA A100 Tensor Core GPU Datasheet (80GB), Nvidia. ↩
NVIDIA DGX A100 Datasheet, Nvidia. ↩
NVIDIA Breaks 16 AI Performance Records in Latest MLPerf Benchmarks, NVIDIA Blog, July 29, 2020. ↩
MLPerf v1.0 Training Benchmarks: Insights into a Record-Setting NVIDIA Performance, NVIDIA Developer Blog, June 2021. ↩
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, Paresh Kharya and Ali Alvi, NVIDIA / Microsoft, October 11, 2021. ↩
The Technology Behind BLOOM Training, Stas Bekman, Hugging Face, July 14, 2022. ↩
Bringing AI supercomputing to customers (OpenAI / Microsoft 10,000 GPU cluster), Microsoft Azure Blog, May 19, 2020. ↩
NVIDIA Hopper Architecture In-Depth, NVIDIA Developer Blog, March 22, 2022. ↩
Nvidia's A100 is the $10,000 chip powering the race for AI, Kif Leswing, CNBC, February 23, 2023. ↩
Ampere (microarchitecture)), Wikipedia.
Nvidia DGX, Wikipedia.
TIMELINE: GPU Export Controls, NVIDIA GPU Bans, & AI GPU Black Market, GamersNexus. ↩
Understanding the Biden Administration's Updated Export Controls, Gregory C. Allen, CSIS, October 27, 2022. ↩
Amazon EC2 P4d Instances, Amazon Web Services product page. ↩
Azure ND A100 v4 series, Microsoft Azure documentation. ↩
Google Cloud A2 VMs (with NVIDIA A100), Google Cloud documentation. ↩
LLaMA: Open and Efficient Foundation Language Models, Hugo Touvron et al., Meta AI, February 2023. ↩
High-Resolution Image Synthesis with Latent Diffusion Models, Robin Rombach et al. (Stable Diffusion paper), 2021. ↩
Nvidia begins shipping the A100, its first Ampere-based data center GPU, Frederic Lardinois, TechCrunch, May 14, 2020. ↩
NVIDIA A100 40GB PCIe GPU Accelerator Product Brief, Nvidia, September 2020. ↩
NVIDIA DGX Station A100 Offers Researchers AI Data-Center-in-a-Box, NVIDIA Newsroom, November 16, 2020. ↩
NVIDIA HGX A100, PCIe A100, Tesla T4 Passive GPU end-of-life notice, One Stop Systems, 2024. ↩
NVIDIA Virtual GPU Software Lifecycle on Supported GPUs (March 18, 2026), Nvidia. ↩
Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines, NVIDIA Developer Blog. ↩
Accelerating Data Center AI with the NVIDIA Converged Accelerator Developer Kit (A100X), NVIDIA Developer Blog. ↩
NVIDIA introduces A800 data-center GPU for China, VideoCardz, November 2022. ↩
Nvidia Devises a New Chip for China That Passes U.S. Export Controls, The Motley Fool, November 11, 2022. ↩
TOP500 List - June 2020: Selene at #7 with 27.58 PFLOPS, TOP500. ↩
Top500: Fugaku Keeps Crown, Nvidia's Selene Climbs to #5 (63.4 PFLOPS), HPCwire, November 16, 2020. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit