High Bandwidth Memory (HBM)
Last reviewed
May 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 · 4,296 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 · 4,296 words
Add missing citations, update stale details, or suggest a clearer explanation.
High Bandwidth Memory (HBM) is a JEDEC-standardised, 3D-stacked DRAM technology designed to deliver very high memory bandwidth and good energy efficiency to a logic die that sits in close physical proximity, typically a GPU or AI accelerator. HBM uses through-silicon vias (TSVs) to vertically stack multiple DRAM dies on top of a base/buffer die, then connects to the host processor through a wide parallel interface (1024 bits per stack for HBM1 through HBM3e, doubled to 2048 bits for HBM4) that runs across a silicon interposer or other advanced packaging substrate. The first HBM standard, JESD235, was published by JEDEC in October 2013, and the first commercial HBM-equipped product was AMD's Fiji GPU in the Radeon R9 Fury X, launched on 24 June 2015.
HBM has become the defining memory technology of the modern AI accelerator era. Every flagship training and inference chip from NVIDIA, AMD, Google, Intel, AWS, Huawei, and most other vendors uses HBM, because large language model training and serving are bandwidth-bound at the matrix-multiplication and attention layers. By stacking DRAM dies vertically and using a very wide bus, HBM delivers between 3x and 10x the per-package bandwidth of GDDR or LPDDR alternatives, at lower energy per bit, in exchange for significant cost and packaging complexity.
Large language model inference, especially the autoregressive decode phase that generates one token at a time, is dominated by the cost of streaming model weights and the key/value cache from memory into the compute units. The arithmetic per byte transferred is low, so the accelerator's effective throughput is set by memory bandwidth rather than peak floating point rate. This is the classic memory wall, and it gets worse as models scale.
Training is more compute-bound than decode but still benefits enormously from bandwidth, because activation memory, optimizer state, and weight gradients all live in HBM and must be moved between operators many times per step. Larger HBM capacity per accelerator also reduces the need for tensor or pipeline parallelism across multiple GPUs, which lowers communication overhead and lets a single device serve a larger model.
A practical example: an NVIDIA H200 with 141 GB of HBM3e at 4.8 TB/s can hold the weights of a Llama 3.1 70B parameter model in FP16 (around 140 GB) on a single GPU, with a small amount of room left for the KV cache and activations. The previous-generation H100 with 80 GB of HBM3 cannot fit that model in FP16 on one GPU, forcing tensor parallelism across two or more devices. The same memory uplift moves into Blackwell B200 (192 GB) and AMD's MI325X (256 GB) and MI355X (288 GB), where state-of-the-art LLMs at full precision can sit on one or two devices.
The table below summarises the JEDEC standards and headline specifications for each HBM generation. Per-stack bandwidth is the JEDEC ceiling at the highest data rate; commercial products often ship at lower rates initially.
| Generation | JEDEC standard | Year (standard) | Data rate per pin | Bus width per stack | Max bandwidth per stack | Max die stack | Max capacity per stack |
|---|---|---|---|---|---|---|---|
| HBM (HBM1) | JESD235 | October 2013 | 1.0 Gbps | 1024-bit | 128 GB/s | 4-Hi | 4 GB |
| HBM2 | JESD235A | January 2016 | 2.0 Gbps | 1024-bit | 256 GB/s | 8-Hi | 8 GB |
| HBM2E | JESD235B/C/D | 2018-2021 | 3.6 Gbps | 1024-bit | 461 GB/s | 12-Hi | 24 GB |
| HBM3 | JESD238 | January 2022 | 6.4 Gbps | 1024-bit | 819 GB/s | 16-Hi | 64 GB |
| HBM3e | JESD238A | 2024 | 9.6-9.8 Gbps | 1024-bit | ~1.2 TB/s | 12-Hi (16-Hi qual.) | 36 GB (commercial), 48 GB (spec ceiling) |
| HBM4 | JESD270-4 | 16 April 2025 | 8.0 Gbps | 2048-bit | 2.0 TB/s | 16-Hi | 64 GB |
JEDEC published the first HBM standard, JESD235, in October 2013 after a multi-year collaboration that started with a 2010 proposal from AMD and SK hynix. The original engineering work was presented by Lee et al. at the 2014 IEEE International Solid-State Circuits Conference (ISSCC) in the paper "A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective I/O Test Circuits."
HBM1 specified eight independent 128-bit channels per stack, a 1.0 Gbps per-pin data rate, a 1024-bit total interface, and 128 GB/s of bandwidth per stack. Stacks topped out at 4-Hi with 1 GB capacity at 2 Gb per die.
The technology shipped commercially in AMD's Fiji GPU, the Radeon R9 Fury X, on 24 June 2015. Fury X used four HBM1 stacks for 4 GB of total memory across a 4096-bit aggregate bus, delivering 512 GB/s, the highest of any single GPU at the time. The 4 GB capacity ceiling proved to be Fiji's biggest practical limitation as games started exceeding 4 GB working sets in 2015 and 2016.
JEDEC accepted HBM2 as JESD235A in January 2016. HBM2 doubled the per-pin data rate to 2.0 Gbps, kept the 1024-bit interface, and reached 256 GB/s per stack. Stack heights expanded to 8-Hi and per-stack capacity to 8 GB at 8 Gb per die.
NVIDIA shipped the first HBM2 product, the Tesla P100 (Pascal), with sales beginning on 5 April 2016. P100 carried 16 GB of HBM2 across four 4-Hi stacks, delivering 720 GB/s on TSMC's CoWoS substrate. P100 proved that compute accelerators could be built around HBM as a primary memory tier. The Tesla V100 (Volta, 2017) followed with 16 GB or 32 GB HBM2 and 900 GB/s. AMD's Vega-architecture Radeon Vega 56 and Vega 64 also used HBM2.
HBM2E is the marketing name for JEDEC updates JESD235B (November 2018), JESD235C (January 2020), and JESD235D (February 2021), which together pushed the per-pin rate to 3.2 then 3.6 Gbps and allowed 12-Hi stacks up to 24 GB.
Samsung announced 16 GB Flashbolt HBM2E at 3.2 Gbps on 20 March 2019 (410 GB/s per stack). SK hynix followed on 12 August 2019 with 16 GB HBM2E at 3.6 Gbps (460 GB/s per stack). NVIDIA used HBM2E in the A100 80 GB upgrade in late 2020, reaching just over 2 TB/s of memory bandwidth (1.94 TB/s PCIe, around 2.04 TB/s SXM). Intel's Xeon CPU Max series (Sapphire Rapids HBM), launched on 10 January 2023, was the first x86 server CPU with on-package HBM, carrying 64 GB of HBM2E for around 1 TB/s.
JEDEC published HBM3 as JESD238 on 27 January 2022. HBM3 doubled the per-pin data rate to 6.4 Gbps for 819 GB/s per stack, doubled the channel count from 8 (HBM2) to 16 with two pseudo-channels each (32 effective channels), and supported 4-Hi through 16-Hi stacks with capacities up to 64 GB at the spec ceiling. Commercial parts shipped at 24 GB (12-Hi with 2 GB dies).
NVIDIA's H100 (Hopper, March 2022) was the first HBM3 product, carrying 80 GB across five active stacks (one disabled for yield) and reaching 3.35 TB/s in the SXM variant. AMD's Instinct MI300X, launched in late 2023, used eight HBM3 stacks for 192 GB and 5.3 TB/s, the largest HBM capacity on a single GPU at the time.
HBM3e is the JEDEC JESD238A revision, finalised in 2024, extending HBM3 to 9.6 Gbps per pin (some vendors push to 9.8 Gbps) for roughly 1.2 TB/s per stack. SK hynix began volume production of 8-Hi HBM3e at 24 GB per stack in March 2024, then announced volume production of the world's first 12-Hi HBM3e at 36 GB per stack on 26 September 2024.
NVIDIA's H200 was the first HBM3e product: 141 GB across six active 24 GB stacks at 4.8 TB/s, announced at SC23 in November 2023 and reaching customers in Q2 2024. The B200 Blackwell GPU, launched in 2024, uses eight 24 GB HBM3e stacks for 192 GB and 8 TB/s across an 8192-bit aggregate bus. AMD's MI325X (256 GB at 6 TB/s) and MI355X (288 GB at 8 TB/s, eight 36 GB 12-Hi stacks) also use HBM3e.
JEDEC published the HBM4 standard as JESD270-4 on 16 April 2025. HBM4's most consequential change is the doubling of the per-stack interface from 1024 to 2048 bits, which lets the standard hit 2 TB/s per stack at a relatively conservative 8 Gbps per pin. The number of independent channels per stack doubles from 16 to 32, and HBM4 supports 4-Hi, 8-Hi, 12-Hi, and 16-Hi configurations with 24 Gb or 32 Gb dies, allowing a 64 GB ceiling per stack at 16-Hi. Vendor-specific VDDQ levels of 0.7 V, 0.75 V, 0.8 V, or 0.9 V and VDDC of 1.0 V or 1.05 V give designers room to trade power against speed.
The wider bus is partly an admission that further increases in pin speed are running into power and signal-integrity ceilings; widening the bus is more expensive in interposer area but cheaper in joules per bit. HBM4 also opens the door to a custom logic base die fabricated on a leading-edge node such as TSMC N3, which can host more sophisticated controllers, error correction, and (in some vendor proposals) compute-in-memory primitives.
SK hynix delivered the world's first 12-layer HBM4 samples in March 2025 and is targeting mass production in the second half of 2025. First HBM4-equipped accelerators are expected in 2026, with NVIDIA's Rubin generation and AMD's MI400 series among the named consumers.
Industry roadmaps from SK hynix, Samsung, and Micron point to an HBM4E refresh in the 2026-2027 window with higher pin speeds (around 10 Gbps), larger 16-Hi commercial stacks, and continued integration with custom logic base dies. Specific targets cluster around 2.5-3 TB/s and 48-64 GB per stack.
An HBM stack is a vertically integrated structure. At the bottom sits a base die (sometimes called a buffer or logic die) that handles the interface to the host, including the physical layer (PHY), arbitration, refresh control, and built-in self-test. Above the base die sit between two and sixteen DRAM dies, depending on the configuration: 4-Hi (four dies), 8-Hi (eight dies), 12-Hi (twelve dies), or 16-Hi (sixteen dies).
The dies in the stack are interconnected by through-silicon vias (TSVs), vertical conductors that run through holes etched in the silicon and filled with copper. TSVs are the enabling technology that makes HBM possible: traditional wire bonding cannot support the thousands of signals needed to expose a 1024-bit interface, but a 2D arrangement of TSVs can. Microbumps connect adjacent dies.
For HBM1 through HBM3e the per-stack interface is 1024 data bits, organised as 8 channels of 128 bits (HBM1, HBM2) or 16 channels of 64 bits with two pseudo-channels each (HBM3, HBM3e). HBM4 doubles this to 2048 data bits across 32 channels.
Because the interface is so wide, HBM cannot use a normal printed circuit board. The dominant approach is a passive silicon interposer (TSMC's CoWoS-S, for example), where the GPU die and the HBM stacks sit side by side on top of a large piece of silicon that carries thousands of fine-pitch traces between them. CoWoS-L uses a redistribution layer with embedded silicon bridges to cover larger areas; Intel's EMIB takes a related approach with embedded silicon bridges in an organic substrate.
A modern AI accelerator typically integrates four to eight HBM stacks around a central logic die. NVIDIA's H100 SXM has six HBM3 sites with five active (the sixth disabled for yield), the H200 has the same arrangement with HBM3e, and the B200 has eight HBM3e stacks across two compute chiplets. AMD's MI300X and MI325X have eight HBM stacks each.
HBM is produced by exactly three companies: SK hynix, Samsung Electronics, and Micron Technology. The market structure has been unusually concentrated and unusually dynamic over the last few years.
SK hynix has led through most of the HBM3 and HBM3e ramp. The company shipped the first 8-layer HBM3e to NVIDIA in March 2024, followed by the first 12-layer 36 GB HBM3e in volume production in September 2024. SK hynix delivered the first 12-layer HBM4 samples in March 2025.
Micron entered HBM3e late but qualified its 24 GB 8-Hi HBM3e parts at 9.6 Gbps for the H200 and B200 ramps in 2024 and is now a meaningful supplier of HBM3e and HBM4.
Samsung was historically the largest DRAM maker but stumbled on HBM3e qualification at NVIDIA through 2024, with reports citing thermal and power-consumption issues. Samsung passed NVIDIA's qualification for 12-layer HBM3e in September 2025 and began HBM3e shipments to NVIDIA in Q3 2025. In the interim, Samsung remained the dominant HBM3e supplier to Google's TPU program.
On the packaging side, the binding constraint for the AI accelerator industry has been TSMC's CoWoS capacity, the advanced packaging line that integrates HBM stacks with logic dies on a silicon interposer. TSMC CEO C.C. Wei has stated that CoWoS capacity is sold out through 2025 and into 2026. NVIDIA reportedly secures over 60% of TSMC's CoWoS allocation. TSMC plans to scale CoWoS from roughly 35,000 wafers per month at the end of 2024 to about 130,000 wafers per month by the end of 2026.
Yield is a real constraint. Stacking 12 or 16 thin DRAM dies, drilling thousands of TSVs through each, and bonding them all to a base die without a single short or open is hard. The shift to hybrid bonding, where copper-to-copper bonds replace microbumps, is one path to denser stacks at acceptable yield, and SK hynix and Samsung have both begun that transition for HBM4 and beyond.
The table below lists the major AI and HPC accelerators that use HBM, their HBM generation, and their per-device memory configuration.
| Accelerator | Vendor | HBM generation | Capacity | Bandwidth | Year |
|---|---|---|---|---|---|
| Tesla P100 | NVIDIA | HBM2 | 16 GB | 720 GB/s | 2016 |
| Tesla V100 | NVIDIA | HBM2 | 16 GB / 32 GB | 900 GB/s | 2017 |
| A100 (40 GB) | NVIDIA | HBM2 | 40 GB | 1.55 TB/s | 2020 |
| A100 (80 GB) | NVIDIA | HBM2E | 80 GB | 1.94-2.04 TB/s | 2020 |
| H100 SXM | NVIDIA | HBM3 | 80 GB | 3.35 TB/s | 2022 |
| H200 | NVIDIA | HBM3e | 141 GB | 4.8 TB/s | 2024 |
| B200 (Blackwell) | NVIDIA | HBM3e | 192 GB | 8.0 TB/s | 2024-2025 |
| GB200 NVL72 (per GPU) | NVIDIA | HBM3e | 192 GB | 8.0 TB/s | 2024-2025 |
| MI200 / MI250X | AMD | HBM2E | 128 GB | 3.2 TB/s | 2021-2022 |
| MI300A | AMD | HBM3 | 128 GB | 5.3 TB/s | 2023 |
| MI300X | AMD | HBM3 | 192 GB | 5.3 TB/s | 2023 |
| MI325X | AMD | HBM3e | 256 GB | 6.0 TB/s | 2024 |
| MI355X (MI350 series) | AMD | HBM3e | 288 GB | 8.0 TB/s | 2025 |
| TPU v4 | HBM2 | 32 GB | 1.2 TB/s | 2021 | |
| TPU v5p | HBM2E/HBM3 | 96 GB | 2.8 TB/s | 2023 | |
| TPU v7 (Ironwood) | HBM3e | 192 GB | 7.37 TB/s | 2025 | |
| Trainium2 | AWS | HBM3 | 96 GB | ~2.9 TB/s | 2024 |
| Gaudi 2 | Intel | HBM2E | 96 GB | 2.45 TB/s | 2022 |
| Gaudi 3 | Intel | HBM2E | 128 GB | 3.7 TB/s | 2024 |
| Ascend 910B | Huawei | HBM2E | 64 GB | ~1.6 TB/s | 2023 |
| Dojo D1 | Tesla | HBM | n/a (training tile uses local SRAM and HBM-class peripheral memory) | n/a | 2022 |
| SN40L | SambaNova | HBM3 | 64 GB on-package + 1.5 TB DDR | 1.6 TB/s HBM | 2024 |
| Xeon CPU Max | Intel | HBM2E | 64 GB | ~1.0 TB/s | 2023 |
| A64FX (Fugaku) | Fujitsu | HBM2 | 32 GB | 1.0 TB/s | 2020 |
A notable counterexample is Cerebras's Wafer-Scale Engine (WSE-2, WSE-3), which uses 44 GB of on-die SRAM rather than HBM, trading capacity for bandwidth and avoiding HBM's packaging constraints entirely. Most other large accelerator vendors have concluded that HBM is the right operating point for the foreseeable future.
| Memory type | Typical bandwidth per device | Typical capacity per device | Energy per bit | Packaging | Where it is used |
|---|---|---|---|---|---|
| HBM3e | 1.2 TB/s per stack, ~8 TB/s per GPU with 8 stacks | 24-36 GB per stack, up to 288 GB per accelerator | Lowest of the four (rough rule of thumb: about 3-4 pJ/bit including PHY) | 2.5D silicon interposer (CoWoS, EMIB) | Datacenter AI accelerators, top-tier HPC |
| GDDR7 | ~1.5 TB/s on a 384-bit bus | 24-32 GB | Higher than HBM (around 7-8 pJ/bit) | Standard PCB, on-board chips | Consumer GPUs (RTX 5090), some workstation cards |
| GDDR6/6X | ~1 TB/s on a 384-bit bus | 16-24 GB | Higher than HBM | Standard PCB, on-board chips | Consumer GPUs (RTX 4090), inference cards (L40, L40S) |
| LPDDR5/5X | ~100-500 GB/s depending on width | 8-512 GB | Lowest absolute power, but lower bandwidth | LPCAMM or soldered on-board | Apple unified memory (M-series, Mac Studio, Mac Pro), edge AI, mobile, NVIDIA Grace and GB200 CPU complex |
| DDR5 | 50-100 GB/s per CPU socket | up to terabytes per socket via DIMMs | Moderate | DIMM modules, off-board | Server CPUs and general-purpose computing |
The shorthand that captures the trade-off: HBM is the fastest and most energy-efficient on a per-bit basis but the most expensive and packaging-constrained; GDDR is the next step down in bandwidth but uses cheap on-board chips and standard PCBs; LPDDR is the lowest-power and offers the largest capacities at moderate bandwidth, which is why Apple uses it for unified memory and why NVIDIA pairs LPDDR5X with HBM3e on the GB200 (HBM3e on the Blackwell GPUs, LPDDR5X on the Grace CPU); DDR is the workhorse of CPU memory and stays on standard DIMM form factors.
The core reason is bandwidth per package. Transformer inference reads the entire model's weights from memory once per token, and the larger the model, the more bytes per token. At a given hardware spend, the only way to push token-generation latency down is to push memory bandwidth up. HBM's 3D-stacked architecture and 1024-bit (now 2048-bit) interface deliver several times the bandwidth per package that GDDR can manage at sane power budgets.
Capacity matters too. HBM stacks have grown from 1 GB (HBM1, 4-Hi) to 36 GB (HBM3e, 12-Hi), letting a single accelerator hold more model weights and a larger KV cache. The H200's 141 GB and the MI355X's 288 GB enable single-device serving of 70B and 200B parameter models without aggressive quantisation.
Energy per bit is the third lever. At the scale of a 100,000-GPU cluster running at 700-1400 W per accelerator, HBM's joules-per-bit advantage over GDDR translates into measurable power savings at the rack and datacenter level.
The most-cited drawback is cost. HBM3e parts cost roughly $7-10 per GB at wholesale according to industry analyst estimates, compared to perhaps $2-3 per GB for GDDR6 and well under $1 per GB for commodity DDR5. A B200 GPU with eight 24 GB HBM3e stacks is reported to carry roughly $2,400 of memory cost, more than the logic die itself. HBM is now estimated to account for 30-40% of the manufacturing cost of a flagship AI accelerator.
Advanced packaging is the second drawback. HBM cannot be soldered onto a normal PCB and requires CoWoS, EMIB, or a similar 2.5D substrate. CoWoS capacity at TSMC has been the binding constraint on AI accelerator volumes through 2024 and 2025.
Thermal management is the third issue. Stacking 12 or 16 DRAM dies creates a vertical column with non-trivial thermal resistance, and the dies at the top of the stack run hotter than those at the bottom. The base die also acts as a thermal floor between the DRAM and the logic die, which sits on the same interposer and runs at hundreds of watts.
Finally, IO area on the logic die is a hard constraint. The PHY for an HBM3e stack consumes a non-trivial fraction of the GPU's perimeter, and the number of stacks that fit around a single die is bounded by the reticle limit and bump pitch. NVIDIA's B200 splits the GPU into two reticle-limited dies on a CoWoS-L substrate partly to expose enough perimeter for eight HBM3e stacks.
The HBM market has been one of the most-watched corners of the semiconductor industry through the LLM boom. SK hynix's lead in HBM3 and HBM3e gave it the dominant supply position into NVIDIA's H100, H200, and B200 ramps. TrendForce projected that HBM would account for over 20% of total DRAM market value in 2024 and over 30% in 2025, despite representing under 10% of bit volume.
Micron's HBM3e qualification in 2024 added a meaningful second supplier to NVIDIA. Samsung's qualification stumbles in 2024 cost the company a generation of HBM3e supply at NVIDIA, but Samsung remained the dominant HBM supplier to Google's TPU program (estimates suggest over 60% of TPU HBM3e share) and recovered HBM3e qualification at NVIDIA in September 2025. Reports in late 2025 indicated all three suppliers had sold out 2026 HBM production. The Korean government has formally designated HBM as a strategic technology.
Three threads dominate the HBM roadmap from 2024 through 2027.
First, HBM3e capacity continues to ramp. 8-Hi 24 GB stacks at 9.6 Gbps are in volume production at all three vendors; 12-Hi 36 GB stacks went into volume at SK hynix in September 2024, with Micron and Samsung close behind. 16-Hi HBM3e qualification is targeted for 2025.
Second, HBM4 is now standardised (JESD270-4, April 2025) and approaching commercial production. The doubled 2048-bit interface and the option of a custom logic base die fabricated on a leading-edge process are the two structural changes that distinguish HBM4 from a routine speed bump. SK hynix delivered the first 12-Hi HBM4 samples in March 2025 and is targeting 2H 2025 mass production. NVIDIA's Rubin generation and AMD's MI400 series are widely expected to be HBM4-based, shipping in 2026.
Third, processing-in-memory (PIM). Samsung's HBM2-PIM (Aquabolt-XL), introduced in 2021 at Hot Chips 33, embeds 16-wide SIMD engines inside the DRAM banks, claiming around 2x system-level performance and 60-70% energy reduction on memory-bound workloads. SK hynix and Samsung are cooperating on an LPDDR6-PIM standard at JEDEC. None of these PIM variants has displaced standard HBM in mainstream accelerators yet.
The other structural shift is hybrid bonding, where adjacent dies in the stack are joined by direct copper-to-copper bonds rather than microbumps. Hybrid bonding allows finer pitches, lower thermal resistance, and taller stacks at acceptable yield, and is on the roadmap for 16-Hi HBM and beyond. Custom HBM, where a customer specifies a non-JEDEC-standard base die optimised for its accelerator, is also emerging.