Systolic array
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,778 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,778 words
Add missing citations, update stale details, or suggest a clearer explanation.
A systolic array is a parallel computer architecture composed of a homogeneous network of tightly coupled processing elements (PEs) arranged in a regular grid, typically one-dimensional, two-dimensional, or hexagonal. Data flows rhythmically through the array under the control of a global clock, with each PE consuming an operand, performing a small computation (most often a multiply-accumulate), and forwarding intermediate results to its neighbours on every cycle. The metaphor that gives the architecture its name comes from the cardiovascular system: like blood pumped from the heart through arteries and capillaries, data is pumped from memory through the PE array, with each cell processing values as they pass by (Kung, 1982).
Systolic arrays were introduced in 1978 by H. T. Kung and Charles Leiserson at Carnegie Mellon University, in a technical report titled "Systolic Arrays (for VLSI)," which appeared in Sparse Matrix Proceedings 1978 (Academic Press, 1979). The concept was popularised in Kung's 1982 IEEE Computer article "Why Systolic Architectures?" and reproduced in Mead and Conway's canonical textbook Introduction to VLSI Systems (1980). Today the systolic array is the central compute fabric inside most modern AI accelerators, including Google's Tensor Processing Unit (TPU), the matrix engines in AWS Trainium and Inferentia, the matrix-multiply units inside Tesla's Dojo D1, and, in a looser sense, NVIDIA Tensor Cores and AMD Matrix Cores. Because deep learning training and inference are dominated by dense matrix multiplications, the systolic array has become arguably the most economically important parallel architecture of the last decade.
In the late 1970s, the cost of designing custom integrated circuits was dropping while the number of transistors that fit on a single die kept rising. Mead and Conway (1980) had argued that VLSI would soon let designers place tens of thousands of gates on a chip, but that the bottleneck would be communication rather than computation: long wires were slow, expensive, and hard to lay out. Kung and Leiserson's 1978 report addressed exactly this problem. They proposed building parallel processors as regular grids of identical cells, with all communication restricted to nearest neighbours. Such a design would scale gracefully in VLSI because long wires were unnecessary, control was uniform, and the floorplan was repetitive. The original report described systolic structures for a wide range of dense linear-algebra operations, including matrix-vector and matrix-matrix multiplication, LU decomposition, banded triangular system solving, convolution, and polynomial evaluation (Kung and Leiserson, 1978).
The term "systolic" was coined by Kung as an analogy to the human heart. In the 1982 IEEE Computer paper, he describes the system as one in which "data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory," likening it to an automobile assembly line: many cars are worked on simultaneously, each at a different station, and the throughput is set by the rate at which they move down the line, not by the time spent at any one station (Kung, 1982). The same paper laid out the design principles that still apply: simple, regular cells; local communication; balanced compute and I/O; high concurrency; and minimal global control.
The earliest hardware to be designed explicitly as a systolic array was Carnegie Mellon's Warp project, which produced a wire-wrapped two-cell prototype in 1985, a printed-circuit version delivered by GE in 1987 (around USD 350,000 per machine), and finally an integrated-circuit version called iWarp built jointly with Intel in the late 1980s. Each iWarp chip carried roughly 700,000 transistors in a 0.9 micron CMOS process and ran at 20 MHz; a typical iWarp system was an 8x8 torus of 64 processors delivering about 1.2 gigaflops peak (Borkar et al., 1988; Annaratone et al., 1987). Around 39 iWarp machines were sold by Intel in 1992 and 1993.
The defining features of a systolic array are easy to state and turn out to be enormously powerful in practice.
Maximum data reuse. Each operand is fetched once from memory and then reused as it ripples through the array. In a 256x256 multiply-accumulate array, a single weight loaded into one PE can contribute to 256 partial sums without any further main-memory traffic. This is critical because, in modern silicon, moving a 32-bit word across a chip costs more energy than the multiply-accumulate that consumes it (Horowitz, 2014).
Regular, local interconnect. PEs talk only to their immediate neighbours. This gives a regular floorplan, simplifies routing, removes long wires from the critical path, and keeps clock distribution tractable. A systolic array is one of the few architectures whose physical implementation matches its logical organisation almost exactly.
Massive parallelism. A single TPU v1 die holds 65,536 8-bit multipliers (a 256x256 grid) operating in lockstep at 700 MHz, giving 92 TOPS peak throughput in a 28 nm process and 28 to 40 W of TDP (Jouppi et al., 2017). Larger and more recent arrays scale this idea by an order of magnitude.
Predictable, pipelined throughput. Once the array is filled, one set of inputs is consumed and one set of outputs is emitted on every cycle. Latency for an N x N matrix multiply on an N x N output-stationary array is O(N) cycles to fill plus O(N) cycles to drain, but throughput is one MAC per PE per cycle, so the total work, N^3 multiplications, is completed in roughly 3N cycles rather than the N^3 cycles a serial machine would need.
Energy efficiency. Because the array minimises off-chip memory accesses and avoids the speculation, branch prediction, and out-of-order machinery of a general-purpose CPU, energy per operation is much lower. Jouppi et al. (2017) reported that the TPU v1 delivered 30 to 80 times higher performance per watt than the contemporary Haswell CPUs and Kepler GPUs on neural-network inference workloads.
Matrix multiplication is the textbook systolic problem. Consider computing C = A * B, where A is M x K, B is K x N, and C is M x N. A weight-stationary systolic array of size M x N proceeds as follows. Each PE (i, j) holds one accumulator for C[i, j] and is loaded with a column of B (or a row, depending on the dataflow). Activations from A are streamed in from the left edge of the array. On each cycle, each PE multiplies its current input by its stored weight, adds the product to its accumulator, and passes the input to its right neighbour. After K cycles, every PE has accumulated the dot product that defines its element of C. The whole multiplication takes K + M + N - 2 cycles in the simplest schedule, with throughput limited only by clock rate and array size.
When the problem is larger than the array, the matrices are tiled and each tile is streamed through the same hardware. This lets a 256x256 array compute matrix multiplications of any size, paying a small overhead in fill and drain cycles. Convolutions are usually mapped to matrix multiplications via the im2col transform (Chellapilla, Puri, and Simard, 2006), which is why the same hardware that does GEMM well also runs convolutional neural networks well.
A "dataflow" describes which operands are stationary in the array and which ones move. The classic taxonomy was formalised by Chen, Krishna, Emer, and Sze in the Eyeriss paper (ISCA 2016), which also introduced the row-stationary scheme. Different dataflows trade off how often each tensor is reloaded from off-chip memory, which directly affects energy.
| Dataflow | What stays | What flows | Typical use |
|---|---|---|---|
| Output-stationary (OS) | Partial sums in PEs | Weights and activations | Classical Kung-Leiserson designs; balanced reuse |
| Weight-stationary (WS) | Weights in PEs | Activations and partial sums | Inference (e.g. TPU v1, AWS Inferentia); weights large and reused |
| Input-stationary (IS) | Activations in PEs | Weights and partial sums | Some training scenarios |
| Row-stationary (RS) | Convolution row in PE | Activations, partials, filters across PEs | Eyeriss; minimises total data movement on CNNs |
| No local reuse (NLR) | Nothing | Everything | Reference baseline; energy-inefficient |
Chen et al. (2016) showed that on AlexNet the row-stationary dataflow used 1.4x to 2.5x less energy than weight-stationary or output-stationary in convolutional layers, because it simultaneously reuses filter weights, input activations, and partial sums.
The table below collects the most influential systolic-array-based machines and accelerators. Sizes refer to the on-chip MAC array, not to multi-chip pods.
| System | Year | Organisation | Array size | Numerics | Peak throughput | Notes |
|---|---|---|---|---|---|---|
| Warp / PC-Warp | 1985 to 1987 | CMU + GE | 10-cell linear | 32-bit float | 100 MFLOPS | First explicit systolic-array machine; sold for ~USD 350,000 |
| iWarp | 1990 to 1993 | CMU + Intel | 8x8 torus typical | 32-bit float | 1.2 GFLOPS | About 39 systems sold; LIW microarchitecture |
| Google TPU v1 | 2015, paper 2017 | 256x256 | INT8 MAC, INT32 accumulate | 92 TOPS | First commercial deployment of a large systolic array; 700 MHz, 28 nm, 28 to 40 W | |
| TPU v2 | 2017 | 128x128 per MXU | bfloat16 multiply, FP32 accumulate | 45 TFLOPS / chip | Two cores per chip; trainable workloads via bfloat16 | |
| TPU v3 | 2018 | 128x128 per MXU (2 per core) | bfloat16, FP32 acc | 123 TFLOPS / chip | Liquid-cooled; doubled MXUs per core | |
| TPU v4 | 2021 | 128x128 per MXU (4 per core) | bfloat16, FP32 acc | 275 TFLOPS / chip | Optical reconfigurable interconnect (Jouppi et al., 2023) | |
| TPU v5p / v6e Trillium | 2023 to 2024 | 128x128 (v5p), 256x256 (v6e) | bfloat16, INT8 | 459 TFLOPS (v5p), ~926 TFLOPS (v6e) | v6e returned to a 256x256 MXU for 4x FLOPs per cycle | |
| Tesla Dojo D1 | 2021 | Tesla | 354 nodes per die, each with a matrix multiplier | BF16 / CFP8 / FP32 | 362 BF16 TFLOPS / die | 7 nm TSMC, 50 billion transistors, 645 mm^2; 18x20 array of nodes |
| Cerebras WSE-2 | 2021 | Cerebras Systems | 850,000 cores in a 2D mesh | FP16 dense / sparse | 7.5 PFLOPS dense FP16 | Wafer-scale; not a classical systolic array but uses the same dataflow principles |
| AWS Inferentia / Trainium | 2019 / 2020 | AWS | 128x128 per NeuronCore Tensor Engine | cFP8 / FP16 / BF16 / TF32 / FP32 | 100+ TFLOPS BF16 / TensorEngine | Each NeuronCore-v2 has tensor, vector, scalar, and GPSIMD engines |
| Groq LPU (TSP) | 2020 to 2024 | Groq | Functionally sliced spatial array | INT8 / FP16 | ~750 TOPS (LPU v1) | Tensor streaming processor; deterministic compiler-controlled execution |
| Sambanova RDU | 2020+ | Sambanova | Reconfigurable dataflow array | BF16 / FP32 | Varies | Coarse-grained reconfigurable array (CGRA), spiritual successor to Warp |
| Intel Habana Gaudi / Gaudi 2 | 2019 / 2022 | Intel | Multiple matrix engines (TPCs + MMEs) | BF16 / FP16 | 432 BF16 TFLOPS (Gaudi 2) | MME is a configurable systolic engine |
| Xilinx Versal AI Engine | 2019 | Xilinx (now AMD) | Array of 400 VLIW SIMD cores | INT8 / BF16 / FP32 | ~133 TOPS INT8 | Adaptive compute acceleration platform; spatial dataflow array on FPGA fabric |
| UC Berkeley Gemmini | 2019 | Berkeley | Configurable, e.g. 16x16 INT8 | INT8 / BF16 / FP32 | Configurable | Open-source RISC-V systolic accelerator generator |
The TPU v1 was, by a large margin, the highest-impact systolic deployment. According to Jouppi et al. (2017), it had been running at production scale in Google data centres since 2015, accelerating workloads such as Search, Translate, Photos, and AlphaGo. Its 256x256 MAC array used 8-bit weights and 8-bit activations with 32-bit accumulators, and it was driven by a CISC-like instruction set with about a dozen high-level instructions, the most important of which was MatrixMultiply (Jouppi et al., 2018).
The TPU v2 onwards switched the matrix unit to bfloat16 multiply with FP32 accumulate to support training, and split the single 256x256 MXU into smaller 128x128 MXUs with multiple per core, which allowed better utilisation when batch sizes were small. TPU v4 (Jouppi et al., 2023) introduced an optical reconfigurable interconnect for TPU pods and added the SparseCore for embedding lookups. With TPU v6e (Trillium, 2024), Google returned to a 256x256 MXU for 4x more FLOPs per cycle. Cloud TPU instances expose this hardware to external customers; a TPU slice or TPU pod bundles many TPU chips over a high-bandwidth fabric.
Tesla's Dojo D1, announced at AI Day 2021, is a 7 nm TSMC die with 50 billion transistors in 645 mm^2. It packs 354 training nodes in an 18x20 grid (a few are reserved for fault tolerance), each node containing a 64-bit superscalar CPU plus a dedicated matrix multiplier and vector unit; the entire die hits 362 BF16 TFLOPS at 2 GHz with 440 MB of distributed SRAM and 8 TB/s of off-die bandwidth across 576 SerDes channels (Bannon et al., 2022; Talpes et al., 2023).
NVIDIA Tensor Cores, introduced with the Volta V100 GPU in 2017, perform a 4x4x4 matrix multiply-accumulate per Tensor Core per cycle, with 8 Tensor Cores per Streaming Multiprocessor and 80 SMs on the V100 die for a total of 125 TFLOPS in mixed-precision FP16 multiply with FP32 accumulate. Whether Tensor Cores are "truly" systolic arrays is a matter of definition; microbenchmarks suggest they are implemented as small spatial arrays of FMA units that share the systolic spirit (lockstep, fixed dataflow, local reuse) without exposing it as cleanly as a TPU MXU. Hopper (H100) and Blackwell (B100/B200) extended the same idea to larger sub-tiles and lower precisions (FP8, FP6, FP4), and AMD's Matrix Cores in CDNA 3 follow the same pattern.
| Property | Strength | Limitation |
|---|---|---|
| Throughput per area | Very high; an N x N array does N^2 MACs per cycle | Underused on small or oddly shaped problems |
| Energy per operation | Minimised by local reuse and lack of speculation | Sparse / irregular data wastes most of the array |
| Compiler model | Fixed, predictable dataflow; matmul scheduling is straightforward | Inflexible for control-heavy or branching workloads |
| Physical design | Regular floorplan; nearest-neighbour wires | Hard to clock at extreme frequencies due to skew across large arrays |
| Memory pressure | On-chip SRAM and HBM can feed the array | Bandwidth wall: if HBM cannot keep up, the array stalls |
| Determinism | Cycle-accurate timing; useful for inference SLAs | Hostile to dynamic shapes, conditional execution |
The inflexibility limitation is the most consequential one. A 256x256 array configured for INT8 GEMM is exquisite at INT8 GEMM and bad at almost everything else. For workloads that are not dense matrix multiplication, such as pointer chasing, embedding-table lookups, or graph traversal, a systolic array contributes very little. This is part of why modern accelerators (TPU, Trainium, Dojo) bolt the systolic core onto a vector unit, a scalar unit, and special-purpose engines for embeddings (such as TPU v4's SparseCore). Sparsity is another open problem: a CNN with 80 percent zero weights still needs 80 percent of its PEs to multiply zero by something unless the array supports zero-skipping, which is non-trivial to add without breaking the regular dataflow.
Research on systolic arrays did not stop in 1990. Active areas include:
Sparse systolic arrays. Schemes such as Sparse-TPU (Sze et al.) and NVIDIA's structured 2:4 sparsity in Ampere and Hopper add a small amount of indexing logic to skip zero operands, recovering roughly 2x throughput on weight-pruned networks.
Coarse-grained reconfigurable arrays (CGRAs). These keep the regular grid of PEs but make each PE programmable, so that the same fabric can implement different dataflows. Sambanova RDU and the Cerebras CS-2 are commercial variants; academic CGRAs go back to RaPiD, PipeRench, and TRIPS.
Compute-in-memory. Mythic, IBM's analogue AI chips (NorthPole and follow-ons), and several startups perform multiply-accumulate inside SRAM or analogue crossbar arrays, treating the memory itself as a systolic substrate. Energy per MAC can drop by an order of magnitude at the cost of accuracy.
Photonic systolic arrays. Lightmatter and Lightelligence build matrix-multiply units out of optical interferometers, where the array is a literal grid of waveguides. Light moves through the array in nanoseconds, but encoding and decoding the optical signals is the dominant cost.
FPGA-hosted systolic accelerators. Long before TPUs, Xilinx and Altera FPGAs were used to host systolic accelerators for radar, finance, and medical imaging. Today, AI accelerators such as the Xilinx Versal AI Engine series provide a tiled array of VLIW SIMD cores arranged as a coarse systolic fabric on the same die as programmable logic.
Deep learning models are dominated by general matrix multiplications. A typical LLM forward pass spends 80 percent or more of its FLOPs in the matrix multiplications of the attention and feed-forward layers; for convolutional vision models the figure is similar after im2col. Because systolic arrays do exactly this operation efficiently, they have become the central design pattern for inference and training silicon. Quantised inference, in particular INT8 or INT4, benefits disproportionately because more multipliers fit in the same silicon area.
The arc from "systolic array as a research idea (1978)" to "systolic array as the engine of the AI revolution (2017 onwards)" took almost forty years. The bridge was Google's decision to deploy the TPU v1 at production scale in 2015. Once that demonstrated 30x performance per watt over CPUs and GPUs on data-centre inference, every major chip company added a systolic-style matrix engine to its roadmap, and the rest of the industry followed.
It is worth being honest about what this means. The systolic array is not a magic architecture for general computing. It is a very good architecture for a very specific kind of computation, and that kind of computation happens to be the rate-limiting step of modern neural networks. If transformer-style models had not turned out to scale the way they did, the systolic renaissance probably would not have happened. But they did, and so chips like the TPU and Trainium, and benchmarks like MLPerf that measure them, now organise a large fraction of global compute spend.
"Systolic" remains a standard term in computer architecture textbooks, including Patterson and Hennessy's Computer Architecture: A Quantitative Approach (sixth edition, chapter 7 on data-flow architectures and warehouse-scale computing). H. T. Kung went on to a prolific career at CMU and Harvard, working on networking, parallel algorithms, and recently on deep learning compilers. Charles Leiserson is best known today as a co-author of the Introduction to Algorithms textbook (CLRS) and as a creator of the Cilk parallel programming language at MIT. The systolic array sits in their joint legacy as a piece of theory that turned out to be exactly the right idea, just thirty years early.
The history is also a useful lesson in how research ideas mature in hardware. Kung and Leiserson's 1978 report described a class of architectures that VLSI would eventually make practical; Mead and Conway's 1980 textbook taught a generation of students how to build them; the Warp and iWarp projects of the 1980s prototyped real machines and discovered the engineering problems; FPGA-based systolic accelerators in the 1990s and 2000s kept the technique alive in industry; and the 2010s rise of deep learning provided a workload that finally made the architecture economically dominant. Each step looked modest at the time. The cumulative effect is that almost every commercial AI chip in 2026 traces a direct architectural line back to a Carnegie Mellon technical report from 1978.