Systolic array

A systolic array is a parallel computer architecture composed of a homogeneous network of tightly coupled processing elements (PEs) arranged in a regular grid, typically one-dimensional, two-dimensional, or hexagonal. Data flows rhythmically through the array under the control of a global clock, with each PE consuming an operand, performing a small computation (most often a multiply-accumulate), and forwarding intermediate results to its neighbours on every cycle. The metaphor that gives the architecture its name comes from the cardiovascular system: like blood pumped from the heart through arteries and capillaries, data is pumped from memory through the PE array, with each cell processing values as they pass by (Kung, 1982).

Systolic arrays were introduced in 1978 by H. T. Kung and Charles Leiserson at Carnegie Mellon University, in a technical report titled "Systolic Arrays (for VLSI)," which appeared in Sparse Matrix Proceedings 1978 (Academic Press, 1979). The concept was popularised in Kung's 1982 IEEE Computer article "Why Systolic Architectures?" and reproduced in Mead and Conway's canonical textbook Introduction to VLSI Systems (1980). Today the systolic array is the central compute fabric inside most modern AI accelerators, including Google's Tensor Processing Unit (TPU), the matrix engines in AWS Trainium and Inferentia, the matrix-multiply units inside Tesla's Dojo D1, and, in a looser sense, NVIDIA Tensor Cores and AMD Matrix Cores. Because deep learning training and inference are dominated by dense matrix multiplications, the systolic array has become arguably the most economically important parallel architecture of the last decade.

Origin and naming

In the late 1970s, the cost of designing custom integrated circuits was dropping while the number of transistors that fit on a single die kept rising. Mead and Conway (1980) had argued that VLSI would soon let designers place tens of thousands of gates on a chip, but that the bottleneck would be communication rather than computation: long wires were slow, expensive, and hard to lay out. Kung and Leiserson's 1978 report addressed exactly this problem. They proposed building parallel processors as regular grids of identical cells, with all communication restricted to nearest neighbours. Such a design would scale gracefully in VLSI because long wires were unnecessary, control was uniform, and the floorplan was repetitive. The original report described systolic structures for a wide range of dense linear-algebra operations, including matrix-vector and matrix-matrix multiplication, LU decomposition, banded triangular system solving, convolution, and polynomial evaluation (Kung and Leiserson, 1978).

The term "systolic" was coined by Kung as an analogy to the human heart. In the 1982 IEEE Computer paper, he describes the system as one in which "data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory," likening it to an automobile assembly line: many cars are worked on simultaneously, each at a different station, and the throughput is set by the rate at which they move down the line, not by the time spent at any one station (Kung, 1982). The same paper laid out the design principles that still apply: simple, regular cells; local communication; balanced compute and I/O; high concurrency; and minimal global control.

The earliest hardware to be designed explicitly as a systolic array was Carnegie Mellon's Warp project, which produced a wire-wrapped two-cell prototype in 1985, a printed-circuit version delivered by GE in 1987 (around USD 350,000 per machine), and finally an integrated-circuit version called iWarp built jointly with Intel in the late 1980s. Each iWarp chip carried roughly 700,000 transistors in a 0.9 micron CMOS process and ran at 20 MHz; a typical iWarp system was an 8x8 torus of 64 processors delivering about 1.2 gigaflops peak (Borkar et al., 1988; Annaratone et al., 1987). Around 39 iWarp machines were sold by Intel in 1992 and 1993.

Architectural principles

The defining features of a systolic array are easy to state and turn out to be enormously powerful in practice.

Maximum data reuse. Each operand is fetched once from memory and then reused as it ripples through the array. In a 256x256 multiply-accumulate array, a single weight loaded into one PE can contribute to 256 partial sums without any further main-memory traffic. This is critical because, in modern silicon, moving a 32-bit word across a chip costs more energy than the multiply-accumulate that consumes it (Horowitz, 2014).

Regular, local interconnect. PEs talk only to their immediate neighbours. This gives a regular floorplan, simplifies routing, removes long wires from the critical path, and keeps clock distribution tractable. A systolic array is one of the few architectures whose physical implementation matches its logical organisation almost exactly.

Massive parallelism. A single TPU v1 die holds 65,536 8-bit multipliers (a 256x256 grid) operating in lockstep at 700 MHz, giving 92 TOPS peak throughput in a 28 nm process and 28 to 40 W of TDP (Jouppi et al., 2017). Larger and more recent arrays scale this idea by an order of magnitude.

Predictable, pipelined throughput. Once the array is filled, one set of inputs is consumed and one set of outputs is emitted on every cycle. Latency for an N x N matrix multiply on an N x N output-stationary array is O(N) cycles to fill plus O(N) cycles to drain, but throughput is one MAC per PE per cycle, so the total work, N^3 multiplications, is completed in roughly 3N cycles rather than the N^3 cycles a serial machine would need.

Energy efficiency. Because the array minimises off-chip memory accesses and avoids the speculation, branch prediction, and out-of-order machinery of a general-purpose CPU, energy per operation is much lower. Jouppi et al. (2017) reported that the TPU v1 delivered 30 to 80 times higher performance per watt than the contemporary Haswell CPUs and Kepler GPUs on neural-network inference workloads.

Canonical example: matrix multiplication

Matrix multiplication is the textbook systolic problem. Consider computing C = A * B, where A is M x K, B is K x N, and C is M x N. A weight-stationary systolic array of size M x N proceeds as follows. Each PE (i, j) holds one accumulator for C[i, j] and is loaded with a column of B (or a row, depending on the dataflow). Activations from A are streamed in from the left edge of the array. On each cycle, each PE multiplies its current input by its stored weight, adds the product to its accumulator, and passes the input to its right neighbour. After K cycles, every PE has accumulated the dot product that defines its element of C. The whole multiplication takes K + M + N - 2 cycles in the simplest schedule, with throughput limited only by clock rate and array size.

When the problem is larger than the array, the matrices are tiled and each tile is streamed through the same hardware. This lets a 256x256 array compute matrix multiplications of any size, paying a small overhead in fill and drain cycles. Convolutions are usually mapped to matrix multiplications via the im2col transform (Chellapilla, Puri, and Simard, 2006), which is why the same hardware that does GEMM well also runs convolutional neural networks well.

Dataflow taxonomies

A "dataflow" describes which operands are stationary in the array and which ones move. The classic taxonomy was formalised by Chen, Krishna, Emer, and Sze in the Eyeriss paper (ISCA 2016), which also introduced the row-stationary scheme. Different dataflows trade off how often each tensor is reloaded from off-chip memory, which directly affects energy.

Dataflow	What stays	What flows	Typical use
Output-stationary (OS)	Partial sums in PEs	Weights and activations	Classical Kung-Leiserson designs; balanced reuse
Weight-stationary (WS)	Weights in PEs	Activations and partial sums	Inference (e.g. TPU v1, AWS Inferentia); weights large and reused
Input-stationary (IS)	Activations in PEs	Weights and partial sums	Some training scenarios
Row-stationary (RS)	Convolution row in PE	Activations, partials, filters across PEs	Eyeriss; minimises total data movement on CNNs
No local reuse (NLR)	Nothing	Everything	Reference baseline; energy-inefficient

Chen et al. (2016) showed that on AlexNet the row-stationary dataflow used 1.4x to 2.5x less energy than weight-stationary or output-stationary in convolutional layers, because it simultaneously reuses filter weights, input activations, and partial sums.

Major implementations

The table below collects the most influential systolic-array-based machines and accelerators. Sizes refer to the on-chip MAC array, not to multi-chip pods.

System	Year	Organisation	Array size	Numerics	Peak throughput	Notes
Warp / PC-Warp	1985 to 1987	CMU + GE	10-cell linear	32-bit float	100 MFLOPS	First explicit systolic-array machine; sold for ~USD 350,000
iWarp	1990 to 1993	CMU + Intel	8x8 torus typical	32-bit float	1.2 GFLOPS	About 39 systems sold; LIW microarchitecture
Google TPU v1	2015, paper 2017	Google	256x256	INT8 MAC, INT32 accumulate	92 TOPS	First commercial deployment of a large systolic array; 700 MHz, 28 nm, 28 to 40 W
TPU v2	2017	Google	128x128 per MXU	bfloat16 multiply, FP32 accumulate	45 TFLOPS / chip	Two cores per chip; trainable workloads via bfloat16
TPU v3	2018	Google	128x128 per MXU (2 per core)	bfloat16, FP32 acc	123 TFLOPS / chip	Liquid-cooled; doubled MXUs per core
TPU v4	2021	Google	128x128 per MXU (4 per core)	bfloat16, FP32 acc	275 TFLOPS / chip	Optical reconfigurable interconnect (Jouppi et al., 2023)
TPU v5p / v6e Trillium	2023 to 2024	Google	128x128 (v5p), 256x256 (v6e)	bfloat16, INT8	459 TFLOPS (v5p), ~926 TFLOPS (v6e)	v6e returned to a 256x256 MXU for 4x FLOPs per cycle
Tesla Dojo D1	2021	Tesla	354 nodes per die, each with a matrix multiplier	BF16 / CFP8 / FP32	362 BF16 TFLOPS / die	7 nm TSMC, 50 billion transistors, 645 mm^2; 18x20 array of nodes
Cerebras WSE-2	2021	Cerebras Systems	850,000 cores in a 2D mesh	FP16 dense / sparse	7.5 PFLOPS dense FP16	Wafer-scale; not a classical systolic array but uses the same dataflow principles
AWS Inferentia / Trainium	2019 / 2020	AWS	128x128 per NeuronCore Tensor Engine	cFP8 / FP16 / BF16 / TF32 / FP32	100+ TFLOPS BF16 / TensorEngine	Each NeuronCore-v2 has tensor, vector, scalar, and GPSIMD engines
Groq LPU (TSP)	2020 to 2024	Groq	Functionally sliced spatial array	INT8 / FP16	~750 TOPS (LPU v1)	Tensor streaming processor; deterministic compiler-controlled execution
Sambanova RDU	2020+	Sambanova	Reconfigurable dataflow array	BF16 / FP32	Varies	Coarse-grained reconfigurable array (CGRA), spiritual successor to Warp
Intel Habana Gaudi / Gaudi 2	2019 / 2022	Intel	Multiple matrix engines (TPCs + MMEs)	BF16 / FP16	432 BF16 TFLOPS (Gaudi 2)	MME is a configurable systolic engine
Xilinx Versal AI Engine	2019	Xilinx (now AMD)	Array of 400 VLIW SIMD cores	INT8 / BF16 / FP32	~133 TOPS INT8	Adaptive compute acceleration platform; spatial dataflow array on FPGA fabric
UC Berkeley Gemmini	2019	Berkeley	Configurable, e.g. 16x16 INT8	INT8 / BF16 / FP32	Configurable	Open-source RISC-V systolic accelerator generator

The TPU v1 was, by a large margin, the highest-impact systolic deployment. According to Jouppi et al. (2017), it had been running at production scale in Google data centres since 2015, accelerating workloads such as Search, Translate, Photos, and AlphaGo. Its 256x256 MAC array used 8-bit weights and 8-bit activations with 32-bit accumulators, and it was driven by a CISC-like instruction set with about a dozen high-level instructions, the most important of which was MatrixMultiply (Jouppi et al., 2018).

The TPU v2 onwards switched the matrix unit to bfloat16 multiply with FP32 accumulate to support training, and split the single 256x256 MXU into smaller 128x128 MXUs with multiple per core, which allowed better utilisation when batch sizes were small. TPU v4 (Jouppi et al., 2023) introduced an optical reconfigurable interconnect for TPU pods and added the SparseCore for embedding lookups. With TPU v6e (Trillium, 2024), Google returned to a 256x256 MXU for 4x more FLOPs per cycle. Cloud TPU instances expose this hardware to external customers; a TPU slice or TPU pod bundles many TPU chips over a high-bandwidth fabric.

Tesla's Dojo D1, announced at AI Day 2021, is a 7 nm TSMC die with 50 billion transistors in 645 mm^2. It packs 354 training nodes in an 18x20 grid (a few are reserved for fault tolerance), each node containing a 64-bit superscalar CPU plus a dedicated matrix multiplier and vector unit; the entire die hits 362 BF16 TFLOPS at 2 GHz with 440 MB of distributed SRAM and 8 TB/s of off-die bandwidth across 576 SerDes channels (Bannon et al., 2022; Talpes et al., 2023).

NVIDIA Tensor Cores, introduced with the Volta V100 GPU in 2017, perform a 4x4x4 matrix multiply-accumulate per Tensor Core per cycle, with 8 Tensor Cores per Streaming Multiprocessor and 80 SMs on the V100 die for a total of 125 TFLOPS in mixed-precision FP16 multiply with FP32 accumulate. Whether Tensor Cores are "truly" systolic arrays is a matter of definition; microbenchmarks suggest they are implemented as small spatial arrays of FMA units that share the systolic spirit (lockstep, fixed dataflow, local reuse) without exposing it as cleanly as a TPU MXU. Hopper (H100) and Blackwell (B100/B200) extended the same idea to larger sub-tiles and lower precisions (FP8, FP6, FP4), and AMD's Matrix Cores in CDNA 3 follow the same pattern.

Strengths and limitations

Property	Strength	Limitation
Throughput per area	Very high; an N x N array does N^2 MACs per cycle	Underused on small or oddly shaped problems
Energy per operation	Minimised by local reuse and lack of speculation	Sparse / irregular data wastes most of the array
Compiler model	Fixed, predictable dataflow; matmul scheduling is straightforward	Inflexible for control-heavy or branching workloads
Physical design	Regular floorplan; nearest-neighbour wires	Hard to clock at extreme frequencies due to skew across large arrays
Memory pressure	On-chip SRAM and HBM can feed the array	Bandwidth wall: if HBM cannot keep up, the array stalls
Determinism	Cycle-accurate timing; useful for inference SLAs	Hostile to dynamic shapes, conditional execution

The inflexibility limitation is the most consequential one. A 256x256 array configured for INT8 GEMM is exquisite at INT8 GEMM and bad at almost everything else. For workloads that are not dense matrix multiplication, such as pointer chasing, embedding-table lookups, or graph traversal, a systolic array contributes very little. This is part of why modern accelerators (TPU, Trainium, Dojo) bolt the systolic core onto a vector unit, a scalar unit, and special-purpose engines for embeddings (such as TPU v4's SparseCore). Sparsity is another open problem: a CNN with 80 percent zero weights still needs 80 percent of its PEs to multiply zero by something unless the array supports zero-skipping, which is non-trivial to add without breaking the regular dataflow.

Modern variants and research directions

Research on systolic arrays did not stop in 1990. Active areas include:

Sparse systolic arrays. Schemes such as Sparse-TPU (Sze et al.) and NVIDIA's structured 2:4 sparsity in Ampere and Hopper add a small amount of indexing logic to skip zero operands, recovering roughly 2x throughput on weight-pruned networks.

Coarse-grained reconfigurable arrays (CGRAs). These keep the regular grid of PEs but make each PE programmable, so that the same fabric can implement different dataflows. Sambanova RDU and the Cerebras CS-2 are commercial variants; academic CGRAs go back to RaPiD, PipeRench, and TRIPS.

Compute-in-memory. Mythic, IBM's analogue AI chips (NorthPole and follow-ons), and several startups perform multiply-accumulate inside SRAM or analogue crossbar arrays, treating the memory itself as a systolic substrate. Energy per MAC can drop by an order of magnitude at the cost of accuracy.

Photonic systolic arrays. Lightmatter and Lightelligence build matrix-multiply units out of optical interferometers, where the array is a literal grid of waveguides. Light moves through the array in nanoseconds, but encoding and decoding the optical signals is the dominant cost.

FPGA-hosted systolic accelerators. Long before TPUs, Xilinx and Altera FPGAs were used to host systolic accelerators for radar, finance, and medical imaging. Today, AI accelerators such as the Xilinx Versal AI Engine series provide a tiled array of VLIW SIMD cores arranged as a coarse systolic fabric on the same die as programmable logic.

Connection to AI and machine learning

Deep learning models are dominated by general matrix multiplications. A typical LLM forward pass spends 80 percent or more of its FLOPs in the matrix multiplications of the attention and feed-forward layers; for convolutional vision models the figure is similar after im2col. Because systolic arrays do exactly this operation efficiently, they have become the central design pattern for inference and training silicon. Quantised inference, in particular INT8 or INT4, benefits disproportionately because more multipliers fit in the same silicon area.

The arc from "systolic array as a research idea (1978)" to "systolic array as the engine of the AI revolution (2017 onwards)" took almost forty years. The bridge was Google's decision to deploy the TPU v1 at production scale in 2015. Once that demonstrated 30x performance per watt over CPUs and GPUs on data-centre inference, every major chip company added a systolic-style matrix engine to its roadmap, and the rest of the industry followed.

It is worth being honest about what this means. The systolic array is not a magic architecture for general computing. It is a very good architecture for a very specific kind of computation, and that kind of computation happens to be the rate-limiting step of modern neural networks. If transformer-style models had not turned out to scale the way they did, the systolic renaissance probably would not have happened. But they did, and so chips like the TPU and Trainium, and benchmarks like MLPerf that measure them, now organise a large fraction of global compute spend.

Educational and historical significance

"Systolic" remains a standard term in computer architecture textbooks, including Patterson and Hennessy's Computer Architecture: A Quantitative Approach (sixth edition, chapter 7 on data-flow architectures and warehouse-scale computing). H. T. Kung went on to a prolific career at CMU and Harvard, working on networking, parallel algorithms, and recently on deep learning compilers. Charles Leiserson is best known today as a co-author of the Introduction to Algorithms textbook (CLRS) and as a creator of the Cilk parallel programming language at MIT. The systolic array sits in their joint legacy as a piece of theory that turned out to be exactly the right idea, just thirty years early.

The history is also a useful lesson in how research ideas mature in hardware. Kung and Leiserson's 1978 report described a class of architectures that VLSI would eventually make practical; Mead and Conway's 1980 textbook taught a generation of students how to build them; the Warp and iWarp projects of the 1980s prototyped real machines and discovered the engineering problems; FPGA-based systolic accelerators in the 1990s and 2000s kept the technique alive in industry; and the 2010s rise of deep learning provided a workload that finally made the architecture economically dominant. Each step looked modest at the time. The cumulative effect is that almost every commercial AI chip in 2026 traces a direct architectural line back to a Carnegie Mellon technical report from 1978.

Systolic array

Origin and naming

Architectural principles

Canonical example: matrix multiplication

Dataflow taxonomies

Major implementations

Strengths and limitations

Modern variants and research directions

Connection to AI and machine learning

Educational and historical significance

References

Improve this article

Origin and naming

Architectural principles

Canonical example: matrix multiplication

Dataflow taxonomies

Major implementations

Strengths and limitations

Modern variants and research directions

Connection to AI and machine learning

Educational and historical significance

References

Origin and naming

Architectural principles

Canonical example: matrix multiplication

Dataflow taxonomies

Major implementations

Strengths and limitations

Modern variants and research directions

Connection to AI and machine learning

Educational and historical significance

References

Improve this article

Related Articles

Arm Holdings

Graphics processing unit

Cerebras Systems

Groq

Edge AI

NVIDIA Digits

Origin and naming

Architectural principles

Canonical example: matrix multiplication

Dataflow taxonomies

Major implementations

Strengths and limitations

Modern variants and research directions

Connection to AI and machine learning

Educational and historical significance

References

Related Articles

Arm Holdings

Graphics processing unit

Cerebras Systems

Groq

Edge AI

NVIDIA Digits