# Systolic array

> Source: https://aiwiki.ai/wiki/systolic_array
> Updated: 2026-06-24
> Categories: AI Hardware
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **systolic array** is a parallel computer architecture in which a regular grid of identical, tightly coupled processing elements (PEs) rhythmically pumps data through the array under a global clock, with each PE consuming an operand, performing a small fixed computation (most often a multiply-accumulate), and forwarding the result to its neighbours every cycle. Because each operand is fetched from memory once and then reused as it ripples across many PEs, the architecture performs dense matrix multiplication and convolution with extremely high arithmetic intensity and minimal off-chip memory traffic, which is exactly why it has become the central compute fabric inside most modern AI accelerators. The most influential example is Google's [Tensor Processing Unit (TPU)](/wiki/tpu), whose first generation built its matrix-multiply unit from a 256x256 grid of 65,536 multiply-accumulators delivering 92 trillion operations per second (TOPS) [1].

The concept was introduced in 1978 by [H. T. Kung](https://www.eecs.harvard.edu/htk/) and Charles Leiserson at Carnegie Mellon University, in a technical report titled "Systolic Arrays (for VLSI)," which appeared in *Sparse Matrix Proceedings 1978* (Academic Press, 1979) [2]. The architecture takes its name from the cardiovascular system: as Kung put it in his 1982 IEEE *Computer* article "Why Systolic Architectures?", "data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory, much as blood circulates to and from the heart" [3]. The idea was reproduced in Carver Mead and Lynn Conway's canonical textbook *Introduction to VLSI Systems* (1980, Section 8.3) [4]. Today the systolic array underpins the matrix engines in [AWS Trainium](/wiki/aws_trainium) and [AWS Inferentia](/wiki/aws_inferentia), the matrix-multiply units inside Tesla's Dojo D1, and, in a looser sense, NVIDIA Tensor Cores and AMD Matrix Cores. Because deep learning training and inference are dominated by dense [matrix multiplication](/wiki/matrix_multiplication), the systolic array has become arguably the most economically important parallel architecture of the last decade.

## Where does the name systolic array come from?

In the late 1970s, the cost of designing custom integrated circuits was dropping while the number of transistors that fit on a single die kept rising. Mead and Conway (1980) had argued that VLSI would soon let designers place tens of thousands of gates on a chip, but that the bottleneck would be communication rather than computation: long wires were slow, expensive, and hard to lay out [4]. Kung and Leiserson's 1978 report addressed exactly this problem. They proposed building parallel processors as regular grids of identical cells, with all communication restricted to nearest neighbours. Such a design would scale gracefully in VLSI because long wires were unnecessary, control was uniform, and the floorplan was repetitive. The original report described systolic structures for a wide range of dense linear-algebra operations, including matrix-vector and matrix-matrix multiplication, LU decomposition, banded triangular system solving, convolution, and polynomial evaluation [2].

The term "systolic" was coined by Kung as an analogy to the human heart, after the systole phase in which the heart muscle contracts to pump blood. In the 1982 IEEE *Computer* paper, he likened the system to an automobile assembly line: many cars are worked on simultaneously, each at a different station, and the throughput is set by the rate at which they move down the line, not by the time spent at any one station [3]. The same paper laid out the design principles that still apply: simple, regular cells; local communication; balanced compute and I/O; high concurrency; and minimal global control.

The earliest hardware to be designed explicitly as a systolic array was Carnegie Mellon's Warp project, funded by DARPA and built with industrial partners G.E., Honeywell, and Intel. It produced a wire-wrapped two-cell prototype in June 1985, a printed-circuit version (PC-Warp) delivered by GE in April 1987 (around USD 350,000 per machine), and finally an integrated-circuit version called iWarp built jointly with Intel, whose first system shipped in March 1990 [5][6]. Each iWarp chip carried roughly 700,000 transistors and ran at 20 MHz; a typical iWarp system was an 8x8 torus of 64 processors delivering about 1.2 gigaflops peak [6][7]. Around 39 iWarp machines were sold by Intel in 1992 and 1993.

## What makes a systolic array efficient?

The defining features of a systolic array are easy to state and turn out to be enormously powerful in practice.

*Maximum data reuse.* Each operand is fetched once from memory and then reused as it ripples through the array. In a 256x256 multiply-accumulate array, a single weight loaded into one PE can contribute to 256 partial sums without any further main-memory traffic. This is critical because, in modern silicon, moving data is far costlier than the arithmetic that consumes it: Horowitz (2014) measured a 45 nm multiply-accumulate at a few picojoules while an off-chip DRAM access costs on the order of 640 picojoules, roughly two orders of magnitude more [8].

*Regular, local interconnect.* PEs talk only to their immediate neighbours. This gives a regular floorplan, simplifies routing, removes long wires from the critical path, and keeps clock distribution tractable. A systolic array is one of the few architectures whose physical implementation matches its logical organisation almost exactly.

*Massive parallelism.* A single TPU v1 die holds 65,536 8-bit multipliers (a 256x256 grid) operating in lockstep at 700 MHz, giving 92 TOPS peak throughput in a 28 nm process and 28 to 40 W of TDP [1]. Larger and more recent arrays scale this idea by an order of magnitude.

*Predictable, pipelined throughput.* Once the array is filled, one set of inputs is consumed and one set of outputs is emitted on every cycle. Latency for an N x N matrix multiply on an N x N output-stationary array is O(N) cycles to fill plus O(N) cycles to drain, but throughput is one MAC per PE per cycle, so the total work, N^3 multiplications, is completed in roughly 3N cycles rather than the N^3 cycles a serial machine would need.

*Energy efficiency.* Because the array minimises off-chip memory accesses and avoids the speculation, branch prediction, and out-of-order machinery of a general-purpose CPU, energy per operation is much lower. Jouppi et al. (2017) reported that the TPU v1 delivered 30 to 80 times higher performance per watt than the contemporary Intel Haswell CPUs and NVIDIA Kepler [GPU](/wiki/gpu)s on neural-network inference workloads, and was 15 to 30 times faster overall [1].

## How does a systolic array multiply matrices?

Matrix multiplication is the textbook systolic problem. Consider computing C = A * B, where A is M x K, B is K x N, and C is M x N. A weight-stationary systolic array of size M x N proceeds as follows. Each PE (i, j) holds one accumulator for C[i, j] and is loaded with a column of B (or a row, depending on the dataflow). Activations from A are streamed in from the left edge of the array. On each cycle, each PE multiplies its current input by its stored weight, adds the product to its accumulator, and passes the input to its right neighbour. After K cycles, every PE has accumulated the dot product that defines its element of C. The whole multiplication takes K + M + N - 2 cycles in the simplest schedule, with throughput limited only by clock rate and array size.

When the problem is larger than the array, the matrices are tiled and each tile is streamed through the same hardware. This lets a 256x256 array compute matrix multiplications of any size, paying a small overhead in fill and drain cycles. [Convolution](/wiki/convolution)s are usually mapped to matrix multiplications via the im2col (unrolled-convolution) transform introduced by Chellapilla, Puri, and Simard (2006), who reported a 2.4x to 3.0x speedup by turning each convolutional layer into a single matrix-matrix product fed to a BLAS routine [9]. This is why the same hardware that does GEMM well also runs convolutional neural networks well.

## How do systolic dataflows differ?

A "dataflow" describes which operands are stationary in the array and which ones move. The classic taxonomy was formalised by Chen, Krishna, Emer, and Sze in the Eyeriss paper (ISCA 2016), which also introduced the row-stationary scheme [10]. Different dataflows trade off how often each tensor is reloaded from off-chip memory, which directly affects energy.

| Dataflow | What stays | What flows | Typical use |
|---|---|---|---|
| Output-stationary (OS) | Partial sums in PEs | Weights and activations | Classical Kung-Leiserson designs; balanced reuse |
| Weight-stationary (WS) | Weights in PEs | Activations and partial sums | Inference (e.g. TPU v1, AWS Inferentia); weights large and reused |
| Input-stationary (IS) | Activations in PEs | Weights and partial sums | Some training scenarios |
| Row-stationary (RS) | Convolution row in PE | Activations, partials, filters across PEs | Eyeriss; minimises total data movement on CNNs |
| No local reuse (NLR) | Nothing | Everything | Reference baseline; energy-inefficient |

Chen et al. (2016) showed that on AlexNet the row-stationary dataflow used 1.4x to 2.5x less energy than weight-stationary or output-stationary in convolutional layers, because it simultaneously reuses filter weights, input activations, and partial sums; the resulting Eyeriss chip ran AlexNet convolutions at 35 frames per second using only 278 mW, about ten times more energy-efficient than contemporary mobile GPUs [10].

## Which AI chips use systolic arrays?

The table below collects the most influential systolic-array-based machines and accelerators. Sizes refer to the on-chip MAC array, not to multi-chip pods.

| System | Year | Organisation | Array size | Numerics | Peak throughput | Notes |
|---|---|---|---|---|---|---|
| Warp / PC-Warp | 1985 to 1987 | CMU + GE | 10-cell linear | 32-bit float | 100 MFLOPS | First explicit systolic-array machine; sold for ~USD 350,000 |
| iWarp | 1990 to 1993 | CMU + Intel | 8x8 torus typical | 32-bit float | 1.2 GFLOPS | About 39 systems sold; LIW microarchitecture |
| [Google TPU](/wiki/tpu) v1 | 2015, paper 2017 | Google | 256x256 | INT8 MAC, INT32 accumulate | 92 TOPS | First commercial deployment of a large systolic array; 700 MHz, 28 nm, 28 to 40 W |
| [TPU](/wiki/tpu_chip) v2 | 2017 | Google | 128x128 per MXU | bfloat16 multiply, FP32 accumulate | 45 TFLOPS / chip | Two cores per chip; trainable workloads via bfloat16 |
| [TPU](/wiki/tpu_chip) v3 | 2018 | Google | 128x128 per MXU (2 per core) | bfloat16, FP32 acc | 123 TFLOPS / chip | Liquid-cooled; doubled MXUs per core |
| [TPU](/wiki/tpu_chip) v4 | 2021 | Google | 128x128 per MXU (4 per core) | bfloat16, FP32 acc | 275 TFLOPS / chip | Optical reconfigurable interconnect [11] |
| [TPU](/wiki/tpu_chip) v5p / v6e Trillium | 2023 to 2024 | Google | 128x128 (v5p), 256x256 (v6e) | bfloat16, INT8 | 459 TFLOPS (v5p), ~926 TFLOPS (v6e) | v6e returned to a 256x256 MXU for 4x FLOPs per cycle |
| Tesla Dojo D1 | 2021 | Tesla | 354 nodes per die, each with a matrix multiplier | BF16 / CFP8 / FP32 | 362 BF16 TFLOPS / die | 7 nm TSMC, 50 billion transistors, 645 mm^2; 18x20 array of nodes |
| [Cerebras](/wiki/cerebras) WSE-2 | 2021 | Cerebras Systems | 850,000 cores in a 2D mesh | FP16 dense / sparse | 7.5 PFLOPS dense FP16 | Wafer-scale; not a classical systolic array but uses the same dataflow principles |
| AWS Inferentia / Trainium | 2019 / 2020 | AWS | 128x128 per NeuronCore Tensor Engine | cFP8 / FP16 / BF16 / TF32 / FP32 | 100+ TFLOPS BF16 / TensorEngine | Each NeuronCore-v2 has tensor, vector, scalar, and GPSIMD engines |
| [Groq](/wiki/groq) LPU (TSP) | 2020 to 2024 | Groq | Functionally sliced spatial array | INT8 / FP16 | ~750 TOPS (LPU v1) | Tensor streaming processor; deterministic compiler-controlled execution |
| [Sambanova](/wiki/sambanova) RDU | 2020+ | SambaNova | Reconfigurable dataflow array | BF16 / FP32 | Varies | Coarse-grained reconfigurable array (CGRA), spiritual successor to Warp |
| Intel Habana Gaudi / Gaudi 2 | 2019 / 2022 | Intel | Multiple matrix engines (TPCs + MMEs) | BF16 / FP16 | 432 BF16 TFLOPS (Gaudi 2) | MME is a configurable systolic engine |
| Xilinx Versal AI Engine | 2019 | Xilinx (now AMD) | Array of 400 VLIW SIMD cores | INT8 / BF16 / FP32 | ~133 TOPS INT8 | Adaptive compute acceleration platform; spatial dataflow array on FPGA fabric |
| UC Berkeley Gemmini | 2019 | Berkeley | Configurable, e.g. 16x16 INT8 | INT8 / BF16 / FP32 | Configurable | Open-source RISC-V systolic accelerator generator |

The TPU v1 was, by a large margin, the highest-impact systolic deployment. According to Jouppi et al. (2017), it had been running at production scale in Google data centres since 2015, accelerating workloads such as Search, Translate, Photos, and AlphaGo [1]. Its 256x256 MAC array used 8-bit weights and 8-bit activations with 32-bit accumulators, and it was driven by a CISC-like instruction set with about a dozen high-level instructions, the most important of which was MatrixMultiply [12].

The TPU v2 onwards switched the matrix unit to bfloat16 multiply with FP32 accumulate to support training, and split the single 256x256 MXU into smaller 128x128 MXUs with multiple per core, which allowed better utilisation when batch sizes were small [13]. TPU v4 (Jouppi et al., 2023) introduced an optical reconfigurable interconnect for [TPU pods](/wiki/tpu_pod) and added the SparseCore for embedding lookups [11]. With TPU v6e (Trillium, announced at Google I/O 2024), Google returned to a 256x256 MXU for 4x more FLOPs per cycle. [Cloud TPU](/wiki/cloud_tpu) instances expose this hardware to external customers; a [TPU slice](/wiki/tpu_slice) or [TPU pod](/wiki/tpu_pod) bundles many [TPU chips](/wiki/tpu_chip) over a high-bandwidth fabric.

Tesla's Dojo D1, announced at AI Day on 19 August 2021, is a 7 nm TSMC die with 50 billion transistors in 645 mm^2 [14]. It packs 354 training nodes in an 18x20 grid (a few are reserved for fault tolerance), each node containing a 64-bit superscalar CPU plus a dedicated matrix multiplier and vector unit; the entire die hits 362 BF16 TFLOPS at 2 GHz with distributed SRAM and high off-die bandwidth across hundreds of SerDes channels [15][16].

NVIDIA Tensor Cores, introduced with the Volta V100 GPU in 2017, perform a 4x4x4 matrix multiply-accumulate per Tensor Core per cycle, with 8 Tensor Cores per Streaming Multiprocessor and 80 SMs on the V100 die for a total of 125 TFLOPS in mixed-precision FP16 multiply with FP32 accumulate. Whether Tensor Cores are "truly" systolic arrays is a matter of definition; microbenchmarks suggest they are implemented as small spatial arrays of FMA units that share the systolic spirit (lockstep, fixed dataflow, local reuse) without exposing it as cleanly as a TPU MXU. Hopper (H100) and Blackwell (B100/B200) extended the same idea to larger sub-tiles and lower precisions (FP8, FP6, FP4), and AMD's Matrix Cores in CDNA 3 follow the same pattern.

## What are the strengths and limitations of systolic arrays?

| Property | Strength | Limitation |
|---|---|---|
| Throughput per area | Very high; an N x N array does N^2 MACs per cycle | Underused on small or oddly shaped problems |
| Energy per operation | Minimised by local reuse and lack of speculation | Sparse / irregular data wastes most of the array |
| Compiler model | Fixed, predictable dataflow; matmul scheduling is straightforward | Inflexible for control-heavy or branching workloads |
| Physical design | Regular floorplan; nearest-neighbour wires | Hard to clock at extreme frequencies due to skew across large arrays |
| Memory pressure | On-chip SRAM and HBM can feed the array | Bandwidth wall: if HBM cannot keep up, the array stalls |
| Determinism | Cycle-accurate timing; useful for inference SLAs | Hostile to dynamic shapes, conditional execution |

The inflexibility limitation is the most consequential one. A 256x256 array configured for INT8 GEMM is exquisite at INT8 GEMM and bad at almost everything else. For workloads that are not dense matrix multiplication, such as pointer chasing, embedding-table lookups, or graph traversal, a systolic array contributes very little. This is part of why modern accelerators (TPU, Trainium, Dojo) bolt the systolic core onto a vector unit, a scalar unit, and special-purpose engines for embeddings (such as TPU v4's SparseCore). Sparsity is another open problem: a CNN with 80 percent zero weights still needs 80 percent of its PEs to multiply zero by something unless the array supports zero-skipping, which is non-trivial to add without breaking the regular dataflow.

## What new directions are systolic arrays taking?

Research on systolic arrays did not stop in 1990. Active areas include:

*Sparse systolic arrays.* Schemes that add a small amount of indexing logic to skip zero operands, such as NVIDIA's structured 2:4 sparsity in Ampere and Hopper, recover roughly 2x throughput on weight-pruned networks.

*Coarse-grained reconfigurable arrays (CGRAs).* These keep the regular grid of PEs but make each PE programmable, so that the same fabric can implement different dataflows. SambaNova's RDU and the Cerebras CS-2 are commercial variants; academic CGRAs go back to RaPiD, PipeRench, and TRIPS.

*Compute-in-memory.* Mythic, IBM's analogue AI chips (NorthPole and follow-ons), and several startups perform multiply-accumulate inside SRAM or analogue crossbar arrays, treating the memory itself as a systolic substrate. Energy per MAC can drop by an order of magnitude at the cost of accuracy.

*Photonic systolic arrays.* Lightmatter and Lightelligence build matrix-multiply units out of optical interferometers, where the array is a literal grid of waveguides. Light moves through the array in nanoseconds, but encoding and decoding the optical signals is the dominant cost.

*FPGA-hosted systolic accelerators.* Long before TPUs, Xilinx and Altera FPGAs were used to host systolic accelerators for radar, finance, and medical imaging. Today, [AI accelerators](/wiki/ai_chips) such as the Xilinx Versal AI Engine series provide a tiled array of VLIW SIMD cores arranged as a coarse systolic fabric on the same die as programmable logic.

## Why is the systolic array central to AI?

[Deep learning](/wiki/deep_learning) models are dominated by general matrix multiplications. A typical [transformer](/wiki/transformer) LLM forward pass spends 80 percent or more of its FLOPs in the matrix multiplications of the [attention](/wiki/attention) and feed-forward layers; for convolutional vision models the figure is similar after im2col. Because systolic arrays do exactly this operation efficiently, they have become the central design pattern for inference and training silicon. Quantised inference, in particular INT8 or INT4, benefits disproportionately because more multipliers fit in the same silicon area.

The arc from "systolic array as a research idea (1978)" to "systolic array as the engine of the AI revolution (2017 onwards)" took almost forty years. The bridge was Google's decision to deploy the TPU v1 at production scale in 2015. Once that demonstrated 30 to 80 times higher performance per watt over CPUs and GPUs on data-centre inference, every major chip company added a systolic-style matrix engine to its roadmap, and the rest of the industry followed [1].

It is worth being honest about what this means. The systolic array is not a magic architecture for general computing. It is a very good architecture for a very specific kind of computation, and that kind of computation happens to be the rate-limiting step of modern neural networks. If transformer-style models had not turned out to scale the way they did, the systolic renaissance probably would not have happened. But they did, and so chips like the TPU and Trainium, and benchmarks like [MLPerf](/wiki/mlperf) that measure them, now organise a large fraction of global compute spend.

## Educational and historical significance

"Systolic" remains a standard term in computer architecture textbooks, including Patterson and Hennessy's *Computer Architecture: A Quantitative Approach* (sixth edition, chapter 7 on data-flow architectures and warehouse-scale computing) [17]. H. T. Kung went on to a prolific career at CMU and Harvard, working on networking, parallel algorithms, and recently on deep learning compilers. Charles Leiserson is best known today as a co-author of the *Introduction to Algorithms* textbook (CLRS) and as a creator of the Cilk parallel programming language at MIT. The systolic array sits in their joint legacy as a piece of theory that turned out to be exactly the right idea, just thirty years early.

The history is also a useful lesson in how research ideas mature in hardware. Kung and Leiserson's 1978 report described a class of architectures that VLSI would eventually make practical; Mead and Conway's 1980 textbook taught a generation of students how to build them; the Warp and iWarp projects of the 1980s prototyped real machines and discovered the engineering problems; FPGA-based systolic accelerators in the 1990s and 2000s kept the technique alive in industry; and the 2010s rise of deep learning provided a workload that finally made the architecture economically dominant. Each step looked modest at the time. The cumulative effect is that almost every commercial AI chip in 2026 traces a direct architectural line back to a Carnegie Mellon technical report from 1978.

## References

1. Jouppi, N. P., Young, C., Patil, N., Patterson, D., et al. (2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit." *ISCA 2017*. https://arxiv.org/abs/1704.04760
2. Kung, H. T., and Leiserson, C. E. (1978). "Systolic Arrays (for VLSI)." CMU-CS-79-103. *Sparse Matrix Proceedings 1978*, 256 to 282. Academic Press, 1979. https://www.eecs.harvard.edu/htk/static/files/1978-cmu-cs-report-kung-leiserson.pdf
3. Kung, H. T. (1982). "Why Systolic Architectures?" *IEEE Computer*, 15(1), 37 to 46. https://www.eecs.harvard.edu/~htk/publication/1982-kung-why-systolic-architecture.pdf
4. Mead, C. A., and Conway, L. A. (1980). *Introduction to VLSI Systems*, Section 8.3. Addison-Wesley.
5. Wikipedia. "WARP (systolic array)." https://en.wikipedia.org/wiki/WARP_(systolic_array)
6. Borkar, S., Cohn, R., Cox, G., Gleason, S., Gross, T., Kung, H. T., Lam, M., Moore, B., Peterson, C., Pieper, J., Rankin, L., Tseng, P. S., Sutton, J., Urbanski, J., and Webb, J. (1988). "iWarp: an integrated solution to high-speed parallel computing." *Proceedings of Supercomputing '88*, 330 to 339. https://www.eecs.harvard.edu/~htk/publication/1988-supercomputing-borkar-etc.pdf
7. Annaratone, M., Arnould, E., Gross, T., Kung, H. T., Lam, M., Menzilcioglu, O., and Webb, J. A. (1987). "The Warp Computer: Architecture, Implementation, and Performance." *IEEE Transactions on Computers*, C-36(12), 1523 to 1538. https://www.ri.cmu.edu/pub_files/pub3/annaratone_m_1987_1/annaratone_m_1987_1.pdf
8. Horowitz, M. (2014). "Computing's energy problem (and what we can do about it)." *ISSCC 2014*. https://gwern.net/doc/cs/hardware/2014-horowitz-2.pdf
9. Chellapilla, K., Puri, S., and Simard, P. (2006). "High Performance Convolutional Neural Networks for Document Processing." *Tenth International Workshop on Frontiers in Handwriting Recognition*. https://inria.hal.science/inria-00112631/document
10. Chen, Y.-H., Krishna, T., Emer, J., and Sze, V. (2016). "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks." *ISCA 2016*. https://eems.mit.edu/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf
11. Jouppi, N. P., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., Young, C., Zhou, X., Zhou, Z., and Patterson, D. (2023). "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." *ISCA 2023*. https://arxiv.org/abs/2304.01433
12. Jouppi, N. P., Young, C., Patil, N., and Patterson, D. (2018). "Motivation for and Evaluation of the First Tensor Processing Unit." *IEEE Micro*, 38(3), 10 to 19.
13. Norrie, T., Patil, N., Yoon, D. H., Kurian, G., Li, S., Laudon, J., Young, C., Jouppi, N., and Patterson, D. (2021). "The Design Process for Google's Training Chips: TPUv2 and TPUv3." *IEEE Micro*, 41(2), 56 to 63.
14. CNBC. (2021). "Tesla unveils Dojo D1 chip at AI Day." https://www.cnbc.com/2021/08/19/tesla-unveils-dojo-d1-chip-at-ai-day.html
15. Bannon, P., et al. (2022). "Computer architecture for AI: Tesla Dojo." Hot Chips 34. https://chipsandcheese.com/p/hot-chips-34-teslas-dojo-microarchitecture
16. Talpes, E., et al. (2023). "The Microarchitecture of DOJO, Tesla's Exa-Scale Computer." *IEEE Micro*, 43(3). https://ieeexplore.ieee.org/document/10078146
17. Patterson, D. A., and Hennessy, J. L. (2019). *Computer Architecture: A Quantitative Approach*, 6th ed., chapter on warehouse-scale computers and data-flow accelerators. Morgan Kaufmann.
18. Cerebras Systems. (2021). "WSE-2: 2.6 trillion transistors, 850,000 cores." https://www.cerebras.ai/blog/cerebras-architecture-deep-dive-first-look-inside-the-hw-sw-co-design-for-deep-learning
19. Groq. (2024). "LPU Architecture." https://groq.com/lpu-architecture
20. AWS. (2023). "Trainium / Inferentia2 NeuronCore Architecture." https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.0/general/nki/arch/trainium_inferentia2_arch.html
21. Wikipedia. "Systolic array." https://en.wikipedia.org/wiki/Systolic_array
22. Wikipedia. "iWarp." https://en.wikipedia.org/wiki/IWarp
23. Wikipedia. "Tensor Processing Unit." https://en.wikipedia.org/wiki/Tensor_Processing_Unit
24. Wikipedia. "Tesla Dojo." https://en.wikipedia.org/wiki/Tesla_Dojo
