Meta MTIA
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,816 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,816 words
Add missing citations, update stale details, or suggest a clearer explanation.
MTIA (Meta Training and Inference Accelerator) is a family of custom application-specific integrated circuits (ASICs) that Meta designs in-house to run its largest artificial intelligence workloads. The first chips were built for the deep learning recommendation and ranking models that decide what shows up in Facebook and Instagram feeds and ads, and later generations are being extended toward generative AI. MTIA sits alongside graphics processing units in Meta's data centers and runs through a software stack built on PyTorch.[1][2]
Meta announced the first generation, MTIA v1, in May 2023, fabricated on a TSMC 7 nanometer process.[1][3] A next-generation MTIA followed in April 2024 on TSMC's 5 nanometer process, with roughly three and a half times the dense compute of the original and more than double the memory.[2][4] The work is part of a wider shift among hyperscale operators toward in-house silicon, a trend that also produced Google's TPU and Amazon's Trainium and Inferentia lines.[5][6]
Meta serves recommendation models at a scale that is hard to appreciate from the outside. Ranking the candidate posts, reels, and ads for billions of people happens billions of times a day, and the models behind it have grown steadily larger and more complex.[1][7] These deep learning recommendation models, often shortened to DLRMs, have an unusual shape. They mix dense neural network layers with very large embedding tables, and they are bound as much by memory access and data movement as by raw arithmetic.[1][8]
General-purpose GPUs can run this work, but they are tuned for the dense matrix math of training large neural networks rather than the sparse, memory-heavy access patterns of recommendation inference. For a workload that runs constantly across a fleet of this size, the figures that matter most are performance per watt and performance per total cost of ownership.[1][7] By designing the chip and its compiler together, Meta could target its own model shapes directly, squeeze more useful work out of each watt, and gain more control over its hardware roadmap rather than depending entirely on outside suppliers.[2][7] Reducing reliance on a single GPU vendor, Nvidia, is a recurring theme in how analysts read the program, though Meta has framed MTIA as a complement to GPUs rather than a replacement.[4][6]
The first chip was presented in a paper at the International Symposium on Computer Architecture in 2023, where Meta described the hardware and software as co-designed for recommendation inference and reported improved performance per watt over GPUs on its internal models.[3][8] At the time of the May 2023 announcement, Meta said the silicon was already running in its data centers.[1]
The two announced generations share a family resemblance. Both use an eight by eight grid of processing elements with RISC-V control, on-chip memory measured in hundreds of megabytes, and off-chip LPDDR5. The second generation moves to a denser process node, runs at a higher clock, and roughly triples dense throughput while widening the memory system.[1][2][4]
| Feature | MTIA v1 | Next-generation MTIA |
|---|---|---|
| Announced | May 2023 | April 2024 |
| Process | TSMC 7 nm | TSMC 5 nm |
| Clock frequency | 800 MHz | 1.35 GHz |
| Processing element grid | 8 x 8 (64 PEs) | 8 x 8 (64 PEs) |
| On-chip SRAM | 128 MB | 256 MB |
| Off-chip memory | 64 GB LPDDR5 | 128 GB LPDDR5 |
| LPDDR5 bandwidth | not stated | 204.8 GB/s |
| INT8 compute (dense) | 102.4 TOPS | 354 TOPS |
| INT8 compute (sparse) | not stated | 708 TOPS |
| FP16/BF16 compute (dense) | 51.2 TFLOPS | 177 TFLOPS |
| TDP | 25 W | 90 W |
| Die size | not stated | 421 mm2 |
Sources: Meta AI blog and independent reporting.[1][2][4]
MTIA v1 ran at 800 MHz and delivered 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16, inside a 25 watt thermal envelope.[1] The next-generation part lifts the clock to 1.35 GHz on a 421 square millimeter die and reports 354 TOPS of dense INT8 throughput, with sparse INT8 reaching 708 TOPS thanks to hardware support for sparsity.[2][4] Independent coverage put the jump at about three and a half times the dense compute and seven times the sparse compute of the original, with on-chip SRAM doubled to 256 MB and LPDDR5 capacity doubled to 128 GB.[4] The higher 90 watt power budget reflects the larger, faster chip.[2][4]
An MTIA accelerator is built around a grid of processing elements connected by an on-chip mesh network, with each generation using an eight by eight layout of 64 elements.[1][2] Each processing element contains two RISC-V cores, one of them carrying a vector extension, together with a set of fixed-function blocks that handle common operations such as matrix multiplication and the data movement around the embeddings.[1][3] Choosing the open RISC-V instruction set for control gave Meta flexibility to shape the cores around its own software rather than around a fixed commercial design.[1]
Memory is organized as a hierarchy. Each processing element has a small amount of local storage, the chip carries a large pool of on-chip SRAM, 128 MB on the first generation and 256 MB on the second, and off-package LPDDR5 provides the bulk capacity, 64 GB and then 128 GB.[1][2] That structure suits recommendation models well, because keeping more of the frequently used embedding data close to the compute reduces the trips to slower memory that otherwise dominate the workload.[1][8]
The chips do not run on their own. The next-generation design is packaged onto boards that go into a rack-scale system. Meta described a configuration of three chassis per rack, twelve boards per chassis, and two accelerators per board, for 72 accelerators in a single system serving production traffic.[2][9]
Hardware is only useful if Meta's engineers and models can reach it without rewriting everything, so the MTIA effort has always been paired with a software stack built around PyTorch, Meta's open-source machine learning framework.[1][2] The aim is for model authors to keep working in PyTorch while a compiler and runtime map their code onto the accelerator.[1][7]
For the next generation, Meta leaned on Triton, an open-source language and compiler for writing GPU and accelerator kernels, building a Triton-MTIA backend that turns Triton code into instructions for the chip.[2][9] This lets the same high-level kernels target different hardware and frees engineers from hand-writing low-level code for every operation. Tying the silicon to a widely used framework also smooths the path for moving models between GPUs and MTIA inside the same fleet.[2][7]
MTIA started life squarely as a recommendation and ranking engine, and that remains its main job in production today.[2][4] But the boom in generative AI, including Meta's own Llama family of large language models, reshaped the company's priorities and its appetite for compute.[6][7] Meta has said it intends to broaden MTIA to handle more of these generative workloads over time, extending a chip first built for sparse recommendation math toward the dense transformer math that drives text and image generation.[2][9]
That ambition has limits worth stating plainly. The next-generation MTIA was widely read as an inference and internal-workload part rather than a competitor to Nvidia's top training GPUs such as the H100, and several reviewers stressed that it was not meant to be an Nvidia killer.[4] Meta continued to buy large quantities of GPUs for training even as it expanded MTIA, which fits the picture of custom silicon handling steady, well-understood inference while GPUs cover the frontier of training and experimentation.[4][6]
MTIA is one piece of a larger AI infrastructure that Meta has been building out, which also includes massive GPU clusters and the open hardware designs the company contributes through the Open Compute Project.[2][7] Within that mix, the accelerators are positioned as the efficient option for the recommendation and ranking work that runs constantly across Facebook and Instagram, leaving GPUs to absorb the heaviest training jobs and the fast-moving generative experiments.[2][4]
The broader context is a move by the largest cloud and platform companies to design their own AI chips. Google has shipped successive generations of its TPU since the mid-2010s, Amazon offers Trainium for training and Inferentia for inference, and Microsoft introduced its Maia accelerator in 2023.[5][6] The motivations rhyme across all of them, namely better efficiency on their own workloads, more control over the hardware and software stack, and less exposure to the cost and supply constraints of buying every chip from one vendor.[5][6] MTIA is Meta's entry in that group, distinctive for how tightly it is shaped around recommendation models and the embedding-heavy access patterns they create.[1][8]
MTIA shows that a company with a large and well-characterized internal workload can justify designing its own silicon, and that the design can pay off in efficiency when the chip and its compiler are built together for that workload.[3][8] The generation-to-generation jump, from 7 nm to 5 nm and from roughly 100 to over 350 dense INT8 TOPS, suggests the program is a sustained investment rather than a one-off experiment.[1][2][4]
The limits are equally clear. MTIA is internal hardware, not a product Meta sells, so its reach is bounded by Meta's own needs and its success is hard to measure from outside the company.[4][7] It targets inference and recommendation rather than the most demanding training, which keeps Meta dependent on GPUs for the leading edge of model development.[4][6] And designing custom chips carries real cost and schedule risk, including the long lead times of taping out new silicon and the work of keeping a software stack current with fast-changing models.[6][7] How well MTIA stretches to cover generative AI, and whether later generations close more of the gap with general-purpose accelerators, will shape how large a role it ends up playing in Meta's infrastructure.[2][4]