SambaNova SN40L
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,470 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,470 words
Add missing citations, update stale details, or suggest a clearer explanation.
The SambaNova SN40L is a reconfigurable dataflow AI accelerator designed by SambaNova Systems and unveiled on September 19, 2023.[1][2] Marketed as a fourth-generation Reconfigurable Dataflow Unit (RDU) and the first member of SambaNova's "Cerulean" architectural family, the SN40L combines two TSMC 5 nm logic dies, 64 GB of co-packaged HBM3, and up to 1.5 TB of direct-attached DDR5 in a three-tier memory hierarchy intended to run trillion-parameter generative models on a single 8-socket node.[1][3][4] The "40" in the product name denotes SambaNova's fourth chip generation, while the "L" indicates that the silicon is specifically tuned for large language model workloads.[5]
Where contemporary GPU accelerators such as the NVIDIA H100 rely on a thread-and-kernel programming model, the SN40L instead exposes a spatial mesh of 1,040 Pattern Compute Units (PCUs) and 1,040 Pattern Memory Units (PMUs) that the SambaFlow compiler configures into long, statically-scheduled pipelines.[4][6] Each socket reaches a peak of 638 BF16 TFLOPS and contains 102 billion transistors fabricated on TSMC's N5 node and integrated using 2.5D Chip-on-Wafer-on-Substrate (CoWoS-S) packaging.[7][4] At the system level, a SambaRack SN40L-16 chassis aggregates 16 RDU sockets to support models up to roughly five trillion parameters and a sequence length above 256k tokens, while drawing an average of just 10 kWh during inference, low enough to be air-cooled in conventional 19-inch racks.[3][8][9]
The SN40L has become the workhorse of SambaNova's pivot away from training: it underlies the SambaCloud public inference service launched on September 10, 2024,[10] the SambaNova Suite enterprise stack, and on-premises deployments at customers including SoftBank, Saudi Aramco, Analog Devices, Stanford University research groups, OTP Bank, the RIKEN Center for Computational Science, and U.S. Department of Energy national laboratories such as Argonne, Oak Ridge, and Lawrence Livermore.[11][12][13] The chip's most-cited public benchmark is its September 2024 world record of 132 output tokens per second on Meta's 405-billion-parameter Llama 3.1 model at full 16-bit precision, achieved on a single 16-socket SN40L node.[14][10]
SambaNova Systems was founded in Palo Alto, California, in 2017 by three Stanford-affiliated technologists: Rodrigo Liang, a former senior vice president for SPARC and other server processors at Sun Microsystems and Oracle; Stanford Cadence Design Professor of Electrical Engineering and Computer Science Kunle Olukotun, widely credited as the academic father of the multicore microprocessor through his 1990s work on the Hydra chip-multiprocessor at Stanford; and Christopher Ré, a Stanford computer-science associate professor and MacArthur Fellow whose work in the Stanford DAWN lab on weak supervision and learned systems shaped the company's software direction.[15][16][17] All three remain involved with the company, with Liang serving as chief executive, Olukotun as chief technologist, and Ré as a technical adviser.[15][17]
The company emerged from stealth in March 2018 with a $56 million Series A round led by Walden International and Google Ventures, raised a $150 million Series B led by Intel Capital in April 2019, a $250 million Series C led by BlackRock in February 2020, and a $676 million Series D led by SoftBank Vision Fund 2 in April 2021 that valued the firm at approximately $5.1 billion.[18] Total disclosed venture funding through the SN40L's launch was about $1.1 billion, making SambaNova one of the best-funded AI-silicon startups of its era alongside Cerebras and Graphcore.[18]
The SN40L is the fourth chip in SambaNova's RDU lineage. The first-generation SN10, introduced in 2020, established the dataflow architecture and was followed by the SN20 and the die-to-die SN30, which together carried the "Cardinal" architectural codename, a reference to Stanford's school colour.[5][19] The SN40L, by contrast, is the first member of a new family that SambaNova internally calls "Cerulean," and it is the first SambaNova chip to integrate on-package High Bandwidth Memory.[5][19]
The SN40L is a dataflow accelerator: instead of executing a stream of instructions through a fixed pipeline, it spatially maps each layer of a deep-learning graph onto a physical region of the chip and streams activations through that mapping in a producer-consumer fashion.[20][21] This is the same principle SambaNova's prior RDUs used, but the SN40L expands both the size of the spatial fabric and the reachable working set.
Each SN40L socket exposes a two-dimensional, packet-switched mesh of reconfigurable tiles built from two main element types.[4][6] Pattern Compute Units (PCUs) are configurable matrix engines that can be programmed at compile time to behave as either a systolic array (for general matrix multiplication) or as a pipelined SIMD lane with cross-lane reduction, supporting BF16, FP32 and INT32 numerics together with a tail-section that implements transcendental functions and stochastic rounding.[4] Pattern Memory Units (PMUs) are banked SRAM scratchpads of programmable depth, paired with their own address-generation ALU, banking-and-predication logic and a data-alignment unit that performs transpose and other tensor reshaping in-place without round-tripping through DRAM.[4]
The SN40L socket aggregates 1,040 PCUs and 1,040 PMUs distributed across two identical accelerator dies that are joined edge-to-edge via the 2.5D CoWoS interposer, yielding the published peak of 638 BF16 TFLOPS per socket.[4][7] Compared with the prior-generation SN30, the SN40L contains about 18.8 percent fewer functional units but achieves comparable raw throughput thanks to the 7 nm-to-5 nm shrink and an approximately 12 percent clock-speed lift.[22]
PCUs and PMUs are stitched together by a Reconfigurable Dataflow Network (RDN): a mesh-based packet-switched interconnect with three physically separate fabrics carrying vector data, scalar data, and control packets respectively.[4] The RDN supports multi-cast routing, dynamic packet ordering using sequence identifiers, and is what allows the compiler to fuse hundreds of operators into a single dataflow kernel without committing intermediate tensors to off-chip memory.[4][21]
To bridge the on-chip fabric to the rest of the system, the SN40L incorporates Address Generation and Coalescing Units (AGCUs) that mediate accesses to local HBM and DDR, host memory across PCIe, and to other RDUs in a node.[4] The AGCU implements a peer-to-peer protocol enabling direct, point-to-point reads and writes between sockets without traversing host memory or the DDR tier, which SambaNova reports is essential for sharding very large models across the 16 sockets of a SambaRack.[4]
The architectural payoff of the dataflow model is a programming style SambaNova calls streaming spatial kernel fusion. Because the compiler can lay an entire transformer decoder block (and often whole sequences of blocks) onto the fabric as a single dataflow graph, the chip avoids the per-kernel launch overhead, register-file traffic, and on-chip-cache pressure that limits GPU kernel fusion. SambaNova's own measurements report 2× to 13× speed-ups on individual operator fusion microbenchmarks versus a hand-tuned GPU baseline.[4]
The SN40L's three-tier memory was, in turn, designed to make a model-serving pattern that SambaNova calls Composition of Experts (CoE) practical: large numbers of independently-trained dense models (each typically 7 billion to 70 billion parameters) are stored in DDR, paged into HBM on demand, and routed per-prompt by a lightweight gating model that decides which "expert" model to invoke.[4][23] This is distinct from the more familiar Mixture of Experts (MoE) approach, in which a single trained network contains sparsely-activated expert sublayers; in CoE, the experts are entirely separate, separately-trained models, and SambaNova reports its Samba-1 product packaged more than fifty such experts into a 1.3-trillion-parameter logical model.[23][24]
The SN40L's defining hardware feature is a three-tier memory hierarchy that combines large on-chip SRAM, on-package HBM3, and direct-attached DDR5 within a single socket.[4][25] Earlier SambaNova RDUs had only SRAM and DDR; the addition of HBM was the principal motivation for the "L" variant's existence.[25]
| Tier | Capacity per socket | Implementation | Aggregate bandwidth |
|---|---|---|---|
| On-chip SRAM | 520 MiB | Distributed across 1,040 PMUs | Hundreds of TB/s |
| HBM3 | 64 GiB | Co-packaged via CoWoS-S interposer | ~2 TB/s |
| DDR5 DRAM | up to 1.5 TiB | Pluggable DIMMs, direct-attached | >1 TB/s aggregate |
Source: SambaNova arXiv whitepaper (May 2024).[4]
The 520 MiB of on-die SRAM in each SN40L socket is, at the time of the chip's launch, the largest on-chip memory in any commercial dataflow or AI ASIC other than wafer-scale designs such as the Cerebras WSE-3.[25][4] It is not arranged as a single L1 or L2 cache; instead, each PMU contains a programmable scratchpad bank that the SambaFlow compiler explicitly partitions among intermediate tensors and weights.[4] This high-density, banked SRAM is what makes streaming kernel fusion viable: rather than spilling intermediate activations to HBM between operators (as a GPU must), the SN40L can keep them resident in the local PMU bank that produced them, then forward them via the RDN to the next consuming PCU on the next clock-domain.[4][21]
Above the SRAM sits 64 GiB of HBM3 integrated onto the package via the 2.5D CoWoS-S interposer.[4][7] In SambaNova's own framing, HBM acts as a large L3 cache rather than as primary model storage; the company architects deployments so that HBM holds the activations and currently active expert weights for a workload, while persistent storage of model weights happens in DDR.[22][25] HBM3 was a deliberate choice for the SN40L: SambaNova's earlier chips used larger but slower DDR-only configurations, and the addition of HBM was driven by the bandwidth requirements of large-language-model decode steps in which the entire context tensor must be re-read every token.[25]
The third tier is up to 1.5 TiB of direct-attached DDR5 DRAM per socket in pluggable DIMM form factor.[4][3] This is an unusually large attached-DRAM capacity for any AI accelerator, and it is the architectural pivot point of SambaNova's CoE pitch: at 1.5 TB per socket and twelve sockets per typical eight-RDU DataScale node plus host, a single rack can hold hundreds of full-precision dense models without paging from disk.[4][25] SambaNova reports that the DDR-to-HBM transfer rate exceeds 1 TB/s in aggregate within a single SN40L node, which it cites as the key enabler for sub-millisecond model swap latency in CoE serving.[4]
The hierarchy as a whole is what SambaNova has marketed as a solution to the "AI memory wall": by trading some peak FLOPS density for more on-package and direct-attached capacity, the SN40L can keep the entire weight set of even multi-hundred-billion-parameter dense LLMs within a single node, eliminating the cross-node communication that bottlenecks GPU deployments at similar parameter counts.[4]
The SN40L is fabricated on TSMC's N5 (5 nm) process and integrated using TSMC's 2.5D CoWoS-S advanced packaging.[7][25] Each socket contains two identical 600 mm² accelerator dies plus the HBM3 stacks on a common silicon interposer.[7] SambaNova reports a per-socket transistor count of approximately 102 billion, comparable in transistor density to NVIDIA's contemporaneous data-center GPUs, with 520 MB of on-chip SRAM implemented in high-density memory cells.[7]
The published headline performance of each socket is 638 BF16 TFLOPS, with the same units also citing 640 BF16 TFLOPS and an FP16 figure of approximately 688 TFLOPS depending on operating mode.[4][7][26] The chip notably lacks dedicated low-precision accelerators for INT8 and FP8, an architectural choice consistent with SambaNova's emphasis on inference at "no-quantization" BF16 / FP16 / mixed precision for accuracy-sensitive enterprise workloads.[26][14] Estimated thermal-design power for the accelerator socket is reported at roughly 600 W in third-party coverage, though SambaNova itself emphasises rack-level power: a 16-socket SambaRack averages about 10 kW under inference load, an order of magnitude lower than typical 140-kW NVIDIA HGX racks of similar parameter capacity.[8][26]
The SN40L is not sold as a discrete chip but as part of integrated rack systems originally branded SambaNova DataScale and, since 2024, increasingly branded SambaRack.[27][9]
The canonical building block is the DataScale SN40L-2 module, which integrates two RDUs together with their host components.[27] Four such modules combine into the eight-RDU SN40L node, the company's reference unit for serving a five-trillion-parameter Composition-of-Experts workload.[3][9] The largest standard configuration is the SambaRack SN40L-16, a single 19-inch rack containing 16 RDU sockets across eight SN40L-2 modules, which is the platform SambaNova quotes for its world-record Llama 3.1 405B and DeepSeek-R1 671B inference benchmarks.[9][14][28]
A SambaRack SN40L-16 therefore aggregates approximately 8 GB of on-chip SRAM, 1 TB of HBM3, and up to 24 TB of attached DDR5 across its 16 sockets, sufficient to host hundreds of distinct foundation models simultaneously.[27][4] Air cooling at ~10 kW per rack allows the systems to be deployed in standard enterprise data-center facilities without specialised liquid-cooling retrofits.[8][29]
The SN40L would be unusable without its compiler stack, because there is no public ISA for the RDU; users do not write CUDA-style kernels but rather submit standard model graphs that SambaFlow lowers to RDU configuration bitstreams.[30][31]
SambaFlow is the core compiler and runtime. It ingests model definitions from PyTorch or TensorFlow, traces them into a dataflow graph, and produces a binary "PEF" file that contains the spatial mapping of every operator onto specific PCUs and PMUs as well as the routing configuration for the RDN.[30] Within SambaFlow, the compiler performs operator fusion, tiling, weight partitioning and inter-socket sharding automatically; SambaNova emphasises that an entire transformer decoder layer is typically compiled as a single kernel call, eliminating the per-kernel launch overhead that limits GPU performance on long generation sequences.[30][4]
SambaStudio is a graphical, browser-based platform layered on top of SambaFlow that gives data scientists a model-management workflow: dataset upload, fine-tuning, deployment, and inspection of running model endpoints without leaving the GUI.[30][32]
SambaCloud (originally announced as "SambaNova Cloud") is the public-internet inference service launched on September 10, 2024.[10] It exposes Meta's Llama-family open-source models, DeepSeek-R1, and others via an OpenAI-compatible REST API hosted on SambaRack SN40L-16 nodes.[10][14] SambaCloud comes in Free, Developer, and Enterprise tiers, with the Enterprise tier offering a SambaNova-managed instance of the cloud stack deployable inside a customer's own data centre under what SambaNova calls SambaManaged, which was launched in July 2025.[33] An adjacent product, the original SambaNova Suite announced on February 28, 2023, is the full-stack offering combining DataScale hardware, SambaFlow and a curated set of pre-trained open-source models for enterprise on-premises deployment.[34]
The SN40L's most heavily-publicised benchmarks have been generative-model token-throughput records measured at full 16-bit precision on standard open-source LLMs.
On Llama 3.1 405B, a 16-socket SambaRack SN40L-16 first reached 132 output tokens/second/user at SambaCloud's September 2024 launch, then 129 output tokens/second/user in a later SambaNova publication, and is reported in the company's blog and arXiv papers at peak rates exceeding 100 tokens/s/user across batch sizes up to four concurrent requests.[14][10][4] A separate Artificial Analysis-verified test reported 114 tokens/s on the same model.[14] At the time these results were published, the fastest comparable GPU-based service was measured by Artificial Analysis at approximately 72 tokens/s on the same model.[14]
On Llama 3.1 70B, the same 16-socket node reached 457 to 461 output tokens/second, while on the smaller Llama 3.1 8B SambaNova has cited rates above 1,042 tokens/second.[14][10][35] All three measurements are reported at native BF16 precision without quantisation.[14]
For the Composition of Experts workload pattern that the SN40L was specifically architected to accelerate, SambaNova's arXiv whitepaper reports CoE serving speed-ups of 3.7× over an NVIDIA DGX H100 and 6.6× over an NVIDIA DGX A100 in their own measurements, together with 15× to 31× faster model-switching latency versus the same DGX systems and up to 19× lower machine footprint for the same aggregate parameter count.[4]
The SN40L has not, as of the time of writing, posted official entries to the MLPerf Inference suite; this is in line with SambaNova's stated preference for end-to-end token-rate benchmarks on real LLMs over the historically training-centric MLPerf metrics.
Public customer references for the SN40L span enterprise, sovereign and scientific deployments.
SoftBank has been a multi-generation customer and was, separately, the lead investor in SambaNova's Series D in 2021.[18][11] SoftBank has used SambaNova DataScale systems as part of its Japan-based AI computing platform and was named as the first announced deployment partner for SambaNova's next-generation SN50 RDU.[11]
Saudi Aramco, the Saudi national oil company, signed a memorandum of understanding with SambaNova to deploy on-site SN40L systems for Metabrain, an internal large language model trained on roughly 90 years of Aramco's operational and exploration data and used for industrial AI applications.[12]
Analog Devices announced on January 10, 2024 that it would deploy the SambaNova Suite enterprise-wide to support generative AI applications across the global semiconductor company, making ADI one of the first publicly-named industrial enterprise users of the SN40L platform.[36]
In the U.S. national-laboratory complex, Argonne National Laboratory announced on November 18, 2024 (at the SC24 supercomputing conference in Atlanta) that the Argonne Leadership Computing Facility had deployed a new 16-RDU SambaNova DataScale SN40L cluster as part of its AI Testbed, available to the U.S. scientific community via project allocations and the National AI Research Resource Pilot.[13] Argonne uses the system to support inference for projects including the AuroraGPT foundation model and applications in drug discovery, climate science and brain mapping.[13] Oak Ridge and Lawrence Livermore National Laboratories are also listed by SambaNova as customers, along with Japan's RIKEN Center for Computational Science, OTP Bank, Accenture and NetApp.[12]
OVHcloud, the French sovereign-cloud provider, announced in late 2025 that it had selected SambaNova RDUs to power its AI Endpoints inference service.[37]
SambaNova has consistently declined to publish unit pricing for either the SN40L chip or the DataScale and SambaRack systems built around it; per-system prices are quoted privately and have been described in trade-press reporting as ranging from approximately $500,000 to several million dollars depending on the configuration.[22][38]
The company offers three main commercial constructs:
Beginning in July 2025, SambaManaged added a turnkey hybrid model in which SambaNova operates a private SambaCloud instance inside the customer's own facility, blending the security profile of on-premises hardware with the operational economics of cloud consumption pricing.[33]
The SN40L was launched into a market that subsequently shifted decisively in favour of inference-only workloads dominated by NVIDIA, and SambaNova's business model was restructured in response.
In April 2025, SambaNova laid off 77 employees, about 15 percent of its global workforce, in conjunction with a strategic pivot away from training workloads and toward an inference-cloud business model. The company filed two WARN notices in California and Washington.[39][40] Chief executive Rodrigo Liang described the move in subsequent interviews as a response to the realisation that "the industry has shifted to primarily inference" and that the AI inference market would dwarf the training market in dollar volume.[41]
Subsequent reporting from October and December 2025 indicated that SambaNova had retained an investment bank and entered exclusive acquisition negotiations with Intel at an enterprise value of approximately $1.6 billion including debt - a steep mark-down from the $5.1 billion 2021 valuation.[42][43] The deal ultimately did not close. Instead, on February 24, 2026, SambaNova announced a $350 million Series E funding round co-led by Vista Equity Partners and Cambium Capital with continued participation from Intel Capital, the Qatar Investment Authority, Saudi Arabia's sovereign wealth fund, GV, Battery Ventures, T. Rowe Price and BlackRock.[44][45] The round implies a post-money valuation of approximately $2.2 billion, materially below the 2021 peak.[44]
The same February 2026 announcement introduced the SN50, SambaNova's next-generation RDU and the successor to the SN40L. The SN50 retains the same three-tier memory philosophy with 64 GB of HBM, 432 MB of on-chip SRAM (slightly smaller than the SN40L's 520 MB) and 256 GB to 2 TB of DDR5 per socket, while delivering an advertised 5× peak performance and 4× network bandwidth relative to the SN40L.[45][46] SambaNova has stated that the SN50 will begin shipping in the second half of 2026, with SoftBank named as the first customer for SN50-powered low-latency inference services in Asia-Pacific.[45][46] An Intel collaboration combining Xeon CPUs with SambaNova RDUs was announced simultaneously.[44]
For at least the next product cycle, the SN40L will remain the chip running both SambaCloud's public endpoints and the great majority of installed customer systems, while the SN50 ramps to volume production.
Trade-press reception of the SN40L at launch was broadly positive, with reviewers focusing on three architectural choices that distinguish it from contemporary GPU and ASIC competitors.
First, the three-tier memory hierarchy was widely highlighted as the chip's defining contribution. ServeTheHome described the system as the first commercial accelerator to combine large on-chip SRAM, on-package HBM and direct-attached pluggable DDR in a single coherent address space.[25] The Next Platform noted that this architecture amounts to an explicit rejection of the GPU industry's assumption that AI weight sets must fit in HBM, and observed that the SN40L's eight-socket node can hold approximately 71 separate Llama-2-70B-class models simultaneously without paging.[22]
Second, the streaming dataflow programming model was characterised by analysts as both the chip's central strength and its central commercial risk, since the SambaFlow compiler is the only path to using the silicon and the chip cannot run hand-written CUDA-style kernels.[21][31] SambaNova has argued in response that the compiler-only model is precisely what allows automatic spatial fusion of arbitrarily complex graphs and that customers benefit from being insulated from low-level optimisation work.[4][21]
Third, analysts noted the chip's emphasis on full-precision inference. By focusing the compute units on BF16 and FP32 (with no first-class INT8 or FP8 path), the SN40L positions itself for enterprise and scientific customers who want to deploy open-source models without the accuracy regressions associated with aggressive quantisation, a positioning that has been validated by its choice of Llama 3.1 405B "at full 16-bit precision" as its headline benchmark.[14][26]
Among the SN40L's most direct architectural competitors, the Groq LPU targets the same low-latency inference market with a different trade-off, prioritising deterministic single-stream latency over high on-package memory capacity; the Cerebras WSE-3 takes the opposite extreme to SambaNova by integrating an entire wafer of compute and SRAM with no on-package HBM; and the Etched Sohu takes the most extreme position by hard-wiring transformer inference into the silicon. The SN40L sits between these in design space: a programmable spatial dataflow with a balanced multi-tier memory rather than either fully model-specific silicon or a single-tier memory architecture.