Tensor parallelism (TP) is a distributed training technique that partitions individual weight tensors of a neural network across multiple devices, with each device computing a partial result that is then combined through collective communication operations such as AllReduce, AllGather, or ReduceScatter. Along with data parallelism and pipeline parallelism, tensor parallelism is one of the three foundational forms of parallelism used to train and serve modern large language models. Whereas data parallelism replicates the entire model on each device and shards the input batch, and pipeline parallelism assigns different layers to different devices, tensor parallelism slices a single layer's weight matrices along their rows or columns and distributes those slices across devices that operate on the same input simultaneously [1].
The technique was popularized by the Megatron-LM paper of Shoeybi and colleagues at NVIDIA in 2019, which introduced an elegant scheme for partitioning the linear projections inside transformer attention and feed-forward blocks in a way that requires only a small number of AllReduce operations per layer [1]. Megatron-LM tensor parallelism remains the canonical implementation. Its conceptual descendants appear inside virtually every modern training and inference framework, including DeepSpeed, FairScale, NVIDIA NeMo, PyTorch DTensor, JAX shard_map, vLLM, TensorRT-LLM, and SGLang. Tensor parallelism is the primary lever that makes it possible to fit a single large transformer layer onto multiple GPUs when no individual accelerator has enough memory or compute to host it alone, and it is a critical ingredient in the 3D parallelism strategies used to train trillion-parameter models on tens of thousands of GPUs [2].
The defining tradeoff of tensor parallelism is communication intensity. Because each forward pass and each backward pass through a TP-sharded layer triggers a collective AllReduce on activations whose size scales with batch size, sequence length, and hidden dimension, tensor parallelism only delivers good utilization when the devices share extremely high-bandwidth interconnects. In practice this means tensor parallelism is almost always confined to a single server with NVLink, with 8-way TP being the most common configuration on NVIDIA HGX and DGX nodes equipped with H100, H200, or B200 GPUs. The arrival of NVLink Switch fabrics on the GB200 NVL72 platform extends practical tensor parallelism to 16-way, 32-way, or even 72-way configurations by keeping all GPUs in a single rack on the same NVLink domain at 1.8 TB/s per link [3]. Outside of these high-bandwidth domains, tensor parallelism typically gives way to pipeline parallelism for inter-node sharding and to data parallelism for replication across many nodes.
Tensor parallelism solves the problem that the parameters of a single transformer layer can be too large for one accelerator. A standard transformer decoder layer contains four major weight matrices: the query, key, and value projections (often fused into a single QKV matrix), the output projection, and the two feed-forward network (FFN) matrices that expand and contract the hidden dimension by typically a factor of four. For a model with hidden dimension 12,288 (the size used in GPT-3 175B), each FFN matrix alone is roughly 12,288 by 49,152, occupying about 1.2 GB in bfloat16 and triple that once optimizer state is included. Across many such layers and intermediate buffers, even an 80 GB H100 is quickly exhausted. Tensor parallelism shards each weight matrix across multiple GPUs so no single device holds the full matrix.
Before Megatron-LM, the dominant scaling approach beyond one device was data parallelism, which replicates the entire model on each device. It scales well as long as the model fits on one device but does nothing to address per-device memory. Earlier forms of model parallelism included naive layer-wise partitioning (an unoptimized form of pipeline parallelism that left most devices idle) and ad hoc partitioning of specific operations like embedding tables. None of these gave a clean recipe for sharding the dense linear layers that dominate transformer compute. Megatron-LM's contribution was to recognize that the transformer block's structure lends itself to a column-then-row sharding pattern that minimizes communication points and can be expressed with two simple primitives [1].
The Megatron-LM tensor parallelism scheme partitions both the attention block and the FFN block in a way that requires exactly one AllReduce per block in the forward pass and one in the backward pass. The total communication cost is therefore four AllReduces per transformer layer, two for attention and two for the FFN, which scales linearly with the number of layers and is independent of the tensor-parallel degree at the per-layer level (although AllReduce latency itself grows with TP degree). The scheme uses two operators, conventionally called f and g, that are conjugates of each other [1]. The f operator is the identity in the forward pass and an AllReduce in the backward pass, while the g operator is an AllReduce in the forward pass and the identity in the backward pass. These two operators are inserted at the boundaries of the sharded region; everything in between can be computed locally on each device without communication.
The building blocks of Megatron-style tensor parallelism are two specialized linear layers. The column-parallel linear layer shards the weight matrix along its output (column) dimension. If input X is replicated across all TP ranks and weight W is split into column blocks W = [W_1, ..., W_n], each rank computes a partial output Y_i = X * W_i locally. The result is sharded along its output dimension. The column-parallel layer requires no forward-pass communication because each rank only needs X. In the backward pass, the input gradient is summed across columns via AllReduce. This is the f operator: identity forward, AllReduce backward.
The row-parallel linear layer shards the weight matrix along its input (row) dimension. If input X is already sharded along its column dimension (matching the output of a preceding column-parallel layer) and W is split into row blocks, each rank computes Y_i = X_i * W_i locally. The partial outputs must be summed via AllReduce in the forward pass. In the backward pass the AllReduce becomes an identity. This is the g operator: AllReduce forward, identity backward.
The trick that makes Megatron-LM communication-efficient is to chain a column-parallel layer with a row-parallel layer back-to-back. The output of the column-parallel layer is naturally sharded in exactly the layout that the row-parallel layer expects, so any element-wise computation between them runs locally. Only at the end of the row-parallel layer does the result need to be combined via AllReduce.
The transformer FFN consists of two linear layers with a nonlinearity (GeLU, ReLU, or SwiGLU) between them, projecting the hidden dimension up to an intermediate dimension typically equal to 4d and back down. Megatron-LM applies the column-parallel pattern to the up-projection and the row-parallel pattern to the down-projection. Each rank holds a column slice of the up-projection weight, and the element-wise nonlinearity runs locally because it is independent across columns. The down-projection weight is row-sharded to match the column-sharded activation, and a single AllReduce at the end combines the partial outputs into the full FFN output [1]. The forward pass needs one AllReduce (the g operator at the end of the down-projection); the backward pass needs one (the f operator at the start of the up-projection).
The attention block follows the same column-then-row pattern, with the natural unit of sharding being the attention head. Megatron-LM partitions the QKV projections in column-parallel fashion, with each rank receiving a complete subset of attention heads. Each rank then performs the entire attention computation for its own heads locally, including softmax, the attention-by-value matmul, and any per-head bias. The output projection W_O is row-parallel, matching the column-sharded layout of concatenated head outputs. A single AllReduce at the end combines the partial outputs.
This head-aligned partitioning has a useful consequence: as long as the number of attention heads is divisible by the TP degree, no communication is needed inside the attention computation. The softmax never crosses a device boundary because each head lives entirely on one rank. This is why head-aligned TP scales gracefully up to TP=8 or TP=16 in modern transformers, which typically have 32, 64, 96, or 128 heads. For grouped-query attention (GQA) and multi-query attention (MQA) variants, where multiple query heads share fewer KV heads, the partitioning is more delicate: KV heads must be replicated across some TP ranks if there are fewer KV heads than TP ranks.
The input embedding table and the output projection to vocabulary logits are the other large parameter matrices that benefit from tensor parallelism. Megatron-LM partitions the embedding table along the vocabulary dimension, with each rank holding a vocabulary slice. The forward pass for the embedding lookup is a local lookup followed by an AllReduce to combine results across ranks. The output logits are computed using the same column-parallel matmul, and the cross-entropy loss is implemented in a sharded form that avoids materializing the full vocabulary distribution on any one rank.
The original Megatron-LM scheme leaves several operations unparallelized: the layer-norms before and after each block, the dropouts, and the residual additions. These operations are computationally cheap so the FLOPs cost of replicating them is negligible, but they require activations of full sequence length and full hidden dimension on every TP rank, contributing significantly to activation memory. Korthikanti and colleagues at NVIDIA introduced sequence parallelism in 2022 (often called Megatron-LM 2) as an extension that parallelizes these previously replicated regions along the sequence dimension [4]. Layer-norm, dropout, and residual addition are independent across sequence positions, so partitioning the activation tensor along the sequence dimension lets each rank work on a different chunk without any extra communication for these operations themselves.
Sequence parallelism modifies the f and g operators: f becomes an AllGather along the sequence dimension in the forward pass and a ReduceScatter in the backward pass, while g becomes a ReduceScatter forward and an AllGather backward. The total communication volume is unchanged because an AllReduce decomposes into a ReduceScatter followed by an AllGather; the new scheme simply splits each AllReduce into its two components and places them at different points in the dataflow [4]. The result is that the layer-norm, dropout, and residual regions hold a sequence-sharded activation that is 1/N the size of the original. For long-context training where activations dominate memory, sequence parallelism reduces activation memory by approximately 5x with no change in communication cost, and reduces time spent in activation recomputation by over 90 percent [4]. Most modern Megatron-derived frameworks enable sequence parallelism by default when tensor parallelism is in use.
The table below summarizes the four most common forms of parallelism used in modern LLM training, highlighting where each shines and what its main bottleneck is.
| Parallelism type | What is sharded | What is replicated | Communication per layer | Bandwidth requirement | Typical scope |
|---|---|---|---|---|---|
| Data parallelism | Input batch | Full model on every device | One AllReduce of gradients per training step (not per layer) | Moderate, tolerates Ethernet or InfiniBand across nodes | Across many nodes, often hundreds to thousands |
| Tensor parallelism | Weight matrices within a layer | Input batch and full sequence | Two AllReduces per transformer layer in forward and two in backward | Very high, NVLink within a server or NVLink Switch within a rack | Within a single node, typically TP=2, 4, 8, or 16 |
| Pipeline parallelism | Layers across devices | None (each device hosts a different layer subset) | Point-to-point send/receive of activations between adjacent stages | Moderate, tolerates InfiniBand across nodes | Across nodes, with micro-batching to hide stage latency |
| Sequence parallelism | Activations along sequence dimension in non-TP regions | Weights are still TP-sharded | AllGather and ReduceScatter (replaces AllReduce of TP) | Same as TP, requires NVLink | Always combined with TP |
Data parallelism is the simplest to implement and scales well across loosely coupled nodes because gradient AllReduce only happens once per training step and can be overlapped with the backward pass [2]. Its limitation is that the model and optimizer state must fit on a single device, which is impractical for models above roughly 10 to 20 billion parameters. Sharded data parallelism (FSDP, ZeRO-3, HSDP) addresses this by partitioning optimizer state, gradients, and parameters across data-parallel ranks, but pays for the savings with additional AllGather operations.
Pipeline parallelism partitions the model by layer, with each device holding a contiguous block. Its attractions are low communication volume and tolerance for slower inter-node links. Its drawback is the pipeline bubble, the time during which some stages are idle waiting for upstream activations or downstream gradients. Schedules such as 1F1B, interleaved 1F1B, and zero-bubble pipelines reduce but do not eliminate the bubble. Pipeline parallelism is the standard choice for sharding across nodes when intra-node bandwidth has been saturated by tensor parallelism.
Tensor parallelism complements both. Within a high-bandwidth node it cuts per-device memory and compute by the TP degree, allowing models too large for one device to run at full per-rank utilization. It benefits more than the other strategies from interconnect upgrades. The general design pattern for very large models is to use tensor parallelism within a node, pipeline parallelism across nodes, and data parallelism across pipeline replicas, a configuration known as 3D parallelism [2].
Tensor parallelism is more sensitive to interconnect bandwidth than any other common parallelism strategy. Each transformer layer triggers two AllReduces in the forward pass and two in the backward pass, with payload sizes proportional to batch_size * sequence_length * hidden_dimension * bytes_per_element. For a model with hidden dimension 8,192 running with batch size 4 and sequence length 8,192 in bfloat16, each AllReduce moves 512 MB. Across 80 layers and four AllReduces per layer, that is 160 GB of TP communication per iteration, easily enough to dominate iteration time on a slow interconnect.
The table below maps tensor-parallel degree to typical hardware substrates and rough bandwidth budgets in 2026.
| Hardware substrate | Per-link bandwidth | Practical TP degree | Notes |
|---|---|---|---|
| PCIe Gen 4 / Gen 5 | 64 / 128 GB/s | TP=2, marginal at TP=4 | Insufficient for serious LLM training, mostly used for inference of small models |
| NVLink 3 (A100, 12 links) | 600 GB/s aggregate per GPU | TP=8 within DGX A100 | Standard for GPT-3 era training, dominant baseline for tensor parallelism |
| NVLink 4 (H100/H200, 18 links) | 900 GB/s aggregate per GPU | TP=8 within HGX H100/H200 | Doubles A100 bandwidth, enables longer sequences and larger micro-batches |
| NVLink 5 (B200, 18 links) | 1.8 TB/s aggregate per GPU | TP=8 within HGX B200 | Doubles H100 bandwidth again, supports trillion-parameter dense models |
| NVLink Switch (GB200 NVL72) | 1.8 TB/s per GPU across 72 GPUs | TP=16, 32, 64, or 72 within rack | First substrate to extend TP cleanly beyond 8 GPUs, critical for very large MoE models |
| InfiniBand NDR (400 Gb/s) | 50 GB/s per port | TP not recommended across nodes | Fine for data parallelism and pipeline parallelism, too slow for TP collectives |
| Ethernet 200/400 GbE | 25 to 50 GB/s | TP not recommended across nodes | Suitable only for inter-node DP and PP |
The practical implication is that tensor parallelism scales cleanly only up to the size of an NVLink island. On HGX H100 and H200 systems that island is 8 GPUs, so 8-way TP is the common upper bound. On GB200 NVL72 racks the island grows to 72 GPUs and TP becomes practical at 16, 32, or higher. Projects that need TP beyond NVLink boundaries either accept the slowdown or use techniques like Flash Communication, kernel-level overlap, or compressed AllReduces to mitigate the cost [3].
A wide range of training and inference frameworks implement tensor parallelism, mostly following the Megatron-LM scheme with framework-specific abstractions. The table below summarizes the main options as of 2026.
| Framework | Underlying primitive | Typical TP degree | Notes |
|---|---|---|---|
| Megatron-LM (NVIDIA) | Custom CUDA collectives, Megatron Core building blocks | 1 to 8 within node, up to 64 on NVL72 | Reference implementation, the basis for most other frameworks |
| DeepSpeed (Microsoft) | Megatron-style TP combined with ZeRO data parallelism | 1 to 8 | The DP/TP interaction is engineered carefully via ZeRO-DP partitioning |
| FairScale (Meta) | Megatron-style TP with checkpoint_wrapper integration | 1 to 8 | Largely superseded by PyTorch DTensor and FSDP2 |
| PyTorch DTensor + parallelize_module | DistributedTensor with Colwise/Rowwise/SequenceParallel styles | 1 to 8 within node, 16+ on NVL72 | Native PyTorch API since 2.0, became production-ready in 2024 |
| TorchTitan | DTensor + FSDP2 + selective TP | 1 to 16 | PyTorch reference recipe for combined parallelism on H100/B200 |
| JAX shard_map and pjit | Manual SPMD with named axes | 1 to 16 | Used heavily on TPU pods with ICI interconnect, also on GPU |
| NVIDIA NeMo Framework | Wraps Megatron Core, exposes TP/PP/DP/EP/CP via config | 1 to 64 | Production framework for NVIDIA hardware, especially Blackwell |
| ColossalAI | Megatron-style TP plus 1D, 2D, 2.5D, and 3D variants | 1 to 16 | Academic framework, popular for research on alternative TP layouts |
| vLLM | Megatron-style TP with NCCL collectives | 1 to 8 within node | Standard inference runtime for open-source LLMs |
| TensorRT-LLM | Custom kernels with fused TP communication | 1 to 8 | NVIDIA's high-performance inference runtime |
| SGLang | NCCL-based TP with graph-level optimization | 1 to 8 | Inference runtime focused on throughput |
| Hugging Face Accelerate / Transformers | Wraps DTensor, DeepSpeed, or Megatron under the hood | 1 to 8 | Glue for popular open-source workflows |
Megatron-LM remains the canonical reference and the source of the building blocks reused everywhere else. Its core abstractions (ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding) appear under almost identical names in DeepSpeed, FairScale, and earlier PyTorch distributed code [1]. The modern PyTorch story has consolidated around DTensor, which represents tensor sharding as first-class metadata on every parameter and reuses the same machinery for tensor parallelism, sequence parallelism, FSDP2, and HSDP. The user expresses a parallelism plan as a mapping from module names to ParallelStyle objects, and DTensor inserts the appropriate AllReduces, AllGathers, and ReduceScatters automatically.
The DeepSpeed approach has historically combined Megatron tensor parallelism with its own ZeRO-style sharded data parallelism. ZeRO-3 partitions optimizer state, gradients, and parameters along the data-parallel axis but keeps each rank holding the full tensor for any given operation by AllGathering the parameters just before they are used. ZeRO-3 communicates parameter shards on demand, while TP communicates activation shards on demand; the two compose naturally [2]. JAX and TPU workflows use a different idiom: programs declare a logical mesh with named axes, and the user annotates each tensor with a partition spec. The XLA compiler then derives the necessary collective communications automatically. The shard_map primitive added in JAX 0.4 exposes a more explicit form where the user writes per-shard code directly.
For models large enough that no single parallelism strategy scales them efficiently, the standard solution is to combine all three. The Megatron-LM-DeepSpeed collaboration that trained the 530-billion-parameter Megatron-Turing NLG model in 2022 used 8-way tensor parallelism within each DGX A100 node, 35-way pipeline parallelism across nodes within a pipeline replica, and 24-way data parallelism across replicas, for a total of 8 * 35 * 24 = 6,720 GPUs running in lockstep [5]. Narayanan et al. (2021), scaling Megatron-LM to a trillion parameters, reported 163 teraFLOP/s per GPU and aggregate throughput of 502 petaFLOP/s on 3,072 A100 GPUs using a similar 3D configuration [2].
The assignment of devices to the three axes follows the bandwidth hierarchy: tensor parallelism, with the highest communication intensity per layer, goes on the highest-bandwidth axis (NVLink within a node); pipeline parallelism, with low communication that can hide behind scheduling, goes on the intermediate-bandwidth axis (InfiniBand within a pod); data parallelism, with one gradient AllReduce per step, goes on the lowest-bandwidth axis (the broader cluster fabric). Newer schemes layer additional axes on top: expert parallelism for mixture-of-experts models, context parallelism for very long sequences, and zero-bubble pipeline scheduling to reduce idle time.
The collective communication libraries that tensor parallelism relies on, primarily NVIDIA's NCCL but also AMD's RCCL and Intel's oneCCL, implement AllReduce, AllGather, and ReduceScatter as ring or tree algorithms tuned for the underlying topology. An AllReduce of N bytes on a ring of K GPUs requires roughly 2 * (K-1)/K * N bytes of data to be sent and received per GPU, which for large N approaches 2 * N. This linear scaling with payload size, combined with the per-layer communication frequency of tensor parallelism, makes raw NVLink bandwidth the single most important hardware factor for TP performance.
A recurring optimization theme is overlap of communication with computation. The backward pass of a TP-sharded layer can begin its AllReduce as soon as partial results are available and run in parallel with the next layer's backward computation. Megatron-LM, DeepSpeed, and PyTorch DTensor all implement some form of overlap, typically by issuing the collective on a separate CUDA stream. More aggressive techniques such as kernel-level fusion of the matmul and the AllReduce (Flux, Async-TP, NCCL-symm) aim to hide communication behind compute of the same layer. These techniques are especially valuable on Blackwell and beyond, where per-FLOP compute cost has dropped faster than per-byte communication cost.
Tensor parallelism is also widely used at inference time. The benefits are the same as at training: distributing weights lets a model run when no single GPU has enough memory, and distributing the matmul compute lowers per-token latency. The cost is also the same: every layer triggers an AllReduce, and slow interconnects can erase per-rank compute savings. The standard recipe is 8-way TP within a high-bandwidth node, with pipeline or data parallelism for sharding beyond the node.
Inference engines such as vLLM, TensorRT-LLM, and SGLang all support tensor parallelism through NCCL-based AllReduce kernels. TensorRT-LLM uses fused custom kernels for higher throughput, vLLM layers paged KV cache and continuous batching on top of TP, and SGLang focuses on graph-level scheduling of TP collectives. For dense 70B models on H100 nodes, TP=8 inside a single node is the gold-standard configuration for low-latency inference. The arrival of GB200 NVL72 has shifted the optimal point: a 70B model can now run with TP=72 entirely within a single rack with all communication on NVLink, reducing per-token latency of cross-node InfiniBand communication from hundreds of microseconds to a few microseconds [3]. A subtle inference consideration is that the KV cache, which stores keys and values from previous tokens, is naturally aligned with the head dimension and is automatically partitioned across TP ranks when QKV projections are column-parallel. No extra communication is needed during attention for cached tokens, only for the new tokens being generated.
Tensor parallelism's main limitation is its high communication intensity. The forward and backward AllReduces happen at every layer and grow linearly with model dimension, batch size, and sequence length. On hardware where per-FLOP compute has improved faster than per-byte communication (which is true of every NVIDIA generation from Volta onwards), the relative cost of TP communication grows over time. This has motivated a range of communication-reduction techniques: lower-precision collectives (FP8 or even INT8 AllReduces), sparsified or low-rank approximations of the activation gradients, kernel-level overlap of communication with compute, and topology-aware collective algorithms that exploit the specific layout of NVLink Switch fabrics [3].
A second limitation is the head-divisibility constraint. The cleanest tensor-parallel partitioning of multi-head attention requires the number of heads to be divisible by the TP degree. Models that need to scale beyond the head count (such as MQA models with 1 KV head per layer) require either KV head replication, additional communication for KV gathering, or a hybrid layout that combines head-level TP with hidden-dimension TP.
A third limitation is that tensor parallelism does not by itself reduce optimizer state memory. Each TP rank holds its sharded weights and corresponding sharded optimizer state, but the per-rank optimizer footprint is still proportional to the per-rank parameter footprint. To shard optimizer state further, TP must be combined with FSDP, ZeRO, or a similar sharded data parallelism scheme. The PyTorch FSDP2 implementation, built on DTensor, provides a clean composition where FSDP shards along the data-parallel axis and TP shards along an orthogonal mesh axis, with the appropriate collectives derived automatically.
Research on alternatives to the basic Megatron scheme is active. ColossalAI's 2D, 2.5D, and 3D tensor parallelism variants partition each weight matrix along multiple axes simultaneously, reducing per-AllReduce volume at the cost of additional collectives. Async tensor parallelism breaks each AllReduce into a chain of point-to-point sends overlapped with compute. Mixture-of-experts models introduce expert parallelism, which adds a separate AllToAll pattern orthogonal to TP. Context parallelism shards along the sequence dimension across nodes, enabling million-token context windows. Each of these techniques compounds with rather than replaces tensor parallelism, and the design space of combined parallelism strategies continues to expand.