See also: Machine learning terms
In machine learning, an operation (often abbreviated as op) is a basic computational unit that manipulates data, typically tensors, during the training or execution of a model. Operations can be arithmetic, logical, relational, or higher level numerical routines such as matrix multiplication, convolution, and softmax. They are the building blocks of machine learning models: a deep neural network is, at the lowest level of description, just a sequence (or graph) of ops applied to tensors of weights and activations.
In modern deep learning frameworks such as TensorFlow, PyTorch, and JAX, an op has a precise technical meaning. It is a primitive registered with the framework that takes one or more input tensors, produces one or more output tensors, and usually has a corresponding gradient (backward) implementation so it can be used inside automatic differentiation. Each op is then bound to one or more device-specific implementations (kernels) for CPU, GPU, TPU, or other accelerators.
The word "operation" is one of the most overloaded terms in ML systems. It is used loosely for any computation, formally for a node in a computational graph, and informally for the whole pipeline ("the matmul op", "the attention op", "a fused op"). Most of this article uses the framework definition: an op is a registered primitive with a name, a type signature, and one or more device kernels.
Three closely related concepts often get conflated. Keeping them straight makes the rest of this article easier to follow.
| Concept | Meaning | Example |
|---|---|---|
| Op | The abstract computational primitive defined by the framework. It has a name, input and output type signatures, attributes, and a gradient rule. | aten::matmul, tf.raw_ops.Conv2D, jnp.add |
| Node | An instance of an op inside a specific computational graph, bound to particular input tensors. | The matmul node that multiplies activations of layer 7 by its weight matrix |
| Kernel | The concrete device-specific implementation that actually executes the op on hardware. | A CUDA kernel for matmul on NVIDIA GPUs, an OpenMP loop on CPU, an MPS kernel on Apple silicon |
A single op can have many kernels: float32 vs float16, CPU vs GPU, NCHW vs NHWC layout, with or without cuDNN. When the framework executes a node, it dispatches to the kernel that matches the input device, dtype, and layout. TensorFlow calls this the OpKernel system; PyTorch calls it the dispatcher.
Every framework has its own conventions for how ops are defined, registered, and called. The differences matter when you write custom ops or read framework internals.
TensorFlow was built around an explicit op registry. Each op is registered in C++ with a name, an input/output signature, and any attributes, and is then bound to one or more OpKernel implementations. The Python API exposes the raw registry through tf.raw_ops, while higher level functions (tf.matmul, tf.nn.conv2d) wrap those raw ops with shape inference, broadcasting, and gradient rules. Gradients are registered separately via RegisterGradient. TensorFlow ships with on the order of a thousand ops, and missing an installed kernel for a given device produces the well known "OpKernel not registered" error.
PyTorch keeps its operator library in the aten ("a tensor") namespace, exposed in Python as torch.ops.aten. Operator schemas are listed in aten/src/ATen/native/native_functions.yaml, and a code generator (torchgen) processes that file to emit C++ bindings, autograd glue, and Python entry points. The framework ships with more than two thousand ops covering everything from aten::add to aten::scaled_dot_product_attention. At runtime, a component called the dispatcher decides which kernel to run based on the input device, dtype, autograd state, autocast state, and any active vmap or functorch transform. PyTorch also exposes torch.nn.functional (stateless functional ops like F.relu, F.conv2d) and torch.nn modules that wrap them with parameters.
JAX takes a different approach. Users write code in jax.numpy (a NumPy-like API), and jit traces the Python function into an intermediate representation called jaxpr, which is then lowered to StableHLO and handed to the XLA compiler. The set of "ops" a JAX program ultimately runs is the set of HLO primitives that XLA understands, not a fixed framework-side registry. This makes JAX much smaller in op surface area but heavily dependent on XLA for performance.
ONNX (Open Neural Network Exchange) defines a portable, versioned op set that acts as an interchange format between frameworks. An ONNX model is a graph of standardized ops such as Conv, MatMul, Relu, and Softmax, grouped into opset versions (opset 1 through opset 22+ as of 2025). Two domains exist: ai.onnx for the core neural network ops and ai.onnx.ml for classical ML ops like decision trees. Inference engines like ONNX Runtime, TensorRT, and OpenVINO consume these models and dispatch each op to a hardware-specific implementation.
| Framework | Op namespace | Where ops live | Default execution model |
|---|---|---|---|
| TensorFlow | tf.raw_ops | C++ OpKernel registry | Graph (tf.function) or eager |
| PyTorch | torch.ops.aten | native_functions.yaml and aten/src/ATen/native | Eager by default, graph via torch.compile |
| JAX | jax.numpy (traced to HLO) | XLA HLO primitives | Traced and JIT-compiled with XLA |
| ONNX | ai.onnx, ai.onnx.ml | Versioned opsets | Interchange format consumed by runtimes |
Almost every modern framework groups ops into a handful of broad categories. The exact names differ, but the categories are stable across TensorFlow, PyTorch, JAX, and ONNX.
| Category | What it does | Common examples |
|---|---|---|
| Element-wise | Apply a function to each tensor element independently | add, mul, relu, sigmoid, tanh, gelu, exp, log |
| Reduction | Collapse one or more axes by summing, averaging, or selecting | sum, mean, max, min, argmax, argmin, var |
| Linear algebra | Matrix and tensor contractions | matmul, conv1d, conv2d, conv3d, einsum, bmm |
| Indexing and selection | Pick elements from a tensor by index or mask | gather, scatter, slice, masked_select, index_select |
| Shape manipulation | Change tensor shape without changing data | reshape, transpose, permute, stack, concat, squeeze, expand |
| Random | Draw samples from a distribution | normal, uniform, bernoulli, dropout |
| Communication | Move tensors between devices in distributed training | all_reduce, all_gather, broadcast, reduce_scatter |
| I/O | Read and write tensors to and from storage | tf.io.read_file, torch.load, jax.device_put |
| Normalization and pooling | Standard neural network building blocks | batch_norm, layer_norm, max_pool, avg_pool |
| Activation | Non-linear functions applied between layers | relu, gelu, silu, softmax, log_softmax |
A typical transformer forward pass consists almost entirely of ops from this table: matmul for the projections, softmax and reductions inside attention, layer_norm for normalization, and gelu or silu as the activation. The infamous "only six lines of math" of the attention operation expands into dozens of ops once you actually trace it through PyTorch.
Launching a GPU kernel has overhead, and reading and writing to GPU global memory (HBM) is slow compared to on-chip SRAM. If the network executes one tiny op per kernel launch, both effects dominate. Op fusion combines several adjacent ops into a single kernel that does the work in one pass, keeping intermediate values in registers or shared memory.
The canonical inference fusion is conv-bn-relu. During inference, batch normalization reduces to a per-channel scale and shift, which can be folded directly into the preceding convolution's weights and bias. The activation can then be applied in the same kernel. The result is one kernel launch instead of three, no intermediate tensor written to memory, and faster execution. Reported speedups range from roughly 1.5x on edge microcontrollers to nearly 3x on individual layers, with no loss of accuracy.
The headline modern example is FlashAttention (Dao et al., 2022). Standard attention computes the full N x N matrix of query-key scores, applies softmax, and multiplies by the value matrix, materializing an O(N^2) intermediate that quickly dominates memory at long context lengths. FlashAttention fuses the matmul, the softmax, and the second matmul into a single tiled kernel that streams blocks of queries, keys, and values through on-chip SRAM and never writes the full attention matrix to HBM. The original paper reported up to a 7.6x speedup on GPT-2 and reduced memory use from quadratic to linear in sequence length. FlashAttention-2 (2023) and FlashAttention-3 (2024) extended the idea with better parallelism and asynchronous, low-precision execution on Hopper GPUs.
| Fused op | What it combines | Where it shows up |
|---|---|---|
| Linear + bias + activation | matmul, add, relu/gelu | MLP blocks, often in cuBLASLt or Triton |
| RMSNorm | element-wise square, mean, rsqrt, multiply | LLaMA-style transformer blocks |
| Fused Adam | parameter update combining moments and weight decay | Optimizer step in mixed precision training |
| Fused softmax + cross entropy | softmax then negative log likelihood | Classification training loops |
| Fused conv-bn-relu | convolution, batch norm, ReLU | CNN inference |
| FlashAttention | matmul, scale, softmax, matmul, masking | Transformer attention |
Triton and NVIDIA's CUTLASS are the two most popular ways to write custom fused ops today. Triton is a Python-embedded DSL introduced by Tillet, Kung, and Cox in 2019 that lets researchers write tile-based GPU kernels without learning CUDA; it can produce matmul and convolution kernels competitive with cuBLAS and cuDNN. CUTLASS is NVIDIA's C++ template library of building blocks for the same job.
Fusion in modern systems is usually not done by hand for every shape. It is done by a compiler that takes a graph of high level ops and rewrites it into a smaller graph of fused, hardware-specific kernels. The intermediate representations (IRs) that compilers use have become an active subfield of ML systems.
| Compiler / IR | Used by | What it does |
|---|---|---|
| XLA (HLO, StableHLO) | TensorFlow, JAX, PyTorch/XLA | Lowers framework ops into HLO primitives, then performs target-independent passes (CSE, fusion, buffer analysis) and target-specific passes for GPU, CPU, and TPU |
| TorchInductor | PyTorch 2.x via torch.compile | Lowers PyTorch FX graphs into a compact ~50-op IR and generates Triton kernels for GPU and OpenMP code for CPU |
| TVM | Apache TVM | Separates compute from schedule using a tensor expression language; uses a learned cost model to search for fast schedules across CPU, mobile GPU, and server GPU |
| MLIR | LLVM project, used inside XLA, IREE, Mojo | Provides a multi-level IR with user-defined dialects, allowing tensor programs to be progressively lowered through several abstraction levels in a single compilation pipeline |
XLA was originally a TensorFlow project; it now lives in the cross-vendor OpenXLA initiative and accepts StableHLO as its frontend op set. TorchInductor, the default backend for torch.compile, was designed to be implemented in Python so researchers can extend it without touching C++. TVM came from the Chen et al. OSDI 2018 paper and pioneered the idea of treating schedule search as a learning problem. MLIR (Lattner et al.) generalizes the LLVM idea of a fixed IR into a system of stacked dialects, and now underpins XLA, IREE, the Mojo language, and increasing parts of LLVM itself.
When a researcher needs an op that the framework does not provide, or a much faster version of one it does, they write a custom op. The general pattern is similar across frameworks:
In TensorFlow this means writing an OpKernel subclass and calling REGISTER_KERNEL_BUILDER. In PyTorch it means using TORCH_LIBRARY and registering kernels with the dispatcher (m.impl("my_op", &my_kernel)), then writing a Python autograd Function for the backward pass. JAX users write a primitive with jax.core.Primitive and register lowering rules to HLO or CUDA. Custom ops are how nearly every state of the art kernel (FlashAttention, paged attention in vLLM, fused MoE kernels) actually reaches users.
Even without writing a brand new op, there is a lot of performance to be had at the op level. Kernel selection picks among dozens of cuBLAS or CUTLASS variants for matmul, parameterized by tile size, split-k, and Tensor Core usage; frameworks autotune on first call and cache the choice. Layout rewrites convert convolutions to NHWC because Tensor Cores prefer it. Mixed precision selects float16, bfloat16, or FP8 variants of an op and inserts casts in the right places. SIMD dispatch sends CPU ops to AVX2, AVX-512, NEON, or AMX kernels depending on the host. Memory layout choices like pinned memory and zero-copy transfers all live at the op layer.
Large language model serving has put an unusual amount of pressure on op-level engineering. Inference engines such as vLLM, Hugging Face Text Generation Inference (TGI), and llama.cpp ship heavily fused custom ops because every microsecond per token matters at scale.
A few representative examples. Paged attention (vLLM, 2023) treats the KV cache like virtual memory pages, with custom CUDA kernels that read attention scores from non-contiguous blocks. Grouped-query attention (GQA) and multi-query attention (MQA) are op-level rewrites that share key and value heads across multiple query heads, reducing memory bandwidth by a large factor and enabling faster decoding for models like Llama 3 and Mistral. Fused MoE kernels combine the routing softmax, top-k selection, and expert matmul into one kernel for mixture of experts models. Quantized matmul ops (INT8, INT4, FP8) are written specifically for GEMV-shaped single-token decode workloads, which look very different from the GEMM-shaped training workloads classical libraries were tuned for.
On the training side, torch.compile in PyTorch 2.x rewrites the user's model into a smaller set of fused Triton kernels, often eliminating dozens of small ops in a transformer block. The performance improvement can be 30 to 80 percent on a typical training step, mostly from op fusion and better kernel selection rather than algorithmic changes.
Large op libraries are a mixed blessing. PyTorch's two-thousand-plus ops make the framework expressive but also make it heavy: every new hardware backend has to either implement, decompose, or fall back for each one. New chips like Google TPU v5, AMD MI300, and NVIDIA B200 each require months of kernel porting before they can run state-of-the-art models efficiently. The compiler stacks (XLA, TorchInductor, TVM) try to soften this by reducing the surface area to a small set of "core" ops (PrimTorch in PyTorch, the core ATen op set in ExecuTorch, HLO in XLA) that backends must implement; everything else is decomposed into those core ops.
There is also a tension between having one big fused op (fast but inflexible) and many small ops (slower but easier to compose). FlashAttention, for example, is a single huge kernel; if you want to insert a custom mask or score modifier, you need a new variant. Generic compilers can in principle generate the fused kernel on demand, but in practice hand-tuned kernels still win on the hottest workloads.
Imagine you are playing with building blocks to create a tower. Each block is a basic task you need to do, like adding numbers, comparing them, or stretching them into a longer row. In machine learning, those basic tasks are called operations, or ops. A complicated AI model is just a tower of these ops stacked on top of each other.
Sometimes, instead of using three separate small blocks, you can glue them together into one bigger block that does the same job faster. That is called fusion, and it is a big part of why modern AI models can run on phones and laptops at all.