Operation (op)

Introduction

In machine learning, an operation (often abbreviated as op) is a basic computational unit that manipulates data, typically tensors, during the training or execution of a model. Operations can be arithmetic, logical, relational, or higher level numerical routines such as matrix multiplication, convolution, and softmax. They are the building blocks of machine learning models: a deep neural network is, at the lowest level of description, just a sequence (or graph) of ops applied to tensors of weights and activations.

In modern deep learning frameworks such as TensorFlow, PyTorch, and JAX, an op has a precise technical meaning. It is a primitive registered with the framework that takes one or more input tensors, produces one or more output tensors, and usually has a corresponding gradient (backward) implementation so it can be used inside automatic differentiation. Each op is then bound to one or more device-specific implementations (kernels) for CPU, GPU, TPU, or other accelerators.

The word "operation" is one of the most overloaded terms in ML systems. It is used loosely for any computation, formally for a node in a computational graph, and informally for the whole pipeline ("the matmul op", "the attention op", "a fused op"). Most of this article uses the framework definition: an op is a registered primitive with a name, a type signature, and one or more device kernels.

Op vs node vs kernel

Three closely related concepts often get conflated. Keeping them straight makes the rest of this article easier to follow.

Concept	Meaning	Example
Op	The abstract computational primitive defined by the framework. It has a name, input and output type signatures, attributes, and a gradient rule.	`aten::matmul`, `tf.raw_ops.Conv2D`, `jnp.add`
Node	An instance of an op inside a specific computational graph, bound to particular input tensors.	The `matmul` node that multiplies activations of layer 7 by its weight matrix
Kernel	The concrete device-specific implementation that actually executes the op on hardware.	A CUDA kernel for `matmul` on NVIDIA GPUs, an OpenMP loop on CPU, an MPS kernel on Apple silicon

A single op can have many kernels: float32 vs float16, CPU vs GPU, NCHW vs NHWC layout, with or without cuDNN. When the framework executes a node, it dispatches to the kernel that matches the input device, dtype, and layout. TensorFlow calls this the OpKernel system; PyTorch calls it the dispatcher.

Op systems in major frameworks

Every framework has its own conventions for how ops are defined, registered, and called. The differences matter when you write custom ops or read framework internals.

TensorFlow

TensorFlow was built around an explicit op registry. Each op is registered in C++ with a name, an input/output signature, and any attributes, and is then bound to one or more OpKernel implementations. The Python API exposes the raw registry through tf.raw_ops, while higher level functions (tf.matmul, tf.nn.conv2d) wrap those raw ops with shape inference, broadcasting, and gradient rules. Gradients are registered separately via RegisterGradient. TensorFlow ships with on the order of a thousand ops, and missing an installed kernel for a given device produces the well known "OpKernel not registered" error.

PyTorch

PyTorch keeps its operator library in the aten ("a tensor") namespace, exposed in Python as torch.ops.aten. Operator schemas are listed in aten/src/ATen/native/native_functions.yaml, and a code generator (torchgen) processes that file to emit C++ bindings, autograd glue, and Python entry points. The framework ships with more than two thousand ops covering everything from aten::add to aten::scaled_dot_product_attention. At runtime, a component called the dispatcher decides which kernel to run based on the input device, dtype, autograd state, autocast state, and any active vmap or functorch transform. PyTorch also exposes torch.nn.functional (stateless functional ops like F.relu, F.conv2d) and torch.nn modules that wrap them with parameters.

JAX

JAX takes a different approach. Users write code in jax.numpy (a NumPy-like API), and jit traces the Python function into an intermediate representation called jaxpr, which is then lowered to StableHLO and handed to the XLA compiler. The set of "ops" a JAX program ultimately runs is the set of HLO primitives that XLA understands, not a fixed framework-side registry. This makes JAX much smaller in op surface area but heavily dependent on XLA for performance.

ONNX

ONNX (Open Neural Network Exchange) defines a portable, versioned op set that acts as an interchange format between frameworks. An ONNX model is a graph of standardized ops such as Conv, MatMul, Relu, and Softmax, grouped into opset versions (opset 1 through opset 22+ as of 2025). Two domains exist: ai.onnx for the core neural network ops and ai.onnx.ml for classical ML ops like decision trees. Inference engines like ONNX Runtime, TensorRT, and OpenVINO consume these models and dispatch each op to a hardware-specific implementation.

Comparison

Framework	Op namespace	Where ops live	Default execution model
TensorFlow	`tf.raw_ops`	C++ `OpKernel` registry	Graph (tf.function) or eager
PyTorch	`torch.ops.aten`	`native_functions.yaml` and `aten/src/ATen/native`	Eager by default, graph via torch.compile
JAX	`jax.numpy` (traced to HLO)	XLA HLO primitives	Traced and JIT-compiled with XLA
ONNX	`ai.onnx`, `ai.onnx.ml`	Versioned opsets	Interchange format consumed by runtimes

Categories of ops

Almost every modern framework groups ops into a handful of broad categories. The exact names differ, but the categories are stable across TensorFlow, PyTorch, JAX, and ONNX.

Category	What it does	Common examples
Element-wise	Apply a function to each tensor element independently	add, mul, relu, sigmoid, tanh, gelu, exp, log
Reduction	Collapse one or more axes by summing, averaging, or selecting	sum, mean, max, min, argmax, argmin, var
Linear algebra	Matrix and tensor contractions	matmul, conv1d, conv2d, conv3d, einsum, bmm
Indexing and selection	Pick elements from a tensor by index or mask	gather, scatter, slice, masked_select, index_select
Shape manipulation	Change tensor shape without changing data	reshape, transpose, permute, stack, concat, squeeze, expand
Random	Draw samples from a distribution	normal, uniform, bernoulli, dropout
Communication	Move tensors between devices in distributed training	all_reduce, all_gather, broadcast, reduce_scatter
I/O	Read and write tensors to and from storage	tf.io.read_file, torch.load, jax.device_put
Normalization and pooling	Standard neural network building blocks	batch_norm, layer_norm, max_pool, avg_pool
Activation	Non-linear functions applied between layers	relu, gelu, silu, softmax, log_softmax

A typical transformer forward pass consists almost entirely of ops from this table: matmul for the projections, softmax and reductions inside attention, layer_norm for normalization, and gelu or silu as the activation. The infamous "only six lines of math" of the attention operation expands into dozens of ops once you actually trace it through PyTorch.

Op fusion

Launching a GPU kernel has overhead, and reading and writing to GPU global memory (HBM) is slow compared to on-chip SRAM. If the network executes one tiny op per kernel launch, both effects dominate. Op fusion combines several adjacent ops into a single kernel that does the work in one pass, keeping intermediate values in registers or shared memory.

Conv-BN-ReLU

The canonical inference fusion is conv-bn-relu. During inference, batch normalization reduces to a per-channel scale and shift, which can be folded directly into the preceding convolution's weights and bias. The activation can then be applied in the same kernel. The result is one kernel launch instead of three, no intermediate tensor written to memory, and faster execution. Reported speedups range from roughly 1.5x on edge microcontrollers to nearly 3x on individual layers, with no loss of accuracy.

FlashAttention

The headline modern example is FlashAttention (Dao et al., 2022). Standard attention computes the full N x N matrix of query-key scores, applies softmax, and multiplies by the value matrix, materializing an O(N^2) intermediate that quickly dominates memory at long context lengths. FlashAttention fuses the matmul, the softmax, and the second matmul into a single tiled kernel that streams blocks of queries, keys, and values through on-chip SRAM and never writes the full attention matrix to HBM. The original paper reported up to a 7.6x speedup on GPT-2 and reduced memory use from quadratic to linear in sequence length. FlashAttention-2 (2023) and FlashAttention-3 (2024) extended the idea with better parallelism and asynchronous, low-precision execution on Hopper GPUs.

Other common fusions

Fused op	What it combines	Where it shows up
Linear + bias + activation	matmul, add, relu/gelu	MLP blocks, often in cuBLASLt or Triton
RMSNorm	element-wise square, mean, rsqrt, multiply	LLaMA-style transformer blocks
Fused Adam	parameter update combining moments and weight decay	Optimizer step in mixed precision training
Fused softmax + cross entropy	softmax then negative log likelihood	Classification training loops
Fused conv-bn-relu	convolution, batch norm, ReLU	CNN inference
FlashAttention	matmul, scale, softmax, matmul, masking	Transformer attention

Triton and NVIDIA's CUTLASS are the two most popular ways to write custom fused ops today. Triton is a Python-embedded DSL introduced by Tillet, Kung, and Cox in 2019 that lets researchers write tile-based GPU kernels without learning CUDA; it can produce matmul and convolution kernels competitive with cuBLAS and cuDNN. CUTLASS is NVIDIA's C++ template library of building blocks for the same job.

Compilers and intermediate representations

Fusion in modern systems is usually not done by hand for every shape. It is done by a compiler that takes a graph of high level ops and rewrites it into a smaller graph of fused, hardware-specific kernels. The intermediate representations (IRs) that compilers use have become an active subfield of ML systems.

Compiler / IR	Used by	What it does
XLA (HLO, StableHLO)	TensorFlow, JAX, PyTorch/XLA	Lowers framework ops into HLO primitives, then performs target-independent passes (CSE, fusion, buffer analysis) and target-specific passes for GPU, CPU, and TPU
TorchInductor	PyTorch 2.x via torch.compile	Lowers PyTorch FX graphs into a compact ~50-op IR and generates Triton kernels for GPU and OpenMP code for CPU
TVM	Apache TVM	Separates compute from schedule using a tensor expression language; uses a learned cost model to search for fast schedules across CPU, mobile GPU, and server GPU
MLIR	LLVM project, used inside XLA, IREE, Mojo	Provides a multi-level IR with user-defined dialects, allowing tensor programs to be progressively lowered through several abstraction levels in a single compilation pipeline

XLA was originally a TensorFlow project; it now lives in the cross-vendor OpenXLA initiative and accepts StableHLO as its frontend op set. TorchInductor, the default backend for torch.compile, was designed to be implemented in Python so researchers can extend it without touching C++. TVM came from the Chen et al. OSDI 2018 paper and pioneered the idea of treating schedule search as a learning problem. MLIR (Lattner et al.) generalizes the LLVM idea of a fixed IR into a system of stacked dialects, and now underpins XLA, IREE, the Mojo language, and increasing parts of LLVM itself.

Custom ops

When a researcher needs an op that the framework does not provide, or a much faster version of one it does, they write a custom op. The general pattern is similar across frameworks:

Implement the forward computation in C++/CUDA, Triton, or another GPU language.
Register the op with the framework, giving it a name, an input/output signature, and binding the kernel to a device.
Register a gradient rule so the op composes with automatic differentiation.
Optionally provide shape inference so graph compilers can reason about it.

In TensorFlow this means writing an OpKernel subclass and calling REGISTER_KERNEL_BUILDER. In PyTorch it means using TORCH_LIBRARY and registering kernels with the dispatcher (m.impl("my_op", &my_kernel)), then writing a Python autograd Function for the backward pass. JAX users write a primitive with jax.core.Primitive and register lowering rules to HLO or CUDA. Custom ops are how nearly every state of the art kernel (FlashAttention, paged attention in vLLM, fused MoE kernels) actually reaches users.

Op-level optimizations

Even without writing a brand new op, there is a lot of performance to be had at the op level. Kernel selection picks among dozens of cuBLAS or CUTLASS variants for matmul, parameterized by tile size, split-k, and Tensor Core usage; frameworks autotune on first call and cache the choice. Layout rewrites convert convolutions to NHWC because Tensor Cores prefer it. Mixed precision selects float16, bfloat16, or FP8 variants of an op and inserts casts in the right places. SIMD dispatch sends CPU ops to AVX2, AVX-512, NEON, or AMX kernels depending on the host. Memory layout choices like pinned memory and zero-copy transfers all live at the op layer.

Modern context: LLM inference

Large language model serving has put an unusual amount of pressure on op-level engineering. Inference engines such as vLLM, Hugging Face Text Generation Inference (TGI), and llama.cpp ship heavily fused custom ops because every microsecond per token matters at scale.

A few representative examples. Paged attention (vLLM, 2023) treats the KV cache like virtual memory pages, with custom CUDA kernels that read attention scores from non-contiguous blocks. Grouped-query attention (GQA) and multi-query attention (MQA) are op-level rewrites that share key and value heads across multiple query heads, reducing memory bandwidth by a large factor and enabling faster decoding for models like Llama 3 and Mistral. Fused MoE kernels combine the routing softmax, top-k selection, and expert matmul into one kernel for mixture of experts models. Quantized matmul ops (INT8, INT4, FP8) are written specifically for GEMV-shaped single-token decode workloads, which look very different from the GEMM-shaped training workloads classical libraries were tuned for.

On the training side, torch.compile in PyTorch 2.x rewrites the user's model into a smaller set of fused Triton kernels, often eliminating dozens of small ops in a transformer block. The performance improvement can be 30 to 80 percent on a typical training step, mostly from op fusion and better kernel selection rather than algorithmic changes.

Limitations and tradeoffs

Large op libraries are a mixed blessing. PyTorch's two-thousand-plus ops make the framework expressive but also make it heavy: every new hardware backend has to either implement, decompose, or fall back for each one. New chips like Google TPU v5, AMD MI300, and NVIDIA B200 each require months of kernel porting before they can run state-of-the-art models efficiently. The compiler stacks (XLA, TorchInductor, TVM) try to soften this by reducing the surface area to a small set of "core" ops (PrimTorch in PyTorch, the core ATen op set in ExecuTorch, HLO in XLA) that backends must implement; everything else is decomposed into those core ops.

There is also a tension between having one big fused op (fast but inflexible) and many small ops (slower but easier to compose). FlashAttention, for example, is a single huge kernel; if you want to insert a custom mask or score modifier, you need a new variant. Generic compilers can in principle generate the fused kernel on demand, but in practice hand-tuned kernels still win on the hottest workloads.

Explain like I'm 5 (ELI5)

Imagine you are playing with building blocks to create a tower. Each block is a basic task you need to do, like adding numbers, comparing them, or stretching them into a longer row. In machine learning, those basic tasks are called operations, or ops. A complicated AI model is just a tower of these ops stacked on top of each other.

Sometimes, instead of using three separate small blocks, you can glue them together into one bigger block that does the same job faster. That is called fusion, and it is a big part of why modern AI models can run on phones and laptops at all.

References

TensorFlow documentation. "Create an op." https://www.tensorflow.org/guide/create_op
TensorFlow API reference. "tf.Operation." https://www.tensorflow.org/api_docs/python/tf/Operation
PyTorch documentation. "Operator Registration." https://docs.pytorch.org/docs/stable/accelerator/operators.html
PyTorch wiki. "PyTorch dispatcher walkthrough." https://github.com/pytorch/pytorch/wiki/PyTorch-dispatcher-walkthrough
PyTorch GitHub. "aten/src/ATen/native/README.md." https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md
ONNX documentation. "ONNX Operators." https://onnx.ai/onnx/operators/
ONNX documentation. "ONNX Versioning." https://onnx.ai/onnx/repo-docs/Versioning.html
OpenXLA Project. "XLA architecture." https://openxla.org/xla/architecture
OpenXLA Project. "StableHLO." https://github.com/openxla/stablehlo
PyTorch Developer Mailing List. "TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes." https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747
Dao, T., Fu, D., Ermon, S., Rudra, A., and Re, C. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. https://arxiv.org/pdf/2205.14135
Dao, T. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." 2023. https://tridao.me/publications/flash2/flash2.pdf
Shah, J. et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." NeurIPS 2024. https://tridao.me/publications/flash3/flash3.pdf
Tillet, P., Kung, H. T., and Cox, D. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." MAPL 2019. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
OpenAI. "Introducing Triton: Open-source GPU programming for neural networks." https://openai.com/index/triton/
Chen, T. et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." OSDI 2018. https://www.usenix.org/system/files/osdi18-chen.pdf
MLIR project. "MLIR Language Reference." https://mlir.llvm.org/docs/LangRef/
TensorFlow Blog. "MLIR: A new intermediate representation and compiler framework." https://blog.tensorflow.org/2019/04/mlir-new-intermediate-representation.html
PyTorch tutorials. "Building a Convolution/Batch Norm fuser with torch.compile." https://docs.pytorch.org/tutorials/intermediate/torch_compile_conv_bn_fuser.html
Lei Mao. "Neural Network Batch Normalization Fusion." https://leimao.github.io/blog/Neural-Network-Batch-Normalization-Fusion/
Kwon, W. et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. https://arxiv.org/abs/2309.06180
Ainslie, J. et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023.

Introduction

Op vs node vs kernel

Op systems in major frameworks

TensorFlow

PyTorch

JAX

ONNX

Comparison

Categories of ops

Op fusion

Conv-BN-ReLU

FlashAttention

Other common fusions

Compilers and intermediate representations

Custom ops

Op-level optimizations

Modern context: LLM inference

Limitations and tradeoffs

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

LangGraph

MediaPipe

TensorFlow Lite (LiteRT)

Online inference

Partitioning strategy

TensorFlow Serving

Introduction

Op vs node vs kernel

Op systems in major frameworks

TensorFlow

PyTorch

JAX

ONNX

Comparison

Categories of ops

Op fusion

Conv-BN-ReLU

FlashAttention

Other common fusions

Compilers and intermediate representations

Custom ops

Op-level optimizations

Modern context: LLM inference

Limitations and tradeoffs

Explain like I'm 5 (ELI5)

See also

References

Related Articles

LangGraph

MediaPipe

TensorFlow Lite (LiteRT)

Online inference

Partitioning strategy

TensorFlow Serving