# Operation (op)

> Source: https://aiwiki.ai/wiki/operation_op
> Updated: 2026-06-02
> Categories: Developer Tools, MLOps
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Introduction

In machine learning, an **operation** (often abbreviated as **op**) is a basic computational unit that manipulates data, typically [tensors](/wiki/tensor), during the training or execution of a model. Operations can be arithmetic, logical, relational, or higher level numerical routines such as matrix multiplication, convolution, and softmax. They are the building blocks of machine learning models: a deep neural network is, at the lowest level of description, just a sequence (or graph) of ops applied to tensors of weights and activations.

In modern deep learning frameworks such as [TensorFlow](/wiki/tensorflow), [PyTorch](/wiki/pytorch), and [JAX](/wiki/jax), an op has a precise technical meaning. It is a primitive registered with the framework that takes one or more input tensors, produces one or more output tensors, and usually has a corresponding gradient (backward) implementation so it can be used inside [automatic differentiation](/wiki/automatic_differentiation).[2] Each op is then bound to one or more device-specific implementations (kernels) for CPU, [GPU](/wiki/gpu), TPU, or other accelerators.

The word "operation" is one of the most overloaded terms in ML systems. It is used loosely for any computation, formally for a node in a [computational graph](/wiki/computational_graph), and informally for the whole pipeline ("the matmul op", "the attention op", "a fused op"). Most of this article uses the framework definition: an op is a registered primitive with a name, a type signature, and one or more device kernels.

## Op vs node vs kernel

Three closely related concepts often get conflated. Keeping them straight makes the rest of this article easier to follow.

| Concept | Meaning | Example |
|---|---|---|
| Op | The abstract computational primitive defined by the framework. It has a name, input and output type signatures, attributes, and a gradient rule. | `aten::matmul`, `tf.raw_ops.Conv2D`, `jnp.add` |
| Node | An instance of an op inside a specific [computational graph](/wiki/computational_graph), bound to particular input tensors. | The `matmul` node that multiplies activations of layer 7 by its weight matrix |
| [Kernel](/wiki/machine_learning) | The concrete device-specific implementation that actually executes the op on hardware. | A [CUDA](/wiki/cuda) kernel for `matmul` on NVIDIA GPUs, an OpenMP loop on CPU, an MPS kernel on Apple silicon |

A single op can have many kernels: float32 vs float16, CPU vs GPU, NCHW vs NHWC layout, with or without [cuDNN](/wiki/cudnn). When the framework executes a node, it dispatches to the kernel that matches the input device, dtype, and layout. TensorFlow calls this the OpKernel system; PyTorch calls it the dispatcher.

## Op systems in major frameworks

Every framework has its own conventions for how ops are defined, registered, and called. The differences matter when you write custom ops or read framework internals.

### TensorFlow

[TensorFlow](/wiki/tensorflow) was built around an explicit op registry. Each op is registered in C++ with a name, an input/output signature, and any attributes, and is then bound to one or more `OpKernel` implementations. The Python API exposes the raw registry through `tf.raw_ops`, while higher level functions (`tf.matmul`, `tf.nn.conv2d`) wrap those raw ops with shape inference, broadcasting, and gradient rules. Gradients are registered separately via `RegisterGradient`.[1] TensorFlow ships with on the order of a thousand ops, and missing an installed kernel for a given device produces the well known "OpKernel not registered" error.

### PyTorch

[PyTorch](/wiki/pytorch) keeps its operator library in the `aten` ("a tensor") namespace, exposed in Python as `torch.ops.aten`. Operator schemas are listed in `aten/src/ATen/native/native_functions.yaml`, and a code generator (`torchgen`) processes that file to emit C++ bindings, autograd glue, and Python entry points.[5] The framework ships with more than two thousand ops covering everything from `aten::add` to `aten::scaled_dot_product_attention`. At runtime, a component called the dispatcher decides which kernel to run based on the input device, dtype, autograd state, autocast state, and any active vmap or functorch transform.[4] PyTorch also exposes `torch.nn.functional` (stateless functional ops like `F.relu`, `F.conv2d`) and `torch.nn` modules that wrap them with parameters.

### JAX

[JAX](/wiki/jax) takes a different approach. Users write code in `jax.numpy` (a NumPy-like API), and `jit` traces the Python function into an intermediate representation called jaxpr, which is then lowered to StableHLO and handed to the [XLA](/wiki/xla) compiler. The set of "ops" a JAX program ultimately runs is the set of HLO primitives that XLA understands, not a fixed framework-side registry. This makes JAX much smaller in op surface area but heavily dependent on XLA for performance.

### ONNX

[ONNX](/wiki/onnx) (Open Neural Network Exchange) defines a portable, versioned op set that acts as an interchange format between frameworks. An ONNX model is a graph of standardized ops such as `Conv`, `MatMul`, `Relu`, and `Softmax`,[6] grouped into opset versions (opset 1 through opset 22+ as of 2025).[7] Two domains exist: `ai.onnx` for the core neural network ops and `ai.onnx.ml` for classical ML ops like decision trees. Inference engines like ONNX Runtime, TensorRT, and OpenVINO consume these models and dispatch each op to a hardware-specific implementation.

### Comparison

| Framework | Op namespace | Where ops live | Default execution model |
|---|---|---|---|
| [TensorFlow](/wiki/tensorflow) | `tf.raw_ops` | C++ `OpKernel` registry | Graph (tf.function) or eager |
| [PyTorch](/wiki/pytorch) | `torch.ops.aten` | `native_functions.yaml` and `aten/src/ATen/native` | Eager by default, graph via [torch.compile](/wiki/torch_compile) |
| [JAX](/wiki/jax) | `jax.numpy` (traced to HLO) | XLA HLO primitives | Traced and JIT-compiled with [XLA](/wiki/xla) |
| [ONNX](/wiki/onnx) | `ai.onnx`, `ai.onnx.ml` | Versioned opsets | Interchange format consumed by runtimes |

## Categories of ops

Almost every modern framework groups ops into a handful of broad categories. The exact names differ, but the categories are stable across TensorFlow, PyTorch, JAX, and ONNX.

| Category | What it does | Common examples |
|---|---|---|
| Element-wise | Apply a function to each tensor element independently | [add](/wiki/add), mul, relu, sigmoid, tanh, gelu, exp, log |
| Reduction | Collapse one or more axes by summing, averaging, or selecting | sum, mean, max, min, argmax, argmin, var |
| Linear algebra | Matrix and tensor contractions | [matmul](/wiki/matmul), conv1d, conv2d, conv3d, einsum, bmm |
| Indexing and selection | Pick elements from a tensor by index or mask | gather, scatter, slice, masked_select, index_select |
| Shape manipulation | Change tensor shape without changing data | reshape, transpose, permute, stack, concat, squeeze, expand |
| Random | Draw samples from a distribution | normal, uniform, bernoulli, dropout |
| Communication | Move tensors between devices in distributed training | all_reduce, all_gather, broadcast, reduce_scatter |
| I/O | Read and write tensors to and from storage | tf.io.read_file, torch.load, jax.device_put |
| Normalization and pooling | Standard neural network building blocks | batch_norm, layer_norm, max_pool, avg_pool |
| Activation | Non-linear functions applied between layers | relu, gelu, silu, softmax, log_softmax |

A typical transformer forward pass consists almost entirely of ops from this table: matmul for the projections, softmax and reductions inside attention, layer_norm for normalization, and gelu or silu as the activation. The infamous "only six lines of math" of the attention operation expands into dozens of ops once you actually trace it through PyTorch.

## Op fusion

Launching a GPU kernel has overhead, and reading and writing to GPU global memory (HBM) is slow compared to on-chip SRAM. If the network executes one tiny op per kernel launch, both effects dominate. Op fusion combines several adjacent ops into a single kernel that does the work in one pass, keeping intermediate values in registers or shared memory.

### Conv-BN-ReLU

The canonical inference fusion is conv-bn-relu. During inference, [batch normalization](/wiki/batch_normalization) reduces to a per-channel scale and shift, which can be folded directly into the preceding convolution's weights and bias.[19] The activation can then be applied in the same kernel. The result is one kernel launch instead of three, no intermediate tensor written to memory, and faster execution. Reported speedups range from roughly 1.5x on edge microcontrollers to nearly 3x on individual layers, with no loss of accuracy.[20]

### FlashAttention

The headline modern example is FlashAttention (Dao et al., 2022). Standard attention computes the full N x N matrix of query-key scores, applies softmax, and multiplies by the value matrix, materializing an O(N^2) intermediate that quickly dominates memory at long context lengths. FlashAttention fuses the matmul, the softmax, and the second matmul into a single tiled kernel that streams blocks of queries, keys, and values through on-chip SRAM and never writes the full attention matrix to HBM. The original paper reported up to a 7.6x speedup on GPT-2 and reduced memory use from quadratic to linear in sequence length.[11] FlashAttention-2 (2023)[12] and FlashAttention-3 (2024)[13] extended the idea with better parallelism and asynchronous, low-precision execution on Hopper GPUs.

### Other common fusions

| Fused op | What it combines | Where it shows up |
|---|---|---|
| Linear + bias + activation | matmul, add, relu/gelu | MLP blocks, often in cuBLASLt or [Triton](/wiki/nvidia_triton) |
| RMSNorm | element-wise square, mean, rsqrt, multiply | LLaMA-style transformer blocks |
| Fused Adam | parameter update combining moments and weight decay | Optimizer step in mixed precision training |
| Fused softmax + cross entropy | softmax then negative log likelihood | Classification training loops |
| Fused conv-bn-relu | convolution, batch norm, ReLU | CNN inference |
| FlashAttention | matmul, scale, softmax, matmul, masking | Transformer attention |

[Triton](/wiki/nvidia_triton) and NVIDIA's CUTLASS are the two most popular ways to write custom fused ops today. Triton is a Python-embedded DSL introduced by Tillet, Kung, and Cox in 2019 that lets researchers write tile-based GPU kernels without learning [CUDA](/wiki/cuda);[14] it can produce matmul and convolution kernels competitive with cuBLAS and [cuDNN](/wiki/cudnn).[15] CUTLASS is NVIDIA's C++ template library of building blocks for the same job.

## Compilers and intermediate representations

Fusion in modern systems is usually not done by hand for every shape. It is done by a compiler that takes a graph of high level ops and rewrites it into a smaller graph of fused, hardware-specific kernels. The intermediate representations (IRs) that compilers use have become an active subfield of ML systems.

| Compiler / IR | Used by | What it does |
|---|---|---|
| [XLA](/wiki/xla) (HLO, StableHLO) | TensorFlow, [JAX](/wiki/jax), PyTorch/XLA | Lowers framework ops into HLO primitives, then performs target-independent passes (CSE, fusion, buffer analysis) and target-specific passes for GPU, CPU, and TPU |
| TorchInductor | [PyTorch](/wiki/pytorch) 2.x via [torch.compile](/wiki/torch_compile) | Lowers PyTorch FX graphs into a compact ~50-op IR and generates [Triton](/wiki/nvidia_triton) kernels for GPU and OpenMP code for CPU |
| [TVM](/wiki/tvm) | Apache TVM | Separates compute from schedule using a tensor expression language; uses a learned cost model to search for fast schedules across CPU, mobile GPU, and server GPU |
| MLIR | LLVM project, used inside XLA, IREE, Mojo | Provides a multi-level IR with user-defined dialects, allowing tensor programs to be progressively lowered through several abstraction levels in a single compilation pipeline |

XLA was originally a TensorFlow project; it now lives in the cross-vendor OpenXLA initiative[8] and accepts StableHLO as its frontend op set.[9] TorchInductor, the default backend for `torch.compile`, was designed to be implemented in Python so researchers can extend it without touching C++.[10] TVM came from the Chen et al. OSDI 2018 paper and pioneered the idea of treating schedule search as a learning problem.[16] MLIR (Lattner et al.) generalizes the LLVM idea of a fixed IR into a system of stacked dialects,[17] and now underpins XLA, IREE, the Mojo language, and increasing parts of LLVM itself.[18]

## Custom ops

When a researcher needs an op that the framework does not provide, or a much faster version of one it does, they write a custom op. The general pattern is similar across frameworks:

1. Implement the forward computation in C++/CUDA, [Triton](/wiki/nvidia_triton), or another GPU language.
2. Register the op with the framework, giving it a name, an input/output signature, and binding the kernel to a device.
3. Register a gradient rule so the op composes with [automatic differentiation](/wiki/automatic_differentiation).
4. Optionally provide shape inference so graph compilers can reason about it.

In TensorFlow this means writing an `OpKernel` subclass and calling `REGISTER_KERNEL_BUILDER`.[1] In PyTorch it means using `TORCH_LIBRARY` and registering kernels with the dispatcher (`m.impl("my_op", &my_kernel)`), then writing a Python autograd `Function` for the backward pass.[3] JAX users write a primitive with `jax.core.Primitive` and register lowering rules to HLO or CUDA. Custom ops are how nearly every state of the art kernel (FlashAttention, paged attention in vLLM, fused MoE kernels) actually reaches users.

## Op-level optimizations

Even without writing a brand new op, there is a lot of performance to be had at the op level. Kernel selection picks among dozens of cuBLAS or CUTLASS variants for matmul, parameterized by tile size, split-k, and Tensor Core usage; frameworks autotune on first call and cache the choice. Layout rewrites convert convolutions to NHWC because Tensor Cores prefer it. Mixed precision selects float16, bfloat16, or FP8 variants of an op and inserts casts in the right places. SIMD dispatch sends CPU ops to AVX2, AVX-512, NEON, or AMX kernels depending on the host. Memory layout choices like pinned memory and zero-copy transfers all live at the op layer.

## Modern context: LLM inference

Large language model serving has put an unusual amount of pressure on op-level engineering. Inference engines such as vLLM, Hugging Face Text Generation Inference (TGI), and llama.cpp ship heavily fused custom ops because every microsecond per token matters at scale.

A few representative examples. Paged attention (vLLM, 2023) treats the KV cache like virtual memory pages, with custom CUDA kernels that read attention scores from non-contiguous blocks.[21] Grouped-query attention (GQA) and multi-query attention (MQA) are op-level rewrites that share key and value heads across multiple query heads, reducing memory bandwidth by a large factor and enabling faster decoding for models like Llama 3 and Mistral.[22] Fused MoE kernels combine the routing softmax, top-k selection, and expert matmul into one kernel for mixture of experts models. Quantized matmul ops (INT8, INT4, FP8) are written specifically for GEMV-shaped single-token decode workloads, which look very different from the GEMM-shaped training workloads classical libraries were tuned for.

On the training side, [torch.compile](/wiki/torch_compile) in PyTorch 2.x rewrites the user's model into a smaller set of fused [Triton](/wiki/nvidia_triton) kernels, often eliminating dozens of small ops in a transformer block. The performance improvement can be 30 to 80 percent on a typical training step, mostly from op fusion and better kernel selection rather than algorithmic changes.

## Limitations and tradeoffs

Large op libraries are a mixed blessing. PyTorch's two-thousand-plus ops make the framework expressive but also make it heavy: every new hardware backend has to either implement, decompose, or fall back for each one. New chips like Google TPU v5, AMD MI300, and NVIDIA B200 each require months of kernel porting before they can run state-of-the-art models efficiently. The compiler stacks (XLA, TorchInductor, [TVM](/wiki/tvm)) try to soften this by reducing the surface area to a small set of "core" ops (PrimTorch in PyTorch, the core ATen op set in ExecuTorch, HLO in XLA) that backends must implement; everything else is decomposed into those core ops.

There is also a tension between having one big fused op (fast but inflexible) and many small ops (slower but easier to compose). FlashAttention, for example, is a single huge kernel; if you want to insert a custom mask or score modifier, you need a new variant. Generic compilers can in principle generate the fused kernel on demand, but in practice hand-tuned kernels still win on the hottest workloads.

## Explain like I'm 5 (ELI5)

Imagine you are playing with building blocks to create a tower. Each block is a basic task you need to do, like adding numbers, comparing them, or stretching them into a longer row. In machine learning, those basic tasks are called operations, or ops. A complicated AI model is just a tower of these ops stacked on top of each other.

Sometimes, instead of using three separate small blocks, you can glue them together into one bigger block that does the same job faster. That is called fusion, and it is a big part of why modern AI models can run on phones and laptops at all.

## See also

- [TensorFlow](/wiki/tensorflow)
- [PyTorch](/wiki/pytorch)
- [JAX](/wiki/jax)
- [XLA](/wiki/xla)
- [ONNX](/wiki/onnx)
- [Tensor](/wiki/tensor)
- [Computational graph](/wiki/computational_graph)
- [Kernel (machine learning)](/wiki/machine_learning)
- [Triton](/wiki/nvidia_triton)
- [TVM](/wiki/tvm)
- [cuDNN](/wiki/cudnn)
- [CUDA](/wiki/cuda)
- [Automatic differentiation](/wiki/automatic_differentiation)
- [torch.compile](/wiki/torch_compile)

## References

1. TensorFlow documentation. "Create an op." https://www.tensorflow.org/guide/create_op
2. TensorFlow API reference. "tf.Operation." https://www.tensorflow.org/api_docs/python/tf/Operation
3. PyTorch documentation. "Operator Registration." https://docs.pytorch.org/docs/stable/accelerator/operators.html
4. PyTorch wiki. "PyTorch dispatcher walkthrough." https://github.com/pytorch/pytorch/wiki/PyTorch-dispatcher-walkthrough
5. PyTorch GitHub. "aten/src/ATen/native/README.md." https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md
6. ONNX documentation. "ONNX Operators." https://onnx.ai/onnx/operators/
7. ONNX documentation. "ONNX Versioning." https://onnx.ai/onnx/repo-docs/Versioning.html
8. OpenXLA Project. "XLA architecture." https://openxla.org/xla/architecture
9. OpenXLA Project. "StableHLO." https://github.com/openxla/stablehlo
10. PyTorch Developer Mailing List. "TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes." https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747
11. Dao, T., Fu, D., Ermon, S., Rudra, A., and Re, C. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. https://arxiv.org/pdf/2205.14135
12. Dao, T. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." 2023. https://tridao.me/publications/flash2/flash2.pdf
13. Shah, J. et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." NeurIPS 2024. https://tridao.me/publications/flash3/flash3.pdf
14. Tillet, P., Kung, H. T., and Cox, D. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." MAPL 2019. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
15. OpenAI. "Introducing Triton: Open-source GPU programming for neural networks." https://openai.com/index/triton/
16. Chen, T. et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." OSDI 2018. https://www.usenix.org/system/files/osdi18-chen.pdf
17. MLIR project. "MLIR Language Reference." https://mlir.llvm.org/docs/LangRef/
18. TensorFlow Blog. "MLIR: A new intermediate representation and compiler framework." https://blog.tensorflow.org/2019/04/mlir-new-intermediate-representation.html
19. PyTorch tutorials. "Building a Convolution/Batch Norm fuser with torch.compile." https://docs.pytorch.org/tutorials/intermediate/torch_compile_conv_bn_fuser.html
20. Lei Mao. "Neural Network Batch Normalization Fusion." https://leimao.github.io/blog/Neural-Network-Batch-Normalization-Fusion/
21. Kwon, W. et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. https://arxiv.org/abs/2309.06180
22. Ainslie, J. et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023.

