ONNX (Open Neural Network Exchange) is an open standard format for representing machine learning models. Announced in September 2017 by Facebook (now Meta) and Microsoft, ONNX defines a common set of operators and a file format that enables AI developers to move trained models between different frameworks, tools, runtimes, and hardware platforms. A model trained in PyTorch can be exported to ONNX format and then deployed using a completely different runtime, on a different hardware platform, without rewriting the model code. This interoperability has made ONNX the de facto standard for model exchange in the machine learning ecosystem.
ONNX originated from a project called Toffee, developed by the PyTorch team at Facebook. In September 2017, Facebook and Microsoft jointly announced ONNX as an open standard for ML model interoperability [1]. The initial motivation was straightforward: researchers often train models in one framework but need to deploy them in a different environment. At the time, moving a model from PyTorch to a production serving system, or from TensorFlow to a mobile runtime, required painful and error-prone manual conversion.
In December 2017, ONNX version 1.0 was released with additional support from Amazon Web Services and other partners, marking its transition to a production-ready standard [2]. The project quickly attracted broad industry backing. IBM, Huawei, Intel, AMD, Arm, and Qualcomm all announced support for the initiative.
In November 2019, ONNX was accepted as a graduated project under the Linux Foundation AI & Data (LF AI & Data) umbrella, giving it vendor-neutral governance and ensuring that no single company controls the standard's direction [3].
Since then, ONNX has evolved through regular releases on a roughly four-month cadence. The standard has expanded from its original focus on deep learning models (particularly convolutional neural networks and recurrent neural networks) to support a much broader range of ML models, including traditional machine learning algorithms, transformer architectures, and diffusion models.
At its core, ONNX is a specification for representing machine learning models as computation graphs. Understanding ONNX requires understanding three key concepts: the graph format, operators and opsets, and the serialization format.
An ONNX model represents a computation as a directed acyclic graph (DAG). Each node in the graph represents a call to an operator (such as a matrix multiplication, a convolution, or a ReLU activation). Edges represent the flow of data (tensors) between operators. The graph has defined inputs (the data that flows in, such as an image or a text embedding) and defined outputs (the model's predictions).
The graph is structured as a topologically sorted list of nodes, meaning that every node appears after the nodes whose outputs it depends on. This ordering ensures that the graph can be executed sequentially from top to bottom, or that a runtime can determine a valid execution order without additional analysis.
Beyond the computation graph, an ONNX model file also contains metadata about the model (its name, domain, version, description) and a list of trained parameters (weights and biases stored as tensor constants).
ONNX defines a standard set of operators that cover the building blocks of machine learning models. These include:
| Category | Example operators |
|---|---|
| Tensor operations | Reshape, Transpose, Concat, Gather, Slice |
| Math operations | MatMul, Add, Mul, Div, Sqrt, Exp, Log |
| Neural network layers | Conv, BatchNormalization, LSTM, GRU |
| Activation functions | Relu, Sigmoid, Tanh, Softmax, GELU |
| Normalization | LayerNormalization, GroupNormalization, InstanceNormalization |
| Pooling | MaxPool, AveragePool, GlobalAveragePool |
| Attention (newer opsets) | MultiHeadAttention, RotaryEmbeddings |
Operators are grouped into operator sets (opsets), where each opset is identified by a domain and a version number. The default domain (ai.onnx) contains the standard operators. The opset version is a monotonically increasing integer; when an operator is added, removed, or modified, the opset version increases. This versioning mechanism ensures backward compatibility: a model exported with opset 17 will continue to work even as newer opsets are released, because runtimes can support multiple opset versions simultaneously.
As of early 2026, the latest ONNX release is version 1.20.1 (January 2026), with version 1.21.0 in release candidate stage [4]. Recent opsets have added operators for modern architectures, including RotaryEmbeddings and improved broadcasting support for LayerNorm and RMSNorm.
ONNX models are serialized using Protocol Buffers (protobuf), Google's language-neutral, platform-neutral serialization format. The protobuf schema is defined in files within the ONNX repository (onnx/*.proto). The top-level construct is a ModelProto, which contains:
ir_version: The version of the ONNX intermediate representationopset_import: Which opset versions the model usesgraph: The computation graph (GraphProto), which contains nodes, inputs, outputs, and initializers (trained weights)The protobuf format provides compact binary serialization, which is important because ONNX model files can be very large (billions of parameters in modern large language models). The format also provides well-defined schemas and cross-language support, making it straightforward to read and write ONNX files from Python, C++, C#, Java, and other languages.
Getting a trained model into ONNX format requires an export step that traces the model's computation graph and translates it into ONNX operators. The major ML frameworks each provide their own export mechanisms.
PyTorch provides torch.onnx.export() as the primary export mechanism. Starting with PyTorch 2.5, there are two export backends:
torch.onnx.export(..., dynamo=True)): The recommended approach, which leverages torch.export and Torch FX for graph capture. This method produces cleaner ONNX graphs and handles dynamic control flow better than the legacy approach.A typical PyTorch-to-ONNX export looks like this:
import torch
model = MyModel()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx", dynamo=True)
The export process traces the model's execution with the dummy input to capture the computation graph, then maps each PyTorch operation to the corresponding ONNX operator.
TensorFlow models are converted to ONNX using the tf2onnx tool, maintained as part of the ONNX organization on GitHub. tf2onnx supports TensorFlow 2.x, Keras, TensorFlow.js, and TFLite models. It can be used as a command-line tool or as a Python API [6].
python -m tf2onnx.convert --saved-model ./my_model --output model.onnx
tf2onnx supports ONNX opsets 14 through 18, covering the vast majority of TensorFlow operations.
ONNX converters exist for most popular ML frameworks:
| Source framework | Converter tool | Notes |
|---|---|---|
| PyTorch | torch.onnx.export (built-in) | Recommended: dynamo=True |
| TensorFlow/Keras | tf2onnx | Supports SavedModel, Keras, TFLite |
| scikit-learn | sklearn-onnx (skl2onnx) | Converts traditional ML models |
| XGBoost/LightGBM | onnxmltools | Tree-based model conversion |
| JAX | jax2onnx / ndonnx | Emerging support via Array API standard |
| PaddlePaddle | paddle2onnx | Baidu's framework |
| MATLAB | Built-in Deep Learning Toolbox | exportONNXNetwork function |
The breadth of converter support reflects ONNX's position as the common interchange format for the ML ecosystem.
While ONNX itself is just a file format specification, ONNX Runtime (ORT) is the high-performance inference engine developed by Microsoft for executing ONNX models. ONNX Runtime is the most widely used ONNX execution engine and is a separate open-source project from the ONNX standard itself [7].
ONNX Runtime is designed as a cross-platform inference engine with a pluggable architecture. At its core is a graph execution engine that loads an ONNX model, applies optimizations, and dispatches operations to one or more Execution Providers (EPs). Each EP implements the ONNX operators for a specific hardware backend:
| Execution Provider | Hardware target | Typical use case |
|---|---|---|
| CPU | x86/ARM CPUs | Default fallback, development |
| CUDA | NVIDIA GPUs | Training and inference on NVIDIA hardware |
| TensorRT | NVIDIA GPUs | Optimized inference with NVIDIA TensorRT |
| DirectML | Windows GPUs | GPU inference on Windows (any vendor) |
| OpenVINO | Intel CPUs/GPUs/VPUs | Optimized inference on Intel hardware |
| CoreML | Apple Silicon | Inference on Mac and iOS |
| NNAPI | Android NPUs | On-device inference on Android |
| ROCm | AMD GPUs | Inference on AMD hardware |
| QNN | Qualcomm NPUs | Inference on Qualcomm AI accelerators |
| WebNN | Browser NPUs | Browser-based NPU-accelerated inference |
This EP architecture means that the same ONNX model can run on vastly different hardware simply by selecting a different Execution Provider, without any changes to the model itself.
Before executing a model, ONNX Runtime applies a series of graph-level optimizations that can significantly improve inference performance. These optimizations are applied in multiple passes:
Operator fusion combines multiple sequential operations into a single optimized kernel. For example, a Conv followed by BatchNormalization followed by ReLU can be fused into a single ConvBnRelu kernel that reads the input once, performs all three operations, and writes the output once. This eliminates intermediate memory reads and writes, which are often the performance bottleneck. For transformer models, attention-specific fusions combine multiple matrix multiplications, softmax, and masking operations into a single fused multi-head attention kernel [8].
Constant folding pre-computes operations whose inputs are all constants (known at model load time). For example, if a model contains a reshape operation on a constant tensor, ONNX Runtime computes the result during initialization and replaces the operation with its output.
Layout optimization transforms tensor memory layouts to match what the hardware and kernels prefer. For example, some GPU kernels perform better with NCHW (batch, channels, height, width) layout while others prefer NHWC.
Memory planning analyzes the lifetime of intermediate tensors and allocates memory efficiently, reusing buffers when tensors' lifetimes do not overlap.
ONNX Runtime provides three optimization levels (basic, extended, and all) that users can configure when creating an inference session. Higher levels apply more aggressive optimizations that take longer at session creation but yield faster inference.
Quantization is one of the most impactful optimizations for inference deployment. ONNX Runtime supports multiple quantization approaches:
Quantization typically converts FP32 models to INT8, providing 2-4x speedup with minimal accuracy loss for most models. For large language models, ONNX Runtime also supports INT4 quantization through techniques like GPTQ and AWQ [9].
Beyond server-side inference, ONNX Runtime has expanded to support edge and client-side deployment through specialized variants.
ONNX Runtime Web (ORT Web) enables running ONNX models directly in web browsers. It provides three execution backends:
ORT Web is distributed as an npm package (onnxruntime-web) and provides a JavaScript/TypeScript API. Running ML models in the browser offers several benefits: it eliminates server-client communication latency, protects user privacy by keeping data on-device, and provides a cross-platform, install-free experience.
ONNX Runtime Mobile is optimized for deployment on Android and iOS devices. It uses the same API as the server-side runtime, allowing developers to use a consistent interface across deployment targets. Mobile-specific optimizations include:
Developers can integrate ONNX Runtime Mobile into Android (Java/Kotlin), iOS (Objective-C/Swift), React Native, and MAUI/Xamarin applications [11].
ONNX has become deeply embedded in the machine learning ecosystem, serving as a bridge between training frameworks and deployment environments.
Virtually every major hardware vendor supports ONNX as an input format for their inference tools:
| Vendor | ONNX integration |
|---|---|
| NVIDIA | TensorRT can ingest ONNX models directly for GPU-optimized inference |
| Intel | OpenVINO uses ONNX as a primary input format |
| Qualcomm | AI Hub and QNN SDK accept ONNX models for mobile/edge deployment |
| AMD | ROCm and Vitis AI support ONNX model ingestion |
| Apple | Core ML Tools can convert ONNX models for Apple Silicon |
| AWS | SageMaker Neo compiles ONNX models for various hardware targets |
This broad vendor support means that exporting a model to ONNX effectively makes it deployable on nearly any hardware platform, which is precisely the interoperability that ONNX was designed to achieve.
All major cloud providers offer ONNX Runtime as a deployment option:
The ONNX Model Zoo is a collection of pre-trained, state-of-the-art models in ONNX format. It includes models for computer vision (image classification, object detection, segmentation), natural language processing (language models, text classification), speech processing, and other domains. The Model Zoo serves both as a reference for ONNX compatibility testing and as a resource for developers who need pre-trained models for deployment.
Despite its broad adoption, ONNX has several known limitations.
Not every operation in every framework has a corresponding ONNX operator. When a model uses framework-specific operations that are not covered by the ONNX standard, the export process may fail or require custom operator definitions. This is particularly common with cutting-edge model architectures that use novel operations not yet standardized in ONNX.
ONNX models historically required fixed tensor shapes, which was problematic for models that process variable-length inputs (like text sequences of different lengths). While recent ONNX versions have improved dynamic shape support, some runtimes still perform better with fixed shapes. Complex dynamic control flow (if/else branches that depend on runtime data) can also be difficult to represent in ONNX's static graph format.
Very large models (billions of parameters) can produce ONNX files that are many gigabytes in size. While the protobuf format handles this, some tools and runtimes may struggle with loading and processing such large files. The ONNX community has worked on external data support (storing large tensors in separate files rather than embedded in the protobuf) to address this.
ONNX was originally designed for inference (deploying trained models), not training. While ONNX Runtime does include training capabilities (ONNX Runtime Training, or ORT Training), the training ecosystem is less mature than the inference ecosystem. Most users continue to train models in their preferred framework and export to ONNX only for deployment.
ONNX continues to evolve as the standard interchange format for machine learning models. Several trends define its current trajectory.
As large language models have come to dominate the AI landscape, ONNX and ONNX Runtime have expanded their support for LLM-specific features. ONNX Runtime includes optimized kernels for attention mechanisms, rotary positional embeddings, grouped-query attention, and other transformer building blocks. Microsoft has invested in making ONNX Runtime competitive for LLM inference, including INT4 quantization for models like Llama, Phi, and Mistral [13].
ONNX Runtime's cross-platform nature makes it well-suited for the growing edge AI market. With Execution Providers for NPUs from Qualcomm, Intel, and others, ONNX Runtime can take advantage of dedicated AI hardware on laptops, phones, and IoT devices. Microsoft's partnership with hardware vendors to integrate NPU accelerators into ONNX Runtime positions it as a key enabler of on-device AI inference.
The maturation of WebGPU and WebNN browser APIs has opened new possibilities for ONNX Runtime Web. WebGPU provides GPU access with lower overhead than WebGL, while WebNN enables direct access to NPUs on client devices. Together, these APIs make it feasible to run increasingly sophisticated models, including small language models, directly in the browser.
The ONNX ecosystem has reached a level of maturity where it is a default part of most ML deployment pipelines. The standard is supported by all major frameworks, all major hardware vendors, and all major cloud providers. While alternatives exist (such as TensorFlow's SavedModel format or PyTorch's TorchScript/torch.export), ONNX's vendor neutrality and broad support make it the most portable option for cross-platform deployment.
As of early 2026, ONNX 1.20.1 is the latest stable release, with version 1.21.0 in development. ONNX Runtime 1.22 is the latest runtime release, with ongoing work on performance optimization, expanded hardware support, and improved support for emerging model architectures [14].