ONNX

AI Tools & Products Machine Learning

17 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v5 · 3,487 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ONNX (Open Neural Network Exchange) is an open standard file format for representing machine learning models so they can be moved between different frameworks, runtimes, and hardware platforms without being rewritten. Announced in September 2017 by Facebook (now Meta) and Microsoft, ONNX defines a common set of operators and a protobuf-based serialization that lets a model trained in one framework be deployed in a completely different environment. A model trained in PyTorch, for example, can be exported to ONNX and then run with an unrelated runtime on unrelated hardware without changing the model code. ONNX has been governed since November 2019 as a graduate project of the Linux Foundation AI & Data foundation, and it has become the de facto standard for model exchange in the machine learning ecosystem.

History

ONNX originated from a project called Toffee, developed by the PyTorch team at Facebook. In September 2017, Facebook and Microsoft jointly announced ONNX as an open standard for ML model interoperability ^[1]. The initial motivation was straightforward: researchers often train models in one framework but need to deploy them in a different environment. At the time, moving a model from PyTorch to a production serving system, or from TensorFlow to a mobile runtime, required painful and error-prone manual conversion.

When was ONNX first released?

On December 6, 2017, Facebook, Amazon Web Services, and Microsoft announced that the first production-ready version of ONNX was available ^[2]. As the joint announcement put it, "Today Facebook, AWS, and Microsoft are excited to announce that with the support of the community and new partners the first version of ONNX is now production-ready" ^[2]. The project quickly attracted broad industry backing. IBM, Huawei, Intel, AMD, Arm, and Qualcomm all announced support for the initiative.

In November 2019, ONNX was accepted as a graduated project under the Linux Foundation AI & Data (LF AI & Data) umbrella, giving it vendor-neutral governance and ensuring that no single company controls the standard's direction ^[3]. The transfer was announced on November 14, 2019, with Microsoft noting that "LF AI will host ONNX and continue the open-governance model, which encourages community participation and contributions" ^[3].

Since then, ONNX has evolved through regular releases on a roughly four-month cadence ^[4]. The standard has expanded from its original focus on deep learning models (particularly convolutional neural networks and recurrent neural networks) to support a much broader range of ML models, including traditional machine learning algorithms, transformer architectures, and diffusion models.

Release timeline

Milestone	Date
ONNX announced by Facebook and Microsoft	September 2017 ^[1]
ONNX V1 declared production-ready (with AWS)	December 6, 2017 ^[2]
Accepted as an LF AI & Data graduate project	November 14, 2019 ^[3]
ONNX 1.21.0 (Opset 26, CumProd, 2-bit types)	April 27, 2026 ^[5]
ONNX 1.22.0 (Opset 27), latest stable release	June 15, 2026 ^[4]

How ONNX works

At its core, ONNX is a specification for representing machine learning models as computation graphs. Understanding ONNX requires understanding three key concepts: the graph format, operators and opsets, and the serialization format.

Computation graph

An ONNX model represents a computation as a directed acyclic graph (DAG). Each node in the graph represents a call to an operator (such as a matrix multiplication, a convolution, or a ReLU activation). Edges represent the flow of data (tensors) between operators. The graph has defined inputs (the data that flows in, such as an image or a text embedding) and defined outputs (the model's predictions).

The graph is structured as a topologically sorted list of nodes, meaning that every node appears after the nodes whose outputs it depends on. This ordering ensures that the graph can be executed sequentially from top to bottom, or that a runtime can determine a valid execution order without additional analysis.

Beyond the computation graph, an ONNX model file also contains metadata about the model (its name, domain, version, description) and a list of trained parameters (weights and biases stored as tensor constants).

Operators and opsets

ONNX defines a standard set of operators that cover the building blocks of machine learning models. These include:

Category	Example operators
Tensor operations	Reshape, Transpose, Concat, Gather, Slice
Math operations	MatMul, Add, Mul, Div, Sqrt, Exp, Log
Neural network layers	Conv, BatchNormalization, LSTM, GRU
Activation functions	Relu, Sigmoid, Tanh, Softmax, GELU
Normalization	LayerNormalization, GroupNormalization, InstanceNormalization
Pooling	MaxPool, AveragePool, GlobalAveragePool
Attention (newer opsets)	MultiHeadAttention, RotaryEmbeddings

Operators are grouped into operator sets (opsets), where each opset is identified by a domain and a version number. The default domain (ai.onnx) contains the standard operators. The opset version is a monotonically increasing integer; when an operator is added, removed, or modified, the opset version increases. This versioning mechanism ensures backward compatibility: a model exported with opset 17 will continue to work even as newer opsets are released, because runtimes can support multiple opset versions simultaneously.

What is the latest ONNX version?

As of June 2026, the latest ONNX release is version 1.22.0, published on June 15, 2026, which introduces Opset 27 ^[4]. The preceding release, ONNX 1.21.0 (April 27, 2026), introduced Opset 26 along with new operators including CumProd (cumulative multiplication) and BitCast (reinterpreting data without copying), plus support for ultra-compact 2-bit data types aimed at edge, mobile, and embedded deployments ^[5]. As the LF AI & Data foundation described that release, "This release introduces Opset 26, along with new operators and precision types that enable richer, more flexible model representations for modern AI workloads" ^[5]. Recent opsets have also added operators for modern architectures, including RotaryEmbeddings and improved broadcasting support for LayerNorm and RMSNorm.

Serialization format

ONNX models are serialized using Protocol Buffers (protobuf), Google's language-neutral, platform-neutral serialization format. The protobuf schema is defined in files within the ONNX repository (onnx/*.proto). The top-level construct is a ModelProto, which contains:

ir_version: The version of the ONNX intermediate representation
opset_import: Which opset versions the model uses
graph: The computation graph (GraphProto), which contains nodes, inputs, outputs, and initializers (trained weights)
Metadata fields (producer name, model version, description)

The protobuf format provides compact binary serialization, which is important because ONNX model files can be very large (billions of parameters in modern large language models). The format also provides well-defined schemas and cross-language support, making it straightforward to read and write ONNX files from Python, C++, C#, Java, and other languages.

Exporting models to ONNX

Getting a trained model into ONNX format requires an export step that traces the model's computation graph and translates it into ONNX operators. The major ML frameworks each provide their own export mechanisms.

From PyTorch

PyTorch provides torch.onnx.export() as the primary export mechanism. Starting with PyTorch 2.5, there are two export backends:

Dynamo-based exporter (torch.onnx.export(..., dynamo=True)): The recommended approach, which leverages torch.export and Torch FX for graph capture. This method produces cleaner ONNX graphs and handles dynamic control flow better than the legacy approach.
TorchScript-based exporter (legacy): The older approach that traces the model using TorchScript. While still functional, it is no longer recommended for new projects because TorchScript itself is deprecated ^[6].

A typical PyTorch-to-ONNX export looks like this:

import torch
model = MyModel()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx", dynamo=True)

The export process traces the model's execution with the dummy input to capture the computation graph, then maps each PyTorch operation to the corresponding ONNX operator.

From TensorFlow

TensorFlow models are converted to ONNX using the tf2onnx tool, maintained as part of the ONNX organization on GitHub. tf2onnx supports TensorFlow 2.x, Keras, TensorFlow.js, and TFLite models. It can be used as a command-line tool or as a Python API ^[7].

python -m tf2onnx.convert --saved-model ./my_model --output model.onnx

tf2onnx supports ONNX opsets 14 through 18, covering the vast majority of TensorFlow operations.

From other frameworks

ONNX converters exist for most popular ML frameworks:

Source framework	Converter tool	Notes
PyTorch	`torch.onnx.export` (built-in)	Recommended: dynamo=True
TensorFlow/Keras	`tf2onnx`	Supports SavedModel, Keras, TFLite
scikit-learn	`sklearn-onnx` (skl2onnx)	Converts traditional ML models
XGBoost/LightGBM	`onnxmltools`	Tree-based model conversion
JAX	`jax2onnx` / ndonnx	Emerging support via Array API standard
PaddlePaddle	`paddle2onnx`	Baidu's framework
MATLAB	Built-in Deep Learning Toolbox	`exportONNXNetwork` function

The breadth of converter support reflects ONNX's position as the common interchange format for the ML ecosystem.

ONNX Runtime

While ONNX itself is just a file format specification, ONNX Runtime (ORT) is the high-performance inference engine developed by Microsoft for executing ONNX models. ONNX Runtime is the most widely used ONNX execution engine and is a separate open-source project from the ONNX standard itself ^[8]. According to its developers, "ONNX Runtime powers AI in Microsoft products including Windows, Office, Azure Cognitive Services, and Bing, as well as in thousands of other projects across the world" ^[8]. The latest version is ONNX Runtime 1.27.0, released June 15, 2026, on an approximately monthly release cadence ^[9].

Architecture

ONNX Runtime is designed as a cross-platform inference engine with a pluggable architecture. At its core is a graph execution engine that loads an ONNX model, applies optimizations, and dispatches operations to one or more Execution Providers (EPs). Each EP implements the ONNX operators for a specific hardware backend:

Execution Provider	Hardware target	Typical use case
CPU	x86/ARM CPUs	Default fallback, development
CUDA	NVIDIA GPUs	Training and inference on NVIDIA hardware
TensorRT	NVIDIA GPUs	Optimized inference with NVIDIA TensorRT
DirectML	Windows GPUs	GPU inference on Windows (any vendor)
OpenVINO	Intel CPUs/GPUs/VPUs	Optimized inference on Intel hardware
CoreML	Apple Silicon	Inference on Mac and iOS
NNAPI	Android NPUs	On-device inference on Android
ROCm	AMD GPUs	Inference on AMD hardware
QNN	Qualcomm NPUs	Inference on Qualcomm AI accelerators
WebNN	Browser NPUs	Browser-based NPU-accelerated inference

This EP architecture means that the same ONNX model can run on vastly different hardware simply by selecting a different Execution Provider, without any changes to the model itself.

Graph optimization

Before executing a model, ONNX Runtime applies a series of graph-level optimizations that can significantly improve inference performance. These optimizations are applied in multiple passes:

Operator fusion combines multiple sequential operations into a single optimized kernel. For example, a Conv followed by BatchNormalization followed by ReLU can be fused into a single ConvBnRelu kernel that reads the input once, performs all three operations, and writes the output once. This eliminates intermediate memory reads and writes, which are often the performance bottleneck. For transformer models, attention-specific fusions combine multiple matrix multiplications, softmax, and masking operations into a single fused multi-head attention kernel ^[10].

Constant folding pre-computes operations whose inputs are all constants (known at model load time). For example, if a model contains a reshape operation on a constant tensor, ONNX Runtime computes the result during initialization and replaces the operation with its output.

Layout optimization transforms tensor memory layouts to match what the hardware and kernels prefer. For example, some GPU kernels perform better with NCHW (batch, channels, height, width) layout while others prefer NHWC.

Memory planning analyzes the lifetime of intermediate tensors and allocates memory efficiently, reusing buffers when tensors' lifetimes do not overlap.

ONNX Runtime provides three optimization levels (basic, extended, and all) that users can configure when creating an inference session. Higher levels apply more aggressive optimizations that take longer at session creation but yield faster inference.

Quantization

Quantization is one of the most impactful optimizations for inference deployment. ONNX Runtime supports multiple quantization approaches:

Dynamic quantization computes quantization parameters at runtime for each inference call. It quantizes weights ahead of time but quantizes activations on the fly. This requires no calibration data and is the easiest approach to apply.
Static quantization pre-computes quantization parameters for both weights and activations using a calibration dataset. This produces faster inference than dynamic quantization because activation quantization parameters are known ahead of time.
Quantization-aware training (QAT) simulates quantization during training, allowing the model to learn to compensate for quantization error. QAT typically produces the highest-accuracy quantized models.

Quantization typically converts FP32 models to INT8, providing 2-4x speedup with minimal accuracy loss for most models. For large language models, ONNX Runtime also supports INT4 quantization through techniques like GPTQ and AWQ ^[11].

ONNX Runtime Web and Mobile

Beyond server-side inference, ONNX Runtime has expanded to support edge and client-side deployment through specialized variants.

ONNX Runtime Web

ONNX Runtime Web (ORT Web) enables running ONNX models directly in web browsers. It provides three execution backends:

WebAssembly (WASM): Compiles the native ONNX Runtime CPU engine into WebAssembly using Emscripten. This provides broad browser compatibility and reasonable CPU inference performance.
WebGL: Uses WebGL shaders for GPU-accelerated inference in browsers. Suitable for models that benefit from GPU execution but does not support all operators.
WebGPU: The newest and most performant backend, using the WebGPU API for GPU access. WebGPU provides lower overhead and more flexible compute capabilities than WebGL ^[12].
WebNN: An emerging web standard that accesses hardware-accelerated ML capabilities, including NPUs (Neural Processing Units) on client devices. ONNX Runtime Web's WebNN support enables NPU-accelerated inference directly in the browser.

ORT Web is distributed as an npm package (onnxruntime-web) and provides a JavaScript/TypeScript API. Running ML models in the browser offers several benefits: it eliminates server-client communication latency, protects user privacy by keeping data on-device, and provides a cross-platform, install-free experience.

ONNX Runtime Mobile

ONNX Runtime Mobile is optimized for deployment on Android and iOS devices. It uses the same API as the server-side runtime, allowing developers to use a consistent interface across deployment targets. Mobile-specific optimizations include:

Reduced binary size through operator selection (including only the operators a specific model needs)
Integration with platform-specific accelerators via NNAPI (Android) and CoreML (iOS)
Support for quantized models, which are essential for meeting the latency and memory constraints of mobile devices

Developers can integrate ONNX Runtime Mobile into Android (Java/Kotlin), iOS (Objective-C/Swift), React Native, and MAUI/Xamarin applications ^[13].

ONNX in the ML ecosystem

ONNX has become deeply embedded in the machine learning ecosystem, serving as a bridge between training frameworks and deployment environments.

Adoption by hardware vendors

Virtually every major hardware vendor supports ONNX as an input format for their inference tools:

Vendor	ONNX integration
NVIDIA	TensorRT can ingest ONNX models directly for GPU-optimized inference
Intel	OpenVINO uses ONNX as a primary input format
Qualcomm	AI Hub and QNN SDK accept ONNX models for mobile/edge deployment
AMD	ROCm and Vitis AI support ONNX model ingestion
Apple	Core ML Tools can convert ONNX models for Apple Silicon
AWS	SageMaker Neo compiles ONNX models for various hardware targets

This broad vendor support means that exporting a model to ONNX effectively makes it deployable on nearly any hardware platform, which is precisely the interoperability that ONNX was designed to achieve.

Adoption by cloud providers

All major cloud providers offer ONNX Runtime as a deployment option:

Microsoft Azure: ONNX Runtime is deeply integrated into Azure ML and Azure AI services. Microsoft uses ONNX Runtime internally across its own products, including Bing, Office, and Windows ^[8].
Amazon Web Services: SageMaker supports ONNX model deployment through multiple inference containers.
Google Cloud: While Google's primary ML framework is JAX/TensorFlow and its primary accelerator is the TPU, GCP supports ONNX Runtime deployment on GPU and CPU instances.

ONNX Model Zoo

The ONNX Model Zoo is a collection of pre-trained, state-of-the-art models in ONNX format. It includes models for computer vision (image classification, object detection, segmentation), natural language processing (language models, text classification), speech processing, and other domains. The Model Zoo serves both as a reference for ONNX compatibility testing and as a resource for developers who need pre-trained models for deployment.

Limitations and challenges

Despite its broad adoption, ONNX has several known limitations.

Operator coverage gaps

Not every operation in every framework has a corresponding ONNX operator. When a model uses framework-specific operations that are not covered by the ONNX standard, the export process may fail or require custom operator definitions. This is particularly common with cutting-edge model architectures that use novel operations not yet standardized in ONNX.

Dynamic shapes and control flow

ONNX models historically required fixed tensor shapes, which was problematic for models that process variable-length inputs (like text sequences of different lengths). While recent ONNX versions have improved dynamic shape support, some runtimes still perform better with fixed shapes. Complex dynamic control flow (if/else branches that depend on runtime data) can also be difficult to represent in ONNX's static graph format.

Large model support

Very large models (billions of parameters) can produce ONNX files that are many gigabytes in size. While the protobuf format handles this, some tools and runtimes may struggle with loading and processing such large files. The ONNX community has worked on external data support (storing large tensors in separate files rather than embedded in the protobuf) to address this.

Training support

ONNX was originally designed for inference (deploying trained models), not training. While ONNX Runtime does include training capabilities (ONNX Runtime Training, or ORT Training), the training ecosystem is less mature than the inference ecosystem. Most users continue to train models in their preferred framework and export to ONNX only for deployment.

Current state: 2025-2026

ONNX continues to evolve as the standard interchange format for machine learning models. Several trends define its current trajectory.

Growing LLM support

As large language models have come to dominate the AI landscape, ONNX and ONNX Runtime have expanded their support for LLM-specific features. ONNX Runtime includes optimized kernels for attention mechanisms, rotary positional embeddings, grouped-query attention, and other transformer building blocks. Microsoft has invested in making ONNX Runtime competitive for LLM inference, including INT4 quantization for models like Llama, Phi, and Mistral ^[14]. The 2-bit data types added in ONNX 1.21.0 (April 2026) extend this trend toward ever more compact representations for edge and mobile LLM deployment ^[5].

Edge AI expansion

ONNX Runtime's cross-platform nature makes it well-suited for the growing edge AI market. With Execution Providers for NPUs from Qualcomm, Intel, and others, ONNX Runtime can take advantage of dedicated AI hardware on laptops, phones, and IoT devices. Microsoft's partnership with hardware vendors to integrate NPU accelerators into ONNX Runtime positions it as a key enabler of on-device AI inference.

WebGPU and WebNN

The maturation of WebGPU and WebNN browser APIs has opened new possibilities for ONNX Runtime Web. WebGPU provides GPU access with lower overhead than WebGL, while WebNN enables direct access to NPUs on client devices. Together, these APIs make it feasible to run increasingly sophisticated models, including small language models, directly in the browser.

Ecosystem maturity

The ONNX ecosystem has reached a level of maturity where it is a default part of most ML deployment pipelines. The standard is supported by all major frameworks, all major hardware vendors, and all major cloud providers. While alternatives exist (such as TensorFlow's SavedModel format or PyTorch's TorchScript/torch.export), ONNX's vendor neutrality and broad support make it the most portable option for cross-platform deployment.

As of June 2026, ONNX 1.22.0 (Opset 27) is the latest stable release of the format, published June 15, 2026 ^[4], and ONNX Runtime 1.27.0, also released June 15, 2026, is the latest runtime release, with ongoing work on performance optimization, expanded hardware support, and improved support for emerging model architectures ^[9].

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

ONNX

History

When was ONNX first released?

Release timeline

How ONNX works

Computation graph

Operators and opsets

What is the latest ONNX version?

Serialization format

Exporting models to ONNX

From PyTorch

From TensorFlow

From other frameworks

ONNX Runtime

Architecture

Graph optimization

Quantization

ONNX Runtime Web and Mobile

ONNX Runtime Web

ONNX Runtime Mobile

ONNX in the ML ecosystem

Adoption by hardware vendors

Adoption by cloud providers

ONNX Model Zoo

Limitations and challenges

Operator coverage gaps

Dynamic shapes and control flow

Large model support

Training support

Current state: 2025-2026

Growing LLM support

Edge AI expansion

WebGPU and WebNN

Ecosystem maturity

References

Improve this article

What links here (24 of 54)

What links here (24 of 54)

History

When was ONNX first released?

Release timeline

How ONNX works

Computation graph

Operators and opsets

What is the latest ONNX version?

Serialization format

Exporting models to ONNX

From PyTorch

From TensorFlow

From other frameworks

ONNX Runtime

Architecture

Graph optimization

Quantization

ONNX Runtime Web and Mobile

ONNX Runtime Web

ONNX Runtime Mobile

ONNX in the ML ecosystem

Adoption by hardware vendors

Adoption by cloud providers

ONNX Model Zoo

Limitations and challenges

Operator coverage gaps

Dynamic shapes and control flow

Large model support

Training support

Current state: 2025-2026

Growing LLM support

Edge AI expansion

WebGPU and WebNN

Ecosystem maturity

References

Improve this article

Related Articles

Claude Sonnet 4.5

DataFrame

Keras

Matplotlib

NumPy

Pandas

What links here (24 of 54)

Related Articles

Claude Sonnet 4.5

DataFrame

Keras

Matplotlib

NumPy

Pandas

What links here (24 of 54)