OpenVINO
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,346 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,346 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit developed by Intel for optimizing and deploying deep learning inference across a heterogeneous set of Intel hardware targets, including x86 and Arm CPUs, integrated and discrete GPUs, and Neural Processing Units (NPUs).[1] Originally launched in May 2018 as a computer-vision toolkit for edge devices, the project has expanded into a general inference runtime that supports models trained in PyTorch, TensorFlow, ONNX and other frameworks, and is now positioned as Intel's principal software vehicle for generative AI on client and edge silicon.[2][3] The toolkit is distributed under the Apache License 2.0 and developed in the open at github.com/openvinotoolkit/openvino.[4] Beyond the core runtime, the OpenVINO ecosystem includes the Neural Network Compression Framework (NNCF) for quantization and pruning, an OpenVINO Tokenizers extension, the OpenVINO Model Server (OVMS), and an OpenVINO GenAI library targeted at LLM and vision-language workloads.[5][6][7][8]
| Field | Value |
|---|---|
| Original release | May 16, 2018 (Intel Distribution of OpenVINO Toolkit) |
| Developer | Intel Corporation |
| License | Apache License 2.0 |
| Source repository | github.com/openvinotoolkit/openvino |
| Latest release (at writing) | 2026.1.0 (April 7, 2026) |
| Languages | C++ core, Python and C bindings |
| Supported devices | Intel CPU (x86, Arm), Intel GPU (integrated, Arc discrete), Intel NPU (Core Ultra) |
| Model intake | OpenVINO IR, ONNX, PyTorch, TensorFlow, TensorFlow Lite, PaddlePaddle, JAX |
| Hugging Face integration | optimum-intel |
The Intel Distribution of OpenVINO toolkit was first announced and released on May 16, 2018, when Intel introduced it as a tool to "give developers improved neural network performance on a variety of Intel processors" for vision-centric edge workloads in cameras, IoT devices, and embedded systems.[2] The initial release framed OpenVINO as the successor to the older Intel Computer Vision SDK, bundling a Model Optimizer that consumed models from popular open-source frameworks such as Caffe, TensorFlow and MXNet, an Inference Engine, and pre-trained models from the Open Model Zoo. It targeted Intel CPUs and integrated GPUs along with specialized accelerators including FPGA cards and Movidius vision processing units (VPUs).[2][9]
The Movidius VPU branch entered the lineage through Intel's September 2016 acquisition of the Irish chip designer Movidius, co-founded in 2005 in Dublin by Sean Mitchell, David Moloney and Val Muresan. Movidius produced the Myriad 2 and Myriad X VPUs, low-power chips for computer vision that Intel later folded into OpenVINO as the MYRIAD plugin and offered as Neural Compute Stick devices. The Myriad 2 chip reached about 1 to 1.5 trillion operations per second; the 2017-era Myriad X added a dedicated Neural Compute Engine and pushed sustained throughput to roughly 4 trillion operations per second.[9] The same architectural team's "Keem Bay" / Vision Processing Unit work, detailed publicly in 2019, eventually evolved into the Neural Processing Unit (NPU) shipped in Intel Core Ultra client processors.[10]
The acronym OpenVINO expands to Open Visual Inference and Neural Network Optimization; the original 2018 framing was explicitly visual, with Intel announcing the toolkit alongside DevCloud for the Edge, a hosted prototyping environment that exposed combinations of Xeon processors, integrated graphics, FPGAs and Movidius VPUs so developers could try the same toolkit across different hardware without acquiring it physically.[2]
Across 2019, 2020 and 2021 Intel iterated the toolkit on a roughly quarterly cadence, with releases such as 2019 R1, 2020.1, 2021.1 and 2021.4 progressively broadening framework coverage and Linux distribution support.[11] An important transition during this period was the open-sourcing of the code: the openvinotoolkit/openvino GitHub repository was published in 2020 under the Apache License 2.0, replacing the previous binary-only Intel Distribution as the canonical source of truth and inviting external contributions.[1][4]
The 2022.1 release, shipped in March 2022, was a substantial rewrite that introduced the IRv11 intermediate representation and dropped support for the older IRv10 in the Post-Training Optimization Tool (POT); it expanded natural-language model coverage with more than 70 additional INT8-quantized models, retired support for Ubuntu 18.04 in favor of Ubuntu 20.04, and unified the device-portability model so that the same compiled graph could move more cleanly between CPU, GPU and accelerator targets.[12] The 2022.x line also added or improved frontends for TensorFlow Lite and PaddlePaddle, broadening the "intake" side of the conversion pipeline beyond the traditional TensorFlow, Caffe, MXNet and ONNX lineup of the 2018 release.[11]
In late 2022, Intel designated 2022.3 LTS as a Long-Term Support release with extended maintenance, and 2022.3.1 LTS in early 2023 became the final OpenVINO version to support the Neural Compute Stick 2 / Movidius Myriad X. From the 2023.0 release onward, support for VPU accelerators based on Intel Movidius was canceled, marking the end of the Myriad VPU line in OpenVINO.[13]
OpenVINO 2023.0 introduced direct conversion from PyTorch model objects into OpenVINO Intermediate Representation without the intermediate step of ONNX export, an option that has since become the recommended path for PyTorch users.[14] In the same period, the Model Optimizer command-line tool was officially deprecated in favor of the new OpenVINO Model Converter (OVC) Python API; Model Optimizer was supported through the 2025.0 release and then removed, alongside the separate openvino-dev package, in subsequent 2025 releases.[15]
The 2024.0 release shipped on March 6, 2024 and added a preview NPU plugin for the integrated Neural Processor Unit in Intel Core Ultra processors (codename Meteor Lake), distributed through the main openvino PyPI package; the same release tightened the integration with the Hugging Face ecosystem, including INT4 weight quantization for popular LLMs.[3] Subsequent 2024.x and 2025.x releases extended NPU coverage to the Core Ultra Series 2 ("Lunar Lake") generation, which features a more capable six-engine NPU (compared with two compute engines on the original Meteor Lake NPU) and introduced NormalFloat 4-bit (NF4) weight quantization as the first data type supported only on the newer NPU.[16] OpenVINO GenAI, the dedicated generative-AI library, was first published as a separate repository in 2024 and reached a 2026.1.0.0 release on April 7, 2026; subsequent 2026.x releases extended the supported model surface to include preview Text-to-Video generation via the LTX-Video model.[7][21]
A condensed timeline of headline releases is summarized below:
| Release | Year | Notable change |
|---|---|---|
| 2018 (initial) | 2018 | Original public release; CPU, integrated GPU, FPGA, Movidius VPU |
| 2019 R1 | 2019 | Refreshed Model Optimizer; expanded Open Model Zoo |
| 2021.1 / 2021.4 | 2021 | Broader Linux distribution coverage |
| 2022.1 | 2022 | IRv11; rewritten Inference Engine API; +70 INT8 models |
| 2022.3 LTS | 2022 | Long-term support designation |
| 2022.3.1 LTS | 2023 | Final release supporting Movidius VPU / NCS2 |
| 2023.0 | 2023 | Direct PyTorch frontend; Movidius VPU support removed |
| 2024.0 | 2024 | NPU plugin in main PyPI package; Hugging Face INT4 quantization |
| 2025.0 | 2025 | Model Optimizer and openvino-dev package removed; OVC API only |
| 2026.0 / 2026.1 | 2026 | Enhanced GenAI; preview Text2Video via LTX-Video; OVMS 2026.1 |
OpenVINO is organized around a core C++ runtime with Python and C bindings, plus pluggable per-device backends. A typical workflow has three stages: convert a trained model from a source framework (PyTorch, TensorFlow, ONNX, PaddlePaddle, JAX) into the OpenVINO format, compile it for a target device using core.compile_model, and then run inference through the resulting CompiledModel and inference requests.[14][17] The same Python or C++ application can dispatch the same compiled graph to different physical devices by changing a device string such as CPU, GPU, or NPU, and the runtime supports heterogeneous execution that splits a graph across multiple devices.[17]
OpenVINO's native graph format is the Intermediate Representation (IR), a pair of files comprising an XML description of the graph topology and a binary .bin file holding weights. IR is produced either by the Model Optimizer (in 2022 and earlier) or by the OpenVINO Model Converter API (openvino.convert_model / ovc from 2023 onward).[14][15] IRv11, introduced in 2022.1, is the current major IR version; legacy IRv10 graphs are no longer accepted by NNCF or the POT toolchain.[12]
In addition to its own IR, the OpenVINO runtime can directly load and execute ONNX graphs through an ONNX frontend without an intermediate IR conversion, and from 2023.0 onward it can directly ingest a PyTorch nn.Module via the PyTorch frontend, internally building an OpenVINO graph through tracing rather than first exporting to ONNX. The PyTorch path is now considered the primary conversion route for PyTorch models, and converting via ONNX is documented as a fallback when direct conversion fails.[14]
OpenVINO exposes hardware through a plugin architecture; the principal first-party plugins are:
| Plugin / device string | Hardware |
|---|---|
CPU | Intel x86 CPUs (AVX2, AVX-512, AMX) and Arm CPUs (NEON, SVE) |
GPU | Intel integrated GPUs and discrete GPUs, including the Intel Arc family |
NPU | Intel NPU integrated into Core Ultra (Meteor Lake) and Core Ultra Series 2 (Lunar Lake) processors |
MYRIAD (legacy) | Movidius Myriad VPUs and Neural Compute Stick 2 (through 2022.3.x LTS only) |
| FPGA (legacy) | Intel Arria/Stratix FPGAs via bitstreams shipped through the 2018 "with FPGA Support" packages |
The Movidius VPU plugin was retired with the cancellation of all VPU accelerator support after the 2022.3.1 LTS release.[13] FPGA support, present in the 2018 "OpenVINO Toolkit for Linux with FPGA Support" SKUs that shipped bitstreams for Arria 10 cards, was discontinued in subsequent releases as Intel reorganized its programmable-logic strategy.[18]
The NPU plugin shipped initially as a preview in 2023.3 and became part of the main pip install openvino distribution starting with 2024.0; on Core Ultra Series 2 (Lunar Lake) the same plugin exposes the larger six-engine NPU and supports additional low-precision data types including NF4 weight quantization that are not available on the original Meteor Lake NPU.[3][16]
Model compression in OpenVINO is delegated to the Neural Network Compression Framework, an open-source Python package hosted at github.com/openvinotoolkit/nncf and licensed under Apache 2.0.[5] NNCF was introduced in an Intel-authored arXiv paper, "Neural Network Compression Framework for fast model inference," submitted on February 20, 2020 by Alexander Kozlov, Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin and Yury Gorbachev, and last revised in December of that year.[19] The paper describes NNCF as a framework for compressing neural networks with optional fine-tuning, implementing sparsity, quantization and binarization methods designed to produce models that run efficiently on general-purpose hardware (CPU, GPU) and on specialized deep-learning accelerators while preserving the original accuracy.[19]
The framework implements both post-training algorithms (8-bit Post-Training Quantization, weights compression for LLMs, activation sparsity) and training-time algorithms (Quantization-Aware Training, weight-only quantization with LoRA and Neural Low-rank Search, structured pruning), and it accepts models in PyTorch, TorchFX, ONNX and OpenVINO formats with OpenVINO as the preferred backend for post-training quantization.[5] NNCF's PTQ workflow requires only a model and a small calibration dataset on the order of 300 samples to apply 8-bit symmetric quantization.[5]
For LLM weight compression specifically, NNCF supports data-free INT8 and INT4 weight quantization as well as data-aware methods such as AWQ (Activation-aware Weight Quantization), GPTQ, quantization scale estimation, and mixed-precision quantization; on Core Ultra Series 2 NPUs it additionally supports the NormalFloat 4-bit (NF4) data type.[6][16] These algorithms are exposed both directly through NNCF and indirectly through the optimum-intel Hugging Face integration, which uses NNCF as its compression backend for OpenVINO IR and ONNX models.[6] In a representative reference benchmark, applying NNCF YOLOv8 PTQ produces an INT8 OpenVINO IR variant with a small (single-digit percent) drop in mean average precision relative to the original FP32 model while delivering substantially better throughput on Intel CPUs.[5]
The PyPI package is published as nncf. The latest release at writing is v3.1.0, dated April 8, 2026; the project history shows 34 releases since the initial public version.[5]
OpenVINO Tokenizers is a separately maintained extension at github.com/openvinotoolkit/openvino_tokenizers that adds text-processing operators to the runtime so that tokenization and detokenization can be expressed as OpenVINO graphs without external dependencies. The extension converts Hugging Face tokenizers into OpenVINO models, supports BPE, WordPiece, SentencePiece, Tiktoken and RWKV tokenizer types, and can be fused with downstream LLM graphs to produce a single end-to-end pipeline. Once exported, both the tokenizer and detokenizer become first-class OpenVINO models that can be read, compiled and saved with the standard runtime APIs, and they can also be combined with the model itself into a single graph; the documentation calls out classifiers and RAG embedders as cases where this fused form is preferred because both tokenizer and model are invoked once per inference. The extension additionally provides a "greedy decoding" pipeline component for simple text-generation use cases.[20]
The package serves as the tokenizer backbone for OpenVINO GenAI: when openvino_genai is installed, the five-step convert-and-validate flow for a tokenizer runs automatically as part of model loading. The latest 2026.1.0.0 release is dated April 7, 2026, and the project is licensed under Apache 2.0.[20]
OpenVINO GenAI is a higher-level library, distributed from github.com/openvinotoolkit/openvino.genai, that runs on top of the OpenVINO Runtime and implements generative-AI pipelines for LLMs, vision-language models, image generation, speech recognition and speech synthesis. It exposes both C++ and Python APIs and implements supporting infrastructure such as continuous batching, prefix caching and KV-cache management. The library's stated motivation is that it offers "better performance than Python-based runtimes" by minimizing the per-token overhead in the inference loop, while preserving a Python-friendly façade for rapid prototyping.[7]
Concrete supported workloads include LLM chat (Llama, Phi, Mistral and Qwen families), VLM image and video analysis (LLaVA, MiniCPM-V, Qwen2-VL, Phi-3.5-Vision, InternVL2), image generation with Stable Diffusion and Flux, Whisper-based speech recognition, SpeechT5 text-to-speech, semantic-search embedding generation and text reranking for retrieval workflows.[7] GenAI also provides a backend for llama.cpp enabling optimized inference on Intel CPUs, GPUs and NPUs, validated against GGUF builds of models such as Llama-3.2-1B-Instruct and Mistral-7B-Instruct.[21] Recent OpenVINO Model Server work layered on GenAI has added a unified VLM chatbot demo with video-file support and interactive model switching across Qwen3-VL, Qwen2.5-VL and LLaVa-NeXT-Video.[21]
GenAI also implements the Speculative Decoding pipeline as a serving-side optimization, exposing draft-and-target model pairs through OVMS for higher effective throughput on LLM workloads where draft acceptance rates are favorable.[23]
The OpenVINO Model Server is a standalone serving system written in C++ that hosts OpenVINO-optimized models behind network APIs, hosted at github.com/openvinotoolkit/model_server under Apache 2.0.[8] It is API-compatible with TensorFlow Serving's gRPC interface and with the KServe v2 (REST and gRPC) inference protocol, exposing a unified Predict and ModelMetadata surface that lets existing clients of those ecosystems run inference on OpenVINO-optimized models without code changes.[8][22] Beyond classical predict APIs, OVMS exposes an OpenAI-compatible HTTP API for LLM serving and integrates the continuous-batching pipeline from OpenVINO GenAI to serve LLM and VLM workloads.[23] OVMS supports loading models stored locally, on object storage, or pulled from the Hugging Face Hub, and is distributed as Docker images for both bare-metal and Kubernetes deployments. The latest release shown at writing is 2026.1, dated April 7, 2026.[8]
The server's documentation notes that the gRPC interface is generally recommended for raw inference latency because of more efficient input-deserialization, while the REST API is preferred when minimizing client-side dependencies is the higher priority.[22]
optimum-intel is a Hugging Face-maintained Python package that integrates Intel acceleration toolkits, including OpenVINO, the Intel Neural Compressor and Intel Extension for PyTorch, with the Transformers and Diffusers libraries. The Intel and Hugging Face collaboration was announced in mid-2022, and the initial OpenVINO integration into optimum-intel landed on November 2, 2022, based on OpenVINO 2022.2.[24] The package introduces OVModelForXxx wrapper classes that mirror Transformers' AutoModelForXxx classes but execute inference through OpenVINO Runtime, and it provides convenience APIs for exporting Hugging Face Hub models to OpenVINO IR and for invoking NNCF-backed quantization.[25] In an early benchmark cited by Hugging Face, applying NNCF quantization through optimum-intel to a Vision Transformer reduced memory footprint by 3.8x (344MB to 90MB) and improved per-sample latency by 2.4x (98ms to 41ms), with negligible accuracy change (the quantized and original models both reached 87.6% accuracy on the same task).[6]
The integration was deliberately staged: the November 2022 launch focused on encoder-style models (BERT, DistilBERT and their kin) where post-training static quantization and quantization-aware training are well understood, and subsequent releases broadened to encoder-decoder, decoder-only and diffusion architectures. More recent integrations cover weight-only INT4 and NF4 quantization of LLMs and pipelines that combine optimum-intel export with OpenVINO GenAI inference for end-to-end deployment, including the option to push the resulting compressed model back to the Hugging Face Hub for sharing.[6][24]
Hugging Face's documentation positions optimum-intel as a "single-line" path to Intel acceleration: users replace from transformers import AutoModelForXxx with from optimum.intel import OVModelForXxx, change the model class name, and the rest of the standard Transformers and pipelines API continues to work, with OpenVINO performing inference under the hood on Intel CPU, GPU or NPU targets.[25]
OpenVINO publishes a regularly updated set of platform benchmarks through the OpenVINO Model Hub, covering throughput and latency for representative computer-vision and generative workloads across Intel CPUs, GPUs and NPUs. The internal benchmark_app tool, distributed with the runtime, lets users reproduce these measurements on local hardware.[26]
In standardized industry benchmarks, Intel reported on May 5, 2025 that its Core Ultra Series 2 processors were the first to achieve full Neural Processing Unit support in the MLPerf Client v0.6 benchmark from MLCommons (see MLPerf); the submission used the Llama 2 7B model and recorded a first-token latency of 1.09 seconds and an NPU throughput of 18.55 tokens per second. Intel attributed the result to joint optimization between its NPU hardware teams and the OpenVINO software stack.[27] Independent coverage of MLPerf Client results has reported the integrated Intel Arc GPU delivering 93.5 tokens per second on Llama 3.1 8B, behind only AMD's Radeon Pro W7900, with a 0.12 second time-to-first-token that beat both AMD and NVIDIA cards in the same comparison.[28]
For Lunar Lake (Core Ultra Series 2), OpenVINO documentation notes that systems may need more than 16 GB of RAM to process prompts longer than 1024 tokens with 7B-class models such as Llama-2-7B, Mistral-0.2-7B and Qwen-2-7B, and recommends INT4 weight quantization as the operating point that fits most 3B to 7B models into NPU memory.[16]
OpenVINO is widely deployed as an inference backend for Edge AI applications because of its support for Intel-class CPUs and integrated GPUs that are common in industrial PCs, retail kiosks, security cameras and consumer laptops. Documented uses include medical imaging at the edge: for example, real-time inferencing pipelines in CT, ultrasound and MRI workflows surveyed by the Edge AI and Vision Alliance, where vendors describe replacing custom CPU code paths with OpenVINO to obtain consistent throughput across multiple Intel device generations.[29] On the consumer side, QNAP ships OpenVINO as a built-in deep-learning runtime on its network-attached storage appliances so that on-NAS image classification, face recognition and similar Convolutional Neural Network-based features can run without relying on external GPUs.[30]
In the generative-AI era, OpenVINO has been positioned as the principal runtime for "AI PC" workloads on Intel client silicon, integrating with the llama.cpp ecosystem and exposing both LLM and VLM pipelines through OpenVINO GenAI; the 2026.0 release added a preview Text2Video pipeline using the LTX-Video model.[21] On the server side, OVMS is used to serve OpenVINO-optimized models in Kubernetes and OpenShift clusters; Intel documents reference architectures for scaling OVMS pods behind a load balancer with autoscaling driven by per-model throughput.[22]
NNCF, which is part of the OpenVINO toolkit ecosystem, is itself the subject of an Intel-authored research paper (arXiv:2002.08679), establishing a citable primary source for downstream academic work that uses OpenVINO-compatible compression methods.[19] Beyond that foundational paper, OpenVINO appears as an inference backend in published benchmarks of edge deep-learning systems: independent studies of resource-efficient medical image classification for edge devices and of analog in-memory computing for medical-imaging segmentation cite OpenVINO-compatible deployment paths when discussing how to evaluate quantized neural networks on commodity hardware, and Intel collaborators have used OpenVINO to characterize quantization tradeoffs for DNN inference on edge devices.[19][28]
OpenVINO has historically been most performant on Intel hardware, and while CPU support extends to Arm processors, performance optimization on non-Intel platforms is less mature than on Intel-specific paths. The retirement of Movidius VPU support after the 2022.3.x LTS line removed a class of dedicated low-power vision accelerators that some industrial customers had built systems around, leaving those users to either freeze on the LTS branch or migrate to NPU-equipped Core Ultra processors.[13] FPGA support was likewise discontinued after the 2018 generation of "with FPGA Support" packages.[18]
On the NPU side, real-world reports note that the integrated NPU in some Core Ultra configurations can be slower than the same machine's CPU for certain LLM workloads, with users filing performance issues against the openvinotoolkit/openvino repository documenting these cases; appropriate device selection between CPU, GPU and NPU therefore remains workload-dependent.[31] The end-of-life of the standalone openvino-dev package and the Model Optimizer CLI in the 2025 release line is a backwards-incompatible change that has required users to migrate scripts to the new openvino.convert_model API.[15]
| Toolkit | Primary vendor | Native graph format | Hardware focus | License |
|---|---|---|---|---|
| OpenVINO | Intel | OpenVINO IR (XML + bin) | Intel CPU, GPU, NPU | Apache 2.0 |
| TensorRT | NVIDIA | TensorRT engine | NVIDIA GPU (CUDA) | Proprietary |
| ONNX Runtime | Microsoft and community | ONNX | CPU, GPU, NPU via execution providers | MIT |
| TensorFlow Lite (LiteRT) | TFLite FlatBuffer | Mobile and embedded (Arm CPU, NPU, GPU) | Apache 2.0 | |
| llama.cpp | Open-source community | GGUF | CPU, plus GPU/NPU backends | MIT |
Compared with NVIDIA's TensorRT, OpenVINO targets a broader set of CPU-class devices and is open source, but it does not run on NVIDIA GPUs and has historically lagged TensorRT in raw throughput on data-center accelerators.[1] Compared with ONNX Runtime, OpenVINO can act as an ONNX Runtime execution provider but also exposes its own IR and runtime APIs that allow more aggressive Intel-specific optimizations.[14] On the serving side, OVMS competes with TensorFlow Serving, the KServe stack, BentoML and Ray Serve for the "production model server" role; its differentiator is that the bundled runtime is OpenVINO and it is API-compatible with both TensorFlow Serving and KServe.[22]