OpenVINO

AI Inference Developer Tools Open Source AI

22 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

31 citations

Revision

v3 · 4,450 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit developed by Intel for optimizing and deploying deep learning inference across a heterogeneous set of Intel hardware targets, including x86 and Arm CPUs, integrated and discrete GPUs, and Neural Processing Units (NPUs).^[1] First released on May 16, 2018 as a computer-vision toolkit for edge devices, OpenVINO has expanded into a general-purpose inference runtime that supports models trained in PyTorch, TensorFlow, ONNX and other frameworks, and is now Intel's principal software vehicle for generative AI on client and edge silicon.^[2]^[3] The official repository describes it succinctly as an "Open-source software toolkit for optimizing and deploying deep learning models," with documented use across computer vision, automatic speech recognition, generative AI, and natural language processing.^[1] The toolkit is distributed under the Apache License 2.0 and developed in the open at github.com/openvinotoolkit/openvino.^[4] Beyond the core runtime, the OpenVINO ecosystem includes the Neural Network Compression Framework (NNCF) for quantization and pruning, an OpenVINO Tokenizers extension, the OpenVINO Model Server (OVMS), and an OpenVINO GenAI library targeted at LLM and vision-language workloads.^[5]^[6]^[7]^[8]

Infobox

Field	Value
Original release	May 16, 2018 (Intel Distribution of OpenVINO Toolkit)
Developer	Intel Corporation
License	Apache License 2.0
Source repository	github.com/openvinotoolkit/openvino
Latest release (at writing)	2026.1.0 (April 7, 2026)
Languages	C++ core, Python and C bindings
Supported devices	Intel CPU (x86, Arm), Intel GPU (integrated, Arc discrete), Intel NPU (Core Ultra)
Model intake	OpenVINO IR, ONNX, PyTorch, TensorFlow, TensorFlow Lite, PaddlePaddle, JAX
Hugging Face integration	optimum-intel

What does OpenVINO stand for?

The acronym OpenVINO expands to Open Visual Inference and Neural Network Optimization, reflecting the toolkit's original 2018 framing as a visual, computer-vision product. Intel announced the toolkit alongside DevCloud for the Edge, a hosted prototyping environment that exposed combinations of Xeon processors, integrated graphics, FPGAs and Movidius VPUs so developers could try the same toolkit across different hardware without acquiring it physically.^[2] Despite the "Visual" in the name, OpenVINO has since grown well beyond vision into general LLM, speech and generative-AI inference.

When was OpenVINO released?

Origins and the 2018 launch

The Intel Distribution of OpenVINO toolkit was first announced and released on May 16, 2018, when Intel introduced it as a tool to "give developers improved neural network performance on a variety of Intel processors" for vision-centric edge workloads in cameras, IoT devices, and embedded systems.^[2] The initial release framed OpenVINO as the successor to the older Intel Computer Vision SDK, bundling a Model Optimizer that consumed models from popular open-source frameworks such as Caffe, TensorFlow and MXNet, an Inference Engine, and pre-trained models from the Open Model Zoo. It targeted Intel CPUs and integrated GPUs along with specialized accelerators including FPGA cards and Movidius vision processing units (VPUs).^[2]^[9]

The Movidius VPU branch entered the lineage through Intel's September 2016 acquisition of the Irish chip designer Movidius, co-founded in 2005 in Dublin by Sean Mitchell, David Moloney and Val Muresan. Movidius produced the Myriad 2 and Myriad X VPUs, low-power chips for computer vision that Intel later folded into OpenVINO as the MYRIAD plugin and offered as Neural Compute Stick devices. The Myriad 2 chip reached about 1 to 1.5 trillion operations per second; the 2017-era Myriad X added a dedicated Neural Compute Engine and pushed sustained throughput to roughly 4 trillion operations per second.^[9] The same architectural team's "Keem Bay" / Vision Processing Unit work, detailed publicly in 2019, eventually evolved into the Neural Processing Unit (NPU) shipped in Intel Core Ultra client processors.^[10]

2019 to 2022: stabilization and LTS cadence

Across 2019, 2020 and 2021 Intel iterated the toolkit on a roughly quarterly cadence, with releases such as 2019 R1, 2020.1, 2021.1 and 2021.4 progressively broadening framework coverage and Linux distribution support.^[11] An important transition during this period was the open-sourcing of the code: the openvinotoolkit/openvino GitHub repository was published in 2020 under the Apache License 2.0, replacing the previous binary-only Intel Distribution as the canonical source of truth and inviting external contributions.^[1]^[4]

The 2022.1 release, shipped in March 2022, was a substantial rewrite that introduced the IRv11 intermediate representation and dropped support for the older IRv10 in the Post-Training Optimization Tool (POT); it expanded natural-language model coverage with more than 70 additional INT8-quantized models, retired support for Ubuntu 18.04 in favor of Ubuntu 20.04, and unified the device-portability model so that the same compiled graph could move more cleanly between CPU, GPU and accelerator targets.^[12] The 2022.x line also added or improved frontends for TensorFlow Lite and PaddlePaddle, broadening the "intake" side of the conversion pipeline beyond the traditional TensorFlow, Caffe, MXNet and ONNX lineup of the 2018 release.^[11]

In late 2022, Intel designated 2022.3 LTS as a Long-Term Support release with extended maintenance, and 2022.3.1 LTS in early 2023 became the final OpenVINO version to support the Neural Compute Stick 2 / Movidius Myriad X. From the 2023.0 release onward, support for VPU accelerators based on Intel Movidius was canceled, marking the end of the Myriad VPU line in OpenVINO.^[13]

2023 to 2026: PyTorch frontend, NPU, and generative AI

OpenVINO 2023.0 introduced direct conversion from PyTorch model objects into OpenVINO Intermediate Representation without the intermediate step of ONNX export, an option that has since become the recommended path for PyTorch users.^[14] In the same period, the Model Optimizer command-line tool was officially deprecated in favor of the new OpenVINO Model Converter (OVC) Python API; Model Optimizer was supported through the 2025.0 release and then removed, alongside the separate openvino-dev package, in subsequent 2025 releases.^[15]

The 2024.0 release shipped on March 6, 2024 and added a preview NPU plugin for the integrated Neural Processor Unit in Intel Core Ultra processors (codename Meteor Lake), distributed through the main openvino PyPI package; the same release tightened the integration with the Hugging Face ecosystem, including INT4 weight quantization for popular LLMs.^[3] Subsequent 2024.x and 2025.x releases extended NPU coverage to the Core Ultra Series 2 ("Lunar Lake") generation, which features a more capable six-engine NPU (compared with two compute engines on the original Meteor Lake NPU) and introduced NormalFloat 4-bit (NF4) weight quantization as the first data type supported only on the newer NPU.^[16] OpenVINO GenAI, the dedicated generative-AI library, was first published as a separate repository in 2024 and reached a 2026.1.0.0 release on April 7, 2026; subsequent 2026.x releases extended the supported model surface to include preview Text-to-Video generation via the LTX-Video model.^[7]^[21]

A condensed timeline of headline releases is summarized below:

Release	Year	Notable change
2018 (initial)	2018	Original public release; CPU, integrated GPU, FPGA, Movidius VPU
2019 R1	2019	Refreshed Model Optimizer; expanded Open Model Zoo
2021.1 / 2021.4	2021	Broader Linux distribution coverage
2022.1	2022	IRv11; rewritten Inference Engine API; +70 INT8 models
2022.3 LTS	2022	Long-term support designation
2022.3.1 LTS	2023	Final release supporting Movidius VPU / NCS2
2023.0	2023	Direct PyTorch frontend; Movidius VPU support removed
2024.0	2024	NPU plugin in main PyPI package; Hugging Face INT4 quantization
2025.0	2025	Model Optimizer and `openvino-dev` package removed; OVC API only
2026.0 / 2026.1	2026	Enhanced GenAI; preview Text2Video via LTX-Video; OVMS 2026.1

How does OpenVINO work?

Architecture

OpenVINO is organized around a core C++ runtime with Python and C bindings, plus pluggable per-device backends. A typical workflow has three stages: convert a trained model from a source framework (PyTorch, TensorFlow, ONNX, PaddlePaddle, JAX) into the OpenVINO format, compile it for a target device using core.compile_model, and then run inference through the resulting CompiledModel and inference requests.^[14]^[17] The same Python or C++ application can dispatch the same compiled graph to different physical devices by changing a device string such as CPU, GPU, or NPU, and the runtime supports heterogeneous execution that splits a graph across multiple devices.^[17]

What is the OpenVINO Intermediate Representation (IR)?

OpenVINO's native graph format is the Intermediate Representation (IR), a pair of files comprising an XML description of the graph topology and a binary .bin file holding weights. IR is produced either by the Model Optimizer (in 2022 and earlier) or by the OpenVINO Model Converter API (openvino.convert_model / ovc from 2023 onward).^[14]^[15] IRv11, introduced in 2022.1, is the current major IR version; legacy IRv10 graphs are no longer accepted by NNCF or the POT toolchain.^[12]

In addition to its own IR, the OpenVINO runtime can directly load and execute ONNX graphs through an ONNX frontend without an intermediate IR conversion, and from 2023.0 onward it can directly ingest a PyTorch nn.Module via the PyTorch frontend, internally building an OpenVINO graph through tracing rather than first exporting to ONNX. The PyTorch path is now considered the primary conversion route for PyTorch models, and converting via ONNX is documented as a fallback when direct conversion fails.^[14]

What hardware does OpenVINO support?

OpenVINO exposes hardware through a plugin architecture; the principal first-party plugins are:

Plugin / device string	Hardware
`CPU`	Intel x86 CPUs (AVX2, AVX-512, AMX) and Arm CPUs (NEON, SVE)
`GPU`	Intel integrated GPUs and discrete GPUs, including the Intel Arc family
`NPU`	Intel NPU integrated into Core Ultra (Meteor Lake) and Core Ultra Series 2 (Lunar Lake) processors
`MYRIAD` (legacy)	Movidius Myriad VPUs and Neural Compute Stick 2 (through 2022.3.x LTS only)
FPGA (legacy)	Intel Arria/Stratix FPGAs via bitstreams shipped through the 2018 "with FPGA Support" packages

The Movidius VPU plugin was retired with the cancellation of all VPU accelerator support after the 2022.3.1 LTS release.^[13] FPGA support, present in the 2018 "OpenVINO Toolkit for Linux with FPGA Support" SKUs that shipped bitstreams for Arria 10 cards, was discontinued in subsequent releases as Intel reorganized its programmable-logic strategy.^[18]

The NPU plugin shipped initially as a preview in 2023.3 and became part of the main pip install openvino distribution starting with 2024.0; on Core Ultra Series 2 (Lunar Lake) the same plugin exposes the larger six-engine NPU and supports additional low-precision data types including NF4 weight quantization that are not available on the original Meteor Lake NPU.^[3]^[16]

How does OpenVINO optimize models? Quantization and the NNCF

Model compression in OpenVINO is delegated to the Neural Network Compression Framework, an open-source Python package hosted at github.com/openvinotoolkit/nncf and licensed under Apache 2.0.^[5] NNCF was introduced in an Intel-authored arXiv paper, "Neural Network Compression Framework for fast model inference," submitted on February 20, 2020 by Alexander Kozlov, Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin and Yury Gorbachev, and last revised in December of that year.^[19] The paper describes NNCF as a framework for compressing neural networks with optional fine-tuning, implementing sparsity, quantization and binarization methods designed to produce models that run efficiently on general-purpose hardware (CPU, GPU) and on specialized deep-learning accelerators while preserving the original accuracy.^[19]

The framework implements both post-training algorithms (8-bit Post-Training Quantization, weights compression for LLMs, activation sparsity) and training-time algorithms (Quantization-Aware Training, weight-only quantization with LoRA and Neural Low-rank Search, structured pruning), and it accepts models in PyTorch, TorchFX, ONNX and OpenVINO formats with OpenVINO as the preferred backend for post-training quantization.^[5] NNCF's PTQ workflow requires only a model and a small calibration dataset on the order of 300 samples to apply 8-bit symmetric quantization.^[5]

For LLM weight compression specifically, NNCF supports data-free INT8 and INT4 weight quantization as well as data-aware methods such as AWQ (Activation-aware Weight Quantization), GPTQ, quantization scale estimation, and mixed-precision quantization; on Core Ultra Series 2 NPUs it additionally supports the NormalFloat 4-bit (NF4) data type.^[6]^[16] These algorithms are exposed both directly through NNCF and indirectly through the optimum-intel Hugging Face integration, which uses NNCF as its compression backend for OpenVINO IR and ONNX models.^[6] In a representative reference benchmark, applying NNCF YOLOv8 PTQ produces an INT8 OpenVINO IR variant with a small (single-digit percent) drop in mean average precision relative to the original FP32 model while delivering substantially better throughput on Intel CPUs.^[5]

The PyPI package is published as nncf. The latest release at writing is v3.1.0, dated April 8, 2026; the project history shows 34 releases since the initial public version.^[5]

What are OpenVINO Tokenizers?

OpenVINO Tokenizers is a separately maintained extension at github.com/openvinotoolkit/openvino_tokenizers that adds text-processing operators to the runtime so that tokenization and detokenization can be expressed as OpenVINO graphs without external dependencies. The extension converts Hugging Face tokenizers into OpenVINO models, supports BPE, WordPiece, SentencePiece, Tiktoken and RWKV tokenizer types, and can be fused with downstream LLM graphs to produce a single end-to-end pipeline. Once exported, both the tokenizer and detokenizer become first-class OpenVINO models that can be read, compiled and saved with the standard runtime APIs, and they can also be combined with the model itself into a single graph; the documentation calls out classifiers and RAG embedders as cases where this fused form is preferred because both tokenizer and model are invoked once per inference. The extension additionally provides a "greedy decoding" pipeline component for simple text-generation use cases.^[20]

The package serves as the tokenizer backbone for OpenVINO GenAI: when openvino_genai is installed, the five-step convert-and-validate flow for a tokenizer runs automatically as part of model loading. The latest 2026.1.0.0 release is dated April 7, 2026, and the project is licensed under Apache 2.0.^[20]

What is OpenVINO GenAI?

OpenVINO GenAI is a higher-level library, distributed from github.com/openvinotoolkit/openvino.genai, that runs on top of the OpenVINO Runtime and implements generative-AI pipelines for LLMs, vision-language models, image generation, speech recognition and speech synthesis. It exposes both C++ and Python APIs and implements supporting infrastructure such as continuous batching, prefix caching and KV-cache management. The library's stated motivation is that it offers "better performance than Python-based runtimes" by minimizing the per-token overhead in the inference loop, while preserving a Python-friendly facade for rapid prototyping.^[7]

Concrete supported workloads include LLM chat (Llama, Phi, Mistral and Qwen families), VLM image and video analysis (LLaVA, MiniCPM-V, Qwen2-VL, Phi-3.5-Vision, InternVL2), image generation with Stable Diffusion and Flux, Whisper-based speech recognition, SpeechT5 text-to-speech, semantic-search embedding generation and text reranking for retrieval workflows.^[7] GenAI also provides a backend for llama.cpp enabling optimized inference on Intel CPUs, GPUs and NPUs, validated against GGUF builds of models such as Llama-3.2-1B-Instruct and Mistral-7B-Instruct.^[21] Recent OpenVINO Model Server work layered on GenAI has added a unified VLM chatbot demo with video-file support and interactive model switching across Qwen3-VL, Qwen2.5-VL and LLaVa-NeXT-Video.^[21]

GenAI also implements the Speculative Decoding pipeline as a serving-side optimization, exposing draft-and-target model pairs through OVMS for higher effective throughput on LLM workloads where draft acceptance rates are favorable.^[23]

What is the OpenVINO Model Server (OVMS)?

The OpenVINO Model Server is a standalone serving system written in C++ that hosts OpenVINO-optimized models behind network APIs, hosted at github.com/openvinotoolkit/model_server under Apache 2.0.^[8] It is API-compatible with TensorFlow Serving's gRPC interface and with the KServe v2 (REST and gRPC) inference protocol, exposing a unified Predict and ModelMetadata surface that lets existing clients of those ecosystems run inference on OpenVINO-optimized models without code changes.^[8]^[22] Beyond classical predict APIs, OVMS exposes an OpenAI-compatible HTTP API for LLM serving and integrates the continuous-batching pipeline from OpenVINO GenAI to serve LLM and VLM workloads.^[23] OVMS supports loading models stored locally, on object storage, or pulled from the Hugging Face Hub, and is distributed as Docker images for both bare-metal and Kubernetes deployments. The latest release shown at writing is 2026.1, dated April 7, 2026.^[8]

The server's documentation notes that the gRPC interface is generally recommended for raw inference latency because of more efficient input-deserialization, while the REST API is preferred when minimizing client-side dependencies is the higher priority.^[22]

How does OpenVINO integrate with Hugging Face (optimum-intel)?

optimum-intel is a Hugging Face-maintained Python package that integrates Intel acceleration toolkits, including OpenVINO, the Intel Neural Compressor and Intel Extension for PyTorch, with the Transformers and Diffusers libraries. The Intel and Hugging Face collaboration was announced in mid-2022, and the initial OpenVINO integration into optimum-intel landed on November 2, 2022, based on OpenVINO 2022.2.^[24] The package introduces OVModelForXxx wrapper classes that mirror Transformers' AutoModelForXxx classes but execute inference through OpenVINO Runtime, and it provides convenience APIs for exporting Hugging Face Hub models to OpenVINO IR and for invoking NNCF-backed quantization.^[25] In an early benchmark cited by Hugging Face, applying NNCF quantization through optimum-intel to a Vision Transformer reduced memory footprint by 3.8x (344MB to 90MB) and improved per-sample latency by 2.4x (98ms to 41ms), with negligible accuracy change (the quantized and original models both reached 87.6% accuracy on the same task).^[6]

The integration was deliberately staged: the November 2022 launch focused on encoder-style models (BERT, DistilBERT and their kin) where post-training static quantization and quantization-aware training are well understood, and subsequent releases broadened to encoder-decoder, decoder-only and diffusion architectures. More recent integrations cover weight-only INT4 and NF4 quantization of LLMs and pipelines that combine optimum-intel export with OpenVINO GenAI inference for end-to-end deployment, including the option to push the resulting compressed model back to the Hugging Face Hub for sharing.^[6]^[24]

Hugging Face's documentation positions optimum-intel as a "single-line" path to Intel acceleration: users replace from transformers import AutoModelForXxx with from optimum.intel import OVModelForXxx, change the model class name, and the rest of the standard Transformers and pipelines API continues to work, with OpenVINO performing inference under the hood on Intel CPU, GPU or NPU targets.^[25]

How fast is OpenVINO? Benchmarks and performance

OpenVINO publishes a regularly updated set of platform benchmarks through the OpenVINO Model Hub, covering throughput and latency for representative computer-vision and generative workloads across Intel CPUs, GPUs and NPUs. The internal benchmark_app tool, distributed with the runtime, lets users reproduce these measurements on local hardware.^[26]

In standardized industry benchmarks, Intel reported on May 5, 2025 that its Core Ultra Series 2 processors were the first to achieve full Neural Processing Unit support in the MLPerf Client v0.6 benchmark from MLCommons (see MLPerf); the submission used the Llama 2 7B model and recorded a first-token latency of 1.09 seconds and an NPU throughput of 18.55 tokens per second. Intel attributed the result to joint optimization between its NPU hardware teams and the OpenVINO software stack.^[27] Independent coverage of MLPerf Client results has reported the integrated Intel Arc GPU delivering 93.5 tokens per second on Llama 3.1 8B, behind only AMD's Radeon Pro W7900, with a 0.12 second time-to-first-token that beat both AMD and NVIDIA cards in the same comparison.^[28]

For Lunar Lake (Core Ultra Series 2), OpenVINO documentation notes that systems may need more than 16 GB of RAM to process prompts longer than 1024 tokens with 7B-class models such as Llama-2-7B, Mistral-0.2-7B and Qwen-2-7B, and recommends INT4 weight quantization as the operating point that fits most 3B to 7B models into NPU memory.^[16]

What is OpenVINO used for? Applications

OpenVINO is widely deployed as an inference backend for Edge AI applications because of its support for Intel-class CPUs and integrated GPUs that are common in industrial PCs, retail kiosks, security cameras and consumer laptops. Documented uses include medical imaging at the edge: for example, real-time inferencing pipelines in CT, ultrasound and MRI workflows surveyed by the Edge AI and Vision Alliance, where vendors describe replacing custom CPU code paths with OpenVINO to obtain consistent throughput across multiple Intel device generations.^[29] On the consumer side, QNAP ships OpenVINO as a built-in deep-learning runtime on its network-attached storage appliances so that on-NAS image classification, face recognition and similar Convolutional Neural Network-based features can run without relying on external GPUs.^[30]

In the generative-AI era, OpenVINO has been positioned as the principal runtime for "AI PC" workloads on Intel client silicon, integrating with the llama.cpp ecosystem and exposing both LLM and VLM pipelines through OpenVINO GenAI; the 2026.0 release added a preview Text2Video pipeline using the LTX-Video model.^[21] On the server side, OVMS is used to serve OpenVINO-optimized models in Kubernetes and OpenShift clusters; Intel documents reference architectures for scaling OVMS pods behind a load balancer with autoscaling driven by per-model throughput.^[22]

How is OpenVINO used in academic research?

NNCF, which is part of the OpenVINO toolkit ecosystem, is itself the subject of an Intel-authored research paper (arXiv:2002.08679), establishing a citable primary source for downstream academic work that uses OpenVINO-compatible compression methods.^[19] Beyond that foundational paper, OpenVINO appears as an inference backend in published benchmarks of edge deep-learning systems: independent studies of resource-efficient medical image classification for edge devices and of analog in-memory computing for medical-imaging segmentation cite OpenVINO-compatible deployment paths when discussing how to evaluate quantized neural networks on commodity hardware, and Intel collaborators have used OpenVINO to characterize quantization tradeoffs for DNN inference on edge devices.^[19]^[28]

What are the limitations of OpenVINO?

OpenVINO has historically been most performant on Intel hardware, and while CPU support extends to Arm processors, performance optimization on non-Intel platforms is less mature than on Intel-specific paths. The retirement of Movidius VPU support after the 2022.3.x LTS line removed a class of dedicated low-power vision accelerators that some industrial customers had built systems around, leaving those users to either freeze on the LTS branch or migrate to NPU-equipped Core Ultra processors.^[13] FPGA support was likewise discontinued after the 2018 generation of "with FPGA Support" packages.^[18]

On the NPU side, real-world reports note that the integrated NPU in some Core Ultra configurations can be slower than the same machine's CPU for certain LLM workloads, with users filing performance issues against the openvinotoolkit/openvino repository documenting these cases; appropriate device selection between CPU, GPU and NPU therefore remains workload-dependent.^[31] The end-of-life of the standalone openvino-dev package and the Model Optimizer CLI in the 2025 release line is a backwards-incompatible change that has required users to migrate scripts to the new openvino.convert_model API.^[15]

How does OpenVINO compare to TensorRT and ONNX Runtime?

Toolkit	Primary vendor	Native graph format	Hardware focus	License
OpenVINO	Intel	OpenVINO IR (XML + bin)	Intel CPU, GPU, NPU	Apache 2.0
TensorRT	NVIDIA	TensorRT engine	NVIDIA GPU (CUDA)	Proprietary
ONNX Runtime	Microsoft and community	ONNX	CPU, GPU, NPU via execution providers	MIT
TensorFlow Lite (LiteRT)	Google	TFLite FlatBuffer	Mobile and embedded (Arm CPU, NPU, GPU)	Apache 2.0
llama.cpp	Open-source community	GGUF	CPU, plus GPU/NPU backends	MIT

Compared with NVIDIA's TensorRT, OpenVINO targets a broader set of CPU-class devices and is open source, but it does not run on NVIDIA GPUs and has historically lagged TensorRT in raw throughput on data-center accelerators.^[1] Compared with ONNX Runtime, OpenVINO can act as an ONNX Runtime execution provider but also exposes its own IR and runtime APIs that allow more aggressive Intel-specific optimizations.^[14] On the serving side, OVMS competes with TensorFlow Serving, the KServe stack, BentoML and Ray Serve for the "production model server" role; its differentiator is that the bundled runtime is OpenVINO and it is API-compatible with both TensorFlow Serving and KServe.^[22]

References

openvinotoolkit, "openvino: OpenVINO is an open source toolkit for optimizing and deploying AI inference", GitHub, 2026-04-07. https://github.com/openvinotoolkit/openvino. Accessed 2026-05-21. ↩
Khari Johnson, "Intel launches OpenVINO computer vision toolkit for edge computing", VentureBeat, 2018-05-16. https://venturebeat.com/ai/intel-launches-openvino-computer-vision-toolkit-for-edge-computing. Accessed 2026-05-21. ↩
Intel Corporation, "Release Notes for Intel Distribution of OpenVINO Toolkit 2024.0", Intel Developer Zone, 2024-03-06. https://www.intel.com/content/www/us/en/developer/articles/release-notes/openvino/2024-0.html. Accessed 2026-05-21. ↩
openvinotoolkit, "openvino/LICENSE at master", GitHub, 2024. https://github.com/openvinotoolkit/openvino/blob/master/LICENSE. Accessed 2026-05-21. ↩
openvinotoolkit, "nncf: Neural Network Compression Framework for enhanced OpenVINO inference", GitHub, 2026-04-08. https://github.com/openvinotoolkit/nncf. Accessed 2026-05-21. ↩
Hugging Face, "Accelerate your models with Optimum Intel and OpenVINO", Hugging Face Blog, 2022-11-02. https://huggingface.co/blog/openvino. Accessed 2026-05-21. ↩
openvinotoolkit, "openvino.genai: Run Generative AI models with simple C++/Python API and using OpenVINO Runtime", GitHub, 2026-04-07. https://github.com/openvinotoolkit/openvino.genai. Accessed 2026-05-21. ↩
openvinotoolkit, "model_server: A scalable inference server for models optimized with OpenVINO", GitHub, 2026-04-07. https://github.com/openvinotoolkit/model_server. Accessed 2026-05-21. ↩
Wikipedia contributors, "Movidius", Wikipedia, 2024. https://en.wikipedia.org/wiki/Movidius. Accessed 2026-05-21. ↩
Kyle Wiggers, "Intel details Keem Bay, a next-gen Movidius Myriad VPU for computer vision at the edge", VentureBeat, 2019-11-13. https://venturebeat.com/technology/intel-details-keem-bay-a-movidius-processor-for-computer-vision-at-the-edge. Accessed 2026-05-21. ↩
openvinotoolkit, "Releases openvinotoolkit/openvino", GitHub, 2026-04-07. https://github.com/openvinotoolkit/openvino/releases. Accessed 2026-05-21. ↩
Intel Corporation, "Release Notes for Intel Distribution of OpenVINO Toolkit 2022", Intel Developer Zone, 2022. https://www.intel.com/content/www/us/en/developer/articles/release-notes/openvino/2022-2.html. Accessed 2026-05-21. ↩
Intel Community, "NEW LTS RELEASE: Intel Distribution of OpenVINO toolkit 2022.3.1 LTS", Intel Community, 2023. https://community.intel.com/t5/Intel-Distribution-of-OpenVINO/NEW-LTS-RELEASE-Intel-Distribution-of-OpenVINO-toolkit-2022-3-1/m-p/1499215. Accessed 2026-05-21. ↩
OpenVINO documentation, "Convert a PyTorch Model to OpenVINO IR", docs.openvino.ai, 2024. https://docs.openvino.ai/2024/notebooks/pytorch-to-openvino-with-output.html. Accessed 2026-05-21. ↩
Intel Corporation, "Release Notes for Intel Distribution of OpenVINO Toolkit 2025.0", Intel Developer Zone, 2025. https://www.intel.com/content/www/us/en/developer/articles/release-notes/openvino/2025-0.html. Accessed 2026-05-21. ↩
OpenVINO documentation, "OpenVINO GenAI on NPU", docs.openvino.ai, 2025. https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html. Accessed 2026-05-21. ↩
OpenVINO documentation, "NPU Device", docs.openvino.ai, 2024. https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html. Accessed 2026-05-21. ↩
intel-iot-devkit, "openvino-with-fpga-hello-world-face-detection", GitHub, 2018. https://github.com/intel-iot-devkit/openvino-with-fpga-hello-world-face-detection. Accessed 2026-05-21. ↩
Alexander Kozlov, Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin, Yury Gorbachev, "Neural Network Compression Framework for fast model inference", arXiv:2002.08679, 2020-02-20. https://arxiv.org/abs/2002.08679. Accessed 2026-05-21. ↩
openvinotoolkit, "openvino_tokenizers: OpenVINO Tokenizers extension", GitHub, 2026-04-07. https://github.com/openvinotoolkit/openvino_tokenizers. Accessed 2026-05-21. ↩
OpenVINO toolkit, "OpenVINO 2026.0: New Models, Enhanced GenAI, and Smarter Compression", Medium / OpenVINO toolkit blog, 2026. https://medium.com/openvino-toolkit/openvino-2026-0-new-models-enhanced-genai-and-smarter-compression-bf846a59cda8. Accessed 2026-05-21. ↩
OpenVINO documentation, "OpenVINO Model Server: What is OpenVINO Model Server", docs.openvino.ai, 2025. https://docs.openvino.ai/2025/model-server/ovms_what_is_openvino_model_server.html. Accessed 2026-05-21. ↩
OpenVINO documentation, "Efficient LLM Serving", docs.openvino.ai, 2025. https://docs.openvino.ai/2025/model-server/ovms_docs_llm_reference.html. Accessed 2026-05-21. ↩
Hugging Face, "Optimize and deploy with Optimum-Intel and OpenVINO GenAI", Hugging Face Blog, 2024. https://huggingface.co/blog/deploy-with-openvino. Accessed 2026-05-21. ↩
huggingface, "optimum-intel: Optimum Intel: Accelerate inference with Intel optimization tools", GitHub, 2024. https://github.com/huggingface/optimum-intel. Accessed 2026-05-21. ↩
OpenVINO toolkit, "Introducing OpenVINO Model Hub: Benchmark AI Inference with Ease", Medium / OpenVINO toolkit blog, 2024. https://medium.com/openvino-toolkit/introducing-openvino-model-hub-benchmark-ai-inference-with-ease-2cd7ad8f5e4d. Accessed 2026-05-21. ↩
Intel Newsroom, "Intel Achieves First, Only Full NPU Support in MLPerf Client v0.6 Benchmark", Intel, 2025-05-05. https://newsroom.intel.com/client-computing/intel-achieves-first-only-full-npu-support-mlperf-client-v0-6-benchmark. Accessed 2026-05-21. ↩
Jon Peddie Research, "Intel the first company to attain NPU support in MLPerf Client v0.6 benchmark", JPR, 2025. https://www.jonpeddie.com/news/intel-the-first-company-to-attain-npu-support-in-mlperf-client-v0-6-benchmark/. Accessed 2026-05-21. ↩
Edge AI and Vision Alliance, "Real-time Medical Imaging Using OpenVINO", Edge AI and Vision Alliance, 2022-04. https://www.edge-ai-vision.com/2022/04/real-time-medical-imaging-using-openvino/. Accessed 2026-05-21. ↩
QNAP Systems, "OpenVINO on NAS: Accelerate and verify deep learning inference", QNAP, 2024. https://www.qnap.com/en/solution/openvino. Accessed 2026-05-21. ↩
openvinotoolkit, "[Performance]: 265k NPU much slower than CPU at LLM inference (Issue #32306)", GitHub, 2025. https://github.com/openvinotoolkit/openvino/issues/32306. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Intel Keras NVIDIA Triton Inference Server torch.compile

Infobox

What does OpenVINO stand for?

When was OpenVINO released?

Origins and the 2018 launch

2019 to 2022: stabilization and LTS cadence

2023 to 2026: PyTorch frontend, NPU, and generative AI

How does OpenVINO work?

Architecture

What is the OpenVINO Intermediate Representation (IR)?

What hardware does OpenVINO support?

How does OpenVINO optimize models? Quantization and the NNCF

What are OpenVINO Tokenizers?

What is OpenVINO GenAI?

What is the OpenVINO Model Server (OVMS)?

How does OpenVINO integrate with Hugging Face (optimum-intel)?

How fast is OpenVINO? Benchmarks and performance

What is OpenVINO used for? Applications

How is OpenVINO used in academic research?

What are the limitations of OpenVINO?

How does OpenVINO compare to TensorRT and ONNX Runtime?

See also

References

Improve this article

Related Articles

NVIDIA Dynamo

ExLlamaV2 (EXL2)

Optimum-Quanto

Text Generation Inference (TGI)

NVIDIA Triton Inference Server

TensorFlow Serving

What links here

Related Articles

NVIDIA Dynamo

ExLlamaV2 (EXL2)

Optimum-Quanto

Text Generation Inference (TGI)

NVIDIA Triton Inference Server

TensorFlow Serving

What links here