NVIDIA NIM

AI Inference AI Infrastructure Developer Tools Enterprise AI NVIDIA

31 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v4 · 6,232 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NVIDIA NIM (NVIDIA Inference Microservices) is a set of containerized, prebuilt-and-optimized model-serving microservices from NVIDIA that package an AI model, an optimized inference engine, and an OpenAI-compatible REST API into a single Docker container for deployment across cloud environments, data centers, workstations, and edge hardware. NVIDIA announced NIM at GTC on March 18, 2024, and built it on its own inference software, including NVIDIA Triton Inference Server and NVIDIA TensorRT-LLM, with NVIDIA stating the prebuilt containers "enable developers to reduce deployment times from weeks to minutes."^[1] Enterprises can pull a NIM from the NGC container registry, run it on any NVIDIA GPU infrastructure, and begin serving requests within minutes rather than weeks.^[1]^[3]

NIM is distributed as part of NVIDIA AI Enterprise and is optimized to run across what NVIDIA describes as its "CUDA installed base of hundreds of millions of GPUs across clouds, data centers, workstations and PCs."^[1] By tying optimized inference to NVIDIA's GPU lineup and offering production access through an NVIDIA AI Enterprise license, NVIDIA positions NIM as both a developer convenience layer and an enterprise software subscription.

Jensen Huang introduced NIM during his GTC 2024 keynote at the San Jose Convention Center, framing it as a way to turn NVIDIA's GPU installed base into a production AI deployment platform. "Established enterprise platforms are sitting on a goldmine of data that can be transformed into generative AI copilots," Huang said in the launch announcement.^[1] The announcement came alongside a set of healthcare-specific NIMs under the BioNeMo brand, covering drug discovery, medical imaging, and genomics.^[10] NVIDIA described NIM as part of its broader strategy to sell software and services on top of its GPU hardware business, with NVIDIA AI Enterprise providing the commercial licensing framework.

What problem does NIM solve?

Before NIM, deploying an open-weight language model to production required assembling several moving parts manually. A team would download model weights from a repository such as Hugging Face, select and configure an inference server (such as vLLM, Text Generation Inference, or NVIDIA Triton Inference Server), compile TensorRT engines for the target GPU architecture, write a serving wrapper that exposed an HTTP endpoint, and then containerize the whole stack. For a team without deep MLOps experience, that process could consume weeks of engineering time. For teams who did have that experience, the work was repeatable but not portable: an engine compiled for an A100 cluster would not run on an H100 without recompilation.

The problem was worse at scale. Running multiple models simultaneously meant maintaining separate container images, separate version pins, and separate optimization profiles. Updating a model from one version to the next triggered a full revalidation cycle. Fine-tuned adapters added another layer of bookkeeping.

NVIDIA's response was to pre-solve all of that for its own GPU ecosystem. Each NIM ships with multiple pre-compiled inference profiles, and the container selects the right profile at launch based on the detected hardware. The operator does not need to know whether TensorRT or vLLM is running under the hood; the container handles it.

The timing of NIM's release reflected industry dynamics in early 2024. The open-weight model ecosystem had matured significantly: Meta's Llama 2 and Mistral 7B had demonstrated that capable models could be self-hosted, and demand for on-premises inference was growing among enterprises that could not or would not send proprietary data to third-party API providers. At the same time, the tooling to actually deploy those models remained fragmented and difficult to validate. NIM addressed that gap by treating inference deployment as a product rather than a DIY engineering problem.

How is a NIM container built?

Container structure

A NIM container is a Docker image stored in NVIDIA's NGC container registry at nvcr.io. The image includes:

The base model weights (or a pointer to fetch them from NGC at runtime)
One or more compiled TensorRT-LLM engine profiles, each tuned for a specific GPU architecture and optimization objective
A vLLM or SGLang runtime as a fallback for GPUs that do not have a pre-compiled TensorRT profile
An OpenAI-compatible HTTP API server that exposes /v1/chat/completions, /v1/completions, and /v1/embeddings endpoints
CUDA libraries and domain-specific extensions
An enterprise base layer with security patches and support tooling

The container exposes port 8000 by default. To pull and start a NIM for Llama 3.1 8B, an operator authenticates to the NGC registry with an API key, then runs a single docker run command with the image path and a mount point for a local model cache directory.

Inference engine selection

NIM uses a profile system to select the appropriate inference backend. On first startup, the container inspects the available GPU hardware, reads the model's profile manifest, and loads the best matching engine.^[3] Profiles are tagged by:

GPU architecture (Ampere, Hopper, Ada Lovelace, Blackwell)
Optimization objective (latency vs. throughput)
Quantization format (FP16, BF16, FP8, INT4, INT8)
Tensor parallelism degree (single GPU, multi-GPU)

For high-end data center GPUs with a pre-compiled TensorRT-LLM profile, the container loads that engine directly. For other NVIDIA GPUs or architectures without a compiled profile, it falls back to vLLM. SGLang is available for some models as a third option.

TensorRT-LLM backend

When a TensorRT-LLM profile is available, it provides the highest throughput of the three backends. TensorRT-LLM compiles the model graph into a GPU-specific execution plan, fuses operations, and applies continuous batching along with paged KV-cache management.^[14] NVIDIA publishes two TensorRT-LLM profile variants per model: a latency profile that minimizes time-to-first-token and inter-token latency for interactive workloads, and a throughput profile that maximizes tokens generated per second for batch workloads.

On H100 SXM hardware, NVIDIA benchmarks show approximately 2x throughput improvement for Llama 3.1 8B compared to a default deployment: roughly 1,200 tokens per second with TensorRT-LLM optimization versus around 610 tokens per second without it.^[2]

OpenAI-compatible API layer

The HTTP server inside each NIM exposes endpoints that are structurally identical to OpenAI's API.^[3] A developer who has code calling client.chat.completions.create() from the OpenAI Python SDK can point that code at a locally running NIM by changing a single base URL, with no other modifications. NVIDIA also exposes its own extended endpoints for features that have no OpenAI equivalent, such as per-request LoRA adapter selection.

The /v1/models endpoint returns metadata about the loaded model, and the OpenAI-compatible format allows direct integration with frameworks such as LangChain, LlamaIndex, and most other LLM orchestration libraries.

PEFT and LoRA adapter support

NIM supports dynamic loading of Low-Rank Adaptation (LoRA) adapters without restarting the container. Adapters trained with the NVIDIA NeMo framework or the Hugging Face PEFT library can be stored in a local directory, and the NIM loads them into a multi-tier cache at runtime.^[7] When a client submits an inference request, it can specify a named LoRA adapter; the server pulls the corresponding weights and applies them to the base model computation.

This multi-LoRA serving mode lets a single NIM container serve dozens of fine-tuned variants of one base model simultaneously, without duplicating the base weights on disk or in GPU memory. NVIDIA's documentation refers to deploying a large collection of adapters this way as a "swarm of LoRA adapters."^[8]

Kubernetes and orchestration

NVIDIA ships Helm charts for deploying NIM on Kubernetes clusters. The NVIDIA NIM Operator (available as a separate component) automates the lifecycle of NIM pods, handling version upgrades, health checks, and GPU resource scheduling. NIM integrates with NVIDIA's GPU Operator for Kubernetes, which manages CUDA drivers and device plugins across cluster nodes.

For multi-NIM deployments, NVIDIA provides the NVIDIA Inference Manager (part of NVIDIA AI Enterprise), which routes requests across multiple running NIMs and provides load balancing.

NIM Operator

The NVIDIA NIM Operator is a Kubernetes operator designed to automate NIM deployment and management within a cluster. It handles the full pod lifecycle: pulling updated container images from NGC, draining connections before rolling restarts, restoring previous versions on failure, and reporting health status to Kubernetes cluster management tools. The operator reads a custom resource definition (CRD) that specifies which NIM to run, the GPU resource limits, and the model cache mount path. Once the CRD is applied, the operator handles the rest, including authentication to the NGC registry using a Kubernetes secret.

For air-gapped deployments (environments with no internet access), NVIDIA supports pre-pulling NIM container images to an internal container registry and mounting pre-downloaded model weights from a local network storage volume, so that NIMs never need to contact NVIDIA's infrastructure after initial setup.

Observability and monitoring

NIM containers expose Prometheus-compatible metrics at a /metrics endpoint. Standard metrics include request count, request latency histograms, token throughput (input tokens per second and output tokens per second), batch size distribution, KV-cache utilization, and GPU memory usage. These metrics integrate with Grafana dashboards for production monitoring. NVIDIA provides reference dashboard configurations alongside its Helm charts.

Request-level logging can be configured to emit structured JSON logs, which can be forwarded to centralized logging systems such as Elasticsearch or Splunk. Log fields include request ID, model name, prompt token count, completion token count, latency, and the selected inference profile.

NVIDIA Dynamo and distributed serving

At GTC 2025 on March 18, 2025, NVIDIA introduced NVIDIA Dynamo, an open-source distributed inference framework aimed at serving large reasoning models across many GPUs in what NVIDIA calls an AI factory.^[16] Dynamo builds on the modular architecture of NVIDIA Triton Inference Server and supports multiple backends, including PyTorch, SGLang, TensorRT-LLM, and vLLM.^[17] Its core innovations are disaggregated serving (running the compute-bound prefill phase and the memory-bound decode phase on separate GPUs), a smart router that tracks the KV cache across a GPU fleet to avoid redundant recomputation, a GPU planner that adds and removes GPUs as demand shifts, and a distributed KV cache manager that offloads less-frequently-accessed cache to CPU memory, local storage, or networked object storage.^[16]^[17] NVIDIA reported that Dynamo boosts throughput on the DeepSeek-R1 671B model running on a Blackwell GB200 NVL72 system by 30x, and more than doubles performance on Llama 70B running on Hopper GPUs.^[16] For enterprises, NVIDIA stated that Dynamo would be included with NVIDIA NIM microservices as part of NVIDIA AI Enterprise, and the framework is published openly in the ai-dynamo/dynamo GitHub repository.^[16]^[17]

What are NIM Agent Blueprints?

In August 2024, NVIDIA extended the NIM platform with NIM Agent Blueprints, a catalog of pre-built, end-to-end generative AI application workflows.^[4] A Blueprint packages multiple NIM containers, orchestration logic, sample code, and deployment charts for a specific enterprise use case. Developers can take a Blueprint, plug in their own data and branding, and deploy a functioning application rather than building each component from scratch.^[5]

The initial Blueprint releases covered three use cases:

A digital human workflow for customer service, creating a real-time 3D animated avatar interface driven by a speech recognition NIM, an LLM NIM, and a text-to-speech NIM
A generative virtual screening workflow for drug discovery, combining molecular generation models with protein structure prediction to accelerate candidate identification
A multimodal PDF data extraction workflow for enterprise retrieval-augmented generation (RAG), using NVIDIA NeMo Retriever NIMs to process PDF documents and build high-accuracy retrieval pipelines

NVIDIA announced plans for monthly Blueprint releases covering customer service, content generation, software engineering, retail shopping advisors, and R&D applications.^[4] Partners including Accenture, Deloitte, Cisco, Dell Technologies, Hewlett Packard Enterprise, Lenovo, SoftServe, and World Wide Technology were among the first systems integrators offering Blueprint-based solutions to enterprise customers.^[4]

Blueprints are published on build.nvidia.com alongside documentation, sample code, and Helm charts. Each Blueprint is designed around what NVIDIA calls a "data-driven generative AI flywheel": user interactions with the deployed application generate new data, which can be used to fine-tune LoRA adapters, which are then loaded back into the NIM without restarting the service.^[5] NVIDIA NeMo handles the fine-tuning step, and NVIDIA AI Foundry provides the production environment for managing adapter versions.

The customer service digital human Blueprint integrates several NIM types in sequence. An audio NIM transcribes the user's voice. An LLM NIM generates a text response. A text-to-speech NIM converts the text to audio. An Avatar NIM (from NVIDIA ACE, the Avatar Cloud Engine platform) drives the facial animation of a rendered character. All four components communicate over the local network using the same OpenAI-compatible API format, which makes it possible to swap any component for an alternative NIM without rewriting the orchestration logic.

NVIDIA later rebranded NIM Agent Blueprints simply as NVIDIA Blueprints, expanding the catalog beyond agentic workflows to include reference architectures for video analytics, document processing, and scientific computing.

At GTC 2025, NVIDIA added the AI-Q Blueprint, a reference workflow for connecting enterprise knowledge sources to teams of autonomous AI agents, which it scheduled for availability in April 2025.^[18] Alongside it, NVIDIA released an open-source agent toolkit (the Agent Intelligence toolkit, later known as NeMo Agent toolkit) on GitHub for orchestrating multi-agent systems built on NIM and NeMo microservices.^[18]

What models does NIM support?

The NIM catalog at build.nvidia.com lists over 100 models across multiple domains.^[3] The catalog is organized into several categories.

Large language models

The LLM catalog includes models from multiple providers packaged as NIMs:

Model family	Examples
Meta Llama	Llama 3.1 8B, 70B, 405B; Llama 3.2 1B, 3B, 11B, 90B; Llama 4 Scout, Llama 4 Maverick
Mistral / Mixtral	Mistral 7B, Mistral NeMo 12B, Mixtral 8x7B, Mixtral 8x22B, Mistral Large
Google	Gemma 2 2B, 9B, 27B
Microsoft	Phi-3 Mini, Phi-3 Medium, Phi-3.5
NVIDIA Nemotron	Nemotron-4 340B, Nemotron-Mini 4B, Llama 3.1 Nemotron 70B; Llama Nemotron Nano/Super/Ultra; Nemotron 3 Nano
DeepSeek	DeepSeek-R1 (preview January 2025; generally available January 30, 2025)
OpenAI open models	gpt-oss-120b, gpt-oss-20b
Code-focused	Code Llama 70B, Qwen2.5 Coder

Llama 3.1 405B, one of the largest openly licensed models available, requires multiple H100 GPUs and is served with tensor parallelism enabled across those GPUs.

NVIDIA also packages its own fine-tuned Nemotron models. Llama 3.1 Nemotron 70B is a Nemotron post-trained version of Llama 3.1 70B that NVIDIA released as an instruction-following model with particularly high scores on MT-Bench. The Nemotron-Mini 4B model is tuned for edge and workstation deployment on RTX hardware.

The model catalog is updated regularly as new open-weight models are released. NVIDIA typically adds a NIM for a major new model within weeks of the model's public release, though the timeline depends on the GPU compatibility work required to produce TensorRT-LLM profiles for it. Several high-profile 2025 frontier and open models were packaged as NIM microservices, including Meta's Llama 4 Scout and Llama 4 Maverick (released April 5, 2025) and OpenAI's gpt-oss-120b and gpt-oss-20b open-weight models (released August 5, 2025).^[19]^[20]

Embedding models

NIM includes a set of text and multimodal embedding NIMs for search and retrieval applications:

Model	Description
NV-EmbedQA-E5-v5	Text embedding for question-answering retrieval
NV-EmbedCode-v1	7B Mistral-based model for code retrieval
NV-CLIP	Multimodal image and text embedding
Llama 3.2 NeMo Retriever Embedding	1.6B multimodal embedding for RAG pipelines

Vision language models

Vision language model NIMs accept image plus text inputs and generate text outputs:

Model	Description
Llama 3.2 Vision 11B / 90B	Meta's multimodal Llama models
Mistral NeMo VLM	12B vision-language model
Nemotron Nano 12B v2 VL	Multi-image and video understanding
NVIDIA NVLM-D-72B	High-parameter vision model
NEVA-22B	NVIDIA's own vision-language model

Speech models

NVIDIA's Riva speech platform is packaged as NIM containers for automatic speech recognition (ASR), text-to-speech synthesis (TTS), neural machine translation (NMT), and speaker diarization. Riva NIMs include both streaming and batch inference modes, with the streaming ASR NIM able to return partial transcriptions with low latency, suitable for live transcription applications.

Specialized domain models

For healthcare and life sciences, NVIDIA offers BioNeMo NIMs covering protein structure prediction (ESMFold), generative molecular design (MolMIM), molecular docking (DiffDock), and genomics analysis. These launched alongside the initial NIM announcement at GTC 2024, with NVIDIA describing them as more than two dozen healthcare-specific microservices.^[10]

ESMFold is a protein language model from Meta that predicts protein 3D structure from amino acid sequences. Packaged as a NIM, it exposes a REST API that accepts a FASTA-format amino acid string and returns structure predictions in PDB format, allowing downstream molecular visualization and docking tools to consume the output directly.

MolMIM (Molecular Multimodal Information Model) is a generative chemistry model that proposes new drug-like molecules with specified physicochemical properties. The MolMIM NIM accepts a SMILES string and a set of property constraints, and returns novel candidate molecules ranked by predicted property values.

DiffDock is a diffusion model for protein-ligand docking. Given a protein structure and a small molecule, it predicts the 3D binding pose, which is useful for evaluating whether a candidate compound is likely to bind to a therapeutic target.

Visual generative AI models

NIM also covers image and video generation, including Stable Diffusion and other diffusion-based models, available through the visual generative AI NIM catalog. These NIMs accept text prompts (and optionally input images) and return generated images via a REST API. Enterprise deployments use these for content creation pipelines, product visualization, and synthetic data generation for training computer vision models.

OpenUSD and 3D world models

At SIGGRAPH in July 2024, NVIDIA introduced a set of NIM microservices built around OpenUSD (Universal Scene Description) to accelerate the development of digital twins, robotics simulation, and industrial 3D workflows that run in NVIDIA Omniverse.^[21] These include the USD Code NIM (which generates OpenUSD Python code from text prompts), the USD Search NIM (natural-language and image search across 3D asset libraries), and the USD Validate NIM (which checks file compatibility and renders RTX preview images), along with additional services for scene layout, material application, mesh generation from point-cloud data, physics super-resolution, and large-scale neural radiance fields.^[21] Foxconn and WPP were named among the first adopters, using the microservices for factory digital twins and content generation pipelines respectively.^[21]

Media and broadcast models

NVIDIA's AI for Media stack (formerly NVIDIA Maxine) packages real-time audio and video enhancement models as NIM microservices for media and entertainment workflows. A 2026 product update introduced a lineup of SMPTE ST 2110-compliant NIM microservices designed to drop directly into live, IP-based broadcast pipelines, including LipSync, Active Speaker Detection, Studio Voice (speech enhancement), and Video Super Resolution.^[22] The update also added the NVIDIA Maxine Synthetic Video Detector, a NIM microservice that estimates the probability that a video clip was generated by AI diffusion models using an ensemble of DINOv2 and DINOv3 features with TensorRT optimization.^[23] NVIDIA reported that the detector reaches roughly 92 percent accuracy on uncompressed video and can process frames in as little as 22 milliseconds, and noted that it was offered in private access at the time of the update.^[22]^[23]

How much does NIM cost?

NIM has two access tiers with distinct terms.

Development access

Any developer with a free NVIDIA Developer Program account can generate an NGC API key and pull NIM containers for development and testing. The hosted inference API at build.nvidia.com provides free access to NIM endpoints (backed by NVIDIA DGX Cloud infrastructure) for evaluation, limited by a rate cap per key. This tier does not require an NVIDIA AI Enterprise license and is intended for prototyping.

Production access

Deploying NIM in production requires an NVIDIA AI Enterprise license.^[6] NVIDIA AI Enterprise is a commercial software subscription that includes NIM along with other NVIDIA AI frameworks such as NeMo, Riva, and RAPIDS.

NVIDIA AI Enterprise pricing (as of 2024-2025):

License type	Price per GPU
1-year subscription	$4,500
2-year subscription	$9,000
3-year subscription	$13,500
5-year subscription	$18,000
Perpetual license	$22,500 (includes 5 years of support)

The entry production license is $4,500 per GPU per year, or approximately $1 per GPU per hour for cloud deployments.^[6] Cloud deployments on AWS, Google Cloud, Azure, and Oracle Cloud can use that pay-as-you-go model, billed on top of the cloud provider's instance costs.

EDU institutions and startups in NVIDIA's Inception program receive a 75% discount on list prices. Startups can purchase up to 64 one-year subscriptions at reduced rates through the Inception program; companies in the Connect program for software developers can buy up to 16.

Pricing is per GPU rather than per NIM or per model. A server with eight H100 GPUs running a single NIM costs the same license fee as one running eight NIMs on the same hardware.

NVIDIA AI Enterprise also includes a 90-day free evaluation license for organizations that want to test production NIM deployment before committing to a subscription.^[6] The evaluation license covers the same features as the paid subscription and can be activated through the NVIDIA AI Enterprise portal.

NVIDIA AI Enterprise bundles more than just NIM. The subscription includes NVIDIA NeMo (for fine-tuning and training), NVIDIA Riva (for speech AI), NVIDIA Merlin (for recommendation systems), NVIDIA cuOpt (for route optimization), and NVIDIA RAPIDS (for GPU-accelerated data science). This bundling means that organizations already using other NVIDIA AI software may find that NIM production access comes included in a license they already hold.

How is NIM deployed and accessed?

NIM is accessible through three primary routes:

Hosted API (build.nvidia.com): Free endpoints hosted on DGX Cloud hardware, returning completions via the OpenAI-compatible format. Suitable for prototyping and evaluation. Rate-limited per API key.

Self-hosted containers: Pull from nvcr.io using an NGC API key, run on any supported NVIDIA hardware. Requires NVIDIA AI Enterprise for production. Operators have full control over infrastructure, data residency, and model versions.

Cloud marketplace deployments: AWS SageMaker, Google Cloud Vertex AI, Microsoft Azure AI, and Oracle Cloud all offer NIM through their respective AI model catalogs.^[11] On these platforms, the NIM container runs on GPU instances managed by the cloud provider, and the NVIDIA AI Enterprise license can be included in the cloud billing.

NVIDIA also supports deployment on NVIDIA-Certified Systems, a program covering validated server hardware from Cisco, Dell Technologies, Hewlett Packard Enterprise, Lenovo, and Supermicro. For on-premises deployments, validated Kubernetes distributions include Red Hat OpenShift, VMware Tanzu, and Canonical Charmed Kubernetes.

How does NIM compare to vLLM, Triton, and other inference platforms?

The enterprise inference deployment space includes several alternatives to NIM, each with different tradeoffs.

Platform	Deployment model	Hardware requirement	API compatibility	License
NVIDIA NIM	Self-hosted container or hosted API	NVIDIA GPUs only	OpenAI-compatible	NVIDIA AI Enterprise (production)
Hugging Face Inference Endpoints	Managed cloud	Multi-vendor GPUs	OpenAI-compatible (partial)	Per-hour billing
vLLM	Self-hosted, open source	NVIDIA + AMD GPUs	OpenAI-compatible	Apache 2.0
Modal (platform)	Serverless cloud	NVIDIA GPUs	Custom API	Per-second billing
NVIDIA Triton Inference Server	Self-hosted, open source	Multi-hardware	gRPC / HTTP	Apache 2.0
Replicate	Managed cloud	NVIDIA GPUs	Custom API	Per-second billing
Together AI	Managed cloud / self-hosted	NVIDIA GPUs	OpenAI-compatible	Per-token billing

NIM vs. vLLM: vLLM is the closest open-source analog to NIM. Both use continuous batching, paged attention, and OpenAI-compatible APIs. vLLM supports AMD GPUs and Intel GPUs in addition to NVIDIA, and it is fully open source under the Apache 2.0 license. NIM wraps vLLM as one of its backends and adds pre-compiled TensorRT-LLM profiles (unavailable in plain vLLM), multi-LoRA dynamic serving, enterprise support contracts, and a curated model catalog vetted by NVIDIA.^[13]^[14] For teams already comfortable with vLLM who do not need TensorRT-LLM profiles or enterprise SLAs, NIM adds overhead without clear benefit. For teams that want a supported, pre-optimized stack, NIM reduces the configuration burden.

NIM vs. Hugging Face Inference Endpoints: Hugging Face Inference Endpoints provides a managed service for deploying models from the Hugging Face Hub. The service supports multiple GPU providers and does not require an NVIDIA-specific license. It runs TGI or vLLM under the hood and charges per hour of GPU time. NIM offers more GPU-specific optimization (TensorRT-LLM profiles) and a richer model catalog outside the Hub (domain models, speech models, molecular biology models), but requires an NVIDIA AI Enterprise contract for production self-hosting. In June 2024, Hugging Face and NVIDIA announced a collaboration to offer NVIDIA-optimized NIM endpoints through Hugging Face's Serverless Inference API, blending the Hub's model discovery with NVIDIA's inference stack.^[12]

NIM vs. Modal: Modal (platform) offers serverless GPU compute billed per second, with containers that scale to zero when idle. It does not provide a curated model catalog; operators bring their own code and containers. NIM requires persistent infrastructure (or cloud instances) that are always running, which adds cost for sporadic workloads but reduces latency for steady-state production traffic. Modal's flexibility suits research and variable workloads; NIM targets steady-state enterprise production traffic.

What is NIM used for?

Enterprise deployments of NIM cluster around several patterns.

Customer-facing chat and copilots: Organizations integrate NIM-served LLMs into internal helpdesks, customer service interfaces, and productivity copilots. ServiceNow, SAP, and Box were among the enterprise software vendors announced at GTC 2024 as early NIM adopters, integrating NIM-powered inference into their platforms.^[1]

Retrieval-augmented generation pipelines: NIM provides both embedding models (for indexing and query encoding) and LLMs (for answer generation) within the same container ecosystem, making it straightforward to build RAG pipelines where both components are served through the same API format. The NeMo Retriever NIM microservices are specifically tuned for retrieval tasks.

Code generation and software development: Code-specific models like Code Llama 70B are available as NIMs. Enterprises deploy these behind internal developer tools or IDEs, serving code completions and explanations within their security perimeter rather than routing code to third-party APIs.

Healthcare and life sciences: Pharmaceutical companies use BioNeMo NIMs for in silico drug screening, where molecular generation models propose candidate compounds and structure prediction models (ESMFold, DiffDock) filter them. More than 200 organizations were integrating BioNeMo NIMs into their workflows within a year of the platform's launch.

Edge and workstation inference: NIM supports deployment on workstations with RTX 4080 and RTX 4090 GPUs, enabling inference at the edge for applications where sending data to a cloud endpoint is not acceptable due to latency, cost, or data privacy constraints. NVIDIA RTX AI PCs are an explicit supported platform.

Financial services and security: CrowdStrike was among the GTC 2024 launch partners, integrating NIM into its Falcon platform for security analytics.^[1] Financial institutions use NIM-hosted LLMs for document analysis, compliance review, and fraud detection, running inference within their own data centers to satisfy regulatory requirements around data residency.

Manufacturing and digital twins: Siemens, one of the named NIM adopters, explored integrating NIM with its industrial AI toolchain for predictive maintenance and process optimization. NVIDIA's Omniverse platform, which is used for industrial simulation and digital twins, integrates with NIM for language and reasoning tasks within simulation workflows.

Document processing at scale: The multimodal PDF extraction Blueprint lets organizations process large archives of PDFs, extracting text, tables, charts, and figures into structured formats suitable for downstream analysis or RAG pipelines. This use case targets legal, financial, and compliance document workflows where structured extraction from unstructured documents is a persistent bottleneck.

Adoption and ecosystem

At the GTC 2024 announcement, NVIDIA named Adobe, Cadence, CrowdStrike, Getty Images, SAP, ServiceNow, and Shutterstock as early NIM adopters.^[1] By August 2024, when NIM Agent Blueprints launched, the named enterprise customer list had grown to include Lowe's, Siemens, Cohesity, Dropbox, NetApp, and Glean.^[4]

Cloud providers integrated NIM into their AI model catalogs: AWS SageMaker, Google Cloud Vertex AI, and Microsoft Azure AI all offered NIM containers through their platforms by mid-2024.^[11] Oracle Cloud added NIM to Oracle Cloud Infrastructure in a more limited capacity.

Hardware partners including Dell, HPE, Lenovo, and Supermicro validated NIM on their NVIDIA-Certified server lines and began offering NIM as a pre-installed option on AI factory deployments. Cisco validated NIM on its UCS server platforms.

Red Hat OpenShift published guidance for running NIM on OpenShift AI in May 2025, reflecting growing interest in running NIMs within Kubernetes environments that use Red Hat's enterprise Linux distribution.^[15]

NIM 1.4, released in December 2024, achieved 2.4x faster inference than the 1.0 release across several benchmark configurations. DeepSeek-R1 was added as a preview NIM in January 2025 following the model's public release.

Cloudera reported 36x performance improvements after integrating NIM into its data platform, compared to previous inference configurations. This figure reflects both the TensorRT-LLM optimization and the removal of per-request engine compilation overhead that had characterized their earlier setup.

NVIDIA announced that as of late 2024, hundreds of enterprises had deployed NIM in production, with the catalog expanding from the initial 20+ models at GTC 2024 to over 100 models by the end of the year.^[1]^[3] The build.nvidia.com platform accumulated millions of API calls from developers using the free hosted endpoints for evaluation.

The NIM ecosystem also expanded through NVIDIA's AI Foundry service, a managed offering that combines NIM with NeMo fine-tuning and model management. AI Foundry targets organizations that want to customize a foundation model with proprietary data and deploy it as a NIM, without managing the fine-tuning infrastructure themselves.

2025-2026 developments

Through 2025 and into 2026, NVIDIA positioned NIM as the deployment layer for a new wave of reasoning and agentic models while expanding the catalog beyond language into media, 3D, and broadcast workflows.

DeepSeek-R1 general availability

The DeepSeek-R1 reasoning model, which had been added as a preview in January 2025, reached general availability as a NIM microservice on build.nvidia.com on January 30, 2025.^[24] NVIDIA reported that the 671-billion-parameter model could deliver up to 3,872 tokens per second on a single NVIDIA HGX H200 system when served through the NIM microservice, and stated that "the DeepSeek-R1 NIM microservice simplifies deployments with support for industry-standard APIs."^[24]

Llama Nemotron reasoning models

At GTC 2025 on March 18, 2025, NVIDIA introduced the Llama Nemotron family of open reasoning models, post-trained from Meta's Llama models to improve multistep math, coding, and complex decision-making for agentic applications.^[18] The models are offered as NIM microservices in three sizes: Nano (tuned for accuracy on PCs and edge devices), Super (highest throughput on a single GPU), and Ultra (maximum agentic accuracy on multi-GPU servers).^[18] NVIDIA stated that the post-training process raises accuracy by up to 20 percent over the base model and improves inference speed by up to 5x relative to other leading open reasoning models.^[18] Early enterprise users building on the Llama Nemotron NIM microservices included Microsoft (Azure AI Foundry and the Azure AI Agent Service), SAP (the Joule copilot and ABAP code completion), ServiceNow, Accenture (its AI Refinery platform), and Deloitte (its Zora AI agentic platform).^[18]

Blackwell-optimized open models

Several major open models released in 2025 were packaged as NIM microservices with optimization for the Blackwell architecture. Meta's Llama 4 Scout and Llama 4 Maverick, announced on April 5, 2025, were made available as NIM microservices; NVIDIA reported that on Blackwell B200 GPUs the optimized Llama 4 Scout exceeds 40,000 tokens per second and Llama 4 Maverick exceeds 30,000 tokens per second, with the B200 delivering 3.4x higher throughput and 2.6x better cost per token than the H200.^[19] On August 5, 2025, OpenAI's open-weight gpt-oss-120b and gpt-oss-20b models launched as NIM microservices, with NVIDIA reporting up to 1.5 million tokens per second for gpt-oss-120b on a single Blackwell GB200 NVL72 rack-scale system.^[20]

Nemotron 3

On December 15, 2025, NVIDIA announced the Nemotron 3 family of open models, built on a hybrid latent mixture-of-experts (MoE) architecture and offered for secure on-premises deployment as NIM microservices.^[25] The family spans three sizes: Nemotron 3 Nano (around 30 billion parameters, activating up to about 3 billion per token), Nemotron 3 Super (around 100 billion parameters), and Nemotron 3 Ultra (around 500 billion parameters).^[25] NVIDIA stated that Nemotron 3 Nano delivers about 4x higher token throughput than Nemotron 2 Nano, reduces reasoning-token generation by up to 60 percent to lower inference cost, and supports a context window of up to 1 million tokens.^[25] Nemotron 3 Nano was made available at announcement, with the Super and Ultra variants expected in the first half of 2026, distributed through Hugging Face and a range of inference providers in addition to NVIDIA NIM microservices.^[25]

GTC 2026 and the agentic stack

At GTC 2026, held in San Jose from March 16 to March 19, 2026, NVIDIA continued to frame NIM as a core part of its agentic AI stack, with NIM microservices and Nemotron open models cited as the software foundation for autonomous agent and digital-human deployments shown during the event.^[26]

Media and broadcast NIM microservices

In a 2026 update to its AI for Media (Maxine) product line, NVIDIA released a set of SMPTE ST 2110-compliant NIM microservices, namely LipSync, Active Speaker Detection, Studio Voice, and Video Super Resolution, designed to run inside live IP-based broadcast pipelines, together with a Synthetic Video Detector NIM for flagging AI-generated footage in newsroom and content-platform workflows.^[22]^[23]

What are NIM's limitations?

NIM has several meaningful constraints for organizations evaluating it.

NVIDIA hardware exclusivity. NIM containers only run on NVIDIA GPUs. Organizations with AMD or Intel GPU investments, or those looking to hedge against hardware vendor concentration, cannot use NIM on non-NVIDIA hardware. vLLM and Triton Inference Server both support AMD ROCm GPUs.

Production licensing cost. The $4,500 per GPU per year subscription cost is substantial for small organizations or projects with limited GPU counts.^[6] For a small four-GPU inference server, the annual license fee alone is $18,000, before accounting for hardware and cloud costs. Open-source alternatives like vLLM have no licensing cost.

Limited to NVIDIA's model catalog. While the catalog covers most popular open-weight models, it does not include every model available on Hugging Face. Less popular or recently released models may not yet have a NIM, requiring teams to fall back to other serving frameworks for those models.

Closed optimization profiles. The pre-compiled TensorRT-LLM profiles are compiled artifacts distributed by NVIDIA; they are not open source. Users cannot inspect or modify the compilation process, and the profiles are only available for GPU architectures and models that NVIDIA has chosen to optimize.

Dependency on NVIDIA infrastructure. Pulling container images requires authenticating to NVIDIA's NGC registry. If NGC has downtime, or if NVIDIA changes access terms, pulling updated images or deploying to new infrastructure becomes harder. Organizations with strict supply-chain requirements may find this dependency difficult to accept.

References

NVIDIA Newsroom. "NVIDIA Launches Generative AI Microservices for Developers." March 18, 2024. https://nvidianews.nvidia.com/news/generative-ai-microservices-for-developers ↩
NVIDIA Technical Blog. "NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale." 2024. https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/ ↩
NVIDIA Documentation. "Overview of NVIDIA NIM for Large Language Models." https://docs.nvidia.com/nim/large-language-models/latest/introduction.html ↩
NVIDIA Newsroom. "NVIDIA and Global Partners Launch NIM Agent Blueprints for Enterprises to Make Their Own AI." August 27, 2024. https://nvidianews.nvidia.com/news/nvidia-and-global-partners-launch-nim-agent-blueprints-for-enterprises-to-make-their-own-ai ↩
NVIDIA Blog. "NIM Agent Blueprints Fast-Forward Next Wave of Enterprise Generative AI." 2024. https://blogs.nvidia.com/blog/nim-agent-blueprints/ ↩
NVIDIA Documentation. "NVIDIA AI Enterprise Pricing." https://docs.nvidia.com/ai-enterprise/planning-resource/licensing-guide/latest/pricing.html ↩
NVIDIA Documentation. "Fine-Tuning with LoRA." https://docs.nvidia.com/nim/large-language-models/latest/peft.html ↩
NVIDIA Technical Blog. "Seamlessly Deploying a Swarm of LoRA Adapters with NVIDIA NIM." https://developer.nvidia.com/blog/seamlessly-deploying-a-swarm-of-lora-adapters-with-nvidia-nim/ ↩
TechCrunch. "Nvidia launches NIM to make it smoother to deploy AI models into production." March 18, 2024. https://techcrunch.com/2024/03/18/nvidia-launches-a-set-of-microservices-for-optimized-inferencing/
NVIDIA Newsroom. "NVIDIA Healthcare Launches Generative AI Microservices to Advance Drug Discovery, MedTech and Digital Health." 2024. https://nvidianews.nvidia.com/news/healthcare-generative-ai-microservices ↩
AWS Blog. "Get Started with NVIDIA NIM Inference Microservices on Amazon SageMaker." https://aws.amazon.com/blogs/machine-learning/get-started-with-nvidia-nim-inference-microservices-on-amazon-sagemaker/ ↩
Hugging Face Blog. "Serverless Inference with Hugging Face and NVIDIA NIM." https://huggingface.co/blog/inference-dgx-cloud ↩
NVIDIA Developer Forums. "vLLM vs NVIDIA NIM." https://forums.developer.nvidia.com/t/vllm-vs-nvidia-nim/357164 ↩
Dell Technologies. "Tailoring LLM Inference with NVIDIA NIM using Key Features of TensorRT-LLM and vLLM." https://infohub.delltechnologies.com/p/tailoring-llm-inference-with-nvidia-nim-using-key-features-of-tensorrt-llm-and-vllm/ ↩
Red Hat Developer. "How to set up NVIDIA NIM on Red Hat OpenShift AI." May 2025. https://developers.redhat.com/articles/2025/05/08/how-set-nvidia-nim-red-hat-openshift-ai ↩
NVIDIA Newsroom. "NVIDIA Dynamo Open-Source Library Accelerates and Scales AI Reasoning Models." March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-dynamo-open-source-library-accelerates-and-scales-ai-reasoning-models ↩
NVIDIA Technical Blog. "Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models." March 2025. https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/ ↩
NVIDIA Newsroom. "NVIDIA Launches Family of Open Reasoning AI Models for Developers and Enterprises to Build Agentic AI Platforms." March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-launches-family-of-open-reasoning-ai-models-for-developers-and-enterprises-to-build-agentic-ai-platforms ↩
NVIDIA Technical Blog. "NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick." April 5, 2025. https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/ ↩
NVIDIA Blog. "OpenAI and NVIDIA Propel AI Innovation With New Open Models Optimized for the World's Largest AI Inference Infrastructure." August 5, 2025. https://blogs.nvidia.com/blog/openai-gpt-oss/ ↩
NVIDIA Newsroom. "NVIDIA Announces Generative AI Models and NIM Microservices for OpenUSD Language, Geometry, Physics and Materials." July 29, 2024. https://nvidianews.nvidia.com/news/nvidia-announces-generative-ai-models-and-nim-microservices-for-openusd ↩
NVIDIA Developer Forums. "AI for Media Product Update: New Real-Time AI Models and Infrastructure Microservices." June 2, 2026. https://forums.developer.nvidia.com/t/ai-for-media-product-update-new-real-time-ai-models-and-infrastructure-microservices/372024 ↩
NVIDIA Documentation. "Overview, NVIDIA NIM Maxine Synthetic Video Detector." https://docs.nvidia.com/nim/maxine/synthetic-video-detector/latest/overview.html ↩
NVIDIA Blog. "DeepSeek-R1 Now Live With NVIDIA NIM." January 30, 2025. https://blogs.nvidia.com/blog/deepseek-r1-nim-microservice/ ↩
NVIDIA Newsroom. "NVIDIA Debuts Nemotron 3 Family of Open Models." December 15, 2025. https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models ↩
NVIDIA Blog. "NVIDIA GTC 2026: Live Updates on What's Next in AI." March 2026. https://blogs.nvidia.com/blog/gtc-2026-news/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Lepton AI Llama Nemotron Llama-3.1-Nemotron-70B-Instruct Mistral Medium 3.5 NVIDIA AI Enterprise NVIDIA B200 NVIDIA Cosmos 3 NVIDIA DGX Cloud NVIDIA Dynamo NVIDIA NeMo NVIDIA Picasso NVIDIA TensorRT-LLM Nemotron Nemotron-4 Nvidia Phi-3 Stability AI

What problem does NIM solve?

How is a NIM container built?

Container structure

Inference engine selection

TensorRT-LLM backend

OpenAI-compatible API layer

PEFT and LoRA adapter support

Kubernetes and orchestration

NIM Operator

Observability and monitoring

NVIDIA Dynamo and distributed serving

What are NIM Agent Blueprints?

What models does NIM support?

Large language models

Embedding models

Vision language models

Speech models

Specialized domain models

Visual generative AI models

OpenUSD and 3D world models

Media and broadcast models

How much does NIM cost?

Development access

Production access

How is NIM deployed and accessed?

How does NIM compare to vLLM, Triton, and other inference platforms?

What is NIM used for?

Adoption and ecosystem

2025-2026 developments

DeepSeek-R1 general availability

Llama Nemotron reasoning models

Blackwell-optimized open models

Nemotron 3

GTC 2026 and the agentic stack

Media and broadcast NIM microservices

What are NIM's limitations?

See also

References

Improve this article

Related Articles

NVIDIA Dynamo

NVIDIA AI Enterprise

NVIDIA Picasso

NVIDIA Triton Inference Server

CUDA

CUTLASS

What links here

Related Articles

NVIDIA Dynamo

NVIDIA AI Enterprise

NVIDIA Picasso

NVIDIA Triton Inference Server

CUDA

CUTLASS

What links here