NVIDIA NeMo

AI Infrastructure Developer Tools NVIDIA

12 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v4 · 2,345 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NVIDIA NeMo is an open, end-to-end framework from NVIDIA for building, customizing, and deploying generative AI models, described by NVIDIA as "a scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI." ^[1]^[2] It started in 2019 as a toolkit for speech and conversational AI and grew into a broader platform that covers large language model training, multimodal models, and automatic speech recognition and text to speech. NeMo runs on NVIDIA GPUs and ties together the pieces a team needs to take a model from raw data through pretraining, fine-tuning, alignment, and serving. ^[1]^[2]^[3]

A common point of confusion is the difference between NeMo and Nemotron. NeMo is the software framework you use to train and customize models. Nemotron is a family of actual models that NVIDIA built and released, and NVIDIA uses NeMo internally to train them. One is the workshop, the other is what comes out of it. ^[3]

When was NeMo released and where did it come from?

NVIDIA introduced NeMo in 2019. The name is short for Neural Modules, and the project was first described in a paper, "NeMo: a toolkit for building AI applications using Neural Modules," posted to arXiv on September 14, 2019. ^[12] The early design centered on reusable building blocks that took typed inputs and produced typed outputs, which you could connect to assemble a model. The first release leaned into conversational AI, with collections for automatic speech recognition, natural language processing, and speech synthesis. For several years NeMo was best known in the speech community as a way to train and fine-tune ASR and TTS models. ^[3]^[4]^[12]

As the field shifted toward large transformer models trained on huge clusters, NeMo expanded to match. It picked up dedicated collections for language models and multimodal models, adopted distributed training techniques built for thousands of GPUs, and added tooling for the stages around training such as data preparation, alignment, and safety. The result today describes itself as a cloud native generative AI framework that lets users "efficiently create, customize, and deploy new generative AI models by leveraging existing code and pretrained model checkpoints." ^[1]^[2]

How is NeMo built?

NeMo is written in Python and built on PyTorch. For large model training it sits on top of two NVIDIA libraries that do the heavy lifting. The first is Megatron Core, an open source library of GPU optimized building blocks and system level techniques for training transformer models at scale. The second is Transformer Engine, which handles mixed precision and FP8 training on NVIDIA Hopper and newer GPUs. NeMo wraps both behind higher level APIs so teams can configure a training run without reimplementing the low level kernels themselves. ^[1]^[2]^[5]

Training a model with hundreds of billions of parameters does not fit on one GPU, so NeMo supports several forms of model parallelism that can be combined. Tensor parallelism splits the math inside each layer across GPUs. Pipeline parallelism puts different layers on different GPUs and streams batches through them. Data parallelism replicates the model and feeds each copy a different slice of the batch. Sequence and context parallelism split work along the length of the input, which matters for long contexts, and expert parallelism distributes the experts in a mixture of experts model. NeMo also supports fully sharded data parallelism, which shards the parameters and optimizer state to save memory. The point of combining these is to spread a single model and its data across many GPUs and many nodes while keeping the hardware busy. It all runs on NVIDIA's CUDA platform and the GPU libraries built on top of it. ^[2]

Mixed precision and FP8 are part of why the framework targets recent hardware. Transformer Engine lets NeMo train in FP8 on Hopper and newer GPUs, which cuts memory use and speeds up training compared with higher-precision formats, as long as the GPU supports it. Older or non-NVIDIA hardware does not get that path. ^[1]^[2]^[5]

Because NeMo is a deep learning framework rather than a single model, it ships with pretrained checkpoints and recipes you can start from instead of training a neural network from scratch. The recipes capture the parallelism layout, the data configuration, and the hyperparameters for a given model size, so a team can reproduce a known-good setup rather than tuning everything by hand. NeMo also reads and writes formats that line up with the wider community, so checkpoints can move between NeMo and projects like Hugging Face. ^[1]

What are the main components of NeMo?

Over time NeMo has grown into a set of components that each handle one stage of the model lifecycle. Several of them are available both as open source libraries and as managed NeMo microservices, which are containerized versions meant to run on Kubernetes in production. The table below lists the main pieces. ^[1]^[6]^[7]

Component	Role	Notes
NeMo Curator	Data curation	GPU accelerated download, cleaning, filtering, and deduplication of text, image, and video data for pretraining and fine-tuning. Uses RAPIDS and Dask to scale across nodes. In one NVIDIA benchmark it deduplicated 1.96 trillion tokens in about 0.5 hours on 32 NVIDIA H100 GPUs, and NVIDIA reports GPU deduplication running roughly 20x faster and 5x cheaper than a CPU pipeline. ^[8]^[13]
NeMo Customizer	Fine-tuning service	Microservice for fine-tuning and adapting models, with methods such as LoRA, P-tuning, and supervised fine-tuning. ^[6]^[7]
NeMo Aligner and NeMo-RL	Post-training and alignment	Toolkits for model alignment, including reinforcement learning from human feedback, DPO, and GRPO. NeMo-RL is the newer library and scales from a single GPU to thousand-GPU runs using backends like Megatron Core and vLLM with Ray. ^[9]
NeMo Guardrails	Safety and control	Open source toolkit for adding programmable rails to LLM applications, covering input, output, dialog, retrieval, and execution rails. ^[10]
NeMo Retriever	Retrieval and RAG	Microservices for embedding, reranking, and multimodal extraction that power retrieval augmented generation over enterprise data. NVIDIA reports up to 15x faster multimodal data extraction and 3x better embedding throughput versus open alternatives, with multilingual support across 26 languages. ^[11]^[14]
NeMo Evaluator	Evaluation	Microservice for benchmarking models on academic and custom datasets, including LLM-as-a-judge methods. ^[6]^[7]
NeMo Data Store	Artifact storage	Stores and manages datasets, models, and related artifacts across the workflow. ^[6]^[7]

The components are designed to chain together into one workflow. A team can curate a dataset with Curator, train or fine-tune with the core framework and Customizer, align the result with NeMo-RL or Aligner, attach Guardrails, connect it to private data through Retriever, and check quality with Evaluator. The microservices versions share an API surface and run as containers, which is what lets them be wired into a continuous loop where a deployed model is evaluated, then fine-tuned again on fresh data, then redeployed. That data flywheel framing is how NVIDIA pitches the microservices for agentic applications that need to keep improving after launch. ^[6]^[7]^[15]

What does NeMo Guardrails do?

Guardrails is worth a closer look because safety sits at a different layer from training. NVIDIA open-sourced NeMo Guardrails on April 25, 2023, as a toolkit for keeping generative AI chatbots accurate and on-topic. ^[16] The companion research paper describes the approach this way: "using a runtime inspired from dialogue management, NeMo Guardrails allows developers to add programmable rails to LLM applications, these are user-defined, independent of the underlying LLM, and interpretable." ^[17]

Rather than baking behavior into the weights, it inserts programmable rails between the application and the model. Input rails can reject or rewrite what a user sends, for example masking sensitive data. Dialog rails shape how the model is prompted and can hold it to a defined conversation flow. Retrieval rails act on the chunks pulled into a retrieval augmented generation pipeline. Execution rails govern the tools the model is allowed to call. Output rails check or alter the response before it reaches the user, covering things like content moderation, fact-checking, and jailbreak detection. Because it is provider-agnostic, Guardrails can sit in front of many different models rather than only NVIDIA's own, and it integrates with toolkits such as LangChain. ^[10]^[16]^[17]

What are the Parakeet and Canary speech models?

The speech roots are still part of NeMo. It keeps dedicated collections for automatic speech recognition and text to speech, with model families and pretrained checkpoints for transcription, speech synthesis, and related tasks such as speaker identification and spoken language understanding. NeMo has been a common starting point for state-of-the-art ASR work, since the collections include strong baseline architectures and recipes that researchers can fine-tune on their own audio rather than training from scratch. ^[1]^[4]

NVIDIA ships two flagship ASR families built with NeMo. Parakeet is a line of high-throughput English and multilingual transcription models, and Canary is a multilingual recognition and translation line. Parakeet TDT 0.6B v2, a 600 million parameter FastConformer-TDT model released in 2025, posted a 6.05 average word error rate on the Hugging Face Open ASR Leaderboard and an RTFx of 3,386, meaning it transcribes audio roughly 3,386 times faster than real time at batch size 128, which placed it at the top of that public leaderboard. ^[18] On the multilingual side, NVIDIA's Canary-Qwen-2.5B reported a 5.63% word error rate on the English Open ASR Leaderboard in 2025, and in August 2025 NVIDIA released Parakeet and Canary updates covering 25 European languages. ^[19] These results are a reminder that NeMo was a serious ASR and TTS toolkit before the language model era and did not drop that focus when it broadened. The same modular design that shaped the original Neural Modules still shows up here, where audio, language, and acoustic pieces can be assembled into a pipeline. ^[1]^[4]

How does NeMo fit with Nemotron, NIM, and NVIDIA AI Enterprise?

NeMo sits inside a larger NVIDIA software stack, and the parts are easy to mix up, so it helps to separate them. NeMo is the framework for building and customizing models. Nemotron is a family of NVIDIA models that are built with that framework. NVIDIA NIM, short for NVIDIA Inference Microservices, packages models as optimized containers for serving, and a model customized in NeMo can be deployed through NIM. NeMo models can also be exported to TensorRT-LLM for optimized inference. ^[1]^[3]

The whole collection is offered commercially through NVIDIA AI Enterprise, which adds support, security updates, and a supported path to production for the NeMo framework and the NeMo microservices. The NeMo microservices reached general availability on April 23, 2025, with launch partners including SAP and ServiceNow; NVIDIA cited Amdocs reporting a 64% improvement in average handling time on a telecom support agent built with the microservices. ^[15]^[20] So a developer can use the open source NeMo libraries for free, or adopt the supported microservices and NIM containers under NVIDIA AI Enterprise when running in production. ^[1]^[6]^[7]

Who uses NeMo?

NeMo is aimed at organizations that train or heavily customize their own models on NVIDIA hardware rather than only calling a hosted API. That includes research labs publishing new models, enterprises building domain-specific assistants and agents, and teams that need to fine-tune a foundation model on private data and keep it inside their own environment for privacy or compliance reasons. NVIDIA itself uses NeMo to produce the Nemotron models, which makes the framework a working example of its own use rather than a tool NVIDIA only ships to others. The audience splits roughly in two. Researchers and individual developers tend to reach for the open source libraries directly. Larger companies that want a supported path lean on the microservices and the NVIDIA AI Enterprise packaging so they get vendor support and security updates instead of maintaining the stack on their own. ^[1]^[6]^[15]

Limitations

NeMo is built for NVIDIA GPUs and the CUDA platform, so it is not a portable, hardware-neutral framework. Teams without access to capable NVIDIA hardware, especially the Hopper and newer GPUs needed for FP8 training, get less out of it. The framework is also large and aimed at people doing real training and customization, which makes it heavier to learn than a simple inference library or an API client. The full value shows up most for groups that actually pretrain, fine-tune, or align models at scale, while a team that only needs to call an existing model through an endpoint will find most of NeMo unnecessary. The split between the open source libraries and the supported microservices under NVIDIA AI Enterprise can also be confusing, since the same name can refer to either a free library or a commercial service depending on context. ^[1]^[6]

References

NVIDIA. "NVIDIA/NeMo." GitHub. https://github.com/NVIDIA/NeMo ↩
NVIDIA. "NeMo Framework Overview." NVIDIA NeMo Framework User Guide. https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html ↩
NVIDIA. "NeMo Framework." developer.nvidia.com. https://developer.nvidia.com/nemo-framework ↩
NVIDIA. "NeMo Toolkit." NVIDIA NeMo Framework User Guide. https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/index.html ↩
NVIDIA. "Megatron-Core." NVIDIA Docs. https://docs.nvidia.com/megatron-core/index.html ↩
NVIDIA. "NeMo Microservices." NVIDIA Docs. https://docs.nvidia.com/nemo/microservices/latest/about/index.html ↩
NVIDIA. "NVIDIA NeMo Microservices." developer.nvidia.com. https://developer.nvidia.com/nemo-microservices ↩
NVIDIA. "NVIDIA-NeMo/Curator." GitHub. https://github.com/NVIDIA-NeMo/Curator ↩
NVIDIA. "NVIDIA-NeMo/RL." GitHub. https://github.com/NVIDIA-NeMo/RL ↩
NVIDIA. "NVIDIA/NeMo-Guardrails." GitHub. https://github.com/NVIDIA/NeMo-Guardrails ↩
NVIDIA. "NVIDIA NeMo Retriever." developer.nvidia.com. https://developer.nvidia.com/nemo-retriever ↩
Kuchaiev, Oleksii, et al. "NeMo: a toolkit for building AI applications using Neural Modules." arXiv:1909.09577, September 14, 2019. https://arxiv.org/abs/1909.09577 ↩
NVIDIA. "Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator." NVIDIA Technical Blog. https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/ ↩
NVIDIA. "NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster." NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-nemo-retriever-delivers-accurate-multimodal-pdf-data-extraction-15x-faster/ ↩
NVIDIA. "Enhance Your AI Agent with Data Flywheels Using NVIDIA NeMo Microservices." NVIDIA Technical Blog. https://developer.nvidia.com/blog/enhance-your-ai-agent-with-data-flywheels-using-nvidia-nemo-microservices/ ↩
NVIDIA. "Right on Track: NVIDIA Open-Source Software Helps Developers Add Guardrails to AI Chatbots." NVIDIA Blog, April 25, 2023. https://blogs.nvidia.com/blog/ai-chatbot-guardrails-nemo/ ↩
Rebedea, Traian, et al. "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails." arXiv:2310.10501, October 16, 2023. https://arxiv.org/abs/2310.10501 ↩
NVIDIA. "nvidia/parakeet-tdt-0.6b-v2." Hugging Face model card. https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 ↩
NVIDIA. "NVIDIA Speech AI Models Deliver Industry-Leading Accuracy and Performance." NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-speech-ai-models-deliver-industry-leading-accuracy-and-performance/ ↩
CIO. "Nvidia says NeMo microservices now generally available." April 23, 2025. https://www.cio.com/article/3968114/nvidia-says-nemo-microservices-now-generally-available.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

ChipNeMo Lepton AI Llama Nemotron Llama-3.1-Nemotron-70B-Instruct Minitron NVIDIA AI Enterprise NVIDIA Canary NVIDIA DGX Cloud NVIDIA Dynamo NVIDIA NIM NVIDIA Parakeet NVIDIA Picasso Nemotron Nemotron-4 ServiceNow

When was NeMo released and where did it come from?

How is NeMo built?

What are the main components of NeMo?

What does NeMo Guardrails do?

What are the Parakeet and Canary speech models?

How does NeMo fit with Nemotron, NIM, and NVIDIA AI Enterprise?

Who uses NeMo?

Limitations

References

Improve this article

Related Articles

CUDA

NVIDIA NIM

NVIDIA Dynamo

CUTLASS

NVIDIA Picasso

NVIDIA H100

What links here

Related Articles

CUDA

NVIDIA NIM

NVIDIA Dynamo

CUTLASS

NVIDIA Picasso

NVIDIA H100

What links here