# NVLM

> Source: https://aiwiki.ai/wiki/nvlm
> Updated: 2026-07-17
> Categories: Large Language Models, Multimodal AI, NVIDIA
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**NVLM** (short for **NVIDIA Vision Language Model**), released as **NVLM 1.0**, is a family of open multimodal large language models developed by [Nvidia](/wiki/nvidia). Introduced in September 2024, NVLM 1.0 was presented as a set of "frontier-class" vision-language models that the authors reported as competitive with the leading proprietary systems of the period, such as [GPT-4o](/wiki/gpt_4o), as well as with strong open-access models such as Llama 3-V and [InternVL](/wiki/internvl).[1] The work is notable for two contributions beyond raw benchmark scores: a systematic comparison of decoder-only and cross-attention multimodal architectures that yielded three model variants, and the finding that NVLM models maintained, and in several cases improved, their text-only performance after multimodal training, a regime in which many comparable models regress.[1]

The models, training methodology, and datasets are described in the paper "NVLM: Open Frontier-Class Multimodal LLMs" (arXiv:2409.11402), authored by Wenliang Dai, Nayeon Lee, Boxin Wang, Wei Ping, and colleagues at NVIDIA.[1] NVIDIA released the flagship weights publicly on [Hugging Face](/wiki/hugging_face) and stated that the training code would be open-sourced in [Megatron-LM](/wiki/megatron_lm) (Megatron-Core).[5]

## Release

The NVLM paper was first submitted to arXiv on 17 September 2024 (version 1), with a revised version 2 following on 22 October 2024.[1] Alongside the paper, NVIDIA published a project page through its ADLR research group and released the decoder-only flagship, NVLM-D-72B, on Hugging Face.[3][5] The model weights were distributed under a CC-BY-NC-4.0 (Attribution-NonCommercial) license, with the underlying vision encoder carried under its own MIT license.[5] The release positioned NVLM within a broader pattern of NVIDIA contributing open model families, alongside efforts such as [Nemotron](/wiki/nemotron_3) and [Llama Nemotron](/wiki/llama_nemotron) on the language side.

A distinguishing feature of the release was its emphasis on reproducibility and openness of process. In addition to weights, the authors documented the composition of their multimodal pretraining and supervised fine-tuning datasets in detail, and stated an intent to open-source the training code in Megatron-Core so that the wider community could reproduce production-grade multimodal training.[1]

## Architectural variants

A central thesis of the NVLM work is a head-to-head comparison between the two dominant families of multimodal LLM architecture: decoder-only designs (exemplified by LLaVA), in which image features are projected into the same token space the language model consumes, and cross-attention designs (exemplified by Flamingo), in which image features are attended to through dedicated cross-attention layers rather than entering the main token sequence. Rather than declaring one approach superior, NVLM 1.0 ships three variants that occupy different points on the efficiency-versus-reasoning trade-off.[1]

| Variant | Architecture | How images are processed | Added trainable modules |
| --- | --- | --- | --- |
| NVLM-D | Decoder-only | Image tokens are projected into the LLM and processed as ordinary input tokens | 2-layer MLP projector |
| NVLM-X | Cross-attention | Image features are read by the LLM through gated cross-attention layers, keeping them out of the self-attention sequence | 1-layer MLP plus gated cross-attention layers |
| NVLM-H | Hybrid | A low-resolution thumbnail and text are processed jointly in self-attention, while finer high-resolution tile details are routed through cross-attention | 2-layer MLP plus gated cross-attention layers |

NVLM-D treats image tokens the same way it treats text token embeddings, which simplifies the design and unifies how modalities are handled; the paper reports it as the strongest of the three on OCR-related tasks.[1] NVLM-X uses cross-attention to improve computational efficiency on high-resolution images and offers faster training and inference, performing best-in-class among cross-attention models. NVLM-H is presented as the novel contribution that combines the two: by handling the thumbnail and text in the language model's self-attention layers while delegating fine detail to cross-attention, it aims to retain multimodal reasoning quality while improving efficiency, and it posts the family's best scores on reasoning-oriented benchmarks such as MMMU and MathVista.[1]

## Backbone components

All three variants share a common vision pathway and language backbone.

- **Text backbone:** Qwen2-72B-Instruct, the 72-billion-parameter instruction-tuned model from the [Qwen](/wiki/qwen) series. (An earlier preprint configuration also explored a backbone built on the Nous-Hermes-2-Yi-34B and other models, but the flagship 72B builds on Qwen2.)[1]
- **Vision encoder:** InternViT-6B, a roughly 6-billion-parameter vision transformer that operates on 448x448 image inputs. The released NVLM-D-72B uses the InternViT-6B-448px-V1-2 checkpoint, while the paper's main experiments reference the V1-5 variant.[5]

The released NVLM-D-72B totals approximately 79 billion parameters once the 72B language model, the 6B vision encoder, and the projection modules are combined, and supports a context length of up to 128K tokens.[5]

To handle high-resolution images, NVLM uses a tile-based dynamic high-resolution (DHR) scheme. An input image is split into tiles, with up to six regular tiles plus one global thumbnail tile. Each 448x448 tile passes through InternViT-6B and yields 1,024 output tokens, which are compressed to 256 tokens through a pixel-shuffle operation before reaching the language model. A key methodological contribution is a **1-D tile-tagging** design: textual tags are inserted to mark tile boundaries and positions in the flattened sequence, which the authors report significantly improves performance on multimodal reasoning and OCR tasks compared with feeding tiles without positional tags.[1]

## Training

NVLM follows a two-stage recipe of multimodal pretraining followed by supervised fine-tuning (SFT). The authors place unusual emphasis on data curation, reporting that dataset quality and task diversity matter more than sheer scale, even during pretraining, and that this held across all three architectures.[1]

The most consequential training choice for the headline text-only result was the deliberate integration of a high-quality text-only dataset into multimodal training, combined with a substantial volume of multimodal mathematics and reasoning data. The intent was to develop "production-grade" multimodality: models that excel at vision-language tasks without paying the usual tax of degraded text-only ability. The inclusion of math and reasoning data is credited with enhanced mathematics and coding capability across both modalities.[1]

## Benchmark results

On vision-language benchmarks, NVLM 1.0 was reported as competitive with frontier proprietary and open models. The decoder-only NVLM-D-72B in particular was highlighted as achieving leading scores on OCRBench and VQAv2 at the time, and as matching or surpassing GPT-4o on several key benchmarks including MathVista, OCRBench, ChartQA, and DocVQA, while trailing on MMMU.[1] The following table compares NVLM-D-72B against representative competitors using figures reported in the paper (higher is better; OCRBench is scored out of 1000).[1]

| Benchmark | NVLM-D 72B | GPT-4o | Llama 3-V 405B | InternVL-2-Pro |
| --- | --- | --- | --- | --- |
| MMMU (val) | 59.7 | 69.1 | 64.5 | 58.9 |
| MathVista | 65.2 | 63.8 | -- | 66.3 |
| OCRBench | 853 | 736 | -- | 837 |
| ChartQA | 86.0 | -- | 84.8 | -- |
| DocVQA | 92.6 | 92.8 | 92.6 | 95.1 |
| TextVQA | 82.1 | -- | 83.4 | 87.3 |
| AI2D | 94.2 | -- | 80.2 | -- |
| VQAv2 | 85.4 | -- | -- | -- |
| RealWorldQA | 69.7 | -- | -- | -- |

Within the NVLM family, the three variants trade places depending on the task. NVLM-H leads on reasoning benchmarks, while NVLM-D leads on OCR-heavy and document tasks, consistent with the architectural design goals.[1]

| Benchmark | NVLM-D 72B | NVLM-X 72B | NVLM-H 72B |
| --- | --- | --- | --- |
| MMMU (val) | 59.7 | 57.4 | 60.2 |
| MathVista | 65.2 | 64.6 | 66.6 |
| OCRBench | 853 | 828 | 831 |
| ChartQA | 86.0 | 82.9 | 83.3 |
| AI2D (test) | 85.2 | 84.2 | 83.8 |

## Text-only performance after multimodal training

The result most frequently cited from the NVLM work concerns text-only capability. Many multimodal models degrade on pure language benchmarks once vision data is introduced during training; the paper specifically notes that the leading InternVL 2 model shows significant degradation on text-only benchmarks including MMLU, GSM8K, MATH, and HumanEval.[1] NVLM was designed to avoid this regression, and the authors report that its 72B model instead improved on text-only mathematics and coding benchmarks after multimodal training, with average accuracy rising by 4.3 points relative to the Qwen2-72B-Instruct backbone.[1]

The official model card publishes a before-and-after comparison for NVLM-D-72B against its backbone, with the largest gains on mathematics, illustrating the same effect:[5]

| Task | Qwen2-72B-Instruct (backbone) | NVLM-D 72B (after MM training) | Change |
| --- | --- | --- | --- |
| MMLU | 82.3 | 81.7 | -0.6 |
| GSM8K | 91.1 | 93.2 | +2.1 |
| MATH | 59.7 | 73.1 | +13.4 |
| HumanEval | 86.0 | 89.0 | +3.0 |
| Average | 79.8 | 84.3 | +4.5 |

(Exact figures vary slightly between the model card and the paper's internal tables owing to differing evaluation setups, but both report a net average gain of roughly 4 to 5 points and a large jump on MATH.)[1][5] The authors attribute this outcome to the deliberate blending of high-quality text-only data and multimodal math and reasoning data into the training mixture, rather than to scale alone.[1]

## Open release and significance

NVLM was significant less for any single benchmark record than for the combination of openness and methodology it offered. By releasing competitive frontier-class multimodal weights and documenting the data composition, NVIDIA provided the research community with a reproducible reference point at a time when the strongest multimodal systems were largely closed. The work also clarified the practical trade-offs between decoder-only and cross-attention multimodal architectures and offered the hybrid NVLM-H as a concrete design that sought the best of both, alongside the 1-D tile-tagging technique for high-resolution inputs.

The demonstration that multimodal training need not degrade, and can in fact enhance, text-only reasoning informed subsequent thinking about how to build vision-language models without sacrificing language ability. NVLM sits within NVIDIA's wider portfolio of open and research models; it is distinct from **Eagle**, a separate NVIDIA (NVlabs) vision-language line that explores a mixture-of-vision-encoders design, and from the company's language-focused Nemotron and Llama Nemotron releases.[8]

## References

1. Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping. "NVLM: Open Frontier-Class Multimodal LLMs." arXiv:2409.11402, September 2024. https://arxiv.org/abs/2409.11402
2. NVLM 1.0 full text (HTML). arXiv. https://arxiv.org/html/2409.11402v1
3. "NVLM: Open Frontier-Class Multimodal LLMs." NVIDIA ADLR project page. https://research.nvidia.com/labs/adlr/NVLM-1/
4. "Introducing NVLM 1.0." NVLM project site. https://nvlm-project.github.io/
5. NVLM-D-72B model card. Hugging Face / NVIDIA. https://huggingface.co/nvidia/NVLM-D-72B
6. "NVLM 1.0: NVIDIA's Open-Source Multimodal AI Model." Encord. https://encord.com/blog/nvlm-nvidia-open-source-multimodal-ai-model/
7. "Introducing NVLM 1.0: NVIDIA's Approach to Multimodal LLMs." Analytics Vidhya, October 2024. https://www.analyticsvidhya.com/blog/2024/10/nvidia-nvlm-1/
8. Min Shi, et al. "Eagle: Exploring the Design Space for Multimodal LLMs with Mixture of Encoders." arXiv:2408.15998. https://arxiv.org/abs/2408.15998