NVLM
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,779 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,779 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVLM (short for NVIDIA Vision Language Model), released as NVLM 1.0, is a family of open multimodal large language models developed by Nvidia. Introduced in September 2024, NVLM 1.0 was presented as a set of "frontier-class" vision-language models that the authors reported as competitive with the leading proprietary systems of the period, such as GPT-4o, as well as with strong open-access models such as Llama 3-V and InternVL. The work is notable for two contributions beyond raw benchmark scores: a systematic comparison of decoder-only and cross-attention multimodal architectures that yielded three model variants, and the finding that NVLM models maintained, and in several cases improved, their text-only performance after multimodal training, a regime in which many comparable models regress.
The models, training methodology, and datasets are described in the paper "NVLM: Open Frontier-Class Multimodal LLMs" (arXiv:2409.11402), authored by Wenliang Dai, Nayeon Lee, Boxin Wang, Wei Ping, and colleagues at NVIDIA. NVIDIA released the flagship weights publicly on Hugging Face and stated that the training code would be open-sourced in Megatron-LM (Megatron-Core).
The NVLM paper was first submitted to arXiv on 17 September 2024 (version 1), with a revised version 2 following on 22 October 2024. Alongside the paper, NVIDIA published a project page through its ADLR research group and released the decoder-only flagship, NVLM-D-72B, on Hugging Face. The model weights were distributed under a CC-BY-NC-4.0 (Attribution-NonCommercial) license, with the underlying vision encoder carried under its own MIT license. The release positioned NVLM within a broader pattern of NVIDIA contributing open model families, alongside efforts such as Nemotron and Llama Nemotron on the language side.
A distinguishing feature of the release was its emphasis on reproducibility and openness of process. In addition to weights, the authors documented the composition of their multimodal pretraining and supervised fine-tuning datasets in detail, and stated an intent to open-source the training code in Megatron-Core so that the wider community could reproduce production-grade multimodal training.
A central thesis of the NVLM work is a head-to-head comparison between the two dominant families of multimodal LLM architecture: decoder-only designs (exemplified by LLaVA), in which image features are projected into the same token space the language model consumes, and cross-attention designs (exemplified by Flamingo), in which image features are attended to through dedicated cross-attention layers rather than entering the main token sequence. Rather than declaring one approach superior, NVLM 1.0 ships three variants that occupy different points on the efficiency-versus-reasoning trade-off.
| Variant | Architecture | How images are processed | Added trainable modules |
|---|---|---|---|
| NVLM-D | Decoder-only | Image tokens are projected into the LLM and processed as ordinary input tokens | 2-layer MLP projector |
| NVLM-X | Cross-attention | Image features are read by the LLM through gated cross-attention layers, keeping them out of the self-attention sequence | 1-layer MLP plus gated cross-attention layers |
| NVLM-H | Hybrid | A low-resolution thumbnail and text are processed jointly in self-attention, while finer high-resolution tile details are routed through cross-attention | 2-layer MLP plus gated cross-attention layers |
NVLM-D treats image tokens the same way it treats text token embeddings, which simplifies the design and unifies how modalities are handled; the paper reports it as the strongest of the three on OCR-related tasks. NVLM-X uses cross-attention to improve computational efficiency on high-resolution images and offers faster training and inference, performing best-in-class among cross-attention models. NVLM-H is presented as the novel contribution that combines the two: by handling the thumbnail and text in the language model's self-attention layers while delegating fine detail to cross-attention, it aims to retain multimodal reasoning quality while improving efficiency, and it posts the family's best scores on reasoning-oriented benchmarks such as MMMU and MathVista.
All three variants share a common vision pathway and language backbone.
The released NVLM-D-72B totals approximately 79 billion parameters once the 72B language model, the 6B vision encoder, and the projection modules are combined, and supports a context length of up to 128K tokens.
To handle high-resolution images, NVLM uses a tile-based dynamic high-resolution (DHR) scheme. An input image is split into tiles, with up to six regular tiles plus one global thumbnail tile. Each 448x448 tile passes through InternViT-6B and yields 1,024 output tokens, which are compressed to 256 tokens through a pixel-shuffle operation before reaching the language model. A key methodological contribution is a 1-D tile-tagging design: textual tags are inserted to mark tile boundaries and positions in the flattened sequence, which the authors report significantly improves performance on multimodal reasoning and OCR tasks compared with feeding tiles without positional tags.
NVLM follows a two-stage recipe of multimodal pretraining followed by supervised fine-tuning (SFT). The authors place unusual emphasis on data curation, reporting that dataset quality and task diversity matter more than sheer scale, even during pretraining, and that this held across all three architectures.
The most consequential training choice for the headline text-only result was the deliberate integration of a high-quality text-only dataset into multimodal training, combined with a substantial volume of multimodal mathematics and reasoning data. The intent was to develop "production-grade" multimodality: models that excel at vision-language tasks without paying the usual tax of degraded text-only ability. The inclusion of math and reasoning data is credited with enhanced mathematics and coding capability across both modalities.
On vision-language benchmarks, NVLM 1.0 was reported as competitive with frontier proprietary and open models. The decoder-only NVLM-D-72B in particular was highlighted as achieving leading scores on OCRBench and VQAv2 at the time, and as matching or surpassing GPT-4o on several key benchmarks including MathVista, OCRBench, ChartQA, and DocVQA, while trailing on MMMU. The following table compares NVLM-D-72B against representative competitors using figures reported in the paper (higher is better; OCRBench is scored out of 1000).
| Benchmark | NVLM-D 72B | GPT-4o | Llama 3-V 405B | InternVL-2-Pro |
|---|---|---|---|---|
| MMMU (val) | 59.7 | 69.1 | 64.5 | 58.9 |
| MathVista | 65.2 | 63.8 | -- | 66.3 |
| OCRBench | 853 | 736 | -- | 837 |
| ChartQA | 86.0 | -- | 84.8 | -- |
| DocVQA | 92.6 | 92.8 | 92.6 | 95.1 |
| TextVQA | 82.1 | -- | 83.4 | 87.3 |
| AI2D | 94.2 | -- | 80.2 | -- |
| VQAv2 | 85.4 | -- | -- | -- |
| RealWorldQA | 69.7 | -- | -- | -- |
Within the NVLM family, the three variants trade places depending on the task. NVLM-H leads on reasoning benchmarks, while NVLM-D leads on OCR-heavy and document tasks, consistent with the architectural design goals.
| Benchmark | NVLM-D 72B | NVLM-X 72B | NVLM-H 72B |
|---|---|---|---|
| MMMU (val) | 59.7 | 57.4 | 60.2 |
| MathVista | 65.2 | 64.6 | 66.6 |
| OCRBench | 853 | 828 | 831 |
| ChartQA | 86.0 | 82.9 | 83.3 |
| AI2D (test) | 85.2 | 84.2 | 83.8 |
The result most frequently cited from the NVLM work concerns text-only capability. Many multimodal models degrade on pure language benchmarks once vision data is introduced during training; the paper specifically notes that the leading InternVL 2 model shows significant degradation on text-only benchmarks including MMLU, GSM8K, MATH, and HumanEval. NVLM was designed to avoid this regression, and the authors report that its 72B model instead improved on text-only mathematics and coding benchmarks after multimodal training, with average accuracy rising by 4.3 points relative to the Qwen2-72B-Instruct backbone.
The official model card publishes a before-and-after comparison for NVLM-D-72B against its backbone, with the largest gains on mathematics, illustrating the same effect:
| Task | Qwen2-72B-Instruct (backbone) | NVLM-D 72B (after MM training) | Change |
|---|---|---|---|
| MMLU | 82.3 | 81.7 | -0.6 |
| GSM8K | 91.1 | 93.2 | +2.1 |
| MATH | 59.7 | 73.1 | +13.4 |
| HumanEval | 86.0 | 89.0 | +3.0 |
| Average | 79.8 | 84.3 | +4.5 |
(Exact figures vary slightly between the model card and the paper's internal tables owing to differing evaluation setups, but both report a net average gain of roughly 4 to 5 points and a large jump on MATH.) The authors attribute this outcome to the deliberate blending of high-quality text-only data and multimodal math and reasoning data into the training mixture, rather than to scale alone.
NVLM was significant less for any single benchmark record than for the combination of openness and methodology it offered. By releasing competitive frontier-class multimodal weights and documenting the data composition, NVIDIA provided the research community with a reproducible reference point at a time when the strongest multimodal systems were largely closed. The work also clarified the practical trade-offs between decoder-only and cross-attention multimodal architectures and offered the hybrid NVLM-H as a concrete design that sought the best of both, alongside the 1-D tile-tagging technique for high-resolution inputs.
The demonstration that multimodal training need not degrade, and can in fact enhance, text-only reasoning informed subsequent thinking about how to build vision-language models without sacrificing language ability. NVLM sits within NVIDIA's wider portfolio of open and research models; it is distinct from Eagle, a separate NVIDIA (NVlabs) vision-language line that explores a mixture-of-vision-encoders design, and from the company's language-focused Nemotron and Llama Nemotron releases.