# QvQ

> Source: https://aiwiki.ai/wiki/qvq
> Updated: 2026-06-09
> Categories: Chinese AI, Multimodal AI, Reasoning Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# QvQ

QvQ (styled QVQ) is a family of experimental visual reasoning models from the [Qwen](/wiki/qwen) team at [Alibaba](/wiki/alibaba). The models extend the [large language model](/wiki/large_language_model) reasoning approach to images: instead of answering a question about a picture in a single step, a QVQ model produces a long internal chain of thought and reasons about the visual content before giving a final answer. The first release, QVQ-72B-Preview, appeared on 25 December 2024 as the visual counterpart to the text-only reasoning model [QwQ](/wiki/qwq) (QwQ-32B-Preview), which the team had published the previous month. [1][2] A second, redesigned model, QVQ-Max, was announced on 28 March 2025. [3]

The name follows the team's habit of pairing a "Q" for Qwen with a playful repetition: where QwQ is the text reasoner, QVQ adds a "V" for vision. [1]

## Relationship to QwQ and Qwen2-VL

QVQ-72B-Preview is built on [Qwen2-VL](/wiki/qwen2_vl)-72B, Alibaba's open-weight multimodal model, rather than being trained from scratch. The Hugging Face model card lists the base model as Qwen/Qwen2-VL-72B and tags the model for image-text-to-text use. [4] In effect QVQ takes the perception stack of Qwen2-VL and layers a long reasoning behaviour on top, the same idea QwQ-32B-Preview applied to text-only problems. [1][2] The team described both QwQ and QVQ as exploratory research releases rather than finished products. [1]

QVQ is positioned as a [reasoning_model](/wiki/reasoning_model) for the [multimodal](/wiki/multimodal) setting. The Qwen blog framed the goal in terms of combining language and vision so a model can "see the world with wisdom," and likened the intended behaviour to a person who works through a hard physics diagram step by step instead of guessing. [1]

## QVQ-72B-Preview

QVQ-72B-Preview is the 72-billion-parameter preview model released on 25 December 2024 under the title "QVQ: To See the World with Wisdom." [1] It was published with open weights on Hugging Face alongside an online demo space, and shares the repository lineage of [Qwen2-VL](/wiki/qwen2_vl). [1][4]

The model has several practical constraints that the developers stated plainly. It supports only single-round dialogue and image inputs: it does not accept video, and it is not designed for multi-turn conversation. [4] In day-to-day recognition tasks such as identifying people, animals, or plants, the team reported that QVQ does not improve meaningfully over Qwen2-VL-72B-Instruct, so it is not a drop-in replacement for the base model. [4]

## Benchmarks

The Qwen team evaluated QVQ-72B-Preview on four multimodal reasoning benchmarks: [MMMU](/wiki/mmmu) (a university-level, multidisciplinary multimodal exam), MathVista, MathVision, and OlympiadBench. The headline result is a score of 70.3 on MMMU, with larger gains on the math-heavy sets relative to the Qwen2-VL-72B-Instruct baseline. The team compared the preview against OpenAI's o1 (the 17 December 2024 build), GPT-4o (13 May 2024), and Claude 3.5 Sonnet (22 October 2024). [1][4]

| Benchmark | QVQ-72B-Preview | o1-2024-12-17 | GPT-4o-2024-05-13 | Claude 3.5 Sonnet-20241022 | Qwen2-VL-72B-Instruct |
|-----------|-----------------|---------------|-------------------|----------------------------|-----------------------|
| MMMU (val) | 70.3 | 77.3 | 69.1 | 70.4 | 64.5 |
| MathVista (mini) | 71.4 | 71.0 | 63.8 | 65.3 | 70.5 |
| MathVision (full) | 35.9 | N/A | 30.4 | 35.6 | 25.9 |
| OlympiadBench | 20.4 | N/A | 25.9 | N/A | 11.2 |

Source: Qwen blog and the QVQ-72B-Preview Hugging Face card. [1][4] On MathVista the preview slightly edged o1 in the reported figures, while on MMMU it trailed o1 but landed close to Claude 3.5 Sonnet and ahead of GPT-4o. The team characterised the overall result as closing much of the gap with leading reasoning systems on these visual tasks. [1]

## Limitations noted by the developers

The Qwen team published an unusually candid list of weaknesses for the preview, grouped under four headings. [1][4]

Language mixing and code-switching: the model can occasionally mix languages or switch between them, which can hurt the clarity of an answer. [4]

Recursive reasoning loops: the long chain of thought can fall into circular patterns, producing very long responses that sometimes never reach a final answer. [4]

Safety and ethical considerations: the team stated that robust safety measures are still needed and advised caution when deploying the model. [4]

Performance and benchmark limits: QVQ does not fully replace Qwen2-VL-72B-Instruct, and during multi-step visual reasoning it may gradually lose track of the image content and hallucinate. It also shows no significant gain over the base model on basic recognition. [4]

These caveats, together with the "Preview" label, frame QVQ-72B as a research artifact meant to demonstrate visual chain-of-thought rather than a production system. [1]

## Licensing

QVQ-72B-Preview is released under the Qwen license. The Hugging Face metadata records the license as "other" with the license name "qwen" and a link to the model's LICENSE file, the same custom community license Alibaba uses for its larger Qwen weights. [4] The weights, an online demo, and the model card are hosted on Hugging Face. [1][4]

## QVQ-Max

On 28 March 2025 the Qwen team announced QVQ-Max under the title "QVQ-Max: Think with Evidence," describing it as the first version of a redesigned visual reasoning model that supersedes the December preview. [3] Unlike the preview, QVQ-Max accepts both images and video, and it is offered through the Qwen Chat interface, where users upload an image or video, ask a question, and press a "Thinking" button to watch the model reason before it answers. [3]

The blog presented QVQ-Max less as a benchmark leader and more as a practical tool, with worked examples spanning visual math problems, charts and diagrams, everyday questions, code, and creative tasks such as helping with illustration or design. [3] One quantitative point the team highlighted is that accuracy on MathVision rises as the maximum length of the model's thinking process is increased, an effect analogous to test-time scaling in text reasoners. [3] The post did not publish a full comparison table of the kind released for the 72B preview. [3]

For the future of the line, the team listed three directions: improving observational accuracy through grounding (so the model can verify the visual evidence behind a claim), building visual agents that carry out multi-step tasks such as operating a phone or computer or playing games, and broadening interaction beyond text. [3] The earlier QVQ-72B-Preview weights remain available on Hugging Face, while QVQ-Max has been distributed through Qwen Chat and Alibaba Cloud's interface rather than as a clearly documented open-weight checkpoint. [3][4]

QVQ sits alongside Alibaba's broader multimodal line, including [Qwen2.5-VL](/wiki/qwen2_5_vl), which the team continued to develop in parallel as the general-purpose vision-language family while QVQ pursued the narrower goal of long-form visual reasoning. [3]

## See also

- [Skywork-R1V](/wiki/skywork_r1)

## References

1. "QVQ: To See the World with Wisdom," Qwen (Alibaba Cloud) blog, 25 December 2024. https://qwenlm.github.io/blog/qvq-72b-preview/
2. "QwQ: Reflect Deeply on the Boundaries of the Unknown," Qwen (Alibaba Cloud) blog. https://qwenlm.github.io/blog/qwq-32b-preview/
3. "QVQ-Max: Think with Evidence," Qwen (Alibaba Cloud) blog, 28 March 2025. https://qwenlm.github.io/blog/qvq-max-preview/
4. "Qwen/QVQ-72B-Preview," Hugging Face model card. https://huggingface.co/Qwen/QVQ-72B-Preview

