QvQ
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,133 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,133 words
Add missing citations, update stale details, or suggest a clearer explanation.
QvQ (styled QVQ) is a family of experimental visual reasoning models from the Qwen team at Alibaba. The models extend the large language model reasoning approach to images: instead of answering a question about a picture in a single step, a QVQ model produces a long internal chain of thought and reasons about the visual content before giving a final answer. The first release, QVQ-72B-Preview, appeared on 25 December 2024 as the visual counterpart to the text-only reasoning model QwQ (QwQ-32B-Preview), which the team had published the previous month. [1][2] A second, redesigned model, QVQ-Max, was announced on 28 March 2025. [3]
The name follows the team's habit of pairing a "Q" for Qwen with a playful repetition: where QwQ is the text reasoner, QVQ adds a "V" for vision. [1]
QVQ-72B-Preview is built on Qwen2-VL-72B, Alibaba's open-weight multimodal model, rather than being trained from scratch. The Hugging Face model card lists the base model as Qwen/Qwen2-VL-72B and tags the model for image-text-to-text use. [4] In effect QVQ takes the perception stack of Qwen2-VL and layers a long reasoning behaviour on top, the same idea QwQ-32B-Preview applied to text-only problems. [1][2] The team described both QwQ and QVQ as exploratory research releases rather than finished products. [1]
QVQ is positioned as a reasoning_model for the multimodal setting. The Qwen blog framed the goal in terms of combining language and vision so a model can "see the world with wisdom," and likened the intended behaviour to a person who works through a hard physics diagram step by step instead of guessing. [1]
QVQ-72B-Preview is the 72-billion-parameter preview model released on 25 December 2024 under the title "QVQ: To See the World with Wisdom." [1] It was published with open weights on Hugging Face alongside an online demo space, and shares the repository lineage of Qwen2-VL. [1][4]
The model has several practical constraints that the developers stated plainly. It supports only single-round dialogue and image inputs: it does not accept video, and it is not designed for multi-turn conversation. [4] In day-to-day recognition tasks such as identifying people, animals, or plants, the team reported that QVQ does not improve meaningfully over Qwen2-VL-72B-Instruct, so it is not a drop-in replacement for the base model. [4]
The Qwen team evaluated QVQ-72B-Preview on four multimodal reasoning benchmarks: MMMU (a university-level, multidisciplinary multimodal exam), MathVista, MathVision, and OlympiadBench. The headline result is a score of 70.3 on MMMU, with larger gains on the math-heavy sets relative to the Qwen2-VL-72B-Instruct baseline. The team compared the preview against OpenAI's o1 (the 17 December 2024 build), GPT-4o (13 May 2024), and Claude 3.5 Sonnet (22 October 2024). [1][4]
| Benchmark | QVQ-72B-Preview | o1-2024-12-17 | GPT-4o-2024-05-13 | Claude 3.5 Sonnet-20241022 | Qwen2-VL-72B-Instruct |
|---|---|---|---|---|---|
| MMMU (val) | 70.3 | 77.3 | 69.1 | 70.4 | 64.5 |
| MathVista (mini) | 71.4 | 71.0 | 63.8 | 65.3 | 70.5 |
| MathVision (full) | 35.9 | N/A | 30.4 | 35.6 | 25.9 |
| OlympiadBench | 20.4 | N/A | 25.9 | N/A | 11.2 |
Source: Qwen blog and the QVQ-72B-Preview Hugging Face card. [1][4] On MathVista the preview slightly edged o1 in the reported figures, while on MMMU it trailed o1 but landed close to Claude 3.5 Sonnet and ahead of GPT-4o. The team characterised the overall result as closing much of the gap with leading reasoning systems on these visual tasks. [1]
The Qwen team published an unusually candid list of weaknesses for the preview, grouped under four headings. [1][4]
Language mixing and code-switching: the model can occasionally mix languages or switch between them, which can hurt the clarity of an answer. [4]
Recursive reasoning loops: the long chain of thought can fall into circular patterns, producing very long responses that sometimes never reach a final answer. [4]
Safety and ethical considerations: the team stated that robust safety measures are still needed and advised caution when deploying the model. [4]
Performance and benchmark limits: QVQ does not fully replace Qwen2-VL-72B-Instruct, and during multi-step visual reasoning it may gradually lose track of the image content and hallucinate. It also shows no significant gain over the base model on basic recognition. [4]
These caveats, together with the "Preview" label, frame QVQ-72B as a research artifact meant to demonstrate visual chain-of-thought rather than a production system. [1]
QVQ-72B-Preview is released under the Qwen license. The Hugging Face metadata records the license as "other" with the license name "qwen" and a link to the model's LICENSE file, the same custom community license Alibaba uses for its larger Qwen weights. [4] The weights, an online demo, and the model card are hosted on Hugging Face. [1][4]
On 28 March 2025 the Qwen team announced QVQ-Max under the title "QVQ-Max: Think with Evidence," describing it as the first version of a redesigned visual reasoning model that supersedes the December preview. [3] Unlike the preview, QVQ-Max accepts both images and video, and it is offered through the Qwen Chat interface, where users upload an image or video, ask a question, and press a "Thinking" button to watch the model reason before it answers. [3]
The blog presented QVQ-Max less as a benchmark leader and more as a practical tool, with worked examples spanning visual math problems, charts and diagrams, everyday questions, code, and creative tasks such as helping with illustration or design. [3] One quantitative point the team highlighted is that accuracy on MathVision rises as the maximum length of the model's thinking process is increased, an effect analogous to test-time scaling in text reasoners. [3] The post did not publish a full comparison table of the kind released for the 72B preview. [3]
For the future of the line, the team listed three directions: improving observational accuracy through grounding (so the model can verify the visual evidence behind a claim), building visual agents that carry out multi-step tasks such as operating a phone or computer or playing games, and broadening interaction beyond text. [3] The earlier QVQ-72B-Preview weights remain available on Hugging Face, while QVQ-Max has been distributed through Qwen Chat and Alibaba Cloud's interface rather than as a clearly documented open-weight checkpoint. [3][4]
QVQ sits alongside Alibaba's broader multimodal line, including Qwen2.5-VL, which the team continued to develop in parallel as the general-purpose vision-language family while QVQ pursued the narrower goal of long-form visual reasoning. [3]