QvQ

Chinese AI Multimodal AI Reasoning Models

6 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v3 · 1,136 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

QvQ (styled QVQ) is a family of experimental visual reasoning models from the Qwen team at Alibaba. The models extend the large language model reasoning approach to images: instead of answering a question about a picture in a single step, a QVQ model produces a long internal chain of thought and reasons about the visual content before giving a final answer. The first release, QVQ-72B-Preview, appeared on 25 December 2024 as the visual counterpart to the text-only reasoning model QwQ (QwQ-32B-Preview), which the team had published the previous month. ^[1]^[2] A second, redesigned model, QVQ-Max, was announced on 28 March 2025. ^[3]

The name follows the team's habit of pairing a "Q" for Qwen with a playful repetition: where QwQ is the text reasoner, QVQ adds a "V" for vision. ^[1]

Relationship to QwQ and Qwen2-VL

QVQ-72B-Preview is built on Qwen2-VL-72B, Alibaba's open-weight multimodal model, rather than being trained from scratch. The Hugging Face model card lists the base model as Qwen/Qwen2-VL-72B and tags the model for image-text-to-text use. ^[4] In effect QVQ takes the perception stack of Qwen2-VL and layers a long reasoning behaviour on top, the same idea QwQ-32B-Preview applied to text-only problems. ^[1]^[2] The team described both QwQ and QVQ as exploratory research releases rather than finished products. ^[1]

QVQ is positioned as a reasoning_model for the multimodal setting. The Qwen blog framed the goal in terms of combining language and vision so a model can "see the world with wisdom," and likened the intended behaviour to a person who works through a hard physics diagram step by step instead of guessing. ^[1]

QVQ-72B-Preview

QVQ-72B-Preview is the 72-billion-parameter preview model released on 25 December 2024 under the title "QVQ: To See the World with Wisdom." ^[1] It was published with open weights on Hugging Face alongside an online demo space, and shares the repository lineage of Qwen2-VL. ^[1]^[4]

The model has several practical constraints that the developers stated plainly. It supports only single-round dialogue and image inputs: it does not accept video, and it is not designed for multi-turn conversation. ^[4] In day-to-day recognition tasks such as identifying people, animals, or plants, the team reported that QVQ does not improve meaningfully over Qwen2-VL-72B-Instruct, so it is not a drop-in replacement for the base model. ^[4]

Benchmarks

The Qwen team evaluated QVQ-72B-Preview on four multimodal reasoning benchmarks: MMMU (a university-level, multidisciplinary multimodal exam), MathVista, MathVision, and OlympiadBench. The headline result is a score of 70.3 on MMMU, with larger gains on the math-heavy sets relative to the Qwen2-VL-72B-Instruct baseline. The team compared the preview against OpenAI's o1 (the 17 December 2024 build), GPT-4o (13 May 2024), and Claude 3.5 Sonnet (22 October 2024). ^[1]^[4]

Benchmark	QVQ-72B-Preview	o1-2024-12-17	GPT-4o-2024-05-13	Claude 3.5 Sonnet-20241022	Qwen2-VL-72B-Instruct
MMMU (val)	70.3	77.3	69.1	70.4	64.5
MathVista (mini)	71.4	71.0	63.8	65.3	70.5
MathVision (full)	35.9	N/A	30.4	35.6	25.9
OlympiadBench	20.4	N/A	25.9	N/A	11.2

Source: Qwen blog and the QVQ-72B-Preview Hugging Face card. ^[1]^[4] On MathVista the preview slightly edged o1 in the reported figures, while on MMMU it trailed o1 but landed close to Claude 3.5 Sonnet and ahead of GPT-4o. The team characterised the overall result as closing much of the gap with leading reasoning systems on these visual tasks. ^[1]

Limitations noted by the developers

The Qwen team published an unusually candid list of weaknesses for the preview, grouped under four headings. ^[1]^[4]

Language mixing and code-switching: the model can occasionally mix languages or switch between them, which can hurt the clarity of an answer. ^[4]

Recursive reasoning loops: the long chain of thought can fall into circular patterns, producing very long responses that sometimes never reach a final answer. ^[4]

Safety and ethical considerations: the team stated that robust safety measures are still needed and advised caution when deploying the model. ^[4]

Performance and benchmark limits: QVQ does not fully replace Qwen2-VL-72B-Instruct, and during multi-step visual reasoning it may gradually lose track of the image content and hallucinate. It also shows no significant gain over the base model on basic recognition. ^[4]

These caveats, together with the "Preview" label, frame QVQ-72B as a research artifact meant to demonstrate visual chain-of-thought rather than a production system. ^[1]

Licensing

QVQ-72B-Preview is released under the Qwen license. The Hugging Face metadata records the license as "other" with the license name "qwen" and a link to the model's LICENSE file, the same custom community license Alibaba uses for its larger Qwen weights. ^[4] The weights, an online demo, and the model card are hosted on Hugging Face. ^[1]^[4]

QVQ-Max

On 28 March 2025 the Qwen team announced QVQ-Max under the title "QVQ-Max: Think with Evidence," describing it as the first version of a redesigned visual reasoning model that supersedes the December preview. ^[3] Unlike the preview, QVQ-Max accepts both images and video, and it is offered through the Qwen Chat interface, where users upload an image or video, ask a question, and press a "Thinking" button to watch the model reason before it answers. ^[3]

The blog presented QVQ-Max less as a benchmark leader and more as a practical tool, with worked examples spanning visual math problems, charts and diagrams, everyday questions, code, and creative tasks such as helping with illustration or design. ^[3] One quantitative point the team highlighted is that accuracy on MathVision rises as the maximum length of the model's thinking process is increased, an effect analogous to test-time scaling in text reasoners. ^[3] The post did not publish a full comparison table of the kind released for the 72B preview. ^[3]

For the future of the line, the team listed three directions: improving observational accuracy through grounding (so the model can verify the visual evidence behind a claim), building visual agents that carry out multi-step tasks such as operating a phone or computer or playing games, and broadening interaction beyond text. ^[3] The earlier QVQ-72B-Preview weights remain available on Hugging Face, while QVQ-Max has been distributed through Qwen Chat and Alibaba Cloud's interface rather than as a clearly documented open-weight checkpoint. ^[3]^[4]

QVQ sits alongside Alibaba's broader multimodal line, including Qwen2.5-VL, which the team continued to develop in parallel as the general-purpose vision-language family while QVQ pursued the narrower goal of long-form visual reasoning. ^[3]

References

"QVQ: To See the World with Wisdom," Qwen (Alibaba Cloud) blog, 25 December 2024. https://qwenlm.github.io/blog/qvq-72b-preview/ ↩
"QwQ: Reflect Deeply on the Boundaries of the Unknown," Qwen (Alibaba Cloud) blog. https://qwenlm.github.io/blog/qwq-32b-preview/ ↩
"QVQ-Max: Think with Evidence," Qwen (Alibaba Cloud) blog, 28 March 2025. https://qwenlm.github.io/blog/qvq-max-preview/ ↩
"Qwen/QVQ-72B-Preview," Hugging Face model card. https://huggingface.co/Qwen/QVQ-72B-Preview ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

QwQ

Relationship to QwQ and Qwen2-VL

QVQ-72B-Preview

Benchmarks

Limitations noted by the developers

Licensing

QVQ-Max

See also

References

Improve this article

Related Articles

Skywork-R1V

Muse Spark

DeepSeek-R1

GRPO

DeepSeek-R1-Distill

DeepSeek V3.1