QwQ
Last reviewed
May 17, 2026
Sources
13 citations
Review status
Source-backed
Revision
v2 · 6,131 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
13 citations
Review status
Source-backed
Revision
v2 · 6,131 words
Add missing citations, update stale details, or suggest a clearer explanation.
| QwQ | |
|---|---|
| Developer | Alibaba Cloud Qwen team |
| Series | Qwen |
| First release | November 28, 2024 (QwQ-32B-Preview) |
| Full release | March 5, 2025 (QwQ-32B) |
| Type | Large language model (reasoning model) |
| Architecture | Dense Transformer decoder |
| Parameters | 32.5 billion (dense) |
| Base model | Qwen2.5-32B |
| Context length | 32,768 tokens (Preview); 131,072 tokens (QwQ-32B) |
| Training method | Cold-start fine-tuning plus reinforcement learning with rule-based and general-purpose rewards |
| License | Apache 2.0 |
| Pronunciation | "kwuh" (sometimes spelled "quee-ew"); short for "Qwen with Questions" |
| Distribution | Hugging Face, ModelScope, Qwen Chat, Alibaba Cloud Model Studio |
QwQ is a series of open-weight reasoning models developed by the Qwen team at Alibaba Cloud. The series consists of two public releases: QwQ-32B-Preview, an experimental preview shipped on November 28, 2024, and QwQ-32B, the full release on March 5, 2025. Both models were released under the Apache 2.0 license, distributed via Hugging Face and ModelScope, and pitched as an open competitor to OpenAI's o1 and the contemporaneous DeepSeek-R1.[1][2][3] The name QwQ stands for "Qwen with Questions," pronounced loosely as "kwuh," and is paired with a playful, anime-styled mascot in the team's blog posts.[1]
The two checkpoints share a 32.5-billion-parameter dense decoder built on the Qwen2.5-32B base model and use the same tokenizer as the rest of the Qwen 2.5 family. Where the Preview version was a research artefact released to gather feedback, QwQ-32B was trained with a multi-stage reinforcement learning pipeline. The pipeline combined rule-based rewards for mathematics and code with general reward models for instruction following, formatting, and tool use. Alibaba reported that QwQ-32B reached parity with DeepSeek-R1, a model with roughly twenty times more total parameters, on several reasoning benchmarks. The release was followed by an 8.39% jump in Alibaba's Hong Kong-listed shares the same week, briefly pushing the stock to a 52-week high.[2][3][4][11]
QwQ sits within Alibaba's broader push into reasoning systems, alongside QvQ (a vision-reasoning preview from December 2024) and the later thinking variants in the Qwen 3 family. It is widely cited as one of the first small, openly licensed models to match much larger proprietary reasoning systems on math and code benchmarks, and as one of the first widely circulated examples of a visible, "thinking-aloud" chain-of-thought trace from a Chinese lab.[5][6]
Qwen, short for Tongyi Qianwen (通义千问), is the family of large language models developed by Alibaba Cloud's Qwen team since 2023. By the time QwQ-32B-Preview shipped, the team had already released the Qwen 1, Qwen 1.5, and Qwen 2 generations, followed by Qwen 2.5 in September 2024. The Qwen 2.5 family covered seven dense sizes from 0.5B to 72B parameters, all sharing a 128K context window and a 150K-vocabulary tokenizer. It also included two specialized branches: Qwen2.5-Coder and Qwen2.5-Math, each tuned with extra code and math data on top of the same base.[7][8]
QwQ was the team's first reasoning-focused entry. The Qwen 2.5 base models had already shown strong math performance for their size, especially in the 32B and 72B variants. The reasoning track took the 32B base as the starting point because, in the team's own framing, 32B was the smallest size at which long chain-of-thought traces remained coherent under their training recipe, while still being small enough to run on a single high-end consumer GPU after quantization.[1][3]
The immediate context for QwQ was the September 2024 release of OpenAI's o1, which popularized the idea of a model that spends extra inference compute on a hidden chain of thought before answering. Within a few weeks several labs began publishing their own reasoning experiments. The Qwen team's blog post for the Preview, titled "Reflect Deeply on the Boundaries of the Unknown," framed QwQ as both a research preview and an open invitation, releasing the weights, the system prompt, and a small set of example traces so that the community could study the failure modes directly.[1]
The team also positioned the model against DeepSeek-R1, released two months later in January 2025. DeepSeek-R1 was a 671-billion-parameter Mixture of Experts model with 37B active parameters, vastly larger than QwQ-32B. Alibaba's argument with the March 2025 release was that a much smaller dense model, when trained with the right RL recipe, could approach R1's results on math and coding benchmarks while running on far cheaper hardware.[2][3][4]
Both QwQ-32B-Preview and QwQ-32B were released under the permissive Apache 2.0 license, allowing commercial use, modification, and redistribution without requiring a separate Alibaba license. This contrasted with some earlier Qwen releases, which were governed by the more restrictive Tongyi Qianwen license for the largest checkpoints. The Apache 2.0 choice mattered because it let downstream teams ship QwQ inside commercial products and run their own RL fine-tunes without legal review.[2][9]
The two QwQ checkpoints differ less in size than in maturity. The Preview was a research drop with limited training and a shorter context, while QwQ-32B was the production release with a full RL pipeline and the same 131,072-token window as the underlying Qwen 2.5-32B base.
| Model | Release date | Parameters | Context length | Status | License |
|---|---|---|---|---|---|
| QwQ-32B-Preview | November 28, 2024 | 32.5B dense | 32,768 tokens | Experimental research preview | Apache 2.0 |
| QwQ-32B | March 5, 2025 | 32.5B dense | 131,072 tokens | General release | Apache 2.0 |
Alibaba also lists the model on its commercial endpoints. QwQ-32B is available through Alibaba Cloud Model Studio's DashScope API and through the free Qwen Chat web interface at chat.qwen.ai, where it appears as a selectable thinking model. The same weights are mirrored on Hugging Face under Qwen/QwQ-32B-Preview and Qwen/QwQ-32B, and on ModelScope under the equivalent identifiers.[1][2][9]
On DashScope, the production endpoint is exposed as qwq-plus and qwq-plus-latest, with date-pinned snapshots such as qwq-plus-2025-03-05 available so applications can lock to a specific training cut. These hosted variants accept the same chat-completions schema as the rest of the Qwen API and route to the same 32B weights, but they operate in a thinking-only mode where the reasoning trace is always emitted before the final answer.[11][12]
QwQ is a dense decoder-only Transformer, not a Mixture of Experts model. The team kept the architecture identical to Qwen2.5-32B so that the Preview and the final QwQ-32B checkpoint could load into the same inference stack as the rest of the Qwen 2.5 family.[1][3]
Key architectural details published by the team or recoverable from the Hugging Face configuration files include:[1][7][8]
The choice to keep the architecture unchanged was deliberate. By using the same backbone as Qwen2.5-32B, the team could reuse all of the inference and fine-tuning tooling the community had already built for that model, including vLLM, SGLang, Ollama, and llama.cpp paths.[3][9]
Both QwQ checkpoints start from the Qwen2.5-32B base. The Preview was produced with a relatively short post-training pipeline: a curated supervised fine-tune on long chain-of-thought traces drawn from competition math, programming, and scientific reasoning sources, followed by a limited RL stage. The team did not publish a full technical report for the Preview, and described the run in their blog as a "first attempt."[1]
For QwQ-32B, Alibaba gave more public detail. The training pipeline used two main RL stages on top of an initial cold-start fine-tune.[3]
| Stage | Goal | Reward signal | Data |
|---|---|---|---|
| Cold start | Teach the base model to produce long, structured reasoning traces. | Supervised loss on curated traces. | Hand-filtered competition math, code, and science problems with worked solutions. |
| RL stage 1 | Improve math and code accuracy. | Rule-based: numerical answer match for math, unit-test pass for code. | Math word problems with verifiable final answers; programming problems with hidden test cases. |
| RL stage 2 | Improve general instruction following, format compliance, tool use, and human preference. | General reward model plus rule-based checks (for example, IFEval-style format constraints). | Mixed instruction-following, agent, and chat data. |
The rule-based design for stage 1 follows the same family of "verifiable rewards" used by DeepSeek-R1 and described in subsequent open papers. Instead of a learned reward model, the system grades each rollout against a deterministic checker: for math, the model's final answer must match the gold answer after normalization; for code, the generated solution must compile and pass every hidden test case. This avoids the reward hacking failure modes that plagued earlier reinforcement learning from human feedback systems, since there is nothing to reward-hack except the actual task.[3][6]
The second RL stage is closer to standard RLHF. A general reward model, trained on Qwen team preference data, is used to score outputs on broader axes: helpfulness, instruction following, formatting, refusal behavior, and tool-call correctness. The team reported that this stage was kept short, because longer runs began to erode the math and code gains from the first stage.[3]
A central piece of the QwQ-32B pipeline that the team described in more depth than most contemporaries was the pair of automated graders that score rollouts during stage 1. The math accuracy verifier is a normalizer plus equality checker. It strips formatting from the model's final-answer span, applies arithmetic and symbolic normalization (for example, reducing fractions, canonicalizing surds, and parsing LaTeX), and compares against the gold answer. Partial credit is not given. Either the answers match after normalization, in which case the rollout earns a positive reward, or they do not, in which case the reward is zero.[2][3][6]
The code execution server is conceptually similar but runs against a hidden test suite. Each programming problem ships with a battery of public and hidden test cases. Generated solutions are compiled, sandboxed, and executed against the full suite. A rollout earns reward only if every hidden test passes within the time limit. Edge cases, off-by-one errors, and floating-point tolerance issues are treated as failures, which forces the policy to produce robust code rather than code that merely passes a single example.[2][3][6]
What made this setup unusual for early 2025 was the scale at which it ran. Alibaba's blog post described "continuous RL scaling" on top of these verifiers, meaning the team trained the policy for many more steps than was standard in earlier supervised pipelines. The headline AIME and LiveCodeBench gains from Preview to full release are attributed largely to this extended RL training rather than to changes in the underlying architecture or the cold-start data.[2][3]
A notable addition in QwQ-32B over the Preview was support for function calling and agentic tool use. The released checkpoint can emit structured tool calls, observe tool outputs, and continue reasoning, all inside the same chain of thought. The team trained this capability during the second RL stage, using rollouts where the policy is given access to a sandboxed Python interpreter and a small set of search and retrieval tools. Reported BFCL (Berkeley Function Calling Leaderboard) scores placed QwQ-32B above the contemporaneous Qwen2.5-32B-Instruct and roughly comparable to several closed-weight reasoning systems.[3]
The choice of verifiable rewards in stage 1 reflects a wider shift in how reasoning models were being trained at the start of 2025. Earlier RLHF-style pipelines relied on a learned reward model that predicted human preferences. These reward models tended to overfit, especially at the long-tail end of difficulty where they had little training signal, and the resulting policies often gamed the reward instead of solving the underlying task. Verifiable rewards remove that loop. A math problem with a known answer either gets the right answer or it does not. A programming problem either passes the hidden tests or it does not. The reward is binary, deterministic, and impossible to game without actually solving the problem.[3][6]
The trade-off is data coverage. Verifiable rewards only work where you can write a checker, which restricts stage 1 to math, code, and a narrow slice of formal science. Open-ended writing, persuasion, summarization, and most chat tasks have no checker, so the second RL stage has to fall back on a learned reward model. This is one reason QwQ is much stronger on quantitative tasks than on creative or open-ended ones, and it is also why the Qwen team reports a sharper improvement on AIME and LiveCodeBench than on subjective benchmarks like Arena-Hard.[3][6]
QwQ produces visible chain-of-thought traces by default. Unlike o1, which hides its internal reasoning from end users, QwQ surfaces its thinking inline. A typical trace begins with the model restating the problem, sketches a plan, works through intermediate steps, sometimes catches and corrects its own mistakes, and finishes with a clearly delimited final answer. The team's example traces in the Preview blog post explicitly highlight self-correction lines such as "Wait, let me check this again," "Actually, that step is wrong," or "Hmm, I should reconsider."[1]
The "thinking-aloud" voice is one of QwQ's most distinctive features. It feels closer to a student working a problem out loud than to the polished, hidden reasoning of a closed system. This made QwQ popular as a teaching artefact: educators and researchers could read the trace and see exactly where the model decided what. It also made the model's mistakes more legible, since wrong intermediate steps usually appear in plain text rather than being summarized away.[5][6]
QwQ-32B is tuned primarily for three domains:[1][3]
The full QwQ-32B release added function calling capability. A user can pass a list of tool schemas, and the model will produce JSON tool calls inside its trace, then incorporate the tool outputs into the next step. This makes QwQ usable as the backbone of an agent loop, including coding agents that compile and run their own output.[3]
QwQ's traces tend to be longer and more conversational than DeepSeek-R1's, which lean toward dense, near-formal step-by-step derivations. They are also more visible than OpenAI o1's, which by default returns a hidden chain of thought and a short summary. In side-by-side comparisons posted by community evaluators in early 2025, QwQ would frequently restart a calculation when it noticed an error, sometimes producing two or three full attempts at the same problem. R1, by contrast, more often reached the answer in one pass with shorter self-correction inserts. The two styles are different products of similar training pipelines and reflect different choices about how to seed the cold-start data.[5][6]
The two checkpoints were evaluated on broadly the same benchmarks but at different points in time. The numbers below come from Alibaba's own blog posts and accompanying Hugging Face model cards. As with most reasoning model benchmarks in this period, exact scores varied across community evaluations, sometimes by several points, depending on temperature, sampling strategy, and answer-extraction rules.[1][3][6]
| Benchmark | QwQ-32B | Notes |
|---|---|---|
| AIME 2024 | 79.5 | Pass@1; competition math |
| MATH-500 | 90.6 | Reasoning over 500 hand-curated problems |
| GPQA Diamond | 65.2 | Graduate-level science multiple choice |
| LiveCodeBench (v5) | 63.4 | Programming contest problems with hidden tests |
| BFCL | 66.4 | Berkeley Function Calling Leaderboard |
| IFEval | 83.9 | Instruction-following format compliance |
| LiveBench | 73.1 | Aggregated reasoning, math, and coding subsets |
Readers should treat these numbers as the publisher's reported figures rather than independently verified ones. Several of the AIME and GPQA scores in particular were re-evaluated by the community using different extraction rules and landed within a few points of the reported numbers, but with non-trivial spread. The MATH-500 figure of 90.6 is the most widely cited and has been roughly reproduced by downstream evaluators.[1][3][6]
The Preview model, released four months earlier, reported lower scores on most of these benchmarks. Alibaba's Preview blog reported around 50.0 on AIME 2024, 90.6 on MATH-500 (the same number, since this benchmark was central to the cold-start data), 65.2 on GPQA, and 50.0 on LiveCodeBench. The headline gain from the full release was on AIME and LiveCodeBench, both of which improved sharply with the longer RL run.[1][3]
The table below puts QwQ-32B alongside other reasoning systems that were either current or close in time at the March 2025 release. The numbers are drawn from each lab's own release post, not from a unified evaluation; comparisons should be read accordingly.
| Model | Developer | Type | AIME 2024 | MATH-500 | GPQA Diamond | LiveCodeBench |
|---|---|---|---|---|---|---|
| QwQ-32B | Alibaba | 32B dense, open | 79.5 | 90.6 | 65.2 | 63.4 |
| DeepSeek-R1 | DeepSeek | 671B MoE (37B active), open | ~79.8 | ~97.3 | ~71.5 | ~65.9 |
| OpenAI o1 (full) | OpenAI | Proprietary | ~79.2 | ~96.4 | ~75.7 | ~63.4 |
| OpenAI o1-mini | OpenAI | Proprietary, smaller | ~63.6 | ~90.0 | ~60.0 | ~53.8 |
| OpenAI o3-mini (high) | OpenAI | Proprietary | ~83.6 | ~97.9 | ~77.0 | ~66.3 |
| Claude 3.7 Sonnet (extended thinking) | Anthropic | Proprietary | ~61.3 | not reported | ~78.2 | not directly reported |
| Gemini 2.5 Pro | Proprietary | ~92.0 | not reported | ~84.0 | not directly reported |
The interpretation that received the most attention from coverage at the time was that QwQ-32B sat in the same neighborhood as DeepSeek-R1 on AIME, GPQA, and LiveCodeBench despite using roughly twenty times fewer total parameters and far less inference memory. On MATH-500, R1 retained a clear lead, and on GPQA Diamond the closed reasoning models from OpenAI and Anthropic remained ahead.[3][4][6]
QwQ received broad attention in both English and Chinese AI media. Coverage in VentureBeat, SiliconANGLE, and The Decoder framed the March 2025 release as a major proof point for small, open-weight reasoning models. The headline running through most stories was that a 32B dense model could match a 671B MoE on competition math while being cheap enough to run on a single workstation GPU after quantization.[2][4][5]
On the financial side, Alibaba's Hong Kong-listed shares closed up 8.39% on March 6, 2025, the trading day after the QwQ-32B announcement, briefly touching a 52-week high. The company's New York-listed ADR rose roughly 2.5% in pre-market trading on the same news. Bloomberg, CNBC, and Reuters tied the rally to an ongoing rotation into Chinese AI stocks that had begun with the DeepSeek-R1 release in January 2025, sometimes called the "DeepSeek shock." Bernstein analysts argued at the time that the QwQ release positioned Alibaba's cloud and AI earnings on what they called a "more upwardly-pointing trajectory," and a Hang Seng Tech sub-index of Chinese tech names climbed in sympathy. QwQ-32B was widely read as Alibaba's direct answer to that moment, demonstrating that the Qwen team could ship a competitive reasoning model on a footing comparable to DeepSeek without the heavy parameter count.[4][11]
Within the open-source community, QwQ became a standard baseline for fine-tuning experiments. By mid-2025, Hugging Face listed thousands of QwQ-derived checkpoints, including domain-specific math fine-tunes, agent fine-tunes, function-calling adaptations, and quantized GGUF builds for local inference under Ollama and llama.cpp. The model's combination of permissive licensing, manageable size, and strong baseline reasoning made it an easy starting point for further RL experiments.[5][6]
Quantization providers shipped their own packagings within days. The Qwen team published an official AWQ 4-bit checkpoint at Qwen/QwQ-32B-AWQ and an official GGUF build at Qwen/QwQ-32B-GGUF. Independent maintainers including the LM Studio community, Unsloth, Bartowski, and Mungert mirrored the same weights across additional quantization formats and bit widths, with most builds preserving the full 131K context window. By mid-2025, the cluster of QwQ-related repositories on Hugging Face was among the most downloaded reasoning-model families in the Qwen organization, behind only the main Qwen 2.5 chat and base lines.[9][10]
A recurring point in coverage was that QwQ-32B is small enough to run on a single high-end consumer GPU after INT4 quantization. The 4-bit version of the 32.5B-parameter model requires roughly 20 GB of VRAM for inference, which fits on a single RTX 4090 (24 GB) or RTX 5090 (32 GB). Apple Silicon machines with 32 GB or more of unified memory can run the same quantized weights through llama.cpp's Metal backend. This is in sharp contrast to DeepSeek-R1, which requires a multi-GPU server even at INT4, and to OpenAI o1, which is not available locally at all. For many small teams, QwQ was the first reasoning model they could run on their own hardware without a cloud bill.[3][5][9]
For higher-throughput serving, the Qwen team and community converged on three main inference stacks:
| Stack | Typical hardware | Notes |
|---|---|---|
| vLLM | Single A100/H100 80 GB (BF16) or 24-32 GB consumer card (INT4) | Uses PagedAttention and continuous batching; supports tensor parallelism for multi-GPU shards |
| SGLang | Single A100/H100 or multi-GPU cluster | Optimized for structured outputs and tool-call streaming; commonly used for agent workloads |
| Ollama and llama.cpp | Consumer GPU, Apple Silicon, or CPU with enough RAM | GGUF format with INT4 to INT8 quantization; preferred for local desktop and edge deployment |
A standard vllm serve configuration for QwQ-32B uses --tensor-parallel-size 1 on an 80 GB card or --tensor-parallel-size 2 across two 48 GB cards, with --max-model-len 131072 to expose the full context window. YaRN extension must be enabled for prompts longer than 32K tokens, since the base RoPE configuration is calibrated for the shorter window.[9][12]
On Alibaba Cloud Model Studio, QwQ-32B is exposed through the DashScope chat-completions API under the qwq-plus family of identifiers. Pricing is metered separately for input and output tokens, with output tokens charged at a premium that reflects the longer reasoning traces typical of the model. The same endpoint backs the free Qwen Chat web interface at chat.qwen.ai, which uses a daily message quota rather than per-token billing. Several third-party inference providers, including SiliconFlow, OpenRouter, Fireworks AI, Together AI, and Groq, added hosted QwQ-32B endpoints in the weeks after the open-weights release, sometimes offering lower per-token prices than the official Alibaba endpoint by trading off context length or response speed.[11][12]
In academic papers published through 2025, QwQ-32B appeared as a baseline in studies of reasoning faithfulness, chain-of-thought robustness, and reward modeling. Researchers used the model partly because the weights and tokenizer are open, partly because the visible chain of thought makes it easier to instrument, and partly because the training recipe is documented enough to attempt reproductions. Several follow-up papers reported reproducing the AIME and MATH-500 numbers within a few points using their own evaluations.[6][10]
Reception was not uniformly positive. Practical reviewers raised three recurring criticisms.[5][10]
QwQ-32B's open weights and Apache 2.0 license led to a substantial derivative ecosystem on Hugging Face within months of release. The most common derivative families include:[6][9]
The Qwen team did not publish an official QwQ-Coder variant. Coding capabilities were instead carried forward through Qwen2.5-Coder and, later, the Qwen3-Coder line.[7][9]
The QwQ models inherit most of the limitations of their Qwen2.5 base, plus a few that are specific to reasoning models. The team's Hugging Face model card and blog posts list several explicit caveats.[1][3][9]
The most discussed quirk is mid-trace language switching. The Preview model would sometimes begin reasoning in English, drift into Chinese for a stretch (especially on math problems with Chinese training sources), and then return to English for the final answer. Some traces switched several times within a single response. The QwQ-32B release reduced this behavior by adding format-compliance rewards in the second RL stage, but reviewers and academic write-ups continued to find examples through 2025. The team frames this as an artefact of the bilingual cold-start data, not a bug to be fully eliminated, since the underlying reasoning is often correct regardless of which language carried it.[1][5][10]
Like most early reasoning models, QwQ can fall into reasoning loops. On harder problems the trace may repeat a near-identical line of analysis several times, occasionally hitting the maximum generation budget without committing to a final answer. This was less common in QwQ-32B than in the Preview, but it is still a documented failure mode and a frequent source of wasted tokens at inference time.[5][10]
Reasoning models in this generation produce long traces by design, but QwQ leans long. Default outputs commonly run 2,000 to 8,000 tokens for moderately hard problems, and can run longer on AIME-style questions. This shifts cost from training to inference and changes the economics of deployment compared with standard chat models. Quantized local inference reduces the dollar cost but not the wall-clock time, which can stretch into many seconds even on modern hardware.[3][6]
A limitation that became more visible only after the launch of Qwen 3 is that QwQ-32B does not expose a user-adjustable thinking budget. The model always emits a full reasoning trace before its answer, and there is no API parameter to cap the length or to switch the model into a non-thinking, direct-answer mode. Later Qwen 3 thinking variants added a controllable thinking budget (in some configurations up to roughly 38K tokens) that lets callers trade latency against reasoning depth. Applications built on QwQ that need faster, shorter responses on easy queries have to fall back on stop-token tricks or move to a different model.[12][13]
QwQ does not eliminate hallucinations. The visible chain of thought sometimes makes errors easier to catch, but it also produces confident-looking intermediate steps that turn out to be wrong. Standard AI safety caveats apply: the model should not be relied on for high-stakes factual questions without external verification, and the trace should not be read as a faithful audit of the underlying decision process. Several papers in 2025 examined chain-of-thought faithfulness on QwQ specifically and found the trace and the final answer were sometimes loosely coupled, especially under adversarial prompting.[6][10]
Like other open-weight models from Chinese labs, QwQ inherits some refusal behavior on politically sensitive topics related to China. The model will often decline to discuss specific historical events or political figures, or produce a sanitized response. Several community-built "abliterated" variants strip these refusals to suit different use cases. Conversely, on technical and scientific topics, the model is broadly forthcoming, with refusals concentrated mostly around explicit harm-enabling content. The base behavior is shaped by Chinese regulatory requirements that apply to publicly distributed models in mainland China and is consistent with the safety post-training applied to other Qwen 2.5 checkpoints.[5][9][10]
QwQ is the first link in a chain of reasoning-flavored Qwen releases that ran through 2025 and into 2026.
| Model | Release | Domain | Approach |
|---|---|---|---|
| QwQ-32B-Preview | Nov 28, 2024 | Text reasoning (math, code, science) | Cold-start fine-tune plus short RL on a 32B dense base |
| QvQ-72B-Preview | Dec 24, 2024 | Vision reasoning | Reasoning post-training on top of Qwen2-VL-72B for image-grounded math and science |
| QwQ-32B | Mar 5, 2025 | Text reasoning | Two-stage RL with rule-based and general rewards; adds tool use |
| Qwen 3 Thinking variants | Apr 28, 2025 | General | Hybrid "thinking / non-thinking" mode in a single model, controllable per request |
| Qwen3-Max-Thinking | Sep 2025 | Frontier reasoning | Hosted-only flagship with extended thinking budget |
The trajectory from QwQ to the Qwen 3 thinking variants shows a clear consolidation. QwQ was a separate product line, with its own checkpoint and its own behavior. By the time Qwen 3 launched in April 2025, the team had folded thinking back into the same dense model as the standard chat behavior, controllable through a flag. QwQ-32B continued to be maintained on Hugging Face but was no longer the team's flagship reasoning model from mid-2025 onward.[7][9]
The Qwen 3 release in April 2025 effectively absorbed the QwQ direction. Qwen3-32B, the dense 32-billion-parameter member of the Qwen 3 family, ships with a hybrid thinking mode that toggles between fast, non-thinking replies and deliberate reasoning traces, controlled by an enable_thinking flag (and later by a thinking-budget parameter on the hosted API). On Qwen3-32B in thinking mode, the team reports higher scores than QwQ-32B on AIME 2024, AIME 2025, LiveCodeBench, BFCL, and LiveBench, while preserving QwQ-style chain-of-thought traces.[12][13]
The Qwen 3 technical report frames this as a clean superset: anything a caller previously did with QwQ-32B can be done with Qwen3-32B in thinking mode, plus a non-thinking mode for cheap, fast turns. Alibaba did not formally deprecate the QwQ checkpoints. As of mid-2026, both QwQ-32B and QwQ-32B-Preview remain available on Hugging Face and on the DashScope API under the qwq-plus and qwq-32b model IDs, with date-pinned snapshots for reproducibility. They are listed in Alibaba Cloud's documentation under thinking-only models, alongside Qwen 3 thinking variants.[11][12]
In practice, most new builds in late 2025 and 2026 use Qwen 3 thinking variants or the larger Qwen3-Max-Thinking flagship rather than QwQ-32B directly. QwQ-32B's lasting role has been as a reproducible reasoning baseline: a checkpoint at a known training cut, with a documented recipe, that researchers can pull and rerun against new benchmarks without worrying that the model has been silently updated underneath them.[6][13]
On the vision side, QvQ-72B-Preview, released December 24, 2024, applied a similar reasoning training recipe to the Qwen2-VL-72B base for image-grounded reasoning. QvQ is a separate model from QwQ but shares the Qwen team's broader bet that reasoning post-training generalizes across modalities. The two models are sometimes confused because of their similar names; QwQ is text-only, while QvQ is multimodal.[7]
In practice QwQ was deployed as a reasoning backbone for a fairly narrow set of tasks where its strengths matter: research and analysis assistants that handle long math or scientific reasoning, coding agents that need to iterate against test results, and educational tools that benefit from a visible thinking trace. The model is generally a poor choice for high-throughput chat, customer support, or any setting where short, confident replies are expected; non-thinking models like Qwen2.5-32B-Instruct are usually a better fit there.[5][9]
On the agent side, several open-source coding agent frameworks (such as community ports of OpenHands and SWE-agent) shipped QwQ-32B presets, taking advantage of the function calling capability and the long context window. For mathematics tutoring and contest preparation, communities around platforms such as Project Euler and competitive programming sites began sharing QwQ-driven workflows where the model not only solved problems but produced study notes from its own traces.[5][6][10]
The name QwQ is a play on the Qwen brand and on a popular Asian internet kaomoji, where "QwQ" represents a tearful, slightly overwhelmed face. The Qwen team's blog post for the Preview opens with this kaomoji and uses it as a visual anchor throughout. The expanded form, "Qwen with Questions," is the team's official gloss; the pronunciation "kwuh" is used in their video material and in talks. Some English-language coverage rendered it as "quee-ew" or simply spelled it out as "Q-W-Q."[1][5]
The playful branding stood out among the more austere house styles of frontier labs. Where OpenAI named its reasoning system "o1" and DeepSeek used the more clinical "R1," Alibaba leaned into a deliberately informal mascot, signaling that the model itself was supposed to feel more like a curious student than a polished assistant. This matched the visible-chain-of-thought design choice and was widely commented on in launch coverage.[1][5]
A reader looking for a single-paragraph framing might compare QwQ-32B to the rest of the early reasoning model wave like this. OpenAI o1 and the later o3 and GPT-5 thinking modes are the closed reference points. DeepSeek-R1 is the open heavyweight, much larger and slightly stronger on math, but expensive to host. Claude 3.7 Sonnet and Claude Opus 4 extended thinking modes prioritize reliability and tool use. Gemini 2.5 Pro and Deep Think push the upper end of frontier benchmarks. QwQ-32B is the small, open, single-GPU reasoning model that punches above its weight, and that, more than anything else, is what made it interesting to the community.[3][4][5][6]