QwQ

Chinese AI Large Language Models Open Source AI Reasoning Models

33 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v4 · 6,503 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

QwQ
Developer	Alibaba Cloud Qwen team
Series	Qwen
First release	November 28, 2024 (QwQ-32B-Preview)
Full release	March 5, 2025 (QwQ-32B)
Type	Large language model (reasoning model)
Architecture	Dense Transformer decoder
Parameters	32.5 billion (dense)
Base model	Qwen2.5-32B
Context length	32,768 tokens (Preview); 131,072 tokens (QwQ-32B)
Training method	Cold-start fine-tuning plus reinforcement learning with rule-based and general-purpose rewards
License	Apache 2.0
Pronunciation	"kwuh" (sometimes spelled "quee-ew"); short for "Qwen with Questions"
Distribution	Hugging Face, ModelScope, Qwen Chat, Alibaba Cloud Model Studio

QwQ is a family of open-weight reasoning models from the Qwen team at Alibaba Cloud, built to compete with OpenAI's o1 and DeepSeek-R1 at a fraction of their size. Its flagship checkpoint, QwQ-32B, is a 32.5-billion-parameter dense model that Alibaba reported reaches "performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated)," a roughly 20-fold reduction in parameter count for similar math and coding results.^[2] The series has two public releases under the Apache 2.0 license: QwQ-32B-Preview, an experimental research preview shipped on November 28, 2024, and QwQ-32B, the full release on March 5, 2025, both distributed via Hugging Face and ModelScope.^[1]^[2]^[3] The name QwQ stands for "Qwen with Questions," pronounced loosely as "kwuh," and is paired with a playful, anime-styled mascot in the team's blog posts.^[1]

The two checkpoints share a 32.5-billion-parameter dense decoder built on the Qwen2.5-32B base model and use the same tokenizer as the rest of the Qwen 2.5 family. Where the Preview version was a research artefact released to gather feedback, QwQ-32B was trained with a multi-stage reinforcement learning pipeline. The pipeline combined rule-based rewards for mathematics and code with general reward models for instruction following, formatting, and tool use. Alibaba reported that QwQ-32B reached parity with DeepSeek-R1, a model with roughly twenty times more total parameters, on several reasoning benchmarks. The release was followed by an 8.39% jump in Alibaba's Hong Kong-listed shares the same week, briefly pushing the stock to a 52-week high.^[2]^[3]^[4]^[11]

QwQ sits within Alibaba's broader push into reasoning systems, alongside QvQ (a vision-reasoning preview from December 2024) and the later thinking variants in the Qwen 3 family. It is widely cited as one of the first small, openly licensed models to match much larger proprietary reasoning systems on math and code benchmarks, and as one of the first widely circulated examples of a visible, "thinking-aloud" chain-of-thought trace from a Chinese lab.^[5]^[6]

What is the Qwen series QwQ comes from?

Qwen, short for Tongyi Qianwen (通义千问), is the family of large language models developed by Alibaba Cloud's Qwen team since 2023. By the time QwQ-32B-Preview shipped, the team had already released the Qwen 1, Qwen 1.5, and Qwen 2 generations, followed by Qwen 2.5 in September 2024. The Qwen 2.5 family covered seven dense sizes from 0.5B to 72B parameters, all sharing a 128K context window and a 150K-vocabulary tokenizer. It also included two specialized branches: Qwen2.5-Coder and Qwen2.5-Math, each tuned with extra code and math data on top of the same base.^[7]^[8]

QwQ was the team's first reasoning-focused entry. The Qwen 2.5 base models had already shown strong math performance for their size, especially in the 32B and 72B variants. The reasoning track took the 32B base as the starting point because, in the team's own framing, 32B was the smallest size at which long chain-of-thought traces remained coherent under their training recipe, while still being small enough to run on a single high-end consumer GPU after quantization.^[1]^[3]

Why did the Qwen team build a reasoning variant?

The immediate context for QwQ was the September 2024 release of OpenAI's o1, which popularized the idea of a model that spends extra inference compute on a hidden chain of thought before answering. Within a few weeks several labs began publishing their own reasoning experiments. The Qwen team's blog post for the Preview, titled "QwQ: Reflect Deeply on the Boundaries of the Unknown," framed QwQ as both a research preview and an open invitation, releasing the weights, the system prompt, and a small set of example traces so that the community could study the failure modes directly.^[1]

The team also positioned the model against DeepSeek-R1, released two months later in January 2025. DeepSeek-R1 was a 671-billion-parameter Mixture of Experts model with 37B active parameters, vastly larger than QwQ-32B. Alibaba's argument with the March 2025 release was that a much smaller dense model, when trained with the right RL recipe, could approach R1's results on math and coding benchmarks while running on far cheaper hardware.^[2]^[3]^[4]

Is QwQ open source?

Both QwQ-32B-Preview and QwQ-32B were released under the permissive Apache 2.0 license, allowing commercial use, modification, and redistribution without requiring a separate Alibaba license. This contrasted with some earlier Qwen releases, which were governed by the more restrictive Tongyi Qianwen license for the largest checkpoints. The Apache 2.0 choice mattered because it let downstream teams ship QwQ inside commercial products and run their own RL fine-tunes without legal review.^[2]^[9]

When was QwQ released, and what are the variants?

The two QwQ checkpoints differ less in size than in maturity. The Preview was a research drop with limited training and a shorter context, while QwQ-32B was the production release with a full RL pipeline and the same 131,072-token window as the underlying Qwen 2.5-32B base.

Model	Release date	Parameters	Context length	Status	License
QwQ-32B-Preview	November 28, 2024	32.5B dense	32,768 tokens	Experimental research preview	Apache 2.0
QwQ-32B	March 5, 2025	32.5B dense	131,072 tokens	General release	Apache 2.0

Alibaba also lists the model on its commercial endpoints. QwQ-32B is available through Alibaba Cloud Model Studio's DashScope API and through the free Qwen Chat web interface at chat.qwen.ai, where it appears as a selectable thinking model. The same weights are mirrored on Hugging Face under Qwen/QwQ-32B-Preview and Qwen/QwQ-32B, and on ModelScope under the equivalent identifiers.^[1]^[2]^[9]

On DashScope, the production endpoint is exposed as qwq-plus and qwq-plus-latest, with date-pinned snapshots such as qwq-plus-2025-03-05 available so applications can lock to a specific training cut. These hosted variants accept the same chat-completions schema as the rest of the Qwen API and route to the same 32B weights, but they operate in a thinking-only mode where the reasoning trace is always emitted before the final answer.^[11]^[12]

What is QwQ's architecture?

QwQ is a dense decoder-only Transformer, not a Mixture of Experts model. The team kept the architecture identical to Qwen2.5-32B so that the Preview and the final QwQ-32B checkpoint could load into the same inference stack as the rest of the Qwen 2.5 family.^[1]^[3]

Key architectural details published by the team or recoverable from the Hugging Face configuration files include:^[1]^[7]^[8]

64 transformer layers with grouped-query attention (40 query heads, 8 key-value heads).
Hidden size 5,120 with a 27,648 feed-forward dimension and SwiGLU activations.
Rotary position embeddings (RoPE) with a base frequency tuned for long context, plus YaRN-style extension to support the 131K window in QwQ-32B.
Tokenizer: the same byte-pair tokenizer used across Qwen 2.5, with a vocabulary of around 152K tokens.
BF16 weights at release; the team published an INT4 quantized variant alongside the main checkpoint that runs on a single 24 GB GPU.

The choice to keep the architecture unchanged was deliberate. By using the same backbone as Qwen2.5-32B, the team could reuse all of the inference and fine-tuning tooling the community had already built for that model, including vLLM, SGLang, Ollama, and llama.cpp paths.^[3]^[9]

How was QwQ trained?

Starting point and data

Both QwQ checkpoints start from the Qwen2.5-32B base. The Preview was produced with a relatively short post-training pipeline: a curated supervised fine-tune on long chain-of-thought traces drawn from competition math, programming, and scientific reasoning sources, followed by a limited RL stage. The team did not publish a full technical report for the Preview, and described the run in their blog as a "first attempt."^[1]

For QwQ-32B, Alibaba gave more public detail. The team summarized the recipe directly: "We began with a cold-start checkpoint and implemented a reinforcement learning (RL) scaling approach driven by outcome-based rewards."^[2] The training pipeline used two main RL stages on top of that initial cold-start fine-tune.^[3]

Stage	Goal	Reward signal	Data
Cold start	Teach the base model to produce long, structured reasoning traces.	Supervised loss on curated traces.	Hand-filtered competition math, code, and science problems with worked solutions.
RL stage 1	Improve math and code accuracy.	Rule-based: numerical answer match for math, unit-test pass for code.	Math word problems with verifiable final answers; programming problems with hidden test cases.
RL stage 2	Improve general instruction following, format compliance, tool use, and human preference.	General reward model plus rule-based checks (for example, IFEval-style format constraints).	Mixed instruction-following, agent, and chat data.

The rule-based design for stage 1 follows the same family of "verifiable rewards" used by DeepSeek-R1 and described in subsequent open papers. Instead of a learned reward model, the system grades each rollout against a deterministic checker: for math, the model's final answer must match the gold answer after normalization; for code, the generated solution must compile and pass every hidden test case. This avoids the reward hacking failure modes that plagued earlier reinforcement learning from human feedback systems, since there is nothing to reward-hack except the actual task.^[3]^[6]

The second RL stage is closer to standard RLHF. A general reward model, trained on Qwen team preference data, is used to score outputs on broader axes: helpfulness, instruction following, formatting, refusal behavior, and tool-call correctness. The team reported that this stage was kept short, because longer runs began to erode the math and code gains from the first stage.^[3]

The accuracy verifier and code execution server

A central piece of the QwQ-32B pipeline that the team described in more depth than most contemporaries was the pair of automated graders that score rollouts during stage 1. As the Qwen team put it, "rather than relying on traditional reward models, we used an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases."^[2] The math accuracy verifier is a normalizer plus equality checker. It strips formatting from the model's final-answer span, applies arithmetic and symbolic normalization (for example, reducing fractions, canonicalizing surds, and parsing LaTeX), and compares against the gold answer. Partial credit is not given. Either the answers match after normalization, in which case the rollout earns a positive reward, or they do not, in which case the reward is zero.^[2]^[3]^[6]

The code execution server is conceptually similar but runs against a hidden test suite. Each programming problem ships with a battery of public and hidden test cases. Generated solutions are compiled, sandboxed, and executed against the full suite. A rollout earns reward only if every hidden test passes within the time limit. Edge cases, off-by-one errors, and floating-point tolerance issues are treated as failures, which forces the policy to produce robust code rather than code that merely passes a single example.^[2]^[3]^[6]

What made this setup unusual for early 2025 was the scale at which it ran. The team argued that "scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods," and described "continuous RL scaling" on top of these verifiers, meaning the policy was trained for many more steps than was standard in earlier supervised pipelines. The headline AIME and LiveCodeBench gains from Preview to full release are attributed largely to this extended RL training rather than to changes in the underlying architecture or the cold-start data.^[2]^[3]

Tool use and function calling

A notable addition in QwQ-32B over the Preview was support for function calling and agentic tool use. The released checkpoint can emit structured tool calls, observe tool outputs, and continue reasoning, all inside the same chain of thought. The team trained this capability during the second RL stage, using rollouts where the policy is given access to a sandboxed Python interpreter and a small set of search and retrieval tools. Reported BFCL (Berkeley Function Calling Leaderboard) scores placed QwQ-32B above the contemporaneous Qwen2.5-32B-Instruct and roughly comparable to several closed-weight reasoning systems.^[3]

Why did rule-based rewards matter?

The choice of verifiable rewards in stage 1 reflects a wider shift in how reasoning models were being trained at the start of 2025. Earlier RLHF-style pipelines relied on a learned reward model that predicted human preferences. These reward models tended to overfit, especially at the long-tail end of difficulty where they had little training signal, and the resulting policies often gamed the reward instead of solving the underlying task. Verifiable rewards remove that loop. A math problem with a known answer either gets the right answer or it does not. A programming problem either passes the hidden tests or it does not. The reward is binary, deterministic, and impossible to game without actually solving the problem.^[3]^[6]

The trade-off is data coverage. Verifiable rewards only work where you can write a checker, which restricts stage 1 to math, code, and a narrow slice of formal science. Open-ended writing, persuasion, summarization, and most chat tasks have no checker, so the second RL stage has to fall back on a learned reward model. This is one reason QwQ is much stronger on quantitative tasks than on creative or open-ended ones, and it is also why the Qwen team reports a sharper improvement on AIME and LiveCodeBench than on subjective benchmarks like Arena-Hard.^[3]^[6]

How does QwQ reason, and what can it do?

QwQ produces visible chain-of-thought traces by default. Unlike o1, which hides its internal reasoning from end users, QwQ surfaces its thinking inline. A typical trace begins with the model restating the problem, sketches a plan, works through intermediate steps, sometimes catches and corrects its own mistakes, and finishes with a clearly delimited final answer. The team's example traces in the Preview blog post explicitly highlight self-correction lines such as "Wait, let me check this again," "Actually, that step is wrong," or "Hmm, I should reconsider."^[1]

The "thinking-aloud" voice is one of QwQ's most distinctive features. It feels closer to a student working a problem out loud than to the polished, hidden reasoning of a closed system. This made QwQ popular as a teaching artefact: educators and researchers could read the trace and see exactly where the model decided what. It also made the model's mistakes more legible, since wrong intermediate steps usually appear in plain text rather than being summarized away.^[5]^[6]

Domains

QwQ-32B is tuned primarily for three domains:^[1]^[3]

Mathematics: AIME-style competition problems, MATH-500, GSM-style word problems, and Olympiad-level questions. The model uses rule-based verification during training, which biases the trace toward producing a clearly extractable numeric or algebraic answer.
Code: LiveCodeBench-style programming contest problems, with hidden tests as the supervision signal. The model often writes out a brief plan in natural language, then a code block, then a worked-through example to sanity-check the code.
Science and general reasoning: GPQA-Diamond questions, LiveBench reasoning subsets, and instruction following. Performance here is strong but more uneven than on math and code, in part because the rewards in stage 2 are noisier.

Function calling and agents

The full QwQ-32B release added function calling capability. A user can pass a list of tool schemas, and the model will produce JSON tool calls inside its trace, then incorporate the tool outputs into the next step. This makes QwQ usable as the backbone of an agent loop, including coding agents that compile and run their own output.^[3]

How does QwQ's reasoning style compare with peers?

QwQ's traces tend to be longer and more conversational than DeepSeek-R1's, which lean toward dense, near-formal step-by-step derivations. They are also more visible than OpenAI o1's, which by default returns a hidden chain of thought and a short summary. In side-by-side comparisons posted by community evaluators in early 2025, QwQ would frequently restart a calculation when it noticed an error, sometimes producing two or three full attempts at the same problem. R1, by contrast, more often reached the answer in one pass with shorter self-correction inserts. The two styles are different products of similar training pipelines and reflect different choices about how to seed the cold-start data.^[5]^[6]

How does QwQ-32B perform on benchmarks?

The two checkpoints were evaluated on broadly the same benchmarks but at different points in time. The numbers below come from Alibaba's own blog posts and accompanying Hugging Face model cards. As with most reasoning model benchmarks in this period, exact scores varied across community evaluations, sometimes by several points, depending on temperature, sampling strategy, and answer-extraction rules.^[1]^[3]^[6]

QwQ-32B benchmark scores (March 2025)

Benchmark	QwQ-32B	Notes
AIME 2024	79.5	Pass@1; competition math
MATH-500	90.6	Reasoning over 500 hand-curated problems
GPQA Diamond	65.2	Graduate-level science multiple choice
LiveCodeBench (v5)	63.4	Programming contest problems with hidden tests
BFCL	66.4	Berkeley Function Calling Leaderboard
IFEval	83.9	Instruction-following format compliance
LiveBench	73.1	Aggregated reasoning, math, and coding subsets

Readers should treat these numbers as the publisher's reported figures rather than independently verified ones. Several of the AIME and GPQA scores in particular were re-evaluated by the community using different extraction rules and landed within a few points of the reported numbers, but with non-trivial spread. The independent evaluation firm Artificial Analysis, for example, measured QwQ-32B at 59.5% on GPQA Diamond in its own run, several points below the 65.2 Alibaba reported, while broadly agreeing that the model approached DeepSeek-R1's level. The MATH-500 figure of 90.6 is the most widely cited and has been roughly reproduced by downstream evaluators.^[1]^[3]^[6]^[14]

The Preview model, released four months earlier, reported lower scores on most of these benchmarks. Alibaba's Preview blog reported 50.0 on AIME 2024, 90.6 on MATH-500 (the same number, since this benchmark was central to the cold-start data), 65.2 on GPQA, and 50.0 on LiveCodeBench. The headline gain from the full release was on AIME (50.0 to 79.5) and LiveCodeBench (50.0 to 63.4), both of which improved sharply with the longer RL run.^[1]^[3]

Comparison with peer reasoning models

The table below puts QwQ-32B alongside other reasoning systems that were either current or close in time at the March 2025 release. The numbers are drawn from each lab's own release post, not from a unified evaluation; comparisons should be read accordingly.

Model	Developer	Type	AIME 2024	MATH-500	GPQA Diamond	LiveCodeBench
QwQ-32B	Alibaba	32B dense, open	79.5	90.6	65.2	63.4
DeepSeek-R1	DeepSeek	671B MoE (37B active), open	~79.8	~97.3	~71.5	~65.9
OpenAI o1 (full)	OpenAI	Proprietary	~79.2	~96.4	~75.7	~63.4
OpenAI o1-mini	OpenAI	Proprietary, smaller	~63.6	~90.0	~60.0	~53.8
OpenAI o3-mini (high)	OpenAI	Proprietary	~83.6	~97.9	~77.0	~66.3
Claude 3.7 Sonnet (extended thinking)	Anthropic	Proprietary	~61.3	not reported	~78.2	not directly reported
Gemini 2.5 Pro	Google	Proprietary	~92.0	not reported	~84.0	not directly reported

The interpretation that received the most attention from coverage at the time was that QwQ-32B sat in the same neighborhood as DeepSeek-R1 on AIME, GPQA, and LiveCodeBench despite using roughly twenty times fewer total parameters and far less inference memory. By Alibaba's own accounting, QwQ-32B outperformed R1 on three of the five benchmarks the company used to compare the two models. On MATH-500, R1 retained a clear lead, and on GPQA Diamond the closed reasoning models from OpenAI and Anthropic remained ahead.^[3]^[4]^[6]

How was QwQ received and adopted?

QwQ received broad attention in both English and Chinese AI media. Coverage in VentureBeat, SiliconANGLE, and The Decoder framed the March 2025 release as a major proof point for small, open-weight reasoning models. The headline running through most stories was that a 32B dense model could match a 671B MoE on competition math while being cheap enough to run on a single workstation GPU after quantization.^[2]^[4]^[5]

Market reaction

On the financial side, Alibaba's Hong Kong-listed shares closed up 8.39% on March 6, 2025, the trading day after the QwQ-32B announcement, briefly touching a 52-week high. The company's New York-listed ADR rose roughly 2.5% in pre-market trading on the same news. In its own release, Alibaba said the model "rivals cutting-edge reasoning models, e.g., DeepSeek-R1." Bloomberg, CNBC, and Reuters tied the rally to an ongoing rotation into Chinese AI stocks that had begun with the DeepSeek-R1 release in January 2025, sometimes called the "DeepSeek shock." Bernstein analysts argued at the time that the QwQ release positioned Alibaba's cloud and AI earnings on what they called a "more upwardly-pointing trajectory," and a Hang Seng Tech sub-index of Chinese tech names climbed in sympathy. QwQ-32B was widely read as Alibaba's direct answer to that moment, demonstrating that the Qwen team could ship a competitive reasoning model on a footing comparable to DeepSeek without the heavy parameter count.^[2]^[4]^[11]

Community uptake

Within the open-source community, QwQ became a standard baseline for fine-tuning experiments. By mid-2025, Hugging Face listed thousands of QwQ-derived checkpoints, including domain-specific math fine-tunes, agent fine-tunes, function-calling adaptations, and quantized GGUF builds for local inference under Ollama and llama.cpp. The model's combination of permissive licensing, manageable size, and strong baseline reasoning made it an easy starting point for further RL experiments.^[5]^[6]

Quantization providers shipped their own packagings within days. The Qwen team published an official AWQ 4-bit checkpoint at Qwen/QwQ-32B-AWQ and an official GGUF build at Qwen/QwQ-32B-GGUF. Independent maintainers including the LM Studio community, Unsloth, Bartowski, and Mungert mirrored the same weights across additional quantization formats and bit widths, with most builds preserving the full 131K context window. By mid-2025, the cluster of QwQ-related repositories on Hugging Face was among the most downloaded reasoning-model families in the Qwen organization, behind only the main Qwen 2.5 chat and base lines.^[9]^[10]

What hardware does QwQ-32B need?

A recurring point in coverage was that QwQ-32B is small enough to run on a single high-end consumer GPU after INT4 quantization. The 4-bit version of the 32.5B-parameter model requires roughly 20 GB of VRAM for inference, which fits on a single RTX 4090 (24 GB) or RTX 5090 (32 GB). Apple Silicon machines with 32 GB or more of unified memory can run the same quantized weights through llama.cpp's Metal backend. This is in sharp contrast to DeepSeek-R1, which by some estimates requires more than 1,500 GB of memory across roughly 16 NVIDIA A100 GPUs to serve in full, and to OpenAI o1, which is not available locally at all. For many small teams, QwQ was the first reasoning model they could run on their own hardware without a cloud bill.^[3]^[5]^[9]^[15]

For higher-throughput serving, the Qwen team and community converged on three main inference stacks:

Stack	Typical hardware	Notes
vLLM	Single A100/H100 80 GB (BF16) or 24-32 GB consumer card (INT4)	Uses PagedAttention and continuous batching; supports tensor parallelism for multi-GPU shards
SGLang	Single A100/H100 or multi-GPU cluster	Optimized for structured outputs and tool-call streaming; commonly used for agent workloads
Ollama and llama.cpp	Consumer GPU, Apple Silicon, or CPU with enough RAM	GGUF format with INT4 to INT8 quantization; preferred for local desktop and edge deployment

A standard vllm serve configuration for QwQ-32B uses --tensor-parallel-size 1 on an 80 GB card or --tensor-parallel-size 2 across two 48 GB cards, with --max-model-len 131072 to expose the full context window. YaRN extension must be enabled for prompts longer than 32K tokens, since the base RoPE configuration is calibrated for the shorter window.^[9]^[12]

Pricing and hosted access

On Alibaba Cloud Model Studio, QwQ-32B is exposed through the DashScope chat-completions API under the qwq-plus family of identifiers. Pricing is metered separately for input and output tokens, with output tokens charged at a premium that reflects the longer reasoning traces typical of the model. The same endpoint backs the free Qwen Chat web interface at chat.qwen.ai, which uses a daily message quota rather than per-token billing. Several third-party inference providers, including SiliconFlow, OpenRouter, Fireworks AI, Together AI, and Groq, added hosted QwQ-32B endpoints in the weeks after the open-weights release, sometimes offering lower per-token prices than the official Alibaba endpoint by trading off context length or response speed.^[11]^[12]

Academic uptake

In academic papers published through 2025, QwQ-32B appeared as a baseline in studies of reasoning faithfulness, chain-of-thought robustness, and reward modeling. Researchers used the model partly because the weights and tokenizer are open, partly because the visible chain of thought makes it easier to instrument, and partly because the training recipe is documented enough to attempt reproductions. Several follow-up papers reported reproducing the AIME and MATH-500 numbers within a few points using their own evaluations.^[6]^[10]

Reception was not uniformly positive. Practical reviewers raised three recurring criticisms.^[5]^[10]

The model is verbose. Default traces often run several thousand tokens for problems where a non-thinking model would answer in fifty. This makes inference more expensive in practice, and it makes API rate limiting bite harder.
The chain of thought sometimes loops. Reviewers documented examples in which the model repeats a near-identical analysis several times before committing to an answer, particularly on harder GPQA-style questions.
The model occasionally switches languages mid-trace. English prompts sometimes get partially-Chinese reasoning, and vice versa. The team described this as a known limitation and partially mitigated it in QwQ-32B by adding format constraints to the second RL stage, but it was never fully eliminated.

Distillations and derivatives

QwQ-32B's open weights and Apache 2.0 license led to a substantial derivative ecosystem on Hugging Face within months of release. The most common derivative families include:^[6]^[9]

GGUF and EXL2 quantizations: Community-built INT4, INT5, INT6, and INT8 quantizations from groups such as Bartowski and TheBloke-style maintainers, sized for single-GPU and Apple Silicon inference.
Domain fine-tunes: Math-focused, code-focused, and agent-focused fine-tunes that further train on competition or coding data on top of the QwQ base.
Uncensored or "abliterated" variants: Community releases that strip the safety post-training, similar to those produced for other open-weight models.
Distillations onto smaller bases: Some teams produced reasoning-style fine-tunes of smaller Qwen 2.5 sizes (7B, 14B) using QwQ-32B traces as teacher data, broadly analogous to the DeepSeek-R1 distill series.

The Qwen team did not publish an official QwQ-Coder variant. Coding capabilities were instead carried forward through Qwen2.5-Coder and, later, the Qwen3-Coder line.^[7]^[9]

What are QwQ's limitations and quirks?

The QwQ models inherit most of the limitations of their Qwen2.5 base, plus a few that are specific to reasoning models. The team's Hugging Face model card and blog posts list several explicit caveats, and warn that "the model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it."^[1]^[3]^[9]

Language switching

The most discussed quirk is mid-trace language switching. The Qwen team listed it as a known limitation in the Preview model card: "the model may mix languages or switch between them unexpectedly, affecting response clarity."^[1] In practice the Preview would sometimes begin reasoning in English, drift into Chinese for a stretch (especially on math problems with Chinese training sources), and then return to English for the final answer. Some traces switched several times within a single response. The QwQ-32B release reduced this behavior by adding format-compliance rewards in the second RL stage, but reviewers and academic write-ups continued to find examples through 2025. The team frames this as an artefact of the bilingual cold-start data, not a bug to be fully eliminated, since the underlying reasoning is often correct regardless of which language carried it.^[1]^[5]^[10]

Repetitive thinking

Like most early reasoning models, QwQ can fall into reasoning loops. The Preview model card warns that "the model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer."^[1] On harder problems the trace may repeat a near-identical line of analysis several times, occasionally hitting the maximum generation budget without committing to a final answer. This was less common in QwQ-32B than in the Preview, but it is still a documented failure mode and a frequent source of wasted tokens at inference time.^[1]^[5]^[10]

Token-heavy inference

Reasoning models in this generation produce long traces by design, but QwQ leans long. Default outputs commonly run 2,000 to 8,000 tokens for moderately hard problems, and can run longer on AIME-style questions. This shifts cost from training to inference and changes the economics of deployment compared with standard chat models. Quantized local inference reduces the dollar cost but not the wall-clock time, which can stretch into many seconds even on modern hardware.^[3]^[6]

No thinking budget control

A limitation that became more visible only after the launch of Qwen 3 is that QwQ-32B does not expose a user-adjustable thinking budget. The model always emits a full reasoning trace before its answer, and there is no API parameter to cap the length or to switch the model into a non-thinking, direct-answer mode. Later Qwen 3 thinking variants added a controllable thinking budget (in some configurations up to roughly 38K tokens) that lets callers trade latency against reasoning depth. Applications built on QwQ that need faster, shorter responses on easy queries have to fall back on stop-token tricks or move to a different model.^[12]^[13]

Hallucinations and safety

QwQ does not eliminate hallucinations. The visible chain of thought sometimes makes errors easier to catch, but it also produces confident-looking intermediate steps that turn out to be wrong. Standard AI safety caveats apply: the model should not be relied on for high-stakes factual questions without external verification, and the trace should not be read as a faithful audit of the underlying decision process. Several papers in 2025 examined chain-of-thought faithfulness on QwQ specifically and found the trace and the final answer were sometimes loosely coupled, especially under adversarial prompting.^[6]^[10]

Bias and refusals

Like other open-weight models from Chinese labs, QwQ inherits some refusal behavior on politically sensitive topics related to China. The model will often decline to discuss specific historical events or political figures, or produce a sanitized response. Several community-built "abliterated" variants strip these refusals to suit different use cases. Conversely, on technical and scientific topics, the model is broadly forthcoming, with refusals concentrated mostly around explicit harm-enabling content. The base behavior is shaped by Chinese regulatory requirements that apply to publicly distributed models in mainland China and is consistent with the safety post-training applied to other Qwen 2.5 checkpoints.^[5]^[9]^[10]

Where does QwQ fit in Alibaba's reasoning lineup?

QwQ is the first link in a chain of reasoning-flavored Qwen releases that ran through 2025 and into 2026.

Model	Release	Domain	Approach
QwQ-32B-Preview	Nov 28, 2024	Text reasoning (math, code, science)	Cold-start fine-tune plus short RL on a 32B dense base
QvQ-72B-Preview	Dec 24, 2024	Vision reasoning	Reasoning post-training on top of Qwen2-VL-72B for image-grounded math and science
QwQ-32B	Mar 5, 2025	Text reasoning	Two-stage RL with rule-based and general rewards; adds tool use
Qwen 3 Thinking variants	Apr 28, 2025	General	Hybrid "thinking / non-thinking" mode in a single model, controllable per request
Qwen3-Max-Thinking	Sep 2025	Frontier reasoning	Hosted-only flagship with extended thinking budget

The trajectory from QwQ to the Qwen 3 thinking variants shows a clear consolidation. QwQ was a separate product line, with its own checkpoint and its own behavior. By the time Qwen 3 launched in April 2025, the team had folded thinking back into the same dense model as the standard chat behavior, controllable through a flag. QwQ-32B continued to be maintained on Hugging Face but was no longer the team's flagship reasoning model from mid-2025 onward.^[7]^[9]

How did Qwen 3 succeed QwQ?

The Qwen 3 release in April 2025 effectively absorbed the QwQ direction. Qwen3-32B, the dense 32-billion-parameter member of the Qwen 3 family, ships with a hybrid thinking mode that toggles between fast, non-thinking replies and deliberate reasoning traces, controlled by an enable_thinking flag (and later by a thinking-budget parameter on the hosted API). On Qwen3-32B in thinking mode, the team reports higher scores than QwQ-32B on AIME 2024, AIME 2025, LiveCodeBench, BFCL, and LiveBench, while preserving QwQ-style chain-of-thought traces.^[12]^[13]

The Qwen 3 technical report frames this as a clean superset: anything a caller previously did with QwQ-32B can be done with Qwen3-32B in thinking mode, plus a non-thinking mode for cheap, fast turns. Alibaba did not formally deprecate the QwQ checkpoints. As of mid-2026, both QwQ-32B and QwQ-32B-Preview remain available on Hugging Face and on the DashScope API under the qwq-plus and qwq-32b model IDs, with date-pinned snapshots for reproducibility. They are listed in Alibaba Cloud's documentation under thinking-only models, alongside Qwen 3 thinking variants.^[11]^[12]

In practice, most new builds in late 2025 and 2026 use Qwen 3 thinking variants or the larger Qwen3-Max-Thinking flagship rather than QwQ-32B directly. QwQ-32B's lasting role has been as a reproducible reasoning baseline: a checkpoint at a known training cut, with a documented recipe, that researchers can pull and rerun against new benchmarks without worrying that the model has been silently updated underneath them.^[6]^[13]

On the vision side, QvQ-72B-Preview, released December 24, 2024, applied a similar reasoning training recipe to the Qwen2-VL-72B base for image-grounded reasoning. QvQ is a separate model from QwQ but shares the Qwen team's broader bet that reasoning post-training generalizes across modalities. The two models are sometimes confused because of their similar names; QwQ is text-only, while QvQ is multimodal.^[7]

What is QwQ used for?

In practice QwQ was deployed as a reasoning backbone for a fairly narrow set of tasks where its strengths matter: research and analysis assistants that handle long math or scientific reasoning, coding agents that need to iterate against test results, and educational tools that benefit from a visible thinking trace. The model is generally a poor choice for high-throughput chat, customer support, or any setting where short, confident replies are expected; non-thinking models like Qwen2.5-32B-Instruct are usually a better fit there.^[5]^[9]

On the agent side, several open-source coding agent frameworks (such as community ports of OpenHands and SWE-agent) shipped QwQ-32B presets, taking advantage of the function calling capability and the long context window. For mathematics tutoring and contest preparation, communities around platforms such as Project Euler and competitive programming sites began sharing QwQ-driven workflows where the model not only solved problems but produced study notes from its own traces.^[5]^[6]^[10]

Why is the model called QwQ?

The name QwQ is a play on the Qwen brand and on a popular Asian internet kaomoji, where "QwQ" represents a tearful, slightly overwhelmed face. The Qwen team's blog post for the Preview opens with this kaomoji and uses it as a visual anchor throughout. The expanded form, "Qwen with Questions," is the team's official gloss; the pronunciation "kwuh" is used in their video material and in talks. Some English-language coverage rendered it as "quee-ew" or simply spelled it out as "Q-W-Q."^[1]^[5]

The playful branding stood out among the more austere house styles of frontier labs. Where OpenAI named its reasoning system "o1" and DeepSeek used the more clinical "R1," Alibaba leaned into a deliberately informal mascot, signaling that the model itself was supposed to feel more like a curious student than a polished assistant. This matched the visible-chain-of-thought design choice and was widely commented on in launch coverage.^[1]^[5]

Comparison summary

A reader looking for a single-paragraph framing might compare QwQ-32B to the rest of the early reasoning model wave like this. OpenAI o1 and the later o3 and GPT-5 thinking modes are the closed reference points. DeepSeek-R1 is the open heavyweight, much larger and slightly stronger on math, but expensive to host. Claude 3.7 Sonnet and Claude Opus 4 extended thinking modes prioritize reliability and tool use. Gemini 2.5 Pro and Deep Think push the upper end of frontier benchmarks. QwQ-32B is the small, open, single-GPU reasoning model that punches above its weight, and that, more than anything else, is what made it interesting to the community.^[3]^[4]^[5]^[6]

References

"QwQ: Reflect Deeply on the Boundaries of the Unknown." Qwen Team Blog, November 28, 2024. https://qwenlm.github.io/blog/qwq-32b-preview/ ↩
"QwQ-32B: Embracing the Power of Reinforcement Learning." Qwen Team Blog, March 5, 2025. https://qwenlm.github.io/blog/qwq-32b/ ↩
"Qwen/QwQ-32B model card." Hugging Face. https://huggingface.co/Qwen/QwQ-32B ↩
"Alibaba's Qwen team releases QwQ-32B reasoning model, shares jump." SiliconANGLE, March 6, 2025. https://siliconangle.com/2025/03/06/alibaba-shares-jump-new-open-source-qwq-32b-reasoning-model/ ↩
"Alibaba's new open source QwQ-32B reasoning model is small enough to run on your computer." VentureBeat, March 5, 2025. https://venturebeat.com/ai/alibabas-new-open-source-model-qwq-32b-matches-deepseek-r1-with-way-smaller-compute-requirements ↩
"State of reasoning models, 2025." Hugging Face Blog, mid-2025. ↩
"Qwen2.5: A Party of Foundation Models!" Qwen Team Blog, September 2024. https://qwenlm.github.io/blog/qwen2.5/ ↩
"Qwen2.5 Technical Report." arXiv:2412.15115, December 2024. https://arxiv.org/abs/2412.15115 ↩
"Qwen/QwQ-32B-Preview model card." Hugging Face. https://huggingface.co/Qwen/QwQ-32B-Preview ↩
"QwQ GitHub repository." https://github.com/QwenLM/QwQ ↩
"Alibaba shares soar after Chinese tech giant unveils DeepSeek rival QwQ-32B." CNBC, March 6, 2025. https://www.cnbc.com/2025/03/06/alibaba-shares-soar-after-chinese-tech-giant-unveils-deepseek-rival-qwq-32b.html ↩
"Using deep thinking models." Alibaba Cloud Model Studio documentation, 2025. https://www.alibabacloud.com/help/en/model-studio/deep-thinking ↩
"Qwen3 Technical Report." arXiv:2505.09388, May 2025. https://arxiv.org/abs/2505.09388 ↩
Artificial Analysis. "Alibaba launches QwQ-32B, an open weights reasoning model that may approach DeepSeek R1's level of intelligence." March 2025. https://artificialanalysis.ai/models/qwq-32b ↩
"QwQ-32B: Features, Access, DeepSeek-R1 Comparison & More." DataCamp, March 2025. https://www.datacamp.com/blog/qwq-32b ↩

External links

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AI Model Release Timeline (2022-2026)DiLoCo Extended thinking QvQ Qwen Qwen3-Max Reasoning (artificial intelligence)Skywork-R1V vLLM

What is the Qwen series QwQ comes from?

Why did the Qwen team build a reasoning variant?

Is QwQ open source?

When was QwQ released, and what are the variants?

What is QwQ's architecture?

How was QwQ trained?

Starting point and data

The accuracy verifier and code execution server

Tool use and function calling

Why did rule-based rewards matter?

How does QwQ reason, and what can it do?

Domains

Function calling and agents

How does QwQ's reasoning style compare with peers?

How does QwQ-32B perform on benchmarks?

QwQ-32B benchmark scores (March 2025)

Comparison with peer reasoning models

How was QwQ received and adopted?

Market reaction

Community uptake

What hardware does QwQ-32B need?

Pricing and hosted access

Academic uptake

Distillations and derivatives

What are QwQ's limitations and quirks?

Language switching

Repetitive thinking

Token-heavy inference

No thinking budget control

Hallucinations and safety

Bias and refusals

Where does QwQ fit in Alibaba's reasoning lineup?

How did Qwen 3 succeed QwQ?

What is QwQ used for?

Why is the model called QwQ?

Comparison summary

See also

References

External links

Improve this article

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

Kimi K2 Thinking

Marco-o1

DeepSeek-R1

MiniMax M1

What links here

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

Kimi K2 Thinking

Marco-o1

DeepSeek-R1

MiniMax M1

What links here