Group Sequence Policy Optimization (GSPO)

Machine Learning Reinforcement Learning

8 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,678 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Group Sequence Policy Optimization (GSPO) is a reinforcement learning algorithm for training large language models, introduced by the Qwen team at Alibaba in July 2025 ^[1]^[2]. Its defining feature is that it defines the importance sampling ratio, and performs clipping, rewarding, and optimization, at the level of the whole generated sequence rather than at the level of individual tokens. This contrasts with GRPO (Group Relative Policy Optimization), the widely used algorithm it is designed to replace, which operates token by token.

The authors argue that token-level importance ratios are statistically ill founded for language-model RL and are the root cause of the training instability and occasional "model collapse" seen when GRPO is run for many steps or applied to large mixture of experts (MoE) models. By matching the granularity of the importance ratio to the granularity of the reward, which is normally assigned to a complete response, GSPO produces lower-variance gradients, trains more stably, and removes the need for the "Routing Replay" workaround previously required to keep MoE RL from diverging ^[1]^[2]. The Qwen team credits GSPO with contributing to improvements in its Qwen3 models, several of which are large MoE systems ^[1]^[2].

GSPO was described in the paper "Group Sequence Policy Optimization" (arXiv:2507.18071) by Chujie Zheng, Shixuan Liu, Mingze Li, and colleagues, and in an accompanying Qwen blog post published on 27 July 2025 ^[1]^[2].

Background: GRPO and its instability

Proximal Policy Optimization (PPO) is the standard policy-gradient method behind RLHF. PPO trains a separate value network (a critic) to estimate advantages and optimizes a clipped, token-level surrogate objective. GRPO, introduced in DeepSeekMath and popularized by DeepSeek-R1, removes the critic: instead of a learned value function it samples a group of G responses for each query and computes each response's advantage from the group's reward statistics ^[3]. For response $y_i$ with scalar reward $r_i$ , the advantage is

A_i = \frac{r_i - \mathrm{mean}(r_1, \dots, r_G)}{\mathrm{std}(r_1, \dots, r_G)}.

This same advantage is broadcast to every token in the response. GRPO then reuses PPO's token-level machinery: for token t of response i it forms the per-token importance ratio

w_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} \mid x, y_{i,<t})},

clips it to the interval $[1 - \epsilon, 1 + \epsilon]$ , multiplies by $A_i$ , and averages the clipped surrogate over all tokens and all responses.

The GSPO authors identify a conceptual flaw in this design ^[1]^[2]. Importance sampling is meant to correct for the mismatch between the distribution that generated the data (the old policy) and the distribution being optimized (the new policy) by reweighting many samples. At each token position, however, GRPO has only a single sample, so the per-token ratio cannot perform the averaging that makes an importance weight valid; it simply injects high-variance noise. Because that noise is multiplied position by position and then passed through clipping, it accumulates over long responses and can push the policy off a cliff, producing the irreversible collapse observed in extended training runs.

The problem is sharpest for MoE models. After each gradient update the expert-routing network can change which experts a token activates: the Qwen team reports that for a 48-layer MoE roughly 10 percent of the activated experts differ from one step to the next ^[1]^[2]. That makes the per-token likelihoods, and therefore the token-level ratios, swing wildly even when the model as a whole has barely changed. The previous remedy, "Routing Replay," caches the old policy's routing decisions and forces the new policy to reuse them so the ratios stay meaningful. This works but adds memory and communication overhead and constrains the model's usable capacity ^[1]^[2].

How GSPO works

Sequence-level importance ratio

GSPO keeps GRPO's critic-free, group-relative advantage A_i but replaces the token-level ratio with a single ratio for the entire response. Because the raw likelihood ratio of a whole sequence is a product of many per-token factors and would grow or shrink exponentially with length, GSPO normalizes it by the response length |y_i|:

\begin{aligned} s_i(\theta) &= \left( \frac{\pi_\theta(y_i \mid x)}{\pi_{\theta_{\text{old}}}(y_i \mid x)} \right)^{1 / |y_i|} \\ &= \exp\left( \frac{1}{|y_i|} \sum_t \log\left[ \frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} \mid x, y_{i,<t})} \right] \right). \end{aligned}

The exponent $1 / |y_i|$ forms a geometric mean over token log-ratios. It keeps $s_i$ close to 1, controls variance, and places responses of different lengths on a common numerical scale ^[1]^[2].

Objective and clipping

The GSPO objective applies one clipped term per response rather than per token:

J_{\text{GSPO}}(\theta) = \mathbb{E}\left[ \frac{1}{G} \sum_i \min\left( s_i(\theta) A_i, \mathrm{clip}(s_i(\theta), 1 - \epsilon, 1 + \epsilon) A_i \right) \right].

Clipping now acts on the whole sequence: if a response's sequence ratio leaves the trust region, it is the entire response, not one token, that is held out of the update. Because s_i is length-normalized and therefore tightly concentrated near 1, the useful clipping range is far narrower than GRPO's; in the team's MoE experiment the clip thresholds were about 3e-4 on the left and 4e-4 on the right, versus roughly 0.2 for GRPO ^[1]. Counterintuitively, GSPO clips a much larger fraction of tokens, around two orders of magnitude more, yet still trains more efficiently, because the gradient signal it keeps is cleaner ^[1]^[2].

GSPO-token variant

For settings that need per-token credit assignment, such as multi-turn agentic RL, the paper introduces GSPO-token. It restores a per-token advantage A_{i,t} while keeping the sequence ratio's stability through a stop-gradient construction:

s_{i,t}(\theta) = \mathrm{sg}[s_i(\theta)] \cdot \frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\mathrm{sg}[\pi_\theta(y_{i,t} \mid x, y_{i,<t})]},

where $\mathrm{sg}[\cdot]$ denotes the stop-gradient (detach) operation. When all per-token advantages equal the sequence advantage, GSPO-token is numerically identical to GSPO and yields the same gradients; its only added value is the flexibility to shape advantages token by token ^[1].

Comparison to GRPO and PPO

Aspect	PPO	GRPO	GSPO
Critic / value network	Learned critic	None (group baseline)	None (group baseline)
Advantage estimate	GAE from critic	Group-relative, normalized reward	Group-relative, normalized reward
Importance ratio unit	Token	Token	Sequence (length-normalized)
Clipping level	Per token	Per token	Per sequence
Typical clip range	about 0.1 to 0.2	about 0.2	about 3e-4 to 4e-4
Reward vs IS granularity	matched (token critic)	mismatched (token IS, sequence reward)	matched (sequence IS, sequence reward)
MoE RL stability	usually dense models	needs Routing Replay	stable without Routing Replay

The conceptual through-line is that GSPO makes the unit of optimization match the unit of reward: rewards in RLHF and RLVR (reinforcement learning with verifiable rewards) are assigned to complete sequences, so the off-policy correction should be applied to complete sequences too. PPO sidesteps the per-token sampling issue by learning a value function, but at the cost of training and serving a separate critic; GRPO drops the critic but keeps the token-level ratio, which GSPO argues is exactly the part that breaks ^[1]^[2].

Applications

GSPO was developed for and deployed in Qwen's own training pipeline, and the team states it contributed to improvements in its Qwen3 models, which include large MoE systems ^[1]^[2]. The reported experiments fine-tuned a cold-start checkpoint derived from Qwen3-30B-A3B-Base, a 48-layer MoE model, and evaluated on reasoning and coding benchmarks including AIME 2024, LiveCodeBench, and CodeForces, where GSPO reached higher training accuracy and benchmark scores than GRPO under equal compute ^[1].

Beyond raw scores, the authors emphasize two practical infrastructure benefits. First, GSPO trains MoE models stably without Routing Replay, simplifying the system and freeing model capacity ^[1]^[2]. Second, because GSPO depends only on a sequence-level likelihood and is insensitive to small per-token numerical differences, it tolerates the precision gap between the optimized inference engines used to generate rollouts and the training engine. That makes it practical to optimize directly on inference-engine likelihoods, which is attractive for partial rollouts, multi-turn RL, and disaggregated train-and-infer deployments ^[1]^[2]. Since its release, GSPO has been discussed and adapted in follow-up work on RLVR and on policy-optimization methods that mix token-level and sequence-level signals ^[5]^[6].

Limitations

GSPO is recent and was largely validated by its originating lab, so independent, large-scale reproductions are still accumulating. Several tradeoffs are noted by the authors and by subsequent papers:

Coarser clipping. Because clipping is all or nothing at the sequence level, a single out-of-range response is discarded as a whole, so GSPO removes a far larger fraction of generated tokens from each update than GRPO does ^[1]. Follow-up analyses report that group-based schemes like GSPO can discard more than 10 percent of responses that carry non-zero advantage, which wastes rollout compute ^[5].
Length sensitivity. The $1 / |y_i|$ normalization stabilizes the ratio but couples the objective to response length; later work argues this leaves a residual length bias and proposes length-unbiased variants ^[5].
Token versus sequence granularity. Pure sequence-level optimization can blur the fine-grained, token-level credit assignment that some tasks need; hybrid methods that combine token-level and sequence-level signals have since been proposed ^[6]. The GSPO-token variant only partially addresses this.
Inherited group-RL costs. GSPO still relies on sampling a group of responses per prompt and on a verifiable or learned reward, so it carries the rollout cost and reward-design burden common to group-based RLVR methods.

References

Zheng, Chujie; Liu, Shixuan; Li, Mingze; Chen, Xiong-Hui; Yu, Bowen; Gao, Chang; Dang, Kai; Liu, Yuqiong; Men, Rui; Yang, An; Zhou, Jingren; Lin, Junyang. "Group Sequence Policy Optimization." arXiv:2507.18071, July 2025. https://arxiv.org/abs/2507.18071 ↩
Qwen Team. "GSPO: Towards Scalable Reinforcement Learning for Language Models." Qwen blog, 27 July 2025. https://qwenlm.github.io/blog/gspo/ ↩
Shao, Zhihong; et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024. https://arxiv.org/abs/2402.03300 ↩
Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017. https://arxiv.org/abs/1707.06347
"Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR." arXiv:2602.05261, 2026. https://arxiv.org/abs/2602.05261 ↩
"Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR." arXiv:2601.05607, 2026. https://arxiv.org/abs/2601.05607 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Reinforcement learning