Group Sequence Policy Optimization (GSPO)
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,687 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,687 words
Add missing citations, update stale details, or suggest a clearer explanation.
Group Sequence Policy Optimization (GSPO) is a reinforcement learning algorithm for training large language models, introduced by the Qwen team at Alibaba in July 2025 [1][2]. Its defining feature is that it defines the importance sampling ratio, and performs clipping, rewarding, and optimization, at the level of the whole generated sequence rather than at the level of individual tokens. This contrasts with GRPO (Group Relative Policy Optimization), the widely used algorithm it is designed to replace, which operates token by token.
The authors argue that token-level importance ratios are statistically ill founded for language-model RL and are the root cause of the training instability and occasional "model collapse" seen when GRPO is run for many steps or applied to large mixture of experts (MoE) models. By matching the granularity of the importance ratio to the granularity of the reward, which is normally assigned to a complete response, GSPO produces lower-variance gradients, trains more stably, and removes the need for the "Routing Replay" workaround previously required to keep MoE RL from diverging [1][2]. The Qwen team credits GSPO with contributing to improvements in its Qwen3 models, several of which are large MoE systems [1][2].
GSPO was described in the paper "Group Sequence Policy Optimization" (arXiv:2507.18071) by Chujie Zheng, Shixuan Liu, Mingze Li, and colleagues, and in an accompanying Qwen blog post published on 27 July 2025 [1][2].
Proximal Policy Optimization (PPO) is the standard policy-gradient method behind RLHF. PPO trains a separate value network (a critic) to estimate advantages and optimizes a clipped, token-level surrogate objective. GRPO, introduced in DeepSeekMath and popularized by DeepSeek-R1, removes the critic: instead of a learned value function it samples a group of G responses for each query and computes each response's advantage from the group's reward statistics [3]. For response y_i with scalar reward r_i, the advantage is
A_i = (r_i - mean(r_1 .. r_G)) / std(r_1 .. r_G).
This same advantage is broadcast to every token in the response. GRPO then reuses PPO's token-level machinery: for token t of response i it forms the per-token importance ratio
w_{i,t}(theta) = pi_theta(y_{i,t} | x, y_{i,<t}) / pi_theta_old(y_{i,t} | x, y_{i,<t}),
clips it to the interval [1 - eps, 1 + eps], multiplies by A_i, and averages the clipped surrogate over all tokens and all responses.
The GSPO authors identify a conceptual flaw in this design [1][2]. Importance sampling is meant to correct for the mismatch between the distribution that generated the data (the old policy) and the distribution being optimized (the new policy) by reweighting many samples. At each token position, however, GRPO has only a single sample, so the per-token ratio cannot perform the averaging that makes an importance weight valid; it simply injects high-variance noise. Because that noise is multiplied position by position and then passed through clipping, it accumulates over long responses and can push the policy off a cliff, producing the irreversible collapse observed in extended training runs.
The problem is sharpest for MoE models. After each gradient update the expert-routing network can change which experts a token activates: the Qwen team reports that for a 48-layer MoE roughly 10 percent of the activated experts differ from one step to the next [1][2]. That makes the per-token likelihoods, and therefore the token-level ratios, swing wildly even when the model as a whole has barely changed. The previous remedy, "Routing Replay," caches the old policy's routing decisions and forces the new policy to reuse them so the ratios stay meaningful. This works but adds memory and communication overhead and constrains the model's usable capacity [1][2].
GSPO keeps GRPO's critic-free, group-relative advantage A_i but replaces the token-level ratio with a single ratio for the entire response. Because the raw likelihood ratio of a whole sequence is a product of many per-token factors and would grow or shrink exponentially with length, GSPO normalizes it by the response length |y_i|:
s_i(theta) = ( pi_theta(y_i | x) / pi_theta_old(y_i | x) ) ^ (1 / |y_i|) = exp( (1 / |y_i|) * sum_t log[ pi_theta(y_{i,t} | x, y_{i,<t}) / pi_theta_old(y_{i,t} | x, y_{i,<t}) ] ).
The exponent 1 / |y_i| forms a geometric mean over token log-ratios. It keeps s_i close to 1, controls variance, and places responses of different lengths on a common numerical scale [1][2].
The GSPO objective applies one clipped term per response rather than per token:
J_GSPO(theta) = E[ (1 / G) * sum_i min( s_i(theta) * A_i, clip(s_i(theta), 1 - eps, 1 + eps) * A_i ) ].
Clipping now acts on the whole sequence: if a response's sequence ratio leaves the trust region, it is the entire response, not one token, that is held out of the update. Because s_i is length-normalized and therefore tightly concentrated near 1, the useful clipping range is far narrower than GRPO's; in the team's MoE experiment the clip thresholds were about 3e-4 on the left and 4e-4 on the right, versus roughly 0.2 for GRPO [1]. Counterintuitively, GSPO clips a much larger fraction of tokens, around two orders of magnitude more, yet still trains more efficiently, because the gradient signal it keeps is cleaner [1][2].
For settings that need per-token credit assignment, such as multi-turn agentic RL, the paper introduces GSPO-token. It restores a per-token advantage A_{i,t} while keeping the sequence ratio's stability through a stop-gradient construction:
s_{i,t}(theta) = sg[s_i(theta)] * pi_theta(y_{i,t} | x, y_{i,<t}) / sg[ pi_theta(y_{i,t} | x, y_{i,<t}) ],
where sg[.] denotes the stop-gradient (detach) operation. When all per-token advantages equal the sequence advantage, GSPO-token is numerically identical to GSPO and yields the same gradients; its only added value is the flexibility to shape advantages token by token [1].
| Aspect | PPO | GRPO | GSPO |
|---|---|---|---|
| Critic / value network | Learned critic | None (group baseline) | None (group baseline) |
| Advantage estimate | GAE from critic | Group-relative, normalized reward | Group-relative, normalized reward |
| Importance ratio unit | Token | Token | Sequence (length-normalized) |
| Clipping level | Per token | Per token | Per sequence |
| Typical clip range | about 0.1 to 0.2 | about 0.2 | about 3e-4 to 4e-4 |
| Reward vs IS granularity | matched (token critic) | mismatched (token IS, sequence reward) | matched (sequence IS, sequence reward) |
| MoE RL stability | usually dense models | needs Routing Replay | stable without Routing Replay |
The conceptual through-line is that GSPO makes the unit of optimization match the unit of reward: rewards in RLHF and RLVR (reinforcement learning with verifiable rewards) are assigned to complete sequences, so the off-policy correction should be applied to complete sequences too. PPO sidesteps the per-token sampling issue by learning a value function, but at the cost of training and serving a separate critic; GRPO drops the critic but keeps the token-level ratio, which GSPO argues is exactly the part that breaks [1][2].
GSPO was developed for and deployed in Qwen's own training pipeline, and the team states it contributed to improvements in its Qwen3 models, which include large MoE systems [1][2]. The reported experiments fine-tuned a cold-start checkpoint derived from Qwen3-30B-A3B-Base, a 48-layer MoE model, and evaluated on reasoning and coding benchmarks including AIME 2024, LiveCodeBench, and CodeForces, where GSPO reached higher training accuracy and benchmark scores than GRPO under equal compute [1].
Beyond raw scores, the authors emphasize two practical infrastructure benefits. First, GSPO trains MoE models stably without Routing Replay, simplifying the system and freeing model capacity [1][2]. Second, because GSPO depends only on a sequence-level likelihood and is insensitive to small per-token numerical differences, it tolerates the precision gap between the optimized inference engines used to generate rollouts and the training engine. That makes it practical to optimize directly on inference-engine likelihoods, which is attractive for partial rollouts, multi-turn RL, and disaggregated train-and-infer deployments [1][2]. Since its release, GSPO has been discussed and adapted in follow-up work on RLVR and on policy-optimization methods that mix token-level and sequence-level signals [5][6].
GSPO is recent and was largely validated by its originating lab, so independent, large-scale reproductions are still accumulating. Several tradeoffs are noted by the authors and by subsequent papers: