Fine-tuning, Post-training & RL
IC5IC6

PPO, GRPO, DPO & the Variant Zoo

Policy-gradient RL for LLMs is one question asked a dozen ways — how do you turn a scalar reward into a stable, low-variance gradient on a 100k-vocab autoregressive policy? PPO answers with a clipped surrogate and a learned critic; GRPO drops the critic and lets the group mean be the baseline; DPO skips sampling entirely. Then Dr.GRPO, DAPO, GSPO, and CISPO are a precise sequence of bug-fixes to GRPO's biases. Derive each, and know exactly what it fixes and what it breaks.

18 min read · 15 sections
Prerequisites: the training loop (forward/loss/backward/step), policy gradients & REINFORCE intuition, KL divergence & cross-entropy
Runnable: ai-eng-wiki/examples/rl/grpo_min.py

1. Quick anchor

Every RL post-training algorithm in this lesson is an answer to one engineering question: how do you turn a scalar reward into a low-variance, stable gradient on a 100k-vocab autoregressive policy? You sample completions, score them, and nudge the policy to make above-average completions more likely. Everything else — critics, clipping, group baselines, importance ratios — is machinery for making that nudge not blow up.

Three families own the landscape:

  • PPO (Proximal Policy Optimization): actor-critic. A clipped surrogate objective plus a learned value function (the critic) as a baseline. Stable, principled, expensive — you hold up to four model copies in memory.
  • GRPO (Group Relative Policy Optimization): drop the critic. Sample G completions per prompt and let the group mean be the baseline. This is what trained DeepSeek-R1, and it cut RL infrastructure cost enough to make frontier reasoning RL accessible.
  • DPO (Direct Preference Optimization): skip RL entirely. A closed-form offline loss on preference pairs whose implicit reward is r = β·log(π_θ/π_ref). No sampling, no reward model, no rollout loop.

Then the variant zoo — Dr.GRPO, DAPO, GSPO, CISPO, and a dozen cousins — is not random proliferation. It is a precise, dated sequence of bug-fixes to GRPO's two original sins: a length bias and a difficulty bias baked into how it normalizes the advantage. Know the bugs and the zoo organizes itself.

2. Why interviewers probe this

This is the load-bearing lesson of the RL pillar. An interviewer is checking whether you can derive these objectives, not recite acronyms. The tell at each level:

  • IC5 — Can you write PPO's clipped surrogate and explain why the min? Can you state GRPO's group-relative advantage from memory and say what dropping the critic buys and costs? Do you understand DPO's implicit reward and why β is the KL knob?
  • IC6 — Can you walk the variant zoo and, for each method, name exactly what it changes, the bias it fixes, and the new failure mode it introduces? Can you reason about the std-normalization debate, token-vs-sequence-level importance sampling, and why GRPO destabilizes MoE training? This is where staff-level RL candidates separate from people who ran a TRL script once.

A candidate who says "GRPO is PPO without the critic" and stops has named the title of the chapter. The job is the chapter.

3. Concept build-up

Beginner explainerNew here? The words first

The words first.

  • Policy — the model itself, seen as a dice-roller that assigns a probability to each possible next token (word-piece).
  • Sample (rollout) — one full answer the policy generates for a prompt.
  • Reward — a single number scoring how good an answer is (e.g. 1 if a math answer is correct, 0 if not).
  • Baseline — the reward you'd expect on average; a reference point to compare against.
  • Advantagereward − baseline: how much better (positive) or worse (negative) a sample did than expected.
  • Policy gradient — the update rule that nudges probabilities: push up the tokens from good samples, push down the tokens from bad ones.
  • Critic (value network) — in PPO, a second trained model whose only job is to predict the baseline.
  • Group — in GRPO, several samples drawn for the same prompt, used together to compute the baseline.

Step by step.

  1. Pick a prompt. Let the policy generate a group of answers (say, 8).
  2. Score each answer with the reward function.
  3. Work out a baseline — the "par" score to measure against.
  4. Subtract: advantage = reward − baseline. Positive means above par.
  5. Apply the policy gradient: raise the probability of tokens in positive-advantage answers, lower it for negative ones — by an amount proportional to the advantage.
  6. PPO trains a separate critic to guess the baseline. GRPO skips it: the baseline is just the group's mean reward (usually scaled by the group's spread, (r − mean) / std).
  7. Move to the next prompt and repeat; over many steps the model drifts toward higher-reward behavior.

Remember this: you train a model by scoring its own answers and nudging it toward the above-average ones — and GRPO's trick is letting a group of answers be their own "average," so no extra critic network is needed.

3.1 The shared skeleton: policy gradient + a baseline

Every method here descends the same gradient. The policy π_θ defines a distribution over completions; you want to maximize expected reward J(θ) = E_{y~π_θ}[r(y)]. The policy gradient theorem gives:

ƒ
θJ(θ)=Eyπθ[r(y)θlogπθ(y)]\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}\big[\, r(y)\, \nabla_\theta \log \pi_\theta(y) \,\big]

In words: push up the log-probability of completions weighted by their reward. This is REINFORCE. It is unbiased but has murderous variancer(y) for an LLM completion can be anything, and you are estimating an expectation over an astronomically large output space from a handful of samples.

Policy Gradient Theorem — on real numbers

Let's name every part. The left side ∇_θ J(θ) is "the direction to step the model weights to maximize expected reward." The right side has three pieces: E means "average over many samples"; r(y) is the scalar reward for one completion (e.g. 1 for correct, 0 for wrong); ∇_θ log π_θ(y) is the gradient of log-probability — it points in the direction that makes completion y more likely.

Here's a tiny concrete example. Suppose we sample 4 completions with rewards [1.0, 0.0, 1.0, 0.0]. The gradient step says: take the log-prob gradient of the first and third completions (reward=1) and average them, then step in that direction. We're making the good completions more likely by nudging the model weights up their gradient, and the bad ones are dragged down because they contribute zero to the average. One step: each completion's effect on the weights is reward × its-log-prob-gradient.

What just happened: we moved the policy toward higher-reward behavior by weighting each completion's learning signal by how good it was. The catch is variance — with only 4 samples in a 100k-vocab space, that average is a very noisy estimate of the true expectation.

The fix is a baseline b. Subtract any quantity that doesn't depend on the action and the estimator stays unbiased but its variance drops:

ƒ
θJ(θ)=E[(r(y)b)θlogπθ(y)]\nabla_\theta J(\theta) = \mathbb{E}\big[\,(r(y) - b)\, \nabla_\theta \log \pi_\theta(y)\,\big]

The quantity A = r(y) − b is the advantage: "how much better than baseline was this completion?" Every algorithm in this lesson is a different choice of baseline b and a different way to safely reuse off-policy samples. PPO learns b with a neural network. GRPO sets b to the group mean. That single design decision is the whole story.

One more shared primitive: importance sampling. Generating rollouts is the expensive part, so you want to take several gradient steps on the same batch of completions. But after one step, π_θ ≠ the π_old that generated the data — the batch is now off-policy. You correct with the importance ratio r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t). The catch: over a long sequence these ratios are a product of per-token ratios and can swing by orders of magnitude, so the gradient variance diverges. Controlling that ratio is the central design axis of the entire variant zoo.

3.2 PPO: the clipped surrogate and the critic

PPO makes off-policy reuse safe by refusing to trust the importance ratio too far. Its objective is the clipped surrogate:

ƒ
LCLIP(θ)=Et[min(rt(θ)A^t,    clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta)\,\hat{A}_t,\;\; \text{clip}(r_t(\theta),\,1-\epsilon,\,1+\epsilon)\,\hat{A}_t\big)\Big]
PPO's Clipped Surrogate — on real numbers

Symbol names: r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) is the importance ratio — "how much more (or less) likely is this token under the new policy than the old one?" If the new policy makes a token twice as likely, r_t = 2.0. If it makes it half as likely, r_t = 0.5. Â_t is the advantage estimate (good action positive, bad action negative). ε ≈ 0.2 is the clip width. clip(x, 1−ε, 1+ε) clamps a number to the range [0.8, 1.2].

Concrete walk-through: Suppose Â_t = 1.0 (a good token) and after one gradient step the ratio rose to r_t = 1.5 (30% more likely). The two terms are:

  • Unclipped: 1.5 × 1.0 = 1.5
  • Clipped: clip(1.5, 0.8, 1.2) × 1.0 = 1.2 × 1.0 = 1.2
  • Min of the two: min(1.5, 1.2) = 1.2

So the gradient uses the clipped version (1.2). If the ratio kept rising to r_t = 2.0, the clipped term would still be capped at 1.2, and the gradient would flatten — you can't over-commit to one batch. Now suppose Â_t = −1.0 (a bad token) that we want to push down, and r_t = 0.1 (now 90% less likely). The unclipped term is 0.1 × (−1.0) = −0.1, the clipped term is clip(0.1, 0.8, 1.2) × (−1.0) = 0.8 × (−1.0) = −0.8, and the min is −0.8. The clipped gradient is stronger (a larger magnitude), which is the asymmetry: for bad actions, we do want the gradient to be live and pull the probability down aggressively; we only gate the good direction.

What just happened: PPO avoids over-committing to old data by clamping how far the ratio can swing, but the min() keeps the gradient alive for corrections. It's a cheap trust region.

where r_t(θ) is the per-token importance ratio above, Â_t is the advantage estimate, and ε ≈ 0.2 is the clip width. Why the min? This is the question. Consider the two cases:

  • Â_t > 0 (good action): the unclipped term wants to raise r_t without bound. The clip caps the reward of doing so at 1+ε, so once you've moved the probability up by ~20% on this batch, the gradient flattens to zero — a dead zone. You can't over-commit to a single batch.
  • Â_t < 0 (bad action): symmetric — the gradient flattens once you've pushed the probability down by ~20%.

The min makes the bound pessimistic: it always takes the smaller (less optimistic) of the clipped and unclipped objectives. The subtle payoff is asymmetric. If a token is already far outside the trust region in the wrong direction (ratio already pushed too high on a bad action), the min selects the unclipped term — so the gradient is still live and can pull it back. The clip only kills the gradient for moves that would push further in the improving direction. This is a cheap, first-order trust region — unlike TRPO, which enforces a hard D_KL(π_old‖π_new) ≤ δ constraint via second-order conjugate-gradient optimization. PPO drops the guarantee, keeps the stability, and scales to LLMs.

Where does Â_t come from? A learned critic V(s) — a value head the same size as the policy — estimates expected future reward, and GAE (Generalized Advantage Estimation) blends multi-step TD residuals:

ƒ
A^tGAE=l=0(γλ)lδt+l,δt=rt+γV(st+1)V(st)\hat{A}^{\text{GAE}}_t = \sum_{l=0}^{\infty}(\gamma\lambda)^l\,\delta_{t+l}, \qquad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

δ_t is the TD residual (reward plus discounted next-state value minus current value), γ is the discount, and λ ∈ [0,1] is the bias-variance knob: λ=0 uses only the 1-step estimate (low variance, high bias from a wrong critic), λ=1 is full Monte Carlo (unbiased, high variance). Practice lives at λ ∈ [0.9, 0.99]. For LLMs the reward is usually a single terminal scalar (correct/incorrect at the end), so GAE mostly propagates that one signal backward through the critic's value estimates.

The cost. PPO keeps four models resident: the policy (training), the critic (training), the frozen reference (for the KL penalty), and the reward model — plus optimizer states for the two trainable ones. The critic is the killer: it's policy-sized, must be trained alongside, and a badly-fit critic silently poisons every advantage. That overhead is exactly what GRPO deletes.

3.3 GRPO: the group is the baseline

GRPO (introduced in DeepSeekMath, Shao et al. 2024; scaled in DeepSeek-R1) makes one observation: if you're going to sample anyway, sample a group of G completions for the same prompt and use the group's own statistics as the baseline. No critic needed. The advantage for completion i is:

ƒ
Ai=rimean(r1,,rG)std(r1,,rG)A_{i} = \frac{r_i - \text{mean}(r_1,\dots,r_G)}{\text{std}(r_1,\dots,r_G)}
GRPO's Group-Relative Advantage — on real numbers

Symbol names: r_i is the reward (a single number) for completion i — say, 1 if a math problem was solved, 0 otherwise. mean(r_1, ..., r_G) is the average reward across the group of G completions. std(...) is the standard deviation — a measure of how spread-out the rewards are.

Concrete walk-through: Suppose G = 4 completions have rewards [3.0, 1.0, 2.0, 0.0]. The mean is (3+1+2+0)/4 = 1.5. The variance is ((3−1.5)² + (1−1.5)² + (2−1.5)² + (0−1.5)²)/4 = (2.25 + 0.25 + 0.25 + 2.25)/4 = 1.25, so std = √1.25 ≈ 1.118. Now compute A for each:

  • A_1 = (3.0 − 1.5) / 1.118 ≈ 1.34 (above average by ~1.3 std)
  • A_2 = (1.0 − 1.5) / 1.118 ≈ −0.45 (below average)
  • A_3 = (2.0 − 1.5) / 1.118 ≈ 0.45 (slightly above)
  • A_4 = (0.0 − 1.5) / 1.118 ≈ −1.34 (far below)

What just happened: completion 1 gets strong positive advantage (raised its tokens' probability), completion 4 gets strong negative (lowered its tokens). Completions 2 and 3 are near-zero because they're close to the group mean — the policy doesn't strongly push them either way. The whole group's statistics become the baseline; no critic needed.

r_i is the scalar reward for completion i, the mean over the group is the baseline b, and dividing by the group std normalizes scale. The same A_i is assigned to every token in completion i — there is no per-token credit assignment, because the reward model emits one scalar per completion and per-token credit is impractical. Above-average completions get reinforced, below-average suppressed, and as the whole group converges in quality the advantages shrink toward zero automatically — so you don't need PPO's adaptive KL scheduler.

The full token-level loss adds an explicit KL penalty:

ƒ
L=1Gi=1G1oit=1oi[Ailogπθ(oi,t)βDKL(t)]\mathcal{L} = -\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\Big[A_i\,\log\pi_\theta(o_{i,t}\mid\cdot) - \beta\, D_{KL}^{(t)}\Big]

|o_i| is the token length of completion i, β weights the KL, and D_KL uses the k3 unbiased, non-negative estimator D_KL = π_ref/π_θ − log(π_ref/π_θ) − 1 (always ≥ 0, lower variance than the naive form). Note the key difference from PPO: GRPO applies KL as a direct loss term, not folded into the reward. PPO bakes it into the scalar (total_reward = rm(x,y) − β·KL); GRPO penalizes per-token deviation from the reference directly.

The payoff is infrastructural: GRPO holds 2–3 model copies (policy, reference, optional reward model) versus PPO's four, cutting RL memory/compute roughly in half and removing the fragile critic-training loop. That cost reduction is why large-scale reasoning RL became practical.

◐ InteractiveGRPO: the group is the baseline
mean 0.55
r=0.90
1.10
r=0.20
-1.10
r=0.70
0.47
r=0.10
-1.42
r=0.85
0.95
r=0.40
-0.47
r=0.95
1.26
r=0.30
-0.79

Advantage A = (r − μ) / σ with μ=0.55, σ=0.32. Bars above the mean get a positive advantage (reinforced); below get negative (suppressed). No value network — the other samples in the group are the baseline. That's the whole trick that makes GRPO so much cheaper than PPO.

3.4 The std-normalization debate and GRPO's two biases

The denominator std(r_1..r_G) is where the bodies are buried, and IC6 interviews go straight here.

Failure mode — exploding gradients on easy/solved groups. When every completion in a group gets the same reward (all correct or all wrong — common in RLVR once the model masters a prompt), std → 0 and the advantage divides by ~zero, producing massively amplified, noisy gradients. This is the dominant instability in sparse-reward reasoning RL. Mitigations: clamp the denominator (std + 1e-6), or filter degenerate groups entirely (DAPO's dynamic sampling), or drop the /std (Dr.GRPO).

Difficulty bias. Std-normalization gives prompts with low reward variance (very easy or very hard) systematically larger-magnitude advantages than medium-difficulty prompts — so the optimizer over-weights problems it has nothing to learn from. Length bias. The 1/|o_i| per-response normalization means a wrong-but-long completion gets each of its tokens penalized less than a wrong-but-short one, and a correct-but-short one rewarded more per token than a correct-but-long one. Net effect: GRPO has a structural preference for shorter correct answers and longer incorrect answers — exactly backwards for reasoning, where it can collapse useful chain-of-thought.

Hold these two biases in your head. The entire next subsection is "who removed which normalization, and what broke."

3.5 DPO: preference optimization with no RL loop

DPO (Rafailov et al., 2023) is the odd one out: no sampling, no rollouts, no reward model, no critic. Given preference pairs (x, y_w, y_l) — prompt, chosen, rejected — it derives a closed-form loss by inverting the Bradley-Terry preference model. The trick is recognizing that the RLHF-optimal policy has a closed form, which lets you define an implicit reward:

ƒ
r(x,y)=βlogπθ(yx)πref(yx)r(x,y) = \beta\,\log\frac{\pi_\theta(y\mid x)}{\pi_{ref}(y\mid x)}

and the loss becomes a simple binary classification of which response is preferred:

ƒ
LDPO=E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{DPO} = -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log\sigma\Big(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{ref}(y_w\mid x)} - \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{ref}(y_l\mid x)}\Big)\Big]

σ is the logistic function; β ∈ [0.1, 0.5] scales both the implicit reward magnitude and the KL regularization strength — small β keeps you close to the reference, large β optimizes the preference more aggressively. It's cheap, stable, and fully offline, which is why it's the default for preference tuning.

Two failure modes interviewers love:

  • Likelihood displacement. During DPO training the log-prob of the chosen response often decreases — just more slowly than the rejected one. The margin grows while both probabilities erode. Your preference accuracy looks great and your model's calibration quietly rots, breaking anything downstream that expects sane probabilities.
  • Reward hacking. The loss can be minimized by driving the rejected log-prob to −∞ (over-suppression) rather than improving the chosen response. The implicit reward gap explodes while actual chosen quality plateaus or degrades. Fixes (POWER, PG-DPO, DPO-Shift) bound the preference score or add a regularizer that protects chosen-response likelihood.

The honest senior framing: DPO is offline contextual-bandit imitation of a preference dataset; GRPO is online exploration. DPO can only re-rank responses it was shown; it cannot discover a better completion the way an online sampler can. That's why frontier reasoning models use GRPO-family RL, not DPO, for capability gains.

3.6 The variant zoo — what each fixes, and what it breaks

Read this as a changelog against §3.4's two biases and §3.1's importance-ratio problem.

Dr.GRPO (Understanding R1-Zero, 2025) — removes the normalizations. Drops the 1/|o_i| length normalization (replacing it with a global constant) and the /std difficulty normalization, and removes KL. Fixes both of GRPO's biases at the source. Breaks: removing per-response length scaling can introduce the reverse bias — it now mildly incentivizes verbosity — and training is a bit less stable. (Learnable-aggregation variants like λ-GRPO try to get the best of both by learning the per-token weight instead of fixing it.)

DAPO (ByteDance, 2025) — four orthogonal fixes. (a) Clip-Higher: asymmetric clip bounds (a higher ceiling than floor) so the policy can still raise probability on promising new tokens — prevents the entropy collapse that symmetric clipping causes. (b) Dynamic Sampling: discard all-correct and all-wrong groups (zero gradient signal, or the std→0 blowup) and resample, keeping every batch informative. (c) Token-level loss: aggregate over tokens rather than averaging per-sequence, so long chains-of-thought get proportionate signal. (d) Overlong reward shaping: an explicit soft penalty for excessive length instead of GRPO's indirect, biased length normalization. Removes KL entirely. Result: 50 points on AIME 2024 with Qwen2.5-32B in ~half the steps of DeepSeek-R1-Zero. Breaks: four new things to tune, dynamic sampling adds rollout overhead, and credit assignment is still per-response-uniform.

GSPO (Qwen, 2025) — lifts importance sampling to the sequence level. Instead of a noisy per-token ratio it uses a single length-normalized sequence ratio (geometric mean of per-token ratios, via (1/|o_i|)·log(π_θ/π_old)), and clips the whole sequence on or off. Fixes: token-level IS is theoretically broken for single samples and its ratios swing >10x; sequence-level averaging cancels that noise. Critically, it stabilizes MoE RL without Routing Replay — see §3.7. Breaks: binary all-or-nothing clipping is coarse; one outlier off-policy token can discard an otherwise good sequence's entire gradient. (SAPO replaces the hard clip with a soft sigmoid gate; SSPO clips at sub-sentence granularity as a middle ground.)

CISPO (MiniMax-M1, 2025) — clips the importance weight, not the token update. Standard PPO/GRPO clipping zeroes the gradient for high-ratio tokens — which silently kills the rare "reflection" tokens ("Wait", "However", "Let me recheck") that are high-IS precisely because they're pivotal. Once clipped, they never contribute to later off-policy updates. CISPO instead writes log π(token) · clip(ratio) — it bounds the magnitude of every token's gradient but never sets it to zero, so reflection tokens keep learning across 512-step reasoning rollouts. Result: enabled MiniMax-M1's full RL run (512 H100s, 3 weeks, ~$534K) with stable convergence. Breaks: far-off-policy tokens still have their update magnitude bounded; it's a softer clip, not a free lunch.

The rest of the zoo, in one line each: MaxRL reweights toward Pass@k instead of expected reward (7.9–19x more sample-efficient for multi-sample success, but needs verifiable binary rewards); SimKO adds entropy regularization to fight the probability-concentration that kills Pass@k; λ-GRPO makes the token-aggregation weight learnable and unifies GRPO/Dr.GRPO/DAPO as special cases; DPPO replaces heuristic ratio-clipping with a direct (top-k or binary) policy-divergence estimate. None of them solves per-token credit assignment — that remains the open problem (see §7).

3.7 Why this matters for MoE (the IC6 trap)

DeepSeek-R1, Qwen3, and most frontier models are Mixture-of-Experts. Here token-level GRPO has a specific, vicious failure. The inference engine (vLLM/SGLang) and the training engine (FSDP/Megatron) route tokens to experts independently; even with identical weights, ~10% of routers disagree per forward pass and 94% of tokens differ in expert assignment in at least one layer. After each policy update the routing shifts again. The result: importance ratios spike chaotically, PPO-style clipping triggers unpredictably, and training collapses. GSPO's sequence-level ratio sidesteps this — if routing diverges in layer k and re-aligns by k+2, the sequence-level average washes it out, so you don't need the "Rollout Routing Replay" workaround that token-level methods require. This is the cleanest example in the field of why the token-vs-sequence axis is not academic. (Full treatment in RL Infrastructure.)

4. Minimal implementation

The two ideas that make GRPO work — group-relative advantage and an advantage-weighted policy-gradient step — fit in a few lines with no critic, no reward model, and no GPU. The full runnable version (a 4-action bandit the policy learns to solve from relative rewards alone) is in examples/rl/grpo_min.py.

import math, random
 
def group_relative_advantages(rewards, eps=1e-6, normalize_std=True):
    """A_i = (r_i - mean) / (std + eps), the GRPO baseline.
 
    The group mean is the baseline a PPO critic would have learned — here it
    falls out of sampling for free. `eps` clamps the denominator: when every
    completion scores the same (std -> 0) the naive divide explodes the
    gradient. That is THE std-normalization failure mode in sparse-reward RL.
    Dr.GRPO drops the /std entirely (normalize_std=False) to kill the
    difficulty bias it introduces.
    """
    G = len(rewards)
    mean = sum(rewards) / G
    if not normalize_std:                      # Dr.GRPO style
        return [r - mean for r in rewards]
    var = sum((r - mean) ** 2 for r in rewards) / G
    std = math.sqrt(var)
    return [(r - mean) / (std + eps) for r in rewards]
 
def grpo_step(logits, reward_table, G=16, lr=0.5):
    """One GRPO update on a softmax policy over discrete actions.
 
    Sample a GROUP, standardize its rewards, then ascend the
    advantage-weighted log-prob gradient. For softmax policies,
    d/d(logit_j) log pi(a) = (1[j==a] - pi_j). We accumulate
    A_i * that over the group and step UP — the group-mean baseline is
    already baked into A, so there is no separate value network anywhere.
    """
    z = max(logits)
    exps = [math.exp(l - z) for l in logits]
    s = sum(exps)
    probs = [e / s for e in exps]                       # softmax(logits)
 
    actions = [_sample(probs) for _ in range(G)]        # G "completions"
    rewards = [reward_table[a] for a in actions]        # one scalar each
    adv = group_relative_advantages(rewards)            # the whole trick
 
    grad = [0.0] * len(logits)
    for a, A in zip(actions, adv):
        for j in range(len(logits)):
            grad[j] += A * ((1.0 if j == a else 0.0) - probs[j])
    return [l + lr * g / G for l, g in zip(logits, grad)]  # /G == group mean baseline
 
def _sample(probs):
    u, c = random.random(), 0.0
    for i, p in enumerate(probs):
        c += p
        if u <= c:
            return i
    return len(probs) - 1

What's honest about this and what's not: the advantage math is exactly GRPO's, and the /G averaging is precisely how the group-mean baseline enters. What's missing for a real run is the importance ratio and clipping (this does one on-policy step, so π_θ == π_old and the ratio is 1), the KL-to-reference penalty, and a real reward model or verifier. Add those and you have the loss in §3.3. For a production loop, use verl or TRL — see RL Infrastructure.

group_relative_advantages() and grpo_step() — section by section

group_relative_advantages(rewards, eps=1e-6, normalize_std=True): This function takes a list of reward scalars (one per completion in the group) and returns the advantage for each. It computes the mean and std exactly as written in §3.3 — line 196 sums rewards and divides by G, line 199 computes variance, line 200 takes the sqrt. The eps=1e-6 clamps the denominator (if std is tiny, dividing by it would explode the gradient, so we force std >= 1e-6). The normalize_std=False flag lets you drop the /std division entirely (Dr.GRPO style), so you get just A_i = r_i − mean without the scaling. Result: a list of advantages, one per completion, all normalized by the group's own statistics.

grpo_step(logits, reward_table, G=16, lr=0.5): This is the full GRPO update on a softmax policy. Lines 212–215 compute the softmax probabilities from the input logits (using the numerical trick z = max(logits) to avoid overflow). Line 217 samples G actions from the policy (those are the "completions"). Line 218 looks up the reward for each action in the reward_table. Line 219 calls group_relative_advantages() to get the advantage for each action — the whole trick. Lines 221–224 compute the policy gradient: for each sampled action and its advantage, we loop over all logits and accumulate A * (1[j==a] − pi_j), which is the softmax gradient. Line 225 steps the logits up by this gradient, scaled by learning rate and averaged over the group (/G). Result: updated logits that make above-average actions slightly more likely.

What both accomplish together: The group statistics (mean, std) replace a learned critic; the advantage-weighted gradient is the policy-gradient theorem in action; the /G average encodes that every token in the group should move by the same advantage amount (no per-token credit assignment, just per-completion). This is GRPO in miniature.

5. Production tradeoffs

Method Critic? Baseline / advantage KL handling Models in memory Best for · main failure mode
PPO Yes (learned V) GAE on critic values folded into reward 4 (policy, critic, ref, RM) Principled, dense rewards · critic mis-fit poisons advantages; 4x memory
GRPO No group mean, /std explicit loss term 2–3 Cheap reasoning RL · length + difficulty bias; std→0 blowup
Dr.GRPO No group mean, no /std removed 2 Removing GRPO bias · reverse verbosity bias; less stable
DAPO No group mean, dynamic sampling removed 2 Long-CoT, fast convergence · 4 new knobs; sampling overhead
GSPO No group mean, seq-level IS seq-level clip 2 MoE stability, low variance · binary clip discards good sequences
CISPO No group mean, clip the IS weight bounded weight 2 Long reflective CoT · still bounds far-off-policy magnitude
DPO No implicit reward β·log(π/π_ref) via β 2 (policy, ref) Cheap offline preference tuning · likelihood displacement; reward hacking

Two senior instincts to say out loud: (1) The critic is PPO's biggest liability, not its KL term — a wrong value function corrupts every advantage silently, which is why critic-free methods won the reasoning era. (2) Dropping KL (DAPO, Dr.GRPO) is fine for RLVR but dangerous for RLHF — with a verifiable reward you can't game the reference into gibberish, but against a learned reward model the KL leash is what stops reward hacking. The right answer is reward-design-dependent, and saying so signals you've shipped one.

6. How it's asked

[IC5] "Derive PPO's clipped surrogate. Why the min?" L = E[min(r_t·Â_t, clip(r_t, 1−ε, 1+ε)·Â_t)] where r_t = π_θ/π_old. The min makes the bound pessimistic: for Â>0 the clip caps how much you can raise the probability on one batch (a dead zone past 1+ε), preventing over-commitment; for Â<0, symmetric. The asymmetry that matters: if a token is already past the band in the wrong direction, the min picks the unclipped term so the gradient stays live and pulls it back. It's a cheap first-order trust region — TRPO's hard KL constraint without the second-order solve.
[IC5] "How does GRPO drop the critic, and what does that cost?" Sample G completions per prompt; advantage A_i = (r_i − mean)/std uses the group mean as the baseline a critic would have learned. Saves ~half the memory (2–3 models vs 4) and the fragile critic loop. Cost: the /std blows up gradients when all G rewards match (std→0) — the dominant instability in sparse reasoning RL — plus a difficulty bias toward low-variance prompts, and the 1/|o_i| length normalization biases toward short-correct/long-incorrect. Clamp the std, filter degenerate groups, or drop the normalizations (Dr.GRPO).
[IC6] "Dr.GRPO, DAPO, GSPO, CISPO — what does each fix and break?" Dr.GRPO removes the length and std normalizations (kills both biases; risks reverse verbosity bias). DAPO adds clip-higher (stops entropy collapse), dynamic sampling (drops zero-signal groups), token-level loss (long CoT), overlong shaping (conciseness); removes KL — but adds four knobs. GSPO lifts importance sampling to sequence level (kills token-noise, stabilizes MoE without routing replay) but its binary clip can discard a good sequence over one outlier token. CISPO clips the IS weight not the token update, preserving rare reflection-token gradients — but still bounds far-off-policy magnitude. None solves per-token credit assignment.
[IC6] "Why does GRPO destabilize MoE RL, and how does GSPO fix it?" In MoE, inference and training engines route independently; ~10% of routers disagree per forward pass and routing shifts after every update. Token-level importance ratios then spike chaotically and clipping fires unpredictably → collapse. GSPO's sequence-level ratio averages over the whole completion, so transient per-layer routing divergence washes out — no Rollout Routing Replay needed. It's the canonical proof that token-vs-sequence-level IS is a real engineering decision, not theory.
[IC5] "No reward model, no sampling — why ever use GRPO over DPO?" DPO is offline imitation of a fixed preference set: it can only re-rank responses it was shown, and suffers likelihood displacement (chosen log-prob drops too) and reward hacking (over-suppressing rejected). GRPO explores online — it can discover completions better than anything in your dataset, which is where reasoning capability gains come from. DPO for cheap alignment/style; GRPO-family for capability.

7. Pitfalls & flashcards

  • Forgetting the std→0 clamp. All-correct or all-wrong groups divide the advantage by ~zero and explode the gradient. Clamp the denominator or filter the group (DAPO).
  • Leaving GRPO's length normalization in for reasoning. The 1/|o_i| term structurally prefers short-correct and long-incorrect — it can quietly collapse chain-of-thought. Know that Dr.GRPO/DAPO removed it on purpose.
  • Symmetric clipping causing entropy collapse. A symmetric ε slowly starves exploration; clip-higher (asymmetric bounds) is the fix.
  • Token-level IS on an MoE without sequence-level aggregation. Routing volatility makes per-token ratios diverge — use GSPO or routing replay.
  • Trusting DPO's preference accuracy as a quality signal. Likelihood displacement means both chosen and rejected log-probs can fall while the margin grows; calibration rots invisibly.
  • Removing KL against a learned reward model. Fine for verifiable rewards; an invitation to reward-hack a reward model.

Flashcard. Every method is a choice of baseline + a way to tame the importance ratio. PPO: learned critic + per-token clipped ratio. GRPO: group-mean baseline (no critic) + KL as a loss term — but it ships a length bias (1/|o_i|) and a difficulty bias (/std, blows up at std→0). Dr.GRPO removes both normalizations; DAPO adds clip-higher + dynamic sampling + token-level loss; GSPO clips at the sequence level (fixes MoE); CISPO clips the IS weight not the update (saves reflection tokens). DPO skips RL: implicit reward β·log(π/π_ref), offline, prone to likelihood displacement.

8. Further reading

Next: RL Infrastructure — how these objectives actually run at scale (rollout/training split, async RL, MoE routing replay, vLLM vs SGLang), then the RL interview benchmark to pressure-test all of it.

Primary sources
← More in Fine-tuning, Post-training & RL