Policy-gradient RL for LLMs is one question asked a dozen ways — how do you turn a scalar reward into a stable, low-variance gradient on a 100k-vocab autoregressive policy? PPO answers with a clipped surrogate and a learned critic; GRPO drops the critic and lets the group mean be the baseline; DPO skips sampling entirely. Then Dr.GRPO, DAPO, GSPO, and CISPO are a precise sequence of bug-fixes to GRPO's biases. Derive each, and know exactly what it fixes and what it breaks.
ai-eng-wiki/examples/rl/grpo_min.pyEvery RL post-training algorithm in this lesson is an answer to one engineering question: how do you turn a scalar reward into a low-variance, stable gradient on a 100k-vocab autoregressive policy? You sample completions, score them, and nudge the policy to make above-average completions more likely. Everything else — critics, clipping, group baselines, importance ratios — is machinery for making that nudge not blow up.
Three families own the landscape:
G completions per prompt and let the group mean be the baseline. This is what trained DeepSeek-R1, and it cut RL infrastructure cost enough to make frontier reasoning RL accessible.r = β·log(π_θ/π_ref). No sampling, no reward model, no rollout loop.Then the variant zoo — Dr.GRPO, DAPO, GSPO, CISPO, and a dozen cousins — is not random proliferation. It is a precise, dated sequence of bug-fixes to GRPO's two original sins: a length bias and a difficulty bias baked into how it normalizes the advantage. Know the bugs and the zoo organizes itself.
This is the load-bearing lesson of the RL pillar. An interviewer is checking whether you can derive these objectives, not recite acronyms. The tell at each level:
β is the KL knob?A candidate who says "GRPO is PPO without the critic" and stops has named the title of the chapter. The job is the chapter.
The words first.
1 if a math answer is correct, 0 if not).reward − baseline: how much better (positive) or worse (negative) a sample did than expected.Step by step.
advantage = reward − baseline. Positive means above par.(r − mean) / std).Remember this: you train a model by scoring its own answers and nudging it toward the above-average ones — and GRPO's trick is letting a group of answers be their own "average," so no extra critic network is needed.
Every method here descends the same gradient. The policy π_θ defines a distribution over completions; you want to maximize expected reward J(θ) = E_{y~π_θ}[r(y)]. The policy gradient theorem gives:
In words: push up the log-probability of completions weighted by their reward. This is REINFORCE. It is unbiased but has murderous variance — r(y) for an LLM completion can be anything, and you are estimating an expectation over an astronomically large output space from a handful of samples.
Let's name every part. The left side ∇_θ J(θ) is "the direction to step the model weights to maximize expected reward." The right side has three pieces: E means "average over many samples"; r(y) is the scalar reward for one completion (e.g. 1 for correct, 0 for wrong); ∇_θ log π_θ(y) is the gradient of log-probability — it points in the direction that makes completion y more likely.
Here's a tiny concrete example. Suppose we sample 4 completions with rewards [1.0, 0.0, 1.0, 0.0]. The gradient step says: take the log-prob gradient of the first and third completions (reward=1) and average them, then step in that direction. We're making the good completions more likely by nudging the model weights up their gradient, and the bad ones are dragged down because they contribute zero to the average. One step: each completion's effect on the weights is reward × its-log-prob-gradient.
What just happened: we moved the policy toward higher-reward behavior by weighting each completion's learning signal by how good it was. The catch is variance — with only 4 samples in a 100k-vocab space, that average is a very noisy estimate of the true expectation.
The fix is a baseline b. Subtract any quantity that doesn't depend on the action and the estimator stays unbiased but its variance drops:
The quantity A = r(y) − b is the advantage: "how much better than baseline was this completion?" Every algorithm in this lesson is a different choice of baseline b and a different way to safely reuse off-policy samples. PPO learns b with a neural network. GRPO sets b to the group mean. That single design decision is the whole story.
One more shared primitive: importance sampling. Generating rollouts is the expensive part, so you want to take several gradient steps on the same batch of completions. But after one step, π_θ ≠ the π_old that generated the data — the batch is now off-policy. You correct with the importance ratio r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t). The catch: over a long sequence these ratios are a product of per-token ratios and can swing by orders of magnitude, so the gradient variance diverges. Controlling that ratio is the central design axis of the entire variant zoo.
PPO makes off-policy reuse safe by refusing to trust the importance ratio too far. Its objective is the clipped surrogate:
Symbol names: r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) is the importance ratio — "how much more (or less) likely is this token under the new policy than the old one?" If the new policy makes a token twice as likely, r_t = 2.0. If it makes it half as likely, r_t = 0.5. Â_t is the advantage estimate (good action positive, bad action negative). ε ≈ 0.2 is the clip width. clip(x, 1−ε, 1+ε) clamps a number to the range [0.8, 1.2].
Concrete walk-through: Suppose Â_t = 1.0 (a good token) and after one gradient step the ratio rose to r_t = 1.5 (30% more likely). The two terms are:
1.5 × 1.0 = 1.5clip(1.5, 0.8, 1.2) × 1.0 = 1.2 × 1.0 = 1.2min(1.5, 1.2) = 1.2So the gradient uses the clipped version (1.2). If the ratio kept rising to r_t = 2.0, the clipped term would still be capped at 1.2, and the gradient would flatten — you can't over-commit to one batch. Now suppose Â_t = −1.0 (a bad token) that we want to push down, and r_t = 0.1 (now 90% less likely). The unclipped term is 0.1 × (−1.0) = −0.1, the clipped term is clip(0.1, 0.8, 1.2) × (−1.0) = 0.8 × (−1.0) = −0.8, and the min is −0.8. The clipped gradient is stronger (a larger magnitude), which is the asymmetry: for bad actions, we do want the gradient to be live and pull the probability down aggressively; we only gate the good direction.
What just happened: PPO avoids over-committing to old data by clamping how far the ratio can swing, but the min() keeps the gradient alive for corrections. It's a cheap trust region.
where r_t(θ) is the per-token importance ratio above, Â_t is the advantage estimate, and ε ≈ 0.2 is the clip width. Why the min? This is the question. Consider the two cases:
Â_t > 0 (good action): the unclipped term wants to raise r_t without bound. The clip caps the reward of doing so at 1+ε, so once you've moved the probability up by ~20% on this batch, the gradient flattens to zero — a dead zone. You can't over-commit to a single batch.Â_t < 0 (bad action): symmetric — the gradient flattens once you've pushed the probability down by ~20%.The min makes the bound pessimistic: it always takes the smaller (less optimistic) of the clipped and unclipped objectives. The subtle payoff is asymmetric. If a token is already far outside the trust region in the wrong direction (ratio already pushed too high on a bad action), the min selects the unclipped term — so the gradient is still live and can pull it back. The clip only kills the gradient for moves that would push further in the improving direction. This is a cheap, first-order trust region — unlike TRPO, which enforces a hard D_KL(π_old‖π_new) ≤ δ constraint via second-order conjugate-gradient optimization. PPO drops the guarantee, keeps the stability, and scales to LLMs.
Where does Â_t come from? A learned critic V(s) — a value head the same size as the policy — estimates expected future reward, and GAE (Generalized Advantage Estimation) blends multi-step TD residuals:
δ_t is the TD residual (reward plus discounted next-state value minus current value), γ is the discount, and λ ∈ [0,1] is the bias-variance knob: λ=0 uses only the 1-step estimate (low variance, high bias from a wrong critic), λ=1 is full Monte Carlo (unbiased, high variance). Practice lives at λ ∈ [0.9, 0.99]. For LLMs the reward is usually a single terminal scalar (correct/incorrect at the end), so GAE mostly propagates that one signal backward through the critic's value estimates.
The cost. PPO keeps four models resident: the policy (training), the critic (training), the frozen reference (for the KL penalty), and the reward model — plus optimizer states for the two trainable ones. The critic is the killer: it's policy-sized, must be trained alongside, and a badly-fit critic silently poisons every advantage. That overhead is exactly what GRPO deletes.
GRPO (introduced in DeepSeekMath, Shao et al. 2024; scaled in DeepSeek-R1) makes one observation: if you're going to sample anyway, sample a group of G completions for the same prompt and use the group's own statistics as the baseline. No critic needed. The advantage for completion i is:
Symbol names: r_i is the reward (a single number) for completion i — say, 1 if a math problem was solved, 0 otherwise. mean(r_1, ..., r_G) is the average reward across the group of G completions. std(...) is the standard deviation — a measure of how spread-out the rewards are.
Concrete walk-through: Suppose G = 4 completions have rewards [3.0, 1.0, 2.0, 0.0]. The mean is (3+1+2+0)/4 = 1.5. The variance is ((3−1.5)² + (1−1.5)² + (2−1.5)² + (0−1.5)²)/4 = (2.25 + 0.25 + 0.25 + 2.25)/4 = 1.25, so std = √1.25 ≈ 1.118. Now compute A for each:
A_1 = (3.0 − 1.5) / 1.118 ≈ 1.34 (above average by ~1.3 std)A_2 = (1.0 − 1.5) / 1.118 ≈ −0.45 (below average)A_3 = (2.0 − 1.5) / 1.118 ≈ 0.45 (slightly above)A_4 = (0.0 − 1.5) / 1.118 ≈ −1.34 (far below)What just happened: completion 1 gets strong positive advantage (raised its tokens' probability), completion 4 gets strong negative (lowered its tokens). Completions 2 and 3 are near-zero because they're close to the group mean — the policy doesn't strongly push them either way. The whole group's statistics become the baseline; no critic needed.
r_i is the scalar reward for completion i, the mean over the group is the baseline b, and dividing by the group std normalizes scale. The same A_i is assigned to every token in completion i — there is no per-token credit assignment, because the reward model emits one scalar per completion and per-token credit is impractical. Above-average completions get reinforced, below-average suppressed, and as the whole group converges in quality the advantages shrink toward zero automatically — so you don't need PPO's adaptive KL scheduler.
The full token-level loss adds an explicit KL penalty:
|o_i| is the token length of completion i, β weights the KL, and D_KL uses the k3 unbiased, non-negative estimator D_KL = π_ref/π_θ − log(π_ref/π_θ) − 1 (always ≥ 0, lower variance than the naive form). Note the key difference from PPO: GRPO applies KL as a direct loss term, not folded into the reward. PPO bakes it into the scalar (total_reward = rm(x,y) − β·KL); GRPO penalizes per-token deviation from the reference directly.
The payoff is infrastructural: GRPO holds 2–3 model copies (policy, reference, optional reward model) versus PPO's four, cutting RL memory/compute roughly in half and removing the fragile critic-training loop. That cost reduction is why large-scale reasoning RL became practical.
Advantage A = (r − μ) / σ with μ=0.55, σ=0.32. Bars above the mean get a positive advantage (reinforced); below get negative (suppressed). No value network — the other samples in the group are the baseline. That's the whole trick that makes GRPO so much cheaper than PPO.
The denominator std(r_1..r_G) is where the bodies are buried, and IC6 interviews go straight here.
Failure mode — exploding gradients on easy/solved groups. When every completion in a group gets the same reward (all correct or all wrong — common in RLVR once the model masters a prompt), std → 0 and the advantage divides by ~zero, producing massively amplified, noisy gradients. This is the dominant instability in sparse-reward reasoning RL. Mitigations: clamp the denominator (std + 1e-6), or filter degenerate groups entirely (DAPO's dynamic sampling), or drop the /std (Dr.GRPO).
Difficulty bias. Std-normalization gives prompts with low reward variance (very easy or very hard) systematically larger-magnitude advantages than medium-difficulty prompts — so the optimizer over-weights problems it has nothing to learn from. Length bias. The 1/|o_i| per-response normalization means a wrong-but-long completion gets each of its tokens penalized less than a wrong-but-short one, and a correct-but-short one rewarded more per token than a correct-but-long one. Net effect: GRPO has a structural preference for shorter correct answers and longer incorrect answers — exactly backwards for reasoning, where it can collapse useful chain-of-thought.
Hold these two biases in your head. The entire next subsection is "who removed which normalization, and what broke."
DPO (Rafailov et al., 2023) is the odd one out: no sampling, no rollouts, no reward model, no critic. Given preference pairs (x, y_w, y_l) — prompt, chosen, rejected — it derives a closed-form loss by inverting the Bradley-Terry preference model. The trick is recognizing that the RLHF-optimal policy has a closed form, which lets you define an implicit reward:
and the loss becomes a simple binary classification of which response is preferred:
σ is the logistic function; β ∈ [0.1, 0.5] scales both the implicit reward magnitude and the KL regularization strength — small β keeps you close to the reference, large β optimizes the preference more aggressively. It's cheap, stable, and fully offline, which is why it's the default for preference tuning.
Two failure modes interviewers love:
−∞ (over-suppression) rather than improving the chosen response. The implicit reward gap explodes while actual chosen quality plateaus or degrades. Fixes (POWER, PG-DPO, DPO-Shift) bound the preference score or add a regularizer that protects chosen-response likelihood.The honest senior framing: DPO is offline contextual-bandit imitation of a preference dataset; GRPO is online exploration. DPO can only re-rank responses it was shown; it cannot discover a better completion the way an online sampler can. That's why frontier reasoning models use GRPO-family RL, not DPO, for capability gains.
Read this as a changelog against §3.4's two biases and §3.1's importance-ratio problem.
Dr.GRPO (Understanding R1-Zero, 2025) — removes the normalizations. Drops the 1/|o_i| length normalization (replacing it with a global constant) and the /std difficulty normalization, and removes KL. Fixes both of GRPO's biases at the source. Breaks: removing per-response length scaling can introduce the reverse bias — it now mildly incentivizes verbosity — and training is a bit less stable. (Learnable-aggregation variants like λ-GRPO try to get the best of both by learning the per-token weight instead of fixing it.)
DAPO (ByteDance, 2025) — four orthogonal fixes. (a) Clip-Higher: asymmetric clip bounds (a higher ceiling than floor) so the policy can still raise probability on promising new tokens — prevents the entropy collapse that symmetric clipping causes. (b) Dynamic Sampling: discard all-correct and all-wrong groups (zero gradient signal, or the std→0 blowup) and resample, keeping every batch informative. (c) Token-level loss: aggregate over tokens rather than averaging per-sequence, so long chains-of-thought get proportionate signal. (d) Overlong reward shaping: an explicit soft penalty for excessive length instead of GRPO's indirect, biased length normalization. Removes KL entirely. Result: 50 points on AIME 2024 with Qwen2.5-32B in ~half the steps of DeepSeek-R1-Zero. Breaks: four new things to tune, dynamic sampling adds rollout overhead, and credit assignment is still per-response-uniform.
GSPO (Qwen, 2025) — lifts importance sampling to the sequence level. Instead of a noisy per-token ratio it uses a single length-normalized sequence ratio (geometric mean of per-token ratios, via (1/|o_i|)·log(π_θ/π_old)), and clips the whole sequence on or off. Fixes: token-level IS is theoretically broken for single samples and its ratios swing >10x; sequence-level averaging cancels that noise. Critically, it stabilizes MoE RL without Routing Replay — see §3.7. Breaks: binary all-or-nothing clipping is coarse; one outlier off-policy token can discard an otherwise good sequence's entire gradient. (SAPO replaces the hard clip with a soft sigmoid gate; SSPO clips at sub-sentence granularity as a middle ground.)
CISPO (MiniMax-M1, 2025) — clips the importance weight, not the token update. Standard PPO/GRPO clipping zeroes the gradient for high-ratio tokens — which silently kills the rare "reflection" tokens ("Wait", "However", "Let me recheck") that are high-IS precisely because they're pivotal. Once clipped, they never contribute to later off-policy updates. CISPO instead writes log π(token) · clip(ratio) — it bounds the magnitude of every token's gradient but never sets it to zero, so reflection tokens keep learning across 512-step reasoning rollouts. Result: enabled MiniMax-M1's full RL run (512 H100s, 3 weeks, ~$534K) with stable convergence. Breaks: far-off-policy tokens still have their update magnitude bounded; it's a softer clip, not a free lunch.
The rest of the zoo, in one line each: MaxRL reweights toward Pass@k instead of expected reward (7.9–19x more sample-efficient for multi-sample success, but needs verifiable binary rewards); SimKO adds entropy regularization to fight the probability-concentration that kills Pass@k; λ-GRPO makes the token-aggregation weight learnable and unifies GRPO/Dr.GRPO/DAPO as special cases; DPPO replaces heuristic ratio-clipping with a direct (top-k or binary) policy-divergence estimate. None of them solves per-token credit assignment — that remains the open problem (see §7).
DeepSeek-R1, Qwen3, and most frontier models are Mixture-of-Experts. Here token-level GRPO has a specific, vicious failure. The inference engine (vLLM/SGLang) and the training engine (FSDP/Megatron) route tokens to experts independently; even with identical weights, ~10% of routers disagree per forward pass and 94% of tokens differ in expert assignment in at least one layer. After each policy update the routing shifts again. The result: importance ratios spike chaotically, PPO-style clipping triggers unpredictably, and training collapses. GSPO's sequence-level ratio sidesteps this — if routing diverges in layer k and re-aligns by k+2, the sequence-level average washes it out, so you don't need the "Rollout Routing Replay" workaround that token-level methods require. This is the cleanest example in the field of why the token-vs-sequence axis is not academic. (Full treatment in RL Infrastructure.)
The two ideas that make GRPO work — group-relative advantage and an advantage-weighted policy-gradient step — fit in a few lines with no critic, no reward model, and no GPU. The full runnable version (a 4-action bandit the policy learns to solve from relative rewards alone) is in examples/rl/grpo_min.py.
import math, random
def group_relative_advantages(rewards, eps=1e-6, normalize_std=True):
"""A_i = (r_i - mean) / (std + eps), the GRPO baseline.
The group mean is the baseline a PPO critic would have learned — here it
falls out of sampling for free. `eps` clamps the denominator: when every
completion scores the same (std -> 0) the naive divide explodes the
gradient. That is THE std-normalization failure mode in sparse-reward RL.
Dr.GRPO drops the /std entirely (normalize_std=False) to kill the
difficulty bias it introduces.
"""
G = len(rewards)
mean = sum(rewards) / G
if not normalize_std: # Dr.GRPO style
return [r - mean for r in rewards]
var = sum((r - mean) ** 2 for r in rewards) / G
std = math.sqrt(var)
return [(r - mean) / (std + eps) for r in rewards]
def grpo_step(logits, reward_table, G=16, lr=0.5):
"""One GRPO update on a softmax policy over discrete actions.
Sample a GROUP, standardize its rewards, then ascend the
advantage-weighted log-prob gradient. For softmax policies,
d/d(logit_j) log pi(a) = (1[j==a] - pi_j). We accumulate
A_i * that over the group and step UP — the group-mean baseline is
already baked into A, so there is no separate value network anywhere.
"""
z = max(logits)
exps = [math.exp(l - z) for l in logits]
s = sum(exps)
probs = [e / s for e in exps] # softmax(logits)
actions = [_sample(probs) for _ in range(G)] # G "completions"
rewards = [reward_table[a] for a in actions] # one scalar each
adv = group_relative_advantages(rewards) # the whole trick
grad = [0.0] * len(logits)
for a, A in zip(actions, adv):
for j in range(len(logits)):
grad[j] += A * ((1.0 if j == a else 0.0) - probs[j])
return [l + lr * g / G for l, g in zip(logits, grad)] # /G == group mean baseline
def _sample(probs):
u, c = random.random(), 0.0
for i, p in enumerate(probs):
c += p
if u <= c:
return i
return len(probs) - 1What's honest about this and what's not: the advantage math is exactly GRPO's, and the /G averaging is precisely how the group-mean baseline enters. What's missing for a real run is the importance ratio and clipping (this does one on-policy step, so π_θ == π_old and the ratio is 1), the KL-to-reference penalty, and a real reward model or verifier. Add those and you have the loss in §3.3. For a production loop, use verl or TRL — see RL Infrastructure.
group_relative_advantages(rewards, eps=1e-6, normalize_std=True): This function takes a list of reward scalars (one per completion in the group) and returns the advantage for each. It computes the mean and std exactly as written in §3.3 — line 196 sums rewards and divides by G, line 199 computes variance, line 200 takes the sqrt. The eps=1e-6 clamps the denominator (if std is tiny, dividing by it would explode the gradient, so we force std >= 1e-6). The normalize_std=False flag lets you drop the /std division entirely (Dr.GRPO style), so you get just A_i = r_i − mean without the scaling. Result: a list of advantages, one per completion, all normalized by the group's own statistics.
grpo_step(logits, reward_table, G=16, lr=0.5): This is the full GRPO update on a softmax policy. Lines 212–215 compute the softmax probabilities from the input logits (using the numerical trick z = max(logits) to avoid overflow). Line 217 samples G actions from the policy (those are the "completions"). Line 218 looks up the reward for each action in the reward_table. Line 219 calls group_relative_advantages() to get the advantage for each action — the whole trick. Lines 221–224 compute the policy gradient: for each sampled action and its advantage, we loop over all logits and accumulate A * (1[j==a] − pi_j), which is the softmax gradient. Line 225 steps the logits up by this gradient, scaled by learning rate and averaged over the group (/G). Result: updated logits that make above-average actions slightly more likely.
What both accomplish together: The group statistics (mean, std) replace a learned critic; the advantage-weighted gradient is the policy-gradient theorem in action; the /G average encodes that every token in the group should move by the same advantage amount (no per-token credit assignment, just per-completion). This is GRPO in miniature.
| Method | Critic? | Baseline / advantage | KL handling | Models in memory | Best for · main failure mode |
|---|---|---|---|---|---|
| PPO | Yes (learned V) |
GAE on critic values | folded into reward | 4 (policy, critic, ref, RM) | Principled, dense rewards · critic mis-fit poisons advantages; 4x memory |
| GRPO | No | group mean, /std |
explicit loss term | 2–3 | Cheap reasoning RL · length + difficulty bias; std→0 blowup |
| Dr.GRPO | No | group mean, no /std |
removed | 2 | Removing GRPO bias · reverse verbosity bias; less stable |
| DAPO | No | group mean, dynamic sampling | removed | 2 | Long-CoT, fast convergence · 4 new knobs; sampling overhead |
| GSPO | No | group mean, seq-level IS | seq-level clip | 2 | MoE stability, low variance · binary clip discards good sequences |
| CISPO | No | group mean, clip the IS weight | bounded weight | 2 | Long reflective CoT · still bounds far-off-policy magnitude |
| DPO | No | implicit reward β·log(π/π_ref) |
via β |
2 (policy, ref) | Cheap offline preference tuning · likelihood displacement; reward hacking |
Two senior instincts to say out loud: (1) The critic is PPO's biggest liability, not its KL term — a wrong value function corrupts every advantage silently, which is why critic-free methods won the reasoning era. (2) Dropping KL (DAPO, Dr.GRPO) is fine for RLVR but dangerous for RLHF — with a verifiable reward you can't game the reference into gibberish, but against a learned reward model the KL leash is what stops reward hacking. The right answer is reward-design-dependent, and saying so signals you've shipped one.
L = E[min(r_t·Â_t, clip(r_t, 1−ε, 1+ε)·Â_t)] where r_t = π_θ/π_old. The min makes the bound pessimistic: for Â>0 the clip caps how much you can raise the probability on one batch (a dead zone past 1+ε), preventing over-commitment; for Â<0, symmetric. The asymmetry that matters: if a token is already past the band in the wrong direction, the min picks the unclipped term so the gradient stays live and pulls it back. It's a cheap first-order trust region — TRPO's hard KL constraint without the second-order solve.G completions per prompt; advantage A_i = (r_i − mean)/std uses the group mean as the baseline a critic would have learned. Saves ~half the memory (2–3 models vs 4) and the fragile critic loop. Cost: the /std blows up gradients when all G rewards match (std→0) — the dominant instability in sparse reasoning RL — plus a difficulty bias toward low-variance prompts, and the 1/|o_i| length normalization biases toward short-correct/long-incorrect. Clamp the std, filter degenerate groups, or drop the normalizations (Dr.GRPO).std→0 clamp. All-correct or all-wrong groups divide the advantage by ~zero and explode the gradient. Clamp the denominator or filter the group (DAPO).1/|o_i| term structurally prefers short-correct and long-incorrect — it can quietly collapse chain-of-thought. Know that Dr.GRPO/DAPO removed it on purpose.ε slowly starves exploration; clip-higher (asymmetric bounds) is the fix.Flashcard. Every method is a choice of baseline + a way to tame the importance ratio. PPO: learned critic + per-token clipped ratio. GRPO: group-mean baseline (no critic) + KL as a loss term — but it ships a length bias (
1/|o_i|) and a difficulty bias (/std, blows up atstd→0). Dr.GRPO removes both normalizations; DAPO adds clip-higher + dynamic sampling + token-level loss; GSPO clips at the sequence level (fixes MoE); CISPO clips the IS weight not the update (saves reflection tokens). DPO skips RL: implicit rewardβ·log(π/π_ref), offline, prone to likelihood displacement.
Next: RL Infrastructure — how these objectives actually run at scale (rollout/training split, async RL, MoE routing replay, vLLM vs SGLang), then the RL interview benchmark to pressure-test all of it.