Xiuyu Li's 35-question RL interview benchmark, answered in depth — Algorithm and Infrastructure. The reference answers the original deliberately leaves out.
A benchmark of 35 reinforcement-learning interview questions compiled by Xiuyu Li (@sheriyuo) (UC Berkeley) from real RL interview write-ups on Zhihu — CN original here. The source deliberately ships no reference answers ("memorizing interview questions is not enough — deep understanding matters far more"). These are ours: rigorous, current to June 2026, grounded in the primary methods. The line between LLM-RL and Agentic-RL is intentionally loose — several questions answer differently depending on the setting. Treat each as a launch point, then go deeper.
Pairs with the pillar: RL post-training · PPO, GRPO & the variant zoo · RL infrastructure.
The words first.
Step by step.
Remember this: RL for language models is fundamentally a variance-reduction problem — you want to learn from rewards with a signal that is noisy and sparse, so the infrastructure (baselines, importance weighting, multi-epoch reuse, KL constraints) exists to make training stable without requiring impossibly large batch sizes.
A "pure Critic" (value-based) method such as Q-learning learns Q(s,a) and acts greedily via argmax over actions. For LLMs this is intractable: the action space is the entire vocabulary at each step and the policy is a sequence over that space, so argmax decoding has no closed form and cannot represent the stochastic, high-entropy distributions needed for exploration and sampling. Actor-Critic instead keeps an explicit parameterized policy (the actor) that directly emits token distributions, while the critic only supplies a value baseline V(s) to reduce gradient variance.
The deeper motivation is the bias-variance tradeoff. Pure policy gradient (actor-only REINFORCE) is unbiased but high-variance; variance scales with trajectory length and reward sparsity, demanding huge batches and long rollouts. Adding a critic gives the advantage A(s,a)=R(s,a)-V(s), and via Generalized Advantage Estimation A_GAE=Σ(γλ)^l δ_t with δ_t=r_t+γV(s_{t+1})-V(s_t), trading λ between low-variance/high-bias (λ=0) and Monte Carlo (λ=1). The bootstrapped critic dramatically improves sample efficiency at the cost of some bias.
So Actor-Critic is the practical middle ground: the actor handles the combinatorial, stochastic action space that a pure critic cannot, and the critic tames the variance that a pure actor cannot. The tradeoff is infrastructure cost — PPO carries policy + reference + value + optimizer states (~4x memory). This is precisely why critic-free variants like GRPO replace the learned critic with an empirical group-mean baseline, recovering variance reduction without a value network while keeping the explicit actor. The lesson: you always need the actor; the critic is an optional, swappable variance-reduction device.
Let's walk through why the critic saves compute. Imagine a simple problem: the policy generates a 4-token completion on a single prompt, yielding a trajectory with returns [0.2, 0.4, 0.5, 0.6] (normalized cumulative reward at each step).
Pure policy gradient (REINFORCE): gradient ∝ return × ∇log π(token). Each token sees its full return value: token 1 gets 0.2, token 2 gets 0.4, etc. If returns have std ≈ 0.15, gradient variance is high.
With a learned value baseline: suppose the critic estimates V(s) ≈ [0.3, 0.35, 0.4, 0.45] at each state. Advantage = [0.2−0.3, 0.4−0.35, 0.5−0.4, 0.6−0.45] = [−0.1, 0.05, 0.1, 0.15]. These advantages are much smaller (std ≈ 0.1 vs 0.15), so the gradient is the same direction but lower variance — the critic's estimate absorbed most of the raw return's scale, leaving only the relative signal (how much better/worse than expected). That's the win: same gradient direction, but tighter confidence intervals mean you need fewer samples.
The cost: the critic has to be trained too, and a bad critic can be worse than nothing.
They are three views of the same objective. For a fixed data distribution P and model Q, the identity D_KL(P||Q)=H(P,Q)-H(P) holds, where H(P,Q)=-Σ P(x) log Q(x) is cross-entropy and H(P) is the (parameter-independent) entropy of the data. Since H(P) is constant in θ, minimizing cross-entropy is exactly minimizing the forward KL D_KL(P||Q).
MLE closes the loop: maximizing log-likelihood Σ log Q(x_i) over data equals minimizing the empirical cross-entropy between the data distribution and the model, hence minimizing forward KL from data to model. So "train an LM by MLE" = "minimize cross-entropy" = "minimize forward KL." Forward KL is mode-covering: it penalizes Q for assigning low probability where P has mass, encouraging broad coverage.
In RL the same machinery reappears as a constraint rather than the loss. Cross-entropy remains the base supervised objective, while KL measures deviation from a reference policy and acts as regularization. RLHF makes it explicit: total_reward = r(x,y) - (β/|y|)·D_KL(π_θ||π_ref). PPO applies it implicitly through ratio clipping; DPO folds it into β scaling the implicit reward; GRPO adds an explicit β·D_KL penalty term. Note the directionality flip: MLE uses forward KL D_KL(P_data||π_θ), whereas RL penalties typically use reverse KL D_KL(π_θ||π_ref), which is mode-seeking and keeps the policy from drifting into degenerate high-reward shortcuts.
The practical takeaway: pretraining and RL share one currency — KL/cross-entropy — but use opposite KL directions and opposite roles (objective vs. constraint), which is why balancing "maximize reward" against "stay near reference" is the central tension in post-training.
Reward design follows the verifiability and supervision available. Four paradigms dominate.
RLHF (reward models): train a separate model on human preference pairs and use its scalar output as reward. Flexible, scales to subjective judgments (helpfulness, tone), but annotation is expensive, the reward model generalizes poorly out-of-distribution, and RL exploits its failures — classic reward hacking.
RLVR (verifiable rewards): rewards come from deterministic, tamper-proof oracles — unit tests, formal proofs, exact answer matching. The signal is binary/ternary, r∈{0,1}. No reward-model training step, provably ungameable, and empirically strong on math/code (DeepSeek-R1, R1-Zero). Limitation: only works where an automated oracle exists and gives no partial credit.
RLAIF (AI feedback): a larger teacher LLM replaces human annotators, judging student outputs. Scales annotation cheaply and handles complex domains, but inherits teacher bias and noise, and risks distribution collapse if teacher and student are similar. By 2025 it is comparable to RLHF for many tasks.
Process rewards: assign credit to intermediate reasoning steps rather than only the final answer, rewarding correct reasoning process, not lucky answers. Stronger signal for reasoning emergence but more labels per sample and harder to verify.
Design principles cut across these: prefer verifiable signals when an oracle exists (eliminates hacking); shape rewards to counter known biases (DAPO's overlong-reward shaping penalizes verbose CoT; length normalization to avoid length bias); and watch the std-normalization trap — sparse rewards with tiny std blow up advantages. The dominant tradeoff is fidelity vs. coverage: RLVR is correct but narrow; RLHF/RLAIF are broad but gameable. Production systems increasingly combine them — verifiable rewards for reasoning, model-based rewards for open-ended quality and safety.
All three underpin off-policy estimation in modern RL. Importance sampling (IS) lets us reuse trajectories from a behavior policy π_old to estimate expectations under π_new via the ratio w_t=π_new(a_t|s_t)/π_old(a_t|s_t). This is what makes PPO's multi-epoch reuse of a rollout batch valid, and what makes asynchronous/stale-data training (AReaL, 1-5 versions old) possible. The danger: over long LLM sequences the product of token ratios is heavy-tailed, so gradient variance diverges. Mitigations include Truncated IS (clip ratios to ~10-20, trading variance for bias), Masked IS (drop high-variance samples), and PPO's clip(w,1-ε,1+ε) which is itself implicit truncation. GSPO lifts IS to the sequence level to average out token-level noise.
Rejection sampling enters two ways. As a variance tool, Rejection-Gated Policy Optimization provably bounds gradient variance under heavy-tailed ratios by rejecting samples with extreme log-prob mismatches, guaranteeing finite variance — complementing PPO clipping with theory. As a data tool, it powers DeepSeek-R1's stage 3: sample top-K completions from the stage-2 policy, reject incorrect/unreadable ones, and SFT on the survivors.
Broader Monte Carlo methods are everywhere: GRPO's group-mean baseline is a pure MC estimate of the value baseline (sample G completions, average rewards); REINFORCE uses full MC returns; GAE interpolates between MC (λ=1, low bias/high variance) and bootstrapped TD (λ=0). MaxRL even uses IS with Pass@k weighting P(k successes|N)/P(y|prompt) plus a Maclaurin expansion to directly optimize the multi-sample metric.
The unifying theme: RL for LLMs is fundamentally about cheaply estimating on-policy gradients from off-policy or finite samples, and IS/rejection/MC are the variance-control knobs that make this tractable.
PPO uses a learned critic with GAE: A_GAE_t = Σ_l (γλ)^l δ_{t+l}, where δ_t = r_t + γV(s_{t+1}) - V(s_t). λ∈[0.9,0.99] trades bias (low λ, 1-step) against variance (high λ, Monte Carlo). GRPO is critic-free: it samples G completions per prompt and sets a group-relative advantage A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G), assigning that scalar to every token in completion i.
Generalized Advantage Estimation is a recipe: at each step t, you ask "how good is this step?" by blending one-step TD (fast but biased) with N-step rollouts (slower but less biased). The parameter λ ∈ [0, 1] controls the blend.
Symbol meanings: δ_t is the TD residual (one-step surprise: what you actually got minus what the critic predicted). Σ (γλ)^l δ_{t+l} weights each step's surprise with geometric decay (γλ)^l.
Numeric example: let's say γ=0.99, λ=0.95, and we have a sequence of TD residuals δ = [0.1, 0.05, 0.02].
With λ=0.95, you're mostly using the one-step TD (δ_t itself carries weight 1.0), but you're smoothing future surprises in. With λ=1.0 you'd sum all future steps equally (high variance, low bias); with λ=0 you'd use only δ_t (low variance, high bias). This single knob trades the two automatically across the entire trajectory.
Subtracting a baseline reduces variance without introducing bias. The policy-gradient identity E[∇log π · b] = 0 for any action-independent baseline b means subtracting the mean reward leaves the gradient's expectation unchanged while shrinking its variance — actions are reinforced by how much better they are than the policy's average, not by raw return magnitude. In PPO the baseline is V(s); in GRPO it is the empirical group mean, which conveniently shrinks advantages to zero automatically as all outputs converge in quality, removing the need for an adaptive KL scheduler.
Std normalization is more contested. Dividing by std makes the effective learning rate invariant to reward scale and sparsity, stabilizing training across tasks. But it is not free. When the group reward distribution has very small std (sparse/hard problems where correct solutions are rare), the tiny denominator massively amplifies gradients, a key LLM-RL divergence mode. It also induces a difficulty bias: std-normalization over-weights too-easy and too-hard prompts relative to medium-difficulty ones. Dr.GRPO therefore drops the /std term entirely, using A_i = r_i - mean(r). Practical fixes when keeping it: clamp std to a floor (e.g., 1e-8), use the inter-quartile range, or percentile-based adaptive normalization. So std normalization is helpful but not strictly necessary — and actively harmful in sparse-reward reasoning.
They explore along orthogonal axes. Training-phase exploration is exploration in parameter/policy space: the policy samples trajectories from its evolving distribution, observes successes and failures via reward, and updates weights so that good solution paths become more probable. Over a long run this builds a durable "reasoning repertoire" — the model permanently acquires strategies (reflection, backtracking, hypothesis testing) it can later deploy. Diversity here is driven by sampling temperature, entropy regularization (SimKO guards against probability concentration collapsing pass@k), and reward shaping.
Test-time scaling is exploration in compute/inference space at fixed weights: given more inference budget, the model explores more of the solution space per query via extended chain-of-thought (more tokens), tree/beam search over multiple hypotheses, iterative generate→verify→regenerate refinement, and reflection/backtracking on prior steps. The policy is frozen; the search is what scales.
The crucial coupling, highlighted by the e3 work, is that test-time scaling is not automatic. A model trained with passive, fixed-per-sample-length RL does not know how to spend extra inference compute usefully — it plateaus. To get test-time extrapolation, you must train the model to perform in-context exploration: use variable-length episodes and reward strategic exploration, so the policy learns how to allocate a budget it hasn't seen. In other words, training-time exploration shapes the policy, but it must specifically install the meta-skill of inference-time exploration for test-time scaling to pay off.
The two are complementary and yield dual scaling laws: months of train-time RL compute plus per-inference test-time budget. Neither alone reaches frontier reasoning — train-time RL discovers strategies, test-time scaling deploys and amplifies them, and the bridge is training the model to explore in-context.
PPO optimizes the clipped surrogate L = min(r_t·A_t, clip(r_t, 1-ε, 1+ε)·A_t), where r_t = π_θ(a_t|s_t)/π_old(a_t|s_t) and ε≈0.2. The clip caps how far the probability ratio can move, and the min makes the cap one-directional in an advantage-aware way.
The minimum enforces asymmetric gradient saturation. When r_t is inside [1-ε,1+ε], the unclipped term is active and gradients flow normally. When the update would push r_t too far in the direction the advantage favors (e.g., A>0 and r_t>1+ε), the clipped term becomes constant, its gradient is zero, and a "dead zone" halts further updates. But if the ratio moves the wrong way, the unclipped term — being smaller — is selected, so the policy is still corrected. The net effect: PPO focuses effort where the two objectives disagree, signaling instability, and refuses to reward already-overshot moves. It is a heuristic first-order trust region — cheap, no conjugate gradient — but with no hard KL guarantee.
Without clipping, the surrogate r_t·A_t can be driven arbitrarily large by a single update; the policy makes huge off-policy jumps, importance ratios over long sequences explode in variance, and training collapses (mode collapse to high-reward shortcuts, distribution degeneration).
CISPO (MiniMax-M1) keeps a bound but moves it: instead of clip(r)·log π (clipping the token update), it computes log π · clip(r) — clipping the importance weight while letting every token contribute gradient. The motivation: standard clipping silences rare high-IS "reflective" tokens ("However", "Wait", "Recheck"); once clipped out they never rejoin subsequent off-policy updates, fatal for long-CoT. CISPO only bounds gradient magnitude, never zeroing a token's contribution — enabling stable 512-step reasoning RL. Its limit: far-off-policy update magnitude is still discarded; SAPO's soft gating is smoother.
GRPO includes an explicit KL term to prevent the policy from drifting too far from the reference (base/SFT) model — guarding against degeneration, language mixing, mode collapse to reward shortcuts, and gradient-variance blowups. Unlike PPO, which folds KL into the reward as r - (β/|y|)·D_KL, GRPO applies β·D_KL as a direct per-token penalty inside the loss: L = E[(1/G)Σ_i (1/|o_i|)Σ_t (A_i·log π_θ - β·D_KL^(t))].
The KL is computed with the unbiased, guaranteed-non-negative k3 estimator: D_KL^(t) = (π_ref(o_{i,t})/π_θ(o_{i,t})) - log(π_ref/π_θ) - 1. This form is always ≥0 and has lower variance than the naive log-ratio estimator, which matters because it is evaluated per token from single samples.
The naive KL divergence estimator D_KL = log(π_ref(token)) − log(π_θ(token)) can be negative (wrong!) if the single sample happens to come from a part of the distribution where π_θ > π_ref. The k3 form fixes this by design.
Numeric example: suppose π_ref(token) = 0.1 and π_θ(token) = 0.2.
The k3 rearrangement (ratio minus log-ratio minus 1) is provably non-negative for all input pairs, making it safe for single-sample KL estimation in GRPO. The cost is a slightly higher-variance estimate, but bounded correctness is worth the trade in RL.
DAPO, Dr.GRPO, and GSPO remove the KL term for several reasons. First, in RLVR reasoning the reward is verifiable (binary correctness), so the hacking that KL guards against is far less likely — the oracle cannot be gamed the way a learned reward model can. Second, the goal of long-horizon reasoning RL is to expand the policy beyond the base distribution; anchoring tightly to π_ref caps exactly the capability gains you want (ProRL's expansion needs room to move). Third, dropping β eliminates a sensitive hyperparameter and the cost of keeping a frozen reference model in memory. GSPO additionally derives its stability from sequence-level importance sampling rather than reference anchoring, so the KL becomes redundant. The tradeoff is that without KL the policy can drift, lose calibration, or over-optimize verifiable proxies; this is acceptable when rewards are trustworthy and broad-domain harmlessness is handled in a separate later stage. There is no consensus — KL is task-dependent regularization, valuable in RLHF, often dispensable in RLVR.
You get a gradient-scaling bug: gradients end up off by an exact integer factor equal to the number of redundant reductions (or, in sequence/tensor parallelism, by the parallel group size). The tell-tale symptom is gradients that differ by precisely seq_parallel_size — e.g., with seq_parallel_size=4, gradients are 4x too large (or, depending on where the double-reduce happens, 4x too small). The root cause is that torch.distributed.all_reduce does not carry a backprop wrapper, so reductions applied to the loss/gradients are not properly accounted for, and summing across processes multiplies the effective gradient.
The practical consequence is a silently mis-scaled effective learning rate. If gradients are inflated, you get exploding gradients, loss spikes, NaNs/Infs, and divergence; if deflated, you get pathologically slow or stalled learning that looks like a bad LR or a dead optimizer. Because the model often still "trains," the bug is insidious — curves look plausible but final quality is wrong and results are irreproducible. Related variants make it worse: fp16 transmitted as bf16 during communication can produce infinite gradients, and quantized all-reduce (THC, Terngrad) can overflow during partial-sum aggregation in multi-hop topologies.
The fix is to make the reduction backprop-aware and to reduce exactly once. Use torch.distributed.nn.all_reduce (which provides the correct gradient wrapper) instead of raw all_reduce, or, if you must use the raw primitive, explicitly rescale gradients by 1/num_processes (or 1/seq_parallel_size). More broadly, audit gradient-accumulation and DDP/sequence-parallel code paths so the loss is averaged, not summed-then-summed, across the parallel dimension. This class of bug is universal in distributed setups and is a prime suspect whenever gradient norms jump by a clean integer factor.
DPO has no explicit reward model; it defines an implicit reward by inverting the Bradley-Terry preference model: r(x,y) = β·log(π_θ(y|x)/π_ref(y|x)), with β∈[0.1,0.5] controlling KL-regularization strength. For a preference pair the objective reduces to r(x,y_w)-r(x,y_l) = β·log(π_θ(y_w|x)/π_θ(y_l|x)) + β·log(π_ref(y_l|x)/π_ref(y_w|x)). Smaller β stays closer to the reference (stronger regularization); larger β allows sharper, more aggressive divergence. β-DPO (NeurIPS 2024) adapts β per batch from the average implicit margin.
β is the KL weight — it sets how much you care about staying near the reference policy versus maximizing reward. Think of it as the temperature of exploration: small β = conservative, large β = aggressive.
Numeric example: on a single preference pair, suppose π_θ(chosen) = 0.6, π_θ(rejected) = 0.2, π_ref(chosen) = 0.3, π_ref(rejected) = 0.4. The log-probability margins are:
With β=0.1: implicit reward margin = 0.1 × (1.099 + 0.288) ≈ 0.139 (the pair are nearly tied in the loss). With β=0.5: implicit reward margin = 0.5 × (1.099 + 0.288) ≈ 0.694 (the pair are clearly separated, the policy moved farther from the reference).
Small β squashes the preference signal and keeps the policy near reference; large β unleashes the preference but risks drifting. In reasoning tasks (RLVR), you often use smaller β because the reward is verifiable and doesn't need much regularization; in RLHF with learned rewards, larger β is safer to guard against reward-model noise.
Reward hacking absolutely occurs. Because the implicit reward only depends on the log-probability ratio, the model can maximize the margin by over-suppressing the rejected completion's log-prob rather than improving the chosen one. The pathological mode: log π_θ(y_l) crashes while chosen-response quality plateaus or even degrades, so the margin grows for the wrong reason. The closely related likelihood displacement phenomenon is worse — during DPO training the chosen response's log-prob also decreases, just slower than the rejected one's. Both classes lose probability mass, eroding calibration and breaking downstream uses that rely on well-behaved probabilities (e.g., sampling, confidence).
Mitigations (2025) target keeping chosen quality up and bounding the suppression. POWER provably prevents Type-I reward hacking via bounded preference scores. Policy-Guided DPO (PG-DPO) combines Adaptive Rejection Scaling with Implicit Preference Regularization to preserve chosen-response quality while still separating pairs. DPO-Shift adds a parameter function to the rejected term in the Bradley-Terry model to directly counter likelihood displacement. Adaptive-β (β-DPO) prevents over-divergence on noisy batches. More generally, anchoring with an explicit SFT/NLL term on the chosen response, regularizing absolute log-probs (not just the ratio), and monitoring chosen log-prob during training are standard defenses. The core lesson: optimizing a ratio is not the same as improving the winner.
The mismatch is router shift. The inference engine (e.g., vLLM) and training engine (e.g., FSDP) route independently; even with identical router weights, ~10% of routers disagree per forward pass and 94% of tokens differ in expert assignment in at least one MoE layer. After each policy update, routing probabilities and even the selected experts shift, making the importance ratio IS_t = π_θ/π_ref spike, causing extreme token-distribution spikes that cascade across layers, chaotic clip triggering, and training collapse. Five solutions:
R3 (Rollout Routing Replay): record the routing masks the inference engine used during rollout, and replay those exact masks into the training forward pass. This forces π_train_router to match π_inference_router, killing the discrepancy at its source — cutting train/inference routing KL ~50% (dense-model parity) and reducing extreme tokens 10x; collapsing GRPO runs become stable.
RSPO (Router-Shift Policy Optimization): no replay; instead compute a per-token router-shift correction — detect when expert_old ≠ expert_new and rescale the importance weight by a stop-gradient, lower-bounded factor, aggregated at sequence level before clipping. Works from log-prob recomputation alone.
TIS (Truncated Importance Sampling): two-sided token mask — zero the gradient when IS_t exceeds τ_upper or falls below τ_lower (e.g., [0.5,1.5]). Simple but coarse; ignores sequence coherence and expert-transition structure.
GSPO (Group Sequence Policy Optimization): lift IS to sequence level, IS_seq = (log π_θ(response)/log π_ref(response))/len, with sequence-level clipping. Routing fluctuations that diverge in layer k and re-align by layer k+2 average out, so no replay is needed — adopted in Qwen3 production.
VESPO: sequence-level soft importance weighting via variational bounds, smoothly down-weighting off-policy samples rather than hard-zeroing gradients. The trend is clear: move from token-level patches (TIS/R3) toward sequence-level and soft methods that are intrinsically robust to routing volatility.
Group size G (GRPO): G controls the quality of the empirical baseline. Larger G gives a lower-variance group mean/std and more stable advantages, but cost scales linearly in rollouts. Practically G≈8-16 is common; too small makes mean(r)/std(r) noisy (and risks degenerate all-correct/all-wrong groups with zero variance), too large wastes compute. Prefix Grouper amortizes shared-prompt FLOPs (~1/G), enabling larger groups. Dynamic sampling (DAPO) filters uninformative groups to keep effective signal high.
Learning rate: RL post-training uses small LRs (often ~1e-6 to 1e-5 for the policy) because the trust region is fragile — PPO clipping and advantage normalization are very sensitive to LR, batch size, and reward scale. Too high and the ratio routinely exceeds the clip bound, the policy jumps off-policy, and importance variance explodes. Watch gradient norm and clip fraction as guides.
PPO epochs (reuse of a rollout batch): few epochs (typically 1-4). Each extra epoch moves π_θ further from π_old, inflating the ratio, increasing the clipped fraction, and adding off-policy bias; beyond a handful of epochs the surrogate stops approximating the true objective. One epoch is effectively on-policy and safest; more epochs trade sample efficiency for staleness.
Generation length: longer rollouts are needed for reasoning/long-CoT, but bring length bias (GRPO's 1/|o_i| favors short-correct/long-incorrect), compute cost quadratic in attention, and the long-tail rollout problem (the batch stalls on the longest sequence, idling GPUs). Set max length to task needs, use overlong-reward shaping (DAPO) or Dr.GRPO/λ-GRPO to neutralize length bias, and address tail latency with RollPacker tail batching, partial rollouts (APRIL/CoPRIS), or async training. The unifying principle: every knob trades sample/compute efficiency against off-policy drift and gradient variance.
Each patches a GRPO weakness. Dr.GRPO removes per-response length normalization (1/|o_i|) and std normalization in the advantage, eliminating length and difficulty biases and reducing verbosity; limitation — dropping length norm can reverse-bias toward verbosity and is less stable than adaptive λ-GRPO. DAPO adds four pieces: Clip-Higher (asymmetric clip bounds to prevent entropy collapse and enable upward exploration), Dynamic Sampling (filter uninformative rollouts), token-level policy-gradient loss (fine-grained CoT credit), and Overlong reward shaping (penalize excessive length), plus KL removal; it hits AIME-2024=50 on Qwen2.5-32B in half the steps. Limits: asymmetric clip needs tuning, sampling overhead, credit still token-approximate.
GSPO moves importance sampling to sequence level (geometric-mean ratio, length-normalized), clipping whole sequences — killing token-level noise and stabilizing MoE without routing replay; limit: binary on/off clipping is coarse, one outlier token can block an entire sequence. CISPO clips the importance weight, not the token update (log π·clip(r)), so reflective tokens keep contributing across off-policy steps; limit: far-off-policy magnitude still discarded. SAPO replaces hard clipping with temperature-controlled sigmoid gating — sequence-coherent yet token-adaptive, preserving near-on-policy tokens while down-weighting only true outliers; limit: temperature tuning, marginal cost.
DPPO replaces noisy single-sample ratio clipping with a principled divergence estimate (binary-Bernoulli or top-k TV/KL), correctly constraining both low- and high-probability token shifts; limit: approximation error, divergence-estimation overhead. MaxRL optimizes Pass@k directly via importance weighting P(k|N)/P(y) and a Maclaurin expansion interpolating standard-RL to exact MLE, 7.9-19.2x more sample-efficient for equivalent Pass@k; limit: needs binary rewards and a low-probability regime. SimKO targets probability-concentration bias (top-1 monopolizing mass, collapsing pass@k) via entropy regularization to preserve alternative candidates; limit: verifiable-reward-only and tuning-sensitive. Recurring tensions: token vs. sequence granularity, hard vs. soft gating, and unsolved per-token credit assignment.
TRPO enforces a hard trust region. It maximizes E[(π_new/π_old)·A] subject to an explicit constraint D_KL(π_old||π_new) ≤ δ. It solves this constrained problem with second-order optimization — conjugate gradient to approximate the natural-gradient direction (using the Fisher/Hessian-vector product) plus a backtracking line search to ensure both the KL bound and improvement hold. This yields a monotonic-improvement guarantee, the principled gold standard, but the second-order machinery is expensive and scales poorly to LLM-sized models — which is why PPO replaced it with heuristic first-order clipping that has no hard KL bound but is far cheaper.
DPPO (Divergence Proximal Policy Optimization) keeps the trust-region spirit but fixes PPO's core flaw: PPO's ratio clip is a single-sample Monte Carlo estimate of policy divergence, and for ~100k-token vocabularies single-token samples are extremely noisy, so low-probability good tokens get over-penalized and high-probability bad tokens under-constrained. DPPO instead directly approximates the true policy divergence (KL or total variation) and constrains on that. Because exact divergence over the full vocabulary is costly, it uses lightweight approximations: Binary Divergence collapses the categorical to a Bernoulli (sampled token vs. all others) and computes TV; Top-K Divergence estimates divergence on the k most probable tokens. This is a theoretically grounded constraint rather than a heuristic ratio, correctly bounding both low- and high-probability shifts; the cost is approximation error.
AReaL enforces its trust region under asynchrony. Fully decoupling generation from training means rollout data is 1-5 (sometimes up to 20+) policy versions stale, which would normally violate the on-policy assumption. AReaL uses a staleness-enhanced PPO variant — modified importance sampling that reweights and bounds stale trajectories, plus workload balancing to keep staleness small and controllable. The effective trust region is thus maintained by staleness-aware IS correction (related A-3PO approximates the proximal policy via log-prob interpolation π_proxy≈α·π_behavior+(1-α)·π_target), keeping stale updates near the current policy while delivering ~2.57x speedup.
This is the central open debate — "elicitation vs. expansion" — and the honest answer is: it depends, but the evidence increasingly supports genuine expansion under the right conditions.
The elicitation position holds that RL merely reshapes the output distribution, concentrating probability on solution paths already reachable by the base model. Its strongest evidence is large-k pass@k analysis: at high k, base models eventually solve as many or more unique problems than their RL-tuned versions, suggesting RL is an amplifier of latent ability, not a creator — it trades breadth (pass@k) for sharpness (pass@1).
The expansion counter, anchored by ProRL (NeurIPS 2025), shows the opposite under prolonged training: RL-trained models outperform base across the whole pass@k curve (k=1 to 100+), and crucially succeed on problems where the base model fails entirely even under extensive sampling. Theory (expressiveness work) shows RL after next-token pretraining can install qualitatively new computation that next-token training alone cannot reach tractably. This is genuine frontier expansion, not redistribution.
The emerging nuanced consensus reconciles them via three factors. Task properties: expansion happens when the task is novel or under-represented in pretraining (real headroom for search); elicitation dominates when the task is already heavily covered (little new space to explore). RL duration and diversity: short RL ≈ elicitation; long RL with a diverse task suite → expansion. Reward design: RLVR (verifiable) forces discovery of new strategies, whereas RLHF (learned reward) is more prone to gaming existing paths.
So RL can fundamentally expand the frontier, but not unconditionally — it requires sufficient compute, task novelty/diversity, exploration preservation (entropy/diversity controls to avoid pass@k collapse), and verifiable rewards. The unresolved question is the scaling law: whether expansion scales sub-, linearly, or super-linearly with compute remains unknown.
ProRL (Prolonged RL, NeurIPS 2025) reframes the question from "does RL help" to "what does sustained RL unlock," and its methodology is the template. Three ingredients are essential to scale boundaries rather than collapse them. First, KL divergence control — stay close enough to the base to avoid degeneration, but loosely enough to permit genuine exploration. Second, reference-policy resetting — periodically re-anchor the reference to prevent the slow distribution collapse and entropy decay that otherwise stall long runs. Third, a diverse task suite — exploration headroom comes from breadth, so heterogeneous, verifiable tasks keep discovering new solution-space regions over weeks of compute. ProRL's finding is that reasoning-boundary improvement correlates with both task competence and training duration: the frontier keeps moving with more compute when the policy is kept from collapsing.
How to think about scaling, then: the binding constraint is not raw compute but exploration preservation. Naively scaling steps drives probability concentration (top-1 monopolizes mass, pass@k collapses), so boundary-scaling requires explicit diversity maintenance — entropy regularization (SimKO), asymmetric Clip-Higher to prevent entropy collapse (DAPO), and objectives that optimize multi-sample success directly (MaxRL/SimKO target pass@k rather than expected reward). Reward verifiability matters: RLVR forces new strategies and resists hacking, so it scales boundaries more reliably than RLHF.
The open questions temper optimism. The compute-to-frontier scaling law is unknown (linear? sublinear? superlinear?), RL remains highly sample-inefficient (millions of samples to reach frontier), and generalization across domains is mixed. The practical synthesis emerging in 2026: combine prolonged, diverse, verifiable RL with reference resets and exploration controls, and pair train-time scaling with test-time scaling and on-policy distillation (which reaches comparable reasoning at ~10x fewer GPU hours) to push boundaries more efficiently than RL compute alone.
On-Policy Distillation (OPD) has a student learn from a teacher on trajectories the student itself samples, rather than on static teacher-generated data. The loop: the student generates a response to a query (on-policy), the teacher evaluates that response (not an idealized trajectory), and the student updates to fix its own errors. This directly attacks exposure bias — the covariate shift between SFT training (always conditioned on teacher/gold prefixes) and inference (conditioned on the student's own, possibly erroneous, tokens). In off-policy SFT the student never learns to recover from its own mistakes, so errors compound autoregressively at test time; OPD trains the student to predict distributions matching its own outputs, nearly eliminating that covariate shift.
Versus traditional RL, OPD is far more sample- and compute-efficient. RLVR gives a sparse scalar reward per trajectory and is notoriously sample-hungry; OPD instead receives a dense, per-token supervisory signal from the teacher's distribution over the student's own rollouts. The reported result is that OPD surpasses pure RL on reasoning while using roughly 10x fewer GPU hours — it combines on-policy correction (RL's strength) with dense token-level targets (SFT's strength), without a reward model or the high-variance credit-assignment problem.
Applications (2025-2026) are broad and production-scale: Qwen3, DeepSeek-V4, Gemma 2, and MiMo-V2-Flash adopt OPD as a core post-training mechanism. Variants extend reach: Black-Box OPD removes the need for teacher logits, using discriminator-based rewards on student rollouts (enabling distillation from API-only teachers). It is especially powerful for small models — a 0.6B student reaches 26.13% on AIME 2025 via effective OPD transfer, and DeepSeek-V3.2 uses on-policy distillation alongside its RL stack. Limitations: it needs a stronger teacher (it cannot exceed the teacher's reachable distribution the way frontier RL expansion can), and black-box variants depend on discriminator quality.
Reasoning emerges in distinct phases over an RL run, not at a single moment. Phase 1 (early, ~0-20% of training) is procedural reliability: the model masters low-level execution — formatting, arithmetic, basic logic, following the answer protocol. There is essentially no reflection or backtracking yet; gains come from learning to execute steps correctly. Phase 2 (mid, ~20-80%) is where the bottleneck shifts from execution to strategic planning, and this phase drives most of the gains on hard problems. Here checking/reflection behaviors, iterative refinement, and hypothesis generation-and-testing emerge; concretely, spontaneous self-correction and backtracking appear around 30-40% of training, when the model first begins revisiting and reconsidering prior reasoning. Phase 3 (late, 80%+) is consolidation and generalization — continued but slower improvement as strategies stabilize.
Two factors strongly modulate this timeline. Curriculum learning (scheduling tasks easy→hard) accelerates emergence across reasoning depths and makes long-CoT behavior appear at consistent stages; without a curriculum, emergence is more erratic and the onset of reflection is less predictable. Reward type matters too: verifiable rewards (RLVR) on genuinely hard problems are what trigger the spontaneous development of self-correction, since the model must explore to find rare correct solutions.
The dramatic existence proof is DeepSeek-R1-Zero: pure RL (GRPO on V3-Base, no SFT cold-start) produced emergent extended chain-of-thought, spontaneous self-correction, and "aha moments" of internal verification, lifting AIME-2024 pass@1 from 15.6% to 71.0% — demonstrating that sophisticated reasoning emerges from reward-signal optimization alone, without human reasoning-trajectory supervision. The practical implication for training: don't expect reasoning at the start; budget enough mid-training compute on appropriately hard, curriculum-ordered, verifiable tasks for the strategic-planning phase where reflection and the bulk of hard-problem gains actually materialize.
DeepSeek-R1 (Jan 2025) established the template: GRPO (critic-free, group-relative advantage A_{i,t}=(r_i-mean(r))/std(r)) plus RLVR (binary verifiable rewards r∈{0,1} from answer matching for math/code), trained on V3-Base (671B/37B MoE, MLA). Its 4-stage pipeline — cold-start SFT, large-scale reasoning RL, rejection-sampling+SFT, then harmlessness RL — and the R1-Zero pure-RL result proved reasoning emerges from reward alone.
The algorithmic trajectory then addressed GRPO's instability: GSPO (Qwen3, Jul 2025) lifted importance sampling to sequence level, fixing MoE instability without routing replay and reducing variance. V3.2 (Dec 2025) introduced DeepSeek Sparse Attention (DSA) for efficient long context, on-policy distillation, and crucially the VAPO RL framework — value-based RL with a reintroduced state-value critic combined with policy optimization, achieving 60.4 AIME with zero crashes in 40% fewer RL steps than DAPO. This signals value-based methods maturing back into scaled RL. Future V4 systems are reported to adopt OPD as a core mechanism, reaching comparable reasoning at ~10x fewer GPU hours.
RL differs fundamentally in MoE because routing creates train-inference mismatch. Inference (vLLM) and training (FSDP) route independently: ~10% of routers disagree per pass and 94% of tokens differ in expert assignment in at least one layer; each policy update further shifts routing, spiking importance ratios and cascading expert disagreement across layers, triggering bursty clipping and collapse. Token-level GRPO is therefore unstable on MoE. Remedies — R3 (replay inference routing into training), RSPO (router-shift IS correction), TIS, and sequence-level GSPO/VESPO — restore stability, with GSPO reducing train/inference routing KL to dense-model parity. Helpfully, V3's auxiliary-loss-free load balancing (bias terms on routing scores, not gating weights) is RL-compatible: bias adjustments don't couple with expert-output gradients, keeping GRPO/GSPO updates clean.
In synchronous GRPO you keep 2–3 full model copies resident: the policy being optimized, the frozen reference model used for the KL penalty, and optionally a reward model (eliminated under RLVR/function rewards, since the verifier is deterministic). This is already a major win over PPO's 4-way footprint (policy + reference + value/critic + reward), because GRPO's group-relative baseline removes the critic — saving roughly 30–50% of memory and compute. Note that beyond the "copies," the policy alone carries Adam optimizer states (~2× params in fp32 plus master weights) and a separate weight buffer lives in the rollout engine (vLLM/SGLang), often colocated and parameter-synced rather than a true extra copy.
Let's inventory the tensors on a single GPU during a GRPO forward+backward pass on a 1B-parameter model.
Total without LoRA/offload: ~2 + 8 + 2 = 12 GB of resident memory. For a 70B model (Qwen-size), that's 120 GB — doesn't fit one A100 (80GB HBM). With 8 A100s and FSDP, parameters shard to ~15 GB per GPU, optimizer state ~60 GB per GPU (but sharded), and it becomes feasible. This is why large RL pushes toward LoRA (2 small rank matrices instead of full reference copy, saving ~80%) or dropping the reference entirely (Dr.GRPO, DAPO).
Optimizations: dropping KL entirely (DAPO, Dr.GRPO) removes the reference copy. LoRA-based schemes (PERL, LoRASA) freeze a shared backbone so policy and reference differ only by small low-rank adapters, collapsing two copies into one backbone plus adapters. Prefix Grouper shares encoded representations of the long common prompt prefix across the G group samples, cutting FLOPs/activation memory by roughly a 1/G factor and enabling larger groups. Activation offloading plus gradient checkpointing recomputes activations on the backward pass for ~80% activation-memory savings at ~20% throughput cost. FSDP/ZeRO shards params, grads, and optimizer states across data-parallel ranks. Combined, these can take a naive 3-copy + optimizer-state footprint down to roughly a single sharded backbone with adapters.
KV cache is the dominant memory and bandwidth cost in rollout. vLLM's PagedAttention stores the cache in fixed-size blocks placed anywhere in GPU memory, cutting fragmentation waste from 60–80% to under 4% and freeing blocks immediately on completion. SGLang's RadixAttention extends this with a radix-tree prefix index, reusing KV for shared system prompts/few-shot/documents (6.4× on RAG-style prefix-heavy loads). Architecturally, DeepSeek's MLA compresses K/V into a low-rank latent (5–13% of standard KV), reconstructed per-head at decode.
Transfer/compression: pruning (SnapKV, H2O) drops low-attention tokens for 40–75% compression; quantization (KIVI, KVQuant) takes KV to 4–8 bits; KVzip exploits cross-layer redundancy; LayerKV caches only critical layers. In disaggregated prefill/decode serving, attention heads are sharded across GPUs and each GPU caches K/V for its heads; the KV transfer between prefill and decode stages is the network bottleneck. LMCache provides an enterprise KV layer, and a key trick is prefetching KV during prefill to overlap with decode-stage communication.
Multi-GPU comm strategies: tensor parallelism needs a per-layer all-reduce, demanding NVLink; pipeline parallelism passes activations across stages; ring attention arranges GPUs in a logical ring so GPU i streams its K/V block to i+1 while receiving from i−1, overlapping transfer with attention compute for long context; MoE expert parallelism uses all-to-all to dispatch/combine tokens. Tradeoff: RadixAttention's LRU prefix cache consumes GPU memory and is wasteful on low-overlap workloads, while aggressive KV quantization/pruning trades a small quality loss for higher batch sizes and longer contexts.
FP8 (E4M3/E5M2) is floating point: it keeps an exponent field, so it has wide dynamic range and degrades gracefully on the activation outliers that pervade LLMs. INT8 is uniform fixed-point: it has more precision within a calibrated range but no exponent, so it needs careful per-channel/per-tensor scaling (GPTQ/AWQ-style calibration) and is outlier-sensitive — a few large activations blow up the scale and crush the rest. FP8 is natively accelerated in Hopper/Blackwell Transformer Engine tensor cores; INT8 is supported on a wider range of (including older) hardware.
Inference: FP8 is the gold standard on Hopper — typically 0.1–0.3% perplexity increase, ~33% faster than FP16 with ~8.5% lower latency, and preserves 99–100% of benchmark performance up to 405B+. INT8 (W8A8) remains attractive on non-Hopper GPUs but requires more calibration effort and risks outlier-induced quality loss. So: prefer FP8 inference where hardware supports it, INT8 as the portable fallback.
Training: BF16 remains the preferred default for stability. FP8 training is emerging (e.g., μ-unit scaling): ~30–50% memory reduction and up to +34% throughput vs BF16 at trillion-token scale, but the reduced 4–5 exponent bits make it fragile across seeds, hyperparameters, and datasets — instabilities are worse than at inference, so it is used selectively, not as production default. INT8 training is rare because gradients span too large a dynamic range. For RL specifically, running rollout in FP8 while training in BF16 introduces a precision-driven train–inference mismatch that inflates importance-sampling ratio variance.
In synchronous RL the training step cannot start until the slowest generation in the batch finishes. If 32 prompts average 100 tokens but one runs to 800, the entire batch (and all GPUs) idle waiting on that straggler — generation length is heavy-tailed, so a handful of long completions dominate wall-clock and tank throughput.
Solutions fall into scheduling and asynchrony. RollPacker (tail batching) groups prompts by expected length: most steps are "short rounds" of balanced short rollouts, and long-tail responses are consolidated into separate "long rounds," reducing idle time while staying synchronous. Asynchronous RL decouples rollout from training — rollout workers continuously generate into a buffer while training workers consume a steady stream of scored trajectories, yielding 1.53×–2.24× speedups at the cost of staleness that must be importance-corrected.
Partial-rollout methods attack the tail directly. APRIL (Active Partial Rollouts) generates up to max length or a time budget, then ships the incomplete sequence to training and resumes it later, importance-weighting the partial trajectory; it reports +22.5% throughput across GRPO/DAPO/GSPO and +2.1% accuracy from faster convergence. CoPRIS adds concurrency control — bounding the number of in-flight rollouts (not too few, not all) and combining partial generation with importance sampling for better sample efficiency and stability. Continuous batching at the engine level (vLLM/SGLang) further removes head-of-line blocking so finished sequences free their KV immediately and new prompts backfill. The recurring tradeoff: partial/async approaches recover GPU utilization but introduce off-policy data that requires importance weighting and staleness control.
Continuous (iteration-level) batching admits and retires requests every decode step instead of waiting for a fixed batch window, eliminating head-of-line blocking and overlapping prefill with decode. The catch for RL is determinism and on-policyness. Because the batch composition and size change every iteration, batch-size-dependent kernels (RMSNorm, matmul split-K, attention split reductions) pick different reduction orders, so the rollout logprobs are not bitwise reproducible. That breaks the assumption that the logprobs you train against equal the logprobs used to sample, inflating importance-sampling ratio variance and making policy-version comparisons and reward-model consistency unreliable. It also blurs "which policy generated this token" when generation spans weight updates.
Traditional batching: fill a batch of 32 prompts, generate until all finish, then start a new batch. While waiting for the 31st prompt to finish token 512, the 32nd prompt (already done at token 256) idles — head-of-line blocking.
Continuous batching (vLLM, SGLang): the moment any prompt finishes a token, retire it and backfill with a waiting prompt. GPUs never idle, throughput climbs (typically +1.5–2× vs fixed batching).
The RL problem: a GRPO sampling engine uses continuous batching to maximize GPU utilization, then ships logprobs to the trainer. But because batch composition changes every iteration, kernels like matmul and attention see different batch sizes, pick different reduction strategies (split-K counts, tile shapes), and floating-point addition is non-associative — so the logprobs differ by batch size. You sample with batch size 8, then later train with batch size 32, and the logprob of token 5 in the reference model is different, inflating the importance ratio π_θ/π_ref.
The fix: enforce batch-invariant kernels: fixed reduction order (Thinking Machines approach in SGLang), deterministic seeded sampling, pinned kernel selection. Cost: ~34% slowdown, partially recoverable via CUDA graphs to ~2.8× with graphs. This is why RL rollout engines often run in a slightly slower deterministic mode while standard serving uses continuous batching for speed.
vLLM: PagedAttention block KV management plus continuous batching with prefill/decode overlap; simple, high-concurrency, predictable. SGLang: adds RadixAttention prefix caching via a radix tree (+29% throughput, up to 6.4× on prefix-heavy workloads, 30–50% lower latency), but the LRU prefix cache consumes GPU memory and means an identical prompt may hit cache or recompute, which can perturb numerics unless determinism is enforced. SGLang has shipped verified batch-invariant kernels (the Thinking Machines approach): fixed reduction order independent of batch size, chunked-prefill alignment to integer multiples of split_kv_size, and seeded multinomial (Gumbel) sampling — roughly 34% slowdown, recoverable to ~2.8× with CUDA graphs. The practical RL guidance: run the rollout engine in a deterministic/batch-invariant mode so continuous batching's variable batch geometry doesn't poison the importance ratios, accepting the throughput cost.
Both engines expose Prometheus-style /metrics. In vLLM the key gauges are gpu_cache_usage_perc (fraction of KV blocks in use), num_requests_running vs num_requests_waiting (queueing), prefix-cache hit rate, time-to-first-token / inter-token latency, preemption/swap counts, and throughput in tokens/sec. SGLang reports analogous scheduler stats: token usage, running/queued requests, RadixAttention cache hit rate, and throughput. At the cluster level you compute MFU (model FLOPs utilization) — achieved FLOPs over peak — as the single most honest efficiency number; Megatron long-context setups target >55% MFU even at 4M tokens.
KV-cache utilization is gpu_cache_usage_perc = used blocks / total allocatable blocks; PagedAttention is what pushes effective utilization up by cutting fragmentation waste from 60–80% to under 4%. During RL training you watch this gauge over the rollout phase together with the preemption/recompute count: a high cache-usage figure means high concurrency but, past a threshold, the scheduler starts preempting and recomputing sequences (especially with long-tail generations), which silently destroys throughput. In colocated setups the rollout KV cache contends with training weights/optimizer state and activations for the same HBM, so you size gpu_memory_utilization to leave headroom. Practical evaluation: track achieved decode batch size, cache-usage percentage, preemption events, prefix-cache hit rate, and tokens/sec across a rollout, and correlate them — rising preemptions at high cache usage signal over-subscription, while a low hit rate on a prefix-heavy workload means RadixAttention isn't paying off. The tradeoff is always concurrency (high KV utilization) versus preemption risk.
Backprop runs over the standard parallelism stack. Data parallelism via FSDP/ZeRO shards parameters, gradients, and optimizer states across ranks and uses reduce-scatter to combine gradients into shards; tensor parallelism (Megatron) splits each matmul and inserts all-reduces in both forward and backward; pipeline parallelism partitions layers across stages with a 1F1B micro-batch schedule; context/sequence parallelism splits the sequence for long context; expert parallelism routes MoE tokens via all-to-all. Forward computes activations (gradient checkpointing trades ~20% compute to recompute them in backward and save ~80% activation memory), then the backward pass produces gradients that are synchronized and applied with BF16 compute plus fp32 master weights and gradient accumulation.
In RL the graph is narrower than pretraining: gradients flow only through the policy on the sampled rollout tokens; the reference model is frozen (no grad, used only for KL), and rewards are scalars injected as advantages, not differentiated. The loss is the policy-gradient surrogate (importance ratio × advantage, clipped) plus optional KL, reduced per token with length normalization that must be applied consistently across micro-batches and DP ranks.
A notorious pitfall is gradient scaling under sequence/tensor parallelism: torch.distributed.all_reduce has no autograd backward wrapper, so gradients come out off by exactly the sequence-parallel size (e.g., 4× too large/small at seq_parallel_size=4). The fix is torch.distributed.nn.all_reduce or explicitly scaling by 1/num_processes. Frameworks like VeRL and slime abstract this by delegating to FSDP/FSDP2 or Megatron backends, while keeping the rollout engine (vLLM/SGLang) on a separately synced weight copy.
The core bottleneck in synchronous RL is that generation and training alternate on the same hardware: while you train, the inference engine is idle, and while you generate, the trainer is idle — compounded by long-tail stragglers and a weight-broadcast barrier between every step. Async frameworks decouple these.
AReaL fully decouples generation from training: rollout workers continuously generate trajectories into a buffer with interruptible rollout and dynamic batching, while training workers fetch batches asynchronously, achieving up to 2.57× speedup, with a staleness-enhanced PPO variant and a parallel reward service to handle the off-policy data. A-3PO targets the cost of the proximal-policy forward pass in decoupled PPO by approximating it via log-probability interpolation (π_proxy ≈ α·π_behavior + (1−α)·π_target) with staleness-aware weighting that favors fresher data, for ~1.8× speedup at comparable quality. LlamaRL uses a distributed actor/learner architecture where many actors generate concurrently and a learner updates shared parameters without global locks, scaling to billion-parameter models. Laminar pipelines the generation, reward, and optimization stages so they run concurrently, using chain-based broadcasting to distribute rewards/state and reduce stage-level idle time. slime supports both sync and async modes over its SGLang+Megatron stack.
Concretely, these solve: (1) generation–training serialization (GPU idle in the off-phase), (2) long-tail straggler stalls, (3) the per-step weight-sync barrier, and (4) load imbalance between rollout and trainer. Async generally buys 1.53×–2.24× over sync. The price is staleness — data is several policy versions old — requiring importance correction, truncation/clipping, or staleness-aware weighting to stay stable.
The right distinction is between the token sequence and the KV tensors. What partial-rollout frameworks preserve and resume is the generated token prefix, not the KV cache as a numerically valid artifact. AReaL's interruptible rollout workers stop a long generation at a time/length budget, ship the incomplete sequence to training, and later continue it; APRIL does the same and importance-weights the incomplete sequence. The already-emitted tokens are fixed text, but once the policy takes an optimizer step the weights change, so the KV cache that would correspond to that prefix under the new weights is different from what the old policy produced. In practice the KV tensors are tied to a specific policy version and are generally recomputed (re-prefilled) under the current weights rather than reused across a weight update; reusing stale KV directly would silently mix two policies inside one forward pass.
The off-policyness this creates is handled at the algorithm level, not by keeping caches: the segment generated under the older policy is treated as off-policy data and corrected with importance sampling (and staleness weighting), which is precisely why these frameworks pair partial rollout with IS. For MoE models there is an extra wrinkle — even with re-prefill, the inference and training routers disagree (~10% of routers per pass, ~94% of tokens differ in at least one layer), so what is "preserved" to keep training and inference aligned is the routing decisions, via Rollout Routing Replay (R3): record the inference-engine routing masks and replay them in the trainer, cutting routing KL ~50%. So: token prefixes preserved and resumed; KV tensors generally not reused across policy versions; staleness and MoE routing handled by IS/replay.
Expert parallelism shards the experts of each MoE layer across GPUs/nodes. Because each token is routed to only its top-K experts (DeepSeek-V3: 256 routed + 1 shared expert, 8 active per token, 37B of 671B params activated), EP requires two all-to-all collectives per MoE layer: dispatch tokens to the GPUs holding their chosen experts, then combine the results back. That all-to-all is the dominant cost and makes EP communication-bound, so it demands high-bandwidth interconnect (NVLink intra-node, InfiniBand inter-node). When the all-to-all overlaps well with expert compute, EP delivers large gains: wide expert parallelism reports ~1.8× per-GPU throughput and MegaScale-Infer ~1.90×, which is why MoE now powers 60%+ of open-source releases.
The main throughput killers are load imbalance and traffic. If routing concentrates on a few hot experts, those GPUs become stragglers; DeepSeek's auxiliary-loss-free bias-term balancing (negative bias to overloaded experts, positive to idle ones, applied to routing scores not gating weights) keeps utilization even without degrading expert outputs. Traffic-aware schemes like NETMOE and MoNTA reorder/route communication to reduce all-to-all volume, and expert duplication improves locality at the cost of extra memory.
In an RL loop EP interacts badly with policy updates: routing is volatile, so importance-sampling ratios spike when train- and inference-side routers diverge, destabilizing GRPO. Mitigations are GSPO's sequence-level importance ratio (which averages routing fluctuations across a response and trains MoE stably without routing replay) or R3 routing replay. So EP raises capacity and per-GPU throughput but is gated by all-to-all bandwidth, balanced routing, and, in RL, routing-shift stability.
At 1M–4M tokens attention becomes I/O-bound: GPUs stall waiting on KV exchange across context-parallel ranks. The design goal is to hide that communication behind compute. Ring Attention arranges the context-parallel GPUs in a logical ring so each sends its K/V block to neighbor i+1 while receiving from i−1, overlapping the transfer with the local attention computation. ISO does the overlap at sequence granularity rather than layer-wise, which loosens applicability constraints and yields ~35% prefill reduction on a 4090 and ~15% on an A800. DistCA uses a ping-pong scheme with in-place attention servers to fully overlap communication with compute, sustaining >55% MFU at 4M tokens; Megatron's context parallelism with block-wise chunking (MTraining/FPDT) hits the same >55% target. The general recipe: chunk the sequence, double-buffer the K/V transfers, and schedule the next chunk's comm to run under the current chunk's matmuls.
Megatron vs FSDP: Megatron-LM is built around tensor parallelism (splitting matmuls, all-reduce per layer) and pipeline parallelism, heavily tuned for NVIDIA NVLink/InfiniBand; it gives SOTA throughput and fine-grained control over the parallelism layout but has a steeper learning curve. FSDP is ZeRO-style sharding of parameters, gradients, and optimizer states with all-gather on the forward pass and reduce-scatter on the backward, natively integrated in PyTorch, more portable across hardware, and the 2024–2025 industry default; its overlap story is prefetching the next layer's parameter all-gather under the current layer's compute. FSDP's communication volume can be higher for very large models, which is why hybrid FSDP+TP (and adding context parallelism for long sequences) is now standard. VeRL and slime expose both Megatron and FSDP backends precisely so you can pick per hardware and context length.
Deterministic execution means fixing every nondeterministic source: seed all RNG, fix reduction orders, and pin kernel selection so a given input always yields the same bits. Batch invariance is the stronger property that a single element's output is identical regardless of the batch size or composition it is run in. It is the missing piece for RL reproducibility, because rollout and training run the same model at different batch sizes, and if logprobs differ by batch you cannot compare policy versions or trust importance ratios.
The cause is not hardware fate — it is that performance kernels (RMSNorm, matmul, attention) choose different reduction orders, tile shapes, or split-K counts depending on batch size, and floating-point addition is non-associative, so different orders give different bits. The Thinking Machines fix (verified in SGLang) is to fix the reduction order independent of batch size: deterministic RMSNorm reduction, fixed matmul reduction block size, and attention backends pinned to fixed split sizes (FlashInfer fixed split_kv, FlashAttention-3 num-splits=1, Triton), plus chunked-prefill alignment to integer multiples of split_kv_size and seeded multinomial (Gumbel-hash) sampling. Cost is ~34% slowdown, recoverable to ~2.8× with CUDA graphs.
Is atomic add involved? Yes — split-K matmuls, scatter, and attention split reductions accumulate via atomicAdd, whose completion order is nondeterministic, which is one source of run-to-run nondeterminism. But atomic add is not the root cause of batch variance. Removing atomics (using deterministic reduction trees) buys run-to-run determinism yet still does not give batch invariance, because the reduction strategy still changes with batch size. So atomic add neither fully causes nor solves the problem: you must fix a batch-size-independent reduction order; eliminating atomics is at most a complementary step, not a sufficient one.
They diagnose the bottleneck differently and therefore build different systems. AReaL frames the rollout bottleneck as the synchronization coupling between generation and training plus long-tail stragglers: GPUs idle whenever the two phases alternate, and the slowest completion gates the step. Its answer is full asynchrony — a bespoke decoupled engine with rollout workers that continuously generate into a buffer, interruptible rollout, dynamic batching, and a parallel reward service, with a staleness-enhanced PPO to absorb the resulting off-policy data (up to 2.57× speedup). The philosophy is algorithm/system co-design: accept and correct for staleness in exchange for never letting either side idle.
slime frames the bottleneck as inference-engine efficiency and the weight-sync bridge between trainer and sampler. Rather than building a custom async engine, it treats rollout as a first-class serving problem and leans on a mature stack: SGLang (RadixAttention prefix caching, router load balancing via sgl-router for multi-turn rollout) for generation and Megatron-LM for training, with the engineering effort concentrated on efficient weight synchronization between the two and on passing through all of Megatron's (TP/PP/EP/CP) and SGLang's parameters. It supports both sync and async modes and is deliberately a lightweight glue layer, proven in production on the GLM-4.5/4.6/4.7/5/5.1 line.
So the contrast is: AReaL views the bottleneck as synchronization itself and removes it with async decoupling and staleness handling (a research-grade async framework); slime views the bottleneck as serving throughput and the train↔rollout weight transfer, and solves it by integrating best-in-class inference (SGLang) and training (Megatron) while keeping the orchestration thin and flexible.
Staleness is the off-policyness budget: how many policy versions old the data being trained on is. Synchronous RL has staleness 0 — every sample comes from the current policy. Moderate async (AReaL-style) keeps data roughly 1–3 updates / 1–5 versions old, and fully decoupled high-async setups run 5–20+ versions old, only made workable by importance weighting. The right mental model is a trust region in version space: as staleness grows, the gap between the behavior policy that generated a token and the target policy you are updating widens, the importance-sampling ratio π_θ/π_behavior drifts from 1, and gradient variance climbs; past some point clipping fires chaotically and training biases or collapses.
You therefore treat staleness as a tunable you actively bound, not a free byproduct of throughput. AReaL deliberately balances rollout and training workloads to cap staleness and uses a staleness-enhanced PPO variant; A-3PO weights fresher data more heavily and approximates the proximal policy by log-prob interpolation; truncated importance sampling masks tokens whose ratios exceed bounds. In practice the sweet spot is ~1–5 versions: enough asynchrony to keep both engines busy (1.53×–2.24× speedup) without pushing IS ratios into the heavy-tailed regime. You instrument it by monitoring the IS-ratio distribution and clip fraction, and tighten the buffer/concurrency when they blow up. For MoE add router shift on top of weight staleness — even one version of drift moves expert routing — so the effective staleness budget is smaller unless you apply R3 replay or GSPO's sequence-level ratio.
slime is a thin orchestration layer over Megatron-LM (training) and SGLang (inference), coupled by sgl-router. The loop: the Megatron training engine holds the canonical policy weights; those weights are synchronized into the SGLang rollout engine; SGLang generates rollouts — router-backed and multi-turn, benefiting from RadixAttention prefix caching and load balancing across replicas; completions are scored by the reward path (verifiable/RLVR rewards for math/code, or a reward model), capturing the rollout logprobs; the scored trajectories are handed back to Megatron, which recomputes current-policy logprobs in a forward pass, computes the loss, runs the optimizer step, and then re-syncs the updated weights to SGLang. This weight synchronization between trainer and sampler is the load-bearing bridge, and slime supports both sync and async modes around it.
Megatron integration is by design a seamless pass-through: all Megatron parallelism knobs — tensor (TP), pipeline (PP), expert (EP), and context (CP) parallelism — and all SGLang serving parameters are exposed directly, so the same configuration scales from pretraining to RL without re-plumbing. The codebase stays lightweight, which is why it ports cleanly across the GLM model line.
Loss computation is the GRPO-family policy-gradient objective. For each prompt slime samples a group of completions, computes group-relative advantages from the rewards (A_i = (r_i − mean)/std, or a Dr.GRPO/DAPO-style aggregation), forms the per-token importance ratio between the recomputed Megatron logprobs and the SGLang rollout logprobs, applies the clipped surrogate, and optionally adds a KL term against a reference model — all reduced with the configured token/sequence length normalization. Megatron handles the sharded forward/backward and optimizer step; slime supplies advantages, ratios, and reward signals.
For my actual context — frontier, large-scale, MoE RL on NVIDIA clusters — I would default to slime, with VeRL as the close second and the others reserved for narrower roles. slime fits the MoE/long-context regime exactly: native Megatron-LM training with full TP/PP/EP/CP pass-through (essential for 671B-class DeepSeek/GLM-style MoE), SGLang rollout with RadixAttention and sgl-router for efficient multi-turn generation, clean weight sync, and both sync and async modes. It is battle-tested in production on the GLM-4.5/4.6/4.7/5/5.1 line and stays a lightweight codebase, so the path from pretraining to RL to deployment is short.
I would pick VeRL instead when I need algorithmic breadth or backend flexibility: it supports PPO, GRPO, RLOO, DAPO, GSPO, REINFORCE++ and more, runs on FSDP/FSDP2 or Megatron for training and vLLM or SGLang for rollout, scales to DeepSeek-671B and MoE, and has SOTA results (DAPO on Qwen2.5-32B hitting 50 on AIME 2024). It is the most general production framework and the safer default if I'm exploring multiple RL algorithms or non-NVIDIA backends.
The others are role-specific. AReaL is what I'd reach for to research fully asynchronous, high-staleness training (2.57× speedup, staleness-enhanced PPO) where the contribution is the async engine itself. Unsloth wins single-GPU, memory-constrained work (≈7× longer context than TRL at equal batch via runtime patching of TRL trainers). TRL is the best on-ramp — superb data preprocessing and ecosystem integration — for prototyping and small/mid-scale runs, though its ~1K-token context ceiling at comparable batch sizes and historically synchronous loop make it weak for frontier scale. Net: slime for MoE production, VeRL for breadth.