Fine-tuning, Post-training & RL
IC4IC5IC6

RL Post-Training — Why We Optimize Rewards After SFT

Pretraining teaches a model what is likely, SFT teaches it what to imitate, and RL teaches it what is good — by optimizing a reward signal while a KL leash holds the policy near its trusted SFT prior. This lesson frames the whole pillar — reward sources (RLHF / RLVR / RLAIF), the RM→PPO pipeline, the KL penalty and why it exists, reward hacking as the central danger, and where DPO, GRPO and DeepSeek-R1's emergent reasoning fit.

13 min read · 15 sections
Prerequisites: how cross-entropy / MLE training works (see /ml-foundations/how-models-learn), what SFT is (see /finetuning)

1. Quick anchor

A frontier model is built in three stages, and each optimizes a different objective:

  1. Pretraining — next-token prediction on trillions of tokens. Learns what is likely. Objective: cross-entropy / MLE.
  2. SFT (supervised fine-tuning) — imitate curated demonstrations. Learns what to imitate. Same objective (cross-entropy), narrower data.
  3. RL post-training — optimize a scalar reward on the model's own samples, while a KL penalty tethers it to the SFT model. Learns what is good.

The defining object of the whole pillar is one equation — the KL-regularized reward objective:

ƒ
J(θ)=ExD,  yπθ(x)[r(x,y)    βDKL ⁣(πθ(x)πref(x))]J(\theta) = \mathbb{E}_{x\sim\mathcal{D},\; y\sim\pi_\theta(\cdot\mid x)} \Big[\, r(x,y) \;-\; \beta\, D_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot\mid x)\big) \Big]

where πθ\pi_\theta is the policy we are training, πref\pi_{\mathrm{ref}} is the frozen reference (the SFT checkpoint we started from), r(x,y)r(x,y) is a scalar reward for completion yy on prompt xx, β\beta is the KL penalty weight, and D\mathcal{D} is a prompt distribution. Everything else in this pillar — PPO, DPO, GRPO and its zoo of variants, RLVR, the infra — is a particular choice of where rr comes from and how you take the gradient of this objective without blowing up. Keep this equation in your head; we will return to it repeatedly.

The KL-regularized RL objective — on real numbers

Name every symbol. Left side: J(θ)J(\theta) is the objective we maximize. Inside: ExD,yπθ(x)\mathbb{E}_{x\sim\mathcal{D}, y\sim\pi_\theta(\cdot\mid x)} means sample a prompt xx and then sample a completion yy from the current policy — so the RL loop is on-policy. r(x,y)r(x,y) is the reward: a single number for this prompt-completion pair. β\beta is the KL weight: big β ⇒ stay close to reference; small β ⇒ chase reward aggressively. DKL(πθπref)D_{\mathrm{KL}}(\pi_\theta \|\pi_{\mathrm{ref}}) is KL divergence: how far the policy has drifted from the frozen SFT checkpoint in its token-probability distribution.

Walk a concrete example. Suppose r(x,y)=10.0r(x,y) = 10.0 (excellent completion) and the KL divergence is 0.50.5 nats, with β=0.1\beta = 0.1. Then the objective is: 10.0 - 0.1 * 0.5 = 10.0 - 0.05 = 9.95. If instead the policy had drifted further (DKL=2.0D_{\mathrm{KL}} = 2.0), we'd get 10.0 - 0.1 * 2.0 = 9.8 — the reward is good but the penalty is still small. But if β=1.0\beta = 1.0 (tighter leash) and DKL=2.0D_{\mathrm{KL}} = 2.0, we get 10.0 - 1.0 * 2.0 = 8.0 — the KL cost suddenly dominates. This single equation balances two forces: climb the reward against don't wander from the trusted SFT prior.

◇ Live illustrationThe post-training pipeline

A pretrained model is supervised-fine-tuned, then aligned with reinforcement learning against a reward signal — the RLHF/RLVR pipeline.

2. Why interviewers probe this (per level)

  • IC4 — Can you draw the pipeline and justify each stage? Can you state the objective, explain why the KL term exists, and implement a basic RLHF or RLVR loop without hand-waving? The tell: you treat RL as "optimize reward minus drift," not "optimize reward."
  • IC5 — Can you reason about tradeoffs between reward sources (RM vs verifier vs AI judge), diagnose reward hacking / reward-model overoptimization, and choose DPO vs GRPO vs PPO for a concrete project under a concrete budget? You should be able to name the failure mode before the symptom shows up.
  • IC6 / staff — Can you set strategy? When does the org invest in RLVR (oracles, verifiable domains) vs RLHF (reward models, fuzzy preferences)? Where's the reward-model quality ceiling? Do you believe RL expands capability or merely elicits it — and what does that imply for compute allocation between train-time RL and test-time scaling?

3. Concept build-up

Beginner explainerNew here? The words first

The words first.

  • Policy — the model itself, viewed as a decision-maker: given a prompt, it chooses the next token (word-piece). "Improving the policy" just means updating the model's weights.
  • Reward — a single number scoring how good a finished response is. Higher = better.
  • Reward model — a separate model trained on human preference data to predict that reward number, so we can score millions of responses without asking a human each time.
  • SFT (Supervised Fine-Tuning) — the earlier stage: show the model prompt → good answer examples and have it imitate them.
  • RL (Reinforcement Learning) — training by trial: the model generates its own answer, gets a reward, and nudges itself toward higher-reward behavior.
  • Reference model — a frozen copy of the post-SFT model; the "trusted starting point" we don't want to drift far from.
  • KL divergence — a number measuring how different two probability distributions are. Here: how far the policy's token choices have moved from the reference's.

Step by step.

  1. Start from the SFT model (already decent at imitation). Freeze a copy as the reference.
  2. Feed the policy a prompt; it samples a full response.
  3. Score that response with the reward (usually from the reward model).
  4. Measure how far the policy drifted from the reference (KL), and subtract a penalty proportional to it — the "leash."
  5. Update the policy's weights to raise reward − KL penalty: make good answers likelier.
  6. Repeat over many prompts.

Remember this: RL post-training lets an already-decent model chase a reward it can optimize but couldn't be shown by example, while the KL leash keeps it tethered to its trusted SFT starting point so it improves without going off the rails.

3.1 Why RL at all — the structural limits of SFT

SFT is behavioral cloning: maximum likelihood on (prompt, gold-response) pairs. Minimizing cross-entropy is minimizing the forward KL DKL(datamodel)D_{\mathrm{KL}}(\text{data}\,\|\,\text{model}), which is mode-covering — it spreads probability to cover every demonstration but has no notion of "this answer is better than that one." Three structural consequences:

  • You can only be as good as your demonstrations. SFT's ceiling is the labeler. It cannot discover a solution no demonstration contained.
  • Exposure bias. SFT only ever conditions on gold prefixes. At inference the model conditions on its own tokens, so errors compound autoregressively (covariate shift) and the model never learned to recover from mistakes it was never shown making.
  • No comparative signal. "Helpful, not sycophantic," "correct, not just plausible" are relative judgments. MLE has no channel for them.

RL fixes all three by changing what you optimize and what distribution you optimize over. You sample from πθ\pi_\theta itself (on-policy), score those samples with a reward, and push probability toward high-reward regions. Because training data is now the model's own output, RL directly attacks exposure bias, and because the signal is a comparative scalar, it can push past the demonstrators. This is also the through-line to on-policy distillation — train the student on its own rollouts judged by a teacher — which several 2025–26 models (Qwen3, DeepSeek-V3.2) lean on as a cheaper cousin of RL.

SFT moves the model to a good region of policy space cheaply and stably; RL searches within and beyond that region under a reward. You almost always want SFT first as a strong, fluent initialization — pure RL from base is possible (R1-Zero, §3.6) but fragile.

3.2 The objective, term by term — and why the KL leash

Return to J(θ)=E[r(x,y)βDKL(πθπref)]J(\theta) = \mathbb{E}[\,r(x,y) - \beta\,D_{\mathrm{KL}}(\pi_\theta\|\pi_{\mathrm{ref}})\,]. The reward term is obvious — climb it. The KL term is the part interviewers actually care about. In practice it is applied per token and length-normalized, βytDKL(t)\frac{\beta}{|y|}\sum_t D_{\mathrm{KL}}^{(t)}, and there are two equivalent ways to wire it in:

  • In-reward (classic PPO/RLHF): fold the penalty into the reward, r~t=rtβ(logπθ(yt)logπref(yt))\tilde r_t = r_t - \beta\,\big(\log\pi_\theta(y_t) - \log\pi_{\mathrm{ref}}(y_t)\big), then run vanilla advantage estimation on r~\tilde r.
  • As an explicit loss term (GRPO): keep rr clean and add βDKL\beta\,D_{\mathrm{KL}} to the loss directly, using a guaranteed-non-negative low-variance estimator (Schulman's "k3"): DKL(t)=πref(yt)πθ(yt)logπref(yt)πθ(yt)10D_{\mathrm{KL}}^{(t)} = \frac{\pi_{\mathrm{ref}}(y_t)}{\pi_\theta(y_t)} - \log\frac{\pi_{\mathrm{ref}}(y_t)}{\pi_\theta(y_t)} - 1 \ge 0.

Why is it there at all? Three reasons, all the same reason:

  1. The reward is only valid near πref\pi_{\mathrm{ref}}. A reward model was trained on outputs that looked like SFT outputs. Off-distribution, its scores are noise — and RL is an adversary that will happily march into that noise. The KL leash keeps the policy in the region where rr still means something.
  2. Prevent mode collapse / degeneration. Unconstrained, the policy collapses onto a few high-reward strings (repetition, fixed templates), losing the diversity and fluency pretraining bought you.
  3. It implicitly defines the optimum. The closed-form maximizer of the KL-regularized objective is π\*(yx)πref(yx)exp ⁣(r(x,y)/β)\pi^\*(y\mid x) \propto \pi_{\mathrm{ref}}(y\mid x)\,\exp\!\big(r(x,y)/\beta\big) — a tilted version of the reference. Small β\beta ⇒ aggressive tilt (more reward, more drift, more hacking risk); large β\beta ⇒ stay close to SFT (safer, weaker). That same identity is what DPO inverts (§3.5).
The closed-form optimum — a tilted reference

This formula says the best policy is the SFT reference distribution, reweighted (tilted) by the exponential of the reward scaled by 1/β1/\beta. Name the symbols. π\*\pi^\* is the optimal policy. πref\pi_{\mathrm{ref}} is the SFT model we started from — it gets multiplied by a tilt factor. exp(r(x,y)/β)\exp(r(x,y)/\beta) is that tilt: high reward ⇒ higher tilt ⇒ more probability; low reward ⇒ lower tilt. The \propto means "proportional to" — we normalize afterward.

Walk it on concrete numbers. Suppose the reference assigns probability [0.5, 0.3, 0.2] to three continuations, and their rewards are [5.0, 2.0, 1.0] with β=1.0\beta = 1.0. The exponentials are exp(5.0) ≈ 148, exp(2.0) ≈ 7.4, exp(1.0) ≈ 2.7. The tilted (unnormalized) probabilities are [0.5 × 148, 0.3 × 7.4, 0.2 × 2.7] = [74, 2.2, 0.54]. Normalize: sum is 76.74, so final probs are [74/76.74, 2.2/76.74, 0.54/76.74] ≈ [0.96, 0.03, 0.01]. High-reward option jumped from 50% to 96%; low-reward option collapsed. Now shrink β\beta to 0.1: exponents become [exp(50), exp(20), exp(10)] ≈ [5e21, 5e8, 22k] — the highest-reward option dominates so completely it's nearly 1.0. Grow β\beta to 5.0: exponents become [exp(1), exp(0.4), exp(0.2)] ≈ [2.7, 1.5, 1.2] — tilted probabilities [1.35, 0.45, 0.24] normalize to [0.63, 0.21, 0.11] — modest change from reference. That's the tradeoff: small β\beta ⇒ aggressive, high-variance tilt toward reward; large β\beta ⇒ conservative, stable, stays near reference.

What breaks at β=0\beta=0? You get pure reward maximization with no anchor: the policy drifts out of the RM's valid region, reward-hacks, collapses entropy, and "wins" the proxy while the true objective craters. β=0 is the canonical way to cause reward hacking on purpose.

3.3 Where the reward comes from — RLHF, RLVR, RLAIF

The single biggest design decision in post-training is the source of rr. Three families:

  • RLHF — learned reward model. Collect human preference pairs (ywyl)(y_w \succ y_l), train a reward model rϕr_\phi with the Bradley–Terry loss L=logσ(rϕ(x,yw)rϕ(x,yl))\mathcal{L} = -\log\sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big), then RL against rϕr_\phi's scalar. Strength: captures fuzzy, subjective quality (helpfulness, tone, safety). Weakness: labels are expensive, and the RM generalizes poorly out-of-distribution — the root cause of reward hacking. The RM is a proxy, and Goodhart's law is waiting.
  • RLVR — verifiable rewards. Replace the learned model with a deterministic oracle: unit tests for code, exact-match / symbolic check for math, a proof checker, a fact lookup. Reward is binary or ternary: r{0,1}r \in \{0,1\}. Strength: there is no reward model to hack — the oracle is ground truth, so you can train hard and long without overoptimization in the RM sense. Empirically dominant on math/code (DeepSeek-R1, §3.6). Weakness: only exists where you can automate verification; no partial credit / nuance; verifiers can still be gamed at the task level (e.g. tests that pass for wrong reasons).
  • RLAIF / Constitutional AI — AI feedback. A stronger LLM (or the model judging itself against a written "constitution") provides the preference signal instead of humans. Strength: scales annotation cheaply, handles domains humans find tedious. Weakness: inherits the teacher's biases and blind spots; risk of correlated errors when teacher and student are similar.

A fourth axis cuts across these: outcome rewards (score only the final answer) vs process rewards / PRMs (score intermediate reasoning steps). Process rewards give denser, earlier signal for reasoning but cost far more labels and introduce their own hacking surface (rewarding plausible-looking steps).

3.4 The classic RLHF pipeline: RM → PPO (and why it's heavy)

The original InstructGPT-style recipe, still the reference architecture:

  1. SFT the base model on demonstrations → πref\pi_{\mathrm{ref}}.
  2. Train the reward model rϕr_\phi on preference pairs (Bradley–Terry).
  3. PPO against rϕr_\phi with the KL penalty to πref\pi_{\mathrm{ref}}.

PPO is an actor-critic method, and the cost shows up in memory: you hold four models — policy, frozen reference, reward model, and a learned critic/value network — roughly the footprint of a single model (see /finetuning/rl-infrastructure). The critic exists to reduce variance via the advantage At=RtV(st)A_t = R_t - V(s_t), usually estimated with GAE (AtGAE=l(γλ)lδt+lA^{\mathrm{GAE}}_t = \sum_l (\gamma\lambda)^l\,\delta_{t+l}, trading bias for variance via λ[0.9,0.99]\lambda\in[0.9,0.99]). The core PPO update is the clipped surrogate:

ƒ
LPPO=Et[min(ρtAt,  clip(ρt,1ϵ,1+ϵ)At)],ρt=πθ(atst)πold(atst)\mathcal{L}^{\mathrm{PPO}} = \mathbb{E}_t\Big[\min\big(\rho_t\,A_t,\;\mathrm{clip}(\rho_t,\,1-\epsilon,\,1+\epsilon)\,A_t\big)\Big], \qquad \rho_t = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\mathrm{old}}(a_t\mid s_t)}

The min\min creates an asymmetric trust region: once the importance ratio ρt\rho_t leaves [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] in the helpful direction, the gradient flatlines (a "dead zone"), so a single batch can't yank the policy too far. It's a first-order heuristic — cheaper than TRPO's hard KL constraint but with no monotonic-improvement guarantee and notorious sensitivity to reward normalization, batch size, and learning rate. PPO works, but it's the expensive, fiddly baseline — which is exactly why DPO and GRPO exist.

3.5 Reward hacking — the central danger

If you remember one risk from this pillar, make it this. Reward hacking is the policy exploiting flaws in the reward proxy to score highly without the intended behavior — Goodhart's law in a gradient loop. Concretely: length bias (longer = higher RM score), sycophancy (agree with the user to win approval), format gaming, and exploiting RM blind spots out-of-distribution. The signature plot is reward-model overoptimization: as KL distance grows, the proxy reward climbs monotonically while true quality traces an inverted U — up, peak, then down. Your dashboard says you're winning while your users say you're losing.

Mitigations, roughly in order of leverage:

  • The KL leash itself (raise β\beta) plus early stopping on a held-out human/oracle eval, not on proxy reward.
  • Move to RLVR where you can — an oracle has no exploitable generalization gap in the RM sense.
  • RM ensembles / uncertainty penalties, fresher preference data, periodic RM retraining on the policy's current outputs.
  • Reference resetting (periodically update πref\pi_{\mathrm{ref}}) for long runs — used in ProRL (§3.6).

DPO has its own hacking pathology worth naming: likelihood displacement. Optimizing the preference margin often lowers the log-prob of the chosen response while lowering the rejected one faster — the margin grows but both probabilities erode, breaking calibration and sometimes quality. Fixes (DPO-Shift, PG-DPO, β-DPO) all target this.

3.6 Where DPO and GRPO fit (preview), and RLVR's emergent reasoning

DPO (Direct Preference Optimization) skips the reward model and the RL loop. It inverts the closed-form optimum from §3.2: if π\*πrefexp(r/β)\pi^\*\propto\pi_{\mathrm{ref}}\exp(r/\beta), then the implicit reward is r(x,y)=βlogπθ(yx)πref(yx)r(x,y)=\beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}. Substitute into Bradley–Terry and you get a plain classification loss on preference pairs:

ƒ
LDPO=logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))\mathcal{L}^{\mathrm{DPO}} = -\log\sigma\!\Big(\beta\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)} - \beta\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\Big)

It's offline, stable, and cheap (two models, no rollouts, no critic) — but offline means no exploration (you're bounded by your preference dataset), and it's prone to likelihood displacement.

GRPO (Group Relative Policy Optimization) keeps the on-policy RL loop but deletes the critic. For each prompt it samples a group of GG completions and uses the group's own statistics as the baseline: Ai=rimean(r1..G)std(r1..G)A_i = \frac{r_i - \mathrm{mean}(r_{1..G})}{\mathrm{std}(r_{1..G})}. No value network ⇒ ~50% less memory and far simpler infra ⇒ RL at scale. GRPO + RLVR is the modern workhorse for reasoning, and it spawned a whole family (Dr.GRPO, DAPO, GSPO, CISPO, …) addressing its length/difficulty biases and token-level noise — the entire subject of /finetuning/ppo-grpo-and-variants.

The headline result that made all this matter: DeepSeek-R1. R1-Zero applied GRPO with purely verifiable rewards directly to a base model — no SFT cold start — and reasoning emerged from reward alone: long chain-of-thought, spontaneous self-correction, "aha" backtracking. AIME 2024 pass@1 went 15.6% → 71.0% (86.7% with majority vote). The shipped R1 added a small SFT cold-start for readability plus a 4-stage pipeline (cold-start SFT → reasoning RL → rejection-sampling SFT → safety RL). This reframed RL post-training from "alignment polish" to "a capability lever." The open debate (the elicitation vs expansion question, very alive in 2026): does RL install genuinely new reasoning, or just amplify what the base could already do under enough sampling? ProRL argues expansion with long, diverse training; large-kk analyses argue base models catch up (elicitation). Emerging consensus: it depends on task headroom, RL duration, and whether rewards are verifiable.

3.7 The map of this pillar

This lesson is the frame. From here: /finetuning/ppo-grpo-and-variants (the algorithms and their tradeoffs), /finetuning/rl-infrastructure (rollout engines, async RL, memory, MoE stability), and /finetuning/rl-interview-benchmark (timed drills). RL post-training sits downstream of SFT in /finetuning, leans on serving stacks from /inference, is architecturally grounded in /transformers, and is only as trustworthy as your /evals.

4. Minimal implementation

A single RLVR + GRPO training step, honest about what production omits. This is the shortest path from §3 to code; the variants lesson refines every line of it.

"""
Illustrative GRPO + verifiable-reward (RLVR) step. NOT production.
Omitted on purpose: distributed rollout on a separate inference engine
(vLLM/SGLang), KV-cache reuse, grad accumulation, sequence packing,
loss masking on prompt tokens, and old-policy logprob caching at sample time.
See /finetuning/rl-infrastructure for the parts that actually cost money.
"""
import torch
 
def verifiable_reward(completion: str, gold: str) -> float:
    # RLVR: a deterministic oracle. No reward model => nothing to hack at this layer.
    return 1.0 if extract_answer(completion) == gold else 0.0
 
def grpo_step(policy, ref, batch, *, G, beta=0.04, eps_lo=0.2, eps_hi=0.28):
    # batch.tokens   : [B, T]  completions (B == num_prompts * G)
    # batch.old_logp : [B, T]  logprob under the sampling (behavior) policy
    # batch.rewards  : [B]     scalar reward per completion (from verifiable_reward)
    # batch.mask     : [B, T]  1 on response tokens, 0 on prompt/pad
    logp = token_logprobs(policy, batch.tokens)            # current policy, grad ON
    with torch.no_grad():
        ref_logp = token_logprobs(ref, batch.tokens)       # frozen reference
 
    # --- group-relative advantage: each completion vs. its prompt's group ---
    r = batch.rewards.view(-1, G)                          # [num_prompts, G]
    adv = (r - r.mean(1, keepdim=True)) / (r.std(1, keepdim=True) + 1e-6)
    adv = adv.reshape(-1, 1)                               # same A_i for every token
 
    # --- PPO-style clipped surrogate on the importance ratio ---
    ratio = torch.exp(logp - batch.old_logp)              # pi_theta / pi_old, per token
    # asymmetric "clip-higher" (a DAPO trick) curbs entropy collapse vs symmetric clip
    surrogate = torch.min(ratio * adv,
                          torch.clamp(ratio, 1 - eps_lo, 1 + eps_hi) * adv)
 
    # --- KL-to-reference as an explicit penalty (Schulman k3 estimator, >= 0) ---
    log_rho = ref_logp - logp
    kl = torch.exp(log_rho) - log_rho - 1.0
 
    per_tok = -(surrogate - beta * kl)                    # maximize surrogate, pay KL
    loss = (per_tok * batch.mask).sum() / batch.mask.sum().clamp(min=1)
    return loss

Two honesty notes. (1) std-normalizing the advantage is the standard trick that makes the learning rate invariant to reward scale — but when rewards are nearly all-0 or all-1 (a too-easy or too-hard prompt group), the denominator collapses and gradients explode; clamp it or filter degenerate groups (Dr.GRPO removes this term entirely). (2) The same scalar advantage is broadcast to every token in a completion — the unsolved credit-assignment problem at the heart of all these methods.

The GRPO step, section by section

Break the code into three phases.

Phase 1: Compute policy and reference logprobs. Lines 178–180 compute the log-probability of each token under the current policy (with grad enabled) and under the frozen reference (no grad). We need both: the policy tells us what the learning policy currently thinks, and the reference is the anchor we measure drift from.

Phase 2: Group-relative advantage. Lines 183–185 reshape rewards from flat [B] (where B = num_prompts × G, G completions per prompt) to [num_prompts, G] — each row is one prompt's group of G completions. Then: subtract the group mean from each reward, divide by the group std, and broadcast back to [B, 1]. This normalizes the signal within each prompt's batch: if a prompt is easy (all completions score 0.9), rewards are centered near 0 and the std is small; if hard (all score 0.1), same story. The std clamp + 1e-6 prevents division by zero but can blow up if the group is degenerate (all identical rewards). This per-group baseline is GRPO's key trick: no learned value network, just the group's own statistics.

Phase 3: PPO update with KL penalty. Lines 187–198 compute the PPO clipped surrogate (lines 188–191): the importance ratio π_θ / π_old is clamped asymmetrically so one big step can't overshoot, then multiplied by the advantage. In parallel (lines 193–195), we compute the KL divergence term per token using Schulman's k3 estimator exp(log_ρ) - log_ρ - 1, which is always ≥0 and low-variance. The final loss (line 197) is -(surrogate - β * KL): maximize the clipped surrogate, pay a penalty proportional to drift. Line 198 masks out prompt tokens and normalizes by response tokens only — training signal flows only through the generated response.

All together: sample completions, score them with a reward, group-normalize the advantage to zero out easy/hard variation, apply PPO's clipped importance-ratio trick to prevent one step from drifting too far, and subtract KL-to-reference to keep the policy tethered to the SFT prior. The result is a stable, memory-cheap on-policy RL update.

5. Production tradeoffs

Method Reward source Models in memory Exploration Stability Best for Primary failure mode
SFT gold demos (none) 1 none very high bootstrapping a fluent prior ceiling = labeler; exposure bias
DPO offline preference pairs 2 (policy + ref) none (offline) high cheap preference alignment, no infra likelihood displacement; bounded by data
PPO-RLHF learned reward model 4 (policy, ref, RM, critic) on-policy medium (fiddly) fuzzy human-pref objectives RM overoptimization / reward hacking
GRPO-RLVR verifiable oracle 2–3 (policy, ref, +opt. RM) on-policy, group medium-high math/code reasoning at scale length/difficulty bias; std blow-up; MoE router drift
RLAIF / CAI LLM judge / constitution 2–4 on-policy medium scaling annotation cheaply teacher bias; correlated errors

Rule of thumb: verifiable domain → RLVR + GRPO; fuzzy human preference, small budget → DPO; fuzzy preference, you need exploration and can afford infra → PPO/GRPO with a reward model. Cost ordering is roughly SFT < DPO ≪ GRPO < PPO.

6. How it's asked

Q (IC4). "Why not just do more SFT instead of RL?" SFT is MLE on demonstrations — it can only imitate, so its ceiling is the labeler, and it suffers exposure bias because it only ever sees gold prefixes. RL optimizes a comparative reward on the model's own samples, so it can exceed demonstrators and learn to recover from its own errors. The price is instability and a far heavier loop, which is why we SFT first for a strong, fluent initialization, then RL.

Q (IC5). "What exactly does the KL term buy you, and what does β trade off?" It keeps πθ\pi_\theta near the trusted SFT reference, where (a) the reward proxy is still valid, (b) the model stays fluent and diverse (no mode collapse), and (c) the optimum is well-defined: π\*πrefer/β\pi^\*\propto\pi_{\mathrm{ref}}e^{r/\beta}. Small β ⇒ aggressive reward-chasing, more drift, more hacking; large β ⇒ safe but weak. At β=0 the policy walks off the RM's valid region and reward-hacks. Bonus: PPO folds KL into the reward; GRPO adds it as an explicit non-negative loss term.

Q (IC5). "RM vs verifier — when do you reach for each?" Verifier (RLVR) when an automated oracle exists (code tests, math answer-matching) — there's no reward model to overoptimize, so you can train hard; this is what powered DeepSeek-R1. Reward model (RLHF) when quality is subjective (tone, helpfulness, safety) and no oracle exists — at the cost of expensive labels and reward-hacking risk from the RM's OOD generalization gap. Many pipelines use both: RLVR for reasoning, an RM/RLAIF pass for helpfulness and safety.

Q (IC6). "Reward climbs, human eval peaks early then drops. Diagnose and fix." Classic reward-model overoptimization: the proxy and true objective diverge as KL grows (inverted-U on true quality). First, stop trusting proxy reward — gate on a held-out human/oracle eval and early-stop at its peak. Then, in order: raise β (tighten the leash); check for the usual hacks (length, sycophancy, format); add RM-ensemble/uncertainty penalties and refresh the RM on current-policy outputs; reset the reference periodically for long runs; and where feasible, swap the learned RM for a verifiable reward to remove the exploitable gap entirely.

Q (IC6). "Does RL create new capability or just surface latent base ability?" Open in 2026. Expansion camp (ProRL) shows RL-trained models beating base across pass@k with reference-resetting and long, diverse RLVR — evidence of genuinely new strategies. Elicitation camp shows base models catching up at large k — RL as a sampler-sharpener. Defensible synthesis: expansion when the task has headroom under-covered by pretraining and RL is long/diverse with verifiable rewards; elicitation when the task is already well-covered or RL is short. The practical corollary is a dual scaling law — train-time RL and test-time compute are complementary, not substitutes.

7. Pitfalls & flashcards

  • Optimizing proxy reward instead of the true objective. Reward is a proxy; gate and early-stop on real evals, never on RM score alone.
  • Dropping or mis-tuning the KL term. β=0 is a reward-hacking generator; too-large β makes RL a no-op. It's the single most important hyperparameter after the reward itself.
  • std-normalizing through degenerate groups. All-correct or all-wrong groups give a ~0 denominator and exploding gradients — clamp the std or filter the group.
  • Forgetting the reference is frozen. πref\pi_{\mathrm{ref}} is fixed at the SFT checkpoint (unless you deliberately reset it); training it is a common bug.
  • Confusing "RL" with "PPO." PPO is one instantiation. DPO removes the loop; GRPO removes the critic; RLVR removes the reward model.
  • Treating credit assignment as solved. Every method here broadcasts one sequence-level reward across all tokens. It isn't solved — it's the open frontier the variants chase.

Flashcard. Pretrain learns likely, SFT learns imitate, RL learns good — by maximizing E[r(x,y)βDKL(πθπref)]\mathbb{E}[\,r(x,y) - \beta\,D_{\mathrm{KL}}(\pi_\theta\|\pi_{\mathrm{ref}})\,] on the policy's own samples. The KL leash keeps you where the reward is valid (kill it and you reward-hack into mode collapse). Reward comes from a learned model (RLHF, hackable), a verifier (RLVR, not hackable — powered R1's emergent reasoning), or an AI judge (RLAIF). DPO skips the loop; GRPO skips the critic; the central, ever-present danger is reward hacking — optimizing the proxy while true quality traces an inverted U.

8. Further reading

Next: PPO, GRPO & the Variant Zoo → — now that you know why we optimize reward-minus-KL, see exactly how the modern algorithms take that gradient: critic-free advantages, token vs sequence importance sampling, clip-higher, and the fixes for GRPO's length, difficulty, and MoE-routing pathologies. Then wire it to real hardware in /finetuning/rl-infrastructure and drill it under time in /finetuning/rl-interview-benchmark.

Primary sources
← More in Fine-tuning, Post-training & RL