Pretraining teaches a model what is likely, SFT teaches it what to imitate, and RL teaches it what is good — by optimizing a reward signal while a KL leash holds the policy near its trusted SFT prior. This lesson frames the whole pillar — reward sources (RLHF / RLVR / RLAIF), the RM→PPO pipeline, the KL penalty and why it exists, reward hacking as the central danger, and where DPO, GRPO and DeepSeek-R1's emergent reasoning fit.
A frontier model is built in three stages, and each optimizes a different objective:
The defining object of the whole pillar is one equation — the KL-regularized reward objective:
where is the policy we are training, is the frozen reference (the SFT checkpoint we started from), is a scalar reward for completion on prompt , is the KL penalty weight, and is a prompt distribution. Everything else in this pillar — PPO, DPO, GRPO and its zoo of variants, RLVR, the infra — is a particular choice of where comes from and how you take the gradient of this objective without blowing up. Keep this equation in your head; we will return to it repeatedly.
Name every symbol. Left side: is the objective we maximize. Inside: means sample a prompt and then sample a completion from the current policy — so the RL loop is on-policy. is the reward: a single number for this prompt-completion pair. is the KL weight: big β ⇒ stay close to reference; small β ⇒ chase reward aggressively. is KL divergence: how far the policy has drifted from the frozen SFT checkpoint in its token-probability distribution.
Walk a concrete example. Suppose (excellent completion) and the KL divergence is nats, with . Then the objective is: 10.0 - 0.1 * 0.5 = 10.0 - 0.05 = 9.95. If instead the policy had drifted further (), we'd get 10.0 - 0.1 * 2.0 = 9.8 — the reward is good but the penalty is still small. But if (tighter leash) and , we get 10.0 - 1.0 * 2.0 = 8.0 — the KL cost suddenly dominates. This single equation balances two forces: climb the reward against don't wander from the trusted SFT prior.
A pretrained model is supervised-fine-tuned, then aligned with reinforcement learning against a reward signal — the RLHF/RLVR pipeline.
The words first.
token (word-piece). "Improving the policy" just means updating the model's weights.prompt → good answer examples and have it imitate them.Step by step.
reference.KL), and subtract a penalty proportional to it — the "leash."reward − KL penalty: make good answers likelier.Remember this: RL post-training lets an already-decent model chase a reward it can optimize but couldn't be shown by example, while the KL leash keeps it tethered to its trusted SFT starting point so it improves without going off the rails.
SFT is behavioral cloning: maximum likelihood on (prompt, gold-response) pairs. Minimizing cross-entropy is minimizing the forward KL , which is mode-covering — it spreads probability to cover every demonstration but has no notion of "this answer is better than that one." Three structural consequences:
RL fixes all three by changing what you optimize and what distribution you optimize over. You sample from itself (on-policy), score those samples with a reward, and push probability toward high-reward regions. Because training data is now the model's own output, RL directly attacks exposure bias, and because the signal is a comparative scalar, it can push past the demonstrators. This is also the through-line to on-policy distillation — train the student on its own rollouts judged by a teacher — which several 2025–26 models (Qwen3, DeepSeek-V3.2) lean on as a cheaper cousin of RL.
SFT moves the model to a good region of policy space cheaply and stably; RL searches within and beyond that region under a reward. You almost always want SFT first as a strong, fluent initialization — pure RL from base is possible (R1-Zero, §3.6) but fragile.
Return to . The reward term is obvious — climb it. The KL term is the part interviewers actually care about. In practice it is applied per token and length-normalized, , and there are two equivalent ways to wire it in:
Why is it there at all? Three reasons, all the same reason:
This formula says the best policy is the SFT reference distribution, reweighted (tilted) by the exponential of the reward scaled by . Name the symbols. is the optimal policy. is the SFT model we started from — it gets multiplied by a tilt factor. is that tilt: high reward ⇒ higher tilt ⇒ more probability; low reward ⇒ lower tilt. The means "proportional to" — we normalize afterward.
Walk it on concrete numbers. Suppose the reference assigns probability [0.5, 0.3, 0.2] to three continuations, and their rewards are [5.0, 2.0, 1.0] with . The exponentials are exp(5.0) ≈ 148, exp(2.0) ≈ 7.4, exp(1.0) ≈ 2.7. The tilted (unnormalized) probabilities are [0.5 × 148, 0.3 × 7.4, 0.2 × 2.7] = [74, 2.2, 0.54]. Normalize: sum is 76.74, so final probs are [74/76.74, 2.2/76.74, 0.54/76.74] ≈ [0.96, 0.03, 0.01]. High-reward option jumped from 50% to 96%; low-reward option collapsed. Now shrink to 0.1: exponents become [exp(50), exp(20), exp(10)] ≈ [5e21, 5e8, 22k] — the highest-reward option dominates so completely it's nearly 1.0. Grow to 5.0: exponents become [exp(1), exp(0.4), exp(0.2)] ≈ [2.7, 1.5, 1.2] — tilted probabilities [1.35, 0.45, 0.24] normalize to [0.63, 0.21, 0.11] — modest change from reference. That's the tradeoff: small ⇒ aggressive, high-variance tilt toward reward; large ⇒ conservative, stable, stays near reference.
What breaks at ? You get pure reward maximization with no anchor: the policy drifts out of the RM's valid region, reward-hacks, collapses entropy, and "wins" the proxy while the true objective craters. β=0 is the canonical way to cause reward hacking on purpose.
The single biggest design decision in post-training is the source of . Three families:
A fourth axis cuts across these: outcome rewards (score only the final answer) vs process rewards / PRMs (score intermediate reasoning steps). Process rewards give denser, earlier signal for reasoning but cost far more labels and introduce their own hacking surface (rewarding plausible-looking steps).
The original InstructGPT-style recipe, still the reference architecture:
PPO is an actor-critic method, and the cost shows up in memory: you hold four models — policy, frozen reference, reward model, and a learned critic/value network — roughly 4× the footprint of a single model (see /finetuning/rl-infrastructure). The critic exists to reduce variance via the advantage , usually estimated with GAE (, trading bias for variance via ). The core PPO update is the clipped surrogate:
The creates an asymmetric trust region: once the importance ratio leaves in the helpful direction, the gradient flatlines (a "dead zone"), so a single batch can't yank the policy too far. It's a first-order heuristic — cheaper than TRPO's hard KL constraint but with no monotonic-improvement guarantee and notorious sensitivity to reward normalization, batch size, and learning rate. PPO works, but it's the expensive, fiddly baseline — which is exactly why DPO and GRPO exist.
If you remember one risk from this pillar, make it this. Reward hacking is the policy exploiting flaws in the reward proxy to score highly without the intended behavior — Goodhart's law in a gradient loop. Concretely: length bias (longer = higher RM score), sycophancy (agree with the user to win approval), format gaming, and exploiting RM blind spots out-of-distribution. The signature plot is reward-model overoptimization: as KL distance grows, the proxy reward climbs monotonically while true quality traces an inverted U — up, peak, then down. Your dashboard says you're winning while your users say you're losing.
Mitigations, roughly in order of leverage:
DPO has its own hacking pathology worth naming: likelihood displacement. Optimizing the preference margin often lowers the log-prob of the chosen response while lowering the rejected one faster — the margin grows but both probabilities erode, breaking calibration and sometimes quality. Fixes (DPO-Shift, PG-DPO, β-DPO) all target this.
DPO (Direct Preference Optimization) skips the reward model and the RL loop. It inverts the closed-form optimum from §3.2: if , then the implicit reward is . Substitute into Bradley–Terry and you get a plain classification loss on preference pairs:
It's offline, stable, and cheap (two models, no rollouts, no critic) — but offline means no exploration (you're bounded by your preference dataset), and it's prone to likelihood displacement.
GRPO (Group Relative Policy Optimization) keeps the on-policy RL loop but deletes the critic. For each prompt it samples a group of completions and uses the group's own statistics as the baseline: . No value network ⇒ ~50% less memory and far simpler infra ⇒ RL at scale. GRPO + RLVR is the modern workhorse for reasoning, and it spawned a whole family (Dr.GRPO, DAPO, GSPO, CISPO, …) addressing its length/difficulty biases and token-level noise — the entire subject of /finetuning/ppo-grpo-and-variants.
The headline result that made all this matter: DeepSeek-R1. R1-Zero applied GRPO with purely verifiable rewards directly to a base model — no SFT cold start — and reasoning emerged from reward alone: long chain-of-thought, spontaneous self-correction, "aha" backtracking. AIME 2024 pass@1 went 15.6% → 71.0% (86.7% with majority vote). The shipped R1 added a small SFT cold-start for readability plus a 4-stage pipeline (cold-start SFT → reasoning RL → rejection-sampling SFT → safety RL). This reframed RL post-training from "alignment polish" to "a capability lever." The open debate (the elicitation vs expansion question, very alive in 2026): does RL install genuinely new reasoning, or just amplify what the base could already do under enough sampling? ProRL argues expansion with long, diverse training; large- analyses argue base models catch up (elicitation). Emerging consensus: it depends on task headroom, RL duration, and whether rewards are verifiable.
This lesson is the frame. From here: /finetuning/ppo-grpo-and-variants (the algorithms and their tradeoffs), /finetuning/rl-infrastructure (rollout engines, async RL, memory, MoE stability), and /finetuning/rl-interview-benchmark (timed drills). RL post-training sits downstream of SFT in /finetuning, leans on serving stacks from /inference, is architecturally grounded in /transformers, and is only as trustworthy as your /evals.
A single RLVR + GRPO training step, honest about what production omits. This is the shortest path from §3 to code; the variants lesson refines every line of it.
"""
Illustrative GRPO + verifiable-reward (RLVR) step. NOT production.
Omitted on purpose: distributed rollout on a separate inference engine
(vLLM/SGLang), KV-cache reuse, grad accumulation, sequence packing,
loss masking on prompt tokens, and old-policy logprob caching at sample time.
See /finetuning/rl-infrastructure for the parts that actually cost money.
"""
import torch
def verifiable_reward(completion: str, gold: str) -> float:
# RLVR: a deterministic oracle. No reward model => nothing to hack at this layer.
return 1.0 if extract_answer(completion) == gold else 0.0
def grpo_step(policy, ref, batch, *, G, beta=0.04, eps_lo=0.2, eps_hi=0.28):
# batch.tokens : [B, T] completions (B == num_prompts * G)
# batch.old_logp : [B, T] logprob under the sampling (behavior) policy
# batch.rewards : [B] scalar reward per completion (from verifiable_reward)
# batch.mask : [B, T] 1 on response tokens, 0 on prompt/pad
logp = token_logprobs(policy, batch.tokens) # current policy, grad ON
with torch.no_grad():
ref_logp = token_logprobs(ref, batch.tokens) # frozen reference
# --- group-relative advantage: each completion vs. its prompt's group ---
r = batch.rewards.view(-1, G) # [num_prompts, G]
adv = (r - r.mean(1, keepdim=True)) / (r.std(1, keepdim=True) + 1e-6)
adv = adv.reshape(-1, 1) # same A_i for every token
# --- PPO-style clipped surrogate on the importance ratio ---
ratio = torch.exp(logp - batch.old_logp) # pi_theta / pi_old, per token
# asymmetric "clip-higher" (a DAPO trick) curbs entropy collapse vs symmetric clip
surrogate = torch.min(ratio * adv,
torch.clamp(ratio, 1 - eps_lo, 1 + eps_hi) * adv)
# --- KL-to-reference as an explicit penalty (Schulman k3 estimator, >= 0) ---
log_rho = ref_logp - logp
kl = torch.exp(log_rho) - log_rho - 1.0
per_tok = -(surrogate - beta * kl) # maximize surrogate, pay KL
loss = (per_tok * batch.mask).sum() / batch.mask.sum().clamp(min=1)
return lossTwo honesty notes. (1) std-normalizing the advantage is the standard trick that makes the learning rate invariant to reward scale — but when rewards are nearly all-0 or all-1 (a too-easy or too-hard prompt group), the denominator collapses and gradients explode; clamp it or filter degenerate groups (Dr.GRPO removes this term entirely). (2) The same scalar advantage is broadcast to every token in a completion — the unsolved credit-assignment problem at the heart of all these methods.
Break the code into three phases.
Phase 1: Compute policy and reference logprobs. Lines 178–180 compute the log-probability of each token under the current policy (with grad enabled) and under the frozen reference (no grad). We need both: the policy tells us what the learning policy currently thinks, and the reference is the anchor we measure drift from.
Phase 2: Group-relative advantage. Lines 183–185 reshape rewards from flat [B] (where B = num_prompts × G, G completions per prompt) to [num_prompts, G] — each row is one prompt's group of G completions. Then: subtract the group mean from each reward, divide by the group std, and broadcast back to [B, 1]. This normalizes the signal within each prompt's batch: if a prompt is easy (all completions score 0.9), rewards are centered near 0 and the std is small; if hard (all score 0.1), same story. The std clamp + 1e-6 prevents division by zero but can blow up if the group is degenerate (all identical rewards). This per-group baseline is GRPO's key trick: no learned value network, just the group's own statistics.
Phase 3: PPO update with KL penalty. Lines 187–198 compute the PPO clipped surrogate (lines 188–191): the importance ratio π_θ / π_old is clamped asymmetrically so one big step can't overshoot, then multiplied by the advantage. In parallel (lines 193–195), we compute the KL divergence term per token using Schulman's k3 estimator exp(log_ρ) - log_ρ - 1, which is always ≥0 and low-variance. The final loss (line 197) is -(surrogate - β * KL): maximize the clipped surrogate, pay a penalty proportional to drift. Line 198 masks out prompt tokens and normalizes by response tokens only — training signal flows only through the generated response.
All together: sample completions, score them with a reward, group-normalize the advantage to zero out easy/hard variation, apply PPO's clipped importance-ratio trick to prevent one step from drifting too far, and subtract KL-to-reference to keep the policy tethered to the SFT prior. The result is a stable, memory-cheap on-policy RL update.
| Method | Reward source | Models in memory | Exploration | Stability | Best for | Primary failure mode |
|---|---|---|---|---|---|---|
| SFT | gold demos (none) | 1 | none | very high | bootstrapping a fluent prior | ceiling = labeler; exposure bias |
| DPO | offline preference pairs | 2 (policy + ref) | none (offline) | high | cheap preference alignment, no infra | likelihood displacement; bounded by data |
| PPO-RLHF | learned reward model | 4 (policy, ref, RM, critic) | on-policy | medium (fiddly) | fuzzy human-pref objectives | RM overoptimization / reward hacking |
| GRPO-RLVR | verifiable oracle | 2–3 (policy, ref, +opt. RM) | on-policy, group | medium-high | math/code reasoning at scale | length/difficulty bias; std blow-up; MoE router drift |
| RLAIF / CAI | LLM judge / constitution | 2–4 | on-policy | medium | scaling annotation cheaply | teacher bias; correlated errors |
Rule of thumb: verifiable domain → RLVR + GRPO; fuzzy human preference, small budget → DPO; fuzzy preference, you need exploration and can afford infra → PPO/GRPO with a reward model. Cost ordering is roughly SFT < DPO ≪ GRPO < PPO.
Q (IC4). "Why not just do more SFT instead of RL?" SFT is MLE on demonstrations — it can only imitate, so its ceiling is the labeler, and it suffers exposure bias because it only ever sees gold prefixes. RL optimizes a comparative reward on the model's own samples, so it can exceed demonstrators and learn to recover from its own errors. The price is instability and a far heavier loop, which is why we SFT first for a strong, fluent initialization, then RL.
Q (IC5). "What exactly does the KL term buy you, and what does β trade off?" It keeps near the trusted SFT reference, where (a) the reward proxy is still valid, (b) the model stays fluent and diverse (no mode collapse), and (c) the optimum is well-defined: . Small β ⇒ aggressive reward-chasing, more drift, more hacking; large β ⇒ safe but weak. At β=0 the policy walks off the RM's valid region and reward-hacks. Bonus: PPO folds KL into the reward; GRPO adds it as an explicit non-negative loss term.
Q (IC5). "RM vs verifier — when do you reach for each?" Verifier (RLVR) when an automated oracle exists (code tests, math answer-matching) — there's no reward model to overoptimize, so you can train hard; this is what powered DeepSeek-R1. Reward model (RLHF) when quality is subjective (tone, helpfulness, safety) and no oracle exists — at the cost of expensive labels and reward-hacking risk from the RM's OOD generalization gap. Many pipelines use both: RLVR for reasoning, an RM/RLAIF pass for helpfulness and safety.
Q (IC6). "Reward climbs, human eval peaks early then drops. Diagnose and fix." Classic reward-model overoptimization: the proxy and true objective diverge as KL grows (inverted-U on true quality). First, stop trusting proxy reward — gate on a held-out human/oracle eval and early-stop at its peak. Then, in order: raise β (tighten the leash); check for the usual hacks (length, sycophancy, format); add RM-ensemble/uncertainty penalties and refresh the RM on current-policy outputs; reset the reference periodically for long runs; and where feasible, swap the learned RM for a verifiable reward to remove the exploitable gap entirely.
Q (IC6). "Does RL create new capability or just surface latent base ability?" Open in 2026. Expansion camp (ProRL) shows RL-trained models beating base across pass@k with reference-resetting and long, diverse RLVR — evidence of genuinely new strategies. Elicitation camp shows base models catching up at large k — RL as a sampler-sharpener. Defensible synthesis: expansion when the task has headroom under-covered by pretraining and RL is long/diverse with verifiable rewards; elicitation when the task is already well-covered or RL is short. The practical corollary is a dual scaling law — train-time RL and test-time compute are complementary, not substitutes.
std-normalizing through degenerate groups. All-correct or all-wrong groups give a ~0 denominator and exploding gradients — clamp the std or filter the group.Flashcard. Pretrain learns likely, SFT learns imitate, RL learns good — by maximizing on the policy's own samples. The KL leash keeps you where the reward is valid (kill it and you reward-hack into mode collapse). Reward comes from a learned model (RLHF, hackable), a verifier (RLVR, not hackable — powered R1's emergent reasoning), or an AI judge (RLAIF). DPO skips the loop; GRPO skips the critic; the central, ever-present danger is reward hacking — optimizing the proxy while true quality traces an inverted U.
Next: PPO, GRPO & the Variant Zoo → — now that you know why we optimize reward-minus-KL, see exactly how the modern algorithms take that gradient: critic-free advantages, token vs sequence importance sampling, clip-higher, and the fixes for GRPO's length, difficulty, and MoE-routing pathologies. Then wire it to real hardware in /finetuning/rl-infrastructure and drill it under time in /finetuning/rl-interview-benchmark.