RLHF/RLVR, PPO → GRPO and the variant zoo (DAPO, GSPO, Dr.GRPO…), RL infrastructure, and the 35-question RL interview benchmark, answered.
Pretraining teaches a model what is likely, SFT teaches it what to imitate, and RL teaches it what is good — by optimizing a reward signal while a KL leash holds the policy near its trusted SFT prior. This lesson frames the whole pillar — reward sources (RLHF / RLVR / RLAIF), the RM→PPO pipeline, the KL penalty and why it exists, reward hacking as the central danger, and where DPO, GRPO and DeepSeek-R1's emergent reasoning fit.
Policy-gradient RL for LLMs is one question asked a dozen ways — how do you turn a scalar reward into a stable, low-variance gradient on a 100k-vocab autoregressive policy? PPO answers with a clipped surrogate and a learned critic; GRPO drops the critic and lets the group mean be the baseline; DPO skips sampling entirely. Then Dr.GRPO, DAPO, GSPO, and CISPO are a precise sequence of bug-fixes to GRPO's biases. Derive each, and know exactly what it fixes and what it breaks.
The half of RL hiring people forget. How many model copies GRPO actually holds, why your GPUs idle on long-tail rollouts, when async beats sync (and what staleness costs you), VeRL vs TRL vs slime vs AReaL, Megatron vs FSDP, why RL training is nondeterministic, and the MoE train-inference router mismatch that silently collapses runs.
Xiuyu Li's 35-question RL interview benchmark, answered in depth — Algorithm and Infrastructure. The reference answers the original deliberately leaves out.