← all pillarsPillar · session 4 · live

Fine-tuning, Post-training & RL

RLHF/RLVR, PPO → GRPO and the variant zoo (DAPO, GSPO, Dr.GRPO…), RL infrastructure, and the 35-question RL interview benchmark, answered.

IC4IC5IC6

RL Post-Training — Why We Optimize Rewards After SFT

Pretraining teaches a model what is likely, SFT teaches it what to imitate, and RL teaches it what is good — by optimizing a reward signal while a KL leash holds the policy near its trusted SFT prior. This lesson frames the whole pillar — reward sources (RLHF / RLVR / RLAIF), the RM→PPO pipeline, the KL penalty and why it exists, reward hacking as the central danger, and where DPO, GRPO and DeepSeek-R1's emergent reasoning fit.

13 minRead →
IC5IC6

PPO, GRPO, DPO & the Variant Zoo

Policy-gradient RL for LLMs is one question asked a dozen ways — how do you turn a scalar reward into a stable, low-variance gradient on a 100k-vocab autoregressive policy? PPO answers with a clipped surrogate and a learned critic; GRPO drops the critic and lets the group mean be the baseline; DPO skips sampling entirely. Then Dr.GRPO, DAPO, GSPO, and CISPO are a precise sequence of bug-fixes to GRPO's biases. Derive each, and know exactly what it fixes and what it breaks.

18 minRead →
IC5IC6

RL Infrastructure: Rollouts, Async, and the MoE Mismatch

The half of RL hiring people forget. How many model copies GRPO actually holds, why your GPUs idle on long-tail rollouts, when async beats sync (and what staleness costs you), VeRL vs TRL vs slime vs AReaL, Megatron vs FSDP, why RL training is nondeterministic, and the MoE train-inference router mismatch that silently collapses runs.

12 min minRead →
IC5IC6

RL Interview Benchmark — 35 Questions, Answered

Xiuyu Li's 35-question RL interview benchmark, answered in depth — Algorithm and Infrastructure. The reference answers the original deliberately leaves out.

42 minRead →