The full math of the classic three-stage pipeline: a reward model from preferences, a KL-regularized RL objective, and the PPO update that optimizes it.
| Symbol | Description |
|---|---|
| \(x\) | Prompt; \(y\) a full completion |
| \(y_w,\; y_l\) | Preferred (winner) and dispreferred (loser) completion in a comparison |
| \(\pi_\theta\) | Policy being trained (the LM), parameters \(\theta\) |
| \(\pi_{\text{ref}}\) | Frozen reference policy (the SFT model) |
| \(r_\phi(x,y)\) | Learned reward model, parameters \(\phi\); outputs a scalar score |
| \(\beta\) | KL penalty coefficient |
| \(s_t,\; a_t\) | State (prompt + tokens so far) and action (next token) at step \(t\) |
| \(V_\psi(s)\) | Value function (critic), parameters \(\psi\) |
| \(\hat A_t\) | Estimated advantage at step \(t\) |
| \(\gamma,\;\lambda\) | Discount and GAE smoothing factors |
| \(\sigma\) | Logistic sigmoid \(1/(1+e^{-t})\) |
RLHF (as in InstructGPT) aligns a pretrained language model to human preferences in three sequential stages. Each stage produces the input for the next.
This page focuses on the math of Stages 2 and 3. Stage 1 is ordinary cross-entropy fine-tuning — see the cross-entropy page.
We can't ask humans for absolute scores, only comparisons. The reward model is a scalar head on a transformer that maps \((x,y)\) to a single number, trained so that preferred completions score higher. Preferences are modeled with Bradley–Terry:
\[ p(y_w \succ y_l \mid x) = \frac{\exp r_\phi(x,y_w)}{\exp r_\phi(x,y_w) + \exp r_\phi(x,y_l)} = \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big) \]Bradley–Terry gives the probability of a single comparison. To turn it into a training loss we fit \(\phi\) by maximum likelihood over the whole comparison dataset \(\mathcal{D}\), in three steps.
Each example in \(\mathcal{D}\) is already labeled so that \(y_w\) is the human-preferred answer. Assuming examples are independent, the probability of observing all the labels is the product of the per-comparison probabilities:
\[ \mathcal{P}(\phi) = \prod_{(x,y_w,y_l)\in\mathcal{D}} p(y_w \succ y_l \mid x) = \prod_{(x,y_w,y_l)\in\mathcal{D}} \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big) \]Maximizing \(\mathcal{P}\) is the same as maximizing \(\log\mathcal{P}\) (the log is monotonic), and the log turns the product into a sum — easier to optimize and numerically stable:
\[ \log \mathcal{P}(\phi) = \sum_{(x,y_w,y_l)\in\mathcal{D}} \log \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big) \]Flip the sign to turn "maximize log-likelihood" into "minimize a loss," and average over the dataset (the \(\tfrac{1}{|\mathcal{D}|}\sum\) is written as an expectation \(\mathbb{E}_{\mathcal{D}}\)):
Only reward differences matter, so \(r_\phi\) is identified up to an additive constant; implementations usually normalize it to zero mean. This is the same Bradley–Terry likelihood that DPO later reuses — the difference is where the reward lives (a separate network here, the policy itself in DPO).
With \(r_\phi\) frozen, Stage 3 trains the policy to maximize reward while staying close to the reference model. The KL leash stops the policy from drifting into degenerate text that fools the reward model (reward hacking):
This is the same objective DPO starts from — DPO solves it in closed form, whereas standard RLHF optimizes it directly with reinforcement learning. The rest of this page is how that direct optimization works.
In practice the KL is not computed exactly, because written as a sum, \(\mathrm{KL} = \sum_y \pi_\theta(y\mid x)\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\), it ranges over every possible completion \(y\) — \(|\text{vocab}|^{\text{length}}\) sequences, far too many to enumerate (the same wall as \(Z(x)\) on the DPO page).
But the very same quantity is an expectation, \(\mathrm{KL} = \mathbb{E}_{y\sim\pi_\theta}\big[\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\big]\), and RLHF is already sampling \(y\sim\pi_\theta\) to generate rollouts. So the log-ratio on that one sampled sequence is a free, unbiased Monte-Carlo estimate of the KL. Absorbing it into the reward gives the effective reward:
\[ R(x,y) = r_\phi(x,y) \;-\; \beta \log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \]Because generation is autoregressive, \(\pi(y\mid x) = \prod_t \pi(a_t\mid s_t)\), and \(\log\) of a product is a sum — so the sequence log-ratio splits exactly into per-token log-ratios, \(\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} = \sum_t \log\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{ref}}(a_t\mid s_t)}\). The reward model then contributes a single terminal reward at the final token, while the KL penalty becomes a dense per-token shaping reward:
\[ R_t = \underbrace{\mathbf{1}[t = T]\, r_\phi(x,y)}_{\text{terminal RM reward}} \;-\; \underbrace{\beta\, \log\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{ref}}(a_t\mid s_t)}}_{\text{per-token KL penalty}} \]How do you maximize an expected reward when the thing you sample from, \(\pi_\theta\), is what you're differentiating? The policy gradient theorem (the log-derivative / REINFORCE trick) gives the answer. A trajectory is \(\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)\) with return \(R(\tau)\), and the objective is the expected return:
\[ J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)] = \sum_\tau \pi_\theta(\tau)\, R(\tau) \]Only the trajectory probability depends on \(\theta\) (the return is just a number once \(\tau\) is fixed), so:
\[ \nabla_\theta J(\theta) = \sum_\tau \nabla_\theta \pi_\theta(\tau)\, R(\tau) \]This is not yet an expectation — \(\nabla_\theta \pi_\theta(\tau)\) can't be estimated by sampling. We must get it back to the form \(\sum_\tau \pi_\theta(\tau)[\cdots]\).
From \(\nabla_\theta \log \pi_\theta(\tau) = \tfrac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)}\), rearrange to \(\nabla_\theta \pi_\theta(\tau) = \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\) and substitute:
\[ \nabla_\theta J(\theta) = \sum_\tau \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\,R(\tau) = \mathbb{E}_{\tau\sim\pi_\theta}\!\big[R(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\big] \]An expectation under \(\pi_\theta\) again — and we already sample \(\tau\sim\pi_\theta\) during rollouts, so it is now estimable.
The trajectory probability factorizes into the initial state, the per-step policy, and the environment dynamics:
\[ \pi_\theta(\tau) = p(s_0)\prod_{t} \pi_\theta(a_t\mid s_t)\, P(s_{t+1}\mid s_t, a_t) \]Taking \(\log\) (product → sum) then \(\nabla_\theta\): the initial-state term \(p(s_0)\) and the dynamics \(P(s_{t+1}\mid s_t,a_t)\) do not depend on \(\theta\) — the agent controls neither — so they vanish, leaving only the policy:
\[ \nabla_\theta \log \pi_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t) \]Combining gives the policy gradient theorem (REINFORCE form):
\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;R(\tau)\right] \]Each action's score function \(\nabla\log\pi_\theta(a_t\mid s_t)\) points in the direction that makes \(a_t\) more likely; weighting it by the return reinforces actions from good trajectories and suppresses those from bad ones.
This estimator is correct but high-variance — it credits every action with the whole trajectory's return, including luck it had nothing to do with. Three refinements sharpen it, each keeping the gradient unbiased.
An action \(a_t\) cannot affect rewards collected before it: \(\mathbb{E}[\nabla\log\pi_\theta(a_t\mid s_t)\, r_{t'}] = 0\) for \(t' < t\). Dropping those past rewards leaves the expectation unchanged but cuts variance. Replace \(R(\tau)\) with the reward-to-go \(G_t = \sum_{t'\ge t} r_{t'}\):
\[ \nabla_\theta J = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;G_t\right] \]For any \(b(s_t)\) depending on the state but not the action, its contribution is exactly zero (the policy sums to 1, so its gradient sums to 0):
\[ \mathbb{E}_{a_t\sim\pi_\theta}\!\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\big] = b(s_t)\,\nabla_\theta\!\!\sum_{a_t}\pi_\theta(a_t\mid s_t) = b(s_t)\,\nabla_\theta 1 = 0 \]So we may subtract a baseline for free (unbiased), and a good one slashes variance: \(\nabla_\theta J = \mathbb{E}[\sum_t \nabla\log\pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))]\).
The natural baseline is the expected return from the state, the value function \(V(s_t) = \mathbb{E}[G_t\mid s_t]\). Then \(G_t - V(s_t)\) measures how much better \(a_t\) did than average from \(s_t\) — exactly the advantage, since \(Q(s_t,a_t) = \mathbb{E}[G_t\mid s_t,a_t]\):
\[ \hat A_t \approx G_t - V(s_t), \qquad A_t = Q(s_t,a_t) - V(s_t) \]Substituting yields the form used throughout RLHF:
\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[ \sum_{t} \hat A_t \, \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right] \]Intuition: push up the log-probability of actions that did better than expected (\(\hat A_t > 0\)), push down those that did worse. Subtracting \(V\) strips out the "this state was just generically good/bad" component, leaving a signal about the action. The advantage, not the raw reward, is the teaching signal — and §5 is how \(\hat A_t\) is actually estimated.
§4 left us needing to estimate \(\hat A_t\). We have a learned value function \(V_\psi(s)\) — a second head (the critic) predicting expected return from \(s\) — and sampled rewards. The whole design is a bias–variance trade-off: how much to trust real sampled rewards versus the imperfect bootstrap \(V\).
Monte Carlo uses the actual return-to-go, \(\hat A_t = G_t - V(s_t)\) with \(G_t = \sum_{l\ge 0}\gamma^l r_{t+l}\): unbiased (only real rewards) but high-variance (a long sum of random rewards). One-step TD bootstraps after a single step:
\[ \hat A_t^{(1)} = R_t + \gamma\, V_\psi(s_{t+1}) - V_\psi(s_t) \;\triangleq\; \delta_t \]Low variance (one random reward) but biased (leans entirely on \(V\)). This TD error \(\delta_t\) is the building block for everything below.
Between the extremes, take \(n\) real rewards then bootstrap. Larger \(n\) means less bias, more variance:
\[ \hat A_t^{(n)} = \sum_{l=0}^{n-1}\gamma^l r_{t+l} + \gamma^n V_\psi(s_{t+n}) - V_\psi(s_t) \]The intermediate value terms telescope, so the \(n\)-step advantage is exactly a discounted sum of TD errors. For \(n=2\), the \(+\gamma V(s_{t+1})\) and \(-\gamma V(s_{t+1})\) cancel:
\[ \delta_t + \gamma\delta_{t+1} = r_t + \gamma r_{t+1} + \gamma^2 V_\psi(s_{t+2}) - V_\psi(s_t) = \hat A_t^{(2)} \]In general \(\hat A_t^{(n)} = \sum_{l=0}^{n-1}\gamma^l\,\delta_{t+l}\).
Rather than pick one \(n\), Generalized Advantage Estimation takes an exponentially-weighted average of every \(n\)-step estimator with weight \((1-\lambda)\lambda^{n-1}\):
\[ \hat A_t^{\text{GAE}(\gamma,\lambda)} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}\,\hat A_t^{(n)} \]Collecting the coefficient of each \(\delta_{t+l}\) — it appears in every estimator with \(n>l\), giving \((1-\lambda)\gamma^l(\lambda^l+\lambda^{l+1}+\cdots) = (\gamma\lambda)^l\) — the whole thing collapses to a single discounted sum:
The infinite sum is never formed directly — it satisfies a one-line recursion run backward from the end of the trajectory, an \(O(T)\) pass:
\[ \hat A_t = \delta_t + \gamma\lambda\,\hat A_{t+1} \]The same pass yields the critic's regression target, the TD(\(\lambda\)) return \(\hat R_t = \hat A_t + V_\psi(s_t)\), so policy and critic train together:
\[ \mathcal{L}_{\text{VF}}(\psi) = \mathbb{E}_t\!\left[ \big(V_\psi(s_t) - \hat R_t\big)^2 \right] \]Vanilla policy gradients are unstable: one large step can collapse the policy. PPO keeps each update close to the policy that collected the data, \(\pi_{\theta_{\text{old}}}\), via a clipped importance ratio. Define the probability ratio:
\[ \rho_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \]The \(\min\) of the clipped and unclipped terms means: once the ratio moves more than \(\epsilon\) (e.g. 0.2) in the direction that would help, the objective flattens — removing the incentive to take a huge step. It is a cheap, first-order way to enforce a trust region.
The complete per-step loss adds the value-function regression and an entropy bonus \(S\) that encourages exploration:
\[ \mathcal{L}_{\text{PPO}} = \mathbb{E}_t\!\left[ -\,\mathcal{L}^{\text{CLIP}}_t \;+\; c_1\, \big(V_\psi(s_t) - \hat R_t\big)^2 \;-\; c_2\, S\big[\pi_\theta\big](s_t) \right] \]with coefficients \(c_1, c_2\). (Signs are written for minimization: maximize the clipped surrogate and entropy, minimize value error.)
Stage 3 repeats the following until the policy converges or KL budget is exhausted:
Both optimize the identical KL-regularized objective of §3. They differ in how.
| Standard RLHF (PPO) | DPO | |
|---|---|---|
| Reward model | Explicit network \(r_\phi\) | Implicit: \(\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}\) |
| Optimization | Online RL, sampling rollouts | Offline supervised loss on pairs |
| Models in memory | Policy, critic, reference, RM | Policy + reference |
| Objective solved | Directly, via PPO | In closed form (no RL loop) |
| Stability | Sensitive; many knobs | Stable; a single loss |
| Trade-off | Can use online data & any reward | Tied to the preference dataset |
See the DPO derivation for how the explicit reward and the PPO loop are algebraically eliminated.
\(-\mathbb{E}[\log\sigma(r_w - r_l)]\)
\(\max\,\mathbb{E}[r_\phi] - \beta\,\mathrm{KL}(\pi_\theta\|\pi_{\text{ref}})\)
\(r_\phi - \beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}\)
\(\mathbb{E}[\sum_t \hat A_t \nabla\log\pi_\theta(a_t\mid s_t)]\)
\(\hat A_t = \sum_l (\gamma\lambda)^l \delta_{t+l}\)
\(\mathbb{E}[\min(\rho_t\hat A_t,\,\mathrm{clip}(\rho_t)\hat A_t)]\)
| Stage | What it learns | Loss / objective |
|---|---|---|
| 1 · SFT | \(\pi_{\text{ref}}\) | Cross-entropy on demonstrations |
| 2 · Reward model | \(r_\phi\) | \(-\mathbb{E}[\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))]\) |
| 3 · RL (PPO) | \(\pi_\theta\), \(V_\psi\) | \(\mathcal{L}^{\text{CLIP}} + c_1\mathcal{L}_{\text{VF}} - c_2 S\) |