RLHF — The Math of Reinforcement Learning from Human Feedback

Notation

Symbol	Description
\(x\)	Prompt; \(y\) a full completion
\(y_w,\; y_l\)	Preferred (winner) and dispreferred (loser) completion in a comparison
\(\pi_\theta\)	Policy being trained (the LM), parameters \(\theta\)
\(\pi_{\text{ref}}\)	Frozen reference policy (the SFT model)
\(r_\phi(x,y)\)	Learned reward model, parameters \(\phi\); outputs a scalar score
\(\beta\)	KL penalty coefficient
\(s_t,\; a_t\)	State (prompt + tokens so far) and action (next token) at step \(t\)
\(V_\psi(s)\)	Value function (critic), parameters \(\psi\)
\(\hat A_t\)	Estimated advantage at step \(t\)
\(\gamma,\;\lambda\)	Discount and GAE smoothing factors
\(\sigma\)	Logistic sigmoid \(1/(1+e^{-t})\)

Unsure about any symbol below — \(\mathbb{E}\), \(\sim\), \(\mid\), \(\mathrm{KL}\), \(\nabla\)? See the math notation reference.

1. The Three-Stage Pipeline

RLHF (as in InstructGPT) aligns a pretrained language model to human preferences in three sequential stages. Each stage produces the input for the next.

Stage 1

SFT

Supervised fine-tuning on demonstrations. Produces \(\pi_{\text{ref}}\).

→

Stage 2

Reward Model

Fit \(r_\phi\) to human preference comparisons (§2).

→

Stage 3

RL (PPO)

Optimize \(\pi_\theta\) against \(r_\phi\) with a KL leash (§3–7).

This page focuses on the math of Stages 2 and 3. Stage 1 is ordinary cross-entropy fine-tuning — see the cross-entropy page.

2. Reward Model

We can't ask humans for absolute scores, only comparisons. The reward model is a scalar head on a transformer that maps \((x,y)\) to a single number, trained so that preferred completions score higher. Preferences are modeled with Bradley–Terry:

\[ p(y_w \succ y_l \mid x) = \frac{\exp r_\phi(x,y_w)}{\exp r_\phi(x,y_w) + \exp r_\phi(x,y_l)} = \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big) \]

Bradley–Terry gives the probability of a single comparison. To turn it into a training loss we fit \(\phi\) by maximum likelihood over the whole comparison dataset \(\mathcal{D}\), in three steps.

Step 1 — Likelihood of the dataset

Each example in \(\mathcal{D}\) is already labeled so that \(y_w\) is the human-preferred answer. Assuming examples are independent, the probability of observing all the labels is the product of the per-comparison probabilities:

\[ \mathcal{P}(\phi) = \prod_{(x,y_w,y_l)\in\mathcal{D}} p(y_w \succ y_l \mid x) = \prod_{(x,y_w,y_l)\in\mathcal{D}} \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big) \]

Step 2 — Take the log

Maximizing \(\mathcal{P}\) is the same as maximizing \(\log\mathcal{P}\) (the log is monotonic), and the log turns the product into a sum — easier to optimize and numerically stable:

\[ \log \mathcal{P}(\phi) = \sum_{(x,y_w,y_l)\in\mathcal{D}} \log \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big) \]

Step 3 — Negate and average

Flip the sign to turn "maximize log-likelihood" into "minimize a loss," and average over the dataset (the \(\tfrac{1}{|\mathcal{D}|}\sum\) is written as an expectation \(\mathbb{E}_{\mathcal{D}}\)):

Reward model loss

\[ \mathcal{L}_{\text{RM}}(\phi) = -\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}} \Big[\, \log \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big) \Big] \]

Reading the loss. The 2-way softmax above is exactly a sigmoid of the reward gap, \(\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))\), so this is just binary logistic regression with the gap as its single logit and "the winner wins" as the label. Taking \(-\log\) of that probability and averaging over the dataset is maximum likelihood: minimizing it drives \(r_\phi(x,y_w)\) up and \(r_\phi(x,y_l)\) down until the preferred completion is confidently ranked higher. A pair already ranked correctly contributes almost no gradient; a pair ranked backwards contributes a large one.

Only reward differences matter, so \(r_\phi\) is identified up to an additive constant; implementations usually normalize it to zero mean. This is the same Bradley–Terry likelihood that DPO later reuses — the difference is where the reward lives (a separate network here, the policy itself in DPO).

3. The RL Objective

With \(r_\phi\) frozen, Stage 3 trains the policy to maximize reward while staying close to the reference model. The KL leash stops the policy from drifting into degenerate text that fools the reward model (reward hacking):

KL-regularized RL objective

\[ \max_{\pi_\theta}\; \mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi_\theta(\cdot\mid x)} \big[\, r_\phi(x,y) \,\big] \;-\; \beta\, \mathrm{KL}\!\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big) \]

This is the same objective DPO starts from — DPO solves it in closed form, whereas standard RLHF optimizes it directly with reinforcement learning. The rest of this page is how that direct optimization works.

Folding the KL into a per-token reward

In practice the KL is not computed exactly, because written as a sum, \(\mathrm{KL} = \sum_y \pi_\theta(y\mid x)\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\), it ranges over every possible completion \(y\) — \(|\text{vocab}|^{\text{length}}\) sequences, far too many to enumerate (the same wall as \(Z(x)\) on the DPO page).

But the very same quantity is an expectation, \(\mathrm{KL} = \mathbb{E}_{y\sim\pi_\theta}\big[\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\big]\), and RLHF is already sampling \(y\sim\pi_\theta\) to generate rollouts. So the log-ratio on that one sampled sequence is a free, unbiased Monte-Carlo estimate of the KL. Absorbing it into the reward gives the effective reward:

\[ R(x,y) = r_\phi(x,y) \;-\; \beta \log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \]

Because generation is autoregressive, \(\pi(y\mid x) = \prod_t \pi(a_t\mid s_t)\), and \(\log\) of a product is a sum — so the sequence log-ratio splits exactly into per-token log-ratios, \(\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} = \sum_t \log\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{ref}}(a_t\mid s_t)}\). The reward model then contributes a single terminal reward at the final token, while the KL penalty becomes a dense per-token shaping reward:

\[ R_t = \underbrace{\mathbf{1}[t = T]\, r_\phi(x,y)}_{\text{terminal RM reward}} \;-\; \underbrace{\beta\, \log\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{ref}}(a_t\mid s_t)}}_{\text{per-token KL penalty}} \]

Framed as RL: each generated token is an action \(a_t\), the state \(s_t\) is the prompt plus tokens generated so far, and an episode is one full completion. The reward is sparse (one RM score at the end) plus a dense KL penalty at every token. A dense penalty also aids credit assignment — it flags which tokens drifted, instead of one lump sum at the end.

This single-sample log-ratio is unbiased but can be negative for an individual sample (a true KL cannot) and is high-variance. GRPO swaps in the always-positive \(u-\log u-1\) estimator to fix this.

4. Policy Gradient Foundations

How do you maximize an expected reward when the thing you sample from, \(\pi_\theta\), is what you're differentiating? The policy gradient theorem (the log-derivative / REINFORCE trick) gives the answer. A trajectory is \(\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)\) with return \(R(\tau)\), and the objective is the expected return:

\[ J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)] = \sum_\tau \pi_\theta(\tau)\, R(\tau) \]

Step 1 — Differentiate

Only the trajectory probability depends on \(\theta\) (the return is just a number once \(\tau\) is fixed), so:

\[ \nabla_\theta J(\theta) = \sum_\tau \nabla_\theta \pi_\theta(\tau)\, R(\tau) \]

This is not yet an expectation — \(\nabla_\theta \pi_\theta(\tau)\) can't be estimated by sampling. We must get it back to the form \(\sum_\tau \pi_\theta(\tau)[\cdots]\).

Step 2 — The log-derivative trick

From \(\nabla_\theta \log \pi_\theta(\tau) = \tfrac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)}\), rearrange to \(\nabla_\theta \pi_\theta(\tau) = \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\) and substitute:

\[ \nabla_\theta J(\theta) = \sum_\tau \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\,R(\tau) = \mathbb{E}_{\tau\sim\pi_\theta}\!\big[R(\tau)\,\nabla_\theta \log \pi_\theta(\tau)\big] \]

An expectation under \(\pi_\theta\) again — and we already sample \(\tau\sim\pi_\theta\) during rollouts, so it is now estimable.

Step 3 — Only the policy terms survive

The trajectory probability factorizes into the initial state, the per-step policy, and the environment dynamics:

\[ \pi_\theta(\tau) = p(s_0)\prod_{t} \pi_\theta(a_t\mid s_t)\, P(s_{t+1}\mid s_t, a_t) \]

Taking \(\log\) (product → sum) then \(\nabla_\theta\): the initial-state term \(p(s_0)\) and the dynamics \(P(s_{t+1}\mid s_t,a_t)\) do not depend on \(\theta\) — the agent controls neither — so they vanish, leaving only the policy:

\[ \nabla_\theta \log \pi_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t) \]

Combining gives the policy gradient theorem (REINFORCE form):

\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;R(\tau)\right] \]

Each action's score function \(\nabla\log\pi_\theta(a_t\mid s_t)\) points in the direction that makes \(a_t\) more likely; weighting it by the return reinforces actions from good trajectories and suppresses those from bad ones.

From return to advantage

This estimator is correct but high-variance — it credits every action with the whole trajectory's return, including luck it had nothing to do with. Three refinements sharpen it, each keeping the gradient unbiased.

A — Reward-to-go (causality)

An action \(a_t\) cannot affect rewards collected before it: \(\mathbb{E}[\nabla\log\pi_\theta(a_t\mid s_t)\, r_{t'}] = 0\) for \(t' < t\). Dropping those past rewards leaves the expectation unchanged but cuts variance. Replace \(R(\tau)\) with the reward-to-go \(G_t = \sum_{t'\ge t} r_{t'}\):

\[ \nabla_\theta J = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;G_t\right] \]

B — Subtract a baseline

For any \(b(s_t)\) depending on the state but not the action, its contribution is exactly zero (the policy sums to 1, so its gradient sums to 0):

\[ \mathbb{E}_{a_t\sim\pi_\theta}\!\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\big] = b(s_t)\,\nabla_\theta\!\!\sum_{a_t}\pi_\theta(a_t\mid s_t) = b(s_t)\,\nabla_\theta 1 = 0 \]

So we may subtract a baseline for free (unbiased), and a good one slashes variance: \(\nabla_\theta J = \mathbb{E}[\sum_t \nabla\log\pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))]\).

C — Baseline = value ⟹ advantage

The natural baseline is the expected return from the state, the value function \(V(s_t) = \mathbb{E}[G_t\mid s_t]\). Then \(G_t - V(s_t)\) measures how much better \(a_t\) did than average from \(s_t\) — exactly the advantage, since \(Q(s_t,a_t) = \mathbb{E}[G_t\mid s_t,a_t]\):

\[ \hat A_t \approx G_t - V(s_t), \qquad A_t = Q(s_t,a_t) - V(s_t) \]

Substituting yields the form used throughout RLHF:

\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[ \sum_{t} \hat A_t \, \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right] \]

Intuition: push up the log-probability of actions that did better than expected (\(\hat A_t > 0\)), push down those that did worse. Subtracting \(V\) strips out the "this state was just generically good/bad" component, leaving a signal about the action. The advantage, not the raw reward, is the teaching signal — and §5 is how \(\hat A_t\) is actually estimated.

5. Value Function and Advantage

§4 left us needing to estimate \(\hat A_t\). We have a learned value function \(V_\psi(s)\) — a second head (the critic) predicting expected return from \(s\) — and sampled rewards. The whole design is a bias–variance trade-off: how much to trust real sampled rewards versus the imperfect bootstrap \(V\).

The two extremes

Monte Carlo uses the actual return-to-go, \(\hat A_t = G_t - V(s_t)\) with \(G_t = \sum_{l\ge 0}\gamma^l r_{t+l}\): unbiased (only real rewards) but high-variance (a long sum of random rewards). One-step TD bootstraps after a single step:

\[ \hat A_t^{(1)} = R_t + \gamma\, V_\psi(s_{t+1}) - V_\psi(s_t) \;\triangleq\; \delta_t \]

Low variance (one random reward) but biased (leans entirely on \(V\)). This TD error \(\delta_t\) is the building block for everything below.

Step 1 — The n-step ladder

Between the extremes, take \(n\) real rewards then bootstrap. Larger \(n\) means less bias, more variance:

\[ \hat A_t^{(n)} = \sum_{l=0}^{n-1}\gamma^l r_{t+l} + \gamma^n V_\psi(s_{t+n}) - V_\psi(s_t) \]

Step 2 — n-step advantage = sum of TD errors

The intermediate value terms telescope, so the \(n\)-step advantage is exactly a discounted sum of TD errors. For \(n=2\), the \(+\gamma V(s_{t+1})\) and \(-\gamma V(s_{t+1})\) cancel:

\[ \delta_t + \gamma\delta_{t+1} = r_t + \gamma r_{t+1} + \gamma^2 V_\psi(s_{t+2}) - V_\psi(s_t) = \hat A_t^{(2)} \]

In general \(\hat A_t^{(n)} = \sum_{l=0}^{n-1}\gamma^l\,\delta_{t+l}\).

Step 3 — GAE averages over all n

Rather than pick one \(n\), Generalized Advantage Estimation takes an exponentially-weighted average of every \(n\)-step estimator with weight \((1-\lambda)\lambda^{n-1}\):

\[ \hat A_t^{\text{GAE}(\gamma,\lambda)} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}\,\hat A_t^{(n)} \]

Collecting the coefficient of each \(\delta_{t+l}\) — it appears in every estimator with \(n>l\), giving \((1-\lambda)\gamma^l(\lambda^l+\lambda^{l+1}+\cdots) = (\gamma\lambda)^l\) — the whole thing collapses to a single discounted sum:

GAE

\[ \hat A_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\, \delta_{t+l} \]

\(\lambda = 0\) keeps only the \(l=0\) term, recovering one-step TD \(\hat A_t = \delta_t\) (low variance, higher bias); \(\lambda = 1\) gives \(\sum_l\gamma^l\delta_{t+l}\), which telescopes to the Monte-Carlo \(G_t - V(s_t)\) (unbiased, high variance). \(\gamma\) is the reward discount (part of the problem); \(\lambda\) is the variance-reduction knob. RLHF typically uses \(\lambda \approx 0.95,\ \gamma \approx 1\).

Computing it, and training the critic

The infinite sum is never formed directly — it satisfies a one-line recursion run backward from the end of the trajectory, an \(O(T)\) pass:

\[ \hat A_t = \delta_t + \gamma\lambda\,\hat A_{t+1} \]

The same pass yields the critic's regression target, the TD(\(\lambda\)) return \(\hat R_t = \hat A_t + V_\psi(s_t)\), so policy and critic train together:

\[ \mathcal{L}_{\text{VF}}(\psi) = \mathbb{E}_t\!\left[ \big(V_\psi(s_t) - \hat R_t\big)^2 \right] \]

6. The PPO Clipped Objective

Vanilla policy gradients are unstable: one large step can collapse the policy. PPO keeps each update close to the policy that collected the data, \(\pi_{\theta_{\text{old}}}\), via a clipped importance ratio. Define the probability ratio:

\[ \rho_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \]

PPO clipped surrogate

\[ \mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[ \min\!\Big( \rho_t(\theta)\,\hat A_t,\;\; \mathrm{clip}\big(\rho_t(\theta),\, 1-\epsilon,\, 1+\epsilon\big)\,\hat A_t \Big) \right] \]

The \(\min\) of the clipped and unclipped terms means: once the ratio moves more than \(\epsilon\) (e.g. 0.2) in the direction that would help, the objective flattens — removing the incentive to take a huge step. It is a cheap, first-order way to enforce a trust region.

Full PPO loss

The complete per-step loss adds the value-function regression and an entropy bonus \(S\) that encourages exploration:

\[ \mathcal{L}_{\text{PPO}} = \mathbb{E}_t\!\left[ -\,\mathcal{L}^{\text{CLIP}}_t \;+\; c_1\, \big(V_\psi(s_t) - \hat R_t\big)^2 \;-\; c_2\, S\big[\pi_\theta\big](s_t) \right] \]

with coefficients \(c_1, c_2\). (Signs are written for minimization: maximize the clipped surrogate and entropy, minimize value error.)

7. The Full RLHF Loop

Stage 3 repeats the following until the policy converges or KL budget is exhausted:

Rollout — sample prompts \(x\sim\mathcal{D}\), generate completions \(y\sim\pi_\theta(\cdot\mid x)\).
Score — compute the terminal reward \(r_\phi(x,y)\) and the per-token KL penalty to form \(R_t\) (§3).
Estimate — run the critic \(V_\psi\) and compute advantages \(\hat A_t\) via GAE (§5).
Optimize — take several PPO epochs of minibatch updates on \(\mathcal{L}_{\text{PPO}}\) (§6), then refresh \(\pi_{\theta_{\text{old}}}\).

This requires four models in memory: the policy \(\pi_\theta\), the critic \(V_\psi\), the frozen reference \(\pi_{\text{ref}}\), and the reward model \(r_\phi\). That cost and instability is precisely the pain DPO removes.

8. RLHF vs. DPO

Both optimize the identical KL-regularized objective of §3. They differ in how.

	Standard RLHF (PPO)	DPO
Reward model	Explicit network \(r_\phi\)	Implicit: \(\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}\)
Optimization	Online RL, sampling rollouts	Offline supervised loss on pairs
Models in memory	Policy, critic, reference, RM	Policy + reference
Objective solved	Directly, via PPO	In closed form (no RL loop)
Stability	Sensitive; many knobs	Stable; a single loss
Trade-off	Can use online data & any reward	Tied to the preference dataset

See the DPO derivation for how the explicit reward and the PPO loop are algebraically eliminated.

9. Summary

RM loss

\(-\mathbb{E}[\log\sigma(r_w - r_l)]\)

RL objective

\(\max\,\mathbb{E}[r_\phi] - \beta\,\mathrm{KL}(\pi_\theta\|\pi_{\text{ref}})\)

Effective reward

\(r_\phi - \beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}\)

Policy gradient

\(\mathbb{E}[\sum_t \hat A_t \nabla\log\pi_\theta(a_t\mid s_t)]\)

Advantage (GAE)

\(\hat A_t = \sum_l (\gamma\lambda)^l \delta_{t+l}\)

PPO surrogate

\(\mathbb{E}[\min(\rho_t\hat A_t,\,\mathrm{clip}(\rho_t)\hat A_t)]\)

Stage	What it learns	Loss / objective
1 · SFT	\(\pi_{\text{ref}}\)	Cross-entropy on demonstrations
2 · Reward model	\(r_\phi\)	\(-\mathbb{E}[\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))]\)
3 · RL (PPO)	\(\pi_\theta\), \(V_\psi\)	\(\mathcal{L}^{\text{CLIP}} + c_1\mathcal{L}_{\text{VF}} - c_2 S\)

RLHF — Reinforcement Learning from Human Feedback