How DPO turns the RLHF reward-modeling-plus-RL pipeline into a single supervised classification loss on preference pairs — and why the reward model and partition function quietly disappear.
| Symbol | Description |
|---|---|
| \(x\) | Prompt / context |
| \(y_w,\; y_l\) | Preferred (winner) and dispreferred (loser) completions for \(x\) |
| \(\pi_\theta(y\mid x)\) | Policy being trained (the language model), parameters \(\theta\) |
| \(\pi_{\text{ref}}(y\mid x)\) | Frozen reference policy (typically the SFT model) |
| \(r(x,y)\) | Latent reward / scoring function |
| \(\beta\) | Temperature controlling deviation from \(\pi_{\text{ref}}\) (KL strength) |
| \(Z(x)\) | Partition function normalizing the optimal policy |
| \(\sigma(t)\) | Logistic sigmoid \(1/(1+e^{-t})\) |
| \(\mathcal{D}\) | Dataset of preference triples \((x, y_w, y_l)\) |
Classic RLHF trains a reward model \(r(x,y)\) from human preferences, then optimizes the policy to maximize reward while staying close to the reference model. The KL penalty prevents reward hacking and keeps generations fluent:
\[ \max_{\pi}\; \mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi(\cdot\mid x)}\big[\,r(x,y)\,\big] \;-\; \beta\, \mathrm{KL}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big) \]DPO's key observation: this objective has a closed-form optimum, and that optimum lets us express the reward in terms of the policy itself — eliminating the separate RL loop.
Writing the objective per-prompt and expanding the KL term:
\[ \max_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[\, r(x,y) - \beta \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \right] \]We now massage this into a single KL divergence in four small steps.
Scaling the objective by the positive constant \(1/\beta\) does not change which \(\pi\) is optimal. Dividing through, then negating to turn the \(\max\) into a \(\min\):
\[ \min_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[\, \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)} - \frac{1}{\beta}\,r(x,y) \right] \]Write the reward term as a logarithm, \(\tfrac{1}{\beta} r(x,y) = \log e^{\,r(x,y)/\beta}\), and combine the two logs into one ratio:
\[ \min_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[\, \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)\,e^{\,r(x,y)/\beta}} \right] \]The denominator \(\pi_{\text{ref}}\,e^{r/\beta}\) is not yet a probability distribution (it doesn't sum to 1). Define the partition function that normalizes it, \(Z(x) = \sum_y \pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\), and divide the denominator by it. Since \(\pi_{\text{ref}}\,e^{r/\beta} = Z(x)\cdot\big[\tfrac{1}{Z(x)}\pi_{\text{ref}}\,e^{r/\beta}\big]\), this introduces a \(-\log Z(x)\) term, which leaves the expectation because it does not depend on \(y\):
\[ \min_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[ \log\frac{\pi(y\mid x)}{\tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}} \right] - \log Z(x) \]Name the denominator \(\pi^*(y\mid x) \triangleq \tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\). Two facts make Step 4 work.
(a) \(\pi^*\) is a genuine probability distribution — a Gibbs (Boltzmann) distribution. It is non-negative (a probability \(\pi_{\text{ref}}\ge 0\) times a positive exponential), and it sums to 1 precisely because \(Z(x)\) was defined to normalize it:
\[ \sum_y \pi^*(y\mid x) = \frac{1}{Z(x)}\sum_y \pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta} = \frac{Z(x)}{Z(x)} = 1 \]That is the entire role of the partition function: it turns the un-normalized weight \(\pi_{\text{ref}}\,e^{r/\beta}\) — the reference distribution exponentially re-weighted by reward — into something that integrates to one.
(b) The bracketed term matches the definition of KL exactly. Recall \(\mathrm{KL}(p\,\|\,q) = \mathbb{E}_{y\sim p}\!\big[\log\tfrac{p(y)}{q(y)}\big]\). Our bracket is an expectation over \(y\sim\pi\) of \(\log\tfrac{\pi}{\pi^*}\) — the same template with \(p=\pi\), \(q=\pi^*\):
\[ \mathbb{E}_{y\sim\pi}\!\left[\log\frac{\pi(y\mid x)}{\pi^*(y\mid x)}\right] = \mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi^*(\cdot\mid x)\big) \]Since \(\log Z(x)\) depends only on \(x\) (not on \(\pi\)), it is a constant for the minimization and drops out, leaving:
\[ \min_{\pi}\; \mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi^*(\cdot\mid x)\big) \]A KL divergence satisfies \(\mathrm{KL}(p\,\|\,q)\ge 0\) with equality if and only if \(p=q\). So the minimizer is simply \(\pi = \pi^*\) — the closest \(\pi\) can get to the Gibbs distribution is to be it. This gives the optimal policy:
Human preferences are modeled with Bradley–Terry: the probability that \(y_w\) is preferred over \(y_l\) is a sigmoid of the reward difference.
\[ p(y_w \succ y_l \mid x) = \frac{\exp r(x,y_w)}{\exp r(x,y_w) + \exp r(x,y_l)} = \sigma\big(r(x,y_w) - r(x,y_l)\big) \]A standard reward model is fit by maximizing the likelihood of the observed preferences. DPO instead substitutes a reward expressed through the policy.
Solve the optimal-policy equation for the reward by taking logs of both sides and rearranging:
Take \(\log\) of \(\pi^*(y\mid x) = \tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\):
\[ \log \pi^*(y\mid x) = \log \pi_{\text{ref}}(y\mid x) + \frac{r(x,y)}{\beta} - \log Z(x) \]Solve for \(r(x,y)\):
\[ r(x,y) = \beta \log\frac{\pi^*(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta \log Z(x) \]We do not know the true reward, so we make the model's own policy \(\pi_\theta\) play the role of \(\pi^*\). Define the implicit reward as the log-ratio against the reference:
\[ r_\theta(x,y) = \beta \log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta \log Z(x) \]Training \(\pi_\theta\) to satisfy the preferences is then equivalent to fitting this reward — but with no separate reward network.
Substitute \(r_\theta\) into the Bradley–Terry model. Only the difference of rewards appears, and the \(\beta\log Z(x)\) term is identical for both \(y_w\) and \(y_l\) (it depends only on \(x\)), so it cancels:
\[ \begin{aligned} r_\theta(x,y_w) - r_\theta(x,y_l) &= \beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} + \cancel{\beta \log Z(x)} \\[6pt] &\quad - \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} - \cancel{\beta \log Z(x)} \\[6pt] &= \beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \end{aligned} \]Plug the reward difference into \(p(y_w\succ y_l\mid x)=\sigma(\cdot)\) and take the negative log-likelihood over the preference dataset:
This is just binary classification: a logistic loss that pushes the implicit reward of the winner above that of the loser. It needs only the four log-probabilities per pair — two from the trainable policy and two from the frozen reference — and no sampling, no reward model, and no RL.
Let \(\hat r_w = \beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)}\) and \(\hat r_l = \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\) be the implicit rewards. The loss is \(-\log\sigma(\hat r_w - \hat r_l)\): minimized as the margin \(\hat r_w - \hat r_l \to +\infty\), large when the model ranks the pair wrong.
Using \(\frac{d}{dt}\log\sigma(t) = \sigma(-t) = 1-\sigma(t)\) and the chain rule, differentiate the loss for a single pair. Let \(u = \hat r_w - \hat r_l\):
The reference terms are constant in \(\theta\), so the implicit reward gradients reduce to log-prob gradients:
\[ \nabla_\theta u = \beta\Big(\nabla_\theta \log\pi_\theta(y_w\mid x) - \nabla_\theta \log\pi_\theta(y_l\mid x)\Big) \]The update raises the log-probability of the winner and lowers that of the loser. The scalar weight \(\sigma(\hat r_l - \hat r_w)\) is the model's error: it is near \(1\) when the implicit reward currently ranks the pair backwards (large correction) and near \(0\) once the ordering is already correct (vanishing update).
Larger \(\beta\) penalizes deviations from \(\pi_{\text{ref}}\) more strongly (smaller effective steps in the log-ratio); smaller \(\beta\) lets the policy move farther from the reference to satisfy preferences. Typical values are \(\beta \in [0.1, 0.5]\).
\(\pi^* \propto \pi_{\text{ref}}\,e^{r/\beta}\)
\(\hat r = \beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}\,(+\,\beta\log Z)\)
\(p(y_w\succ y_l)=\sigma(\hat r_w-\hat r_l)\)
\(-\log\sigma(\hat r_w-\hat r_l)\)
\(\sigma(\hat r_l-\hat r_w)\) — high when ranked wrong
It cancels in the reward difference
| Quantity | Formula |
|---|---|
| RLHF objective | \(\max_\pi \mathbb{E}[r] - \beta\,\mathrm{KL}(\pi\,\|\,\pi_{\text{ref}})\) |
| Optimal policy | \(\pi^*(y\mid x) = \tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\) |
| Reward (reparam.) | \(r(x,y) = \beta\log\tfrac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x)\) |
| Reward difference | \(\hat r_w - \hat r_l = \beta\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\) |
| DPO loss | \(-\mathbb{E}\big[\log\sigma(\hat r_w - \hat r_l)\big]\) |
| DPO gradient | \(-\beta\,\mathbb{E}\big[\sigma(\hat r_l-\hat r_w)(\nabla\log\pi_\theta(y_w) - \nabla\log\pi_\theta(y_l))\big]\) |