ML Math

Direct Preference Optimization

How DPO turns the RLHF reward-modeling-plus-RL pipeline into a single supervised classification loss on preference pairs — and why the reward model and partition function quietly disappear.

Notation

SymbolDescription
\(x\)Prompt / context
\(y_w,\; y_l\)Preferred (winner) and dispreferred (loser) completions for \(x\)
\(\pi_\theta(y\mid x)\)Policy being trained (the language model), parameters \(\theta\)
\(\pi_{\text{ref}}(y\mid x)\)Frozen reference policy (typically the SFT model)
\(r(x,y)\)Latent reward / scoring function
\(\beta\)Temperature controlling deviation from \(\pi_{\text{ref}}\) (KL strength)
\(Z(x)\)Partition function normalizing the optimal policy
\(\sigma(t)\)Logistic sigmoid \(1/(1+e^{-t})\)
\(\mathcal{D}\)Dataset of preference triples \((x, y_w, y_l)\)

1. The RLHF Objective

Classic RLHF trains a reward model \(r(x,y)\) from human preferences, then optimizes the policy to maximize reward while staying close to the reference model. The KL penalty prevents reward hacking and keeps generations fluent:

\[ \max_{\pi}\; \mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi(\cdot\mid x)}\big[\,r(x,y)\,\big] \;-\; \beta\, \mathrm{KL}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big) \]

DPO's key observation: this objective has a closed-form optimum, and that optimum lets us express the reward in terms of the policy itself — eliminating the separate RL loop.

Closed-form means the optimum can be written as an explicit formula you evaluate directly, rather than one you must reach by an iterative search (as PPO does). Crucially, because the optimum is an explicit equation, we can later rearrange it algebraically to express the reward through the policy.

Closed-form optimal policy

Writing the objective per-prompt and expanding the KL term:

\[ \max_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[\, r(x,y) - \beta \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \right] \]

We now massage this into a single KL divergence in four small steps.

Step 1 — Divide by \(\beta\) and minimize

Scaling the objective by the positive constant \(1/\beta\) does not change which \(\pi\) is optimal. Dividing through, then negating to turn the \(\max\) into a \(\min\):

\[ \min_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[\, \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)} - \frac{1}{\beta}\,r(x,y) \right] \]
Step 2 — Fold the reward into the log

Write the reward term as a logarithm, \(\tfrac{1}{\beta} r(x,y) = \log e^{\,r(x,y)/\beta}\), and combine the two logs into one ratio:

\[ \min_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[\, \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)\,e^{\,r(x,y)/\beta}} \right] \]
Step 3 — Normalize the denominator

The denominator \(\pi_{\text{ref}}\,e^{r/\beta}\) is not yet a probability distribution (it doesn't sum to 1). Define the partition function that normalizes it, \(Z(x) = \sum_y \pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\), and divide the denominator by it. Since \(\pi_{\text{ref}}\,e^{r/\beta} = Z(x)\cdot\big[\tfrac{1}{Z(x)}\pi_{\text{ref}}\,e^{r/\beta}\big]\), this introduces a \(-\log Z(x)\) term, which leaves the expectation because it does not depend on \(y\):

\[ \min_{\pi}\; \mathbb{E}_{y\sim\pi}\!\left[ \log\frac{\pi(y\mid x)}{\tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}} \right] - \log Z(x) \]
Step 4 — Recognize the KL divergence

Name the denominator \(\pi^*(y\mid x) \triangleq \tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\). Two facts make Step 4 work.

(a) \(\pi^*\) is a genuine probability distribution — a Gibbs (Boltzmann) distribution. It is non-negative (a probability \(\pi_{\text{ref}}\ge 0\) times a positive exponential), and it sums to 1 precisely because \(Z(x)\) was defined to normalize it:

\[ \sum_y \pi^*(y\mid x) = \frac{1}{Z(x)}\sum_y \pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta} = \frac{Z(x)}{Z(x)} = 1 \]

That is the entire role of the partition function: it turns the un-normalized weight \(\pi_{\text{ref}}\,e^{r/\beta}\) — the reference distribution exponentially re-weighted by reward — into something that integrates to one.

(b) The bracketed term matches the definition of KL exactly. Recall \(\mathrm{KL}(p\,\|\,q) = \mathbb{E}_{y\sim p}\!\big[\log\tfrac{p(y)}{q(y)}\big]\). Our bracket is an expectation over \(y\sim\pi\) of \(\log\tfrac{\pi}{\pi^*}\) — the same template with \(p=\pi\), \(q=\pi^*\):

\[ \mathbb{E}_{y\sim\pi}\!\left[\log\frac{\pi(y\mid x)}{\pi^*(y\mid x)}\right] = \mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi^*(\cdot\mid x)\big) \]

Since \(\log Z(x)\) depends only on \(x\) (not on \(\pi\)), it is a constant for the minimization and drops out, leaving:

\[ \min_{\pi}\; \mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi^*(\cdot\mid x)\big) \]

A KL divergence satisfies \(\mathrm{KL}(p\,\|\,q)\ge 0\) with equality if and only if \(p=q\). So the minimizer is simply \(\pi = \pi^*\) — the closest \(\pi\) can get to the Gibbs distribution is to be it. This gives the optimal policy:

Optimal policy
\[ \pi^*(y\mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y\mid x)\,\exp\!\Big(\tfrac{1}{\beta} r(x,y)\Big) \]
\(Z(x)\) sums over all possible completions \(y\), so it is intractable to compute. This is exactly why naive RLHF cannot just sample from \(\pi^*\) directly and instead resorts to PPO. DPO sidesteps \(Z(x)\) entirely — watch it cancel in §3.

2. The Bradley–Terry Preference Model

Human preferences are modeled with Bradley–Terry: the probability that \(y_w\) is preferred over \(y_l\) is a sigmoid of the reward difference.

\[ p(y_w \succ y_l \mid x) = \frac{\exp r(x,y_w)}{\exp r(x,y_w) + \exp r(x,y_l)} = \sigma\big(r(x,y_w) - r(x,y_l)\big) \]

A standard reward model is fit by maximizing the likelihood of the observed preferences. DPO instead substitutes a reward expressed through the policy.

3. Reparameterizing the Reward

Solve the optimal-policy equation for the reward by taking logs of both sides and rearranging:

Step A — Invert the optimal policy

Take \(\log\) of \(\pi^*(y\mid x) = \tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\):

\[ \log \pi^*(y\mid x) = \log \pi_{\text{ref}}(y\mid x) + \frac{r(x,y)}{\beta} - \log Z(x) \]

Solve for \(r(x,y)\):

\[ r(x,y) = \beta \log\frac{\pi^*(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta \log Z(x) \]
Step B — Identify reward with the trained policy

We do not know the true reward, so we make the model's own policy \(\pi_\theta\) play the role of \(\pi^*\). Define the implicit reward as the log-ratio against the reference:

\[ r_\theta(x,y) = \beta \log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta \log Z(x) \]

Training \(\pi_\theta\) to satisfy the preferences is then equivalent to fitting this reward — but with no separate reward network.

Step C — The partition function cancels

Substitute \(r_\theta\) into the Bradley–Terry model. Only the difference of rewards appears, and the \(\beta\log Z(x)\) term is identical for both \(y_w\) and \(y_l\) (it depends only on \(x\)), so it cancels:

\[ \begin{aligned} r_\theta(x,y_w) - r_\theta(x,y_l) &= \beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} + \cancel{\beta \log Z(x)} \\[6pt] &\quad - \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} - \cancel{\beta \log Z(x)} \\[6pt] &= \beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \end{aligned} \]
The intractable \(Z(x)\) vanishing is the crux of DPO. Because Bradley–Terry only ever sees reward differences within the same prompt, the per-prompt normalizer never has to be computed.

4. The DPO Loss

Plug the reward difference into \(p(y_w\succ y_l\mid x)=\sigma(\cdot)\) and take the negative log-likelihood over the preference dataset:

DPO objective
\[ \mathcal{L}_{\text{DPO}}(\theta) = -\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}} \left[ \log \sigma\!\left( \beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right) \right] \]

This is just binary classification: a logistic loss that pushes the implicit reward of the winner above that of the loser. It needs only the four log-probabilities per pair — two from the trainable policy and two from the frozen reference — and no sampling, no reward model, and no RL.

Reading the loss

Let \(\hat r_w = \beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)}\) and \(\hat r_l = \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\) be the implicit rewards. The loss is \(-\log\sigma(\hat r_w - \hat r_l)\): minimized as the margin \(\hat r_w - \hat r_l \to +\infty\), large when the model ranks the pair wrong.

5. Gradient

Using \(\frac{d}{dt}\log\sigma(t) = \sigma(-t) = 1-\sigma(t)\) and the chain rule, differentiate the loss for a single pair. Let \(u = \hat r_w - \hat r_l\):

Step A — Differentiate the log-sigmoid
\[ \nabla_\theta\big[-\log\sigma(u)\big] = -\,\sigma(-u)\,\nabla_\theta u = -\,\sigma(\hat r_l - \hat r_w)\,\nabla_\theta u \]

The reference terms are constant in \(\theta\), so the implicit reward gradients reduce to log-prob gradients:

\[ \nabla_\theta u = \beta\Big(\nabla_\theta \log\pi_\theta(y_w\mid x) - \nabla_\theta \log\pi_\theta(y_l\mid x)\Big) \]
DPO gradient
\[ \nabla_\theta \mathcal{L}_{\text{DPO}} = -\,\beta\, \mathbb{E}_{\mathcal{D}}\Big[\, \underbrace{\sigma(\hat r_l - \hat r_w)}_{\text{weight}} \big(\nabla_\theta \log\pi_\theta(y_w\mid x) - \nabla_\theta \log\pi_\theta(y_l\mid x)\big)\Big] \]

The update raises the log-probability of the winner and lowers that of the loser. The scalar weight \(\sigma(\hat r_l - \hat r_w)\) is the model's error: it is near \(1\) when the implicit reward currently ranks the pair backwards (large correction) and near \(0\) once the ordering is already correct (vanishing update).

This adaptive weighting is what keeps DPO stable. Examples the model already orders correctly contribute almost no gradient, so optimization concentrates on the pairs it still gets wrong — much like the \(p-y\) residual in cross entropy.

Role of \(\beta\)

Larger \(\beta\) penalizes deviations from \(\pi_{\text{ref}}\) more strongly (smaller effective steps in the log-ratio); smaller \(\beta\) lets the policy move farther from the reference to satisfy preferences. Typical values are \(\beta \in [0.1, 0.5]\).

6. Summary

Optimal policy

\(\pi^* \propto \pi_{\text{ref}}\,e^{r/\beta}\)

Implicit reward

\(\hat r = \beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}\,(+\,\beta\log Z)\)

Preference model

\(p(y_w\succ y_l)=\sigma(\hat r_w-\hat r_l)\)

Loss

\(-\log\sigma(\hat r_w-\hat r_l)\)

Gradient weight

\(\sigma(\hat r_l-\hat r_w)\) — high when ranked wrong

Why no \(Z(x)\)

It cancels in the reward difference

QuantityFormula
RLHF objective\(\max_\pi \mathbb{E}[r] - \beta\,\mathrm{KL}(\pi\,\|\,\pi_{\text{ref}})\)
Optimal policy\(\pi^*(y\mid x) = \tfrac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}\)
Reward (reparam.)\(r(x,y) = \beta\log\tfrac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x)\)
Reward difference\(\hat r_w - \hat r_l = \beta\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\)
DPO loss\(-\mathbb{E}\big[\log\sigma(\hat r_w - \hat r_l)\big]\)
DPO gradient\(-\beta\,\mathbb{E}\big[\sigma(\hat r_l-\hat r_w)(\nabla\log\pi_\theta(y_w) - \nabla\log\pi_\theta(y_l))\big]\)