The single principle behind cross entropy, the language-model token loss, DPO, and least squares: make the observed data as probable as possible — then take the negative log so it becomes a loss to minimize.
Most supervised losses are not separate inventions. They are the same recipe — maximize the probability the model assigns to what actually happened — applied to different output distributions. That recipe is the negative log-likelihood (NLL). Learn it once and cross entropy, binary log-loss, mean-squared error, and the DPO loss all become the same object wearing different clothes.
| Symbol | Description |
|---|---|
| \(x_i\) | Input / context of the \(i\)-th example |
| \(y_i\) | Observed (true) output of the \(i\)-th example |
| \(\theta\) | Model parameters being trained |
| \(p_\theta(y\mid x)\) | Probability the model assigns to output \(y\) given \(x\) |
| \(L(\theta)\) | Likelihood — probability of the whole dataset under the model |
| \(\mathcal{L}_{\text{NLL}}(\theta)\) | Negative log-likelihood — the loss we minimize |
| \(n\) | Number of training examples |
Start with one number: how probable does the model think the observed dataset is? Assuming the \(n\) examples are independent, the probability of seeing all of them together is the product of the per-example probabilities — this product is the likelihood:
\[ L(\theta) = \prod_{i=1}^{n} p_\theta(y_i \mid x_i) \]A good model is one that makes the data it actually saw look probable, so we want to maximize \(L\). Three small moves turn this into a loss a gradient optimizer can chew on.
A product of thousands of probabilities (each \(< 1\)) underflows to zero in floating point and is awkward to differentiate. The log turns the product into a sum (see the notation page), which is numerically stable and has simple term-by-term derivatives:
\[ \log L(\theta) = \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i) \]The log is monotonic, so whatever \(\theta\) maximizes \(L\) also maximizes \(\log L\) — we have not changed the answer, only the arithmetic.
Optimizers minimize by convention. Flip the sign so "maximize log-likelihood" becomes "minimize negative log-likelihood":
\[ \mathcal{L}_{\text{NLL}}(\theta) = -\sum_{i=1}^{n} \log p_\theta(y_i \mid x_i) \]Summing makes the loss scale with dataset size; dividing by \(n\) gives a per-example number that is comparable across batch sizes:
\[ \mathcal{L}_{\text{NLL}}(\theta) = -\frac{1}{n}\sum_{i=1}^{n} \log p_\theta(y_i \mid x_i) \]"For each example, look up the probability the model gave to the answer that actually occurred, take its log, negate it, and average. Driving this down forces the model to put high probability on the truth."
Look at the cost of a single example, \(-\log p\), where \(p\) is the probability the model placed on the correct answer. The shape of \(-\log\) does something a plain "probability of error" never could:
| Model's prob on truth \(p\) | Cost \(-\log p\) | Reading |
|---|---|---|
| \(1.0\) | \(0\) | Certain and correct — no penalty |
| \(0.5\) | \(0.69\) | A coin-flip's worth of doubt |
| \(0.1\) | \(2.30\) | Truth was unlikely under the model |
| \(0.01\) | \(4.61\) | Confidently wrong — heavily punished |
| \(\to 0\) | \(\to \infty\) | Ruled the truth out entirely — unbounded loss |
The penalty is zero only when the model is fully confident and right, and grows without bound as the model assigns vanishing probability to what actually happened. That asymmetry is the whole point: being confidently wrong is punished far more harshly than being unsure. A loss linear in error would let a model shrug off an occasional catastrophic miss; \(-\log p\) never lets it.
A 3-class problem where the true class is #2. The model outputs the distribution \(\mathbf{p} = [0.1,\ 0.7,\ 0.2]\). Only the probability on the observed class enters the loss:
\[ \mathcal{L}_{\text{NLL}} = -\log p_2 = -\log(0.7) \approx 0.357 \]Had the model been less sure of the truth, say \(\mathbf{p} = [0.1,\ 0.2,\ 0.7]\) (true class got only \(0.2\)):
\[ \mathcal{L}_{\text{NLL}} = -\log(0.2) \approx 1.609 \]Same correct class, but a model that buried the truth under a confident wrong guess pays roughly \(4.5\times\) the cost. The other entries (\(0.1\), \(0.7\)) never appear directly — they matter only through the constraint that the distribution sums to 1, which is what lifting \(p_2\) competes against.
NLL is a template. Pick the distribution \(p_\theta(y\mid x)\) that matches your output type, plug it in, and a familiar named loss falls out.
If \(y\) is one of \(V\) classes and the model outputs a categorical distribution (via softmax), then \(p_\theta(y_i\mid x_i) = p_{i,k_i}\), the probability on the true class \(k_i\). The NLL is exactly cross entropy:
\[ \mathcal{L}_{\text{NLL}} = -\frac{1}{n}\sum_{i=1}^{n}\log p_{i,k_i} \]The one-hot label zeroes every term except the true class, so cross entropy's \(-\sum_i y_i\log p_i\) collapses to \(-\log p_{k}\) — the NLL of a single categorical draw.
For a yes/no label with the model predicting \(p_\theta = \hat p\) for "yes", the Bernoulli probability of the observed \(y\in\{0,1\}\) is \(\hat p^{\,y}(1-\hat p)^{1-y}\). Its NLL is binary cross entropy:
\[ \mathcal{L}_{\text{NLL}} = -\big[\,y\log\hat p + (1-y)\log(1-\hat p)\,\big] \]For a real-valued target modeled as \(y \sim \mathcal{N}(\mu_\theta(x),\sigma^2)\) with fixed \(\sigma\), the density is \(\tfrac{1}{\sqrt{2\pi\sigma^2}}e^{-(y-\mu_\theta)^2/2\sigma^2}\). Taking \(-\log\) drops the constant and leaves the squared error:
\[ \mathcal{L}_{\text{NLL}} = \frac{1}{2\sigma^2}\,(y-\mu_\theta(x))^2 + \text{const} \]So least-squares regression is just NLL under a Gaussian assumption — the reason MSE is the "default" regression loss is that it is the maximum-likelihood loss for Gaussian noise.
A language model factorizes the probability of a sequence into per-token conditionals, so its training loss is the NLL summed over positions \(t\):
\[ \mathcal{L}_{\text{NLL}} = -\frac{1}{T}\sum_{t=1}^{T}\log p_\theta(\text{token}_t \mid \text{token}_{When the observation is "\(y_w\) was preferred over \(y_l\)," the Bradley–Terry model gives the probability of that preference as \(\sigma(\hat r_w - \hat r_l)\). Its NLL is the DPO loss (and the RLHF reward-model loss):
\[ \mathcal{L}_{\text{NLL}} = -\,\mathbb{E}\big[\log\sigma(\hat r_w - \hat r_l)\big] \]Same template — negative log of the probability the model assigns to the observed outcome — with the "outcome" being a human preference rather than a class or a token.
| Output type | Distribution \(p_\theta(y\mid x)\) | NLL becomes |
|---|---|---|
| One of \(V\) classes | Categorical (softmax) | Cross entropy |
| Yes / no | Bernoulli (sigmoid) | Binary log-loss |
| Real number | Gaussian, fixed \(\sigma\) | Mean-squared error |
| Next token | Categorical over vocab | LM cross-entropy |
| Preference \(y_w \succ y_l\) | Bradley–Terry (sigmoid) | DPO / RM loss |
Averaging the NLL over the true data distribution \(p_{\text{data}}\) rather than a finite sample gives the cross entropy between data and model:
\[ \mathbb{E}_{y\sim p_{\text{data}}}\!\big[-\log p_\theta(y)\big] = H(p_{\text{data}}, p_\theta) = \underbrace{H(p_{\text{data}})}_{\text{fixed}} + \underbrace{\mathrm{KL}\big(p_{\text{data}}\,\|\,p_\theta\big)}_{\text{what training reduces}} \]The entropy \(H(p_{\text{data}})\) does not depend on \(\theta\), so minimizing NLL is exactly minimizing the forward KL from the data distribution to the model. That is why maximum-likelihood training is mass-covering: the model is pushed to place probability everywhere the data does. (The forward/reverse KL distinction is on the notation page.)
\(L(\theta)=\prod_i p_\theta(y_i\mid x_i)\)
\(-\tfrac{1}{n}\sum_i \log p_\theta(y_i\mid x_i)\)
\(-\log p\): 0 when right, \(\to\infty\) when confidently wrong
Maximizing likelihood is minimizing its negative log
Categorical→CE, Bernoulli→BCE, Gaussian→MSE, BT→DPO
Min NLL ⇔ min \(\mathrm{KL}(p_{\text{data}}\,\|\,p_\theta)\)
| When you see… | Recognize it as |
|---|---|
| "trained by maximum likelihood" | An NLL loss in the code |
| \(-\log p_k\) / cross entropy | NLL of a categorical draw |
| \(\tfrac12(y-\hat y)^2\) / MSE | NLL under a Gaussian with fixed variance |
| \(-\log\sigma(\cdot)\) | NLL of a Bernoulli / Bradley–Terry outcome |
| A loss that won't reach zero | Its floor is the data entropy \(H(p_{\text{data}})\) |