Foundations

Negative Log-Likelihood

The single principle behind cross entropy, the language-model token loss, DPO, and least squares: make the observed data as probable as possible — then take the negative log so it becomes a loss to minimize.

Most supervised losses are not separate inventions. They are the same recipe — maximize the probability the model assigns to what actually happened — applied to different output distributions. That recipe is the negative log-likelihood (NLL). Learn it once and cross entropy, binary log-loss, mean-squared error, and the DPO loss all become the same object wearing different clothes.

Notation

SymbolDescription
\(x_i\)Input / context of the \(i\)-th example
\(y_i\)Observed (true) output of the \(i\)-th example
\(\theta\)Model parameters being trained
\(p_\theta(y\mid x)\)Probability the model assigns to output \(y\) given \(x\)
\(L(\theta)\)Likelihood — probability of the whole dataset under the model
\(\mathcal{L}_{\text{NLL}}(\theta)\)Negative log-likelihood — the loss we minimize
\(n\)Number of training examples

1. From "probability of the data" to a loss

Start with one number: how probable does the model think the observed dataset is? Assuming the \(n\) examples are independent, the probability of seeing all of them together is the product of the per-example probabilities — this product is the likelihood:

\[ L(\theta) = \prod_{i=1}^{n} p_\theta(y_i \mid x_i) \]

A good model is one that makes the data it actually saw look probable, so we want to maximize \(L\). Three small moves turn this into a loss a gradient optimizer can chew on.

Step 1 — Take the log

A product of thousands of probabilities (each \(< 1\)) underflows to zero in floating point and is awkward to differentiate. The log turns the product into a sum (see the notation page), which is numerically stable and has simple term-by-term derivatives:

\[ \log L(\theta) = \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i) \]

The log is monotonic, so whatever \(\theta\) maximizes \(L\) also maximizes \(\log L\) — we have not changed the answer, only the arithmetic.

Step 2 — Negate to get a loss

Optimizers minimize by convention. Flip the sign so "maximize log-likelihood" becomes "minimize negative log-likelihood":

\[ \mathcal{L}_{\text{NLL}}(\theta) = -\sum_{i=1}^{n} \log p_\theta(y_i \mid x_i) \]
Step 3 — Average over the batch

Summing makes the loss scale with dataset size; dividing by \(n\) gives a per-example number that is comparable across batch sizes:

\[ \mathcal{L}_{\text{NLL}}(\theta) = -\frac{1}{n}\sum_{i=1}^{n} \log p_\theta(y_i \mid x_i) \]
Negative log-likelihood
\[ \mathcal{L}_{\text{NLL}}(\theta) = -\frac{1}{n}\sum_{i=1}^{n} \log p_\theta(y_i \mid x_i) \]
In plain English

"For each example, look up the probability the model gave to the answer that actually occurred, take its log, negate it, and average. Driving this down forces the model to put high probability on the truth."

"Maximum likelihood estimation" (MLE) and "minimizing NLL" are the same procedure — one stated as a max over likelihood, the other as a min over its negative log. Whenever a paper says a model is "trained by maximum likelihood," the loss in the code is an NLL.

2. Why the \(-\log p\) shape is exactly right

Look at the cost of a single example, \(-\log p\), where \(p\) is the probability the model placed on the correct answer. The shape of \(-\log\) does something a plain "probability of error" never could:

Model's prob on truth \(p\)Cost \(-\log p\)Reading
\(1.0\)\(0\)Certain and correct — no penalty
\(0.5\)\(0.69\)A coin-flip's worth of doubt
\(0.1\)\(2.30\)Truth was unlikely under the model
\(0.01\)\(4.61\)Confidently wrong — heavily punished
\(\to 0\)\(\to \infty\)Ruled the truth out entirely — unbounded loss

The penalty is zero only when the model is fully confident and right, and grows without bound as the model assigns vanishing probability to what actually happened. That asymmetry is the whole point: being confidently wrong is punished far more harshly than being unsure. A loss linear in error would let a model shrug off an occasional catastrophic miss; \(-\log p\) never lets it.

The log also makes the gradient well-behaved: \(\nabla_\theta\big[-\log p_\theta\big] = -\tfrac{1}{p_\theta}\nabla_\theta p_\theta\). The \(1/p\) factor amplifies the update precisely when the model gave the truth low probability — large corrections where they are most needed, vanishing ones where the model is already right.

3. Worked example

A 3-class problem where the true class is #2. The model outputs the distribution \(\mathbf{p} = [0.1,\ 0.7,\ 0.2]\). Only the probability on the observed class enters the loss:

\[ \mathcal{L}_{\text{NLL}} = -\log p_2 = -\log(0.7) \approx 0.357 \]

Had the model been less sure of the truth, say \(\mathbf{p} = [0.1,\ 0.2,\ 0.7]\) (true class got only \(0.2\)):

\[ \mathcal{L}_{\text{NLL}} = -\log(0.2) \approx 1.609 \]

Same correct class, but a model that buried the truth under a confident wrong guess pays roughly \(4.5\times\) the cost. The other entries (\(0.1\), \(0.7\)) never appear directly — they matter only through the constraint that the distribution sums to 1, which is what lifting \(p_2\) competes against.

4. The same loss across model types

NLL is a template. Pick the distribution \(p_\theta(y\mid x)\) that matches your output type, plug it in, and a familiar named loss falls out.

Categorical output → cross entropy

If \(y\) is one of \(V\) classes and the model outputs a categorical distribution (via softmax), then \(p_\theta(y_i\mid x_i) = p_{i,k_i}\), the probability on the true class \(k_i\). The NLL is exactly cross entropy:

\[ \mathcal{L}_{\text{NLL}} = -\frac{1}{n}\sum_{i=1}^{n}\log p_{i,k_i} \]

The one-hot label zeroes every term except the true class, so cross entropy's \(-\sum_i y_i\log p_i\) collapses to \(-\log p_{k}\) — the NLL of a single categorical draw.

These are two names for one loss. The cross-entropy page derives the softmax mechanics and the clean \(\mathbf{p}-\mathbf{y}\) gradient; this page is the probabilistic reason that loss is the right one to use.

Binary output → log-loss (BCE)

For a yes/no label with the model predicting \(p_\theta = \hat p\) for "yes", the Bernoulli probability of the observed \(y\in\{0,1\}\) is \(\hat p^{\,y}(1-\hat p)^{1-y}\). Its NLL is binary cross entropy:

\[ \mathcal{L}_{\text{NLL}} = -\big[\,y\log\hat p + (1-y)\log(1-\hat p)\,\big] \]

Gaussian output → mean-squared error

For a real-valued target modeled as \(y \sim \mathcal{N}(\mu_\theta(x),\sigma^2)\) with fixed \(\sigma\), the density is \(\tfrac{1}{\sqrt{2\pi\sigma^2}}e^{-(y-\mu_\theta)^2/2\sigma^2}\). Taking \(-\log\) drops the constant and leaves the squared error:

\[ \mathcal{L}_{\text{NLL}} = \frac{1}{2\sigma^2}\,(y-\mu_\theta(x))^2 + \text{const} \]

So least-squares regression is just NLL under a Gaussian assumption — the reason MSE is the "default" regression loss is that it is the maximum-likelihood loss for Gaussian noise.

Next-token prediction → the LM loss

A language model factorizes the probability of a sequence into per-token conditionals, so its training loss is the NLL summed over positions \(t\):

\[ \mathcal{L}_{\text{NLL}} = -\frac{1}{T}\sum_{t=1}^{T}\log p_\theta(\text{token}_t \mid \text{token}_{"Predict the next token" is literally "maximize the likelihood of the token that actually came next." Each position is a categorical NLL, i.e. a cross entropy over the vocabulary.

Preference pairs → the DPO / RLHF reward loss

When the observation is "\(y_w\) was preferred over \(y_l\)," the Bradley–Terry model gives the probability of that preference as \(\sigma(\hat r_w - \hat r_l)\). Its NLL is the DPO loss (and the RLHF reward-model loss):

\[ \mathcal{L}_{\text{NLL}} = -\,\mathbb{E}\big[\log\sigma(\hat r_w - \hat r_l)\big] \]

Same template — negative log of the probability the model assigns to the observed outcome — with the "outcome" being a human preference rather than a class or a token.

Output typeDistribution \(p_\theta(y\mid x)\)NLL becomes
One of \(V\) classesCategorical (softmax)Cross entropy
Yes / noBernoulli (sigmoid)Binary log-loss
Real numberGaussian, fixed \(\sigma\)Mean-squared error
Next tokenCategorical over vocabLM cross-entropy
Preference \(y_w \succ y_l\)Bradley–Terry (sigmoid)DPO / RM loss

5. Relationship to entropy and KL

Averaging the NLL over the true data distribution \(p_{\text{data}}\) rather than a finite sample gives the cross entropy between data and model:

\[ \mathbb{E}_{y\sim p_{\text{data}}}\!\big[-\log p_\theta(y)\big] = H(p_{\text{data}}, p_\theta) = \underbrace{H(p_{\text{data}})}_{\text{fixed}} + \underbrace{\mathrm{KL}\big(p_{\text{data}}\,\|\,p_\theta\big)}_{\text{what training reduces}} \]

The entropy \(H(p_{\text{data}})\) does not depend on \(\theta\), so minimizing NLL is exactly minimizing the forward KL from the data distribution to the model. That is why maximum-likelihood training is mass-covering: the model is pushed to place probability everywhere the data does. (The forward/reverse KL distinction is on the notation page.)

This also explains the floor on the loss: the smallest achievable cross entropy is \(H(p_{\text{data}})\), reached only when \(p_\theta = p_{\text{data}}\). NLL can never be driven to zero unless the data is perfectly deterministic — its irreducible part is the data's own entropy.

6. Summary

Likelihood

\(L(\theta)=\prod_i p_\theta(y_i\mid x_i)\)

NLL loss

\(-\tfrac{1}{n}\sum_i \log p_\theta(y_i\mid x_i)\)

Cost shape

\(-\log p\): 0 when right, \(\to\infty\) when confidently wrong

MLE = min NLL

Maximizing likelihood is minimizing its negative log

One template

Categorical→CE, Bernoulli→BCE, Gaussian→MSE, BT→DPO

= forward KL

Min NLL ⇔ min \(\mathrm{KL}(p_{\text{data}}\,\|\,p_\theta)\)

When you see…Recognize it as
"trained by maximum likelihood"An NLL loss in the code
\(-\log p_k\) / cross entropyNLL of a categorical draw
\(\tfrac12(y-\hat y)^2\) / MSENLL under a Gaussian with fixed variance
\(-\log\sigma(\cdot)\)NLL of a Bernoulli / Bradley–Terry outcome
A loss that won't reach zeroIts floor is the data entropy \(H(p_{\text{data}})\)