A practical, symbol-by-symbol reference for the probability and optimization notation that shows up across machine-learning papers — with a worked example at the end.
Most ML equations reuse the same small vocabulary of symbols. Once you can read each piece on its own, even dense objectives become a sentence you can translate. This page builds that vocabulary one symbol at a time, then applies all of it to a real objective in §9.
Before any operation, know what kind of thing a symbol is.
A subscript means one of two different things depending on what sits there:
The two can coexist without clashing because they look different: in \(r_\phi(x, o_i)\), the Greek \(\phi\) names the reward model's parameters while the running integer \(i\) indexes which output. Likewise \(\mathcal{L}(\theta)\) — parameters in the argument slot — signals "this loss is a function of the weights \(\theta\)," the knobs gradient descent turns.
The capital sigma \(\sum\) means "add up"; the capital pi \(\prod\) means "multiply together." The decorations tell you the index variable and its range.
\[ \sum_{i=1}^{n} a_i \;=\; a_1 + a_2 + \cdots + a_n \qquad\qquad \prod_{i=1}^{n} a_i \;=\; a_1 \cdot a_2 \cdots a_n \]Read the subscript as "start here" and the superscript as "stop here." The body to the right is evaluated once per index value. A subscript with a condition, like \(\sum_{i \,:\, y_i = 1}\), means "sum over every \(i\) that satisfies the condition."
In a probability like \(p(y \mid x)\), the vertical bar reads "given": "the probability of \(y\) given \(x\)." Everything left of the bar is what you're asking about; everything right is treated as fixed, known background.
\[ \underbrace{p}_{\text{a distribution}}(\,\underbrace{y}_{\text{asked about}} \mid \underbrace{x}_{\text{held fixed}}\,) \]Example: for a language model, \(p(y \mid x)\) is how likely response \(y\) is once the prompt \(x\) is already fixed. The same bar appears inside expectations and divergences to signal conditioning.
A probability distribution assigns a probability to every possible outcome. Two notational habits matter:
A script letter such as \(\mathcal{D}\) names a distribution as a whole — e.g. "the data distribution." You usually don't evaluate \(\mathcal{D}\) at a point; you draw samples from it (§5). Lowercase \(p,\,q\) name distributions you do evaluate, as in \(p(x)\).
\(x \sim \mathcal{D}\) reads "\(x\) is drawn from (distributed according to) \(\mathcal{D}\)." It signals that \(x\) is random and names the distribution governing how often each value appears.
Whenever you see \(\sim\), ask: "which symbol is the random one, and what's its distribution?" Everything else is, for the moment, held fixed.
\(\mathbb{E}\) (blackboard-bold "E") is the expected value — a probability-weighted average. The subscript says what is random; the brackets hold the quantity being averaged.
\[ \mathbb{E}_{x\sim p}\big[\, f(x) \,\big] \;=\; \sum_{x} \underbrace{p(x)}_{\text{how likely } x \text{ is}} \cdot \underbrace{f(x)}_{\text{value at } x} \qquad\Big(\text{or } \int p(x)\,f(x)\,dx \text{ if continuous}\Big) \]In words: run \(f\) on every possible \(x\), weight each result by how probable that \(x\) is, and add them up. Likely values count more; rare values count less.
A fair die, \(f(x)=x\) (the face value). Each face has probability \(1/6\):
\[ \mathbb{E}[x] = \tfrac16(1) + \tfrac16(2) + \cdots + \tfrac16(6) = 3.5 \]The "expected" value 3.5 is the long-run average — even though you can never roll a 3.5. Expectation is an average, not a prediction of any single outcome.
Subscripts can stack to express several sources of randomness, evaluated as a nested process — draw the first variable, then the next, then average:
\[ \mathbb{E}_{x\sim\mathcal{D},\; y\sim p(\cdot\mid x)}\big[\, f(x,y) \,\big] \]\(\max_{z}\, g(z)\) is "the largest value \(g\) takes as \(z\) varies"; \(\min_z\) is the smallest. The subscript names what you're allowed to change.
The variable need not be a number. If \(\pi\) is a whole function (e.g. a policy or distribution), \(\max_\pi\) searches over the entire space of functions for the best one.
The double bar \(\|\) separates two distributions being compared. \(\mathrm{KL}(p \,\|\, q)\) measures how different \(p\) is from \(q\) — a kind of asymmetric "distance."
\[ \mathrm{KL}(p \,\|\, q) \;=\; \mathbb{E}_{x\sim p}\!\left[\log \frac{p(x)}{q(x)}\right] \;=\; \sum_x p(x)\,\log\frac{p(x)}{q(x)} \]Properties worth memorizing:
Because KL is asymmetric, swapping the arguments gives a genuinely different quantity. When fitting a model \(q_\theta\) to a target \(p\), the two orderings have names and distinct behaviors:
| Forward KL | Reverse KL | |
|---|---|---|
| Written | \(\mathrm{KL}(p \,\|\, q_\theta)\) | \(\mathrm{KL}(q_\theta \,\|\, p)\) |
| Expectation under | the target \(p\) | the model \(q_\theta\) |
| Behavior | Mass-covering — \(q\) spreads to cover every mode of \(p\) | Mode-seeking — \(q\) locks onto one mode and ignores the rest |
| Penalizes | \(q\!\approx\!0\) where \(p\!>\!0\) (missing mass) | \(q\!>\!0\) where \(p\!\approx\!0\) (spurious mass) |
The intuition is in which distribution weights the log-ratio. Forward KL averages \(\log\tfrac{p}{q}\) over \(p\), so wherever \(p\) has mass but \(q\) doesn't, the ratio blows up — \(q\) is forced to cover all of \(p\). Reverse KL averages \(\log\tfrac{q}{p}\) over \(q\), so \(q\) is only penalized where it puts mass; it can safely ignore parts of \(p\) by placing no mass there, concentrating on a single high-probability region.
| Symbol | Reads as | Note |
|---|---|---|
| \(\propto\) | "is proportional to" | Equal up to a constant factor (often a normalizer) |
| \(\triangleq\) or \(:=\) | "is defined as" | Introduces a definition, not a derived equality |
| \(\in\) | "is an element of" | \(x \in \mathbb{R}^n\): \(x\) lives in that set |
| \(\nabla_\theta f\) | "gradient of \(f\) w.r.t. \(\theta\)" | Vector of partial derivatives; points uphill |
| \(\|\mathbf{x}\|\) | "the norm (length) of \(\mathbf{x}\)" | \(\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}\) |
| \(\mathbf{1}[\,\cdot\,]\) | "1 if true, else 0" | Indicator function; turns a condition into a number |
| \(\odot\) | "elementwise product" | Multiply matching entries (Hadamard product) |
| \(\hat{x}\) | "estimate / predicted \(x\)" | A hat usually marks a model's guess of a true quantity |
| \(x^*\) | "the optimal \(x\)" | A star marks the best / solution value, e.g. \(\pi^*\) is the optimal policy |
| \(a \succ b\) | "\(a\) is preferred over \(b\)" | Preference ordering; \(y_w \succ y_l\) means the winner beats the loser |
Here is a real, dense objective — the KL-constrained RLHF objective from the DPO page. Nothing in it is new; it is just the pieces above stacked together.
\[ \underbrace{\max_{\pi}}_{\text{§7}}\; \underbrace{\mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi(\cdot\mid x)}}_{\text{§4,5,6}} \big[\,\underbrace{r(x,y)}_{\text{§1}}\,\big] \;-\; \underbrace{\beta}_{\text{§1}}\, \underbrace{\mathrm{KL}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)}_{\text{§3,8}} \]Translate left to right:
"Find the policy \(\pi\) that makes its average reward as large as possible — over prompts from the data and responses the policy generates — while not drifting too far from the reference model, with \(\beta\) controlling how strongly that drift is penalized."
Two competing forces, traded off by \(\beta\): the expectation term pulls toward high-reward behavior; the KL term tethers the policy to the reference so it stays well-behaved. Read this way, the equation is a sentence — which is the whole point of learning the notation.
Function applied to \(x\); \(\mathbf{x}\) is an \(n\)-vector
Add up / multiply over an index range
Probability of \(y\) given \(x\)
The whole distribution, not one value
\(x\) is randomly drawn from \(p\)
Probability-weighted average of \(f\)
Best value vs. the winner that achieves it
How far \(p\) is from \(q\); \(\ge 0\), asymmetric
| When you see… | Ask yourself |
|---|---|
| A bold/uppercase letter or \(\mathbb{R}^n\) | Scalar, vector, or matrix — what shape? |
| A subscript on \(\sum\), \(\prod\), or \(\mathbb{E}\) | What's the index / what is random — and over what range? |
| A vertical bar \(\mid\) | What's asked about (left) vs. held fixed (right)? |
| A dot \(\cdot\) in an argument slot | Whole distribution, or single value? |
| A tilde \(\sim\) | Which symbol is the random draw, from what? |
| \(\max\) / \(\arg\max\) with a subscript | What can I vary — and do I want the score or the winner? |
| A double bar \(\|\) | Two distributions compared; order matters. |