Foundations

Reading ML Math Notation

A practical, symbol-by-symbol reference for the probability and optimization notation that shows up across machine-learning papers — with a worked example at the end.

Most ML equations reuse the same small vocabulary of symbols. Once you can read each piece on its own, even dense objectives become a sentence you can translate. This page builds that vocabulary one symbol at a time, then applies all of it to a real objective in §9.

1. The basic objects: scalars, vectors, functions

Before any operation, know what kind of thing a symbol is.

The blackboard-bold \(\mathbb{R}\) means "the real numbers." So \(f:\mathbb{R}^n \to \mathbb{R}\) is shorthand for "\(f\) takes an \(n\)-vector and returns one number." The arrow \(\to\) describes the function's type, not a limit.
Letter conventions are just habits, not rules: \(\alpha,\beta,\gamma,\lambda\) tend to be scalar weights; \(\theta,\phi,\mathbf{w}\) tend to be learnable parameters; \(\pi,\,p,\,q\) tend to be distributions; \(\sigma\) is often the sigmoid or a standard deviation. Always confirm from context.
\(\pi\) almost always denotes a policy — a conditional distribution \(\pi(y\mid x)\) that, given an input \(x\) (a prompt or state), assigns a probability to each possible output \(y\) (a response or action). "Running the policy" means sampling \(y\sim\pi(\cdot\mid x)\). A language model is a policy in this sense, and training (RLHF, DPO, GRPO) reshapes which outputs it makes likely. You'll meet it as \(\pi_\theta\) (the trainable policy), \(\pi_{\text{ref}}\) (a frozen reference), and \(\pi^*\) (the optimal one).

Subscripts: parameters vs. indices

A subscript means one of two different things depending on what sits there:

The two can coexist without clashing because they look different: in \(r_\phi(x, o_i)\), the Greek \(\phi\) names the reward model's parameters while the running integer \(i\) indexes which output. Likewise \(\mathcal{L}(\theta)\) — parameters in the argument slot — signals "this loss is a function of the weights \(\theta\)," the knobs gradient descent turns.

2. Sums and products: \(\;\sum\;\) and \(\;\prod\;\)

The capital sigma \(\sum\) means "add up"; the capital pi \(\prod\) means "multiply together." The decorations tell you the index variable and its range.

\[ \sum_{i=1}^{n} a_i \;=\; a_1 + a_2 + \cdots + a_n \qquad\qquad \prod_{i=1}^{n} a_i \;=\; a_1 \cdot a_2 \cdots a_n \]

Read the subscript as "start here" and the superscript as "stop here." The body to the right is evaluated once per index value. A subscript with a condition, like \(\sum_{i \,:\, y_i = 1}\), means "sum over every \(i\) that satisfies the condition."

Logs turn products into sums: \(\log \prod_i a_i = \sum_i \log a_i\). This is why log-likelihoods (sums) are preferred over raw likelihoods (products) — they avoid numerical underflow and are easier to differentiate.

3. The conditional bar \(\;\mid\;\)

In a probability like \(p(y \mid x)\), the vertical bar reads "given": "the probability of \(y\) given \(x\)." Everything left of the bar is what you're asking about; everything right is treated as fixed, known background.

\[ \underbrace{p}_{\text{a distribution}}(\,\underbrace{y}_{\text{asked about}} \mid \underbrace{x}_{\text{held fixed}}\,) \]

Example: for a language model, \(p(y \mid x)\) is how likely response \(y\) is once the prompt \(x\) is already fixed. The same bar appears inside expectations and divergences to signal conditioning.

4. Distributions and the \(\;\cdot\;\) placeholder

A probability distribution assigns a probability to every possible outcome. Two notational habits matter:

Named distributions (often calligraphic)

A script letter such as \(\mathcal{D}\) names a distribution as a whole — e.g. "the data distribution." You usually don't evaluate \(\mathcal{D}\) at a point; you draw samples from it (§5). Lowercase \(p,\,q\) name distributions you do evaluate, as in \(p(x)\).

The dot \(\cdot\) means "the whole distribution"

Picture \(p(\cdot \mid x)\) as the full bar chart of probabilities over every outcome, while \(p(y\mid x)\) is the height of one single bar. You need the whole chart whenever you sample from it (§5) or compare it against another chart (§8).

5. Sampling: the tilde \(\;\sim\;\)

\(x \sim \mathcal{D}\) reads "\(x\) is drawn from (distributed according to) \(\mathcal{D}\)." It signals that \(x\) is random and names the distribution governing how often each value appears.

Whenever you see \(\sim\), ask: "which symbol is the random one, and what's its distribution?" Everything else is, for the moment, held fixed.

6. Expectation \(\;\mathbb{E}\;\)

\(\mathbb{E}\) (blackboard-bold "E") is the expected value — a probability-weighted average. The subscript says what is random; the brackets hold the quantity being averaged.

\[ \mathbb{E}_{x\sim p}\big[\, f(x) \,\big] \;=\; \sum_{x} \underbrace{p(x)}_{\text{how likely } x \text{ is}} \cdot \underbrace{f(x)}_{\text{value at } x} \qquad\Big(\text{or } \int p(x)\,f(x)\,dx \text{ if continuous}\Big) \]

In words: run \(f\) on every possible \(x\), weight each result by how probable that \(x\) is, and add them up. Likely values count more; rare values count less.

Concrete example

A fair die, \(f(x)=x\) (the face value). Each face has probability \(1/6\):

\[ \mathbb{E}[x] = \tfrac16(1) + \tfrac16(2) + \cdots + \tfrac16(6) = 3.5 \]

The "expected" value 3.5 is the long-run average — even though you can never roll a 3.5. Expectation is an average, not a prediction of any single outcome.

Subscripts can stack to express several sources of randomness, evaluated as a nested process — draw the first variable, then the next, then average:

\[ \mathbb{E}_{x\sim\mathcal{D},\; y\sim p(\cdot\mid x)}\big[\, f(x,y) \,\big] \]
A subscript is often dropped when "obvious from context," e.g. \(\mathbb{E}[f(x)]\). When in doubt, restore it — naming the distribution removes most ambiguity in a derivation.

7. Optimization: \(\;\max,\; \min,\; \arg\max\;\)

\(\max_{z}\, g(z)\) is "the largest value \(g\) takes as \(z\) varies"; \(\min_z\) is the smallest. The subscript names what you're allowed to change.

The variable need not be a number. If \(\pi\) is a whole function (e.g. a policy or distribution), \(\max_\pi\) searches over the entire space of functions for the best one.

In practice the optimized object is a neural network with parameters \(\theta\), so "vary \(\pi\)" really means "adjust \(\theta\) by gradient steps." The notation states the goal; training is how we approximately reach it.

8. Comparing distributions: KL divergence \(\;\mathrm{KL}(p \,\|\, q)\;\)

The double bar \(\|\) separates two distributions being compared. \(\mathrm{KL}(p \,\|\, q)\) measures how different \(p\) is from \(q\) — a kind of asymmetric "distance."

\[ \mathrm{KL}(p \,\|\, q) \;=\; \mathbb{E}_{x\sim p}\!\left[\log \frac{p(x)}{q(x)}\right] \;=\; \sum_x p(x)\,\log\frac{p(x)}{q(x)} \]

Properties worth memorizing:

KL is the standard way to penalize a trained model for straying from a reference, and it underlies cross entropy: minimizing cross entropy between predictions and labels is equivalent to minimizing a KL divergence.

Forward vs. reverse KL

Because KL is asymmetric, swapping the arguments gives a genuinely different quantity. When fitting a model \(q_\theta\) to a target \(p\), the two orderings have names and distinct behaviors:

Forward KLReverse KL
Written\(\mathrm{KL}(p \,\|\, q_\theta)\)\(\mathrm{KL}(q_\theta \,\|\, p)\)
Expectation underthe target \(p\)the model \(q_\theta\)
BehaviorMass-covering — \(q\) spreads to cover every mode of \(p\)Mode-seeking — \(q\) locks onto one mode and ignores the rest
Penalizes\(q\!\approx\!0\) where \(p\!>\!0\) (missing mass)\(q\!>\!0\) where \(p\!\approx\!0\) (spurious mass)

The intuition is in which distribution weights the log-ratio. Forward KL averages \(\log\tfrac{p}{q}\) over \(p\), so wherever \(p\) has mass but \(q\) doesn't, the ratio blows up — \(q\) is forced to cover all of \(p\). Reverse KL averages \(\log\tfrac{q}{p}\) over \(q\), so \(q\) is only penalized where it puts mass; it can safely ignore parts of \(p\) by placing no mass there, concentrating on a single high-probability region.

Maximum-likelihood / cross-entropy training minimizes the forward KL (hence mass-covering, why language models can be over-broad). Variational inference and the RL KL penalties on the RLHF / GRPO pages use the reverse KL \(\mathrm{KL}(\pi_\theta\,\|\,\pi_{\text{ref}})\) — the model's own samples weight the penalty.

9. A few more symbols you'll meet

SymbolReads asNote
\(\propto\)"is proportional to"Equal up to a constant factor (often a normalizer)
\(\triangleq\) or \(:=\)"is defined as"Introduces a definition, not a derived equality
\(\in\)"is an element of"\(x \in \mathbb{R}^n\): \(x\) lives in that set
\(\nabla_\theta f\)"gradient of \(f\) w.r.t. \(\theta\)"Vector of partial derivatives; points uphill
\(\|\mathbf{x}\|\)"the norm (length) of \(\mathbf{x}\)"\(\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}\)
\(\mathbf{1}[\,\cdot\,]\)"1 if true, else 0"Indicator function; turns a condition into a number
\(\odot\)"elementwise product"Multiply matching entries (Hadamard product)
\(\hat{x}\)"estimate / predicted \(x\)"A hat usually marks a model's guess of a true quantity
\(x^*\)"the optimal \(x\)"A star marks the best / solution value, e.g. \(\pi^*\) is the optimal policy
\(a \succ b\)"\(a\) is preferred over \(b\)"Preference ordering; \(y_w \succ y_l\) means the winner beats the loser

10. Worked example: the RLHF objective

Here is a real, dense objective — the KL-constrained RLHF objective from the DPO page. Nothing in it is new; it is just the pieces above stacked together.

\[ \underbrace{\max_{\pi}}_{\text{§7}}\; \underbrace{\mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi(\cdot\mid x)}}_{\text{§4,5,6}} \big[\,\underbrace{r(x,y)}_{\text{§1}}\,\big] \;-\; \underbrace{\beta}_{\text{§1}}\, \underbrace{\mathrm{KL}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)}_{\text{§3,8}} \]

Translate left to right:

In plain English

"Find the policy \(\pi\) that makes its average reward as large as possible — over prompts from the data and responses the policy generates — while not drifting too far from the reference model, with \(\beta\) controlling how strongly that drift is penalized."

Two competing forces, traded off by \(\beta\): the expectation term pulls toward high-reward behavior; the KL term tethers the policy to the reference so it stays well-behaved. Read this way, the equation is a sentence — which is the whole point of learning the notation.

Cheat sheet

\(f(x)\), \(\mathbf{x}\in\mathbb{R}^n\)

Function applied to \(x\); \(\mathbf{x}\) is an \(n\)-vector

\(\sum\), \(\prod\)

Add up / multiply over an index range

\(p(y\mid x)\)

Probability of \(y\) given \(x\)

\(p(\cdot \mid x)\)

The whole distribution, not one value

\(x \sim p\)

\(x\) is randomly drawn from \(p\)

\(\mathbb{E}_{x\sim p}[f]\)

Probability-weighted average of \(f\)

\(\max,\,\arg\max\)

Best value vs. the winner that achieves it

\(\mathrm{KL}(p\,\|\,q)\)

How far \(p\) is from \(q\); \(\ge 0\), asymmetric

When you see…Ask yourself
A bold/uppercase letter or \(\mathbb{R}^n\)Scalar, vector, or matrix — what shape?
A subscript on \(\sum\), \(\prod\), or \(\mathbb{E}\)What's the index / what is random — and over what range?
A vertical bar \(\mid\)What's asked about (left) vs. held fixed (right)?
A dot \(\cdot\) in an argument slotWhole distribution, or single value?
A tilde \(\sim\)Which symbol is the random draw, from what?
\(\max\) / \(\arg\max\) with a subscriptWhat can I vary — and do I want the score or the winner?
A double bar \(\|\)Two distributions compared; order matters.