Reading ML Math Notation — A Practical Reference

Most ML equations reuse the same small vocabulary of symbols. Once you can read each piece on its own, even dense objectives become a sentence you can translate. This page builds that vocabulary one symbol at a time, then applies all of it to a real objective in §9.

1. The basic objects: scalars, vectors, functions

Before any operation, know what kind of thing a symbol is.

Scalar — a single number, usually lowercase italic: \(a,\, x,\, \beta\). A lone Greek letter with no parentheses is typically a fixed constant (a hyperparameter knob).
Vector — an ordered list of numbers, often bold or arrowed: \(\mathbf{x}\), \(\vec{x}\). "\(\mathbf{x}\in\mathbb{R}^n\)" reads "\(\mathbf{x}\) is a vector of \(n\) real numbers."
Matrix — a 2-D grid, usually uppercase: \(W \in \mathbb{R}^{m\times n}\) ("\(m\) rows, \(n\) columns").
Function — a rule mapping inputs to an output: \(f(x)\) is "\(f\) applied to \(x\)." It can take several arguments, \(r(x,y)\), and return a scalar, vector, or anything else.

The blackboard-bold \(\mathbb{R}\) means "the real numbers." So \(f:\mathbb{R}^n \to \mathbb{R}\) is shorthand for "\(f\) takes an \(n\)-vector and returns one number." The arrow \(\to\) describes the function's type, not a limit.

Letter conventions are just habits, not rules: \(\alpha,\beta,\gamma,\lambda\) tend to be scalar weights; \(\theta,\phi,\mathbf{w}\) tend to be learnable parameters; \(\pi,\,p,\,q\) tend to be distributions; \(\sigma\) is often the sigmoid or a standard deviation. Always confirm from context.

\(\pi\) almost always denotes a policy — a conditional distribution \(\pi(y\mid x)\) that, given an input \(x\) (a prompt or state), assigns a probability to each possible output \(y\) (a response or action). "Running the policy" means sampling \(y\sim\pi(\cdot\mid x)\). A language model is a policy in this sense, and training (RLHF, DPO, GRPO) reshapes which outputs it makes likely. You'll meet it as \(\pi_\theta\) (the trainable policy), \(\pi_{\text{ref}}\) (a frozen reference), and \(\pi^*\) (the optimal one).

Subscripts: parameters vs. indices

A subscript means one of two different things depending on what sits there:

Parameter subscript (usually a Greek letter): \(f_\theta\) reads "the function \(f\) parameterized by \(\theta\)." Here \(\theta\) is the whole vector of learnable weights, so \(f_\theta\) is typically a neural network and the subscript names what training adjusts. E.g. \(\pi_\theta\) (policy), \(r_\phi\) (reward model), \(V_\psi\) (value network). A word-subscript like \(\pi_{\text{ref}}\) is just a label (the "reference" model), not parameters.
Index subscript (usually a letter or number counter): \(a_i\), \(x_t\), \(r_i\) pick out the \(i\)-th / \(t\)-th element of a collection or sequence. This is the same role the subscript plays under \(\sum\) and \(\prod\) (§2).

The two can coexist without clashing because they look different: in \(r_\phi(x, o_i)\), the Greek \(\phi\) names the reward model's parameters while the running integer \(i\) indexes which output. Likewise \(\mathcal{L}(\theta)\) — parameters in the argument slot — signals "this loss is a function of the weights \(\theta\)," the knobs gradient descent turns.

2. Sums and products: \(\;\sum\;\) and \(\;\prod\;\)

The capital sigma \(\sum\) means "add up"; the capital pi \(\prod\) means "multiply together." The decorations tell you the index variable and its range.

\[ \sum_{i=1}^{n} a_i \;=\; a_1 + a_2 + \cdots + a_n \qquad\qquad \prod_{i=1}^{n} a_i \;=\; a_1 \cdot a_2 \cdots a_n \]

Read the subscript as "start here" and the superscript as "stop here." The body to the right is evaluated once per index value. A subscript with a condition, like \(\sum_{i \,:\, y_i = 1}\), means "sum over every \(i\) that satisfies the condition."

Logs turn products into sums: \(\log \prod_i a_i = \sum_i \log a_i\). This is why log-likelihoods (sums) are preferred over raw likelihoods (products) — they avoid numerical underflow and are easier to differentiate.

3. The conditional bar \(\;\mid\;\)

In a probability like \(p(y \mid x)\), the vertical bar reads "given": "the probability of \(y\) given \(x\)." Everything left of the bar is what you're asking about; everything right is treated as fixed, known background.

\[ \underbrace{p}_{\text{a distribution}}(\,\underbrace{y}_{\text{asked about}} \mid \underbrace{x}_{\text{held fixed}}\,) \]

Example: for a language model, \(p(y \mid x)\) is how likely response \(y\) is once the prompt \(x\) is already fixed. The same bar appears inside expectations and divergences to signal conditioning.

4. Distributions and the \(\;\cdot\;\) placeholder

A probability distribution assigns a probability to every possible outcome. Two notational habits matter:

Named distributions (often calligraphic)

A script letter such as \(\mathcal{D}\) names a distribution as a whole — e.g. "the data distribution." You usually don't evaluate \(\mathcal{D}\) at a point; you draw samples from it (§5). Lowercase \(p,\,q\) name distributions you do evaluate, as in \(p(x)\).

The dot \(\cdot\) means "the whole distribution"

\(p(y \mid x)\) — a single number: the probability of one specific \(y\).
\(p(\cdot \mid x)\) — the entire distribution over all possible \(y\). The dot is a blank standing for "any value," so this is "the whole probability function, for every \(y\), given \(x\)."

Picture \(p(\cdot \mid x)\) as the full bar chart of probabilities over every outcome, while \(p(y\mid x)\) is the height of one single bar. You need the whole chart whenever you sample from it (§5) or compare it against another chart (§8).

5. Sampling: the tilde \(\;\sim\;\)

\(x \sim \mathcal{D}\) reads "\(x\) is drawn from (distributed according to) \(\mathcal{D}\)." It signals that \(x\) is random and names the distribution governing how often each value appears.

\(x \sim \mathcal{D}\) — pick \(x\) at random from the data distribution.
\(y \sim p(\cdot\mid x)\) — having fixed \(x\), draw \(y\) from the conditional distribution (note the dot: you sample from the whole distribution, not a single value).

Whenever you see \(\sim\), ask: "which symbol is the random one, and what's its distribution?" Everything else is, for the moment, held fixed.

6. Expectation \(\;\mathbb{E}\;\)

\(\mathbb{E}\) (blackboard-bold "E") is the expected value — a probability-weighted average. The subscript says what is random; the brackets hold the quantity being averaged.

\[ \mathbb{E}_{x\sim p}\big[\, f(x) \,\big] \;=\; \sum_{x} \underbrace{p(x)}_{\text{how likely } x \text{ is}} \cdot \underbrace{f(x)}_{\text{value at } x} \qquad\Big(\text{or } \int p(x)\,f(x)\,dx \text{ if continuous}\Big) \]

In words: run \(f\) on every possible \(x\), weight each result by how probable that \(x\) is, and add them up. Likely values count more; rare values count less.

Concrete example

A fair die, \(f(x)=x\) (the face value). Each face has probability \(1/6\):

\[ \mathbb{E}[x] = \tfrac16(1) + \tfrac16(2) + \cdots + \tfrac16(6) = 3.5 \]

The "expected" value 3.5 is the long-run average — even though you can never roll a 3.5. Expectation is an average, not a prediction of any single outcome.

Subscripts can stack to express several sources of randomness, evaluated as a nested process — draw the first variable, then the next, then average:

\[ \mathbb{E}_{x\sim\mathcal{D},\; y\sim p(\cdot\mid x)}\big[\, f(x,y) \,\big] \]

A subscript is often dropped when "obvious from context," e.g. \(\mathbb{E}[f(x)]\). When in doubt, restore it — naming the distribution removes most ambiguity in a derivation.

7. Optimization: \(\;\max,\; \min,\; \arg\max\;\)

\(\max_{z}\, g(z)\) is "the largest value \(g\) takes as \(z\) varies"; \(\min_z\) is the smallest. The subscript names what you're allowed to change.

\(\max_z g(z)\) / \(\min_z g(z)\) — return the best value of \(g\).
\(\arg\max_z g(z)\) / \(\arg\min_z g(z)\) — return the \(z\) that achieves it (the winner, not the score).

The variable need not be a number. If \(\pi\) is a whole function (e.g. a policy or distribution), \(\max_\pi\) searches over the entire space of functions for the best one.

In practice the optimized object is a neural network with parameters \(\theta\), so "vary \(\pi\)" really means "adjust \(\theta\) by gradient steps." The notation states the goal; training is how we approximately reach it.

8. Comparing distributions: KL divergence \(\;\mathrm{KL}(p \,\|\, q)\;\)

The double bar \(\|\) separates two distributions being compared. \(\mathrm{KL}(p \,\|\, q)\) measures how different \(p\) is from \(q\) — a kind of asymmetric "distance."

\[ \mathrm{KL}(p \,\|\, q) \;=\; \mathbb{E}_{x\sim p}\!\left[\log \frac{p(x)}{q(x)}\right] \;=\; \sum_x p(x)\,\log\frac{p(x)}{q(x)} \]

Properties worth memorizing:

\(\mathrm{KL}(p\,\|\,q) \ge 0\) always, and \(= 0\) exactly when \(p\) and \(q\) are identical.
It is asymmetric: \(\mathrm{KL}(p\,\|\,q) \neq \mathrm{KL}(q\,\|\,p)\) in general — order matters.
It is itself an expectation (§6) of a log-ratio, so it combines several pieces from this page at once.

KL is the standard way to penalize a trained model for straying from a reference, and it underlies cross entropy: minimizing cross entropy between predictions and labels is equivalent to minimizing a KL divergence.

Forward vs. reverse KL

Because KL is asymmetric, swapping the arguments gives a genuinely different quantity. When fitting a model \(q_\theta\) to a target \(p\), the two orderings have names and distinct behaviors:

	Forward KL	Reverse KL
Written	\(\mathrm{KL}(p \,\\|\, q_\theta)\)	\(\mathrm{KL}(q_\theta \,\\|\, p)\)
Expectation under	the target \(p\)	the model \(q_\theta\)
Behavior	Mass-covering — \(q\) spreads to cover every mode of \(p\)	Mode-seeking — \(q\) locks onto one mode and ignores the rest
Penalizes	\(q\!\approx\!0\) where \(p\!>\!0\) (missing mass)	\(q\!>\!0\) where \(p\!\approx\!0\) (spurious mass)

The intuition is in which distribution weights the log-ratio. Forward KL averages \(\log\tfrac{p}{q}\) over \(p\), so wherever \(p\) has mass but \(q\) doesn't, the ratio blows up — \(q\) is forced to cover all of \(p\). Reverse KL averages \(\log\tfrac{q}{p}\) over \(q\), so \(q\) is only penalized where it puts mass; it can safely ignore parts of \(p\) by placing no mass there, concentrating on a single high-probability region.

Maximum-likelihood / cross-entropy training minimizes the forward KL (hence mass-covering, why language models can be over-broad). Variational inference and the RL KL penalties on the RLHF / GRPO pages use the reverse KL \(\mathrm{KL}(\pi_\theta\,\|\,\pi_{\text{ref}})\) — the model's own samples weight the penalty.

9. A few more symbols you'll meet

Symbol	Reads as	Note
\(\propto\)	"is proportional to"	Equal up to a constant factor (often a normalizer)
\(\triangleq\) or \(:=\)	"is defined as"	Introduces a definition, not a derived equality
\(\in\)	"is an element of"	\(x \in \mathbb{R}^n\): \(x\) lives in that set
\(\nabla_\theta f\)	"gradient of \(f\) w.r.t. \(\theta\)"	Vector of partial derivatives; points uphill
\(\\|\mathbf{x}\\|\)	"the norm (length) of \(\mathbf{x}\)"	\(\\|\mathbf{x}\\|_2 = \sqrt{\sum_i x_i^2}\)
\(\mathbf{1}[\,\cdot\,]\)	"1 if true, else 0"	Indicator function; turns a condition into a number
\(\odot\)	"elementwise product"	Multiply matching entries (Hadamard product)
\(\hat{x}\)	"estimate / predicted \(x\)"	A hat usually marks a model's guess of a true quantity
\(x^*\)	"the optimal \(x\)"	A star marks the best / solution value, e.g. \(\pi^*\) is the optimal policy
\(a \succ b\)	"\(a\) is preferred over \(b\)"	Preference ordering; \(y_w \succ y_l\) means the winner beats the loser

10. Worked example: the RLHF objective

Here is a real, dense objective — the KL-constrained RLHF objective from the DPO page. Nothing in it is new; it is just the pieces above stacked together.

\[ \underbrace{\max_{\pi}}_{\text{§7}}\; \underbrace{\mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi(\cdot\mid x)}}_{\text{§4,5,6}} \big[\,\underbrace{r(x,y)}_{\text{§1}}\,\big] \;-\; \underbrace{\beta}_{\text{§1}}\, \underbrace{\mathrm{KL}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)}_{\text{§3,8}} \]

Translate left to right:

\(\max_\pi\) — choose the policy \(\pi\) that makes the rest as large as possible (§7).
\(\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi(\cdot\mid x)}[\,r(x,y)\,]\) — the policy's average reward: draw a prompt \(x\) from the data, draw a response \(y\) from the policy given \(x\), score it with \(r\), and average (§4–6).
\(-\,\beta\,\mathrm{KL}(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x))\) — minus a penalty for how far the policy's distribution has drifted from the reference model's, scaled by the knob \(\beta\) (§1, 3, 8).

In plain English

"Find the policy \(\pi\) that makes its average reward as large as possible — over prompts from the data and responses the policy generates — while not drifting too far from the reference model, with \(\beta\) controlling how strongly that drift is penalized."

Two competing forces, traded off by \(\beta\): the expectation term pulls toward high-reward behavior; the KL term tethers the policy to the reference so it stays well-behaved. Read this way, the equation is a sentence — which is the whole point of learning the notation.

Cheat sheet

\(f(x)\), \(\mathbf{x}\in\mathbb{R}^n\)

Function applied to \(x\); \(\mathbf{x}\) is an \(n\)-vector

\(\sum\), \(\prod\)

Add up / multiply over an index range

\(p(y\mid x)\)

Probability of \(y\) given \(x\)

\(p(\cdot \mid x)\)

The whole distribution, not one value

\(x \sim p\)

\(x\) is randomly drawn from \(p\)

\(\mathbb{E}_{x\sim p}[f]\)

Probability-weighted average of \(f\)

\(\max,\,\arg\max\)

Best value vs. the winner that achieves it

\(\mathrm{KL}(p\,\|\,q)\)

How far \(p\) is from \(q\); \(\ge 0\), asymmetric

When you see…	Ask yourself
A bold/uppercase letter or \(\mathbb{R}^n\)	Scalar, vector, or matrix — what shape?
A subscript on \(\sum\), \(\prod\), or \(\mathbb{E}\)	What's the index / what is random — and over what range?
A vertical bar \(\mid\)	What's asked about (left) vs. held fixed (right)?
A dot \(\cdot\) in an argument slot	Whole distribution, or single value?
A tilde \(\sim\)	Which symbol is the random draw, from what?
\(\max\) / \(\arg\max\) with a subscript	What can I vary — and do I want the score or the winner?
A double bar \(\\|\)	Two distributions compared; order matters.