ML Math

AI Learning Notes

Worked derivations for core ML building blocks — forward passes, gradients, and intuitions.

Loss Functions

Cross Entropy
Cross Entropy — Forward & Backward Pass

Softmax probabilities, numerically stable forward pass, and the clean p − y gradient derivation through the combined softmax + loss.

Read derivation →

Normalization

LayerNorm
Layer Normalization — Gradient Derivation

Full forward and backward pass including the three-term dx formula through mean, variance, and normalized input.

Read derivation →
RMSNorm
RMS Normalization — Gradient Derivation

Simpler two-term dx derivation. Includes a side-by-side comparison with LayerNorm and a feature table.

Read derivation →
BatchNorm
Batch Normalization — Gradient Derivation

Full backward pass normalizing across the batch dimension, including training vs. inference running statistics and a comparison with LayerNorm.

Read derivation →

Activations & FFN

SwiGLU
SwiGLU — Forward & Backward Pass

Swish activation derivative, gated hidden state gradients, and weight gradients for all three projection matrices.

Read derivation →

Positional Encoding

RoPE
Rotary Position Embedding — Gradient Derivation

Per-pair rotation forward pass, the relative-position dot-product property, and the symmetric backward pass via transposed rotation.

Read derivation →