Forward and backward pass derivation. Rotation matrices, the relative-position dot-product property, and the elegant symmetry between forward and backward.
| Symbol | Description |
|---|---|
| \(\mathbf{x} \in \mathbb{R}^d\) | Input vector (query or key) at some position; \(d\) must be even |
| \(m\) | Token position index (integer) |
| \(\theta_k\) | Frequency for dimension pair \(k\): \(\theta_k = \mathrm{base}^{-2k/d}\), \(k = 0,\ldots,\tfrac{d}{2}-1\) |
| \(\varphi_{m,k}\) | Rotation angle for pair \(k\) at position \(m\): \(\varphi_{m,k} = m\,\theta_k\) |
| \(R_m\) | Block-diagonal rotation matrix for position \(m\) |
| \(\mathbf{y}\) | RoPE output: \(\mathbf{y} = R_m\,\mathbf{x}\) |
| \(d\mathbf{y},\; d\mathbf{x}\) | Upstream gradient \(\partial L/\partial \mathbf{y}\) and target gradient \(\partial L/\partial \mathbf{x}\) |
| \(\mathbf{c}_m,\; \mathbf{s}_m\) | Element-wise cosine/sine vectors (each frequency repeated twice, shape \(d\)) |
| \(J(\cdot)\) | Pair-rotation operator: \(J(\mathbf{v})_{2k} = -v_{2k+1},\; J(\mathbf{v})_{2k+1} = v_{2k}\) |
RoPE uses a geometric sequence of frequencies, one per dimension pair:
\[ \theta_k = \mathrm{base}^{-2k/d}, \qquad k = 0, 1, \ldots, \tfrac{d}{2}-1 \]With \(\mathrm{base}=10000\), this gives wavelengths ranging from \(2\pi\) (pair 0, high frequency) to \(2\pi \cdot 10000\) (last pair, low frequency). The rotation angle applied to pair \(k\) at position \(m\) is simply:
\[ \varphi_{m,k} = m\,\theta_k \]The two broadcast vectors used throughout are:
\[ \mathbf{c}_m = \bigl[\cos\varphi_{m,0},\;\cos\varphi_{m,0},\;\cos\varphi_{m,1},\;\cos\varphi_{m,1},\;\ldots\bigr] \] \[ \mathbf{s}_m = \bigl[\sin\varphi_{m,0},\;\sin\varphi_{m,0},\;\sin\varphi_{m,1},\;\sin\varphi_{m,1},\;\ldots\bigr] \]Each consecutive pair of dimensions \((x_{2k},\, x_{2k+1})\) is rotated by angle \(\varphi_{m,k}\):
\[ \begin{pmatrix} y_{2k} \\ y_{2k+1} \end{pmatrix} = \begin{pmatrix} \cos\varphi_{m,k} & -\sin\varphi_{m,k} \\ \sin\varphi_{m,k} & \phantom{-}\cos\varphi_{m,k} \end{pmatrix} \begin{pmatrix} x_{2k} \\ x_{2k+1} \end{pmatrix} \]Expanding:
\[ y_{2k} = x_{2k}\cos\varphi_{m,k} - x_{2k+1}\sin\varphi_{m,k} \] \[ y_{2k+1} = x_{2k}\sin\varphi_{m,k} + x_{2k+1}\cos\varphi_{m,k} \]Stacking all pairs, \(R_m\) is block-diagonal:
\[ R_m = \mathrm{diag}\!\Bigl( \underbrace{\begin{pmatrix}\cos\varphi_{m,0} & -\sin\varphi_{m,0} \\ \sin\varphi_{m,0} & \cos\varphi_{m,0}\end{pmatrix}}_{k=0},\; \underbrace{\begin{pmatrix}\cos\varphi_{m,1} & -\sin\varphi_{m,1} \\ \sin\varphi_{m,1} & \cos\varphi_{m,1}\end{pmatrix}}_{k=1},\; \ldots \Bigr) \] \[ \mathbf{y} = R_m\,\mathbf{x} \]Define the pair-rotation operator \(J\), which rotates each pair 90° counterclockwise:
\[ J(\mathbf{v})_{2k} = -v_{2k+1}, \qquad J(\mathbf{v})_{2k+1} = v_{2k} \]Then the forward pass collapses to two element-wise operations:
Verification for pair \(k\):
The key motivation for RoPE: the inner product of a rotated query and key depends only on their relative position \(n-m\), not on absolute positions.
Because each \(2{\times}2\) rotation block is orthogonal, so is the full \(R_m\). This gives \(R_m^\top R_m = I\) and:
\[ (R_m\,\mathbf{q})^\top (R_n\,\mathbf{k}) = \mathbf{q}^\top R_m^\top R_n\,\mathbf{k} = \mathbf{q}^\top R_{n-m}\,\mathbf{k} \]The forward pass is a linear map: \(\mathbf{y} = R_m\,\mathbf{x}\). The Jacobian \(\partial \mathbf{y}/\partial \mathbf{x} = R_m\). By the chain rule:
\[ \frac{\partial L}{\partial \mathbf{x}} = R_m^\top\,\frac{\partial L}{\partial \mathbf{y}} = R_m^\top\,d\mathbf{y} \]Because \(R_m\) is orthogonal, \(R_m^\top = R_m^{-1} = R_{-m}\). The backward pass is just rotating the upstream gradient by \(-m\,\theta_k\) per pair.
Applying \(R_{-m}\) to \(d\mathbf{y}\) pair by pair (rotation by \(-\varphi_{m,k}\)):
\[ dx_{2k} = dy_{2k}\cos\varphi_{m,k} + dy_{2k+1}\sin\varphi_{m,k} \] \[ dx_{2k+1} = -dy_{2k}\sin\varphi_{m,k} + dy_{2k+1}\cos\varphi_{m,k} \]Compare with the forward: the only change is the sign on \(\sin\). This is exactly what negating the angle does — \(\cos(-\varphi)=\cos\varphi\), \(\sin(-\varphi)=-\sin\varphi\).
Define the inverse pair-rotation \(J^\top\), which is \(J\) transposed (clockwise 90°):
\[ J^\top(\mathbf{v})_{2k} = v_{2k+1}, \qquad J^\top(\mathbf{v})_{2k+1} = -v_{2k} \]Verification: \(J^\top(d\mathbf{y})_{2k} = dy_{2k+1}\) and \(J^\top(d\mathbf{y})_{2k+1} = -dy_{2k}\), which matches Step B above.
The structure is identical to the forward pass with two substitutions:
rotate_half computes \(J(\cdot)\), the backward component is -rotate_half(dy) (equivalently rotate_half(-dy)). Many frameworks implement this by negating \(\mathbf{s}_m\) instead of the rotated vector — both produce the same result.
\(\varphi_{m,k} = m \cdot \mathrm{base}^{-2k/d}\)
\(\mathbf{y} = \mathbf{x}\odot\mathbf{c}_m + J(\mathbf{x})\odot\mathbf{s}_m\)
\((R_m\mathbf{q})^\top(R_n\mathbf{k}) = \mathbf{q}^\top R_{n-m}\,\mathbf{k}\)
\(d\mathbf{x} = d\mathbf{y}\odot\mathbf{c}_m + J^\top(d\mathbf{y})\odot\mathbf{s}_m\)
| Pass | Operation on pairs | Vectorized |
|---|---|---|
| Forward |
\(y_{2k} = x_{2k}\cos\varphi - x_{2k+1}\sin\varphi\) \(y_{2k+1} = x_{2k}\sin\varphi + x_{2k+1}\cos\varphi\) |
\(\mathbf{x}\odot\mathbf{c}_m + J(\mathbf{x})\odot\mathbf{s}_m\) |
| Backward |
\(dx_{2k} = dy_{2k}\cos\varphi + dy_{2k+1}\sin\varphi\) \(dx_{2k+1} = -dy_{2k}\sin\varphi + dy_{2k+1}\cos\varphi\) |
\(d\mathbf{y}\odot\mathbf{c}_m + J^\top(d\mathbf{y})\odot\mathbf{s}_m\) |