RoPE — Gradient Derivation

Notation

Symbol	Description
\(\mathbf{x} \in \mathbb{R}^d\)	Input vector (query or key) at some position; \(d\) must be even
\(m\)	Token position index (integer)
\(\theta_k\)	Frequency for dimension pair \(k\): \(\theta_k = \mathrm{base}^{-2k/d}\), \(k = 0,\ldots,\tfrac{d}{2}-1\)
\(\varphi_{m,k}\)	Rotation angle for pair \(k\) at position \(m\): \(\varphi_{m,k} = m\,\theta_k\)
\(R_m\)	Block-diagonal rotation matrix for position \(m\)
\(\mathbf{y}\)	RoPE output: \(\mathbf{y} = R_m\,\mathbf{x}\)
\(d\mathbf{y},\; d\mathbf{x}\)	Upstream gradient \(\partial L/\partial \mathbf{y}\) and target gradient \(\partial L/\partial \mathbf{x}\)
\(\mathbf{c}_m,\; \mathbf{s}_m\)	Element-wise cosine/sine vectors (each frequency repeated twice, shape \(d\))
\(J(\cdot)\)	Pair-rotation operator: \(J(\mathbf{v})_{2k} = -v_{2k+1},\; J(\mathbf{v})_{2k+1} = v_{2k}\)

1. Angle Schedule

RoPE uses a geometric sequence of frequencies, one per dimension pair:

\[ \theta_k = \mathrm{base}^{-2k/d}, \qquad k = 0, 1, \ldots, \tfrac{d}{2}-1 \]

With \(\mathrm{base}=10000\), this gives wavelengths ranging from \(2\pi\) (pair 0, high frequency) to \(2\pi \cdot 10000\) (last pair, low frequency). The rotation angle applied to pair \(k\) at position \(m\) is simply:

\[ \varphi_{m,k} = m\,\theta_k \]

The two broadcast vectors used throughout are:

\[ \mathbf{c}_m = \bigl[\cos\varphi_{m,0},\;\cos\varphi_{m,0},\;\cos\varphi_{m,1},\;\cos\varphi_{m,1},\;\ldots\bigr] \] \[ \mathbf{s}_m = \bigl[\sin\varphi_{m,0},\;\sin\varphi_{m,0},\;\sin\varphi_{m,1},\;\sin\varphi_{m,1},\;\ldots\bigr] \]

2. Forward Pass

Per-pair rotation

Each consecutive pair of dimensions \((x_{2k},\, x_{2k+1})\) is rotated by angle \(\varphi_{m,k}\):

\[ \begin{pmatrix} y_{2k} \\ y_{2k+1} \end{pmatrix} = \begin{pmatrix} \cos\varphi_{m,k} & -\sin\varphi_{m,k} \\ \sin\varphi_{m,k} & \phantom{-}\cos\varphi_{m,k} \end{pmatrix} \begin{pmatrix} x_{2k} \\ x_{2k+1} \end{pmatrix} \]

Expanding:

\[ y_{2k} = x_{2k}\cos\varphi_{m,k} - x_{2k+1}\sin\varphi_{m,k} \] \[ y_{2k+1} = x_{2k}\sin\varphi_{m,k} + x_{2k+1}\cos\varphi_{m,k} \]

Matrix form

Stacking all pairs, \(R_m\) is block-diagonal:

\[ R_m = \mathrm{diag}\!\Bigl( \underbrace{\begin{pmatrix}\cos\varphi_{m,0} & -\sin\varphi_{m,0} \\ \sin\varphi_{m,0} & \cos\varphi_{m,0}\end{pmatrix}}_{k=0},\; \underbrace{\begin{pmatrix}\cos\varphi_{m,1} & -\sin\varphi_{m,1} \\ \sin\varphi_{m,1} & \cos\varphi_{m,1}\end{pmatrix}}_{k=1},\; \ldots \Bigr) \] \[ \mathbf{y} = R_m\,\mathbf{x} \]

Vectorized form

Define the pair-rotation operator \(J\), which rotates each pair 90° counterclockwise:

\[ J(\mathbf{v})_{2k} = -v_{2k+1}, \qquad J(\mathbf{v})_{2k+1} = v_{2k} \]

Then the forward pass collapses to two element-wise operations:

Forward Pass (vectorized)

\[ \mathbf{y} = \mathbf{x} \odot \mathbf{c}_m \;+\; J(\mathbf{x}) \odot \mathbf{s}_m \]

Verification for pair \(k\):

Index \(2k\): \(\;x_{2k}\cos\varphi_{m,k} + (-x_{2k+1})\sin\varphi_{m,k}\) ✓
Index \(2k+1\): \(\;x_{2k+1}\cos\varphi_{m,k} + x_{2k}\sin\varphi_{m,k}\) ✓

3. Relative-Position Property

The key motivation for RoPE: the inner product of a rotated query and key depends only on their relative position \(n-m\), not on absolute positions.

Because each \(2{\times}2\) rotation block is orthogonal, so is the full \(R_m\). This gives \(R_m^\top R_m = I\) and:

\[ (R_m\,\mathbf{q})^\top (R_n\,\mathbf{k}) = \mathbf{q}^\top R_m^\top R_n\,\mathbf{k} = \mathbf{q}^\top R_{n-m}\,\mathbf{k} \]

The last equality uses the homomorphism \(R_m^\top R_n = R_{-m} R_n = R_{n-m}\), which holds because rotation matrices compose by adding angles. This lets the model learn relative distances without explicit relative-position bias tables.

4. Backward Pass — Gradient for \(\mathbf{x}\)

Step A — Jacobian is an orthogonal matrix

The forward pass is a linear map: \(\mathbf{y} = R_m\,\mathbf{x}\). The Jacobian \(\partial \mathbf{y}/\partial \mathbf{x} = R_m\). By the chain rule:

\[ \frac{\partial L}{\partial \mathbf{x}} = R_m^\top\,\frac{\partial L}{\partial \mathbf{y}} = R_m^\top\,d\mathbf{y} \]

Because \(R_m\) is orthogonal, \(R_m^\top = R_m^{-1} = R_{-m}\). The backward pass is just rotating the upstream gradient by \(-m\,\theta_k\) per pair.

Step B — Per-pair gradient

Applying \(R_{-m}\) to \(d\mathbf{y}\) pair by pair (rotation by \(-\varphi_{m,k}\)):

\[ dx_{2k} = dy_{2k}\cos\varphi_{m,k} + dy_{2k+1}\sin\varphi_{m,k} \] \[ dx_{2k+1} = -dy_{2k}\sin\varphi_{m,k} + dy_{2k+1}\cos\varphi_{m,k} \]

Compare with the forward: the only change is the sign on \(\sin\). This is exactly what negating the angle does — \(\cos(-\varphi)=\cos\varphi\), \(\sin(-\varphi)=-\sin\varphi\).

Step C — Vectorized result

Define the inverse pair-rotation \(J^\top\), which is \(J\) transposed (clockwise 90°):

\[ J^\top(\mathbf{v})_{2k} = v_{2k+1}, \qquad J^\top(\mathbf{v})_{2k+1} = -v_{2k} \]

Verification: \(J^\top(d\mathbf{y})_{2k} = dy_{2k+1}\) and \(J^\top(d\mathbf{y})_{2k+1} = -dy_{2k}\), which matches Step B above.

Backward Pass (vectorized)

\[ d\mathbf{x} = d\mathbf{y} \odot \mathbf{c}_m \;+\; J^\top(d\mathbf{y}) \odot \mathbf{s}_m \]

The structure is identical to the forward pass with two substitutions:

\(\mathbf{x} \to d\mathbf{y}\) (gradient takes the place of input)
\(J \to J^\top\) (clockwise instead of counterclockwise pair-rotation)

Implementation shortcut: because \(J\) is linear and skew-symmetric, \(J^\top = -J\), so \(J^\top(\mathbf{v}) = -J(\mathbf{v})\). In code, if rotate_half computes \(J(\cdot)\), the backward component is -rotate_half(dy) (equivalently rotate_half(-dy)). Many frameworks implement this by negating \(\mathbf{s}_m\) instead of the rotated vector — both produce the same result.

5. Summary

Angle schedule

\(\varphi_{m,k} = m \cdot \mathrm{base}^{-2k/d}\)

Forward

\(\mathbf{y} = \mathbf{x}\odot\mathbf{c}_m + J(\mathbf{x})\odot\mathbf{s}_m\)

Key property

\((R_m\mathbf{q})^\top(R_n\mathbf{k}) = \mathbf{q}^\top R_{n-m}\,\mathbf{k}\)

Backward

\(d\mathbf{x} = d\mathbf{y}\odot\mathbf{c}_m + J^\top(d\mathbf{y})\odot\mathbf{s}_m\)

Pass	Operation on pairs	Vectorized
Forward	\(y_{2k} = x_{2k}\cos\varphi - x_{2k+1}\sin\varphi\) \(y_{2k+1} = x_{2k}\sin\varphi + x_{2k+1}\cos\varphi\)	\(\mathbf{x}\odot\mathbf{c}_m + J(\mathbf{x})\odot\mathbf{s}_m\)
Backward	\(dx_{2k} = dy_{2k}\cos\varphi + dy_{2k+1}\sin\varphi\) \(dx_{2k+1} = -dy_{2k}\sin\varphi + dy_{2k+1}\cos\varphi\)	\(d\mathbf{y}\odot\mathbf{c}_m + J^\top(d\mathbf{y})\odot\mathbf{s}_m\)

Rotary Position Embedding (RoPE)