Alignment

GRPO — Group Relative Policy Optimization

The RL algorithm behind DeepSeek-R1: drop PPO's value network and use a group of sampled answers as the baseline, scoring each one relative to its peers.

Notation

SymbolDescription
\(q\)Prompt / question (the same role as \(x\) elsewhere)
\(o_i\)The \(i\)-th sampled output (completion) in the group; \(|o_i|\) its token length
\(o_{i,t}\)Token \(t\) of output \(i\); \(o_{i,
\(G\)Group size — number of outputs sampled per prompt
\(\pi_\theta\)Policy being trained; \(\pi_{\theta_{\text{old}}}\) the policy that generated the group; \(\pi_{\text{ref}}\) frozen reference
\(r_i\)Reward of output \(o_i\) (from a reward model or a verifier)
\(\hat A_{i,t}\)Advantage assigned to token \(t\) of output \(i\)
\(\rho_{i,t}(\theta)\)Importance ratio \(\pi_\theta(o_{i,t}\mid\cdot)/\pi_{\theta_{\text{old}}}(o_{i,t}\mid\cdot)\)
\(\epsilon,\;\beta\)Clip range and KL penalty coefficient
New to \(\mathbb{E}\), \(\sim\), \(\mathrm{KL}\), or \(\nabla\)? See the math notation reference. GRPO assumes the RLHF setup — read that first for PPO, advantages, and the clipped surrogate.

1. Why GRPO? Killing the critic

PPO needs a learned value function \(V_\psi(s)\) — the critic — to compute advantages \(\hat A_t = \big(\text{return}\big) - V_\psi(s_t)\). For LLMs the critic is typically another model as large as the policy, which doubles memory and adds a second hard-to-train objective.

GRPO's insight: the only thing the critic provides is a baseline to subtract from the reward (to reduce gradient variance). For a fixed prompt you can estimate that baseline directly — just sample several answers and use their average reward. No value network required.

Subtracting any baseline that doesn't depend on the action leaves the policy gradient unbiased but lowers its variance. PPO learns the baseline; GRPO estimates it on the fly from a group of samples — a Monte-Carlo baseline.

2. Group Sampling

For each prompt \(q\), sample a group of \(G\) outputs from the current policy, then score each with the reward model (or a rule-based verifier, as in math/code RL):

Prompt
\(q\)
Output 1
\(o_1\) → \(r_1\)
Output 2
\(o_2\) → \(r_2\)

\(\vdots\)
Output G
\(o_G\) → \(r_G\)
\[ \{o_1, o_2, \ldots, o_G\} \sim \pi_{\theta_{\text{old}}}(\,\cdot \mid q\,), \qquad r_i = \text{reward}(q, o_i) \]

3. Group-Relative Advantage

Instead of a learned baseline, GRPO standardizes each reward against its own group — a z-score. With outcome supervision (one reward per whole output), every token of output \(i\) receives the same advantage:

Group-relative advantage (outcome supervision)
\[ \hat A_{i,t} = \tilde r_i = \frac{r_i - \operatorname{mean}(\{r_1,\ldots,r_G\})}{\operatorname{std}(\{r_1,\ldots,r_G\})} \qquad \text{for all } t \]

Subtracting the group mean is the baseline; dividing by the group standard deviation normalizes scale, so prompts of differing difficulty contribute comparable gradients. An answer better than its peers gets \(\hat A_{i,t} > 0\) (reinforce it); a below-average answer gets \(\hat A_{i,t} < 0\) (suppress it).

This matches how reward models are trained — on comparisons between outputs of the same prompt. GRPO's group-relative scoring uses the reward model in exactly the comparative regime it was fit on.

Process supervision

If rewards are available at the end of each reasoning step (not just the final answer), normalize every step reward across the group, then set a token's advantage to the sum of normalized rewards from all steps ending at or after it:

\[ \hat A_{i,t} = \sum_{\text{step } j \,:\, \text{index}(j)\, \ge\, t} \tilde r_{i,j} \]

This credits each token with the (normalized) reward of every step it helped produce.

4. The GRPO Objective

GRPO keeps PPO's clipped surrogate but plugs in the group-relative advantage and adds the KL penalty directly to the loss (rather than folding it into the reward, as PPO does). The importance ratio is per-token:

\[ \rho_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}\mid q, o_{i,
GRPO objective
\[ \mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{\,q,\,\{o_i\}_{i=1}^{G}} \left[ \frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \Big\{\, \mathcal{C}_{i,t}(\theta) \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big[\pi_\theta \,\|\, \pi_{\text{ref}}\big]_{i,t} \,\Big\} \right] \] \[ \mathcal{C}_{i,t}(\theta) = \min\!\Big( \rho_{i,t}(\theta)\,\hat A_{i,t},\;\; \mathrm{clip}\big(\rho_{i,t}(\theta),\, 1-\epsilon,\, 1+\epsilon\big)\,\hat A_{i,t} \Big) \]

Reading it: average over the group \((\tfrac1G\sum_i)\), average over tokens \((\tfrac{1}{|o_i|}\sum_t)\), and at each token take the PPO clipped term \(\mathcal{C}_{i,t}\) minus a per-token KL penalty toward the reference. Maximizing \(\mathcal{J}_{\text{GRPO}}\) raises the probability of tokens in better-than-average answers while clipping bounds the step size and the KL term keeps the policy near \(\pi_{\text{ref}}\).

5. The Unbiased KL Estimator

The KL term uses a low-variance, always-positive single-sample estimator (the "k3" estimator), evaluated per token on the sampled \(o_{i,t}\):

KL estimator
\[ \mathbb{D}_{\mathrm{KL}}\!\big[\pi_\theta \,\|\, \pi_{\text{ref}}\big]_{i,t} = \frac{\pi_{\text{ref}}(o_{i,t}\mid q, o_{i,

Writing \(u = \pi_{\text{ref}}/\pi_\theta\), this is \(u - \log u - 1\), which is \(\ge 0\) for all \(u>0\) and zero only at \(u=1\) (where the policies agree). Its expectation over \(o\sim\pi_\theta\) equals the true \(\mathrm{KL}(\pi_\theta\,\|\,\pi_{\text{ref}})\), so it is an unbiased estimate that never goes negative — unlike the raw log-ratio.

Two contrasts with PPO-style RLHF: the baseline comes from the group, not a critic; and the KL appears as an explicit loss term with this estimator, rather than as a per-token penalty mixed into the reward signal.

6. Gradient

Differentiating (in the unclipped region, \(\rho_{i,t}=1\) at the start of each update since \(\pi_\theta=\pi_{\theta_{\text{old}}}\)) gives a clean policy-gradient form. The KL estimator contributes the extra \(\beta\big(\tfrac{\pi_{\text{ref}}}{\pi_\theta}-1\big)\) term:

\[ \nabla_\theta \mathcal{J}_{\text{GRPO}} = \mathbb{E}\!\left[ \frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \left( \hat A_{i,t} + \beta\Big(\frac{\pi_{\text{ref}}(o_{i,t}\mid\cdot)}{\pi_\theta(o_{i,t}\mid\cdot)} - 1\Big) \right) \nabla_\theta \log \pi_\theta(o_{i,t}\mid q, o_{i,The coefficient on each token's score-function gradient \(\nabla_\theta\log\pi_\theta\) is its advantage, nudged by the KL pull toward the reference. The familiar \(\nabla\log\pi_\theta\) factor is the same log-derivative trick from the RLHF policy-gradient section.

7. The GRPO Loop

  1. Sample a group — for each prompt \(q\), draw \(G\) outputs from \(\pi_{\theta_{\text{old}}}\).
  2. Score — reward each output: a reward model, or a rule-based verifier (exact-match for math, unit tests for code).
  3. Normalize — compute group-relative advantages \(\hat A_{i,t}\) by z-scoring the rewards within the group (§3).
  4. Optimize — take several epochs of minibatch updates on \(\mathcal{J}_{\text{GRPO}}\) (§4), then refresh \(\pi_{\theta_{\text{old}}}\).
Models in memory: policy \(\pi_\theta\), reference \(\pi_{\text{ref}}\), and a reward model if used — but no critic. With a rule-based verifier (DeepSeek-R1-Zero), even the reward model disappears, leaving just the policy and reference.

8. GRPO vs. PPO vs. DPO

PPO (RLHF)GRPODPO
Baseline / advantageLearned critic \(V_\psi\)Group mean reward (z-score)
Critic networkYes (policy-sized)NoNo
Reward signalReward modelReward model or verifierImplicit (preference pairs)
KL handlingFolded into rewardExplicit loss term (k3)Baked into closed form
Online samplingYesYes (a group per prompt)No (offline)
Notable useInstructGPT, ChatGPTDeepSeekMath, DeepSeek-R1Lightweight alignment

GRPO sits between the two: it keeps online RL and a reward signal like PPO, but removes the critic for a sampling-based baseline — much of DPO's simplicity without giving up on-policy exploration.

9. Summary

Core idea

Group of samples replaces the critic baseline

Advantage

\(\hat A_{i,t} = \frac{r_i - \operatorname{mean}(r)}{\operatorname{std}(r)}\)

Surrogate

\(\min(\rho_{i,t}\hat A_{i,t},\,\mathrm{clip}(\rho_{i,t})\hat A_{i,t})\)

KL term

\(u - \log u - 1,\; u=\frac{\pi_{\text{ref}}}{\pi_\theta}\) — always \(\ge 0\)

No critic

Halves model memory vs. PPO

Works with

Reward models or rule-based verifiers

QuantityFormula
Group\(\{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(\cdot\mid q)\)
Advantage\(\hat A_{i,t} = (r_i - \operatorname{mean}(r))/\operatorname{std}(r)\)
Ratio\(\rho_{i,t} = \pi_\theta(o_{i,t}\mid\cdot)/\pi_{\theta_{\text{old}}}(o_{i,t}\mid\cdot)\)
Objective\(\mathbb{E}\big[\tfrac1G\sum_i\tfrac{1}{|o_i|}\sum_t (\mathcal{C}_{i,t} - \beta\,\mathbb{D}_{\mathrm{KL}})\big]\)
KL estimator\(\tfrac{\pi_{\text{ref}}}{\pi_\theta} - \log\tfrac{\pi_{\text{ref}}}{\pi_\theta} - 1\)