The RL algorithm behind DeepSeek-R1: drop PPO's value network and use a group of sampled answers as the baseline, scoring each one relative to its peers.
| Symbol | Description |
|---|---|
| \(q\) | Prompt / question (the same role as \(x\) elsewhere) |
| \(o_i\) | The \(i\)-th sampled output (completion) in the group; \(|o_i|\) its token length |
| \(o_{i,t}\) | Token \(t\) of output \(i\); \(o_{i, |
| \(G\) | Group size — number of outputs sampled per prompt |
| \(\pi_\theta\) | Policy being trained; \(\pi_{\theta_{\text{old}}}\) the policy that generated the group; \(\pi_{\text{ref}}\) frozen reference |
| \(r_i\) | Reward of output \(o_i\) (from a reward model or a verifier) |
| \(\hat A_{i,t}\) | Advantage assigned to token \(t\) of output \(i\) |
| \(\rho_{i,t}(\theta)\) | Importance ratio \(\pi_\theta(o_{i,t}\mid\cdot)/\pi_{\theta_{\text{old}}}(o_{i,t}\mid\cdot)\) |
| \(\epsilon,\;\beta\) | Clip range and KL penalty coefficient |
PPO needs a learned value function \(V_\psi(s)\) — the critic — to compute advantages \(\hat A_t = \big(\text{return}\big) - V_\psi(s_t)\). For LLMs the critic is typically another model as large as the policy, which doubles memory and adds a second hard-to-train objective.
GRPO's insight: the only thing the critic provides is a baseline to subtract from the reward (to reduce gradient variance). For a fixed prompt you can estimate that baseline directly — just sample several answers and use their average reward. No value network required.
For each prompt \(q\), sample a group of \(G\) outputs from the current policy, then score each with the reward model (or a rule-based verifier, as in math/code RL):
Instead of a learned baseline, GRPO standardizes each reward against its own group — a z-score. With outcome supervision (one reward per whole output), every token of output \(i\) receives the same advantage:
Subtracting the group mean is the baseline; dividing by the group standard deviation normalizes scale, so prompts of differing difficulty contribute comparable gradients. An answer better than its peers gets \(\hat A_{i,t} > 0\) (reinforce it); a below-average answer gets \(\hat A_{i,t} < 0\) (suppress it).
If rewards are available at the end of each reasoning step (not just the final answer), normalize every step reward across the group, then set a token's advantage to the sum of normalized rewards from all steps ending at or after it:
\[ \hat A_{i,t} = \sum_{\text{step } j \,:\, \text{index}(j)\, \ge\, t} \tilde r_{i,j} \]This credits each token with the (normalized) reward of every step it helped produce.
GRPO keeps PPO's clipped surrogate but plugs in the group-relative advantage and adds the KL penalty directly to the loss (rather than folding it into the reward, as PPO does). The importance ratio is per-token:
\[ \rho_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}\mid q, o_{i,Reading it: average over the group \((\tfrac1G\sum_i)\), average over tokens \((\tfrac{1}{|o_i|}\sum_t)\), and at each token take the PPO clipped term \(\mathcal{C}_{i,t}\) minus a per-token KL penalty toward the reference. Maximizing \(\mathcal{J}_{\text{GRPO}}\) raises the probability of tokens in better-than-average answers while clipping bounds the step size and the KL term keeps the policy near \(\pi_{\text{ref}}\).
The KL term uses a low-variance, always-positive single-sample estimator (the "k3" estimator), evaluated per token on the sampled \(o_{i,t}\):
Writing \(u = \pi_{\text{ref}}/\pi_\theta\), this is \(u - \log u - 1\), which is \(\ge 0\) for all \(u>0\) and zero only at \(u=1\) (where the policies agree). Its expectation over \(o\sim\pi_\theta\) equals the true \(\mathrm{KL}(\pi_\theta\,\|\,\pi_{\text{ref}})\), so it is an unbiased estimate that never goes negative — unlike the raw log-ratio.
Differentiating (in the unclipped region, \(\rho_{i,t}=1\) at the start of each update since \(\pi_\theta=\pi_{\theta_{\text{old}}}\)) gives a clean policy-gradient form. The KL estimator contributes the extra \(\beta\big(\tfrac{\pi_{\text{ref}}}{\pi_\theta}-1\big)\) term:
\[ \nabla_\theta \mathcal{J}_{\text{GRPO}} = \mathbb{E}\!\left[ \frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \left( \hat A_{i,t} + \beta\Big(\frac{\pi_{\text{ref}}(o_{i,t}\mid\cdot)}{\pi_\theta(o_{i,t}\mid\cdot)} - 1\Big) \right) \nabla_\theta \log \pi_\theta(o_{i,t}\mid q, o_{i,| PPO (RLHF) | GRPO | DPO | |
|---|---|---|---|
| Baseline / advantage | Learned critic \(V_\psi\) | Group mean reward (z-score) | — |
| Critic network | Yes (policy-sized) | No | No |
| Reward signal | Reward model | Reward model or verifier | Implicit (preference pairs) |
| KL handling | Folded into reward | Explicit loss term (k3) | Baked into closed form |
| Online sampling | Yes | Yes (a group per prompt) | No (offline) |
| Notable use | InstructGPT, ChatGPT | DeepSeekMath, DeepSeek-R1 | Lightweight alignment |
GRPO sits between the two: it keeps online RL and a reward signal like PPO, but removes the critic for a sampling-based baseline — much of DPO's simplicity without giving up on-policy exploration.
Group of samples replaces the critic baseline
\(\hat A_{i,t} = \frac{r_i - \operatorname{mean}(r)}{\operatorname{std}(r)}\)
\(\min(\rho_{i,t}\hat A_{i,t},\,\mathrm{clip}(\rho_{i,t})\hat A_{i,t})\)
\(u - \log u - 1,\; u=\frac{\pi_{\text{ref}}}{\pi_\theta}\) — always \(\ge 0\)
Halves model memory vs. PPO
Reward models or rule-based verifiers
| Quantity | Formula |
|---|---|
| Group | \(\{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(\cdot\mid q)\) |
| Advantage | \(\hat A_{i,t} = (r_i - \operatorname{mean}(r))/\operatorname{std}(r)\) |
| Ratio | \(\rho_{i,t} = \pi_\theta(o_{i,t}\mid\cdot)/\pi_{\theta_{\text{old}}}(o_{i,t}\mid\cdot)\) |
| Objective | \(\mathbb{E}\big[\tfrac1G\sum_i\tfrac{1}{|o_i|}\sum_t (\mathcal{C}_{i,t} - \beta\,\mathbb{D}_{\mathrm{KL}})\big]\) |
| KL estimator | \(\tfrac{\pi_{\text{ref}}}{\pi_\theta} - \log\tfrac{\pi_{\text{ref}}}{\pi_\theta} - 1\) |