DeepSeek-R1 successfully utilizes Group Relative Policy Optimization (GRPO) to improve LLM reasoning capability. This article dives deeper into GRPO, studying its motivation and mechanisms. Also, this article summarizes interesting DeepSeek findings on utilizing RL on LLMs.

DeepSeek Group Relative Policy Optimization (GRPO) and its findings/thoughts on RL

In both DeepSeekMath (https://arxiv.org/abs/2402.03300) and DeepSeek-R1 (https://arxiv.org/abs/2501.12948), Group Relative Policy Optimization or GRPO has been utilized to improve LLM reasoning capability in long responses. Especially on math and coding tasks, GRPO has demonstrated convincing performance, enabling DeepSeek models obtain on par performance with GPT-4 and Sonnet-3.5. This article targets a brief analysis over GRPO, and an review on DeepSeek’s findings on utilizing RL.

What is GRPO and why is it preferred?

First, let’s brief a short review on PPO. Let $q$ be a query, $P(Q)$ be the query distribution, $\pi_{\text{ref}}$ be a reference model, $o \sim \pi_{\text{ref}}(O\q)$ be a response from the reference model, the PPO objective is defined as

ppo

Here, $A_t: O \rightarrow \mathcal{R}$ is the advantage function, which computes the advantage value for each output token $o_t$, which is computed by $V_{\phi} (q, o_{< t}) - r(o_t)$. Here, $V_{\phi}$ is a learned value function, which is usually the same size as $\pi_{\theta}$. So overall, PPO needs at least three models: $\pi_{\theta}$, $\pi_{\text{ref}}$, and $V_{\phi}$, and a potential reward model $r$. This makes PPO occupy much memory space.

The motivation for GRPO is to reduce memory footprint, and it is done via removing the value model $V_{\phi}$. The difference between PPO and GRPO is shown below.

overview

Specifically, instead of using a value function to estimate the value-to-go and compute the advantage, GRPO estimates the value-to-go via taking the mean of rewards across different roll-out trajectories. Let ${o_1, o_2, \ldots, o_G}$ be different roll-out trajectories (responses), $\mathbf{R}={(r_1^{\text{index}(1)}, \ldots, r_1^{\text{index}(K_1)}), \ldots, ((r_G^{\text{index}(1)}, \ldots, r_G^{\text{index}(K_G)}))}$, the value-to-go for $\text{index}_j$ token in the $i$-th trajectory is computed by \(\hat r_i^{\text{index}(j)} = \frac{r_i^{\text{index}(j)} - \text{mean}(\mathbf{R})}{std(\mathbf{R})}\). Then, the advantage value is computed by \(\hat A_{i, t} = \sum_{\text{index}(j) \geq t} r_i^{\text{index}(j)}\).

A unified view on RL (for LLM)

In the DeepSeekMath wort, they consider popular LLM fine-tuning algorithms, including SFT, rejection sampling finetuning (RFT), DPO, online RFT, PPO, and GRPO, and provide a unified formulation on these algorithms:

rl

Here, data source $D$ determines the training data (online or offline); reward function $\pi_{rf}$ (ruled-based or model-based); and algorithm $A$ is mostly related to the loss functions, which are responsible to compute the gradient signal basing on the training data and reward signals. Their summarization over popular algorithms is as follows:

algs

Basing on this unified view, different tasks, instruction tuning, distillation, alignment, reasoning, can be viewed as utilizing different RL algorithms to learn from the data.

Lessons learned when using different RL

These are some key takeaways:

figure6

What to improve?