The critical implementation detail of KL loss in GRPO

Summary

Training large-scale reinforcement learning (RL) models for language understanding on reasoning tasks has emerged as a promising paradigm for mastering complex problem-solving skills. Recent advancements, particularly evidenced by DeepSeek's R1-Zero [1], have demonstrated impressive training time scaling. This has heightened interest in the RL algorithm utilized, namely Group Relative Policy Optimization (GRPO), an adaptation of the Proximal Policy Optimization (PPO) framework.

GRPO is characterized by two principal innovations:

It abandons the traditional critic model, choosing instead to estimate the value baseline using group scores;
Rather than incorporating a per-token Kullback-Leibler (KL) divergence penalty in the rewards, GRPO directly regularizes the learning process by adding the KL divergence between the learned policy and a reference policy to the loss function.

Importantly, the KL penalty included within the reward framework cannot simply be transitioned to serve as a regularization loss term, especially in off-policy settings. This necessitates the computation of the KL term across the entirety of the vocabulary rather than merely relying on the sampled trajectory.

Theory recap

To highlight this crucial detail, it is essential to begin by differentiating between Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). Using the formulas provided in the GRPO paper [2], we can briefly elucidate these differences.

The loss function in PPO is defined as follows:

The loss of GRPO is:

It is noteworthy that irrespective of the chosen RL algorithm, both sample trajectories (query-response pairs) based on the policy $\pi_{\theta_\text{old}}$. This signifies that both approaches are inherently off-policy—meaning that the distribution from which samples are drawn differs from the policy $\pi_\theta$ that is subject to optimization. Remember this off-policy characteristic, as it underpins the subsequent derivation.

Now, let us examine the KL divergence terms within both algorithms.

In the PPO setting, the KL divergence between $\pi_{\theta_\text{old}}$ and $\pi_\text{ref}$ is incorporated into the reward function, represented as:

$$ R = r(q,o) - \text{KL}(\pi_{\theta_\text{old}}, \pi_\text{ref}) $$