Policy Optimization: From Vanilla Policy Gradients to GRPO

In the world of reinforcement learning, policy gradient methods have become the cornerstone of many state-of-the-art algorithms. In this blog post, we'll explore the evolution from vanilla policy gradients to modern approaches like Proximal Policy Optimization (PPO) and the more recent Group Relative Policy Optimization (GRPO). These methods have proven particularly effective for fine-tuning language models through reinforcement learning from human feedback (RLHF).

The Foundation: Vanilla Policy Gradient

To understand modern policy optimization methods, we need to start with the basics. The core idea of policy gradient methods is simple: we want to optimize a parameterized policy $\pi_{\theta}$ to maximize the expected return $J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)]$ .

Brief Derivation

The policy gradient theorem gives us an analytical expression for $\nabla_{\theta} J(\pi_{\theta})$ , which we can use for gradient ascent:

\theta_{k+1} = \theta_k + \alpha \left. \nabla_{\theta} J(\pi_{\theta}) \right|_{\theta_k}

Let's derive the simplest form of the policy gradient. First, we note that the probability of a trajectory $\tau = (s_0, a_0, ..., s_{T+1})$ under policy $\pi_{\theta}$ is:

P(\tau|\theta) = \rho_0(s_0) \prod_{t=0}^{T} P(s_{t+1}|s_t, a_t) \pi_{\theta}(a_t|s_t)

Using the log-derivative trick and noting that environment dynamics don't depend on $\theta$ , we get:

\begin{aligned} \nabla_{\theta} J(\pi_{\theta}) &= \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)] \\ &= \nabla_{\theta} \int_{\tau} P(\tau|\theta) R(\tau) \\ &= \int_{\tau} \nabla_{\theta} P(\tau|\theta) R(\tau) \\ &= \int_{\tau} P(\tau|\theta) \nabla_{\theta} \log P(\tau|\theta) R(\tau) \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}[\nabla_{\theta} \log P(\tau|\theta) R(\tau)] \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) R(\tau)] \end{aligned}

This gives us the vanilla policy gradient, which we can estimate with a sample mean from collected trajectories:

\hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) R(\tau)

Improving the Vanilla Policy Gradient

The policy gradient has a general form:

\nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) \Phi_t]

where $\Phi_t$ is a weighting term that could be any of:

Total trajectory return: $\Phi_t = R(\tau)$
Future return from time $t$ : $\Phi_t = \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})$
Future return minus a baseline: $\Phi_t = \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1}) - b(s_t)$

Using Advantage Functions

Two particularly important choices for $\Phi_t$ are:

On-Policy Action-Value Function: $\Phi_t = Q^{\pi_{\theta}}(s_t, a_t)$
Advantage Function: $\Phi_t = A^{\pi_{\theta}}(s_t, a_t) = Q^{\pi_{\theta}}(s_t, a_t) - V^{\pi_{\theta}}(s_t)$

The advantage function $A^{\pi_{\theta}}(s_t, a_t)$ is particularly useful as it describes how much better (or worse) an action is compared to the average action in that state. Using the advantage function reduces variance in the policy gradient estimate without introducing bias.

Proximal Policy Optimization (PPO)

PPO has become one of the most popular policy gradient methods due to its simplicity and effectiveness. It requires three main components:

Policy ( $\pi_{\theta}$ ): The model we're optimizing (in RLHF, this is the language model)
Reward model ( $R_{\phi}$ ): A trained network that provides scalar rewards
Critic ( $V_{\gamma}$ ): A value function that predicts rewards from partial states

General Advantage Estimation (GAE)

A key innovation in PPO is the use of General Advantage Estimation (GAE), which balances bias and variance in advantage estimation:

A^{GAE}_K = \delta_0 + \lambda\delta_1 + \lambda^2\delta_2 + ... + (\lambda)^{K-1}\delta_{K-1} = \sum_{t=0}^{K-1}(\lambda)^t\delta_t

where $\delta_t$ is the temporal difference (TD) error:

\delta_t = V_{\gamma}(s_{t+1}) - V_{\gamma}(s_t)

The parameter $\lambda$ controls the trade-off between bias and variance:

When $\lambda = 0$ , GAE reduces to single-step TD (lower variance, higher bias)
When $\lambda = 1$ , GAE becomes Monte Carlo estimation (higher variance, lower bias)

For a deeper dive into value function approximation and GAE techniques, check out this explainer on TD(λ).

PPO Objective Components

The PPO objective consists of several components at the token level:

1. Clipped Surrogate Objective

L_{clip}(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\min\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}A^{GAE}_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right)A^{GAE}_t\right)

The clipped surrogate objective is a critical innovation in PPO. It limits the size of policy updates to prevent excessive changes that could destabilize training. The ratio $\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ measures how much the policy has changed since the last update. When this ratio exceeds $1+\epsilon$ (for positive advantages) or falls below $1-\epsilon$ (for negative advantages), the clipping function restricts the objective value, effectively discouraging further changes in that direction. This prevents the notorious "policy collapse" problem where large updates can irreparably damage a previously well-performing policy.

2. KL Divergence Penalty

\text{KL}(\theta) = \frac{1}{T}\sum_{t=1}^{T}\mathcal{D}_{KL}(\pi_{\theta_{ref}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t))

The KL divergence penalty serves as a soft constraint to keep the updated policy close to the original reference policy. This is particularly important in RLHF contexts where we want to improve the model's behavior according to human preferences without losing the knowledge and capabilities acquired during pretraining. The KL term measures the "distance" between probability distributions, penalizing the model when it diverges substantially from its original behavior. This helps preserve the model's linguistic capabilities while still allowing targeted improvements in specific behaviors.

3. Entropy Bonus

H(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{a \sim \pi_\theta(\cdot|s_t)}[\log\pi_\theta(a|s_t)]

The entropy bonus encourages exploration by rewarding policies that maintain diversity in their action distributions. In language models, this prevents the policy from becoming overly deterministic and always choosing the same tokens in similar contexts. Higher entropy means the model assigns meaningful probability to a wider range of tokens, maintaining creativity and diversity in outputs. This is especially important in early training stages to prevent premature convergence to suboptimal policies.

4. Value Function Approximation

The critic loss component of the PPO objective serves to train the value function approximator $V_{\gamma}$ , which is crucial for accurately estimating advantages. This component is given by:

L(\gamma) = \mathbb{E}_t[(V_\gamma(s_t) - \text{sg}(R_\phi(s_T)))^2]

where $V_\gamma(s_t)$ is the critic's prediction of the expected return from state $s_t$ , $R_\phi(s_T)$ is the reward model's output for the completed trajectory, and $\text{sg}$ represents the stop gradient operation.

This squared error loss trains the critic to accurately predict the final reward that will be assigned by the reward model, but using only partial information (the state at time $t$ rather than the complete trajectory). As the policy $\pi_\theta$ changes during training, the expected returns from each state also change, requiring the critic to be continuously updated alongside the policy.

The Full PPO Objective

L_{PPO}(\theta, \gamma) = L_{clip}(\theta) + w_1 H(\theta) - w_2 \text{KL}(\theta) - w_3 L(\gamma)

Group Relative Policy Optimization (GRPO)

GRPO simplifies PPO by taking a different approach to advantage estimation. Instead of using a critic network, GRPO estimates advantages by comparing rewards across multiple responses sampled from the same prompt.

Advantage Calculation in GRPO

For each prompt $q$ , GRPO samples $G$ completions. The advantage for each completion $i$ and token $t$ is calculated as:

A^{i,t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)}

where $r_i$ is the reward for completion $i$ , and $\text{mean}(r)$ and $\text{std}(r)$ are the mean and standard deviation of rewards across all $G$ completions in the group. Importantly, while the advantage is calculated at the completion level (a single value per completion), it is applied to each token $t$ in the completion during the optimization process.

GRPO Objective Components

The GRPO objective consists of components similar to PPO but with a group-relative approach:

1. Clipped Surrogate Objective

L_{clip}(\theta) = -\frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}\min\left(\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}A^{i,t}, \text{clip}\left(\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}, 1-\epsilon, 1+\epsilon\right)A^{i,t}\right)

Like in PPO, this clipping prevents excessive policy updates that could destabilize training. The clipping is applied at the token level but with advantages calculated at the sequence level, providing a unique balance between local optimization and global performance.

2. KL Divergence Penalty

\text{KL}(\theta) = \frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}D_{KL}[\pi_\theta(\cdot|s_{i,t})\|\pi_{ref}(\cdot|s_{i,t})]

KL divergence is estimated using the approximator introduced by Schulman et al. (2020). The approximator is defined as follows:

D_{KL}[\pi_\theta\|\pi_{ref}] = \frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{ref}(a_{i,t}|s_{i,t})} - \log\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{ref}(a_{i,t}|s_{i,t})} - 1

The Full GRPO Objective

The GRPO objective combines these components:

L_{GRPO}(\theta) = L_{clip}(\theta) - \beta \text{KL}(\theta)

Note that unlike PPO, GRPO doesn't include an entropy bonus term or a critic loss component, making it significantly simpler.

Comparison Between PPO and GRPO

PPO and GRPO represent two different approaches to policy optimization, with GRPO offering a simpler alternative to the more complex PPO framework. The fundamental difference lies in how advantages are estimated: PPO relies on a critic network through GAE to estimate token-level advantages, requiring a complex training setup with both actor and critic components. In contrast, GRPO takes a more direct approach by estimating advantages from the relative performance of multiple completions sampled for the same prompt, eliminating the need for a critic network altogether.

The simplification offered by GRPO extends to the objective function as well. While PPO incorporates an entropy bonus term to encourage exploration alongside the clipped surrogate objective and KL penalty, GRPO relies only on the latter two components, trusting in the natural variance from group sampling to provide sufficient exploration. Despite these simplifications, GRPO maintains the core stability mechanisms of PPO, including the clipping of probability ratios to prevent excessive policy updates.