Policy Optimization: From Vanilla Policy Gradients to GRPO

Ziyi Zhu / March 09, 2025
8 min read • ––– views
In the world of reinforcement learning, policy gradient methods have become the cornerstone of many state-of-the-art algorithms. In this blog post, we'll explore the evolution from vanilla policy gradients to modern approaches like Proximal Policy Optimization (PPO) and the more recent Group Relative Policy Optimization (GRPO). These methods have proven particularly effective for fine-tuning language models through reinforcement learning from human feedback (RLHF).
The Foundation: Vanilla Policy Gradient
To understand modern policy optimization methods, we need to start with the basics. The core idea of policy gradient methods is simple: we want to optimize a parameterized policy to maximize the expected return .
Brief Derivation
The policy gradient theorem gives us an analytical expression for , which we can use for gradient ascent:
Let's derive the simplest form of the policy gradient. First, we note that the probability of a trajectory under policy is:
Using the log-derivative trick and noting that environment dynamics don't depend on , we get:
This gives us the vanilla policy gradient, which we can estimate with a sample mean from collected trajectories:
Improving the Vanilla Policy Gradient
The policy gradient has a general form:
where is a weighting term that could be any of:
- Total trajectory return:
- Future return from time :
- Future return minus a baseline:
Using Advantage Functions
Two particularly important choices for are:
- On-Policy Action-Value Function:
- Advantage Function:
The advantage function is particularly useful as it describes how much better (or worse) an action is compared to the average action in that state. Using the advantage function reduces variance in the policy gradient estimate without introducing bias.
Proximal Policy Optimization (PPO)
PPO has become one of the most popular policy gradient methods due to its simplicity and effectiveness. It requires three main components:
- Policy (): The model we're optimizing (in RLHF, this is the language model)
- Reward model (): A trained network that provides scalar rewards
- Critic (): A value function that predicts rewards from partial states
General Advantage Estimation (GAE)
A key innovation in PPO is the use of General Advantage Estimation (GAE), which balances bias and variance in advantage estimation:
where is the temporal difference (TD) error:
The parameter controls the trade-off between bias and variance:
- When , GAE reduces to single-step TD (lower variance, higher bias)
- When , GAE becomes Monte Carlo estimation (higher variance, lower bias)
For a deeper dive into value function approximation and GAE techniques, check out this explainer on TD(λ).
PPO Objective Components
The PPO objective consists of several components at the token level:
1. Clipped Surrogate Objective
The clipped surrogate objective is a critical innovation in PPO. It limits the size of policy updates to prevent excessive changes that could destabilize training. The ratio measures how much the policy has changed since the last update. When this ratio exceeds (for positive advantages) or falls below (for negative advantages), the clipping function restricts the objective value, effectively discouraging further changes in that direction. This prevents the notorious "policy collapse" problem where large updates can irreparably damage a previously well-performing policy.
2. KL Divergence Penalty
The KL divergence penalty serves as a soft constraint to keep the updated policy close to the original reference policy. This is particularly important in RLHF contexts where we want to improve the model's behavior according to human preferences without losing the knowledge and capabilities acquired during pretraining. The KL term measures the "distance" between probability distributions, penalizing the model when it diverges substantially from its original behavior. This helps preserve the model's linguistic capabilities while still allowing targeted improvements in specific behaviors.
3. Entropy Bonus
The entropy bonus encourages exploration by rewarding policies that maintain diversity in their action distributions. In language models, this prevents the policy from becoming overly deterministic and always choosing the same tokens in similar contexts. Higher entropy means the model assigns meaningful probability to a wider range of tokens, maintaining creativity and diversity in outputs. This is especially important in early training stages to prevent premature convergence to suboptimal policies.
4. Value Function Approximation
The critic loss component of the PPO objective serves to train the value function approximator , which is crucial for accurately estimating advantages. This component is given by:
where is the critic's prediction of the expected return from state , is the reward model's output for the completed trajectory, and represents the stop gradient operation.
This squared error loss trains the critic to accurately predict the final reward that will be assigned by the reward model, but using only partial information (the state at time rather than the complete trajectory). As the policy changes during training, the expected returns from each state also change, requiring the critic to be continuously updated alongside the policy.
The Full PPO Objective
Group Relative Policy Optimization (GRPO)
GRPO simplifies PPO by taking a different approach to advantage estimation. Instead of using a critic network, GRPO estimates advantages by comparing rewards across multiple responses sampled from the same prompt.
Advantage Calculation in GRPO
For each prompt , GRPO samples completions. The advantage for each completion and token is calculated as:
where is the reward for completion , and and are the mean and standard deviation of rewards across all completions in the group. Importantly, while the advantage is calculated at the completion level (a single value per completion), it is applied to each token in the completion during the optimization process.
GRPO Objective Components
The GRPO objective consists of components similar to PPO but with a group-relative approach:
1. Clipped Surrogate Objective
Like in PPO, this clipping prevents excessive policy updates that could destabilize training. The clipping is applied at the token level but with advantages calculated at the sequence level, providing a unique balance between local optimization and global performance.
2. KL Divergence Penalty
KL divergence is estimated using the approximator introduced by Schulman et al. (2020). The approximator is defined as follows:
The Full GRPO Objective
The GRPO objective combines these components:
Note that unlike PPO, GRPO doesn't include an entropy bonus term or a critic loss component, making it significantly simpler.
Comparison Between PPO and GRPO
PPO and GRPO represent two different approaches to policy optimization, with GRPO offering a simpler alternative to the more complex PPO framework. The fundamental difference lies in how advantages are estimated: PPO relies on a critic network through GAE to estimate token-level advantages, requiring a complex training setup with both actor and critic components. In contrast, GRPO takes a more direct approach by estimating advantages from the relative performance of multiple completions sampled for the same prompt, eliminating the need for a critic network altogether.
The simplification offered by GRPO extends to the objective function as well. While PPO incorporates an entropy bonus term to encourage exploration alongside the clipped surrogate objective and KL penalty, GRPO relies only on the latter two components, trusting in the natural variance from group sampling to provide sufficient exploration. Despite these simplifications, GRPO maintains the core stability mechanisms of PPO, including the clipping of probability ratios to prevent excessive policy updates.