Policy Optimization: From Vanilla Policy Gradients to GRPO

Ziyi Zhu

Ziyi Zhu / March 09, 2025

8 min read––– views

In the world of reinforcement learning, policy gradient methods have become the cornerstone of many state-of-the-art algorithms. In this blog post, we'll explore the evolution from vanilla policy gradients to modern approaches like Proximal Policy Optimization (PPO) and the more recent Group Relative Policy Optimization (GRPO). These methods have proven particularly effective for fine-tuning language models through reinforcement learning from human feedback (RLHF).

The Foundation: Vanilla Policy Gradient

To understand modern policy optimization methods, we need to start with the basics. The core idea of policy gradient methods is simple: we want to optimize a parameterized policy πθ\pi_{\theta} to maximize the expected return J(πθ)=Eτπθ[R(τ)]J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)].

Brief Derivation

The policy gradient theorem gives us an analytical expression for θJ(πθ)\nabla_{\theta} J(\pi_{\theta}), which we can use for gradient ascent:

θk+1=θk+αθJ(πθ)θk\theta_{k+1} = \theta_k + \alpha \left. \nabla_{\theta} J(\pi_{\theta}) \right|_{\theta_k}

Let's derive the simplest form of the policy gradient. First, we note that the probability of a trajectory τ=(s0,a0,...,sT+1)\tau = (s_0, a_0, ..., s_{T+1}) under policy πθ\pi_{\theta} is:

P(τθ)=ρ0(s0)t=0TP(st+1st,at)πθ(atst)P(\tau|\theta) = \rho_0(s_0) \prod_{t=0}^{T} P(s_{t+1}|s_t, a_t) \pi_{\theta}(a_t|s_t)

Using the log-derivative trick and noting that environment dynamics don't depend on θ\theta, we get:

θJ(πθ)=θEτπθ[R(τ)]=θτP(τθ)R(τ)=τθP(τθ)R(τ)=τP(τθ)θlogP(τθ)R(τ)=Eτπθ[θlogP(τθ)R(τ)]=Eτπθ[t=0Tθlogπθ(atst)R(τ)]\begin{aligned} \nabla_{\theta} J(\pi_{\theta}) &= \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)] \\ &= \nabla_{\theta} \int_{\tau} P(\tau|\theta) R(\tau) \\ &= \int_{\tau} \nabla_{\theta} P(\tau|\theta) R(\tau) \\ &= \int_{\tau} P(\tau|\theta) \nabla_{\theta} \log P(\tau|\theta) R(\tau) \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}[\nabla_{\theta} \log P(\tau|\theta) R(\tau)] \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) R(\tau)] \end{aligned}

This gives us the vanilla policy gradient, which we can estimate with a sample mean from collected trajectories:

g^=1DτDt=0Tθlogπθ(atst)R(τ)\hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) R(\tau)

Improving the Vanilla Policy Gradient

The policy gradient has a general form:

θJ(πθ)=Eτπθ[t=0Tθlogπθ(atst)Φt]\nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) \Phi_t]

where Φt\Phi_t is a weighting term that could be any of:

  1. Total trajectory return: Φt=R(τ)\Phi_t = R(\tau)
  2. Future return from time tt: Φt=t=tTR(st,at,st+1)\Phi_t = \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})
  3. Future return minus a baseline: Φt=t=tTR(st,at,st+1)b(st)\Phi_t = \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1}) - b(s_t)

Using Advantage Functions

Two particularly important choices for Φt\Phi_t are:

  1. On-Policy Action-Value Function: Φt=Qπθ(st,at)\Phi_t = Q^{\pi_{\theta}}(s_t, a_t)
  2. Advantage Function: Φt=Aπθ(st,at)=Qπθ(st,at)Vπθ(st)\Phi_t = A^{\pi_{\theta}}(s_t, a_t) = Q^{\pi_{\theta}}(s_t, a_t) - V^{\pi_{\theta}}(s_t)

The advantage function Aπθ(st,at)A^{\pi_{\theta}}(s_t, a_t) is particularly useful as it describes how much better (or worse) an action is compared to the average action in that state. Using the advantage function reduces variance in the policy gradient estimate without introducing bias.

Proximal Policy Optimization (PPO)

PPO has become one of the most popular policy gradient methods due to its simplicity and effectiveness. It requires three main components:

  1. Policy (πθ\pi_{\theta}): The model we're optimizing (in RLHF, this is the language model)
  2. Reward model (RϕR_{\phi}): A trained network that provides scalar rewards
  3. Critic (VγV_{\gamma}): A value function that predicts rewards from partial states

General Advantage Estimation (GAE)

A key innovation in PPO is the use of General Advantage Estimation (GAE), which balances bias and variance in advantage estimation:

AKGAE=δ0+λδ1+λ2δ2+...+(λ)K1δK1=t=0K1(λ)tδtA^{GAE}_K = \delta_0 + \lambda\delta_1 + \lambda^2\delta_2 + ... + (\lambda)^{K-1}\delta_{K-1} = \sum_{t=0}^{K-1}(\lambda)^t\delta_t

where δt\delta_t is the temporal difference (TD) error:

δt=Vγ(st+1)Vγ(st)\delta_t = V_{\gamma}(s_{t+1}) - V_{\gamma}(s_t)

The parameter λ\lambda controls the trade-off between bias and variance:

  • When λ=0\lambda = 0, GAE reduces to single-step TD (lower variance, higher bias)
  • When λ=1\lambda = 1, GAE becomes Monte Carlo estimation (higher variance, lower bias)

For a deeper dive into value function approximation and GAE techniques, check out this explainer on TD(λ).

PPO Objective Components

The PPO objective consists of several components at the token level:

1. Clipped Surrogate Objective

Lclip(θ)=1Tt=1Tmin(πθ(atst)πθold(atst)AtGAE,clip(πθ(atst)πθold(atst),1ϵ,1+ϵ)AtGAE)L_{clip}(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\min\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}A^{GAE}_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right)A^{GAE}_t\right)

The clipped surrogate objective is a critical innovation in PPO. It limits the size of policy updates to prevent excessive changes that could destabilize training. The ratio πθ(atst)πθold(atst)\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} measures how much the policy has changed since the last update. When this ratio exceeds 1+ϵ1+\epsilon (for positive advantages) or falls below 1ϵ1-\epsilon (for negative advantages), the clipping function restricts the objective value, effectively discouraging further changes in that direction. This prevents the notorious "policy collapse" problem where large updates can irreparably damage a previously well-performing policy.

2. KL Divergence Penalty

KL(θ)=1Tt=1TDKL(πθref(st)πθ(st))\text{KL}(\theta) = \frac{1}{T}\sum_{t=1}^{T}\mathcal{D}_{KL}(\pi_{\theta_{ref}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t))

The KL divergence penalty serves as a soft constraint to keep the updated policy close to the original reference policy. This is particularly important in RLHF contexts where we want to improve the model's behavior according to human preferences without losing the knowledge and capabilities acquired during pretraining. The KL term measures the "distance" between probability distributions, penalizing the model when it diverges substantially from its original behavior. This helps preserve the model's linguistic capabilities while still allowing targeted improvements in specific behaviors.

3. Entropy Bonus

H(θ)=1Tt=1TEaπθ(st)[logπθ(ast)]H(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{a \sim \pi_\theta(\cdot|s_t)}[\log\pi_\theta(a|s_t)]

The entropy bonus encourages exploration by rewarding policies that maintain diversity in their action distributions. In language models, this prevents the policy from becoming overly deterministic and always choosing the same tokens in similar contexts. Higher entropy means the model assigns meaningful probability to a wider range of tokens, maintaining creativity and diversity in outputs. This is especially important in early training stages to prevent premature convergence to suboptimal policies.

4. Value Function Approximation

The critic loss component of the PPO objective serves to train the value function approximator VγV_{\gamma}, which is crucial for accurately estimating advantages. This component is given by:

L(γ)=Et[(Vγ(st)sg(Rϕ(sT)))2]L(\gamma) = \mathbb{E}_t[(V_\gamma(s_t) - \text{sg}(R_\phi(s_T)))^2]

where Vγ(st)V_\gamma(s_t) is the critic's prediction of the expected return from state sts_t, Rϕ(sT)R_\phi(s_T) is the reward model's output for the completed trajectory, and sg\text{sg} represents the stop gradient operation.

This squared error loss trains the critic to accurately predict the final reward that will be assigned by the reward model, but using only partial information (the state at time tt rather than the complete trajectory). As the policy πθ\pi_\theta changes during training, the expected returns from each state also change, requiring the critic to be continuously updated alongside the policy.

The Full PPO Objective

LPPO(θ,γ)=Lclip(θ)+w1H(θ)w2KL(θ)w3L(γ)L_{PPO}(\theta, \gamma) = L_{clip}(\theta) + w_1 H(\theta) - w_2 \text{KL}(\theta) - w_3 L(\gamma)

Group Relative Policy Optimization (GRPO)

GRPO simplifies PPO by taking a different approach to advantage estimation. Instead of using a critic network, GRPO estimates advantages by comparing rewards across multiple responses sampled from the same prompt.

Advantage Calculation in GRPO

For each prompt qq, GRPO samples GG completions. The advantage for each completion ii and token tt is calculated as:

Ai,t=rimean(r)std(r)A^{i,t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)}

where rir_i is the reward for completion ii, and mean(r)\text{mean}(r) and std(r)\text{std}(r) are the mean and standard deviation of rewards across all GG completions in the group. Importantly, while the advantage is calculated at the completion level (a single value per completion), it is applied to each token tt in the completion during the optimization process.

GRPO Objective Components

The GRPO objective consists of components similar to PPO but with a group-relative approach:

1. Clipped Surrogate Objective

Lclip(θ)=1Gi=1G1Tit=1Timin(πθ(ai,tsi,t)πθold(ai,tsi,t)Ai,t,clip(πθ(ai,tsi,t)πθold(ai,tsi,t),1ϵ,1+ϵ)Ai,t)L_{clip}(\theta) = -\frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}\min\left(\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}A^{i,t}, \text{clip}\left(\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}, 1-\epsilon, 1+\epsilon\right)A^{i,t}\right)

Like in PPO, this clipping prevents excessive policy updates that could destabilize training. The clipping is applied at the token level but with advantages calculated at the sequence level, providing a unique balance between local optimization and global performance.

2. KL Divergence Penalty

KL(θ)=1Gi=1G1Tit=1TiDKL[πθ(si,t)πref(si,t)]\text{KL}(\theta) = \frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}D_{KL}[\pi_\theta(\cdot|s_{i,t})\|\pi_{ref}(\cdot|s_{i,t})]

KL divergence is estimated using the approximator introduced by Schulman et al. (2020). The approximator is defined as follows:

DKL[πθπref]=πθ(ai,tsi,t)πref(ai,tsi,t)logπθ(ai,tsi,t)πref(ai,tsi,t)1D_{KL}[\pi_\theta\|\pi_{ref}] = \frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{ref}(a_{i,t}|s_{i,t})} - \log\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{ref}(a_{i,t}|s_{i,t})} - 1

The Full GRPO Objective

The GRPO objective combines these components:

LGRPO(θ)=Lclip(θ)βKL(θ)L_{GRPO}(\theta) = L_{clip}(\theta) - \beta \text{KL}(\theta)

Note that unlike PPO, GRPO doesn't include an entropy bonus term or a critic loss component, making it significantly simpler.

Comparison Between PPO and GRPO

PPO and GRPO represent two different approaches to policy optimization, with GRPO offering a simpler alternative to the more complex PPO framework. The fundamental difference lies in how advantages are estimated: PPO relies on a critic network through GAE to estimate token-level advantages, requiring a complex training setup with both actor and critic components. In contrast, GRPO takes a more direct approach by estimating advantages from the relative performance of multiple completions sampled for the same prompt, eliminating the need for a critic network altogether.

The simplification offered by GRPO extends to the objective function as well. While PPO incorporates an entropy bonus term to encourage exploration alongside the clipped surrogate objective and KL penalty, GRPO relies only on the latter two components, trusting in the natural variance from group sampling to provide sufficient exploration. Despite these simplifications, GRPO maintains the core stability mechanisms of PPO, including the clipping of probability ratios to prevent excessive policy updates.