Group Relative Policy Optimization

GRPO

Foundations

Evaluation

Soft glowing orange and yellow light with a gradient blending into black background.

TL;DR

An efficient reinforcement learning algorithm that aligns large language models by evaluating a group of generated responses and computing relative rewards, completely eliminating the need for a separate critic model.

In depth

Developed by DeepSeek, Group Relative Policy Optimization simplifies the traditional Proximal Policy Optimization framework by removing the resource-intensive critic model. Instead of predicting absolute state values, GRPO samples a group of candidate completions for each prompt and normalizes their rewards dynamically. This drastically lowers GPU memory consumption during training while successfully prompting emergent reasoning behaviors like self-reflection and chain-of-thought generation.

Why this matters for your business

It significantly democratizes reinforcement learning for AI builders by reducing training costs and hardware requirements, allowing smaller organizations to develop high-performance reasoning models.