LLM Alignment: PPO, DPO, and GRPO

Introduction

Welcome! This lecture explores the critical challenge of aligning Large Language Models (LLMs). While modern LLMs can write code, compose poetry, and answer complex questions, their raw, pre-trained versions may not always behave as we'd like. They can sometimes generate incorrect, biased, or unhelpful content. Alignment is the process of fine-tuning these models to be helpful, harmless, and consistent with human values.

The most common framework for this is Reinforcement Learning from Human Feedback (RLHF). At its core, RLHF is a sophisticated process for teaching an AI to understand and adopt human preferences. It typically involves three key stages:

Supervised Fine-Tuning (SFT): A pre-trained LLM is first taught to mimic human-written examples for specific tasks, creating a solid baseline model (often called the reference policy, $\pi_{\text{ref}}$).
Reward Modeling: Humans are asked to rank different model responses to the same prompt. This preference data ($y_w \succ y_l$, where response $y_w$ is preferred over $y_l$) is used to train a separate "reward model" that learns to score responses based on how much a human would like them.
RL Fine-Tuning: The SFT model is further optimized using reinforcement learning. It generates responses, which are scored by the reward model. This score (or "reward") is used to update the LLM's policy, encouraging it to produce more high-reward outputs.

In this lecture, we will journey through three key algorithms that tackle the RL fine-tuning stage, each with a unique approach to this complex optimization problem: PPO, the classic workhorse; DPO, a clever and more direct method; and GRPO, an efficient, critic-free alternative pioneered by recent research.

1. The Classic Approach: Proximal Policy Optimization (PPO)

The traditional RLHF pipeline relies on PPO for its final and most critical phase. It carefully updates the language model based on feedback from a separately trained reward model. Below is an interactive walkthrough of its three core phases.

Mathematical Foundations: Policy Gradient Methods

PPO builds on the Policy Gradient Theorem, which provides the mathematical foundation for optimizing policies in reinforcement learning.

Policy Gradient Theorem: The Foundation of Learning

The Big Picture: Imagine teaching a student to write better essays. When they write something good, you encourage more of that behavior. When they write something bad, you discourage it. That's exactly what policy gradients do for LLMs!

The Mathematical Heart:

$$ \nabla_\theta J(\theta) = \hat{\mathbb{E}}_t [ \nabla_\theta \log \pi_\theta(a_t | s_t) \hat{A}_t ] $$

Let's break this down piece by piece:

$\theta$: The model's "brain" - billions of numbers that determine how it thinks
$\nabla_\theta$: "Which direction should we adjust the brain to get better?"
$\pi_\theta(a_t | s_t)$: "How likely is the model to choose token $a_t$ given context $s_t$?"
$\log \pi_\theta$: We use log because it makes math easier and more stable
$\hat{A}_t$: "Was this choice better (+) or worse (-) than average?"
$\hat{\mathbb{E}}_t$: Average over all the examples we've seen

The Learning Rule (Simplified):

$$ L^{PG}(\theta) = \hat{\mathbb{E}}_t [\log \pi_\theta(a_t | s_t) \hat{A}_t] $$

Think of it like this:

If $\hat{A}_t = +2$: "This token choice was really good! Make it 2x more likely next time."
If $\hat{A}_t = 0$: "This choice was average. Don't change anything."
If $\hat{A}_t = -1.5$: "This was a bad choice. Make it less likely next time."

Real Example: If the model generates "The sky is purple" (bad, $\hat{A}_t = -2$), we decrease the probability of "purple" after "sky is". If it generates "The sky is blue" (good, $\hat{A}_t = +1.5$), we increase the probability of "blue".

Detailed LLM Example: Teaching Better Explanations

Scenario: Training ChatGPT to give better physics explanations

User Prompt: "Explain quantum physics to a beginner"
Current Context: "Quantum physics is"
Token Choice: "fascinating" vs "complicated"

Step 1: Current Policy Probabilities
• $\pi_\theta(\text{"fascinating"}|\text{context}) = 0.7$ (70% chance)
• $\pi_\theta(\text{"complicated"}|\text{context}) = 0.3$ (30% chance)

Step 2: Human Feedback Analysis
• Complete response with "fascinating": "Quantum physics is fascinating because it reveals how particles behave in surprising ways..."
• Human rating: 4.5/5 (engaging, accessible)
• Advantage: $\hat{A}_t = +2.3$ (much better than average)

Step 3: Policy Gradient Calculation
$L^{PG} = \log(0.7) \times 2.3$
$L^{PG} = -0.356 \times 2.3 = -0.82$

Step 4: Learning Effect
• Gradient pushes model to increase $\pi_\theta(\text{"fascinating"})$ from 0.7 → 0.75
• Model learns: "When explaining complex topics, use engaging language"
• Future responses become more accessible and interesting

Counter-example: If response was confusing ($\hat{A}_t = -1.5$), model would decrease probability of "complicated" language

Trust Region Policy Optimization (TRPO): The Predecessor

PPO was designed to achieve TRPO's stability with a simpler first-order optimization approach. Understanding TRPO helps explain PPO's design choices.

TRPO: Learning with Training Wheels

The Catastrophic Problem: Imagine you're learning to drive. If you make huge steering corrections, you'll crash! Similarly, if we make huge changes to an LLM's "brain," it might forget how to speak English entirely and start generating gibberish.

TRPO's Ingenious Solution: "Let's improve the model, but with safety guardrails!"

The Main Objective (What We Want to Maximize):

$$ \max_\theta \hat{\mathbb{E}}_t \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}_t \right] $$

Let's decode this step by step:

$\pi_\theta(a_t|s_t)$: New model's probability of choosing token $a_t$
$\pi_{\theta_{old}}(a_t|s_t)$: Old model's probability of choosing the same token
The ratio $\frac{\pi_\theta}{\pi_{\theta_{old}}}$: "How much more/less likely is this choice now?"
If ratio = 2: New model is 2x more likely to choose this token
If ratio = 0.5: New model is half as likely to choose this token
If ratio = 1: No change in probability

The Logic: If $\hat{A}_t > 0$ (good choice), we want ratio > 1. If $\hat{A}_t < 0$ (bad choice), we want ratio < 1. Multiply them together and we get a score to maximize!

But Here's the Safety Constraint (The Training Wheels):

$$ \hat{\mathbb{E}}_t[KL[\pi_{\theta_{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]] \leq \delta $$

KL Divergence Explained Like You're 5:

KL = 0: "The new model thinks exactly like the old model" (identical twins)
KL = 0.01: "Very similar thinking, just tiny differences" (siblings)
KL = 0.1: "Noticeably different but still related" (cousins)
KL = 1.0: "Very different thinking" (strangers)

The Constraint Says: "You can improve the model, but don't let it become too different from what it was. Keep KL ≤ δ (usually 0.01)."

Why This is Hard: This creates a complex optimization problem that requires second-order derivatives (like computing the curvature of a mountain while climbing it). It's mathematically elegant but computationally expensive!

Detailed LLM Example: Poetry Generation Constraints

Scenario: Training an LLM to write better poetry using TRPO

User Prompt: "Write a haiku about artificial intelligence"

Old Policy Response:
"Machines that can think
Processing information
Future is here now"
• Probability: $\pi_{old}(\text{this poem}|\text{prompt}) = 0.0023$

Proposed New Policy Response:
"Silicon neurons
Dancing through digital dreams
Consciousness blooms bright"
• Probability: $\pi_{new}(\text{this poem}|\text{prompt}) = 0.0089$
• Human rating: Much more creative and poetic

TRPO Constraint Check:
• KL divergence limit: $\delta = 0.01$ (stay close to old policy)
• Actual KL: $KL[\pi_{old}, \pi_{new}] = 0.023$ (too big a change!)
• TRPO decision: Reject this update, it's too dramatic
• Required action: Use smaller learning rate, make gradual improvements

The Problem: TRPO's conservative approach means the model learns very slowly to be more creative, taking many small steps instead of one good leap

PPO's Advantage: Clipping allows bigger improvements while staying stable

PPO Core Idea: Clipped Surrogate Objective

PPO's main innovation is replacing TRPO's complex constrained optimization with a simple clipping mechanism that prevents destructive policy updates.

PPO's Brilliant Solution: Smart Clipping

The Genius Insight: "What if instead of complex math constraints, we just put a simple 'speed limit' on how much the model can change?" This is PPO's revolutionary idea!

Step 1: Calculate the Change Ratio

$$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} $$

This ratio tells us everything:

r = 1.0: "No change - new model acts exactly like old model"
r = 1.5: "New model is 50% more likely to choose this token"
r = 0.7: "New model is 30% less likely to choose this token"
r = 3.0: "DANGER! New model is 3x more likely - too big a change!"
r = 0.1: "DANGER! New model almost never chooses this - too big a change!"

Step 2: PPO's Clipping Magic

$$ L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right) \right] $$

The Clipping Function Explained:

If $\epsilon = 0.2$: We clip ratios to the range [0.8, 1.2]
If r = 0.5: clip(0.5, 0.8, 1.2) = 0.8 (bring it up to minimum)
If r = 1.0: clip(1.0, 0.8, 1.2) = 1.0 (no change needed)
If r = 2.5: clip(2.5, 0.8, 1.2) = 1.2 (bring it down to maximum)

The "min" Operation - PPO's Safety Net:

We compute TWO objectives: one with the original ratio, one with the clipped ratio. Then we take the minimum (most pessimistic) of the two.

Why This Works:

If the model wants to make a good change (positive advantage): Clipping prevents it from getting too excited and making huge changes
If the model wants to make a bad change (negative advantage): Clipping prevents it from making huge destructive changes
The result: Steady, stable learning without catastrophic failures!

Brilliant Simplicity: Instead of TRPO's complex second-order optimization, PPO just says "don't change by more than 20% in any direction" and it works amazingly well!

Detailed LLM Example: Learning Programming Advice

Scenario: Training ChatGPT to give better programming advice

User Question: "How do I learn programming effectively?"
Context Generated So Far: "The best way to learn programming is"
Token Decision: "practice" vs "reading"

PPO Settings: $\epsilon = 0.2$ (clipping range [0.8, 1.2])

Case 1: High-Reward Token (Practice-Focused Response)
• Complete response: "The best way to learn programming is practice - build projects, solve coding challenges, and write code daily"
• Human feedback: Very helpful, actionable advice
• Advantage: $\hat{A}_t = +1.8$ (much better than average)
• Old policy: $\pi_{old}(\text{"practice"}|\text{context}) = 0.4$ (40% chance)
• New policy: $\pi_{new}(\text{"practice"}|\text{context}) = 0.65$ (65% chance)
• Ratio: $r_t = 0.65/0.4 = 1.625$ (62.5% increase!)
• Clipped ratio: $\min(1.625, 1.2) = 1.2$ (capped at 20% increase)
• Final objective: $\min(1.625 \times 1.8, 1.2 \times 1.8) = \min(2.925, 2.16) = 2.16$

Case 2: Alternative Token (Reading-Focused)
• Complete response: "The best way to learn programming is reading documentation and tutorials thoroughly"
• Human feedback: Somewhat helpful but less actionable
• Advantage: $\hat{A}_t = -0.9$ (below average)
• Clipping prevents over-penalization of this reasonable alternative

Learning Result: Model learns to favor hands-on, practical advice while not completely dismissing other approaches

PPO Algorithm: High-Level Structure

PPO follows a simple iterative process: collect trajectories, compute advantages, and optimize the clipped surrogate objective for multiple epochs.

PPO Algorithm: The Complete Recipe for Success

Think of PPO like a chef perfecting a recipe through careful experimentation:

Phase 1: Setup Your Kitchen (Initialization)

$\pi_{\theta_0}$: Your "chef" (the language model that generates responses)
$V_{\phi_0}$: Your "food critic" (predicts how good a response will be)
Both start as apprentices and will get better together!

Phase 2: The Learning Cycle (Repeated Thousands of Times)

Step 1: Practice Cooking (Data Collection)

Give the model 512 different prompts (like "Explain gravity")
Let it generate responses using current policy $\pi_{\theta_k}$
This is expensive! Each response takes GPU time and energy
Result: 512 prompt-response pairs to learn from

Step 2: Get Expert Feedback (Reward & Advantage Computation)

Feed each response to the reward model (the "expert judge")
Get scores like: "This explanation is 8/10" or "This is confusing, 3/10"
Compute advantages $\hat{A}_t$: "Was this response better or worse than expected?"
Positive advantage = "Surprisingly good!" Negative = "Disappointing"

Step 3: Learn Intensively (The 4-Epoch Training)

Here's PPO's secret sauce: Instead of throwing away the data after one use, we learn from it 4 times!

Epoch 1: "Here's what we did and how good it was..."
Epoch 2: "Let me think about this again..."
Epoch 3: "I'm starting to see the patterns..."
Epoch 4: "Now I really understand!"
Clipping ensures: We don't "over-learn" and break the model

Step 4: Graduation (Policy Update)

After 4 epochs of learning, $\theta_k$ becomes $\theta_{k+1}$ - a smarter, better model!

The Economics: Data collection is expensive (like hiring a chef to cook), but learning from existing data is cheap (like studying recipes). PPO maximizes learning per dollar spent!

Detailed LLM Example: Customer Service Training

Scenario: Training an AI assistant for customer service

Training Batch Example:
• 512 customer complaints processed simultaneously
• Each response up to 2048 tokens (full conversation)
• Learning rate: 1e-6 (very careful updates)
• Clipping: $\epsilon = 0.2$ (20% maximum change)
• KL penalty: $\beta = 0.02$ (stay professional)

Sample Customer Complaint:
"My order was delayed and I'm very frustrated. This is unacceptable!"

Response A (High Reward):
"I completely understand your frustration, and I sincerely apologize for the delay. Let me check your order status and see how we can make this right..."
• Empathetic, professional, solution-focused
• Reward score: 4.2/5.0

Response B (Low Reward):
"Orders sometimes get delayed. Please be patient."
• Dismissive, unhelpful
• Reward score: 1.1/5.0

Training Effect:
• Model learns to prioritize empathy and problem-solving
• Increases probability of helpful phrases like "I understand" and "Let me help"
• Decreases probability of dismissive language
• After 10,000 iterations: Customer satisfaction increases from 3.2/5 to 4.1/5

The Complete PPO Objective Function

In practice, PPO combines the clipped policy loss with value function training and an entropy bonus to create a robust, multi-objective optimization.

PPO's Triple Optimization: Juggling Three Goals

Imagine training a student who must: (1) Give better answers, (2) Predict test scores accurately, and (3) Stay creative. PPO does all three simultaneously!

$$ L_t(\theta, \phi) = L_t^{CLIP}(\theta) - c_1 L_t^{VF}(\phi) + c_2 S[\pi_\theta](s_t) $$

🎯 Goal 1: Better Token Choices (Policy Loss)

$L_t^{CLIP}(\theta)$ = "Make the model choose better words/tokens"

This is the clipped objective we learned about
Positive when model improves, negative when it gets worse
The main driver of better responses

🔮 Goal 2: Better Predictions (Value Function Loss)

$$ L_t^{VF}(\phi) = (V_\phi(s_t) - V_t^{targ})^2 $$

The Value Function is Like a Fortune Teller:

$V_\phi(s_t)$: "I predict this response will get a reward of 2.3"
$V_t^{targ}$: "Actually, it got a reward of 2.8"
Error: (2.3 - 2.8)² = 0.25
Goal: Make the fortune teller more accurate over time
Why we need this: Good predictions → better advantage estimates → better learning

🎨 Goal 3: Stay Creative (Entropy Bonus)

$$ S[\pi_\theta](s_t) = -\sum_a \pi_\theta(a|s_t) \log \pi_\theta(a|s_t) $$

Entropy Explained with a Coin Flip Analogy:

Fair coin (50/50): High entropy = maximum uncertainty/creativity
Biased coin (99/1): Low entropy = very predictable/boring
For LLMs: High entropy = considers many word choices
Low entropy problem: Model always says "The answer is..." (repetitive!)
Solution: Small bonus for staying diverse and creative

⚖️ Balancing Act:

$c_1 = 0.5$: Value function training is half as important as policy
$c_2 = 0.01$: Entropy gets a tiny bonus (don't want too much randomness)
The art: Tuning these coefficients for optimal performance

🧠 Why This Works: Like teaching a student to be accurate AND creative AND self-aware, PPO creates well-rounded AI that doesn't just memorize but truly understands!

Detailed LLM Example: Teaching Children About AI

Scenario: Training ChatGPT to explain complex topics to children

Parent's Request: "Explain machine learning to my 5-year-old"

Generated Response:
"Machine learning is like teaching a computer to recognize patterns, just like how you learn to recognize different animals. When you see a dog, you know it's a dog because you've seen many dogs before and learned what makes them special - four legs, fur, wagging tail. Computers can learn the same way!"

Step 1: Reward Model Evaluation
• Age-appropriateness: 4.5/5 (uses simple analogies)
• Accuracy: 4.2/5 (correct but simplified)
• Engagement: 4.0/5 (relatable examples)
• Overall reward: $r = +2.1$ (much better than average)

Step 2: PPO Loss Components
• Clipped policy loss: $L^{CLIP} = -0.73$ (encourages this response)
• Value function prediction: $V = 1.8$ ("I think this will get ~1.8 reward")
• Actual target: $V^{targ} = 2.1$ (it got 2.1, better than predicted!)
• Value loss: $L^{VF} = (1.8-2.1)^2 = 0.09$ (improve prediction)
• Entropy: $S = 3.2$ (good word variety, not repetitive)

Step 3: Combined Optimization
$L = -0.73 - 0.5 \times 0.09 + 0.01 \times 3.2 = -0.683$

Learning Outcome:
• Model learns to use animal analogies for children
• Value function learns that simple explanations get higher rewards
• Future responses become more child-friendly and engaging

Generalized Advantage Estimation (GAE)

The quality of advantage estimation is critical for PPO performance. GAE provides a sophisticated method to balance bias and variance in advantage estimates.

GAE: Computing "How Good Was This Choice?"

Step 1: Temporal Difference (TD) Error

$$ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) $$

Intuition: Did we get more reward than expected? $r_t$ = actual reward, $V(s_t)$ = what we expected, $\gamma V(s_{t+1})$ = discounted future value.

Step 2: GAE Advantage (Weighted Average of TD Errors)

$$ \hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l} $$

What this does: Look at TD error now ($\delta_t$) and future TD errors ($\delta_{t+1}, \delta_{t+2}...$), but weight future errors less ($\gamma\lambda < 1$).

Practical Recursive Form:

$$ \hat{A}_t = \delta_t + \gamma\lambda \hat{A}_{t+1} $$

$\lambda$ parameter: $\lambda = 0$ = only use immediate TD error. $\lambda = 1$ = use all future TD errors equally. $\lambda = 0.95$ = good balance.

Detailed LLM Example: Poetry Generation Analysis

Scenario: Training an LLM to write better poetry

User Request: "Write a haiku about artificial intelligence"

Generated Haiku (token by token):
"Silicon minds dream
Processing data streams
Consciousness awakening"

Key Tokens Analyzed: ["Silicon", "minds", "awakening"]
• $\gamma = 0.99$ (future rewards matter), $\lambda = 0.95$ (balance bias/variance)

Step 1: Immediate Rewards
• "Silicon": 0.5 (technical but cold)
• "minds": 1.2 (adds humanity)
• "awakening": 2.1 (powerful, evocative ending)

Step 2: Value Function Predictions
• After "Silicon": $V = 0.8$ ("this poem might be okay")
• After "minds": $V = 1.5$ ("getting better, more engaging")
• After "awakening": $V = 1.9$ ("strong finish expected")

Step 3: TD Error Calculations
• $\delta_0 = 0.5 + 0.99 \times 1.5 - 0.8 = 1.185$ (better than expected)
• $\delta_1 = 1.2 + 0.99 \times 1.9 - 1.5 = 1.581$ (much better)
• $\delta_2 = 2.1 + 0 - 1.9 = 0.2$ (slightly better than predicted)

Step 4: GAE Advantages (Credit Assignment)
• $\hat{A}_2 = 0.2$ ("awakening" was good)
• $\hat{A}_1 = 1.581 + 0.99 \times 0.95 \times 0.2 = 1.769$ ("minds" set up the great ending)
• $\hat{A}_0 = 1.185 + 0.99 \times 0.95 \times 1.769 = 2.836$ ("Silicon" started a winning poem)

Learning Outcome: Model learns that technical terms can work if they lead to emotional payoffs

PPO Training Loop: Implementation Details

The PPO training loop efficiently implements the clipped objective with multiple epochs and minibatch updates to maximize sample efficiency.

PPO Training Loop: The Full Recipe

Think of it like learning to cook from practice:

1. Practice: Generate responses to prompts (collect data)
2. Get feedback: Reward model scores responses, compute advantages
3. Learn (4 times from same practice session):
• Shuffle: Mix up the data randomly
• Small bites: Process data in small batches (64-512 examples)
• Improve policy: Update model weights using clipped objective
• Improve critic: Train value function to predict rewards better

Why repeat 4 times? Generating data is expensive (requires running the LLM), so we squeeze maximum learning from each batch. Clipping prevents us from "over-learning" and ruining the model.

Detailed LLM Example: Medical Q&A Training Progress

Scenario: Training an AI medical assistant over 10,000 training steps

Sample Question: "What should I do about persistent headaches?"

Early Training Response (Step 100):
"Headaches can be caused by many things. Try drinking water."
• Reward: 0.2/5.0 (too brief, not helpful)
• Length: 45 tokens (too short)

Late Training Response (Step 10,000):
"Persistent headaches can have various causes including dehydration, stress, or underlying conditions. I recommend: 1) Keep a headache diary to track triggers, 2) Ensure adequate hydration and sleep, 3) Consider stress management techniques, and 4) Consult a healthcare provider if headaches persist or worsen, especially if accompanied by other symptoms."
• Reward: 1.8/5.0 (comprehensive, actionable)
• Length: 120 tokens (appropriately detailed)

Training Metrics Evolution:
• Policy loss: -0.05 → -0.23 (learning to maximize helpful responses)
• Value loss: 0.8 → 0.3 (better at predicting response quality)
• KL penalty: 0.02 (staying close to medical training base)
• Clip fraction: 0.18 (healthy learning rate)

Quality Improvements:
• Helpfulness: 3.2/5 → 4.1/5 (more actionable advice)
• Harmlessness: 4.8/5 → 4.9/5 (appropriate medical disclaimers)
• Honesty: 3.9/5 → 4.3/5 (acknowledges limitations)

Safety Check: If KL > 0.05, model might start giving inappropriate medical advice

PPO Variants and Practical Considerations

PPO has several variants and important implementation considerations that affect its performance in practice.

PPO Alternative: Adaptive KL Penalty

Instead of clipping, use a "speed limit" that adapts:

$$ L^{KLPEN}(\theta) = \hat{\mathbb{E}}_t[\log \pi_\theta(a_t | s_t) \hat{A}_t - \beta KL[\pi_{\theta_{old}}, \pi_\theta]] $$

How it works: Maximize reward (first term) but subtract a penalty for changing too much (second term). β controls how strict the penalty is.

Adaptive β (Smart Speed Limit):

Going too fast? If KL > 1.5 × target: β ← 2β (stricter penalty)
Going too slow? If KL < target/1.5: β ← β/2 (relax penalty)

Common Failure Modes: Reward hacking (model finds loopholes), distribution shift (model forgets original training), training instability

Detailed LLM Example: Creative Writing Assistant Tuning

Scenario: Fine-tuning hyperparameters for a creative writing assistant

Test Prompt: "Write an engaging opening paragraph for a mystery novel"

KL Penalty β Experiments:

β = 0.001 (Too Permissive):
"The dark mysterious shadow crept through the dark mysterious alley of mysterious darkness with mysterious intent..."
• Result: Repetitive, incoherent after 1K steps
• Problem: No constraint on staying reasonable

β = 0.02 (Just Right):
"Detective Sarah Chen had seen enough crime scenes to know when something didn't add up, but the pristine living room with a single drop of blood on the white sofa told a story that made her stomach turn."
• Result: 4.1/5 human preference, engaging and coherent
• Sweet spot: Creative but grounded

β = 0.1 (Too Conservative):
"There was a crime. The detective investigated. It was mysterious."
• Result: 3.2/5 preference (boring, generic)
• Problem: Too afraid to deviate from training data

Learning Rate Experiments:
• 5e-6: Fast improvement but unstable - starts writing nonsense
• 1e-6: Gradual, stable improvement in creativity
• 5e-7: Too slow, still generic after 20K steps

Reward Hacking Example: When β too low, model learns to repeat high-scoring phrases like "compelling narrative" and "riveting mystery" regardless of context

Key Takeaways

SFT creates a solid foundation by teaching the model good examples
Reward Modeling learns human preferences from comparative rankings
PPO optimizes the policy to maximize rewards while staying stable

Why this works: Each stage builds on the previous one, creating a robust alignment pipeline that scales human feedback effectively.

Implementation Notes

Data Requirements: SFT needs ~10K examples, RM needs ~50K preference pairs
Computational Cost: RL stage is most expensive (requires multiple model copies)
Hyperparameters: $\beta$ controls exploration vs. exploitation trade-off

Common Issues: Reward hacking, distribution shift, and instability during RL training.

2. A More Direct Path: Direct Preference Optimization (DPO)

Training a separate reward model can be complex. Direct Preference Optimization (DPO) bypasses it by leveraging a mathematical relationship to directly optimize the policy on preference data. Below is an interactive walkthrough of DPO's key innovations and mathematical foundations.

PPO vs. DPO at a Glance

PPO (Explicit Reward)

Preference Data

↓

Train Reward Model

↓

RL Fine-tuning

↓

Aligned LLM

DPO (Implicit Reward)

Preference Data

↓

Directly Optimize Policy
via DPO Loss

↓

Aligned LLM

The Bradley-Terry Preference Model

DPO builds on the Bradley-Terry model, which converts human preferences into mathematical probabilities that can be optimized.

Bradley-Terry: The Foundation of Preference Learning

The Big Picture: Imagine you're a judge in a cooking competition. When you taste two dishes, you prefer one over the other. But how do we turn that preference into math? That's what Bradley-Terry does!

The Core Assumption: Every response has a hidden "quality score" $r^*(x,y)$ that we can't see directly.

The Mathematical Heart:

$$ p(y_w \succ y_l | x) = \frac{\exp(r^*(x, y_w))}{\exp(r^*(x, y_w)) + \exp(r^*(x, y_l))} = \sigma(r^*(x, y_w) - r^*(x, y_l)) $$

Let's break this down piece by piece:

$r^*(x, y)$: The true, hidden quality of response $y$ to prompt $x$
$\exp(r^*)$: Convert scores to positive numbers (like turning grades into probabilities)
Softmax: $\frac{\exp(r_w)}{\exp(r_w) + \exp(r_l)}$ normalizes so probabilities sum to 1
$\sigma(r_w - r_l)$: Sigmoid of the difference - elegant shorthand!

The Intuition:

If $r_w - r_l = +3$: Winner is much better → $\sigma(3) = 95\%$ chance of preference
If $r_w - r_l = 0$: Both equally good → $\sigma(0) = 50\%$ chance (coin flip)
If $r_w - r_l = -2$: Winner is actually worse → $\sigma(-2) = 12\%$ chance

The Magic: We don't need to know the exact scores $r^*$ - we only need their difference! This is why preference learning works so well.

Detailed LLM Example: Science Education Assistant

Scenario: Training an AI tutor to explain scientific concepts

Student Question: "Can you explain how photosynthesis works?"

Response A (Winner - Detailed & Engaging):
"Photosynthesis is like a plant's kitchen! Plants use chlorophyll (the green stuff in leaves) as their cooking equipment to combine sunlight, water from roots, and carbon dioxide from air. The 'recipe' produces glucose (plant food) and oxygen (which we breathe). The chemical equation is: 6CO₂ + 6H₂O + light energy → C₆H₁₂O₆ + 6O₂. This process happens in tiny structures called chloroplasts and is why plants are green and why we have oxygen to breathe!"
• Educational value: High (uses analogy + science)
• Engagement: High (relatable examples)
• Accuracy: High (correct equation and facts)

Response B (Loser - Oversimplified):
"Plants eat sunlight and make food."
• Educational value: Low (no real explanation)
• Engagement: Low (boring, unhelpful)
• Accuracy: Misleading (plants don't "eat" sunlight)

Hidden Quality Assessment:
• $r^*(x, y_w) = 2.5$ (comprehensive, accurate, engaging)
• $r^*(x, y_l) = -1.0$ (unhelpful, potentially confusing)

Bradley-Terry Preference Calculation:
• Score difference: $2.5 - (-1.0) = 3.5$
• Preference probability: $\sigma(3.5) = \frac{1}{1+e^{-3.5}} = 97.1\%$

Learning Insight: 97.1% of humans would prefer the detailed explanation, so DPO will strongly encourage similar comprehensive responses

Traditional RLHF: The Three-Stage Pipeline

Before DPO, alignment required three complex stages. Understanding this helps appreciate DPO's elegant simplification.

RLHF: The Traditional Way (Complex but Effective)

Stage 1: Supervised Fine-Tuning (SFT)

Start with a base model and teach it to follow instructions using high-quality examples. This creates our reference policy $\pi_{\text{ref}}$.

Stage 2: Reward Modeling

Train a separate neural network $r_\phi(x,y)$ to predict human preferences using the Bradley-Terry model:

$$ \mathcal{L}_{R}(r_\phi) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right] $$

This is a binary classification problem: "Is response A better than response B?"

Stage 3: RL Fine-Tuning (The Hard Part)

Use reinforcement learning to optimize the policy using the reward model:

$$ \max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(y|x)} [r_\phi(x,y)] - \beta \mathbb{D}_{KL}(\pi_\theta(y|x) || \pi_{\text{ref}}(y|x)) $$

Why This is Hard:

Three separate models: Base model, reward model, and policy model
Complex RL algorithms: PPO, TRPO, or other policy gradient methods
Instability: RL training can be unstable and sensitive to hyperparameters
Memory intensive: Need to store multiple large models simultaneously
Reward hacking: Model might find ways to "cheat" the reward function

The KL Term Explained: $\beta \mathbb{D}_{KL}$ prevents the model from drifting too far from the original SFT model, maintaining coherent language generation.

Detailed LLM Example: Traditional RLHF Challenges

Scenario: Training a customer service chatbot using traditional RLHF

The Three-Model Complexity:
1. Base Model: General language understanding
2. SFT Model: Learns customer service examples
3. Reward Model: Learns to score responses like humans
4. Policy Model: Final optimized chatbot

Training Timeline & Challenges:

Stage 1 - SFT (3 days):
• Teach basic customer service responses
• Example: "Thank you for contacting us. How can I help?"
• Challenge: Limited to training examples only

Stage 2 - Reward Modeling (1 day):
• Train separate model to score responses
• Example: "I understand your frustration" = 4.2/5
• Challenge: Reward model can overfit to training preferences

Stage 3 - PPO Training (5 days):
• Use reward model to improve policy
• Challenge: PPO is sensitive to hyperparameters
• Common failure: Model starts generating gibberish to "hack" rewards

Real Problems Encountered:
• Reward model gives high scores to nonsensical responses
• Policy collapse: "Thank you thank you thank you..." (repetitive)
• Hyperparameter sensitivity: Small changes break everything
• Complex pipeline: Three separate training processes to manage

DPO's Key Insight: The Optimal Policy-Reward Relationship

DPO's breakthrough comes from a mathematical relationship that allows us to express rewards directly in terms of policies.

The Mathematical Breakthrough: Eliminating the Reward Model

Step 1: The Optimal Policy Formula

For any reward function $r(x,y)$, the optimal policy that maximizes the RLHF objective has this exact form:

$$ \pi_r(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta}r(x,y)\right) $$

What each part means:

$\pi_r(y|x)$: The optimal policy for reward function $r$
$\pi_{\text{ref}}(y|x)$: The reference policy (our starting point)
$\exp(\frac{1}{\beta}r(x,y))$: Exponential of scaled reward (higher reward = higher probability)
$Z(x)$: Normalization constant (ensures probabilities sum to 1)

Step 2: The Brilliant Inversion

Instead of finding the optimal policy for a given reward, we can solve for the reward given a policy:

$$ r(x,y) = \beta \log\left(\frac{\pi_r(y|x)}{\pi_{\text{ref}}(y|x)}\right) + \beta \log(Z(x)) $$

The Magic Cancellation:

When we compute the reward difference for two responses, the $Z(x)$ terms cancel out perfectly:

$$ r(x, y_w) - r(x, y_l) = \beta \left( \log\frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) $$

This is revolutionary because:

We can compute reward differences using only policy probabilities
No need for a separate reward model
The intractable partition function $Z(x)$ disappears
Direct optimization becomes possible

Detailed LLM Example: Poetry Assistant Breakthrough

Scenario: Training an AI poetry assistant using DPO's implicit rewards

User Request: "Write a poem about artificial intelligence"

Good Poem (Winner):
"Silicon dreams awaken,
Algorithms dance with thought,
Mind meets machine at last."
• Creative, evocative, follows haiku structure
• Current policy: $\pi_{\theta}(y_w|x) = 0.15$ (15% chance)
• Reference policy: $\pi_{\text{ref}}(y_w|x) = 0.10$ (10% chance)

Bad Poem (Loser):
"AI is good technology,
Computers are smart,
The end."
• Boring, prosaic, not really poetry
• Current policy: $\pi_{\theta}(y_l|x) = 0.05$ (5% chance)
• Reference policy: $\pi_{\text{ref}}(y_l|x) = 0.08$ (8% chance)

DPO's Magic: Implicit Reward Calculation
• $\beta = 0.1$ (temperature parameter)
• Good poem log ratio: $\log(0.15/0.10) = 0.405$
• Bad poem log ratio: $\log(0.05/0.08) = -0.470$
• Implicit reward difference: $0.1 \times (0.405 - (-0.470)) = 0.0875$

The Breakthrough: Without ever training a reward model, DPO automatically knows the current policy values the good poem 0.0875 points higher than the bad poem. This drives learning toward more creative, poetic responses!

The DPO Loss Function: Putting It All Together

By substituting the policy-based reward into the Bradley-Terry model, we get a simple classification loss that can be optimized directly.

DPO Loss: The Final Elegant Formula

Starting from Bradley-Terry: We know human preferences follow:

$$ p(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l)) $$

Substituting our policy-based reward:

$$ p(y_w \succ y_l | x) = \sigma\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) $$

The DPO Loss Function:

$$ \mathcal{L}_{\text{DPO}}(\pi_\theta) = -\mathbb{E}_{(x,y_w,y_l)}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right] $$

What this loss does:

Increases $\pi_\theta(y_w|x)$: Make winning responses more likely
Decreases $\pi_\theta(y_l|x)$: Make losing responses less likely
Relative to $\pi_{\text{ref}}$: Stay close to the reference policy
Automatic weighting: Focus more on examples where the model is wrong

Why this is beautiful:

✅ Simple: Just a classification loss, no RL needed
✅ Stable: No complex RL dynamics or hyperparameter sensitivity
✅ Memory efficient: Only need two models (policy + reference)
✅ Theoretically grounded: Provably optimal under Bradley-Terry assumptions
✅ Direct: Optimize exactly what you want (human preferences)

Detailed LLM Example: Complete DPO Training Step

Scenario: Training an AI tutor to explain machine learning concepts

Student Question: "Can you explain what machine learning is?"

Winner Response (Detailed & Clear):
"Machine learning is like teaching a computer to recognize patterns by showing it lots of examples. Instead of programming specific rules, we let the computer figure out patterns on its own. For example, to recognize cats in photos, we show it thousands of cat pictures and it learns what features make a cat - whiskers, pointy ears, etc."
• $\log \pi_{\text{ref}}(y_w|x) = -1.2$ (reference model probability)
• $\log \pi_{\theta}(y_w|x) = -1.0$ (current model is more likely to generate this)

Loser Response (Vague & Unhelpful):
"Machine learning is when computers learn things automatically."
• $\log \pi_{\text{ref}}(y_l|x) = -1.5$ (reference model probability)
• $\log \pi_{\theta}(y_l|x) = -2.0$ (current model is less likely to generate this)

DPO Loss Calculation ($\beta = 0.1$):
1. Log ratio differences:
• Winner: $(-1.0) - (-1.2) = +0.2$ (model improved on good response)
• Loser: $(-2.0) - (-1.5) = -0.5$ (model worsened on bad response)
2. Combined difference: $0.1 \times (0.2 - (-0.5)) = 0.07$
3. Preference probability: $\sigma(0.07) = 0.517$ (51.7% confident in ranking)
4. Loss: $-\log(0.517) = 0.66$

Learning Effect: The model will increase probability of detailed explanations and decrease probability of vague responses

DPO Gradient: Understanding the Learning Dynamics

The DPO gradient has an elegant form that automatically focuses on misranked examples and provides stable learning.

DPO Gradient: How the Model Actually Learns

The Gradient Formula:

$$ \nabla_\theta \mathcal{L}_{\text{DPO}} \propto -\beta \sigma(\hat{r}_\theta(y_l) - \hat{r}_\theta(y_w)) \left[ \nabla_\theta \log\pi_\theta(y_w|x) - \nabla_\theta \log\pi_\theta(y_l|x) \right] $$

Breaking down the components:

$[\nabla_\theta \log\pi_\theta(y_w|x) - \nabla_\theta \log\pi_\theta(y_l|x)]$: The update direction
$\sigma(\hat{r}_\theta(y_l) - \hat{r}_\theta(y_w))$: The adaptive weighting factor
$-\beta$: Scale factor (negative because we minimize loss)

The Adaptive Weighting Magic:

When model is wrong: $\hat{r}_\theta(y_l) > \hat{r}_\theta(y_w)$ → weight ≈ 1.0 → large updates
When model is right: $\hat{r}_\theta(y_w) > \hat{r}_\theta(y_l)$ → weight ≈ 0.0 → small updates
When model is uncertain: $\hat{r}_\theta(y_w) \approx \hat{r}_\theta(y_l)$ → weight ≈ 0.5 → medium updates

Why this is brilliant:

🎯 Automatic focus: Spends more effort on examples the model gets wrong
🛡️ Stability: Reduces updates when model is already correct
⚖️ Self-regulating: No need to manually tune learning rates for different examples
🎪 Efficient: Doesn't waste time on easy examples

Comparison to Standard Classification: Unlike regular classification which applies uniform updates, DPO automatically adapts the learning intensity based on the model's current performance on each example.

Detailed LLM Example: Adaptive Learning in Action

Scenario: Training a coding assistant to give better programming advice

Coding Question: "How should I handle errors in my Python code?"

Scenario 1: Model is Wrong (Needs Big Updates)
• Winner: "Use try-except blocks to catch and handle specific exceptions gracefully"
• Loser: "Just ignore errors, they usually don't matter"
• Model currently prefers the loser! $\hat{r}_w = 0.1$, $\hat{r}_l = 0.3$
• Gradient weight: $\sigma(0.3 - 0.1) = 0.55$ (55% of maximum)
• Result: Strong updates to fix this dangerous misconception

Scenario 2: Model is Right (Moderate Updates)
• Winner: "Use descriptive error messages and log exceptions for debugging"
• Loser: "Print 'error occurred' when something goes wrong"
• Model correctly prefers winner: $\hat{r}_w = 0.4$, $\hat{r}_l = 0.1$
• Gradient weight: $\sigma(0.1 - 0.4) = 0.43$ (43% of maximum)
• Result: Moderate reinforcement of good practices

Scenario 3: Model is Very Confident and Right (Small Updates)
• Winner: "Use specific exception types like FileNotFoundError instead of bare except"
• Loser: "Use bare except: pass to ignore all errors"
• Model strongly prefers winner: $\hat{r}_w = 0.8$, $\hat{r}_l = -0.2$
• Gradient weight: $\sigma(-0.2 - 0.8) = 0.27$ (27% of maximum)
• Result: Small updates, model already knows this well

Adaptive Learning Benefit: DPO automatically focuses effort where the model needs it most!

DPO Implementation: From Theory to Code

Understanding how DPO is implemented in practice helps bridge the gap between mathematical theory and real-world training.

DPO Training Loop: The Practical Implementation

Data Preparation: Each training example contains $(x, y_w, y_l)$:

Prompt tokens: $x = [\text{token}_1, \text{token}_2, ..., \text{token}_n]$
Chosen response: $y_w = [\text{token}_{n+1}, ..., \text{token}_{n+m}]$
Rejected response: $y_l = [\text{token}_{n+1}, ..., \text{token}_{n+k}]$

Forward Pass Efficiency Trick:

Instead of two separate forward passes, concatenate chosen and rejected responses:

$$ \text{batch} = \begin{bmatrix} [x, y_w] \\ [x, y_l] \end{bmatrix} \rightarrow \text{single forward pass} $$

Log Probability Computation:

Policy model: \(\log \pi_\theta(y|x) = \sum_{t} \log \pi_\theta(y_t | x, y_{
Reference model: $\log \pi_{\text{ref}}(y|x)$ (with gradients disabled)
Efficiency: Can use same model with/without adapters for reference

Memory Optimization Strategies:

PEFT (LoRA): Only train small adapter layers, not full model
Gradient checkpointing: Trade compute for memory
Reference-free mode: Skip reference model entirely (less stable)
Mixed precision: Use fp16/bf16 for memory savings

Training Stability Tips:

Label smoothing: Mix target with uniform distribution
Beta scheduling: Start with higher $\beta$, decay over time
Learning rate warmup: Gradual increase to prevent instability

Detailed LLM Example: Email Assistant Training

Scenario: Training an AI email assistant using DPO

Training Configuration:
• Base model: Llama-2 7B (business communication specialist)
• Method: LoRA adapters (efficient fine-tuning)
• Beta: 0.1 (balance between exploration and exploitation)
• Learning rate: 5e-4 (faster learning than traditional RLHF)
• Batch size: 64 email preference pairs per update
• Sequence length: 2048 tokens (full email conversations)

Sample Training Pair:
Prompt: "Write a professional email declining a meeting request"

Winner: "Thank you for the meeting invitation. Unfortunately, I have a scheduling conflict and won't be able to attend. Could we explore alternative dates next week? I'm available Tuesday-Thursday afternoons."
• Professional, specific, offers alternatives

Loser: "Can't make it."
• Too brief, unprofessional, unhelpful

Training Efficiency Benefits:
• Training time: 8-12 hours (vs 3-5 days for RLHF)
• Simpler pipeline: No separate reward model training
• Stable learning: No PPO hyperparameter sensitivity
• Direct optimization: Learns exactly what humans prefer

Result: Email assistant learns professional communication patterns 10x faster than traditional RLHF

3. An Efficient RL Approach: Group Relative Policy Optimization (GRPO)

As pioneered by DeepSeek-AI, Group Relative Policy Optimization (GRPO) is an efficient RL algorithm that eliminates the need for a separate, expensive critic model (common in PPO). Instead, it normalizes rewards by comparing a "group" of different responses generated for the same prompt. Below is an interactive walkthrough of GRPO's key innovations and mathematical foundations.

This makes it highly effective for improving reasoning, where a simple, rule-based reward (e.g., "is the final answer correct?") can be used. The "group" allows the model to assign relative credit to different attempts, even if they all lead to the same binary outcome.

GRPO Workflow

Prompt (x)

↓

Policy Model (π)

↓

Generate a Group of N Completions

y₁

y₂

...

y_N

↓

Score each completion with a Reward Function (r)

↓

Normalize rewards across the group to compute Advantage (A_i)

↓

Update Policy (π) using Advantage

GRPO Core Innovation: Critic-Free Learning

GRPO eliminates the need for a separate value function (critic) by using group-based reward normalization to create learning signals.

The Problem with Traditional RL: Expensive Critics

PPO's Challenge: Imagine you're training a student, but you need a separate teacher to constantly evaluate how good each answer is. That's expensive and complex!

PPO Requires:

Policy Network $\pi_\theta$: The student (generates responses)
Value Network $V_\phi$: The teacher (predicts how good responses will be)
Both need training: Double the computational cost
Complex interactions: Teacher and student must learn together

GRPO's Brilliant Solution:

"What if we don't need a separate teacher? What if students can learn by comparing their work with classmates?"

Group-Based Learning:

Generate multiple responses: Like having 4 students solve the same problem
Compare within the group: See who did better or worse
Learn from comparisons: "I should do more like Sarah, less like Bob"
No external judge needed: The group IS the teacher

Key Insight: You don't need to know absolute quality scores. You just need to know "this response is better than that response" within the same context.

Benefits:

🚀 Faster: No critic training needed
💰 Cheaper: Half the computational cost
🎯 Simpler: One model, one objective
🎪 Effective: Works especially well for reasoning tasks

Detailed LLM Example: Math Competition Scenario

Scenario: Training an AI math tutor using GRPO vs PPO

Problem: "Solve: 3x + 7 = 22"

PPO Approach (Complex):
1. Policy generates response: "3x = 15, so x = 5"
2. Critic evaluates: "This looks like it will get reward 0.8"
3. Actual reward: 1.0 (correct)
4. Update both policy AND critic
5. Critic learns to predict better, policy learns from critic

GRPO Approach (Simple):
1. Generate 4 responses simultaneously:
• "x = 5" (correct)
• "x = 5" (correct, different method)
• "x = 7" (wrong)
• "x = 3" (wrong)
2. Compare within group: 2 correct, 2 wrong
3. Learn directly: "Do more like the correct ones"
4. No critic needed!

Result: GRPO achieves similar learning with half the complexity and computational cost

Group Generation: Creating Diverse Responses

GRPO's first step is generating a diverse group of responses to the same prompt, creating a rich comparison set for learning.

Step 1: Sampling Strategy for Diverse Groups

The Goal: Generate $G$ different responses that explore various approaches to the same problem.

Sampling Parameters:

$$ \{o_1, o_2, ..., o_G\} \sim \pi_\theta(\cdot | x) $$

Where:

$x$: The input prompt (same for all responses)
$o_i$: The i-th generated response
$G$: Group size (typically 4-8 responses)
$\pi_\theta$: Current policy (the model being trained)

Diversity Techniques:

Temperature sampling: $T > 1$ for more creativity
Top-k sampling: Consider multiple high-probability tokens
Multiple random seeds: Ensure different reasoning paths

Why Diversity Matters:

Exploration: Discover new solution strategies
Comparison: Need both good and bad examples to learn
Robustness: Learn to handle different approaches
Signal strength: More variance = stronger learning signal

Group Size Trade-offs:

Small groups (G=2-3): Fast, but limited comparison
Medium groups (G=4-6): Good balance of speed and signal
Large groups (G=8+): Rich comparisons, but expensive

Detailed LLM Example: Physics Problem Solving

Scenario: Training an AI physics tutor with diverse reasoning approaches

Physics Problem: "A ball is thrown upward with initial velocity 20 m/s. How high does it go?"

GRPO Group Generation (G=4):

Response 1 (Kinematic Equations):
"Using v² = u² + 2as, where final velocity v=0 at max height, initial velocity u=20 m/s, acceleration a=-9.8 m/s². So 0 = 400 + 2(-9.8)s, giving s = 400/19.6 = 20.4 meters."

Response 2 (Energy Conservation):
"Using energy conservation: initial kinetic energy = final potential energy. ½mv² = mgh, so ½(20)² = 9.8h, giving h = 400/19.6 = 20.4 meters."

Response 3 (Time-based Approach):
"Time to reach max height: v = u + at, so 0 = 20 - 9.8t, giving t = 2.04 seconds. Height: s = ut + ½at² = 20(2.04) + ½(-9.8)(2.04)² = 20.4 meters."

Response 4 (Common Error):
"Using s = ut + ½at², with t=2 seconds: s = 20(2) + ½(-9.8)(4) = 40 - 19.6 = 20.4 meters."
(Error: assumed t=2 without calculating)

Diversity Achieved: Three valid methods plus one common mistake, providing rich learning opportunities

Reward Evaluation: Simple Rule-Based Scoring

GRPO works best with simple, objective reward functions that can be easily computed for each response in the group.

Step 2: Computing Individual Rewards

Reward Function Design: GRPO excels with simple, rule-based rewards that are easy to compute and objective.

Common Reward Types:

Accuracy: $r_i = 1$ if correct, $0$ if wrong
Format compliance: $r_i = 1$ if follows required format
Code execution: $r_i = 1$ if code runs without errors
Verification: $r_i = 1$ if answer can be verified

Mathematical Formulation:

$$ r_i = R(x, o_i) $$

Where:

$R(\cdot)$: Reward function (often rule-based)
$x$: Input prompt
$o_i$: i-th generated response
$r_i$: Scalar reward for response i

Example Reward Functions:

Math problems: $R = 1$ if final answer matches ground truth
Code generation: $R = 1$ if code passes all test cases
Reasoning tasks: $R = 1$ if conclusion is logically valid
Format tasks: $R = 1$ if uses required tags (<think>, <answer>)

Why Simple Rewards Work:

Objective: No human judgment needed
Fast: Can be computed automatically
Clear signal: Binary rewards give strong learning signals
Scalable: Works for thousands of examples

Composite Rewards:

$$ r_i = w_1 \cdot R_{\text{accuracy}}(o_i) + w_2 \cdot R_{\text{format}}(o_i) + w_3 \cdot R_{\text{style}}(o_i) $$

Detailed LLM Example: Coding Assistant Evaluation

Scenario: Training an AI coding assistant with multi-component rewards

Programming Task: "Write a Python function to find the factorial of a number"

Group Responses with Reward Breakdown:

Response 1 (Excellent):

def factorial(n):

                                  if n == 0: return 1

                                  return n * factorial(n-1)

• Accuracy: 1.0 (correct algorithm)
• Format: 1.0 (proper Python syntax)
• Style: 1.0 (handles edge case)
• Total Reward: 3.0

Response 2 (Good):

def factorial(n):

                                  result = 1

                                  for i in range(1, n+1):

                                    result *= i

                                  return result

• Accuracy: 1.0 (correct iterative approach)
• Format: 1.0 (proper syntax)
• Style: 0.5 (doesn't handle n=0 case)
• Total Reward: 2.5

Response 3 (Syntax Error):

def factorial(n)

                                  return n * factorial(n-1)

• Accuracy: 0.0 (missing colon, infinite recursion)
• Format: 0.0 (syntax error)
• Style: 0.0 (no base case)
• Total Reward: 0.0

Response 4 (Wrong Algorithm):

def factorial(n):

                                  return n * n

• Accuracy: 0.0 (wrong algorithm)
• Format: 1.0 (valid syntax)
• Style: 0.0 (not factorial)
• Total Reward: 1.0

Advantage Calculation: Group-Based Normalization

The core of GRPO: converting raw rewards into normalized advantages by comparing each response against the group average.

Step 3: The GRPO Advantage Formula

The Heart of GRPO: Transform raw rewards into learning signals by normalizing within each group.

Group Statistics:

$$ \mu_G = \frac{1}{G} \sum_{i=1}^G r_i \quad \text{(Group Mean)} $$

$$ \sigma_G = \sqrt{\frac{1}{G} \sum_{i=1}^G (r_i - \mu_G)^2} \quad \text{(Group Std Dev)} $$

GRPO Advantage:

$$ A_i = \frac{r_i - \mu_G}{\sigma_G + \epsilon} $$

What This Does:

$A_i > 0$: Response i is better than group average
$A_i = 0$: Response i is exactly average for this group
$A_i < 0$: Response i is worse than group average
$\epsilon$: Small constant (1e-8) to prevent division by zero

Why Normalization is Brilliant:

Relative learning: "Be more like the good responses in this group"
Automatic scaling: Works whether rewards are 0-1 or 0-100
Variance handling: High variance groups get smaller updates
Zero-sum property: $\sum A_i = 0$ (advantages balance out)

Edge Cases:

All rewards equal: $\sigma_G = 0$, so $A_i = 0$ (no learning)
High variance: Large $\sigma_G$ reduces advantage magnitude
Binary rewards: Creates clear +/- advantages

Detailed LLM Example: Creative Writing Assessment

Scenario: Training an AI creative writer with GRPO advantage calculation

Writing Prompt: "Write a compelling opening sentence for a mystery novel"

Group Responses (G=4) with Quality Scores:

Response 1: "Detective Sarah Chen knew the moment she saw the pristine kitchen that someone had died here."
• Quality score: $r_1 = 4.2$ (engaging, mysterious)

Response 2: "The murder weapon lay in plain sight, which was exactly what made it invisible."
• Quality score: $r_2 = 4.8$ (clever paradox, intriguing)

Response 3: "There was a dead body in the room."
• Quality score: $r_3 = 2.1$ (bland, uninspiring)

Response 4: "The clock struck midnight as the mystery began to unfold."
• Quality score: $r_4 = 3.5$ (clichéd but functional)

GRPO Advantage Calculation:
• Group mean: $\mu_G = (4.2 + 4.8 + 2.1 + 3.5) ÷ 4 = 3.65$
• Group std dev: $\sigma_G = 1.02$
• Advantages:
  - $A_1 = (4.2 - 3.65) ÷ 1.02 = +0.54$ (above average)
  - $A_2 = (4.8 - 3.65) ÷ 1.02 = +1.13$ (much above average)
  - $A_3 = (2.1 - 3.65) ÷ 1.02 = -1.52$ (well below average)
  - $A_4 = (3.5 - 3.65) ÷ 1.02 = -0.15$ (slightly below average)

Learning Signal: Model learns to strongly favor Response 2's style, moderately favor Response 1, avoid Response 3's blandness, and slightly discourage Response 4's clichés

GRPO Objective Function: Policy Updates

GRPO uses the computed advantages to update the policy, encouraging responses with positive advantages and discouraging those with negative advantages.

Step 4: The Complete GRPO Objective

GRPO Loss Function: Combines PPO-style clipping with group-based advantages

$$ \mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^G \left[ \min \left( r_i(\theta) A_i, \text{clip}(r_i(\theta), 1-\epsilon, 1+\epsilon) A_i \right) - \beta \mathbb{D}_{KL} \right] $$

Component Breakdown:

$r_i(\theta) = \frac{\pi_\theta(o_i|x)}{\pi_{\theta_{old}}(o_i|x)}$: Importance sampling ratio
$A_i$: Group-normalized advantage (our key innovation)
$\text{clip}(\cdot, 1-\epsilon, 1+\epsilon)$: PPO clipping for stability
$\beta \mathbb{D}_{KL}$: KL penalty to stay close to reference policy

How Learning Works:

Positive advantage ($A_i > 0$): Increase probability of generating $o_i$
Negative advantage ($A_i < 0$): Decrease probability of generating $o_i$
Clipping: Prevents too-large policy updates
KL penalty: Maintains coherent language generation

Key Differences from PPO:

No critic network: Advantages come from group comparison
Batch-based learning: All group members updated together
Relative rewards: Focus on "better than peers" not "absolutely good"
Simpler pipeline: One model, one objective function

Gradient Intuition:

For each response $o_i$:

$$ \nabla_\theta \mathcal{J} \propto A_i \cdot \nabla_\theta \log \pi_\theta(o_i | x) $$

This pushes the model toward generating responses similar to high-advantage examples and away from low-advantage ones.

Detailed LLM Example: Code Debugging Assistant

Scenario: Training an AI to debug Python code using GRPO

Bug Report: "My code gives 'IndexError: list index out of range'"
Buggy Code: nums = [1,2,3]; print(nums[5])

Group Responses (G=4):

Response 1 (Excellent Fix + Explanation):
"The error occurs because you're trying to access index 5 in a list that only has indices 0-2. Fix: Check if index exists: if 5 < len(nums): print(nums[5])"
• Reward: 3.0, Advantage: +1.2

Response 2 (Good Fix, Less Explanation):
"Index 5 doesn't exist. Use print(nums[2]) for the last element."
• Reward: 2.5, Advantage: +0.6

Response 3 (Wrong Solution):
"Try using nums.append(0, 0) to add more elements."
• Reward: 0.5, Advantage: -1.4

Response 4 (Partial Understanding):
"There's an index error. Check your list length."
• Reward: 1.5, Advantage: -0.4

GRPO Update Calculation:
• Response 1: Large positive update (encourage detailed explanations)
• Response 2: Moderate positive update (encourage correct fixes)
• Response 3: Large negative update (discourage wrong solutions)
• Response 4: Small negative update (discourage vague answers)

Learning Outcome: Model learns to provide detailed, accurate debugging advice with clear explanations and correct code fixes

GRPO Training Loop: Complete Implementation

Understanding the full GRPO training loop from data collection through policy updates, including batch processing and efficiency optimizations.

Step 5: The Complete GRPO Training Algorithm

Training Loop Overview: GRPO follows a streamlined process that's simpler than PPO but highly effective for reasoning tasks.

Algorithm Structure:

1. Batch Collection: Sample B prompts from training set
2. Group Generation: For each prompt, generate G responses
3. Reward Evaluation: Score all B×G responses
4. Advantage Computation: Normalize rewards within each group
5. Policy Update: Single gradient step using all advantages
6. Repeat: Continue for desired number of iterations

Batch Processing Efficiency:

$$ \text{Total Responses} = B \times G $$

$$ \text{Gradient} = \frac{1}{B \times G} \sum_{b=1}^B \sum_{g=1}^G \nabla_\theta \mathcal{L}(o_{b,g}) $$

Key Implementation Details:

Parallel generation: All G responses generated simultaneously
Vectorized rewards: Batch compute all rewards at once
Memory efficiency: Process groups independently
Gradient accumulation: Sum gradients across all groups

Hyperparameter Sensitivity:

Group size G: 4-8 optimal for most tasks
Batch size B: 32-128 prompts per iteration
Learning rate: 1e-6 to 5e-6 (similar to PPO)
KL coefficient β: 0.01-0.05 for stability

Convergence Properties: GRPO typically converges 2-3x faster than PPO on reasoning tasks due to the direct group comparison providing stronger learning signals.

Detailed LLM Example: Algorithm Implementation

Scenario: Training a math solver with GRPO implementation details

Training Configuration:
• Batch size B = 64 prompts
• Group size G = 6 responses per prompt
• Total responses per iteration = 384
• Learning rate = 2e-6
• KL penalty β = 0.02

Single Training Iteration:

Step 1: Batch Collection (0.1s)
• Sample 64 math problems from training set
• Problems range from algebra to calculus

Step 2: Parallel Generation (2.3s)
• Generate 6 solutions per problem simultaneously
• Use temperature T=1.2 for diversity
• Total: 384 mathematical solutions

Step 3: Reward Computation (0.8s)
• Check final answers against ground truth
• Verify mathematical reasoning steps
• Assign binary rewards (1.0 or 0.0)

Step 4: Advantage Calculation (0.1s)
• Normalize rewards within each group of 6
• Example group: rewards [1,1,0,1,0,0] → advantages [+0.8,+0.8,-1.2,+0.8,-1.2,-1.2]

Step 5: Policy Update (1.2s)
• Compute gradients for all 384 responses
• Apply clipping and KL penalty
• Single optimizer step

Total Time: 4.5s per iteration (vs 8-12s for PPO)

Advanced GRPO Techniques: Curriculum and Adaptive Sampling

Sophisticated techniques to enhance GRPO performance, including curriculum learning, adaptive group sizes, and dynamic reward weighting.

Advanced Optimization Strategies

1. Curriculum Learning with GRPO:

Start with easier problems and gradually increase difficulty as the model improves.

$$ \text{Difficulty}(t) = \min(D_{\max}, D_0 + \alpha \cdot t) $$

2. Adaptive Group Size:

Adjust group size based on model performance - larger groups when model is uncertain, smaller when confident.

$$ G_{\text{adaptive}} = G_{\min} + (G_{\max} - G_{\min}) \cdot \text{Uncertainty}(\pi_\theta) $$

3. Temperature Scheduling:

Start with high temperature for exploration, gradually reduce for exploitation.

$$ T(t) = T_{\max} \cdot \exp(-\lambda t) + T_{\min} $$

4. Multi-Objective Rewards:

Combine multiple reward signals for richer learning.

$$ r_{\text{total}} = w_1 r_{\text{accuracy}} + w_2 r_{\text{elegance}} + w_3 r_{\text{efficiency}} + w_4 r_{\text{explanation}} $$

5. Dynamic KL Weighting:

Adjust KL penalty based on training progress.

$$ \beta(t) = \beta_0 \cdot \left(1 + \gamma \cdot \text{Performance}(t)\right) $$

6. Rejection Sampling Integration:

Use high-quality GRPO outputs to create SFT data for the next training phase.

Detailed LLM Example: Advanced Training Pipeline

Scenario: Training an advanced mathematical reasoning system

Phase 1: Curriculum Introduction (Weeks 1-2)
• Start with basic arithmetic: "What is 15 + 27?"
• Group size: G = 8 (high exploration)
• Temperature: T = 1.5 (very diverse responses)
• Success rate: 60% → 85%

Phase 2: Intermediate Problems (Weeks 3-4)
• Move to algebra: "Solve 2x² - 5x + 2 = 0"
• Group size: G = 6 (moderate exploration)
• Temperature: T = 1.2 (balanced diversity)
• Success rate: 40% → 75%

Phase 3: Advanced Reasoning (Weeks 5-8)
• Complex proofs: "Prove the fundamental theorem of calculus"
• Group size: G = 4 (focused generation)
• Temperature: T = 0.9 (more deterministic)
• Success rate: 20% → 70%

Multi-Objective Reward Example:
Problem: "Prove that √2 is irrational"
• Accuracy: 1.0 (proof is correct)
• Elegance: 0.8 (uses contradiction method elegantly)
• Efficiency: 0.9 (concise, no unnecessary steps)
• Explanation: 0.7 (clear but could be more beginner-friendly)
• Total reward: 0.4×1.0 + 0.3×0.8 + 0.2×0.9 + 0.1×0.7 = 0.89

Adaptive Learning: As model improves, automatically increase problem difficulty and reduce group size for efficiency

GRPO vs PPO: Comprehensive Comparison

Detailed analysis of when and why GRPO outperforms traditional PPO, with empirical results and theoretical insights.

Comprehensive Performance Analysis

Computational Efficiency Comparison:

PPO Memory: Policy (7B) + Critic (7B) = 14B parameters
GRPO Memory: Policy (7B) only = 7B parameters
Training Speed: GRPO 2-3x faster on reasoning tasks
Convergence: GRPO requires 50-70% fewer iterations

Task-Specific Performance:

Mathematical Reasoning: GRPO +15% accuracy vs PPO
Code Generation: GRPO +12% pass rate vs PPO
Logic Puzzles: GRPO +18% success rate vs PPO
Creative Writing: PPO +8% human preference vs GRPO

Theoretical Advantages of GRPO:

$$ \text{GRPO Variance} = \frac{\sigma^2_{\text{rewards}}}{G} \quad \text{vs} \quad \text{PPO Variance} = \sigma^2_{\text{critic}} + \sigma^2_{\text{rewards}} $$

When to Choose GRPO:

✅ Objective rewards: Clear right/wrong answers
✅ Resource constraints: Limited computational budget
✅ Reasoning tasks: Mathematical, logical, coding problems
✅ Fast iteration: Need quick experimental cycles

When to Choose PPO:

✅ Subjective rewards: Human preference-based tasks
✅ Complex environments: Multi-step reasoning with delayed rewards
✅ Conversational AI: Open-ended dialogue systems
✅ Safety-critical: When stability is more important than efficiency

Hybrid Approaches: Some systems use GRPO for initial reasoning training, then PPO for fine-tuning conversational abilities.

Detailed LLM Example: Empirical Comparison Study

Scenario: Head-to-head comparison on mathematical olympiad problems

Experimental Setup:
• Dataset: 10,000 competition math problems
• Base model: Llama-2 7B
• Training time: 7 days on 8x A100 GPUs
• Evaluation: 1,000 held-out problems

PPO Results:
• Training time: 168 hours (7 days)
• Memory usage: 28GB per GPU (policy + critic)
• Final accuracy: 67.3%
• Training cost: $2,400 (cloud compute)
• Convergence: 15,000 iterations

GRPO Results:
• Training time: 72 hours (3 days)
• Memory usage: 14GB per GPU (policy only)
• Final accuracy: 73.8% (+6.5% improvement)
• Training cost: $1,200 (50% savings)
• Convergence: 8,000 iterations

Key Insights:
• GRPO's group comparison naturally creates curriculum learning
• No critic bias - learns directly from task performance
• Better sample efficiency for objective tasks
• Simpler hyperparameter tuning

Failure Mode Analysis:
• GRPO struggles when all group responses are equally bad/good
• PPO's critic provides more stable learning for subjective tasks
• GRPO requires good reward function design

DeepSeek-R1 Case Study: GRPO in Practice

Real-world application of GRPO in DeepSeek-R1, showing how the algorithm scales to state-of-the-art reasoning capabilities.

DeepSeek-R1's GRPO Implementation

Scale and Configuration:

Base Model: DeepSeek-V3 (671B parameters)
Training Data: Millions of reasoning problems
Group Size: G = 16 for maximum diversity
Batch Size: B = 512 prompts per iteration
Total Responses: 8,192 per training step

Multi-Stage GRPO Pipeline:

Stage 1: Pure GRPO on base model (DeepSeek-R1-Zero)
Stage 2: GRPO after initial SFT
Stage 3: GRPO with rejection sampling data
Stage 4: Final GRPO with multi-objective rewards

Reward Function Evolution:

$$ r_{\text{DeepSeek}} = w_1 \cdot r_{\text{accuracy}} + w_2 \cdot r_{\text{format}} + w_3 \cdot r_{\text{reasoning}} + w_4 \cdot r_{\text{verification}} $$

Key Innovations:

Self-verification rewards: Bonus for checking own work
Reasoning chain rewards: Points for clear step-by-step logic
Format consistency: Proper use of <think> tags
Adaptive difficulty: Harder problems as model improves

Emergent Behaviors:

Chain-of-thought reasoning: Emerged naturally from group comparison
Self-correction: Model learns to catch and fix its own errors
Multiple approaches: Tries different solution methods
Verification habits: Always checks final answers

Training Efficiency: DeepSeek-R1 achieved GPT-4 level reasoning with 3x less compute than traditional RLHF approaches.

Detailed LLM Example: DeepSeek-R1 Training Pipeline

Scenario: Training DeepSeek-R1 on complex mathematical reasoning

Problem Example: "Find all real solutions to the equation x⁴ - 5x² + 6 = 0"

Stage 1: Base Model GRPO (R1-Zero)
Group of 16 responses generated, including:
• Response 1: Correct substitution method → x = ±√2, ±√3
• Response 8: Attempted factoring but made algebraic error
• Response 12: Recognized as quadratic in x² but didn't complete
• Response 16: Complete gibberish due to poor initialization
Advantage calculation rewards systematic approaches

Stage 2: Post-SFT GRPO
Now with better baseline, group responses include:
• Multiple correct solution methods (substitution, factoring)
• Proper mathematical notation and formatting
• Verification steps checking solutions in original equation
• Clear explanation of solution process

Stage 3: Rejection Sampling Integration
• Keep only responses with reward ≥ 0.8
• Use these for next SFT dataset
• Creates virtuous cycle of improvement

Final Performance Metrics:
• Mathematical accuracy: 94.2% on competition problems
• Reasoning clarity: 89.7% human preference score
• Self-verification rate: 96.8% (almost always checks work)
• Format compliance: 99.1% (proper tag usage)

Breakthrough Achievement: First model to achieve human-level performance on mathematical olympiad problems using pure RL training

GRPO Implementation: Practical Considerations

Technical details for implementing GRPO in practice, including code structure, debugging tips, and common pitfalls to avoid.

Implementation Best Practices

Code Structure Overview:

Group Generator: Parallel sampling with temperature control
Reward Computer: Vectorized evaluation of all responses
Advantage Calculator: Group-wise normalization
Policy Updater: Clipped objective with KL penalty

Memory Management:

$$ \text{Memory} = \text{Model Size} + B \times G \times L \times \text{Hidden Dim} $$

Where L is sequence length. Use gradient checkpointing and mixed precision to reduce memory usage.

Common Implementation Pitfalls:

Division by zero: Add ε = 1e-8 to standard deviation
Gradient explosion: Clip gradients to max norm 1.0
Memory leaks: Clear intermediate tensors after each group
Numerical instability: Use log-space computations for probabilities

Debugging Strategies:

Monitor advantage distribution: Should be roughly normal with mean ≈ 0
Track reward variance: Low variance indicates need for harder problems
Check clipping frequency: 10-30% clipping is healthy
Validate group independence: Ensure no data leakage between groups

Hyperparameter Tuning Guide:

Start conservative: Small learning rate, moderate group size
Increase gradually: Scale up as training stabilizes
Monitor KL divergence: Keep below 0.1 for stability
Adjust temperature: Higher for exploration, lower for exploitation

Detailed LLM Example: Production Implementation

Scenario: Implementing GRPO for a production coding assistant

System Architecture:
• Model: CodeLlama 13B
• Infrastructure: 4x A100 GPUs
• Group size: G = 8
• Batch size: B = 32
• Sequence length: 2048 tokens

Memory Optimization:
• Gradient checkpointing: 40% memory reduction
• Mixed precision (fp16): 50% memory reduction
• Sequence packing: 20% efficiency improvement
• Total memory usage: 18GB per GPU

Performance Monitoring:
• Training throughput: 1.2 iterations/minute
• Code correctness: Tracked via automated testing
• Advantage variance: Monitored for curriculum adjustment
• KL divergence: Kept below 0.05 for stability

Common Issues Encountered:
1. Low reward variance: Problems too easy → increased difficulty
2. High KL divergence: Learning rate too high → reduced to 1e-6
3. Memory overflow: Batch size too large → reduced to 16
4. Poor code quality: Added style and efficiency rewards

Production Results:
• Code correctness: 87% → 94% after GRPO training
• User satisfaction: 7.2/10 → 8.6/10
• Training time: 2 weeks vs 6 weeks for PPO
• Cost savings: 65% reduction in compute costs

Mathematical Deep Dive

For each prompt, GRPO samples a group of $G$ outputs. The reward for each output $o_i$ is $r_i$. The key innovation is calculating the Advantage $A_i$ by normalizing the rewards within the group:

$$ A_i = \frac{r_i - \text{mean}(\{r_1, ..., r_G\})}{\text{std}(\{r_1, ..., r_G\})} $$

The GRPO objective then uses this advantage to update the policy, rewarding completions with above-average scores and penalizing those with below-average scores.

$$ \mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E} \left[ \sum_{i=1}^G \left( \text{ratio}(\theta, o_i) \cdot A_i - \beta \cdot \text{KL} \right) \right] $$

Detailed LLM Example: Math Reasoning Assistant

Scenario: Training an AI math tutor using GRPO to solve algebra problems

Student Question: "Solve for x: 2x + 3 = 11"

GRPO Group Generation: The model generates 3 different reasoning approaches simultaneously:

Response 1 (Detailed Step-by-Step):
"Let me solve this equation step by step. First, I'll subtract 3 from both sides: 2x + 3 - 3 = 11 - 3, which gives me 2x = 8. Then I'll divide both sides by 2: 2x ÷ 2 = 8 ÷ 2, so x = 4. Let me verify: 2(4) + 3 = 8 + 3 = 11 ✓"
• Clear methodology, shows verification
• Reward: $r_1 = 1.0$ (Correct answer + excellent explanation)

Response 2 (Concise but Correct):
"I need to isolate x. 2x + 3 = 11, so 2x = 11 - 3 = 8, therefore x = 8 ÷ 2 = 4."
• Efficient, correct steps
• Reward: $r_2 = 1.0$ (Correct answer + good reasoning)

Response 3 (Common Student Error):
"To solve 2x + 3 = 11, I'll add 3 to both sides: 2x = 11 + 3 = 14, so x = 14 ÷ 2 = 7."
• Shows common algebraic mistake (adding instead of subtracting)
• Reward: $r_3 = 0.0$ (Incorrect answer due to sign error)

GRPO Advantage Calculation:

Group Statistics:
- Mean reward: $(1.0 + 1.0 + 0.0) ÷ 3 = 0.67$
- Standard deviation: $\sqrt{\frac{(1.0-0.67)^2 + (1.0-0.67)^2 + (0.0-0.67)^2}{3}} = 0.47$
Normalized Advantages:
- $A_1 = (1.0 - 0.67) ÷ 0.47 = +0.70$ (Response 1: above average)
- $A_2 = (1.0 - 0.67) ÷ 0.47 = +0.70$ (Response 2: above average)
- $A_3 = (0.0 - 0.67) ÷ 0.47 = -1.43$ (Response 3: well below average)

Learning Outcome:
• Model strongly reinforces correct algebraic procedures (both detailed and concise)
• Model learns to avoid the common sign error in equation solving
• Future responses will favor systematic approaches and verification steps
• The group comparison helps the model understand that multiple correct approaches exist

Key GRPO Insight: By comparing responses within the same group, the model learns relative quality without needing a separate critic model. The normalization ensures that even when all responses are good (or all are bad), the model still gets meaningful learning signals.

Summary Comparison Table

Aspect	PPO (Proximal Policy Optimization)	DPO (Direct Preference Optimization)	GRPO (Group Relative Policy Optimization)
Core Idea	Maximize a reward signal from a learned model while staying close to the original policy.	Directly optimize the policy on preference pairs, bypassing an explicit reward model.	Normalize rewards within a group of responses to create a learning signal without a critic model.
Objective Function	$$ L^{PPO} = \mathbb{E}[\min(r_t \hat{A}_t, \text{clip}(...) \hat{A}_t) - c_1 L^{VF} + c_2 S] $$	$$ L^{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w\|x)}{\pi_{\text{ref}}(y_w\|x)} - \beta \log \frac{\pi_\theta(y_l\|x)}{\pi_{\text{ref}}(y_l\|x)}\right)\right] $$	$$ \mathcal{J}^{GRPO} = \mathbb{E} \left[ \frac{1}{G}\sum_{i=1}^G \left( \min(r_i A_i, \text{clip}(...)A_i) - \beta \mathbb{D}_{KL} \right) \right] $$
Reward / Preference Model	Requires an explicit, separately trained reward model $r_\phi(x,y)$ that predicts human scores.	Implicitly models preferences using the Bradley-Terry model. No separate reward model is trained.	Uses any external reward function $R(x,y)$, which is often simple and rule-based (e.g., correct/incorrect).
Advantage Calculation	GAE: $ \hat{A}_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l} $, where $\delta_t$ is the TD error from the critic.	Implicit reward diff: $ \hat{r}_w - \hat{r}_l = \beta \log \frac{\pi(y_w\|x)}{\pi_{\text{ref}}(y_w\|x)} - \dots $	Group-wise normalization: $ A_i = \frac{r_i - \mu_G}{\sigma_G + \epsilon} $. Critic-free.
Key Innovation	Clipped surrogate objective provides stability of TRPO with first-order optimization.	Analytically maps the RLHF objective to a simple classification loss on preference data.	Replaces the expensive critic with cheap, in-batch normalization of rewards.
Data Requirement	Prompts and responses for RL sampling. Reward model requires preference pairs $(x, y_w, y_l)$.	Requires a dataset of preference triplets: $(x, y_w, y_l)$.	Requires prompts and a reward function. No human preference data is strictly necessary.
Computational Cost	Very High: Trains policy + critic + reward models. Requires expensive online sampling.	Low: Simple offline training on a fixed dataset. No sampling loop.	Medium: Cheaper than PPO (no critic), but requires online sampling which is costly.
Hyperparameter Tuning	Difficult: Very sensitive to learning rates, clipping epsilon, GAE params, and KL coefficient.	Easy: Primarily sensitive to $\beta$ and learning rate. Very stable.	Medium: Sensitive to learning rate, group size G, and KL coefficient. More stable than PPO.
Common Failure Modes	Reward Hacking: Finds loopholes in the reward model. Instability: Policy can collapse or generate gibberish.	Overfitting: Can overfit to the specific preferences in the dataset. Stale Reference: Performance depends on a good $\pi_{ref}$.	Low Variance Collapse: If all responses in a group are identical, learning signal becomes zero.
Scalability	Scales well but is very resource-intensive. Used for large models like ChatGPT.	Scales very well; simple loss is efficient for large datasets and models.	Extremely scalable for reasoning tasks, as shown by DeepSeek-R1. Efficiency is a key advantage.
Best Use Case	General-purpose alignment where rewards are complex and subjective (e.g., conversational quality).	Stable and efficient fine-tuning when high-quality preference pairs are available.	Improving reasoning on tasks with clear, objective rewards (e.g., math, coding, logic).

4. DeepSeek-R1: A Case Study in Advanced Reasoning

The DeepSeek-R1 paper showcases a masterclass in applying these alignment techniques to achieve state-of-the-art reasoning capabilities. Their approach involves multiple, carefully orchestrated stages that build on each other, moving from pure, unguided reinforcement learning to highly refined, multi-faceted alignment.

The Full DeepSeek-R1 Pipeline

Stage 1: DeepSeek-R1-Zero

Start with Base Model (DeepSeek-V3)

↓

Pure GRPO Training with Rule-Based Rewards (Accuracy + Format)

↓

Output: DS-R1-Zero
(Powerful but messy)

→

Stage 2: Cold Start SFT

Base Model (DeepSeek-V3)

↓

Collect thousands of high-quality, "cold-start" reasoning examples

↓

Supervised Fine-Tuning (SFT)

↓

Output: Initial SFT Model

→

Stage 3 & 4: Iterative RL and SFT

Reasoning-Oriented GRPO

↓

Rejection Sampling to create new SFT data (+ General & Safety Data)

↓

Second SFT Stage

↓

Final GRPO for All Scenarios

↓

DeepSeek-R1

Stage 1: DeepSeek-R1-Zero - Pure Reinforcement Learning

The first experiment, `DS-R1-Zero`, tested a bold hypothesis: can an LLM develop reasoning abilities through pure RL without any initial supervised fine-tuning? The answer was a resounding yes.

Method: They applied GRPO directly to the base DeepSeek-V3 model.
Rewards: The reward functions were simple and rule-based:
1. Accuracy Reward: Was the final answer, extracted from a specific format (e.g., a box), mathematically or logically correct? (Reward = 1.0 or 0.0).
2. Format Reward: Did the model correctly use the `` and `` tags to wrap its reasoning process?
Outcome: `DS-R1-Zero` achieved remarkable reasoning scores, proving that complex behaviors like self-verification and generating long chains of thought can emerge autonomously from a simple RL setup. However, its outputs were often messy and hard for humans to read.

Mathematical Deep Dive: GRPO

The core of GRPO is its critic-free objective function. For each prompt $q$, a group of outputs $\{o_1, ..., o_G\}$ are sampled. The advantage $A_i$ for each output is calculated by normalizing its reward against the group's statistics.

$$ A_i = \frac{r_i - \text{mean}(\{r_1, ..., r_G\})}{\text{std}(\{r_1, ..., r_G\})} $$

The full GRPO objective function $\mathcal{J}_{GRPO}$ then maximizes the expected advantage, clipped to prevent unstable updates, while regularized by a KL-divergence term.

$$ \mathcal{J}_{GRPO}(\theta) = \mathbb{E} \left[ \frac{1}{G}\sum_{i=1}^G \left( \min \left( \frac{\pi_\theta(o_i |q)}{\pi_{\theta_{old}}(o_i |q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i |q)}{\pi_{\theta_{old}}(o_i |q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) - \beta \mathbb{D}_{KL} \right) \right] $$

$\pi_\theta$ is the policy being trained.
$\pi_{\theta_{old}}$ is the fixed policy used for sampling the outputs.
The ratio $\frac{\pi_\theta}{\pi_{\theta_{old}}}$ is the importance sampling weight.
$\text{clip}(...)$ is the standard PPO clipping function.
$\mathbb{D}_{KL}$ is a KL-divergence penalty keeping $\pi_\theta$ close to a reference policy.

Stage 2: The Multi-Stage Pipeline for DeepSeek-R1

To address the shortcomings of `DS-R1-Zero` and push performance even further, a more structured, multi-stage pipeline was introduced for `DS-R1`.

A. Cold Start SFT

Instead of starting with pure RL, this process begins by giving the model a "cold start"—fine-tuning it on a small, high-quality dataset of ideal reasoning examples. This initial SFT helps the model learn a clean, readable, and effective reasoning format from the outset.

Detailed LLM Example: Cold Start Data Creation for Reasoning

Scenario: Creating high-quality reasoning examples for DeepSeek-R1's initial SFT phase

Target Skill: Mathematical reasoning with clear thought processes

Sample Problem: "What is 15% of 60?"

Few-Shot Prompting Strategy: Show the model examples of ideal reasoning format

                    Example 1:

                    Q: What is 20% of 50?

                    A: <think>I need to calculate 20% of 50. First, I'll convert the percentage to a decimal: 20% = 0.20. Now I multiply: 0.20 × 50 = 10. Let me double-check: 20% means 20 out of 100, so 20/100 × 50 = 1000/100 = 10. ✓</think><answer>10</answer>

                    Example 2:

                    Q: What is 25% of 80?

                    A: <think>25% of 80. I can think of this as 1/4 of 80 since 25% = 25/100 = 1/4. So 80 ÷ 4 = 20. Alternatively, 0.25 × 80 = 20. Both methods give the same answer.</think><answer>20</answer>

                    Now the target problem:

                    Q: What is 15% of 60?

                    A:

Generated High-Quality Response:

                    <think>I need to find 15% of 60. Let me convert 15% to a decimal: 15% = 0.15. Now I'll multiply: 0.15 × 60. I can break this down: 0.10 × 60 = 6, and 0.05 × 60 = 3. So 0.15 × 60 = 6 + 3 = 9. Let me verify using fractions: 15% = 15/100 = 3/20. So (3/20) × 60 = (3 × 60)/20 = 180/20 = 9. ✓</think><answer>9</answer>
                

Quality Features of This Example:

Clear reasoning process: Shows step-by-step calculation
Multiple methods: Demonstrates both decimal and fraction approaches
Verification: Checks the answer using an alternative method
Proper formatting: Uses <think> tags for reasoning, <answer> for final result
Educational value: Explains the mathematical concepts clearly

SFT Training Impact:
This example teaches the model to:
• Always show detailed reasoning in <think> tags
• Use multiple solution approaches when possible
• Verify answers through alternative methods
• Structure responses clearly with proper formatting
• Explain mathematical concepts in an educational manner

Scaling Up: Thousands of such high-quality examples across various mathematical topics (algebra, geometry, calculus, word problems) form the foundation for DeepSeek-R1's reasoning capabilities.

B. Iterative RL and SFT

This is the core of the `DS-R1` training, involving several cycles of RL and data curation.

Reasoning-Oriented RL: The SFT model is trained using GRPO, similar to `DS-R1-Zero`, but now with an additional reward for language consistency to ensure readable outputs.
Rejection Sampling & SFT Data Curation: After the first RL stage, the now-powerful model is used to generate a large amount of reasoning data. This is where rejection sampling comes in.

Rejection Sampling: For each prompt, generate many possible responses. Keep only the ones that are verified to be correct. This creates a new, high-quality dataset for the next stage.

Detailed LLM Example: Rejection Sampling for Logic Puzzles

Scenario: Creating high-quality reasoning data through rejection sampling

Classic Logic Problem: "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"

GRPO Group Generation (4 responses):

Response 1 (Common Intuitive Error):

                    <think>The bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. So if the ball costs x, then the bat costs x + $1.00. But wait, that would mean the bat costs $1.00 and the ball costs $0.10, totaling $1.10.</think><answer>$0.10</answer>
                

• Reward: 0 (Incorrect - falls into the common intuitive trap)

Response 2 (Correct Algebraic Solution):

                    <think>Let me set up equations. Let b = cost of ball. Then bat costs b + $1.00. Total: b + (b + $1.00) = $1.10. So 2b + $1.00 = $1.10. Therefore 2b = $0.10, which means b = $0.05. Let me verify: ball = $0.05, bat = $0.05 + $1.00 = $1.05. Total: $0.05 + $1.05 = $1.10 ✓</think><answer>$0.05</answer>
                

• Reward: 1 (Correct with proper algebraic reasoning and verification)

Response 3 (Correct Answer, Format Issue):

                    <think>If the ball is x and bat is x + 1, then x + x + 1 = 1.10, so 2x = 0.10, so x = 0.05</think><answer>The ball costs 5 cents</answer>
                

• Reward: 0 (Correct reasoning but format inconsistency - "5 cents" vs "$0.05")

Response 4 (Detailed Step-by-Step):

                    <think>This is a classic algebra problem. Let's define variables: let b = price of ball in dollars. Then the bat costs b + 1.00 dollars. The total is: b + (b + 1.00) = 1.10. Simplifying: 2b + 1.00 = 1.10. Subtracting 1.00 from both sides: 2b = 0.10. Dividing by 2: b = 0.05. So the ball costs $0.05 and the bat costs $0.05 + $1.00 = $1.05. Check: $0.05 + $1.05 = $1.10 ✓</think><answer>The ball costs $0.05 and the bat costs $1.05</answer>
                

• Reward: 1 (Excellent detailed reasoning with both individual and total answers)

GRPO Advantage Calculation:

Rewards: [0, 1, 0, 1]
Mean: 0.5, Standard deviation: 0.5
Advantages: [-1.0, +1.0, -1.0, +1.0]

Rejection Sampling Result:
Only responses 2 and 4 are kept for the next SFT dataset because they received reward = 1. This creates a high-quality dataset containing only successful reasoning patterns.

Learning Benefits:

Eliminates common errors: Model learns to avoid the intuitive $0.10 trap
Reinforces systematic approach: Algebraic setup → solve → verify
Maintains format consistency: Standardizes on dollar notation
Encourages verification: Always check the final answer

Scaling Impact: This process, repeated across thousands of logic puzzles, math problems, and reasoning tasks, creates a curated dataset of only the highest-quality reasoning chains for subsequent SFT training.

This new dataset, now containing hundreds of thousands of high-quality reasoning samples, is combined with general-purpose data (for conversation, writing, etc.) and safety data. A second SFT is performed on the base model with this enriched dataset.

Final RL for All Scenarios: The model undergoes one final GRPO stage. This time, the training data includes not just reasoning tasks, but a mix of all scenarios (general conversation, helpfulness, harmlessness) to produce a well-rounded and robust final model: **DeepSeek-R1**.

C. Distillation

The final step mentioned in the paper is distillation. The massive, powerful `DeepSeek-R1` model is used as a "teacher" to train smaller, more efficient "student" models. The student models (e.g., a 7B parameter model) are trained via SFT to mimic the high-quality reasoning outputs of the teacher. This makes state-of-the-art reasoning accessible without requiring enormous computational resources, a key contribution to the community.