Training Overview

🧠

DeepSeek R1 Training Philosophy

DeepSeek R1 represents a revolutionary approach to training reasoning-capable language models. Rather than training from scratch, the methodology builds upon existing foundation models through sophisticated reinforcement learning techniques.

Key Innovation

The entire training process leverages different reinforcement learning strategies applied to their base model (DeepSeek V3), creating a reasoning specialist through iterative improvement rather than ground-up training.

Training Pipeline Architecture

Complete Training Flow

🏗️ Foundation Phase

Starting Point: Pre-trained Base Model

  • DeepSeek V3 (Original) / Qwen 2.5-0.5B (Our Implementation)
  • General language understanding capabilities
  • Basic reasoning but inconsistent structure
  • No specialized reasoning training
📊 Model Specifications
Model: Qwen/Qwen2.5-0.5B-Instruct
Parameters: ~494M
Vocabulary: 151,665 tokens
Max Length: 131,072 tokens
Architecture: Transformer-based

Why this approach? Starting with a capable foundation allows focusing on reasoning enhancement rather than basic language learning.

⚡ R1 Zero: Pure RL Experiment

Objective: Test if reasoning emerges naturally through RL

  • GRPO (Group Reward Policy Optimization)
  • Multiple reward functions for evaluation
  • Structured output with <think> and <answer> tags
  • No supervised examples, pure exploration
🎯 Results & Challenges

Successes:

  • Strong performance on reasoning benchmarks (AIME 2024)
  • Comparable to OpenAI-01-0912 on some tasks
  • Demonstrated RL potential for reasoning

Problems:

  • Messy, hard-to-follow reasoning in <think> tags
  • Language mixing in multilingual contexts
  • Inconsistent reasoning structure
❄️ Cold Start Data Generation

Purpose: Create high-quality reasoning examples

Methods:

  • Few-shot Prompting: Show examples of good reasoning
  • Direct Prompting: Explicitly request step-by-step solutions
  • Post-processing: Human refinement of R1 Zero outputs
📝 Data Quality Examples
Before (R1 Zero):
<think> ummm... multiply 3 and 4... get 12... then add 2...</think>
<answer> 14 </answer>

After (Refined):
<think>
To solve 2 + 3 × 4, I need to follow order of operations.
Step 1: Multiply 3 × 4 = 12
Step 2: Add 2 + 12 = 14
</think>
<answer> 14 </answer>
📚 Supervised Fine-Tuning Stage 1

Goal: Teach structured reasoning patterns

Process:

  • Train on cold start data using cross-entropy loss
  • Learn to format reasoning clearly
  • Establish consistent language usage
  • Improve reasoning step organization
🔧 Training Configuration
Learning Rate: 2e-5
Batch Size: 8 per device
Gradient Accumulation: 2 steps
Max Sequence Length: 4096
Data Packing: Enabled
Optimizer: AdamW with warmup

Outcome: Model with improved reasoning structure but still needs refinement for consistency and quality.

🎯 Reasoning-Oriented Reinforcement Learning

Enhanced Objectives:

  • Language consistency rewards
  • Reasoning quality assessment
  • Improved accuracy evaluation
  • Structured output enforcement
🏆 Reward System Enhancement

New Reward Components:

  • Language Consistency: Same language for question, reasoning, and answer
  • Reasoning Depth: Encourage detailed step-by-step explanations
  • Accuracy Plus: Correct answers with clear justification

This stage fixes the language mixing issues from R1 Zero while maintaining reasoning capabilities.

🎓 Final Training Stages

Rejection Sampling:

  • Generate multiple reasoning examples
  • Filter for highest quality using evaluation metrics
  • Keep only the best examples for further training

SFT Stage 2:

  • Train on filtered high-quality data
  • Add helpfulness and harmlessness objectives
  • Balance reasoning with general AI assistant capabilities
🚀 Final Model Capabilities

DeepSeek R1 Achievements:

  • Clear, structured reasoning in <think> tags
  • Consistent language usage
  • High accuracy on mathematical reasoning
  • Helpful and safe AI assistant behavior
  • Suitable for real-world deployment

Distillation: Knowledge transfer to smaller, more efficient models for wider accessibility.

Environment Setup

⚙️

Development Environment

Repository Structure

train-deepseek-r1/
├── code.ipynb         # Complete implementation notebook
├── requirements.txt   # Python dependencies
└── r1_for_dummies.md  # Beginner-friendly explanation

Installation Commands

# Clone the repository
git clone https://github.com/FareedKhan-dev/train-deepseek-r1.git
cd train-deepseek-r1

# Install dependencies
pip install -r requirements.txt

Required Dependencies

Install the essential libraries for DeepSeek R1 training:

# Install math verification library
pip install math-verify

# Install LaTeX to SymPy converter
pip install latex2sympy2-extended

# Install TRL (Transformers Reinforcement Learning)
pip install trl

Import Essential Libraries

# Import necessary libraries
import logging
import os
import sys
import re
import math
from dataclasses import dataclass, field
from typing import List, Optional

# Import PyTorch and Hugging Face Transformers
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    HfArgumentParser,
    TrainingArguments,
    set_seed,
    TrainerCallback,
    TrainerControl,
    TrainerState,
)
from transformers.trainer_utils import get_last_checkpoint

# Import dataset utilities
import datasets
from datasets import load_dataset

# Import libraries from TRL (Transformers Reinforcement Learning)
from trl import (
    AutoModelForCausalLMWithValueHead,
    PPOConfig,
    PPOTrainer,
    GRPOTrainer,
    GRPOConfig,
    SFTTrainer
)

# Import math-related utilities
from latex2sympy2_extended import NormalizationConfig
from math_verify import LatexExtractionConfig, parse, verify

Required Libraries

# Core ML Libraries
import torch
import transformers
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, 
    HfArgumentParser, TrainingArguments
)

# Reinforcement Learning
from trl import (
    AutoModelForCausalLMWithValueHead,
    PPOConfig, PPOTrainer, 
    GRPOTrainer, GRPOConfig, 
    SFTTrainer
)

# Mathematical Verification
from latex2sympy2_extended import NormalizationConfig
from math_verify import LatexExtractionConfig, parse, verify

# Data Processing
import datasets
from datasets import load_dataset
📊

Training Datasets

Primary Datasets

Dataset Purpose Size Content Type
NuminaMath-TIR R1 Zero Training 70K problems Mathematical reasoning with CoT
Bespoke-Stratos-17k R1 Training (Cold Start) 17K problems Math and coding challenges

Dataset Loading Example

# Load NuminaMath-TIR for R1 Zero training
math_dataset = load_dataset("AI-MO/NuminaMath-TIR", "default")
print(f"Training samples: {len(math_dataset['train'])}")
print(f"Test samples: {len(math_dataset['test'])}")

# Sample structure
sample = math_dataset['train'][0]
print("Fields:", list(sample.keys()))
# Output: ['problem', 'solution', 'messages']
🤖

Base Model Selection

Model Choice Rationale

While DeepSeek used their 685GB DeepSeek-V3 model, we use the more accessible Qwen 2.5-0.5B-Instruct (0.9GB) for demonstration purposes. The methodology remains identical regardless of model size.

Model Initialization

MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    padding_side="right"
)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

print(f"Model parameters: {model.num_parameters():,}")
# Output: Model parameters: 494,032,768

R1 Zero: Pure Reinforcement Learning

GRPO Algorithm Foundation

R1 Zero represents the initial experiment in reasoning emergence through pure reinforcement learning. Unlike traditional RL approaches that require separate critic models, GRPO (Group Reward Policy Optimization) eliminates this computational overhead by deriving baselines directly from group action results.

Reinforcement Learning Framework

RL Components in Language Model Training

  • Agent: The base language model (Qwen 2.5-0.5B)
  • Environment: Mathematical reasoning tasks
  • Action: Generated reasoning and answer sequences
  • Reward: Multi-faceted evaluation of response quality
  • Policy: Model's strategy for generating responses

GRPO Innovation

Traditional RL doubles computational cost with separate actor-critic architectures. GRPO eliminates the critic by computing advantage estimates directly from group rewards, making training more efficient while maintaining learning effectiveness.

Prompt Template Design

Structured Reasoning Format

SYSTEM_PROMPT = """
A conversation between User and Assistant. The user asks a question, 
and the Assistant solves it. The assistant first thinks about the 
reasoning process in the mind and then provides the user with the answer. 
The reasoning process and answer are enclosed within <think> </think> 
and <answer> </answer> tags, respectively.

Format: <think> reasoning process here </think><answer> answer here </answer>
"""

This template establishes clear boundaries between internal reasoning and final answers, enabling targeted evaluation and reward assignment for different aspects of the response.

Multi-Dimensional Reward System

Five-Component Reward Architecture

🎯 Accuracy Reward

Mathematical Foundation:

$$R_{accuracy} = \begin{cases} 1.0 & \text{if } verify(answer_{parsed}, solution_{parsed}) = True \\ 0.0 & \text{if } verify(answer_{parsed}, solution_{parsed}) = False \\ 0.5 & \text{if parsing fails} \end{cases}$$

Implementation Process:

  1. Parse ground truth solution using latex2sympy2
  2. Extract and normalize model's answer
  3. Use math_verify for semantic equivalence checking
  4. Assign binary reward based on mathematical correctness
🧮 Mathematical Verification Example
Problem: "What is 2 + 3 × 4?"
Ground Truth: "14"

Model Response: "<think>Following order of operations...</think><answer>14</answer>"

Verification Process:
1. Parse ground truth: 14 → symbolic representation
2. Extract model answer: "14" → symbolic representation  
3. Mathematical equivalence: 14 ≡ 14 → True
4. Reward: 1.0

Alternative Model Response: "<answer>20</answer>"
1. Parse: 20 → symbolic representation
2. Mathematical equivalence: 20 ≡ 14 → False  
3. Reward: 0.0

Why This Matters: Pure mathematical correctness ensures the model learns actual problem-solving rather than pattern matching on text similarity.

📋 Format Reward

Regex Pattern Matching:

$$R_{format} = \begin{cases} 1.0 & \text{if response matches } \texttt{.*.*} \\ 0.0 & \text{otherwise} \end{cases}$$

Pattern Requirements:

  • Must start with <think> tag
  • Reasoning content within think tags
  • Must end with <answer> tag
  • Final answer within answer tags
  • No additional content outside structure
✅ Format Compliance Examples
✅ CORRECT FORMAT (Reward: 1.0):
"<think>I need to solve 2 + 3 × 4. Order of operations says multiply first: 3 × 4 = 12, then add: 2 + 12 = 14</think><answer>14</answer>"

❌ INCORRECT FORMATS (Reward: 0.0):
"The answer is 14" (no tags)
"<answer>14</answer>" (missing think tag)
"<think>Calculate...</think> The final answer is 14" (content outside tags)
"<think>Step 1...<answer>14</answer></think>" (wrong tag order)

Training Impact: Strict format enforcement teaches the model to consistently separate reasoning from conclusions, making outputs more interpretable and debuggable.

🔍 Reasoning Steps Reward

Step Detection Formula:

$$R_{reasoning} = \min\left(1.0, \frac{\text{count}(\text{reasoning indicators})}{3}\right)$$

Reasoning Indicators Pattern:

(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)

Reward Scaling:

  • 0 indicators → 0.0 reward
  • 1 indicator → 0.33 reward
  • 2 indicators → 0.67 reward
  • 3+ indicators → 1.0 reward
📝 Step-by-Step Reasoning Examples
HIGH REWARD EXAMPLE (Score: 1.0):
"<think>
Step 1: Identify the operation order (PEMDAS)
Step 2: Calculate 3 × 4 = 12
Step 3: Add 2 + 12 = 14
</think><answer>14</answer>"
→ Found 3 "Step X:" patterns = 1.0 reward

MEDIUM REWARD EXAMPLE (Score: 0.67):
"<think>
First, I'll multiply 3 × 4 = 12
Second, I'll add 2 + 12 = 14
</think><answer>14</answer>"
→ Found 2 transition words = 0.67 reward

LOW REWARD EXAMPLE (Score: 0.0):
"<think>The answer is 14</think><answer>14</answer>"
→ Found 0 reasoning indicators = 0.0 reward

Implementation Code:

def reasoning_steps_reward(completions, **kwargs):
    """Reward function to encourage clear step-by-step reasoning."""
    pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"
    
    completion_contents = [completion[0]["content"] for completion in completions]
    
    matches = [len(re.findall(pattern, content, re.MULTILINE))
               for content in completion_contents]
    
    # Reward proportional to reasoning steps, maxing at 1.0
    return [min(1.0, count / 3) for count in matches]
📏 Cosine Scaled Reward

Length-Aware Reward Formula:

$$R_{cosine} = \min_{val} + 0.5 \cdot (\max_{val} - \min_{val}) \cdot (1 + \cos(\pi \cdot \frac{length}{\max_{length}}))$$

Adaptive Scaling Logic:

  • Correct Answers: Shorter responses get higher rewards
  • Incorrect Answers: Longer responses get less penalty
  • Cosine Function: Smooth transition from 1.0 (short) to -1.0 (long)

Parameter Ranges:

  • Correct: [0.8, 1.0] reward range
  • Incorrect: [-0.5, -0.1] penalty range
  • Max length: 1000 characters
📊 Length-Based Reward Examples
CORRECT ANSWER SCENARIOS:
Short (100 chars): cos(π × 0.1) ≈ 0.95 → Reward ≈ 0.99
Medium (500 chars): cos(π × 0.5) = 0.0 → Reward = 0.9  
Long (1000 chars): cos(π × 1.0) = -1.0 → Reward = 0.8

INCORRECT ANSWER SCENARIOS:
Short (100 chars): cos(π × 0.1) ≈ 0.95 → Penalty ≈ -0.12
Medium (500 chars): cos(π × 0.5) = 0.0 → Penalty = -0.3
Long (1000 chars): cos(π × 1.0) = -1.0 → Penalty = -0.5

Implementation Code:

def get_cosine_scaled_reward(min_value_wrong=-0.5, max_value_wrong=-0.1,
                            min_value_correct=0.8, max_value_correct=1.0,
                            max_len=1000):
    def cosine_scaled_reward(completions, solution, accuracy_rewards, **kwargs):
        contents = [completion[0]["content"] for completion in completions]
        rewards = []
        
        for content, sol, acc_reward in zip(contents, solution, accuracy_rewards):
            gen_len = len(content)
            progress = gen_len / max_len
            cosine = math.cos(progress * math.pi)
            
            if acc_reward > 0.5:  # Correct answer
                min_val, max_val = min_value_correct, max_value_correct
            else:  # Incorrect answer
                min_val, max_val = max_value_wrong, min_value_wrong
            
            reward = min_val + 0.5 * (max_val - min_val) * (1.0 + cosine)
            rewards.append(float(reward))
        return rewards
    return cosine_scaled_reward
🔄 Repetition Penalty Reward

N-gram Diversity Formula:

$$R_{repetition} = \text{scaling} \times \text{max\_penalty}$$

$$\text{scaling} = 1 - \frac{\text{unique\_ngrams}}{\text{total\_ngrams}}$$

Diversity Measurement:

  • N-gram Size: 3 (trigrams) for context sensitivity
  • Scaling Range: [0, 1] where 0 = no repetition, 1 = maximum repetition
  • Penalty Range: [0, -0.1] negative rewards for repetition
🎯 Repetition Detection Examples
HIGH DIVERSITY (Low Penalty):
"Step 1: multiply first. Step 2: add second. Step 3: verify result."
→ Trigrams: ["Step 1:", "1: multiply", "multiply first", ...]
→ Unique: 12, Total: 12 → Scaling: 0.0 → Penalty: 0.0

MEDIUM REPETITION (Medium Penalty):
"I think I should think about this problem I think carefully"
→ "I think" appears 3 times
→ Unique: 8, Total: 10 → Scaling: 0.2 → Penalty: -0.02

HIGH REPETITION (High Penalty):
"The answer is the answer is the answer is 14"
→ "the answer" and "answer is" repeat extensively
→ Unique: 4, Total: 8 → Scaling: 0.5 → Penalty: -0.05

Implementation Code:

def get_repetition_penalty_reward(ngram_size=3, max_penalty=-0.1):
    def zipngram(text, ngram_size):
        """Generate n-grams from text."""
        words = text.lower().split()
        return zip(*[words[i:] for i in range(ngram_size)])
    
    def repetition_penalty_reward(completions, **kwargs):
        contents = [completion[0]["content"] for completion in completions]
        rewards = []
        
        for completion in contents:
            if len(completion.split()) < ngram_size:
                rewards.append(0.0)
                continue
            
            ngrams = set()
            total = 0
            for ng in zipngram(completion, ngram_size):
                ngrams.add(ng)
                total += 1
            
            scaling = 1 - len(ngrams) / total
            reward = scaling * max_penalty
            rewards.append(reward)
        return rewards
    return repetition_penalty_reward
🎯 Complete Accuracy Reward Implementation

Mathematical Verification Pipeline:

  1. Parse ground truth using LaTeX extraction
  2. Extract model answer with normalization
  3. Perform semantic mathematical comparison
  4. Handle parsing failures gracefully
🔧 Full Implementation Code
def accuracy_reward(completions, **kwargs):
    """
    Reward function to check if the model's response is mathematically
    equivalent to the ground truth solution.
    """
    contents = [completion[0]["content"] for completion in completions]
    rewards = []
    solutions = kwargs.get("solution")
    
    for content, sol in zip(contents, solutions):
        # Parse the ground truth solution
        gold_parsed = parse(sol, extraction_mode="first_match",
                          extraction_config=[LatexExtractionConfig()])
        
        if gold_parsed:
            # Parse the model's answer with relaxed normalization
            answer_parsed = parse(
                content,
                extraction_config=[
                    LatexExtractionConfig(
                        normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
                    )
                ],
                extraction_mode="first_match",
            )
            
            # Reward 1.0 if correct, 0.0 if incorrect
            reward = float(verify(answer_parsed, gold_parsed))
        else:
            # Neutral reward if ground truth cannot be parsed
            reward = 0.5
            print("Warning: Failed to parse gold solution:", sol)
        
        rewards.append(reward)
    return rewards
📋 Complete Format Reward Implementation

Regex Pattern Validation:

Ensures strict adherence to the <think>...</think><answer>...</answer> format using comprehensive pattern matching.

🔧 Full Implementation Code
def format_reward(completions, **kwargs):
    """
    Reward function to check if the completion has the correct format:
    <think>...</think> <answer>...</answer>.
    """
    # Define the regex pattern for the desired format
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    
    # Extract the content from each completion
    completion_contents = [completion[0]["content"] for completion in completions]
    
    # Check if each completion matches the pattern
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE)
               for content in completion_contents]
    
    # Reward 1.0 for correct format, 0.0 otherwise
    return [1.0 if match else 0.0 for match in matches]

Data Preprocessing Pipeline

🔄

Dataset Transformation

Conversation Format Conversion

# Function to structure the training data
def make_conversation(example):
    """Convert dataset examples into conversation format."""
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

# Load and prepare dataset
def load_math_dataset():
    """Load and prepare the mathematics dataset."""
    dataset = load_dataset(
        "AI-MO/NuminaMath-TIR",
        name="default",
        split=['train', 'test']
    )
    
    # Convert splits into dictionary
    dataset = {
        'train': dataset[0],
        'test': dataset[1]
    }
    
    # Apply conversation format
    for split in dataset:
        dataset[split] = dataset[split].map(make_conversation)
        
        # Remove 'messages' column if exists
        if "messages" in dataset[split].column_names:
            dataset[split] = dataset[split].remove_columns("messages")
    
    return dataset

# Load our training dataset
dataset = load_math_dataset()
print(f"Train set size: {len(dataset['train'])}")
print(f"Test set size: {len(dataset['test'])}")

Dataset Validation

def validate_dataset(dataset):
    """Perform basic validation checks on the dataset."""
    required_fields = ["problem", "prompt"]
    
    for split in ['train', 'test']:
        print(f"\nValidating {split} split:")
        
        fields = dataset[split].column_names
        missing = [field for field in required_fields if field not in fields]
        
        if missing:
            print(f"Warning: Missing fields: {missing}")
        else:
            print("✓ All required fields present")
        
        sample = dataset[split][0]
        messages = sample['prompt']
        
        if (len(messages) >= 2 and
            messages[0]['role'] == 'system' and
            messages[1]['role'] == 'user'):
            print("✓ Prompt format is correct")
        else:
            print("Warning: Incorrect prompt format")

# Validate dataset
validate_dataset(dataset)
  • 2 indicators → 0.67 reward
  • 3+ indicators → 1.0 reward (maximum)
  • 📝 Step-by-Step Analysis
    High Reward Example (Score: 1.0):
    "<think>
    Step 1: Identify the operation order (PEMDAS)
    Step 2: Calculate multiplication first: 3 × 4 = 12  
    Step 3: Add the remaining term: 2 + 12 = 14
    Finally, verify the result makes sense.
    </think><answer>14</answer>"
    
    Indicators found: ["Step 1:", "Step 2:", "Step 3:", "Finally,"] = 4 indicators
    Reward: min(1.0, 4/3) = 1.0
    
    Low Reward Example (Score: 0.33):
    "<think>Just multiply then add to get 14</think><answer>14</answer>"
    
    Indicators found: [] = 0 indicators  
    Reward: min(1.0, 0/3) = 0.0

    Learning Objective: Encourages explicit step-by-step reasoning, making the model's thought process more transparent and verifiable.

    📏 Cosine Scaled Reward

    Length-Based Scaling:

    $$\text{progress} = \frac{\text{length}(\text{response})}{\text{max\_length}}$$

    $$\text{cosine\_factor} = \cos(\text{progress} \times \pi)$$

    $$R_{cosine} = \text{min\_value} + 0.5 \times (\text{max\_value} - \text{min\_value}) \times (1.0 + \text{cosine\_factor})$$

    Adaptive Scaling by Correctness:

    • Correct answers: Shorter = higher reward (conciseness)
    • Incorrect answers: Longer = less penalty (effort recognition)
    📊 Length Impact Analysis
    Configuration:
    - max_length = 1000
    - correct_range = [0.8, 1.0]  
    - incorrect_range = [-0.5, -0.1]
    
    Correct Answer Examples:
    - Short response (100 chars): progress=0.1, cosine=0.95, reward≈0.99
    - Long response (800 chars): progress=0.8, cosine=-0.81, reward≈0.82
    
    Incorrect Answer Examples:  
    - Short response (100 chars): progress=0.1, cosine=0.95, reward≈-0.12
    - Long response (800 chars): progress=0.8, cosine=-0.81, reward≈-0.46

    Behavioral Shaping: Promotes concise correct solutions while being more forgiving of lengthy incorrect attempts that show reasoning effort.

    🔄 Repetition Penalty Reward

    N-gram Diversity Measurement:

    $$\text{scaling} = 1 - \frac{\text{unique\_ngrams}}{\text{total\_ngrams}}$$

    $$R_{repetition} = \text{scaling} \times \text{max\_penalty}$$

    Diversity Analysis:

    • Extract all n-grams (default: trigrams)
    • Count unique vs. total occurrences
    • Higher repetition → larger penalty
    • More diverse language → smaller penalty
    🎯 Repetition Detection Example
    Diverse Response (Low Penalty):
    "Step 1: Identify order of operations. Step 2: Calculate multiplication first. Step 3: Add remaining terms."
    
    Trigrams: ["Step 1:", "1: Identify", "Identify order", "order of", "of operations", "Step 2:", "2: Calculate", ...]
    Unique: 15, Total: 15
    Scaling: 1 - 15/15 = 0.0
    Penalty: 0.0 × (-0.1) = 0.0
    
    Repetitive Response (High Penalty):  
    "Calculate calculate calculate the result result result to get get get the answer answer answer"
    
    Trigrams: ["Calculate calculate", "calculate calculate", "calculate the", "the result", "result result", ...]
    Unique: 8, Total: 12  
    Scaling: 1 - 8/12 = 0.33
    Penalty: 0.33 × (-0.1) = -0.033

    Quality Control: Prevents the model from getting stuck in repetitive loops, encouraging varied and natural language use in reasoning.

    GRPO Training Configuration

    Training Hyperparameters

    @dataclass
    class GRPOScriptArguments:
        reward_funcs: List[str] = ["accuracy", "format", "reasoning_steps", "cosine", "repetition_penalty"]
        cosine_min_value_wrong: float = -0.5
        cosine_max_value_wrong: float = -0.1  
        cosine_min_value_correct: float = 0.8
        cosine_max_value_correct: float = 1.0
        cosine_max_len: int = 1000
        repetition_n_grams: int = 3
        repetition_max_penalty: float = -0.1
    
    training_args = TrainingArguments(
        output_dir="./qwen-grpo-training",
        num_train_epochs=1,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        learning_rate=5e-5,
        warmup_ratio=0.1,
        weight_decay=0.01,
        logging_steps=1,
        save_strategy="steps",
        save_steps=5,
        save_total_limit=2,
        dataloader_num_workers=2,
        seed=42,
        bf16=True,
        gradient_checkpointing=True,
        report_to="none",
        remove_unused_columns=False,
    )

    GRPO Training Loop

    # Initialize GRPO Trainer
    grpo_config = GRPOConfig(**training_args.to_dict())
    
    grpo_trainer = GRPOTrainer(
        model=model,
        reward_funcs=reward_functions,
        args=grpo_config,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        callbacks=callbacks
    )
    
    # Start training
    print("Starting GRPO training...")
    train_result = grpo_trainer.train()
    
    # Save the trained model
    TRAINED_MODEL_PATH = "data/Qwen-GRPO-training"
    tokenizer.save_pretrained(TRAINED_MODEL_PATH)
    grpo_trainer.save_model(TRAINED_MODEL_PATH)
    print(f"GRPO trained model saved to {TRAINED_MODEL_PATH}")
    
    # Test the trained model
    def test_trained_model(user_input: str):
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_input}
        ]
        
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer(text, return_tensors="pt").to(device)
        
        outputs = trained_model.generate(
            **inputs,
            max_new_tokens=200,
            do_sample=True,
            temperature=0.7
        )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response
    
    # Test example
    test_input = "What is 2 + 3 * 4?"
    response = test_trained_model(test_input)
    print(f"Input: {test_input}")
    print(f"Response: {response}")

    Cold Start Data Generation

    ❄️

    Addressing R1 Zero Limitations

    R1 Zero Problems

    • Messy Reasoning: Hard-to-follow thought processes in <think> tags
    • Language Mixing: Inconsistent language usage in multilingual contexts
    • Structural Issues: Inconsistent reasoning organization

    Few-shot Prompting with Long CoT

    # Generate response function
    def generate_response(prompt_text):
        messages = [
            {"role": "system", "content": "You are a helpful assistant that provides step-by-step solutions."},
            {"role": "user", "content": prompt_text}
        ]
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer(text, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response.split("<|im_start|>assistant\n")[-1].strip()
    
    # Few-shot examples
    few_shot_prompt = """
    Problem: What's the square root of 9 plus 5?
    Solution: <|special_token|> First, find the square root of 9, which is 3. Then, add 5 to 3. 3 + 5 equals 8. <|special_token|> Summary: The answer is 8.
    
    Problem: Train travels at 60 mph for 2 hours, how far?
    Solution: <|special_token|> Use the formula: Distance = Speed times Time. Speed is 60 mph, Time is 2 hours. Distance = 60 * 2 = 120 miles. <|special_token|> Summary: Train travels 120 miles.
    
    Problem: What is 2 + 3 * 4?
    Solution:
    """
    
    # Generate structured response
    target_problem = "What is 2 + 3 * 4?"
    model_response = generate_response(few_shot_prompt + target_problem)
    print("Few-shot CoT Response:")
    print(model_response)

    Direct Prompting

    # Direct prompting approach
    direct_prompt = """
    Problem: Solve this, show reasoning step-by-step, and verify:
    What is 2 + 3 * 4?
    """
    
    direct_response = generate_response(direct_prompt)
    print("Direct Prompting Response:")
    print(direct_response)

    Post-Processing Refinement

    # Refine messy R1 Zero outputs
    def refine_output(messy_text):
        """Refine messy reasoning output into structured format."""
        try:
            think_content = messy_text.split("")[1].split("")[0].strip()
            answer_content = messy_text.split("")[1].split("")[0].strip()
            
            # Clean up the reasoning
            cleaned_reasoning = think_content.replace('ummm...', '').replace('...', '').strip()
            
            refined_text = f"""<|special_token|> Reasoning: {cleaned_reasoning.capitalize()}.
    <|special_token|> Summary: The answer is {answer_content}."""
            return refined_text
        except:
            return messy_text
    
    # Example refinement
    messy_output = "  ummm... multiply 3 and 4... get 12... then add 2...\n 14 "
    refined = refine_output(messy_output)
    
    print("Before refinement:")
    print(messy_output)
    print("\nAfter refinement:")
    print(refined)

    Supervised Fine-Tuning (SFT)

    📚

    SFT Stage 1: Structured Reasoning

    SFT Configuration

    # SFT Training Configuration
    OUTPUT_DIR_SFT = "data/Qwen-SFT-training"
    os.makedirs(OUTPUT_DIR_SFT, exist_ok=True)
    
    sft_training_args = TrainingArguments(
        output_dir=OUTPUT_DIR_SFT,
        overwrite_output_dir=True,
        num_train_epochs=1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=16,
        gradient_accumulation_steps=2,
        learning_rate=2e-5,  # Lower learning rate for SFT
        warmup_ratio=0.1,
        weight_decay=0.01,
        logging_steps=10,
        evaluation_strategy="no",
        save_strategy="steps",
        save_steps=50,
        save_total_limit=2,
        dataloader_num_workers=2,
        seed=42,
        bf16=True,
        push_to_hub=False,
        gradient_checkpointing=True,
        report_to="none",
    )

    SFT Training Loop

    # Load high-quality reasoning dataset
    dataset_sft = load_dataset("HuggingFaceH4/Bespoke-Stratos-17k", split='train')
    
    # Initialize model for SFT
    model_sft = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16
    )
    
    # Initialize SFT Trainer
    sft_trainer = SFTTrainer(
        model=model_sft,
        train_dataset=dataset_sft,
        tokenizer=tokenizer,
        args=sft_training_args,
    )
    
    # Start SFT training
    print("Starting SFT training...")
    sft_result = sft_trainer.train()
    
    # Save SFT model
    TRAINED_SFT_PATH = "data/Qwen-SFT-training"
    tokenizer.save_pretrained(TRAINED_SFT_PATH)
    sft_trainer.save_model(TRAINED_SFT_PATH)
    print(f"SFT trained model saved to {TRAINED_SFT_PATH}")

    Advanced Training Stages

    🎯

    Reasoning-Oriented RL & Final Stages

    Reasoning-Oriented Reinforcement Learning

    After SFT, the model undergoes additional RL training with enhanced reward functions:

    Enhanced Reward Components
    • Language Consistency: Ensures reasoning and answers use the same language as the question
    • Reasoning Quality: Evaluates the depth and clarity of step-by-step explanations
    • Improved Accuracy: More sophisticated mathematical verification
    # Conceptual implementation of language consistency reward
    def language_consistency_reward(completions, questions, **kwargs):
        """Reward function to ensure consistent language usage."""
        rewards = []
        
        for completion, question in zip(completions, questions):
            content = completion[0]["content"]
            
            # Detect languages (simplified)
            question_lang = detect_language(question)
            response_lang = detect_language(content)
            
            # Reward consistency
            if question_lang == response_lang:
                rewards.append(1.0)
            else:
                rewards.append(0.0)  # Penalty for language mixing
        
        return rewards
    
    # Enhanced GRPO training with language consistency
    enhanced_reward_functions = [
        accuracy_reward,
        format_reward,
        reasoning_steps_reward,
        language_consistency_reward,
        get_cosine_scaled_reward(),
        get_repetition_penalty_reward()
    ]

    Rejection Sampling

    High-quality reasoning data is generated through rejection sampling:

    # Conceptual rejection sampling implementation
    def rejection_sampling(model, tokenizer, problems, num_samples=10, quality_threshold=0.8):
        """Generate high-quality reasoning examples through rejection sampling."""
        high_quality_examples = []
        
        for problem in problems:
            best_response = None
            best_score = 0
            
            # Generate multiple responses
            for _ in range(num_samples):
                response = generate_response(problem)
                
                # Evaluate quality (simplified)
                score = evaluate_response_quality(response, problem)
                
                if score > best_score and score >= quality_threshold:
                    best_response = response
                    best_score = score
            
            if best_response:
                high_quality_examples.append({
                    'problem': problem,
                    'response': best_response,
                    'quality_score': best_score
                })
        
        return high_quality_examples
    
    # Use rejection sampling to create refined dataset
    refined_data = rejection_sampling(model, tokenizer, sample_problems)

    SFT Stage 2: Helpfulness & Harmlessness

    The final training stage balances reasoning capabilities with general AI assistant behavior:

    # Final stage training configuration
    final_training_args = TrainingArguments(
        output_dir="data/Qwen-R1-final",
        num_train_epochs=1,
        per_device_train_batch_size=4,  # Smaller batch for diverse data
        learning_rate=1e-5,  # Very low learning rate for fine-tuning
        warmup_ratio=0.05,
        weight_decay=0.01,
        logging_steps=5,
        save_strategy="steps",
        save_steps=25,
        evaluation_strategy="steps",
        eval_steps=25,
        seed=42,
        bf16=True,
        gradient_checkpointing=True,
        report_to="none",
    )
    
    # Multi-objective reward function for final stage
    def helpfulness_harmlessness_reward(completions, **kwargs):
        """Reward function balancing helpfulness and harmlessness."""
        rewards = []
        
        for completion in completions:
            content = completion[0]["content"]
            
            # Evaluate helpfulness (simplified)
            helpfulness_score = evaluate_helpfulness(content)
            
            # Evaluate harmlessness (simplified)
            harmlessness_score = evaluate_harmlessness(content)
            
            # Combine scores
            combined_reward = 0.6 * helpfulness_score + 0.4 * harmlessness_score
            rewards.append(combined_reward)
        
        return rewards

    Knowledge Distillation

    Create smaller, more efficient models through knowledge distillation:

    # Knowledge distillation setup
    def distillation_training(teacher_model, student_model, dataset, temperature=3.0):
        """Distill knowledge from large teacher to smaller student model."""
        
        class DistillationTrainer(SFTTrainer):
            def compute_loss(self, model, inputs, return_outputs=False):
                # Student forward pass
                student_outputs = model(**inputs)
                student_logits = student_outputs.logits
                
                # Teacher forward pass (no gradient)
                with torch.no_grad():
                    teacher_outputs = teacher_model(**inputs)
                    teacher_logits = teacher_outputs.logits
                
                # Distillation loss
                distill_loss = F.kl_div(
                    F.log_softmax(student_logits / temperature, dim=-1),
                    F.softmax(teacher_logits / temperature, dim=-1),
                    reduction='batchmean'
                ) * (temperature ** 2)
                
                # Standard cross-entropy loss
                ce_loss = F.cross_entropy(
                    student_logits.view(-1, student_logits.size(-1)),
                    inputs['labels'].view(-1)
                )
                
                # Combined loss
                total_loss = 0.7 * distill_loss + 0.3 * ce_loss
                
                return (total_loss, student_outputs) if return_outputs else total_loss
        
        # Initialize distillation trainer
        distill_trainer = DistillationTrainer(
            model=student_model,
            train_dataset=dataset,
            tokenizer=tokenizer,
            args=final_training_args,
        )
        
        return distill_trainer
    
    # Example usage
    # teacher_model = load_trained_r1_model()
    # student_model = load_smaller_base_model()
    # distill_trainer = distillation_training(teacher_model, student_model, dataset)
    # distill_trainer.train()

    Results & Evaluation

    🏆

    Training Outcomes

    DeepSeek R1 Achievements

    Key Improvements Over R1 Zero
    • Clear Reasoning: Structured, readable thought processes in <think> tags
    • Language Consistency: Unified language usage throughout responses
    • Mathematical Accuracy: High performance on reasoning benchmarks
    • Assistant Behavior: Helpful, harmless, and honest responses
    • Scalability: Knowledge distillation enables deployment of smaller models

    Evaluation Metrics

    # Comprehensive evaluation function
    def evaluate_r1_model(model, tokenizer, test_dataset):
        """Evaluate the trained R1 model on multiple metrics."""
        
        results = {
            'accuracy': 0,
            'format_compliance': 0,
            'reasoning_quality': 0,
            'language_consistency': 0,
            'response_length': [],
            'reasoning_steps': []
        }
        
        for example in test_dataset:
            # Generate response
            response = generate_response(example['problem'])
            
            # Evaluate accuracy
            accuracy = evaluate_mathematical_accuracy(response, example['solution'])
            results['accuracy'] += accuracy
            
            # Evaluate format compliance
            format_score = evaluate_format_compliance(response)
            results['format_compliance'] += format_score
            
            # Evaluate reasoning quality
            reasoning_score = evaluate_reasoning_quality(response)
            results['reasoning_quality'] += reasoning_score
            
            # Track response metrics
            results['response_length'].append(len(response))
            results['reasoning_steps'].append(count_reasoning_steps(response))
        
        # Calculate averages
        n_examples = len(test_dataset)
        results['accuracy'] /= n_examples
        results['format_compliance'] /= n_examples
        results['reasoning_quality'] /= n_examples
        
        return results
    
    # Example evaluation
    # evaluation_results = evaluate_r1_model(trained_model, tokenizer, test_dataset)
    # print("Evaluation Results:", evaluation_results)

    Conclusion

    🎓

    Training Pipeline Summary

    Complete Implementation Achieved

    This guide provides a comprehensive, end-to-end implementation of the DeepSeek R1 training methodology, including:

    • Multi-dimensional reward system with 5 specialized functions
    • GRPO algorithm for efficient reinforcement learning
    • Cold start data generation techniques
    • Supervised fine-tuning for structured reasoning
    • Advanced RL stages with language consistency
    • Knowledge distillation for model deployment

    Key Takeaways

    1. Iterative Improvement: R1 training is a multi-stage process, each addressing specific limitations
    2. Reward Engineering: Sophisticated reward functions are crucial for shaping desired behaviors
    3. Data Quality: High-quality reasoning examples are essential for effective learning
    4. Computational Efficiency: GRPO reduces training costs compared to traditional RL approaches
    5. Scalability: Knowledge distillation enables practical deployment of reasoning models

    Future Directions

    Potential Improvements
    • Integration with larger base models (7B, 13B, 70B parameters)
    • Domain-specific reasoning specialization (code, mathematics, science)
    • Multi-modal reasoning capabilities
    • Improved evaluation metrics for reasoning quality
    • Real-time reasoning optimization

    Important Considerations

    • Computational Requirements: Training requires significant GPU resources
    • Data Quality: Results heavily depend on training data quality
    • Hyperparameter Sensitivity: Careful tuning of reward function parameters is crucial
    • Evaluation Complexity: Reasoning quality assessment remains challenging
    warmup_ratio=0.1, weight_decay=0.01, bf16=True, gradient_checkpointing=True )

    R1 Zero Limitations Discovered

    Performance Achievements: R1 Zero demonstrated impressive reasoning capabilities, achieving performance comparable to OpenAI-01-0912 on mathematical benchmarks like AIME 2024.

    Critical Issues Identified:

    These limitations motivated the development of the full R1 training pipeline with supervised fine-tuning stages.

    Cold Start Data Generation

    ❄️

    High-Quality Reasoning Examples

    To address R1 Zero's limitations, the research team developed sophisticated methods for generating high-quality reasoning examples. This "cold start" data serves as the foundation for supervised fine-tuning, teaching the model proper reasoning structure and consistency.

    Three-Pronged Data Generation Strategy

    Cold Start Methodologies

    🎯 Few-Shot Prompting with Long CoT

    Methodology:

    • Provide 2-3 exemplar problems with detailed solutions
    • Demonstrate desired reasoning structure and depth
    • Use special tokens to delineate reasoning sections
    • Show step-by-step problem decomposition

    Template Structure:

    Problem: [Example 1]
    Solution: <|special_token|> [Detailed reasoning] <|special_token|> Summary: [Answer]
    
    Problem: [Example 2]  
    Solution: <|special_token|> [Detailed reasoning] <|special_token|> Summary: [Answer]
    
    Problem: [Target Problem]
    Solution:
    📚 Few-Shot Example Implementation
    Problem: What's the square root of 9 plus 5?
    Solution: <|special_token|> First, find the square root of 9, which is 3. Then, add 5 to 3. 3 + 5 equals 8. <|special_token|> Summary: The answer is 8.
    
    Problem: Train travels at 60 mph for 2 hours, how far?
    Solution: <|special_token|> Use the formula: Distance = Speed × Time. Speed is 60 mph, Time is 2 hours. Distance = 60 × 2 = 120 miles. <|special_token|> Summary: The answer is 120 miles.
    
    Problem: What is 2 + 3 × 4?
    Solution: <|special_token|> To solve 2 + 3 × 4, we need to follow the order of operations (PEMDAS/BODMAS). Multiplication should be performed before addition.
    Step 1: Multiply 3 by 4, which equals 12.
    Step 2: Add 2 to the result from Step 1: 2 + 12 = 14.
    <|special_token|> Summary: The answer is 14.

    Learning Outcome: The model learns to structure responses with clear reasoning boundaries and systematic problem-solving approaches.

    🎪 Direct Prompting Strategy

    Explicit Instruction Approach:

    • Directly request step-by-step reasoning
    • Ask for verification of the solution
    • Encourage explanation of methodology
    • Prompt for self-reflection on answer quality

    Prompt Engineering Principles:

    • Clear directive language
    • Explicit formatting requirements
    • Verification step inclusion
    • Reasoning transparency emphasis
    🎯 Direct Prompting Example
    Prompt:
    "Problem: Solve this, show reasoning step-by-step, and verify:
    What is 2 + 3 × 4?"
    
    Model Response:
    "Reasoning: To solve 2 + 3 × 4, I need to follow the order of operations, which states that multiplication should be done before addition.
    
    Step 1: Multiply 3 by 4, which equals 12.
    Step 2: Add 2 to the result from Step 1: 2 + 12 = 14.
    
    Verification: To verify the answer, I can double-check the order of operations and the calculations. Multiplication is indeed performed before addition, and the calculations are correct.
    
    Summary: The answer is 14."

    Advantage: Direct prompting produces comprehensive outputs with built-in verification, teaching the model to be self-critical and thorough in its reasoning process.

    🔧 Post-Processing Refinement

    Human-in-the-Loop Enhancement:

    • Use R1 Zero outputs as raw material
    • Human annotators clean and structure responses
    • Correct mathematical errors and logical gaps
    • Standardize formatting and language consistency

    Refinement Process:

    1. Extract reasoning content from <think> tags
    2. Identify and correct errors or unclear steps
    3. Restructure into clear, logical progression
    4. Apply consistent formatting standards
    ✨ Refinement Transformation
    Before (Raw R1 Zero Output):
    "<think> ummm... multiply 3 and 4... get 12... then add 2...</think>
    <answer> 14 </answer>"
    
    After (Human Refined):
    "<think>
    To solve this problem, I need to apply the order of operations (PEMDAS).
    
    Step 1: Identify operations present
    - Addition: 2 + [result]
    - Multiplication: 3 × 4
    
    Step 2: Perform multiplication first
    3 × 4 = 12
    
    Step 3: Perform addition
    2 + 12 = 14
    
    Verification: Following PEMDAS correctly gives us 14.
    </think>
    <answer> 14 </answer>"

    Quality Improvement: Human refinement transforms messy, incomplete reasoning into clear, educational examples that serve as excellent training data for supervised fine-tuning.

    Dataset Preparation Pipeline

    Cold Start Data Processing

    def prepare_cold_start_data():
        """
        Comprehensive pipeline for cold start data preparation
        """
        # Load base dataset
        dataset = load_dataset("bespokelabs/Bespoke-Stratos-17k", "default")
        
        # Apply conversation formatting
        def format_conversation(example):
            return {
                "prompt": [
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": example["problem"]},
                ],
                "completion": example["refined_solution"]  # Human-refined solutions
            }
        
        # Process dataset
        formatted_dataset = dataset.map(format_conversation)
        
        # Quality filtering
        def quality_filter(example):
            # Check for required reasoning indicators
            reasoning_indicators = ["Step", "First", "Then", "Finally", "Because"]
            has_reasoning = any(indicator in example["completion"] for indicator in reasoning_indicators)
            
            # Check format compliance
            has_proper_format = "<think>" in example["completion"] and "<answer>" in example["completion"]
            
            return has_reasoning and has_proper_format
        
        filtered_dataset = formatted_dataset.filter(quality_filter)
        
        return filtered_dataset

    Cold Start Data Impact

    The multi-faceted approach to cold start data generation creates a diverse, high-quality training corpus that addresses the specific weaknesses observed in R1 Zero while maintaining its reasoning strengths. This foundation enables effective supervised fine-tuning in the subsequent training stages.

    Supervised Fine-Tuning Training

    📚

    Learning from High-Quality Examples

    Supervised Fine-Tuning (SFT) transforms the raw reasoning potential of the base model into structured, consistent behavior. By training on carefully curated cold start data, the model learns to produce clear, well-formatted reasoning that addresses the critical limitations observed in R1 Zero.

    SFT Training Mechanics

    Cross-Entropy Loss Optimization

    SFT employs supervised learning principles where the model learns to predict the next token in high-quality reasoning sequences. The training process optimizes the cross-entropy loss between predicted and target tokens:

    Mathematical Foundation

    $$\mathcal{L}_{SFT} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \log P(y_t^{(i)} | y_{\lt t}^{(i)}, x^{(i)}; \theta)$$

    Where:

    • $N$ = number of training examples
    • $T$ = sequence length
    • $y_t^{(i)}$ = target token at position $t$ for example $i$
    • $x^{(i)}$ = input problem for example $i$
    • $\theta$ = model parameters

    Training Process Flow

    1. Input Processing: Problem prompts are tokenized and formatted with system instructions
    2. Target Preparation: High-quality reasoning sequences serve as training targets
    3. Forward Pass: Model generates token probabilities for each position
    4. Loss Calculation: Cross-entropy loss measures prediction accuracy
    5. Backpropagation: Gradients update model parameters to minimize loss
    6. Parameter Update: Optimizer (AdamW) applies gradient-based updates

    SFT Configuration and Implementation

    Training Configuration

    # SFT Training Arguments
    training_args = TrainingArguments(
        output_dir="./qwen-sft-training",
        overwrite_output_dir=True,
        num_train_epochs=1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=16,
        gradient_accumulation_steps=2,
        learning_rate=2e-5,           # Lower than GRPO for stability
        warmup_ratio=0.1,
        weight_decay=0.01,
        logging_steps=10,
        evaluation_strategy="no",
        save_strategy="steps",
        save_steps=50,
        save_total_limit=2,
        dataloader_num_workers=2,
        seed=42,
        bf16=True,
        push_to_hub=False,
        gradient_checkpointing=True,
        report_to="none",
        packing=True,                 # Enable efficient sequence packing
        max_seq_length=4096          # Handle longer reasoning sequences
    )

    SFT Trainer Implementation

    # Initialize SFT Trainer
    sft_trainer = SFTTrainer(
        model=model_sft,                     # Base model for fine-tuning
        train_dataset=cold_start_dataset,    # High-quality reasoning examples
        tokenizer=tokenizer,                 # Tokenizer for text processing
        args=training_args,                  # Training configuration
        dataset_text_field="conversations",  # Field containing conversation data
        packing=True,                        # Enable data packing for efficiency
        max_seq_length=4096                 # Maximum sequence length
    )
    
    # Execute training
    sft_train_result = sft_trainer.train()
    
    # Save the fine-tuned model
    sft_trainer.save_model("./qwen-sft-trained")

    SFT Training Outcomes

    Behavioral Improvements

    Aspect Before SFT (R1 Zero) After SFT (R1 Stage 1)
    Reasoning Structure Messy, inconsistent formatting Clear step-by-step organization
    Language Consistency Mixed languages in responses Consistent language usage
    Format Compliance Irregular tag usage Reliable <think>/<answer> structure
    Reasoning Quality Implicit, hard to follow Explicit, educational explanations

    SFT Stage 1 Achievements

    The first SFT stage successfully addresses the primary issues identified in R1 Zero. The model now consistently produces well-structured reasoning with clear language usage, setting the foundation for advanced reasoning-oriented reinforcement learning in subsequent stages.

    Advanced Reasoning Optimization

    🎯

    Reasoning-Oriented Reinforcement Learning

    After establishing structured reasoning through SFT, the training pipeline applies advanced RL techniques to further refine reasoning quality, consistency, and alignment with human preferences. This stage introduces sophisticated reward systems that go beyond basic accuracy.

    Enhanced Reward Architecture

    Language Consistency Rewards

    A critical addition to the reward system addresses the language mixing issues observed in R1 Zero:

    def language_consistency_reward(completions, input_language, **kwargs):
        """
        Reward function ensuring consistent language usage throughout the response.
        """
        contents = [completion[0]["content"] for completion in completions]
        rewards = []
        
        for content in contents:
            # Detect language of reasoning section
            reasoning_lang = detect_language(extract_thinking_content(content))
            
            # Detect language of answer section  
            answer_lang = detect_language(extract_answer_content(content))
            
            # Check consistency with input language
            input_consistent = (reasoning_lang == input_language)
            internal_consistent = (reasoning_lang == answer_lang)
            
            if input_consistent and internal_consistent:
                reward = 1.0  # Perfect consistency
            elif internal_consistent:
                reward = 0.7  # Internal consistency but wrong language
            else:
                reward = 0.0  # Language mixing detected
                
            rewards.append(reward)
        
        return rewards

    Reasoning Quality Assessment

    Advanced evaluation of reasoning depth and logical coherence:

    Multi-Dimensional Quality Metrics
    • Logical Flow: Coherent progression from premises to conclusions
    • Step Completeness: No missing intermediate steps in reasoning
    • Assumption Clarity: Explicit statement of underlying assumptions
    • Error Detection: Self-correction and verification behaviors

    Rejection Sampling for Quality Control

    High-Quality Data Curation

    Rejection sampling filters generated responses to retain only the highest-quality reasoning examples:

    def rejection_sampling_pipeline(model, problems, quality_threshold=0.85):
        """
        Generate multiple responses and select only high-quality examples.
        """
        high_quality_examples = []
        
        for problem in problems:
            # Generate multiple candidate responses
            candidates = []
            for _ in range(10):  # Generate 10 candidates per problem
                response = model.generate(problem, temperature=0.8)
                candidates.append(response)
            
            # Evaluate each candidate
            scored_candidates = []
            for candidate in candidates:
                scores = {
                    'accuracy': evaluate_accuracy(candidate, problem.solution),
                    'reasoning_quality': evaluate_reasoning_quality(candidate),
                    'language_consistency': evaluate_language_consistency(candidate),
                    'format_compliance': evaluate_format_compliance(candidate)
                }
                
                # Compute composite quality score
                composite_score = (
                    scores['accuracy'] * 0.4 +
                    scores['reasoning_quality'] * 0.3 +
                    scores['language_consistency'] * 0.2 +
                    scores['format_compliance'] * 0.1
                )
                
                scored_candidates.append((candidate, composite_score))
            
            # Select best candidate if it meets threshold
            best_candidate, best_score = max(scored_candidates, key=lambda x: x[1])
            if best_score >= quality_threshold:
                high_quality_examples.append((problem, best_candidate))
        
        return high_quality_examples

    SFT Stage 2: Comprehensive Alignment

    Helpfulness and Harmlessness Integration

    The final supervised fine-tuning stage incorporates broader AI alignment objectives:

    Expanded Training Objectives
    • Helpfulness: Responses provide useful, actionable information
    • Harmlessness: Outputs avoid harmful, biased, or dangerous content
    • Honesty: Model acknowledges uncertainty and limitations
    • Reasoning Excellence: Maintains high-quality step-by-step thinking

    Balancing Multiple Objectives

    The challenge in Stage 2 SFT lies in maintaining reasoning capabilities while incorporating broader alignment goals. Careful dataset curation and training techniques prevent degradation of reasoning quality during alignment training.

    Knowledge Distillation

    🏗️

    Scaling Reasoning Capabilities

    To make advanced reasoning capabilities accessible across different computational constraints, DeepSeek employs knowledge distillation to transfer the reasoning expertise of the full R1 model to smaller, more efficient variants.

    Distillation Methodology

    Teacher-Student Framework

    Distillation Process
    1. Teacher Model: Full DeepSeek R1 with complete reasoning capabilities
    2. Student Models: Smaller architectures (various parameter counts)
    3. Knowledge Transfer: Student learns to mimic teacher's reasoning patterns
    4. Efficiency Optimization: Maintain reasoning quality with reduced computation
    # Knowledge Distillation Loss Function
    def distillation_loss(student_logits, teacher_logits, target_tokens, temperature=3.0, alpha=0.7):
        """
        Combined loss function for knowledge distillation.
        """
        # Soft target loss (knowledge from teacher)
        soft_targets = F.softmax(teacher_logits / temperature, dim=-1)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / temperature, dim=-1),
            soft_targets,
            reduction='batchmean'
        ) * (temperature ** 2)
        
        # Hard target loss (ground truth)
        hard_loss = F.cross_entropy(student_logits, target_tokens)
        
        # Combined loss
        total_loss = alpha * soft_loss + (1 - alpha) * hard_loss
        return total_loss

    Multi-Scale Distillation Strategy

    Model Size Parameters Use Case Reasoning Retention
    R1-Large 70B+ High-performance reasoning 95-98%
    R1-Medium 14-32B Balanced performance/efficiency 85-92%
    R1-Small 1.5-7B Edge deployment 70-80%
    R1-Tiny 0.5-1.5B Mobile/embedded systems 60-70%

    Implementation Results

    Distillation Achievements

    The distillation process successfully creates a family of reasoning-capable models that maintain the core structural and logical reasoning abilities of the full R1 model while offering significant computational savings. This democratizes access to advanced reasoning capabilities across diverse deployment scenarios.

    Performance Benchmarks

    Distilled models demonstrate remarkable retention of reasoning capabilities:

    • Mathematical Reasoning: 85-95% of teacher performance across model sizes
    • Code Generation: Maintained logical structure and correctness
    • Scientific Problem Solving: Preserved step-by-step analytical approach
    • Language Consistency: Retained multilingual reasoning coherence

    Training Pipeline Summary

    🎓

    Complete DeepSeek R1 Methodology

    The DeepSeek R1 training methodology represents a comprehensive approach to developing reasoning-capable language models through iterative improvement and multi-stage optimization.

    Key Innovations and Contributions

    🔬 Technical Innovations
    • GRPO Algorithm: Critic-free reinforcement learning for efficient training
    • Multi-Dimensional Rewards: Comprehensive evaluation beyond simple accuracy
    • Cold Start Data Generation: Systematic creation of high-quality reasoning examples
    • Iterative Refinement: Progressive improvement through multiple training stages
    🎯 Methodological Insights
    • Structured Reasoning: Clear separation of thinking and conclusion phases
    • Language Consistency: Addressing multilingual reasoning challenges
    • Quality Control: Rejection sampling for training data curation
    • Scalability: Knowledge distillation for diverse deployment scenarios

    Implementation Pathway

    This guide provides a complete roadmap for implementing DeepSeek R1-style training:

    1. Environment Setup: Configure development environment and dependencies
    2. Base Model Selection: Choose appropriate foundation model for your scale
    3. R1 Zero Training: Implement GRPO with multi-dimensional rewards
    4. Cold Start Generation: Create high-quality reasoning examples
    5. SFT Training: Supervised fine-tuning for structured reasoning
    6. Advanced RL: Reasoning-oriented reinforcement learning
    7. Distillation: Scale to multiple model sizes

    Future Directions

    The DeepSeek R1 methodology opens several avenues for future research and development:

    • Domain Specialization: Adapting the pipeline for specific reasoning domains
    • Multimodal Reasoning: Extending to visual and audio reasoning tasks
    • Efficiency Optimization: Further reducing computational requirements
    • Evaluation Frameworks: Developing comprehensive reasoning assessment tools