Rewards API
This module provides reward functions for evaluating model outputs during training.
Reward System Overview
The reward system evaluates model-generated responses and provides numerical feedback to guide training. Rewards are computed for each generated response and used to:
Update policy gradients
Guide GRPO optimization
Track training progress
Reward computation:
prompt → model → response → [reward functions] → reward score
Built-in Reward Functions
Format Reward
Checks if response matches expected format:
match_format_exactly(response, expected_format)
Rewards:
1.0: Perfect format match
0.5: Partial format match
0.0: No format match
Example:
response = "The answer is 4."
expected_format = "The answer is [NUM]."
reward = match_format_exactly(response, expected_format)
Correctness Reward
Checks if answer is mathematically correct:
check_answer(response, ground_truth)
Rewards:
1.0: Exact match
0.5: Partial credit (within 10% of correct value)
0.0: Incorrect
Example:
response = "2 + 2 = 4. The answer is 4."
ground_truth = "4"
reward = check_answer(response, ground_truth)
Number Extraction
Extracts and validates numbers from responses:
check_numbers(response, expected_numbers)
Returns:
Extracted numbers from response
Validation results
Matching scores
Example:
response = "2 + 2 = 4. The answer is 4."
numbers = check_numbers(response, [4])
# Returns: {"extracted": [4], "matches": [True], "score": 1.0}
Combined Reward
Total reward is typically a combination:
total_reward = alpha * format_reward + beta * correctness_reward
Default weighting:
alpha = 0.3 # Format importance
beta = 0.7 # Correctness importance
Custom Reward Functions
Create custom reward function in src/agent_tunix/rewards.py:
def custom_reward_function(response: str, prompt: str, metadata: dict) -> float:
"""
Compute reward for a response.
Args:
response: Model-generated response
prompt: Original input prompt
metadata: Additional context (answer, format, etc.)
Returns:
Reward score (typically 0.0 to 1.0)
"""
# Extract expected answer from metadata
expected = metadata.get("answer", "")
# Compute components
format_score = evaluate_format(response, metadata.get("format"))
correctness_score = evaluate_correctness(response, expected)
completeness_score = evaluate_completeness(response)
# Combine with weights
reward = (
0.3 * format_score +
0.5 * correctness_score +
0.2 * completeness_score
)
return reward
Using Custom Rewards
Update reward function in conf/training/default.yaml:
training:
reward_function: custom_reward_function
Or pass during training:
python run_training.py +training.reward_function=custom_reward_function
Reward Design Patterns
Binary Reward (Correct/Incorrect)
Simplest approach, 1.0 for correct, 0.0 for incorrect:
def binary_reward(response, ground_truth):
return 1.0 if is_correct(response, ground_truth) else 0.0
Best for: Clear right/wrong answers
Partial Credit Reward (Graduated)
Award points for partial correctness:
def graduated_reward(response, ground_truth):
if is_correct(response, ground_truth):
return 1.0
elif is_partially_correct(response, ground_truth):
return 0.5
else:
return 0.0
Best for: Tasks with multiple acceptable answers
Continuous Reward (Magnitude-based)
Reward proportional to answer quality:
def continuous_reward(response, ground_truth):
error = abs(extract_number(response) - ground_truth)
max_error = 100
return max(0.0, 1.0 - (error / max_error))
Best for: Numerical tasks where closer is better
Multi-aspect Reward (Composite)
Combine multiple evaluation aspects:
def composite_reward(response, prompt, ground_truth):
# Evaluate different aspects
relevance = evaluate_relevance(response, prompt)
correctness = evaluate_correctness(response, ground_truth)
clarity = evaluate_clarity(response)
conciseness = evaluate_conciseness(response)
# Weighted combination
reward = (
0.4 * correctness +
0.3 * relevance +
0.2 * clarity +
0.1 * conciseness
)
return reward
Best for: Complex tasks requiring multiple quality dimensions
Reward Shaping
Reward shaping guides learning by providing intermediate signals:
def shaped_reward(response, ground_truth):
"""Add shaping to guide model behavior."""
base_reward = check_correctness(response, ground_truth)
# Shape 1: Penalize very long responses
length_penalty = -0.1 * len(response.split()) / 100
# Shape 2: Reward attempting reasoning steps
reasoning_bonus = 0.1 if has_reasoning_steps(response) else 0.0
# Shape 3: Penalize hallucination
hallucination_penalty = -0.2 if has_hallucination(response) else 0.0
return base_reward + length_penalty + reasoning_bonus + hallucination_penalty
Guidelines:
Keep shaping rewards relatively small compared to primary reward
Ensure shaping aligns with task objectives
Monitor reward distribution during training
Reward Debugging
Inspect rewards during training:
# Enable verbose reward logging
python run_training.py training.log_rewards=true
This logs:
Reward distribution for each batch
Min/max/mean rewards
Reward statistics over training
Analyze reward patterns:
# Save reward analysis
python run_training.py training.save_reward_analysis=true
Outputs analysis of:
Which response types get high/low rewards
Reward distribution skewness
Reward variance
Common failure patterns
Common Reward Issues
Reward Always Near 0 or 1
Issue: Sparse or binary rewards don’t guide learning well
Solution: Use graduated rewards with intermediate values:
def improved_reward(response, ground_truth):
if exact_match(response, ground_truth):
return 1.0
elif close_match(response, ground_truth, tolerance=0.1):
return 0.5
else:
return 0.0
Rewards Too Noisy
Issue: High variance in rewards prevents consistent learning
Solution: Smooth and normalize rewards:
def stable_reward(response, ground_truth):
base = check_correctness(response, ground_truth)
# Add minimum reward to avoid exact zeros
return max(base, 0.1)
Reward Hacking
Issue: Model learns to game the reward instead of solving the task
Solution: Include format/style constraints:
def robust_reward(response, ground_truth):
correctness = check_correctness(response, ground_truth)
format_match = check_format(response, expected_format)
# Both must be good
if format_match < 0.8:
return 0.0
return correctness
Testing Rewards
Test reward functions on sample outputs:
from agent_tunix.rewards import check_answer, match_format_exactly
# Test data
test_cases = [
{
"response": "2 + 2 = 4. The answer is 4.",
"ground_truth": "4",
"expected_format": "The answer is [NUM].",
"expected_reward": 1.0
},
{
"response": "2 plus 2 equals 4.",
"ground_truth": "4",
"expected_format": "The answer is [NUM].",
"expected_reward": 0.5 # Correct but wrong format
}
]
# Evaluate
for case in test_cases:
format_reward = match_format_exactly(
case["response"],
case["expected_format"]
)
correctness_reward = check_answer(
case["response"],
case["ground_truth"]
)
total = 0.3 * format_reward + 0.7 * correctness_reward
print(f"Expected: {case['expected_reward']}, Got: {total}")
Advanced Topics
See Custom Reward Functions for:
Reward normalization and scaling
Multi-task rewards
Curriculum learning with rewards
Reward model training
Next Steps
Custom Reward Functions - Detailed custom reward guide
Training API - Training API reference
Training Guide - Training guide