Rewards API

This module provides reward functions for evaluating model outputs during training.

Reward System Overview

The reward system evaluates model-generated responses and provides numerical feedback to guide training. Rewards are computed for each generated response and used to:

  1. Update policy gradients

  2. Guide GRPO optimization

  3. Track training progress

Reward computation:

prompt → model → response → [reward functions] → reward score

Built-in Reward Functions

Format Reward

Checks if response matches expected format:

match_format_exactly(response, expected_format)

Rewards:

  • 1.0: Perfect format match

  • 0.5: Partial format match

  • 0.0: No format match

Example:

response = "The answer is 4."
expected_format = "The answer is [NUM]."
reward = match_format_exactly(response, expected_format)

Correctness Reward

Checks if answer is mathematically correct:

check_answer(response, ground_truth)

Rewards:

  • 1.0: Exact match

  • 0.5: Partial credit (within 10% of correct value)

  • 0.0: Incorrect

Example:

response = "2 + 2 = 4. The answer is 4."
ground_truth = "4"
reward = check_answer(response, ground_truth)

Number Extraction

Extracts and validates numbers from responses:

check_numbers(response, expected_numbers)

Returns:

  • Extracted numbers from response

  • Validation results

  • Matching scores

Example:

response = "2 + 2 = 4. The answer is 4."
numbers = check_numbers(response, [4])
# Returns: {"extracted": [4], "matches": [True], "score": 1.0}

Combined Reward

Total reward is typically a combination:

total_reward = alpha * format_reward + beta * correctness_reward

Default weighting:

alpha = 0.3    # Format importance
beta = 0.7     # Correctness importance

Custom Reward Functions

Create custom reward function in src/agent_tunix/rewards.py:

def custom_reward_function(response: str, prompt: str, metadata: dict) -> float:
    """
    Compute reward for a response.

    Args:
        response: Model-generated response
        prompt: Original input prompt
        metadata: Additional context (answer, format, etc.)

    Returns:
        Reward score (typically 0.0 to 1.0)
    """
    # Extract expected answer from metadata
    expected = metadata.get("answer", "")

    # Compute components
    format_score = evaluate_format(response, metadata.get("format"))
    correctness_score = evaluate_correctness(response, expected)
    completeness_score = evaluate_completeness(response)

    # Combine with weights
    reward = (
        0.3 * format_score +
        0.5 * correctness_score +
        0.2 * completeness_score
    )

    return reward

Using Custom Rewards

Update reward function in conf/training/default.yaml:

training:
  reward_function: custom_reward_function

Or pass during training:

python run_training.py +training.reward_function=custom_reward_function

Reward Design Patterns

Binary Reward (Correct/Incorrect)

Simplest approach, 1.0 for correct, 0.0 for incorrect:

def binary_reward(response, ground_truth):
    return 1.0 if is_correct(response, ground_truth) else 0.0

Best for: Clear right/wrong answers

Partial Credit Reward (Graduated)

Award points for partial correctness:

def graduated_reward(response, ground_truth):
    if is_correct(response, ground_truth):
        return 1.0
    elif is_partially_correct(response, ground_truth):
        return 0.5
    else:
        return 0.0

Best for: Tasks with multiple acceptable answers

Continuous Reward (Magnitude-based)

Reward proportional to answer quality:

def continuous_reward(response, ground_truth):
    error = abs(extract_number(response) - ground_truth)
    max_error = 100
    return max(0.0, 1.0 - (error / max_error))

Best for: Numerical tasks where closer is better

Multi-aspect Reward (Composite)

Combine multiple evaluation aspects:

def composite_reward(response, prompt, ground_truth):
    # Evaluate different aspects
    relevance = evaluate_relevance(response, prompt)
    correctness = evaluate_correctness(response, ground_truth)
    clarity = evaluate_clarity(response)
    conciseness = evaluate_conciseness(response)

    # Weighted combination
    reward = (
        0.4 * correctness +
        0.3 * relevance +
        0.2 * clarity +
        0.1 * conciseness
    )
    return reward

Best for: Complex tasks requiring multiple quality dimensions

Reward Shaping

Reward shaping guides learning by providing intermediate signals:

def shaped_reward(response, ground_truth):
    """Add shaping to guide model behavior."""
    base_reward = check_correctness(response, ground_truth)

    # Shape 1: Penalize very long responses
    length_penalty = -0.1 * len(response.split()) / 100

    # Shape 2: Reward attempting reasoning steps
    reasoning_bonus = 0.1 if has_reasoning_steps(response) else 0.0

    # Shape 3: Penalize hallucination
    hallucination_penalty = -0.2 if has_hallucination(response) else 0.0

    return base_reward + length_penalty + reasoning_bonus + hallucination_penalty

Guidelines:

  • Keep shaping rewards relatively small compared to primary reward

  • Ensure shaping aligns with task objectives

  • Monitor reward distribution during training

Reward Debugging

Inspect rewards during training:

# Enable verbose reward logging
python run_training.py training.log_rewards=true

This logs:

  • Reward distribution for each batch

  • Min/max/mean rewards

  • Reward statistics over training

Analyze reward patterns:

# Save reward analysis
python run_training.py training.save_reward_analysis=true

Outputs analysis of:

  • Which response types get high/low rewards

  • Reward distribution skewness

  • Reward variance

  • Common failure patterns

Common Reward Issues

Reward Always Near 0 or 1

Issue: Sparse or binary rewards don’t guide learning well

Solution: Use graduated rewards with intermediate values:

def improved_reward(response, ground_truth):
    if exact_match(response, ground_truth):
        return 1.0
    elif close_match(response, ground_truth, tolerance=0.1):
        return 0.5
    else:
        return 0.0

Rewards Too Noisy

Issue: High variance in rewards prevents consistent learning

Solution: Smooth and normalize rewards:

def stable_reward(response, ground_truth):
    base = check_correctness(response, ground_truth)
    # Add minimum reward to avoid exact zeros
    return max(base, 0.1)

Reward Hacking

Issue: Model learns to game the reward instead of solving the task

Solution: Include format/style constraints:

def robust_reward(response, ground_truth):
    correctness = check_correctness(response, ground_truth)
    format_match = check_format(response, expected_format)

    # Both must be good
    if format_match < 0.8:
        return 0.0

    return correctness

Testing Rewards

Test reward functions on sample outputs:

from agent_tunix.rewards import check_answer, match_format_exactly

# Test data
test_cases = [
    {
        "response": "2 + 2 = 4. The answer is 4.",
        "ground_truth": "4",
        "expected_format": "The answer is [NUM].",
        "expected_reward": 1.0
    },
    {
        "response": "2 plus 2 equals 4.",
        "ground_truth": "4",
        "expected_format": "The answer is [NUM].",
        "expected_reward": 0.5  # Correct but wrong format
    }
]

# Evaluate
for case in test_cases:
    format_reward = match_format_exactly(
        case["response"],
        case["expected_format"]
    )
    correctness_reward = check_answer(
        case["response"],
        case["ground_truth"]
    )
    total = 0.3 * format_reward + 0.7 * correctness_reward
    print(f"Expected: {case['expected_reward']}, Got: {total}")

Advanced Topics

See Custom Reward Functions for:

  • Reward normalization and scaling

  • Multi-task rewards

  • Curriculum learning with rewards

  • Reward model training

Next Steps