Custom Reward Functions

Creating custom reward functions allows you to tailor the training signal to your specific task requirements.

Basic Reward Function Structure

A reward function takes a response and computes a numerical score:

def my_reward_function(response: str, **kwargs) -> float:
    """
    Compute reward for a model response.

    Args:
        response: The model-generated response text
        **kwargs: Additional context (prompt, answer, etc.)

    Returns:
        Reward score, typically in range [0.0, 1.0]
    """
    # Your evaluation logic here
    reward = evaluate_response(response, kwargs)
    return reward

Argument Patterns

Different calling patterns depending on your needs:

# Minimal
def simple_reward(response: str) -> float:
    return 1.0 if is_correct(response) else 0.0

# With context
def contextual_reward(response: str, prompt: str = "", answer: str = "") -> float:
    return evaluate(response, answer)

# Using kwargs for flexibility
def flexible_reward(response: str, **metadata) -> float:
    answer = metadata.get("answer", "")
    prompt = metadata.get("prompt", "")
    return evaluate(response, answer, prompt)

Implementing Reward Functions

Example 1: Math Problem Correctness

For math problems, extract and check numerical answers:

import re

def math_reward(response: str, answer: str = "") -> float:
    """Reward correct mathematical answers."""
    if not answer:
        return 0.0

    try:
        # Extract number from response (last number mentioned)
        numbers = re.findall(r'-?\d+\.?\d*', response)
        if not numbers:
            return 0.0

        predicted = float(numbers[-1])
        expected = float(answer)

        # Exact match
        if predicted == expected:
            return 1.0

        # Partial credit for close answers (within 10%)
        error_pct = abs(predicted - expected) / expected
        if error_pct <= 0.1:
            return 0.5

        return 0.0

    except (ValueError, IndexError):
        return 0.0

Usage:

from agent_tunix.rewards import register_reward_function
register_reward_function("math", math_reward)

python run_training.py training.reward_function=math

Example 2: Format and Content Combined

Reward both format compliance and correctness:

def format_and_content_reward(
    response: str,
    answer: str = "",
    format_template: str = "The answer is [NUM]."
) -> float:
    """Reward responses that match format and are correct."""

    # Check format (0.0 to 1.0)
    format_score = evaluate_format(response, format_template)

    # Check correctness (0.0 to 1.0)
    content_score = evaluate_correctness(response, answer)

    # Require both format and correctness
    if format_score < 0.8:
        return 0.0  # Strict format requirement

    # Combine scores
    return content_score * format_score

def evaluate_format(response, template):
    """Check if response matches format template."""
    # Example: check if contains "The answer is [NUMBER]."
    if "The answer is" in response and "." in response:
        return 1.0
    elif "answer" in response.lower():
        return 0.5
    return 0.0

def evaluate_correctness(response, answer):
    """Check if answer is correct."""
    import re
    numbers = re.findall(r'-?\d+\.?\d*', response)
    if numbers and float(numbers[-1]) == float(answer):
        return 1.0
    return 0.0

Example 3: Length-Penalized Reward

Reward shorter, more concise correct answers:

def conciseness_reward(response: str, answer: str = "") -> float:
    """Reward correct but concise answers."""

    # Base correctness
    correctness = evaluate_correctness(response, answer)

    # Penalize verbosity
    words = len(response.split())
    length_penalty = max(0.0, 1.0 - (words / 100))  # Penalty after 100 words

    # Combine
    return correctness * (0.7 + 0.3 * length_penalty)

Example 4: Multi-Aspect Evaluation

Evaluate multiple dimensions of quality:

def multi_aspect_reward(response: str, **kwargs) -> float:
    """Evaluate response on multiple dimensions."""

    answer = kwargs.get("answer", "")
    prompt = kwargs.get("prompt", "")

    # Aspect 1: Correctness (0-1)
    correctness = evaluate_correctness(response, answer)

    # Aspect 2: Clarity (0-1)
    clarity = evaluate_clarity(response)

    # Aspect 3: Completeness (0-1)
    completeness = evaluate_completeness(response, prompt)

    # Aspect 4: Efficiency (0-1)
    efficiency = evaluate_efficiency(response)

    # Weighted combination
    reward = (
        0.5 * correctness +
        0.2 * clarity +
        0.15 * completeness +
        0.15 * efficiency
    )

    return reward

def evaluate_clarity(response):
    """Check if response is clear and well-structured."""
    # Heuristics for clarity
    lines = len(response.split('\n'))
    sentences = len(response.split('.'))

    if sentences > 1:  # Multiple sentences
        clarity = 0.8
    elif lines > 1:  # Multiple lines
        clarity = 0.6
    else:
        clarity = 0.4

    return min(1.0, clarity)

def evaluate_completeness(response, prompt):
    """Check if response fully addresses prompt."""
    # Simple heuristic: longer responses usually more complete
    if len(response) > 50:
        return 1.0
    elif len(response) > 20:
        return 0.5
    return 0.0

def evaluate_efficiency(response):
    """Score response efficiency (short but complete)."""
    word_count = len(response.split())

    if word_count < 30:
        return 1.0
    elif word_count < 100:
        return 0.8
    elif word_count < 200:
        return 0.5
    else:
        return 0.2

Registering Custom Rewards

Create custom reward file:

# src/agent_tunix/custom_rewards.py

def my_reward_v1(response: str, answer: str = "") -> float:
    """Custom reward function v1."""
    # Implementation
    return score

def my_reward_v2(response: str, answer: str = "") -> float:
    """Custom reward function v2."""
    # Implementation
    return score

Register in training config:

# conf/training/default.yaml
training:
  reward_function: custom_rewards.my_reward_v1

Or use inline:

python run_training.py training.reward_function=custom_rewards.my_reward_v1

Iterating on Rewards

Testing Process

Create test suite for reward functions:

# test_rewards.py
from agent_tunix.custom_rewards import my_reward

test_cases = [
    {
        "response": "The answer is 4.",
        "answer": "4",
        "expected": 1.0
    },
    {
        "response": "Four",
        "answer": "4",
        "expected": 0.5
    },
    {
        "response": "The answer is 5.",
        "answer": "4",
        "expected": 0.0
    }
]

for case in test_cases:
    reward = my_reward(case["response"], answer=case["answer"])
    print(f"Expected: {case['expected']}, Got: {reward}")
    assert abs(reward - case["expected"]) < 0.01

Run tests:

python test_rewards.py

Analyzing Reward Distribution

Log rewards during training to analyze distribution:

python run_training.py training.log_reward_stats=true

Monitor:

  • Mean reward per batch

  • Reward variance

  • Min/max rewards

  • Reward histogram

If rewards are skewed:

# Too sparse (mostly 0 or 1)
→ Add graduated levels (0.25, 0.5, 0.75)

# Too clustered
→ Add more dimensions (format, content, efficiency)

# High variance
→ Smooth with normalization

Reward Normalization

Normalize rewards for stable training:

def normalize_reward(reward: float, mean: float = 0.5, std: float = 0.2) -> float:
    """Normalize reward to stable distribution."""
    # Clip outliers
    reward = max(-3 * std + mean, min(3 * std + mean, reward))

    # Normalize to standard range
    normalized = (reward - mean) / (std + 1e-8)
    return normalized

Apply in your reward function:

def normalized_reward(response: str, answer: str = "") -> float:
    raw_reward = evaluate(response, answer)
    return normalize_reward(raw_reward)

Reward Shaping for Better Training

Guidance Rewards

Add small bonuses to guide model behavior:

def shaped_reward(response: str, **kwargs) -> float:
    """Reward with training guidance."""

    base_reward = evaluate_correctness(response, kwargs.get("answer", ""))

    # Shape 1: Reward explicit reasoning
    if has_reasoning_steps(response):
        base_reward += 0.05

    # Shape 2: Penalize repetition
    if has_repetition(response):
        base_reward -= 0.05

    # Shape 3: Encourage specific format
    if follows_template(response):
        base_reward += 0.1

    return max(0.0, min(1.0, base_reward))

def has_reasoning_steps(response):
    keywords = ["because", "therefore", "first", "then", "step"]
    return any(kw in response.lower() for kw in keywords)

def has_repetition(response):
    words = response.split()
    return len(words) != len(set(words))

def follows_template(response):
    return "The answer is" in response and response.endswith(".")

Curriculum Learning

Start with easy rewards, progress to harder:

def curriculum_reward(response: str, step: int, **kwargs) -> float:
    """Progressive reward based on training progress."""

    if step < 1000:
        # Early: Just check format
        return 1.0 if follows_format(response) else 0.0

    elif step < 5000:
        # Middle: Format + basic correctness
        format_score = evaluate_format(response, kwargs.get("format", ""))
        content_score = evaluate_correctness(response, kwargs.get("answer", ""))
        return 0.3 * format_score + 0.7 * content_score

    else:
        # Late: Full evaluation
        return multi_aspect_reward(response, **kwargs)

Common Issues and Solutions

Reward Always 1.0

Issue: Reward function too lenient

Solution:

# Before
def loose_reward(response):
    return 1.0 if "answer" in response else 0.0

# After
def strict_reward(response, answer=""):
    return 1.0 if extract_answer(response) == answer else 0.0

Reward Always 0.0

Issue: Reward function too strict

Solution:

# Before
def strict_reward(response, answer=""):
    return 1.0 if response == answer else 0.0

# After
def flexible_reward(response, answer=""):
    extracted = extract_answer(response)
    return 1.0 if extracted == answer else 0.5

Model Collapses to Single Output

Issue: Reward doesn’t differentiate outputs well

Solution: Diversify reward signal:

def diverse_reward(response: str, **kwargs) -> float:
    correctness = evaluate_correctness(response, kwargs.get("answer", ""))
    style = evaluate_style_diversity(response)
    return 0.8 * correctness + 0.2 * style

Debugging Reward Functions

Trace reward computation:

def debug_reward(response: str, **kwargs) -> float:
    """Reward with debug output."""

    print(f"Response: {response[:50]}...")

    correctness = evaluate_correctness(response, kwargs.get("answer", ""))
    print(f"Correctness: {correctness}")

    format_match = evaluate_format(response, kwargs.get("format", ""))
    print(f"Format: {format_match}")

    reward = 0.7 * correctness + 0.3 * format_match
    print(f"Final reward: {reward}\n")

    return reward

Run with debug enabled:

python run_training.py training.reward_function=debug_reward training.log_level=debug

Best Practices

  1. Start simple: Begin with basic correctness reward

  2. Test first: Create test cases before using in training

  3. Analyze distribution: Plot reward histograms during training

  4. Iterate gradually: Add one aspect at a time

  5. Balance components: Ensure no single component dominates

  6. Document design: Record why you chose specific weights

  7. Version experiments: Keep records of reward versions used

Next Steps