Custom Reward Functions
=======================

Creating custom reward functions allows you to tailor the training signal to your specific task requirements.

Basic Reward Function Structure
-------------------------------

A reward function takes a response and computes a numerical score::

    def my_reward_function(response: str, **kwargs) -> float:
        """
        Compute reward for a model response.

        Args:
            response: The model-generated response text
            **kwargs: Additional context (prompt, answer, etc.)

        Returns:
            Reward score, typically in range [0.0, 1.0]
        """
        # Your evaluation logic here
        reward = evaluate_response(response, kwargs)
        return reward

Argument Patterns
~~~~~~~~~~~~~~~~~

Different calling patterns depending on your needs::

    # Minimal
    def simple_reward(response: str) -> float:
        return 1.0 if is_correct(response) else 0.0

    # With context
    def contextual_reward(response: str, prompt: str = "", answer: str = "") -> float:
        return evaluate(response, answer)

    # Using kwargs for flexibility
    def flexible_reward(response: str, **metadata) -> float:
        answer = metadata.get("answer", "")
        prompt = metadata.get("prompt", "")
        return evaluate(response, answer, prompt)

Implementing Reward Functions
-----------------------------

Example 1: Math Problem Correctness
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For math problems, extract and check numerical answers::

    import re

    def math_reward(response: str, answer: str = "") -> float:
        """Reward correct mathematical answers."""
        if not answer:
            return 0.0

        try:
            # Extract number from response (last number mentioned)
            numbers = re.findall(r'-?\d+\.?\d*', response)
            if not numbers:
                return 0.0

            predicted = float(numbers[-1])
            expected = float(answer)

            # Exact match
            if predicted == expected:
                return 1.0

            # Partial credit for close answers (within 10%)
            error_pct = abs(predicted - expected) / expected
            if error_pct <= 0.1:
                return 0.5

            return 0.0

        except (ValueError, IndexError):
            return 0.0

Usage::

    from agent_tunix.rewards import register_reward_function
    register_reward_function("math", math_reward)

    python run_training.py training.reward_function=math


Example 2: Format and Content Combined
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reward both format compliance and correctness::

    def format_and_content_reward(
        response: str,
        answer: str = "",
        format_template: str = "The answer is [NUM]."
    ) -> float:
        """Reward responses that match format and are correct."""

        # Check format (0.0 to 1.0)
        format_score = evaluate_format(response, format_template)

        # Check correctness (0.0 to 1.0)
        content_score = evaluate_correctness(response, answer)

        # Require both format and correctness
        if format_score < 0.8:
            return 0.0  # Strict format requirement

        # Combine scores
        return content_score * format_score

    def evaluate_format(response, template):
        """Check if response matches format template."""
        # Example: check if contains "The answer is [NUMBER]."
        if "The answer is" in response and "." in response:
            return 1.0
        elif "answer" in response.lower():
            return 0.5
        return 0.0

    def evaluate_correctness(response, answer):
        """Check if answer is correct."""
        import re
        numbers = re.findall(r'-?\d+\.?\d*', response)
        if numbers and float(numbers[-1]) == float(answer):
            return 1.0
        return 0.0


Example 3: Length-Penalized Reward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reward shorter, more concise correct answers::

    def conciseness_reward(response: str, answer: str = "") -> float:
        """Reward correct but concise answers."""

        # Base correctness
        correctness = evaluate_correctness(response, answer)

        # Penalize verbosity
        words = len(response.split())
        length_penalty = max(0.0, 1.0 - (words / 100))  # Penalty after 100 words

        # Combine
        return correctness * (0.7 + 0.3 * length_penalty)


Example 4: Multi-Aspect Evaluation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Evaluate multiple dimensions of quality::

    def multi_aspect_reward(response: str, **kwargs) -> float:
        """Evaluate response on multiple dimensions."""

        answer = kwargs.get("answer", "")
        prompt = kwargs.get("prompt", "")

        # Aspect 1: Correctness (0-1)
        correctness = evaluate_correctness(response, answer)

        # Aspect 2: Clarity (0-1)
        clarity = evaluate_clarity(response)

        # Aspect 3: Completeness (0-1)
        completeness = evaluate_completeness(response, prompt)

        # Aspect 4: Efficiency (0-1)
        efficiency = evaluate_efficiency(response)

        # Weighted combination
        reward = (
            0.5 * correctness +
            0.2 * clarity +
            0.15 * completeness +
            0.15 * efficiency
        )

        return reward

    def evaluate_clarity(response):
        """Check if response is clear and well-structured."""
        # Heuristics for clarity
        lines = len(response.split('\n'))
        sentences = len(response.split('.'))

        if sentences > 1:  # Multiple sentences
            clarity = 0.8
        elif lines > 1:  # Multiple lines
            clarity = 0.6
        else:
            clarity = 0.4

        return min(1.0, clarity)

    def evaluate_completeness(response, prompt):
        """Check if response fully addresses prompt."""
        # Simple heuristic: longer responses usually more complete
        if len(response) > 50:
            return 1.0
        elif len(response) > 20:
            return 0.5
        return 0.0

    def evaluate_efficiency(response):
        """Score response efficiency (short but complete)."""
        word_count = len(response.split())

        if word_count < 30:
            return 1.0
        elif word_count < 100:
            return 0.8
        elif word_count < 200:
            return 0.5
        else:
            return 0.2


Registering Custom Rewards
---------------------------

Create custom reward file::

    # src/agent_tunix/custom_rewards.py

    def my_reward_v1(response: str, answer: str = "") -> float:
        """Custom reward function v1."""
        # Implementation
        return score

    def my_reward_v2(response: str, answer: str = "") -> float:
        """Custom reward function v2."""
        # Implementation
        return score

Register in training config::

    # conf/training/default.yaml
    training:
      reward_function: custom_rewards.my_reward_v1

Or use inline::

    python run_training.py training.reward_function=custom_rewards.my_reward_v1

Iterating on Rewards
--------------------

Testing Process
~~~~~~~~~~~~~~~

Create test suite for reward functions::

    # test_rewards.py
    from agent_tunix.custom_rewards import my_reward

    test_cases = [
        {
            "response": "The answer is 4.",
            "answer": "4",
            "expected": 1.0
        },
        {
            "response": "Four",
            "answer": "4",
            "expected": 0.5
        },
        {
            "response": "The answer is 5.",
            "answer": "4",
            "expected": 0.0
        }
    ]

    for case in test_cases:
        reward = my_reward(case["response"], answer=case["answer"])
        print(f"Expected: {case['expected']}, Got: {reward}")
        assert abs(reward - case["expected"]) < 0.01

Run tests::

    python test_rewards.py

Analyzing Reward Distribution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Log rewards during training to analyze distribution::

    python run_training.py training.log_reward_stats=true

Monitor:

- Mean reward per batch
- Reward variance
- Min/max rewards
- Reward histogram

If rewards are skewed::

    # Too sparse (mostly 0 or 1)
    → Add graduated levels (0.25, 0.5, 0.75)

    # Too clustered
    → Add more dimensions (format, content, efficiency)

    # High variance
    → Smooth with normalization

Reward Normalization
--------------------

Normalize rewards for stable training::

    def normalize_reward(reward: float, mean: float = 0.5, std: float = 0.2) -> float:
        """Normalize reward to stable distribution."""
        # Clip outliers
        reward = max(-3 * std + mean, min(3 * std + mean, reward))

        # Normalize to standard range
        normalized = (reward - mean) / (std + 1e-8)
        return normalized

Apply in your reward function::

    def normalized_reward(response: str, answer: str = "") -> float:
        raw_reward = evaluate(response, answer)
        return normalize_reward(raw_reward)

Reward Shaping for Better Training
-----------------------------------

Guidance Rewards
~~~~~~~~~~~~~~~~

Add small bonuses to guide model behavior::

    def shaped_reward(response: str, **kwargs) -> float:
        """Reward with training guidance."""

        base_reward = evaluate_correctness(response, kwargs.get("answer", ""))

        # Shape 1: Reward explicit reasoning
        if has_reasoning_steps(response):
            base_reward += 0.05

        # Shape 2: Penalize repetition
        if has_repetition(response):
            base_reward -= 0.05

        # Shape 3: Encourage specific format
        if follows_template(response):
            base_reward += 0.1

        return max(0.0, min(1.0, base_reward))

    def has_reasoning_steps(response):
        keywords = ["because", "therefore", "first", "then", "step"]
        return any(kw in response.lower() for kw in keywords)

    def has_repetition(response):
        words = response.split()
        return len(words) != len(set(words))

    def follows_template(response):
        return "The answer is" in response and response.endswith(".")

Curriculum Learning
~~~~~~~~~~~~~~~~~~~~

Start with easy rewards, progress to harder::

    def curriculum_reward(response: str, step: int, **kwargs) -> float:
        """Progressive reward based on training progress."""

        if step < 1000:
            # Early: Just check format
            return 1.0 if follows_format(response) else 0.0

        elif step < 5000:
            # Middle: Format + basic correctness
            format_score = evaluate_format(response, kwargs.get("format", ""))
            content_score = evaluate_correctness(response, kwargs.get("answer", ""))
            return 0.3 * format_score + 0.7 * content_score

        else:
            # Late: Full evaluation
            return multi_aspect_reward(response, **kwargs)

Common Issues and Solutions
---------------------------

**Reward Always 1.0**

Issue: Reward function too lenient

Solution::

    # Before
    def loose_reward(response):
        return 1.0 if "answer" in response else 0.0

    # After
    def strict_reward(response, answer=""):
        return 1.0 if extract_answer(response) == answer else 0.0

**Reward Always 0.0**

Issue: Reward function too strict

Solution::

    # Before
    def strict_reward(response, answer=""):
        return 1.0 if response == answer else 0.0

    # After
    def flexible_reward(response, answer=""):
        extracted = extract_answer(response)
        return 1.0 if extracted == answer else 0.5

**Model Collapses to Single Output**

Issue: Reward doesn't differentiate outputs well

Solution: Diversify reward signal::

    def diverse_reward(response: str, **kwargs) -> float:
        correctness = evaluate_correctness(response, kwargs.get("answer", ""))
        style = evaluate_style_diversity(response)
        return 0.8 * correctness + 0.2 * style

Debugging Reward Functions
---------------------------

Trace reward computation::

    def debug_reward(response: str, **kwargs) -> float:
        """Reward with debug output."""

        print(f"Response: {response[:50]}...")

        correctness = evaluate_correctness(response, kwargs.get("answer", ""))
        print(f"Correctness: {correctness}")

        format_match = evaluate_format(response, kwargs.get("format", ""))
        print(f"Format: {format_match}")

        reward = 0.7 * correctness + 0.3 * format_match
        print(f"Final reward: {reward}\n")

        return reward

Run with debug enabled::

    python run_training.py training.reward_function=debug_reward training.log_level=debug

Best Practices
--------------

1. **Start simple**: Begin with basic correctness reward
2. **Test first**: Create test cases before using in training
3. **Analyze distribution**: Plot reward histograms during training
4. **Iterate gradually**: Add one aspect at a time
5. **Balance components**: Ensure no single component dominates
6. **Document design**: Record why you chose specific weights
7. **Version experiments**: Keep records of reward versions used

Next Steps
----------

- :doc:`../api/rewards` - Reward API reference
- :doc:`../guide/training` - Training guide
- :doc:`../guide/hyperparameter_tuning` - Tuning strategies