Rewards API
===========

.. py:module:: agent_tunix.rewards
   :noindex:

This module provides reward functions for evaluating model outputs during training.

Reward System Overview
----------------------

The reward system evaluates model-generated responses and provides numerical feedback to guide training. Rewards are computed for each generated response and used to:

1. Update policy gradients
2. Guide GRPO optimization
3. Track training progress

Reward computation::

    prompt → model → response → [reward functions] → reward score

Built-in Reward Functions
--------------------------

**Format Reward**

Checks if response matches expected format::

    match_format_exactly(response, expected_format)

Rewards:

- 1.0: Perfect format match
- 0.5: Partial format match
- 0.0: No format match

Example::

    response = "The answer is 4."
    expected_format = "The answer is [NUM]."
    reward = match_format_exactly(response, expected_format)

**Correctness Reward**

Checks if answer is mathematically correct::

    check_answer(response, ground_truth)

Rewards:

- 1.0: Exact match
- 0.5: Partial credit (within 10% of correct value)
- 0.0: Incorrect

Example::

    response = "2 + 2 = 4. The answer is 4."
    ground_truth = "4"
    reward = check_answer(response, ground_truth)

**Number Extraction**

Extracts and validates numbers from responses::

    check_numbers(response, expected_numbers)

Returns:

- Extracted numbers from response
- Validation results
- Matching scores

Example::

    response = "2 + 2 = 4. The answer is 4."
    numbers = check_numbers(response, [4])
    # Returns: {"extracted": [4], "matches": [True], "score": 1.0}

Combined Reward
---------------

Total reward is typically a combination::

    total_reward = alpha * format_reward + beta * correctness_reward

Default weighting::

    alpha = 0.3    # Format importance
    beta = 0.7     # Correctness importance

Custom Reward Functions
-----------------------

Create custom reward function in ``src/agent_tunix/rewards.py``::

    def custom_reward_function(response: str, prompt: str, metadata: dict) -> float:
        """
        Compute reward for a response.

        Args:
            response: Model-generated response
            prompt: Original input prompt
            metadata: Additional context (answer, format, etc.)

        Returns:
            Reward score (typically 0.0 to 1.0)
        """
        # Extract expected answer from metadata
        expected = metadata.get("answer", "")

        # Compute components
        format_score = evaluate_format(response, metadata.get("format"))
        correctness_score = evaluate_correctness(response, expected)
        completeness_score = evaluate_completeness(response)

        # Combine with weights
        reward = (
            0.3 * format_score +
            0.5 * correctness_score +
            0.2 * completeness_score
        )

        return reward

Using Custom Rewards
~~~~~~~~~~~~~~~~~~~~

Update reward function in ``conf/training/default.yaml``::

    training:
      reward_function: custom_reward_function

Or pass during training::

    python run_training.py +training.reward_function=custom_reward_function

Reward Design Patterns
----------------------

**Binary Reward** (Correct/Incorrect)

Simplest approach, 1.0 for correct, 0.0 for incorrect::

    def binary_reward(response, ground_truth):
        return 1.0 if is_correct(response, ground_truth) else 0.0

Best for: Clear right/wrong answers

**Partial Credit Reward** (Graduated)

Award points for partial correctness::

    def graduated_reward(response, ground_truth):
        if is_correct(response, ground_truth):
            return 1.0
        elif is_partially_correct(response, ground_truth):
            return 0.5
        else:
            return 0.0

Best for: Tasks with multiple acceptable answers

**Continuous Reward** (Magnitude-based)

Reward proportional to answer quality::

    def continuous_reward(response, ground_truth):
        error = abs(extract_number(response) - ground_truth)
        max_error = 100
        return max(0.0, 1.0 - (error / max_error))

Best for: Numerical tasks where closer is better

**Multi-aspect Reward** (Composite)

Combine multiple evaluation aspects::

    def composite_reward(response, prompt, ground_truth):
        # Evaluate different aspects
        relevance = evaluate_relevance(response, prompt)
        correctness = evaluate_correctness(response, ground_truth)
        clarity = evaluate_clarity(response)
        conciseness = evaluate_conciseness(response)

        # Weighted combination
        reward = (
            0.4 * correctness +
            0.3 * relevance +
            0.2 * clarity +
            0.1 * conciseness
        )
        return reward

Best for: Complex tasks requiring multiple quality dimensions

Reward Shaping
---------------

Reward shaping guides learning by providing intermediate signals::

    def shaped_reward(response, ground_truth):
        """Add shaping to guide model behavior."""
        base_reward = check_correctness(response, ground_truth)

        # Shape 1: Penalize very long responses
        length_penalty = -0.1 * len(response.split()) / 100

        # Shape 2: Reward attempting reasoning steps
        reasoning_bonus = 0.1 if has_reasoning_steps(response) else 0.0

        # Shape 3: Penalize hallucination
        hallucination_penalty = -0.2 if has_hallucination(response) else 0.0

        return base_reward + length_penalty + reasoning_bonus + hallucination_penalty

Guidelines:

- Keep shaping rewards relatively small compared to primary reward
- Ensure shaping aligns with task objectives
- Monitor reward distribution during training

Reward Debugging
----------------

Inspect rewards during training::

    # Enable verbose reward logging
    python run_training.py training.log_rewards=true

This logs:

- Reward distribution for each batch
- Min/max/mean rewards
- Reward statistics over training

Analyze reward patterns::

    # Save reward analysis
    python run_training.py training.save_reward_analysis=true

Outputs analysis of:

- Which response types get high/low rewards
- Reward distribution skewness
- Reward variance
- Common failure patterns

Common Reward Issues
--------------------

**Reward Always Near 0 or 1**

Issue: Sparse or binary rewards don't guide learning well

Solution: Use graduated rewards with intermediate values::

    def improved_reward(response, ground_truth):
        if exact_match(response, ground_truth):
            return 1.0
        elif close_match(response, ground_truth, tolerance=0.1):
            return 0.5
        else:
            return 0.0

**Rewards Too Noisy**

Issue: High variance in rewards prevents consistent learning

Solution: Smooth and normalize rewards::

    def stable_reward(response, ground_truth):
        base = check_correctness(response, ground_truth)
        # Add minimum reward to avoid exact zeros
        return max(base, 0.1)

**Reward Hacking**

Issue: Model learns to game the reward instead of solving the task

Solution: Include format/style constraints::

    def robust_reward(response, ground_truth):
        correctness = check_correctness(response, ground_truth)
        format_match = check_format(response, expected_format)

        # Both must be good
        if format_match < 0.8:
            return 0.0

        return correctness

Testing Rewards
---------------

Test reward functions on sample outputs::

    from agent_tunix.rewards import check_answer, match_format_exactly

    # Test data
    test_cases = [
        {
            "response": "2 + 2 = 4. The answer is 4.",
            "ground_truth": "4",
            "expected_format": "The answer is [NUM].",
            "expected_reward": 1.0
        },
        {
            "response": "2 plus 2 equals 4.",
            "ground_truth": "4",
            "expected_format": "The answer is [NUM].",
            "expected_reward": 0.5  # Correct but wrong format
        }
    ]

    # Evaluate
    for case in test_cases:
        format_reward = match_format_exactly(
            case["response"],
            case["expected_format"]
        )
        correctness_reward = check_answer(
            case["response"],
            case["ground_truth"]
        )
        total = 0.3 * format_reward + 0.7 * correctness_reward
        print(f"Expected: {case['expected_reward']}, Got: {total}")

Advanced Topics
---------------

See :doc:`../advanced/custom_rewards` for:

- Reward normalization and scaling
- Multi-task rewards
- Curriculum learning with rewards
- Reward model training

Next Steps
----------

- :doc:`../advanced/custom_rewards` - Detailed custom reward guide
- :doc:`train` - Training API reference
- :doc:`../guide/training` - Training guide