Evaluation API
==============

.. py:module:: agent_tunix.evaluate
   :noindex:

This module provides evaluation utilities for assessing model performance.

Main Entry Point
----------------

.. autofunction:: evaluate
   :members:

Evaluation Functions
--------------------

.. autofunction:: create_sampler
   :members:

.. autofunction:: evaluate_with_config
   :members:

Configuration Classes
---------------------

Evaluation configuration structure::

    evaluation:
      checkpoint_dir: ./checkpoints/ckpts/
      step: null                           # null for latest
      inference_config: greedy
      num_passes: 1
      verbose: true

Inference Configurations
------------------------

Three predefined inference strategies:

**Greedy**

Deterministic, always select highest probability token::

    inference_config: greedy
    # temperature: 1e-4
    # top_k: 1
    # top_p: 1.0

Best for: Reproducible results, benchmarking, production inference

**Standard**

Balanced sampling with moderate diversity::

    inference_config: standard
    # temperature: 0.7
    # top_k: 50
    # top_p: 0.95

Best for: Reasonable diversity while maintaining coherence

**Liberal**

High diversity, creative responses::

    inference_config: liberal
    # temperature: 0.85
    # top_k: 2000
    # top_p: 1.0

Best for: Exploring model capabilities, creative tasks

Evaluation Metrics
------------------

The framework computes three main metrics:

- **Accuracy**: Percentage of exactly correct answers (number matching expected output)
- **Partial Accuracy**: Percentage within 10% of correct value
- **Format Accuracy**: Percentage of responses matching expected format structure

Example output::

    Evaluation Results
    ==================
    Correct: 125/500
    Accuracy: 25.00%
    Partial Accuracy: 45.20%
    Format Accuracy: 78.50%

Checkpoint Selection
--------------------

Evaluate latest checkpoint::

    python evaluate.py

Evaluate specific step::

    python evaluate.py step=1000

Use custom checkpoint directory::

    python evaluate.py checkpoint_dir=/path/to/checkpoints/

List available checkpoints::

    ls -la checkpoints/ckpts/actor/

Advanced Configuration
----------------------

Create custom evaluation config in ``conf/evaluation/custom.yaml``::

    checkpoint_dir: ./checkpoints/ckpts/
    step: 500
    inference_config: greedy
    num_passes: 5
    verbose: true

Use it::

    python evaluate.py --config custom

Multiple Passes
---------------

Run multiple generation passes per question for uncertainty estimation::

    python evaluate.py num_passes=3

Useful for:

- Understanding model consistency
- Finding best response from multiple attempts
- Estimating confidence/uncertainty

Batch Evaluation
----------------

Evaluate multiple checkpoints::

    for step in 100 200 300 400 500; do
        python evaluate.py step=$step >> results.txt
    done

Or with Hydra sweeps::

    python evaluate.py --multirun step=100,200,300,400,500

Interpreting Results
--------------------

**High Accuracy, Low Format Accuracy**

Model produces correct answers but in wrong format. May indicate:

- Reward function not properly calibrated
- Format specification unclear to model
- Need for stricter format enforcement during training

**Low Accuracy, High Format Accuracy**

Model follows format but answers are incorrect. May need:

- More training data
- Better reward signal
- Longer training duration
- Different hyperparameters

**Low Accuracy, Low Format Accuracy**

Fundamental training issue. Check:

- Data quality and completeness
- Model size adequacy for task complexity
- Training configuration correctness
- Learning rate appropriateness
- Sufficient training steps

Troubleshooting
---------------

**No checkpoint found**

Verify training completed and checkpoints exist::

    ls -la checkpoints/ckpts/actor/

**CUDA out of memory**

Reduce batch size::

    python evaluate.py training.micro_batch_size=1

**Evaluation takes too long**

- Use greedy inference (faster)
- Reduce evaluation set size
- Evaluate fewer test batches

**Metric discrepancies**

Ensure using same inference configuration as training::

    python evaluate.py inference_config=greedy

Programmatic Usage
------------------

Use evaluation functions in custom scripts::

    from agent_tunix.evaluate import create_sampler, evaluate_with_config

    # Create sampler for generation
    sampler = create_sampler(model, tokenizer, model_config, 256, 512)

    # Evaluate with predefined config
    results = evaluate_with_config(test_dataset, sampler, "greedy")

    print(f"Accuracy: {results['accuracy']:.2f}%")
    print(f"Partial Accuracy: {results['partial_accuracy']:.2f}%")
    print(f"Format Accuracy: {results['format_accuracy']:.2f}%")

Example Evaluation Workflow
----------------------------

1. Evaluate latest checkpoint with greedy inference::

    python evaluate.py

2. Compare different inference strategies::

    for config in greedy standard liberal; do
        echo "=== $config ==="
        python evaluate.py inference_config=$config
    done

3. Get uncertainty estimates with multiple passes::

    python evaluate.py num_passes=5

4. Compare multiple checkpoints::

    python evaluate.py --multirun step=100,200,300,400,500

Next Steps
----------

- :doc:`../guide/evaluation` - Detailed evaluation guide
- :doc:`train` - Training API reference
- :doc:`models` - Model architecture reference