Evaluation Guide

Model Evaluation

Evaluate a trained model on the test set:

python evaluate.py

This will:

  1. Load the trained model with LoRA weights

  2. Create a sampler for text generation

  3. Generate responses for test set questions

  4. Compute evaluation metrics

Evaluation Metrics

The framework computes:

  • Accuracy: Percentage of exactly correct answers

  • Partial Accuracy: Answers within 10% of correct value

  • Format Accuracy: Responses matching expected format

Example output:

Evaluation Results
==================
Correct: 125/500
Accuracy: 25.00%
Partial Accuracy: 45.20%
Format Accuracy: 78.50%

Configuration

Evaluation settings are in conf/evaluation/.

Key parameters:

checkpoint_dir: ./checkpoints/ckpts/    # Model checkpoint directory
step: null                              # Checkpoint step (null for latest)
inference_config: greedy                # Inference strategy
num_passes: 1                           # Generations per question
verbose: true                           # Show progress

Override from command line:

# Use specific checkpoint step
python evaluate.py step=500

# Different checkpoint directory
python evaluate.py checkpoint_dir=/path/to/checkpoints/

Inference Strategies

Three predefined inference configurations:

Greedy

Deterministic generation, always choose highest probability token:

python evaluate.py inference_config=greedy

Temperature: 1e-4, top_k: 1, top_p: 1.0

Standard

Balanced sampling with reasonable diversity:

python evaluate.py inference_config=standard

Temperature: 0.7, top_k: 50, top_p: 0.95

Liberal

More diverse, creative responses:

python evaluate.py inference_config=liberal

Temperature: 0.85, top_k: 2000, top_p: 1.0

Multiple Passes

Run multiple generation passes per question:

python evaluate.py num_passes=3

Useful for:

  • Understanding model consistency

  • Finding best response from multiple attempts

  • Estimating uncertainty

Checkpoint Selection

Latest checkpoint (default):

python evaluate.py

Specific step:

python evaluate.py step=1000

Custom directory:

python evaluate.py checkpoint_dir=./custom/checkpoints/

Finding Checkpoint Steps

List available checkpoints:

ls -la checkpoints/ckpts/actor/

Output shows directories like:

0/
50/
100/
150/
...

These correspond to training steps.

Advanced Configuration

Custom evaluation configuration in conf/evaluation/custom.yaml:

checkpoint_dir: ./checkpoints/ckpts/
step: 500
inference_config: greedy
num_passes: 5
verbose: true

Use it:

python evaluate.py --config custom

Batch Evaluation

Evaluate multiple checkpoints:

for step in 100 200 300 400 500; do
    python evaluate.py step=$step >> results.txt
done

Or with hyperparameter sweep:

python evaluate.py --multirun step=100,200,300,400,500

Interpreting Results

High Accuracy, Low Format Accuracy

Model produces correct answers but in wrong format. Check reward function calibration.

Low Accuracy, High Format Accuracy

Model follows format but answers are incorrect. May need:

  • More training data

  • Better reward signal

  • Longer training

Low Accuracy, Low Format Accuracy

Fundamental training issue. Check:

  • Data quality

  • Model size adequacy

  • Training configuration

  • Learning rate

Troubleshooting Evaluation

No checkpoint found

Ensure training has completed and checkpoints exist:

ls -la checkpoints/ckpts/actor/

CUDA out of memory during eval

Reduce batch size or model size:

python evaluate.py training.micro_batch_size=1

Evaluation takes too long

  • Use greedy inference (faster)

  • Reduce evaluation set size

  • Use fewer test batches

Metric discrepancies

Ensure using same inference configuration as training:

python evaluate.py inference_config=greedy

Example Evaluation Workflow

1. Evaluate specific checkpoint:

python evaluate.py step=500

2. Try different inference strategies:

for config in greedy standard liberal; do
    echo "=== $config ==="
    python evaluate.py inference_config=$config
done

3. Multiple passes for uncertainty:

python evaluate.py num_passes=5

4. Compare checkpoints:

python evaluate.py --multirun step=100,200,300,400,500

Programmatic Evaluation

Use evaluation functions directly:

from agent_tunix.evaluate import evaluate, create_sampler, evaluate_with_config

# Create sampler
sampler = create_sampler(model, tokenizer, model_config, 256, 512)

# Evaluate with predefined config
results = evaluate_with_config(test_dataset, sampler, "greedy")

print(f"Accuracy: {results['accuracy']:.2f}%")

See API Reference for details.

Next Steps