Evaluation Guide

Model Evaluation

Evaluate a trained model on the test set:

python evaluate.py

This will:

Load the trained model with LoRA weights
Create a sampler for text generation
Generate responses for test set questions
Compute evaluation metrics

Evaluation Metrics

The framework computes:

Accuracy: Percentage of exactly correct answers
Partial Accuracy: Answers within 10% of correct value
Format Accuracy: Responses matching expected format

Example output:

Evaluation Results
==================
Correct: 125/500
Accuracy: 25.00%
Partial Accuracy: 45.20%
Format Accuracy: 78.50%

Configuration

Evaluation settings are in conf/evaluation/.

Key parameters:

checkpoint_dir: ./checkpoints/ckpts/    # Model checkpoint directory
step: null                              # Checkpoint step (null for latest)
inference_config: greedy                # Inference strategy
num_passes: 1                           # Generations per question
verbose: true                           # Show progress

Override from command line:

# Use specific checkpoint step
python evaluate.py step=500

# Different checkpoint directory
python evaluate.py checkpoint_dir=/path/to/checkpoints/

Inference Strategies

Three predefined inference configurations:

Greedy

Deterministic generation, always choose highest probability token:

python evaluate.py inference_config=greedy

Temperature: 1e-4, top_k: 1, top_p: 1.0

Standard

Balanced sampling with reasonable diversity:

python evaluate.py inference_config=standard

Temperature: 0.7, top_k: 50, top_p: 0.95

Liberal

More diverse, creative responses:

python evaluate.py inference_config=liberal

Temperature: 0.85, top_k: 2000, top_p: 1.0

Multiple Passes

Run multiple generation passes per question:

python evaluate.py num_passes=3

Useful for:

Understanding model consistency
Finding best response from multiple attempts
Estimating uncertainty

Checkpoint Selection

Latest checkpoint (default):

python evaluate.py

Specific step:

python evaluate.py step=1000

Custom directory:

python evaluate.py checkpoint_dir=./custom/checkpoints/

Finding Checkpoint Steps

List available checkpoints:

ls -la checkpoints/ckpts/actor/

Output shows directories like:

0/
50/
100/
150/
...

These correspond to training steps.

Advanced Configuration

Custom evaluation configuration in conf/evaluation/custom.yaml:

checkpoint_dir: ./checkpoints/ckpts/
step: 500
inference_config: greedy
num_passes: 5
verbose: true

Use it:

python evaluate.py --config custom

Batch Evaluation

Evaluate multiple checkpoints:

for step in 100 200 300 400 500; do
    python evaluate.py step=$step >> results.txt
done

Or with hyperparameter sweep:

python evaluate.py --multirun step=100,200,300,400,500

Interpreting Results

High Accuracy, Low Format Accuracy

Model produces correct answers but in wrong format. Check reward function calibration.

Low Accuracy, High Format Accuracy

Model follows format but answers are incorrect. May need:

More training data
Better reward signal
Longer training

Low Accuracy, Low Format Accuracy

Fundamental training issue. Check:

Data quality
Model size adequacy
Training configuration
Learning rate

Troubleshooting Evaluation

No checkpoint found

Ensure training has completed and checkpoints exist:

ls -la checkpoints/ckpts/actor/

CUDA out of memory during eval

Reduce batch size or model size:

python evaluate.py training.micro_batch_size=1

Evaluation takes too long

Use greedy inference (faster)
Reduce evaluation set size
Use fewer test batches

Metric discrepancies

Ensure using same inference configuration as training:

python evaluate.py inference_config=greedy

Example Evaluation Workflow

1. Evaluate specific checkpoint:

python evaluate.py step=500

2. Try different inference strategies:

for config in greedy standard liberal; do
    echo "=== $config ==="
    python evaluate.py inference_config=$config
done

3. Multiple passes for uncertainty:

python evaluate.py num_passes=5

4. Compare checkpoints:

python evaluate.py --multirun step=100,200,300,400,500

Programmatic Evaluation

Use evaluation functions directly:

from agent_tunix.evaluate import evaluate, create_sampler, evaluate_with_config

# Create sampler
sampler = create_sampler(model, tokenizer, model_config, 256, 512)

# Evaluate with predefined config
results = evaluate_with_config(test_dataset, sampler, "greedy")

print(f"Accuracy: {results['accuracy']:.2f}%")

See API Reference for details.