Evaluation API
This module provides evaluation utilities for assessing model performance.
Main Entry Point
Evaluation Functions
Configuration Classes
Evaluation configuration structure:
evaluation:
checkpoint_dir: ./checkpoints/ckpts/
step: null # null for latest
inference_config: greedy
num_passes: 1
verbose: true
Inference Configurations
Three predefined inference strategies:
Greedy
Deterministic, always select highest probability token:
inference_config: greedy
# temperature: 1e-4
# top_k: 1
# top_p: 1.0
Best for: Reproducible results, benchmarking, production inference
Standard
Balanced sampling with moderate diversity:
inference_config: standard
# temperature: 0.7
# top_k: 50
# top_p: 0.95
Best for: Reasonable diversity while maintaining coherence
Liberal
High diversity, creative responses:
inference_config: liberal
# temperature: 0.85
# top_k: 2000
# top_p: 1.0
Best for: Exploring model capabilities, creative tasks
Evaluation Metrics
The framework computes three main metrics:
Accuracy: Percentage of exactly correct answers (number matching expected output)
Partial Accuracy: Percentage within 10% of correct value
Format Accuracy: Percentage of responses matching expected format structure
Example output:
Evaluation Results
==================
Correct: 125/500
Accuracy: 25.00%
Partial Accuracy: 45.20%
Format Accuracy: 78.50%
Checkpoint Selection
Evaluate latest checkpoint:
python evaluate.py
Evaluate specific step:
python evaluate.py step=1000
Use custom checkpoint directory:
python evaluate.py checkpoint_dir=/path/to/checkpoints/
List available checkpoints:
ls -la checkpoints/ckpts/actor/
Advanced Configuration
Create custom evaluation config in conf/evaluation/custom.yaml:
checkpoint_dir: ./checkpoints/ckpts/
step: 500
inference_config: greedy
num_passes: 5
verbose: true
Use it:
python evaluate.py --config custom
Multiple Passes
Run multiple generation passes per question for uncertainty estimation:
python evaluate.py num_passes=3
Useful for:
Understanding model consistency
Finding best response from multiple attempts
Estimating confidence/uncertainty
Batch Evaluation
Evaluate multiple checkpoints:
for step in 100 200 300 400 500; do
python evaluate.py step=$step >> results.txt
done
Or with Hydra sweeps:
python evaluate.py --multirun step=100,200,300,400,500
Interpreting Results
High Accuracy, Low Format Accuracy
Model produces correct answers but in wrong format. May indicate:
Reward function not properly calibrated
Format specification unclear to model
Need for stricter format enforcement during training
Low Accuracy, High Format Accuracy
Model follows format but answers are incorrect. May need:
More training data
Better reward signal
Longer training duration
Different hyperparameters
Low Accuracy, Low Format Accuracy
Fundamental training issue. Check:
Data quality and completeness
Model size adequacy for task complexity
Training configuration correctness
Learning rate appropriateness
Sufficient training steps
Troubleshooting
No checkpoint found
Verify training completed and checkpoints exist:
ls -la checkpoints/ckpts/actor/
CUDA out of memory
Reduce batch size:
python evaluate.py training.micro_batch_size=1
Evaluation takes too long
Use greedy inference (faster)
Reduce evaluation set size
Evaluate fewer test batches
Metric discrepancies
Ensure using same inference configuration as training:
python evaluate.py inference_config=greedy
Programmatic Usage
Use evaluation functions in custom scripts:
from agent_tunix.evaluate import create_sampler, evaluate_with_config
# Create sampler for generation
sampler = create_sampler(model, tokenizer, model_config, 256, 512)
# Evaluate with predefined config
results = evaluate_with_config(test_dataset, sampler, "greedy")
print(f"Accuracy: {results['accuracy']:.2f}%")
print(f"Partial Accuracy: {results['partial_accuracy']:.2f}%")
print(f"Format Accuracy: {results['format_accuracy']:.2f}%")
Example Evaluation Workflow
Evaluate latest checkpoint with greedy inference:
python evaluate.py
Compare different inference strategies:
for config in greedy standard liberal; do echo "=== $config ===" python evaluate.py inference_config=$config doneGet uncertainty estimates with multiple passes:
python evaluate.py num_passes=5
Compare multiple checkpoints:
python evaluate.py --multirun step=100,200,300,400,500
Next Steps
Evaluation Guide - Detailed evaluation guide
Training API - Training API reference
Models API - Model architecture reference