Troubleshooting
Common issues and solutions for training and evaluation.
Installation Issues
CUDA Not Found
Error:
RuntimeError: CUDA is not available
Solution:
# Verify CUDA installation
python -m agent_tunix.utils check-gpu
# Check NVIDIA drivers
nvidia-smi
# Install CUDA if missing (see installation guide)
JAX Backend Issues
Error:
ModuleNotFoundError: No module named 'jax'
Solution:
# Reinstall JAX with CUDA support
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
# Or with latest CUDA
pip install jax[cuda12_cudnn83]
GPU Out of Memory on Import
Error:
RuntimeError: CUDA out of memory
Solution:
# Set XLA to only allocate needed memory
export XLA_PYTHON_CLIENT_PREALLOCATE=false
# Reduce TensorFlow/JAX memory allocation
export JAX_PLATFORM_NAME=cpu # Use CPU for testing
Training Issues
Training Starts But Stops Immediately
Error:
ValueError: Data loading failed
Solution:
1. Check data availability::
python -c "from datasets import load_dataset; load_dataset('gsm8k')"
2. Verify tokenizer exists::
python run_training.py --cfg job | grep tokenizer
3. Check configuration::
python run_training.py --cfg job
Loss is NaN
Symptoms:
loss: nan
Training crashes after few steps
Causes and solutions:
Learning rate too high:
python run_training.py optimizer.learning_rate=1e-7
Gradient overflow:
python run_training.py optimizer.max_grad_norm=0.01
Batch size too large:
python run_training.py training.micro_batch_size=1
Bad data example:
# Check and clean data python -c "from agent_tunix.data import load_dataset; ds, _ = load_dataset(); print(ds[0])"
Loss Not Decreasing
Symptoms:
Loss stays constant or increases
Causes and solutions:
Learning rate too low:
python run_training.py optimizer.learning_rate=1e-4
Model not training (frozen weights):
# Check if parameters are trainable python -c "from agent_tunix.train import create_model; model = create_model(cfg); print(model.trainable_params())"
Data too small:
# Use more training data or reduce num_batches python run_training.py training.num_batches=10000
LoRA not properly configured:
python run_training.py model.lora_rank=64
Memory Error During Training
Error:
RuntimeError: CUDA out of memory
Solutions (in order of impact):
Reduce batch size:
python run_training.py training.micro_batch_size=1
Reduce model size:
python run_training.py model=gemma3_270m
Reduce LoRA rank:
python run_training.py model.lora_rank=8
Reduce sequence length:
python run_training.py \ generation.max_prompt_length=128 \ generation.max_generation_steps=256
Reduce generations per prompt:
python run_training.py grpo.num_generations=2
Use gradient accumulation:
python run_training.py training.gradient_accumulation_steps=4
Slow Training
Check GPU utilization:
# Monitor in another terminal
watch -n 1 nvidia-smi
Solutions if utilization is low:
Data loading bottleneck:
# Increase number of data loading workers python run_training.py training.num_workers=8
Model too small:
python run_training.py model=gemma3_1b
Check for CPU bottleneck:
# Monitor CPU usage top
I/O bottleneck:
# Move data to faster storage (SSD) # Or use memory-mapped datasets
Checkpoints Not Being Saved
Error:
No checkpoint directory created
Solutions:
Check checkpoint directory permissions:
ls -la checkpoints/ckpts/
Verify checkpoint configuration:
python run_training.py --cfg job | grep checkpoint
Create directory if missing:
mkdir -p checkpoints/ckpts/
Use absolute path:
python run_training.py checkpoint_dir=/path/to/checkpoints/ckpts/
Model Not Improving on Validation
Symptoms:
Validation accuracy flat
Training loss decreases but validation stagnates
Causes and solutions:
Overfitting:
# Add validation set diversity # Reduce model capacity python run_training.py model.lora_rank=8
Wrong reward signal:
# Check reward function python -c " from agent_tunix.rewards import check_answer print(check_answer('The answer is 4.', '4')) "
Validation set too small:
# Use larger validation set # Or fewer validation steps python run_training.py training.eval_interval_steps=1000
Distribution mismatch:
# Ensure test set matches training data # Or fine-tune on test distribution
Evaluation Issues
No Checkpoints Found
Error:
No checkpoint found in: checkpoints/ckpts/actor/
Solutions:
Verify training completed:
ls -la outputs/tunix-grpo/*/
Check for checkpoints:
find . -name "*actor*" -type d
Use absolute path:
python evaluate.py checkpoint_dir=/absolute/path/to/checkpoints/ckpts/
Train if not done:
python run_training.py +experiment=quick_test
CUDA Memory During Evaluation
Error:
RuntimeError: CUDA out of memory during evaluation
Solutions:
Reduce batch size:
python evaluate.py training.micro_batch_size=1
Reduce sequence length:
python evaluate.py generation.max_generation_steps=256
Use CPU:
python evaluate.py device=cpu
Evaluation Takes Too Long
Solutions:
Use greedy decoding (faster):
python evaluate.py inference_config=greedy
Reduce evaluation samples:
python evaluate.py evaluation.num_samples=100
Reduce number of passes:
python evaluate.py evaluation.num_passes=1
Use smaller model checkpoint:
python evaluate.py step=100
Metric Results Don’t Match Training
Causes:
Different inference config:
# Use same as training python evaluate.py inference_config=greedy
Different checkpoint:
# Use latest python evaluate.py
Different data:
# Ensure same dataset python evaluate.py --cfg job | grep dataset
Randomness:
# Set seed python evaluate.py seed=42
Configuration Issues
Invalid Configuration
Error:
ConfigError: Could not find 'model/custom.yaml'
Solutions:
List available configs:
python run_training.py --info config-groups | grep model
Check file exists:
ls conf/model/
Use default if custom missing:
python run_training.py model=gemma3_1b
Conflicting Overrides
Error:
ConfigCompositionException: Could not override
Solutions:
Check config hierarchy:
python run_training.py --info defaults-tree
Use correct path:
# Correct python run_training.py optimizer.learning_rate=1e-5 # Wrong python run_training.py learning_rate=1e-5
Use force override if needed:
python run_training.py ++optimizer.new_param=value
Missing Configuration Group
Error:
ConfigCompositionException: Could not load group
Solutions:
List defaults:
python run_training.py --info config-groups
Create missing config:
touch conf/scheduler/custom.yaml
Update config.yaml defaults:
# conf/config.yaml defaults: - scheduler: custom
Distributed Training Issues
NCCL Errors
Error:
RuntimeError: NCCL operation failed
Solutions:
Check GPU connectivity:
nvidia-smi -L
Enable NCCL debugging:
export NCCL_DEBUG=INFO
Increase timeout:
export NCCL_P2P_CONNECT_TIMEOUT=300
Use single GPU for testing:
python run_training.py model.mesh_shape=[[1,1],["fsdp","tp"]]
Device Mismatch
Error:
RuntimeError: Devices are not homogeneous
Causes:
Different GPU types in cluster
Different compute capabilities
Solution:
Use same GPU type across all nodes
Or use compatible GPUs
Communication Timeout
Error:
TimeoutError: Communication timed out
Solutions:
Check network:
ping <other-node>
Increase timeout:
export NCCL_P2P_CONNECT_TIMEOUT=600
Use slower network:
export NCCL_SOCKET_IFNAME=eth0 # Specific network interface
Uneven GPU Utilization
Issue: Some GPUs finish faster than others
Solutions:
Check loads:
nvidia-smi dmon -s pm
Adjust batch size:
python run_training.py training.micro_batch_size=8
Balance data distribution:
python run_training.py training.data_seed=42
Debugging Tips
Enable Verbose Logging
python run_training.py training.log_level=DEBUG
Profile Training
python run_training.py training.profile=true
Check profile output in logs.
Inspect Configuration
python run_training.py --cfg job --resolve
Shows all interpolations resolved.
Dry Run Test
python run_training.py +experiment=quick_test --dry-run
Validates configuration without training.
Check Versions
python -c "
import jax
import flax
import transformers
import hydra
print(f'JAX: {jax.__version__}')
print(f'Flax: {flax.__version__}')
print(f'Transformers: {transformers.__version__}')
print(f'Hydra: {hydra.__version__}')
"
Getting Help
When reporting issues, include:
Complete error message and traceback
Configuration used (
python run_training.py --cfg job)GPU information (
nvidia-smi)Environment info (Python version, package versions)
Minimal reproducible example
Common Patterns
Testing Fix Before Full Run
# Test configuration and data loading
python run_training.py +experiment=quick_test --dry-run
# Run 10 steps to verify
python run_training.py +experiment=quick_test
# If successful, run full training
python run_training.py +experiment=full_training
Incremental Memory Reduction
# Start here
python run_training.py
# If OOM, reduce batch size
python run_training.py training.micro_batch_size=2
# If still OOM, use smaller model
python run_training.py model=gemma3_270m training.micro_batch_size=1
# If still OOM, reduce LoRA rank
python run_training.py model=gemma3_270m model.lora_rank=8 training.micro_batch_size=1
Next Steps
Training Guide - Training guide
Training API - Training API
Configuration Guide - Configuration reference