Troubleshooting

Common issues and solutions for training and evaluation.

Installation Issues

CUDA Not Found

Error:

RuntimeError: CUDA is not available

Solution:

# Verify CUDA installation
python -m agent_tunix.utils check-gpu

# Check NVIDIA drivers
nvidia-smi

# Install CUDA if missing (see installation guide)

JAX Backend Issues

Error:

ModuleNotFoundError: No module named 'jax'

Solution:

# Reinstall JAX with CUDA support
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Or with latest CUDA
pip install jax[cuda12_cudnn83]

GPU Out of Memory on Import

Error:

RuntimeError: CUDA out of memory

Solution:

# Set XLA to only allocate needed memory
export XLA_PYTHON_CLIENT_PREALLOCATE=false

# Reduce TensorFlow/JAX memory allocation
export JAX_PLATFORM_NAME=cpu  # Use CPU for testing

Training Issues

Training Starts But Stops Immediately

Error:

ValueError: Data loading failed

Solution:

1. Check data availability::

    python -c "from datasets import load_dataset; load_dataset('gsm8k')"

2. Verify tokenizer exists::

    python run_training.py --cfg job | grep tokenizer

3. Check configuration::

    python run_training.py --cfg job

Loss is NaN

Symptoms:

loss: nan
Training crashes after few steps

Causes and solutions:

  1. Learning rate too high:

    python run_training.py optimizer.learning_rate=1e-7
    
  2. Gradient overflow:

    python run_training.py optimizer.max_grad_norm=0.01
    
  3. Batch size too large:

    python run_training.py training.micro_batch_size=1
    
  4. Bad data example:

    # Check and clean data
    python -c "from agent_tunix.data import load_dataset; ds, _ = load_dataset(); print(ds[0])"
    

Loss Not Decreasing

Symptoms:

Loss stays constant or increases

Causes and solutions:

  1. Learning rate too low:

    python run_training.py optimizer.learning_rate=1e-4
    
  2. Model not training (frozen weights):

    # Check if parameters are trainable
    python -c "from agent_tunix.train import create_model; model = create_model(cfg); print(model.trainable_params())"
    
  3. Data too small:

    # Use more training data or reduce num_batches
    python run_training.py training.num_batches=10000
    
  4. LoRA not properly configured:

    python run_training.py model.lora_rank=64
    

Memory Error During Training

Error:

RuntimeError: CUDA out of memory

Solutions (in order of impact):

  1. Reduce batch size:

    python run_training.py training.micro_batch_size=1
    
  2. Reduce model size:

    python run_training.py model=gemma3_270m
    
  3. Reduce LoRA rank:

    python run_training.py model.lora_rank=8
    
  4. Reduce sequence length:

    python run_training.py \
        generation.max_prompt_length=128 \
        generation.max_generation_steps=256
    
  5. Reduce generations per prompt:

    python run_training.py grpo.num_generations=2
    
  6. Use gradient accumulation:

    python run_training.py training.gradient_accumulation_steps=4
    

Slow Training

Check GPU utilization:

# Monitor in another terminal
watch -n 1 nvidia-smi

Solutions if utilization is low:

  1. Data loading bottleneck:

    # Increase number of data loading workers
    python run_training.py training.num_workers=8
    
  2. Model too small:

    python run_training.py model=gemma3_1b
    
  3. Check for CPU bottleneck:

    # Monitor CPU usage
    top
    
  4. I/O bottleneck:

    # Move data to faster storage (SSD)
    # Or use memory-mapped datasets
    

Checkpoints Not Being Saved

Error:

No checkpoint directory created

Solutions:

  1. Check checkpoint directory permissions:

    ls -la checkpoints/ckpts/
    
  2. Verify checkpoint configuration:

    python run_training.py --cfg job | grep checkpoint
    
  3. Create directory if missing:

    mkdir -p checkpoints/ckpts/
    
  4. Use absolute path:

    python run_training.py checkpoint_dir=/path/to/checkpoints/ckpts/
    

Model Not Improving on Validation

Symptoms:

Validation accuracy flat
Training loss decreases but validation stagnates

Causes and solutions:

  1. Overfitting:

    # Add validation set diversity
    # Reduce model capacity
    python run_training.py model.lora_rank=8
    
  2. Wrong reward signal:

    # Check reward function
    python -c "
    from agent_tunix.rewards import check_answer
    print(check_answer('The answer is 4.', '4'))
    "
    
  3. Validation set too small:

    # Use larger validation set
    # Or fewer validation steps
    python run_training.py training.eval_interval_steps=1000
    
  4. Distribution mismatch:

    # Ensure test set matches training data
    # Or fine-tune on test distribution
    

Evaluation Issues

No Checkpoints Found

Error:

No checkpoint found in: checkpoints/ckpts/actor/

Solutions:

  1. Verify training completed:

    ls -la outputs/tunix-grpo/*/
    
  2. Check for checkpoints:

    find . -name "*actor*" -type d
    
  3. Use absolute path:

    python evaluate.py checkpoint_dir=/absolute/path/to/checkpoints/ckpts/
    
  4. Train if not done:

    python run_training.py +experiment=quick_test
    

CUDA Memory During Evaluation

Error:

RuntimeError: CUDA out of memory during evaluation

Solutions:

  1. Reduce batch size:

    python evaluate.py training.micro_batch_size=1
    
  2. Reduce sequence length:

    python evaluate.py generation.max_generation_steps=256
    
  3. Use CPU:

    python evaluate.py device=cpu
    

Evaluation Takes Too Long

Solutions:

  1. Use greedy decoding (faster):

    python evaluate.py inference_config=greedy
    
  2. Reduce evaluation samples:

    python evaluate.py evaluation.num_samples=100
    
  3. Reduce number of passes:

    python evaluate.py evaluation.num_passes=1
    
  4. Use smaller model checkpoint:

    python evaluate.py step=100
    

Metric Results Don’t Match Training

Causes:

  1. Different inference config:

    # Use same as training
    python evaluate.py inference_config=greedy
    
  2. Different checkpoint:

    # Use latest
    python evaluate.py
    
  3. Different data:

    # Ensure same dataset
    python evaluate.py --cfg job | grep dataset
    
  4. Randomness:

    # Set seed
    python evaluate.py seed=42
    

Configuration Issues

Invalid Configuration

Error:

ConfigError: Could not find 'model/custom.yaml'

Solutions:

  1. List available configs:

    python run_training.py --info config-groups | grep model
    
  2. Check file exists:

    ls conf/model/
    
  3. Use default if custom missing:

    python run_training.py model=gemma3_1b
    

Conflicting Overrides

Error:

ConfigCompositionException: Could not override

Solutions:

  1. Check config hierarchy:

    python run_training.py --info defaults-tree
    
  2. Use correct path:

    # Correct
    python run_training.py optimizer.learning_rate=1e-5
    
    # Wrong
    python run_training.py learning_rate=1e-5
    
  3. Use force override if needed:

    python run_training.py ++optimizer.new_param=value
    

Missing Configuration Group

Error:

ConfigCompositionException: Could not load group

Solutions:

  1. List defaults:

    python run_training.py --info config-groups
    
  2. Create missing config:

    touch conf/scheduler/custom.yaml
    
  3. Update config.yaml defaults:

    # conf/config.yaml
    defaults:
      - scheduler: custom
    

Distributed Training Issues

NCCL Errors

Error:

RuntimeError: NCCL operation failed

Solutions:

  1. Check GPU connectivity:

    nvidia-smi -L
    
  2. Enable NCCL debugging:

    export NCCL_DEBUG=INFO
    
  3. Increase timeout:

    export NCCL_P2P_CONNECT_TIMEOUT=300
    
  4. Use single GPU for testing:

    python run_training.py model.mesh_shape=[[1,1],["fsdp","tp"]]
    

Device Mismatch

Error:

RuntimeError: Devices are not homogeneous

Causes:

  • Different GPU types in cluster

  • Different compute capabilities

Solution:

  • Use same GPU type across all nodes

  • Or use compatible GPUs

Communication Timeout

Error:

TimeoutError: Communication timed out

Solutions:

  1. Check network:

    ping <other-node>
    
  2. Increase timeout:

    export NCCL_P2P_CONNECT_TIMEOUT=600
    
  3. Use slower network:

    export NCCL_SOCKET_IFNAME=eth0  # Specific network interface
    

Uneven GPU Utilization

Issue: Some GPUs finish faster than others

Solutions:

  1. Check loads:

    nvidia-smi dmon -s pm
    
  2. Adjust batch size:

    python run_training.py training.micro_batch_size=8
    
  3. Balance data distribution:

    python run_training.py training.data_seed=42
    

Debugging Tips

Enable Verbose Logging

python run_training.py training.log_level=DEBUG

Profile Training

python run_training.py training.profile=true

Check profile output in logs.

Inspect Configuration

python run_training.py --cfg job --resolve

Shows all interpolations resolved.

Dry Run Test

python run_training.py +experiment=quick_test --dry-run

Validates configuration without training.

Check Versions

python -c "
import jax
import flax
import transformers
import hydra
print(f'JAX: {jax.__version__}')
print(f'Flax: {flax.__version__}')
print(f'Transformers: {transformers.__version__}')
print(f'Hydra: {hydra.__version__}')
"

Getting Help

When reporting issues, include:

  1. Complete error message and traceback

  2. Configuration used (python run_training.py --cfg job)

  3. GPU information (nvidia-smi)

  4. Environment info (Python version, package versions)

  5. Minimal reproducible example

Common Patterns

Testing Fix Before Full Run

# Test configuration and data loading
python run_training.py +experiment=quick_test --dry-run

# Run 10 steps to verify
python run_training.py +experiment=quick_test

# If successful, run full training
python run_training.py +experiment=full_training

Incremental Memory Reduction

# Start here
python run_training.py

# If OOM, reduce batch size
python run_training.py training.micro_batch_size=2

# If still OOM, use smaller model
python run_training.py model=gemma3_270m training.micro_batch_size=1

# If still OOM, reduce LoRA rank
python run_training.py model=gemma3_270m model.lora_rank=8 training.micro_batch_size=1

Next Steps