Troubleshooting

Common issues and solutions for training and evaluation.

Installation Issues

CUDA Not Found

Error:

RuntimeError: CUDA is not available

Solution:

# Verify CUDA installation
python -m agent_tunix.utils check-gpu

# Check NVIDIA drivers
nvidia-smi

# Install CUDA if missing (see installation guide)

JAX Backend Issues

Error:

ModuleNotFoundError: No module named 'jax'

Solution:

# Reinstall JAX with CUDA support
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Or with latest CUDA
pip install jax[cuda12_cudnn83]

GPU Out of Memory on Import

Error:

RuntimeError: CUDA out of memory

Solution:

# Set XLA to only allocate needed memory
export XLA_PYTHON_CLIENT_PREALLOCATE=false

# Reduce TensorFlow/JAX memory allocation
export JAX_PLATFORM_NAME=cpu  # Use CPU for testing

Training Issues

Training Starts But Stops Immediately

Error:

ValueError: Data loading failed

Solution:

1. Check data availability::

    python -c "from datasets import load_dataset; load_dataset('gsm8k')"

2. Verify tokenizer exists::

    python run_training.py --cfg job | grep tokenizer

3. Check configuration::

    python run_training.py --cfg job

Loss is NaN

Symptoms:

loss: nan
Training crashes after few steps

Causes and solutions:

Learning rate too high:

python run_training.py optimizer.learning_rate=1e-7

Gradient overflow:

python run_training.py optimizer.max_grad_norm=0.01

Batch size too large:

python run_training.py training.micro_batch_size=1

Bad data example:

# Check and clean data
python -c "from agent_tunix.data import load_dataset; ds, _ = load_dataset(); print(ds[0])"

Loss Not Decreasing

Symptoms:

Loss stays constant or increases

Causes and solutions:

Learning rate too low:

python run_training.py optimizer.learning_rate=1e-4

Model not training (frozen weights):

# Check if parameters are trainable
python -c "from agent_tunix.train import create_model; model = create_model(cfg); print(model.trainable_params())"

Data too small:

# Use more training data or reduce num_batches
python run_training.py training.num_batches=10000

LoRA not properly configured:

python run_training.py model.lora_rank=64

Memory Error During Training

Error:

RuntimeError: CUDA out of memory

Solutions (in order of impact):

Reduce batch size:

python run_training.py training.micro_batch_size=1

Reduce model size:

python run_training.py model=gemma3_270m

Reduce LoRA rank:

python run_training.py model.lora_rank=8

Reduce sequence length:

python run_training.py \
    generation.max_prompt_length=128 \
    generation.max_generation_steps=256

Reduce generations per prompt:

python run_training.py grpo.num_generations=2

Use gradient accumulation:

python run_training.py training.gradient_accumulation_steps=4

Slow Training

Check GPU utilization:

# Monitor in another terminal
watch -n 1 nvidia-smi

Solutions if utilization is low:

Data loading bottleneck:

# Increase number of data loading workers
python run_training.py training.num_workers=8

Model too small:
```
python run_training.py model=gemma3_1b
```
Check for CPU bottleneck:
```
# Monitor CPU usage
top
```

I/O bottleneck:

# Move data to faster storage (SSD)
# Or use memory-mapped datasets

Checkpoints Not Being Saved

Error:

No checkpoint directory created

Solutions:

Check checkpoint directory permissions:
```
ls -la checkpoints/ckpts/
```

Verify checkpoint configuration:

python run_training.py --cfg job | grep checkpoint

Create directory if missing:
```
mkdir -p checkpoints/ckpts/
```

Use absolute path:

python run_training.py checkpoint_dir=/path/to/checkpoints/ckpts/

Model Not Improving on Validation

Symptoms:

Validation accuracy flat
Training loss decreases but validation stagnates

Causes and solutions:

Overfitting:

# Add validation set diversity
# Reduce model capacity
python run_training.py model.lora_rank=8

Wrong reward signal:

# Check reward function
python -c "
from agent_tunix.rewards import check_answer
print(check_answer('The answer is 4.', '4'))
"

Validation set too small:

# Use larger validation set
# Or fewer validation steps
python run_training.py training.eval_interval_steps=1000

Distribution mismatch:

# Ensure test set matches training data
# Or fine-tune on test distribution

Evaluation Issues

No Checkpoints Found

Error:

No checkpoint found in: checkpoints/ckpts/actor/

Solutions:

Verify training completed:
```
ls -la outputs/tunix-grpo/*/
```
Check for checkpoints:
```
find . -name "*actor*" -type d
```

Use absolute path:

python evaluate.py checkpoint_dir=/absolute/path/to/checkpoints/ckpts/

Train if not done:

python run_training.py +experiment=quick_test

CUDA Memory During Evaluation

Error:

RuntimeError: CUDA out of memory during evaluation

Solutions:

Reduce batch size:

python evaluate.py training.micro_batch_size=1

Reduce sequence length:

python evaluate.py generation.max_generation_steps=256

Use CPU:
```
python evaluate.py device=cpu
```

Evaluation Takes Too Long

Solutions:

Use greedy decoding (faster):

python evaluate.py inference_config=greedy

Reduce evaluation samples:

python evaluate.py evaluation.num_samples=100

Reduce number of passes:

python evaluate.py evaluation.num_passes=1

Use smaller model checkpoint:
```
python evaluate.py step=100
```

Metric Results Don’t Match Training

Causes:

Different inference config:

# Use same as training
python evaluate.py inference_config=greedy

Different checkpoint:
```
# Use latest
python evaluate.py
```

Different data:

# Ensure same dataset
python evaluate.py --cfg job | grep dataset

Randomness:
```
# Set seed
python evaluate.py seed=42
```

Configuration Issues

Invalid Configuration

Error:

ConfigError: Could not find 'model/custom.yaml'

Solutions:

List available configs:

python run_training.py --info config-groups | grep model

Check file exists:
```
ls conf/model/
```
Use default if custom missing:
```
python run_training.py model=gemma3_1b
```

Conflicting Overrides

Error:

ConfigCompositionException: Could not override

Solutions:

Check config hierarchy:

python run_training.py --info defaults-tree

Use correct path:

# Correct
python run_training.py optimizer.learning_rate=1e-5

# Wrong
python run_training.py learning_rate=1e-5

Use force override if needed:

python run_training.py ++optimizer.new_param=value

Missing Configuration Group

Error:

ConfigCompositionException: Could not load group

Solutions:

List defaults:

python run_training.py --info config-groups

Create missing config:
```
touch conf/scheduler/custom.yaml
```

Update config.yaml defaults:

# conf/config.yaml
defaults:
  - scheduler: custom

Distributed Training Issues

NCCL Errors

Error:

RuntimeError: NCCL operation failed

Solutions:

Check GPU connectivity:
```
nvidia-smi -L
```
Enable NCCL debugging:
```
export NCCL_DEBUG=INFO
```
Increase timeout:
```
export NCCL_P2P_CONNECT_TIMEOUT=300
```

Use single GPU for testing:

python run_training.py model.mesh_shape=[[1,1],["fsdp","tp"]]

Device Mismatch

Error:

RuntimeError: Devices are not homogeneous

Causes:

Different GPU types in cluster
Different compute capabilities

Solution:

Use same GPU type across all nodes
Or use compatible GPUs

Communication Timeout

Error:

TimeoutError: Communication timed out

Solutions:

Check network:
```
ping <other-node>
```
Increase timeout:
```
export NCCL_P2P_CONNECT_TIMEOUT=600
```

Use slower network:

export NCCL_SOCKET_IFNAME=eth0  # Specific network interface

Uneven GPU Utilization

Issue: Some GPUs finish faster than others

Solutions:

Check loads:
```
nvidia-smi dmon -s pm
```

Adjust batch size:

python run_training.py training.micro_batch_size=8

Balance data distribution:

python run_training.py training.data_seed=42

Debugging Tips

Enable Verbose Logging

python run_training.py training.log_level=DEBUG

Profile Training

python run_training.py training.profile=true

Check profile output in logs.

Inspect Configuration

python run_training.py --cfg job --resolve

Shows all interpolations resolved.

Dry Run Test

python run_training.py +experiment=quick_test --dry-run

Validates configuration without training.

Check Versions

python -c "
import jax
import flax
import transformers
import hydra
print(f'JAX: {jax.__version__}')
print(f'Flax: {flax.__version__}')
print(f'Transformers: {transformers.__version__}')
print(f'Hydra: {hydra.__version__}')
"

Getting Help

When reporting issues, include:

Complete error message and traceback
Configuration used (python run_training.py --cfg job)
GPU information (nvidia-smi)
Environment info (Python version, package versions)
Minimal reproducible example

Common Patterns

Testing Fix Before Full Run

# Test configuration and data loading
python run_training.py +experiment=quick_test --dry-run

# Run 10 steps to verify
python run_training.py +experiment=quick_test

# If successful, run full training
python run_training.py +experiment=full_training

Incremental Memory Reduction

# Start here
python run_training.py

# If OOM, reduce batch size
python run_training.py training.micro_batch_size=2

# If still OOM, use smaller model
python run_training.py model=gemma3_270m training.micro_batch_size=1

# If still OOM, reduce LoRA rank
python run_training.py model=gemma3_270m model.lora_rank=8 training.micro_batch_size=1

Next Steps

Training Guide - Training guide
Training API - Training API
Configuration Guide - Configuration reference