Training Configuration Reference
This section details training configuration options for controlling the training process.
Training Configuration File
Located at: conf/training/default.yaml
micro_batch_size: 4
num_batches: 3738
num_epochs: 1
checkpoint_dir: ./checkpoints/ckpts/
save_interval_steps: 100
eval_interval_steps: 500
log_interval_steps: 10
seed: 42
device: cuda
Training Configuration Parameters
micro_batch_size
Type: integer
Range: 1 to 16 (depending on GPU)
Default: 4
Batch size per GPU device.
Memory impact: Linear (2× batch size ≈ 2× memory)
Guidance by GPU:
- 11GB GPU (RTX 2080 Ti): 1
- 24GB GPU (RTX A4000): 2-4
- 48GB GPU (RTX A6000): 4-8
- 80GB GPU (H100): 8-16
Training dynamics:
- Larger batches: more stable gradients, slower per-step updates
- Smaller batches: noisier gradients, faster per-step updates
Typical ranges:
# Memory constrained
python run_training.py training.micro_batch_size=1
# Balanced
python run_training.py training.micro_batch_size=4
# High capacity
python run_training.py training.micro_batch_size=8
num_batches
Type: integer
Range: 10 to 100000
Default: 3738
Total number of training batches (steps).
Training duration:
duration ≈ num_batches / (batches_per_hour)
Typical speeds:
- 270M model: ~500 steps/hour
- 1B model: ~300 steps/hour
- 4B model: ~150 steps/hour
Examples:
# Quick test (10 minutes)
training.num_batches=50
# Full training (varies)
training.num_batches=3738
# Long training (1-2 days)
training.num_batches=10000
Relationship with num_epochs:
total_steps = (dataset_size / batch_size) × num_epochs
# But num_batches directly specifies total steps
num_epochs
Type: integer
Range: 1 to 10
Default: 1
Number of complete passes through dataset.
Typical:
- Most tasks: 1 epoch
- Small datasets: 3-5 epochs
- Large datasets: 1 epoch
Usually keep at 1 and adjust num_batches instead:
# Good: directly specify steps
training.num_batches=3738
# Less common: use epochs
training.num_epochs=2
checkpoint_dir
Type: string (path)
Default: ./checkpoints/ckpts/
Directory where model checkpoints saved.
Must be writable directory. Created if doesn’t exist.
Absolute vs relative:
# Relative to project root
training.checkpoint_dir=./checkpoints/ckpts/
# Absolute path
training.checkpoint_dir=/full/path/to/checkpoints/ckpts/
Example:
python run_training.py checkpoint_dir=~/my_models/checkpoints/
save_interval_steps
Type: integer
Range: 1 to 1000
Default: 100
Save checkpoint every N steps.
Trade-offs:
- Frequent saves (50): more disk, better coverage, can resume from recent step
- Infrequent saves (1000): less disk, fewer checkpoints, coarser resume points
Disk usage estimate:
disk = (checkpoint_size) × (num_batches / save_interval_steps)
For 1B model (≈4GB checkpoint):
- save_interval=50: 4GB × (3738/50) ≈ 300GB
- save_interval=100: 4GB × (3738/100) ≈ 150GB
- save_interval=500: 4GB × (3738/500) ≈ 30GB
Example:
python run_training.py training.save_interval_steps=200
eval_interval_steps
Type: integer
Range: 100 to 5000
Default: 500
Evaluate on validation set every N steps.
Lower = more frequent evaluation:
- Every 100 steps: frequent feedback, slower training
- Every 500 steps: good balance (default)
- Every 1000 steps: less frequent, faster training
Example:
python run_training.py training.eval_interval_steps=1000
log_interval_steps
Type: integer
Range: 1 to 100
Default: 10
Log metrics every N steps.
Determines frequency of logged stats:
- Every 1 step: very verbose, can slow training
- Every 10 steps: good visibility (default)
- Every 100 steps: less detailed
Example:
python run_training.py training.log_interval_steps=5
seed
Type: integer
Default: 42
Random seed for reproducibility.
Same seed = reproducible results across runs.
For different runs:
python run_training.py seed=42
python run_training.py seed=43
python run_training.py seed=44
Reproducibility:
python run_training.py seed=42 # Run 1
python run_training.py seed=42 # Run 2 (identical to Run 1)
device
Type: string
Options: cuda, cpu
Default: cuda
Training device.
GPU (recommended):
python run_training.py device=cuda
CPU (very slow, for testing):
python run_training.py device=cpu
Memory Optimization Parameters
gradient_accumulation_steps
Type: integer
Default: 1
Number of gradient accumulation steps before update.
Effectively increases batch size without more GPU memory:
effective_batch_size = micro_batch_size × gradient_accumulation_steps
Example:
# Effective batch size of 8 with 2GB GPU
training.micro_batch_size=2
training.gradient_accumulation_steps=4
max_grad_norm (in optimizer config)
Type: float
Default: 0.1
Gradient clipping threshold.
Larger value = less clipping, more aggressive updates:
- 0.01: strong clipping, stable but slow
- 0.1: moderate clipping (default)
- 1.0: weak clipping, aggressive
Complete Training Configuration Example
# conf/training/default.yaml
micro_batch_size: 4
num_batches: 3738
num_epochs: 1
checkpoint_dir: ./checkpoints/ckpts/
save_interval_steps: 100
eval_interval_steps: 500
log_interval_steps: 10
seed: 42
device: cuda
Custom training config:
# conf/training/aggressive.yaml
micro_batch_size: 8
num_batches: 1000
num_epochs: 1
checkpoint_dir: ./checkpoints/ckpts/
save_interval_steps: 50
eval_interval_steps: 100
log_interval_steps: 5
seed: 42
device: cuda
Use:
python run_training.py training=aggressive
Common Configuration Patterns
Quick Test (10 steps)
python run_training.py \
training.micro_batch_size=2 \
training.num_batches=10 \
training.save_interval_steps=10 \
training.eval_interval_steps=10
Or use experiment:
python run_training.py +experiment=quick_test
Memory Constrained (11GB GPU)
python run_training.py \
model=gemma3_270m \
training.micro_batch_size=1 \
training.num_batches=3738
High Performance (H100 GPU)
python run_training.py \
model=gemma3_4b \
training.micro_batch_size=8 \
training.num_batches=10000 \
training.eval_interval_steps=1000
Production Training
python run_training.py \
model=gemma3_1b \
training.micro_batch_size=4 \
training.num_batches=10000 \
training.save_interval_steps=100
Resuming Training
Automatically resumes from latest checkpoint:
python run_training.py checkpoint_dir=./checkpoints/ckpts/
Or from specific directory:
python run_training.py checkpoint_dir=/path/to/checkpoints/ckpts/
The framework:
Finds latest checkpoint in directory
Loads model weights
Continues training from that step
Training Dynamics
How configuration affects training:
Learning Rate + Batch Size:
Larger batch → can use higher learning rate Smaller batch → need lower learning rate
Warmup + Learning Rate:
Longer warmup (higher warmup_ratio) → more stable Short warmup → faster convergence but less stable
Number of Batches + Evaluation Interval:
More batches → longer training, more progress Less frequent eval → faster training but less monitoring
LoRA Rank + Learning Rate:
Higher rank → more parameters, may need lower LR Lower rank → fewer parameters, can use higher LR
Checkpoint Management
Disk Space Required
total_disk ≈ checkpoint_size × (num_batches / save_interval_steps)
For 1B model (≈4GB):
3738 steps, save every 100 steps
total_disk ≈ 4GB × (3738/100) ≈ 150GB
Keeping Only Important Checkpoints
Save less frequently:
python run_training.py training.save_interval_steps=500
Or manually delete old checkpoints:
# Keep only last 5 checkpoints
ls -dt checkpoints/ckpts/actor/*/ | tail -n +6 | xargs rm -rf
Finding Checkpoint Sizes
du -sh checkpoints/ckpts/actor/*/
Monitoring Training Progress
Check Training Loss
tail -f outputs/tunix-grpo/YYYY-MM-DD/HH-MM-SS/train.log
Use Weights & Biases
Enabled by default. View at https://wandb.ai
Use TensorBoard
make tensorboard
# Open http://localhost:6006
Integration with Other Configs
Training settings interact with:
Model: Larger models need lower batch sizes
GRPO: num_generations multiplies memory usage
Optimizer: Learning rate should match batch size
Scheduler: warmup_ratio affects convergence
Coordinated tuning example:
# For 4GB GPU
python run_training.py \
model=gemma3_270m \
model.lora_rank=8 \
training.micro_batch_size=1 \
grpo.num_generations=2 \
optimizer.learning_rate=1e-5
# For 80GB GPU
python run_training.py \
model=gemma3_4b \
model.lora_rank=64 \
training.micro_batch_size=8 \
grpo.num_generations=4 \
optimizer.learning_rate=3e-6
Next Steps
Configuration Overview - Configuration overview
Training Guide - Training guide
Hyperparameter Tuning - Tuning strategies
Configuration Guide - Configuration guide