Training Configuration Reference

This section details training configuration options for controlling the training process.

Training Configuration File

Located at: conf/training/default.yaml

micro_batch_size: 4
num_batches: 3738
num_epochs: 1
checkpoint_dir: ./checkpoints/ckpts/
save_interval_steps: 100
eval_interval_steps: 500
log_interval_steps: 10
seed: 42
device: cuda

Training Configuration Parameters

micro_batch_size

Type: integer

Range: 1 to 16 (depending on GPU)

Default: 4

Batch size per GPU device.

Memory impact: Linear (2× batch size ≈ 2× memory)

Guidance by GPU:

- 11GB GPU (RTX 2080 Ti): 1
- 24GB GPU (RTX A4000): 2-4
- 48GB GPU (RTX A6000): 4-8
- 80GB GPU (H100): 8-16

Training dynamics:

- Larger batches: more stable gradients, slower per-step updates
- Smaller batches: noisier gradients, faster per-step updates

Typical ranges:

# Memory constrained
python run_training.py training.micro_batch_size=1

# Balanced
python run_training.py training.micro_batch_size=4

# High capacity
python run_training.py training.micro_batch_size=8

num_batches

Type: integer

Range: 10 to 100000

Default: 3738

Total number of training batches (steps).

Training duration:

duration ≈ num_batches / (batches_per_hour)

Typical speeds:

- 270M model: ~500 steps/hour
- 1B model: ~300 steps/hour
- 4B model: ~150 steps/hour

Examples:

# Quick test (10 minutes)
training.num_batches=50

# Full training (varies)
training.num_batches=3738

# Long training (1-2 days)
training.num_batches=10000

Relationship with num_epochs:

total_steps = (dataset_size / batch_size) × num_epochs
# But num_batches directly specifies total steps

num_epochs

Type: integer

Range: 1 to 10

Default: 1

Number of complete passes through dataset.

Typical:

- Most tasks: 1 epoch
- Small datasets: 3-5 epochs
- Large datasets: 1 epoch

Usually keep at 1 and adjust num_batches instead:

# Good: directly specify steps
training.num_batches=3738

# Less common: use epochs
training.num_epochs=2

checkpoint_dir

Type: string (path)

Default: ./checkpoints/ckpts/

Directory where model checkpoints saved.

Must be writable directory. Created if doesn’t exist.

Absolute vs relative:

# Relative to project root
training.checkpoint_dir=./checkpoints/ckpts/

# Absolute path
training.checkpoint_dir=/full/path/to/checkpoints/ckpts/

Example:

python run_training.py checkpoint_dir=~/my_models/checkpoints/

save_interval_steps

Type: integer

Range: 1 to 1000

Default: 100

Save checkpoint every N steps.

Trade-offs:

- Frequent saves (50): more disk, better coverage, can resume from recent step
- Infrequent saves (1000): less disk, fewer checkpoints, coarser resume points

Disk usage estimate:

disk = (checkpoint_size) × (num_batches / save_interval_steps)

For 1B model (≈4GB checkpoint):

- save_interval=50: 4GB × (3738/50) ≈ 300GB
- save_interval=100: 4GB × (3738/100) ≈ 150GB
- save_interval=500: 4GB × (3738/500) ≈ 30GB

Example:

python run_training.py training.save_interval_steps=200

eval_interval_steps

Type: integer

Range: 100 to 5000

Default: 500

Evaluate on validation set every N steps.

Lower = more frequent evaluation:

- Every 100 steps: frequent feedback, slower training
- Every 500 steps: good balance (default)
- Every 1000 steps: less frequent, faster training

Example:

python run_training.py training.eval_interval_steps=1000

log_interval_steps

Type: integer

Range: 1 to 100

Default: 10

Log metrics every N steps.

Determines frequency of logged stats:

- Every 1 step: very verbose, can slow training
- Every 10 steps: good visibility (default)
- Every 100 steps: less detailed

Example:

python run_training.py training.log_interval_steps=5

seed

Type: integer

Default: 42

Random seed for reproducibility.

Same seed = reproducible results across runs.

For different runs:

python run_training.py seed=42
python run_training.py seed=43
python run_training.py seed=44

Reproducibility:

python run_training.py seed=42  # Run 1
python run_training.py seed=42  # Run 2 (identical to Run 1)

device

Type: string

Options: cuda, cpu

Default: cuda

Training device.

GPU (recommended):

python run_training.py device=cuda

CPU (very slow, for testing):

python run_training.py device=cpu

Memory Optimization Parameters

gradient_accumulation_steps

Type: integer

Default: 1

Number of gradient accumulation steps before update.

Effectively increases batch size without more GPU memory:

effective_batch_size = micro_batch_size × gradient_accumulation_steps

Example:

# Effective batch size of 8 with 2GB GPU
training.micro_batch_size=2
training.gradient_accumulation_steps=4

max_grad_norm (in optimizer config)

Type: float

Default: 0.1

Gradient clipping threshold.

Larger value = less clipping, more aggressive updates:

- 0.01: strong clipping, stable but slow
- 0.1: moderate clipping (default)
- 1.0: weak clipping, aggressive

Complete Training Configuration Example

# conf/training/default.yaml
micro_batch_size: 4
num_batches: 3738
num_epochs: 1
checkpoint_dir: ./checkpoints/ckpts/
save_interval_steps: 100
eval_interval_steps: 500
log_interval_steps: 10
seed: 42
device: cuda

Custom training config:

# conf/training/aggressive.yaml
micro_batch_size: 8
num_batches: 1000
num_epochs: 1
checkpoint_dir: ./checkpoints/ckpts/
save_interval_steps: 50
eval_interval_steps: 100
log_interval_steps: 5
seed: 42
device: cuda

Use:

python run_training.py training=aggressive

Common Configuration Patterns

Quick Test (10 steps)

python run_training.py \
    training.micro_batch_size=2 \
    training.num_batches=10 \
    training.save_interval_steps=10 \
    training.eval_interval_steps=10

Or use experiment:

python run_training.py +experiment=quick_test

Memory Constrained (11GB GPU)

python run_training.py \
    model=gemma3_270m \
    training.micro_batch_size=1 \
    training.num_batches=3738

High Performance (H100 GPU)

python run_training.py \
    model=gemma3_4b \
    training.micro_batch_size=8 \
    training.num_batches=10000 \
    training.eval_interval_steps=1000

Production Training

python run_training.py \
    model=gemma3_1b \
    training.micro_batch_size=4 \
    training.num_batches=10000 \
    training.save_interval_steps=100

Resuming Training

Automatically resumes from latest checkpoint:

python run_training.py checkpoint_dir=./checkpoints/ckpts/

Or from specific directory:

python run_training.py checkpoint_dir=/path/to/checkpoints/ckpts/

The framework:

  1. Finds latest checkpoint in directory

  2. Loads model weights

  3. Continues training from that step

Training Dynamics

How configuration affects training:

  1. Learning Rate + Batch Size:

    Larger batch → can use higher learning rate
    Smaller batch → need lower learning rate
    
  2. Warmup + Learning Rate:

    Longer warmup (higher warmup_ratio) → more stable
    Short warmup → faster convergence but less stable
    
  3. Number of Batches + Evaluation Interval:

    More batches → longer training, more progress
    Less frequent eval → faster training but less monitoring
    
  4. LoRA Rank + Learning Rate:

    Higher rank → more parameters, may need lower LR
    Lower rank → fewer parameters, can use higher LR
    

Checkpoint Management

Disk Space Required

total_disk ≈ checkpoint_size × (num_batches / save_interval_steps)

For 1B model (≈4GB):

3738 steps, save every 100 steps
total_disk ≈ 4GB × (3738/100) ≈ 150GB

Keeping Only Important Checkpoints

Save less frequently:

python run_training.py training.save_interval_steps=500

Or manually delete old checkpoints:

# Keep only last 5 checkpoints
ls -dt checkpoints/ckpts/actor/*/ | tail -n +6 | xargs rm -rf

Finding Checkpoint Sizes

du -sh checkpoints/ckpts/actor/*/

Monitoring Training Progress

Check Training Loss

tail -f outputs/tunix-grpo/YYYY-MM-DD/HH-MM-SS/train.log

Use Weights & Biases

Enabled by default. View at https://wandb.ai

Use TensorBoard

make tensorboard
# Open http://localhost:6006

Integration with Other Configs

Training settings interact with:

  1. Model: Larger models need lower batch sizes

  2. GRPO: num_generations multiplies memory usage

  3. Optimizer: Learning rate should match batch size

  4. Scheduler: warmup_ratio affects convergence

Coordinated tuning example:

# For 4GB GPU
python run_training.py \
    model=gemma3_270m \
    model.lora_rank=8 \
    training.micro_batch_size=1 \
    grpo.num_generations=2 \
    optimizer.learning_rate=1e-5

# For 80GB GPU
python run_training.py \
    model=gemma3_4b \
    model.lora_rank=64 \
    training.micro_batch_size=8 \
    grpo.num_generations=4 \
    optimizer.learning_rate=3e-6

Next Steps