Hyperparameter Tuning

Overview

Hyperparameter tuning is critical for model performance. This guide covers common tuning strategies.

Key Hyperparameters

Learning Rate

Controls update step size. Too high causes divergence, too low causes slow convergence.

Range: 1e-7 to 1e-4

Default: 3e-6

Sweep:

python run_training.py --multirun optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4

Batch Size

Larger batches provide better gradient estimates but require more memory.

Range: 1 to 8 (depending on GPU)

Default: 4

Sweep:

python run_training.py --multirun training.micro_batch_size=1,2,4

LoRA Rank

Higher rank provides more capacity but requires more memory and computation.

Range: 4 to 64

Default: 32

Sweep:

python run_training.py --multirun model.lora_rank=8,16,32,64

Number of Generations

More generations provide better reward signal but increase computation.

Range: 1 to 8

Default: 4

Sweep:

python run_training.py --multirun grpo.num_generations=2,4,8

KL Beta

Strength of KL divergence penalty. Higher values keep closer to reference model.

Range: 0.01 to 1.0

Default: 0.08

Sweep:

python run_training.py --multirun grpo.beta=0.01,0.05,0.1,0.5

Tuning Strategies

1. Learning Rate Search

Find optimal learning rate first:

python run_training.py --multirun optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4

Monitor loss curves and pick best one.

2. Batch Size vs Learning Rate

Larger batches often allow higher learning rates:

python run_training.py --multirun training.micro_batch_size=1,2,4 optimizer.learning_rate=1e-6,3e-6,1e-5

3. Model Capacity

Test different model sizes:

python run_training.py --multirun model=gemma3_270m,gemma3_1b

4. Algorithm Parameters

Tune GRPO-specific parameters:

python run_training.py --multirun grpo.num_generations=2,4,8 grpo.beta=0.01,0.1,1.0

Monitoring During Tuning

1. Watch logs:

tail -f outputs/tunix-grpo/YYYY-MM-DD/HH-MM-SS/train.log

2. Use Weights & Biases

Compare runs at: https://wandb.ai

3. Check tensorboard:

make tensorboard

Common Tuning Issues

Loss Not Decreasing

Learning rate too low: increase to 1e-5
Batch size too small: increase to 2 or 4
Model too small: try 1b model

Loss Diverging (NaN)

Learning rate too high: reduce to 1e-7
Gradient clipping insufficient: reduce max_grad_norm
Batch size too large: reduce

Slow Convergence

Learning rate too low
Batch size too small
Not enough generations

Mode Collapse

Model stops improving. Try:

Increase diversity: higher temperature
Modify reward function
Change LoRA rank

Efficient Search

Grid Search

Systematic search over parameter combinations:

python run_training.py --multirun \
    optimizer.learning_rate=1e-6,3e-6,1e-5 \
    training.micro_batch_size=1,2,4

Random Search

Random sampling of parameter space:

python run_training.py --multirun \
    optimizer.learning_rate='log_uniform(1e-7,1e-4)' \
    training.micro_batch_size='choice(1,2,4)'

Early Stopping

Stop unpromising runs early:

# Quick test first
python run_training.py +experiment=quick_test model=gemma3_1b

# Then full training only for promising configs
python run_training.py model=gemma3_1b

Memory-Aware Tuning

For 11GB GPU (RTX 2080 Ti):

model.lora_rank: 8-16
training.micro_batch_size: 1
model: gemma3_270m
grpo.num_generations: 2

For 48GB GPU (RTX A6000):

model.lora_rank: 32-64
training.micro_batch_size: 4
model: gemma3_1b
grpo.num_generations: 4

For 80GB GPU (H100):

model.lora_rank: 64
training.micro_batch_size: 8
model: gemma3_4b
grpo.num_generations: 8

Example Tuning Workflow

Phase 1: Learning Rate Search (1 hour):

python run_training.py +experiment=quick_test --multirun \
    optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4

Phase 2: Batch Size Tuning (2 hours):

python run_training.py --multirun \
    training.micro_batch_size=1,2,4 \
    optimizer.learning_rate=3e-6

Phase 3: Model Size (varies):

python run_training.py --multirun \
    model=gemma3_270m,gemma3_1b \
    training.micro_batch_size=1

Phase 4: Final Training

Use best parameters from previous phases:

python run_training.py \
    model=gemma3_1b \
    optimizer.learning_rate=1e-5 \
    training.micro_batch_size=2

Best Practices

Start simple: Tune one parameter at a time
Use short runs: Test with +experiment=quick_test
Log everything: Enable W&B logging
Save results: Document best configurations
Reproduce winners: Verify best configs on full runs
Monitor hardware: Check GPU memory and utilization

Hyperparameter Tuning

Overview

Key Hyperparameters

Tuning Strategies

Monitoring During Tuning

Common Tuning Issues

Efficient Search

Memory-Aware Tuning

Example Tuning Workflow

Best Practices

Next Steps