Hyperparameter Tuning

Overview

Hyperparameter tuning is critical for model performance. This guide covers common tuning strategies.

Key Hyperparameters

Learning Rate

Controls update step size. Too high causes divergence, too low causes slow convergence.

Range: 1e-7 to 1e-4

Default: 3e-6

Sweep:

python run_training.py --multirun optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4

Batch Size

Larger batches provide better gradient estimates but require more memory.

Range: 1 to 8 (depending on GPU)

Default: 4

Sweep:

python run_training.py --multirun training.micro_batch_size=1,2,4

LoRA Rank

Higher rank provides more capacity but requires more memory and computation.

Range: 4 to 64

Default: 32

Sweep:

python run_training.py --multirun model.lora_rank=8,16,32,64

Number of Generations

More generations provide better reward signal but increase computation.

Range: 1 to 8

Default: 4

Sweep:

python run_training.py --multirun grpo.num_generations=2,4,8

KL Beta

Strength of KL divergence penalty. Higher values keep closer to reference model.

Range: 0.01 to 1.0

Default: 0.08

Sweep:

python run_training.py --multirun grpo.beta=0.01,0.05,0.1,0.5

Tuning Strategies

1. Learning Rate Search

Find optimal learning rate first:

python run_training.py --multirun optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4

Monitor loss curves and pick best one.

2. Batch Size vs Learning Rate

Larger batches often allow higher learning rates:

python run_training.py --multirun training.micro_batch_size=1,2,4 optimizer.learning_rate=1e-6,3e-6,1e-5

3. Model Capacity

Test different model sizes:

python run_training.py --multirun model=gemma3_270m,gemma3_1b

4. Algorithm Parameters

Tune GRPO-specific parameters:

python run_training.py --multirun grpo.num_generations=2,4,8 grpo.beta=0.01,0.1,1.0

Monitoring During Tuning

1. Watch logs:

tail -f outputs/tunix-grpo/YYYY-MM-DD/HH-MM-SS/train.log

2. Use Weights & Biases

Compare runs at: https://wandb.ai

3. Check tensorboard:

make tensorboard

Common Tuning Issues

Loss Not Decreasing

  • Learning rate too low: increase to 1e-5

  • Batch size too small: increase to 2 or 4

  • Model too small: try 1b model

Loss Diverging (NaN)

  • Learning rate too high: reduce to 1e-7

  • Gradient clipping insufficient: reduce max_grad_norm

  • Batch size too large: reduce

Slow Convergence

  • Learning rate too low

  • Batch size too small

  • Not enough generations

Mode Collapse

Model stops improving. Try:

  • Increase diversity: higher temperature

  • Modify reward function

  • Change LoRA rank

Memory-Aware Tuning

For 11GB GPU (RTX 2080 Ti):

model.lora_rank: 8-16
training.micro_batch_size: 1
model: gemma3_270m
grpo.num_generations: 2

For 48GB GPU (RTX A6000):

model.lora_rank: 32-64
training.micro_batch_size: 4
model: gemma3_1b
grpo.num_generations: 4

For 80GB GPU (H100):

model.lora_rank: 64
training.micro_batch_size: 8
model: gemma3_4b
grpo.num_generations: 8

Example Tuning Workflow

Phase 1: Learning Rate Search (1 hour):

python run_training.py +experiment=quick_test --multirun \
    optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4

Phase 2: Batch Size Tuning (2 hours):

python run_training.py --multirun \
    training.micro_batch_size=1,2,4 \
    optimizer.learning_rate=3e-6

Phase 3: Model Size (varies):

python run_training.py --multirun \
    model=gemma3_270m,gemma3_1b \
    training.micro_batch_size=1

Phase 4: Final Training

Use best parameters from previous phases:

python run_training.py \
    model=gemma3_1b \
    optimizer.learning_rate=1e-5 \
    training.micro_batch_size=2

Best Practices

  1. Start simple: Tune one parameter at a time

  2. Use short runs: Test with +experiment=quick_test

  3. Log everything: Enable W&B logging

  4. Save results: Document best configurations

  5. Reproduce winners: Verify best configs on full runs

  6. Monitor hardware: Check GPU memory and utilization

Next Steps