Optimizer Configuration Reference
This section details optimizer configuration options and strategies.
Available Optimizers
AdamW (Adam with Decoupled Weight Decay)
The default optimizer for Agent-Tunix.
Configuration file: conf/optimizer/adamw.yaml
optimizer_name: adamw
learning_rate: 3e-6
weight_decay: 0.01
betas: [0.9, 0.999]
eps: 1e-8
warmup_ratio: 0.1
max_grad_norm: 0.1
Use:
python run_training.py optimizer=adamw
Why AdamW?
Adaptive learning rates per parameter
Decoupled weight decay (correct L2 regularization)
Good convergence properties
Standard in modern deep learning
Optimizer Configuration Parameters
optimizer_name
Type: string
Default: adamw
Optimizer algorithm name.
Example:
optimizer_name: adamw
learning_rate
Type: float
Range: 1e-7 to 1e-4 (typical)
Default: 3e-6
Controls step size in gradient descent. Critical hyperparameter.
Guidance by model size:
- 270M model: 1e-5 to 3e-5
- 1B model: 1e-6 to 1e-5
- 4B model: 1e-6 to 3e-6
Too high (divergence):
loss → NaN or ∞
Solution: reduce by 10× (e.g., 1e-5 → 1e-6)
Too low (slow training):
loss decreases very slowly
Solution: increase by 10× (e.g., 1e-7 → 1e-6)
Example:
python run_training.py optimizer.learning_rate=1e-5
weight_decay
Type: float
Range: 0.0 to 0.1
Default: 0.01
L2 regularization strength. Penalizes large weights.
Effects:
Higher: stronger regularization, less overfitting
Lower: more flexibility, potential overfitting
Typical values:
- No regularization needed: 0.0
- Standard: 0.01
- Strong regularization: 0.05-0.1
Example:
python run_training.py optimizer.weight_decay=0.05
betas
Type: list[float, float]
Default: [0.9, 0.999]
Exponential moving average coefficients for gradient moments.
Format: [beta1, beta2]
beta1 (momentum): controls first moment exponential moving average
beta2 (second moment): controls second moment exponential moving average
Typical values:
- Standard: [0.9, 0.999]
- Aggressive (faster adaptation): [0.95, 0.99]
- Conservative (smoother): [0.8, 0.999]
Default usually works well. Change only if:
Training unstable → try [0.95, 0.99]
Converging too slowly → try [0.8, 0.999]
Example:
python run_training.py 'optimizer.betas=[0.95,0.99]'
eps
Type: float
Default: 1e-8
Small value to prevent division by zero in adaptive learning rates.
Rarely needs adjustment. Only increase if numerical instability:
python run_training.py optimizer.eps=1e-6
warmup_ratio
Type: float
Range: 0.0 to 1.0
Default: 0.1
Fraction of training steps devoted to warmup.
Effect: Learning rate gradually increases from 0 to peak during warmup.
Benefits:
- Stabilizes early training
- Prevents gradient explosion
- Improves final model quality
Common values:
- No warmup: 0.0
- Light warmup (5%): 0.05
- Standard (10%): 0.1
- Strong warmup (20%): 0.2
For short training (quick_test):
python run_training.py +experiment=quick_test optimizer.warmup_ratio=0.0
Example:
python run_training.py optimizer.warmup_ratio=0.2
max_grad_norm
Type: float
Range: 0.01 to 1.0
Default: 0.1
Gradient clipping threshold. Limits gradient magnitude to prevent exploding gradients.
Effect: If ||gradient|| > max_grad_norm, scale down to threshold.
When needed:
- NaN/Inf loss → reduce to 0.01
- Unstable training → reduce to 0.05
- Smooth training → 0.1 (default)
Example:
python run_training.py optimizer.max_grad_norm=0.05
Learning Rate Scheduling
Combined with scheduler configuration:
# conf/scheduler/warmup_cosine.yaml
scheduler_name: warmup_cosine
warmup_steps: null # Computed from warmup_ratio
total_steps: null # Computed from num_batches
lr_min: 1e-7
Default schedule: Warmup → Cosine decay
Warmup phase:
lr(t) = learning_rate × (t / warmup_steps)
Cosine decay phase:
lr(t) = lr_min + 0.5 × (lr_peak - lr_min) × (1 + cos(π × progress))
This provides:
Stability in early training (warmup)
Gradual cooling for convergence (cosine)
Optimizer Tuning Workflow
Step 1: Find Baseline Learning Rate
Quick search with small dataset:
python run_training.py +experiment=quick_test \
--multirun optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4
Monitor training loss curves. Pick best.
Step 2: Fine-tune Around Best Learning Rate
Narrow range around best from step 1:
python run_training.py --multirun \
optimizer.learning_rate=1e-6,3e-6,1e-5,3e-5
Step 3: Tune Warmup Ratio
Try different warmup values:
python run_training.py \
optimizer.learning_rate=3e-6 \
--multirun optimizer.warmup_ratio=0.05,0.1,0.2
Step 4: Tune Weight Decay
Reduce if overfitting, increase if underfitting:
python run_training.py \
optimizer.learning_rate=3e-6 \
--multirun optimizer.weight_decay=0.0,0.01,0.05
Step 5: Full Training
Use best parameters:
python run_training.py \
optimizer.learning_rate=3e-6 \
optimizer.warmup_ratio=0.1 \
optimizer.weight_decay=0.01
Common Configuration Examples
Conservative (stable, slow)
optimizer:
learning_rate: 1e-6
warmup_ratio: 0.2
weight_decay: 0.05
max_grad_norm: 0.05
Use:
python run_training.py \
optimizer.learning_rate=1e-6 \
optimizer.warmup_ratio=0.2 \
optimizer.weight_decay=0.05 \
optimizer.max_grad_norm=0.05
Balanced (recommended)
optimizer:
learning_rate: 3e-6
warmup_ratio: 0.1
weight_decay: 0.01
max_grad_norm: 0.1
Use:
python run_training.py optimizer=adamw # Uses defaults
Aggressive (fast, risky)
optimizer:
learning_rate: 1e-5
warmup_ratio: 0.05
weight_decay: 0.0
max_grad_norm: 0.1
Use:
python run_training.py \
optimizer.learning_rate=1e-5 \
optimizer.warmup_ratio=0.05 \
optimizer.weight_decay=0.0
Diagnosing Optimizer Issues
Loss Diverging (NaN/Inf)
Cause: Learning rate too high
Solution:
python run_training.py optimizer.learning_rate=1e-7
python run_training.py optimizer.max_grad_norm=0.01
Loss Not Decreasing
Cause: Learning rate too low or model not training
Solution:
python run_training.py optimizer.learning_rate=1e-4
Oscillating Loss (high variance)
Cause: Learning rate borderline, warmup insufficient
Solution:
python run_training.py \
optimizer.learning_rate=1e-6 \
optimizer.warmup_ratio=0.2
Slow Convergence
Cause: Learning rate too low or weight decay too high
Solution:
python run_training.py \
optimizer.learning_rate=1e-5 \
optimizer.weight_decay=0.0
Complete Optimizer Configuration Example
# conf/optimizer/custom.yaml
optimizer_name: adamw
learning_rate: 5e-6
weight_decay: 0.02
betas: [0.9, 0.999]
eps: 1e-8
warmup_ratio: 0.15
max_grad_norm: 0.1
Use:
python run_training.py optimizer=custom
Or override directly:
python run_training.py \
optimizer.learning_rate=5e-6 \
optimizer.warmup_ratio=0.15
Interaction with Other Settings
Optimizer settings interact with:
Model size: Larger models need lower learning rates
Batch size: Larger batches can use higher learning rates
LoRA rank: Doesn’t directly affect learning rate
Data size: Smaller datasets need lower learning rates
Example adjustment:
# Large model, small batch → lower LR
model=gemma3_4b \
training.micro_batch_size=1 \
optimizer.learning_rate=1e-6
# Small model, large batch → higher LR
model=gemma3_270m \
training.micro_batch_size=8 \
optimizer.learning_rate=1e-4
Next Steps
Configuration Overview - Configuration overview
Hyperparameter Tuning - Tuning strategies
Configuration Guide - Configuration guide
Training API - Training API reference