Optimizer Configuration Reference
==================================

This section details optimizer configuration options and strategies.

Available Optimizers
--------------------

AdamW (Adam with Decoupled Weight Decay)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The default optimizer for Agent-Tunix.

Configuration file: ``conf/optimizer/adamw.yaml``

::

    optimizer_name: adamw
    learning_rate: 3e-6
    weight_decay: 0.01
    betas: [0.9, 0.999]
    eps: 1e-8
    warmup_ratio: 0.1
    max_grad_norm: 0.1

Use::

    python run_training.py optimizer=adamw

**Why AdamW?**

- Adaptive learning rates per parameter
- Decoupled weight decay (correct L2 regularization)
- Good convergence properties
- Standard in modern deep learning

Optimizer Configuration Parameters
-----------------------------------

**optimizer_name**

Type: ``string``

Default: ``adamw``

Optimizer algorithm name.

Example::

    optimizer_name: adamw

**learning_rate**

Type: ``float``

Range: ``1e-7`` to ``1e-4`` (typical)

Default: ``3e-6``

Controls step size in gradient descent. Critical hyperparameter.

Guidance by model size::

    - 270M model: 1e-5 to 3e-5
    - 1B model: 1e-6 to 1e-5
    - 4B model: 1e-6 to 3e-6

Too high (divergence)::

    loss → NaN or ∞
    Solution: reduce by 10× (e.g., 1e-5 → 1e-6)

Too low (slow training)::

    loss decreases very slowly
    Solution: increase by 10× (e.g., 1e-7 → 1e-6)

Example::

    python run_training.py optimizer.learning_rate=1e-5

**weight_decay**

Type: ``float``

Range: ``0.0`` to ``0.1``

Default: ``0.01``

L2 regularization strength. Penalizes large weights.

Effects:

- Higher: stronger regularization, less overfitting
- Lower: more flexibility, potential overfitting

Typical values::

    - No regularization needed: 0.0
    - Standard: 0.01
    - Strong regularization: 0.05-0.1

Example::

    python run_training.py optimizer.weight_decay=0.05

**betas**

Type: ``list[float, float]``

Default: ``[0.9, 0.999]``

Exponential moving average coefficients for gradient moments.

Format: ``[beta1, beta2]``

- **beta1** (momentum): controls first moment exponential moving average
- **beta2** (second moment): controls second moment exponential moving average

Typical values::

    - Standard: [0.9, 0.999]
    - Aggressive (faster adaptation): [0.95, 0.99]
    - Conservative (smoother): [0.8, 0.999]

Default usually works well. Change only if:

- Training unstable → try [0.95, 0.99]
- Converging too slowly → try [0.8, 0.999]

Example::

    python run_training.py 'optimizer.betas=[0.95,0.99]'

**eps**

Type: ``float``

Default: ``1e-8``

Small value to prevent division by zero in adaptive learning rates.

Rarely needs adjustment. Only increase if numerical instability::

    python run_training.py optimizer.eps=1e-6

**warmup_ratio**

Type: ``float``

Range: ``0.0`` to ``1.0``

Default: ``0.1``

Fraction of training steps devoted to warmup.

Effect: Learning rate gradually increases from 0 to peak during warmup.

Benefits::

    - Stabilizes early training
    - Prevents gradient explosion
    - Improves final model quality

Common values::

    - No warmup: 0.0
    - Light warmup (5%): 0.05
    - Standard (10%): 0.1
    - Strong warmup (20%): 0.2

For short training (quick_test)::

    python run_training.py +experiment=quick_test optimizer.warmup_ratio=0.0

Example::

    python run_training.py optimizer.warmup_ratio=0.2

**max_grad_norm**

Type: ``float``

Range: ``0.01`` to ``1.0``

Default: ``0.1``

Gradient clipping threshold. Limits gradient magnitude to prevent exploding gradients.

Effect: If ||gradient|| > max_grad_norm, scale down to threshold.

When needed::

    - NaN/Inf loss → reduce to 0.01
    - Unstable training → reduce to 0.05
    - Smooth training → 0.1 (default)

Example::

    python run_training.py optimizer.max_grad_norm=0.05

Learning Rate Scheduling
------------------------

Combined with ``scheduler`` configuration::

    # conf/scheduler/warmup_cosine.yaml
    scheduler_name: warmup_cosine
    warmup_steps: null  # Computed from warmup_ratio
    total_steps: null   # Computed from num_batches
    lr_min: 1e-7

Default schedule: Warmup → Cosine decay

Warmup phase::

    lr(t) = learning_rate × (t / warmup_steps)

Cosine decay phase::

    lr(t) = lr_min + 0.5 × (lr_peak - lr_min) × (1 + cos(π × progress))

This provides:

- Stability in early training (warmup)
- Gradual cooling for convergence (cosine)

Optimizer Tuning Workflow
--------------------------

**Step 1: Find Baseline Learning Rate**

Quick search with small dataset::

    python run_training.py +experiment=quick_test \
        --multirun optimizer.learning_rate=1e-7,1e-6,1e-5,1e-4

Monitor training loss curves. Pick best.

**Step 2: Fine-tune Around Best Learning Rate**

Narrow range around best from step 1::

    python run_training.py --multirun \
        optimizer.learning_rate=1e-6,3e-6,1e-5,3e-5

**Step 3: Tune Warmup Ratio**

Try different warmup values::

    python run_training.py \
        optimizer.learning_rate=3e-6 \
        --multirun optimizer.warmup_ratio=0.05,0.1,0.2

**Step 4: Tune Weight Decay**

Reduce if overfitting, increase if underfitting::

    python run_training.py \
        optimizer.learning_rate=3e-6 \
        --multirun optimizer.weight_decay=0.0,0.01,0.05

**Step 5: Full Training**

Use best parameters::

    python run_training.py \
        optimizer.learning_rate=3e-6 \
        optimizer.warmup_ratio=0.1 \
        optimizer.weight_decay=0.01

Common Configuration Examples
-----------------------------

**Conservative (stable, slow)**

::

    optimizer:
      learning_rate: 1e-6
      warmup_ratio: 0.2
      weight_decay: 0.05
      max_grad_norm: 0.05

Use::

    python run_training.py \
        optimizer.learning_rate=1e-6 \
        optimizer.warmup_ratio=0.2 \
        optimizer.weight_decay=0.05 \
        optimizer.max_grad_norm=0.05

**Balanced (recommended)**

::

    optimizer:
      learning_rate: 3e-6
      warmup_ratio: 0.1
      weight_decay: 0.01
      max_grad_norm: 0.1

Use::

    python run_training.py optimizer=adamw  # Uses defaults

**Aggressive (fast, risky)**

::

    optimizer:
      learning_rate: 1e-5
      warmup_ratio: 0.05
      weight_decay: 0.0
      max_grad_norm: 0.1

Use::

    python run_training.py \
        optimizer.learning_rate=1e-5 \
        optimizer.warmup_ratio=0.05 \
        optimizer.weight_decay=0.0

Diagnosing Optimizer Issues
----------------------------

**Loss Diverging (NaN/Inf)**

Cause: Learning rate too high

Solution::

    python run_training.py optimizer.learning_rate=1e-7
    python run_training.py optimizer.max_grad_norm=0.01

**Loss Not Decreasing**

Cause: Learning rate too low or model not training

Solution::

    python run_training.py optimizer.learning_rate=1e-4

**Oscillating Loss (high variance)**

Cause: Learning rate borderline, warmup insufficient

Solution::

    python run_training.py \
        optimizer.learning_rate=1e-6 \
        optimizer.warmup_ratio=0.2

**Slow Convergence**

Cause: Learning rate too low or weight decay too high

Solution::

    python run_training.py \
        optimizer.learning_rate=1e-5 \
        optimizer.weight_decay=0.0

Complete Optimizer Configuration Example
----------------------------------------

::

    # conf/optimizer/custom.yaml
    optimizer_name: adamw
    learning_rate: 5e-6
    weight_decay: 0.02
    betas: [0.9, 0.999]
    eps: 1e-8
    warmup_ratio: 0.15
    max_grad_norm: 0.1

Use::

    python run_training.py optimizer=custom

Or override directly::

    python run_training.py \
        optimizer.learning_rate=5e-6 \
        optimizer.warmup_ratio=0.15

Interaction with Other Settings
--------------------------------

Optimizer settings interact with:

1. **Model size**: Larger models need lower learning rates
2. **Batch size**: Larger batches can use higher learning rates
3. **LoRA rank**: Doesn't directly affect learning rate
4. **Data size**: Smaller datasets need lower learning rates

Example adjustment::

    # Large model, small batch → lower LR
    model=gemma3_4b \
    training.micro_batch_size=1 \
    optimizer.learning_rate=1e-6

    # Small model, large batch → higher LR
    model=gemma3_270m \
    training.micro_batch_size=8 \
    optimizer.learning_rate=1e-4

Next Steps
----------

- :doc:`overview` - Configuration overview
- :doc:`../guide/hyperparameter_tuning` - Tuning strategies
- :doc:`../getting_started/configuration` - Configuration guide
- :doc:`../api/train` - Training API reference