Training Configuration Reference
=================================

This section details training configuration options for controlling the training process.

Training Configuration File
---------------------------

Located at: ``conf/training/default.yaml``

::

    micro_batch_size: 4
    num_batches: 3738
    num_epochs: 1
    checkpoint_dir: ./checkpoints/ckpts/
    save_interval_steps: 100
    eval_interval_steps: 500
    log_interval_steps: 10
    seed: 42
    device: cuda

Training Configuration Parameters
---------------------------------

**micro_batch_size**

Type: ``integer``

Range: ``1`` to ``16`` (depending on GPU)

Default: ``4``

Batch size per GPU device.

Memory impact: Linear (2× batch size ≈ 2× memory)

Guidance by GPU::

    - 11GB GPU (RTX 2080 Ti): 1
    - 24GB GPU (RTX A4000): 2-4
    - 48GB GPU (RTX A6000): 4-8
    - 80GB GPU (H100): 8-16

Training dynamics::

    - Larger batches: more stable gradients, slower per-step updates
    - Smaller batches: noisier gradients, faster per-step updates

Typical ranges::

    # Memory constrained
    python run_training.py training.micro_batch_size=1

    # Balanced
    python run_training.py training.micro_batch_size=4

    # High capacity
    python run_training.py training.micro_batch_size=8

**num_batches**

Type: ``integer``

Range: ``10`` to ``100000``

Default: ``3738``

Total number of training batches (steps).

Training duration::

    duration ≈ num_batches / (batches_per_hour)

Typical speeds::

    - 270M model: ~500 steps/hour
    - 1B model: ~300 steps/hour
    - 4B model: ~150 steps/hour

Examples::

    # Quick test (10 minutes)
    training.num_batches=50

    # Full training (varies)
    training.num_batches=3738

    # Long training (1-2 days)
    training.num_batches=10000

Relationship with num_epochs::

    total_steps = (dataset_size / batch_size) × num_epochs
    # But num_batches directly specifies total steps

**num_epochs**

Type: ``integer``

Range: ``1`` to ``10``

Default: ``1``

Number of complete passes through dataset.

Typical::

    - Most tasks: 1 epoch
    - Small datasets: 3-5 epochs
    - Large datasets: 1 epoch

Usually keep at 1 and adjust ``num_batches`` instead::

    # Good: directly specify steps
    training.num_batches=3738

    # Less common: use epochs
    training.num_epochs=2

**checkpoint_dir**

Type: ``string`` (path)

Default: ``./checkpoints/ckpts/``

Directory where model checkpoints saved.

Must be writable directory. Created if doesn't exist.

Absolute vs relative::

    # Relative to project root
    training.checkpoint_dir=./checkpoints/ckpts/

    # Absolute path
    training.checkpoint_dir=/full/path/to/checkpoints/ckpts/

Example::

    python run_training.py checkpoint_dir=~/my_models/checkpoints/

**save_interval_steps**

Type: ``integer``

Range: ``1`` to ``1000``

Default: ``100``

Save checkpoint every N steps.

Trade-offs::

    - Frequent saves (50): more disk, better coverage, can resume from recent step
    - Infrequent saves (1000): less disk, fewer checkpoints, coarser resume points

Disk usage estimate::

    disk = (checkpoint_size) × (num_batches / save_interval_steps)

For 1B model (≈4GB checkpoint)::

    - save_interval=50: 4GB × (3738/50) ≈ 300GB
    - save_interval=100: 4GB × (3738/100) ≈ 150GB
    - save_interval=500: 4GB × (3738/500) ≈ 30GB

Example::

    python run_training.py training.save_interval_steps=200

**eval_interval_steps**

Type: ``integer``

Range: ``100`` to ``5000``

Default: ``500``

Evaluate on validation set every N steps.

Lower = more frequent evaluation::

    - Every 100 steps: frequent feedback, slower training
    - Every 500 steps: good balance (default)
    - Every 1000 steps: less frequent, faster training

Example::

    python run_training.py training.eval_interval_steps=1000

**log_interval_steps**

Type: ``integer``

Range: ``1`` to ``100``

Default: ``10``

Log metrics every N steps.

Determines frequency of logged stats::

    - Every 1 step: very verbose, can slow training
    - Every 10 steps: good visibility (default)
    - Every 100 steps: less detailed

Example::

    python run_training.py training.log_interval_steps=5

**seed**

Type: ``integer``

Default: ``42``

Random seed for reproducibility.

Same seed = reproducible results across runs.

For different runs::

    python run_training.py seed=42
    python run_training.py seed=43
    python run_training.py seed=44

Reproducibility::

    python run_training.py seed=42  # Run 1
    python run_training.py seed=42  # Run 2 (identical to Run 1)

**device**

Type: ``string``

Options: ``cuda``, ``cpu``

Default: ``cuda``

Training device.

GPU (recommended)::

    python run_training.py device=cuda

CPU (very slow, for testing)::

    python run_training.py device=cpu

Memory Optimization Parameters
------------------------------

**gradient_accumulation_steps**

Type: ``integer``

Default: ``1``

Number of gradient accumulation steps before update.

Effectively increases batch size without more GPU memory::

    effective_batch_size = micro_batch_size × gradient_accumulation_steps

Example::

    # Effective batch size of 8 with 2GB GPU
    training.micro_batch_size=2
    training.gradient_accumulation_steps=4

**max_grad_norm** (in optimizer config)

Type: ``float``

Default: ``0.1``

Gradient clipping threshold.

Larger value = less clipping, more aggressive updates::

    - 0.01: strong clipping, stable but slow
    - 0.1: moderate clipping (default)
    - 1.0: weak clipping, aggressive

Complete Training Configuration Example
---------------------------------------

::

    # conf/training/default.yaml
    micro_batch_size: 4
    num_batches: 3738
    num_epochs: 1
    checkpoint_dir: ./checkpoints/ckpts/
    save_interval_steps: 100
    eval_interval_steps: 500
    log_interval_steps: 10
    seed: 42
    device: cuda

Custom training config::

    # conf/training/aggressive.yaml
    micro_batch_size: 8
    num_batches: 1000
    num_epochs: 1
    checkpoint_dir: ./checkpoints/ckpts/
    save_interval_steps: 50
    eval_interval_steps: 100
    log_interval_steps: 5
    seed: 42
    device: cuda

Use::

    python run_training.py training=aggressive

Common Configuration Patterns
-----------------------------

**Quick Test (10 steps)**

::

    python run_training.py \
        training.micro_batch_size=2 \
        training.num_batches=10 \
        training.save_interval_steps=10 \
        training.eval_interval_steps=10

Or use experiment::

    python run_training.py +experiment=quick_test

**Memory Constrained (11GB GPU)**

::

    python run_training.py \
        model=gemma3_270m \
        training.micro_batch_size=1 \
        training.num_batches=3738

**High Performance (H100 GPU)**

::

    python run_training.py \
        model=gemma3_4b \
        training.micro_batch_size=8 \
        training.num_batches=10000 \
        training.eval_interval_steps=1000

**Production Training**

::

    python run_training.py \
        model=gemma3_1b \
        training.micro_batch_size=4 \
        training.num_batches=10000 \
        training.save_interval_steps=100

Resuming Training
-----------------

Automatically resumes from latest checkpoint::

    python run_training.py checkpoint_dir=./checkpoints/ckpts/

Or from specific directory::

    python run_training.py checkpoint_dir=/path/to/checkpoints/ckpts/

The framework:

1. Finds latest checkpoint in directory
2. Loads model weights
3. Continues training from that step

Training Dynamics
-----------------

How configuration affects training:

1. **Learning Rate + Batch Size**::

       Larger batch → can use higher learning rate
       Smaller batch → need lower learning rate

2. **Warmup + Learning Rate**::

       Longer warmup (higher warmup_ratio) → more stable
       Short warmup → faster convergence but less stable

3. **Number of Batches + Evaluation Interval**::

       More batches → longer training, more progress
       Less frequent eval → faster training but less monitoring

4. **LoRA Rank + Learning Rate**::

       Higher rank → more parameters, may need lower LR
       Lower rank → fewer parameters, can use higher LR

Checkpoint Management
---------------------

**Disk Space Required**

::

    total_disk ≈ checkpoint_size × (num_batches / save_interval_steps)

For 1B model (≈4GB)::

    3738 steps, save every 100 steps
    total_disk ≈ 4GB × (3738/100) ≈ 150GB

**Keeping Only Important Checkpoints**

Save less frequently::

    python run_training.py training.save_interval_steps=500

Or manually delete old checkpoints::

    # Keep only last 5 checkpoints
    ls -dt checkpoints/ckpts/actor/*/ | tail -n +6 | xargs rm -rf

**Finding Checkpoint Sizes**

::

    du -sh checkpoints/ckpts/actor/*/

Monitoring Training Progress
-----------------------------

**Check Training Loss**

::

    tail -f outputs/tunix-grpo/YYYY-MM-DD/HH-MM-SS/train.log

**Use Weights & Biases**

Enabled by default. View at https://wandb.ai

**Use TensorBoard**

::

    make tensorboard
    # Open http://localhost:6006

Integration with Other Configs
------------------------------

Training settings interact with:

1. **Model**: Larger models need lower batch sizes
2. **GRPO**: num_generations multiplies memory usage
3. **Optimizer**: Learning rate should match batch size
4. **Scheduler**: warmup_ratio affects convergence

Coordinated tuning example::

    # For 4GB GPU
    python run_training.py \
        model=gemma3_270m \
        model.lora_rank=8 \
        training.micro_batch_size=1 \
        grpo.num_generations=2 \
        optimizer.learning_rate=1e-5

    # For 80GB GPU
    python run_training.py \
        model=gemma3_4b \
        model.lora_rank=64 \
        training.micro_batch_size=8 \
        grpo.num_generations=4 \
        optimizer.learning_rate=3e-6

Next Steps
----------

- :doc:`overview` - Configuration overview
- :doc:`../guide/training` - Training guide
- :doc:`../guide/hyperparameter_tuning` - Tuning strategies
- :doc:`../getting_started/configuration` - Configuration guide