Model Configuration Reference
=============================

This section details all model configuration options and available models.

Available Models
----------------

Gemma3 270M
~~~~~~~~~~~

Lightweight model for constrained environments.

Configuration file: ``conf/model/gemma3_270m.yaml``

::

    model_family: gemma3
    model_size: 270m
    lora_rank: 32
    lora_alpha: 32.0
    lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
    mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use::

    python run_training.py model=gemma3_270m

**Specifications**:

- Parameters: 270 million
- Memory: ~11GB with batch size 1
- Recommended GPU: RTX 2080 Ti, RTX A4000
- LoRA rank: 8-32
- Training speed: ~500 steps/hour

**Good for**:

- Testing setups
- Running on limited GPUs
- Quick prototyping
- Small datasets

Gemma3 1B
~~~~~~~~~

Standard model balancing performance and efficiency.

Configuration file: ``conf/model/gemma3_1b.yaml``

::

    model_family: gemma3
    model_size: 1b
    lora_rank: 32
    lora_alpha: 32.0
    lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
    mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use::

    python run_training.py model=gemma3_1b

**Specifications**:

- Parameters: 1 billion
- Memory: ~24GB with batch size 2, ~48GB with batch size 4
- Recommended GPU: RTX A6000, L40S, A100-40GB
- LoRA rank: 16-64
- Training speed: ~300 steps/hour

**Good for**:

- Production training
- Balanced quality/speed
- Most use cases
- Benchmarking

Gemma3 4B
~~~~~~~~~

Larger model for higher capacity tasks.

Configuration file: ``conf/model/gemma3_4b.yaml``

::

    model_family: gemma3
    model_size: 4b
    lora_rank: 64
    lora_alpha: 64.0
    lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
    mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use::

    python run_training.py model=gemma3_4b

**Specifications**:

- Parameters: 4 billion
- Memory: ~80GB with batch size 8 (single GPU)
- Recommended: H100 or multiple A100s
- LoRA rank: 32-128
- Training speed: ~150 steps/hour

**Good for**:

- Complex reasoning tasks
- Large datasets
- Multi-GPU setups
- High-quality models

Model Configuration Parameters
------------------------------

**model_family**

Type: ``string``

Default: ``gemma3``

Model architecture family. Currently only ``gemma3`` supported.

Example::

    model_family: gemma3

**model_size**

Type: ``string``

Options: ``270m``, ``1b``, ``4b``

Default: ``270m``

Model size variant.

Example::

    model_size: 1b

**lora_rank**

Type: ``integer``

Range: ``4`` to ``128``

Default: ``32``

Rank of LoRA matrices. Higher rank = more capacity but more memory.

Recommended values::

    - 270M model: 8-32
    - 1B model: 16-64
    - 4B model: 32-128

Memory impact::

    Memory ≈ baseline_memory × (1 + 2 × lora_rank / hidden_dim)

For 1B model with hidden_dim=2048::

    - rank 16: +1.6% memory
    - rank 32: +3.1% memory
    - rank 64: +6.3% memory

Example::

    lora_rank: 64

**lora_alpha**

Type: ``float``

Default: equals lora_rank

Scaling factor for LoRA. Usually equals ``lora_rank``.

Affects training dynamics:

- Higher alpha: stronger LoRA updates
- Lower alpha: weaker LoRA updates

Typically::

    lora_alpha: ${model.lora_rank}

Or set manually::

    lora_alpha: 16.0

**lora_module_path**

Type: ``string`` (regex pattern)

Regular expression matching layer names to apply LoRA.

Default for Gemma3::

    lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"

This applies LoRA to:

- Attention query/key/value projections
- MLP gate and projection layers

To apply LoRA to all layers (not recommended, memory intensive)::

    lora_module_path: ".*"

To apply LoRA only to attention::

    lora_module_path: ".*einsum"

To apply LoRA only to MLP::

    lora_module_path: ".*proj"

**mesh_shape**

Type: ``list[list]`` with dimension names

Default: ``[[1, 1], ["fsdp", "tp"]]``

Parallelism configuration for distributed training.

Format: ``[[num_devices_fsdp, num_devices_tp], ["fsdp", "tp"]]``

Where:

- **num_devices_fsdp**: GPUs for fully sharded data parallelism
- **num_devices_tp**: GPUs for tensor parallelism

Single GPU::

    mesh_shape: [[1, 1], ["fsdp", "tp"]]

Data parallel (4 GPUs)::

    mesh_shape: [[4, 1], ["fsdp", "tp"]]

Tensor parallel (4 GPUs)::

    mesh_shape: [[1, 4], ["fsdp", "tp"]]

Hybrid (8 GPUs, 2 data × 4 tensor)::

    mesh_shape: [[2, 4], ["fsdp", "tp"]]

See :doc:`../advanced/distributed_training` for details.

Complete Model Configuration Example
-------------------------------------

::

    # conf/model/custom.yaml
    model_family: gemma3
    model_size: 1b
    lora_rank: 64
    lora_alpha: 64.0
    lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
    mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use::

    python run_training.py model=custom

Memory Requirements by Configuration
-------------------------------------

RTX 2080 Ti (11GB VRAM)
~~~~~~~~~~~~~~~~~~~~~~~~

::

    model=gemma3_270m
    model.lora_rank: 8
    training.micro_batch_size: 1

RTX A6000 (48GB VRAM)
~~~~~~~~~~~~~~~~~~~~~

::

    model=gemma3_1b
    model.lora_rank: 32
    training.micro_batch_size: 4

H100 (80GB VRAM)
~~~~~~~~~~~~~~~~

::

    model=gemma3_4b
    model.lora_rank: 64
    training.micro_batch_size: 8

Multi-GPU (4× A100 80GB)
~~~~~~~~~~~~~~~~~~~~~~~~

::

    model=gemma3_4b
    model.lora_rank: 128
    model.mesh_shape: [[2, 2], ["fsdp", "tp"]]
    training.micro_batch_size: 8

Tuning LoRA Rank
----------------

**Finding Right Rank**

Start with default (32 for 1B) and adjust based on:

1. **Memory constraints**::

       # If OOM
       model.lora_rank=16

2. **Training quality**::

       # If poor performance
       model.lora_rank=64

3. **Speed/memory trade-off**::

       # Balance training speed and capacity
       model.lora_rank=32  # Default good balance

**Testing Different Ranks**

Create experiment to sweep ranks::

    # conf/experiment/rank_sweep.yaml
    # @package _global_
    training:
      num_batches: 100

Run::

    python run_training.py +experiment=rank_sweep \
        --multirun model.lora_rank=8,16,32,64

Compare metrics to find optimal rank.

Creating Custom Models
---------------------

To add support for a new model:

1. Create config file ``conf/model/newmodel.yaml``::

       model_family: newmodel
       model_size: 1b
       lora_rank: 32
       lora_alpha: 32.0
       lora_module_path: ".*pattern_matching_layers"
       mesh_shape: [[1, 1], ["fsdp", "tp"]]

2. Update code to load model (if needed)
3. Update tokenizer path if different

Then use::

    python run_training.py model=newmodel

Integration with Training
-------------------------

Model config integrates with training via:

1. **LoRA**: Only ``lora_rank``, ``lora_alpha``, ``lora_module_path`` matter for fine-tuning
2. **Distributed training**: ``mesh_shape`` controls parallelism
3. **Memory**: ``model_size`` + ``lora_rank`` determine memory usage

Optimal configuration depends on:

- Available GPU memory
- Training data size
- Time constraints
- Target model quality

Next Steps
----------

- :doc:`overview` - Configuration overview
- :doc:`../getting_started/configuration` - Configuration guide
- :doc:`../api/models` - Model API reference
- :doc:`../advanced/distributed_training` - Distributed training setup