Model Configuration Reference

This section details all model configuration options and available models.

Available Models

Gemma3 270M

Lightweight model for constrained environments.

Configuration file: conf/model/gemma3_270m.yaml

model_family: gemma3
model_size: 270m
lora_rank: 32
lora_alpha: 32.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=gemma3_270m

Specifications:

  • Parameters: 270 million

  • Memory: ~11GB with batch size 1

  • Recommended GPU: RTX 2080 Ti, RTX A4000

  • LoRA rank: 8-32

  • Training speed: ~500 steps/hour

Good for:

  • Testing setups

  • Running on limited GPUs

  • Quick prototyping

  • Small datasets

Gemma3 1B

Standard model balancing performance and efficiency.

Configuration file: conf/model/gemma3_1b.yaml

model_family: gemma3
model_size: 1b
lora_rank: 32
lora_alpha: 32.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=gemma3_1b

Specifications:

  • Parameters: 1 billion

  • Memory: ~24GB with batch size 2, ~48GB with batch size 4

  • Recommended GPU: RTX A6000, L40S, A100-40GB

  • LoRA rank: 16-64

  • Training speed: ~300 steps/hour

Good for:

  • Production training

  • Balanced quality/speed

  • Most use cases

  • Benchmarking

Gemma3 4B

Larger model for higher capacity tasks.

Configuration file: conf/model/gemma3_4b.yaml

model_family: gemma3
model_size: 4b
lora_rank: 64
lora_alpha: 64.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=gemma3_4b

Specifications:

  • Parameters: 4 billion

  • Memory: ~80GB with batch size 8 (single GPU)

  • Recommended: H100 or multiple A100s

  • LoRA rank: 32-128

  • Training speed: ~150 steps/hour

Good for:

  • Complex reasoning tasks

  • Large datasets

  • Multi-GPU setups

  • High-quality models

Model Configuration Parameters

model_family

Type: string

Default: gemma3

Model architecture family. Currently only gemma3 supported.

Example:

model_family: gemma3

model_size

Type: string

Options: 270m, 1b, 4b

Default: 270m

Model size variant.

Example:

model_size: 1b

lora_rank

Type: integer

Range: 4 to 128

Default: 32

Rank of LoRA matrices. Higher rank = more capacity but more memory.

Recommended values:

- 270M model: 8-32
- 1B model: 16-64
- 4B model: 32-128

Memory impact:

Memory ≈ baseline_memory × (1 + 2 × lora_rank / hidden_dim)

For 1B model with hidden_dim=2048:

- rank 16: +1.6% memory
- rank 32: +3.1% memory
- rank 64: +6.3% memory

Example:

lora_rank: 64

lora_alpha

Type: float

Default: equals lora_rank

Scaling factor for LoRA. Usually equals lora_rank.

Affects training dynamics:

  • Higher alpha: stronger LoRA updates

  • Lower alpha: weaker LoRA updates

Typically:

lora_alpha: ${model.lora_rank}

Or set manually:

lora_alpha: 16.0

lora_module_path

Type: string (regex pattern)

Regular expression matching layer names to apply LoRA.

Default for Gemma3:

lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"

This applies LoRA to:

  • Attention query/key/value projections

  • MLP gate and projection layers

To apply LoRA to all layers (not recommended, memory intensive):

lora_module_path: ".*"

To apply LoRA only to attention:

lora_module_path: ".*einsum"

To apply LoRA only to MLP:

lora_module_path: ".*proj"

mesh_shape

Type: list[list] with dimension names

Default: [[1, 1], ["fsdp", "tp"]]

Parallelism configuration for distributed training.

Format: [[num_devices_fsdp, num_devices_tp], ["fsdp", "tp"]]

Where:

  • num_devices_fsdp: GPUs for fully sharded data parallelism

  • num_devices_tp: GPUs for tensor parallelism

Single GPU:

mesh_shape: [[1, 1], ["fsdp", "tp"]]

Data parallel (4 GPUs):

mesh_shape: [[4, 1], ["fsdp", "tp"]]

Tensor parallel (4 GPUs):

mesh_shape: [[1, 4], ["fsdp", "tp"]]

Hybrid (8 GPUs, 2 data × 4 tensor):

mesh_shape: [[2, 4], ["fsdp", "tp"]]

See Distributed Training for details.

Complete Model Configuration Example

# conf/model/custom.yaml
model_family: gemma3
model_size: 1b
lora_rank: 64
lora_alpha: 64.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=custom

Memory Requirements by Configuration

RTX 2080 Ti (11GB VRAM)

model=gemma3_270m
model.lora_rank: 8
training.micro_batch_size: 1

RTX A6000 (48GB VRAM)

model=gemma3_1b
model.lora_rank: 32
training.micro_batch_size: 4

H100 (80GB VRAM)

model=gemma3_4b
model.lora_rank: 64
training.micro_batch_size: 8

Multi-GPU (4× A100 80GB)

model=gemma3_4b
model.lora_rank: 128
model.mesh_shape: [[2, 2], ["fsdp", "tp"]]
training.micro_batch_size: 8

Tuning LoRA Rank

Finding Right Rank

Start with default (32 for 1B) and adjust based on:

  1. Memory constraints:

    # If OOM
    model.lora_rank=16
    
  2. Training quality:

    # If poor performance
    model.lora_rank=64
    
  3. Speed/memory trade-off:

    # Balance training speed and capacity
    model.lora_rank=32  # Default good balance
    

Testing Different Ranks

Create experiment to sweep ranks:

# conf/experiment/rank_sweep.yaml
# @package _global_
training:
  num_batches: 100

Run:

python run_training.py +experiment=rank_sweep \
    --multirun model.lora_rank=8,16,32,64

Compare metrics to find optimal rank.

Creating Custom Models

To add support for a new model:

  1. Create config file conf/model/newmodel.yaml:

    model_family: newmodel
    model_size: 1b
    lora_rank: 32
    lora_alpha: 32.0
    lora_module_path: ".*pattern_matching_layers"
    mesh_shape: [[1, 1], ["fsdp", "tp"]]
    
  2. Update code to load model (if needed)

  3. Update tokenizer path if different

Then use:

python run_training.py model=newmodel

Integration with Training

Model config integrates with training via:

  1. LoRA: Only lora_rank, lora_alpha, lora_module_path matter for fine-tuning

  2. Distributed training: mesh_shape controls parallelism

  3. Memory: model_size + lora_rank determine memory usage

Optimal configuration depends on:

  • Available GPU memory

  • Training data size

  • Time constraints

  • Target model quality

Next Steps