Model Configuration Reference

This section details all model configuration options and available models.

Available Models

Gemma3 270M

Lightweight model for constrained environments.

Configuration file: conf/model/gemma3_270m.yaml

model_family: gemma3
model_size: 270m
lora_rank: 32
lora_alpha: 32.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=gemma3_270m

Specifications:

Parameters: 270 million
Memory: ~11GB with batch size 1
Recommended GPU: RTX 2080 Ti, RTX A4000
LoRA rank: 8-32
Training speed: ~500 steps/hour

Good for:

Testing setups
Running on limited GPUs
Quick prototyping
Small datasets

Gemma3 1B

Standard model balancing performance and efficiency.

Configuration file: conf/model/gemma3_1b.yaml

model_family: gemma3
model_size: 1b
lora_rank: 32
lora_alpha: 32.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=gemma3_1b

Specifications:

Parameters: 1 billion
Memory: ~24GB with batch size 2, ~48GB with batch size 4
Recommended GPU: RTX A6000, L40S, A100-40GB
LoRA rank: 16-64
Training speed: ~300 steps/hour

Good for:

Production training
Balanced quality/speed
Most use cases
Benchmarking

Gemma3 4B

Larger model for higher capacity tasks.

Configuration file: conf/model/gemma3_4b.yaml

model_family: gemma3
model_size: 4b
lora_rank: 64
lora_alpha: 64.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=gemma3_4b

Specifications:

Parameters: 4 billion
Memory: ~80GB with batch size 8 (single GPU)
Recommended: H100 or multiple A100s
LoRA rank: 32-128
Training speed: ~150 steps/hour

Good for:

Complex reasoning tasks
Large datasets
Multi-GPU setups
High-quality models

Model Configuration Parameters

model_family

Type: string

Default: gemma3

Model architecture family. Currently only gemma3 supported.

Example:

model_family: gemma3

model_size

Type: string

Options: 270m, 1b, 4b

Default: 270m

Model size variant.

Example:

model_size: 1b

lora_rank

Type: integer

Range: 4 to 128

Default: 32

Rank of LoRA matrices. Higher rank = more capacity but more memory.

Recommended values:

- 270M model: 8-32
- 1B model: 16-64
- 4B model: 32-128

Memory impact:

Memory ≈ baseline_memory × (1 + 2 × lora_rank / hidden_dim)

For 1B model with hidden_dim=2048:

- rank 16: +1.6% memory
- rank 32: +3.1% memory
- rank 64: +6.3% memory

Example:

lora_rank: 64

lora_alpha

Type: float

Default: equals lora_rank

Scaling factor for LoRA. Usually equals lora_rank.

Affects training dynamics:

Higher alpha: stronger LoRA updates
Lower alpha: weaker LoRA updates

Typically:

lora_alpha: ${model.lora_rank}

Or set manually:

lora_alpha: 16.0

lora_module_path

Type: string (regex pattern)

Regular expression matching layer names to apply LoRA.

Default for Gemma3:

lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"

This applies LoRA to:

Attention query/key/value projections
MLP gate and projection layers

To apply LoRA to all layers (not recommended, memory intensive):

lora_module_path: ".*"

To apply LoRA only to attention:

lora_module_path: ".*einsum"

To apply LoRA only to MLP:

lora_module_path: ".*proj"

mesh_shape

Type: list[list] with dimension names

Default: [[1, 1], ["fsdp", "tp"]]

Parallelism configuration for distributed training.

Format: [[num_devices_fsdp, num_devices_tp], ["fsdp", "tp"]]

Where:

num_devices_fsdp: GPUs for fully sharded data parallelism
num_devices_tp: GPUs for tensor parallelism

Single GPU:

mesh_shape: [[1, 1], ["fsdp", "tp"]]

Data parallel (4 GPUs):

mesh_shape: [[4, 1], ["fsdp", "tp"]]

Tensor parallel (4 GPUs):

mesh_shape: [[1, 4], ["fsdp", "tp"]]

Hybrid (8 GPUs, 2 data × 4 tensor):

mesh_shape: [[2, 4], ["fsdp", "tp"]]

See Distributed Training for details.

Complete Model Configuration Example

# conf/model/custom.yaml
model_family: gemma3
model_size: 1b
lora_rank: 64
lora_alpha: 64.0
lora_module_path: ".*q_einsum|.*kv_einsum|.*gate_proj|.*down_proj|.*up_proj|.*attn_vec_einsum"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Use:

python run_training.py model=custom

Memory Requirements by Configuration

RTX 2080 Ti (11GB VRAM)

model=gemma3_270m
model.lora_rank: 8
training.micro_batch_size: 1

RTX A6000 (48GB VRAM)

model=gemma3_1b
model.lora_rank: 32
training.micro_batch_size: 4

H100 (80GB VRAM)

model=gemma3_4b
model.lora_rank: 64
training.micro_batch_size: 8

Multi-GPU (4× A100 80GB)

model=gemma3_4b
model.lora_rank: 128
model.mesh_shape: [[2, 2], ["fsdp", "tp"]]
training.micro_batch_size: 8

Tuning LoRA Rank

Finding Right Rank

Start with default (32 for 1B) and adjust based on:

Memory constraints:
```
# If OOM
model.lora_rank=16
```

Training quality:

# If poor performance
model.lora_rank=64

Speed/memory trade-off:

# Balance training speed and capacity
model.lora_rank=32  # Default good balance

Testing Different Ranks

Create experiment to sweep ranks:

# conf/experiment/rank_sweep.yaml
# @package _global_
training:
  num_batches: 100

Run:

python run_training.py +experiment=rank_sweep \
    --multirun model.lora_rank=8,16,32,64

Compare metrics to find optimal rank.

Creating Custom Models

To add support for a new model:

Create config file conf/model/newmodel.yaml:

model_family: newmodel
model_size: 1b
lora_rank: 32
lora_alpha: 32.0
lora_module_path: ".*pattern_matching_layers"
mesh_shape: [[1, 1], ["fsdp", "tp"]]

Update code to load model (if needed)
Update tokenizer path if different

Then use:

python run_training.py model=newmodel

Integration with Training

Model config integrates with training via:

LoRA: Only lora_rank, lora_alpha, lora_module_path matter for fine-tuning
Distributed training: mesh_shape controls parallelism
Memory: model_size + lora_rank determine memory usage

Optimal configuration depends on:

Available GPU memory
Training data size
Time constraints
Target model quality

Next Steps

Configuration Overview - Configuration overview
Configuration Guide - Configuration guide
Models API - Model API reference
Distributed Training - Distributed training setup