Glossary

Accuracy: Percentage of exact matches between model outputs and expected answers.
Activation: Output of a neural network layer; intermediate representation passed to next layer.
Adaptation: Process of modifying a pre-trained model for a specific task. LoRA is an adaptation technique.
AdamW: Adaptive Moment Estimation optimizer with decoupled weight decay. Standard optimizer used in Agent-Tunix.
Attention Mask: Binary mask indicating which tokens should be attended to (1) and which are padding (0).
Baseline: Reference value used in reward computation to normalize/center rewards.
Batch Size: Number of samples processed together in one training step. Larger batches = more stable gradients.
Beam Search: Decoding strategy that tracks multiple hypothesis sequences and keeps the best ones.
Benchmark: Set of standard problems used to evaluate model performance.
Beta (β): In GRPO, weight controlling KL divergence penalty. Higher β keeps model closer to reference.
Bias (in neural networks): Learnable parameters added to layer outputs; enables modeling non-linear relationships.
Bias (statistical): Systematic errors in model predictions; different from variance.
Checkpoint: Saved model weights at a training step, allowing resumption and model selection.
Clipping (Gradient): Limiting gradient magnitudes to prevent exploding gradients during backpropagation.
Cluster Config: Configuration specifying how multiple GPUs/nodes are arranged for training.
Computation Graph: DAG (directed acyclic graph) representing mathematical operations and their dependencies.
Conditioning: Process of providing context to influence model output; e.g., prompt conditioning.
Configuration (Hydra): YAML-based specification of training parameters, model settings, and experiment details.
Convergence: Training state where loss stops decreasing, model has reached a local optimum.
Cross-Entropy Loss: Standard loss function for classification/language modeling tasks.
CUDA: Compute Unified Device Architecture; NVIDIA’s parallel computing platform for GPUs.
Curriculum Learning: Training strategy starting with easy examples, progressing to harder ones.
Data Parallelism: Distributing different data batches across multiple devices while replicating the model.
Decoding: Process of generating text from model logits (scores) using sampling or greedy selection.
Divergence: When training loss increases over time; indicates learning rate too high or data issue.
Dropout: Regularization technique randomly disabling neurons during training to prevent overfitting.
FSDP: Fully Sharded Data Parallel; JAX distributed training strategy sharding model and data.
Embedding: Vector representation of discrete tokens or concepts learned during training.
Entropy: Measure of randomness/uncertainty in a probability distribution.
Epsilon (ε): In PPO/GRPO, clipping range for policy updates; controls maximum gradient step.
Epoch: One complete pass through the entire training dataset.
Evaluation: Process of assessing model performance on held-out test data using metrics.
Example (training): Single data point consisting of input prompt and target output.
Fine-tuning: Training a pre-trained model on task-specific data; adapts general knowledge to specific task.
Flax: Neural network library for JAX providing layer abstractions and utilities.
Forward Pass: Computing network output given inputs; first stage of training step.
Frozen Weights: Model parameters that are not updated during training; held constant as reference.
Generation (text): Process of producing new text sequences conditioned on prompt input.
Gradient: Direction and magnitude of loss change with respect to parameters; used to update weights.
Gradient Accumulation: Computing gradients over multiple mini-batches before updating weights; simulates larger batch.
Gradient Descent: Optimization algorithm updating parameters by moving in negative gradient direction.
Greedy Decoding: Selecting highest-probability token at each step; deterministic, fast generation.
Group Relative Policy Optimization (GRPO): Reinforcement learning algorithm generating K responses per prompt and computing group-relative rewards.
Hallucination: Model generating plausible-sounding but false information not supported by training data.
Hyperparameter: Configuration setting controlling training dynamics (learning rate, batch size, etc.); not learned.
Hydra: Configuration management framework enabling YAML-based parametrization and composition.
Input IDs: Numeric token indices representing text input to neural network.
Inference: Using trained model to generate predictions on new data.
Interpolation (Hydra): Referencing other config values using ${path.to.value} syntax.
JAX: Array computation library from Google enabling GPU-accelerated numerical computing.
KL Divergence: Measure of distance between two probability distributions; used to constrain policy deviation.
Layer: Distinct processing unit in neural network; applies transformation to input.
Learning Rate: Hyperparameter controlling step size in gradient descent; critical for training stability.
Learning Rate Scheduler: Strategy for adjusting learning rate during training (warmup, cosine decay, etc.).
Log (training): Record of metrics (loss, accuracy, etc.) computed during training for monitoring progress.
Logits: Raw, unnormalized output scores from neural network before softmax/sampling.
LoRA: Low-Rank Adaptation; parameter-efficient fine-tuning adding small trainable matrices to frozen model.
LoRA Rank: Dimension of low-rank matrices in LoRA; higher rank = more capacity but more parameters.
Loss Function: Mathematical function quantifying difference between model predictions and targets; guided by gradients.
Mask (in attention): Binary indicator controlling which tokens interact; prevents attending to future tokens (causal mask).
Memory (GPU): High-speed storage on GPU holding model weights, activations, and gradients; limited resource.
Mesh Shape: Configuration specifying how devices arranged for distributed training (FSDP × TP dimensions).
Metric: Quantitative measure of model performance (accuracy, loss, F1, etc.).
Mini-batch: Small subset of data processed together; typical size 1-256 examples.
Mixed Precision: Training using both float32 (high precision) and float16 (lower precision) for speed/memory trade-off.
Model: Neural network architecture with learnable parameters (weights, biases, embeddings, etc.).
Model Family: Category of architectures (Gemma3, LLaMA, etc.); defines structure and behavior.
Momentum: Accumulation of previous gradients; helps optimization converge faster and escape local minima.
Multi-run: Running same experiment multiple times with different hyperparameter values (parameter sweep).
NaN (Not a Number): Invalid floating-point value indicating computation failure; training becomes undefined.
Normalization: Rescaling values to standard range (usually 0-1 or mean 0, std 1) for stable training.
Nucleus Sampling (Top-p): Decoding strategy selecting from highest-probability tokens summing to threshold p.
Optimizer: Algorithm updating model weights based on gradients (Adam, SGD, AdamW, etc.).
Overrides (configuration): Command-line changes to config parameters without modifying YAML files.
Parameter Sharing: Reusing same weights across multiple positions/layers to reduce memory and improve efficiency.
Perplexity: Inverse probability of ground truth sequence; lower is better for language models.
Policy: Model trained using reinforcement learning to maximize expected reward.
PPO (Proximal Policy Optimization): Reinforcement learning algorithm with clipped objective preventing large policy updates.
Prompt: Input text conditioning model output; text input to language model.
Prompt Engineering: Designing effective prompts to elicit desired model behavior.
Pruning: Removing small-weight connections from neural network to reduce size/computation.
Quantization: Reducing precision of weights/activations (float32 → int8) to save memory.
Rank (LoRA): See LoRA Rank.
Recall: Fraction of positive examples correctly identified; useful for imbalanced problems.
Reference Model: Original frozen model used as baseline; policy model trained relative to reference.
Regularization: Technique preventing overfitting by penalizing complex models (dropout, L2, etc.).
Reinforcement Learning (RL): Learning paradigm where agent optimizes behavior to maximize cumulative reward signal.
Reward Function: Function evaluating model responses and returning numerical score guiding training.
Reward Shaping: Adding intermediate signals to guide learning beyond primary reward.
Sampling (decoding): Stochastic generation selecting tokens from probability distribution.
Scheduler (learning rate): Strategy for adjusting learning rate during training for better convergence.
Seed (random): Initial value for random number generator; same seed = reproducible randomness.
Softmax: Normalization function converting logits to probability distribution.
Stable Training: Training where loss smoothly decreases without spikes, divergence, or NaN errors.
Step (training): Single gradient update; one mini-batch processed and weights updated.
Temperature (sampling): Parameter controlling randomness of decoding (0 = deterministic, ∞ = uniform random).
Tensor Parallel: Distributing model tensors across multiple devices; suits very large models.
Tokenization: Process of converting text into token indices; inverse is detokenization.
Token: Discrete unit of text (word, subword, character); basic unit of language models.
Top-k Sampling: Decoding selecting from k highest-probability tokens.
Training: Process of updating model parameters to minimize loss on training data.
Validation: Evaluating model on held-out data to monitor generalization during training.
Warmup (learning rate): Initial training phase with gradually increasing learning rate; improves stability.
Warmup Ratio: Fraction of training devoted to warmup phase (typical 0.05-0.1).
Weights: Learnable parameters of neural network; updated during training via gradients.
Weight Decay: Regularization penalizing large weights; encourages sparse solutions.
Weights & Biases (W&B): Platform for tracking, visualizing, and comparing machine learning experiments.
Zero-shot: Model performing task without seeing examples; relies on pre-training knowledge.

Next Steps

Training Guide - Training guide
Frequently Asked Questions - Frequently asked questions
Configuration Guide - Configuration reference