Glossary ======== **Accuracy** Percentage of exact matches between model outputs and expected answers. **Activation** Output of a neural network layer; intermediate representation passed to next layer. **Adaptation** Process of modifying a pre-trained model for a specific task. LoRA is an adaptation technique. **AdamW** Adaptive Moment Estimation optimizer with decoupled weight decay. Standard optimizer used in Agent-Tunix. **Attention Mask** Binary mask indicating which tokens should be attended to (1) and which are padding (0). **Baseline** Reference value used in reward computation to normalize/center rewards. **Batch Size** Number of samples processed together in one training step. Larger batches = more stable gradients. **Beam Search** Decoding strategy that tracks multiple hypothesis sequences and keeps the best ones. **Benchmark** Set of standard problems used to evaluate model performance. **Beta (β)** In GRPO, weight controlling KL divergence penalty. Higher β keeps model closer to reference. **Bias (in neural networks)** Learnable parameters added to layer outputs; enables modeling non-linear relationships. **Bias (statistical)** Systematic errors in model predictions; different from variance. **Checkpoint** Saved model weights at a training step, allowing resumption and model selection. **Clipping (Gradient)** Limiting gradient magnitudes to prevent exploding gradients during backpropagation. **Cluster Config** Configuration specifying how multiple GPUs/nodes are arranged for training. **Computation Graph** DAG (directed acyclic graph) representing mathematical operations and their dependencies. **Conditioning** Process of providing context to influence model output; e.g., prompt conditioning. **Configuration (Hydra)** YAML-based specification of training parameters, model settings, and experiment details. **Convergence** Training state where loss stops decreasing, model has reached a local optimum. **Cross-Entropy Loss** Standard loss function for classification/language modeling tasks. **CUDA** Compute Unified Device Architecture; NVIDIA's parallel computing platform for GPUs. **Curriculum Learning** Training strategy starting with easy examples, progressing to harder ones. **Data Parallelism** Distributing different data batches across multiple devices while replicating the model. **Decoding** Process of generating text from model logits (scores) using sampling or greedy selection. **Divergence** When training loss increases over time; indicates learning rate too high or data issue. **Dropout** Regularization technique randomly disabling neurons during training to prevent overfitting. **FSDP** Fully Sharded Data Parallel; JAX distributed training strategy sharding model and data. **Embedding** Vector representation of discrete tokens or concepts learned during training. **Entropy** Measure of randomness/uncertainty in a probability distribution. **Epsilon (ε)** In PPO/GRPO, clipping range for policy updates; controls maximum gradient step. **Epoch** One complete pass through the entire training dataset. **Evaluation** Process of assessing model performance on held-out test data using metrics. **Example (training)** Single data point consisting of input prompt and target output. **Fine-tuning** Training a pre-trained model on task-specific data; adapts general knowledge to specific task. **Flax** Neural network library for JAX providing layer abstractions and utilities. **Forward Pass** Computing network output given inputs; first stage of training step. **Frozen Weights** Model parameters that are not updated during training; held constant as reference. **Generation (text)** Process of producing new text sequences conditioned on prompt input. **Gradient** Direction and magnitude of loss change with respect to parameters; used to update weights. **Gradient Accumulation** Computing gradients over multiple mini-batches before updating weights; simulates larger batch. **Gradient Descent** Optimization algorithm updating parameters by moving in negative gradient direction. **Greedy Decoding** Selecting highest-probability token at each step; deterministic, fast generation. **Group Relative Policy Optimization (GRPO)** Reinforcement learning algorithm generating K responses per prompt and computing group-relative rewards. **Hallucination** Model generating plausible-sounding but false information not supported by training data. **Hyperparameter** Configuration setting controlling training dynamics (learning rate, batch size, etc.); not learned. **Hydra** Configuration management framework enabling YAML-based parametrization and composition. **Input IDs** Numeric token indices representing text input to neural network. **Inference** Using trained model to generate predictions on new data. **Interpolation (Hydra)** Referencing other config values using ${path.to.value} syntax. **JAX** Array computation library from Google enabling GPU-accelerated numerical computing. **KL Divergence** Measure of distance between two probability distributions; used to constrain policy deviation. **Layer** Distinct processing unit in neural network; applies transformation to input. **Learning Rate** Hyperparameter controlling step size in gradient descent; critical for training stability. **Learning Rate Scheduler** Strategy for adjusting learning rate during training (warmup, cosine decay, etc.). **Log (training)** Record of metrics (loss, accuracy, etc.) computed during training for monitoring progress. **Logits** Raw, unnormalized output scores from neural network before softmax/sampling. **LoRA** Low-Rank Adaptation; parameter-efficient fine-tuning adding small trainable matrices to frozen model. **LoRA Rank** Dimension of low-rank matrices in LoRA; higher rank = more capacity but more parameters. **Loss Function** Mathematical function quantifying difference between model predictions and targets; guided by gradients. **Mask (in attention)** Binary indicator controlling which tokens interact; prevents attending to future tokens (causal mask). **Memory (GPU)** High-speed storage on GPU holding model weights, activations, and gradients; limited resource. **Mesh Shape** Configuration specifying how devices arranged for distributed training (FSDP × TP dimensions). **Metric** Quantitative measure of model performance (accuracy, loss, F1, etc.). **Mini-batch** Small subset of data processed together; typical size 1-256 examples. **Mixed Precision** Training using both float32 (high precision) and float16 (lower precision) for speed/memory trade-off. **Model** Neural network architecture with learnable parameters (weights, biases, embeddings, etc.). **Model Family** Category of architectures (Gemma3, LLaMA, etc.); defines structure and behavior. **Momentum** Accumulation of previous gradients; helps optimization converge faster and escape local minima. **Multi-run** Running same experiment multiple times with different hyperparameter values (parameter sweep). **NaN (Not a Number)** Invalid floating-point value indicating computation failure; training becomes undefined. **Normalization** Rescaling values to standard range (usually 0-1 or mean 0, std 1) for stable training. **Nucleus Sampling (Top-p)** Decoding strategy selecting from highest-probability tokens summing to threshold p. **Optimizer** Algorithm updating model weights based on gradients (Adam, SGD, AdamW, etc.). **Overrides (configuration)** Command-line changes to config parameters without modifying YAML files. **Parameter Sharing** Reusing same weights across multiple positions/layers to reduce memory and improve efficiency. **Perplexity** Inverse probability of ground truth sequence; lower is better for language models. **Policy** Model trained using reinforcement learning to maximize expected reward. **PPO (Proximal Policy Optimization)** Reinforcement learning algorithm with clipped objective preventing large policy updates. **Prompt** Input text conditioning model output; text input to language model. **Prompt Engineering** Designing effective prompts to elicit desired model behavior. **Pruning** Removing small-weight connections from neural network to reduce size/computation. **Quantization** Reducing precision of weights/activations (float32 → int8) to save memory. **Rank (LoRA)** See LoRA Rank. **Recall** Fraction of positive examples correctly identified; useful for imbalanced problems. **Reference Model** Original frozen model used as baseline; policy model trained relative to reference. **Regularization** Technique preventing overfitting by penalizing complex models (dropout, L2, etc.). **Reinforcement Learning (RL)** Learning paradigm where agent optimizes behavior to maximize cumulative reward signal. **Reward Function** Function evaluating model responses and returning numerical score guiding training. **Reward Shaping** Adding intermediate signals to guide learning beyond primary reward. **Sampling (decoding)** Stochastic generation selecting tokens from probability distribution. **Scheduler (learning rate)** Strategy for adjusting learning rate during training for better convergence. **Seed (random)** Initial value for random number generator; same seed = reproducible randomness. **Softmax** Normalization function converting logits to probability distribution. **Stable Training** Training where loss smoothly decreases without spikes, divergence, or NaN errors. **Step (training)** Single gradient update; one mini-batch processed and weights updated. **Temperature (sampling)** Parameter controlling randomness of decoding (0 = deterministic, ∞ = uniform random). **Tensor Parallel** Distributing model tensors across multiple devices; suits very large models. **Tokenization** Process of converting text into token indices; inverse is detokenization. **Token** Discrete unit of text (word, subword, character); basic unit of language models. **Top-k Sampling** Decoding selecting from k highest-probability tokens. **Training** Process of updating model parameters to minimize loss on training data. **Validation** Evaluating model on held-out data to monitor generalization during training. **Warmup (learning rate)** Initial training phase with gradually increasing learning rate; improves stability. **Warmup Ratio** Fraction of training devoted to warmup phase (typical 0.05-0.1). **Weights** Learnable parameters of neural network; updated during training via gradients. **Weight Decay** Regularization penalizing large weights; encourages sparse solutions. **Weights & Biases (W&B)** Platform for tracking, visualizing, and comparing machine learning experiments. **Zero-shot** Model performing task without seeing examples; relies on pre-training knowledge. Next Steps ---------- - :doc:`../guide/training` - Training guide - :doc:`faq` - Frequently asked questions - :doc:`../getting_started/configuration` - Configuration reference