Glossary
- Accuracy
Percentage of exact matches between model outputs and expected answers.
- Activation
Output of a neural network layer; intermediate representation passed to next layer.
- Adaptation
Process of modifying a pre-trained model for a specific task. LoRA is an adaptation technique.
- AdamW
Adaptive Moment Estimation optimizer with decoupled weight decay. Standard optimizer used in Agent-Tunix.
- Attention Mask
Binary mask indicating which tokens should be attended to (1) and which are padding (0).
- Baseline
Reference value used in reward computation to normalize/center rewards.
- Batch Size
Number of samples processed together in one training step. Larger batches = more stable gradients.
- Beam Search
Decoding strategy that tracks multiple hypothesis sequences and keeps the best ones.
- Benchmark
Set of standard problems used to evaluate model performance.
- Beta (β)
In GRPO, weight controlling KL divergence penalty. Higher β keeps model closer to reference.
- Bias (in neural networks)
Learnable parameters added to layer outputs; enables modeling non-linear relationships.
- Bias (statistical)
Systematic errors in model predictions; different from variance.
- Checkpoint
Saved model weights at a training step, allowing resumption and model selection.
- Clipping (Gradient)
Limiting gradient magnitudes to prevent exploding gradients during backpropagation.
- Cluster Config
Configuration specifying how multiple GPUs/nodes are arranged for training.
- Computation Graph
DAG (directed acyclic graph) representing mathematical operations and their dependencies.
- Conditioning
Process of providing context to influence model output; e.g., prompt conditioning.
- Configuration (Hydra)
YAML-based specification of training parameters, model settings, and experiment details.
- Convergence
Training state where loss stops decreasing, model has reached a local optimum.
- Cross-Entropy Loss
Standard loss function for classification/language modeling tasks.
- CUDA
Compute Unified Device Architecture; NVIDIA’s parallel computing platform for GPUs.
- Curriculum Learning
Training strategy starting with easy examples, progressing to harder ones.
- Data Parallelism
Distributing different data batches across multiple devices while replicating the model.
- Decoding
Process of generating text from model logits (scores) using sampling or greedy selection.
- Divergence
When training loss increases over time; indicates learning rate too high or data issue.
- Dropout
Regularization technique randomly disabling neurons during training to prevent overfitting.
- FSDP
Fully Sharded Data Parallel; JAX distributed training strategy sharding model and data.
- Embedding
Vector representation of discrete tokens or concepts learned during training.
- Entropy
Measure of randomness/uncertainty in a probability distribution.
- Epsilon (ε)
In PPO/GRPO, clipping range for policy updates; controls maximum gradient step.
- Epoch
One complete pass through the entire training dataset.
- Evaluation
Process of assessing model performance on held-out test data using metrics.
- Example (training)
Single data point consisting of input prompt and target output.
- Fine-tuning
Training a pre-trained model on task-specific data; adapts general knowledge to specific task.
- Flax
Neural network library for JAX providing layer abstractions and utilities.
- Forward Pass
Computing network output given inputs; first stage of training step.
- Frozen Weights
Model parameters that are not updated during training; held constant as reference.
- Generation (text)
Process of producing new text sequences conditioned on prompt input.
- Gradient
Direction and magnitude of loss change with respect to parameters; used to update weights.
- Gradient Accumulation
Computing gradients over multiple mini-batches before updating weights; simulates larger batch.
- Gradient Descent
Optimization algorithm updating parameters by moving in negative gradient direction.
- Greedy Decoding
Selecting highest-probability token at each step; deterministic, fast generation.
- Group Relative Policy Optimization (GRPO)
Reinforcement learning algorithm generating K responses per prompt and computing group-relative rewards.
- Hallucination
Model generating plausible-sounding but false information not supported by training data.
- Hyperparameter
Configuration setting controlling training dynamics (learning rate, batch size, etc.); not learned.
- Hydra
Configuration management framework enabling YAML-based parametrization and composition.
- Input IDs
Numeric token indices representing text input to neural network.
- Inference
Using trained model to generate predictions on new data.
- Interpolation (Hydra)
Referencing other config values using ${path.to.value} syntax.
- JAX
Array computation library from Google enabling GPU-accelerated numerical computing.
- KL Divergence
Measure of distance between two probability distributions; used to constrain policy deviation.
- Layer
Distinct processing unit in neural network; applies transformation to input.
- Learning Rate
Hyperparameter controlling step size in gradient descent; critical for training stability.
- Learning Rate Scheduler
Strategy for adjusting learning rate during training (warmup, cosine decay, etc.).
- Log (training)
Record of metrics (loss, accuracy, etc.) computed during training for monitoring progress.
- Logits
Raw, unnormalized output scores from neural network before softmax/sampling.
- LoRA
Low-Rank Adaptation; parameter-efficient fine-tuning adding small trainable matrices to frozen model.
- LoRA Rank
Dimension of low-rank matrices in LoRA; higher rank = more capacity but more parameters.
- Loss Function
Mathematical function quantifying difference between model predictions and targets; guided by gradients.
- Mask (in attention)
Binary indicator controlling which tokens interact; prevents attending to future tokens (causal mask).
- Memory (GPU)
High-speed storage on GPU holding model weights, activations, and gradients; limited resource.
- Mesh Shape
Configuration specifying how devices arranged for distributed training (FSDP × TP dimensions).
- Metric
Quantitative measure of model performance (accuracy, loss, F1, etc.).
- Mini-batch
Small subset of data processed together; typical size 1-256 examples.
- Mixed Precision
Training using both float32 (high precision) and float16 (lower precision) for speed/memory trade-off.
- Model
Neural network architecture with learnable parameters (weights, biases, embeddings, etc.).
- Model Family
Category of architectures (Gemma3, LLaMA, etc.); defines structure and behavior.
- Momentum
Accumulation of previous gradients; helps optimization converge faster and escape local minima.
- Multi-run
Running same experiment multiple times with different hyperparameter values (parameter sweep).
- NaN (Not a Number)
Invalid floating-point value indicating computation failure; training becomes undefined.
- Normalization
Rescaling values to standard range (usually 0-1 or mean 0, std 1) for stable training.
- Nucleus Sampling (Top-p)
Decoding strategy selecting from highest-probability tokens summing to threshold p.
- Optimizer
Algorithm updating model weights based on gradients (Adam, SGD, AdamW, etc.).
- Overrides (configuration)
Command-line changes to config parameters without modifying YAML files.
- Parameter Sharing
Reusing same weights across multiple positions/layers to reduce memory and improve efficiency.
- Perplexity
Inverse probability of ground truth sequence; lower is better for language models.
- Policy
Model trained using reinforcement learning to maximize expected reward.
- PPO (Proximal Policy Optimization)
Reinforcement learning algorithm with clipped objective preventing large policy updates.
- Prompt
Input text conditioning model output; text input to language model.
- Prompt Engineering
Designing effective prompts to elicit desired model behavior.
- Pruning
Removing small-weight connections from neural network to reduce size/computation.
- Quantization
Reducing precision of weights/activations (float32 → int8) to save memory.
- Rank (LoRA)
See LoRA Rank.
- Recall
Fraction of positive examples correctly identified; useful for imbalanced problems.
- Reference Model
Original frozen model used as baseline; policy model trained relative to reference.
- Regularization
Technique preventing overfitting by penalizing complex models (dropout, L2, etc.).
- Reinforcement Learning (RL)
Learning paradigm where agent optimizes behavior to maximize cumulative reward signal.
- Reward Function
Function evaluating model responses and returning numerical score guiding training.
- Reward Shaping
Adding intermediate signals to guide learning beyond primary reward.
- Sampling (decoding)
Stochastic generation selecting tokens from probability distribution.
- Scheduler (learning rate)
Strategy for adjusting learning rate during training for better convergence.
- Seed (random)
Initial value for random number generator; same seed = reproducible randomness.
- Softmax
Normalization function converting logits to probability distribution.
- Stable Training
Training where loss smoothly decreases without spikes, divergence, or NaN errors.
- Step (training)
Single gradient update; one mini-batch processed and weights updated.
- Temperature (sampling)
Parameter controlling randomness of decoding (0 = deterministic, ∞ = uniform random).
- Tensor Parallel
Distributing model tensors across multiple devices; suits very large models.
- Tokenization
Process of converting text into token indices; inverse is detokenization.
- Token
Discrete unit of text (word, subword, character); basic unit of language models.
- Top-k Sampling
Decoding selecting from k highest-probability tokens.
- Training
Process of updating model parameters to minimize loss on training data.
- Validation
Evaluating model on held-out data to monitor generalization during training.
- Warmup (learning rate)
Initial training phase with gradually increasing learning rate; improves stability.
- Warmup Ratio
Fraction of training devoted to warmup phase (typical 0.05-0.1).
- Weights
Learnable parameters of neural network; updated during training via gradients.
- Weight Decay
Regularization penalizing large weights; encourages sparse solutions.
- Weights & Biases (W&B)
Platform for tracking, visualizing, and comparing machine learning experiments.
- Zero-shot
Model performing task without seeing examples; relies on pre-training knowledge.
Next Steps
Training Guide - Training guide
Frequently Asked Questions - Frequently asked questions
Configuration Guide - Configuration reference