Data API ======== .. py:module:: agent_tunix.data :noindex: This module provides data loading and preprocessing utilities. Dataset Support --------------- **GSM8K (Grade School Math)** Default dataset for math reasoning tasks:: # Auto-loaded with default configuration Dataset specifications: - Task: Grade school math word problems - Format: Question → Answer with step-by-step reasoning - Size: ~8,000 training examples - Data source: Hugging Face datasets Example:: Question: Natalia sold clips to 48 of her friends in April, and then she sold clips to 42 of her friends in May. If she got $2.50 for each clip, how much money did she earn in total? Answer: Natalia sold clips to 48 + 42 = 90 friends in total. If she got $2.50 for each clip, she earned 90 * $2.50 = $225. Data Loading ------------ The framework automatically handles: 1. **Downloading**: Fetches dataset from source if not cached 2. **Splitting**: Creates train/validation/test splits 3. **Tokenization**: Converts text to token IDs 4. **Batching**: Creates mini-batches for training 5. **Padding**: Handles variable-length sequences Custom Datasets --------------- To use a custom dataset, create a data loading function in ``src/agent_tunix/data.py``:: def load_custom_dataset(dataset_path, tokenizer, max_length=512): """Load custom dataset and tokenize.""" # 1. Load data from your source examples = load_data(dataset_path) # 2. Format as (prompt, answer) pairs formatted = [ {"prompt": ex["question"], "answer": ex["answer"]} for ex in examples ] # 3. Tokenize tokenized = tokenizer( [f"{ex['prompt']} {ex['answer']}" for ex in formatted], truncation=True, max_length=max_length, return_tensors="np" ) # 4. Return dataset object return formatted, tokenized Data Format Requirements ------------------------ Minimum required fields:: { "prompt": "What is 2 + 2?", "answer": "The answer is 4." } For math problems, we recommend step-by-step format:: { "prompt": "What is 2 + 2?", "answer": "2 + 2 = 4. The answer is 4." } Tokenization ------------- The framework uses the model's tokenizer:: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("google/gemma-1b") # Tokenize input tokens = tokenizer("What is 2 + 2?", return_tensors="pt") # Token IDs print(tokens['input_ids']) # Attention masks (1 for real tokens, 0 for padding) print(tokens['attention_mask']) Preprocessing Pipeline ---------------------- Standard preprocessing steps:: 1. Load raw data 2. Split into train/val/test 3. Tokenize with model tokenizer 4. Pad sequences to max_length 5. Create attention masks 6. Create PyTorch/JAX datasets 7. Define data loaders for batching Data Splits ----------- Default splits:: Train: 80% of data Validation: 10% of data Test: 10% of data Customize in configuration:: training: train_split: 0.8 val_split: 0.1 test_split: 0.1 Batch Processing ---------------- Mini-batch configuration:: training: micro_batch_size: 4 # Batch per device grpo: num_generations: 4 # Responses per prompt Processing flow:: 1. Load batch of prompts 2. For each prompt, generate K responses (num_generations) 3. Compute rewards for each response 4. Update model based on rewards Sequence Lengths ---------------- Configure sequence lengths:: generation: max_prompt_length: 256 # Maximum prompt tokens max_generation_steps: 512 # Maximum response tokens Practical limits:: - Longer prompts: more context but less generation space - Shorter sequences: faster training but less capacity - Balance based on your task requirements Advanced: Custom Data Loading ------------------------------ For advanced use cases, create custom data loaders:: class CustomDataLoader: def __init__(self, data_path, tokenizer, batch_size): self.data = load_json(data_path) self.tokenizer = tokenizer self.batch_size = batch_size def __iter__(self): for i in range(0, len(self.data), self.batch_size): batch = self.data[i:i+self.batch_size] yield self._process_batch(batch) def _process_batch(self, batch): prompts = [ex["prompt"] for ex in batch] answers = [ex["answer"] for ex in batch] # Tokenize inputs = self.tokenizer( prompts, padding=True, truncation=True, return_tensors="np" ) outputs = self.tokenizer( answers, padding=True, truncation=True, return_tensors="np" ) return { "input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], "target_ids": outputs["input_ids"], "target_mask": outputs["attention_mask"] } Data Validation --------------- Verify data quality:: # Check dataset statistics python -c " from agent_tunix.data import load_dataset dataset, tokenizer = load_dataset() print(f'Dataset size: {len(dataset)}') print(f'Example: {dataset[0]}') " Troubleshooting --------------- **Data loading too slow** - Use smaller max_length - Reduce batch size temporarily - Pre-tokenize and cache data **Out of memory during batching** - Reduce micro_batch_size - Reduce max sequence lengths - Use gradient accumulation instead of larger batches **Tokenization mismatches** - Use same tokenizer as model - Check special token handling - Verify padding configuration Next Steps ---------- - :doc:`../guide/training` - Training guide - :doc:`train` - Training API reference - :doc:`../advanced/custom_rewards` - Custom reward functions for your data