Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
DATA_FORMATS.mdscripts/validate_dataset.pyValidate datasets before fine-tuning with Unsloth.
For automated validation, use the script:
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
Ask for: HF dataset ID (e.g., mlabonne/FineTome-100k) or local path (e.g., ./data.jsonl)
Auto-detect format from structure. See DATA_FORMATS.md for details.
| Format | Detection | Key Fields |
|---|---|---|
| Raw | text only | text |
| Alpaca | instruction + output | instruction, output |
| ShareGPT | conversations array | from, value |
| ChatML | messages array | role, content |
Check required fields exist. Report issues with fix suggestions.
Display 2-3 examples for visual verification.
Report statistics: total tokens, min/max/mean/median sequence length.
Flag concerns:
Ask for target model and LoRA rank, then calculate:
| Chinchilla Fraction | Interpretation |
|---|---|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |
Based on analysis, suggest:
standardize_sharegpt() for ShareGPT dataOffer to upload local datasets to Hub.
Pass context to funsloth-train:
dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2