Document chunking implementations and benchmarking tools for RAG pipelines including fixed-size, semantic, recursive, and sentence-based strategies. Use when implementing document processing, optimizing chunk sizes, comparing chunking approaches, benchmarking retrieval performance, or when user mentions chunking, text splitting, document segmentation, RAG optimization, or chunk evaluation.
Limited to specific tools
Additional assets for this skill
This skill is limited to using the following tools:
README.mdSKILL_SUMMARY.mdexamples/chunk-code.pyexamples/chunk-markdown.pyexamples/chunk-pdf.pyscripts/benchmark-chunking.pyscripts/chunk-fixed-size.pyscripts/chunk-recursive.pyscripts/chunk-semantic.pytemplates/chunking-config.yamltemplates/custom-splitter.pyPurpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.
Activation Triggers:
Key Resources:
scripts/chunk-fixed-size.py - Fixed-size chunking implementationscripts/chunk-semantic.py - Semantic chunking with paragraph preservationscripts/chunk-recursive.py - Recursive chunking for hierarchical documentsscripts/benchmark-chunking.py - Benchmark and compare chunking strategiestemplates/chunking-config.yaml - Chunking configuration templatetemplates/custom-splitter.py - Template for custom chunking logicexamples/chunk-markdown.py - Markdown-specific chunkingexamples/chunk-code.py - Source code chunkingexamples/chunk-pdf.py - PDF document chunkingFixed-Size Chunking:
Semantic Chunking:
Recursive Chunking:
Sentence-Based Chunking:
Script: scripts/chunk-fixed-size.py
Usage:
python scripts/chunk-fixed-size.py \
--input document.txt \
--chunk-size 1000 \
--overlap 200 \
--output chunks.json
Parameters:
chunk-size: Number of characters per chunk (default: 1000)overlap: Character overlap between chunks (default: 200)split-on: Split on sentences, words, or characters (default: sentences)Best Practices:
Script: scripts/chunk-semantic.py
Usage:
python scripts/chunk-semantic.py \
--input document.txt \
--max-chunk-size 1500 \
--output chunks.json
How it works:
Best for: Articles, blog posts, documentation, books
Script: scripts/chunk-recursive.py
Usage:
python scripts/chunk-recursive.py \
--input document.md \
--chunk-size 1000 \
--separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
--output chunks.json
How it works:
Separator hierarchy examples:
["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]["\\n\\n", "\\n", ". ", " "]Best for: Structured documents, source code, technical manuals
Script: scripts/benchmark-chunking.py
Usage:
python scripts/benchmark-chunking.py \
--input document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500 \
--output benchmark-results.json
Metrics Evaluated:
Output:
{
"fixed-1000": {
"time_ms": 45,
"chunk_count": 127,
"avg_size": 982,
"size_variance": 12.3,
"context_score": 0.72
},
"semantic-1000": {
"time_ms": 156,
"chunk_count": 114,
"avg_size": 1087,
"size_variance": 234.5,
"context_score": 0.91
}
}
Template: templates/chunking-config.yaml
Complete configuration:
chunking:
# Global defaults
default_strategy: semantic
default_chunk_size: 1000
default_overlap: 200
# Strategy-specific configs
strategies:
fixed_size:
chunk_size: 1000
overlap: 200
split_on: sentence # sentence, word, character
semantic:
max_chunk_size: 1500
min_chunk_size: 200
preserve_paragraphs: true
add_headers: true # Include section headers
recursive:
chunk_size: 1000
overlap: 100
separators:
markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
text: ["\\n\\n", ". ", " "]
# Document type mappings
document_types:
".md": semantic
".py": recursive
".txt": fixed_size
".pdf": semantic
Template: templates/custom-splitter.py
Create your own chunking logic:
from typing import List, Dict
import re
class CustomChunker:
def __init__(self, chunk_size: int = 1000, overlap: int = 200):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
"""
Implement custom chunking logic here.
Returns:
List of chunks with metadata:
[
{
"text": "chunk content",
"metadata": {
"chunk_id": 0,
"source": "document.txt",
"start_char": 0,
"end_char": 1000
}
}
]
"""
chunks = []
# Your custom chunking logic here
# Example: Split on custom pattern
sections = self._split_sections(text)
for i, section in enumerate(sections):
chunks.append({
"text": section,
"metadata": {
"chunk_id": i,
"source": metadata.get("source", "unknown"),
"chunk_size": len(section)
}
})
return chunks
def _split_sections(self, text: str) -> List[str]:
# Implement your splitting logic
pass
Example: examples/chunk-markdown.py
Features:
Usage:
python examples/chunk-markdown.py README.md --output readme-chunks.json
Example: examples/chunk-code.py
Features:
Supported languages: Python, JavaScript, TypeScript, Java, Go
Usage:
python examples/chunk-code.py src/main.py --language python --output code-chunks.json
Example: examples/chunk-pdf.py
Features:
Dependencies: pypdf, pdfminer.six
Usage:
python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json
General recommendations:
| Content Type | Chunk Size | Overlap | Strategy |
|---|---|---|---|
| Q&A / FAQs | 200-400 | 50 | Sentence |
| Articles | 500-1000 | 100-200 | Semantic |
| Documentation | 1000-1500 | 200-300 | Recursive |
| Books | 1000-2000 | 300-400 | Semantic |
| Source code | 500-1000 | 100 | Recursive |
Test with your data: Use benchmark-chunking.py to find optimal settings
Why overlap matters:
Overlap guidelines:
Fast chunking (large documents):
# Use fixed-size for speed
python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000
Quality chunking (smaller documents):
# Use semantic for better context
python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500
Batch processing:
# Process multiple files
for file in documents/*.txt; do
python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json"
done
python scripts/benchmark-chunking.py \
--input sample-document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500
Review metrics:
Compare retrieval quality:
Use configuration file:
import yaml
from chunking_strategies import get_chunker
config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)
chunks = chunker.chunk(document_text)
Issue: Chunks too small/large
chunk_size parameterIssue: Lost context at boundaries
Issue: Slow processing
Issue: Poor retrieval quality
Core libraries:
pip install tiktoken # Token counting
pip install nltk # Sentence splitting
pip install spacy # Advanced NLP (optional)
For PDF support:
pip install pypdf pdfminer.six
For benchmarking:
pip install pandas numpy scikit-learn
Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom Output Format: JSON with text and metadata Version: 1.0.0