Loading...
Loading...
Split documents into optimal chunks — strategies, trade-offs, and best practices
Rule of thumb: Start at 500 tokens with 50-token overlap.
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N chars | Simple, predictable |
| Sentence-based | Split at boundaries, merge | Articles, docs |
| Semantic | Detect topic shifts | Complex documents |
| Recursive | Multiple delimiter levels | Mixed content |
Always preserve metadata with each chunk:
chunk = {
'text': '...refund policy states...',
'metadata': {
'source': 'terms_of_service.pdf',
'page': 12,
'section': 'Refunds',
'chunk_index': 4
}
}Implement sentence chunking:
import re
def sentence_chunk(text, max_chars=500):
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks, current = [], []
current_len = 0
for s in sentences:
if current_len + len(s) > max_chars and current:
chunks.append(' '.join(current))
current, current_len = [], 0
current.append(s)
current_len += len(s)
if current:
chunks.append(' '.join(current))
return chunks