Document Processing & Chunking

Split documents into optimal chunks — strategies, trade-offs, and best practices

Document Processing & Chunking

The Chunking Trade-off

**Small chunks** (100-200 tokens): Precise retrieval, missing context

**Large chunks** (500-1000 tokens): Full context, more noise

Rule of thumb: Start at 500 tokens with 50-token overlap.

Chunking Strategies

Strategy	How It Works	Best For
Fixed-size	Split every N chars	Simple, predictable
Sentence-based	Split at boundaries, merge	Articles, docs
Semantic	Detect topic shifts	Complex documents
Recursive	Multiple delimiter levels	Mixed content

Metadata

Always preserve metadata with each chunk:

python

chunk = {
    'text': '...refund policy states...',
    'metadata': {
        'source': 'terms_of_service.pdf',
        'page': 12,
        'section': 'Refunds',
        'chunk_index': 4
    }
}

Your Turn!

Implement sentence chunking:

python

import re
def sentence_chunk(text, max_chars=500):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks, current = [], []
    current_len = 0
    for s in sentences:
        if current_len + len(s) > max_chars and current:
            chunks.append(' '.join(current))
            current, current_len = [], 0
        current.append(s)
        current_len += len(s)
    if current:
        chunks.append(' '.join(current))
    return chunks

✏️ Code Editor

Loading Python...

📤 Output

Write your solution and click "Run Code" to test it!

← Advanced Prompting Techniques Next: Evaluation & Benchmarks →