Chunking Strategies
How documents are split into chunks significantly impacts search quality.
Default Strategy
python
{
"strategy": "recursive",
"chunk_size": 512, # tokens
"chunk_overlap": 64, # tokens
"separators": ["\n\n", "\n", ". ", " "]
}Strategies
Recursive (Default)
Splits on natural boundaries (paragraphs, sentences, words).
python
client.documents.upload(
file,
chunking={"strategy": "recursive"}
)Fixed Size
Equal-sized chunks, regardless of content structure.
python
chunking={"strategy": "fixed", "chunk_size": 256}Semantic
Groups related content together using embeddings.
python
chunking={"strategy": "semantic", "similarity_threshold": 0.7}By Header
Splits on document headers (H1, H2, etc.).
python
chunking={"strategy": "header", "max_levels": 2}Chunk Size Guidelines
| Document Type | Recommended Size |
|---|---|
| Technical docs | 256-512 tokens |
| Legal documents | 512-1024 tokens |
| Conversations | 128-256 tokens |
| Code | 256-512 tokens |
Overlap
Overlap ensures context isn't lost at chunk boundaries:
python
chunking={
"chunk_size": 512,
"chunk_overlap": 64 # 12.5% overlap
}Custom Metadata
Add context to chunks:
python
chunking={
"include_metadata": True,
"metadata_fields": ["title", "section", "page_number"]
}