Recursive Chunker
Splits text by paragraphs, falls back to sentence grouping, and finally token chunking when needed.
Quick Start
import { RecursiveChunker } from "@voltagent/rag";
const text = "Para one.\n\nPara two is longer and may be split.";
const chunks = new RecursiveChunker().chunk(text, { maxTokens: 15, overlapTokens: 5 });
// Example output (content shortened):
// [
// { content: "Para one.", metadata: { sourceType: "paragraph", paragraphIndex: 0 } },
// { content: "Para two is longer", metadata: { sourceType: "sentence", paragraphIndex: 1 } },
// { content: "longer ...", metadata: { sourceType: "sentence-token", paragraphIndex: 1 } },
// ]
Options
maxTokens(number): limit per chunk, default300overlapTokens(number): token overlap between chunks, default0tokenizer
tokenizer must implement { tokenize(text): Token[]; countTokens(text): number }. Default is tiktoken (cl100k_base).
Provide tokenizer per-call to override the constructor setting.
Output
metadata.sourceType:paragraph,sentence, orsentence-tokenmetadata.paragraphIndex- Overlap chunks are labeled with
*-overlap.
import { RecursiveChunker } from "@voltagent/rag";
const chunker = new RecursiveChunker();
const chunks = chunker.chunk("Paragraph one.\n\nParagraph two...", {
maxTokens: 150,
overlapTokens: 10,
});
When to Use
- Mixed-length content with natural paragraph boundaries.
- Need graceful fallback to smaller units when paragraphs exceed token limits.