Chunkers
Chunkers split raw content into smaller units with metadata so you can index and retrieve context. VoltAgent ships multiple chunkers for different formats and a StructuredDocument helper to orchestrate extractors and chunking strategies. An auto heuristic uses format detection (JSON/HTML/LaTeX/code/table/markdown signals) to pick a chunker when you do not know the format ahead of time.
StructuredDocument: wraps raw text and applies extractors before chunking.- Token, Sentence, Recursive: general-purpose text chunking.
- Table, Code, Markdown, Semantic Markdown, HTML, JSON, LaTeX: format-aware chunkers.
- Semantic, Late, Neural, Slumber: semantic or post-processing chunkers.
- Parser Registry: register parsers for language-aware code chunking.
Each chunker returns an array of chunks with content, positional metadata, and optional labels. Pass configuration through constructor options or per-call options.
Quick Example
import { RecursiveChunker } from "@voltagent/rag";
const text = "First paragraph.\n\nSecond paragraph that is longer.";
const chunks = new RecursiveChunker().chunk(text, { maxTokens: 120 });
// Output:
// [
// { content: "First paragraph.", metadata: { sourceType: "paragraph", paragraphIndex: 0 } },
// { content: "Second paragraph that is longer.", metadata: { sourceType: "paragraph", paragraphIndex: 1 } },
// ]
Common Configuration
-
Tokenizer: Any object with
{ tokenize: (text: string) => { value: string; start: number; end: number }[], countTokens: (text: string) => number }. Default is tiktoken (cl100k_base). Override on the constructor or per-call:import { TokenChunker, createTikTokenizer } from "@voltagent/rag";
const tokenizer = createTikTokenizer({ model: "gpt-4o-mini" });
const chunker = new TokenChunker(tokenizer);
const chunks = chunker.chunk(text, { maxTokens: 256, overlap: 16, tokenizer }); -
Per-call overrides: Options passed to
chunk(text, options)(e.g.,maxTokens,overlap,label, parser) override constructor defaults. -
Positions: Most chunkers include
metadata.positionwith line/column start/end. Code fences additionally exposefencePosition.
What Is Not Included
- Chunkers do not handle embedding or vector index creation; pipe chunk outputs into your own embedding/vector store flow.
- Inline AST fidelity is limited: markdown/html inline elements and attributes are not preserved beyond text content and basic paths.
- Code chunk
start/endoffsets refer to the fenced body; useposition/fencePositionfor absolute line/column context.