Chunkers

Chunkers split raw content into smaller units with metadata so you can index and retrieve context. VoltAgent ships multiple chunkers for different formats and a StructuredDocument helper to orchestrate extractors and chunking strategies. An auto heuristic uses format detection (JSON/HTML/LaTeX/code/table/markdown signals) to pick a chunker when you do not know the format ahead of time.

StructuredDocument: wraps raw text and applies extractors before chunking.
Token, Sentence, Recursive: general-purpose text chunking.
Table, Code, Markdown, Semantic Markdown, HTML, JSON, LaTeX: format-aware chunkers.
Semantic, Late, Neural, Slumber: semantic or post-processing chunkers.
Parser Registry: register parsers for language-aware code chunking.

Each chunker returns an array of chunks with content, positional metadata, and optional labels. Pass configuration through constructor options or per-call options.

Quick Example

import { RecursiveChunker } from "@voltagent/rag";

const text = "First paragraph.\n\nSecond paragraph that is longer.";
const chunks = new RecursiveChunker().chunk(text, { maxTokens: 120 });

// Output:
// [
//   { content: "First paragraph.", metadata: { sourceType: "paragraph", paragraphIndex: 0 } },
//   { content: "Second paragraph that is longer.", metadata: { sourceType: "paragraph", paragraphIndex: 1 } },
// ]

Common Configuration

Tokenizer: Any object with { tokenize: (text: string) => { value: string; start: number; end: number }[], countTokens: (text: string) => number }. Default is tiktoken (cl100k_base). Override on the constructor or per-call:

import { TokenChunker, createTikTokenizer } from "@voltagent/rag";

const tokenizer = createTikTokenizer({ model: "gpt-4o-mini" });
const chunker = new TokenChunker(tokenizer);
const chunks = chunker.chunk(text, { maxTokens: 256, overlap: 16, tokenizer });

Per-call overrides: Options passed to chunk(text, options) (e.g., maxTokens, overlap, label, parser) override constructor defaults.
Positions: Most chunkers include metadata.position with line/column start/end. Code fences additionally expose fencePosition.

What Is Not Included

Chunkers do not handle embedding or vector index creation; pipe chunk outputs into your own embedding/vector store flow.
Inline AST fidelity is limited: markdown/html inline elements and attributes are not preserved beyond text content and basic paths.
Code chunk start/end offsets refer to the fenced body; use position/fencePosition for absolute line/column context.

Chunkers

Quick Example​

Common Configuration​

What Is Not Included​

Table of Contents

Quick Example

Common Configuration

What Is Not Included