Skip to main content
transform.text-chunker, Split long text into chunks with regex-anchored boundaries, overlap, and header preservation. Accepts raw text or a parsed-document object; chunks carry source page indexes when pages are provided.

Configuration

Configuration goes inside the step’s with: block.
input
string | record<string, unknown>
required
Either raw text or a parsed-document object &#123; pages: [&#123; pageIndex, text }] } (e.g. &#123;&#123; steps.parse.output }}). Pages preserve per-chunk page provenance.
maxChars
integer
required
Target chunk size in characters. Hard ceiling per chunk is 1.5×.
overlap
integer
default:"0"
Characters duplicated at chunk boundaries (default 0). Must be < maxChars / 2.
splitOn
array<string>
Ordered list of regexes; the first that matches near the chunk boundary wins. Falls back to char-cut when none match. Tip: list narrowest first (e.g. /\d+.\d+\s+/ before /\n\n+/).
maxChunks
integer
default:"64"
Safety cap; later chunks are dropped and summary.truncated flips to true.
preserveHeader
integer
default:"0"
Prepend the first N characters of the input to every chunk (good for “always include the contract title”).
minChunkChars
integer
default:"0"
Trailing chunks shorter than this are merged into the previous chunk.

Output