Skip to main content
ai.parse turns a document into text. It is almost always the first step in a document-understanding workflow: point it at an uploaded file, get back clean text plus per-page content, then feed that into ai.extract, ai.split, or a transform.script.

When to use it

  • You have a PDF, image, or Office document and need its text. ai.parse handles PDFs, images (PNG/JPG), and Office formats (DOCX, PPTX, XLSX, ODT, and more), normalizing them to one text representation.
  • You need per-page boundaries. The pages array carries page-scoped text so downstream steps can cite evidence by page or split a document into sections.

How parsing is chosen

The step picks a strategy based on the document and your config:
  • Native text (nativeText: true) pulls embedded text straight from a PDF. Fastest, uses no credits, and falls back to OCR/VLM when the PDF has no text layer (a scan).
  • OCR (ocrModel) runs optical character recognition for scanned PDFs and images.
  • Vision LLM (llmModel) reads page images with a vision model. Best for complex layouts where OCR struggles. pagesPerBatch and maxConcurrency tune how page images are batched across requests.
Office formats are converted to PDF first (via headless LibreOffice) so they get true per-page boundaries instead of a single blob.

Example

- name: parse
  type: ai.parse
  with:
    input: '{{ input.document }}'
    nativeText: true
    outputFormat: markdown
markdown output keeps headings and tables, which makes the downstream ai.extract prompt far more reliable than unstyled plain text.

Configuration

Configuration goes inside the step’s with: block.
input
string
required
Storage reference or template expression for the document
ocrModel
string
OCR provider ID for PDF/image parsing
llmModel
string
LLM provider ID for vision-based parsing
maxConcurrency
number
default:"3"
Max concurrent VLM batch requests
pagesPerBatch
number
default:"5"
Number of page images per VLM request
prompt
string
Custom extraction prompt
languages
array<string>
OCR language hints
outputFormat
"plain" | "markdown" | "djot" | "html"
default:"markdown"
Format for extracted text. markdown (default) keeps structure and is best for LLM extraction; plain is unstyled text; djot/html preserve more layout. Only the native (Kreuzberg) parser respects this, OCR/VLM always emit markdown.
nativeText
boolean
default:"false"
Extract native/embedded text from PDFs without OCR/VLM. Faster and uses no credits. Falls back to OCR/VLM if the PDF has no embedded text.

Output

Extracted text content (combined from all pages)
Per-page content
Document metadata