Parse Document

ai.parse turns a document into text. It is almost always the first step in a document-understanding workflow: point it at an uploaded file, get back clean text plus per-page content, then feed that into ai.extract, ai.split, or a transform.script.

When to use it

You have a PDF, image, or Office document and need its text. ai.parse handles PDFs, images (PNG/JPG), and Office formats (DOCX, PPTX, XLSX, ODT, and more), normalizing them to one text representation.
You need per-page boundaries. The pages array carries page-scoped text so downstream steps can cite evidence by page or split a document into sections.

How parsing is chosen

The step picks a strategy based on the document and your config:

Native text (nativeText: true) pulls embedded text straight from a PDF. Fastest, uses no credits, and falls back to OCR/VLM when the PDF has no text layer (a scan).
OCR (ocrModel) runs optical character recognition for scanned PDFs and images.
Vision LLM (llmModel) reads page images with a vision model. Best for complex layouts where OCR struggles. pagesPerBatch and maxConcurrency tune how page images are batched across requests.

Office formats are converted to PDF first (via headless LibreOffice) so they get true per-page boundaries instead of a single blob.

Example

- name: parse
  type: ai.parse
  with:
    input: '{{ input.document }}'
    nativeText: true
    outputFormat: markdown

markdown output keeps headings and tables, which makes the downstream ai.extract prompt far more reliable than unstyled plain text.

Configuration

Configuration goes inside the step’s with: block.

string

required

Storage reference or template expression for the document

string

OCR provider ID for PDF/image parsing

string

LLM provider ID for vision-based parsing

number

default:"3"

Max concurrent VLM batch requests

number

default:"5"

Number of page images per VLM request

number

default:"1"

Scale factor for rendering PDF pages before VLM parsing. Higher values produce sharper images at larger payload sizes.

integer

default:"85"

JPEG quality for rendered PDF page images sent to VLM parsing. Higher values reduce compression artifacts at larger payload sizes.

string

Custom extraction prompt

array<string>

OCR language hints

"plain" | "markdown" | "djot" | "html"

default:"markdown"

Format for extracted text. markdown (default) keeps structure and is best for LLM extraction; plain is unstyled text; djot/html preserve more layout. Only the native (Kreuzberg) parser respects this, OCR/VLM always emit markdown.

boolean

default:"false"

Extract native/embedded text from PDFs without OCR/VLM. Faster and uses no credits. Falls back to OCR/VLM if the PDF has no embedded text.

boolean

Opt-in (default off). After text extraction, detect which pages contain figures with an in-worker layout model, then caption those pages with a vision model and append <figure>description</figure> to their text, so image-only pages (property photos, signatures, charts) become findable by text-based steps like ai.split. Note: the layout scan runs over all pages, and the caption step and its vision calls are billed. Skipped for plaintext.

string

Custom instruction for the figure-description pass, e.g. “Describe each figure; label a handwritten signature as <figure>signature</figure> and a stamp as <figure>stamp</figure>; for property photos note the room or exterior shown.” Applied only when describeFigures runs.

Output

Extracted text content (combined from all pages)

Per-page content

Show pages properties

0-based page index

Extracted text for this page

Page/sheet name (e.g., Excel sheet name)

Overall page confidence

Document metadata

Get started

Concepts

Workflow steps

Guides & tutorials

Changelog

When to use it

How parsing is chosen

Example

Configuration

Output

​When to use it

​How parsing is chosen

​Example

​Configuration

​Output

When to use it

How parsing is chosen

Example

Configuration

Output