Eval-first development

This guide walks through one workflow: extracting transactions from a bank statement. The same pattern applies to other document types. A workflow defines behavior; an evaluation defines whether that behavior matches the expected output. Build both before treating the workflow as ready.

1. Build the workflow

Parse the document, then extract the fields you need. Keep it simple to start.

eigenpal init workflow extract-statement
# edit workflow.yaml: ai.parse -> ai.extract (account, holder, transactions)
eigenpal workflow push

2. Define what correct means

Collect a handful of real statements and write down the right answer for each. That is your dataset: an input document plus the expected output. Then add evaluators that score a run against that expected output. For structured extraction, exact-diff checks the transactions and account fields match exactly. For the fuzzy parts, an llm-judge can grade something like “does this capture recurring salary credits and exclude refunds.” Evaluators are weighted and roll up to one score against a pass threshold.

# evaluators.yaml
evaluators:
  - type: exact-diff
    name: fields-match
    weight: 0.7
  - type: llm-judge
    name: salary-credits
    weight: 0.3
    passThreshold: 0.9

eigenpal workflow dataset push <workflow-id>
eigenpal workflow evaluators push <workflow-id>

3. Tune until it passes

Run an experiment across the whole dataset. Adjust the prompt, schema, or model and re-run until the score clears your threshold.

eigenpal workflow experiment run <workflow-id> --wait
eigenpal workflow experiment results <workflow-id> <experiment-id>

The workflow CLI still uses batchId in some command help. In the REST API and SDKs, the same identifier is called experimentId.

4. Ship it

Run it for real. Every run is stored with its inputs, outputs, model versions, and trace, which is what makes the next step possible. Experiments add evaluator scores.

eigenpal run workflows.extract-statement --input-file document=statement.pdf

5. Catch a miss

A statement comes in with a layout you never tested, and the workflow drops one transaction. You find it by inspecting the run.

eigenpal runs trace <run-id>

6. Make the miss a permanent test

Instead of patching and moving on, capture the corrected output for that document and add it to the evaluation dataset.

# write the corrected output to corrected-output.json, then promote the reviewed run
eigenpal runs reviews update <run-id> \
  --verdict incorrect \
  --corrected-json-file corrected-output.json
eigenpal runs promote <run-id> --name layout-edge

If you are authoring examples before a run exists, put the file under the dataset archive layout and push the dataset instead. Re-run the experiment. From now on, no change can reintroduce that mistake without an evaluation flagging it. See Review production runs for the full human-review loop around verdicts, statuses, reruns, and monitoring.

eigenpal workflow experiment run <workflow-id> --wait

7. Compare before deploying

Because evaluations define the quality bar, you can compare changes directly. Swap to a cheaper model, re-run, and keep it only if it still passes (see Optimize cost). Push a new version and compare it to the last one before deploying.

eigenpal workflow experiment compare <old-experiment-id> <new-experiment-id>

The dataset grows with real issues, and each workflow change is checked against the same cases before it ships.

Get started

Concepts

Workflow steps

Guides & tutorials

Changelog

1. Build the workflow

2. Define what correct means

3. Tune until it passes

4. Ship it

5. Catch a miss

6. Make the miss a permanent test

7. Compare before deploying

​1. Build the workflow

​2. Define what correct means

​3. Tune until it passes

​4. Ship it

​5. Catch a miss

​6. Make the miss a permanent test

​7. Compare before deploying

1. Build the workflow

2. Define what correct means

3. Tune until it passes

4. Ship it

5. Catch a miss

6. Make the miss a permanent test

7. Compare before deploying