Skip to main content
This is the loop EigenPal is built for, walked through on a single workflow: extracting transactions from a bank statement. The same pattern applies to any document type. The idea in one line: a workflow defines behavior, and an evaluation defines whether that behavior is correct. You build the second thing as seriously as the first.

1. Build the workflow

Parse the document, then extract the fields you need. Keep it simple to start.
eigenpal init workflow extract-statement
# edit workflow.yaml: ai.parse -> ai.extract (account, holder, transactions)
eigenpal workflow push

2. Define what correct means

Collect a handful of real statements and write down the right answer for each. That is your dataset: an input document plus the expected output. Then add evaluators that score a run against that expected output. For structured extraction, exact-diff checks the transactions and account fields match exactly. For the fuzzy parts, an llm-judge can grade something like “does this capture recurring salary credits and exclude refunds.” Evaluators are weighted and roll up to one score against a pass threshold.
# evaluators.yaml
evaluators:
  - type: exact-diff
    name: fields-match
    weight: 0.7
  - type: llm-judge
    name: salary-credits
    weight: 0.3
    passThreshold: 0.9
eigenpal workflow dataset push <workflow-id>
eigenpal workflow evaluators push <workflow-id>

3. Tune until it passes

Run an experiment across the whole dataset. You now have a number, not an opinion. Adjust the prompt, the schema, or the model and re-run until the score clears your threshold.
eigenpal workflow experiment run <workflow-id> --wait
eigenpal workflow experiment results <workflow-id> <batchId>

4. Ship it

Run it for real. Every run is stored with its inputs, outputs, model versions, and scores, which is what makes the next step possible.
eigenpal run workflows.extract-statement -F document=@statement.pdf

5. Catch a miss

A statement comes in with a layout you never tested, and the workflow drops one transaction. You find it by inspecting the run.
eigenpal runs trace <run-id>

6. Make the miss a permanent test

This is the part that compounds. Instead of patching and moving on, capture the corrected output for that document and add it to the evaluation dataset. The production failure is now a test case.
# capture the corrected expected output for the failing run, then add it as an example
eigenpal runs expected pull <run-id>      # edit to the correct answer
eigenpal workflow dataset example create <workflow-id> --name layout-edge -F document=@statement.pdf
Re-run the experiment. From now on, no change can reintroduce that mistake without an evaluation flagging it.
eigenpal workflow experiment run <workflow-id> --wait

7. Improve with confidence

Because the quality bar is fixed by evaluations, you can change things freely and let the experiment tell you whether you helped or hurt. Swap to a cheaper model, re-run, and keep it only if it still passes (see Optimize cost). Push a new version and compare it to the last one before deploying.
eigenpal workflow experiment compare <oldBatchId> <newBatchId>
That is the whole loop. The dataset grows with every issue you have ever seen, and quality becomes visible, measurable, and auditable instead of a matter of spot checks.