Evaluate a workflow

Once you have a dataset, you can score a workflow across every example and compare versions. This is how you tell whether a change made the workflow better or worse.

1. Define evaluators

Evaluators score actual output against each example’s expected output. Put them in evaluators.yaml:

evaluators:
  - type: exact-diff
    name: fields-match
  - type: llm-judge
    name: summary-quality
    passThreshold: 0.85
    prompt: |
      Score how well the summary captures the invoice totals and vendor.

exact-diff for output that should match exactly.
llm-judge for fuzzy correctness, with a passThreshold.
custom-script for correctness you can compute in typed TypeScript.

Validate and push:

eigenpal workflow evaluators validate
eigenpal workflow evaluators push <workflow-id>

2. Run an experiment

An experiment runs the workflow across the whole dataset and collects every score in one batch:

eigenpal workflow experiment run <workflow-id> --wait
eigenpal workflow experiment results <workflow-id> <experiment-id>

--wait blocks until the experiment finishes and exits non-zero if any example failed, so it works in CI. Some CLI help still calls the experiment id a batchId; it is the same identifier.

3. Compare versions

Change the workflow, push a new version, run a second experiment, then diff the two experiments:

eigenpal workflow experiment compare <old-experiment-id> <new-experiment-id>

The comparison shows which examples improved, regressed, or stayed the same.

Tips

llm-judge scores have variance. Average across runs or raise passThreshold deliberately rather than chasing single-run noise.
Keep expected outputs in the dataset so exact-diff and llm-judge have ground truth to compare against.

Get started

Concepts

Workflow steps

Guides & tutorials

Changelog

1. Define evaluators

2. Run an experiment

3. Compare versions

Tips

​1. Define evaluators

​2. Run an experiment

​3. Compare versions

​Tips

1. Define evaluators

2. Run an experiment

3. Compare versions

Tips