Skip to main content
Once you have a dataset, you can score a workflow across every example and compare versions. This is how you tell whether a change made the workflow better or worse, instead of eyeballing one run.

1. Define evaluators

Evaluators score actual output against each example’s expected output. Put them in evaluators.yaml:
evaluators:
  - type: exact-diff
    name: fields-match
  - type: llm-judge
    name: summary-quality
    passThreshold: 0.85
    prompt: |
      Score how well the summary captures the invoice totals and vendor.
  • exact-diff for output that should match exactly.
  • llm-judge for fuzzy correctness, with a passThreshold.
  • custom-script for correctness you can compute in JavaScript.
Validate and push:
eigenpal workflow evaluators validate
eigenpal workflow evaluators push <workflow-id>

2. Run an experiment

An experiment runs the workflow across the whole dataset and collects every score in one batch:
eigenpal workflow experiment run <workflow-id> --wait
eigenpal workflow experiment results <workflow-id> <batchId>
--wait blocks until the batch finishes and exits non-zero if any example failed, so it works in CI.

3. Compare versions

Change the workflow, push a new version, run a second experiment, then diff the two batches:
eigenpal workflow experiment compare <batchIdA> <batchIdB>
The comparison shows which examples improved, regressed, or stayed the same, so a change is a measured decision, not a guess.

Tips

  • llm-judge scores have variance. Average across runs or raise passThreshold deliberately rather than chasing single-run noise.
  • Keep expected outputs in the dataset so exact-diff and llm-judge have ground truth to compare against.