Skip to main content
A workflow that works on one document is not done. Evaluations let you test a workflow against many examples and score the results, the same way tests guard code. Three pieces work together: datasets, evaluators, and experiments.

Datasets

A dataset is a set of examples. Each example is an input (the arguments and files the workflow receives) plus, optionally, the expected output (the ground-truth result). Datasets are portable folder archives, so you can build them locally and push them, or pull and edit them.
examples/<name>/input/arguments.json      scalar/object args
examples/<name>/input/<arg-name>/<file>   one folder per file argument
examples/<name>/expected/output.json      optional ground-truth output
See Build a dataset for the full walkthrough.

Evaluators

An evaluator scores a run’s actual output against the example’s expected output and returns a pass or a number. Define them in evaluators.yaml. Three types ship in the box:
  • exact-diff does a JSON deep-diff against the expected output. Use it when the output should match exactly.
  • llm-judge uses an LLM to score the output against a rubric, with a pass threshold. Use it when “correct” is fuzzy (summaries, free text).
  • custom-script runs a JavaScript scoring function in the sandbox. Use it when correctness is a calculation you can write.
evaluators:
  - type: exact-diff
    name: fields-match
  - type: llm-judge
    name: summary-quality
    passThreshold: 0.85

Experiments

An experiment runs a workflow version across the whole dataset and collects the scores in one place. Run two experiments on two versions and compare them to see what a change improved or broke.
eigenpal workflow experiment run <workflow-id>
eigenpal workflow experiment results <workflow-id> <batchId>
eigenpal workflow experiment compare <batchIdA> <batchIdB>
See Evaluate a workflow for the end-to-end flow.