Evaluations

A workflow defines behavior. An evaluation defines whether that behavior is correct for a dataset. Models, prompts, and steps change; the evaluation is the check that tells you whether the output still matches the expected result. A workflow that works on one document is not done until it passes against a representative dataset. Three pieces work together: datasets, evaluators, and experiments. Used end to end, they turn quality from a spot check into a repeatable score. See Eval-first development.

Datasets

A dataset is a set of examples. Each example is an input (the arguments and files the workflow receives) plus, optionally, the expected output (the ground-truth result). Datasets are portable folder archives, so you can build them locally and push them, or pull and edit them.

examples/<name>/input.json                full run input object
examples/<name>/input/<file>              files referenced as { "$file": "input/<file>" }
examples/<name>/expected.json             optional expected output
examples/<name>/expected/<file>           expected files referenced from expected.json

This is the automation dataset archive format for both workflows and agents. See Build a dataset for the full walkthrough.

Evaluators

An evaluator scores a run’s actual output against the example’s expected output and returns a pass or a number. Define them in evaluators.yaml. Three types ship in the box:

exact-diff does a JSON deep-diff against the expected output. Use it when the output should match exactly.
llm-judge uses an LLM to score the output against a rubric, with a pass threshold. Use it when “correct” is fuzzy (summaries, free text).
custom-script runs a typed TypeScript scoring function in the sandbox. Use it when correctness is a calculation you can write. The function includes a required : number return annotation, matching transform.script.

evaluators:
  - type: exact-diff
    name: fields-match
  - type: llm-judge
    name: summary-quality
    passThreshold: 0.85

Experiments

An experiment runs a workflow version across the whole dataset and collects the scores in one place. Run two experiments on two versions and compare them to see what a change improved or broke.

eigenpal workflow experiment run <workflow-id>
eigenpal workflow experiment results <workflow-id> <experiment-id>
eigenpal workflow experiment compare <old-experiment-id> <new-experiment-id>

Some CLI output and older command help call this identifier a batchId. The REST API and SDKs call the same value an experimentId. See Evaluate a workflow for the end-to-end flow. When production review uncovers a new miss, use Review production runs to correct it and promote it into the dataset.

Get started

Concepts

Workflow steps

Guides & tutorials

Changelog

Datasets

Evaluators

Experiments

​Datasets

​Evaluators

​Experiments

Datasets

Evaluators

Experiments