1. Define evaluators
Evaluators score actual output against each example’s expected output. Put them inevaluators.yaml:
exact-difffor output that should match exactly.llm-judgefor fuzzy correctness, with apassThreshold.custom-scriptfor correctness you can compute in JavaScript.
2. Run an experiment
An experiment runs the workflow across the whole dataset and collects every score in one batch:--wait blocks until the batch finishes and exits non-zero if any example
failed, so it works in CI.
3. Compare versions
Change the workflow, push a new version, run a second experiment, then diff the two batches:Tips
llm-judgescores have variance. Average across runs or raisepassThresholddeliberately rather than chasing single-run noise.- Keep expected outputs in the dataset so
exact-diffandllm-judgehave ground truth to compare against.