Datasets
A dataset is a set of examples. Each example is an input (the arguments and files the workflow receives) plus, optionally, the expected output (the ground-truth result). Datasets are portable folder archives, so you can build them locally and push them, or pull and edit them.Evaluators
An evaluator scores a run’s actual output against the example’s expected output and returns a pass or a number. Define them inevaluators.yaml. Three
types ship in the box:
exact-diffdoes a JSON deep-diff against the expected output. Use it when the output should match exactly.llm-judgeuses an LLM to score the output against a rubric, with a pass threshold. Use it when “correct” is fuzzy (summaries, free text).custom-scriptruns a JavaScript scoring function in the sandbox. Use it when correctness is a calculation you can write.