Skip to main content
Use evaluation APIs when you want to manage datasets, run examples, start experiments, and inspect evaluator scores from Python.

Dataset examples

examples = client.automations.examples.list("workflows.extract-invoice")

example = client.automations.examples.create(
    "workflows.extract-invoice",
    {
        "name": "acme-invoice",
        "input": {"language": "en"},
        "expected": {"vendor": "Acme"},
    },
)
Examples contain input, expected JSON output, expected files, metadata, and optional overrides. For file-heavy datasets, prefer archive import/export so the folder layout stays portable. Dataset archives use the canonical examples/<name>/input.json layout with file references such as { "$file": "input/contract.pdf" }. Legacy archives with manifest.json, input/arguments.json, expected/output.json, or expected/error.json are rejected; export a fresh ZIP before re-importing.

Experiments

experiment = client.automations.experiments.create(
    "workflows.extract-invoice",
    {"examples": [example["id"]]},
)

detail = client.automations.experiments.get(
    "workflows.extract-invoice",
    experiment["id"],
)
An experiment runs dataset examples and records automated evaluator scores. Older CLI docs may call the same id a batchId; API and SDK methods call it an experimentId.

Scores vs feedback

Evaluator score values are automated results. Run feedback rating values are human review verdicts (pass, fail, or partial). Use run feedback endpoints when humans correct or promote a real run into the dataset.