Skip to main content
This guide covers what happens after a workflow passes evaluation: integrating it into your application, safely introducing production traffic, and keeping humans in the loop until you are confident enough to automate fully. The same flow applies to agents; both run through the unified automation surface. Throughout this guide, “automation” refers to a deployed workflow or agent running in production. You author the two differently, but you integrate, run, and review them the same way. The goal is not a one-time launch. It is a continuous improvement loop that turns production failures into regression tests, so the workflow becomes more reliable over time instead of drifting.

The lifecycle

1

Pass the eval

Do not deploy a workflow you have not measured. Start from a passing experiment against a representative dataset.
2

Integrate behind your app

Enable the API trigger and call POST /api/v1/runs from your application. Nothing acts on the output yet.
3

Route production traffic

Shadow first, then ramp a growing fraction of real requests, or cut straight to 100% if the eval and a short shadow period hold up.
4

Sample runs for review

Sample 10 to 15% of production runs for human review. Lower the rate as confidence grows, down to 5%, then to full automation.
5

Turn misses into test cases

Every reviewed miss becomes a corrected dataset example. Fix the workflow, confirm the example passes, and the loop tightens.
There is no requirement to reach full automation. Some teams keep a permanent review step for high-risk workflows; others automate 100% of traffic once the dataset is broad enough. Pick the end state that matches your risk tolerance.

Prerequisite: a workflow you trust

Production starts from a passing evaluation, not from a workflow that happened to work on a few examples. If you have not built a dataset and evaluators yet, do that first: A workflow is ready to integrate when an experiment across the dataset clears your pass threshold and you have compared the current version against the previous one.
eigenpal workflow experiment run <workflow-id> --wait
eigenpal workflow experiment compare <old-experiment-id> <new-experiment-id>

1. Integrate behind your application

Enable the API trigger

By default a workflow accepts runs from the dashboard. To accept calls from your own code, the automation needs its API trigger enabled. Runs started from the API or CLI require the API trigger to be enabled: when it is off, POST /api/v1/runs returns 403 with the error code api_trigger_disabled. Enable the API trigger in the workflow settings in the dashboard.
Trigger state is readable from the API (Get automation triggers), but turning a trigger on or off is a dashboard action today, not a public API mutation.

Start a run

Target the automation by workflows.<slug> (or agents.<slug>). For short jobs, ask the server to hold the request until the run finishes; for longer jobs, start the run and poll.
// Short job: wait inline (seconds, not minutes).
const result = await client.run('workflows.extract-statement', input, {
  waitForCompletion: 60,
});

// Longer job: start and poll.
const { id } = await client.run('workflows.extract-statement', input);
let run;
do {
  await new Promise((r) => setTimeout(r, 2000));
  run = await client.runs.get(id);
} while (!run.finished);
Every run is stored with its inputs, outputs, model versions, and trace, whether or not anything downstream consumes the result. That record is what makes the next two stages possible. Reference: Start a run and List runs, or the runs SDK reference for the typed client.

2. Route production traffic

EigenPal does not split or ramp traffic for you. Your application decides which requests reach the automation and what happens to the output. There are three common patterns, in increasing order of trust. Most teams start with shadow mode, then gradually ramp traffic as confidence grows. Shadow. Send real inputs to the workflow but ignore the output in your product. Compare it against whatever process you run today (manual entry, an existing system, a previous model). Because every run is stored, you can review shadow output later with the same tools you use for live runs. Nothing is at risk while you build confidence on production-shaped data. Ramp. Route a growing fraction of real requests to the automation and act on its output: 1%, then 5%, 25%, 100%. Keep the split behind a feature flag so you can roll back instantly if reviews uncover a regression. Cut over. If the eval is strong and a short shadow period looks clean, route 100% from the start. This is reasonable for low-risk workflows or where a human reviews output before it is used anyway.
Whichever pattern you pick, keep a review sample running. At 100% automation, sampled review is your only signal that the workflow still matches reality.

3. Sample runs for review

Keep a sampled percentage of production runs for human review. Review turns previously unseen production inputs into dataset examples instead of letting failures become silent errors.
Why review production runs?Evaluations tell you whether the workflow still passes the cases you already know about. Human review uncovers failure modes that are not yet represented in your dataset.
Run the review as a sampling rate you lower over time:
  • Start at 10 to 15%. Enough volume to catch layout and edge-case misses early, while the workflow is newest in production.
  • Lower to 5% once a few review cycles pass without surprises.
  • Go to full automation when the dataset covers the long tail and review turns up nothing new. Keep a small ongoing sample if the input distribution drifts (new customers, new document formats).
The mechanics: a reviewed run carries feedback, which has a rating (the human verdict: pass, fail, or partial) and a status (the review lifecycle: open, resolved, or ignored). A sampled run with no verdict yet is unreviewed. A run you flag for follow-up is open, and it moves to resolved once the underlying miss is fixed and the example passes again. That open-to-resolved transition is what forces continuous improvement: a failing run stays on the board until the workflow handles it.

4. Review runs

You can review runs from the API (good for routing review into your own tools or an internal queue) or from the dashboard.

Via the API

Find runs that still need review. Filter the run list by feedback state. To pull completed runs that no one has reviewed yet:
GET /api/v1/runs?type=workflow&source=workflows.extract-statement\
&status=completed&hasFeedback=false&limit=100
The run list accepts feedback filters so you can build a review queue: hasFeedback, noFeedback, feedbackStatus=open|resolved|ignored, feedbackRating=pass|fail|partial|none, hasExpected, promotedToExample, and sinceLastResolved (only runs created after the most recent resolved review). Feedback filters use offset pagination, not cursors.
# The open review backlog for one workflow.
GET /api/v1/runs?source=workflows.extract-statement&feedbackStatus=open
Reference: List runs for the full set of query parameters. Record a pass. When the output is correct, mark it resolved so it leaves the queue.
curl -X PUT https://studio.eigenpal.com/api/v1/runs/$RUN_ID/feedback \
  -H "Authorization: Bearer $EIGENPAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"rating":"pass","status":"resolved"}'
Reference: Update run feedback. Record a fail and correct it. When the output is wrong, set the rating to fail, leave the status open, and attach the corrected output. expected is the corrected JSON; for corrected files, post them to the expected-artifacts endpoint.
# Corrected structured output.
curl -X PUT https://studio.eigenpal.com/api/v1/runs/$RUN_ID/feedback \
  -H "Authorization: Bearer $EIGENPAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"rating":"fail","status":"open","body":"Dropped one transaction","expected":{...}}'

# Corrected file (multipart). Or copy an existing run output file as expected.
curl -X POST https://studio.eigenpal.com/api/v1/runs/$RUN_ID/feedback/expected \
  -H "Authorization: Bearer $EIGENPAL_API_KEY" \
  -F 'file=@corrected.csv'
Reference: Update run feedback for the JSON verdict and expected output, and Add expected file for corrected files. Promote the corrected run into the dataset. This is the step that turns a miss into a permanent test. Promotion copies the run’s input, output, and the expected output and files from feedback into a dataset example on the same automation.
curl -X POST https://studio.eigenpal.com/api/v1/runs/$RUN_ID/promote \
  -H "Authorization: Bearer $EIGENPAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name":"statement-layout-edge"}'
Reference: Promote run to example.
Attach the corrected output before you promote. Promotion copies whatever expected output and files are on the run’s feedback at that moment; promoting before correcting produces an example with no ground truth.

In the dashboard

The Runs view provides the same review workflow as a queue, without building your own review tooling. Scope it to one automation, turn on Sample at a review rate (for example, 15%), and the list flags a sampled fraction of runs for review. Each run carries a review chip: Unreviewed, Open, Passed, or Failed. Filter the list by that chip to work through what still needs a verdict.
Runs review view with sampling, review status chips, and inline output correction
Select a run to review its output. Mark it pass (thumbs up) or fail (thumbs down). On a fail, correct the output in place: edit a field in the JSON output (the old value is struck through, your correction is highlighted), add an optional note, and use Upload on an output file to attach a corrected file. Saving the correction sets the run’s expected output, which is what a dataset example needs. Promote the corrected run to add it to the automation’s dataset. From then on the miss is a regression test: change the workflow, and the example must pass before the run leaves the Open state.
The dashboard and the API in the previous section operate on the same objects. The pass/fail thumbs set feedback rating, the review chips reflect feedback status, editing the JSON or uploading a file sets the expected output, and promoting adds it to the dataset. Use whichever fits your team.

5. Close the loop

Reviewing without fixing issues only creates backlog. Every open review should lead to a workflow improvement that is verified against the dataset, the same dataset the corrected run just joined.
1

Reproduce

Inspect the trace of the failing run to find which step produced the wrong output. eigenpal runs trace <run-id>.
2

Fix

Adjust the prompt, schema, model, or step logic. Push a new workflow version.
3

Confirm

Re-run the experiment. The promoted example is now part of the dataset, so a passing experiment means the specific miss cannot recur unnoticed.
4

Resolve

Mark the run’s review resolved. It leaves the open queue.
eigenpal runs trace <run-id>
# fix workflow.yaml, then:
eigenpal workflow push
eigenpal workflow experiment run <workflow-id> --wait
eigenpal runs feedback update <run-id> --status resolved
This is the same improvement loop described in How it works and Eval-first development, now driven by sampled production traffic instead of a fixed test set. Every cycle the dataset gets closer to the real input distribution, the review rate can drop, and you can automate a larger share of production traffic.

Before you ship

1

The eval passes

An experiment clears your threshold and beats the previous version.
2

The API trigger is enabled

The automation accepts API and CLI runs.
3

Your integration handles the run lifecycle

It calls POST /api/v1/runs and waits for or polls the result.
4

A traffic pattern is chosen

Shadow, ramp, or cut over, with a feature flag you can roll back.
5

A review sample is running

Start at 10 to 15% of production runs, with a plan to lower it.
6

Misses are corrected, not just flagged

Reviewed failures are corrected and promoted into the dataset, then driven to resolved through workflow fixes.