Test Runs

A test run executes all enabled test cases in a test suite and produces a detailed report of results. Use test runs to validate agent behavior before deploying prompt changes.

Running a Test Suite

Navigate to Testing and open a test suite
Click Run Tests
The test run begins immediately

Progress View

While the suite is running, you can watch progress in real time:

Each test case is listed with its current status: Pending, Running, or Complete
Cases execute sequentially, one after another
The currently running case shows a live progress indicator
As each case finishes, it displays a Pass or Fail badge

You can leave the page and return later. The run continues in the background, and results are saved when complete.

Reviewing Results

After the run finishes, the results page shows:

Summary

Total cases — Number of test cases that were executed
Passed — Number of cases that met their success criteria
Failed — Number of cases that did not meet their success criteria
Pass rate — Percentage of cases that passed (e.g., “8/10 — 80%“)

Per-Case Results

Each test case shows:

Field	Description
Status	Pass or Fail
Score	A numerical score from 0-100 assigned by the AI evaluator
Transcript	Full conversation between the simulated caller and your agent
Evaluation Notes	The AI evaluator’s explanation of why the case passed or failed

Failed Case Details

When a test case fails, the evaluation notes explain specifically what went wrong:

Which success criteria were not met
Where in the conversation the issue occurred
What the agent said versus what was expected

This detail makes it straightforward to know what to fix in the agent’s prompt.

Filtering Results

Use the filter options above the results list:

All — Show all test cases
Passed — Show only cases that passed
Failed — Show only cases that failed

When triaging a run with many failures, filter to “Failed” to focus on what needs attention.

Comparing Runs

To track improvements over time, compare results across runs:

Open the test suite
Scroll to the Run History section
Each row shows the run date, pass rate, and a link to the full results

By comparing consecutive runs, you can see whether prompt changes improved, worsened, or had no effect on test outcomes. If a case that previously passed is now failing, you have a regression that needs investigation.

Interpreting Scores

The AI evaluator assigns each test case a score from 0 to 100:

90-100 — Excellent. The agent handled the scenario precisely as intended.
70-89 — Good. The agent mostly handled it well but had minor issues.
50-69 — Fair. The agent completed the scenario but with noticeable problems.
Below 50 — Poor. The agent failed to handle the scenario adequately.

The pass/fail status is determined by the success criteria, not the score alone. A case can pass its keyword_match criteria but still receive a mediocre score if the conversation quality was poor.

Best Practices

Run After Every Prompt Change

Prompt changes can have unexpected side effects. A tweak that fixes one scenario may break another. Always re-run the full suite after modifications.

Investigate Flaky Tests

If a test case passes sometimes and fails other times without any prompt changes, the case may be too sensitive to natural conversation variation. Consider:

Making the success criteria less strict
Using custom criteria that allow for reasonable variation
Increasing the max turns to give the agent more room

Keep a Baseline

When your agent is performing well, note the run date and pass rate as your baseline. After future changes, compare back to this baseline to make sure you have not regressed.

💡

Read the transcript of failed cases carefully. Often the issue is not that the agent said something wrong, but that the simulated caller’s scenario triggered an unexpected conversation path. You may need to adjust the test case, the agent’s prompt, or both.