Skip to Content

Test Runs

A test run executes all enabled test cases in a test suite and produces a detailed report of results. Use test runs to validate agent behavior before deploying prompt changes.

Running a Test Suite

  1. Navigate to Testing and open a test suite
  2. Click Run Tests
  3. The test run begins immediately

Progress View

While the suite is running, you can watch progress in real time:

  • Each test case is listed with its current status: Pending, Running, or Complete
  • Cases execute sequentially, one after another
  • The currently running case shows a live progress indicator
  • As each case finishes, it displays a Pass or Fail badge

You can leave the page and return later. The run continues in the background, and results are saved when complete.

Reviewing Results

After the run finishes, the results page shows:

Summary

  • Total cases — Number of test cases that were executed
  • Passed — Number of cases that met their success criteria
  • Failed — Number of cases that did not meet their success criteria
  • Pass rate — Percentage of cases that passed (e.g., “8/10 — 80%“)

Per-Case Results

Each test case shows:

FieldDescription
StatusPass or Fail
ScoreA numerical score from 0-100 assigned by the AI evaluator
TranscriptFull conversation between the simulated caller and your agent
Evaluation NotesThe AI evaluator’s explanation of why the case passed or failed

Failed Case Details

When a test case fails, the evaluation notes explain specifically what went wrong:

  • Which success criteria were not met
  • Where in the conversation the issue occurred
  • What the agent said versus what was expected

This detail makes it straightforward to know what to fix in the agent’s prompt.

Filtering Results

Use the filter options above the results list:

  • All — Show all test cases
  • Passed — Show only cases that passed
  • Failed — Show only cases that failed

When triaging a run with many failures, filter to “Failed” to focus on what needs attention.

Comparing Runs

To track improvements over time, compare results across runs:

  1. Open the test suite
  2. Scroll to the Run History section
  3. Each row shows the run date, pass rate, and a link to the full results

By comparing consecutive runs, you can see whether prompt changes improved, worsened, or had no effect on test outcomes. If a case that previously passed is now failing, you have a regression that needs investigation.

Interpreting Scores

The AI evaluator assigns each test case a score from 0 to 100:

  • 90-100 — Excellent. The agent handled the scenario precisely as intended.
  • 70-89 — Good. The agent mostly handled it well but had minor issues.
  • 50-69 — Fair. The agent completed the scenario but with noticeable problems.
  • Below 50 — Poor. The agent failed to handle the scenario adequately.

The pass/fail status is determined by the success criteria, not the score alone. A case can pass its keyword_match criteria but still receive a mediocre score if the conversation quality was poor.

Best Practices

Run After Every Prompt Change

Prompt changes can have unexpected side effects. A tweak that fixes one scenario may break another. Always re-run the full suite after modifications.

Investigate Flaky Tests

If a test case passes sometimes and fails other times without any prompt changes, the case may be too sensitive to natural conversation variation. Consider:

  • Making the success criteria less strict
  • Using custom criteria that allow for reasonable variation
  • Increasing the max turns to give the agent more room

Keep a Baseline

When your agent is performing well, note the run date and pass rate as your baseline. After future changes, compare back to this baseline to make sure you have not regressed.

💡

Read the transcript of failed cases carefully. Often the issue is not that the agent said something wrong, but that the simulated caller’s scenario triggered an unexpected conversation path. You may need to adjust the test case, the agent’s prompt, or both.

Last updated on