Skip to Content
DocsTesting & QAA/B Experiments

A/B Experiments

A/B Experiments let you test different versions of your agent’s prompt against live caller traffic. Instead of guessing which prompt performs better, you can measure it with real data.

A/B Experiments require a Growth plan or higher.

How Experiments Work

  1. You create an experiment with two or more prompt variants
  2. Live calls to the agent are randomly distributed across variants based on traffic weights you define
  3. Each variant’s performance is tracked independently
  4. After enough calls, you compare results and apply the winning variant

The agent’s callers are unaware that an experiment is running. Each caller interacts with one variant for the duration of their call.

Creating an Experiment

  1. Navigate to Testing → Experiments in the sidebar
  2. Click Create Experiment
  3. Fill in the details:
FieldDescription
NameA descriptive name (e.g., “Sales opener - friendly vs professional”)
DescriptionWhat you are testing and why
AgentThe agent to run the experiment on
GoalThe metric you want to optimize. See Goals below.
  1. Click Create

Adding Variants

After creating the experiment, add the prompt variations you want to test.

Control Variant

The control variant is created automatically and uses the agent’s current prompt. This is your baseline for comparison.

Adding Test Variants

  1. Click Add Variant
  2. Enter a Variant Name (e.g., “Friendly opener”, “Concise pitch”)
  3. Modify the Prompt — this is a copy of the agent’s current prompt that you can edit. Change only the part you want to test.
  4. Click Save

You can add multiple test variants to compare more than two options at once.

💡

For clean results, change only one thing per variant. If you change the greeting AND the closing in the same variant, you will not know which change caused the difference in performance.

Traffic Weights

Traffic weights control what percentage of calls go to each variant. They must total 100%.

  1. On the experiment page, find the Traffic Split section
  2. Adjust the percentage sliders for each variant
  3. Click Save

Common Splits

  • 50/50 — Equal split between control and one variant. Best for simple A/B tests.
  • 33/33/34 — Equal split across three variants. Useful for testing multiple options.
  • 80/20 — Send most traffic to the control while testing a risky change with a small percentage. Useful for minimizing exposure to a potentially worse prompt.

Goals

The goal defines which metric determines the winner. Available goal types:

conversion_rate

Measures the percentage of calls that result in a successful outcome (as determined by your post-call workflow or sentiment analysis).

Best for: Sales agents, lead qualification agents, appointment booking agents.

sentiment

Measures the average caller sentiment score across calls.

Best for: Customer support agents, reception agents, any agent where caller satisfaction matters most.

duration

Measures the average call duration. Depending on your use case, shorter or longer may be better.

Best for: Agents where you want to optimize for efficiency (shorter) or engagement (longer).

custom

Define your own goal metric using a natural language description. The AI evaluates each call against your custom criteria.

Example: “The agent successfully identifies the caller’s budget range and preferred timeline without being pushy.”

Best for: Complex scenarios where standard metrics do not capture what matters.

Running the Experiment

  1. After configuring variants and traffic weights, click Start Experiment
  2. The experiment begins routing live calls according to the traffic split
  3. Results update as calls come in

During the Experiment

The experiment page shows real-time results:

  • Per-variant call count
  • Per-variant goal metric (e.g., conversion rate, average sentiment)
  • A statistical significance indicator showing whether the difference between variants is meaningful or could be due to chance

Let the experiment run until you have enough data for statistical significance. As a rough guideline, each variant should receive at least 30-50 calls before drawing conclusions. The significance indicator on the experiment page will tell you when the results are reliable.

Monitoring

While the experiment runs, you can:

  • View per-variant metrics on the experiment page
  • Click into individual calls to see which variant handled them
  • Adjust traffic weights if needed (e.g., pause a clearly underperforming variant)

Stopping the Experiment

  1. Click Stop Experiment
  2. All traffic returns to the agent’s current (control) prompt
  3. Final results are locked and saved

Applying the Winner

After stopping the experiment, if a non-control variant won:

  1. Click Apply Winning Variant
  2. The winning variant’s prompt replaces the agent’s current prompt
  3. Confirm the change

This is optional. You can also manually copy the winning prompt and make further adjustments before applying it.

Best Practices

Test One Variable at a Time

Changing too many things between variants makes it impossible to know which change caused the difference. Isolate the variable you are testing.

Run Long Enough

Premature conclusions lead to incorrect decisions. Wait for the statistical significance indicator to show confidence before stopping.

Document Your Experiments

Use the description field to record what you expected to happen and why. When you review past experiments, this context helps you understand the reasoning behind each test.

Start with High-Impact Changes

Test changes that are likely to produce a measurable difference. Small wording tweaks may not produce statistically significant results without very large call volumes.

Use Experiments to Settle Debates

When your team disagrees about the best approach for an agent’s prompt, run an experiment instead of debating. Let the data decide.

Last updated on