A/B Experiments
A/B Experiments let you test different versions of your agent’s prompt against live caller traffic. Instead of guessing which prompt performs better, you can measure it with real data.
A/B Experiments require a Growth plan or higher.
How Experiments Work
- You create an experiment with two or more prompt variants
- Live calls to the agent are randomly distributed across variants based on traffic weights you define
- Each variant’s performance is tracked independently
- After enough calls, you compare results and apply the winning variant
The agent’s callers are unaware that an experiment is running. Each caller interacts with one variant for the duration of their call.
Creating an Experiment
- Navigate to Testing → Experiments in the sidebar
- Click Create Experiment
- Fill in the details:
| Field | Description |
|---|---|
| Name | A descriptive name (e.g., “Sales opener - friendly vs professional”) |
| Description | What you are testing and why |
| Agent | The agent to run the experiment on |
| Goal | The metric you want to optimize. See Goals below. |
- Click Create
Adding Variants
After creating the experiment, add the prompt variations you want to test.
Control Variant
The control variant is created automatically and uses the agent’s current prompt. This is your baseline for comparison.
Adding Test Variants
- Click Add Variant
- Enter a Variant Name (e.g., “Friendly opener”, “Concise pitch”)
- Modify the Prompt — this is a copy of the agent’s current prompt that you can edit. Change only the part you want to test.
- Click Save
You can add multiple test variants to compare more than two options at once.
For clean results, change only one thing per variant. If you change the greeting AND the closing in the same variant, you will not know which change caused the difference in performance.
Traffic Weights
Traffic weights control what percentage of calls go to each variant. They must total 100%.
- On the experiment page, find the Traffic Split section
- Adjust the percentage sliders for each variant
- Click Save
Common Splits
- 50/50 — Equal split between control and one variant. Best for simple A/B tests.
- 33/33/34 — Equal split across three variants. Useful for testing multiple options.
- 80/20 — Send most traffic to the control while testing a risky change with a small percentage. Useful for minimizing exposure to a potentially worse prompt.
Goals
The goal defines which metric determines the winner. Available goal types:
conversion_rate
Measures the percentage of calls that result in a successful outcome (as determined by your post-call workflow or sentiment analysis).
Best for: Sales agents, lead qualification agents, appointment booking agents.
sentiment
Measures the average caller sentiment score across calls.
Best for: Customer support agents, reception agents, any agent where caller satisfaction matters most.
duration
Measures the average call duration. Depending on your use case, shorter or longer may be better.
Best for: Agents where you want to optimize for efficiency (shorter) or engagement (longer).
custom
Define your own goal metric using a natural language description. The AI evaluates each call against your custom criteria.
Example: “The agent successfully identifies the caller’s budget range and preferred timeline without being pushy.”
Best for: Complex scenarios where standard metrics do not capture what matters.
Running the Experiment
- After configuring variants and traffic weights, click Start Experiment
- The experiment begins routing live calls according to the traffic split
- Results update as calls come in
During the Experiment
The experiment page shows real-time results:
- Per-variant call count
- Per-variant goal metric (e.g., conversion rate, average sentiment)
- A statistical significance indicator showing whether the difference between variants is meaningful or could be due to chance
Let the experiment run until you have enough data for statistical significance. As a rough guideline, each variant should receive at least 30-50 calls before drawing conclusions. The significance indicator on the experiment page will tell you when the results are reliable.
Monitoring
While the experiment runs, you can:
- View per-variant metrics on the experiment page
- Click into individual calls to see which variant handled them
- Adjust traffic weights if needed (e.g., pause a clearly underperforming variant)
Stopping the Experiment
- Click Stop Experiment
- All traffic returns to the agent’s current (control) prompt
- Final results are locked and saved
Applying the Winner
After stopping the experiment, if a non-control variant won:
- Click Apply Winning Variant
- The winning variant’s prompt replaces the agent’s current prompt
- Confirm the change
This is optional. You can also manually copy the winning prompt and make further adjustments before applying it.
Best Practices
Test One Variable at a Time
Changing too many things between variants makes it impossible to know which change caused the difference. Isolate the variable you are testing.
Run Long Enough
Premature conclusions lead to incorrect decisions. Wait for the statistical significance indicator to show confidence before stopping.
Document Your Experiments
Use the description field to record what you expected to happen and why. When you review past experiments, this context helps you understand the reasoning behind each test.
Start with High-Impact Changes
Test changes that are likely to produce a measurable difference. Small wording tweaks may not produce statistically significant results without very large call volumes.
Use Experiments to Settle Debates
When your team disagrees about the best approach for an agent’s prompt, run an experiment instead of debating. Let the data decide.