Test inputs
Result
How to read the output
- p-value: evidence against equal conversion rates under the two-proportion z-test.
- Confidence interval: plausible range for Variant B minus Control A in percentage points.
- Bayesian probability: a posterior estimate of how often B's conversion rate is above A's rate under the Beta-Binomial model.
- Power: not calculated here. If the result is inconclusive, plan the next test with power analysis.
Related statistics tools
For planning before a test, use sample size or power analysis. For a one-sample exact proportion check, use the binomial test calculator. For a general confidence-interval workflow, use the CI & hypothesis test wizard. For clinical or risk-difference framing, use the risk difference confidence interval calculator.
FAQ
Which formula does this calculator use?
It uses a two-proportion z-test with a pooled standard error for the p-value. The difference interval can use Newcombe-Wilson or Wald, and the Bayesian readout samples Beta(1 + conversions, 1 + non-conversions) posteriors.
Does statistical significance mean I should ship the variant?
No. A significant p-value says the observed difference is unlikely under the equal-rate model. A launch decision also needs effect size, cost, guardrail metrics, segment behavior, and business context.
Should I choose a one-sided or two-sided test?
Use the two-sided test unless you committed before the experiment that only a lift in the variant direction matters. Two-sided is the safer default because it can detect movement in either direction.
How should I read small-sample results?
When visitors or expected conversions are very low, the normal z approximation can be unstable. Treat the result as a rough signal and compare it with an exact binomial workflow or run a larger test.
Why can the frequentist and Bayesian readings look different?
The p-value asks how surprising the data would be if both rates were equal. The Bayesian probability estimates how often the variant rate beats the control rate under the chosen posterior model, so the numbers answer different questions.
Can I check the result many times while the test is running?
Repeated peeking increases false positives for ordinary fixed-horizon tests. Plan the sample size and analysis rule first, or use a sequential method that is designed for interim checks.
Why might my result differ from another A/B calculator?
Calculators can differ by one-sided versus two-sided tests, pooled versus unpooled standard errors, Wald versus Newcombe intervals, prior choices, rounding, and whether they apply sequential or multiple-testing adjustments.