See also: Machine learning terms, Data analysis, Statistical hypothesis testing
A/B testing (also called split testing or bucket testing) is a controlled experiment in which two or more variants of a product, webpage, feature, or model are compared by randomly assigning users to each variant and measuring the effect on a predefined metric. The "A" variant typically represents the existing version (the control), while the "B" variant represents a proposed change (the treatment). By comparing observed outcomes between groups using statistical hypothesis testing, practitioners can determine whether the treatment causes a meaningful change in user behavior or system performance.
A/B testing has become a cornerstone of data-driven decision making across technology companies, e-commerce platforms, and machine learning deployment pipelines. Companies such as Google, Microsoft, Netflix, and Airbnb collectively run hundreds of thousands of experiments each year to optimize everything from user interface layouts to search ranking algorithms. Google conducted its first A/B test on search results in the year 2000, and by 2011 the company was running over 7,000 A/B tests annually [1].
The method traces its conceptual roots to randomized controlled trials in medicine and agriculture. The first randomized double-blind trial was conducted in 1835 to test a homeopathic drug. In the early 1900s, marketer Claude Hopkins used promotional coupons to compare the effectiveness of different advertising campaigns. Statistician William Sealy Gosset developed the Student's t-test in 1908, providing the mathematical foundation that modern A/B testing relies on [1].
Imagine you run a lemonade stand and you want to know whether a yellow sign or a blue sign attracts more customers. You put up the yellow sign on Monday and count how many people stop by. On Tuesday, you put up the blue sign and count again. After a few rounds of switching, you compare the numbers to see which sign works better.
A/B testing works the same way, except instead of switching signs on different days, you show the yellow sign to half the people walking by and the blue sign to the other half at the same time. That way you can be sure the difference in customers is because of the sign color, not because one day was sunnier than the other.
A/B testing is rooted in the framework of frequentist hypothesis testing. The core procedure involves formulating a null hypothesis and an alternative hypothesis, then collecting data to evaluate whether the observed difference between groups is statistically significant.
The null hypothesis (H0) states that there is no difference between the control and treatment groups. For example, if the metric of interest is conversion rate, H0 asserts that both variants produce the same conversion rate. The alternative hypothesis (H1) states that a difference does exist. A/B tests may use one-sided alternatives (the treatment is better) or two-sided alternatives (the treatment is different in either direction).
The p-value is the probability of observing results as extreme as the measured outcome, assuming the null hypothesis is true. Before running the experiment, practitioners choose a significance level (alpha), commonly set at 0.05. If the computed p-value falls below alpha, the null hypothesis is rejected, and the result is deemed statistically significant. A significance level of 0.05 means there is a 5% chance of concluding that a difference exists when there is actually none (a Type I error, or false positive).
The power of a test is the probability of correctly rejecting the null hypothesis when a true effect exists. Power equals 1 minus the Type II error rate (beta). Industry standard practice sets power at 80%, meaning there is a 20% chance of failing to detect a real effect (a false negative). Higher power requires larger sample sizes.
| Error type | Definition | Common threshold |
|---|---|---|
| Type I error (false positive) | Rejecting H0 when it is true | alpha = 0.05 (5%) |
| Type II error (false negative) | Failing to reject H0 when H1 is true | beta = 0.20 (20%) |
A confidence interval provides a range of plausible values for the true treatment effect. A 95% confidence interval means that if the experiment were repeated many times, 95% of the constructed intervals would contain the true parameter value. Confidence intervals are often more informative than p-values alone because they communicate both the direction and the magnitude of the effect.
The choice of statistical test depends on the type of metric being measured.
| Metric type | Distribution | Recommended test | Example metric |
|---|---|---|---|
| Continuous | Gaussian | Welch's t-test | Average revenue per user |
| Binary | Binomial | Fisher's exact test or z-test for proportions | Click-through rate, conversion rate |
| Count | Poisson | E-test | Transactions per user |
Determining the correct sample size before launching an experiment is critical. Running a test with too few observations risks missing a real effect (underpowered test), while running with too many wastes traffic and delays decisions.
The standard formula for the minimum sample size per group in a two-sample proportion test is:
n = 2 * (Z_(1-alpha/2) + Z_(1-beta))^2 * p(1-p) / (MDE)^2
Where:
Key relationships govern sample size requirements. Detecting half the effect size requires four times the sample, because sample size scales with the inverse square of the MDE. Lower baseline variance or a larger MDE reduces the required sample size. As a general guideline, an MDE of 2 to 5 percent relative change is considered reasonable for most product experiments [2].
A/B tests require a clearly defined primary metric, sometimes called the Overall Evaluation Criterion (OEC). The OEC is often a proxy metric for the desired business outcome rather than a direct measurement.
| Metric | Description | Typical use case |
|---|---|---|
| Conversion rate | Percentage of users who complete a desired action | E-commerce purchases, sign-ups |
| Click-through rate (CTR) | Percentage of users who click on a specific element | Ad performance, email campaigns |
| Revenue per user | Average revenue generated per user during the test period | Pricing experiments, upsell features |
| Engagement rate | Time on page, pages per session, or interactions per visit | Content optimization, app features |
| Retention rate | Percentage of users who return after a set period | Onboarding flows, notification strategies |
| Bounce rate | Percentage of visitors who leave after viewing one page | Landing page design |
Practitioners typically also track guardrail metrics alongside the primary metric. Guardrail metrics represent outcomes that must not degrade (for example, page load time or error rate) even if the primary metric improves.
A well-designed A/B test follows a structured process:
Proper randomization is the foundation of causal inference in A/B testing. Users are typically assigned to groups using a hash function applied to a stable user identifier, ensuring consistent group assignment across sessions. Good randomization ensures that confounding variables (such as device type, geography, or time of day) are evenly distributed across groups.
Two major statistical paradigms underpin A/B testing: the frequentist approach and the Bayesian approach.
The frequentist approach defines probability as the long-run frequency of events across repeated trials. It requires choosing a fixed significance level before the experiment begins, collecting data for a predetermined sample size, and then making a binary decision (reject or fail to reject H0). The frequentist framework is mathematically rigorous and well-understood, but it does not produce a probability that the treatment is better. Instead, it provides a p-value and a confidence interval.
The Bayesian approach treats probability as a degree of belief and incorporates prior knowledge through a prior distribution. As data is collected, the prior is updated to form a posterior distribution, which directly answers the question: "What is the probability that the treatment is better than the control?" Bayesian A/B testing enables continuous monitoring without inflating false positive rates, provides intuitive probability statements (for example, "there is a 95% probability that variant B increases conversion by 2 to 5%"), and allows the incorporation of prior knowledge from past experiments. However, it requires specifying a prior, which introduces subjectivity, and posterior computation can be more complex.
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Definition of probability | Long-run frequency | Degree of belief |
| Prior information | Not used | Incorporated via prior distribution |
| Output | p-value, confidence interval | Posterior distribution, credible interval |
| Monitoring | Fixed sample; peeking inflates error | Continuous monitoring is valid |
| Interpretation | "Reject or fail to reject H0" | "Probability that B is better than A" |
| Common criticism | Arbitrary significance threshold | Choice of prior is subjective |
Multi-armed bandits offer an alternative to traditional A/B testing that balances exploration (learning which variant is best) with exploitation (directing more traffic to the currently best-performing variant). Named after the problem of a gambler choosing between multiple slot machines, bandit algorithms dynamically allocate traffic based on observed performance.
In a standard A/B test, traffic is split evenly for the duration of the experiment, which means significant traffic is sent to underperforming variants. Bandit algorithms such as Thompson sampling (also called Bayesian bandits), Upper Confidence Bound (UCB), and epsilon-greedy reduce this opportunity cost by progressively shifting traffic toward the winning variant.
Bandit approaches are best suited for scenarios where the cost of showing a suboptimal variant is high and where continuous optimization matters more than precise causal estimation. However, they provide weaker statistical guarantees about the treatment effect size compared to a properly run A/B test, and they can converge prematurely if the initial signal is noisy [3].
A/A testing involves running an experiment where both groups receive the identical experience. The purpose is not to detect a treatment effect (there is none) but to validate the experimentation infrastructure. A/A tests verify that the randomization mechanism is unbiased, the metric computation pipeline is correct, the statistical test produces the expected false positive rate (approximately alpha), and that there is no sample ratio mismatch (SRM) between groups.
If an A/A test consistently shows statistically significant differences, it signals a problem with the testing platform, such as biased user assignment, logging errors, or metric calculation bugs. Running A/A tests before launching a new experimentation platform or after making infrastructure changes is considered a best practice [4].
Multivariate testing extends A/B testing by comparing more than two variants simultaneously. In an A/B/n test, users are randomly assigned to one of n groups, each receiving a different variant. This approach allows practitioners to test multiple ideas at once, reducing the calendar time needed compared to running separate sequential A/B tests.
The tradeoff is that each additional variant reduces the number of users per group, requiring a larger overall sample to maintain statistical power. Additionally, testing multiple comparisons increases the risk of false positives, which must be addressed through correction methods such as the Bonferroni correction or the Benjamini-Hochberg procedure.
Full factorial multivariate testing goes further by testing multiple factors simultaneously (for example, both button color and headline text), allowing detection of interaction effects between factors. However, the number of required variants grows multiplicatively with the number of factors, making full factorial designs impractical beyond a small number of variables.
Traditional fixed-horizon A/B tests require practitioners to wait until the planned sample size is reached before analyzing results. In practice, teams often want to monitor experiments in real time and stop early if results are clearly positive or negative. Checking results repeatedly without correction (known as "peeking") inflates the Type I error rate far beyond the nominal alpha level.
Several methods address this problem:
Group sequential designs and always valid inference methods are now supported by modern experimentation platforms, enabling teams to stop underpowered or clearly decisive experiments early while maintaining statistical rigor.
A/B testing is deceptively simple in concept but fraught with methodological traps.
Checking experiment results repeatedly and stopping as soon as significance is reached dramatically increases the false positive rate. With continuous monitoring and a nominal alpha of 0.05, the actual false positive rate can exceed 30%. The solution is to use sequential testing methods or commit to a fixed sample size before analysis.
Testing many metrics or many variants simultaneously increases the chance of finding at least one spurious significant result. If 20 independent metrics are tested at alpha = 0.05, the probability of at least one false positive is approximately 64%. Correction methods (Bonferroni, Holm, or Benjamini-Hochberg for controlling the false discovery rate) should be applied when evaluating multiple hypotheses.
Simpson's paradox occurs when a trend observed in aggregated data reverses or disappears when the data is segmented by a confounding variable. In A/B testing, this often arises when the traffic allocation ratio changes mid-experiment or when subgroups have very different sample sizes. Practitioners should monitor segment-level results and avoid changing allocation ratios during a running test.
Users may react differently to a new variant simply because it is new (novelty effect) or may prefer the familiar version out of habit (primacy effect). These effects create temporary biases that diminish over time. Running experiments for a sufficient duration (typically at least one to two full business cycles) helps mitigate these effects.
SRM occurs when the observed ratio of users between groups deviates significantly from the expected ratio. For example, if the experiment is designed as a 50/50 split but one group ends up with 52% of users, the randomization or logging system may be compromised. SRM checks should be performed automatically at the start of every analysis.
If the analysis only includes users who completed certain actions (for example, users who reached the checkout page), the results may be biased by the treatment's effect on who survives to that stage. Intent-to-treat analysis, which includes all users originally assigned to each group, avoids this problem.
Standard A/B testing relies on the Stable Unit Treatment Value Assumption (SUTVA), which states that one user's outcome is unaffected by another user's treatment assignment. This assumption is violated when users interact with each other, a situation called interference or spillover.
Network effects are prevalent in social media platforms, two-sided marketplaces, and messaging applications. For example, if a new feature in a messaging app is rolled out to the treatment group, control group users who communicate with treated users may also be affected. This spillover can bias treatment effect estimates, typically causing underestimation of the true effect.
Several experimental design strategies address interference:
Airbnb, LinkedIn, and other companies with strong network effects have invested heavily in developing specialized experimental designs to handle interference [6].
Large-scale experimentation platforms employ variance reduction methods to increase the sensitivity of experiments, enabling faster detection of smaller effects without increasing sample size.
Controlled-experiment Using Pre-Experiment Data (CUPED), introduced by Microsoft researchers in 2013, uses pre-experiment metric values as a covariate to reduce variance in the treatment effect estimator [7]. The key insight is that much of the variation in user-level metrics is driven by persistent individual differences rather than the treatment itself. By regressing out the component explained by pre-experiment behavior, CUPED can reduce variance by 50% or more, effectively halving the time needed to reach significance.
The higher the correlation between pre-experiment and post-experiment metric values, the greater the variance reduction. In practice, a pre-experiment observation window of one to two weeks is typically optimal. CUPED has been widely adopted by companies including Netflix, Booking.com, Meta, Airbnb, and DoorDash.
Stratified sampling ensures that important subgroups (such as new versus returning users, or mobile versus desktop users) are evenly represented in each experimental group. This reduces the variance attributable to between-group differences in subgroup composition.
A/B testing plays a critical role in machine learning deployment, serving as the final validation step before a new model replaces an existing one in production. While offline evaluation using held-out test sets and cross-validation provides initial model comparisons, it cannot capture the full complexity of real-world user interactions.
In the ML context, A/B testing compares a candidate model (challenger) against the current production model (champion) by routing a fraction of live traffic to each. Metrics such as accuracy, precision, recall, revenue impact, and user engagement are tracked in real time. The candidate model is promoted to production only if it demonstrates statistically significant improvement on the primary metric without degrading guardrail metrics.
A typical deployment pipeline proceeds through several stages:
For ranking and recommendation systems, interleaving offers a more sensitive alternative to standard A/B testing. Instead of showing different groups entirely different ranked lists, interleaving merges results from two models into a single list and measures which model's results users prefer based on clicks. Airbnb reported that interleaving achieved 50x greater sensitivity than traditional A/B testing for their search ranking experiments [8].
Leading technology companies have built sophisticated experimentation platforms to support thousands of concurrent experiments.
| Company | Scale | Notable practices |
|---|---|---|
| Microsoft | Hundreds of thousands of experiments per year across products | Built the ExP (Experimentation Platform); pioneered CUPED for variance reduction; runs automated SRM and guardrail checks |
| Netflix | Thousands of experiments annually | Tests UI changes, recommendation algorithms, and content thumbnails; built a custom experimentation platform documented in engineering blog posts |
| Airbnb | Hundreds of concurrent experiments | Developed interleaving for search ranking; built frameworks for handling marketplace interference |
| Booking.com | Thousands of concurrent experiments | Deeply embedded experimentation culture; every product change is tested |
| Tens of thousands of experiments annually | Runs A/B tests on search, ads, and Maps; tested 41 shades of blue for hyperlink color |
In 2018, representatives from thirteen organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, Yandex, and Stanford University) convened the first Practical Online Controlled Experiments Summit to share best practices and identify common challenges [9].
Modern experimentation platforms typically provide:
A range of commercial and open-source platforms support A/B testing.
| Platform | Focus | Key features |
|---|---|---|
| Optimizely | Enterprise experimentation | Visual editor, full-stack SDKs, advanced targeting, Bayesian statistics engine |
| VWO | Marketing optimization | Visual A/B testing, heatmaps, session recordings, behavioral analytics |
| Statsig | Product experimentation | Built-in analytics, warehouse-native mode, Bayesian and frequentist support, session replay |
| LaunchDarkly | Feature flag management | Feature flags, progressive delivery, limited built-in experimentation |
| GrowthBook | Open-source experimentation | Warehouse-native, Bayesian engine, CUPED support, free self-hosted tier |
| Google Optimize | Web experimentation | (Sunset in September 2023) Was a free tool integrated with Google Analytics |
| Eppo | Warehouse-native experimentation | Connects directly to data warehouses, CUPED, sequential testing |
| Split (Harness) | Feature delivery and experimentation | Feature flags, traffic management, integrations with CI/CD |
While A/B testing is most associated with software and web applications, it has been applied across diverse domains: