A/B Testing

Introduction

A/B testing (also called split testing or bucket testing) is a controlled experiment in which two or more variants of a product, webpage, feature, or model are compared by randomly assigning users to each variant and measuring the effect on a predefined metric. The "A" variant typically represents the existing version (the control), while the "B" variant represents a proposed change (the treatment). By comparing observed outcomes between groups using statistical hypothesis testing, practitioners can determine whether the treatment causes a meaningful change in user behavior or system performance.

A/B testing has become a cornerstone of data-driven decision making across technology companies, e-commerce platforms, and machine learning deployment pipelines. Companies such as Google, Microsoft, Netflix, and Airbnb collectively run hundreds of thousands of experiments each year to optimize everything from user interface layouts to search ranking algorithms. Google conducted its first A/B test on search results in the year 2000, and by 2011 the company was running over 7,000 A/B tests annually ^[1].

The method traces its conceptual roots to randomized controlled trials in medicine and agriculture. The first randomized double-blind trial was conducted in 1835 to test a homeopathic drug. In the early 1900s, marketer Claude Hopkins used promotional coupons to compare the effectiveness of different advertising campaigns. Statistician William Sealy Gosset developed the Student's t-test in 1908, providing the mathematical foundation that modern A/B testing relies on ^[1].

Explain like I'm 5 (ELI5)

Imagine you run a lemonade stand and you want to know whether a yellow sign or a blue sign attracts more customers. You put up the yellow sign on Monday and count how many people stop by. On Tuesday, you put up the blue sign and count again. After a few rounds of switching, you compare the numbers to see which sign works better.

A/B testing works the same way, except instead of switching signs on different days, you show the yellow sign to half the people walking by and the blue sign to the other half at the same time. That way you can be sure the difference in customers is because of the sign color, not because one day was sunnier than the other.

Statistical foundations

A/B testing is rooted in the framework of frequentist hypothesis testing. The core procedure involves formulating a null hypothesis and an alternative hypothesis, then collecting data to evaluate whether the observed difference between groups is statistically significant.

Null and alternative hypotheses

The null hypothesis (H0) states that there is no difference between the control and treatment groups. For example, if the metric of interest is conversion rate, H0 asserts that both variants produce the same conversion rate. The alternative hypothesis (H1) states that a difference does exist. A/B tests may use one-sided alternatives (the treatment is better) or two-sided alternatives (the treatment is different in either direction).

P-value and significance level

The p-value is the probability of observing results as extreme as the measured outcome, assuming the null hypothesis is true. Before running the experiment, practitioners choose a significance level (alpha), commonly set at 0.05. If the computed p-value falls below alpha, the null hypothesis is rejected, and the result is deemed statistically significant. A significance level of 0.05 means there is a 5% chance of concluding that a difference exists when there is actually none (a Type I error, or false positive).

Statistical power and Type II error

The power of a test is the probability of correctly rejecting the null hypothesis when a true effect exists. Power equals 1 minus the Type II error rate (beta). Industry standard practice sets power at 80%, meaning there is a 20% chance of failing to detect a real effect (a false negative). Higher power requires larger sample sizes.

Error type	Definition	Common threshold
Type I error (false positive)	Rejecting H0 when it is true	alpha = 0.05 (5%)
Type II error (false negative)	Failing to reject H0 when H1 is true	beta = 0.20 (20%)

Confidence intervals

A confidence interval provides a range of plausible values for the true treatment effect. A 95% confidence interval means that if the experiment were repeated many times, 95% of the constructed intervals would contain the true parameter value. Confidence intervals are often more informative than p-values alone because they communicate both the direction and the magnitude of the effect.

Appropriate statistical tests

The choice of statistical test depends on the type of metric being measured.

Metric type	Distribution	Recommended test	Example metric
Continuous	Gaussian	Welch's t-test	Average revenue per user
Binary	Binomial	Fisher's exact test or z-test for proportions	Click-through rate, conversion rate
Count	Poisson	E-test	Transactions per user

Sample size calculation

Determining the correct sample size before launching an experiment is critical. Running a test with too few observations risks missing a real effect (underpowered test), while running with too many wastes traffic and delays decisions.

The standard formula for the minimum sample size per group in a two-sample proportion test is:

n = 2 * (Z_(1-alpha/2) + Z_(1-beta))^2 * p(1-p) / (MDE)^2

Where:

n is the required sample size per group
Z_(1-alpha/2) is the z-score corresponding to the desired significance level (1.96 for alpha = 0.05, two-sided)
Z_(1-beta) is the z-score corresponding to the desired power (0.84 for 80% power)
p is the baseline conversion rate (pooled estimate)
MDE is the minimum detectable effect, the smallest absolute difference the test is designed to detect

Key relationships govern sample size requirements. Detecting half the effect size requires four times the sample, because sample size scales with the inverse square of the MDE. Lower baseline variance or a larger MDE reduces the required sample size. As a general guideline, an MDE of 2 to 5 percent relative change is considered reasonable for most product experiments ^[2].

Common metrics

A/B tests require a clearly defined primary metric, sometimes called the Overall Evaluation Criterion (OEC). The OEC is often a proxy metric for the desired business outcome rather than a direct measurement.

Metric	Description	Typical use case
Conversion rate	Percentage of users who complete a desired action	E-commerce purchases, sign-ups
Click-through rate (CTR)	Percentage of users who click on a specific element	Ad performance, email campaigns
Revenue per user	Average revenue generated per user during the test period	Pricing experiments, upsell features
Engagement rate	Time on page, pages per session, or interactions per visit	Content optimization, app features
Retention rate	Percentage of users who return after a set period	Onboarding flows, notification strategies
Bounce rate	Percentage of visitors who leave after viewing one page	Landing page design

Practitioners typically also track guardrail metrics alongside the primary metric. Guardrail metrics represent outcomes that must not degrade (for example, page load time or error rate) even if the primary metric improves.

A/B testing methodology

Designing the experiment

A well-designed A/B test follows a structured process:

Formulate a hypothesis. State a clear, testable prediction about how the change will affect the metric.
Select the primary metric and guardrail metrics. Define what success looks like and what degradation is unacceptable.
Calculate the required sample size. Use power analysis to determine how many observations are needed.
Randomize users into groups. Randomly assign users to the control or treatment group, ensuring balanced demographic and behavioral characteristics.
Run the experiment for the planned duration. Do not stop early or peek at results without proper correction.
Analyze results. Apply the prespecified statistical test and evaluate confidence intervals.
Make a decision. Ship, iterate, or revert based on the data.

Randomization

Proper randomization is the foundation of causal inference in A/B testing. Users are typically assigned to groups using a hash function applied to a stable user identifier, ensuring consistent group assignment across sessions. Good randomization ensures that confounding variables (such as device type, geography, or time of day) are evenly distributed across groups.

Bayesian vs. frequentist A/B testing

Two major statistical paradigms underpin A/B testing: the frequentist approach and the Bayesian approach.

Frequentist approach

The frequentist approach defines probability as the long-run frequency of events across repeated trials. It requires choosing a fixed significance level before the experiment begins, collecting data for a predetermined sample size, and then making a binary decision (reject or fail to reject H0). The frequentist framework is mathematically rigorous and well-understood, but it does not produce a probability that the treatment is better. Instead, it provides a p-value and a confidence interval.

Bayesian approach

The Bayesian approach treats probability as a degree of belief and incorporates prior knowledge through a prior distribution. As data is collected, the prior is updated to form a posterior distribution, which directly answers the question: "What is the probability that the treatment is better than the control?" Bayesian A/B testing enables continuous monitoring without inflating false positive rates, provides intuitive probability statements (for example, "there is a 95% probability that variant B increases conversion by 2 to 5%"), and allows the incorporation of prior knowledge from past experiments. However, it requires specifying a prior, which introduces subjectivity, and posterior computation can be more complex.

Aspect	Frequentist	Bayesian
Definition of probability	Long-run frequency	Degree of belief
Prior information	Not used	Incorporated via prior distribution
Output	p-value, confidence interval	Posterior distribution, credible interval
Monitoring	Fixed sample; peeking inflates error	Continuous monitoring is valid
Interpretation	"Reject or fail to reject H0"	"Probability that B is better than A"
Common criticism	Arbitrary significance threshold	Choice of prior is subjective

Multi-armed bandits

Multi-armed bandits offer an alternative to traditional A/B testing that balances exploration (learning which variant is best) with exploitation (directing more traffic to the currently best-performing variant). Named after the problem of a gambler choosing between multiple slot machines, bandit algorithms dynamically allocate traffic based on observed performance.

In a standard A/B test, traffic is split evenly for the duration of the experiment, which means significant traffic is sent to underperforming variants. Bandit algorithms such as Thompson sampling (also called Bayesian bandits), Upper Confidence Bound (UCB), and epsilon-greedy reduce this opportunity cost by progressively shifting traffic toward the winning variant.

Bandit approaches are best suited for scenarios where the cost of showing a suboptimal variant is high and where continuous optimization matters more than precise causal estimation. However, they provide weaker statistical guarantees about the treatment effect size compared to a properly run A/B test, and they can converge prematurely if the initial signal is noisy ^[3].

A/A testing

A/A testing involves running an experiment where both groups receive the identical experience. The purpose is not to detect a treatment effect (there is none) but to validate the experimentation infrastructure. A/A tests verify that the randomization mechanism is unbiased, the metric computation pipeline is correct, the statistical test produces the expected false positive rate (approximately alpha), and that there is no sample ratio mismatch (SRM) between groups.

If an A/A test consistently shows statistically significant differences, it signals a problem with the testing platform, such as biased user assignment, logging errors, or metric calculation bugs. Running A/A tests before launching a new experimentation platform or after making infrastructure changes is considered a best practice ^[4].

Multivariate testing (A/B/n testing)

Multivariate testing extends A/B testing by comparing more than two variants simultaneously. In an A/B/n test, users are randomly assigned to one of n groups, each receiving a different variant. This approach allows practitioners to test multiple ideas at once, reducing the calendar time needed compared to running separate sequential A/B tests.

The tradeoff is that each additional variant reduces the number of users per group, requiring a larger overall sample to maintain statistical power. Additionally, testing multiple comparisons increases the risk of false positives, which must be addressed through correction methods such as the Bonferroni correction or the Benjamini-Hochberg procedure.

Full factorial multivariate testing goes further by testing multiple factors simultaneously (for example, both button color and headline text), allowing detection of interaction effects between factors. However, the number of required variants grows multiplicatively with the number of factors, making full factorial designs impractical beyond a small number of variables.

Sequential testing and early stopping

Traditional fixed-horizon A/B tests require practitioners to wait until the planned sample size is reached before analyzing results. In practice, teams often want to monitor experiments in real time and stop early if results are clearly positive or negative. Checking results repeatedly without correction (known as "peeking") inflates the Type I error rate far beyond the nominal alpha level.

Several methods address this problem:

Group sequential testing divides the experiment into a fixed number of interim analyses and uses adjusted significance boundaries (spending functions) at each look. The O'Brien-Fleming boundary is a popular choice that preserves most of the alpha for the final analysis.
Always valid inference provides p-values and confidence intervals that remain valid at any stopping time, regardless of how often results are checked. Proposed by Johari et al. (2017), this framework allows continuous monitoring without inflating false positive rates ^[5].
Sequential probability ratio test (SPRT) compares the likelihood ratio of the data under H0 versus H1 and stops as soon as the ratio crosses a predefined threshold.

Group sequential designs and always valid inference methods are now supported by modern experimentation platforms, enabling teams to stop underpowered or clearly decisive experiments early while maintaining statistical rigor.

Common pitfalls

A/B testing is deceptively simple in concept but fraught with methodological traps.

Peeking (p-hacking)

Checking experiment results repeatedly and stopping as soon as significance is reached dramatically increases the false positive rate. With continuous monitoring and a nominal alpha of 0.05, the actual false positive rate can exceed 30%. The solution is to use sequential testing methods or commit to a fixed sample size before analysis.

Multiple comparisons

Testing many metrics or many variants simultaneously increases the chance of finding at least one spurious significant result. If 20 independent metrics are tested at alpha = 0.05, the probability of at least one false positive is approximately 64%. Correction methods (Bonferroni, Holm, or Benjamini-Hochberg for controlling the false discovery rate) should be applied when evaluating multiple hypotheses.

Simpson's paradox

Simpson's paradox occurs when a trend observed in aggregated data reverses or disappears when the data is segmented by a confounding variable. In A/B testing, this often arises when the traffic allocation ratio changes mid-experiment or when subgroups have very different sample sizes. Practitioners should monitor segment-level results and avoid changing allocation ratios during a running test.

Novelty and primacy effects

Users may react differently to a new variant simply because it is new (novelty effect) or may prefer the familiar version out of habit (primacy effect). These effects create temporary biases that diminish over time. Running experiments for a sufficient duration (typically at least one to two full business cycles) helps mitigate these effects.

Sample ratio mismatch (SRM)

SRM occurs when the observed ratio of users between groups deviates significantly from the expected ratio. For example, if the experiment is designed as a 50/50 split but one group ends up with 52% of users, the randomization or logging system may be compromised. SRM checks should be performed automatically at the start of every analysis.

Survivorship bias

If the analysis only includes users who completed certain actions (for example, users who reached the checkout page), the results may be biased by the treatment's effect on who survives to that stage. Intent-to-treat analysis, which includes all users originally assigned to each group, avoids this problem.

Interference and network effects

Standard A/B testing relies on the Stable Unit Treatment Value Assumption (SUTVA), which states that one user's outcome is unaffected by another user's treatment assignment. This assumption is violated when users interact with each other, a situation called interference or spillover.

Network effects are prevalent in social media platforms, two-sided marketplaces, and messaging applications. For example, if a new feature in a messaging app is rolled out to the treatment group, control group users who communicate with treated users may also be affected. This spillover can bias treatment effect estimates, typically causing underestimation of the true effect.

Several experimental design strategies address interference:

Cluster-based randomization assigns treatment at the level of clusters (geographic regions, social network communities, or marketplace segments) rather than individual users. This reduces spillover between groups at the cost of reduced statistical power.
Graph cluster randomization uses graph partitioning algorithms to create clusters of tightly connected users while minimizing connections between clusters.
Switchback experiments alternate the treatment across time periods for all users in a market, comparing outcomes across time rather than across groups.

Airbnb, LinkedIn, and other companies with strong network effects have invested heavily in developing specialized experimental designs to handle interference ^[6].

Variance reduction techniques

Large-scale experimentation platforms employ variance reduction methods to increase the sensitivity of experiments, enabling faster detection of smaller effects without increasing sample size.

CUPED

Controlled-experiment Using Pre-Experiment Data (CUPED), introduced by Microsoft researchers in 2013, uses pre-experiment metric values as a covariate to reduce variance in the treatment effect estimator ^[7]. The key insight is that much of the variation in user-level metrics is driven by persistent individual differences rather than the treatment itself. By regressing out the component explained by pre-experiment behavior, CUPED can reduce variance by 50% or more, effectively halving the time needed to reach significance.

The higher the correlation between pre-experiment and post-experiment metric values, the greater the variance reduction. In practice, a pre-experiment observation window of one to two weeks is typically optimal. CUPED has been widely adopted by companies including Netflix, Booking.com, Meta, Airbnb, and DoorDash.

Stratified sampling

Stratified sampling ensures that important subgroups (such as new versus returning users, or mobile versus desktop users) are evenly represented in each experimental group. This reduces the variance attributable to between-group differences in subgroup composition.

A/B testing in machine learning

A/B testing plays a critical role in machine learning deployment, serving as the final validation step before a new model replaces an existing one in production. While offline evaluation using held-out test sets and cross-validation provides initial model comparisons, it cannot capture the full complexity of real-world user interactions.

Online model evaluation

In the ML context, A/B testing compares a candidate model (challenger) against the current production model (champion) by routing a fraction of live traffic to each. Metrics such as accuracy, precision, recall, revenue impact, and user engagement are tracked in real time. The candidate model is promoted to production only if it demonstrates statistically significant improvement on the primary metric without degrading guardrail metrics.

A typical deployment pipeline proceeds through several stages:

Offline evaluation on historical data using metrics like AUC, F1 score, or mean squared error
Shadow mode where the new model receives live traffic but its predictions are not shown to users (used to validate latency and error rates)
A/B test with a small percentage of traffic (often 1 to 5%) gradually ramped up
Full rollout if the test is conclusive

Interleaving experiments

For ranking and recommendation systems, interleaving offers a more sensitive alternative to standard A/B testing. Instead of showing different groups entirely different ranked lists, interleaving merges results from two models into a single list and measures which model's results users prefer based on clicks. Airbnb reported that interleaving achieved 50x greater sensitivity than traditional A/B testing for their search ranking experiments ^[8].

A/B testing at scale

Leading technology companies have built sophisticated experimentation platforms to support thousands of concurrent experiments.

Industry examples

Company	Scale	Notable practices
Microsoft	Hundreds of thousands of experiments per year across products	Built the ExP (Experimentation Platform); pioneered CUPED for variance reduction; runs automated SRM and guardrail checks
Netflix	Thousands of experiments annually	Tests UI changes, recommendation algorithms, and content thumbnails; built a custom experimentation platform documented in engineering blog posts
Airbnb	Hundreds of concurrent experiments	Developed interleaving for search ranking; built frameworks for handling marketplace interference
Booking.com	Thousands of concurrent experiments	Deeply embedded experimentation culture; every product change is tested
Google	Tens of thousands of experiments annually	Runs A/B tests on search, ads, and Maps; tested 41 shades of blue for hyperlink color

In 2018, representatives from thirteen organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, Yandex, and Stanford University) convened the first Practical Online Controlled Experiments Summit to share best practices and identify common challenges ^[9].

Platform capabilities

Modern experimentation platforms typically provide:

Automated power analysis and sample size estimation
Feature flag integration for experiment assignment
Real-time dashboards with sequential testing support
Automatic SRM detection and guardrail monitoring
CUPED or other variance reduction methods
Standardized ramping sequences (for example, 1% to 5% to 25% to 50% to 100%)
Automated blocking when key guardrail metrics degrade

A/B testing platforms and tools

A range of commercial and open-source platforms support A/B testing.

Platform	Focus	Key features
Optimizely	Enterprise experimentation	Visual editor, full-stack SDKs, advanced targeting, Bayesian statistics engine
VWO	Marketing optimization	Visual A/B testing, heatmaps, session recordings, behavioral analytics
Statsig	Product experimentation	Built-in analytics, warehouse-native mode, Bayesian and frequentist support, session replay
LaunchDarkly	Feature flag management	Feature flags, progressive delivery, limited built-in experimentation
GrowthBook	Open-source experimentation	Warehouse-native, Bayesian engine, CUPED support, free self-hosted tier
Google Optimize	Web experimentation	(Sunset in September 2023) Was a free tool integrated with Google Analytics
Eppo	Warehouse-native experimentation	Connects directly to data warehouses, CUPED, sequential testing
Split (Harness)	Feature delivery and experimentation	Feature flags, traffic management, integrations with CI/CD

Applications beyond technology

While A/B testing is most associated with software and web applications, it has been applied across diverse domains:

Political campaigns. During the 2008 U.S. presidential election, Barack Obama's campaign used A/B testing on its website, testing combinations of buttons and images to optimize donation page performance ^[1].
Email marketing. Comparing subject lines, send times, and email content to maximize open rates and click-through rates.
Pricing. Testing different price points or discount structures to measure the impact on revenue and purchase volume.
Healthcare. Randomized controlled trials, the medical analog of A/B testing, remain the gold standard for evaluating treatments and interventions.
Education. Testing different instructional methods, interface designs in e-learning platforms, or notification strategies to improve student outcomes.

Best practices

Define the hypothesis before the experiment. A clear hypothesis prevents post-hoc rationalization.
Use power analysis to determine sample size. Running underpowered tests wastes time and resources.
Randomize properly. Use hash-based assignment on stable identifiers to ensure consistency.
Run the test for the full planned duration. Do not stop early without sequential testing corrections.
Preregister the primary metric. Deciding the success metric after seeing results is a form of p-hacking.
Monitor guardrail metrics. Ensure improvements on the primary metric do not come at the expense of other important outcomes.
Check for SRM. Validate that the traffic split matches the intended ratio.
Account for multiple comparisons. Apply appropriate corrections when testing many variants or metrics.
Distinguish statistical significance from practical significance. A statistically significant result may be too small to matter in practice.
Document and share results. Building an institutional knowledge base of past experiments accelerates future decision making.

References

"A/B testing." *Wikipedia*. https://en.wikipedia.org/wiki/A/B_testing
Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing*. Cambridge University Press.
Slivkins, A. (2019). "Introduction to Multi-Armed Bandits." *Foundations and Trends in Machine Learning*, 12(1-2), 1-286.
Fabijan, A., Dmitriev, P., Olsson, H. H., & Bosch, J. (2018). "Online Controlled Experimentation at Scale." Proceedings of the 40th International Conference on Software Engineering.
Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2017). "Always Valid Inference: Continuous Monitoring of A/B Tests." *Operations Research*, 70(3), 1806-1821.
Holtz, D., Lobel, R., Liskovich, I., & Aral, S. (2020). "Reducing Interference Bias in Online Marketplace Experimentation." *arXiv preprint arXiv:2004.12489*.
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data." *Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM)*, 123-132.
Zhang, Q. (2021). "Beyond A/B Test: Speeding up Airbnb Search Ranking Experimentation through Interleaving." *Airbnb Engineering Blog*.
Gupta, S., Kohavi, R., Tang, D., et al. (2019). "Top Challenges from the First Practical Online Controlled Experiments Summit." *ACM SIGKDD Explorations Newsletter*, 21(1), 20-35.
Kohavi, R. & Thomke, S. (2017). "The Surprising Power of Online Experiments." *Harvard Business Review*, September-October 2017.
Deng, A. & Shi, X. (2016). "Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*.

Introduction

Explain like I'm 5 (ELI5)

Statistical foundations

Null and alternative hypotheses

P-value and significance level

Statistical power and Type II error

Confidence intervals

Appropriate statistical tests

Sample size calculation

Common metrics

A/B testing methodology

Designing the experiment

Randomization

Bayesian vs. frequentist A/B testing

Frequentist approach

Bayesian approach

Multi-armed bandits

A/A testing

Multivariate testing (A/B/n testing)

Sequential testing and early stopping

Common pitfalls

Peeking (p-hacking)

Multiple comparisons

Simpson's paradox

Novelty and primacy effects

Sample ratio mismatch (SRM)

Survivorship bias

Interference and network effects

Variance reduction techniques

CUPED

Stratified sampling

A/B testing in machine learning

Online model evaluation

Interleaving experiments

A/B testing at scale

Industry examples

Platform capabilities

A/B testing platforms and tools

Applications beyond technology

Best practices

See also

References

Improve this article

Related Articles

ARC-AGI 2

Data Analysis

Time Series Analysis

AUC-ROC

ARIMA

Anomaly Detection

Introduction

Explain like I'm 5 (ELI5)

Statistical foundations

Null and alternative hypotheses

P-value and significance level

Statistical power and Type II error

Confidence intervals

Appropriate statistical tests

Sample size calculation

Common metrics

A/B testing methodology

Designing the experiment

Randomization

Bayesian vs. frequentist A/B testing

Frequentist approach

Bayesian approach

Multi-armed bandits

A/A testing

Multivariate testing (A/B/n testing)

Sequential testing and early stopping

Common pitfalls

Peeking (p-hacking)

Multiple comparisons

Simpson's paradox

Novelty and primacy effects

Sample ratio mismatch (SRM)

Survivorship bias

Interference and network effects

Variance reduction techniques

CUPED