A/B testing explained: how it works and what it can (and cannot) prove

Most conversion decisions are guesses dressed up as strategy. A/B testing is the tool that replaces the guess with evidence, but only if you understand what it actually measures, and what it can never tell you.

This article covers the mechanics of a proper test, the causal logic that makes it trustworthy, and the boundaries you must respect before you act on a result. If you own a conversion number, this is the foundation everything else rests on.

How a test actually works #

An A/B test (also called a split test) runs two or more versions of a page, element, or flow at the same time. One group of visitors sees the control (A, the current version); another sees the variant (B, the changed version). At the end, you compare one metric (usually conversion rate) across the groups.

The structure is simple. The discipline is in keeping it clean.

One traffic stream, split randomly and concurrently into A and B, each measured on the same metric, then compared.

The word concurrently carries most of the weight. If you show version A on Monday and version B on Tuesday, you are not testing the change. You are testing the change plus every difference between Monday and Tuesday: traffic-source mix, time of day, a competitor’s promotion, a shift in your ad budget. Those signals cannot be separated.

Concurrent assignment removes that problem. Because both groups are live at the same moment, any external factor (weather, a news cycle, seasonal mood) hits both equally. The only systematic difference left between the groups is the change you made.

Why randomisation proves causation #

Random assignment is what elevates A/B testing above almost every other method available to a growth team.

When visitors are assigned at random to A or B, the two groups are statistically equivalent in expectation. Same proportion of new vs returning visitors, same device mix, same traffic sources, same intent distribution. No selection bias, no survivorship bias: the groups differ only by the variant.

Randomisation is the mechanism that lets you say the variant caused this, rather than the variant happened alongside this.

That is the line between an experiment and a correlation. If your analytics show that visitors who saw a new pricing page closed at a higher rate, that could mean the page is better, or that a higher-intent traffic source happened to land on it. A randomised test rules that out by design.

Contrast a before/after analysis: you change the page and measure the next 30 days. If conversion rises, was it the change, a new email campaign, seasonality, or a blog post that drove high-intent traffic? You cannot know. Before/after is fine for spotting an anomaly; it is unreliable for proving what caused it.

What a test can (and cannot) prove #

A well-run test gives you one durable finding: the relative effect on a defined metric, for this audience, over this window. Precise, valuable, and narrow. Hold both halves of that sentence in mind: the “valuable” and the “narrow.”

The table below draws the line. The left column is inside scope; the right column is where teams overreach.

A clean A/B test CAN show	A clean A/B test CANNOT show
That B outperformed A on one chosen metric, for the traffic served during the test	Why it won: the mechanism behind visitor behaviour
That the difference is unlikely to be chance (see statistical significance)	The long-term effect: churn, novelty wearing off, downstream revenue
A reasonable expectation of similar performance if you ship, for similar traffic	A trustworthy answer on a low-traffic page where significance is months away
A confident ship / do not ship decision on the change as tested	Which element drove a full redesign that changed many things at once
What happens on your page, with your current visitors	Whether you are attracting the right visitors, or chasing the right metric at all

What a test proves: one version beat another on one metric, for your current visitors, during the test window. Everything past that is inference: useful, but inference.

A few of the “cannot” rows deserve a sentence each, because they cause the most expensive mistakes.

The “why” is invisible to the test. Imagine a SaaS pricing page where the variant (which moved the most popular plan to the left column) wins. Was it visual anchoring? Less eye travel? A copy tweak you shipped alongside it? The test cannot say. For the mechanism you need qualitative tools: session replay, heatmaps, and on-site surveys. Tests measure outcomes; they do not explain decisions.

Long-term effects hide outside the window. A two-week test captures two weeks of behaviour. If the variant adds friction that new users tolerate but that churns them after 60 days, the test stays silent. Novelty cuts the other way too: an eye-catching element can lift clicks in week one and lose the lift once visitors habituate.

Low traffic and big redesigns break the method. A B2B form with a handful of submissions a month might need months of runtime to detect a meaningful lift, by which point your traffic and market have moved on (the sample size and runtime maths makes this concrete). And a full page rebuild (new layout, copy, images, and flow at once) creates an interaction problem: you learn that something worked, but nothing reusable about what. Validate radical changes with usability testing, then measure them as a before/after once shipped.

Running a test correctly #

Knowing the theory is step one. Here is the sequence that keeps a result trustworthy from launch to read-out.

Pick one primary metric, up front: the single conversion event the test is designed to move. Commit before launch; choosing the metric after seeing data is a form of p-hacking that manufactures false positives.
Calculate sample size in advance: decide the minimum effect worth detecting and the runtime needed to detect it at your confidence threshold, before a single visitor is bucketed. (See sample size and runtime.)
Run one change per test: multi-variable tests need far more traffic and cannot tell you which change drove the result. Sequential single-variable tests build usable knowledge faster.
Check assignment integrity: visitors who clear cookies, switch devices, or share a household can land in both groups. Know how your tool buckets and de-duplicates traffic.
Run to completion, then read once: do not stop because B is “winning.” Early data is noisier and confidence intervals are wider than they look. Read the result a single time, at your pre-set sample size.

The same discipline, framed as the trap on each side:

Do this

Define the primary metric before launch and let it decide the winner.
Set sample size and a minimum runtime (at least one or two full business cycles) up front.
Test one variable so a win teaches you something reusable.
Treat an inconclusive result as a real finding: the change does not matter much here.

Not this

Pick the metric that happens to look good after the data lands.
Peek daily and stop the moment significance flickers green.
Change headline, image, and CTA at once, then guess which one mattered.
Re-run the same test until it finally “works,” banking a false positive.

Rule of thumb: pre-commit to your metric, your sample size, and your runtime before the test goes live. Anything decided after looking at the data is a rationalisation, not a decision.

Reading the result honestly #

A test ends. What you do with the number is where programmes are won or lost.

If the variant wins at your significance threshold and you have hit your pre-set sample size, you have enough to ship with confidence: not certainty, but the kind of confidence good decisions are built on.

If the result is inconclusive (neither version clearly ahead), that is also a result. It tells you this change does not move this metric for this audience, which frees you to test something that might. Do not call a 55/45 flicker a winner because you are impatient, and do not re-run the test until it cooperates. Both habits quietly poison the next decision. The deeper treatment of significance covers confidence intervals, power, and the specific ways results get misread.

From running tests to building a programme #

A single test is a data point. A programme is a compounding asset.

The teams that extract real value treat each test as one node in a learning system: they write the hypothesis before testing, record the outcome regardless of direction (losses teach as much as wins), and use each result to shape the next question. The CRO process in five steps lays this out as a repeatable loop (research, hypothesise, prioritise, test, learn) rather than a string of one-off guesses.

Convert more, guess less. That is not a slogan. It is what a disciplined testing programme delivers: a steady reduction in the revenue you leave on the table by acting on assumptions instead of evidence.

Frequently asked questions #

What is the difference between A/B testing and before/after analysis?

A/B testing shows the control and variant to randomly assigned visitors at the same time, so external factors hit both groups equally and the only systematic difference is your change. A before/after analysis compares two different time periods, which confounds the change with seasonality, campaigns, and traffic shifts, so it can flag that something moved, but not prove what caused it.

How many variants can I test at once?

You can run more than two, but each additional variant splits your traffic further and lengthens the runtime to reach significance, and testing several changes inside one variant means a win tells you nothing about which change mattered. For most teams, one variable per test produces faster, more reusable learning.

Can A/B testing tell me why a variant won?

No. A test measures the outcome (that B beat A on one metric), not the reason behind it. To understand the mechanism, pair the test with qualitative research: session replay, heatmaps, and on-site surveys that surface what visitors were actually thinking and doing.

Is A/B testing worth it on a low-traffic page?

Often not. If reaching significance would take many months, your traffic and market will have changed before the test resolves, and the answer is stale on arrival. On low-traffic pages, qualitative research and usability testing usually beat a formally inconclusive experiment.

A/B testing explained: how it works and what it can (and cannot) prove

How a test actually works #

Why randomisation proves causation #

What a test can (and cannot) prove #

See your own site’s conversion leaks in 15 seconds

Running a test correctly #

Do this

Not this

Reading the result honestly #

From running tests to building a programme #

Frequently asked questions #

OptiWolf

Keep reading

Statistical significance without fooling yourself

How to calculate the sample size and runtime you need before you start

What is conversion rate optimization? A practical definition for operators