Statistical significance without fooling yourself

Q: Does p < 0.05 mean there is a 95% chance my variant is better?

No. This is the single most common misreading. A p-value below 0.05 means that if the variant truly had no effect, you would see a gap this large less than 5% of the time. It is a statement about the data under a no-effect assumption, not the probability the variant wins. The probability that the variant is better is a different quantity that the p-value does not give you.

Q: Why can I not just stop the test once it hits significance?

Because a fixed-horizon p-value is only valid at one pre-planned read. Every extra look is another roll of the dice, and peeking until you see p < 0.05 will eventually produce a “winner” on a change that does nothing. Either commit to a sample size and read once, or use a sequential method built to handle repeated looks.

Q: What is the difference between statistical and practical significance?

Statistical significance says the effect is probably real. Practical significance says it is large enough to be worth shipping. A test can clear 95% confidence on a lift so small the engineering and risk cost more than it returns. Decide your minimum worthwhile lift in advance and judge the confidence interval against it.

Q: How long should a test run if I already hit my sample size early?

Keep going to at least one to two full calendar weeks, and never stop mid-week. Conversion behaviour varies by day of week, so a test that hits its number on a Thursday and stops there is biased toward early-week traffic. Sample size and runtime walks through setting both a sample target and a minimum runtime.

Most teams running A/B tests are not short on data. They are short on an accurate reading of it. A result comes back at p = 0.04, someone calls it a win, the variant ships, and revenue stays flat. The math was not wrong. The interpretation was.

This article is about the gap between what statistical significance actually means and what most operators quietly assume it means. Close that gap and you make fewer expensive mistakes, trust your experiments more, and ship only the changes that genuinely move the number.

What a p-value actually says #

Start here, because almost everyone gets this wrong.

A p-value does not tell you the probability that your variant is better. It does not tell you the probability your result was a fluke. It answers a narrower, more specific question: assuming there is no real difference between the two versions, how likely is it that you would see a gap this large (or larger) by chance alone?

That is a statement about your data inside a hypothetical world where the variant does nothing. It is not a statement about which version wins.

A p-value of 0.05 means: if there were truly no effect, you would still see a result this extreme about 5% of the time from random variation alone.

So a 95% confidence threshold (p < 0.05) is really a decision: I will accept a 5% chance of crowning a false winner. That risk compounds across a programme. Run twenty independent tests on changes that do nothing, and you should still expect roughly one to cross p < 0.05 by luck.

This is not a flaw in the method. It is the method working as designed. Which is exactly why discipline matters as much as the threshold you pick.

Two 95% confidence intervals on the same effect axis. The zero line marks no difference. An interval that straddles zero is inconclusive; an interval sitting entirely on one side is a result you can act on.

Significance vs practical significance #

A result can be statistically significant and practically meaningless at the same time. Confusing the two is a reliable way to burn engineering capacity on micro-optimisations.

Statistical significance tells you the effect is probably real. Practical significance tells you whether it is large enough to matter to the business.

Imagine a SaaS pricing page where a button-colour test lands a tiny lift at 95% confidence. With enough traffic, that can be a genuine, non-random difference, and still not worth the deploy. If your baseline rate barely moves, the engineering time, QA, and risk may cost more than the lift returns.

The tool for thinking about this is the confidence interval, not the p-value alone. The interval gives you a plausible range for the true effect. A tight range around a small lift tells a very different story from a wide range that barely clears zero, even when both report the same p-value.

So decide the threshold before you start: what is the smallest lift that would actually be worth shipping? That is a business question, not a statistical one. If a lift below your line is not worth the trouble, do not call a result a win unless the interval’s lower bound clears that line. Sample size and runtime covers how this minimum detectable effect drives your traffic plan.

The peeking problem #

Here is the most common way good operators produce bad results: they watch the dashboard daily and stop the test the moment it looks like a winner.

This is peeking, and it systematically inflates false positives, often dramatically.

The reason is subtle but decisive. A fixed-horizon p-value is only valid when calculated once, at a single pre-planned point. Every time you glance at accumulating data and apply the threshold, you are running another implicit test. The more you look, the more chances you hand randomness to cross p < 0.05 for a moment. Peek often enough at a change that does nothing, and you will eventually catch a “significant” reading. Stop there and you have shipped noise.

The dashboard is not lying when it shows 95% confidence. You are simply applying that number to a situation it was never built for.

The fix for a fixed-horizon test is blunt: decide the sample size in advance, run until you hit it, then read the result once. No early stops, no daily verdicts.

Rule of thumb: set your sample size before you start, run to completion, and read the result exactly once. Every early peek-and-stop is a coin flip with the odds quietly tilted against you.

Fixed-horizon vs sequential testing #

There are two legitimate ways to decide when a test is allowed to end. Picking the wrong one (or mixing them) is where the false positives sneak in.

	Fixed-horizon	Sequential (always-valid)
When you read it	Once, at the pre-set sample size	Continuously, as data arrives
Can you stop early?	No: early stops invalidate the p-value	Yes: the method accounts for repeated looks
False-positive control	Valid only if you read once	Stays controlled across many looks
Sample size needed	Lower for a single planned read	Generally higher for the same power
Best fit	Most CRO tests; the simple default	When you need to react to data mid-flight
Main risk	Discipline slips and someone peeks	Tool implements it incorrectly

Fixed-horizon is the right default for most teams. Its one real cost: you cannot legitimately peek, so reacting to a clearly catastrophic variant means either eating the loss to term or accepting an invalid read on that test.

Sequential testing (also sold as always-valid inference or continuous monitoring) uses methods that mathematically absorb repeated looks. You can check daily, stop early when evidence is strong, and keep the false-positive rate in check. The trade-off is larger samples, more statistical machinery, and the fact that not every testing tool implements it correctly.

Do this

Pick fixed-horizon or sequential before launch, and commit.
On fixed-horizon, treat the single-read rule as a hard process rule, not a guideline.
Use a sequential mode only if your tool genuinely supports always-valid inference.
Add a minimum runtime on top of sample size to cover weekly cycles.

Not this

Run fixed-horizon, then stop the day it crosses 95%.
Treat a tool’s live “confidence” number as permission to peek.
Assume a daily-updating dashboard is automatically always-valid.
Call a test the moment you hit sample size, mid-week, ignoring day-of-week swings.

One practical guardrail either way: set a minimum runtime alongside the sample size. Even if you hit the target in three days, run at least one to two full weeks to capture day-of-week variation. Tuesday behaviour rarely matches Saturday behaviour, and a short test that over-weights one day skews both the estimate and the variance.

Confidence intervals: read the range, not just the flag #

When a tool reports “statistically significant,” it has collapsed a whole distribution into a yes/no. That binary hides the information you actually need to make the call.

A 95% confidence interval around a lift means: if you reran this exact experiment many times, about 95% of the intervals you computed would contain the true effect. For day-to-day decisions, treat it as a plausible range for the real lift.

Three readings worth slowing down for:

Narrow, entirely above zero: strong evidence of a real positive effect, and the width tells you how precisely you have pinned it down.
Wide, includes zero: even at p < 0.05, an interval that barely excludes zero should make you cautious; there may be an effect, but you have not measured its size well.
Narrow but tiny: the effect is real, but whether it is worth shipping is a business call, not a statistical one.

Most tools surface the interval if you look. Build the habit of reading the range before the verdict. A/B testing explained covers what the test structure itself can and cannot guarantee, useful context before you interpret any single number.

The multiple comparisons trap #

Every metric you track is another shot at a false positive. Run one test and stare at ten metrics (conversion rate, bounce, scroll depth, click-through, revenue per visitor, and so on) and you should expect at least one to cross your threshold by chance, even if the variant does nothing.

This is the multiple comparisons problem, and it compounds fast when you also run many variants or many tests at once.

The discipline: name a single primary metric before the test runs. That metric alone decides win or lose. Secondary metrics are diagnostic: they help explain why, not whether. If a secondary metric moves, treat it as a hypothesis for the next test, not as a retroactive redefinition of success.

If you genuinely must judge several primary outcomes together, look into correction methods (Bonferroni is the most conservative; Benjamini-Hochberg is more powerful for larger sets). Most standard CRO tests do not need that. They need the discipline of one primary metric, chosen up front.

Rule of thumb: pick the primary metric before launch. Secondary metrics are for learning, never for declaring a winner after the fact.

A discipline checklist #

Statistical validity is less about the formula and more about the habits around it. Run this before you call any result:

#	Check	Why it matters
1	Primary metric pre-specified?	Choosing it after seeing results makes the p-value unreliable.
2	Sample size pre-specified?	Stopping when it “looked good” inflates the p-value.
3	Ran at least two full business cycles?	A test that missed the weekend is not representative.
4	Interval’s lower bound above your minimum effect?	If not, the lift may be real but not worth shipping.
5	Reading the primary metric, not a convenient one?	Pivoting metrics after the fact is outcome-switching.
6	Checked for novelty effects?	New designs get a temporary bump; segment new vs returning to sense-check.
7	Anything change externally mid-test?	A pricing shift, traffic-source change, or seasonal spike can corrupt the read.

A clean pass on all seven means you have a result worth acting on. A failure on any one means you have a hypothesis worth re-running properly. This maps to the CRO process in five steps: the “learn” stage only works if the test was clean enough to learn from.

FAQ #

Frequently asked questions #

Does p < 0.05 mean there is a 95% chance my variant is better?

No. This is the single most common misreading. A p-value below 0.05 means that if the variant truly had no effect, you would see a gap this large less than 5% of the time. It is a statement about the data under a no-effect assumption, not the probability the variant wins. The probability that the variant is better is a different quantity that the p-value does not give you.

Why can I not just stop the test once it hits significance?

Because a fixed-horizon p-value is only valid at one pre-planned read. Every extra look is another roll of the dice, and peeking until you see p < 0.05 will eventually produce a “winner” on a change that does nothing. Either commit to a sample size and read once, or use a sequential method built to handle repeated looks.

What is the difference between statistical and practical significance?

Statistical significance says the effect is probably real. Practical significance says it is large enough to be worth shipping. A test can clear 95% confidence on a lift so small the engineering and risk cost more than it returns. Decide your minimum worthwhile lift in advance and judge the confidence interval against it.

How long should a test run if I already hit my sample size early?

Keep going to at least one to two full calendar weeks, and never stop mid-week. Conversion behaviour varies by day of week, so a test that hits its number on a Thursday and stops there is biased toward early-week traffic. Sample size and runtime walks through setting both a sample target and a minimum runtime.

Running fewer experiments with tighter discipline beats running many with sloppy reads. Convert more, guess less. Start by being honest about what your p-values are, and are not, telling you.

Statistical significance without fooling yourself

What a p-value actually says #

Significance vs practical significance #

The peeking problem #

Fixed-horizon vs sequential testing #

Do this

Not this

See your own site’s conversion leaks in 15 seconds

Confidence intervals: read the range, not just the flag #

The multiple comparisons trap #

A discipline checklist #

FAQ #

Frequently asked questions #

OptiWolf

Keep reading

A/B testing explained: how it works and what it can (and cannot) prove

How to calculate the sample size and runtime you need before you start

What is conversion rate optimization? A practical definition for operators