How to calculate the sample size and runtime you need before you start
Most tests fail before they start, killed by a sample size nobody calculated. Here is the ten-minute planning step that fixes it.
Most A/B tests do not fail because the idea was wrong. They fail because the test never had a real chance to give a reliable answer. You call a winner after two days, ship it, and the lift evaporates. Or you run six weeks on a thin-traffic page and still land on no result. Either way, the time is gone.
Sample size planning is the fix, and the core logic takes about ten minutes to understand. This piece explains the mechanics intuitively, shows how runtime falls out of them, and walks one hypothetical example end to end. By the end you will know exactly what to feed a calculator and what to do with the number it returns.
The four inputs that decide everything #
Before you can calculate anything you need four numbers. Every sample size formula is just a precise way of combining them. Get them right and the calculator does the rest.
Baseline conversion rate is the rate you are at today, before any change. If 3% of visitors who land on your pricing page start a trial, your baseline is 3%. You read this from analytics; you do not guess it.
Minimum detectable effect (MDE) is the smallest lift you actually care about. This is the input most people set wrong. It is not the lift you hope for. It is the smallest improvement that would still be worth shipping given the effort and risk. On a 3% baseline, a 20% relative MDE means you want to reliably catch a move to 3.6% or higher.
Statistical significance is the false-positive rate you will tolerate, the odds of calling a winner when nothing really changed. The industry default is 95% confidence, a 5% false-positive rate. Tightening to 99% costs you a larger sample.
Statistical power is the chance of catching a real effect when one genuinely exists. At 80% power you miss a real win one time in five. At 90% you catch more, but you pay for it in sample. Use 80% as the default and 90% for high-stakes decisions.
These four numbers do not negotiate. Change one and the required sample size moves with it, often dramatically.
The diagram below shows how these four feed forward into a sample size, and how traffic turns that into a runtime.
Why baseline and MDE interact so sharply #
The relationship between these two inputs is the single most important thing to internalise, because it is what makes most low-traffic tests infeasible.
Detecting a 1-point absolute lift on a 2% baseline (a 50% relative move) is a completely different job from detecting a 1-point lift on a 40% baseline (a 2.5% relative move). The lower the baseline, the noisier each visitor is relative to the signal you are hunting. You simply need more observations to see through that noise.
The MDE pushes in the same direction. The smaller the relative lift you want to catch, the more data it takes. As a rule, halving your MDE roughly quadruples the required sample. Wanting to detect a 5% relative lift instead of a 20% one is not a small ask; it is a four-times-bigger experiment.
This is why teams with modest traffic must set honest MDEs. There is no point powering a test for a 5% lift if it would take fourteen months to run. Set an MDE your traffic can actually deliver, or accept that the page is not testable right now and put your energy somewhere it pays off.
Rule of thumb: if you cannot reach the required sample in four to six weeks, the test is not viable on that page. Move to a higher-traffic surface or accept a larger MDE.
How to get the number #
Use any established A/B test sample size calculator. Evan Miller’s is the most widely used and the math behind it is sound. You enter the four inputs above and it returns the visitors required per variation. For a standard two-arm test (control plus one challenger), double it for the total you need.
The engine under every such calculator is the normal approximation to the binomial distribution. You do not need the formula to use the output correctly. You do need to treat that output as a floor, not a suggestion. It is the minimum at which your chosen power and confidence actually hold.
- Pull the baseline: read the current conversion rate for the exact page and audience you will test, straight from analytics.
- Set the MDE: decide the smallest lift worth shipping, then sanity-check it against your traffic, not your hopes.
- Lock confidence and power: 95% and 80% by default; tighten only when the decision is expensive to get wrong.
- Read the per-variation sample: multiply by the number of variations for the total you must collect.
- Divide by qualifying traffic: turn that total into days, then round up to whole weeks.
For the why behind 95% and 80%, and what those thresholds actually promise, see statistical significance without fooling yourself. For where this planning step sits in the wider workflow, see the CRO process in five steps.
From sample size to runtime: the simple math #
Once you have the per-variation sample, runtime is pure division:
Runtime in days = (visitors per variation x number of variations) / qualifying traffic per day
Qualifying traffic is not total site visits. It is the visitors who actually enter the experiment, the ones who hit the tested page and meet any targeting you set. If a pricing page sees 500 visitors a day but the test targets only logged-out desktop users, qualifying traffic might be 200.
That division gives you a floor, not a finish line. One constraint sits on top of it, and it is the one teams skip most often.
See your own site’s conversion leaks in 15 seconds
Run a free CRO scan. No account needed.
Whole business cycles, not just enough days #
Traffic and intent are not flat across a week. B2B SaaS often sees engagement sag on weekends; e-commerce frequently peaks on them. Run from Tuesday to the following Monday and you have a clean full week. Stop on Sunday instead and you have skipped part of the cycle, and the sample is skewed by day-of-week effects.
The rule is simple: run whole weeks, minimum one full week, ideally two or more. Two full weeks is the practical standard, because it puts at least two of every weekday into the sample and smooths the weekly rhythm.
Monthly patterns matter too. If your product spikes at month-end, a test covering only week one will misrepresent that behaviour. For anything on a known monthly cycle, cover the full cycle before you call it.
And do not stop early just because you hit the sample mid-week. Stopping on a Thursday because the counter ticked over reintroduces exactly the day-of-week bias you were avoiding. Add days to the end of the week. This is data hygiene, not bureaucracy.
Rule of thumb: minimum two full calendar weeks, and never stop mid-week even if the sample target is technically met.
A worked example #
Imagine a SaaS company with a free-trial signup page. (The numbers below are illustrative, chosen to show the mechanics, not measured from a real account.)
| Input | Value | Why |
|---|---|---|
| Baseline rate | 4% | the page’s current trial-start rate |
| MDE | 25% relative (4% to 5%) | the smallest lift worth the build |
| Confidence | 95% | standard false-positive tolerance |
| Power | 80% | standard catch rate for a real effect |
Feed those into a standard calculator and it returns on the order of a few thousand visitors per variation, roughly double that across the two arms. Suppose the signup page draws about 350 qualifying visitors a day (logged-out, desktop). Dividing the two-arm total by 350 lands the raw runtime near three weeks, which you then round up to the next whole week ending on a Sunday.
Now change one input. Tighten the MDE to a 10% relative lift, ambitious but plausible, and the required sample roughly quadruples, pushing runtime past two months at the same traffic. Nothing else changed; the smaller target alone did it.
That is the entire point of planning. Knowing this before launch, the team makes a deliberate call: accept the larger MDE, test a bolder change more likely to move the needle, or run on a higher-traffic page first. You are not filling in a form. You are deciding what is actually testable.
Do this
- Set the MDE from what is worth shipping, then check it against your real traffic.
- Calculate the sample, then commit to running the full whole-week window.
- Treat the calculator output as a minimum, and read the result once at the end.
Not this
- Pick an MDE from the lift you are hoping for and launch on optimism.
- Eyeball the dashboard daily and stop the moment it looks like a winner.
- Run a low-traffic page for a fixed couple of weeks and trust whatever it shows.
Why underpowered tests are worse than no tests #
An underpowered test does not just return an inconclusive result. It actively misleads you. When a thin-sampled test does cross nominal significance, that win is disproportionately likely to be a false positive, the statistics on this are well established. You ship noise dressed as a winner, then wonder why the lift never shows up in revenue.
The quieter failure is the test that does not reach significance. You run two weeks on a thin page, see no winner, and conclude the idea failed. But the test was never capable of detecting the effect in the first place. You have burned two weeks and drawn a wrong conclusion from a result that meant nothing.
Treat underpowered tests as non-events. Do not ship them, do not conclude from them, do not file them in your learning library. A test with no pre-calculated power guarantee is not a test; it is noise generation.
If you are building a real testing program, working a prioritised backlog and compounding what you learn, every result has to be one you can trust. See how to prioritise experiments with ICE for choosing what to run first, and A/B testing explained for what the test structure can and cannot prove. Pick the right test, power it correctly, read it correctly. That is how you convert more and guess less.
Frequently asked questions #
What sample size do I need for an A/B test?
There is no universal number. It depends on four inputs: your baseline conversion rate, the minimum detectable effect you want to catch, your confidence level (usually 95%), and your power (usually 80%). Feed those into an established calculator and it returns the visitors required per variation. Lower baselines and smaller target lifts both push the requirement up sharply.
How long should an A/B test run?
Long enough to collect your required sample and long enough to cover whole business cycles. Take your total sample, divide by qualifying traffic per day, then round up to whole weeks, a minimum of one full week and ideally two or more. Whichever is longer wins. Never stop mid-week even if the sample target is already met.
Can I stop a test early once it hits significance?
Not in a standard fixed-horizon test. Peeking and stopping early inflates your false-positive rate, and stopping mid-week reintroduces day-of-week bias. Decide the sample size and runtime up front, run to the end of the whole-week window, then read the result once. See statistical significance without fooling yourself.
What if my traffic is too low to reach the sample?
Then the test is not viable on that page as specified, and that is useful to know before you waste weeks. Your options are to accept a larger MDE (which shrinks the required sample), test a bolder change likely to produce a bigger effect, or run on a higher-traffic page first. Running it underpowered anyway is the one option that wastes time for nothing.
OptiWolf
OptiWolf is CRO and lead-generation software: A/B testing, personalization, and lead-capture popups on one measurement spine. The CRO Academy is where we share the playbooks. Convert more, guess less.
