The ad test significance calculator
Most teams call winners on noise. Plug in baseline conversion, the lift you'd care about, and your traffic - get sample size and time-to-significance.
Why ad tests are mostly fake
Imagine flipping two coins. After 10 flips, one has 6 heads and the other has 4. Did you discover that the first coin is "better"? No - you flipped them too few times. The difference is just luck.
Most ad tests work the same way. Teams declare a winner after 3 days because the numbers look different, but they haven't run long enough to know if the difference is real or just luck.
Real winners come from real tests. The calculator below tells you how long you actually need to run.
In one line: if your test runs less than a week with normal traffic, you're probably rewarding noise.
power - the right default for ad tests
realistic confidence threshold for creative
max test runtime under Andromeda fatigue
MDE most ad tests can realistically detect
Plug in your numbers. See if the test reads.
Two-proportion z-test, 80% power, two-tailed. Adjust baseline conversion rate, the lift you'd care about, your confidence threshold, and daily traffic - get sample size and time-to-significance.
The smallest relative lift you'd want to detect. Smaller MDE = bigger sample needed.
95% is the academic standard; 90% is the realistic creative-testing standard. 99% is overkill for ad creative.
Sample per variant
29,160
conversions or visitors
Total sample
58,320
across both variants
Time to significance
At 2,000 visitors/variant/day, this test hits 95% confidence with 15% MDE in roughly 30 days.
Verdict
Too slow
Test takes longer than the fatigue cycle. Either widen your MDE (look for bigger lifts), drop confidence to 90%, or accept that you can't read this reliably with current traffic.
Anti-patterns
Five testing mistakes
The common ways ad teams produce confident-looking but meaningless test results.
Test more concepts, test them properly.
Shuttergen's variation engine produces 25 ship-ready variants per concept, which means you can test concepts head-to-head at the level of structural difference (where MDEs are 30%+) rather than headline tweaks (where MDEs are 3% and untestable).
Lineage tracking from inspiration → variation → result also means wins compound: when a concept beats baseline, the engine generates more variations along its winning structural axes, refining the library with each cycle instead of starting over.
The playbook
Eight rules for honest creative testing
your team's coverage
Sources
What we read to build this
Stop calling tests on three days of variance.
Shuttergen produces enough structural variation that tests have meaningful MDEs - and lineage tracking compounds wins instead of resetting them.
Get started free