Your data team keeps rejecting your tests. Here is what they are actually looking at.

You spent weeks on the campaign. Briefed the designers, aligned stakeholders, got the green light. Then the data team looked at the results and said the numbers were not statistically significant. Here is what that actually means, and how to use it to make better decisions.

Picture this. You run an A/B test on two ad creatives. After two weeks, variant B shows a 12% higher click-through rate. You want to roll it out. The data team says no. The result is not statistically significant.

This conversation happens in marketing teams everywhere. And most of the frustration comes from one thing: marketers and analysts are answering different questions. The marketer sees a 12% difference and asks “is this real?” The analyst sees a p-value of 0.12 and answers “probably not.”

Understanding what sits between those two positions does not make you a statistician. It makes you a sharper decision-maker. And it changes every conversation you have with your data team.

What statistical significance is actually telling you

Every A/B test starts with two assumptions. The first is the null hypothesis: there is no real difference between variant A and variant B. The second is the alternative hypothesis: there is a real difference. Your test exists to figure out which one holds up.

Null hypothesis
There is no real difference between variant A and variant B. Any difference you see is just random variation in the data.
Alternative hypothesis
There is a real difference between variant A and variant B. The effect you observed is unlikely to be due to chance alone.

Statistical significance tells you how likely your result is to have happened by chance alone. Not whether your result is good. Not whether it matters for the business. Just whether the observed difference is likely to be real or likely to be noise.

Your data team expresses this with a p-value. A p-value of 0.05 means there is a 5% chance your result happened by chance. A p-value of 0.20 means there is a 20% chance. The lower the p-value, the more confident you can be that the difference you observed is genuine.

Which threshold to use, and when to bend the rules?

Most people use 0.05 as their default significance threshold. It means you accept a 5% chance of acting on a false positive. That is a reasonable starting point, but it is not a law. The right threshold depends entirely on what is at stake.

0.20
Exploratory
20% chance this is noise. Low-cost tests only.
0.05
Standard
5% chance. Most A/B tests. The default.
0.01
High stakes
1% chance. Large budget decisions. Be very sure.

The lower the p-value, the more confident you can be the result is real.

The question to ask before every test is not “what p-value do I need?” It is “what is the cost of being wrong?” The higher the cost of a false positive, the stricter your threshold should be.

Statistical significance is not the same as practical significance

This is the most important distinction in this entire article. And almost nobody talks about it clearly.

A result can clear your significance threshold and still be completely useless. Imagine you run a test with 500,000 users. Variant B shows a statistically significant 0.3% improvement in conversion rate. Your p-value is 0.02. The data team calls it significant. But 0.3% on your current conversion volume generates an extra €800 per month. The engineering work to implement the change costs €15,000. No sane business makes that call.

The metric you are missing is effect size: the magnitude of the difference, not just the reliability of it. A result worth acting on needs to be both statistically significant and large enough to move your business.

Statistical significance
Is the result real or random?
Measured by the p-value
Can be achieved with a tiny effect on a huge sample
Example: p = 0.02 on a 0.3% lift in CTR
Practical significance
Is the result large enough to act on?
Measured by effect size
Requires business context to evaluate
Example: 0.3% lift generates €800/month vs €15k to implement

Three ways tests go wrong, and how to avoid them

Checking results too early
You check daily and stop the test when it looks significant. Three weeks later the lift disappears. Decide your test duration before you start and do not look until it ends.
Running on too few people
Small samples produce unreliable results in both directions. Run a power analysis before the test starts. It tells you exactly how many users you need to detect a real effect.
Treating 0.05 as a hard wall
A p-value of 0.06 is not a failure. It is a signal. Interpret p-values on a continuum and combine them with effect size and business context before deciding what to do.

Three things to do differently starting Monday

1
Before the test
Set success criteria before the test runs
Define your threshold, sample size, minimum detectable effect and duration. Write them down. Lock them in. Do not revisit once the test is live.
Do this once → eliminates most test failures
2
Reading results
Always ask “significant and how big?”
When your analyst presents a significant result, your next question is always about effect size. How large is the difference in revenue or conversions at your current volume?
Ask this every time → stops costly overreactions
3
Every test
Use a sample size calculator every time
Tools like Evan Miller’s A/B test calculator are free and take two minutes. Running a test without one is measuring with a broken ruler and trusting the result.
Takes 2 min → saves weeks of wasted testing

What it costs to ignore this

Marketing teams that do not understand statistical significance make two expensive mistakes. They kill good ideas too early because a result looks flat before the sample is large enough. And they roll out bad ideas with confidence because a result looked significant before it was stable.

Both mistakes cost money. The first one kills the kind of incremental optimisation that compounds into real competitive advantage over time. The second one sends budget into changes that do not work, and sometimes actively makes things worse.

Getting this right does not require becoming a statistician. It requires asking better questions before a test starts, and resisting the urge to act on results before they are ready to be acted on.

The bottom line

Statistical significance is not a barrier your data team puts up. It is the standard that makes your wins defensible.

When a result is statistically significant and practically meaningful, you can walk into any budget conversation and defend the decision with evidence. When it is not, acting anyway is just expensive guessing with extra steps.

Set your criteria before the test. Check them after. Make the call with both numbers in hand.

Picture of Jelle Casper van Santen
Jelle Casper van Santen
Marketing data analyst with a MSc. in Marketing & Business Analytics. Interested in all things related to attribution, marketing mix modelling, and experimentation.