Home > Your data team keeps rejecting your tests. Here is what they are actually looking at.

Your data team keeps rejecting your tests. Here is what they are actually looking at.

You spent weeks on the campaign. Briefed the designers, aligned stakeholders, got the green light. Then the data team looked at the results and said the numbers were not statistically significant. Here is what that actually means, and how to use it to make better decisions.

Picture this. You run an A/B test on two ad creatives. After two weeks, variant B shows a 12% higher click-through rate. You want to roll it out. The data team says no. The result is not statistically significant.

This conversation happens in marketing teams everywhere. And most of the frustration comes from one thing: marketers and analysts are answering different questions. The marketer sees a 12% difference and asks “is this real?” The analyst sees a p-value of 0.12 and answers “probably not.”

Understanding what sits between those two positions does not make you a statistician. It makes you a sharper decision-maker. And it changes every conversation you have with your data team.

What statistical significance is actually telling you

Every A/B test starts with two assumptions. The first is the null hypothesis: there is no real difference between variant A and variant B. The second is the alternative hypothesis: there is a real difference. Your test exists to figure out which one holds up.

Null hypothesis

There is no real difference between variant A and variant B. Any difference you see is just random variation in the data.

Alternative hypothesis

There is a real difference between variant A and variant B. The effect you observed is unlikely to be due to chance alone.

Statistical significance tells you how likely your result is to have happened by chance alone. Not whether your result is good. Not whether it matters for the business. Just whether the observed difference is likely to be real or likely to be noise.

Your data team expresses this with a p-value. A p-value of 0.05 means there is a 5% chance your result happened by chance. A p-value of 0.20 means there is a 20% chance. The lower the p-value, the more confident you can be that the difference you observed is genuine.

Which threshold to use, and when to bend the rules?

Most people use 0.05 as their default significance threshold. It means you accept a 5% chance of acting on a false positive. That is a reasonable starting point, but it is not a law. The right threshold depends entirely on what is at stake.

0.20

Exploratory

20% chance this is noise. Low-cost tests only.

0.05

Standard

5% chance. Most A/B tests. The default.

0.01

High stakes

1% chance. Large budget decisions. Be very sure.

The lower the p-value, the more confident you can be the result is real.

The question to ask before every test is not “what p-value do I need?” It is “what is the cost of being wrong?” The higher the cost of a false positive, the stricter your threshold should be.

Statistical significance is not the same as practical significance

This is the most important distinction in this entire article. And almost nobody talks about it clearly.

A result can clear your significance threshold and still be completely useless. Imagine you run a test with 500,000 users. Variant B shows a statistically significant 0.3% improvement in conversion rate. Your p-value is 0.02. The data team calls it significant. But 0.3% on your current conversion volume generates an extra €800 per month. The engineering work to implement the change costs €15,000. No sane business makes that call.

The metric you are missing is effect size: the magnitude of the difference, not just the reliability of it. A result worth acting on needs to be both statistically significant and large enough to move your business.

Statistical significance

Is the result real or random?

Measured by the p-value

Can be achieved with a tiny effect on a huge sample

Example: p = 0.02 on a 0.3% lift in CTR

Practical significance

Is the result large enough to act on?

Measured by effect size

Requires business context to evaluate

Example: 0.3% lift generates €800/month vs €15k to implement

Three ways tests go wrong, and how to avoid them

Checking results too early

You check daily and stop the test when it looks significant. Three weeks later the lift disappears. Decide your test duration before you start and do not look until it ends.

Running on too few people

Small samples produce unreliable results in both directions. Run a power analysis before the test starts. It tells you exactly how many users you need to detect a real effect.

Treating 0.05 as a hard wall

A p-value of 0.06 is not a failure. It is a signal. Interpret p-values on a continuum and combine them with effect size and business context before deciding what to do.

Three things to do differently starting Monday

Before the test

Set success criteria before the test runs

Define your threshold, sample size, minimum detectable effect and duration. Write them down. Lock them in. Do not revisit once the test is live.

Do this once → eliminates most test failures

Reading results

Always ask “significant and how big?”

When your analyst presents a significant result, your next question is always about effect size. How large is the difference in revenue or conversions at your current volume?

Ask this every time → stops costly overreactions

Every test

Use a sample size calculator every time

Tools like Evan Miller’s A/B test calculator are free and take two minutes. Running a test without one is measuring with a broken ruler and trusting the result.

Takes 2 min → saves weeks of wasted testing

What it costs to ignore this

Marketing teams that do not understand statistical significance make two expensive mistakes. They kill good ideas too early because a result looks flat before the sample is large enough. And they roll out bad ideas with confidence because a result looked significant before it was stable.

Both mistakes cost money. The first one kills the kind of incremental optimisation that compounds into real competitive advantage over time. The second one sends budget into changes that do not work, and sometimes actively makes things worse.

Getting this right does not require becoming a statistician. It requires asking better questions before a test starts, and resisting the urge to act on results before they are ready to be acted on.

The bottom line

Statistical significance is not a barrier your data team puts up. It is the standard that makes your wins defensible.

When a result is statistically significant and practically meaningful, you can walk into any budget conversation and defend the decision with evidence. When it is not, acting anyway is just expensive guessing with extra steps.

Set your criteria before the test. Check them after. Make the call with both numbers in hand.

Jelle Casper van Santen

Marketing data analyst with a MSc. in Marketing & Business Analytics. Interested in all things related to attribution, marketing mix modelling, and experimentation.