Marketing and Research Consulting for a Brave New World
Subscribe via RSS

We’re told that A/B tests are a gold standard for revealing causality.  But marketers might not be learning as much as they think.  Here are four ways that “test and learn” can go bad and what to do about it.

  1. Ad lift test is flawed.

It is really easy to design a flawed test and I have seen more that are flawed than not.  While your control cell (either pre-matched or post-hoc) can easily be matched on demos, those are not the big variables.  The big things you must match on are prior brand propensity and media consumption propensity. In particular, I see prior brand propensity left out, usually because such data are often not available to a particular provider; but it is a killer to do so. Prior brand propensity is the single most predictive variable of conversion during a campaign period and when you leave it out of the balancing, it will result in thinking your advertising treatment is working better than it really is. There are two reasons for this: 1) because targeting models are always in play, those exposed are more prevalent in the exposed group but they would have been more likely to convert without ad exposure, and 2) more favorable consumers are more responsive to advertising (the Movable Middle effect).  So, there’s your double whammy of bad testing.

I have seen numerous test vs. control results from media companies, AdTech, agencies, and research providers who do not control for brand propensity so yes, this happens. Even properly designed RCTs which include Google’s ghost advertising approach minimize the probability of a mismatch, it is still a possible outcome.


Plan A: work with ad lift providers who can balance on prior brand propensity, like DISQO (disclosure: I consult for them).

Plan B: ensure that you can measure conversion rates pre-campaign of test vs. control.  If they are different, use a counterfactual adjustment (which is a tricky model but can be created). If matching is not very good, build into the contract that the provider will repeat the experiment or at least bootstrap and take the bootstrapped sample where the match is better.

2. No generalizable finding is revealed

When you test, say, using programmatic display targeted via frequent shopper data and separately test Facebook advertising targeted by interest groups, are you ready to generalize?  Will the generalization be to the brand or to all of your brands as a general principle? Or is the result merely parochial to the campaign alone?  If you are not prepared to make generalizations, and all you earned about was this specific campaign, you just didn’t learn very much.

Strategy: design the test with a principle in mind that will lead to generalization

3. A paradox led to wrong conclusions

Does smoking cause liver damage? Data in the wild might suggest “yes” when this might not be the right answer about causality IF heavy smokers are also heavy drinkers. 

Here is one place this could happen in marketing. Suppose we see good response in a test to an ad treatment and want to trace out frequency curves.  We might see that high ad frequency leads to LOWER lift. Is this truly because we turned off consumers with excessive frequency?  Perhaps.  But it is also possible that those who see an ad a lot of times are different in a meaningful way from those who saw the ad with more modest frequency.

Strategy: Treat each group of consumers seeing a certain range of ad frequency exposure as a cohort and have a proper control cell (or counterfactual adjustment) for each.  In general, I prefer twinning as a way of developing a control vs. matching important variables in the aggregate since this easily allows you to break apart your data and have a control for each sub-group.

4. Your test is underpowered

While researchers typically focus on Type I error, there is also Type II error…the failure to spot a difference when there really is one.  If the difference you expect to see that would trigger marketing action is small, you may need a larger sample size than only thinking about Type I error would indicate.  This is a common pitfall with product testing for example.  You denigrate the product in a ‘small way’ for margin improvement, hoping consumers won’t notice.  If the sample size is too small, a reasonable level of consumer disappointment will not be statistically different at the 95% level and the marketer will think the change wasn’t noticed…but it was.  A larger sample size would have better managed Type II error and a mistake would have been avoided.

A/B testing is valuable to adjudicate a conflict between MMM and MTA results.  It is a gold standard for addressing a critical controversy within the marketer team about directing ad investment.  But it has to be done right or it becomes fake news.

Tags: , , , , , ,


Comments are closed.