Are You Reading Test Results Wrong?
Will Moyle
In digital marketing, everyone loves a big, flashy test result — especially in organizations with a strong testing culture. Being able to say “our test showed that the use of conditional content boosted email engagement by 25%” is extremely gratifying. It demonstrates significant impact to the client and provides us with clear direction for crafting future content.
A finding like this is even better when combined with one of my favorite phrases: “The results were 99% statistically significant.” Few phrases are more satisfying for an analyst to say.
But what about tests that don’t generate statistically significant results? It may seem counterintuitive, but these results may still contain a huge amount of value — as long as you’re reading them in the correct way.
Many of the tests we run on email, web, or social media don’t show any difference in performance between variations. One’s first reaction when seeing these results might be disappointment: These results tend to be viewed as a failure and can often be swept under the rug in favor of the more “successful” tests.
Not only is this practice problematic — I believe it can be actively damaging to a digital communications program.
In academic circles, this is known as publication bias: When studies don’t produce significant results, they are much less likely to be published, which some believe may have serious consequences for progress.
When a well-designed test with a large sample size does not generate significant results, the interpretation shouldn’t always be, “We don’t know if version A is better than version B.” An equally valid interpretation could be, “Version A is just as good as version B.” The difference may be subtle, but it’s vital.
Of course, if neither version performs particularly well, our creative teams are pushed to find a new approach. But if both tactics prove to be engaging and there is no statistical difference in performance, our creative teams now have twice the number of tools in their toolbox when working on new content.
For example, during a recent cold weather emergency fundraising campaign for our client Covenant House, our designer created two separate email templates:
Although these styles were markedly different, there was no difference in fundraising performance between the two emails. One might be tempted to dismiss the test as a failure and move on. Instead, our conclusions were:
The success of these emergency campaigns is less impacted by template design than other campaigns, indicating that our audience is more influenced by the urgency of the ask itself rather than the way in which it is presented.
The variation with the photo is closer to where we want the email brand to be; where possible, we will follow this style for similar campaigns in the future.
However, the variation without the photo was quicker to produce; therefore, in rapid response moments, we know that we can use this approach and not risk sacrificing potential revenue from the campaign for the sake of speed.
One additional warning: Don’t keep repeating the same test until you get a significant result. If we run 10 email tests between static and gif images, and only on the 10th test do we generate statistical results, that doesn’t mean that one approach is better than the other. As long as you have an appropriate sample size, your 10 tests indicate that 90% of the time, there is no significant difference in performance between the two variations and you should use whichever graphical style is most appropriate for the content.
As a digital analyst, I sometimes worry that my role is to restrict the creative freedom of my colleagues: “We know the exact combination of sender, subject line, and graphical treatment that works best for this client. Do not deviate from this.” So it’s refreshing when I can say, “Feel free to choose whichever approach you feel is right for this content.”
This article was originally published on bluestate.co and has been recreated here and on LinkedIn with permission from Blue State.