I’ve been asked if I agree with Regina Nuzzo’s recent note on p-values [i]. I don’t want to be nit-picky, but one very small addition to Nuzzo’s helpful tips for communicating statistical significance can make it a great deal more helpful. Here’s my friendly amendment. She writes:
Basics to remember
What’s most important to keep in mind? That we use p-values to alert us to surprising data results, not to give a final answer on anything. (Or at least that’s what we should be doing). And that results can get flagged as “statistically surprising” with a small p-value for a number of reasons:
- There was a fluke. Something unusual happened in the data just by chance.
- Something was violated. By this I mean there was a mismatch between what was actually done in the data analysis and what needed to be done for the p-value to be a valid indicator. One little-known requirement, for example, is that the data analysis be planned before looking at the data. Another is that all analyses and results be presented, no matter the outcome. Yes, these seem like strange, nit-picky rules, but they’re part of the deal when using p-values. A small p-value might simply be a sign that data analysis rules were broken.
- There was a real but tiny relationship, so tiny that we shouldn’t really care about it. A large trial can detect a true effect that is too small to matter at all, but a p-value will still flag it as being surprising.
- There was a relationship that is worth more study. There’s more to be done. Can the result be replicated? Is the effect big enough to matter? How does it relate to other studies?
Or any combination of the above.(Nuzzo)
My tiny addition is to the next to last sentence of #2, “the dark horse that we can’t ignore“. I suggest replacing it with (something like):
One little-known requirement, for example, is that the data analysis be planned before looking at the data. Another is that all analyses and results be presented, no matter the outcome. Yes, these seem like strange, nit-picky rules, but they’re part of the deal in making statistical inferences (or the like).
Why is that tiny addition (in bold) so important? Because without it many people suppose that other statistical methods don’t have to worry about post data selections, selective reporting and the like. With all the airplay it has been receiving–the acknowledged value of preregistration, the “21 word solution”[ii] and many other current reforms–hopefully the danger of ad hoc moves in the “forking paths” (Gelman and Loken 2014) in collecting and interpreting data is no longer a “little-known requirement”. But it’s very important to emphasize that the misleading is not unique to statistical significance tests. The same P-hacked hypothesis can find its way into likelihood ratios, Bayesian factors, posterior probabilities and credibility regions. In their recent Significance article, “Cargo-cult statistics“, Stark and Saltelli emphasize:
“The misuse of p-values, hypothesis tests, and confidence intervals might be deemed frequentist cargo-cult statistics. There is also Bayesian cargo-cult statistics. While a great deal of thought has been given to methods for eliciting priors, in practice, priors are often chosen for convenience or out of habit; perhaps worse, some practitioners choose the prior after looking at the data, trying several priors, and looking at the results – in which case Bayes’ rule no longer applies!”
[i]I’m always asked these things nowadays on twitter, and I’m keen to bring people back to my blog as a far more appropriate place to actually have a discussion.
[ii]“We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study (Simmons, J., Nelson, L., and Simonsohn, U. p. 4).”
References not linked to
Gelman, A. and Loken, E. (2014). ‘The Statistical Crisis in Science’, American Scientist 2,460–5.
Simmons, J., Nelson, L., and Simonsohn, U. (2012). ‘A 21 word solution’, Dialogue: The Official Newsletter of the Society for Personality and Social Psychology 26(2), 4–7.