I’ve been asked if I agree with Regina Nuzzo’s recent note on p-values [i]. I don’t want to be nit-picky, but one very small addition to Nuzzo’s helpful tips for communicating statistical significance can make it a great deal more helpful. Here’s my friendly amendment. She writes:
Basics to remember
What’s most important to keep in mind? That we use p-values to alert us to surprising data results, not to give a final answer on anything. (Or at least that’s what we should be doing). And that results can get flagged as “statistically surprising” with a small p-value for a number of reasons:
- There was a fluke. Something unusual happened in the data just by chance.
- Something was violated. By this I mean there was a mismatch between what was actually done in the data analysis and what needed to be done for the p-value to be a valid indicator. One little-known requirement, for example, is that the data analysis be planned before looking at the data. Another is that all analyses and results be presented, no matter the outcome. Yes, these seem like strange, nit-picky rules, but they’re part of the deal when using p-values. A small p-value might simply be a sign that data analysis rules were broken.
- There was a real but tiny relationship, so tiny that we shouldn’t really care about it. A large trial can detect a true effect that is too small to matter at all, but a p-value will still flag it as being surprising.
- There was a relationship that is worth more study. There’s more to be done. Can the result be replicated? Is the effect big enough to matter? How does it relate to other studies?
Or any combination of the above.(Nuzzo)
My tiny addition is to the next to last sentence of #2, “the dark horse that we can’t ignore“. I suggest replacing it with (something like):
One little-known requirement, for example, is that the data analysis be planned before looking at the data. Another is that all analyses and results be presented, no matter the outcome. Yes, these seem like strange, nit-picky rules, but they’re part of the deal in making statistical inferences (or the like).
Why is that tiny addition (in bold) so important? Because without it many people suppose that other statistical methods don’t have to worry about post data selections, selective reporting and the like. With all the airplay it has been receiving–the acknowledged value of preregistration, the “21 word solution”[ii] and many other current reforms–hopefully the danger of ad hoc moves in the “forking paths” (Gelman and Loken 2014) in collecting and interpreting data is no longer a “little-known requirement”. But it’s very important to emphasize that the misleading is not unique to statistical significance tests. The same P-hacked hypothesis can find its way into likelihood ratios, Bayesian factors, posterior probabilities and credibility regions. In their recent Significance article, “Cargo-cult statistics“, Stark and Saltelli emphasize:
“The misuse of p-values, hypothesis tests, and confidence intervals might be deemed frequentist cargo-cult statistics. There is also Bayesian cargo-cult statistics. While a great deal of thought has been given to methods for eliciting priors, in practice, priors are often chosen for convenience or out of habit; perhaps worse, some practitioners choose the prior after looking at the data, trying several priors, and looking at the results – in which case Bayes’ rule no longer applies!”
As I note in my comment principle 4 in the ASA P-value document notes how P-values become spurious with multiple testing and other selection effects. There is great unclarity as to whether this rule holds true for methods outside the error statistical school. Nuzzo seems to be implying that such “strange, nit picky” rules do not have to be followed by Bayesians and others. Perhaps I’m reading into what she wrote, but if it does apply to them too, then she should say so.
[i]I’m always asked these things nowadays on twitter, and I’m keen to bring people back to my blog as a far more appropriate place to actually have a discussion.
[ii]“We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study (Simmons, J., Nelson, L., and Simonsohn, U. p. 4).”
References not linked to
Gelman, A. and Loken, E. (2014). ‘The Statistical Crisis in Science’, American Scientist 2,460–5.
Simmons, J., Nelson, L., and Simonsohn, U. (2012). ‘A 21 word solution’, Dialogue: The Official Newsletter of the Society for Personality and Social Psychology 26(2), 4–7.
By failing to add this additional point, Nuzzo, like many others (whether unwittingly or not) encourage the conception that p-values are very scary and quirky and require odd pieces of information that other accounts do not. Even if different accounts must appeal to different resources, none are free of the concerns of selective reporting when conducting statistical inference*. Significance tests, like all error statistical methods, have formal ways to pick up on the consequences of biasing selection effects: they alter and possibly invalidate a method’s error probabilities. In so doing, the probative capacity of a method is altered and thus the severity with which claims are warranted. Even if there are many different ways to try to adjust error probabilities to take this into account, what matters is having a grounded alert function that reported numbers may be spurious.
Admittedly, statistical accounts that are outside the error-statistical family (classical likelihood and Bayesian accounts obeying the Likelihood Principle**) need other ways to take account of selection effects, and that’s fine–just tell users what they are, and don’t allow stat communicators to suggest it’s no problem for them.
The fourth of the ASA’s principles on p-values states: “Proper inference requires full reporting and transparency”. Is this meant only for p-values or for all statistical inferences? The ASA statement on p-values doesn’t say. Of course “full reporting” and “transparency” alone aren’t specific enough to point to the need to report sampling plans, data-dependent subgroups, multiple testing and the like, but we should be less worried if an inference account grants this much rather than suggesting significance tests embody strange rules like knowing if the researcher selectively reported.
*There are contexts wherein error probabilities are not influenced–e.g., explaining a known effect with instruments with known reliability.
**Many/some Bayesians (e.g., Gelman) have come to reject the LP, and this maybe a promising way to bridge the different statistical schools.
Thank you for the suggested modification to the Nuzzo piece, Mayo.
Another of her pieces is this 2014 bit, which you commented on previously (March 22, 2014).
https://www.nature.com/news/scientific-method-statistical-errors-1.14700
” According to one widely used calculation, a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of 0.05 raises that chance to at least 29%.”
Loose discussion of statistical principles yields muddled understanding of how to assess error statistical evidence. Nuzzo needs to read your book!