It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say
“the p-value is p”.
(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power:
Do not say a test has high power. Don’t believe that if a test has high power to produce a low p-value when an alternative H’ is true, that finding a low p-value is good evidence for H’. This is wrong. Any effect, no matter how tiny, can produce a small p-value if the power of the test is high enough.
Recommendation: Report the complement of the power in relation to H’: the probability of a type II error β(H’). For instance, instead of saying the power of the test against H’ is .8, say “β(H’) = 0.2.”
“So what do you think?” he began the conversation. Giggling just a little, I told him I basically felt the same way about this as the ban on significance/significant. I didn’t see why people couldn’t just stop abusing power, and especially stop using it in the backwards fashion that is now common (and is actually encouraged by using power as a kind of likelihood). I spend the entire Excursion 5 on power in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars 2018, CUP. Readers of this blog can merely search for power to find quite a lot! That said, I told him the recommendation seemed OK, but noted that you need a minimal threshold (for declaring evidence against a test hypothesis) in order to compute power.
After talking about power, we moved on to some other statistical notions under review. He’d told me last time he lacked a statistical background, and asked me to point out any flagrant mistakes (in his sum-ups, not the Board’s items, which were still under embargo, whatever that means.) I was glad to see that, apparently, the joint committees were subjecting some other notions to scrutiny (for once). According to his draft, the Board doesn’t want people saying a hypothesis or model receives a high probability, say .95, because it is invariably equivocal.
Do not say a hypothesis or model has a high posterior probability (e.g., 0.95) given the data. The statistical and ordinary language meanings of “probability” are now so hopelessly confused; the term should be avoided for that reason alone.
Don’t base your scientific conclusion or practical decision solely on whether a claim gets a high posterior probability. That a hypothesis is given a .95 posterior does not by itself mean it has a “truth probability” of 0.95, nor that H is practically certain (while it’s very improbable that H is false), nor that the posterior was arrived at by a method that is correct 95% of the time, nor that it is rational to bet that the probability of H is 0.95, nor that H will replicate 95% of the time, nor that the falsity of H would have produced a lower posterior on H with probability .95. The posterior can reflect empirical priors, default or data dominant priors, priors from an elicitation of beliefs, conjugate priors, regularisation, prevalence of true effects in a field, or many, many others.
A Bayesian posterior report doesn’t tell you how uncertain that report is.
A posterior of .95 depends on just one way of exhausting the space of possible hypotheses or models (invariably excluding those not thought of). This can considerably distort the scientific process which is always open ended.
Recommendation: If you’re doing a Bayesian posterior assessment, just report “a posterior on H is .95” (or other value), or a posterior distribution over parameters. Don’t say probable.
At this point I began to wonder if he was for real. Was he the Richard Harris who wrote that article last week? I was approached by 3 different journals, and never questioned them. Was this some kind of a backlash to the p-value pronouncements from Stat Report Watch? Or maybe he was that spoofer Justin Smith (whom I don’t know in the least) who recently started that blog on the P-value police. My caller assured me he was on the level, and he did have the official NPR logo. So we talked for around 2 hours!
Comparative measures don’t get off scott-free, according to this new report of the joint Boards:
Don’t say one hypothesis H is more likely than another H’ because this is likely to be interpreted as H is more probable than H’.
Don’t believe that because H is more likely than H’, given data x, that H is probable, well supported or plausible, while H’ is not. This is wrong. It just means H makes the data x more probable than does H’. A high likelihood ratio LR can occur when both H and H’ are highly unlikely, and when some other hypothesis H” is even more likely. Two incompatible hypotheses can both be maximally likely. Being incompatible, they cannot both be highly probable. Don’t believe a high LR in favor of H over H’ means the effect size is large or practically important.
Recommendation: Report “the value of the LR (of H over H’) = k” rather than “H is k times as likely as H'”. As likelihoods enter the computation as a ratio, the word “likelihood” is not necessary and should be dropped wherever possible. The LR level can be reported. The statistical and ordinary language meanings of “likely” are sufficiently confused to avoid the term.
Odds ratios and Bayes Factors (BFs), surprisingly, are treated almost the same way as the LR. (Don’t say H is more probable than H’. Just report BFs and prior odds. A BF doesn’t tell you if an effect size is scientifically or practically important. There’s no BF value between H and H’ that tells you there’s good evidence for H.
But maybe I shouldn’t be so surprised. As in the initial ASA statement, the newest Report avers
Nothing in this statement is new. Statisticians and others have been sounding the alarm about these matters for decades,
In support of their standpoint on posteriors as well as on Bayes Factors they cite Andrew Gelman:
“I do not trust Bayesian induction over the space of models because the posterior probability of a continuous-parameter model depends crucially on untestable aspects of its prior distribution” (Gelman 2011, p. 70). He’s also cited as regards their new rule on Bayes Factors. “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (Gelman 2011, p. 74) or a weighted average of them as in Madigan and Raftery (1994).
Also familiar is the mistake in taking a high BF of in favor of a null or test hypothesis H over an alternative H’ as if it supplies evidence in favor of H. It’s always just a comparison; there’s never a falsification, unlike statistical tests (unless supplemented with a falsification rule[i]).
The Board warns: “Don’t believe a Bayes Factor in favor of H over H’, using a “default” Bayesian prior, means the results are neutral, uninformative, or bias free. Here the report quotes Uri Simonsohn:
“Saying a Bayesian test ‘supports the null’ in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.”
“What they actually ought to write is ‘the data support the null more than they support one mathematically elegant alternative hypothesis I compared it to’”
“The default Bayes factor test “means the Bayesian test ends up asking: ‘is the effect zero, or is it biggish?’ When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero” with high probability.
Scanning the rest of Herris’ article, which was merely in rough draft yesterday, I could see that next in line to face the axe are: confidence, credible, coherent, and probably other honorifics. Maybe now people can spend more time thinking, or so they tell you [ii]!
[i] For example, the rule might be: falsify H in favor of H’ whenever H’ is k times as likely or probable as H, or whenever the posterior of H’ exceeds .95, or whenever the p-value against H is less than .05.)