It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power:

Do not say a test has high power. Don’t believe that if a test has high power to produce a low p-value when an alternative H’ is true, that finding a low p-value is good evidence for H’. This is wrong. Any effect, no matter how tiny, can produce a smallp-value if the power of the test is high enough.

Recommendation: Report the complement of the power in relation to H’: the probability of a type II error β(H’). For instance, instead of saying the power of the test against H’ is .8, say “β(H’) = 0.2.”

“So what do you think?” he began the conversation. Giggling just a little, I told him I basically felt the same way about this as the ban on significance/significant. I didn’t see why people couldn’t just stop abusing power, and especially stop using it in the backwards fashion that is now common (and is actually encouraged by using power as a kind of likelihood). I spend the entire Excursion 5 on power in my *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 2018, CUP. Readers of this blog can merely search for power to find quite a lot! That said, I told him the recommendation seemed OK, but noted that you need a minimal threshold (for declaring evidence against a test hypothesis) in order to compute power.

After talking about power, we moved on to some other statistical notions under review. He’d told me last time he lacked a statistical background, and asked me to point out any flagrant mistakes (in his sum-ups, *not* the Board’s items, which were still under embargo, whatever that means.) I was glad to see that, apparently, the joint committees were subjecting some other notions to scrutiny (*for once*). According to his draft, the Board doesn’t want people saying a hypothesis or model receives a high probability, say .95, because it is invariably equivocal.

Do not say a hypothesis or model has a high posterior probability (e.g., 0.95) given the data. The statistical and ordinary language meanings of “probability” are now so hopelessly confused; the term should be avoided for that reason alone.

Don’t base your scientific conclusion or practical decision solely on whether a claim gets a high posterior probability. That a hypothesis is given a .95 posterior does not by itself mean it has a “truth probability” of 0.95, nor that H is practically certain (while it’s very improbable that H is false), nor that the posterior was arrived at by a method that is correct 95% of the time, nor that it is rational to bet that the probability of H is 0.95, nor that H will replicate 95% of the time, nor that the falsity of H would have produced a lower posterior on H with probability .95. The posterior can reflect empirical priors, default or data dominant priors, priors from an elicitation of beliefs, conjugate priors, regularisation, prevalence of true effects in a field, or many, many others.

A Bayesian posterior report doesn’t tell you how uncertain that report is.

A posterior of .95 depends on just one way of exhausting the space of possible hypotheses or models (invariably excluding those not thought of). This can considerably distort the scientific process which is always open ended.

Recommendation: If you’re doing a Bayesian posterior assessment, just report “a posterior on H is .95” (or other value), or a posterior distribution over parameters. Don’t say probable.

At this point I began to wonder if he was for real. Was he the Richard Harris who wrote that article last week? I was approached by 3 different journals, and never questioned them. Was this some kind of a backlash to the p-value pronouncements from Stat Report Watch? Or maybe he was that spoofer Justin Smith (whom I don’t know in the least) who recently started that blog on the P-value police. My caller assured me he was on the level, and he did have the official NPR logo. So we talked for around 2 hours!

Comparative measures don’t get off scott-free, according to this new report of the joint Boards:

Don’t say one hypothesis H is more likely than another H’ because this is likely to be interpreted as H is more probable than H’.

Don’t believe that because H is more likely than H’, given data x, that H is probable, well supported or plausible, while H’ is not. This is wrong. It just means H makes the data x more probable than does H’. A high likelihood ratio LR can occur when both H and H’ are highly unlikely, and when some other hypothesis H” is even more likely. Two incompatible hypotheses can both be maximally likely. Being incompatible, they cannot both be highly probable. Don’t believe a high LR in favor of H over H’ means the effect size is large or practically important.

Recommendation: Report “the value of the LR (of H over H’) = k” rather than “H is k times as likely as H'”. As likelihoods enter the computation as a ratio, the word “likelihood” is not necessary and should be dropped wherever possible. The LR level can be reported. The statistical and ordinary language meanings of “likely” are sufficiently confused to avoid the term.

Odds ratios and Bayes Factors (BFs), surprisingly, are treated almost the same way as the LR. (Don’t say H is more probable than H’. Just report BFs and prior odds. A BF doesn’t tell you if an effect size is scientifically or practically important. There’s no BF value between H and H’ that tells you there’s good evidence for H.

But maybe I shouldn’t be so surprised. As in the initial ASA statement, the newest Report avers

Nothing in this statement is new. Statisticians and others have been sounding the alarm about these matters for decades,

In support of their standpoint on posteriors as well as on Bayes Factors they cite Andrew Gelman:

“I do not trust Bayesian induction over the space of models because the posterior probability of a continuous-parameter model depends crucially on untestable aspects of its prior distribution” (Gelman 2011, p. 70). He’s also cited as regards their new rule on Bayes Factors. “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (Gelman 2011, p. 74) or a weighted average of them as in Madigan and Raftery (1994).

Also familiar is the mistake in taking a high BF of in favor of a null or test hypothesis H over an alternative H’ as if it supplies evidence in favor of H. It’s always just a comparison; there’s never a falsification, unlike statistical tests (unless supplemented with a falsification rule[i]).

The Board warns: “Don’t believe a Bayes Factor in favor of H over H’, using a “default” Bayesian prior, means the results are neutral, uninformative, or bias free. Here the report quotes Uri

Simonsohn:“Saying a Bayesian test ‘supports the null’ in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.”

“What they actually ought to write is ‘the data support the null more than they support one mathematically elegant alternative hypothesis I compared it to’”

*“*The default Bayes factor test “means the Bayesian test ends up asking: ‘*is the effect zero, or is it biggish*?’ When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero” with high probability.

Scanning the rest of Herris’ article, which was merely in rough draft yesterday, I could see that next in line to face the axe are: confidence, credible, coherent, and probably other honorifics. Maybe now people can spend more time thinking, or so they tell you [ii]!

**Check date! **

[i] For example, the rule might be: falsify H in favor of H’ whenever H’ is k times as likely or probable as H, or whenever the posterior of H’ exceeds .95, or whenever the p-value against H is less than .05.)

Deborah:

I know you’re kidding in the above post, but I do think that it’s a bad idea to conclude that the posterior probability is X that a model is true. Also I think it’s a bad idea to say that a study has 80% power or whatever, as: (a) the power depends on the effect size which is something you don’t know, and (b) I don’t think “power” answers any relevant question, as power = probability of attaining a statistically significant result, and I don’t think statistical significance is relevant to discovery or problem solving (except in the “sociological” sense that lots of people use this rule so we should understand it).

I think it’s fine when people report probabilistic inferences conditional on a model, but the model should make some sense. See for example the golf model discussed here: this model clearly is an oversimplification and, when you feed it enough data, problems arise with it, but we can still learn from it. Similarly for proportional hazard models in survival analysis, or regression models in policy analysis, or whatever. Probabilistic inference within the context of a model. My problem with talking about the posterior probability is X that a model is true, is that this is typically expressed within a context of a model which is not just false (all models are false) but also makes no sense. My problem with p-values is slightly different: the issue is more that the p-value is not an answer to any scientifically- or decision-relevant question.

My posts for this purpose are always true. But I disagree about the p-value. It’s quite useful and important to learn of the indication of a genuine effect, or when there’s no such indications (despite a test with high power of finding one).

Anyone who denies the relevance for the small but important roles p-values play absolutely should stop using them. But then to retain them as useful is disingenuous. If those who rejected the role of P-values would just stop using them, they could stop torturing everyone else with rules they don’t really mean, and wrecking a key tool for the preliminary analysis of drugs and environmental risks.

Except that you use simple significance tests and their p-values, to check models, to determine presence of fraud, distinguish spurious from genuine effects, ascertain if a replication is successful..

I actually like to pay attention to what exactly is said, and most of the cited parts make sense to me. To me your posting comes over as “anti speech bans” in the first place (I may get your irony and maybe meta-irony wrong though), but if these things aren’t taken as bans but rather as recommendations, they seem mostly fine to me. People who don’t get these recommendations at first and try to understand them can learn quite a bit.

I have paid attention to what is said. All I’ve said here mimics that, and thus they too should be warned against if not banned. Check date of course.