# Diary For Statistical War Correspondents on the Latest Ban on Speech

When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection(note) growing out of that conference “A world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks.

First set.

1. We agree with the age-old fallacy of non-rejection of a null hypothesis: a non-statistically significant result at level P is not evidence for the null because a test may have low probability of rejecting a null even if it’s false (i.e., it might have low power to detect a particular alternative).

The solution in the severity interpretation of tests is to take a result that is not statistically significant at a small level, i.e., a large P-value, as ruling out given discrepancies from the null or other reference value:

The data indicate that discrepancies from the null are less than those parametric values the test had a high probability of detecting, if present. See p. 351 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics wars (2018, CUP). [i]

This is akin to the use of power analysis, except that it is sensitive to the actual outcome. It is very odd that this paper makes no mention of power analysis, since that is the standard way to interpret non-significant results.

Using non-significant results (“moderate” P-values) to set upper bounds is done throughout the sciences and is highly informative. This paper instead urges us to read into any observed difference found to be in the welcome direction, to potentially argue for an effect.

2. I agree that one shouldn’t mechanically use P< .05. Ironically, they endorse a .95 confidence interval CI. They should actually use several levels, as is done with a severity assessment.

I have objections to their interpretation of CIs, but I will mainly focus my objections to the ban of the words “significance” or “significant”. It’s not too hard to report that results are significant at level .001 or whatever. Assuming researchers invariably use an unthinking cut-off, rather than reporting the significance level attained by the data, they want to ban words. They (Greenland at least) claim this is a political fight, and so arguing by an appeal to numbers (who sign on to their paper) is appropriate for science. I think many will take this as yet one more round of significance test bashing–even though, amazingly, it is opposite to the most popular of today’s statistical wars. I explain in #3. (The actual logic of significance testing is lost in both types of criticisms.)

3. The most noteworthy feature of this criticism of statistical significance tests is that it is opposite to the most well-known and widely circulated current criticisms of significance tests.

In other words, the big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis. The most well known Bayesian reforms being bandied about do this by giving a point prior–a lump of prior probability–to a point null hypothesis. (There’s no mention of this in the paper.)

These Bayesians argue that small P-values are consistent with strong evidence for the null hypothesis. They conclude that P-values exaggerate the evidence against the null hypothesis. Never mind for now that they are insisting P-values be measured against a standard that is radically different from what the P-value means. All of the criticisms invoke reasoning at odds with statistical significance tests. I want to point out the inconsistency between those reforms and the current one. I will call them Group A and Group B:

Group A: “Make it harder to find evidence against the null”: a P-value of .05 (i.e. a statistically significant result) should not be taken as evidence against the null, it may often be evidence for the null.

Group B (“Retire Stat Sig”): “Make it easier to find evidence against the null”: a P-value > .05 (i.e., a non-statistically significant result) should not be taken as evidence for the null, it may often be evidence against the null.

A proper use and interpretation of statistical tests (as set out in my SIST) interprets P-values correctly in both cases and avoids fallacies of rejection (inferring a magnitude of discrepancy larger than warranted) and fallacies of non-rejection (inferring the absence of an effect smaller than warranted).

The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence! When data provide lousy evidence, when little if anything has been done to rule out known flaws in a claim, it’s not a little bit of evidence (on my account). The most serious concern with the “Retire” argument to ban thresholds for significance is that it is likely to encourage the practice whereby researchers spin their non-significant results by P-hacking or data dredging. It’s bad enough that they do this. Read Goldacre [ii]

Note their saying the researcher should discuss the observed difference. This opens the door to spinning it convincingly to the uninitiated reader.

4. What about selection effects? The really important question that is not mentioned in this paper is whether the researcher is allowed to search for endpoints post-data.

My own account replaces P-values with reports of how severely tested various claims are, whether formal or informal. If we are in a context reporting P-values, the phrase “statistically significant” at the observed P-value is important because the significance level is invalidated by multiple testing, optional stopping, data-dependent subgroups, and data dredging. Everyone knows that. (A P-value, by contrast, if detached from corresponding & testable claims about significance levels, is sometimes seen as a mere relationship between data and a hypothesis.) Getting rid of the term is just what is wanted by those who think the researcher should be free to scour the data in search of impressive-looking effects, or interpret data according to what they believe. Some aver that their very good judgment allows them to determine post-data what the pre-registered endpoints really are or were or should have been. (Goldacre calls this “trust the trialist”). The paper mentions pre-registration fleetingly, but these days we see nods to it that actually go hand in hand with flouting it.

The ASA P-value Guide very pointedly emphasizes that selection effects invalidate P-values. But it does not say that selection effects need to be taken into account by any of the “alternative measures of evidence”, including Bayesian and Likelihoodist. Are they free from Principle 4 on transparency, or not? Whether or when to take account of multiple testing and data dredging are known to be key points on which those accounts differ from significance tests (at least all those who hold to the Likelihood Principle, as with Bayes Factors and Likelihood Ratios).

5. A few asides:

They should really be doing one-sided tests and do away with the point null altogether (except for special cases. I agree with D.R. Cox who suggests doing two 1-sided tests.) . (With 1-sided tests, the test hypothesis and alternative hypothesis are symmetrical as with N-P tests.)

The authors seem to view a test as a report on parameter values that merely fit or are compatible with data. This misses testing reasoning! Granted the points within a CI aren’t far enough away to reject the null at level .05–but that doesn’t mean there’s evidence for them. In other words, they commit the same fallacy they are on about, but regarding members of the CI. In fact there is fairly good evidence the parameter value is less than those values close to the upper confidence limit. Yet this paper calls them compatible, even where there’s rather strong evidence against them, as with an upper .9 level bound, say.

[Using one-sided tests and letting the null assert: a positive effect exists, the recommended account is tantamount to taking the non-significant result as evidence for this null.]

Second Set (to briefly give the minimal non-technical points):

I do think we should avoid the fallacy of going from a large P-value to evidence for a point null hypothesis: inferring evidence of no effect.

CIs at the .95 level are more dichotomous than reporting attained P-values for various hypotheses.

The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence!

The most serious concern with the argument to ban thresholds for significance is that it encourages researchers to spin their non-significant results by P-hacking, data dredging, multiple testing, and outcome-switching.

I would like to see some attention paid to how easy it is to misinterpret results with Bayesian and Likelihoodist methods. Obeying the LP, there is no onus to take account of selection effects, and priors are very often data-dependent, giving even more flexibility.

Third Set (for different journals)

Banning the word “significance” may well free researchers from being held accountable when they downplay negative results and search the data for impressive-looking subgroups.

It’s time for some attention to be paid to how easy it is to misinterpret results on various (subjective,default) Bayesian methods–if there is even agreement on one to examine. The brouhaha is all about a method that plays a small role in an overarching methodology that is able to bound the probabilities of seriously misleading interpretations of data. These are called error probabilities. Their role is just a first indication of whether results could readily be produced by chance variability alone.

Rival schools of statistics (the ASA Guide’s “alternative accounts of evidence”) have never shown their worth in controlling error probabilities of methods. (Without this, we cannot assess their capability for having probed mistaken interpretations of data).

Until those alternative methods are subject to scrutiny for the same or worse abuses–biasing selection effects–we should be wary of ousting these methods and the proper speech that goes with them.

One needs to consider a statistical methodology as a whole–not one very small piece. That full methodology may be called error statistics. (Focusing on the simple significance test, with a point null & no alternative or power consideration, as in the ASA Guide, hardly does justice to the overall error statistical methodology. Error statistics is known to be a piecemeal account–it’s highly distorting to focus on an artificial piece of it.)

Those who use these methods with integrity never recommend using a single test to move from statistical significance to a substantive scientific claim. Once a significant effect is found, they move on to estimating its effect size & exploring properties of the phenomenon. I don’t favor existing testing methodologies but rather reinterpret tests as a way to infer discrepancies that are well or poorly indicated. I described this account over 25 years ago.

On the other hand, simple significance tests are important for testing assumptions of statistical models. Bayesians, if they test their assumptions, use them as well, so they could hardly ban them entirely. But what are P-values measuring? OOPS! you’re not allowed to utter the term s____ance level that was coined for this purpose. Big Brother has dictated! (Look at how strange it is to rewrite Goldacre’s claim below without it. [ii])

I’m very worried that the lead editorial in the new “world after P ≤ 0.05” collection warns us that even if scientists repeatedly show statistically significant increases (p< 0.01 or 0.001) in lead poisoning among children in City F, we mustn’t “conclude anything about scientific or practical importance” such as the water is causing lead poisoning.

“Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, editorial for the Special Issue).

Following this rule, and note the qualification that had been in the ASA Guide is missing, would mean never inferring risks of concern when there was uncertainty (among much else that would go by the wayside). Risks have to be so large and pervasive that no statistics is needed! Statistics is just window dressing, with no actual upshot about the world. Menopausal women would still routinely be taking and dying from hormone replacement therapy because “real world” observational results are compatible with HRT staving off age-related diseases.

Welcome to the brave new world after abandoning error control.

See also my post “Deconstructing ‘A World Beyond P-values’”on the 2017 conference.

[i] Mayo, D. (2018). Statistical Inference as Severe Testing: How To Get Beyond the Statistics Wars, Cambridge: Cambridge University Press.

[ii] Should we replace the offending terms with “moderate or non-small P-values”? The required level for “significance” is separately reported.

Misleading reporting by presenting a study in a more positive way than the actual results reflect constitutes ‘spin’. Authors of an analysis of 72 trials with non-significant results reported it was a common phenomenon, with 40% of the trials containing some form of spin. Strategies included reporting on statistically significant results for within-group comparisons, secondary outcomes, or subgroup analyses and not the primary outcome, or focussing the reader on another study objective away from the statistically non-significant result. (Goldacre)

[added March 25: To be clear, I have no objection to recommending people not use “statistical significance” routinely in that it may be confused with “important”. But the same warnings about equivocation would have to be given to the use of claims: H is more likely than H’. H is more probable than H’. H has probability p. What I object to is mandating a word ban, along with derogating statistical tests in general, while raising no qualms or questions about alternative methods. It doesn’t suffice to say “all methods have problems” either. Let’s look at them.

In the time people have spent repeating old criticisms of significance tests, different ways to deal with data-dependent selection effects could have been developed and experimented with. I know there is considerable work in this area, but I haven’t seen it in the pop discussions of significance tests and p-values.

Gelman’s blog (post on April 12, 2019): Reviews and Discussions of Mayo’s New Book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.

Papers/Articles

Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar, “Editorial: Moving to a World Beyond ‘p < 0.05’,” 73 Am. Statistician S1, S2 (2019).

Valentin Amrhein, Sander Greenland, Blake McShane, Retiring Statistical significance.

John P. A. Ioannidis, “Retiring statistical significance would give bias a free pass,” 567 Nature 461 (2019).

John P. A. Ioannidis, “Do Not Abandon Statistical Significance”  (Nature, April 4, 2019)

Categories: ASA Guide to P-values, P-values

### 42 thoughts on “Diary For Statistical War Correspondents on the Latest Ban on Speech”

1. Christian Hennig

Thanks for this; thanks to my new position and new organisation of teaching, I’m currently teaching statistical tests again after a long time. I point my students to “Abandon Statistical Significance” and am happy to have a source for a different point of view that I can use for them to see the other side of things (rather than having to make all that effort myself).

Actually, although I agree with much that you write, I encourage my students, when talking with non-statisticians, indeed to not use the term “significant” because this will be misinterpreted with high probability. I rather prefer “there’s strong/moderate/weak/no evidence in the data against…”. I don’t think that in the examples I’m discussing much is lost if the term “significant” is not used. “Evidence language” should actually be enough to make clear that some claim that somebody makes is only rather weakly, if at all, backed up by the data. I agree with the authors of “Abandon…” (or was it “Retire…”? There seem to be more than one version of this flying around) that unless a binary decision is in fact needed, it isn’t helpful to present an essentially non-binary situation as if it was a binary one.

“They should really be doing one-sided tests and do away with the point null altogether.” I disagree. Obviously we wouldn’t believe that a point null is precisely true, but then the same holds for the normal distribution assumption, i.i.d., and so on. There are many situations in which researchers are a priori interested in deviations in both/all directions from the null, and it is still informative if data indicate that the data are actually compatible with the point null, i.e. they cannot be used as argument in favour of a deviation in any direction (knowing of course that this doesn’t make the point null true).

2. Welcome to Italy.
Agree one should always indicate the level at which a result is significant, but that doesn’t kill the word “significance”. Warning on properly qualifying the level attained–and note that I also use “power attained”– is very different from a heavy-handed decree along with the denials that one can infer anything from statistically significant results, and no warnings that, for ex finding one hyp that is more likely than another, doesn’t mean there’s good evidence for either of them. I can think of 20 or 30 warnings for Bayesian statistics as well. It’s the constant persecution of statistical tests, without ever a voice of defense, that is more than a little irksome and unfair.
I tend to use “reject at level p” or fail to reject at the level.

3. Deborah:

You write: “the big move in the statistics wars these days is to fight irreplication by making it harder to reject…”

I don’t know what the big moves are, but my own perspective, and I think that of the three authors of the recent article being discussed, is that we should not be “rejecting” at all, that we should move beyond the idea that the purpose of statistics is to reject the null hypothesis of zero effect and zero systematic error.

I don’t want to ban speech, and I don’t think the authors of that article do, either. I’m on record that I’d like to see everything published, including Bem’s ESP paper and various other silly research. My problem is with the idea that rejecting the null hypothesis tells us anything useful.

• I used “reject” (understood to be at a given p-value) to follow the ban on using “significance”, but also, as I always mean it, to find evidence against a hypothesis. Of course, the tests we’re presented with are tools for falsifying statistically, where the inference can take any number of forms (e.g., data indicate a discrepancy from H) or not. As a falsificationist, I take it that you do also falsify.
As for the”big moves” I was alluding to things like “redefine significance” by making the standard cut-off be 0.005.

Although I’m being a bit jokey in the title, only a little bit. I am really struck by the admonition “don’t say ___”. If that’s not to ban speech, I don’t know what is. Do you know of other fields where this is done? Why take cases of abuses, bad science, and unthinking applications, and punish everyone by cutting off words–not to mention taking away words that help to hold those who commit QRPs accountable.

People are perfectly capable of reporting the observed significance level, the p-value, various CIs and avoiding fallacies of small and large p-values (I can’t say significant and non-significant results).

As written, however, the rules delineated in the editorial will get just about every CEO or researcher off the hook who has reported on post-data selection effects, data dredging, multiple testing as “a statistically significant improvement p = whatever”. I expect the lawyers are already lined up for this purpose. It’s made worse by the omission of “only” from the statement that we should not make scientific inferences from statistically significant results. If you think we cannot learn from statistically significant risks, then you think most of the hazards/benefits we claim to find out about in medicine and the environment are illicit. (I didn’t say use a point null).
I think we do find things out about the world, and we need to study how to do it better The knowledge is almost never infallible. In some contexts, formal statistics is involved, as with clinical trials, vaccines, EPA regulations,ecology, etc. but also physics, engineering, any number of sciences. Statistical and non-statistical learning in science often blend seamlessly, as I try to point out in my recent book. In most other cases, formal statistics isn’t involved but we still probe for mistaken interpretations of data, and build storehouses of canonical ways we may be led astray in inquiry. We construct tools that are capable of unearthing these. We don’t just falsify but learn the sources of anomalies, often pointing the way toward new theories, or toward finding out which directions will not be fruitful.

4. There’s a very serious conceptual error people fall into regarding drawing distinctions in continuous cases. The lack of a precise dividing line does NOT mean we cannot identify extreme cases of either . When does observational become theoretical? when is a risk acceptable? How much extra weight to be obese? when does science slip into pseudoscience? when does flirting become harassment? The point is that vagueness does not preclude identifying extreme cases.

5. Gelman himself will use simple significance tests to tests his models. See for example this paper:

Click to access Article_Gelman.pdf

Now if he will go from a small P-value to a brand new model that fits the data better, then he is clearly presuming to find something out from the small P-value. So his own practice would be at odds with claiming we can learn nothing of scientific or practical relevance from small P-values. I don’t think the people who wrote the editorial really mean to say that, but that is what’s written. It can be cured by adding “only” to the claim.

6. Deborah:

The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. As I wrote a few years ago here:

“In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

How do these two forms of reasoning differ? In confirmationist reasoning, the research hypothesis of interest does not need to be stated with any precision. It is the null hypothesis that needs to be specified, because that is what is being rejected. In falsificationist reasoning, there is no null hypothesis, but the research hypothesis must be precise.”

I’m a big fan of researchers trying to poke holes in their own models. I’m not a fan of researchers rejecting straw-man nulls and then acting as this represents evidence to support their favored alternatives.

• I think I’ve written a whole book on this. Are you saying you would no longer falsify models because some people will move from falsifying a model to their favorite alternative theory that fits the data? That’s crazy. You don’t give up on correct logic because some people use illogic.

Let’s talk about your poking holes in your theory or model. How does the logic go? I assume it is statistical & of a p-value sort of reasoning. We’ve discussed this many times before. What are you entitled to infer from finding misfits with your model? Can you infer merely that there’s something wrong somewhere? That’s a start and already is in conflict with the claim that you can learn nothing from finding low p-values, especially if you repeatedly find them & especially if you begin to learn about the mechanism or factors responsible.

• Andrew Gelman

Deborah:

You ask, “Are you saying you would no longer falsify models. . . ?”

Considering that, just above, I wrote, “I’m a big fan of researchers trying to poke holes in their own models. I’m not a fan of researchers rejecting straw-man nulls and then acting as this represents evidence to support their favored alternatives,” I’d have to say, NO, I’m NOT saying I would no longer falsify models! I’m very interested in finding flaws in models. What I’m not interested in is finding flaws in straw-man nulls and then acting as this represents etc.

• Good. And one of the ways to properly poke holes is by means of a falsification that is not infallible–as essentially none are– but that is by means of a proper statistical falsification: an argument from coincidence that shows a genuine anomaly with the claim or model. Of course there are still generally many ways to explain or account for the anomaly, and one then needs to worry if it is well probed. Isn’t it better to point out the age-old fallacy of inferring from a computed small p-value, x, to a theory or claim that purports to explain x, than to discredit methods for poking holes?

• We simply infer the denial of the null or test hypothesis. NP tests allow also inferring discrepancies (in the direction of the alt) well or poorly indicated.

7. Statistical significance tests are being blamed because they can be abused–especially if you use artificial NHST. The two main abuses are: (a) going from a difference that reaches small p-values to scientific claims and (b) Illicit p-values due to selection effects, data dredging, cherry-picking, outcome switching, etc. In the mean time, crooked interpretations of clinical trials are being condoned because it will be said (as it already often is) that the entire method is discredited because some social scientists & others commit fallacies..

Consider (a). The problem with moving from the statistically significant result to H is that H mkes claims that haven’t been probed by the experiment at hand. It might not even be measuring the concepts in H. Anyone who moves from the SS result to H is performing statistical affirming the consequent (invalid statistically as well as deductively)
If H then the SS result is probable.
SS results
Therefore H.
Error statistical testing doesn’t allow this. However, for statistical inferences that take the form of a Bayes boost, statistical affirming the consequent does count as at least some confirmation for H.

8. How can we in good conscience derogate statistical falsification in carefully controlled double-blind trials on grounds that psychologists or others may perform highly artificial experiments and move fallaciously from computed (not even actual) low p-values to theories (in violation of the logic of testing)? I say that we cannot in good conscience do this. But endorsing a policy that prevents researchers from poking holes in CEO’s money-making claims to have a great and safe drug is precisely to do this.

Would critics of statistical significance tests use a drug that resulted in statistically significant increased risks in patients time and again, or even once or twice with very low p for primary outcomes? Would they recommend it to members of their family? If the answer to these questions is “no”, then they cannot at the same time deny that anything can be learned from finding statistical significance.

There’s a reason that even the harshest critics of statistical tests don’t want to relinquish p-values. They should figure out why.

9. I added this to my post today:
[added March 25] To be clear, I have no objection to recommending people not use “statistical significance” routinely in that it may be confused with “important”. But the same warnings about equivocation would have to be given to the use of claims: H is more likely than H’. H is more probable than H’. H has probability p. What I object to is mandating a word ban, along with derogating statistical tests in general, while raising no qualms or questions about alternative methods. It doesn’t suffice to say “all methods have problems” either. Let’s look at them.

In the time people have spent repeating old criticisms of significance tests, different ways to deal with data-dependent selection effects could have been developed and experimented with. I know there is considerable work in this area, but I haven’t seen it in the pop discussions of significance tests and p-values.

10. The Statistics Police have been expanding their ranks lately: http://www.dichotomania.com

Looks like they are about to start book burning,

Justin

• You think my “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars”will be one of the first? Someone on twitter just opined that he thought my book had something to do w/ the latest round of hysteria, but I’m fairly sure that’s not so. This was all part of an intended dramatic unfolding from 2015 or before, only it didn’t turn out to be as simple as was hoped. They couldn’t agree on any alternative methodology,but at most they found a scapegoat to persecute.

• Your book will be tried to be banned soon, I am sure of it. I believe the Statistics Police are first going to go for existing copies of Venn’s “The Logic of Chance” and von Mises’s “Probability, Statistics and Truth”.

Justin

• I loved your stat police site. It’s the kind of thing I’ve been mulling about the last week. I think I should be given an important role in this, don’t you? Being the Frequentist in Exile, Elbagrease and all that

• Justin:

I think it’s rude to accuse people of book burning when no books are being burned. In addition to being a Godwin’s Law violation, your statement is also inappropriate because people such as myself who are opposing “statistical significance” are not trying to suppress anything; indeed, I’m on record as saying that all scientific findings should be published. There should be no “p less than 0.05” gate, and results that have “p less than 0.05” attached to them should not be considered as special.

This is the opposite of book burning.

• I think it’s very possible that those endorsing the banning of words because the corresponding methods might be misinterpreted & abused aren’t aware of the feeling of condescension & persecution associated with the move; they’re quite convinced it’s all for the good and to promote better science. But the fact is that other statistical terms are ambiguous, being attached to formal notions in statistics as well as being associated with informal meanings: probability, likelihood, confidence and credible come to mind. I’m opposed to banning words and in favor of avoiding statistical fallacies. The thing is we know full well how to avoid the fallacies. Some people evidently did not know it wasn’t OK to move from a computed statistically significant difference at a low level to a scientific theory that entails the statistical effect (probabilistic affirming the consequent)*. Others weren’t aware that multiple testing, data dredging and ignoring stopping rules could invalidate and blow up their error probabilities. Now that they know, many have put into place procedures (e.g., preregistration) to avoid them. But instead of endorsing their improved ways, there are heavy-handed bans on some words in favor of other words–even though those other words can be abused as well.

*Notice that if your account views confirmation as raising the probability of H, i.e., x confirms H if Pr(H/x) > Pr(x), then it endorses the move (from H fits x to H). Scientific claims may also make their appearance in likelihood ratios, Bayes Factors, Bayesian updating, and in the “diagnostic model” of tests that computes so-called “positive predictive values”. The same bad consequences in terms of methods with high probability of erroneous inferences results. But the alternative methods of evidence are never made the focus of official criticisms, let alone word bans.

• Mark

It strikes me as funny that many of the folks who would call for banning “significant” would at the same time refer to non-statistical and ill-defined concepts like “representative samples”, “the general population”, and “best estimates” to support their own modeling choices. At least *I* know exactly what I mean if I use the word “significant”!

• Of course they are superior and innocent of the fallacies they deplore in others. Where is Goldacre when we need him?

• Hi Andrew. Parody/satire site.
Should check out the names of some of the signatories the Statistics Police gathered though! My favorite is “Faken Ames”

Cheers,
Justin

• I don’t know Justin, but my position is that satires and parodies are very important, precisely when one is under the power of the institutional lawmakers. What is a term for those who do X,or want to use X (for their ends), while declaring that that the term associated with X (for decades) is very, very bad?

• Of course no one is banning books, only individual words & perhaps methods. I don’t see the connection between being willing to have everything published somewhere, and not unfairly ruling on whether one of many honorific statistical terms needs banning.
Further, we may agree that a rigid cut-off like .05 needs to be interpreted non-rigidly–as does the .95 confidence interval which is open to the same binary abuse–whether using confidence distributions or other assessments. But a small p-value in well controlled trials, especially if there’s more than 1, is indeed special in indicating a genuine effect. Whether it matters, and what ought to be done, are content specific questions, but that does not mean there’s nothing of scientific importance learned from low p-values in well-controlled studies of drugs and the like. The quickest turnabout that comes to mind (though there are many others) was based on the HRT controlled trials of 2002. The increased risks, while very small, were found to be statistically significant, overturning decades of observational studies that seemed to show otherwise.

http://www.nbcnews.com/id/16206352/ns/health-cancer/t/breast-cancer-drop-tied-less-hormone-therapy/#.XJr6W2RKi2x

• Paul Chaffee

Andrew:

You found that rude–yet you felt it appropriate to reference this statement by someone so closed to debate that he didn’t even read the book he was “reviewing,” where you “appreicate his honesty”:

“I cannot comment on the contents of this book, because doing so would require me to read it, and extensive prior knowledge suggests that I will violently disagree with almost every claim that is being made. Hence I will solely review the book’s title, and state my prediction that the “statistics wars” will not be over until the last Fisherian is strung up by the entrails of the last Neyman-Pearsonite, and all who remain have been happily assimilated by the Bayesian Borg. When exactly this event will transpire I don’t know, but I fear I shall not be around to witness it. In my opinion, the only long-term hope for vague concepts such as the “severity” of a test is to embed them within a rational (i.e., Bayesian) framework, but I suspect that this is not the route that the author wishes to pursue. Perhaps this book is comforting to those who have neither the time nor the desire to learn Bayesian inference, in a similar way that homeopathy provides comfort to patients with a serious medical condition.”

11. Huw Llewelyn

It seems to me that the root of the problem is the concept and definition of the P-value: “If some ‘point’ null hypothesis were true (with a prior probability approaching zero), then the probability of the precise value that has actually observed or some other more extreme value that has not been observed is P”. This strange concept creates confusion among non-statisticians and endless argument between statisticians.

I understand the situation by having shown that under some conditions (especially when distributions are symmetrical, e.g. Gaussian), the probability of ultimate replication of a result falling within a range of hypothetical parameters less extreme than the null hypothesis is approximately 1-P. Ultimate ‘replication’ only becomes possible after continuing or repeating a study with a near infinite number of observations. As this is not possible we have to be content with only calculating its probability.

The result of this calculation can be a ‘significant’ starting point (e.g. if the probability of ultimate replication less extreme than the null hypothesis is greater than 0.95). It leads to a nuanced analysis of a study result that involves ‘severe’ scientific hypothesis testing based on showing that some hypotheses are improbable based on the data and thus making the other hypotheses (including some that we have not yet considered) more probable.

This reasoning is based on regarding a Bayesian prior probability distribution as a posterior probability distribution. It is regarded as the result of combining a uniform prior distribution of hypothetical parameters with a Bayesian or ‘subjectively estimated’ likelihood distribution. This in turn can be combined with a likelihood distribution based on the observed result of a new formal study to give a second Bayesian posterior probability. This therefore involves two statistically independent likelihood distributions and a uniform prior probability of hypothetical parameters. A frequentist would of course only use one or more likelihood distributions based on observed data. This is explained with detailed examples in my recent PLOS ONE paper: https://doi.org/10.1371/journal.pone.0212302.

• Huw: I’m sorry but I don’t know how to begin to respond to your comment and am in a mad rush to prepare a talk and class for tomorrow. The only thing I can say is: it’s nothing like that. If you don’t have a copy of my book, email me your address and I’ll send you one.

You should begin by focusing on a1-sided test, say Ho: mu ≤ 0 (for 2-sided, you can do two 1-sided tests). We don’t have good evidence agains Ho if we’d observe even larger differences from 0, with high probability, under the assumption we were in a world where only chance variability is operating. So if the p-value is high, we don’t have evidence against Ho. That should get you started.

• Huw Llewelyn

Thank you for your prompt and generous reply. I do have a ‘Kindle’ version of your book but a ‘proper’ copy would be nice on my shelf but not essential of course if somone else could benefit more. My vocabulary and understanding of probability theory is based on the needs of diagnostic thinking and clinical decision making. This is reflected by the reasoning in my PLOS ONE paper. In order to understand your book even more clearly I will have to continue to ‘translate’ your vocabulary and concepts into those more familiar to me. However it seems to me so far that we are very much in agreement.

• No you won’t, the main points are set out simply. SKIP ANYTHING THAT ISN’T CLEAR, AND YOU’LL CATCH IT IN PLAIN ENGLISH ELSEWHERE. The Kindle version has horrendous symbols, so send your address by email. Thanks for your interest.

• Christian Hennig

Huw: I’d have liked to see this start with a proper and straight formal definition of what is done, rather than with a supposedly “illustrative” example that takes quite some space and has at least three kinds of different boxes, one of which is called “mystery”. I have a hard time following this, and if this is meant to reduce confusion with p-values, good luck with that.

I certainly did not manage to understand your argument why “The prior probabilities of true outcomes for scientific replication have to be uniform by definition”. Actually it is not even clear to me whether by this you want to say that the prior distribution has to be uniform or rather whether you’re talking about a distribution that has some kind of prior probabilities as outcome.

In my understanding, in frequentism there’s no such thing as a prior for a parameter being true. I somehow think that you may agree with this and you want your “prior” to be seen as some kind of frequentist construction, however, I wasn’t able to get my head around that in a reasonable (i.e. maybe admittedly too short) amount of time. What about the objection that you may look at a transformation of the parameter and what is uniform for one way of expressing the parameter is no longer uniform for another?

• Thank you Christian, I agree with everything you wrote.

12. Huw Llewelyn

Christian and Deborah

Let me first say that I am trying to simplify matters by regarding P-values as being approximately equal to the probability of the true result being more extreme than the null hypothesis. This means that the probability of long term replication less extreme than the null hypothesis (after making a nearly infinite number of observations in order to get to know the true value of parameter) is therefore 1-P. There is nothing complicated about that I hope.

I think that this makes more sense that the tortuous concept of the P-value which is the probability of the actual study observation or some other more extreme observation that has not actually been seen conditional on a null hypothesis.

I think that I support your view Deborah by calling this the ‘idealistic’ probability of replication that can be turned into a ‘realistic’ probability of replication after severe testing by going through a checklist of possible reasons why the study was not conducted in an impeccably consistent way (and hopefully showing that such flaws are improbable). After establishing the reliability of the study finding, one then uses it to test a scientific hypothesis, again by severe testing in an attempt to show that rival explanations are improbable.

The approach need not be tied to a null hypothesis (e.g. of no difference between an intervention and control). We can specify any range of possible true results that is scientifically important. It may well be narrower however so that the probability of replication within such a narrower range is lower. (In this sense replication based on trivial differences near a null hypothesis may exaggerate a claim). If it is a range than would produce an equal or smaller P-value next time (or the same or higher probability of replication less extreme than the null hypothesis next time) then the probability of the true result falling into such a range will be about 0.5 (i.e. only by chance). In other words the probability of replication is 0% above that expected by chance alone.

It is a well known Bayesian principle that the above is true if we can assume uniform prior probabilities for all possible true values. My point is that we need not assume this as it is a consequence of selecting regular spaces between all the possible true values and all the possible sampling values. (This is where the sample ‘boxes’ come in to play in section 2 of the paper.) This also means that the odds in a posterior probability distribution and the likelihood ratios in a likelihood distribution have the same and either can be used in calculations of posterior probability of a true result or a range of true results. I offered to provide a mathematical proof but the referees felt that the best way of making the point was through more extensive examples. Unfortunately this may make for a time consuming read. The principle behind the maths is as follows:

When the prior probability of H is p(H/E1) and if we have a second item of evidence E2 with a likelihood ratio p(E2|Ĥ)/p(E2|H), then by Bayes’ rule and assuming statistical independence between p(E1|Ĥ)/p(E1|H) and p(E2|Ĥ)/p(E2|H):

p(H|E1^E2) = 1/{1+p(Ĥ|E1) / p(H|E1) * p(E2|Ĥ) / p(E2|H)}

However, if U is the universal set such that we have uniform priors p(H|U) = p(Ĥ|U) so that the likelihood ratio p(E1|Ĥ)/p(E1|H) equals the odds p(Ĥ|E1)\p(H|E1), then by Bayes’ rule when there is statistical independence between E1 and E2 with respect to H and Ĥ (as in random sampling):

p(H|E1^E2) = 1/{1+p(Ĥ|U) / p(H|U) * p(E1|Ĥ) / p(E1|H)* p(E2|Ĥ) / p(E2|H)}

This is the basis for the way I model random sampling where by definition the marginal prior probabilities are uniform because the scientists creates regular scales for the parameters and consequently the samples.

This principle of combining priors and likelihood distributions is discussed in sections 3.8 and 3.9 of the paper with respect to probability distributions. As far as the population boxes and big sample boxes of Figure 1 are concerned the bell shaped distribution could represent a prior distribution (see column C in sheet ‘Figure 1’ in the excel spreadsheet) or a likelihood distribution (see column B in sheet ‘Figure 1’ in the excel spreadsheet). The spreadsheet is found in OSF: https://osf.io/3utkj/?view_only=e0cc5791ed9b4a0899424a20e4611ccf

Does this make sense?

Huw

13. Huw:
You wrote:
“Let me first say that I am trying to simplify matters by regarding P-values as being approximately equal to the probability of the true result being more extreme than the null hypothesis. This means that the probability of long term replication less extreme than the null hypothesis (after making a nearly infinite number of observations in order to get to know the true value of parameter) is therefore 1-P. There is nothing complicated about that I hope.”

Maybe there’s nothing complicated here to you, but I’m sorry, I can’t respond because none of this makes any sense at all to me.

May I suggest you put to one side everything you’ve said and thought about P-values and start over? In fact, ignore formal statistics and consider the example pp 14-16 of the first Tour of SIST (which I’ve linked to one my blog): my weight & arguing from coincidence

Click to access 0-ex1-ti_sist_mayo_share.pdf

• Huw Llewelyn

Thank you Deborah

I have read pages 14 to 16 as suggested. This is a familiar problem to someone like me who has spent years trying to interpret patients’ weights in clinics! You talk about three weighing scales with similar precision. You are implying that from single samples on three weighing scales without holding books, after holding books and on return from the UK (3×3=9 single measurements) that if for each of the 9 situations you had weighed yourself on a large number of times, you would get a mean weight with a small standard deviation and be able to plot the narrow distribution of your weights. High precision means that when the weighing is performed repeatedly the probability of replication of weights within a narrow range is high.

By doing this for the three weighing scales and getting the same mean for each you would think it improbable that there was bias (unless all three were biased in the same way) so that the three were therefore probably accurate (i.e. not biased) as well as precise. You also establish that the accuracy applies to differences in weight because your weight increased by an extra 3 pounds when you are carrying three books weighing 3lbs (as measured on another 4th weighing machine presumably). In order to avoid measurement bias due to poor operator methodology (not the fault of the machine) it would be important also to wear the same items of clothing at home and in the medical centre, not to change them randomly when testing the weighing machine and to weigh at the same time of day and in relation to meals at each of the 9 weighing sessions. This is analogous to going through a checklist as part of severe testing to consider whether a scientific study was conducted in a consistent way if it were to be repeated as described.

When you return from the UK, you weigh an extra 4.c lb (c being a constant >0 and <1 that you did not specify). Your scientific hypothesis was that your body weight has increased. However, this scientific inference needs to undergo severe testing by excluding the rival hypotheses e.g. that after leaving the USA in the summer and maybe returning in the winter that you were not wearing heavier clothing or forgotten to take heavy boots off, etc when using each 3 weighing scales at home and the medical centre after your return. I assume that you would have taken care to wear the same clothing etc to exclude this possibility as part of severe testing of your hypothesis that your body mass had increased! There are other hypotheses too (e.g. was the increased weight due to increased body fat or fluid etc.). These are considerations that I have had to make often in medical clinics!

This process of assessing the precision of the weighing scales (e.g. if you had measured your weight many times on each scale instead of once) is the same process that I followed in the PLOS ONE paper that addresses instead probability of replication of study data. The realistic probability of study replication would also depend on evidence for the absence of probable methodological inconsistencies that would reduce precision and increase bias. This is a point that you have made strongly by applying severe testing according to my understanding. Am I right in thinking this?

14. Christian Hennig

Huw: I’m not sure what “true result” means (a result in my book would be something observed and what is true about it is that it was observed, no more and no less), however I think I understand the approach as a whole. Obviously this kind of thing can be done if you have a prior. I am not convinced by your arguments though that the uniform would be some kind of natural prior here. You didn’t respond to the transformation issue, which is a very old argument against using the uniform as default prior (or even default distribution in classical probability theory).

By the way, personally I find p-values not tortuous at all, for me they’re just fine. It just drives me up the wall that so many people want them to be something more than they actually are. Actually I believe that this will happen as well with whatever is proposed to take their place.

15. Steven McKinney

The definition of “science” can be as contentious as the current bizarre mish-mash of ideas about what statistics is. I’ll offer this definition, offered by a group at the National Academies of Science in the United States in 1999, during a period of contentious political and academic debate about certain topics within science, and science in general.

“Science is a particular way of knowing about the world. In science, explanations are limited to those based on observations and experiments that can be substantiated by other scientists. Explanations that cannot be based on empirical evidence are not a part of science.

In the quest for understanding, science involves a great deal of careful observation that eventually produces an elaborate written description of the natural world. Scientists communicate their findings and conclusions to other scientists through publications, talks at conferences, hallway conversations, and many other means. Other scientists then test those ideas and build on preexisting work. In this way, the accuracy and sophistication of descriptions of the natural world tend to increase with time, as subsequent generations of scientists correct and extend the work done by their predecessors.”

I still regard statistics, even in its current incomplete form, as an invaluable cornerstone for the scientific method, as a technique for understanding the rate at which we make erroneous conclusions as we continuously re-evaluate previous assertions in our quest to correct and extend the work of our predecessors. Statistical conclusions are not always correct, but error statistical methods allow us to understand and manage the rate at which we make errors.

When I read (with considerable dismay) the statement

” . . . generalizations from single studies are rarely if ever warranted.”

I wonder if the authors of this paper

https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1543137

“Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication”

have read the NAS proffered definition of science, and if not, what on earth their definition of science is.

I hope that Sandler Greenland, a frequent contributor at this blog site, who offered this excellent discussion about statistical methods in the same journal issue

https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529625

“Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values”

can discuss why he penned his name along side of Amrhein and Trafimow in making such contra-indicated assertions about the scientific method and the statistical assessment of repeatable phenomenon. If a single study is not generalizable, its report is an anecdote and not a part of the corpus of science.

• Steven: Great to hear from you. We do get some apt post-modern bumper-stickers such as:
“There Is No Replication Crisis if We Don’t Expect Replication”.
I don’t know if the editorial paper is intended as the official ASA position as with the ASA P-value Statement (2016). There’s a definite danger in encouraging a view that statistics embraces post modernism, radical skepticism, or scientific anarchy.

In Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018), I recommend that researchers seek to falsify some types of inquiries (claims about measurements and experiments), when a purported effect fails to be replicated over and over again, and survives only by propping results up with ad hockeries.

16. Huw Llewelyn

Christian

What I mean by a ‘true result’ during a random sampling process is the result (mean or proportion) obtained after a very large or infinitely large number of observations. It is also important to understand that there are two kinds of prior probability being used in my paper (i.e. Llewelyn H (2019) Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PLoS ONE 14(2): e0212302. https://doi.org/10.1371/journal.pone.0212302).

The first type is the ‘natural’ prior (e.g. a Bayesian prior) that you are talking about which is beset with problems when an attempt is made to use it as a default uniform prior, exemplified by the transformation problem as you point out.

The second type of prior probability that I use in my paper is quite different. It is imposed on the data rather like placing a frame around a picture. The universal set on which the ‘natural’ non-uniform prior is based then becomes a subset (inside the ‘frame’) of this ‘imposed’ artificial universal set of parameters with uniform prior probabilities. This means that the odds of the non-uniform ‘natural’ individual prior probabilities within the ‘frame’ become equal to the likelihood ratio of the corresponding likelihood distribution. I give a more detailed explanation in the OSF supplement to my paper: https://osf.io/s6qgy/.

Once this is done it becomes apparent that if a P-value calculation is based on a symmetrical distribution (e.g. a Gaussian distribution) it is then equal to the posterior probability of the null hypothesis or something more extreme from the observed result (which could be a proportion or mean). This range of ‘true’ results after making a very large number of observations does not contain the observed result and therefore fails to replicate it. The ‘complement’ to this range (i.e. all true results less extreme than the null hypothesis) does contain the observed result and thus replicates it in the long run. The probability of replication within this range is therefore 1-P. However, I called this the ‘idealistic’ probability of replication as it assumes impeccable methodology. If no fault can be found after a rigorous examination (Mayo calls it ‘severe testing’) it can be regarded as a ‘realistic’ probability of replication.

All this allows the P-value to be understood in a logical way with a similar reasoning process to that advocated by Bayesians. I think that this explains why the P-value is intuitively useful. However, the P-value is actually an arbitrary index and not even a probability (not a likelihood probability, prior or posterior probability). However, it has the superficial appearance of a probability, which leads to endless confusion and errors. Many think erroneously that it is false positive rate (or 1 – specificity) that has to be used with ‘lump’ prior probabilities. However, I think its strength is that it is a measure of non-replication as outlined above.

Huw