**Brian Haig, Professor Emeritus**

Department of Psychology

University of Canterbury

Christchurch, New Zealand

The American Statistical Association’s (ASA) recent effort to advise the statistical and scientific communities on how they should think about statistics in research is ambitious in scope. It is concerned with an initial attempt to depict what empirical research might look like in “a world beyond p<0.05” (*The American Statistician, 2019, 73*, S1,1-401). Quite surprisingly, the main recommendation of the lead editorial article in the Special Issue of *The American Statistician *devoted to this topic (Wasserstein, Schirm, & Lazar, 2019; hereafter, ASA II) is that “it is time to stop using the term ‘statistically significant’ entirely”. (p.2) ASA II acknowledges the controversial nature of this directive and anticipates that it will be subject to critical examination. Indeed, in a recent post, Deborah Mayo began her evaluation of ASA II by making constructive amendments to three recommendations that appear early in the document (‘Error Statistics Philosophy’, June 17, 2019). These amendments have received numerous endorsements, and I record mine here. In this short commentary, I briefly state a number of general reservations that I have about ASA II.

*1. The proposal that we should stop using the expression “statistical significance” is given a weak justification*

ASA II proposes a superficial linguistic reform that is unlikely to overcome the widespread misconceptions and misuse of the *concept *of significance testing. A more reasonable, and common-sense, strategy would be to diagnose the reasons for the misconceptions and misuse and take ameliorative action through the provision of better statistics education, much as ASA I did with *p *values. Interestingly, ASA II references Mayo’s recent book, *Statistical Inference as Severe Testing *(2018), when mentioning the “statistics wars”. However, it refrains from considering the fact that her error-statistical perspective provides an informed justification for continuing to use tests of significance, along with the expression, “statistically significant”. Further, ASA II reports cases where some of the Special Issue authors thought that use of a *p*-value threshold might be acceptable. However, it makes no effort to consider how these cases might challenge their main recommendation.

*2. The claimed benefits of abandoning talk of statistical significance are hopeful conjectures.*

ASA II makes a number of claims about the benefits that it thinks will follow from abandoning talk of statistical significance. It says,*“researchers will see their results more easily replicated – and, even when not, they will better understand *why*”. “[We] will begin to see fewer false alarms [and] fewer overlooked discoveries …”. And, “As ‘statistical significance’ is used less, statistical thinking will be used more.” (p.1) *I do not believe that any of these claims are likely to follow from retirement of the expression, “statistical significance”. Unfortunately, no justification is provided for the plausibility of any of the alleged benefits. To take two of these claims: First, removal of the common expression, “significance testing” will make little difference to the success rate of replications. It is well known that successful replications depend on a number of important factors, including research design, data quality, effect size, and study power, along with the multiple criteria often invoked in ascertaining replication success. Second, it is just implausible to suggest that refraining from talk about statistical significance will appreciably help overcome mechanical decision-making in statistical practice, and lead to a greater engagement with statistical thinking. Such an outcome will require, among other things, the implementation of science education reforms that centre on the conceptual foundations of statistical inference.

*3. ASA II’s main recommendation is not a majority view.*

ASA II bases its main recommendation to stop using the language of “statistical significance” in good part on its review of the articles in the Special Issue. However, an inspection of the Special Issue reveals that this recommendation is at variance with the views of many of the 40-odd articles it contains. Those articles range widely over topics covered, and attitudes to, the usefulness of tests of significance. By my reckoning, only two of the articles advocate banning talk of significance testing. To be fair, ASA II acknowledges the diversity of views held about the nature of tests of significance. However, I think that this diversity should have prompted it to take proper account of the fact that its recommendation is only one of a number of alternative views about significance testing. At the very least, ASA II should have tempered its strong recommendation not to speak of statistical significance any more.

*4.*** The claim for continuity between ASA I and ASA II is misleading.** There is no evidence in ASA I (Wasserstein & Lazar, 2016) for the assertion made in ASA II that the earlier document stopped just short of recommending that claims of “statistical significance” should be eliminated. In fact, ASA II marks a clear departure from ASA I, which was essentially concerned with how to better understand and use

*p-*values. There is nothing in the earlier document to suggest that abandoning talk of statistical significance might be the next appropriate step forward in the ASA’s efforts to guide our statistical thinking.

*5. ***Nothing is said about scientific method, and little is said about science.**

The announcement of the ASA’s 2017 Symposium on Statistical Inference stated that the Symposium would “focus on specific approaches for advancing scientific methods in the 21^{st}century”. However, the Symposium, and the resulting Special Issue of *The American Statistician*, showed little interest in matters to do with scientific method. This is regrettable because the myriad insights about scientific inquiry contained in contemporary scientific methodology have the potential to greatly enrich statistical science. The post-*p*< 0.05 world depicted by ASA II is not an informed scientific world. It is an important truism that statistical inference plays a major role in scientific reasoning. However, for this role to be properly conveyed, ASA II would have to employ an informative conception of the nature of scientific inquiry.

*6. Scientists who speak of statistical significance ***do embrace uncertainty. **I think that it is uncharitable, indeed incorrect, of ASA II to depict many researchers who use the language of significance testing as being engaged in a quest for certainty. John Dewey, Charles Peirce, and Karl Popper taught us quite some time ago that we are fallible, error-prone creatures, and that we must embrace uncertainty. Further, despite their limitations, our science education efforts frequently instruct learners to think of uncertainty as an appropriate epistemic attitude to hold in science. This fact, combined with the oft-made claim that statistics employs ideas about probability in order to quantify uncertainty, requires from ASA II a factually-based justification for its claim that many scientists who employ tests of significance do so in a quest for certainty.

Under the heading, “Has the American Statistical Association Gone Post-Modern?”, the legal scholar, Nathan Schachtman, recently stated:

The ASA may claim to be agnostic in the face of the contradictory recommendations, but there is one thing we know for sure: over-reaching litigants and their expert witnesses will exploit the real or apparent chaos in the ASA’s approach. The lack of coherent, consistent guidance will launch a thousand litigation ships, with no epistemic compass.(‘Schachtman’s Law’, March 24, 2019)

I suggest that, with appropriate adjustment, the same can fairly be said about researchers and statisticians, who might look to ASA II as an informative guide to a better understanding of tests of significance, and the many misconceptions about them that need to be corrected.

**References**

Haig, B. D. (2019). Stats: Don’t retire significance testing. *Nature, 569,* 487.

Mayo, D. G. (2019). The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean (Some Recommendations)(ii),blog post on Error Statistics Philosophy Blog, June 17, 2019.

Mayo, D. G. (2018*). Statistical inference as severe testing: How to get beyond the statistics **wars. *New York, NY: Cambridge University Press.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on *p*-values: Context, process, and purpose. *The American Statistician, 70*, 129-133.

Wasserstein, R. L., Schirm. A. L., & Lazar, N. A. (2019). Editorial: Moving to a world beyond “*p*<0.05”. *The American Statistician, 73*, S1, 1-19.

I thank Brian for taking me up on my offer in my recent post:

https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/

“A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on ASA II (~1000 words) if you send them to me.”

Please send me your views for potential guest posting.

The question of just what ASA II is stating, however, remains a major source of unclarity. I assumed several of the statements were slips, written in the hasty exuberance of taking such a strong line. Now I am much less sure, since there has, to my knowledge, been no move to modify some of the statements, at least to cohere with ASA I, which ASA II now does not:

♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof). (p. 1)

♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)

♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)

It is not just that these are at odds with the ASA principle 1 (as discussed in my post), they would also be at odds with other accounts to which the ASA appears to give its blessing. That is, outcomes attaining a a given small p-value threshold map 1-1 on outcomes leading to corresponding inferences based on CIs, Bayes Factors, posteriors, etc. (even if it’s only when we can take the assumptions as holding). You can’t say the very same requirement for an inference holds using one word and not another word.

If science should be humble and self-critical, as ASA II rightly suggests, then meta-science must be as well. The ASA should show itself as an exemplar of willingness to find flaws in its standpoint, especially when they appear to be at odds with best practices in controlled studies. In responding to Amhrein et al, (2019) Cook et al. (2019) write (in the journal, Clinical trials):

“The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.”

https://errorstatistics.files.wordpress.com/2019/06/cook-there-is-still-a-place-for-significance-testing-in-clinical-trials-2019-4.pdf

Haig cites lawyer Schachtman’s concern, and it is one I entirely share; namely, that ASA II will be a ready resource to free researchers from culpability for spinning their interpretations of results that can readily be produced by (or in other cases, scarcely produced by) chance variability. http://schachtmanlaw.com/has-the-american-statistical-association-gone-post-modern/

People will rightly refuse to take part in clinical trials were it to become clear that new rules allow so much latitude for spinning unwelcome results as purportedly showing evidence of the presence, or of the absence, of a genuine effect of concern. Putting a high/low degree of belief on the claim will not somehow make it more fair-minded.

What is to happen to testing altogether if even minimal thresholds are abandoned?

In this connection, it is very important to have a clarification of whether Principle 4 from ASA I still holds in ASA II. This was the principle requiring the reporting of things like data-dependent endpoints, stopping rules and other biasing selection effects. As Cook et. Al., observe:

“By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.”

P-values are directly invalidated by these gambits, it’s not clear that alternative methods are.

The bottom line is that no new formal method can turn troubled scientific practices into sound science. What is required for scientific inquiry with integrity is no big secret. By giving so much weight to the position that the problems with today’s industrialized science are the fault of a very artificial version of error statistical tests, there is a real and present danger that we lose the ability to hold accountable the formal “experts” who have so much control over our lives.

At the risk of being repetitive, let me share some thoughts on the above comments of Haig and follow ups by Mayo. Two comments come up:

1. Inbreeding – the discussions regarding ASA I and ASA II does not seem to involve the “customers”. Specifically, I am thinking of people in industry and health care research and delivery. The gap between theory and practice seems to have been widening. Moreover, these discussions assume that these customers listen and wait attentively for the statistics community to pass on a verdict on how statistical analysis should be done to ensure reproducibility and repeatability etc…. The 2017 symposium Haig referred to had very few practitioners both as speakers and attendants. Most of the discussions in the sessions were inbred exchanges, as if we (statisticians) live in an isolated planet. This ignores the many professional discisplines now active in the data analytics playground.

2. On several occasions I mentioned the importance of generalisations of findings and findings’ representation. This seems to me a crucial aspect that is apparently not getting attention. A proposal for this is made in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070. Will be glad to engage in discussions on it.

The only reason that I joined 840 others in signing the letter which advocated dropping the term “statistically significant” as a synonym for p < 0.05 is that dichotomisation makes no sense. It is obviously silly to treat p = 0.049 as meaning something different from p = 0.051. The letter did not say that p values should be dropped. It didn't even touch much on the deficiencies of p values as evidence.

I don't see how anyone can defend treating p = 0.05 as a threshold. The fact that so may users do exactly that has hindered the progress of science.

Neyman and Pearson were quite clear that using rigid cut-offs, and interpreting a p-value of .049 differently than .049 is silly, but that is quite separate from the stipulations of ASA II. The fact that p-values are continuous does not mean we can’t distinguish p = .4 from p= .0001. Many of the same people who signed on to Gelman’s call to “sign a petition” –which concerned, not ASA II, but Amhrein at al’s paper–https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/

agree with my recommended changes to ASA II. Do you? Do you concur that:

♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)

Do you not distinguish very large from very small PPV values? At what precise point will there be an indication of poor evidence? If you can’t answer, does it follow we should not use the term PPV?

Appealing to fact that .048 doesn’t indicate an important difference from .049 as grounds to not say “significance” is a non-sequitur. And you will notice my recommended revisions are quite distinct from saying/not saying the word.

Perhaps you don’t read as much of the biomedical literature as I do. It’s the norm,in fact almost universal, the claim an effect by saying that it’s “statistically significant” (or the equivalent asterisk). In practice p < 0.05 is taken as a hard line between truth and falsehood, And journals have done nothing to discourage this practice.

You mention PPV but I do not believe that PPV is the right thing to use when trying to interpret an observed p value. That is because it is calculated from tail areas, not densities. In other words it is calculated by the p-less-than method. I maintain that if you have observed, say, p = 0.03 then you are interested only in experiments that produce p = 0.03,and values less than that are irrelevant to the interpretation of the observed p value. That is why it seems sensible to calculate the likelihood ratio from densities, not from tail areas, i.e. using the p-equals approach. When this is done, I refer to the result as a false positive risk, (FPR) and under most circumstances it's a good deal higher than would be inferred from the PPV.

More details about this distinction in https://royalsocietypublishing.org/doi/10.1098/rsos.171085

If we calculate the likelihood ratio as L_10, the ratio for the best-supported alternative relative to H0, then if we have observed p = 0.05.we get L_10 of about 3 . In contrast, (compared with about 15 when the p-less than is used, as for PPV). Odds of 3 show very weak support for H1 compared with the odds of 19:1 that are commonly, but mistakenly, inferred from the p value.

I'd be quite happy for people to report, along with the p value and confidence intervals, the likelihood ratio, the corresponding value of L_10, the odds in favour of there being a real effect, relative to there being no true effect. That is a frequentist measure (under my assumptions) and it measures the evidence that’s provided by the experiment.

If these odds are expressed as a probability, rather than as odds, we could cite, rather than L_10, the corresponding probability 1/(1 + L_10). I suggest that a sensible notation for this probability is FPR_50, because it can, in Bayesian context, be interpreted as the false positive risk when you assume a prior probability of 50%. But since it depends only on the likelihood ratio,there is no necessity to interpret it in that way, and it would save a lot of argument if one didn’t .

By way of example, if you have observed p = 0.05, the FPR_50 is about 0.27, much higher than the value of 0.06 than would be inferred from the PPV approach: see http://fpr-calc.ucl.ac.uk/

http://fpr-calc.ucl.ac.uk/

As Stephen Senn has pointed out, many Bayesians dislike the idea of testing a point null hypothesis, which is implicit in my calculation of FPR. It certainly isn't a unique approach but it seems to me a perfectly reasonable one. Several other approaches suggest FPR values between 0.2 and 0.3 when you observe p = 0.05. See https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622

The problem with your “diagnostic screening model” of tests, which I’ve discussed in great detail elsewhere and is covered in Section 5.6, p. 361 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), isn’t that it uses Bayes’ rule, it’s that it’s seriously flawed. Nor have you responded to the criticisms of this approach. It begins by dichotomizing all effects into “no effect” or “meaningful effect”, all else in between ignored. It then imagines my current hypothesis has been randomly selected from an urn of nulls with a known prevalence of true nulls–a high prevalence is needed to make this criticism of p-values work. Next it is imagined we infer a meaningful effect based on a single small p-value, e..g, .05, with selection effects to boot. Here’s the relevant portion of SIST:

https://errorstatistics.files.wordpress.com/2019/05/excur5-tourii.pdf

In other words, it will use the most abusive variation of tests, but will allow you to make up for it by claiming it was selected from an urn with a high prevalence of true effects. It will even allow a high PPV, so long as you can say your hypothesis was selected from an urn with high prior prevalence.

Few Bayesians or frequentists would assign a probability of .9 to a particular null hypothesis based on the supposition that the hypothesis I’m testing H’ is randomly selected from an urn of null hypotheses 90% of which are true. It is a fallacy of probabilistic instantiation (or principle of indifference), not to mention that we don’t know these prevalences. And plus, we do have relevant info about this particular hypothesis, and I could sort it into many different reference classes accordingly–giving inconsistent results. Are we to consider the proportion of meaningful effects in science as a whole? the proportion of meaningful effects in your research program? in this year’s studies? We can cut up the reference class in numerous ways, and our “prior” is altered dramatically. This is neither a frequentist nor a legitimate Bayesian prior, and is frankly irrelevant to assessing the warrant for a particular hypothesis based on given evidence. When diagnostic screeners use the diagnostic model, say in deciding whether to examine a gene (or a piece of luggage) more closely, they are not taking the prevalence as a probabilistic appraisal of the evidence for the particular genetic hypothesis.

Ironically, the entire approach is based on the most dichotomous variation on tests: rather than report the p-value, it instructs you to “infer meaningful effect” or “infer no effect”. Then, after this most problematic dichotomizing, it tries to compute a “posterior” on the “meaningful effect” hypotheses, never mind that all the “prior” is used up on “no effect” and “meaningful effect” with no assignment in between.

This would be a truly terrible way to do science. Fields with high “crud factors” (Meehl)––which is why in those fields many say all nulls are false– would get high PPV values. But you still wouldn’t know how to replicate the crud factor effect, even if it’s a “true” association. I could go on….

For a blog post on this see https://errorstatistics.com/2015/12/05/beware-of-questionable-front-page-articles-warning-you-to-beware-of-questionable-front-page-articles-2/

David

Boiling down statistical analysis to a dichotomised crossroad is demeaning the contribution of statisticians to scientific investigations. In fact, I believe that we are over this now since the P value calculations are provided in all statistical software platforms and the question is how to use and interpret them properly. For example, JMP is using a surprise factor in the form of a Logworth that builds on P values (Greenland also suggested introducing this) . In other words, we are way beyond the static tables of critical values we used to have in appendices of statistical texts. If anyone is interested in under the hood calculations, my book on Modern Industrial Statistics gives details on exact and approximate calculations https://www.wiley.com/en-us/Modern+Industrial+Statistics%3A+with+applications+in+R%2C+MINITAB+and+JMP%2C+2nd+Edition-p-9781118763698

The point is that this is not the point. The contributions of statistics and statistical thinking are way beyond the P value discussions. Besides the paper you are signed on, I also find suggestions to use a cut off of 0.005 or Bayes factors as reducing statistical work to a singular point, missing out on the big picture.

What I find terrible is that such articles are typically not constructive. We should move ahead with new ideas and new suggestions on how to improve things. The retrospective salvo is further pushing statisticians into a corner, instead of moving them to center stage.

RKennet:

What do you mean by “The retrospective salvo is further pushing statisticians into a corner, instead of moving them to center stage.”

I can sort of guess, but I’d like to hear your clarification.

Deborah

I think people do not realise the severity of the demise of Statistics. Just as an example, I just heard that IBM is dismantling its statistics group (a rumour….).

Specifically, “The current status of statistics in industry is strong; however, the status of statisticians in industry is possibly at an all-time low.” This quote is from Sallie Keller-McNulty in a panel on the future of industrial statistics” (Technometrics, 50, 2, pp.105, 2008). In 2019, this diagnostic assessment is an understatement.

Data science, statistical learning and AI are now pervasive and my new book on The Real Work of Data Science is presenting a wider framework coming from my statistical background, wider than the one of Comp Sci and IE majors.

By the way, at the JSM in Vancouver, a dozen videos clips of ASA members were shown before the president’s address. Al the interviewees had as title: “data scientists”….

Ten years ago I asked for comments on a “note” of this situation that eventually reached version 18: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2171179. The note is a sketch of what could be the basis for a theory of applied statistics. As mentioned, it is a sketch presenting a wide view to build on.

The retrospective salvo I referred to is an implicit referral to Tukey’s sunset salvo https://www.tandfonline.com/doi/abs/10.1080/00031305.1986.10475361

Compare the comments of Tukey on where we are and where we are headed to the ASA documents listed above.

Sunset salvo is a constructive look ahead conceptual paper. It deals with something exciting for statistics, statisticians and the users of statistics.

Retrospective salvo is backwards looking, basically destroying the past without offering an optimistic vision for the future.

We need to say what to do, not what to not do….

ron

rKenett:

Concern about the demise of statistics, in the face of data science, AI and machine learning sparked the Year of Statistics in 2013, and great efforts ensued to revise curricula to avoid being left in the dust by these popular fields. The first call of alarm that I remember was in Normal Deviate’s blog (by Larry Wasserman). I believe the current P-value project at ASA is one more outgrowth, but as you say, it’s only backfiring. Mayo’s recommended revision would help avoid the appearance that the ASA is out on a limb, getting noticed by advancing an imagined politically correct stance (a comment by Miodrag in Mayo’s post speaks to this).

Yes, we all should encourage Mayo to draft, perhaps as a ms for TAS, a somewhat reworded/recast version of the principles/recomendations in ASA I and Wasserstein et al. (2019). Then enlist a largish, serious group of co-authors to sign on to the revision. Wasserstein, Lazar and Schirm hopefully would be amenable to signing on. The idea of asking them to revise their own article I see as quixotic.

But we have two obstacles. 1) Mayo begs off this prooject on the grounds that she is not a professional statistician. Possible translation: I have other priorities for the remainder of my summer:

and 2) Mayo disagrees with the one recommendation that has widest, lonstanding support and that at least some editorial boards might be willing to implement: disallow the phrase “statistically significant” in research ms and in that way stimulate authors (and editors and reviewers) into more nuanced thinking and writing.

To my mind the single greatest accomplishment of “ASA II” has been to stimulate a fermentation process that has demonstrated how much support there is in the stats community, especially among those most knowledgeable about the foundational literature, for implementing this disallowance.

It goes against the recommendations of every stat book in my library! More power to it.

If statisticians after a century can’t even acknowledge the inappropriateness of verbally dichotomizing the P scale, then maybe they merit all being subsumed into Data Science departments and becoming “the tail that the dog wagged.”

Stuart

Some context on statistics as “the tail that the dog wagged” and why if does not have to be so.

Deming advocated a role of director of statistical methods in organisations aiming at enhanced quality, productivity and competitive position. The current era requires an update on what Deming prescribed, usually in the form of a chief data scientist or chief information officer. Our new book on The Real work of Data Science is about that. https://www.wiley.com/en-us/The+Real+Work+of+Data+Science%3A+Turning+data+into+information%2C+better+decisions%2C+and+stronger+organizations-p-9781119570707

In fact, chapter 15, titled “Putting Data Science, and Data Scientists, in the Right Spots”, is an homage to Deming.

Statisticians have an opportunity to step into this role. I believe that background in statistical thinking makes statisticians more adept to fulfill this role, better that trained Com Sci or others. So far, ASA has ignored this message (and the book). In contrast, others have appreciated it and it has received ample exposure in various associations and conferences. Hopefully ASA wakes up to this call.

Some of the feedback we got on the book is listed on the book’s website and inside cover, including some nice words from David Cox….

ron

Stuart:

I never begged off this project because I’m not a statistician. My comment is here:

https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/#comment-183767

It’s main point is:

“I think it’s extremely important for the authors of ASA II to make the revisions–at lease (1) and (2). There is nothing in ASA I that is seriously misleading. But these points are. It would be painless and should be done soon. I think the statement (1) was inadvertent, so why not fix it?”*

I truly thought at least one was an inadvertent slip and informed Wasserstein right away. The editorial is on-line, and when you go to it, it says something like: look out for corrections/updates (I forget the words), so it seems to me very wise and painless to at least correct (1) and (2).

I like Wasserstein, and it seems to me that someone should have caught these slips and the inconsistency with ASA I for him. I know that there was a band of writers for ASA I, and with such a complex piece as this one, I’m supposing it too had others giving editorial oversight. Whoever they are, my opinion is that they should have identified these points, and a few others–lest they come back to haunt.

*In a different comment I referred to not being a statistician wrt proposing a forum in a statistical conference such as the JSM. I did help with one on “reconsidering the significance test controversy” in Cardiff last September including David Cox.

Hurlbert:

There you go again. While you twice denied that journals should be proscribing the use of terms to authors, you continually turn around and call for this.

You wrote:

“Recall that these are from someone who thinks there is no possibility of or need for proscribing or requiring (e.g., in an ASA document, in a journal’s “instructions to authors, or in a textbook) particular statistical methodologies”

https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/#comment-183978

Moreover, you yourself are an activist in banning the word significant. The ASA II document calls for critical assessment, and no such input has been solicited. So which is it?

When I speak at or visit statistics departments, the vast majority either don’t have a clue that the ASA is busy with word bans, or if they have heard of it, roll their eyes and seem embarrassed. Most people don’t have time for such things.

Why not ask how many are in favor of stopping the rehearsals of the identical, age-old, hackneyed criticisms of P-values? I’ll bet there would be a flood of yahs.

Mayo,

Large numbers of ASA members and others seem to agree that better reporting would result from disallowiing use of the phrase “statistically significant” in the reporting of research results. I am sure there is a diversity of motivations for that particular position.

Whether ASA journals wish to disallow the phrase “statistically significant” should be up to its editorial board, and the same for all other journals. It would seem a much more useful step for journals in other disciplines.

As for your eye-rolling friends, I wonder how many have a solid grasp of the historical literature or have published substantive contributions on these fundamental issues…..maybe 5% if we’re lucky?

Stuart:

You need to make up your mind (as dichotomous as that might seem). Are journals editors to be nudged, cajoled, harrassed into barring authors from using a word, or is there “no possibility of or need for proscribing or requiring (e.g., in an ASA document, in a journal’s ‘instructions to authors, or in a textbook)” as you say below

https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/#comment-183978

“Recall that these are from someone who thinks there is no possibility of or need for proscribing or requiring (e.g., in an ASA document, in a journal’s “instructions to authors, or in a textbook) particular statistical methodologies; the only possible consensus is that we should understand the methods we use, we use them correctly and we interpret them appropriately. ”

Individual scientists ought to decide what tools/ words to use and journal editors should not be prodded, nudged or hammered into jumping on a bandwagon of word bans (even it’s “for their own good”).. The weak sciences that abuse tests, committing age-old fallacies of moving from observed associations to causes, and the like, do not become stronger sciences this way. But that’s not even the issue here, the issue is misstatements by the ASA ensconced in ASA II.

As for the depth of knowledge of eye-rollers, there’s David Cox (whose birthday happens to be today).

Part of the revision of curricula you mentioned lead to the paper by De Veaux, R., Agarwal, M., Averett, M. et al. (2017). “Curriculum guidelines for undergraduate programs in data science.” Annual Review of Statistics and Its Applications 4: 15–30;. This proposal is mostly tools oriented.

To look at the big picture, together with Shirley Coleman, I reviewed existing data science programs, offered in Europe, using the 8 quality dimensions. The idea was to see how these programs cover issues such as data resolution, data integration, chronology of data and goal, operationalisation, generalisation and communication. These information quality dimensions are what we considered necessary for generating information quality. This perspective integrates the various tools listed in De Veaux et al and emphasizes a statistical analysis strategy. Statisticians should be involved in such efforts. That paper is: Coleman, S. and Kenett, R. (2017) “The Information Quality Framework for Evaluating Data Science Programs”, Encyclopedia with Semantic Computing and Robotic Intelligence, World Scientific Press, 1 (2), pp. 125-138.

The bottom line is that a data science curriculum should focus on education for generating information quality. This means lots of work for statistics (and ASA….). It also represents what leading with statistics (the latest ASA slogan) should be about. Is ASA involved in a message that statistics is about information quality. Not sure….

More information on this in https://www.wiley.com/en-us/The+Real+Work+of+Data+Science%3A+Turning+data+into+information%2C+better+decisions%2C+and+stronger+organizations-p-9781119570707

ron

@rkenett

That is, more or less, what I was trying to say.

You say “The contributions of statistics and statistical thinking are way beyond the P value discussions”. That is the case among statisticians, but a glance at any biomedical journal shows that it is not the case among users,where p < 0.05 is still a near universal criterion for making a claim. The vast majority of NHST are done without the benefit of professional statistical advice.

I most certainly agree when you say "What I find terrible is that such articles are typically not constructive". I suspect that the reason that journals have failed to provide better statistical advice is a result of the fact that statisticians have failed totally to agree on how to judge whether the means of two independent groups are really different. If journals ask statisticians what they should do, they get different advice from different statisticians. I suspect that that's one reason for the glacial progress in improving the statistical standards.

Unlike many of the articles in the special issue of American Statistician, I made a concrete suggestion about how to improve matters. I suggest that, as well as giving the p value and CI, you should also give one extra number. The most comprehensible number would be the false positive risk (FPR) for a prior odds of 1 (comprehensible to users, because it tells you what most users still mistakenly think that a p value tells you). Details are in https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622 (and the references therein).

Inevitably the FPR involves Bayes' theorem, so, inevitably, it's use triggers internecine warfare among statisticians. There is an infinitude of ways of calculating the FPR. I chose to test a point null because it makes the math simple and much easier to understand for users. The Bayes Factor becomes a likelihood ratio, and so it's a deductive quantity that avoids the problem of induction.

Needless to say, the idea has attracted flak from both purist frequentists and purist Bayesians. Nevertheless, I think that it would be a great deal better that the present situation.

David

I concur with you statement that “The vast majority of NHST are done without the benefit of professional statistical advice.” It is our (statisticians) responsibility to change that by offering advice people seek and methods people will use because of added value (apparent or not).

The title of Ron Wasserstein’s talk in the forthcoming FTC conference is: “Time to Say Goodbye to Statistical Significance” http://www.falltechnicalconference.org/program2019/

Do you think that engineers, chemists or biologists who see this will think statistics is an up to date and attractive discipline and that they should seek collaboration with a statistician? I doubt that. The Japanese call this harakiri.

David:

It’s hard to believe you’re serious in claiming the Bayes Factor is purely deductive:

“The Bayes Factor becomes a likelihood ratio, and so it’s a deductive quantity that avoids the problem of induction.”

Even if we limit things to two point hypotheses so that we don’t go beyond the Likelihood Ratio, you still have to affirm the likelihood function, and that is not a deductive task. Nor is affirming a prior, which you elsewhere use.

Moreover, the problem for likelihood ratios enter: how does it pick up on selection effects? (are the two hyps predesignated)?

Dear Brian,

Hear are some thoughts on your comments.

1. Better not to use “ASA II” as shorthand for “Wallenstein et al. 2009”, as the latter is merely an editorial by three persons, not a consensus document like ASA I. Let’s use “W2009.”

We do need a “common sense” strategy as you suggest, but we should not expect that a good one will emerge from committees, large or small. And ASA is just one of the relevant societies. One can imagine bemused RSS members observing obstreperous colonials at work once again.

In any case, the only useful critiques will be those focused on specific issues, not on entire articles or editorials, and will come from the pens of those who’ve taken the pains to read the historical literature on the specific issues in extenso. The “majority view” of most johnny-come-lately kibitzers would seem irrelevant.

2. It was not the responsibility of “W2009” to summarize or repeat arguments and recommendations long clearly presented in the literature of the past, like those for disallowing the phrase ‘’’statistically significant”.

3. A lot of the current problems are semantic ones. Many of us who have supported the longstanding and so far unrefuted arguments against verbal dichotomization of the P-scale and describing results as “statistically significant” (e.g. Hurlbert&Lombardi2009, Hurlbert, Levine&Utts2019, Wallenstein,Lazar,Schirm2019) are not trying to “ban talk of significance testing.”

We would argue however that use of the phrase “significance assessment” is a preferable to “significance testing.” That latter now carries too much baggage. For most it will continue to imply the validity of verbally dichotomizing the P-scale. And it also will continue to lead to conflation, on a mass scale, of the very disparate processes of “testing of statistical hypotheses” and “testing of research or scientific hypotheses.” (Again, see H&L2009 on the point).

4. We think the core, simplest and most tractable issue is whether the term “statistically significant” should be disallowed in the reporting of research results and “significance assessments.” The logical case for doing so is laid out clearly in the historical and current literature. Failure to do so has been a major factor inflaming the “statistics wars” for the better part of a century.

Those who would retain the term have yet to make an informed argument for doing so. They have not cited a single published or hypothetical study using the phrase where the phrase leads to clearer and less misleading presentation of the results than would be achieved without it. Q.E.D. !

Stuart Hurlbert

Response to Hurlbert:

I suggested we employ the same term to refer to the document in question to avoid confusion. Haig’s guest post is, after all, a response to my call for guest post reflections on ASA II. He agreed with me. As I observe in earlier comments,

https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/#comment-183814

“My blogpost was the result of my trying to identify what ASA II is, and what is its relation to ASA I. We have seen it is not consistent with ASA I, or at least I have brought out conflicts that require attending to.”

That is why I suggest ASA II be revised in quite minimalist ways.

ASA II suggests it is a follow-up of ASA I, and it merely takes a step that they were very close to taking in ASA I. So I really think the onus is on the authors and on the ASA to clarify the status of ASA II. Because ASA I was so strongly put forward as a list of directions or guidelines that everyone should cite, it is more than a bit plausible to read ASA II as an add-on, one that they nearly already included in ASA I. If this is wrong, a small correction or warning should be made, before we see a slew of lawsuits based on ASA II.

Stuart,

I know of your fine paper on the neo-Fisherian perspective, and of your recent co-authored contribution to the TAS Special Issue. So, it’s good to have your comments on my post. Here is my brief response the points you make:

1. Mayo has just explained to you why ‘ASA II’ can be considered an acceptable shorthand for the Wasserstein et al. 2019 editorial. Anyway, I make it clear at the outset that I use this abbreviation to refer to this article. I agree with you that the editorial is not a consensus document. In fact, I say this in my third point, noting that the Special Issue represents a “diversity of views held about the nature of tests of significance”. Further, I would not describe the Wasserman et al. piece as “merely an editorial”. I think it is a substantial article, and will almost certainly be the main focus of discussion of matters arising from consideration of the Special Issue.

2. I agree with you that there is much to recommend the strategy of focusing on specific issues, but I do not think that useful critiques can only come from such a focus. I doubt that we can say with assurance that there is one best way to proceed to bring about understanding and change and, for that reason think that a mixed strategy is advisable. There are many general issues that also deserve our attention. As I see it, a limitation of the 2017 Bethesda Symposium, was that it attracted too few people who were well-equipped to deal with the many larger-scale issues that a “post-p<0.05 world” might be expected to address.

3. You say that the phrase, “significance assessment” is preferable to “significance testing”. This seems to suggest that it is the word, “testing”, not “significance”, that is problematic. Further, I’m unclear how the non-technical word, “assessment”, with its multiple meanings, might overcome the “excess baggage” of the former. The latter does have a statistically reputable interpretation, which could be conveyed with better statistics education. The second edition of Aris Spanos’s highly regarded textbook (Probability Theory and Statistical Inference, 2019), scheduled for publication soon, contains a historically informed treatment of hypothesis testing written from the vantage point of the error-statistical perspective.

4. You say that “[t]hose who would retain the term [“significance testing”] [“statistically significant”?] have yet to make an informed argument for doing so. In my first point, I noted that the error-statistical outlook has the resources to mount a defence for using test of significance, as well as for retaining the expression “statistically significant”. I’ve read your recent exchange with Mayo on this matter and do not think you have given a compelling reason to doubt that this is so. With respect, I think that your “proof” is incomplete.

Brian:

Our different emphases derive mostly from differences in what we are looking for. You (and most other discussants on this blogsite) are focused on the desirable content or scope of possible meeting symposia, chapters in statistics texts, online discussions, etc. where most of the discussants are statisticians. I and “Coup de grace” were/are asking – and answering — the question of whether there is any simple, useful, specific statistical recommendation or guideline with wide enough support, especially among statisticians, to justify it being added to the “instructions for authors” by editors of scientific journals generally, for education of and consumption by non-statisticians.

On your specific points let me comment as follows:

1) I agree that “W2009” is reasonably characterized as a “substantial article” and will, for a season, be a “main focus of discussion.” All to the good, whether Mayo et al. publish a clarifying “ASA III” or not.

2) Agreed there are many general issues that need our attention, but that does not obviate the need to correct specific logical errors one at a time as they come up. That was a main function of ASA I, but the assignment for the TAS special issue was more complex and vague. Many of us would argue that the verbal dichotomization of the P-scale and use of the term “statistically significant” is just as logically fallacious (and operationally superfluous) as is interpreting high P values as evidence favoring H0 over H1 (in a typical statistical test).

3) “significance assessment” was borrowed from earlier authors by Hurlbert&Lombardi2009 but we forgot from whom so could not give credit! Clarity should win eventually. So far, “neoFisherian significance assessment” seems to have accumulated no baggage. It will be interesting to see how Spanos’s views have changed over 20 years after his interactions with Mayo and others. The first edition of his book had a few bloopers, e.g. on pp.690-691, his assertion that high P values, even “P>0.10,” provide “strong support for Ho,” i.e. implicitly evidence in favor of Ho over H1). While he did a fine job pointing out the differences between the N-P paradigm and the neoFisherian one, he never cut to the chase and stated his recommendation for the most typical data analysis situations.

4) I think the ball remains in the court of the dichotomizers! The widespread use of “statistically significant” is now known to be a result of confusion in the minds and literature of early statisticians and earlier and current sheep-like users of statistics in many disciplines. Popularity and historical momentum seem insufficient grounds for retaining the phrase or for trying to redefine the term so that it can be retained, like my granddaughers “security blanket,” out of pure nostalgia.

Actually, I have answered all your criticisms, most recently in https://arxiv.org/ftp/arxiv/papers/1905/1905.08338.pdf

You say “It begins by dichotomizing all effects into “no effect” or “meaningful effect”, all else in between ignored.”

I don’t think that this is true. I’m merely following Goodman (1999) when he says that “Bayes factors show that P values greatly overstate the evidence against the null hypothesis.”. The likelihood ratio in favour of H1 is at its maximum when comparing the best-supported alternative against H0. All other alternative hypotheses are less well supported. Other values are not being ignored. So L_10, as I (and Goodman) calculate it provides the maximum possible evidence for rejection of H0. The fact that this comes out to only about 3 when you have observed p = 0.05 shows that the evidence against H0 provided by p = 0.05 is much weaker that might (mistakenly) be inferred from the p value.

Also, I am not using the “diagnostic screening model”, because that model implies the p-less-than method of calculating the likelihood ratio, As I have explained in my first response, I believe that what’s needed is the p-equals method, as used by Goodman.

You say also ” It then imagines my current hypothesis has been randomly selected from an urn of nulls with a known prevalence of true nulls –a high prevalence is needed to make this criticism of p-values work”.

I don’t think that this is true either, for two reasons. Firstly, I’m explicitly talking about how one should interpret a single p value. It would be misleading to ignore the prior probability even though one usually has no idea what it is, It will differ from one experiment to another but it will exist and to ignore it would be foolish. There is no sense in which I assume that the experiment has been ” randomly selected from an urn of nulls with a known prevalence of true nulls”. I merely assume that in the experiment that I’ve done there is some prior probability, to which which I can’t assign a value. The fact that this prior may vary widely from one experiment to another means, I think, that it’s impossible to calculate a lifetime false positive rate, That’s why I refer to it as a RISK not a rate.

Secondly I think that it’s untrue to say that “a high prevalence is needed to make this criticism of p-values work”. I do assume that any prior P(H1) greater than 0.5 can’t be assumed as to do so would be to say that you were almost sure of the result before you did the experiment. A prior of 0.5 is the nearest one can get to equipoise. The fact that even a prior of 0.5 implies a false positive risk of 27% when you observe p = 0.05 shows, I think that my argument does not depend on assuming a high prevalence of nulls. It’s true of course than if the prior were smaller that 0.5, the FPR would be still higher than 27% but 27% is quite big enough to show that there is a real problem with p = 0.05 as evidence.

I was cheered by the fact that when I asked Stephen Senn on Twitter “If you obs p=0.049

and claim there’s an effect, do you agree that there are plausible arguments that your risk if being

mistaken is over 20%?” he responded with

“For some purposes I would think a risk of being mistaken of 20% is not bad.”

Three other approaches to the problem also give FPR between 20% and 30% for p = 0.05

See section 7 in https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622

Of course my proposal is not unique -nothing Bayesian is. But I maintain that it’s plausible and that to give, as well as p value and CI, an estimate of FPR_50 (the FPR for prior odds =1) would be an improvement on current practice.

I can’t go through so much of this again as I’m preparing for a two-week intensive summer seminar with faculty and post docs on Phil Stat. You’ll see in 5.2 of SIST, starting page332, and who is exaggerating, 4.5 p. 260 that this allows assigning a high posterior to the max likely alternative. The frequentist would be using a confidence level of .5 if she entertained such an inference. In short, that gambit winds up vastly exaggerating what you’re allowed to infer!!!!. Also see p. 264 SIST “whether P-values exaggerate depends on philosophy”(authors include Greenland, Goodman, Senn and others).

It doesn’t matter (for my conern) if it’s p < or = for this problem. You dichotomize into no effect, meaningful effect, and do your computation imagining one infers "meaningful effect" from a single small p-value. In short it is an account based on the most abusive use of tests, turning a continuous quantity into two pigeonholes.The result still exaggerates by giving a fairly high posterior to an inference to which a tester would attach high error probabilities.

On priors, you seem now to have changed from seeking a frequentist prior based on prevalence to what? What is your prior? subjective belief–you've always said you rejected these. Empirical-frequentist, default or something else? Your view for years was plainly that the prior came from the supposed relative frequency with which meaningful effects exist in a field, or the like. The supposition that a hypothesis "has" one, but I can't tell you what it means won't wash. Until you do the discussion goes nowhere.

I am extremely familiar with the diagnostic hypothesis testing and medical decisions having practiced internal and acute medicine and endocrinology for nearly 50 years and still writing a textbook to guide students and young doctors towards proficiency. I hope that this does not put you all off! I have also been an honorary member of a mathematics department for over 20 years focussing on the probability and set theory of diagnosis and medical decision making. However, I hope that you can forgive me for approaching the issues from this perspective.

I have found ‘statistical significance testing to be confusing by making a ‘decision’ to ‘reject’ a hypothesis (as opposed to a course of action, which is what I am use to rejecting) and without explicitly assessing a study result’s ‘reliability’ first. By way of analogy there are usually three steps to interpreting diagnostic information (1) assessing the reliability of the findings (2) arriving at various diagnostic hypotheses with their associated predictions and (3) making decisions based on those predictions and then taking action.

I would happier if these 3 steps were also performed when interpreting scientific information. For example, in order to assess the reliability of a study, I would like to see a likelihood distribution being displayed and the 95% (marking the position of a potential null hypothesis with a 2-sided P-value of 0.05), 99%, 99.5% (marking the position of a potential null hypothesis with a 2-sided P-value of 0.005) and 99.9% confidence intervals etc were marked along the X axis together with the position of the actual null hypothesis. The distribution could then be ‘normalized’ by making the area under the curve correspond to 1 in case some find this helpful (nothing being lost by doing so).

By analogy with making preliminary diagnosis such ‘systolic hypertension’ (e.g. with a true systolic BP >130mmHg) we could ‘diagnose’ a preliminary ‘significant’ result less extreme than the null hypothesis with a probability of 1-P (provided that all the observations were made in an impeccably consistent prospective way, verifying this with a check-list and applying severe testing to exclude all the possible pitfalls). This interpretation involving regarding the likelihood distribution as a probability distribution would be in line with an assessment of a diagnostic test. The definition of the P value would still stand but it would be INTERPRETED in this way. A Bayesian might also wish to augment this step by combining the distribution based on the data alone with a Bayesian prior distribution, perhaps of the kind proposed by David C.

The next step would be to consider the possible explanations for the ‘significant result’ (in the same way as we would consider the differential diagnostic causes of the preliminary diagnosis of ‘systolic hypertension’. In scientific terms the possible explanation for a ‘significant’ result less extreme than the null hypothesis could be blatant fraud, a poor design causing various biases, and if these became improbable explanations after considering each one by severe testing) we would consider the possible theoretical explanations for a genuine result. This could be done with verbal reasoning or probabilistic modelling perhaps based on Bayesian priors. The decision about how to take things forward by direct application or doing further research would depend on the probabilities of these outcomes.

I am presenting a poster (based on a draft paper) describing this approach at a conference tomorrow and would be grateful for your views!

Huw:

There are some phrases you use that don’t seem right (to me) and you might want to clarify, notably “a preliminary ‘significant’ result less extreme than the null hypothesis”. What does that mean?

“we could ‘diagnose’ a preliminary ‘significant’ result less extreme than the null hypothesis with a probability of 1-P (provided that all the observations were made in an impeccably consistent prospective way, verifying this with a check-list and applying severe testing to exclude all the possible pitfalls).”

Again,

“ In scientific terms the possible explanation for a ‘significant’ result less extreme than the null hypothesis could be blatant fraud,”

Thank you for pointing out some ambiguous phrases. I will first address “A preliminary ‘significant’ result less extreme than the null hypothesis”.

The null hypothesis would be that the average effects of treatment and control would be identical IF (the IF being the ‘hypothetical’) a very large or infinite number of observations were made. Thank

The possible true results (eg average differences) less extreme than the null hypothesis would be the range of possible long term averages on one side of the null hypothesis in the same direction as the observed result.

The term ‘significant’ means that any of these long term averages would be of scientific or clinically predictive interest IF spurious or fraudulent reasons for them could be shown to be improbable. We might then consider hypotheses about the underlying mechanisms.

‘Blatant fraud’ could happen at two points. It could be a premeditated act to give a false impression that data had been recorded prospectively and consistently as planned (without hacking etc) so that it could be modelled with the maths of random sampling. This could also happen from ignorance and poor scientific training.

Blatant fraud could also happen by recruiting patients for a study or treating them in a deliberately biased way but collecting the data and analysing their confidence intervals and P-values impeccably. The explanation for the treatment being better than control would not be a real difference but an fraudulently engineered one.

I notice that Wasserstein has been referred to in one comment as Wallenstein, and in another as Wasserman. Yet another reason to keep to “ASA II”.

Here’s a new development from the NEJM.

NEJM Manuscript & Statistical Guidelines 2019

and an associated Editorial “Harrington et al., “New Guidelines for Statistical Reporting in the

Journal” (NEJM, July 18, 2019).