On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)

.

“Before we stood on the edge of the precipice, now we have taken a great step forward”

 

What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II(note)–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i]

In this exercise, I imagine I am someone who eagerly wants the recommendations in ASA II(note) to be accepted by authors, journals, agencies, and the general public. In essence the recommendations are: you may report the P-value associated with a test statistic d–a measure of distance or incompatibility between data and a reference hypothesis– but don’t say that what you’re measuring are the attained statistical significance levels associated with d. (Even though that is the mathematical definition of what is being measured.) Do not predesignate a P-value to be used as a threshold for inferring evidence of a discrepancy or incompatibility–or if you do, never use this threshold in interpreting data.

“Whether a p-value passes any arbitrary threshold should not be considered at all” in interpreting data. (ASA II(note))

This holds, even if you also supply an assessment of indicated population effect size or discrepancy (via confidence intervals, equivalence tests, severity assessments). The same goes for other thresholds based on confidence intervals or Bayes factors.

I imagine myself a member of the ASA II(note) team setting out the recommendation for ASA II(note), weighing if it’s a good idea. We in this leadership group know there’s serious disagreement about our recommendations in ASA II(note), and that ASA II(note) could not by any stretch be considered a consensus statement. Indeed even among over 40 papers explicitly invited to discuss “a world beyond P < 0.05”, we (unfortunately) wound up with proposals in radical disagreement. We [ASA II(note) authors] observe “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018).”

(Aside: Hey, they are citing my book!)

So we agree there is disagreement. We also agree that a large part of the blame for lack of replication in many fields may be traced to bad behavior encouraged by the reward structure: Incentives to publish surprising and novel studies, coupled with an overly flexible methodology, where many choice points in the “forking paths” (Gelman and Loken 2014) between data and hypotheses open the door into “questionable research practices” (QRPs). Call this the flexibility, rewards, and bias F, R & B hypothesis. On this hypothesis, the pressure to publish, to be accepted, is so great as to seduce even researchers who are well aware of the pitfalls to capitalize on selection biases (even if it’s only subliminal).

As a member of the team, I imagine reasoning as follows:

Either the recommendations in ASA II(note) will be followed or they won’t. If the latter, then it cannot be considered successful. Now suppose the former, that people do take it up to a significant extent. The F, R & B hypothesis predicts that the imprimatur of the ASA will encourage researchers to adopt, or at least act in accordance with, ASA II(note) recommendations. [ii] The trouble is that there will be no grounds for thinking that any apparent conversion was based on good reasons, or, at any rate, we will be unable to distinguish following the ASA II(note) stipulations on grounds of evidence from following them because the ASA said so. Therefore even in the former situation, where the new stipulations are taken up to a significant degree, with lots of apparent converts, ASA II(note) could not count as a success. Therefore, in either case, what had seemed to us a great step forward, is unsuccessful. So we shouldn’t put it forward.

“Before we were with our backs against the wall, now we have done a 180 degree turn”

A further worry occurs to me in my imaginary weighing of whether our ASA team should go ahead with publishing ASA II(note). It is this: many of the apparent converts to ASA II(note) might well have come to accept its stipulations on grounds of good reasons, after carrying out a reasoned comparison of statistical significance tests with leading alternative methods, as regards its intended task (distinguishing real effects from random or spurious ones)–if the ASA had only seen its role as facilitating the debate between alternative methods, and as offering a forum for airing contrasting arguments held by ASA members. By marching ahead to urge journals, authors, and agencies to comply with ASA II(note), we will never know.

Not only will we not know how much any observed effect in compliance is due to finding its stipulations are warranted, as opposed to it just confirming the truth of the F, R, & B hypothesis–not to mention people’s fear of being on the wrong side of the ASA’s preferences. It’s worse. The tendency to the human weakness of instantiating the F, R & B hypothesis will be strengthened. Why? Because even in the face of acknowledged professional disagreement of a fairly radical sort, and even as we write “the ideas in this editorial are… open to debate” (ASA II(note)), we are recommending our position be accepted without actually having that debate. In asking for compliance, we are saying, in effect, “we have been able to see it is for the better, even though we recognize there is no professional agreement on our recommendations, and even major opposition”. John Ioannidis, no stranger to criticizing statistical significance tests, wrote this note after the publication of ASA II(note):

Many fields of investigation … have major gaps in the ways they conduct, analyze, and report studies and lack protection from bias. Instead of trying to fix what is lacking and set better and clearer rules, one reaction is to overturn the tables and abolish any gatekeeping rules (such as removing the term statistical significance). However, potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science. (Ioannidis 2019)

Therefore, to conclude with my imaginary scenario, we might imagine the ASA team recognizes that putting forward ASA II(note) (in March 2019) is necessarily going to be unsuccessful and self-defeating, extolling the very behavior we supposedly want to eradicate. So we don’t do it. That imaginary situation, unfortunately, is not the real one we find ourselves in.

Making progress, without bad faith, in the real world needn’t be ruled out entirely. There are those, after all, who never heard of ASA II(note), and do not publish in journals that require obeisance to it. It’s even possible that the necessary debate and comparison of alternative tools for the job could take place after the fact. That would be welcome. None of this would diminish my first self-defeating aspect of the ASA II(note).

My follow-up post is now up: “The ASA’s P-value Project: Why it’s Doing More Harm than Good‘.

[i] See also June 17, 2019. Here I give specific suggestions for why certain principles in ASA II need to be amended to avoid being in tension with ASA I.

[ii] “Imprimatur” means “let it be printed” in Latin. Now I am very careful to follow the context: It is not a consensus document, I make very clear. In fact, that is a key premise of my argument. But the statement that is described as (largely) consensual (ASA I) “stopped just short” of the 2019 editorial. When it first appeared, I asked Wasserstein about the relationship between the two documents. That was the topic of my June 17 post linked in [i]). It was never made clear. It’s blurred. Is it somewhere in the document and I missed it? Increasingly, now that it’s been out long enough for people to start citing it, it is described as the latest ASA recommendations. (They are still just recommendations.) If the ASA wants to clearly distinguish the 2019 from the 2016 statement, this is the time for the authors to do it. (I only consider, as part of ASA II(note), those general recommendations that are given, not any of the individual papers in the special issue.)

This discussion is continued in my next post: The ASA P-value project: Why it’s doing more harm than good.

Blog posts on ASA II(note):

  • June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
    July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II(note))(Guest Post)
  • July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
  • September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.

On ASA I:

  • Link to my published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.

REFERENCES:

Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”. American Scientist 2: 460-5. (pdf)

Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. JAMA 321:2067‐2068. (pdf)

Mayo, (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP).

Mayo, D. G. (2019), P‐value thresholds: Forfeit at your peril. Eur J Clin Invest, 49: e13170. (pdf) doi:10.1111/eci.13170

Wasserstein, R., Schirm, A. and Lazar, N. (2019) “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19: Editorial. (online paper)(pdf)

Categories: P-values, stat wars and their casualties, statistical significance tests | 14 Comments

Post navigation

14 thoughts on “On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)

  1. I think ASA jumped the statistical shark with ASA II.

    Justin

  2. Splendid! We should abolish all committees. Let the people *think*!

    • Richard: Great to hear from you. Of course it wasn’t a call for abolishing committees–not that I’m fond of them–but rather against professional associations adopting one position on an issue where there’s strong disagreement, at least not without taking seriously the arguments on different sides. They should hold forums for debate. Positions the association doesn’t like much shouldn’t be described in straw man terms, but as honestly and generously as possible. else the debate falls into fallacy.

  3. Splendid! We should abolish all committees. Let the people *think*!

  4. Jerry Ravetz

    As a newcomer to this debate, could I try an idea. What we find is that a standard computation is subject to serious errors and distortions in its application and interpretation. These qualitative aspects of formal, quantitative procedures have now become dominant. They are well known and have been listed and analysed. Suppose that they could be coded, and a notation adopted in which their presence or absence would be expressed. Statements concerning the core calculations would, as a standard, be enriched with these qualitative aspects of the content. The concept of such a scheme is already available in the NUSAP notation, developed by Silvio Funtowicz and myself. The book, Uncertainty and Quality in Science for Policy, is available for private use on the website of Andrea Saltelli.

    • I think that that book is not on his website. But it can be found elsewhere.

    • rkenett

      Jerry – glad to hear from you here. Your work with Funtowicz and Andrea Saltelli is indeed very important. My slowly growing pessimistic view is described in
      https://www.linkedin.com/pulse/pragmatic-view-role-statistics-statisticians-modern-data-kenett/
      see also
      https://www.linkedin.com/pulse/statistics-crossroad-generating-information-quality-ron-s-kenett/

      • I read your view, Ron Kenett. My experience after getting on for nearly 50 years in *statistics* is that there is nothing new under the sun. Every ten years or so there have been enough technological advances of various kinds that a whole new crew of folk is doing statistics, but is not trained in that area, and finds it suspect. So they invent new terminology and new slogans. Of course: new sciences bring new challenges and new opportunities. New computational means bring new challenges and new opportunities. Sooner or later, the newcomers have had to pick up the old knowledge (the small data knowledge, the “inference” knowledge, the “decision theoretic” point of view, the Bayesian point of view, and so on and so forth). I think that all the new waves will come and go (“there is nothing new under the sun, everything is emptyness and chasing after the wind”). Both old and round wheels will be reinvented by bright dynamic leadership types, given new names, but thank heavens, in the long run, the insights which we gained long ago into how to learn from data and how to quantify uncertainty will persist. I think it is important to realise that a statistical paradigm is just a “model”. All models are wrong, some are useful. The most interesting insights gained from a set of data are those to be gained when different paradigms give *different* answers. If every way you analyse the data gives you the same data, that is comforting, but perhaps you are missing something very important…

        • Richard: This is very deep. Both “old and round wheels will be reinvented”. I like it.

          I agree as well that: “thank heavens, in the long run, the insights which we gained long ago into how to learn from data and how to quantify uncertainty will persist”––assuming they are not killed off too badly by committees, crusades and new kids on the block.

        • rkenett

          Richard – yes I know this argument. I do think however that besides bending down and letting the new wave pass over your head, we should “think” and adapt. To your comments I have two thoughts:
          1. The focus should be on information, not “data” as you write. Our business, as statisticians, is to generate information. Given the changes in technology and data availability, can we become better at this? I think so. See my blog on linkedIn triggered by a request from Mayo https://www.linkedin.com/pulse/statistics-crossroad-generating-information-quality-ron-s-kenett/
          2. A visionary article is Tukey’s 1962 paper: https://projecteuclid.org/euclid.aoms/1177704711.
          It sketches the future and maps some new directions for statistics to take. While I was in Madison (in the late 1970s) I had several discussions on this with George Box who, as an engineer, did not like the fuzziness of EDA and robust methods advocated by Tukey. Box wanted to build models. Tukey wanted to analyse data (Box of course also analyzed data and the ozone room in the basement with charts on the walls looked like a war room). The big data and ML evolution certainly went in the direction Tukey envisaged. The sterile approach to design experiments in BH2 does not apply anymore. When you are approached by engineers and scientists, they come with excel spreadsheets in their pockets. The discussion starts with, show me the data, and not let’s discuss how to design an experiment, which now comes as a second step.
          My view is that statistics should widened its scope and adopt a life cycle view. I wrote about this in several papers and it is the first chapter in https://www.amazon.com/gp/product/1119570700/ref=dbs_a_def_rwt_bibl_vppi_i0

  5. Pingback: Posts of Christmas Past (1): 13 howlers of significance tests (and how to avoid them) | Error Statistics Philosophy

  6. Pingback: Bad Statistics is Their Product: Fighting Fire With Fire | Error Statistics Philosophy

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.