5-year Review: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)

.

I continue my selective 5-year review of some of the posts revolving around the statistical significance test controversy from 2019. This post was first published on the blog on November 14, 2019. I feared then that many of the howlers of statistical significance tests would be further etched in granite after the ASA’s P-value project, and in many quarters this is, unfortunately, true. One that I’ve noticed quite a lot is the (false) supposition that negative results are uninformative. Some fields, notably psychology, keep to a version of simple Fisherian tests, ignoring Neyman-Pearson (N-P) tests (never minding that Jacob Cohen was a psychologist who gave us “power analysis”).  (See note [1]) For N-P, “it is immaterial which of the two alternatives…is labelled the hypothesis tested” (Neyman 1950, 259). Failing to find evidence of a genuine effect, coupled with a test’s having high capability to detect meaningful effects, warrants inferring the absence of meaningful effects. Even with the simple Fisherian test, failing to reject H0 is informative. Null results figure importantly throughout science, such as when the ether was falsified by Michelson-Morley, and in directing attention away from unproductive theory development.

Please share your comments on this blogpost.

*********************************

Everything is impeach and remove these days! Should that hold also for the concept of statistical significance and P-value thresholds? There’s an active campaign that says yes, but I aver it is doing more harm than good. In my last post, I said I would count the ways it is detrimental until I became “too disconsolate to continue”. There I showed why the new movement, launched by Executive Director of the ASA (American Statistical Association), Ronald Wasserstein (in what I dub ASA II(note)), is self-defeating: it instantiates and encourages the human-all-too-human tendency to exploit researcher flexibility, rewards, and openings for bias in research (F, R & B Hypothesis). That was reason #1. Just reviewing it already fills me with such dismay, that I fear I will become too disconsolate to continue before even getting to reason #2. So let me just quickly jot down reasons #2, 3, 4, and 5 (without full arguments) before I expire.

[I thought that with my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), that I had said pretty much all I cared to say on this topic (and by and large, this is true), but almost as soon as it appeared in print just around a year ago, things got very strange.]

But wait. Someone might object that I’m the one doing more harm than good by linking the ASA (The American Statistical Association) to Wasserstein’s campaign to get publishers, journalists, authors and the general public to buy into the recommendations of ASA II(note). “Shhhhh!” some counsel, “don’t give it more attention; we want people to look away”. Nothing to see? I don’t think so. I will discuss this point in this post in PART II, as soon as I sketch my list of reasons #2-5.

Before starting, let me remind readers that what I abbreviate as ASA II(note) only refers to those portions of the 2019 editorial by Wasserstein, Schirm, and Lazar that allude to their general recommendations, not their summaries of contributed papers in the issue of TAS.

PART I

2 Decriminalize theft to end robbery. The key arguments for impeaching and removing statistical significance levels and P-value thresholds commit fallacies of the “cut-off your nose to spite your face” variety. For example, we should ban P-value thresholds because they cause biased selection and data dredging. Discard P-value thresholds and P-hacking disappears! Or so it is argued. Faced with unwelcome nonsignificant results, eager researchers are still led to massage, spin, and data dredge–only now it is much harder to directly hold them accountable. For the argument, see my (“P-value Thresholds: Forfeit at your Peril“, 2019).

3 Straw men and women fallacies. ASA I and II(note) do more harm than good by presenting oversimple caricatures of the tests. Even ASA I excludes a consideration of alternatives, error probabilities and power[1]. At the same time, it will contrast these threadbare “nil null” hypothesis tests with confidence intervals (CIs)–never minding that the latter employs alternatives. No wonder CIs look better, but such a test is unfair. (Neyman developed confidence intervals as inversions of tests at the same time he was developing hypotheses tests with alternatives in 1930. Using only significance tests, you could recover the lower (and upper) 1-α CI bounds if you wanted, by asking for the hypotheses that the data are statistically significantly greater (smaller) than, at level α, using the usual 2-sided computation).

In ASA II(note), we learn that “no p-value can reveal the …presence…of an association or effect” (at odds with principle 1 of ASA I). That could be true only in the sense that no formal statistical quantity alone could reveal the presence of an association. But in a realistic setting, small p-values surely do reveal the presence of effects. Yes, there are assumptions, but significance tests are prime tools to probe them. We hear of “the seductive certainty falsely promised by statistical significance”, and are told that “a declaration of statistical significance is the antithesis of thoughtfulness”. (How an account that never issues an inference without an associated error probability can be promising certainty is unexplained. On the second allegation, thresholds are rendered meaningful by choosing them to reflect background information and a host of theoretical and epistemic considerations,.) The requirement in philosophy of a reasonably generous interpretation of what your criticizing isn’t a call for being kind or gentle, it’s that otherwise your criticism is guilty of straw men (and women) fallacies, and thus fails.

4 Alternatives to significance testing are given a pass.You will not find any appraisal of the alternative methods recommended to replace significance tests for their intended tasks. Although many of the “alternative measures of evidence” listed in ASA I and II(note): Likelihood ratios, Bayes factors (subjective, default, empirical), posterior predictive values (in diagnostic screening) have been critically evaluated by leading statisticians, no word of criticism is heard here. Here’s an exercise: run down the list of 6 “principles” of ASA I, applying them to any of the alternative measures of evidence on offer. Take, for example, Bayes factors. I claim that they do worse than do significance tests, even without modifications.[2]

5 Assumes probabilism. Any fair (non question-begging) comparison of statistical methods should recognize different roles probability may play in inference. The role of probability in inference by way of statistical falsification is quite different from using probability to quantify degrees of confirmation, support, plausibility or belief in a statistical hypothesis or model–or comparative measures of these.  I abbreviate the former as error statistical methods, the latter, as variants on probabilism. Use whatever terms you like. Statistical significance tests are part of a panoply of methods where probability arises to assess and control misleading interpretations of data.

Error probabilities quantify the capabilities of a method to detect the ways a claim (hypothesis, model or other) may be false, or specifiably flawed. The basic principle of testing is minimalist: there is evidence for a claim only to the extent it has been subjected to, and passes, a test that had at least a reasonable probability of having discerned how the claim may be false. (For a more detailed exposition, see Mayo 2018, or excerpts from SIST on this blog).

Reason #5, then, is that “measures of evidence” in both ASA I and II(note) beg this key question (about the role of probability in statistical inference) in favor of probabilisms–usually comparative as with Bayes factors. If the recommendation in ASA II(note) to remove statistical thresholds is taken seriously, there are no tests and no statistical falsification. Recall what Ioannidis said in objecting to “don’t say significance”, cited in my last post:

Potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science. (Ioannidis 2019)

“Self-ostracizing” is a great term. ASA should ostracize self-ostracizing. This takes me back to the question I promised to come back to: is it a mistake to see the ASA as entangled in the campaign to ban use of the “S-word”, and kill P-value thresholds?

PART II (to read this part, please go to the original post)

[1] “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power (among other things)” (ASA I)

[2] The ASA 2016 Guide’s Six Principles

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

[3] I am grateful to Ron Wasserstein for inviting me to be a “philosophical observer” at the 2015 meeting to draft ASA I.

2019 Blog posts on ASA II(note):

  • June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
    July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II(note))(Guest Post)
  • July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
  • September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
  • Nov 4, 2019. On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests

On ASA I:

  • Link to my published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.

REFERENCES:

Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. JAMA 321:2067‐2068. (pdf)

Ionides, E., Giessing, A., Ritov, Y.  & Page, S. (2017). Response to the ASA’s Statement on p-Values: Context, Process, and Purpose, The American Statistician, 71:1, 88-89. (pdf)

Mayo, (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP).

Mayo, D. G. (2019), P‐value thresholds: Forfeit at your peril. Eur J Clin Invest, 49: e13170. (pdf) doi:10.1111/eci.13170

Neyman, J. (1950), First course in Probability and Statistics, NY: Henry Holt.

Wasserstein, R. & Lazar, N. (2016), The ASA’s Statement on p-Values: Context, Process, and Purpose”. Volume 70, 2016 – Issue 2.

Wasserstein, R., Schirm, A. and Lazar, N. (2019) “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19: Editorial. (ASA II(note))(pdf)

Categories: 5-year memory lane, statistical significance tests, straw person fallacy | 1 Comment

Post navigation

One thought on “5-year Review: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)

  1. Stan Young

    Deborah:I’m having a tussle with environmental epidemiologists. Below is a comment to my coauthor.Stan

    Design of Experiments In industry and agriculture, formal methods, design of experiments, DOE, areused to understand the world empirically. The start of the process a triple, N,SD and Delta, the needed sample size, the variability of the process, StandardDeviation, SD, and delta, the difference to be detected. In addition to thetriple, there are formal arrangements of treatments, conditions, to form a compositeset of treatments, the design, that can be subjected to analysis. A simpledesign is two treatments, A and B, which along with SD and delta, can be usedto determine N the numbers of each treatment needed to have good power todetect a difference of delta (or greater). There are many more complicateddesigns. It is crucial to have/state delta as otherwise there is no bound tothe experiment. For example, if delta is any value greater than zero, then Ngoes to infinity and tradeoffs are not possible.

    Stan and Pat Young 919 782 2759

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.