Posts of Christmas Past (1): 13 howlers of significance tests (and how to avoid them)


I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) howler well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past:

2013 is right around the corner, and here are 13 well-known criticisms of statistical significance tests, and how they are addressed within the error statistical philosophy, as discussed in Mayo, D. G. and Spanos, A. (2011) “Error Statistics“.

  •  (#1) Error statistical tools forbid using any background knowledge [1].
  •  (#2) All statistically significant results are treated the same.
  • (#3) The p-value does not tell us how large a discrepancy is found.
  • (#4) With large enough sample size even a trivially small discrepancy from the null can be detected.
  •  (#5) Whether there is a statistically significant difference from the null depends on which is the null and which is the alternative.
  • (#6) Statistically insignificant results are taken as evidence that the null hypothesis is true.
  • (#7) Error probabilities are misinterpreted as posterior probabilities.
  • (#8) Error statistical tests are justified only in cases where there is a very long (if not infinite) series of repetitions of the same experiment.
  • (#9) Specifying statistical tests is too arbitrary.
  • (#10) We should be doing confidence interval estimation rather than significance tests.
  • (#11) Error statistical methods take into account the intentions of the scientists analyzing the data.
  • (#12) All models are false anyway.
  • (#13) Testing assumptions involves illicit data-mining.

You can read how we avoid them in the full paper here.

My book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST 2018, CUP), excavates the most recent variations on all of these howlers. To allege that statistical significance tests don’t use background information is a willful distortion of the tests which Fisher developed, hand-in-hand, with a large methodology of experimental design: randomization, predesignation and testing model assumptions. All these depend on incorporating background information into the specification and interpretation of tests. “The purpose of randomisation” Fisher made clear, “is to guarantee the validity of the test of significance” (1935). Observational (and other) studies that lack proper controls may well need to concede that any reported P-values are illicit–but then why report P-values at all? (Confidence levels are then equally illicit, except as descriptive measures without error control.) I say they should not report P-values lacking in error-statistical interpretations, at least not without reporting this. But don’t punish studies that work hard to attain error control.

Before you jump on the popular (but misguided) bandwagons of “abandoning statistical significance” or derogating P-values as so-called “purely (blank slate) statistical measures”, ask for evidence supporting the criticisms.[2] You will find they are based on rather blatant misuses and abuses. Only by blocking the credulity with which such apparitions are met these days (in some circles) can we attain improved statistical inferences in Christmases yet to come.

[1] “Error statistical methods” is an umbrella term for methods that employ probability in inference to assess and control the capabilities of methods to avoid mistakes in interpreting data. It includes statistical significance tests, confidence intervals confidence distributions, randomization, resampling and bootstrapping. A proper subset of error statistical methods are those that use error probabilities to assess and control the severity with which claims may be said to have passed a test (with given data). A claim C passes a test with severity to the extent that it has been subjected to and survives a test that probably would have found specified flaws in C, if present. Please see excerpts from SIST 2018.

[2] See

  • November 4, 2019:On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
  • November 14, 2019: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)
  • November 30, 2019: P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

The paper referred to in the post from Christmas past (1) is:

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics.

Categories: memory lane, significance tests, Statistics | Tags:

Post navigation

Comments are closed.

Blog at