If you think it’s a scandal to be without statistical falsification, you will need statistical tests (ii)

Screen Shot 2016-08-09 at 2.55.33 PM


1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:

While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)

Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.[1] But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability?

We know that literal deductive falsification only occurs with trite examples like “All swans are white”; and that a single black swan falsifies the universal claim that C: all swans are white, whereas observing a single white swan wouldn’t allow inferring C (unless there was only 1 swan, or no variability in color) but Burnham and Anderson are discussing statistical falsification, and statistical methods of testing. Moreover, the authors champion a methodology that they say has nothing to do with testing or falsifying: “Unlike significance testing”, the approaches they favor “are not ‘tests,’ are not about testing” (p. 628). I’m not disputing their position that likelihood ratios, odds ratios, Akaike model selection methods are not about testing, but falsification is all about testing! No tests, no falsification, not even of the null hypotheses (which they presumably agree significance tests can falsify). It seems almost a scandal, and it would be one if critics of statistical testing were held to a more stringent, more severe, standard of evidence and argument than they are.

I may add installments/corrections (certainly on E. Pearson’s birthday Thursday); I’ll update with (i), (ii) and the date.

A bit of background. I view significance tests as only a part of a general statistical methodology of testing, estimation, and modeling that employs error probabilities of methods to control and assess how capable methods are at probing errors, and blocking misleading interpretations of data. I call it an error statistical methodology. I reformulate statistical tests as tools for severe testing. The outputs report on the discrepancies that have and have not been tested with severity. There’s much in Popper I agree with: data x only count as evidence for a claim H1 if it constitutes an unsuccessful attempt to falsifyH1. One does not have evidence for a claim if nothing has been done to rule out ways the claim may be false. I use formal error probabilities to direct a more satisfactory notion of severity than Popper.

2. Popper, Fisher-Neyman-Pearson, and falsification.

Popper’s philosophy shares quite a lot with the stringent testing ideas found in Fisher, and also Neyman-Pearson–something Popper himself recognized in the work the authors site (LSD). Here is Popper:

We say that a theory is falsified only if we have accepted basic statements which contradict it…. This condition is necessary but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low level empirical hypothesis which describes such an effect is proposed and corroborated. (Popper LSD, 1959, 203)

Such “a low level empirical hypothesis” is well captured by a statistical claim. Unlike the logical positivists, Popper realized that singular observation statements couldn’t provide the “basic statements” for science. In the same spirit, Fisher warned that in order to use significance tests to legitimately indicate incompatibility with hypotheses, we need not an isolated low P-value, but an experimental phenomenon.

[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)

If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. Conjectured statistical effects are likewise falsified if they contradict data and/or could only be retained through ad hoc saves, verification biases and “exception incorporation”. Moving in stages between data collection, modeling, inferring, and from statistical to substantive hypotheses and back again, learning occurs by a series of piecemeal steps with the same reasoning. The fact that at one stage H1 might be the alternative, at another, the test hypothesis, is no difficulty. The logic differs from inductive updating probabilities of a hypothesis, as well as from a comparison of how much more probable H1 makes the data than does H0, as in likelihood ratios. These are 2 variants of probabilism.

Now there are many who embrace probabilism who deny they need tools to reject or falsify hypotheses. That’s fine. But having declared it a scandal (almost) for a statistical account to lack a methodology to reject/falsify, it’s a bit surprising to learn their account offers no such falsificationist tools. (Perhaps I’m misunderstanding; I invite correction.) For example, the likelihood ratio, they declare, “is an evidence ratio about parameters, given the model and the data. It is the likelihood ratio that defines evidence (Royall 1997)” (Burnham and Anderson, p. 628). They italicize “given” which underscores that these methods begin their work only after models are specified. Richard Royall is mentioned often, but Royall is quite clear that for data to favor H1 over H0 is not to have supplied evidence against H0. (“the fact that we can find some other hypothesis that is better supported than H does not mean that the observations are evidence against H” (1997, pp.21-2).) There’s no such thing as evidence for or against a single hypothesis for him. But without evidence against H0, one can hardly mount a falsification of H0. Thus, I fail to see how their preferred account promotes falsification. It’s (almost) a scandal.

Maybe all they mean is that “historical” Fisher said the tests have only a null, so the only alternative would be its denial. First, we shouldn’t be limiting ourselves to what Fisher thought, nor keep an arbitrary distinction between Fisher vs N-P tests nor confidence intervals. David Cox is a leading Fisherian and his tests have either implicit or explicit alternatives. The choice of a test statistic indicates the alternative, even if it’s only directional. In N-P tests, the test hypothesis and the alternative may be swapped.) Second, even if one imagines the alternative is limited to either of the following:

(i) the effect is real/ non-spurious, or (ii) a parametric non-zero claim (e.g., μ ≠ 0),

they are still statistically falsifiable. An example of the first came last week. Shock waves were felt in high energy particle physics (HEP) when early indications (from last December) of a genuine new particle—one that would falsify the highly corroborated Standard Model (SM)—was itself falsified. This was based on falsifying a common statistical alternative in a significance test: the observed “resonance” (a great term) is real. (The “bumps” began to fade with more data [2].) As for case (ii), some of the most important results in science are null results. By means of high precision null hypotheses tests, bounds for statistical parameters are inferred by rejecting (or falsifying) discrepancies beyond the limits tests are capable of detecting. Think of the famous negative result of Michelson-Morley experiments that falsified the “ether” (or aether) of the type ruled out by special relativity, or the famous equivalence principles of experimental GTR. An example of each is briefly touched upon in a paper with David Cox (Mayo and Cox 2006). Of course, background knowledge about the instruments and theories are operative throughout. More typical are the cases where power analysis can be applied, as discussed in this post.


Perhaps they only mean to say that Fisherian tests don’t directly try to falsify “the effect is real”.  They’re supposed to, it should be very difficult to bring about statistically significant results if the world is like H0.  

3. Model validation, specification and falsification.

When serious attention is paid to the discovery of new ways to extend models and theories, and to model validation, basic statistical tests are looked to. This is so even for Bayesians, be they ecumenical like George Box, or “falsificationists” like Gelman.

For Box, any account that relies on statistical models requires “diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification”. This leads Box to advocate ecumenism. (Box 1983, p. 57). He asks,

[w]hy can’t all criticism be done using Bayes posterior analysis?…The difficulty with this approach is that by supposing all possible sets of assumptions are known a priori, it discredits the possibility of new discovery. But new discovery is, after all, the most important object of the scientific process (ibid., p. 73).

Listen to Andrew Gelman (2011):

At a philosophical level, I have been persuaded by the arguments of Popper (1959), Kuhn (1970), Lakatos (1978), and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call ‘pure significance testing’)[3] (Gelman 2011, p. 70).

Discovery, model checking and correcting rely on statistical testing, formal or informal.

4. “An explicit, objective criterion of ‘best’ models” using methods that obey the LP (p.628).

Say Burnham and Anderson:

“At a deeper level, P values are not proper evidence as they violate the likelihood principle” (Royall 1997)” (p. 627).

A list of pronouncements by Royall follows. What we know at a much deeper level is that any account that obeys the likelihood principle (LP) is not an account that directly assesses or controls the error probabilities of procedures. Control of error probabilities, even approximately, is essential for good tests, and this grows out of a concern, not for controlling error rates in the long run, but for evaluating how well tested models and hypotheses are with the data in hand. As with others who embrace the LP, the authors reject adjusting for selection effects, data dredging, multiple testing, etc.–gambits that alter the sampling distribution and, handled cavalierly, are responsible for much of the bad statistics we see. By the way, reference or default Bayesians also violate the LP. You can’t just make declarations about “proper evidence” without proper evidence. (There’s quite a lot on the LP on this blog; see also links to posts below the references.)

Burnham and Anderson are concerned with how old a method is. Oh the horrors of being a “historical” method. Appealing to ridicule (“progress should not have to ride in a hearse”) is no argument. Besides, it’s manifestly silly to suppose you use a single method, or that error statistical tests haven’t been advanced as well as reinterpreted since Fisher’s day. Moreover, the LP is a historical, baked-on principle suitable for ye olde logical positivist days when empirical observations were treated as “given”. Within that statistical philosophy, it was typical to hold that the data speak for themselves, and that questionable research practices such as cherry-picking, data-dredging, data-dependent selections, and optional stopping are irrelevant to “what the data are saying”! It’s redolent of the time where statistical philosophy sought a single, “objective” evidential relationship to hold between given data, model and hypotheses. Holders of the LP still say this, and the authors are no exception.

[The LP was, I believe, articulated by George Barnard who announced he rejected it at the 1959 Savage forum for all but predesignated simple hypotheses. If you have a date or correction, please let me know. 8/10]

The truth is that one of the biggest problems behind the “replication crisis” is the violation of some age-old truisms about science.It’s the consumers of bad science (in medicine at least) that are likely to ride in a hearse. There’s something wistful about remarks we hear from some quarters now. Listen to Ben Goldacre (2016) in Naure: “The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data,” which he follows with a list of selective publication, data dredging and all the rest, “leading collectively to the ‘replication crisis’.”

He’s trying to remind us that the rules for good science were all in place long ago and somehow are now being ignored or trampled over, in some fields. Wherever there’s a legitimate worry about “perverse incentives,” it’s not a good idea to employ methods where selection effects vanish.

5. Concluding comments

I don’t endorse many of the applications of significance tests in the literature, especially in the social sciences. Many p-values reported are vitiated by fallacious interpretations (going from a statistical to substantive effect), violated assumptions, and biasing selection effects. I’ve long recommended a reformulation of the tools to avoid fallacies of rejection and non-rejection. In some cases, sadly, better statistical inference cannot help, but that doesn’t make me want to embrace methods that do not directly pick up on the effects of biasing selections. Just the opposite.

If the authors are serious about upholding Popperian tenets of good science, then they’ll want to ensure the claims they make can be regarded as having passed a stringent probe into their falsity. I invite comments and corrections.

(Look for updates.)


[1]They are replying to an article by Paul Murtaugh. See the link to his paper here.


[3]Gelman continues: “At the next stage, we see science–and applied statistics–as resolution of anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996).”


    • Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press. [1982 Technical Summary Report #2408 for U.S. Army version here.]
    • Burnham, K. P. & Anderson, D. R. 2014, “P values are only an index to evidence: 20th- vs. 21st-century statistical science“, Ecology, vol. 95, no. 3, pp. 627-630.
    • Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall.
    • Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis”, Rationality, Markets and Morals (RMM) 2, Special Topic: Statistical Science and Philosophy of Science, pp. 67-78.
    • Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall.
    • Goldacre, B. 2016. ‘Make Journals Report Clinical Trials Properly‘, Nature 530,7 (04 February 2016)
    • Kuhn, T. S. 1970. The Structure of Scientific Revolutions, 2nd ed. Chicago: University of Chicago Press.
    • Lakatos, I. 1978. The Methodology of Scientific Research Programmes, Cambridge: Cambridge University Press.
    • Mayo, D. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.
    • Mayo, D. and Cox, D. R. 2006.”Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
    • Murtaugh, P.A. 2014, “In defense of P values“, Ecology, vol. 95, no. 3, pp. 611-617.
    • Murtaugh, P.A. 2014, “Rejoinder“, Ecology, vol. 95, no. 3, pp. 651-653.
    • Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.
    • Popper, K. 1959. The Logic of Scientific Discovery. New York: Basic Books.
    • Royall, R. 1997. Statistical Evidence: A Likelihood Paradigm. Chapman and Hall, CRC Press.
    • Spanos, A. 2014. “Recurring controversies about P values and confidence intervals revisited”, Ecology, vol. 95, no. 3, pp. 645-651.

Related Blogposts

8/29/14: BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2nd)
10/10/14: BREAKING THE (Royall) LAW! (of likelihood) (C)
11/15/14: Why the Law of Likelihood is bankrupt—as an account of evidence
11/25/14: How likelihoodists exaggerate evidence from statistical tests

7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
7/23/14: Continued: “P-values overstate the evidence against the null”: legit or fallacious?
5/12/16: Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”

Categories: P-values, Severity, statistical tests, Statistics, StatSci meets PhilSci

Post navigation

22 thoughts on “If you think it’s a scandal to be without statistical falsification, you will need statistical tests (ii)

  1. Kyle Griffiths

    I noticed you were tweeting this link out to ecologists, which is of great interest to me since it was through Dynamic Ecology that I found your work, and a Burnham and Anderson book that I became seriously interested. My sense is that likelihood-based model selection methods are very influential in ecology. Popular ecology statistics handbooks treat methods in an agnostic way wrt the debates on your blog. Even though I subscribe to your ideas, I don’t know how severity could be extended to cover ecological data sets, where there are typically many possible explanatory variables and ANOVA models tend to turn up significant no matter what. I’m eager for more experienced ecologists to jump in!

    • Kyle: I’m not sure why it would be an extension to use severity in ecology. Ecologists, many of them, were influenced by Royall. [When I asked Royall why, he said it was just an accident of some students he had, and that he would have preferred a different field of followers–he meant because of complexities in the field.] The model selection business was hot around 15 years ago, and clearly they’re still popular, but I lost enthusiasm after my colleague in Econometrics, Aris Spanos, demonstrated how far off they can be. Moreover, I saw there were around 20 or something different model selection methods with different properties. The AIC and its variants boil down to using a type 1 error probability of alpha=.16 in significance testing. But the real problem is that the “best” among the group of models may be meaningless because nothing like an adequate model may be in the group. It’s all comparative. There’s no effort at model validation. If the correct or adequate model is outside the group you’re selecting from, you wouldn’t find out because it would just rank them according to a favored criterion. No falsification, or insights into what new models should be tried emerge.

      The last sections of this paper by Spanos might be relevant: https://errorstatistics.files.wordpress.com/2016/08/spanos-2006-jem.pdf

      • Kyle Griffiths

        I completely agree with you about your main point (an adequate model might not be among those considered), and thank you for commenting on AIC! But the link seems to be redirecting me to my email for some reason…

  2. James T. Lee, MD PhD

    After I can stop wincing, I always become suspicious when any article’s creator(s) have not discerned the difference(s) between a “tenant” and a “tenet”. Seeing this kind of content flaw in a presumably learned publication (from any field) rings loud alarm bells and raises the red flag–What else in the writing may be bogus, blarney, or buncombe? Which parts of any buncombe may slip under the radar and defeat my hopefully still functional 73 year-old BS detector?

    There is a very short list of potential mechanisms that can account for the production of the tenant/tenet lesion: Insouciant proof-reading, faux scholarship, or rapid dictation over the phone to some secretary who faithfully typed “tenant” because he/she had never heard of a tenet. The last we can classify as unfortunate but understandable. The first two are unacceptable.

    • James: Yes, there’s some choppy language through the short article, so maybe they were in a rush. It’s silly to make a principle of falsification have to pay rent to science (as tenants do).

  3. Brian Cade

    I’m one of the statistical ecologists whose career has largely overlapped with that of Burnham and Anderson; same agency (USGS), same university (CSU), and same town (Fort Collins); so I’ve long been exposed to and dismayed by much of what they’ve published as they try to advocate for their information theoretic approach based on AIC. The commentary by Murtaugh (2014) that Burnham and Anderson (2014) are responding to reasonably points out that the information theoretic/AIC approach relies on differences in AIC between 2 models which is just a rescaling of likelihood ratios between models and thus can be directly related to likelihood ratio hypothesis tests, confidence intervals based on test inversion, and coefficients of determination. How is this then some radical new paradigm for improved statistical or scientific inference? Burnham and Anderson in my opinion have gone to great lengths to obfuscate the relationship between AIC and other likelihood ratio inference methods as they advocate for the former. Of even greater concern is that they advocate for AIC based model averaging of individual regression coefficients across multiple candidate models for multimodel inferences. While I might concede that model averaging has some potential if the averaging is done across entire model estimators for the predicted response, model averaging of individual regression coefficients gets into all kinds of mathematical nonsense when there is changing multicollinearity among the covariates in different candidate models, a condition that commonly occurs with observational data (see Cade. 2015. Model averaging and muddled multimodel infernces. Ecology 96: 2370-2382). Of course, Bayesian model averaging of regression coefficients (Hoetting et al. 1999) suffers similar problems. So, yes, be scandalized. There is an entire generation of ecologists and wildlife biologists that have been taught the Burnham and Anderson approach and assume that they could not have erred.

    • Brian: Thanks for your comment, I’ll look up your paper. I expected the ecologists to come back at me, so it’s very interesting to hear they have doubts deeper than I would have expected (I’m counting twitter and email reactions). Still, I want to reiterate, I wasn’t scandalized about their full approach, which I’m not even familiar with, it was only the point about falsifiability that I (jokingly) was scandalized. What I’m truly scandalized about is the extent to which significance test critics get away with repeating, verbatim often, the declarations and “laws” of a Royall or a Lindley or a Howson, and many others, without thinking through the issues themselves, or considering with fairness how testers would and do respond. Worse, there’s a double standard, often, when it comes to questioning some alternative methods. Anyway, that’s the purpose of my upcoming book.

  4. For some commentary on ecologists’ use of model selection, see https://dynamicecology.wordpress.com/2015/05/21/why-aic-appeals-to-ecologists-lowest-instincts/ (aside: any statistical technique can of course be used badly; for commentary on ecologists’ abuses of ANOVA see https://dynamicecology.wordpress.com/2012/11/27/ecologists-need-to-do-a-better-job-of-prediction-part-i-the-insidious-evils-of-anova/)

    For some debate on the possibilities for severe testing in ecology see

    Those two posts actually are debating Platt’s notion of “strong inference”, but I think most of the content generalizes if you mentally replace “strong inference” with “severe tests”.

    • Jeremy:
      Very interesting; I’ll check out the links. I knew I’d get good stuff notifying ecologists.

      • Well, don’t thank us for good stuff until you’ve checked it out and made sure it actually is good! 🙂

    • Jeremy:
      Passing a claim with severity doesn’t require the type of crucial test Platt described. To paraphrase my post, data x count as evidence for a claim H to the extent that it constitutes an unsuccessful attempt to falsify H. Were H specifiably false, with high probability the falsification attempt would have been successful (or, at least, the data would have disagreed with H), but instead the results we generate are in sync with H. That’s the gist of it, anyway.

      • Yes, thank you. I was just trying to provide a bit of context for those last two links. In particular, I’ve since come to recognize that in my post I (and numerous other ecologists) took “strong inference” in a rather loose sense somewhat divorced from Platt’s original usage. On reflection, I think I’d have been better off not stretching the term “strong inference” and instead just arguing that severe testing is indeed possible in ecology. And that among the ways to do it are approaches that bear a loose resemblance to strong inference.

  5. Kyle Griffiths

    I understand if you no longer wish to comment on the likelihood principle, but I’m curious if there are any further thoughts about whether likelihoods have a stable evidential meaning (paraphrasing you on Hacking’s 1972 review referenced in the LP posts above). I take it that model selection methods advocated by B&A are considered comparative b/c they don’t (have a stable evidential meaning).

    • Kyle: Hmm, but they do spew out a “best” among the group, and whether that has a stable meaning or not, my guess is no. In case you’re interested, there’s a lot on this blog about the alleged proof of the LP by Birnbaum based on frequentist principles. This was an extremely important result in statistical foundations, seeming to show that frequentist principles were self refuting. I wrote and published a few papers on some holes in the alleged “proof”, the last one in 2014 in Statistical Science.

    • john byrd

      Kyle, what do you mean by “stable evidential meaning”?

  6. Kyle Griffiths

    John, the phrase from Mayo’s synopsis of Hacking on Edward’s Likelihood in the LP posts above (BREAKING THE LAW…). I’m trying to understand the different views of evidence that have been discussed often on this blog. Some claim that likelihood functions offer all the information the data can tell us and in this view evidence comes from the relative support that different parameter values get under the likelihood function, and I take it that proportional likelihood functions are supposed to say the same thing from an evidential standpoint – that would be a stable evidential meaning. But I think Hacking disagrees that likelihoods can be compared like that. To round out my thoughts, an error statistician looks at the error distribution to see what could have occurred under theoretical or actual repetitions of the experiment and draws evidence from that. My interpretation, errors are mine.

    • Kyle: please read the other likelihood posts and chapters 9 and 10 of EGEK under my “publications”

    • Michael Lew

      Kyle, as one who routinely defends and promotes likelihood functions in comments on this blog, I wish to point out a common, but serious, mistake in your statement. The likelihood function does not offer “all the information the data can tell us”. Instead, the likelihood function tells you the relative support by the data for the possible values of the model parameters, within the statistical model.

      The difference between supplying a model-bound order of preference among values of the parameter of interest and “all of the information” is profound, and important, but it seems to be rarely noticed.

      • Kyle Griffiths

        Michael, I don’t want to belabor the point, but isn’t this what you meant when you posted “…The likelihood principle says that all of the evidence in the data relevant to parameter values in the model is contained in the likelihood function. That DOES NOT say that one has to make inferences only on the basis of the likelihood function…” (from “Why the Law of Likelihood is bankrupt…”)? because it was statements like this that made me curious about evidence from the likelihoodist vs. the error statistical POV.

Blog at WordPress.com.