1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:
“While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)
Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper. But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability?
We know that literal deductive falsification only occurs with trite examples like “All swans are white”; and that a single black swan falsifies the universal claim that C: all swans are white, whereas observing a single white swan wouldn’t allow inferring C (unless there was only 1 swan, or no variability in color) but Burnham and Anderson are discussing statistical falsification, and statistical methods of testing. Moreover, the authors champion a methodology that they say has nothing to do with testing or falsifying: “Unlike significance testing”, the approaches they favor “are not ‘tests,’ are not about testing” (p. 628). I’m not disputing their position that likelihood ratios, odds ratios, Akaike model selection methods are not about testing, but falsification is all about testing! No tests, no falsification, not even of the null hypotheses (which they presumably agree significance tests can falsify). It seems almost a scandal, and it would be one if critics of statistical testing were held to a more stringent, more severe, standard of evidence and argument than they are.
I may add installments/corrections (certainly on E. Pearson’s birthday Thursday); I’ll update with (i), (ii) and the date.
A bit of background. I view significance tests as only a part of a general statistical methodology of testing, estimation, and modeling that employs error probabilities of methods to control and assess how capable methods are at probing errors, and blocking misleading interpretations of data. I call it an error statistical methodology. I reformulate statistical tests as tools for severe testing. The outputs report on the discrepancies that have and have not been tested with severity. There’s much in Popper I agree with: data x only count as evidence for a claim H1 if it constitutes an unsuccessful attempt to falsifyH1. One does not have evidence for a claim if nothing has been done to rule out ways the claim may be false. I use formal error probabilities to direct a more satisfactory notion of severity than Popper.
2. Popper, Fisher-Neyman-Pearson, and falsification.
Popper’s philosophy shares quite a lot with the stringent testing ideas found in Fisher, and also Neyman-Pearson–something Popper himself recognized in the work the authors site (LSD). Here is Popper:
We say that a theory is falsified only if we have accepted basic statements which contradict it…. This condition is necessary but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low level empirical hypothesis which describes such an effect is proposed and corroborated. (Popper LSD, 1959, 203)
Such “a low level empirical hypothesis” is well captured by a statistical claim. Unlike the logical positivists, Popper realized that singular observation statements couldn’t provide the “basic statements” for science. In the same spirit, Fisher warned that in order to use significance tests to legitimately indicate incompatibility with hypotheses, we need not an isolated low P-value, but an experimental phenomenon.
[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)
If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. Conjectured statistical effects are likewise falsified if they contradict data and/or could only be retained through ad hoc saves, verification biases and “exception incorporation”. Moving in stages between data collection, modeling, inferring, and from statistical to substantive hypotheses and back again, learning occurs by a series of piecemeal steps with the same reasoning. The fact that at one stage H1 might be the alternative, at another, the test hypothesis, is no difficulty. The logic differs from inductive updating probabilities of a hypothesis, as well as from a comparison of how much more probable H1 makes the data than does H0, as in likelihood ratios. These are 2 variants of probabilism.
Now there are many who embrace probabilism who deny they need tools to reject or falsify hypotheses. That’s fine. But having declared it a scandal (almost) for a statistical account to lack a methodology to reject/falsify, it’s a bit surprising to learn their account offers no such falsificationist tools. (Perhaps I’m misunderstanding; I invite correction.) For example, the likelihood ratio, they declare, “is an evidence ratio about parameters, given the model and the data. It is the likelihood ratio that defines evidence (Royall 1997)” (Burnham and Anderson, p. 628). They italicize “given” which underscores that these methods begin their work only after models are specified. Richard Royall is mentioned often, but Royall is quite clear that for data to favor H1 over H0 is not to have supplied evidence against H0. (“the fact that we can find some other hypothesis that is better supported than H does not mean that the observations are evidence against H” (1997, pp.21-2).) There’s no such thing as evidence for or against a single hypothesis for him. But without evidence against H0, one can hardly mount a falsification of H0. Thus, I fail to see how their preferred account promotes falsification. It’s (almost) a scandal.
Maybe all they mean is that “historical” Fisher said the tests have only a null, so the only alternative would be its denial. First, we shouldn’t be limiting ourselves to what Fisher thought, nor keep an arbitrary distinction between Fisher vs N-P tests nor confidence intervals. David Cox is a leading Fisherian and his tests have either implicit or explicit alternatives. The choice of a test statistic indicates the alternative, even if it’s only directional. In N-P tests, the test hypothesis and the alternative may be swapped.) Second, even if one imagines the alternative is limited to either of the following:
(i) the effect is real/ non-spurious, or (ii) a parametric non-zero claim (e.g., μ ≠ 0),
they are still statistically falsifiable. An example of the first came last week. Shock waves were felt in high energy particle physics (HEP) when early indications (from last December) of a genuine new particle—one that would falsify the highly corroborated Standard Model (SM)—was itself falsified. This was based on falsifying a common statistical alternative in a significance test: the observed “resonance” (a great term) is real. (The “bumps” began to fade with more data .) As for case (ii), some of the most important results in science are null results. By means of high precision null hypotheses tests, bounds for statistical parameters are inferred by rejecting (or falsifying) discrepancies beyond the limits tests are capable of detecting. Think of the famous negative result of Michelson-Morley experiments that falsified the “ether” (or aether) of the type ruled out by special relativity, or the famous equivalence principles of experimental GTR. An example of each is briefly touched upon in a paper with David Cox (Mayo and Cox 2006). Of course, background knowledge about the instruments and theories are operative throughout. More typical are the cases where power analysis can be applied, as discussed in this post.
Perhaps they only mean to say that Fisherian tests don’t directly try to falsify “the effect is real”. They’re supposed to, it should be very difficult to bring about statistically significant results if the world is like H0.
3. Model validation, specification and falsification.
When serious attention is paid to the discovery of new ways to extend models and theories, and to model validation, basic statistical tests are looked to. This is so even for Bayesians, be they ecumenical like George Box, or “falsificationists” like Gelman.
For Box, any account that relies on statistical models requires “diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification”. This leads Box to advocate ecumenism. (Box 1983, p. 57). He asks,
[w]hy can’t all criticism be done using Bayes posterior analysis?…The difficulty with this approach is that by supposing all possible sets of assumptions are known a priori, it discredits the possibility of new discovery. But new discovery is, after all, the most important object of the scientific process (ibid., p. 73).
Listen to Andrew Gelman (2011):
At a philosophical level, I have been persuaded by the arguments of Popper (1959), Kuhn (1970), Lakatos (1978), and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call ‘pure significance testing’) (Gelman 2011, p. 70).
Discovery, model checking and correcting rely on statistical testing, formal or informal.
4. “An explicit, objective criterion of ‘best’ models” using methods that obey the LP (p.628).
Say Burnham and Anderson:
“At a deeper level, P values are not proper evidence as they violate the likelihood principle” (Royall 1997)” (p. 627).
A list of pronouncements by Royall follows. What we know at a much deeper level is that any account that obeys the likelihood principle (LP) is not an account that directly assesses or controls the error probabilities of procedures. Control of error probabilities, even approximately, is essential for good tests, and this grows out of a concern, not for controlling error rates in the long run, but for evaluating how well tested models and hypotheses are with the data in hand. As with others who embrace the LP, the authors reject adjusting for selection effects, data dredging, multiple testing, etc.–gambits that alter the sampling distribution and, handled cavalierly, are responsible for much of the bad statistics we see. By the way, reference or default Bayesians also violate the LP. You can’t just make declarations about “proper evidence” without proper evidence. (There’s quite a lot on the LP on this blog; see also links to posts below the references.)
Burnham and Anderson are concerned with how old a method is. Oh the horrors of being a “historical” method. Appealing to ridicule (“progress should not have to ride in a hearse”) is no argument. Besides, it’s manifestly silly to suppose you use a single method, or that error statistical tests haven’t been advanced as well as reinterpreted since Fisher’s day. Moreover, the LP is a historical, baked-on principle suitable for ye olde logical positivist days when empirical observations were treated as “given”. Within that statistical philosophy, it was typical to hold that the data speak for themselves, and that questionable research practices such as cherry-picking, data-dredging, data-dependent selections, and optional stopping are irrelevant to “what the data are saying”! It’s redolent of the time where statistical philosophy sought a single, “objective” evidential relationship to hold between given data, model and hypotheses. Holders of the LP still say this, and the authors are no exception.
[The LP was, I believe, articulated by George Barnard who announced he rejected it at the 1959 Savage forum for all but predesignated simple hypotheses. If you have a date or correction, please let me know. 8/10]
The truth is that one of the biggest problems behind the “replication crisis” is the violation of some age-old truisms about science.It’s the consumers of bad science (in medicine at least) that are likely to ride in a hearse. There’s something wistful about remarks we hear from some quarters now. Listen to Ben Goldacre (2016) in Naure: “The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data,” which he follows with a list of selective publication, data dredging and all the rest, “leading collectively to the ‘replication crisis’.”
He’s trying to remind us that the rules for good science were all in place long ago and somehow are now being ignored or trampled over, in some fields. Wherever there’s a legitimate worry about “perverse incentives,” it’s not a good idea to employ methods where selection effects vanish.
5. Concluding comments
I don’t endorse many of the applications of significance tests in the literature, especially in the social sciences. Many p-values reported are vitiated by fallacious interpretations (going from a statistical to substantive effect), violated assumptions, and biasing selection effects. I’ve long recommended a reformulation of the tools to avoid fallacies of rejection and non-rejection. In some cases, sadly, better statistical inference cannot help, but that doesn’t make me want to embrace methods that do not directly pick up on the effects of biasing selections. Just the opposite.
If the authors are serious about upholding Popperian tenets of good science, then they’ll want to ensure the claims they make can be regarded as having passed a stringent probe into their falsity. I invite comments and corrections.
(Look for updates.)
They are replying to an article by Paul Murtaugh. See the link to his paper here.
Gelman continues: “At the next stage, we see science–and applied statistics–as resolution of anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996).”
- Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press. [1982 Technical Summary Report #2408 for U.S. Army version here.]
- Burnham, K. P. & Anderson, D. R. 2014, “P values are only an index to evidence: 20th- vs. 21st-century statistical science“, Ecology, vol. 95, no. 3, pp. 627-630.
- Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall.
- Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis”, Rationality, Markets and Morals (RMM) 2, Special Topic: Statistical Science and Philosophy of Science, pp. 67-78.
- Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall.
- Goldacre, B. 2016. ‘Make Journals Report Clinical Trials Properly‘, Nature 530,7 (04 February 2016)
- Kuhn, T. S. 1970. The Structure of Scientific Revolutions, 2nd ed. Chicago: University of Chicago Press.
- Lakatos, I. 1978. The Methodology of Scientific Research Programmes, Cambridge: Cambridge University Press.
- Mayo, D. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.
- Mayo, D. and Cox, D. R. 2006.”Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
- Murtaugh, P.A. 2014, “In defense of P values“, Ecology, vol. 95, no. 3, pp. 611-617.
- Murtaugh, P.A. 2014, “Rejoinder“, Ecology, vol. 95, no. 3, pp. 651-653.
- Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.
- Popper, K. 1959. The Logic of Scientific Discovery. New York: Basic Books.
- Royall, R. 1997. Statistical Evidence: A Likelihood Paradigm. Chapman and Hall, CRC Press.
- Spanos, A. 2014. “Recurring controversies about P values and conﬁdence intervals revisited”, Ecology, vol. 95, no. 3, pp. 645-651.
LAW OF LIKELIHOOD: ROYALL
8/29/14: BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2nd)
10/10/14: BREAKING THE (Royall) LAW! (of likelihood) (C)
11/15/14: Why the Law of Likelihood is bankrupt—as an account of evidence
11/25/14: How likelihoodists exaggerate evidence from statistical tests
7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
7/23/14: Continued: “P-values overstate the evidence against the null”: legit or fallacious?
5/12/16: Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”
I noticed you were tweeting this link out to ecologists, which is of great interest to me since it was through Dynamic Ecology that I found your work, and a Burnham and Anderson book that I became seriously interested. My sense is that likelihood-based model selection methods are very influential in ecology. Popular ecology statistics handbooks treat methods in an agnostic way wrt the debates on your blog. Even though I subscribe to your ideas, I don’t know how severity could be extended to cover ecological data sets, where there are typically many possible explanatory variables and ANOVA models tend to turn up significant no matter what. I’m eager for more experienced ecologists to jump in!