Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?
Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!
Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!
Raucous laughter ensues!
(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!)
The frequentist tester should retort:
Frequentist Tester: But you assume 50% of the null hypotheses are true, compute P(H0|x) using P(H0) = .5, imagine the null is rejected based on a single small p-value, and then blame the p-value for disagreeing with the result of your computation!
At times you even use α and power as likelihoods in your analysis! These tests violate both Fisherian and Neyman-Pearson tests.
It is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0. This Jeffreys-Lindley “disagreement” is considered problematic for Bayes ratios (e.g., Bernardo). It is not problematic for error statisticians. We always indicate the extent of discrepancy that is and is not indicated, and avoid making mountains out of molehills (See Spanos 2013). J. Berger and Sellke (1987) attempt to generalize the result to show the “exaggeration” even without large n. From their Bayesian perspective, it appears that p-values come up short, error statistical testers (and even some tribes of Bayesians) balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!
The conflict between p-values and Bayesian posteriors typically considers the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0.
“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!
Some find the example shows the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative” Bayesian prior probability assignment of .5 to H0, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others demonstrate that the problem is not p-values but the high prior.
Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Many complain the “spiked concentration of belief in the null” is at odds with the view that “we know all nulls are false” (even though that view is also false.) See Senn’s interesting points on this same issue in his letter (to Goodman) here.
But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and to frequentist error probabilities. How’s that supposed to work? It is imagined that we sample randomly from a population of hypotheses, k% of which are assumed to be true. 50% is a common number. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for your particular H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:
50% of the null hypotheses in a given pool of nulls are true.
This particular null H0 was randomly selected from this urn (some may wish to add “nothing else is known” which would scarcely be true here).
Therefore P(H0 is true) = .5.
I discussed this 20 years ago, Mayo 1997a and b (links in the references) and ever since. However, statistical fallacies repeatedly return to fashion in slightly different guises. Nowadays, you’re most likely to see it within what may be called diagnostic screening models of tests.
It’s not that you can’t play a carnival game of reaching into an urn of nulls (and there are lots of choices for what to put in the urn), and use a Bernoulli model for the chance of drawing a true hypothesis (assuming we even knew the % of true hypotheses, which we do not), but the “event of drawing a true null” is no longer the particular hypothesis one aims to use in computing the probability of data x0 under hypothesis H0. In other words, it’s no longer the H0 needed for the likelihood portion of the frequentist computation. (Note, too, the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)
In any event, .5 is not the frequentist probability that the selected null H0 is true–in those cases where a frequentist prior exists. (I first discussed the nature of legitimate frequentist priors with Erich Lehmann; see the poem he wrote for me as a result in Mayo 1997a).
The diagnostic screening model of tests. The diagnostic screening model of tests has become increasingly commonplace, thanks to Big Data, perverse incentives, nonreplication and all the rest (Ioannidis 2005). As Taleb puts it:
“With big data, researchers have brought cherry-picking to an industrial level”.
Now the diagnostic screening model is apt for various goals–diagnostic screening (for disease) most obviously, but also for TSA bag checks, high throughput studies in genetics and other contexts where the concern is controlling the noise in the network rather than appraising the well-testedness of your research claim. Dichotomies are fine for diagnostics (disease or not, worth further study or not, dangerous bag or not) Forcing scientific inference into a binary basket is what most of us wish to move away from, yet the new screening model dichotomizes results into significant/non-significant, usually at the .05 level. One shouldn’t mix the notions of prevalence, positive predictive value, negative predictive value, etc. from screening with the concepts from statistical testing in science. Yet people do, and there are at least 2 tragicomic results: One is that error probabilities are blamed for disagreeing with measures of completely different things. One journal editor claims the fact that p-values differ from posteriors proves the “invalidity” of p-values.
The second tragicomic result is that inconsistent meanings of type 1 (and 2) error probabilities have found their way into the latest reforms, and into guidebooks for how to avoid inconsistent interpretations of statistical concepts. Whereas there’s a trade-off between type 1 error and type 2 error probabilities in Neyman-Pearson style hypotheses tests, this is no longer true when a type 1 error probability is defined as the posterior of H0 conditional on rejecting. Topsy turvy claims about power readily ensure (search this blog under power for numerous examples).
Conventional Bayesian variant. J Berger doesn’t really imagine selecting from an urn of nulls (he claims). Instead, spiked priors come from one of the systems of default or conventional priors. Curiously, he claims that by adopting his recommended conventional priors, frequentists can become more frequentist (than using flawed error probabilities). We get what he calls conditional p-values (or conditional error probabilities). Magician that he is, the result is that frequentist error probabilities are no longer error probabilities, or even frequentist!
How it happens is not entirely clear, but it’s based on his defining a “Frequentist Principle” that demands that a type 1 (or 2) error probability yield the same number as his conventional posterior probability. (See Berger 2003, and my comment in Mayo 2003).
Senn, in a guest post remarks:
The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.
It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.
Urn of Nulls. Others appear to be serious about the urn of nulls metaphor (e.g., Colquhoun 2014) Say 50% of the nulls in the urn are imagined to be true. Then, when you select your null, its initial probability of truth is .5. This however is to commit the fallacy of probabilistic instantiation.
Two moves are made: (1) it’s admitted it’s an erroneous probabilistic instantiation, but the goal is said to be assessing “science wise error rates” as in a diagnostic screening context. A second move (2) is to claim that a high positive predictive value PPV from the diagnostic model warrants high “epistemic probability”–whatever that is– to the particular case at hand.
The upshot of both are at odds with the goal of restoring scientific integrity. Even if we were to grant these “prevalence rates” (to allude to diagnostic testing), my question is: Why would it be relevant to how good a job you did in testing your particular hypothesis, call it H*? Sciences with high “crud factors” (Meehl 1990) might well get a high PPV simply because of nearly all its nulls being false. This still wouldn’t be evidence of replication ability, nor of understanding of the phenomenon. It would reward non-challenging thinking, and taking the easiest way out.
Safe Science. We hear it recommended that research focus on questions and hypotheses with high prior prevalence. Of course we’d never know the % of true nulls (many say all nulls are false, although that too is false) and we could cleverly game the description to have suitably high or low prior prevalence. Just think of how many ways you could describe those urns of nulls to get a desired PPV, especially on continuous parameters. Then there’s the heralding of safe science:
Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive (Ioannidis, 2005, p. 0700).
The diagnostic model, in effect, says keep doing what you’re doing: publish after an isolated significant result, possibly with cherry-picking and selection effects to boot, just make sure there’s high enough prior prevalence. That preregistration often makes previous significant results vanish shows the problem isn’t the statistical method but its abuse. Ioannidis has done much to expose bad methods, but not with the diagnostic model he earlier popularized.
In every case of a major advance or frontier science that I can think of, there had been little success in adequately explaining some effect–low prior prevalence. It took Prusiner 10 years of failed experiments to finally transmit the prion for mad cow to chimps. People didn’t believe there could be infection without nucleic acid (some still adhere to the “protein only” hypothesis.) He finally won a Nobel Prize, but he would have had a lot less torture if he’d just gone along to get along, keep to the central dogma of biology rather than follow the results that upended it. However, it’s the researcher who has worked with a given problem, building on results and subjecting them to scrutiny, who understands the phenomenon well enough to not just replicate, but alter the entire process in new ways (e.g., prions are now being linked to Alzheimer’s).
Researchers who have churned out and published isolated significant results, and focused on “research questions where the where the pre-study probability is already considerably high” might meet the quota on PPV, but still won’t have the understanding to even show they “know how to conduct an experiment which will rarely fail to give us a statistically significant result”–which was Fisher’s requirement before inferring a genuine phenomenon (Fisher 1947).
Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)
References & Related articles
Berger, J. O. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.
Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.
Cassella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.
Colquhoun, D. (2014) “An investigation of the false discovery rate and the misinterpretation of p-values.” Royal Society Open Science, 2014 1(3): pp. 1-16.
Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.
Fisher, R.A. (1947), Design of Experiments.
Ioannidis, J. (2005). “Why Most Published Research Findings Are False”.
Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.
Mayo, D. (1997a). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?'” and “Response to Howson and Laudan,” Philosophy of Science 64(1): 222-244 and 323-333. NOTE: This issue only comes up in the “Response”, but it made most sense to include both here.
Mayo, D. (1997b) “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64: S195-S212.
Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science18, 19-24.
Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.
Mayo (2005). “Philosophy of Statistics” in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815. (Has typos.)
Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.
Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66 (1): 195-244.
Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.
Prusiner, S. (1991). Molecular Biology of Prion Diseases. Science, 252(5012), 1515-1522.
Prusiner, S. B. (2014) Madness and Memory: The Discovery of Prions—a New Biological Principle of Disease, New Haven, Connecticut: Yale University Press.
Spanos, A. (2013). “Who Should Be Afraid of the Jeffreys-Lindley Paradox”.
Taleb, N. (2013). “Beware the Big Errors of Big Data”. Wired.