TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!)

The frequentist tester should retort:

Frequentist TesterBut you assume 50% of the null hypotheses are true, compute P(H0|x) using P(H0) = .5, imagine the null is rejected based on a single small p-value, and then blame the p-value for disagreeing with the result of your computation!

At times you even use α and power as likelihoods in your analysis! These tests violate both Fisherian and Neyman-Pearson tests.

 It is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0.  This Jeffreys-Lindley “disagreement” is considered problematic for Bayes ratios (e.g., Bernardo). It is not problematic for error statisticians. We always indicate the extent of discrepancy that is and is not indicated, and avoid making mountains out of molehills (See Spanos 2013).  J. Berger and Sellke (1987) attempt to generalize the result to show the “exaggeration” even without large n. From their Bayesian perspective, it appears that p-values come up short, error statistical testers (and even some tribes of Bayesians) balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!

The conflict between p-values and Bayesian posteriors typically considers the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0.

“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!

Table 1 (modified) from J.O. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.







Some find the example shows the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative” Bayesian prior probability assignment of .5 to H0, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others demonstrate that the problem is not p-values but the high prior. 

Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of Has much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Many complain the “spiked concentration of belief in the null” is at odds with the view that “we know all nulls are false” (even though that view is also false.) See Senn’s interesting points on this same issue in his letter (to Goodman) here

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and to frequentist error probabilities. How’s that supposed to work? It is imagined that we sample randomly from a population of hypotheses, k% of which are assumed to be true. 50% is a common number. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for your particular H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.

This particular null H0 was randomly selected from this urn (some may wish to add “nothing else is known” which would scarcely be true here).

Therefore P(H0 is true) = .5.

I discussed this 20 years ago, Mayo 1997a and b (links in the references) and ever since. However, statistical fallacies repeatedly return to fashion in slightly different guises. Nowadays, you’re most likely to see it within what may be called diagnostic screening models of tests.

It’s not that you can’t play a carnival game of reaching into an urn of nulls (and there are lots of choices for what to put in the urn), and use a Bernoulli model for the chance of drawing a true hypothesis (assuming we even knew the % of true hypotheses, which we do not), but the “event of drawing a true null” is no longer the particular hypothesis one aims to use in computing the probability of data x0 under hypothesis H0. In other words, it’s no longer the H0 needed for the likelihood portion of the frequentist computation. (Note, too, the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)

In any event, .5 is not the frequentist probability that the selected null H0 is true–in those cases where a frequentist prior exists. (I first discussed the nature of legitimate frequentist priors with Erich Lehmann; see the poem he wrote for me as a result in Mayo 1997a).

The diagnostic screening model of tests. The diagnostic screening model of tests has become increasingly commonplace, thanks to Big Data, perverse incentives, nonreplication and all the rest (Ioannidis 2005). As Taleb puts it:

“With big data, researchers have brought cherry-picking to an industrial level”.

Now the diagnostic screening model is apt for various goals–diagnostic screening (for disease) most obviously, but also for TSA bag checks, high throughput studies in genetics and other contexts where the concern is controlling the noise in the network rather than appraising the well-testedness of your research claim. Dichotomies are fine for diagnostics (disease or not, worth further study or not, dangerous bag or not) Forcing scientific inference into a binary basket is what most of us wish to move away from, yet the new screening model dichotomizes results into significant/non-significant, usually at the .05 level. One shouldn’t mix the notions of prevalence, positive predictive value, negative predictive value, etc. from screening with the concepts from statistical testing in science. Yet people do, and there are at least 2 tragicomic results: One is that error probabilities are blamed for disagreeing with measures of completely different things. One journal editor claims the fact that p-values differ from posteriors proves the “invalidity” of p-values.

The second tragicomic result is that inconsistent meanings of type 1 (and 2) error probabilities have found their way into the latest reforms, and into guidebooks for how to avoid inconsistent interpretations of statistical concepts. Whereas there’s a trade-off between type 1 error and type 2 error probabilities in Neyman-Pearson style hypotheses tests, this is no longer true when a type 1 error probability is defined as the posterior of H0 conditional on rejecting. Topsy turvy claims about power readily ensure (search this blog under power for numerous examples).

Conventional Bayesian variant. J Berger doesn’t really imagine selecting from an urn of nulls (he claims). Instead, spiked priors come from one of the systems of default or conventional priors. Curiously, he claims that by adopting his recommended conventional priors, frequentists can become more frequentist (than using flawed error probabilities). We get what he calls conditional p-values (or conditional error probabilities). Magician that he is, the result is that frequentist error probabilities are no longer error probabilities, or even frequentist!

How it happens is not entirely clear, but it’s based on his defining a “Frequentist Principle” that demands that a type 1 (or 2) error probability yield the same number as his conventional posterior probability. (See Berger 2003, and my comment in Mayo 2003).

Senn, in a guest post remarks:

The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Urn of Nulls. Others appear to be serious about the urn of nulls metaphor (e.g., Colquhoun 2014) Say 50% of the nulls in the urn are imagined to be true. Then, when you select your null, its initial probability of truth is .5. This however is to commit the fallacy of probabilistic instantiation.

Two moves are made: (1) it’s admitted it’s an erroneous probabilistic instantiation, but the goal is said to be assessing “science wise error rates” as in a diagnostic screening context. A second move (2) is to claim that a high positive predictive value PPV from the diagnostic model warrants high “epistemic probability”–whatever that is– to the particular case at hand.

The upshot of both are at odds with the goal of restoring scientific integrity. Even if we were to grant these “prevalence rates” (to allude to diagnostic testing), my question is: Why would it be relevant to how good a job you did in testing your particular hypothesis, call it H*? Sciences with high “crud factors” (Meehl 1990) might well get a high PPV simply because of nearly all its nulls being false. This still wouldn’t be evidence of replication ability, nor of understanding of the phenomenon. It would reward non-challenging thinking, and taking the easiest way out.  

Safe Science. We hear it recommended that research focus on questions and hypotheses with high prior prevalence. Of course we’d never know the % of true nulls (many say all nulls are false, although that too is false) and we could cleverly game the description to have suitably high or low prior prevalence. Just think of how many ways you could describe those urns of nulls to get a desired PPV, especially on continuous parameters. Then there’s the heralding of safe science:

Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive (Ioannidis, 2005, p. 0700).

The diagnostic model, in effect, says keep doing what you’re doing: publish after an isolated significant result, possibly with cherry-picking and selection effects to boot, just make sure there’s high enough prior prevalence. That preregistration often makes previous significant results vanish shows the problem isn’t the statistical method but its abuse. Ioannidis has done much to expose bad methods, but not with the diagnostic model he earlier popularized.

In every case of a major advance or frontier science that I can think of, there had been little success in adequately explaining some effect–low prior prevalence. It took Prusiner 10 years of failed experiments to finally transmit the prion for mad cow to chimps. People didn’t believe there could be infection without nucleic acid (some still adhere to the “protein only” hypothesis.) He finally won a Nobel Prize, but he would have had a lot less torture if he’d just gone along to get along, keep to the central dogma of biology rather than follow the results that upended it. However, it’s the researcher who has worked with a given problem, building on results and subjecting them to scrutiny, who understands the phenomenon well enough to not just replicate, but alter the entire process in new ways (e.g., prions are now being linked to Alzheimer’s).

Researchers who have churned out and published isolated significant results, and focused on “research questions where the where the pre-study probability is already considerably high” might meet the quota on PPV, but still won’t have the understanding to even show they “know how to conduct an experiment which will rarely fail to give us a statistically significant result”–which was Fisher’s requirement before inferring a genuine phenomenon (Fisher 1947).

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

References & Related articles

Berger, J. O.  (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R.  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Colquhoun, D. (2014) “An investigation of the false discovery rate and the misinterpretation of p-values.” Royal Society Open Science, 2014 1(3): pp. 1-16.

Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.

Fisher, R.A. (1947), Design of Experiments.

Ioannidis, J. (2005). “Why Most Published Research Findings Are False”.

Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (1997a). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,” Philosop2hy of Science 64(1): 222-244 and 323-333. NOTE: This issue only comes up in the “Response”, but it made most sense to include both here.

Mayo, D. (1997b) “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64: S195-S212.

Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”Statistical Science18, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo (2005). “Philosophy of Statistics” in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815. (Has typos.)

Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66 (1): 195-244.

Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.

Prusiner, S. (1991). Molecular Biology of Prion Diseases. Science, 252(5012), 1515-1522.

Prusiner, S. B. (2014) Madness and Memory: The Discovery of Prions—a New Biological Principle of Disease, New Haven, Connecticut: Yale University Press.

Spanos, A. (2013). “Who Should Be Afraid of the Jeffreys-Lindley Paradox”.

Taleb, N. (2013). “Beware the Big Errors of Big Data”. Wired.


Related posts:

Categories: Bayesian/frequentist, Comedy, significance tests, Statistics | 9 Comments

Post navigation

9 thoughts on “TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

  1. Tom Passin

    It’s surprising to me how often in discussions like this- how to change p values into things they are not, for example – very basic things get overlooked time after time. One example relating to p-values is this: they are basically calculated using a Z value ( e.g., Sample Mean / sample std. dev., for an assumed mean of 0). What you don’t see brought up is the fact that the Z value itself is a statistic with its own variance. Instead of going to tables of t or F values, one can ask what this variance is. The square root of the variance will be Z sqrt(Var(M) + Var(s)), where M is the sample mean and s is the sample standard deviation.

    For a normal distribution of standard deviation = 1.0, and a sample size of 10, the square root of Var(s) is about 0.31 and it scales as 1/sqrt(sample size) as you might expect. (I recently ran a bunch of simulations to generate these values, just for fun). Therefore, especially for small sample ,sizes you have a much larger band of variation for the Z value than you might have thought, and so your p values may very well be rather smaller than you had thought. But how often do you see this effect taken into account?

    Another elementary but overlooked point applies to Bayesian approaches. If you want to make use of a prior distribution, then when you come to make inferences you need to factor in the probability of your prior. But this is never done, either. You do sometimes see sensitivity studies, but if your results don’t change much when you change the prior, then what value is it? And if they do, let’s see some severe testing of the validity of using that prior.

    • Tom: On the last paragraph, you mean factor in the probability that the prior is correct? My guess is that Bayesians will tell you they can always get another hierarchy. It was Fisher who said, as do you, that if the priors don’t influence things much, then you might ask what they are doing there in the first place?

      • Tom Passin

        Yes, that’s just what I meant. After all, a prior probability distribution is, deep down, an assumption that needs to be validated. Otherwise it’s just hand-waving.

        Oh, and I picked off the standard deviation of the sample std. dev. for the wrong sample size: for n=10. the simulation value came out 0.22. The value I first wrote was for n=5.

        • Tom: Its rare for people to try to validate their priors. One of the big problems in doing so is identifying what the prior means:is it a degree of belief about the prior distribution, about frequencies, a regularization device, or an undefined measure for purposes of getting a posterior?

        • Tom: Gelman, I’m guessing, will say that it’s no different from a model assumption such as iid. The difficulty some of us have is not knowing what we’re checking when it comes to priors and what counts as it being falsified. Further, the tests for assumptions have a methodology with various checks of their own, and may be satisfied with appropriate designs in experimental settings. I understand the idea that the priors have predictions too, for a posterior distribution from which we can imagine sampling, but the properties of the method need to be made clearer, if possible.

  2. Aljaz

    Great post. I’d still say that medical testing provides a very useful case for thinking about inference. First, it provides an intuitive introduction to your severity concept (by considering tests with different False Alarm rates). Second, it makes it clear that it’s not just about the long run error rate (even if the next test I do will be the very last test I ever do, I’ll still pick the one with False Alarm rate of 1% over the one with a 10% False Alarm rate!). Finally, while it shows that considering the prior (base rate) can be crucial it also makes you reflect on what is a useful prior. In the case of medical testing – if we have reliable data about the base rate of the disease, it clearly is. But as you point out, the situation in science is very different.

    I wish your work was more well-known among psychologists (I found out about it only recently). It provides a compelling philosophical basis for classical tests and it gets rid of the unappealing characteristics of NHST, e.g. the dichotomic thinking.

    • Aljaz: Thanks much. Yes, the problem is mixing the diagnostic screening setting with science, and whatever one thinks of the importance of also doing screening for journals or retraction rates or the like in science, the problem is that some have willfully encouraged claiming that the correct type 1 error probability is their favored posterior. So now we have guidelines that define type 1 error probability in inconsistent ways. Hopefully when my new book is out soon, the basis for error statistical tests will be better appreciated.

    • john byrd

      “In the case of medical testing – if we have reliable data about the base rate of the disease, it clearly is. But as you point out, the situation in science is very different.” One big difference is the whole idea of a base rate of true hypotheses. How often can we know that? I think in some applied work we have a notion of it, as in some cases in medicine. I think there is some value in using the diagnostic testing indices in a counterfactual setup where we suppose the prevalence has a given value to see how the testing will be affected by varying levels of the prevalence in future applications. Likewise, it is important to consider the degree of difference in possible subjects that we consider to be “true negatives” because this factor will also have a big impact on the number of “false positives” we expect to see. This latter point seems to get little attention in the literature, ESP in those simulations that set 50% of nulls as false. How false were they (p-value of 0.049 or 0.001?)?

  3. Huw Llewelyn (@HL327)

    The reference to ‘medical testing made me sit up! When I interpret a patient’s test result, I consider a number of issues: 1. Stochastic. 2. Methodological. 3. Diagnostic. So (1) How probable is it that on repeating the test many times, that it will still be in a similar range of interest? The answer can be affected by past test results that may negate or support the latest result in a manner analogous to how a meta-analysis or a Bayesian prior may influence the interpretation of a new study result. (2) Would the result have been different if any methodological errors were corrected (e.g. the specimen being spoiled due to delay in getting to the lab)? (3) If answers ‘1’ and ‘2’ suggest that the result is worth interpreting, what are the possible causes of the result and is the list virtually complete? If so, one can reason with other observations to probabilistically ‘eliminate’ or ‘refute’ all H2 and H3 and … Hn, so that (H1) then becomes probable. If the list is incomplete, then the conclusion will involve ‘abductive reasoning’ so that the conclusion will be ‘H1 or something else’ but at least not H2 or H3 or … Hn. H2 could represent a stochastic explanation for the test result and H2 a methodological error that explains the test result. In order to consider the differential diagnoses / hypotheses, both would need to be low.

    I model this reasoning with a theorem of probabilistic elimination based on the expanded form of Bayes rule, which does not require actual baseline priors conditional on a shared universal set, but only their ratios. These ratios can be worked out from conditional probabilities and their inverse probabilities, which can be estimated from simple observations. Probabilities in medicine are based on sparse data and seemed to be agreed by consensus. They form a ‘model’ which allows us to share our reasoning. However, at the end of the day we calibrate and update our shared subjective probabilities informally during case discussions, ward rounds, etc. Perhaps something similar could be done for testing scientific hypotheses. The calibration of the probability estimates would have to take place by checking some accessible surrogate marker outcome such as the frequency of replicating studies. This would only check the system of estimating probabilities and not the result of individual studies of course. My subjective probabilities are based on imagined populations e.g. what proportion would get better without treatment if we had 1000 patients identical to the person in front of us?

I welcome constructive comments for 14-21 days. If you wish to have a comment of yours removed during that time, send me an e-mail.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at