Continued:”P-values overstate the evidence against the null”: legit or fallacious?



2. J. Berger and Sellke and Casella and R. Berger

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 (Jeffreys-Good-Lindley paradox).  I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .

“If n = 50…, one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence for it!

From J. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to H0, the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. A Dialogue.) Casella and R. Berger(1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of Has much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.


3. A Dialogue (ending with a little curiosity in J. Berger and Sellke):

So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02. 

P-Value denier: I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier: If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H0 is still not all that low, not as low as .05 for sure.

EPA Rep: Why do you assign such a high prior probability to H0?

P-value denier: If I gave H0 a value lower than .5, then, if there’s evidence to reject H0 , at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H0, assigning prior probability .1 to H0, and my conclusion is that Hhas posterior probability .05 and should be rejected’?

The last sentence is a direct quote from Berger and Sellke!

There’s something curious in assigning a high prior to the null H0–thereby making it harder to reject (or find evidence against) H0–and then justifying the assignment by saying it ensures that, if you do reject H0, there will be a meaningful drop in the probability of H0. What do you think of this?


4. The real puzzle.

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “what is the intended interpretation of the prior, again?” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: “Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” 


5. (Crude) Benchmarks for taking into account sample size:

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H0: μ ≤ 0 vs. H1: μ > 0.  

Let σ = 1, n = 25, so σx= (σ/√n).

For this exercise, fix the sample mean M to be just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:

m0 = 0 + 2(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form:  μ > 0 + γ.

Some benchmarks:

 * The lower bound of a 50% confidence interval is 2(σ/√n). So there’s quite lousy evidence that μ > 2(σ/√n) (the associated severity is .5).

 *The lower bound of the 93% confidence interval is .5(σ/√n). So there’s decent evidence that μ > .5(σ/√n) (The associated severity is .93).

 *For n = 100, σ/√n = .1 (σ= 1); for n = 1600, σ/√n = .025

 *Therefore, a .025 stat sig result is fairly good evidence that μ > .05, when n = 100; whereas, a .025 stat sig result is quite lousy evidence that μ > .05, when n = 1600.

You’re picking up smaller and smaller discrepancies as n increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.


6. “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).


7. July 20, 2014: There is a distinct issue here….That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another well-known fallacies, leading them to guess I was referring to the fallacy of confusing P(E|H) with P(H|E)—what some call the “prosecutor’s fallacy”. I wasn’t. Nor are Berger and Sellke committing a simple blunder of transposing conditionals. If they were, Casella and Berger would scarcely have needed to write their reply to point this out. So how shall we state the basis for the familiar criticism that P-values overstate evidence against (a null)?  I take it that the criticism goes something like this:

The problem with using a P-value to assess evidence against a given null hypothesis H0 is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H0, given data x (especially as n increases).  The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.
Slide #8 of Spanos lecture in our seminar Phil 6334.

 [iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.


References (minimalist) A number of additional links are given in comments to my previous post

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Casella G. and Berger, R..  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Blog posts:

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.
Highly probable vs highly probed: Bayesian/ error statistical differences.


Categories: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics

Post navigation

39 thoughts on “Continued:”P-values overstate the evidence against the null”: legit or fallacious?

  1. To respond to the last comment from the previous post, someone named vl claimed: “When most non-statisticians think about frequentist guarantees what they really care about is given a decision rule (for what to believe), how often is my decision correct, how often is it wrong?”
    Perhaps this kind of “performance” goal is in sync with the way frequentist “decision rules” are often presented, unfortunately. I think that what most people care about, if they aren’t trying to frame their data problem in terms of some formal statistical technique, isn’t that at all. It’s more like, what have I learned from the data about the particular problem at hand? What would be warranted to infer, and what unwarranted?

    • vl

      @mayo Thanks for the comments and discussion. I do whole-heartedly agree that the typical presentations are off.

      regarding –

      “what have I learned from the data about the particular problem at hand? What would be warranted to infer, and what unwarranted?”

      I would ask the critique often leveled at the notion of bayesian probability as encoding a notion of plausibility – that bayesian probabilities are ultimately not testable quantities (although the conclusions one draws from them are, but that’s besides the point).

      The same issue seems to arise if we’re writing down calculations and rules for “learning”/”inferring” but we aren’t willing to couple them to an observable frequency of being correct.

      If we’re unwilling to evaluate a decision rule in terms of “how often are we wrong?” then how is the “learning” rule anything more than a circular definition?

      I think it would be difficult to convince most non-statistician/non-philosophers that one should follow a rule for learning, but offer no relationship to the (frequentist) probability of being correct or incorrect in the the learning process.

      Can we said to be “learning” by a procedure if we can’t even establish a model-based relationship between the procedure and the frequency of correctness?

      • VL: There must be a testable relationship between inferred solutions to problems and approx correct solutions–true–but I don’t think that link is necessarily captured by a crude “how often are we wrong”. Basically, that can be both too easy and too irrelevant. When we mount a strong argument from coincidence, say, to the surprising phenomenon of prion “infection” occurring by an alteration of protein folding, rather than via a virus or bacteria (to allude to a case I’m reading about*), it’s not a matter of error rates. A strong argument from coincidence is one that has sufficiently probed errors in the case at hand. The fact that on average I do OK does not in and of itself warrant the given inference in the case at hand. Conversely, triangulating via numerous procedures that converge to indicate something is much stronger. It’s not so much the “improbability” of being wrong but the fact that you’d have to posit something like a Cartesian demon to explain away the consilience of results.

        • vl

          I understand that error rates are a crude measure and it’s not a _sufficient_ condition for choosing a procedure.

          I also understand your argument that in practice the space of explanatory models (or hypotheses) are not circumscribed.

          However, it seems if a procedure is supposed to work in the general case in which the space of explanatory models are not circumscribed, then shouldn’t it be a necessary prerequisite that it works in limiting special case under a circumscribed set of explanations? (which could be explored via simulation)

          If there’s a better measure than error rates, I’m open to alternatives, but it seems like a reasonable starting point. If I tell a non-statistician/philosopher that given the same set of information about the error model and space of explanations that I propose a procedure with uniformly higher error rates, I would have a hard time making that case that a) they should use the procedure and that b) it would be more reliable than alternative procedures in the general case where the space of hypotheses is not defined.

  2. Anonymous

    The confusion cropping up in regards to p-values and conditional probabilities was talked about on Larry Wasserman’s blog:

    • I’m glad the Normal Deviate’s posts are still out there; he does an excellent job illuminating the point.
      Pr(x;Ho) = k
      just says, the probability of x, computed under Ho,equals k. That is, it’s an analytic claim about the probability distribution of X, stipulated by Ho. This requires Ho to be related to x in a particular way.

      I don’t know how to put subscripts in comments, and don’t have the patience to find out. Check out the link to see Wasserman’s notation.

  3. Christian Hennig

    I think that a legitimate reason to say that “p-values sometimes overstate the evidence against the null” is that in many situations the formal null on which the computation of p-values is based is more restricted than what one could call the “interpretative null”. For example, if the formal null is N(0,\sigma^2), actually all distributions that look about normal and are symmetric or approximately symmetric around something that is about equal to zero, and also some models violating independence between observations slightly would often be interpreted in the same way (“this drug has no effect” etc.). The p-value derived from N(0,1) may be smaller than what one would get from the “worst case” contained in the “interpretative null”.

    This, though, has absolutely nothing to do with the Bayesian arguments for such a claim that have been discussed here up to now. In this respect, the Bayesians are in no way in a better situation to measure evidence than the frequentists; they have the same problem.

    • Hi Christian: I get the gist of what I think you’re saying, but wondered if you can be more explicit about the worst case in the interpretive null.

      • Christian Hennig

        I think one could define this formally as follows. The “interpretative null” I is a set of distributions (that would need to be defined depending on the problem) that are considered “interpretationally equivalent to the null”. Given a test statistic T and consider (w.l.o.g.) a one-sided test which rejects if too large. Now consider the situation T=t. The p-value is P(T>=t) under H0. In principle (I don’t say that this can be done easily in practice) one could compute P(T>=t) for all distributions in I and look at the maximum value; this is what I meant by the “worst case”.
        If the standard p-value is much smaller than this maximum, this could be interpreted by saying that “the p-value overstates the evidence against the interpretative null” (because there is a member of I that is much more compatible with the data than the nominal H0).

        • Christian: But I thought the purpose of requiring a small p-value or small type 1 error is to take into account the “worst case”.
          I’m vaguely remembering something in Fisher where he might be making your point to Neyman, and why he requires certain types of nulls? but I don’t know if this is what you have in mind.
          Incidentally, (not that I think this is relevant, but it might be) remember I’m requiring it be an “audited” p-value, not one that could be invalid due to, say, selection effects not being taken account of.

          • Christian Hennig

            Mayo: I tried to formalise (more or less without formulae!) here one point that pops up from time to time when discussing p-values, which has to do with model assumptions being potentially violated, but the issue can’t really be avoided by testing them (I’m mostly concerned with distributions that violate certain assumptions of the nominal H0 so slightly that they can hardly be told apart from the H0 with any power/severity). Not sure whether Fisher made such a point; it may be.

            If you say “requiring a small p-value” do you refer to “p<0.05", or do you refer to people who want to see a much smaller value than that (as they had in the Higgs boson case) for safety reasons?
            Standard borderlines such as 0.05 don't refer to a worst case in terms of slightly violated model assumptions, but people who have potential problems with such slight violations in mind may say that "we only really believe the H0 is disproved if p<0.001" or so, in order to be on the safe side. It would require a proof though to show that they really are.

            Talking about "worst cases" always requires that the set is specified over which a worst case is sought (called "I" by me above).

            • Christian: I had said that, for purposes of this post’s discussion, that the p-values checked out, i.e., passed an “audit”, so that we could focus on the inference questions. The criticism, e.g., by J. Berger and Sellke is not that the model assumptions do not hold. I may have misunderstood your point.

              • Christian Hennig

                Mayo: Yes, I got it that you originally wrote about another kind of criticism, regarding which I agree with your discussion.
                I brought in another aspect, that I think is also connected to the topic/title of these posts.
                It is not so clear to me, though, how you think p-values could be “audited”.

  4. Christian: I believe your concern here offers an opportunity to clarify some issues about errors that have concerned me and some others.

    If we take the simple case of comparing binary outcomes in a two group ideal randomised controlled trial, we can illustrate the issue fairly straightforwardly. The usual assumptions would be that within each group the binary outcome would be distributed as Binomial (constant probability of success and independent) and the Null hypothesis would be Pt=Pc (treatment success rate equals control success rate).

    For a chosen test statistic, the type one error rate will be a rather erratic looking function of Pc and as we were informed on Larry’s (NormalDeviant) web site, this difficulty is overcome by taking the supremum of that function to get _the_ type one error rate.

    Here Pc is often referred to as a nuisance parameter as primarily we are interested in say Pt – Pc (some measure of treatment effect.)

    Now what you are pointing out is that they are always reasonable concerns about additional nuisance parameters such as other distributions the data may actually follow. Here, it would be non-constant probability of success and dependence. If so the supremum to get _the_ type one error rate should be over those additional nuisance parameters.

    So, type one error rates are really never that well determined nor is power. Looking at problems where Normal assumptions are reasonable avoids all these concerns and gives a false impression of statistics that is hard for almost all of us the shake off.

    • Interesting points!
      Keith, like you say, Christian’s example is covered in the Normal Deviate post linked above. Do you have any good references for the difficulties with this approach that you allude to? Thanks

      • Sure, but a quick simulation should give you the basic picture assuming binomial distribution.

        This one for testing:

        This one for confidence intervals:

        And if up for a it, a technical paper as to why no general solution likely will ever be available:

        Unless you mean the difficulty not knowing what the true distribution is – then I would have to go looking…likely to Bayesian Non-Parametric and other stuff.


        • Christian Hennig

          Huber, in the Section on Robust Tests in the Robust Statistics book, has something on robustifying likelihood ratio tests for testing a full neighbourhood of a parametric null hypothesis. This doesn’t take into account potential violations of the iid assumption, though.

          • Christian: There are methods to address particular problems on a one off basis. Robust likelihoods ratio tests only have asymptotic error rates and sometime fail catastrophically, for instance in my thesis.

            NormDeviant discussed some techniques that had guarantees with minimal assumptions. But the arguments in Stigler’s The changing history of robustness suggest to me that route is not least wrong

            I don’t think there will ever be safe statistical methods with well determined error rates in most real applications, even when there is no apparent problems with random selection and assignment (which is almost never the case).

            • Christian Hennig

              Keith: I certainly agree with your last sentence, and also with Stigler’s paper. My aim was not to suggests that all these problems could be solved. My original intention was less constructive, I wanted to highlight what problems exist and how one could think about them. I still think though that the “robust project” to at least try to solve what can be solved in this respect is extremely valuable.

              Also, Huber’s book actually has finite sample results. May I ask which robust tests failed catastrophically in what sense in your thesis?

              • Christian: Sorry if I seemed argumentative, I liked the way you highlighted the problems and I agree with trying to solve what can be solved – at least without too much blinding to possible surprises of brute force reality.

                For the robust fail, search for “batman” here


                • Christian Hennig

                  Keith: Thanks. As far as I can see, this is not the kind of robust test I was talking about and treated in a simple case in Huber’s book, where one considers a worst case p-value over a neighbourhood of the H0.

            • Keith: I suggest that the goal is “not safe statistical methods with well determined error rates in most real applications”. If we wanted safety, we’d stick very close to the data and have a very boring science. Granted methods should allow robust learning, at least when combined and inter checked. I recall Fisher:

              “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14).

              What counts as “knowing how” depends on the context; but we can discern its absence, as when a purported effect is irreproducible.

        • omaclaren


  5. I took a look at the link in comments from Keith O’Rourke:

    Keith had written: “ 2. Essentially, the two* ways you can go wrong in Bayes are; have a wonky prior, miss-specify the data generating model and misinterpret the posterior probabilities 😉
    People like Xiao-Li Meng and Mike Evans are working on ways to deal with all three but many statisticians just see no prior problems, hear no lack of data fit and speak of no difficulty interpreting posteriors.”
    “For Meng see”
    This looks like a new attempt to develop tools to mitigate the dangers of using Bayesian priors. (There are two other co-authors). As ingenious as the idea is, I find it perplexing to suggest that when there are prior-likelihood conflicts, the prior subtracts some of the information from the likelihood. It’s also equivocal. I can just hear what Fisher would say. Is the prior genuine information or isn’t it? (if it’s not, why multiply it with the likelihood to begin with?) I realize I’ve only scanned the paper, but isn’t this the gist?

    *I guess he meant “three”, and one of the problems I see is in disentangling them. I would also be interested to hear what meaning people accord to “misinterpreting the posterior”.

  6. Despite my thinking we should separate the “P-values overstate evidence against a null” problem and the “screening rates” problem, I admit that they are increasingly linked because of the so-called “replicability crisis”. They are inextricably blended in recent work by Val Johnson that has gotten a lot of attention:

    Click to access pnas-2013-johnson-1313476110.pdf

    Johnson says that his approach “provides a direct connection between significance levels, P values, and Bayes factors, thus making it possible to objectively examine the strength of evidence provided against a null hypothesis as a function of a P value or significance level.” (pp 1-2)
    I’d be interested to hear what people think of this at some point (it’s a mere 5 pages). So maybe David Colquhoun wants to jump back into the discussion. I’m amazed at Johnson’s treatment of the alternative hypothesis, and his blithe use of a prior (to the null) of .5. But I’m also intrigued by his Bayes-frequentist schizophrenia.

  7. Thanks for that response. I’m on holiday in Germany at the moment so will be brief. I would certainly not be happy with an assumed prior of 0.5. What persuaded me that the argument needs to be taken seriously is that no such assumption seems to be necessary. The false discovery rate of at least 30% can be reached by several different arguments which I tried to summarise in
    Now for the Brocken!

  8. I’d be enormously grateful for opinions on
    I am wedded to the idea of open peer review.
    It was the screening analogy that first shattered my life long antipathy to Bayesian arguments. Then the realisation that all that was involved was counting – no need for vague subjective probabilities.
    Wearing my experimentalist hat, what matters to me is to avoid making a fool of myself to often. That means that the false discovery rate is what matters most. The fact that the FDR is much the same regardless of assumptions about priors clinches it for me. Fisherian testing has misled us!

    • David: FDRs do depend on priors or “prevalence rates”, but the mistake in blurring the two kinds of “error probabilities” goes deeper.

      By the way, have you read about Isaac and college-readiness? It’s short:

    • Christian Hennig

      David: I think that these are important considerations to have in mind, but that the ways in which reality is more complex than what is assumed in these considerations should not be forgotten. I, and probably everybody here, would agree that just using P=0.05 as a magical cutoff between “sure discovery” and “just random” is a bad thing. Still, here are some additional aspects.
      1) Whether the FDR is more relevant than the type I error probability depends on the consequences. If P<=0.05 is just interpreted as "deserves further investigation" and if this "further investigation" is not all too expensive or painful, I wouldn't worry too much about 85% false discoveries that are weeded out by such further investigation. In science, I'd never believe an important finding just because it is backed up by P=0.035 in a single study. I'd hope that other people have a look at it and confirm it, by which I don't just mean replication but also additional evidence from looking at the issue in different ways (actually I'd probably say this for more or less arbitrarily small P, but with P=0.035 I'd probably demand more additional backing up than with P=1e-6). I guess that's what Mayo calls piece-meal approach to science.
      2) H0s are violated in all kinds of ways only some of which would be interpreted as confirming certain research hypotheses. If we assume N(mu,sigma^2) for the difference between a drug and a placebo measured somehow as H0, we'd be interested in rejecting mu0. However, mu may be larger than 0, but only so weakly that one would interpret this as “the drug does a tiny little bit but pretty much nothing”; the normal assumption may be violated; the iid assumption may be violated. Actually I’d think that cases in which *all three* of these problems exist dominate cases in which the H0 is indeed fulfilled by a country mile. Now this unfortunately means that it is unclear how to even define an “FDR”, because pretty much nothing is a really false discovery in the sense that absolutely nothing violating the nominal H0 is going on indeed.
      3) Even if we understand a false discovery as a situation in which the “interpretative H0” is fulfilled (see earlier posts of mine), my impression is still that in science as a whole the prevalence of correct findings is fairly high. This is because many researchers test null hypotheses that are implausible anyway and reject them against alternatives that make sense but are pretty boring, which gives them publications without the need of any cheating. Then there is a small number of “interpretative H0s” that may well be true (“homeopathy doesn’t do anything”), and another number of “risky” H0s which are probably not exactly true, but in which the effect needs to be striking in order to a) be of real interest and b) make it possible to find it with the sample size at hand. Then, on the other hand, in some cases researchers want to *confirm* a plausible H0 with high severity. Such things can be assessed in advance, so the idea that whether a research hypothesis is true or not can be modelled as the outcome of a binary iid random variable is rather problematic. There are different classes of hypotheses for which one should have different prevalence rates.

    • john byrd

      David: I have concern over your claim that 30% of “positive” results are wrong when using a cutoff of 0.05. It appears to me that the false discovery rate depends very much on the actual circumstances under which the testing is done. What makes the difference are both 1) how often one will tests individuals that are truly different from the null ( in the relevant respect), and 2) how different they really are (power of test). You note this in the paper, but in my view deemphasize the points in an alarming manner. When we conduct validation studies of test methods, one step is to test a sample known to be consistent with the null. A 0.05 cutoff should produce a mean type I error of 5% in repeated runs. When a sample known to be different in the relevant manner is tested, the false discovery rate depends on point 2 above, and gives a basis for expectations in the future. I do not know of a magic number for what FDR should be, but I agree it should be addressed when offering a test method to the reader. I am sure that 30% is not some magic number, however. I suspect FDR varies wildly in different contexts. Lastly, I believe your paper would be strengthened by including thoughts from Efron, Benjamini and Hochberg, and Turkey. Thanks very much for sharing it. It is helpful to me.

      • David Colquhoun

        Thanks very much for your reply. What struck me was that three different approaches all gave an FDR of around 30 percent (and higher than that for underpowered experiments) , despite making rather different assumptions. That suggests to me that there is a problem

        • john byrd

          David: I would call the false discovery rate a phenomenon, not a problem. It is expected to exist in many testing contexts. It should not be a surprise. (Where the null hypotheses are all true, it can be 100% if not controlled.) Benjamini & Hochberg offer a means to control it. Efron notes how much we can learn from it.

          • John: The business of ROCs never should have been crudely mixed with methods for statistical inference in science. It is a relatively recent confusion that some have happily capitalized on. Reducing or viewing inference to a dichotomy–throwing out information–and conflating the prevalence of an event-type in random sampling with a measure of the inferential warrant of a statistical hypothesis is wrongheaded (and leads to getting the role of power backwards).

          • David Colquhoun

            john byrd and Mayo

            You say that the false discovery rate is phenomenon not a problem. I would say that, if people are unaware of it, then it’s a real problem. It could account for much of the lack of reproducibility in some branches of science. It is almost certainly responsible for the vast number of tests of alternative medicine, in which the the hypotheses are mostly implausible (and in some cases, like homeopathy, obviously untrue).
            Such papers usually end with an assertion that “more research is necessary”.

            Huge amounts of time and money are wasted by publication of under-powered studies in which the FDR can easily be 80%. It’s all very well to say that science needs replication. That;s obviously true. But to publish ‘discoveries’ that have an 80% chance of being wrong just wastes everyone’s time. If you are willing to publish a finding that has at least a 26% chance of being wrong that’s fine as long as you say so.

            As I say in the acceptable FDR depends on the losses incurred by false negatives and false positives. What isn’t acceptable is to ignore the FDR altogether, as is the near-universal custom in the medical literature. Although it won’t be possible to specify precise values for FDR, rough values can be given and the value for P close to 0.05 comes out to be at least 26%, regardless of the prior distribution -or prevalence as I’d prefer to call it in this context. The arguments put forward by Sellke, T., Bayarri, M. J., and Berger, J. O. (2001) persuaded me of that, but you can reach a very similar conclusion by simulated tests, so I don’t think that the seriousness of the problem depends on your attitude to Bayarri et al.

            The fundamental problem with Fishererian approach seems to me to be that it makes no sense to consider only what would happen if the null hypothesis were true. You must also consider what would happen if it were not true. That’s obviously true for the screening problem. It now seems to me to be equally obvious in the case of significance testing..

            Thanks for the reminder that I should include references to Efron and to Valen Johnson (he’s referred to on the blog but got omitted from the paper). That will be fixed in the next version


            • David: I have given some references, so I won’t go further in explaining the points here, except to warn you that the recommendation of the Berger and Sellke folks (and all the others you’ve listed) is to make an inference that is far LESS conservative than what a significance tester would ever countenance: to wit, infer the alternative with maximum likelihood. That’s a highly unreliable method!

              Even to stop at the likelihoodist’s comparative claim, h1 makes x more probable than does h2, is not only extremely uninformative, but fails at error control (unless h1 and h2 are predesignated point hypotheses.)

              If you’re forcing dichotomous testing –throwing away all the info on (and interest in) magnitudes, &imagining just H is true or H is false, where long-run screening rates over a stipulated population is all that matters–you can get some bad numbers*, but they have nothing to do with error statistical tests of a given hypotheses.

              *If you do go this route, use the more sensible numbers of Casella and R.Berger: you’ll see that the posterior of the null is no greater than the p-value!
     and bergernocover.pdf
              Stop and think these points through from scratch.

              • David Colquhoun

                Thanks very much (but your last link gives me a 404).

                Earlier you seemed to acknowledge that the analogy with screening problems might have some relevance to inference, but now you seem to have backed away from that. You surely aren’t denying that the FDR is irrelevant when considering proposed screening test? Or have I misunderstood what you are saying?

                I just don’t see how insisting on P=0.001 in order to control your FDR can possibly be considered “far LESS conservative than what a significance tester would ever countenance”. It is surely far more conservative. You could argue. as Stephen Senn does, that it is TOO conservative, but that depends on the loss incurred by failing to detect a real effect, relative to the damage to one’s reputation consequent on claiming a real effect when there isn’t one.

                Do the disagreements arise because you don’t believe that the FDR is relevant to the problem of inference?

                From the point of view of experimenters who does not want to make fools of themselves by claiming to have made a discovery when they haven’t, then FDR is surely what matters to them? And surely, from the perspective of the reproducibility crisis, it is also FDR that matters? So are you saying that this isn’t true, or are you merely saying Or do you think that the FDR is relevant, but impossible to measure?

                • David: Please get the Casella and R. Berger from the references to this blog; links to papers don’t always work in comments. When you read that, you’ll see my point. The reconcilability of posteriors and p-values in the one-sided testing case is generally acknowledged (but Casella and R. Berger in this paper, demonstrate bounds.)

                  On your other, as i said, the unwarranted inferences licensed by the so-called conservative methods, applied to a test with their chosen p-value, say .001, include: infer the max likely alternative, and worse, infer the posterior to the point null based on .5 prior to the null and this max likely alternative.

  9. In explaining why he doesn’t allow comments on his legal blog, Nathan Schachtman mentions “Although some bloggers (e.g., Deborah Mayo’s Error Statistics site) have had great success in generating interesting and important discussions, I have seen too much spam on other websites…” So I get to thank readers/commentators and Schachtman, who has often commented on issues of PhilStatLaw.

Blog at