“P-values overstate the evidence against the null”: legit or fallacious? (revised)

Posted on July 14, 2014 by Mayo

0. July 20, 2014: Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

1. What you should ask…

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π₀ > 1/2 on a point null H₀(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H₀.”

Your reply might then be: (a) P-values are not intended as posteriors in H₀ and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.

You might toss in the query: Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?

If you wanted to go even further you might rightly ask: And by the way, what warrants your lump of prior to the null? (See Section 3. A Dialogue.)

^^^^^^^^^^^^^^^

2. J. Berger and Sellke and Casella and R. Berger

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H₀ (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, H₀: μ = μ₀ versus H₁: μ ≠ μ₀ .

“If n = 50…, one can classically ‘reject H₀ at significance level p = .05,’ although Pr (H₀|x) = .52 (which would actually indicate that the evidence favors H₀).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence for it!

From J. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to H₀, the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. A Dialogue.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of H₀as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

3. A Dialogue (ending with a little curiosity in J. Berger and Sellke):

So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier: I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier: If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H₀is still not all that low, not as low as .05 for sure.

EPA Rep: Why do you assign such a high prior probability to H₀?

P-value denier: If I gave H₀ a value lower than .5, then, if there’s evidence to reject H_{0 ,}at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H₀, assigning prior probability .1 to H₀, and my conclusion is that H₀has posterior probability .05 and should be rejected’?

The last sentence is a direct quote from Berger and Sellke!

“When giving numerical results, we will tend to present Pr(H₀|x) for π₀ = 1/2. The choice of π₀ = 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that π₀should even be chosen larger than 1/2 since H₀is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π₀< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H₀, assigning prior probability .1 to H₀, and my conclusion is that H₀has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π₀should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

There’s something curious in assigning a high prior to the null H₀–thereby making it harder to reject (or find evidence against) H₀–and then justifying the assignment by saying it ensures that, if you do reject H₀, there will be a meaningful drop in the probability of H_0.What do you think of this?

^^^^^^^^^^^^^^^^^^^^

4. The real puzzle.

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “what is the intended interpretation of the prior, again?” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: “Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?”

^^^^^^^^^^^^^^^^

5. (Crude) Benchmarks for taking into account sample size:

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H₀: μ ≤ 0 vs. H₁: μ > 0.

Let σ = 1, n = 25, so σ_x= (σ/√n).

For this exercise, fix the sample mean M to be just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:

m₀ = 0 + 2(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: μ > 0 + γ.

Some benchmarks:

* The lower bound of a 50% confidence interval is 2(σ/√n). So there’s quite lousy evidence that μ > 2(σ/√n) (the associated severity is .5).

*The lower bound of the 93% confidence interval is .5(σ/√n). So there’s decent evidence that μ > .5(σ/√n) (The associated severity is .93).

*For n = 100, σ/√n = .1 (σ= 1); for n = 1600, σ/√n = .025

*Therefore, a .025 stat sig result is fairly good evidence that μ > .05, when n = 100; whereas, a .025 stat sig result is quite lousy evidence that μ > .05, when n = 1600.

You’re picking up smaller and smaller discrepancies as n increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

6. “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

7. July 20, 2014: There is a distinct issue here….That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another well-known fallacies, leading them to guess I was referring to the fallacy of confusing P(E|H) with P(H|E)—what some call the “prosecutor’s fallacy”. I wasn’t. Nor are Berger and Sellke committing a simple blunder of transposing conditionals. If they were, Casella and Berger would scarcely have needed to write their reply to point this out. So how shall we state the basis for the familiar criticism that P-values overstate evidence against (a null)? I take it that the criticism goes something like this:

The problem with using a P-value to assess evidence against a given null hypothesis H₀ is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H₀, given data x (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.
Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

See continuation of the discussion comments here.

References (minimalist)

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Blog posts:

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Categories: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 71 Comments

71 thoughts on ““P-values overstate the evidence against the null”: legit or fallacious? (revised)”

July 16, 2014

Anonymous

It is very odd that this is readily taken for granted; on closer inspection, it applies just in the event that weighing evidence is computed Bayesianly.

Reply

July 16, 2014

Mayo

Yes, it’s a traveling meme.

Reply

July 16, 2014

Andrew Gelman

Mayo:

I think one problem here is the reluctance on many Bayesians’ part to seriously think about prior information. I know that I have made this mistake a lot.

Regarding some of the technical issues regarding p-values and Bayes factors, this discussion by Christian Robert and me might be helpful. We were discussing a paper by Val Johnson that we felt was misguided in that he was computing worst-case Bayes factors that corresponded to priors that did not make sense.

Reply

July 17, 2014

Mayo

Andrew: Yes I’d read this, thanks for linking. Johnson’s attempt here reflects the fallacy. but I can’t tell if you agree that it is a fallacy. (I’m assuming “audited”–actual not merely nominal P-values)

Reply

July 16, 2014

Stephen John Senn (@stephensenn)

A further bizarre feature of the Berger and Selke model is that although they want to reject the use of P-values they seem to want to maintain 1/20 since their main criticism seems to be that P = 0.05 does not translate into a posterior probabability that the null is true of 0.05 but they like posterior probabilities.

However, surely, if we are supposed to move to a posterior probability, the most important posterior probability to consider is 0.5 not 0.05: at what point does it become more likely than not that the null hypothesis is false? (Of course, considering this point makes one wonder whether starting with a prior probability of 0.5 is appropriate.)

In other words to think that P=0.05 ought to equal posterior probability 0.05 is simply a naive error of units.

I was looking at a treatment for hay-fever the other day. As regards which children should take it, the message was that your child should weigh at least 30kg. This may or may not be a sensible recommendation but to criticise it because 30kg is not 30 months and hardly any child 30 months old would weigh 30kg (thank goodness) is surely beside the point. I think requiring P=0.05 to mean “posterior probability 0.05” is a mistake of the same type.

I might add, by the by, that in the world of regulator science almost no drug is registered on the basis of a single P-vlaue of 0.05.

Reply

July 16, 2014

Mayo

Stephen: Well you’re one of the few sensible ones (to recognize it as a fallacy almost on the order of confusing units). However, remember how widespread is the supposition that people misinterpret P-values as posteriors,because I guess they imagine evidence can’t be qualified in any other way? That was also the criticism in the Higgs case. I find a lot peculiar in the Berger and Sellke article—a key source for the allegation.

Reply
July 17, 2014

Keith O'Rourke

I am missing something here.

> at what point does it become more likely than not that the null hypothesis is false?

Surely? it should be “much more likely than not”…

Reply

July 17, 2014

Stephen John Senn (@stephensenn)

Why should it be “much more likely”? Jeffreys original claim was that astronomers had noted that two standard errors was about the tipping point: when you investigated further as often as not it turned out there was a discrepancy.

I think that this claim is not entirely convincing since standard deviations were not in common use until Pearson and I think Jeffreys was referring to astronomy of the 19th century but it is interesting that he considered 50:50 to be a useful boundary.

For many applications in drug development, deciding whether to cancel a project of collect more data, this might well be a sueful boundary (if you wanted to think about things like this).

Reply

July 17, 2014

Mayo

Stephen: This is a good point. More likely than not (that the null Ho is false) would provide some Bayesian “confirmation” for not Ho. These issues highlight the difference in what these numbers are doing or thought to be doing.

Reply
July 18, 2014

Stephen John Senn (@stephensenn)

I quote from

Jeffreys, H. (1961). Theory of Probability. Oxford, Clarendon Press.

P 386

‘The need arose from the fact that in estimating new parameters the current methods of estimation ordinarily gave results different from zero, but it was habitually found that those up to about twice the standard error tended to diminish when the observations became more numerous or accurate, which was what could be expected if the differences represented only random error, but not what would be expected if they were estimates of a relevant new parameter. But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance.’

Reply
July 18, 2014

Keith O'Rourke

Stephen:

OK it depends and I thought the context was the approval of a drug that was costly and had nasty side effects (but this is a distraction).

In the rare or more likely extremely rare setting where I thought posterior probabilities could be interpreted literally, I believe I would know what probability to target.In most cases. Given they are usually arbitrary, how does one then choose?

Don Berry has argued, in effect, in the regulatory setting that they should be chosen to target the usual type one and two error rates. This did get in the FDA guidance so there might even be empirical record of what those choices were.

I tried to bring this out in Two cheers for Bayes.Control Clin Trials. 1996 Aug;17(4):350-2. Essentially, the two ways you can go wrong in Bayes are; have a wonky prior, miss-specify the data generating model and misinterpret the posterior probabilities 😉

People like Xiao-Li Meng and Mike Evans are working on ways to deal with all three but many statisticians just see no prior problems, hear no lack of data fit and speak of no difficulty interpreting posteriors.

Of course given my interests, I am biased in thinking there will be a third cheer one sunny day.

Reply

July 19, 2014

Mayo

Keith: Not sure what you mean by interpreting posterior probabilities “literally”, did I miss? Can you explain or cite your mention of Meng and Evans on this? I’m much more familiar with Evans.Thnx

Reply

July 22, 2014

Keith O'Rourke

For interpreting posteriors literally see Two Cheers reference above.

For Meng see http://andrewgelman.com/wp-content/uploads/2014/06/2014_MeNiRe.pdf
(which has references to Evans)

Reply

July 17, 2014

matloff

One does not need to resort (I use the word advisedly) to Bayesian analysis to assert p-values tend to be overstated as evidence against H0. The example I brought up here recently is a case in point.

In one of Larry Wasserman’s blog posts, cited here by Deborah, he had opined that while p-values can be misleading in general, in part because with large n we may pounce on a tiny, unimportant departure from H0, there are some situations in which ANY departure would be of interest. My reply now (I had not yet started reading Larry’s blog then) is that no matter how careful one is, there are going to be imperfections in measuring instruments and so on, thus creating a “departure” even if none were there. I worry that the Higgs experiments were of this nature.

In any case, my point is that the same principle applies whether H0 is true or false. The imperfections in measurement can result in an exaggerated p-value, i.e. one smaller than it would be with no measurement error. Here the p-value would overstate the evidence against H0.

On the other hand, if the measurement bias were in the opposite direction from a true departure of H0, the p-value will understate the evidence against H0.

By the way, if the main use to which one puts a confidence interval is to see whether it contains 0 (or other value posited by H0), then one has completed defeated the purpose of forming a CI.

Those counterintuitive theoretical results on Bayesian testing vs. p-values are interesting, Bayesians are doing contortions to justify their philosophy in light of those results. I agree that Deborah’s analysis exposes their circularity.

Reply

July 18, 2014

Mayo

Matloff: Why don’t you take note of my point (in section 5) about how easy it is to take account of sample size using a severity assessment or the like (e.g., confidence intervals). That is a central way it avoids fallacies, even where CIs themselves do not. See reforming the reformers, dashing or I would link it.
Returning to this: It’s by combining several results that the Higgs experimenters sustain a stringent “argument from coincidence”, the different detectors, the continual probing of the data (ongoing). It should never be an “isolated” result (as Fisher says.)

Reply
July 18, 2014

Mayo

Matloff: Wasserman was granting to commentators that p-values can be misused, but he was referring, as I recall, to classic problems of cherry-picking, hunting for significance, multiple testing and other biases. All these are problematic and they show up in disqualified p-values (ones where the computed p-value differs from, and is much smaller than, the actual p-value). (The same is so for CIs).

Here are the CI posts I alluded to in my last comment. If you use CIs you still need supplements to avoid fallacies, and severity gives them to you. You can use the mechanics of CIs as well, of course, but the rationale for doing so in interpreting results is still needed:

https://errorstatistics.com/2013/06/05/do-cis-avoid-fallacies-of-tests-reforming-the-reformers-reblog-51712/

https://errorstatistics.com/2013/06/06/anything-tests-can-do-cis-do-better-cis-do-anything-better-than-tests-reforming-the-reformers-cont/

Reply

July 18, 2014

omaclaren

Mayo:

I also appreciate this and your other analyses. They’ve helped me gain a clearer understanding of these particular issues; however, Wasserman does state in that blog post:

“Now, having said all that, let me add a big disclaimer. I don’t use p-values very often. No doubt they are overused. Indeed, it’s not just p-values that are overused. The whole enterprise of hypothesis testing is overused. But, there are times when it is just the right tool and the search for the Higgs is a perfect example.”

Similarly, in “All of Statistics” he says:

“Warning! There is a tendency to use hypothesis testing methods even when they are not appropriate. Often, estimation and confidence intervals are better tools. Use hypothesis testing only when you want to test a well defined hypothesis.” (Chap. 11)

Do you think this is just a minor difference given the ability to do estimation via hypothesis testing (as in your linked posts) or that it points to something bigger, perhaps not well expressed? Hasn’t Gelman mentioned this, say regarding ‘testing’ vs ‘modelling’ approaches?

In my crude world, given a choice between a plot of how some quantity varies as a function of another and a single number summary like a p-value (and/or severity value) I’d prefer the plot. Admittedly these sorts of plots can be misleading but they seem more natural as a starting point. Often I have a non-statistical model (though it need not be) or set of models in mind, like a differential equation, and I’d like to compare and contrast its possible solutions with the the functional relationships seen in the data.

I know some of Spanos’ work is probably relevant here but I find it a little difficult to follow, possibly because of the econometrics flavour (i.e. my ignorance of). The language of a field like inverse problems tends to be more familiar for my way of thinking.

So, presumably my point is that I find plots of functional responses for both models and data more natural and, honestly, probably more important, with things like hypothesis testing brought in later to make sure we’re not fooling ourselves (too much!). I’d be interested to hear your view on this.

Reply

July 18, 2014

Mayo

omaclaren: In his blogpost, Larry was clearly reacting to some comments bashing P-values, but as I said, we all know about “overuse” or rather “misuse”. Most importantly on your main comment: the construal of tests that I favor directly reports magnitudes of discrepancies that are well or poorly indicated. You might say that it combines testing and estimation. See, for example, Section 5. Also the links I gave in my last comment to posts describing why CIs also need supplementation by a severity analysis, if they are to avoid misinterpretation.

I doubt I’d be writing so much on p-values if it weren’t that those tools have come in for so much bashing–thereby crying out for a clarification from a philosopher of statistics.

On Spanos’ material, try searching this blog. For one thing, there are groups of posts on testing assumptions and model validation. If you employ hypothesis tests to check you’re not fooling yourself too much, then I say they’re doing important work for you, and that;s how they’re intended to be used.

Are these areas of interpreting statistics and of statistical foundations of relevance to practitioners and the interested public? I say yes.

Inverse problems?

Reply

July 18, 2014

omaclaren

Re: p-values bashing, I see your point and do appreciate your clarifications. I think there are many people who confuse/combine the issue of p-value overuse/misuse and whether what they really want are posteriors; I imagine that this forms a much stronger ‘voting block’ than either would get on its own. Regardless of the p-value/posterior probability issue I do think overuse is a big(ger) problem and justifies some of the criticism; however I understand that addressing this is not really your goal in this particular post.

Re: inverse problems. Largely, a different name for the same kind of thing but with an origin in e.g. geophysics and so with a slightly different emphasis or set of concepts. I mentioned this mainly because ordinary and partial differential equation models are common in this area vs. regression models with autocorrelation terms and things as in Spanos’ examples. But basically, for finite-dimensional problems, we have data, a model or models and we want to infer parameters.

Here is a paper which I’m sure you’ll find philosophically sloppy but is still somewhat interesting. It mentions Popper, Bayes and Inverse Problems all in one place:

http://www.nature.com/nphys/journal/v2/n8/abs/nphys375.html

Reply

July 19, 2014

Mayo

So how do we battle “overuse”?
I find it interesting to read how strong but fruitful disputes over data analysis take place in areas that scarcely make use of statistics, or where they enter here and there just when a question may be posed statistically, and then combined with entirely qualitative results. It helps to clarify how the use of statistics is continuous with scientific discovery/learning more generally. If people thought about that more, they’d avoid (mis)using statistics as window-dressing. Likewise, it would be instructive for critics who seem to imagine that if a method can be used superficially, then there is a license to use it superficially.

Reply
July 19, 2014

Mayo

omaclaren: Thanks for the link. I agree that it’s “philosophically sloppy but is still somewhat interesting. Easy enough to fix, but it may result in shifting your entire thesis ….

Click to access popper-bayes-and-the-inverse-problem_nature.pdf

Nearly every scientists’ favorite philosopher is Popper and he is adduced in connection with a large number of radically different philosophies! As he’s also one that I study and write about professionally, I’m going to link to some “no-pain philosophy” blog posts on him.

The author (who is not you I see) says, “As we learn from Popper, a theory must be able to predict the result of observations”, but of course, what we actually learn from Popper is that a theory of any scientific interest cannot predict observations, but only general phenomena which can only be linked to anything observable via a host of background theories of the substantive field, of instruments, and of experiments. Thus, in trying to “falsify” we are immediately blocked by the Duhemian problem of distinguishing between sources of “misfits”.

I’m very glad to see that your “inverse” problem is an example of what I’d call inductive or ampliative inference from tests. Such tests must go beyond mere “fits” however, and must solve their Duhemian problems. That’s what my philosophy of science and statistics are all about.

Here’s the second of three parts (likely most relevant)
https://errorstatistics.com/2012/02/01/no-pain-philosophy-skepticism-rationality-popper-and-all-that-part-2-duhems-problem-methodological-falsification/

for the third (a link to part 1 is within, but is less relevant I think)
https://errorstatistics.com/2012/02/03/no-pain-philosophy-part-3-a-more-contemporary-perspective/

Some more subtle issues from a recent seminar on Popper (from a general philstat course w/ my colleague Aris Spanos, economics):
https://errorstatistics.com/2014/02/26/phil6334-feb-24-2014-popper-and-pseudoscience-day-4-make-up/

Please feel free to ask questions.

Reply

July 20, 2014

omaclaren

Thanks for your responses, I appreciate you taking the time and I enjoy a good ‘slow blog’. I do have a number of thoughts (for what they’re worth) and questions; however I’ll have to see if I can put them together more coherently. I should also point out that I was not the author of the article I linked, sorry if that was unclear.

Reply
July 22, 2014

omaclaren

Thanks again for your response.

Regarding philosophical issues, inference and inverse problems:
I’ve replied below under one of Stephen’s comments as I believe he’s captured the way I think about the general issues of inference quite well. Also, my only other comment on this blog, from a little while ago, was actually on that last Popper post you linked! I should say that, despite some of my own quirky views which may disagree with yours (see below!), I do buy many of the points you’ve raised here and in your books and you’ve certainly helped my understanding of Popper, Peirce, Duhem etc. I’ll have to ask some more questions when I get a chance. For now I’ve just left my own rambling opinion, as is standard on the internet.

Regarding overuse of p-values:
I’m no expert but it seems there are some good, influential voices such as Wasserman, Gelman etc (encouragingly, I can think of quite a few more!) who are defending appropriate use of p-values while also encouraging other things like exploratory data analysis, increased modelling and new ways of representing data as complementary (or more important, often) tools. While more non-statisticians entering the ‘data analysis’ field does bring some dangers (which should certainly be pointed out by the likes of yourself and Stephen) it also appears to bring an increased desire for visualisation and informative representations of data.

So, having more of a ‘feel’ for the data and using more representative models seems, I think, to be a nice way of balancing out too much ‘blind’ testing and use of strawman hypotheses.

Reply

July 18, 2014

David Colquhoun

Surely the answer depends on the losses that incurred as a result of false positives and false negatives. Stephen Senn is fond of citing the example of preliminary screening of a series of drugs. The cost of missing an active drug could be very high for the industry, so it’s reasonable to tolerate a lot of false positives at that stage. They will be weeded out later, before much harm is done to patients.

The fact is that the bulk of the biomedical literature still relies on P<0.05 to announce that they've made a discovery. In the situation where you test a lot of implausible hypotheses (an extreme example would be homeopathy) it is simply a matter of counting to see that many of the claims will be false.

It seems to me to be quite misleading (though not wrong) to call this argument Bayesian. It's simply an entirely non-contentious application of the rules of conditional probability. Ignoring it is very probably a large part of the reason for the present crisis of reproducibilty that has plagued many areas of science.

It is the job of statisticians to minimise irreproducibility, not to condone it.

More at http://www.dcscience.net/?p=6518

Reply
July 18, 2014

Stephen John Senn (@stephensenn)

If David really wants to avoid the Bayesian word then he should use some real data. I have previously pointed him towards an empirical study
1. Djulbegovic B, Kumar A, Glasziou P, Miladinovic B, Chalmers I. Medical research: trial unpredictability yields predictable therapy gains. Nature 2013; 500: 395-396.
http://www.nature.com/nature/journal/v500/n7463/full/500395a.html
which could, if properly analysed, giving him the mixing distribution he needs. (It is NOT just a case of the conditional probabilities.)

As regards the crisis of reproducibility, it’s not a crisis, it’s inevitable. We have to do science in a way that takes account of this and that includes paying more attention to the business of actually reproducing. As I pointed out in a recent twitter post, successive attempts by generations of scientists to reproduce measurements of the astronomical unit starting in the 19th century failed by the standards of reproducibility that the astronomers claimed. Astronomy seems to have survived despite this and nobody calls it a crisis.

In drug development, my explanation of the failure to reproduce is quite different and will not be cured by replacing P-values with confidence intervals (which I think David sometimes favours). (But that certainly does NOT mean that P-values alone should sum up trials.) See
1. Senn S. Being Efficient About Efficacy Estimation. Statistics in Biopharmaceutical Research 2013; 5: 204-210.

Reply
July 18, 2014

David Colquhoun

I thought I’d already answered Stephen’s comment, But I’ll try again in the hope that someone will demolish my arguiment.

The classical example of Bayesian argument is the assessment of the evidence of the hypothesis that the earth goes round the sun. The probability of this hypothesis being true, given some data, must be subjective since it’s not possible to imagine a population of solar systems, some of which are heliocentric and some of which are not. That’s one reason I’ve always advocated the frequentist view. The other reason is that, as an experimenter, I want my experiment to stand on its own, without feeding by prejudices into its interpretation.

But the situation seems quite different when you have an objectively determinable probability that H1 is right in the population being tested, Surely the most ardent anti-Bayesian does not deny that the false discovery rate is relevant to health screening tests (as at http://www.dcscience.net/?p=6473 )? I don’t see how anyone could argue against the correctness of that argument.

If you accept the argument for screening tests then the Fisherian advocate surely must have a problem. I argue that that problem of testing a series of drugs to see whether or not their effects differ from a control group is precisely analogous with the screening problem . It’s easy to imagine a large number of candidate drugs some of which are active (fraction P(real) say) , some of which aren’t. So the prevalence (or prior, if you must) is a perfectly well-defined probability, which could be determined with sufficient effort.

If you test one drug at random, the probability of it being active is P(real). It’s no different from the probability of picking a black ball from an urn that contains a fraction P(real) of black balls. to use the statisticians’ favourite example. If that is the case, I don’t see you you can deny that, if you observe P=0.047 in a single experiment the probability that you will make a fool of yourself if you claim to have made a discovery is at least 30% (and much higher for under-powered experiments, as at http://www.dcscience.net/?p=6518 ).
All you have to do to convince yourself of that is to simulate 100,000 t tests (that takes minutes on my laptop).

You could certainly argue that that problem is best dealt with as a multiple comparison problem. Bur take the case where you test only one primary outcome per paper, over a lifetime. It’s only a slight stretch of the imagination to imagine that you could, with sufficient effort, determine the fraction of hypotheses that were actually true,

Of course, if the loss incurred from false negatives is high (as in the early stages of drug development) it may be quite reasonable to tolerate 50 percent of false discoveries. But if that’s the case, please say so clearly in the paper. As an experimenter, I value my reputation too much to publish papers which have a 50% chance of being wrong.

Reply

July 19, 2014

Stephen John Senn (@stephensenn)

Actually, the screening argument works better with P-values than it does with screening, strange as it may seem! (1-3) This is because it is highly debatable with screening that you can make sensitivity and specificity primitive inputs that will survive intact whatever the prevalence. For example, if they were derived from a population wide study (population epidemiology) why would they be relevant for a patient consulting his or her doctor (clinical epidemiology) and would the population prevalence be relevant for the clinic setting? With type I error rates at least for valid tests, we can believe they would be the same whatever the prior probability of a null hypothesis (but not whatever the alternative).

If I had a large lab screening molecules I would use a Bayesian approach or rather I would use an empirical Bayes approach based on actual data. This would satisfy David’s objective Bayes approach but it does not carry over very well to many other examples.

I have found it impossible to establish from his own blog what it is David wants. Sometimes it has seemed that he wants scientists to publish less but if you ask him if this means that negative results should not be published he quickly denies this. Sometimes he seems to claim that likelihood-based confidence intervals are the way to go but if you use these to judge if an effect is genuine you have the same problem as P-values. Sometimes he seems to want re-calibrated P-values. Sometimes he seems to want to use Bayes (as long as you don’t call him Bayesian).

Personally, I prefer to keep unrecalibrated P-values as one amongst many tools used but since it must be 40 years since I thought they were the same as posterior probabilities I am not at all shocked to find they are not..

However, whatever we do, subsequent generations of scientists will not perfectly replicate what earlier generations did(4). This does not shock statisticians who prefer true doubts to false certainties.

References

1. Dawid AP. Properties of Diagnostic Data Distributions. Biometrics 1976; 32: 647-658.
2. Guggenmoos-Holzmann I, van Houwelingen HC. The (in)validity of sensitivity and specificity. Statistics in Medicine 2000; 19: 1783-1792.
3. Miettinen OS, Caro JJ. Foundations of Medical Diagnosis – What Actually Are the Parameters Involved in Bayes Theorem. Statistics in Medicine 1994; 13: 201-209.
4 Youden J. Enduring values. Technometrics 1972; 14: 1-11.

Reply

July 19, 2014

David Colquhoun

Well, in my own work, significance tests are rare. As far as I can tell, I have followed my own dictum and never used the word “significant” (in its statistical sense) in any paper.

I’m in the business of estimation (of rate constants for transitions in Markov models of single ion channels -see http://www.onemol.org.uk/?page_id=10#chh2003 ) and likelihood intervals seem the natural way to express the uncertainty in the estimates form a single experiment). In fact what we usually give in papers is a plain old standard deviation of the mean of replicate experiments rather than any intra-experiment measure.

But these measures aren’t used for any sort of significance test. That’s a separate problem. I don’t trust significance tests for a different reason, systematic errors often overwhelm random errors n real life. So my interest in the false discovery rate is purely theoretical.

Reply
July 19, 2014

Mayo

Stephen: I see that some people are construing the “fallacy” of my post as tantamount to the fallacy of “transposing the conditional” (never mind that there isn’t a real conditional probability with a P-value), but I’ve never viewed the two as identical, have you? For example, when you responded to Goodman, you knew that he wasn’t falling into that howler. To reduce it to that takes all the interest away from it. I would distinguish it as well from the issues in the screening context. So what exactly is it?
I take it the problem is supposed to be that the mismatch is intuitively problematic on shared (if implicit) evidential grounds. What do you say?

Reply

July 20, 2014

Stephen John Senn (@stephensenn)

The Goodman comment was about regarding the replication probability as being relevant. It isn’t but that’s quite another matter.

Reply

July 21, 2014

Mayo

Stephen: You combined BOTH* problems in that letter–one of the reasons I find it so interesting.
*”Both” here refers to the problem of this post (the allegation that P-values overstate evidence against) AND the problem of replication probabilities.

Reply

July 20, 2014

coreyyanofsky

Mayo: “(never mind that there isn’t a real conditional probability with a P-value)”

Huh. I’d have said that a p-value is a real conditional probability — p-value = Pr(X_new < X | X = x_obs; H_0) — and this is why people who say that p-values aren't error probabilities (in the Mayo sense of the phrase) are wrong.

I mean, obviously Pr(X_new < X | X = x_obs; H_0) is identically equal to Pr(X < x_obs; H_0) for well-behaved H_0; the idea is just to expand the probability space with an IID replicate so that the p-value can be seen as simultaneously post-data (i.e., post observation of X) and pre-data(i.e., X_new is not observed). Then people who go around saying that that p-values are not error probabilities can see the whole mathematical machinery that makes explicit the way in which p-values *are* error probabilities in a sense.

Reply

July 20, 2014

Mayo

Corey: I’m afraid yet another coupla things are being run together; I’ll try to clarify one of them at least on a modification of my blogpost, now that I’m back from travels.

Reply

July 18, 2014

John Stevens

The nature of probability in a Bayesian context is that it applies to any event in which we are uncertain. In the example of the event that the earth goes round the sun it is not a repeatable event. Before we know the truth, we could express our subjective probability and collect related evidence but there is no sense in which a population of solar systems tells us anything about this specific proposition that the earth goes round the sun.

Reply

July 18, 2014

Mayo

John Stevens: yes, Bayesians often say this. But think about it. Couple of things right off: if we really are dealing with a unique “event” when we talk about hypotheses and theories (e.g., H: the deflection effect due to the gravity of the sun is ~ 1.78 arc seconds) it is scarcely obvious that what we wish to do in quantifying (or qualifying) the uncertainty or error-proneness of inferences based on empirical tests (e.g, from 1919 to the present) is assign a posterior probability to H. Scientists don’t do that. Nor do they set sail by delineating an exhaustive list of hypotheses or theories while they are busy probing the universe, as they are. Granted, the Bayesian will allude to the so-called “catchall”–everything other than H– , but what’s the probability of x given everything other than H? On the other hand, error probabilities refer to hypothetical repetitions of the sort we can carry out quite concretely with simulations, as in the case we’ve been speaking about–the Higgs particle. The mistakes are of a type–that is so wherever knowledge is possible. Finally, I might note, if the Bayesian account is really and truly limited to cases of a unique event, wherein even hypothetical repetitions are irrelevant and even meaningless, then how can it be at all relevant for replication or for science, and how can Bayesians at the very same time appeal to long run convergence to justify a cacophony of priors? Standard questions to ponder.

Reply

July 18, 2014

David Colquhoun

@Mayo

Here is a perfectly genuine question, Do you believe that ir proper to consider the false discovery rate when considering whether or not a medical screening is desirable.

If you do believe that, could you explain to me what’s different about the P value argument?

Reply

July 18, 2014

Mayo

David: On the first question there are perfectly ordinary frequentist probabilities of “false-positives” over populations of people, genes or whatnot, as when the truth of the “hypothesis” boils down to the occurrence of an event. (e.g., the randomly selected student with score s is college ready—an example we’ve discussed a lot on this blog). But when we are interested in whether and what kind of evidential warrant there is for THIS hypothesis (with this test and data), e.g., the deflection effect is ~1.75–then -I’m asking about how well errors have been ruled out by the tests, and am not asking about science-wise screening rates. The P-value argument about which we are discussing–on either side– does not involve screening. Similar number can and have been used for science-wise screening, but that’s much more recent. I’ll try to stick to the issue at hand–it’s impossible to do more in comments, but please search the blog, and links to publications on the left-most column.

Reply

July 19, 2014

David Colquhoun

@Mayo

Thanks for the response, but I’m still not sure that I understand what you mean Let me make the question more specific.
.
Suppose that we have a screening test for the development of Alzheimer’s disease in the over 60s, and we know that eventually 5% of that population will develop Alzheimer’s eventually. Suppose too that we estimate that the test has a sensitivity of 0.8 and a specificity of 0.95,

Do you accept that, if you test positive, you have a 46% chance of developing the disease and a 54% chance of the result of the test being wrong?

Reply

July 20, 2014

Stephen John Senn (@stephensenn)

Do you accept
a) that your calculations apply to a randomly sampled member of the population
b) that the figure applying to a patient presenting to a doctor and expressing concern about possible Alzheimer’s would be quite different
c) that if the test is based on a continuous score the categorisation positive or negative does not correspond to a continuous measure such as a P-value
d) that if such a test is used as a battery of tests it might provide useful information and
e) in terms of information, the elevating of a prior probability of 5% to a posterior of 46% is actually quite impressive and could,
depending on losses involved, be quite useful in a decision problem
f) that one should be very careful how far one pushes population screening analogies especially if one is not prepared to make good the lack of objective knowledge about the probabilities by using subjective ones?

Reply

July 20, 2014

Mayo

Thanks Stephen! Especially e), f).

Reply

July 20, 2014

David Colquhoun

Stephen
Thanks, though I had hoped that @Mayo would give me her opinion on my question.

Certainly I accept (a) to (c). Your point (d), while true, is surely irrelevant to my question.

Your response (e) I take to be a (rather reluctant?) YES to my question.

Point (f) seems to suggest that you’ve guessed my next question: if you accept that the argument is sound for the screening problem, why do you reject the exactly analogous argument for significance tests? The prevalence (or prior) is admittedly rarely known with any precision but there are certainly circumstances where one an be confident that it’s pretty low. But whether or not you have a good estimate of its numerical value, surely it is an ordinary frequentist probability?

What does Mayo think about this?

Reply

July 21, 2014

Stephen John Senn (@stephensenn)

Point d) is not irrelevant unless one believes (which I don’t) that a scientist’s study should always finally decide an issue. If not contributions that are not decisive can still be useful.

As regards point e) I am accepting the figures as you gave them but I still doubt in practice whether in the circumstance in which you wish to apply your screening analogy (ignoring the advice that screening itself should not naively be thought of in these terms) you could have such values.

As regards the screening analogy may I point out to you again that in all your discussions of screening you make the assumption that it is easier to estimate P(+¦D), P(D) and P(+) than P(D¦+), which you choose to calculate using the other three as primitive inputs. Why? If you have a sample of individuals with + and – symptoms and D (diseased) and N (not) you might like to reflect on the exercise involved in going about it like this.

As regards f, if because you can calculate a probability using a (theoretically but not in practice) possible frequency analogue then, in view of the fact that MCMC uses frequencies of random numbers to calculate Bayesian probabilities I assume that you maintain that all Bayesian probabilities can be regarded as relative frequencies. If that’s the way you think one can hardly argue. It is simply the converse of the common Bayesian position that all probabilities, whether based on frequencies or not, are subjective. I think many would find both positions rather extreme.

Reply

July 21, 2014

David Colquhoun

No, point (d) is certainly irrelevant to the screening example. There has been huge discussion about the advisability of running mass screening programs for things like breast cancer and Alzheimer’s. Most of the people who are screened are perfectly healthy and the aim is to find markers that will predict whether they will eventually get the disease. The consequences of false positives are dire -may women get mastectomies or chemotherapy unnecessarily. A few are saved but far more suffer.

Likewise, to be told that you’ll get dementia when the odds are that you won’t is devastating and cruel. These aren’t theoretical questions. They are very important for large numbers of people.

Of course the values that are estimated for sensitivity, specificity and prevalence are not precise but that’s always true. The fact is that for almost all screening tests that have been proposed, the prediction is that the number of false positives that are predicted render the tests almost useless.

When we get to the difficult bit -extending the same argument to P values, I maintain that the prevalence (or prior, if you must) has rarely been estimated (Berger gives an example from astronomy where it has). That doesn’t affect the argument that it’s a perfectly well-defined frequentest probability which could, with sufficient effort be determined. All you have to do in order to see that the argument matters is to look at the reductio ad absurdum. If you were daft enough to spend time testing homeopathic pills, which are identical with placebos, the prevalence is clearly zero and all positives are false positives. This matters in real life!

You say that you don’t believe that “a scientist’s study should always finally decide an issue”. Of course not. But the fact that. for example, that a majority* of experiments in oncology and experimental psychology can’t be reproduced amounts to a crisis. It wastes time and money. And sad to say, part of the reason for that crisis is a result of the teaching of the very people who are meant to protect us against that sort of problem -statisticians.

One could argue that it’s is precisely because you can, retrospectively, measure P(D|+) that we knot that Fisherian testing has misled people. But that’s hindsight and not much use to people now.

* some references at http://www.jove.com/blog/2012/05/03/studies-show-only-10-of-published-science-articles-are-reproducible-what-is-happening

Reply

July 21, 2014

Stephen John Senn (@stephensenn)

You were using screening as an analogy for significance testing, an analogy you like but I like less. In the case of significance testing, since you can have multiple independent studies you can have value from a study even if it is not strongly salient. The same would apply to screening if one test could lead to further tests. For example if a high blood pressure reading then led to further subsequent blood pressure measurements being taken. If this is not permitted in your screening example then don’t use it as an analogy to significance testing, something I queried your doing a long time ago.

But there is another point that is baffling about all this and here I don’t understand what Berger is on about. Although it is true that for large sample sizes a classical significance test will choose a more complex model (for example one in which there is a difference between two treatments) in favour of a simple model(one in which there is no difference) more easily than does the Bayesian information criterion (BIC), this is not true for small sample sizes where the reveres is the case. How is this possible? Significance more conservative than BIC? It seems to be the reverse of what Berger et al are claiming.

(One can also note, in this connection, the claim that is often made that for “difficult” problems where frequentist methods don’t have enough power one should consider Bayesian approaches. This seems to tend in the other direction altogether.)

The clue to the answer is in the quotation from Jeffreys I previously gave. If probability is the only basis for preferring one model to another, then as soon as its probability is more than one half, the complex model should be preferred to the simpler one. (This is not so if parsimony is an independent principle to probability but this was not Jeffreys’s view.) Thus by concentrating on posterior probabilities of 0.05 one is making exactly the mistake I alluded to before of complaining that a minimum weight of 30kg does not mean a minimum age of 30 months.

Reply

July 21, 2014

Mayo

Stephen: I also think it’s a mistake to run together screening cases in which a (behavioristic) concern for “controlling the noise in the network”, as Bross called it, is of primary concern. I take Berger and Sellke to be alluding to the evidential appraisal of a single null hypothesis.

Your point about the fact that, in reality,”one test could lead to further tests” and any appraisal that just looks at single isolated tests is missing the roles of tests in actual inquiry. The search for one-shot tests, outputting numbers, and the recent tendency to see everything in terms of high throughput screening, seems to be adding confusion to an already confused literature on tests.

Note, for example,a high enough sample size increases the power of significance tests to pick up trivial discrepancies. By contrast the way “power” is used in an Ioannides-style screening computation results in higher power going hand in hand with “higher replication”.
The question is: How can we delineate these cases to illuminate what’s going on? How can we give an account of statistical inference that, while systematic, is not reducible to one-shot, isolated tests. Rather, there are a series of probed put together for a strong argument?
July 21, 2014

David Colquhoun

I can’t see a reply button on Mayo’s comment so I’ll reply to you.

I’d like to comment on the idea that “further tests are possible”. Of course I agree that eventually all science relies on independent replication. But many sorts of work have a single primary outcome, and are not likely to be replicated quickly. An obvious example is anything that comes from CERN. Big clinical trials must also now declare a single primary outcome.

In my own work with single ion channels, we wish to compare different receptor mechanisms (expressed as Markov processes with discrete states in continuous time). Such experiments rarely involve significance tests, but they take a long time and are unlikely to be replicated quickly.

If, as I think is true, that you accept that a screening test that gives 50% false positives is not going to be helpful. it still beats me why you think that a significance test that gives 50% of false positives is acceptable, on the grounds (if I understand correctly) that further tests can be done to sort it out later. I think that 100 percent of experimenters would think that it was highly unsatisfactory if, when they read that an effect is “significant”, there is actually a 50-50 chance that the effect isn’t real after all.

I’m trying hard to understand this problem. So my question becomes, do you (a) deny that the 50-50 chance is accurate in many experimental studies, or (b) agree that it may be accurate, but think it doesn’t matter because it will get sorted out later by “further tests”?

I’ve written an extended account of the stuff on my blog which will appear soon on arXiv
July 21, 2014

Mayo

David: I’m not sure if you’re writing to me or Stephen, but I’d like to reiterate that I think the situation with screening, even if we grant the numbers (generally a big “if”, but possible), is importantly different from the situation being discussed. I don’t say they’re unrelated, only that there are relevant differences–ones that are significant from a statistical inference perspective–and, moreover, the current issue remains regardless of how one rules regarding those screening settings, wherein control over error rates in the long run is the central interest.
Finally,,when I do turn attention to behavioristic-screening contexts, I will also challenge the computations and arguments that have been popularized as of late. On that issue, then, for me, I’ll just say “to be continued…” (however look up “trouble in the lab” on this blog). (I’ll check out your blog when I can.)
July 22, 2014

David Colquhoun

Mayo. Thanks for the response.
I have explained my position more fully in a paper that includes simulated t tests. The preprint is now on arXiv at http://arxiv.org/abs/1407.5296
July 22, 2014

Stephen John Senn (@stephensenn)

I also can’t see the possibility of replying to David’s comment so I shall do so here.

First, I think that you are confusing two things
1) P-values
2) significance
If your point is that you don’t like the common use of P < 0.05 as significant then use a different threshold yourself for significance, better still, avoid the label. Failing that, use a different system altogether but please explain what. As I have pointed out before, in drug regulation the type I error for registration is set much lower (arguably, at 1/1600 since two phase III trials are required to be significant).

Second, whether or not you like this, model selection methods are commonly in use that are less stringent than P<0.05. I invite you to check out AIC and BIC and I would be interested to know in your own work using maximum likelihood whether you have ever had to choose between simpler and more complex models and if so how.

Third, I want to make it quite clear that I reject entirely this statement of yours "If, as I think is true, that you accept that a screening test that gives 50% false positives is not going to be helpful. " Such a test, depending on circumstance could be extremely useful. For example, if we had a simple test that if it labeled someone as being about to develop Ebola was right 50% of the time this could be extremely useful in disease control. There are many other circumstances one could think of and there simply is no such general rule defensible in terms of practical decision making.

Fourth, for reasons I have given before, the 'crisis of replication' cannot be solved by changing the threshold for 'significance'.

July 18, 2014

vl

I always thought that this idea of “overstating the evidence” was made on purely frequentist terms in the sense that one is relying on a purely frequentist definition of probability.

To paraphrase Wasserman, frequentist/bayes has nothing to do with the application of Bayes theorem and everything to do with the definition of probability.

Thus, conditional on an underlying generative model where the prior probabilities are _frequentist_ prior probabilities (e.g. incidence of breast cancer), we can evaluate the frequentist performance of a p<.05 cutoff with respect to frequentist metrics such as FDR, PPV, etc. and see that p<.05 corresponds to very poor characteristics with respect to these frequentist measures of error.

Reply

July 18, 2014

Mayo

VL: I was distinguishing “screening” cases. I discuss them on the blog and in published work. Here are some blog posts where they arise:

https://errorstatistics.com/2013/09/29/highly-probable-vs-highly-probed-bayesian-error-statistical-differences/

https://errorstatistics.com/2013/11/09/beware-of-questionable-front-page-articles-warning-you-to-beware-of-questionable-front-page-articles-i/

If an urn holds 50% “true” hypotheses (never mind how we’d know they were true), then one might say that an experiment involving randomly selecting from the urn has 50% chance of picking a “true” hypotheses. But now if I draw a particular hypothesis H(16): Newton’s theory of gravitation, I wouldn’t say Pr(H: Newton’s theory) = .5. would you? I wouldn’t say if I picked it out of an urn with 90% true hypotheses that suddenly the same hypothesis has 90% probability of truth. I call this the fallacy of probabilistic instantiation. All I’m saying is that these are different cases, and I was alluding to an argument wherein there is a criticism of using an (audited) p-value from a significance test to ascertain the evidence for a specific H(16)—whatever H(16) is. (Nor do we assign .95 to a particular interval estimate resulting from a .95 estimatOR.) Please see posts for greater clarity, I’m traveling on a moving vehicle.

Reply

July 19, 2014

rasmusab

Depends on your view of probability, right? 🙂 With a subjective / logical (a la Jaynes) view of probability I could state that the probability that hypothesis H(16) is true is 50% until I know that it is Newton’s theory. In the same way that the probability of heads is 50% until I know the outcome of the flip. A specific hypothesis is going to be true (or false) independet of the probability of picking a true hypothesis from the hypothesis urn. Regarding your fallacy, Jaynes uses the “Mind Projection Fallacy” which is almost the opposite of yours (at least when applied to probability).

(Sometimes I get the feeling that a lot of the debate around Bayes / frequentism would go away if one realizes that the other part uses the word “probability” in a very different way.)

Reply

July 19, 2014

Mayo

Rasmusab: But here we ARE talking about the probability of H(16), not the probability that this method will output a true hypothesis. And the reason we are talking about it is that we are talking about the evidential appraisal one might give to it. A frequentist doesn’t assign a prob to H(16): Newton, but she does use probabilistic properties of methods in order to qualify inferences made about Newton, warranted or unwarranted by the data.

I think what needs clarifying is not the notion of probability–we all know there are differences, even if the subjectivist won’t precisely say—but rather the use to which probability may be put in expressing the error proneness of methods. In the account I favor, it is used to quantify the probative capacity of the test, from which I can then make an (ampliative) inference to a claim, and how warranted it is.

Reply

July 21, 2014

Stephen John Senn (@stephensenn)

I think Wasserman’s distinction is interesting but I see it rather differently*. The Bayesian prefers to address and tackle the inverse probability (data to model) directly. The frequentist prefers to stay with the direct probability formulation (model to data). There are problems with both and I am not here arguing for one or the other but I regard the inverse v direct difference as more important than ‘subjective’ versus ‘objective’.

* Admittedly I am guessing as to what he means about the definition of probability.

Reply

July 21, 2014

Mayo

Stephen: The frequentist moves from data to model by ascertaining the probability that the method would not have rejected Ho, were Ho correct. If this probability is high, then there’s an indication for not-Ho. Ironically, its the frequentist who performs a genuinely inductive inference (going beyond the data and “givens”). It is error prone, so it is qualified by means of error probabilities. The Bayesian computation is a purely deductive assignment to an “event”. I’ve never considered the probability of an event as a measure of “the evidence” for the event—even in contexts where we have such probabilities.
All that said, I think the issue of the current post differs from the “screening” context. Here the issue is using a P-value in an evidential assessment of a given null (or discrepancy from it). I have revised the title and added a new section 7.

Reply
July 22, 2014

omaclaren

@Stephen:
This is precisely how I see it! Give or take a misinterpretation on my behalf…

My interpretation:

Both are attempting to solve the ‘inverse problem’ of going from data to model, i.e. inference. This problem is inverse with respect to the ‘forward problem’ of going from model to data, i.e. standard probability calculations. Funnily enough Wasserman actually has this as Figure 1. in ‘All of Statistics’.

There are (broadly speaking) two ways of solving an inverse problem – actually attempt to construct the inverse operator to your forward operator, giving you an `inverse probability’ calculation (Bayes) for your parameters, or repeatedly solve the forward problem for different parameter values and reason by contradiction (a la Popper, Fisher) to eliminate values which give mismatches.

Confusingly, as Stephen says, the Bayesian is attempting to tackle the inverse problem ‘directly’ in the sense of actually constructing the inverse mapping; the frequentist is reasoning ‘indirectly’ in the same sense that proof by contradiction is indirect reasoning.

This is also why the Bayesian requires a ‘prior’ or ‘regularisation’ – the inverse function is in general non-unique or non-existent without further assumptions. There are many problems where introducing these assumptions directly in order to construct the inverse doesn’t seem to be a terrible idea; however I do think that the most general approach of the two is the ‘proof by contradiction’ method, again essentially because of the same reasons that Popper gave when he discussed the problem of induction. Combinations of the two approaches don’t seem that unnatural in practice, however, since mathematicians and scientists frequently employ both direct and indirect arguments and calculations in their work.

So to re-emphasise: the Bayesian solution to the inverse problem (inference) is direct/deductive since they are trying to directly construct the inverse function (an inverse probability calculation for parameters given data), while the Frequentist approach is indirect/analogous to proof by contradiction since they retain the forward calculation direction (probability calculations given parameters) and try to ‘learn from error’.

Which finally brings us to the extra wrinkle that Mayo introduces to tackle the whole Duhem issue (for the Frequentist case) – actual proof by contradiction must be replaced by a sort of probabilistic contradiction. In this sense the proof by contradiction then becomes an inductive move, further complicating everything!

My two cents.

Reply

July 22, 2014

Mayo

“I think he’s got it” or is getting there. (We haven’t conversed before, so I’m not sure.) And now we turn to the manner in which we solve our Duhemian problems. Once again, there is a Bayesian and an error statistical approach. Duhem’s own (subjective) approach is exactly akin to how the Bayesian philosopher purports to “solve” Duhem: Assign beliefs to all of the background and auxiliary claims used to entail the observation, and then “blame” the anomaly or failed prediction on the one with the lowest probability-belief assignment. The error statistical philosopher finds this the wrong way to go. (See, for example, my paper: Duhem’s Problem, The Bayesian Way and Error Statistics: What’s Belief Got to Do With it?”) http://www.phil.vt.edu/dmayo/personal_website/(1997) Duhem’s Problem the Bayesian Way and Error Statistics or What’s Belief Got to Do with It.pdf

We don’t assign degrees of belief to the auxiliaries (to solve Duhem)—they are either falsified directly, or, much more cleverly, we find a way to distinguish their effects. (As in the example I often mention in relation to tests of GTR: a mirror distortion does not look like a deflection effect (or a shadow effect, or a corona effect, etc.); or in cooking, “too much salt” error is distinct from “too much water”. Thus, we can rule out many by distinguishing their error properties (usually through deliberate probes), and what we cannot distinguish, we usefully report, e.g., we cannot distinguish gravitational waves from stellar dust, or however that recent case went. Of course, experimental planning can do a great deal to control these sources ahead of time, e.g., via randomization.

Reply

July 22, 2014

omaclaren

Oh I buy this, philosophically, or at least metaphorically – it fits the mathematical picture of inverse problems in my head*.

Now for me the question becomes how to implement this approach in problems that come up for me day-to-day. While they don’t have a monopoly, Bayesian-oriented statisticians do seem to be taking the lead in pursuing and popularising new tools to tackle things like inference for differential equations or individual-based models and ‘uncertainty quantification’ thereof. And given one can view Bayesian inference as one of constructing inverse functions through regularisation procedures it then becomes much less objectionable in principle.

As C. Glymour says in your and Spanos’ book –

‘Bayesian statistics is one thing (often a useful thing, but not germane to my topic); Bayesian epistemology is something else.’

I would have to be very stubborn (even more than I am!) to refuse to use practical tools they provide, if applicable.

As you say, though, Duhemian problems are important. Some further questions for me then become

– Can Bayesian statisticians handle their Duhemian problems in practice, e.g. by breaking from Bayesian epistemology if they need to?

– Can Frequentist statisticians handle their inverse problems and Duhemian problems in practice as well as Bayesians can?

– If yes to both, how important is it to choose between them (again, in practice) and are there other choices to make instead?

Probably my answers would be yes, yes and not much/yes (all qualified)…on the other hand I can see how it would be interesting philosophically, and maybe for the development of new approaches, to understand what’s going on once everyone starts breaking from their purported principles. I would definitely be interested to read a philosophical account anyway.

*As a strange aside I came to the rough form of my view in the previous comment while trying to understand when to use proof by contradiction in mathematics. Initially I didn’t like it, but I came to understand it as a method for solving inverse-style problems, that is, for reconciling the directions of desired inference and available calculation when these don’t naturally align (and when constructing an inverse is too hard). I also like the quote from G. Polya:

‘Both “reductio ad absurdum” and indirect proof are effective tools of discovery which present themselves naturally to an intent mind. Nevertheless, they are disliked by a few philosophers and many beginners; satirical people and tricky politicians do not appeal to everybody.’

Reply

July 19, 2014

Giridhar R Babu

I would summarize the following points with reference to P values from the notes of my professor, Sander Greenland and the textbook Modern epidemiology, 3rd edition.

P-value as a continuous measure of compatibility between a hypothesis and data: Should it be used to force a qualitative decision?

• The “decision-making mode of analysis” is guided by the “nearly universal acceptance” of the 5% cutoff point for significance testing: What was the circumstance that gave rise to popularity of the 5% significance level? Also, was a fixed alpha level suggested by Neyman and Pearson in their original formulation of hypothesis testing? No, not really.

Regarding the Alternative Hypothesis
• Equivalence testing & equivalent interval. The test hypothesis (Ho) is, for example, the two treatments are NOT equivalent.
– The alternate hypothesis contains large range of values, compatible with any observed data (two-sided).
– Ho: Specific value, consistent with a much narrower range of possible data.
• Major defect of Neyman-Pearson alternative hypothesis was that an assumption of correct statistical model. Better formulation would be “Either the null is false or else the statistical model is wrong”.

Why can’t a P-value be interpreted as the probability of the observed data under the test hypothesis?

Data configuration = any distinct combination of possible values of the variables under study. It is called likelihood defined as the probability that you would observe the data you did observe, given that the parameter of the statistical model is a particular value, i.e. likelihood = Pr (data | parameter).

Problems with P values:-
P-values are calculated from statistical models. A major problem with P-value and tests (including all commercial software) is that the assumed models make no allowance for sources of bias, apart from confounding by controlled covariates. A statistical model can be understood as a set of assumptions about the distributions of the variables under study and how the data were obtained.

Correct definitions of different types of P-values:

An upper one-tailed P-value is the probability that we will have a test statistic (computed from data according to a statistical model) that is greater than or equal to its observed value, assuming that (a) the test hypothesis is correct, and (b) there is no source of bias in the data collection of analysis processes.

Note: Given an actual data set and a specified statistical model, you can always calculate the observed value of the specified test statistic and find both the upper and lower one-tailed P-values from the probability distribution of the test statistic.

The two-tailed P-value is usually defined as twice the smaller of the upper and lower P-values.

Regarding comment on the relationship between Type I and Type II error probabilities in a study, the real question is how does probability of Type II error change when we change the alpha level? You can think of the Type I error rate (α) as the “false-positive probability”; the Type II error rate (β) as “false-negative probability” and 1−β as “sensitivity” of a diagnostic test or classification method. Hence, Type I and Type II errors are results of qualitative study objective.

As I mentioned in my tweet, (@epigiri) two papers are useful in elucidating the clarity on use of p values.

Sander Greenland. Living with p values: resurrecting a Bayesian perspective on frequentist statistics.http://www.ncbi.nlm.nih.gov/m/pubmed/23232611/

Multiple comparisons and association selection in general epidemiology http://m.ije.oxfordjournals.org/content/37/3/430.full

@david_colquhoun @learnfromerror simply put, 2articles can summarise the debate on p values http://t.co/zw04jyELK0 http://t.co/amNSU80Cu9— Randomly Unbiased (@epigiri) July 19, 2014

Reply

July 19, 2014

Mayo

Giridhar:
Of course I’m very familiar with Greenland’s work. Around a year ago, we had some discussions with him on this blog:
https://errorstatistics.com/2013/06/26/why-i-am-not-a-dualist-in-the-sense-of-sander-greenland/

which originated here:
https://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/

Reply

July 19, 2014

Mayo

SATURDAY NIGHT READING: I for one have been (and am still in the midst of) traveling for the past week, and apologize for only being able to quickly read many comments that call for slower scrutiny. There are a number of links I and others have given as well. So, I will not post what I had on draft but just read through this and try to identify the central issues. I am very wary of taking up lots of (related but different) issues at once. It gets too chaotic. Fortunately, this is a”slow blog”.

Reply
July 22, 2014

Mayo

David: This is intended to respond to your comment: https://errorstatistics.com/2014/07/14/the-p-values-overstate-the-evidence-against-the-null-fallacy/#comment-88320

I’ve also explained my position further.

As I say, the numbers aren’t the problem, it’s the fact that a “hypothesis” being true can’t shift its meaning within a given computation. (At least not without falling into the ‘fallacy of probabilistic instantiation’).

A couple of relevant posts:
https://errorstatistics.com/2012/04/28/3671/

https://errorstatistics.com/2012/05/05/comedy-hour-at-the-bayesian-epistemology-retreat-highly-probable-vs-highly-probed/

Related papers:
Mayo, D. G (1997a), “Response to Howson and Laudan,” Philosophy of Science 64: 323-333.
http://www.phil.vt.edu/dmayo/personal_website/(1997) Response to Howson and Laudan.pdf

Mayo, D. G. (1997b), “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64, S195-S212.
http://www.phil.vt.edu/dmayo/personal_website/(1997) Error Statistics and Learning from Error Making a Virtue of Necessity.pdf

Reply
July 22, 2014

David Colquhoun

@Mayo (or anyone else) I’d be interested to hear your opinion about another problem that’s relevant to this discussion

“Some statisticians would say that, once you have
observed, say, P = 0.047, that is part of the data so we
should not include the or less than bit. That is
indisputable if we are trying to interpret the meaning of
a single test that comes out with P = 0.047. To
interpret this we need to see what happens in an
imaginary series of experiments that all come out with
P near to 0.05”
That’s easily investigated by simulated t tests
( page 8 http://arxiv.org/abs/1407.5296 )

Reply

July 22, 2014

Mayo

David: Yes, the attained P-value should be reported.

Reply

July 23, 2014

David Colquhoun

I’m sorry, you seem to have misunderstood my question.

I was asking your opinion about the following case. One wishes to interpret the result of a single experiment that gives P = 0.047. To help with this, you simulate 100k t tests. Do you then look only at those that come out with, say, 0.045 < P < 0.05, or do you look at all results which give P < 0.05, when assessing things like the fraction of false positives.

Reply

July 23, 2014

Stephen John Senn (@stephensenn)

Let me answer this again
1) If you want to calculate a Bayesian posterior distribution you should use the exact P-value
2) If proceeding the way you propose you need to calculate the likelihood under the null (easy if a precise hypothesis) and the likelihood under the alternative (hard – because a mixture of likelihoods corresponding to each distinct value under the null).
3) In practice you can’t do this except by coming off the fence. You need a prior distribution for all values of the parameter (including the single value under the null).
4) You get very different values depending on what you assume
5) If doing a simulation, yes you do need to retain only those that are (to a standard of precision to be determined) equal to P=0.047
6) This is not a good way to do it. Have you considered MCMC?

Reply

July 23, 2014

Mayo

Stephen:
Thank you for restoring my hope that there are still some sane voices out there. To repeat your key points:
“the likelihood under the alternative (hard – because a mixture of likelihoods corresponding to each distinct value under the null).
3) In practice you can’t do this except by coming off the fence. You need a prior distribution for all values of the parameter (including the single value under the null).
4) You get very different values depending on what you assume”

So how do they propose to get the likelihood under the alternative? Can they take it to be the alternative giving max likelihood, as some do? Or must they “come off the fence”?

Reply

July 23, 2014

vl

I would argue that if one hasn’t articulated an alternative hypothesis, then one hasn’t articulated a hypothesis/theory. IMO articulating a null distribution while refusing to articulate the theory is not scientific.

When most non-statisticians think about frequentist guarantees what they really care about is given a decision rule (for what to believe), how often is my decision correct, how often is it wrong?

Estimate the frequency I’m correct or wrong under any testing rule thus requires defining what the prior and alternative hypothesis you have in mind is. In some cases, there may be a range of possible alternative hypotheses corresponding to a mixture distribution.

Regarding Stephen Senn’s point #2. Perhaps it’s hard to solve this analytically, but if I’m understanding his point this should amount to a few lines of R or python and a bit of simulation. Not a good reason not to consider thinking about this.

If one wants to avoid articulating the alternative hypothesis, then forget about being able to have any idea even any idea about the probability of being right or wrong by a frequentist definition of probability.

Put another way, an alternative hypothesis may turn out to be a “wrong” answer, but without specifying one, I haven’t even asked a well-posed question.

Reply

July 23, 2014

Keith O'Rourke

Stephen:

Agree, but want to point out simple two stage simulation can be used here instead of MCMC.

1. Draw the unknown parameter from prior distribution
2. Draw sample from data generating distribution with that parameter value from 1.
Only keep draws where simulated P-value = =0.047.

The kept distribution will be a sample from the posterior.

It is called ABC http://en.wikipedia.org/wiki/Approximate_Bayesian_computation and I have used it with some limited success in teaching people with limited mathematics.

I have a comment at the bottom giving details and references http://magazine.amstat.org/blog/2013/10/01/algebra-and-statistics/

Reply

July 22, 2014

Stephen John Senn (@stephensenn)

The difference is discussed in my book Statistical issues in Drug Development but it does not really need simulation.

Reply
July 22, 2014

Mayo

I was sent a link to a blogpost telling us that he will apply “the modern” approach to inference to criticize those Schnall results in social psych (on unscrambling cleanliness words causing decreased severity in disapproving immoral actions). The “modern” approach is to compute a Bayes factor and assume a .5 prior. http://www.nicebread.de/reanalyzing-the-schnalljohnson-cleanliness-data-sets-new-insights-from-bayesian-and-robust-approaches/
Although I certainly wouldn’t want to defend Schnall’s statistical inference –it falls down on error statistical grounds– it’s puzzling that this author sees no reason to defend ignoring error probabilities and using a spiked prior of .5. I don’t think this will help social psych to become more replicable.

We discussed the “crisis of replicability” in psych recently: https://errorstatistics.com/2014/06/30/some-ironies-in-the-replication-crisis-in-social-psychology-1st-installment/

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

“P-values overstate the evidence against the null”: legit or fallacious? (revised)

Post navigation

71 thoughts on ““P-values overstate the evidence against the null”: legit or fallacious? (revised)”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

“P-values overstate the evidence against the null”: legit or fallacious? (revised)

Related

Post navigation

71 thoughts on ““P-values overstate the evidence against the null”: legit or fallacious? (revised)”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.