I first blogged this letter here. Below the references are some more recent blog links of relevance to this issue.

* Dear Reader: I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost. It is a letter to the editor of Statistics in Medicine in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing. You can read the full letter here. Sincerely, D. G. Mayo*

STATISTICS IN MEDICINE, LETTER TO THE EDITOR

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5.

I share many of Goodman’s misgiving about p-values and I do not disagree with his calculations (except in slight numerical details). I also consider that his demonstration is useful for two reasons. First, it serves as a warning for anybody planning a further similar study to one just completed (and which has a marginally significant result) that this may not be matched in the second study. Second, it serves as a warning that

apparentinconsistency in results from individual studies may be expected to be common and that one should not overreact to this phenomenon.However, I disagree with two points that he makes. First, he claims that ‘the replication probability provides a means, within the frequentist framework, to separate p-values from their hypothesis test interpretation, an important first step towards understanding the concept of inferential meaning’ (p. 879). I disagree with him on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models

to show that p-values overstate the evidence against the null-hypothesis’(p. 875, my italics). I disagree that there is such an overstatement.………

THE TRUE PROBLEM WITH p-VALUES

The uninformative prior that Goodman considers causes no difficulties for p-values at all. The problem is rather that the ‘uninformative’ prior is rarely appropriate. In general, however, it is not possible to survive as a Bayesian on uninformative priors. It is a key feature of Jeffreys’s approach to scientific inference, for example, that he recognized that some way had to be found for down-weighting the effect of higher order terms [8]. To give another example, this approach is essential in dealing with carry-over in analysing cross-over trials [9].

If, in testing the effect of a treatment in a clinical trial, we have a lump of probability on the treatment effect being zero, then, as is well known from the Jeffreys–Good–Lindley paradox, a p-value overstates the evidence against the null [10; 11]. So, of course, do Bayesian posterior statements of the sort made by Student [3].

The important distinction here is Cox’s distinction between precise and dividing hypotheses[12].In the Neyman–Pearson framework, the former corresponds to testing H0: τ = 0 against H1: τ ǂ 0, whereas the latter corresponds to testing H0: τ ≤ 0, against H1: τ > 0. The Bayesian analogue of the first case it is to have a lump of probability on τ =0. Where such a probability is appropriate, then from a Bayesian perspective, the p-value will have most unfortunate properties.

It is important to realize, however, that the reason that Bayesians can regard p-values as overstating the evidence against the null is simply a reflection of the fact that Bayesians can disagree

sharplywith each other. For example, suppose in fact that we have two Bayesians who agree before seeing some data that the probability that the treatment is beneficial is 0.5. Given that the treatment is effective they have the same conditional prior distribution as tohoweffective it will be. However, one of them, the ‘pessimist’, believes that if not beneficial it may beharmful. On the other hand the other, the ‘optimist’ believes that if not beneficial it will beharmless. After running the trial the pessimist now believes with probability 0.95 that the treatment is beneficial, whereas the optimist now believes with probability 0.95 that it is useless.The reason is that the result of the trial is marginally positive. For the optimist such a result could have easily arisen under the ‘null’, which is concentrated on zero. In fact, if most of the prior belief under the alternative corresponds to large treatment benefit, a moderate observed benefit is more likely under the null than under most of the alternative. Hence, the optimist is now inclined to believe the null. For the pessimist, however, such a result is even less likely under the ‘null’ than under the alternative, since both stretch away towards infinity from zero but the point estimate is in the alternative region. Hence, the pessimist is now inclined to believe the alternative hypothesis.

REFERENCES

1. Lehmann EL.Testing Statistical Hypotheses. Chapman and Hall: New York, 1994.

2. Fisher RA.Statistical methods for research workers. In Statistical Methods, Experimental Design and Scientific Inference, Bennet JH (ed.). Oxford University: Oxford, 1925.

3. Student. The probable error of a mean.Biometrika1908; 6:1–25.

4. Cushny AR, Peebles AR. The action of optimal isomers. II. Hyoscines.Journal of Physiology1905; 32:501–510.

5. Fisher RA, Yates F.Statistical Tables for Biological Agricultural and Medical Research. Longman: Harlow,1974. (First published Oliver and Boyd: Edinburgh, 1938.)

6. Royall RM. The effect of sample size on the meaning of significance tests.American Statistician1986; 40:313–315.

7. Senn SJ, Richardson W. The first t-test.Statistics in Medicine1994; 13:785–803.

8. Jeffreys H.Theory of Probability. Clarendon Press: Oxford, 1961.

9. Senn SJ. Consensus and controversy in pharmaceutical statistics (with discussion).Statistician2000; 49:135–176.

10. Lindley DV. A statistical paradox.Biometrika1957; 44:187–192.

11. Bartlett MS. A comment on D.V. Lindley’s statistical paradox.Biometrika1957; 44:533–534.

12. Cox DR. The role of significance tests.Scandinavian Journal of Statistics1977; 4:49–70.

13. Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy.Annals of Internal Medicine1999; 130:995 –1004.

14. Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor.Annals of Internal Medicine1999; 130:1005 –1013.*Department of Epidemiology and Public Health

Department of Statistical Science

University College London

1-19 Torrington Place

London WC1E 6BT, U.K.

For a relevant post of mine:

“P-values overstate the evidence against the null: legitimate or fallacious?”

For related guest posts by Senn:

“The Pathetic P-value,” and it’s sequel: Double Jeopardy: Judge Jeffreys upholds the law”

I’m curious what Senn thinks of Christian Robert’s Bayesian testing via mixtures. I believe it was developed at least in part because of these issues.

Om: you can ask him. What issues? Are you saying it was developed to deal with Goodman’s issues that Senn denies are issues?

I suppose my comment is me asking him, hoping he reads this post.

The issue I’m referring to is how to deal with point nulls in Bayesian testing, eg nicer alternatives to Jeffrey’s approach.