Dear Reader: I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost. It is a letter to the editor of Statistics in Medicine in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing, and you may wish to track down the rest of it. Sincerely, D. G. Mayo
Statist. Med. 2002; 21:2437–2444 http://onlinelibrary.wiley.com/doi/10.1002/sim.1072/abstract
STATISTICS IN MEDICINE, LETTER TO THE EDITOR
A comment on replication, p-values and evidence: S.N. Goodman, Statistics in Medicine 1992; 11:875–879
From: Stephen Senn*
Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5.
I share many of Goodman’s misgiving about p-values and I do not disagree with his calculations (except in slight numerical details). I also consider that his demonstration is useful for two reasons. First, it serves as a warning for anybody planning a further similar study to one just completed (and which has a marginally significant result) that this may not be matched in the second study. Second, it serves as a warning that apparent inconsistency in results from individual studies may be expected to be common and that one should not overreact to this phenomenon.
However, I disagree with two points that he makes. First, he claims that ‘the replication probability provides a means, within the frequentist framework, to separate p-values from their hypothesis test interpretation, an important first step towards understanding the concept of inferential meaning’ (p. 879). I disagree with him on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models to show that p-values overstate the evidence against the null-hypothesis’ (p. 875, my italics). I disagree that there is such an overstatement.
THE TRUE PROBLEM WITH p-VALUES
The uninformative prior that Goodman considers causes no difficulties for p-values at all. The problem is rather that the ‘uninformative’ prior is rarely appropriate. In general, however, it is not possible to survive as a Bayesian on uninformative priors. It is a key feature of Jeffreys’s approach to scientific inference, for example, that he recognized that some way had to be found for down-weighting the effect of higher order terms . To give another example, this approach is essential in dealing with carry-over in analysing cross-over trials .
If, in testing the effect of a treatment in a clinical trial, we have a lump of probability on the treatment effect being zero, then, as is well known from the Jeffreys–Good–Lindley paradox, a p-value overstates the evidence against the null [10; 11]. So, of course, do Bayesian posterior statements of the sort made by Student .
The important distinction here is Cox’s distinction between precise and dividing hypotheses.In the Neyman–Pearson framework, the former corresponds to testing H0: τ = 0 against H1: τ ǂ 0, whereas the latter corresponds to testing H0: τ ≤ 0, against H1: τ > 0. The Bayesian analogue of the first case it is to have a lump of probability on τ =0. Where such a probability is appropriate, then from a Bayesian perspective, the p-value will have most unfortunate properties.
It is important to realize, however, that the reason that Bayesians can regard p-values as overstating the evidence against the null is simply a reflection of the fact that Bayesians can disagree sharply with each other. For example, suppose in fact that we have two Bayesians who agree before seeing some data that the probability that the treatment is beneficial is 0.5. Given that the treatment is effective they have the same conditional prior distribution as to how effective it will be. However, one of them, the ‘pessimist’, believes that if not beneficial it may be harmful. On the other hand the other, the ‘optimist’ believes that if not beneficial it will be harmless. After running the trial the pessimist now believes with probability 0.95 that the treatment is beneficial, whereas the optimist now believes with probability 0.95 that it is useless.
The reason is that the result of the trial is marginally positive. For the optimist such a result could have easily arisen under the ‘null’, which is concentrated on zero. In fact, if most of the prior belief under the alternative corresponds to large treatment benefit, a moderate observed benefit is more likely under the null than under most of the alternative. Hence, the optimist is now inclined to believe the null. For the pessimist, however, such a result is even less likely under the ‘null’ than under the alternative, since both stretch away towards infinity from zero but the point estimate is in the alternative region. Hence, the pessimist is now inclined to believe the alternative hypothesis.
1. Lehmann EL. Testing Statistical Hypotheses. Chapman and Hall: New York, 1994.
2. Fisher RA. Statistical methods for research workers. In Statistical Methods, Experimental Design and Scientific Inference, Bennet JH (ed.). Oxford University: Oxford, 1925.
3. Student. The probable error of a mean. Biometrika 1908; 6:1–25.
4. Cushny AR, Peebles AR. The action of optimal isomers. II. Hyoscines. Journal of Physiology 1905; 32:501–510.
5. Fisher RA, Yates F. Statistical Tables for Biological Agricultural and Medical Research. Longman: Harlow,1974. (First published Oliver and Boyd: Edinburgh, 1938.)
6. Royall RM. The effect of sample size on the meaning of significance tests. American Statistician 1986; 40:313–315.
7. Senn SJ, Richardson W. The first t-test. Statistics in Medicine 1994; 13:785–803.
8. Jeffreys H. Theory of Probability. Clarendon Press: Oxford, 1961.
9. Senn SJ. Consensus and controversy in pharmaceutical statistics (with discussion). Statistician 2000; 49:135–176.
10. Lindley DV. A statistical paradox. Biometrika 1957; 44:187–192.
11. Bartlett MS. A comment on D.V. Lindley’s statistical paradox. Biometrika 1957; 44:533–534.
12. Cox DR. The role of significance tests. Scandinavian Journal of Statistics 1977; 4:49–70.
13. Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine 1999; 130:995 –1004.
14. Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine 1999; 130:1005 –1013.
*Department of Epidemiology and Public Health
Department of Statistical Science
University College London
1-19 Torrington Place
London WC1E 6BT, U.K.