*Dear Reader: I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost. It is a letter to the editor of Statistics in Medicine in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing, and you may wish to track down the rest of it. Sincerely, D. G. Mayo*

Statist. Med. 2002; 21:2437–2444 http://errorstatistics.files.wordpress.com/2013/12/goodman.pdf

STATISTICS IN MEDICINE, LETTER TO THE EDITOR

A comment on replication, p-values and evidence: S.N. Goodman, Statistics in Medicine 1992; 11:875–879

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5.

I share many of Goodman’s misgiving about p-values and I do not disagree with his calculations (except in slight numerical details). I also consider that his demonstration is useful for two reasons. First, it serves as a warning for anybody planning a further similar study to one just completed (and which has a marginally significant result) that this may not be matched in the second study. Second, it serves as a warning that

apparentinconsistency in results from individual studies may be expected to be common and that one should not overreact to this phenomenon.However, I disagree with two points that he makes. First, he claims that ‘the replication probability provides a means, within the frequentist framework, to separate p-values from their hypothesis test interpretation, an important first step towards understanding the concept of inferential meaning’ (p. 879). I disagree with him on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models

to show that p-values overstate the evidence against the null-hypothesis’(p. 875, my italics). I disagree that there is such an overstatement.………

THE TRUE PROBLEM WITH p-VALUES

The uninformative prior that Goodman considers causes no difficulties for p-values at all. The problem is rather that the ‘uninformative’ prior is rarely appropriate. In general, however, it is not possible to survive as a Bayesian on uninformative priors. It is a key feature of Jeffreys’s approach to scientific inference, for example, that he recognized that some way had to be found for down-weighting the effect of higher order terms [8]. To give another example, this approach is essential in dealing with carry-over in analysing cross-over trials [9].

If, in testing the effect of a treatment in a clinical trial, we have a lump of probability on the treatment effect being zero, then, as is well known from the Jeffreys–Good–Lindley paradox, a p-value overstates the evidence against the null [10; 11]. So, of course, do Bayesian posterior statements of the sort made by Student [3].

The important distinction here is Cox’s distinction between precise and dividing hypotheses[12].In the Neyman–Pearson framework, the former corresponds to testing H0: τ = 0 against H1: τ ǂ 0, whereas the latter corresponds to testing H0: τ ≤ 0, against H1: τ > 0. The Bayesian analogue of the first case it is to have a lump of probability on τ =0. Where such a probability is appropriate, then from a Bayesian perspective, the p-value will have most unfortunate properties.

It is important to realize, however, that the reason that Bayesians can regard p-values as overstating the evidence against the null is simply a reflection of the fact that Bayesians can disagree

sharplywith each other. For example, suppose in fact that we have two Bayesians who agree before seeing some data that the probability that the treatment is beneficial is 0.5. Given that the treatment is effective they have the same conditional prior distribution as tohoweffective it will be. However, one of them, the ‘pessimist’, believes that if not beneficial it may beharmful. On the other hand the other, the ‘optimist’ believes that if not beneficial it will beharmless. After running the trial the pessimist now believes with probability 0.95 that the treatment is beneficial, whereas the optimist now believes with probability 0.95 that it is useless.The reason is that the result of the trial is marginally positive. For the optimist such a result could have easily arisen under the ‘null’, which is concentrated on zero. In fact, if most of the prior belief under the alternative corresponds to large treatment benefit, a moderate observed benefit is more likely under the null than under most of the alternative. Hence, the optimist is now inclined to believe the null. For the pessimist, however, such a result is even less likely under the ‘null’ than under the alternative, since both stretch away towards infinity from zero but the point estimate is in the alternative region. Hence, the pessimist is now inclined to believe the alternative hypothesis.

REFERENCES

1. Lehmann EL.Testing Statistical Hypotheses. Chapman and Hall: New York, 1994.

2. Fisher RA.Statistical methods for research workers. In Statistical Methods, Experimental Design and Scientific Inference, Bennet JH (ed.). Oxford University: Oxford, 1925.

3. Student. The probable error of a mean.Biometrika1908; 6:1–25.

4. Cushny AR, Peebles AR. The action of optimal isomers. II. Hyoscines.Journal of Physiology1905; 32:501–510.

5. Fisher RA, Yates F.Statistical Tables for Biological Agricultural and Medical Research. Longman: Harlow,1974. (First published Oliver and Boyd: Edinburgh, 1938.)

6. Royall RM. The effect of sample size on the meaning of significance tests.American Statistician1986; 40:313–315.

7. Senn SJ, Richardson W. The first t-test.Statistics in Medicine1994; 13:785–803.

8. Jeffreys H.Theory of Probability. Clarendon Press: Oxford, 1961.

9. Senn SJ. Consensus and controversy in pharmaceutical statistics (with discussion).Statistician2000; 49:135–176.

10. Lindley DV. A statistical paradox.Biometrika1957; 44:187–192.

11. Bartlett MS. A comment on D.V. Lindley’s statistical paradox.Biometrika1957; 44:533–534.

12. Cox DR. The role of significance tests.Scandinavian Journal of Statistics1977; 4:49–70.

13. Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy.Annals of Internal Medicine1999; 130:995 –1004.

14. Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor.Annals of Internal Medicine1999; 130:1005 –1013.*Department of Epidemiology and Public Health

Department of Statistical Science

University College London

1-19 Torrington Place

London WC1E 6BT, U.K.

Stephen: I really like the points you make. I’m pondering something you say about replication, also taken up your 2007, Statistical Issues in Drug Development. You say that if we knew with high probability that a significant P-value would be followed by an even more significant one, we would believe with high probability that, on average, future experiments would leave us more certain about the falsify of the null (than with this one P-value). “This would be a disastrous property of any system of inference since it would undermine the value of future evidence by confusing actual with anticipated evidence. Thus this particular property of P-values is highly desirable and in fact does not distinguish them from Bayesian statements” (p. 191). Can you possibly flesh this out a bit more? Or point me to a discussion? I’m guessing at a couple of interpretations, and would very much like to understand it.

I think it has to do with the fact that commentators on evidence sometimes confuse a position with a direction. For example a common habit of some physicians I have collaborated with has been to describe P=0.06 as a ‘trend towards significance’, to which I sarcastically reply ‘does this mean that P=0.04 is trend towards non-significance?’. The P-value is not going anywhere. It is where it is. One can argue as to whether where it is is at all interesting and many Bayesians would say it is not but to suggest that it is heading off somewhere is a mistake whatever system of inference you have.

It seems to me that a necessary quality of any evidential system is that uncertain evidence must imply that future evidence might tend in the contrary direction. After all we know that when we have tomorrow’s evidence our overall amount of evidence will have increased. Bayesians must believe therefore that in future it is probable that they will believe something more certainly than they do now but they can’t know exactly what that something will be. Otherwise we would have paradoxes of this sort. ‘Today is Monday and I believe that it is quite probable that it will rain on Wednesday. However, if you come and see me tomorrow, which is Tuesday, I will be able to tell you with absolute certainty that it will rain on Wednesday’.

Thus it seems to me obvious and not at all problematic that if a trial were just significant at the 5% level (P=0.047, for example) there would be quite a large chance that another trial of exactly the same sort would have a modest probability of being significant at the same level. Imagine 100 such trials, for example. If 50% were significant at the 5% level this would be overwhelming evidence against the null.

Senn: Glad to hear from you. I’m still pondering this. What are the implications, if any, for meta-analysis? What, for example, would be the result of combining those 50 results, each significant at the .05 level, and maybe others close (according to meta-analysis)?

Well, one way of looking at it would be as follows. If half the results are significant and half not and assuming that we are using the one sided standard, then in fact we have a series of standardised values varying around 2; their mean will also be about 2 and the standard error will be 1/(root 100) = 1/10=0.1. So that the z statistic of the mean would be 20, which is pretty impressive!

However, note that a single z statistic of 2 from one trial most definitely does not promise you that this will happen. There is, of course, always the possibility that the next trial will be disappointing. In fact, using the same argument you could show that if the z statistic for the next trial were less than 2*(root(2)-1)=0.83 the z statistic for the mean of them both would be less than 2. But that’s life. In a court case we would not accept counsel for the defence saying “I may only have one witness that my client was elsewhere but I could have found a second one so effectively I have two”.

Senn: But what is the null, that the nulls are true in all of them?

In a sense. In a classic fixed-effects meta-analysis, the null is that the treatment effect is identically zero in every trial. If this is rejected then the conclsuion is that the treatment must work some of the time at least, that is to say in at least one trial. To go further and say something about the “average effect” you might then want to do a random-effects meta-analysis, although then “average” might mean different things to different people. This by the by analogous to the issue in the famous dispute between Neyman and Fisher over the analysis of Latin squares where, I have argued, Fisher was in the right. See Stats in Medicine, 2004, 23, 3729-3753 http://onlinelibrary.wiley.com/doi/10.1002/sim.2074/abstract

For completeness, the Bayesian version of the concept Senn is referring to has been called conservation of expected evidence. Quoting from the link:

“P(H) = P(H),

P(H) = P(H,E) + P(H,~E),

P(H) = P(H|E)*P(E) + P(H|~E)*P(~E).

Therefore, for every expectation of evidence, there is an equal and opposite expectation of counterevidence.

If you expect a strong probability of seeing weak evidence in one direction, it must be balanced by a weak expectation of seeing strong evidence in the other direction.”

Thanks Corey. Will have to see if Senn agrees, but I might note that I find this very counterintuitive (which of course stands to reason since I don’t find it intuitive to assign probabilities to hypotheses) and also very equivocal (if it’s to capture English understandings). For example, “weak expectation” might mean a little bit of expectation, or low expectations (of strong evidence).

Just to add: I don’t mean that renders Senn’s result counterintuitive as regards math expectation.

Thanks for the clarification, Corey. This sort of thing has to happen because of the martingale property of forecasts (related to the tower property of expectation) and is also discussed, although not under that name, in my letter. I find it to be a perfectly reasonable property. The point of my letter in Statistics in Medicine was to argue against Goodman that the property that P-values did not bring with them a guarantee of replication did not separate them from other forms of evidence. However, Steve Goodman argued that I had misinterpreted what he said so one should look at his reply also as well, of course, as at his original article.