
.
You will often hear—especially in discussions about the “replication crisis”—that statistical significance tests exaggerate evidence. Significance testing, we hear, inflates effect sizes, inflates power, inflates the probability of a real effect, or inflates the probability of replication, and thereby misleads scientists.
If you look closely, you’ll find the charges are based on concepts and philosophical frameworks foreign to both Fisherian and Neyman–Pearson hypothesis testing. Nearly all have been discussed on this blog or in SIST (Mayo 2018), but new variations have cropped up. The emphasis that some are now placing on how biased selection effects invalidate error probabilities is welcome, but I say that the recommendations for reinterpreting quantities such as p-values and power introduce radical distortions of error statistical inferences. Before diving into the modern incarnations of the charges it’s worth recalling Stephen Senn’s response to Stephen Goodman’s attempt to convert p-values into replication probabilities nearly 20 years ago (“A Comment on Replication, P-values and Evidence,” Statistics in Medicine). I first blogged it in 2012, here. Below I am pasting some excerpts from Senn’s letter (but readers interested in the topic should look at all of it), because Senn’s clarity cuts straight through many of today’s misunderstandings.

.
LETTER TO THE EDITOR
A comment on replication, p-values and evidence
S.N. Goodman, Statistics in Medicine 1992; 11:875–879
From: Stephen Senn
Department of Epidemiology and Public Health Department of Statistical Science
University College London
1-19 Torrington Place
London WC1E 6BT, U.K.
Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0:10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5.
I share many of Goodman’s misgiving about p-values and I do not disagree with his calculations (except in slight numerical details). I also consider that his demonstration is useful for two reasons. First, it serves as a warning for anybody planning a further similar study to one just completed (and which has a marginally significant result) that this may not be matched in the second study. Second, it serves as a warning that apparent inconsistency in results from individual studies may be expected to be common and that one should not overreact to this phenomenon.
However, I disagree with two points that he makes. First, he claims that ‘the replication probability provides a means, within the frequentist framework, to separate p-values from their hypothesis test interpretation, an important first step towards understanding the concept of inferential meaning’ (p. 879). I disagree with him on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models to show that p-values overstate the evidence against the null-hypothesis’ (p. 875, my italics). I disagree that there is such an overstatement.
…[C]onsider a Fisherian conductor of significance tests and a Neymanite conductor of hypothesis tests. Suppose that each tests the same null hypothesis using the same statistic on the same data. Each can calculate a p-value for his or her different reasons: the Fisherian to make an inference; the Neymanite to permit others to come to a decision. This p-value will be the same. Suppose that the p-value is 0.05 and so (just) conventionally Signifiant. Goodman’s concept of a replication probability is not part of either of these systems of inference. It is not a likelihood as used by Fisher. It is not a power as used by Neyman and Pearson. However, it is closely related to the Bayesian idea of a predictive probability, and by stepping out of these two systems and into a third, you can calculate a Bayesian probability that the Fisherian will observe p<.05 next time an identical experiment is run and that the Neymanite will observe a result ‘significant at α=0:05’. These two probabilities are identical, given the same prior, and thus, as should be obvious, cannot be the first or indeed any ‘step to separate p-values from their hypothesis test interpretation’ (Goodman, p. 879).
Similarly, although it is quite true that the probability of replicating a significant result is higher, other things being equal, given that one has merely noted on the first occasion ‘result significant at the 5 per cent level’ rather than p=0.05 [6], this is a trivial mathematical consequence of the fact that the average p-value from all trials that are significant at the 5 per cent level is less than 0.05. It is thus a phenomenon of the same sort as the following. The average diastolic blood pressure, on re-measuring, of men who have been selected for treatment because their diastolic BP is higher than 100 mmHg will be higher, other things being equal, than that for a group whose diastolic BP on first measurement was exactly 100 mmHg.
REPLICATION PROBABILITIES AND INFERENTIAL MEANING
Replication probabilities are not of direct relevance to inferential meaning. They confuse the issue of making inferences. This is because we make inferences primarily about hypotheses or about the state of nature and not about future samples. However, in the context of parametric inference, a replication probability is a reflection of two things. The first is the likely or probable or reasonable (depending upon one’s point of view) value of an unknown parameter. The second is the probabilistic distribution of a future test statistic given a particular value of the unknown parameter, but the second, for example, depends on the size of trial one happens to choose next time around. Goodman has considered the case where the second trial is the same size as the first. This may be a natural choice but it is not inferentially necessary. It would be absurd if our inferences about the world, having just completed a clinical trial, were necessarily dependent on assuming the following:
- We are now going to repeat this experiment.
- We are going to repeat it only once.
- It must be exactly the same size as the experiment we have just run.
- The inferential meaning of the experiment we have just run is the extent to which it predicts this second experiment.
DO p-VALUES OVERSTATE THE EVIDENCE AGAINST THE NULL HYPOTHESIS?
Suppose we consider the following three questions that a Bayesian might put having obtained a posterior probability of 0.05 that treatment 2 was worse than treatment 1:
- What is the probability that in a future experiment, taking that experiment’s results on its own, the results would favour 1 rather than 2?
- What is the probability, having conducted this experiment, and pooled its results with the current one, the results would favour 1 rather than 2?
- What is the probability that having conducted a future experiment and then calculated a Bayesian posterior using a uniform prior and the results of this second experiment alone, the probability that 2 would be worse than 1 would be less than or equal to 0.05?
[You can read Senn’s answers to the first two questions in the full letter here.]
… Neither of these two questions, however, is analogous to the question that Goodman asks about p-values. The analogous question is Q3. This question is not a question about the confirmation of a result; it is a question about the repetition of a probability associated with a result. In fact, if the experiment to come is the same size as the one that has been run, the answer to Q3 is 0.5, exactly Goodman’s result for the p-value (see Appendix).
…
We do not, however, need this replication probability to be higher than 0.5 to believe that the efficacy of the treatment is probable. A long series of trials, 50 per cent of which were significant at the 5 per cent level, would be convincing evidence that the treatment was effective. Given an uninformative prior, followed by one significant result at the 5 per cent level, what the replication probability shows is that the probability that any one of the remaining trials chosen at random from the series is significant is 50 per cent (given that the results of the other trials are not known). Because the probabilities are not independent we cannot easily go further than this. However, what we can say is that the probability (that is to say our subjective assessment) that a very large meta-analysis would show that the treatment was effective would be 95 per cent.
IN CONCLUSION
Although it is interesting to consider the repetition property of p-values, it is false to regard this as being relevant to separating p-values from their hypothesis test interpretation and it is false to regard the modest probability shown as being regrettable. It is desirable. [My emphasis!] Suppose it were the case that a low p-value brought with it a very high probability that it would be repeated. This would then imply that there was a very high probability that a meta-analysis using the current trial and the future one would produce an even lower p-value. This would mean that an anticipated result would have (nearly) the same inferential value as an actual one. On the contrary, the desirable property is that the meta-analysis should confirm the p-value of the current trial. (We can hope that it will do more but must also fear that it will do less.) However, this requires us to expect that the result of the new trial combined with the result we already have will leave us where we were. To think otherwise is to make exactly the same mistake a physician makes in writing of a result, ‘the result failed to reach significance, p=0.08 because the trial was too small’.
To expect that a future trial will be significant given that the current has yielded p=0.05 is to expect that the future p-value will be at least as small as the current one. However, if we have an expectation that further experimentation will produce an even smaller p-value than the current trial, this is not because the p-value overstates the evidence against the null. On the contrary, it can only be because our prior belief is such that we consider it understates it. In other words, our prior belief enables us to recognise the p-value as being too pessimistic. ..
Please share your thoughts and illuminations of this issue in the comments.
For related guest posts by Senn please use the search. Here are two of many:
“The Pathetic P-value,” and it’s sequel: Double Jeopardy: Judge Jeffreys upholds the law”
For a few related posts (from Mayo) on statistical significance tests exaggerating/overstating the evidence:
-
- May 9, 2022: A statistically significant result indicates H’ (μ > μ’) when POW(μ’) is low (not the other way round)–but don’t ignore the standard error
- May 2, 2022: Do “underpowered” tests “exaggerate” population effects? (iv)
- December 13, 2021: Bickel’s defense of significance testing on the basis of Bayesian model checking
- January 19, 2017: The “P-values overstate the evidence against the null” fallacy
- August 28, 2016 Tragicomedy hour: p-values vs posterior probabilities vs diagnostic error rates
- November 25, 2014: How likelihoodists exaggerate evidence from statistical tests
-
Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,” https://errorstatistics.com/2012/05/10/excerpts-from-s-senns-letter-on-replication-p-values-and-evidence/



There’s some discussion between me and Senn (and Corey) in the comments to the initial post where I link to Senn’s letter.
https://errorstatistics.com/2012/05/10/excerpts-from-s-senns-letter-on-replication-p-values-and-evidence/
I had been trying to remember where Senn made the following remark, and it turns out to be there (although maybe it comes up elsewhere as well): “In a court case we would not accept counsel for the defence saying ‘I may only have one witness that my client was elsewhere but I could have found a second one so effectively I have two’.
Dear Mayo
Thank you raising this topic. I have today published this pre-print in arXiv: 2512.13763v1.pdf. I was motivated to write it by Stephen response (with which I agree) to Steven Goodman’s 1992 paper, which are the subjects of this post. However, I use different expressions to those based on Goodman’s to estimate probabilities of replication. The title of my pre-preprint is:
‘Understanding statistics for biomedical research through the lens of replication’.
This is the abstract:
Clinicians and scientists have traditionally focussed on whether their findings will be replicated and are familiar with the concept. The probability that a replication study yields an effect with the same sign, or the same statistical significance as an original study depends on the sum of the variances of the effect estimates. On this basis, when P = 0.025 one-sided and the replication study has the same sample size and variance as the original study, the probability of achieving a one-sided P≤ 0.025 a second time is only about 0.283, consistent with currently observed modest replication rates. A higher replication probability would require a larger sample size than that derived from current single variance power calculations. However, if the replication study is based on an infinitely large sample size (and thus has negligible variance) then the probability that its estimated mean is ‘same sign’ (e.g. again exceeds the null) is 1 −P =0.975. The reasoning is made clearer by changing continuous distributions to discretised scales and probability masses, thus avoiding ambiguity and improper flat priors. This perspective is consistent with Frequentist and Bayesian interpretations and also requires further reasoning when testing scientific hypotheses and making decisions.
Some of the other points covered in the paper are that:
I would be grateful for any comments.
Dear Hew:
Thank you so much for linking to a pre-print of yours. While I haven’t studied it adequately yet, I can see you have a number of interesting proposals, in sync with recent appeals to “replication probabilities”. You say: “Some new insights have allowed P values to be connected to probabilities of replication, helping clinicians and scientists understand statistical concepts in terms familiar to them.” But if in order for clinicians and scientists are to understand error statistical concepts “in terms of concepts familiar to them,” and these concepts are not already grounded in error statistical methods, the question arises as to what statistical grounds those familiar concepts rest on. Perhaps, as I take Senn to be saying, a correct understanding of the initial concepts corrects what some take as “familiar” assumptions about these methods. The worry that I have with the “new insights” connecting P values to probabilities of replication is that they entail misinterpreting the original error statistical concepts. It would be better for clinicians and scientists to be familiar with the concepts that are actually grounded by the methods they seek to understand, rather than start from a place that is “familiar” (perhaps when one is reasoning Bayesianly) and then invent ways that the original concepts appear to substantiate them. You say that you agree with Senn, but Senn suggests that those “familiar” interpretations are not desirable but regrettable. What is your take on Senn’s remarks such as:
“Replication probabilities are not of direct relevance to inferential meaning. They confuse the issue of making inferences. This is because we make inferences primarily about hypotheses or about the state of nature and not about future samples.
…. We do not, however, need [the] replication probability to be higher than 0.5 to believe that the efficacy of the treatment is probable.
…Although it is interesting to consider the repetition property of p-values, … it is false to regard the modest probability shown as being regrettable. It is desirable. Suppose it were the case that a low p-value brought with it a very high probability that it would be repeated. … This would mean that an anticipated result would have (nearly) the same inferential value as an actual one.”
Thank you, Mayo.
The question arises as to what statistical grounds those familiar concepts rest on.
Probability of replication in the long run is another measure of statistical error and addresses the same aleatory process in vocabulary familiar to the clinician and scientist. I show that mathematically, the P value is equal to the probability of non-replication of the ‘same sign’ without invoking Bayes rule. This bridges the inferential gap between the P value and probability of replication in the long run. In clinical and scientific inference, it is customary to establish the probability of diagnosis or hypothesis or estimated treatment efficacy being correct before deciding to accept or reject a course of action (as opposed to directly rejecting a hypothesis on the basis of a P value alone).
The ‘true’ value of the parameter is established by a long run frequency by hypothetically continuing a study impeccably until the sample size is enormous and variance becomes negligible. The probability of the ‘true value’ falling within a specified range (e.g. > zero difference or > 1 SEM difference, etc.) conditional on the data is estimated directly using a probability distribution (and not a likelihood distribution followed by applying Bayes rule). It is therefore more frequentist than Bayesian.
What is your take on Senn’s remarks such as: we make inferences primarily about hypotheses or about the state of nature and not about future samples?
The subsequent hypotheses are tested using epistemological (as opposed to aleatory) probability theory that you Mayo term severe testing. These hypotheses are that the probable true result conditional on the data is due to true treatment efficacy or bias or P hacking etc., etc. Hopefully, evidence is available from the methods section etc. that shows that all hypothetical causes except true treatment efficacy are improbable leaving true treatment efficacy as the probable cause of the data. Further hypotheses (e.g. about the underlying mechanisms of the efficacy) may also be tested based on other data from the study or elsewhere.
What is your take on Senn’s remarks such as: Replication probabilities are not of direct relevance to inferential meaning, and they confuse the issue of making inferences?
I agree. Inferential meaning is provided by continuing or doing ‘same sign’ replication studies of theoretically enormous sample size. The replication probabilities that Stephen refers to are ‘P ≤ 0.05 again’ replication studies of the same or similar size. However, what he says also applies to ‘P ≤ 0.05 again’ replication studies of hypothetically enormous sample size. These modest probabilities (e.g. of 0.283 or 0.5 in my abstract) are preliminary as far as inference is concerned. My pre-print shows that they are exactly as to be expected and are no cause for concern. I show that P values alone predict these probabilities of ‘same size’ and ‘enormous size’ replication without having to know effect sizes etc.
…. We do not, however, need [the] replication probability to be higher than 0.5 to believe that the efficacy of the treatment is probable.
Agreed.
…Although it is interesting to consider the repetition property of p-values, … it is false to regard the modest probability shown as being regrettable. It is desirable. Suppose it were the case that a low p-value brought with it a very high probability that it would be repeated. … This would mean that an anticipated result would have (nearly) the same inferential value as an actual one.”
Agreed. It is not only desirable, but I have shown that the various probabilities of replication are exactly as expected and are no cause for concern.
Thanks for sharing this paper. These debates have a lot of threads, but I’ve really been enjoying reading old articles in Statistics in Medicine.
As far as a replication goes, I’d need to know something about the effect size. If we knew the effect size, then we would know the power of a future study. But then the whole problem would be solved. What I really need is a reasonable lower bound on the effect size.
A result of p = 0.05 puts that lower bound for the effect size at zero, using a 95% interval. If we replicate the study, we just have to accept that we might be underpowered no matter the replication sample size. Maybe we could say that for an observed p* < 0.05, there is a problematic effect size d* that is (or is not) ruled out. d* is problematic because it would imply a replication sample size n* that is outside of our budget.
But I also think that the word “replication” might be misleading here. If I replicate your test equipment, it makes sense to copy what you did as close as possible. But if you claim that ESP is real with a sample size of n = 10, I’m going to try and rebut you with a new experiment with n = 40. I think that’s how it ought to go.
Henry:
I don’t think I understand your comment. Senn is questioning the inferential relevance to a given result of such considerations of replication probabilities. I don’t know if this gets to your comment.
Why cling to the “zero” hypothesis? Why not use alternate hypotheses?
Useful and clear article, thanks for sharing.