Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
The pathetic P-value* 
This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:
We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …
A condition of complete simplicity..
And all shall be well and
All manner of thing shall be well
TS Eliot, Little Gidding
Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows
“There is an element of truth in the conclusion of a perspicacious journalist:
‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘
Robert Matthews Sunday Telegraph, 13 September 1998.” 
However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions.
Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption.
To understand this, consider Student’s key paper of 1908, in which the following statement may be found:
Student was comparing two treatments that Cushny and Peebles had considered in their trials of optical isomers at the Insane Asylum at Kalamazoo. The t-statistic for the difference between the two means (in its modern form as proposed by Fisher) would be 4.06 on 9 degrees of freedom. The cumulative probability of this is 0.99858 or 0.9986 to 4 decimal places. However, given the constraints under which Student had to labour, 0.9985 is remarkably accurate and he calculated 0.9985/(1-0.9985)= 666 to 3 decimal places and interpreted this in terms of what a modern Bayesian would call posterior odds. Note that right-hand probability corresponding to Student’s left hand 0.9885 is 0.0015 and is, in modern parlance, the one-tailed P-value.
Where did Student get this method of calculation from? His own innovation was in deriving the appropriate distribution for what later came to be known as the t-statistic but the general method of calculating an inverse probability from the distribution of the statistic was much older and associated with Laplace. In his influential monograph, Statistical Methods for Research Workers, Fisher, however, proposed an alternative more modest interpretation, stating:
(Here n is the degrees of freedom and not the sample size.) In fact, Fisher does not even give a P-value here but merely notes that the probability is less than some agreed ‘significance’ threshold.
Comparing Fisher here to Student, and even making allowance for the fact that Student has calculated the ‘exact probability’ whereas Fisher, as a consequence of the way he had constructed his own table (entering at fixed pre-determined probability levels), merely gives a threshold, it is hard to claim that Fisher is somehow responsible for a more exaggerated interpretation of the probability concerned. In fact, Fisher has compared the observed value of 4.06 to a two-tailed critical value, a point that is controversial but cannot be represented as being more liberal than Student’s approach.
To understand where the objection of some modern Bayesians to P-values comes from, we have to look to work that came after Fisher, not before him. The chief actor in the drama was Harold Jeffreys whose Theory of Probability first appeared in 1939, by which time Statistical Methods for Research Workers was already in its seventh edition.
Jeffreys had been much impressed by work of the Cambridge philosopher CD Broad who had pointed out that the principle of insufficient reason might lead one to suppose that, given a large series of only positive trials, the next would also be positive but could not lead one to conclude that all future trials would be. In fact, if the future series was large compared to the preceding observations, the probability was small[7, 8]. Jeffreys wished to show that induction could provide a basis for establishing the (probable) truth of scientific laws. This required lumps of probability on simpler forms of the law, rather than the smooth distribution associated with Laplace. Given a comparison of two treatments (as in Student’s case) the simpler form of the law might require only one parameter for their two means, or equivalently, that the parameter for their difference, τ , was zero. To translate this into the Neyman-Pearson framework requires testing something like
H0: τ = 0 v H1: τ ≠ 0 (1)
It seems, however, that Student was considering something like
H0: τ ≤ 0 v H1: τ > 0, (2)
although he perhaps also ought simultaneously to be considering something like
H0: τ ≥0 v H1: τ < 0, (3)
although, again, in a Bayesian framework this is perhaps unnecessary.
(See David Cox for a discussion of the difference between plausible and dividing hypotheses.)
Now the interesting thing about all this is if you choose between (1) on the one hand and (2) or (3) on the other, it makes remarkably little difference to the inference you make in a frequentist framework. You can see this as either a strength or a weakness and is largely to do with the fact that the P-value is calculated under the null hypothesis and that in (2) and (3) the most extreme value, which is used for the calculation, is the same as that in (1). However if you try and express the situations covered by (1) on the one hand and (2) and (3) on the other, it terms of prior distributions and proceed to a Bayesian analysis, then it can make a radical difference, basically because all the other values in H0 in (2) and (3) have even less support than the value of H0 in (1). This is the origin of the problem: there is a strong difference in results according to the Bayesian formulation. It is rather disingenuous to represent it as a problem with P-values per se.
To do so, you would have to claim, at least, that the Laplace, Student etc Bayesian formulation is always less appropriate than the Jeffreys one. In Twitter exchanges with me, David Colquhoun has vigorously defended the position that (1) is what scientists do, even going so far as to state that all life-scientists do this. I disagree. My reading of the literature is that jobbing scientists don’t know what they do. The typical paper says something about the statistical methods, may mention the significance level but does not define the hypothesis being tested. In fact, a paper in the same journal and same year as Colquhoun’s affords an example. Smyth et al, have 17 lines on statistical methods, including permutation tests (of which Colquhoun approves) but nothing about hypotheses, plausible, point, precise, dividing or otherwise, although the paper does, subsequently, contain a number of P-values.
In other words scientists don’t bother to state which of (1) on the one hand or (2) and (3) on the other is relevant. It might be that they should but it is not clear if they did, which way they would jump. Certainly, in drug development I could argue that the most important thing is to avoid deciding that the new treatment is better than the standard, when in fact it is worse and this is certainly an important concern in developing treatments for rare diseases, a topic on which I research. True Bayesian scientists, of course, would have to admit that many intermediate positions are possible. Ultimately, however, if we are concerned about the real false discovery rate, rather than what scientists should coherently believe about it, it is the actual distribution of effects that matters rather than their distribution in my head, or, for that matter, David Colquhoun’s. Here a dram of data is worth a pint of pontification and some interesting evidence as regards clinical trials is given by Djulbegovic et al.
Furthermore, in the one area, model-fitting, where the business of comparing simpler versus complex laws is important, rather than, say, deciding which of two treatments is better (note that in the latter case a wrong decision has more serious consequences), then a common finding is not that the significance test using the 5% level is liberal but that it is conservative. The AIC criterion will choose a complex law more easily and although there is no such general rule about the BIC, because of its dependence on sample size, when one surveys this area it is hard to come to the conclusion that significance tests are generally more liberal.
Finally, I want to make it clear, that I am not suggesting that P-values alone are a good way to summarise results, nor am I suggesting that Bayesian analysis is necessarily bad. I am suggesting, however, that Bayes is hard and pointing the finger at P-values ducks the issue. Bayesians (quite rightly so according to the theory) have every right to disagree with each other. This is the origin of the problem and to therefore dismiss P-values
‘…would require that a procedure is dismissed because, when combined with information which it doesn’t require and which may not exist, it disagrees with a procedure that disagrees with itself.’ (p 195)
My research on inference for small populations is carried out in the framework of the IDEAL project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.
- Colquhoun, D., An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 2014. 1(3): p. 140216.
- Senn, S.J., Two cheers for P-values. Journal of Epidemiology and Biostatistics, 2001. 6(2): p. 193-204.
- Student, The probable error of a mean. Biometrika, 1908. 6: p. 1-25.
- Senn, S.J. and W. Richardson, The first t-test. Statistics in Medicine, 1994. 13(8): p. 785-803.
- Fisher, R.A., Statistical Methods for Research Workers, in Statistical Methods, Experimental Design and Scientific Inference, J.H. Bennet, Editor 1990, Oxford University: Oxford.
- Jeffreys, H., Theory of Probability. Third ed1961, Oxford: Clarendon Press.
- Senn, S.J., Dicing with Death2003, Cambridge: Cambridge University Press.
- Senn, S.J., Comment on “Harold Jeffreys’s Theory of Probability Revisited”. Statistical Science, 2009. 24(2): p. 185-186.
- Cox, D.R., The role of significance tests. Scandinavian Journal of Statistics, 1977. 4: p. 49-70.
- Smyth, A.K., et al., The use of body condition and haematology to detect widespread threatening processes in sleepy lizards (Tiliqua rugosa) in two agricultural environments. Royal Society Open Science, 2014. 1(4): p. 140257.
- Djulbegovic, B., et al., Medical research: trial unpredictability yields predictable therapy gains. Nature, 2013. 500(7463): p. 395-396.
*This post was first blogged here last March. Please see the 145 comments from that discussion. A sequel to Senn’s paper is here. This is third  in my “let PBP series”.
Senn mentions David Colquhoun here. He’s another individual who has advocated the same computation seen in my last two “let PBP” series––a computation with little relation to relevant error probabilities associated with tests.
An anti-Bayesian, Colquhoun nevertheless has been sold on the model of getting prior probabilities for hypotheses by imagining selecting from “urns of null hypotheses,” found in Berger and others (at least when Bayesian Berger is claiming to be ‘frequentist’). Yet it’s more extreme than Ioannidis who showed we need quite a bit of biasing selection effects for the frequentist-Bayesian computation to come out bad. My problem with all of these “science wise rates of false findings” is that no test of a hypothesis is allowed to be considered on its own merits, on whether maybe the researchers didn’t just rush into print after a single statistically significant result just at the .05 level. Your researc, your findings, your checks, your data analysis should not just be a faceless number in a huge urn of hypotheses that is imagined (over how many fields? how many years? and how do we ever know the % that are “true”?) I’m opposed to guilt by association for hypotheses, as well as innocence by association. I’m opposed to advocating that scientists start with a pool of null hypotheses where it is assumed a high % are known to be false (making the findings “true”–as if we don’t care about the magnitude of effects). Who knows how you can ever determine the aggregate set of nulls to use to arrive at the proportion assumed true? Actually, an often heard meme is that “all nulls are false”, in which case, presuming 50% true nulls as Colquhoun does–makes no sense.
We should stop enabling bad behavior (publishing with a single .05 P-value after hunting and P-hacking) seeking to make up for it with a sufficiently unchallenging urn of hypotheses, where there are few true nulls. That’s crazy!
The bigger difference between Fisher and Student here is that Student makes a claim about the odds of “a better soporific”, while Fisher limits himself to saying the “difference between the results is clearly significant”. The first interpretation is wrong, the second is of dubious scientific value. So why bother calculating these p-values*?
*I limit my argument to the default nil-null hypothesis of no difference. The story is completely different when the null hypothesis is predicted by the theory being tested, which can sometimes be “no difference”.