**Stephen Senn**

Head of Competence Center for Methodology and Statistics (CCMS)

Luxembourg Institute of Health

**The pathetic P-value* [3]**

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration

And the end of all our exploring

Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and

All manner of thing shall be wellTS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews

Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they *were already calculating* *and interpreting as posterior probabilities* relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption.

To understand this, consider Student’s key paper[3] of 1908, in which the following statement may be found:

Student was comparing two treatments that Cushny and Peebles had considered in their trials of optical isomers at the Insane Asylum at Kalamazoo[4]. The t-statistic for the difference between the two means (in its modern form as proposed by Fisher) would be 4.06 on 9 degrees of freedom. The cumulative probability of this is 0.99858 or 0.9986 to 4 decimal places. However, given the constraints under which Student had to labour, 0.9985 is remarkably accurate and he calculated 0.9985/(1-0.9985)= 666 to 3 decimal places and interpreted this in terms of what a modern Bayesian would call *posterior odds*. Note that right-hand probability corresponding to Student’s left hand 0.9885 is 0.0015 and is, in modern parlance, the *one-tailed P-value*.

Where did Student get this method of calculation from? His own innovation was in deriving the appropriate distribution for what later came to be known as the t-statistic but the general method of calculating an inverse probability from the distribution of the statistic was much older and associated with Laplace. In his influential monograph, *Statistical Methods for Research Workers**[**5**]*, Fisher, however, proposed an alternative more modest interpretation, stating:

(Here *n* is the degrees of freedom and not the sample size.) In fact, Fisher does not even give a P-value here but merely notes that the probability is less than some agreed ‘significance’ threshold.

Comparing Fisher here to Student, and even making allowance for the fact that Student has calculated the ‘exact probability’ whereas Fisher, as a consequence of the way he had constructed his own table (entering at fixed pre-determined probability levels), merely gives a threshold, it is hard to claim that Fisher is somehow responsible for a more exaggerated interpretation of the probability concerned. In fact, Fisher has compared the observed value of 4.06 to a *two*-tailed critical value, a point that is controversial but cannot be represented as being more liberal than Student’s approach.

To understand where the objection of some modern Bayesians to P-values comes from, we have to look to work that came *after* Fisher, not before him. The chief actor in the drama was Harold Jeffreys whose *Theory of Probability**[**6**]* first appeared in 1939, by which time *Statistical Methods for Research Worker*s was already in its seventh edition.

Jeffreys had been much impressed by work of the Cambridge philosopher CD Broad who had pointed out that the principle of insufficient reason might lead one to suppose that, given a large series of only positive trials, the next would also be positive but could not lead one to conclude that all future trials would be. In fact, if the future series was large compared to the preceding observations, the probability was small[7, 8]. Jeffreys wished to show that induction could provide a basis for establishing the (probable) truth of scientific laws. This required lumps of probability on simpler forms of the law, rather than the smooth distribution associated with Laplace. Given a comparison of two treatments (as in Student’s case) the simpler form of the law might require only one parameter for their two means, or equivalently, that the parameter for their difference, τ , was zero. To translate this into the Neyman-Pearson framework requires testing something like

H_{0}: τ = 0 v H_{1}: τ ≠ 0 (1)

It seems, however, that Student was considering something like

H_{0}: τ ≤ 0 v H_{1}: τ > 0, (2)

although he perhaps also ought simultaneously to be considering something like

H_{0}: τ ≥0 v H_{1}: τ < 0, (3)

although, again, in a Bayesian framework this is perhaps unnecessary.

(See David Cox[9] for a discussion of the difference between plausible and dividing hypotheses.)

Now the interesting thing about all this is if you choose between (1) on the one hand and (2) or (3) on the other, it makes remarkably little difference to the inference you make in a frequentist framework. You can see this as either a strength or a weakness and is largely to do with the fact that the P-value is calculated under the null hypothesis and that in (2) and (3) the most extreme value, which is used for the calculation, is the same as that in (1). However if you try and express the situations covered by (1) on the one hand and (2) and (3) on the other, it terms of prior distributions and proceed to a Bayesian analysis, then it can make a radical difference, basically because all the other values in H_{0} in (2) and (3) have even less support than the value of H_{0} in (1). This is the origin of the problem: there is a strong difference in results according to the Bayesian formulation. It is rather disingenuous to represent it as a problem with P-values *per se*.

To do so, you would have to claim, at least, that the Laplace, Student etc Bayesian formulation is always less appropriate than the Jeffreys one. In Twitter exchanges with me, David Colquhoun has vigorously defended the position that (1) is what scientists do, even going so far as to state that *all *life-scientists do this. I disagree. My reading of the literature is that jobbing scientists don’t know what they do. The typical paper says something about the statistical methods, may mention the significance level but does not define the hypothesis being tested. In fact, a paper in the same journal and same year as Colquhoun’s affords an example. Smyth et al[10], have 17 lines on statistical methods, including permutation tests (of which Colquhoun approves) but nothing about hypotheses, plausible, point, precise, dividing or otherwise, although the paper does, subsequently, contain a number of P-values.

In other words scientists don’t bother to state which of (1) on the one hand or (2) and (3) on the other is relevant. It might be that they *should* but it is not clear if they *did*, which way they would jump. Certainly, in drug development I could argue that the most important thing is to avoid deciding that the new treatment is better than the standard, when in fact it is worse and this is certainly an important concern in developing treatments for rare diseases, a topic on which I research. True Bayesian scientists, of course, would have to admit that many intermediate positions are possible. Ultimately, however, if we are concerned about the *real* false discovery rate, rather than what scientists should coherently *believe* about it, it is the actual distribution of effects that matters rather than their distribution in my head, or, for that matter, David Colquhoun’s. Here a dram of data is worth a pint of pontification and some interesting evidence as regards clinical trials is given by Djulbegovic et al[11].

Furthermore, in the one area, model-fitting, where the business of comparing simpler versus complex laws is important, rather than, say, deciding which of two treatments is better (note that in the latter case a wrong decision has more serious consequences), then a common finding is *not* that the significance test using the 5% level is *liberal* but that it is *conservative*. The AIC criterion will choose a complex law more easily and although there is no such general rule about the BIC, because of its dependence on sample size, when one surveys this area it is hard to come to the conclusion that significance tests are generally more liberal.

Finally, I want to make it clear, that I am not suggesting that P-values alone are a good way to summarise results, nor am I suggesting that Bayesian analysis is necessarily bad. I am suggesting, however, that Bayes is hard and pointing the finger at P-values ducks the issue. Bayesians (quite rightly so according to the theory) have every right to disagree with each other. *This* is the origin of the problem and to *therefore* dismiss P-values

‘…would require that a procedure is dismissed because, when combined with information which it doesn’t require and which may not exist, it disagrees with a procedure that disagrees with itself.’[2] (p 195)

**Acknowledgement**

My research on inference for small populations is carried out in the framework of the IDEAL project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.

# References

- Colquhoun, D.,
*An investigation of the false discovery rate and the misinterpretation of p-values.*Royal Society Open Science, 2014.**1**(3): p. 140216. - Senn, S.J.,
*Two cheers for P-values.*Journal of Epidemiology and Biostatistics, 2001.**6**(2): p. 193-204. - Student,
*The probable error of a mean.*Biometrika, 1908.**6**: p. 1-25. - Senn, S.J. and W. Richardson,
*The first t-test.*Statistics in Medicine, 1994.**13**(8): p. 785-803. - Fisher, R.A.,
*Statistical Methods for Research Workers*, in*Statistical Methods, Experimental Design and Scientific Inference*, J.H. Bennet, Editor 1990, Oxford University: Oxford. - Jeffreys, H.,
*Theory of Probability*. Third ed1961, Oxford: Clarendon Press. - Senn, S.J.,
*Dicing with Death*2003, Cambridge: Cambridge University Press. - Senn, S.J.,
*Comment on “Harold Jeffreys’s Theory of Probability Revisited”.*Statistical Science, 2009.**24**(2): p. 185-186. - Cox, D.R.,
*The role of significance tests.*Scandinavian Journal of Statistics, 1977.**4**: p. 49-70. - Smyth, A.K., et al.,
*The use of body condition and haematology to detect widespread threatening processes in sleepy lizards (Tiliqua rugosa) in two agricultural environments.*Royal Society Open Science, 2014.**1**(4): p. 140257. - Djulbegovic, B., et al.,
*Medical research: trial unpredictability yields predictable therapy gains.*Nature, 2013.**500**(7463): p. 395-396.

***This post was first blogged here last March. Please see the 145 comments from that discussion. A sequel to Senn’s paper is here. This is third [3] in my “let PBP series”.**

Senn mentions David Colquhoun here. He’s another individual who has advocated the same computation seen in my last two “let PBP” series––a computation with little relation to relevant error probabilities associated with tests.

https://errorstatistics.com/2015/12/05/beware-of-questionable-front-page-articles-warning-you-to-beware-of-questionable-front-page-articles-2/

https://errorstatistics.com/2015/11/28/return-to-the-comedy-hour-p-values-vs-posterior-probabilities-1/

An anti-Bayesian, Colquhoun nevertheless has been sold on the model of getting prior probabilities for hypotheses by imagining selecting from “urns of null hypotheses,” found in Berger and others (at least when Bayesian Berger is claiming to be ‘frequentist’). Yet it’s more extreme than Ioannidis who showed we need quite a bit of biasing selection effects for the frequentist-Bayesian computation to come out bad. My problem with all of these “science wise rates of false findings” is that no test of a hypothesis is allowed to be considered on its own merits, on whether maybe the researchers didn’t just rush into print after a single statistically significant result just at the .05 level. Your researc, your findings, your checks, your data analysis should not just be a faceless number in a huge urn of hypotheses that is imagined (over how many fields? how many years? and how do we ever know the % that are “true”?) I’m opposed to guilt by association for hypotheses, as well as innocence by association. I’m opposed to advocating that scientists start with a pool of null hypotheses where it is assumed a high % are known to be false (making the findings “true”–as if we don’t care about the magnitude of effects). Who knows how you can ever determine the aggregate set of nulls to use to arrive at the proportion assumed true? Actually, an often heard meme is that “all nulls are false”, in which case, presuming 50% true nulls as Colquhoun does–makes no sense.

We should stop enabling bad behavior (publishing with a single .05 P-value after hunting and P-hacking) seeking to make up for it with a sufficiently unchallenging urn of hypotheses, where there are few true nulls. That’s crazy!

The bigger difference between Fisher and Student here is that Student makes a claim about the odds of “a better soporific”, while Fisher limits himself to saying the “difference between the results is clearly significant”. The first interpretation is wrong, the second is of dubious scientific value. So why bother calculating these p-values*?

*I limit my argument to the default nil-null hypothesis of no difference. The story is completely different when the null hypothesis is predicted by the theory being tested, which can sometimes be “no difference”.

Why bother? If you care about calibrating your subjective response to the data against a statistical model of some sort then you need to bother. The P-value may not be the only (or best) way to evaluate the evidence in hand using a statistical model, but that is what it does.

Stephen:

You write, “Bayesians (quite rightly so according to the theory) have every right to disagree with each other.”

You could also add, “Non-Bayesians (quite rightly so according to the theory) have every right to disagree with each other.”

Non-Bayesian statistics, like Bayesian statistics, uses models. Different researchers will use different models and thus, quite rightly so according to the theory, have every right to disagree with each other.

A few comments

(1) I have no way of reading Robert Matthews’ mind but in quoting him I presumed that he was using Fisher’s name as shorthand way of referring to null-hypothesis testing in general. If that is the case, then Matthews has a point.

(2) The problem lies (as I’m sure you are very well aware), that so many experimenters are under the false impression that the P value is the false discovery rate -the classical error of the transposed conditional. Of course they are wrong to suppose this, but they are absolutely right to think that what they want to know in order to interpret their results is the false discovery rate. The snag is that there is no precise way to calculate the thing that you want to know, but I think my paper has shown that it is possible to place a minimum limit on it with uncontroversial assumptions

(3) Senn is perfectly right to say that most papers don’t state exactly (or at all) the hypothesis that they are testing. Nevertheless they behave as though they were testing a point null. I imagine that this is the case because that’s what is taught in just about every course and every textbook that they are likely to encounter.

(4) I made absolutely no claim about the “science wise rates of false findings”. I very much doubt whether any realistic estimate can be made of that quantity. The question addressed in my paper was much narrower. My question was, if you observe P=0.047 in a single unbiased test of significance, and claim that you have found a real effect, then the chance that you’ll be wrong is at least 30%”. I’m talking about tests as usually conducted so it doesn’t matter for that conclusion that the people doing the test have no clear idea of what hypothesis they are testing.

(5) Personally, testing a point null is what I want to do. As you say, probably most people are only vaguely aware that that’s what they are doing, but nevertheless it seems a perfectly appropriate thing to do. In the drug example, I wish to rule out the possibility that drug and placebo are identical. If I can persuade my self that they are not identical, then I can safely go on to the next stage, and estimate the effect size, and judge whether or not it’s big enough to matter.

(6) @Mayo I deny “presuming 50% true nulls”. What I say is that the false discovery rate that I found is the minimum you can expect unless you start by assuming that the hypothesis you are testing is more likely than not to be true before the experiment is done. I don’t see any necessity for urns to come to that conclusion.

(7) I’d be more impressed if you had made a concrete suggestion of a better way of testing the difference between two means than the t test I used in the paper (I’d have preferred to use a randomisation test, of course, but the t test is faster so you can do 100,000 simulations in a couple of minutes). It isn’t very helpful to say t-tests aren’t the sensible thing to do unless you suggest an alternative.

(8) My paper certainly seems to have made an impression 93290 full text views and 12087 pdf downloads is huge compared with my serious work. That result is actually quite useful for my frequent argument that metrics don’t measure either originality or quality. But I do find that the conclusions seem to have been widely accepted by professional statisticians -even by David Cox. I’ll confess I was nervous when I talked about the results at a seminar in UCL’s Statistics Department, but no serious objections were raised, despite my berating them for not teaching students about the problems in elementary courses. So perhaps my conclusions are not as controversial as one might infer from reading this blog.

David: you say you don’t assume 50% true nulls, and then go ahead and assume it. There’s no evidence the simulated numbers you like so much have any bearing on the evidential appraisal of any given hypotheses. Bottom line: the numbers in all these analyses are manufactured and have nothing to do with correctly using the numbers that adequately run tests actually provide, e.g., type 1 and 2 errors, confidence levels. Even if science-wise error rates (you call them FDRs, Ioannidis, PPVS) were of interest, and I’m not saying there are no possible behavioristic performance questions that might call for them, there’s no empirical basis behind them. They have not been found, and Ioannidis has had to make ad hoc adjustments to ever get them. But science wise error rates aren’t relevant for communicating or appraising inference. You mention Cox, but I know he’s allergic to this sort of thing.So I don’t know which aspects your regard as widely accepted. The idea of judging the value of significance tests along these lines (and then concluding they lack value) is so very, very far from the way Cox sees them that he’d probably just have a chuckle.

No I do not assume the prior is 0.5. Rather, I say that it is not acceptable to assume a value greater than 0,5 (at least without very strong empirical evidence for a higher value, which is something that we very rarely have). Of course the prior could be less than 0.5 but that would result in a much larger false discovery rate than 40%, which is why I say *at least* 30%.

I was surprised to get a letter, out of the blue, from David Cox in June 2015. he said “I would have guessed the direction of the effect but underestimated its magnitude,”

“My question was, if you observe P=0.047 in a single unbiased test of significance, and claim that you have found a real effect, then the chance that you’ll be wrong is at least 30%”. I’m talking about tests as usually conducted so it doesn’t matter for that conclusion that the people doing the test have no clear idea of what hypothesis they are testing.”

We beat this poor old horse to death last time, but it is still worth pointing out that the quote above can only be true with a package of assumptions about everyone else’s research that are not defensible, and also that it is nothing new to note that a SINGLE p-value is subject to sampling error and if you repeat the same experiment you expect to get something different next time. Under the assumptions of the null model, it is predictable how many p-values should be above 0.05 given N experiments. Likewise under assuming an alternative model. What is surprising? Surely not sampling error.

Fisher was certainly clear about the need for multiple results before drawing conclusions.

I’ve answered all of David’s criticism’s before, in particular the one about testing point nulls (showing that neither history nor current practice nor purpose supports his contention), and showed how strongly his conclusions depend on his assumptions. I will just pick up on one thing. He writes

” if you observe P=0.047 in a single unbiased test of significance, and claim that you have found a real effect, then the chance that you’ll be wrong is at least 30%”.”

Would he now like to calculate the following?

” if you observe P=0.047 in a single unbiased test of significance, and claim that therefore the effect is not real, then the chance that you’ll be wrong could be as high as x%.”

In view of David’s horror of making a fool of himself I think he should calculate x

That’s easy. If you test a sufficiently implausible hypothesis them the probability of wrong rejection approaches 1 -eg if you test homeopathic pills which are identical with the placebo (so the null hypothesis is exactly true), then all positives are false positives.

But that was not your question, You asked

” if you observe P=0.047 in a single unbiased test of significance, and claim that therefore the effect is not real, then the chance that you’ll be wrong could be as high as x%.”

My first response to that is that I would never commit the solecism of claiming that an effect was not real. I would say that the experiment had not found good evidence for a real effect. It’s obvious that if you demand a lower P value to constitute strong evidence for rejection, as I and others suggest, the price will be more false negatives. How one deals with this is not a statistical question, but a matter of economics. I say in the paper

“Observation of a p∼0.05 means nothing more than ‘worth another look’. In practice, one’s attitude will depend on weighing the losses that ensue if you miss a real effect against the loss to your reputation if you claim falsely to have made a discovery.”

It is my understanding that Ioannidis considered only the case of P <= 0,05, whereas my conclusions (see section 10) are based on P = 0.05. I think the latter is the right approach to the question that I posed, and that gives an FDR of at least 26% for any prior that's acceptable (<= 0.5)

Lastly, neither you nor Mayo have said what the hapless experimenter should do when faced with the problem of comparing two means. It isn't very helpful to say that you can get any answer you want by making suitable assumptions.

I note that David does not actually give us x.

“Worth another look” is worth knowing, so the issue is this. How does David consider we should decide if something is worth another look? Does his P < 0.001 standard apply here or does he propose something else and if so what?

As regards comparing two means*, I would never rely on P-values alone. Point estimates, confidence intervals (or better, plots) and even likelihood and possibly Bayesian methods might be worth considering but the sort of Bayesian/frequentist chimera you have on occasion proposed does not strike me as useful. "Wenn schon denn schon" as they say in German.

When you have nuisance parameters that are influential, then things can get quite complicated. Here's a proposal i considered for the business of "comparing two means" in a meta-analysis http://onlinelibrary.wiley.com/doi/10.1002/sim.2639/abstract See in particular figure 4

* In my applied work, in practice, comparing two means is not what it's about. The issue is rather estimating the treatment effect.

My proposal for verbal descriptions of P values was enlarged on in a comment on my paper 6 months ago http://rsos.royalsocietypublishing.org/content/1/3/140216#comment-1889100957

They are similar to suggestions made by Goodman and by V. Johnson.

The silly thing about these arguments is that we don’t seem to disagree very much about what should be done in practice.

David: I reject Goodman and Johnson’s recommendations for largely similar reasons. They view tests through a likelihoodist lens–and it’s credit to Goodman at least that he admits this. The result, ironically is that both recommend inference to an alternative that is far, far beyond what a significance tester would allow. For ex. Johnson will reject the null and infer an alternative against which the test has high power. This corresponds to using a confidence level < .5. This brings out the major difference between a comparative appraisal of "support" and the significance test .

@Mayo

I’m not sure what calculations of Ioannidis you are referring to, so I can’t answer that point

David Colquhoun: (1) ‘The test had 80% sensitivity and 95% specifity, but it is clearly useless: the false discovery rate of 86% is disastrously high’. After an operation for the removal of a tumour a patient has regular blood tests which include tumour markers. After one such test one marker exceeds the maximum of the normal range by a factor of 2. The false discovery rate for such a high level is 90%. The oncologist remarks that this is disastrously high, that the test is useless and he simply does not understand why it is standard practice. No further action is necessary.

(2) If all hypotheses are true the false discovery rate is 100%. If all are false it is 0%. By adjusting the parameters any false discovery rate can be obtained. In particular even if there were a known empirical false discovery rate the parameters could be adjusted to agree with this. This does not imply that the observed empirical rate is due to a proportion false hypotheses.

(3) The simulations are such that the +1 is interpreted as the null hypothesis H_0: mu=0 is false. Another interpretation is the H_0 is always true but that the +1 is a bias in the measurements. Berger and Sellke tell what I take to a fictional account of an astronomer who finds out retrospectively that 30% of his null point hypotheses were correct although tested at the 5% level. Stigler gives 100 measurements of the speed of light in a vacuum. Only 3 values are less than the accepted speed of light. Is this because the light he was measuring was different from the light today, somewhat faster, or is it more plausible that his measurements were biased?

(4) In time series data unsuspected long range correlations can lead to false discoveries.

(5) Much discussion of P-values is based on the family of normal models. This family is not prescribed by any authority and its use is the responsibility of the statistician. How is this decision made and does this play any role in the calculation of P-values? Here in Michelson’s third data set on the speed of light:

880,880,880,860,720,720,620,860,970,950,880,910,850,870,840,840,850,840,840,840. The accepted modern value is 734.5. What is the P-value of the hypothesis H_0: c=734.5 based on Michelson’s data? Any offers?

(6) More generally should the decision about accepting a family of models have any influence on the P-value and if so how should this be done? To make it concrete take the family of models to be the Gaussian family.

Stephen Senn: AIC and BIC are likelihood based and give absolutely no idea as to whether a model is a reasonable approximation to the data or not. More generally likelihood is blind. Why use AIC?

@lauriedavies2014

I don’t really understand the points you are making in (1) and (2).

(3) It’s true that my simulations used t tests on simulated data which were indeed normally-distributed: http://rsos.royalsocietypublishing.org/content/1/3/140216 My guess is that the results wouldn’t have differed greatly if I’d used a randomisation test (which is what I’d do in real life) rather than a t test.

(5) In real life, systematic errors are often more important than random errors. The example of the speed of light measurements is a notorious one, All this means is that, in real life, the false positive rate is likely to be even higher than I found in the paper. The simulations were done with perfectly randomised and unbiased data, yet still the question that I asked shows a minimum false positive rate of 26%. There are many ways that it could be higher than 26%, but not many that could make it less than 26%.

David: You really need to ponder your 26% in light of points 1) and 2) above… This was emphasized last go round as well.

Perhaps you should explain exactly what your points (1) and (2) more clearly.

If (1) means that a “positive” result is usually followed up by further investigations, that’s true, but it’s irrelevant to the question that I was trying to answer. Anyone who is familiar with the biomedical literature knows that it’s common to rush into print with P values that are only just below 0.05. There is a much hyped example with P=0.043 from Science magazine here http://www.dcscience.net/2014/11/02/two-more-cases-of-hype-in-glamour-journals-magnets-cocoa-and-memory/

Point (2) is obvious and was dealt with in the paper (and several times in comments on this blog). The possibility if priors greater than 0.5 was raised in a comment on my paper by Loiselle & Ramachandran. My response was as follows

“To postulate a prevalence greater than 0.5 is tantamount to saying that you were confident that your hypothesis was right before you did the experiment. That would imply that it is legitimate to claim that you have made a discovery on the basis of a statistical argument the premise of which is that it is more likely than not that your hypothesis was right before you did the experiment. I suspect that any such argument would be totally unacceptable to reviewers and editors. Certainly, I have never seen a paper that attempted to justify its conclusions in that way.”

http://rsos.royalsocietypublishing.org/content/2/8/150319

That is why, as an experimenter, I’m not willing to consider any prior greater than 0.5 (in the absence of strong empirical evidence which is virtually never available).

Does this answer your questions? Or have I misunderstood what you are trying to say?

Well, the points were not mine, but I think 2) is clarifying that your 26% means little to anyone else because you control the “FDR” in your study by toggling the prior. You seem to believe it has some kind of empirical validity with regards to use of p-values in research (in general). It does not. What will it look like if you take Stephen’s suggestion and calculate x. For 1), I think the point is that your blanket statement of 26% FDR as a disaster of some sort implies we should ignore the p-value altogether. However, even if you accept 26% as real– and I don’t– then your draconian rejection of p-values makes no sense at all when you have a very small p-value, say 0.001. How likely is that to be a false discovery??

Please read the response to Mayo (and please read my paper). It is simple not true that I advocate “draconian rejection of p-values”. Quite on the contrary.

David: Quite the contrary? You endorse the use of P-values?

My question was neither science-wise, not field-wise, It referred, very explicitly, to a single test.

The advantage of simulating what’s actually done in practice is that you cn see how often you are wrong by counting -no math and no philosophy is needed

I agree that it’s a pity that I used the term “false discovery rate”, because of the somewhat different meaning attached to that term in the field of multiple comparisons. False positive rate would have been a better term. I did point this out 7 months ago in http://rsos.royalsocietypublishing.org/content/1/3/140216#comment-2027421208

I can’t bring myself to use “positive predictive value” because it is so wretchedly non-self-explanatory.

I don’t know where you got the idea that “You have said significance tests are irrelevant”; I quite explicitly say nothing of the sort.

David: But the point is that you assume a relevant measure for a single test is the relative frequency of just significant results, computed on the assumption of a prior rate of true nulls in the field.

David:

(1)`The test had 80% sensitivity and 95% specifity, but it is clearly useless: the false discovery rate of 86% is disastrously high’. This was in the context of mass screening but unless I missed something this was the only context mentioned. The impression left is that a `false discovery rate of 86% is disastrously high’ whatever the context. Moreover anyone using a method with such a high false discovery rate would be making a fool of himself 86% of the times. All I did way to provide a context where this is not so. As you yourself point out any competent oncologist would take further measures. You state that your `paper seems to have made an impression 93290 full text views …’. How many of these interpreted it in the manner intended rather than in the manner it which it was written? You wrote p approx 0.05 means no more than perhaps worth a second look. Perhaps you could have given this much more emphasis rather than relegating it to one of several things that can be done in the last section.

(2) Yes it is obvious but is to be read together with (3)-(6). Given the absence of empirical data of why there are so many false positive (if I am wrong then I would appreciate literature) then your particular model is speculative. This is also obvious but the first line of your summary gives no hint that it is speculative: you seem to take it as a fact that at least 30% or false positives are due to a mechanism well described by your model. How could we check this? We would have to go through the literature and consider all cases of hypothesis testing with a P-value of just less than 0.05. We would then have to check how many of the these hypotheses turned out to be false and how many turned out to be true. This alone would be difficult but it would not be sufficient. For each case we would have to look at the data and decide whether the calculation of the P-value was reasonable: no outliers, no bias, stability with respect to minor changes in the data, stability with respect to all reasonable choices of model or method of testing, no Simpson’s paradox, that is homogeneous data sets were compared, no cheating etc. In other words we would have to go through the data analysis and check that everything of importance corresponds well with the assumptions of your model. When we have done all that you are convinced that the false discovery rate would be at least 30%. You may be correct but in the absence of such data the claim is speculative.

My experience with real data (I do not include data from the literature in this) is very limited, namely X-Ray diffractograms, lengths of study of German students and the evaluation of interlaboratory comparisons. The data for the diffractograms was exact and clean and caused no problems. The student data is more interesting especially in the light of Anne Case’s and Angus Deaton’s study on trends in death rates. This was criticized correctly by Andrew Gelman in his blog. At the time, the early 1980s, German students could in principle study as long as they wished. For one and the same degree the study times could range from 4 to 12 years. The lengths of study were measured by the lengths of study of the graduates in any one year. Because of changes in the number of students taking a particular degree the composition of the graduates changed over time causing changes in the average lengths of study of the graduates. When I pointed this out it became know as The Davies effect. Simpson’s paradox is explicitly mentioned in the preface to Peter Huber’s book on data analysis. Finally I was involved in the formulation of the German standard methods for the examination of water, waste water and sludge. This data is in general characterized by bias and outliers. In addition to this I read articles in journals and blogs, on the likelihood principle, on sufficiency, on optional stopping even down to the standard accepted parametrization of the two-way table and come to the conclusion, no doubt just as speculative as yours, that there are no (<1%) examples in the literature which conform to your model.

(5) If one is to argue about P-values then this must include some justification of the manner in which they are calculated. In particular if the calculation is based on some normal model then this model must in some sense be justified. So to repeat:

Michelson’s third data set on the speed of light:

880,880,880,860,720,720,620,860,970,950,880,910,850,870,840,840,850,840,840,840. The accepted modern value is 734.5. What is the P-value of the hypothesis H_0: c=734.5 based on Michelson’s data? Would some normal model be acceptable? Any offers? There were no offers. There seems to be a reluctance to consider the problem of model choice. Given the prevalence of the normal model in discussions of P-values one would have thought it requires some attention.

Laurie: I think you ended this in the middle. Thanks for pointing out some of the unwarranted assumptions in David’s analysis.The worst part isn’t merely that he and others who jump on the science-wise screening bandwagon declare they’ve demonstrated the irrelevance of significance levels and power, it’s that they assume the appropriate way to appraise inference is by means of one of these science-wise rates. No one has ever showed the relevance or possibility of usefully applying the computation.The upshot would be for scientists to restrict themselves to hypotheses from an urn of non-nulls with high rates of “truth”–even if it reflects a high “crud factor” (as Meehl called it”.

How can you possibly assert that

“The worst part isn’t merely that he and others who jump on the science-wise screening bandwagon declare they’ve demonstrated the irrelevance of significance levels and power”.

I have specifically said. several times now, that I’m not trying to calculate a “science-wise” error rate, and that I don’t believe that it’s possible to do so.

Neither do I for one moment say that significance levels are irrelevant. or that power is irrelevant

Quite on the contrary, my conclusions include

“If you do a significance test, just state the p-value and give the effect size and confidence intervals. But be aware that 95% intervals may be misleadingly narrow, and they tell you nothing whatsoever about the false discovery rate.

I don’t mind at all being criticised for things that I’ve said, but it’s a bit thick to be criticised for things that I have neither said nor believe.

David: I’ve had to use some name, do you prefer “field-wise” error rate? Even if you limited your “science wise” error rates to a given field or a given journal, my criticisms stand. Please tell me the population over which you regard your posterior probability computation to range over. How about field-wise positive predictive value? “False discovery rate” is defined differently, and has been used for a long time, not as a posterior or conditional probability. If you think it’s impossible to assess science-wise error rates, then why are your field-wise error rates any better off? My problem is deeper, even if computable, they’re irrelevant to assessing particular inferences. They might come in handy for various journal wise or field wise overall error rates.

You have said significance tests are irrelevant, and thus so would confidence intervals be irrelevant. There’s a precise duality between them. My own view combines estimation and testing, so that rather than choose a single confidence level, a series of confidence intervals are formed to determine discrepancies that are and are not warranted.