It is because the prior is never known that I advocate expressing uncertainty by calculating the prior that you’d have to believe in order to achieve a false positive rate (FPR) of, say, 5%, This iis Matthews’ reverse Bayesian approach -see section 7 in http://www.biorxiv.org/content/biorxiv/early/2017/07/24/144337.full.pdf

]]>In the rejoinder, Valen Johnson made it clear that the call is also based on empirical findings of non-reproducible research results. How many of those findings are significant at the 0.005 level? Should meta-analysis have a less stringent standard?

]]>The many-authored paper does, belatedly, mention the bright line, but falls down by implying that the impediment to removing the line is a need to find a singular alternative method. Mayo, Stan and David all imply that the common assumption of ‘one size fits all’ for statistical inference is part of the problem. I agree. It is a large part, and it provides most of the impediment to a more reasonable use of statistical inference in scientific inference.

The proposal to reduce the critical threshold for significance from 0.05 to 0.005 may indeed reduce the number of unsupportable claims, and that may be helpful in some fields, but in other fields of research the power and resource costs of such a change would outweigh the benefits.

]]>The prior probability of the outcome of a random selection is what most people seem to mean by the ‘prior probability’ however. This may mean taking into account prior data (e.g. from a pilot study or from a similar study or from an imagined study) that were done in exactly the same way as the current study. This can be done (a) by combining such the prior distribution of such data with the likelihood distribution of the current data or (b) performing a simple meta-analysis by combining the ‘prior data’ with the ‘current data’ and assuming a ‘uninformed prior’ from the combined data or (c) assuming a ‘uninformed prior’ from the ‘current data’ and ignoring ‘prior data’.

It seems to me that the ‘base-rate’ prior probabilities of the possible results of a random selection are always uniform (see the following blog: https://blog.oup.com/2017/06/suspected-fake-results-in-science/). It is only when prior data are taken into account to generate a non-base-rate prior probability that the latter are not uniform. The evidence on which these different prior probabilities are based then has to be combined with the evidence of the new data to produce a probability of replication based on all the evidence. This may have to be combined with even more evidence to arrive at the probabilities of various scientific hypotheses being ‘true’.

Changing levels of ‘significance’ from 0.05 to 0.005 does not seem an adequate solution.

]]>“Since this paper was written, a paper (with 72 authors) has appeared [39] which proposes to change the norm for “statistical significance” from P = 0.05 to P = 0.005. Benjamin et al. [39] makes many of the same points that are made here, and in [1]. But there a few points of disagreement,

(1) Benjamin et al. propose changing the threshold for “statistical significance”, whereas I propose dropping the term “statistically significant” altogether: just give the P value and the prior needed to give a specified false positive rate of 5% (or whatever). Or, alternatively, give the P value and the minimum false positive rate (assuming prior odds of 1). Use of fixed thresholds has done much mischief.

(2) The definition of false positive rate in equation 2 of Benjamin et al. [39] is based on the p-less-than interpretation. In [1], and in this paper, I argue that the p-equals interpretation is more appropriate for interpretation of single tests. If this is accepted, the problem with P values is even greater than stated by Benjamin et al. (e.g see Figure 2).

(3) The value of P = 0.005 proposed by Benjamin et al. [39] would, in order to achieve a false positive rate of 5%, require a prior probability of real effect of about 0.4 (from calc-prior.R, with n = 16). It is, therefore, safe only for plausible hypotheses. If the prior probability were only 0.1, the false positive rate would be 24% (from calc-FPR+LR.R, with n = 16). It would still be unacceptably high even with P = 0.005. Notice that this conclusion differs from that of Benjamin et al [39] who state that the P = 0.005 threshold, with prior = 0.1, would reduce the false positive rate to 5% (rather than 24%). This is because they use the p-less-than interpretation which, in my opinion, is not the correct way to look at the problem.”

]]>Part of taking precautions, of course, is testing statistical assumptions, and the method for such checking is significance tests! The null asserts the given assumption holds, so, again, you might not want to be too demanding before claiming evidence of a model violation, if you’re keen to detect the fat tails you mention.

My own recommendation is never to just infer the difference is significant at such and such level, but to infer the magnitudes that are well and poorly indicated by the data at hand.** We should steer clear of all recipe statistics, whether in the form of Bayes factors (what do they mean?) or significance levels (which at least are calibrated with error probabilities).

*Neyman puts the “risk exists” hypothesis as the test or null hypothesis, by the way–in contrast with the standpoint of the point null.

**After tests of assumptions have shown reasonable model adequacy.

1. You don’t really know the value of, say, 2-sigma, but only an estimate of it. In the long run that estimate may be unbiased, but the p-value is highly non-linear in the value of sigma you use. So for *your* particular data set, who knows how off your estimated p-value is? Of course, it might be off in a favorable way, but how can you tell?

2. You cannot, in many cases, really show from your data that the distributions are gaussian. You just can’t get enough points to reliably define the tails. If the tails are fatter, then a two-sigma (or whatever) spread won’t actually give you the p-value you thought you had. Of course, there are bounds (e.g., Chebyshev’s inequality), but they can be much looser than you might like your p-value to be.

It can be useful to do simulations – lots of runs – to get a sense of how these points may affect your particular case. And it is well to be humble about p-values at all close to a magic number like 0.05.

]]>It should be remembered that because the criticism is based on two-sided tests, even when there’s a predicted direction, the .05 level is actually .025 for the predicted direction, as I understand it. (In other words, you don’t typically see people using .05 one-sided tests, which really is weak).

My main problem with this particular argument is that it rests on a questionable Bayesian appraisal wherein a research hypothesis is claimed to have prior probability .1 because it has, allegedly, been selected from an urn of hypotheses only 10% of which are true–90% of the corresponding nulls are true. (It’s kind of a cross between a type of Bayesian prior and a diagnostic screening prevalence.) The real problem with the project-despite its leaders having the best of intentions–is that people may think this is the correct way to appraise evidence of a given hypothesis in science (transposing the “conditional” in an ordinary error probability). I’ve talked about all this before, so, I’ll just leave it to others to weigh in. ]]>

The wheels of science fall off with p-hacking and HARKing which is VERY common. I’ve now counted out over 50 observational papers. When you count out outcomes-of-interest, potential-predictors, and covariates, the median number of questions under consideration is on the order of 9,000. Nine followed by three zeros. It is no wonder p-values from that process do not replicate.

Use an honest 0.05 (one question and follow the other rules), then replicate and you get 0.0025, which is close enough to 0.005 for me. Boos and Stefanski suggest 0.001.

None of these suggestions, 0.005, 0.0025, 0.001, work if there is p-hacking and HARKing.

]]>https://errorstatistics.files.wordpress.com/2017/07/strack-2016-smiling-registered-replication-report.pdf ]]>