If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did.

This Consensus Study was prompted by concerns about the reproducibility and replicability of scientific research. …To carry out the task, the National Academies appointed a committee of 15 members representing a wide range of expertise.(NAS Consensus Study xvi)

**I. Use the correct definition of P-value, distinguish likelihood and probability.** I limit myself to their remarks on statistical significance tests.

“Because hypothesis testing has been involved in a major portion of reproducibility and replicability assessments, we consider this mode of statistical inference in some detail.” (p.34) Unfortunately, they don’t give us the essential details, and what they give us contains flaws. Let me annotate what they say:

(1) Scientists use the term null hypothesis to describe the supposition that there is no difference between the two intervention groups or no effect of a treatment on some measured outcome (Fisher, 1935). (2) A standard statistical test aims to answers the question: If the null hypothesis is true, what is the likelihood of having obtained the observed difference? (3) In general, the greater the observed difference, the smaller the likelihood it would have occurred by chance when the null hypothesis is true. (4) This measure of the likelihood that an obtained value occurred by chance is called the p-value. (NAS Consensus Study p. 34)

* Remarks*:

(1) This limits the null hypothesis H_{0} to the “nil” or point null–an artificial restriction at the heart of many problems.

(2) It would be wrong to say the “aim” of a standard statistical test is getting a P-value–even if they did correctly define P-value, which they don’t. In fact, they incorrectly define it everywhere in the book, which is baffling. The aim, or *an* aim, is to distinguish signal from noise, or genuine effects from random error, or the like–in relation to a reference hypothesis (test hypothesis). The P-value is the probability (not the likelihood) of a difference as large *or larger* than the observed d_{0} under the assumption that the null hypothesis H_{0} is true. Any observed result d_{0 }will be improbable in some respect. So if you declared evidence of a genuine effect whenever the observed difference was improbable under H_{0}, you’d have an extremely high Type I error probability (if not 1).

By looking at the P-value, Pr(d ≥ d_{0};H_{0}), we reason, if even larger differences than d_{0} occur fairly frequently under H_{0 }(the P-value is *not* small), there’s scarcely evidence of incompatibility with H_{0}. Small P-values *indicate* a genuine discrepancy from (or incompatibility with) H_{0}, but isolated small P-values don’t suffice as evidence of genuine experimental effects (as Fisher stresses). (See this post). [i]

(3) This is OK, but “likelihood” is a technical term and should not be used as a synonym for probable in any discussion trying to clarify terms. Doing so just begs for confusion and transposition fallacies. For example, frequentists will assign likelihoods, but not probabilities, to statistical hypotheses.

In case you thought (2) was a slip, the error is repeated in (4):

(4) This measure of the likelihood that an obtained value occurred by chance is called the p-value.

NO. This is wrong. So I return to my question: Wouldn’t this be the first thing you looked at if you were serving on this committee?

**II. Consensus? Again, P-values and likelihood.** After that wobbly introduction to statistical tests, this Consensus Document turns to remarks from the 2019 American Statistical Association (ASA) editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)–(ASA II). Unlike the 2016 Statement on P-values, ASA I, its authors are clear that ASA II *not* a consensus document, but rather, is “open to debate”. NAS Consensus Study does not note this qualification, although they do not go as far as ASA II in declaring the concept of statistical significance be banished.

More recently, it has been argued that p-values, properly calculated and understood, can be informative and useful; however, a conclusion of statistical significance based on an arbitrary threshold of likelihood (even a familiar one such as p ≤ 0.05) is unhelpful and frequently misleading (Wasserstein et al., 2019) (NAS Consensus Study) [ii]

Now any prespecified “threshold” for statistical significance is “arbitrary”, according to ASA II, so it’s not clear how the two parts of this sentence cohere. Let’s agree that the attained P-value should always be reported. It doesn’t follow that taking into account whether it satisfies a preset value, say 0.005, is misleading. Moreover, thresholds can be intelligently chosen, e.g., to reflect meaningful population effect sizes. (See my recent “P-value thresholds: Forfeit at your peril“)

The NAS Consensus Study continues with the following, which again I’ll annotate:

(5) In some cases, it may be useful to define separate interpretive zones, where p-values above one significance threshold are not deemed significant, p-values below a more stringent significance threshold are deemed significant, and p-values between the two thresholds are deemed inconclusive. (6) Alternatively, one could simply accept the calculated p-value for what it is—the likelihood of obtaining the observed result if the null hypothesis were true—and refrain from further interpreting the results as “significant” or “not significant.” (NAS Consensus Study, 36)

* Remarks*:

(5) This first part is a good idea, and is in sync with how Neyman and Pearson (N-P) first set out tests, with 3 regions. ASA II, however, is opposed to trichotomy.

[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups. (ASA II)

(6) No. There they go again. The P-value is not the probability of obtaining the observed result if the null hypothesis were true. And *please* stop saying “likelihood” when you mean “probability”. They are not the same. It might be fine in informal discussions, but not in guides to avoid fallacies.

The Consensus Study considers different ways to ascertain successful replication.

CONCLUSION 5-2: A number of parametric and nonparametric methods may be suitable for assessing replication across studies. However, a restrictive and unreliable approach would accept replication only when the results in both studies have attained “statistical significance,” that is, when the p-values in both studies have exceeded a selected threshold. Rather, in determining replication, it is important to consider the distributions of observations and to examine how similar these distributions are. (NAS Consensus Study, p. 74)

They do not show that accepting replication so long as “distributions of observations” are deemed “similar” in some sense (left vague) is more reliable than requiring results attain prespecified P-value thresholds–at least if the assessment of unreliability includes increases in false positives as well as false negatives [iii]. Of course testing thresholds need to be intelligently chosen, with regard for variability, indicated magnitude of discrepancy (of the initial study), and power of the tests to detect various discrepancies.

**III.** ** Some Good Points: Data dependent subgroups and double counting.** There are plenty of important points throughout the Consensus Study. I mention just two. First, they tell the famous story with Richard Peto and post-data subgroups.

Misuse of statistical testing often involves post hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives…. A study from the late-1980s gives a striking example of how such post hoc analysis can be misleading. The International Study of Infarct Survival was a large-scale, international, randomized trial that examined the potential benefit of aspirin for patients who had had a heart attack. After data collection and analysis were complete, the publishing journal asked the researchers to do additional analysis to see if certain subgroups of patients benefited more or less from aspirin. Richard Peto, one of the researchers, refused to do so because of the risk of finding invalid but seemingly significant associations. In the end, Peto relented and performed the analysis, but with a twist: he also included a post hoc analysis that divided the patients into the twelve astrological signs, and found that Geminis and Libras did not benefit from aspirin, while Capricorns benefited the most (Peto, 2011). This obviously spurious relationship illustrates the dangers of analyzing data with hypotheses and subgroups that were not prespecified (p.97)

Then there’s a note prohibiting “double counting” data:

A fundamental principle of hypothesis testing is that the same data that were used to generate a hypothesis cannot be used to test that hypothesis (de Groot, 2014). In confirmatory research, the details of how a statistical hypothesis test will be conducted must be decided before looking at the data on which it is to be tested. When this principle is violated, significance testing, confidence intervals, and error control are compromised. Thus, it cannot be assured that false positives are controlled at a fixed rate. In short, when exploratory research is interpreted as if it were confirmatory research, there can be no legitimate statistically significant result. (NAS Consensus Study)

Strictly speaking, there are cases where error control can be retained despite apparently violating this principle (often called the requirement of “use novelty” in philosophy). It will depend on what’s meant by “same data”. For example, data can be used in generating a hypothesis about how a statistical assumption can fail, and also be used in testing that assumption. However, “the data” are remodelled to ask a different question, and error control can be retained. (See SIST Excursion 4, tours II and III. (iv))

**IV. Issue an errata for the P-value definitions. **The NAS Consensus Study only just came out; issuing a correction now will avoid a new generation of incorrect understandings of P-values.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[i] For a severe Tester’s justification of P-values, see Souvenir C of SIST (Mayo 2018, CUP): A Severe Tester’s Translation Guide.

[ii] Perhaps the phrase “it is argued that” indicates that it is just one of many views, but elsewhere in the document points from ASA II are reported without qualification. Fortunately, the document does not include the ASA II recommendation not to use the words “significant/significance”). Later in the book, they give items from the 2016 ASA Statement on P-Values and Statistical Significance (ASA I), which is largely a consensus document.

[iii] They do regard the point as “reinforced by [ASA II] in which the use of a statistical significance threshold in reporting is strongly discouraged due to overuse and wide misinterpretation (Wasserstein et al., 2019).

[iv] Nor is “double counting” necessarily pejorative when testing explanations of a known effect. I delineate the cases according to whether a severity assessment of the inference of interest is invalidated)

Mayo, D. 2018. *Statistical Inference as severe Testing: How to Get Beyond the Statistics Wars*. (SIST) CUP.

You can find many excerpts and mementos from SIST on this blog on this post.

Deborah: This is THE most informative and trenchant commentary you have issued to date regarding this hornets’ nest at NAS. It could not have been made more concise and/or penetrating and it encapsulates with vigor “the heart of the matter”.

James T. Lee, MD PhD

> > >

James: I would say it’s the bare minimum rather than trenchant, but thanks. Now what do we do to reach the people at NAS to get an errata issued? What do you mean about the hornets’ nest at NAS?

So i wrote to the chair of the NAS Consensus Study. I wonder if he will respond. NAS should be responsive to the citizens for whom the report was purportedly written. I’d be glad to hear what others think.

Deborah:

I agree that this National Academy of Sciences report has problems. Maybe I’m not so surprised, given that the National Academy of Sciences itself has such problems. Consider: One of the key points of the scientific reform movement is that too much trust has been placed with traditional gatekeepers, including journals that prize novelty over accuracy, journals like . . . the Proceedings of the National Academy of Sciences. So I can well believe that the National Academy of Sciences, loaded as it is with big shots with lots of reputation to lose, is not the best organization for promulgating reform.

One advantage of the American Statistical Association here is that, bureaucratic and imperfect as this organization may be, it represents working professionals, not Ted-talking TV stars. To put it another way: yes, the statistics profession deserves much of the blame for the current crisis in science, but, compared to the National Academy of Sciences, we’re not so professionally invested in denying or minimizing the problems.

Andrew:

Thanks for your comment. I really don’t know anything about the NAS. (This is actually said to be sponsored by the National Academies of Science–a group of them.) I don’t see why “big shots with lots of reputation to lose” would have trouble defining a P-value. They’re the ones who chose to invest in this project, which I only heard about very recently.

One of the earlier commentators said the NAS was a “hornets’ nest”. So now I’m really curious to hear some NAS stories from you.

Your comment on #1 is not entirely clear to me.

Are you saying that what NAS wrote rules out nulls that say “no difference in distribution between groups” instead of “no difference in [e.g.] the mean of the distribution, between groups”?

Or that what NAS wrote rules out complex null hypotheses, i.e. that a stated parameter lives in a stated subspace of a larger set?

Or something else? Thanks in advance for clarifying.

George: The second is closer, but I don’t say they “rule out” other kinds of test or null hypotheses. They’re to be defining tests in general, and they’re not limited to point nulls, that’s all.

But wait, there’s more!

Not sure if you made it down to Appendix D – Using Bayes Analysis for Hypothesis Testing

They almost get the definition of a p-value right in Appendix D (though given that Bayes predated Fisher I’m unclear on what ‘classical’ statistics refers to)

“The p-value, in classical statistics, is defined as the probability of finding an observed, or more extreme, result under the assumption that the null hypothesis is true.”

They further describe the tortured logic behind one effort to redefine the level of statistical significance (p 228):

“If the observed results produce a p-value equal to 0.005 and the prior probability of the experimental hypothesis is 0.25, then the postexperimental probability that the experimental hypothesis is true is about 90 percent. It is reasoning such as this (using different assumptions in applying the Bayes formula) that led a group of statisticians to recommend setting the threshold p-value to 0.005 for claims of new discoveries (Benjamin et al., 2018). One drawback with this very stringent threshold for statistical significance is that it would fail to detect legitimate discoveries that by chance had not attained the more stringent p-value in an initial study. Regardless of the threshold level of p-value that is chosen, in no case is the p-value a measure of the likelihood that an experimental hypothesis is true.”

How a p-value ends up in a Bayesian framework leaves me baffled. It’s this kind of mash-up that leads to unclear findings such as the 0.005 threshold recommendation.

“When the prior probability of an experimental hypothesis (P[H1]) is 0.3 (meaning its pre-experimental likelihood of being true is about 1 in 3) and the p-value is 0.05, Table D-1 shows the post-experimental probability to be about 62 percent (posterior odds favoring H1 of 1.658 are equivalent to a probability of about 62%). If replication efforts of studies with these characteristics were to fail about 40 percent of the time, one would say this is in line with expectations, even assuming the studies were flawlessly executed.”

If a study is repeatedly flawlessly executed, and failed to reject the null hypothesis 40% of the time, this would hardly comport with Fisher’s concept “we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result”. At the point that an experiment will rarely fail to give a significant result, its prior probability will certainly not be anywhere near 0.3.

If the prior probability of an experimental hypothesis is 0.25 or 0.3, you are still in the exploratory phase of identifying a potentially replicable phenomenon and should perform more confirmatory experiments before attempting to assert a new discovery in a scientific journal.

“When a study fails to be replicated, it may be because of shortcomings in study design or execution, or it may be related to the boldness of the experiment and surprising nature of the results, as manifested in a low preexperimental probability that the scientific inference is correct (Wilson and Wixted, 2018). For this reason, failures to replicate can be a sign of error, may relate to variability in the data and sample size of a study, or they may signal investigators’ eagerness to make important, unexpected discoveries and represent a natural part of the scientific process.”

This appears to be a claim that when we see that a finding does not readily replicate, that indicates that the finding is an important, bold and unexpected discovery. Perhaps this was Fleischmann’s and Pons’ reasoning when they announced their cold fusion finding (mentioned on page 30).

Looks like some cleaning up in Appendix D would also be helpful.

—

They further describe the tortured logic behind one effort to redefine the level of statistical significance (p 228):

“If the observed results produce a p-value equal to 0.005 and the prior probability of the experimental hypothesis is 0.25, then the postexperimental probability that the experimental hypothesis is true is about 90 percent. It is reasoning such as this (using different assumptions in applying the Bayes formula) that led a group of statisticians to recommend setting the threshold p-value to 0.005 for claims of new discoveries (Benjamin et al., 2018). One drawback with this very stringent threshold for statistical significance is that it would fail to detect legitimate discoveries that by chance had not attained the more stringent p-value in an initial study. Regardless of the threshold level of p-value that is chosen, in no case is the p-value a measure of the likelihood that an experimental hypothesis is true.”

How a p-value ends up in a Bayesian framework leaves me baffled. It’s this kind of mash-up that leads to unclear findings such as the 0.005 threshold recommendation.

—

I’m wondering how researchers would know a prior probability of the experimental hypotheses?

If it is just a guess, I think then they may need a prior probability for their prior probability of the experimental hypotheses.

Justin

http://www.statisticool.com

Justin: You’re right. The cobbling together of Bayesian priors and p-values is another baffling and careless part of the report. In the end I decided just to focus on the problem defn. of P-values. I’m afraid Ioannidis, and others who imagine a “diagnostic model” of tests, are responsible for encouraging this morass–an incoherent stew of ingredients from different methodologies. In 4.5 of SIST, and earlier in the Tour, you’ll see that the Bayesian winds up giving high probability to the observed mean being equal to the population mean–in short, winds up finding strong evidence for a discrepancy from the null that would correspond to using a confidence level of .5.

Steven: Thanks for your comment. I had decided not to wade into their Bayesian morass, especially as I discuss this treatment in Statistical Inference as Severe Testing: How to get beyond the statistics wars. See Excursion 4 Tour II, p. 262

https://errorstatistics.files.wordpress.com/2019/10/excur4tourii-2019.pdf

This computation assumes two non exhaustive hypotheses: the point null and the point alt that is equal to the max likely hyp. The result is to give high prob to mu = x-bar even though doing so has error probability of .5. e.g.,The first example they show gives .8 posterior to mu = x-bar based on observed x-bar = 1.65 SE. To infer mu as large as x-bar is tantamount to using a confidence level of .5. It has corresponding severity of .5. See pp 262-4 of SIST. https://errorstatistics.files.wordpress.com/2019/10/excur4tourii-2019.pdf

Ironically, while they do mention “or more extreme” here, they go on to ignore the tail and look at the likelihood (just at the point).

Steven: There is no such thing as “the prior probability of an experimental hypothesis”. There is only an assortment of assessments of how strongly so and so believes a claim, how readily they would bet on it or the like. Else there’s a default prior which is not a degree of belief and is generally not even a probability, being improper. There’s an assumption that talking of “the” prior probability makes sense and furthermore is a desirable entity to combine with the likelihood in reaching inferences. No one has ever showed this. At best invoking a prior is just a way of speaking of how well tested a claim is. But this is very different. And “likelihood” should never be used synonymously with “probability”. I hope that readers of this blog will read my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, because this “diagnostic model” of tests is plainly shown to be seriously flawed (Excursion 5).

Some edited comments originally posted on twitter are listed below:

1, (30 Sep) On the negative side they do not only use likelihood and probability as interchangeable terms, as a matter of speech, they also invert the meaning of reproducibility and repeatability (at least as used by industry and many other scientists) https://t.co/8xKf8taw0C?amp=1

2. (24 Sep) On the positive side they do consider generalisation of findings. An important topic that deserves more consideration . Fisher was mostly concerned by internal validation and less by external validation (which overlaps generalisation of findings). In section 20 of the Statistical Design of Experiments, he writes that randomization is a key to “validity”, highlighting the ability to get a valid estimate and computing and quantifying the size of the error in the estimate. The modern perspective needs to address the more general perspective of generalizability, see for example https://arxiv.org/abs/1808.01174

My point is that the NAS document could/should be considered a starting point for a town hall discussion eliciting and organising comments such as the ones in this blog.

Ron: One point 1, they separate reproducibility and replicability in what I think is the standard way (according to a different ASA doc). So I’m not sure what your mean.

we explain this in the Nature Methods piece I added a link to (https://www.nature.com/articles/nmeth.3489) . Also, If you read my paper on generalisability of findings it is explained there in detail https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070

Ron: I’m sorry this was stuck in moderation, and don’t know why. People who have had one comment approved are automatically allowed to comment without moderation, except in rare cases. Perhaps the 2 links trigger an automatic moderation, but I have no clue why this great big orange dot (that shows when something is in moderation) was not showing.

I do recall reading your reply and thinking that maybe you’d give a very brief explanation. I think at least some of what you place under concerns about “generalization” are in sync with the severe tester’s requirement of a report of what has been poorly probed. I think that would also avoid what you describe as similar sounding but misleading interpretations of results.

I was delighted to see your comments on the recently released NAS document. I too was quite surprised by the generally loose usage of important technical terms in a supposedly expository document authored by subject matter specialists. Most concerning for me, though, was the erroneous definition of the term “null hypothesis” (also pointed out by you). Understanding how an hypothesis is identified for testing requires, at the very least, understanding what an effect is, the implications of controlling a type I error, and what “statistical significant” will mean. To keep this comment simple, we might consider testing 1) no effect, 2) an effect smaller than some identified (perhaps regulatory) threshold, or 3) an effect at least as large as some identified threshold. For each a statistic can be calculated, a p-value defined, and a significance level identified for the probability of type I error. But the implication of type I error differs from hypothesis to hypothesis and with it the rationale for its control. In other words, understanding how a hypothesis comes to be tested amounts to understanding the proper application of statistical hypothesis testing. It is indeed unfortunate that this all to common mistake is repeated in the NAS document.

On a more humorous note: I once was invited to address some scientific colleagues on just this issue. After my presentation I was approached by one well respected member of the audience who proclaimed “you’re just talking about equivalence testing where one tests the alternative hypothesis”. I was stunned into silence and the conversation moved on in other directions.

Again thanks for your timely comments.

Bill Ross

Bill:

Thanks for your comment. I have written to the chair of the committee and received no response. If anyone has any suggestions about whom to contact at NAS, I’d be grateful.

I’m not sure why you were stunned into silence. The idea of equivalence testing, really, is in sync with power analysis, and even more so with a severity assessment. While usually used as a way to interpret non-significant results, it can also be used to set upper bounds based on significant results.

I’m afraid that the use of the point null, and the general looseness of terms, is encouraged by the ASA I and II guides.

Mayo – The above is a great example of how automation affects data and affects information quality. I did not understand what the previous blog was sequestered. The reason you gave in Twitter was only part of the story. Now I understand how Akismet works.,..

Regarding the topic of Reproducibility, Repeatability and Generalisability of findings. It really goes back to Fisher.

The definition I adopted is due to Chris Drummond:

“Reproducibility requires changes; replicability avoids them. A critical point of reproducing an experimental result is that irrelevant things are intentionally not replicated. One might say, one should replicate the result not the experiment.” Proc. of the Evaluation Methods for Machine Learning Workshop at the 26 th ICML, Montreal, Canada, 2009.

and you must know the following quote from Fisher (1935):

“A highly standardized experiment supplies direct information only in respect of the narrow range of conditions achieved by standardization. Standardization, therefore, weakens rather than strengthens our ground for inferring a result, when, as is the case in practice, these conditions are somewhat varied.” He also see the values in reproducibility. Reproducibility is therefore about inherent findings like the Nobel Prize winners achievements.

The bottom line is that the way the experiments are laid out are different, according to your generalisation objective.

so the sequence is:

Set your goal —> Design your study —> Perform it —> Analyse the outcomes —> Generalise the findings

We suggest a section dedicated to Generalisations of findings is included in applied research papers. We also propose a way to structure such a section that refers to Gelman’s S type error.

The proposal applies alternative representations with a boundary of meaning distinguishing between alternatives with meaning equivalence and those with surface similarity. The BOM can be assessed with techniques of sensitivity analysis such as S-type errors or Cornfield inequality.

Consider the third key point bullet in https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP275996

It states: “Using lineage analysis, the present study shows that the Type I cell lineage itself proliferates and expands in response to sustained hypoxia”.

An S type error would be that the Type I cell lineage actually shrinks, and does not proliferate.

We give such examples form translation medicine in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070 which is not behind paywall.