“There Is No Replication Crisis if We Don’t Expect Replication”.

I don’t know if the editorial paper is intended as the official ASA position as with the ASA P-value Statement (2016). There’s a definite danger in encouraging a view that statistics embraces post modernism, radical skepticism, or scientific anarchy.

In Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018), I recommend that researchers seek to falsify some types of inquiries (claims about measurements and experiments), when a purported effect fails to be replicated over and over again, and survives only by propping results up with ad hockeries.

]]>What I mean by a ‘true result’ during a random sampling process is the result (mean or proportion) obtained after a very large or infinitely large number of observations. It is also important to understand that there are two kinds of prior probability being used in my paper (i.e. Llewelyn H (2019) Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PLoS ONE 14(2): e0212302. https://doi.org/10.1371/journal.pone.0212302).

The first type is the ‘natural’ prior (e.g. a Bayesian prior) that you are talking about which is beset with problems when an attempt is made to use it as a default uniform prior, exemplified by the transformation problem as you point out.

The second type of prior probability that I use in my paper is quite different. It is imposed on the data rather like placing a frame around a picture. The universal set on which the ‘natural’ non-uniform prior is based then becomes a subset (inside the ‘frame’) of this ‘imposed’ artificial universal set of parameters with uniform prior probabilities. This means that the odds of the non-uniform ‘natural’ individual prior probabilities within the ‘frame’ become equal to the likelihood ratio of the corresponding likelihood distribution. I give a more detailed explanation in the OSF supplement to my paper: https://osf.io/s6qgy/.

Once this is done it becomes apparent that if a P-value calculation is based on a symmetrical distribution (e.g. a Gaussian distribution) it is then equal to the posterior probability of the null hypothesis or something more extreme from the observed result (which could be a proportion or mean). This range of ‘true’ results after making a very large number of observations does not contain the observed result and therefore fails to replicate it. The ‘complement’ to this range (i.e. all true results less extreme than the null hypothesis) does contain the observed result and thus replicates it in the long run. The probability of replication within this range is therefore 1-P. However, I called this the ‘idealistic’ probability of replication as it assumes impeccable methodology. If no fault can be found after a rigorous examination (Mayo calls it ‘severe testing’) it can be regarded as a ‘realistic’ probability of replication.

All this allows the P-value to be understood in a logical way with a similar reasoning process to that advocated by Bayesians. I think that this explains why the P-value is intuitively useful. However, the P-value is actually an arbitrary index and not even a probability (not a likelihood probability, prior or posterior probability). However, it has the superficial appearance of a probability, which leads to endless confusion and errors. Many think erroneously that it is false positive rate (or 1 – specificity) that has to be used with ‘lump’ prior probabilities. However, I think its strength is that it is a measure of non-replication as outlined above.

Huw

]]>https://www.nap.edu/catalog/6024/science-and-creationism-a-view-from-the-national-academy-of

“Science is a particular way of knowing about the world. In science, explanations are limited to those based on observations and experiments that can be substantiated by other scientists. Explanations that cannot be based on empirical evidence are not a part of science.

In the quest for understanding, science involves a great deal of careful observation that eventually produces an elaborate written description of the natural world. Scientists communicate their findings and conclusions to other scientists through publications, talks at conferences, hallway conversations, and many other means. Other scientists then test those ideas and build on preexisting work. In this way, the accuracy and sophistication of descriptions of the natural world tend to increase with time, as subsequent generations of scientists correct and extend the work done by their predecessors.”

I still regard statistics, even in its current incomplete form, as an invaluable cornerstone for the scientific method, as a technique for understanding the rate at which we make erroneous conclusions as we continuously re-evaluate previous assertions in our quest to correct and extend the work of our predecessors. Statistical conclusions are not always correct, but error statistical methods allow us to understand and manage the rate at which we make errors.

When I read (with considerable dismay) the statement

” . . . generalizations from single studies are rarely if ever warranted.”

I wonder if the authors of this paper

https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1543137

“Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication”

have read the NAS proffered definition of science, and if not, what on earth their definition of science is.

I hope that Sandler Greenland, a frequent contributor at this blog site, who offered this excellent discussion about statistical methods in the same journal issue

https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529625

“Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values”

can discuss why he penned his name along side of Amrhein and Trafimow in making such contra-indicated assertions about the scientific method and the statistical assessment of repeatable phenomenon. If a single study is not generalizable, its report is an anecdote and not a part of the corpus of science.

]]>I found your blog quite by accident but have profited from the exposure and effort to become at least a voice encouraging my colleagues to be more careful of what they assert to have verified. Thank you for that.

In writing recently about some of the difficulties of venturing into the swamp, I was besieged by one reviewer who denied that anything related to statistic analysis could be learned from philosophy or even the practice of medicine among other fields. While I am far from mastering even the basics, I am now aware of the easily expressed overconfidence expressed by my colleagues application of statistical analysis. No doubt the situation exists elsewhere as well. As you observe, even those most knowledgeable are still debating fundamental issues.

On a different note, my father in law, Dan Pletta, taught in the engineering school at VPI for forty years before he died in the 1997. I got to know Blacksburg quite well. It has certainly changed since I knew it in the 1950’ties. Best wishes and keep up the good work.

No need to post this comment ]]>

I have an arxived paper that explores the issue in full: https://arxiv.org/abs/1507.08394

]]>I have read pages 14 to 16 as suggested. This is a familiar problem to someone like me who has spent years trying to interpret patients’ weights in clinics! You talk about three weighing scales with similar precision. You are implying that from single samples on three weighing scales without holding books, after holding books and on return from the UK (3×3=9 single measurements) that if for each of the 9 situations you had weighed yourself on a large number of times, you would get a mean weight with a small standard deviation and be able to plot the narrow distribution of your weights. High precision means that when the weighing is performed repeatedly the probability of replication of weights within a narrow range is high.

By doing this for the three weighing scales and getting the same mean for each you would think it improbable that there was bias (unless all three were biased in the same way) so that the three were therefore probably accurate (i.e. not biased) as well as precise. You also establish that the accuracy applies to differences in weight because your weight increased by an extra 3 pounds when you are carrying three books weighing 3lbs (as measured on another 4th weighing machine presumably). In order to avoid measurement bias due to poor operator methodology (not the fault of the machine) it would be important also to wear the same items of clothing at home and in the medical centre, not to change them randomly when testing the weighing machine and to weigh at the same time of day and in relation to meals at each of the 9 weighing sessions. This is analogous to going through a checklist as part of severe testing to consider whether a scientific study was conducted in a consistent way if it were to be repeated as described.

When you return from the UK, you weigh an extra 4.c lb (c being a constant >0 and <1 that you did not specify). Your scientific hypothesis was that your body weight has increased. However, this scientific inference needs to undergo severe testing by excluding the rival hypotheses e.g. that after leaving the USA in the summer and maybe returning in the winter that you were not wearing heavier clothing or forgotten to take heavy boots off, etc when using each 3 weighing scales at home and the medical centre after your return. I assume that you would have taken care to wear the same clothing etc to exclude this possibility as part of severe testing of your hypothesis that your body mass had increased! There are other hypotheses too (e.g. was the increased weight due to increased body fat or fluid etc.). These are considerations that I have had to make often in medical clinics!

This process of assessing the precision of the weighing scales (e.g. if you had measured your weight many times on each scale instead of once) is the same process that I followed in the PLOS ONE paper that addresses instead probability of replication of study data. The realistic probability of study replication would also depend on evidence for the absence of probable methodological inconsistencies that would reduce precision and increase bias. This is a point that you have made strongly by applying severe testing according to my understanding. Am I right in thinking this?

]]>Anyone who denies the relevance for the small but important roles p-values play absolutely should stop using them. But then to retain them as useful is disingenuous. If those who rejected the role of P-values would just stop using them, they could stop torturing everyone else with rules they don’t really mean, and wrecking a key tool for the preliminary analysis of drugs and environmental risks. ]]>