Yes Pearson seemed to suffer a lot: through Fisher’s criticisms of his “God” (his father), Fisher slighting him in favor of Jerzy, and his father’s domineering ways. One day, he was “smitten” with the woman his cousin was to marry, and she with him, but his father tortured him so (claiming, even after the 2 yrs Egon gave his cousin to win her back, that everyone would always think he stole his cousin’s bride-to-be) that he gave her up. Remember the day I discovered it was apples and not blackcurrants (being grown in the plot kept by his cousin)?

https://errorstatistics.com/2016/08/18/history-of-statistics-sleuths-out-there-ideas-came-into-my-head-as-i-sat-on-a-gate-overlooking-an-experimental-blackcurrant-plot-no-wait-it-was-apples-probably/ ]]>

Found this one – Douglas G Altman and Jonathan J Deeks BMC Medical Research Methodology 2002.

I was working with them at the time and their primary concern was to stop folks from collapsing all the studies and analyzing the data as if it was a single study.

I might have actually suggested some of this wording “Given the choice between a method that always gives a right answer and a method that sometimes or even usually gives the right answer, it is common sense to use the one that always gives the right answer.”

However, I could never convince either of them that it would ever be worthwhile going beyond their view of meta-analysis as simply a weighted average of statistics reported in studies. Binary outcomes are unusual in that group totals are sufficient under the usual assumptions and a number of effect measures are immediately available.

Now, I had spent almost a year working on various versions of “Bayesian random effects meta‐analysis of trials with binary outcomes: methods for the absolute risk difference and relative risk scales by D. E. Warn, S. G. Thompson and D. J. Spiegelhalter, Statistics in Medicine 2002.

Keith O’Rourke Douglas G. Altman” as an attempt to convince Dough otherwise.

Do not think I was successful, especially since the editor did not decide to accept it until August 2005 so it likely fell from his list of things to think about.

Now, not everyone may be aware that Doug Altman recently passed away https://en.wikipedia.org/wiki/Doug_Altman

]]>1. Greenland S (1999). The relation of the probability of causation to the relative risk and

the doubling dose: A methodologic error that has become a social problem. American

Journal of Public Health, 89, 1166-1169.

Hutton, J. L. (2000). Number needed to treat: properties and problems. Journal of the Royal Statistical Society: Series A (Statistics in Society), 163(3), 381–402. https://doi.org/10.1111/1467-985X.00175

]]>The diagnostic screeners and “positive predictive value” advocates in significance testing likewise endorse the measure of how many tests you’d have to run at their favorite significance level to get one additional replication (by their measurement). But no such improvement of replication would occur. ]]>

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.195.3268&rep=rep1&type=pdf

“Practical Bayesian Data Analysis from a Former Frequentist”

Frank E Harrell Jr

Division of Biostatistics and Epidemiology

Department of Health Evaluation Sciences

University of Virginia School of Medicine

MASTERING STATISTICAL ISSUES IN DRUG DEVELOPMENT

HENRY STEWART CONFERENCE STUDIES

15-16 MAY 2000

Pages 24-25 of document:

> “Much controversy about need for adjusting for sequential testing. Frequentist approach is complicated.”

Well we can’t have that. Heaven forfend that there are any complexities. By definition I suppose, Bayesian approaches are not complicated.

> “Example: 5 looks at data as trial proceeds Looks had no effect, trial proceeded to end. Usual P = 0:04, need to adjust upwards for having looked”

How do looks have no effect? If looks have no effect, why do we look at all?

Of course looks have an effect. That’s precisely why many statisticians have worked on sequential methods over many years.

> “Two studies with identical experiments and data but with investigators with different intentions! one might claim ‘significance’, the other not (Berry10) Example: one investigator may treat an interim analysis as a final analysis, another may intend to wait.”

There is nothing wrong with two different investigators with differing intentions deriving differing conclusions from the same body of data. Analysis findings are context dependent.

> ” It gets worse — need to adjust ‘final’ point estimates for having done interim analyses”

I can understand adjusting final confidence interval endpoints for having done interim analyses, but I have yet to come across the scenario that a point estimate needed to be adjusted. I’m happy to be informed here of point estimate adjustment procedures that I have not yet heard of.

> “Freedman et al.36 give example where such adjustment yields 0.95 CI that includes 0.0 even for data indicating that study should be stopped at the first interim analysis”

> “As frequentist methods use intentions (e.g., stopping rule), they are not fully objective8. If the investigator died after reporting the data but before reporting the design of the experiment, it would be impossible to calculate a P–value or other standard measures of evidence.”

Of course any reasonable investigator reports the design of the experiment before collecting data. That’s why we have e.g. the clinicaltrials.gov site – so designs can be reported before the data is collected and the investigator dies. How this is not objective mystifies me.

> “Since P–values are probabilities of obtaining a result as or more extreme than the study’s result under repeated experimentation, frequentists interpret results by inferring ‘what would have occurred following results that were not observed at analyses that were never performed’ 29.”

Science is about studying repeated phenomena. We infer many conditions concerning results not observed at analyses never performed. We infer the results of a coin toss without observing all coins and all tosses of those coins. Of course, we could all be mightily surprised to find that after tomorrow, all coin tosses land heads up, and our old binomial coin toss examples, about fairness and 50/50 outcomes, no longer are of any use. If the sun comes up that is. But it hasn’t happened yet, and today the sun shines here in Vancouver, which is a bit odd but not impossible. These phenomena happen repeatedly, even though we have yet to observe them all, which is why frequentist methods have proven so useful in the scientific study of natural phenomena. So I intend to continue interpreting results using the useful tool of inferring what might occur following results that were not observed at analyses that were never performed, given results observed, when I perform analyses.

]]>Hello Dr.Mayo,

The examples for arguing against the relevance of intentions as a part of an argument against frequentist inference as a whole are:

#1 Wagenmakers EJ., Lee M., Lodewyckx T., Iverson G.J. (2008) “Bayesian Versus Frequentist Inference.” In: Hoijtink H., Klugkist I., Boelen P.A. (eds) Bayesian Evaluation of Informative Hypotheses. Statistics for Social and Behavioral Sciences. Springer, New York, NY

DOI https://doi.org/10.1007/978-0-387-09612-4_9 Print ISBN 978-0-387-09611-7 Online ISBN 978-0-387-09612-4

“2.3 Frequentist Inference Depends on the Intention With Which

the Data Were Collected

Because p-values are calculated over the sample space, changes in the sample space can greatly affect the p-value. For instance, assume that a participant answers a series of 17 test questions of equal difficulty; 13 answers are correct, 4 are incorrect, and the last question was answered incorrectly. Under the standard binomial sampling plan (i.e., “ask 17 questions”), the two-sided pvalue is .049. The data are, however, also consistent with a negative binomial sampling plan (i.e., “keep on asking questions until the fourth error occurs”). Under this alternative sampling plan, the experiment could have been finished after four questions, or after a million. For this sampling plan, the p-value is 021.

What this simple example shows is that the intention of the researcher affects

statistical inference – the data are consistent with both sampling plans, yet the p-value differs. Berger and Wolpert ([14, page 30-33]) discuss the resulting counterintuitive consequences through a story involving a naive scientist and a frequentist statistician.

(the example is interesting as it deals with both researcher intent and external factors, such as a grant being extended or not, but I thought it is too long to include. The chapter is available for free online at http://www.ejwagenmakers.com/2008/BayesFreqBook.pdf )

I think this is a classical case that tries to portray relaying on the sampling space as absurd, as if it is somehow subjective, locked into the scientists’ mind and therefore cannot possibly be a legitimate consideration. It goes hand-in-hand with the argument that observed data is the only thing that matters, while leaving out that an integral part of what “data” is is the method through which the numbers were obtained.

#2 “An Introduction to Bayesian Hypothesis Testing for Management Research”, Sandra Andraszewicz, Benjamin Scheibehenne, Jörg Rieskamp, Raoul Grasman, Josine Verhagen, and Eric-Jan Wagenmakers, Journal of Management, Vol 41, Issue 2, pp. 521 – 543, December 10, 2014, https://doi.org/10.1177/0149206314560412

A host of “criticism” against p-values here, p-values depending on intent among them:

“Unfortunately, p values have a number of serious logical and statistical limitations (e.g.,

Wagenmakers, 2007). In particular, p values cannot quantify evidence in favor of a null

hypothesis (e.g., Gallistel, 2009; Rouder, Speckman, Sun, Morey, & Iverson, 2009), they

overstate the evidence against the null hypothesis (e.g., Berger & Delampady, 1987; Edwards,

Lindman, & Savage, 1963; Johnson, 2013; Sellke, Bayarri, & Berger, 2001), and they depend

on the sampling plan, that is, they depend on the intention with which the data were collected;

consequently, identical data may yield different p values (Berger & Wolpert, 1988;

Lindley, 1993; a concrete example is given below).

Bayesian hypothesis testing using Bayes factors provides a useful alternative to overcome

these problems (e.g., Jeffreys, 1961; Kass & Raftery, 1995). Bayes factors quantify the support

that the data provide for one hypothesis over another; thus, they allow researchers to

quantify evidence for any hypothesis (including the null) and monitor this evidence as the

data accumulate. In Bayesian inference, the intention with which the data are collected is

irrelevant (Rouder, 2014). As will be apparent later, inference using p values can differ dramatically

from inference using Bayes factors. Our main suggestion is that such differences

should be acknowledged rather than ignored.”

Then the authors go into more detail giving an example where p-values are inferior to Bayes factors due to issues related to reflecting researcher intent, in particular, in a continuous monitoring scenario:

“An additional advantage is that, in contrast to the p value, the Bayes factor is not affected

by the sampling plan, or the intention with which the data were collected”

[…]

“For Bayes factors, in contrast, the sampling plan is irrelevant to inference (as dictated by

the stopping rule principle; Berger & Wolpert, 1988; Rouder, 2014). This means that researchers

can monitor the evidence (i.e., the Bayes factor) as the data come in and terminate data

collection whenever they like, such as when the evidence is deemed sufficiently compelling

or when the researcher has run out of resources.”

Here we see the simple mistake of treating the fact that Bayes factors are not altered to accommodate the data known about the sampling procedure as a positive, while treating the need to alter the p-value calculation to accommodate that data as a negative.

Best regards,

Georgi

As for the counterintuitive examples, such as taking account of the possibility the instrument broke down or the like, there is NO error statistical justification for doing so! The probative value of the test is not influenced in the least. Please check my blog and published papers for more on this. ]]>