http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.195.3268&rep=rep1&type=pdf

“Practical Bayesian Data Analysis from a Former Frequentist”

Frank E Harrell Jr

Division of Biostatistics and Epidemiology

Department of Health Evaluation Sciences

University of Virginia School of Medicine

MASTERING STATISTICAL ISSUES IN DRUG DEVELOPMENT

HENRY STEWART CONFERENCE STUDIES

15-16 MAY 2000

Pages 24-25 of document:

> “Much controversy about need for adjusting for sequential testing. Frequentist approach is complicated.”

Well we can’t have that. Heaven forfend that there are any complexities. By definition I suppose, Bayesian approaches are not complicated.

> “Example: 5 looks at data as trial proceeds Looks had no effect, trial proceeded to end. Usual P = 0:04, need to adjust upwards for having looked”

How do looks have no effect? If looks have no effect, why do we look at all?

Of course looks have an effect. That’s precisely why many statisticians have worked on sequential methods over many years.

> “Two studies with identical experiments and data but with investigators with different intentions! one might claim ‘significance’, the other not (Berry10) Example: one investigator may treat an interim analysis as a final analysis, another may intend to wait.”

There is nothing wrong with two different investigators with differing intentions deriving differing conclusions from the same body of data. Analysis findings are context dependent.

> ” It gets worse — need to adjust ‘final’ point estimates for having done interim analyses”

I can understand adjusting final confidence interval endpoints for having done interim analyses, but I have yet to come across the scenario that a point estimate needed to be adjusted. I’m happy to be informed here of point estimate adjustment procedures that I have not yet heard of.

> “Freedman et al.36 give example where such adjustment yields 0.95 CI that includes 0.0 even for data indicating that study should be stopped at the first interim analysis”

> “As frequentist methods use intentions (e.g., stopping rule), they are not fully objective8. If the investigator died after reporting the data but before reporting the design of the experiment, it would be impossible to calculate a P–value or other standard measures of evidence.”

Of course any reasonable investigator reports the design of the experiment before collecting data. That’s why we have e.g. the clinicaltrials.gov site – so designs can be reported before the data is collected and the investigator dies. How this is not objective mystifies me.

> “Since P–values are probabilities of obtaining a result as or more extreme than the study’s result under repeated experimentation, frequentists interpret results by inferring ‘what would have occurred following results that were not observed at analyses that were never performed’ 29.”

Science is about studying repeated phenomena. We infer many conditions concerning results not observed at analyses never performed. We infer the results of a coin toss without observing all coins and all tosses of those coins. Of course, we could all be mightily surprised to find that after tomorrow, all coin tosses land heads up, and our old binomial coin toss examples, about fairness and 50/50 outcomes, no longer are of any use. If the sun comes up that is. But it hasn’t happened yet, and today the sun shines here in Vancouver, which is a bit odd but not impossible. These phenomena happen repeatedly, even though we have yet to observe them all, which is why frequentist methods have proven so useful in the scientific study of natural phenomena. So I intend to continue interpreting results using the useful tool of inferring what might occur following results that were not observed at analyses that were never performed, given results observed, when I perform analyses.

]]>Hello Dr.Mayo,

The examples for arguing against the relevance of intentions as a part of an argument against frequentist inference as a whole are:

#1 Wagenmakers EJ., Lee M., Lodewyckx T., Iverson G.J. (2008) “Bayesian Versus Frequentist Inference.” In: Hoijtink H., Klugkist I., Boelen P.A. (eds) Bayesian Evaluation of Informative Hypotheses. Statistics for Social and Behavioral Sciences. Springer, New York, NY

DOI https://doi.org/10.1007/978-0-387-09612-4_9 Print ISBN 978-0-387-09611-7 Online ISBN 978-0-387-09612-4

“2.3 Frequentist Inference Depends on the Intention With Which

the Data Were Collected

Because p-values are calculated over the sample space, changes in the sample space can greatly affect the p-value. For instance, assume that a participant answers a series of 17 test questions of equal difficulty; 13 answers are correct, 4 are incorrect, and the last question was answered incorrectly. Under the standard binomial sampling plan (i.e., “ask 17 questions”), the two-sided pvalue is .049. The data are, however, also consistent with a negative binomial sampling plan (i.e., “keep on asking questions until the fourth error occurs”). Under this alternative sampling plan, the experiment could have been finished after four questions, or after a million. For this sampling plan, the p-value is 021.

What this simple example shows is that the intention of the researcher affects

statistical inference – the data are consistent with both sampling plans, yet the p-value differs. Berger and Wolpert ([14, page 30-33]) discuss the resulting counterintuitive consequences through a story involving a naive scientist and a frequentist statistician.

(the example is interesting as it deals with both researcher intent and external factors, such as a grant being extended or not, but I thought it is too long to include. The chapter is available for free online at http://www.ejwagenmakers.com/2008/BayesFreqBook.pdf )

I think this is a classical case that tries to portray relaying on the sampling space as absurd, as if it is somehow subjective, locked into the scientists’ mind and therefore cannot possibly be a legitimate consideration. It goes hand-in-hand with the argument that observed data is the only thing that matters, while leaving out that an integral part of what “data” is is the method through which the numbers were obtained.

#2 “An Introduction to Bayesian Hypothesis Testing for Management Research”, Sandra Andraszewicz, Benjamin Scheibehenne, Jörg Rieskamp, Raoul Grasman, Josine Verhagen, and Eric-Jan Wagenmakers, Journal of Management, Vol 41, Issue 2, pp. 521 – 543, December 10, 2014, https://doi.org/10.1177/0149206314560412

A host of “criticism” against p-values here, p-values depending on intent among them:

“Unfortunately, p values have a number of serious logical and statistical limitations (e.g.,

Wagenmakers, 2007). In particular, p values cannot quantify evidence in favor of a null

hypothesis (e.g., Gallistel, 2009; Rouder, Speckman, Sun, Morey, & Iverson, 2009), they

overstate the evidence against the null hypothesis (e.g., Berger & Delampady, 1987; Edwards,

Lindman, & Savage, 1963; Johnson, 2013; Sellke, Bayarri, & Berger, 2001), and they depend

on the sampling plan, that is, they depend on the intention with which the data were collected;

consequently, identical data may yield different p values (Berger & Wolpert, 1988;

Lindley, 1993; a concrete example is given below).

Bayesian hypothesis testing using Bayes factors provides a useful alternative to overcome

these problems (e.g., Jeffreys, 1961; Kass & Raftery, 1995). Bayes factors quantify the support

that the data provide for one hypothesis over another; thus, they allow researchers to

quantify evidence for any hypothesis (including the null) and monitor this evidence as the

data accumulate. In Bayesian inference, the intention with which the data are collected is

irrelevant (Rouder, 2014). As will be apparent later, inference using p values can differ dramatically

from inference using Bayes factors. Our main suggestion is that such differences

should be acknowledged rather than ignored.”

Then the authors go into more detail giving an example where p-values are inferior to Bayes factors due to issues related to reflecting researcher intent, in particular, in a continuous monitoring scenario:

“An additional advantage is that, in contrast to the p value, the Bayes factor is not affected

by the sampling plan, or the intention with which the data were collected”

[…]

“For Bayes factors, in contrast, the sampling plan is irrelevant to inference (as dictated by

the stopping rule principle; Berger & Wolpert, 1988; Rouder, 2014). This means that researchers

can monitor the evidence (i.e., the Bayes factor) as the data come in and terminate data

collection whenever they like, such as when the evidence is deemed sufficiently compelling

or when the researcher has run out of resources.”

Here we see the simple mistake of treating the fact that Bayes factors are not altered to accommodate the data known about the sampling procedure as a positive, while treating the need to alter the p-value calculation to accommodate that data as a negative.

Best regards,

Georgi

As for the counterintuitive examples, such as taking account of the possibility the instrument broke down or the like, there is NO error statistical justification for doing so! The probative value of the test is not influenced in the least. Please check my blog and published papers for more on this. ]]>

Simple requirement of being a scientist that would disqualify many data scientists: Being willing to refuse a project when the design or available data cannot meet the project's goals (this would perhaps disqualify many translational biomedical "scientists" and epidemiologists).

— Frank Harrell (@f2harrell) June 9, 2018

]]>

This is beyond troubling about data science https://t.co/q8rmxhWiYi

— Frank Harrell (@f2harrell) June 9, 2018

]]>

Paper:

Kruschke, J. (2011). Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison. Perspectives on psychological science, 3, 299-312.

Link:

http://www.jstor.org/stable/41613499

Quote:

Unfortunately for NHST, the p value is ill-defined. The conventional NHST analysis assumes that the same size N is fixed, and therefore repeating the experiment means generating simulated data based on the null value of the parameter over and over, with N = 47 each time. But the data do not tell us that the intention of the experimenter was to stop when N = 47. The data contain merley the information that z = 32 and N = 47 because we assume that the result of every trial in independent of other trials. The data collector may have intended to stop when the 32nd success was achieved, and it happended to take 47 trials to do that. In this case, the p value is computed by generating simulated data based on the null value of the parameterwith z= 32 each time and with N varying from one sample to another. … There are many other stopping rules that could have generated the data… It is wrong to speak of the “the” p value for a set of data, because any set of data has many different p values depending on the intent of the experimenter. According to NHST … we must know when the data collector intended to stop data collection, even though we also assume that the data are completely insulated from the researcher’s intention.

Opinion:

I always struggled a bit with this. On the one hand it seems obvious that we should care about intentions, and that intentions should matter for our inference. Hearing that a person found a significant result after looking at a single variable, declared before data collection, is much more impressive than finding the exact same significant result after looking at 200 other variables post-hoc. So here clearly intentions are important, and are needed for valid inference. On the other hand, some counter-examples make this sound downright silly. Imagine a researcher who has the intention to sample N = 40. However, the equipment breaks down after N = 20. Should the sampling distribution now be constructed as if N = 20 was the fixed N, or should the sampling distribution constructed taking into account that there is a probability of the equipment breaking down and collecting less than the intended N = 40. So, while I think I tend to generally agree that intentions matter, there are some cases, where it seems silly. It’s often exactly these cases that are presented in papers that try to argue against the use of intentions.

https://twitter.com/orestistsinalis/status/1004097202887757824)

In data science, it is frequently the case that the metric that is being optimised in an ML model’s cost function is not what you *really* want to optimise for, because your problem is usually a function of the ML model’s metric (e.g. optimise log loss to improve accuracy).

Therefore the best H becomes really the best *observed* H of correlated (e.g. acc is somewhat correlated with log loss) or, worse, accidental but desirable properties of the model.

To severely probe an ML model in the context of a *specific problem*, one needs to show that a change in the model was expected to influence the problem solution in a specific way and not others.

People also tune hyperparameters to death via so-called ‘grid search’. This is a prototypical example in my opinion of how to *not* learn from error. For me, a severe test for hyperparam tuning is to show a plausible *path* of your hyperparam search.

PS: Important position paper on machine learning practices: “On Pace, Progress, and Empirical Rigor” https://t.co/A2qMyMx204 https://t.co/M1XeC4ncbr

]]>In data science, it is frequently the case that the metric that is being optimised in an ML model's cost function is not what you *really* want to optimise for, because your problem is usually a function of the ML model's metric (e.g. optimise log loss to improve accuracy). 1/n

— Orestis Tsinalis (@orestistsinalis) June 5, 2018

]]>

some of the discussions on the principles of statistics he mentions,

not that I took part, I had no idea what was going on. He has now

abandoned this and I agree with his present attitude which is to ignore

them unless they irritate me to such an extent that I feel I have to

say something. This is the case now.

Here are some comments on Crane and Martin.

4.1 The data must be relevant.

Mine are measurements of the quantity of copper in drinking water. The

problem is to specify the actual amount of copper in the sample of

drinking water. They clearly fulfill the demand of relevance.

4.2 The model must be sound.

The following models are all good approximations to the data, the

normal, the Laplace, the log-normal , a t_4, Cauchy, the comb distribution etc

etc. Which one do I choose and why?

The model should generalize to other possible data sets obtained under

the same scientific conditions. What does this mean for my copper

example, always copper, always drinking water, always the same

laboratory, always the same staff? What about sludge, cadmium,

dust, air, nitrates? Always without outliers, always with outliers,

with one outlier, with two outliers?

Always symmetric always skewed? A different model for each of these

possibilities? What would the authors suggest and why?

The authors state two possible roles. One involves a ‘hypothetical

population’. What is a hypothetical population, a real population or a

creation of the mind? I fail to understand why one needs a hypothetical

populations. What is the hypothetical population of my copper example?

I try to give an answer for the data at hand.

The second role is to describe the data generating mechanism. In what

sense is i.i.d. Bernoulli a description of the Newtonian chaos which

generates the result of the coin toss? Actual data generating

mechanisms are in general of a complexity that can in no way be

described by a simple probability model. This applies to the copper

example and the models I suggested there.

4.3 The inference must be valid D =data, P_theta , theta in Theta, a

hypothesis A subset Theta. Now it comes ‘rarely is it possible to {\it

prove} that A is true or false on D alone’. P_theta, Theta, A are all

constructs of the mind. You can only talk about A being true or false

if the data come attached with the whole model P_theta, Theta and a

true value of theta which seems to me to be a very baroque ontology.

My approach to this sort of data is to use functionals, mean, median,

M-functional, MAD etc. There is now no model. What I offer is a

procedure, or a set of procedures (Tukey who seems to have been

forgotten). I look at existence, breakdown points, boundedness,

continuity, differentiability, equivariance over a full neighbourhood

of the data. These latter concepts require a topology. I use the weak

topology of the Kolmogorov metric, more generally a metric based on

V-C classes and so on.

It is possible to do covariate choice for linear least squares

regression without postulating or using the standard linear model,

y=beta*x +noise or indeed any model. It is possible to use the same

idea to produce graphs for gene expression data.

I avoid the word true unless used in the sense of everyday truths or plains

truths (Bernard Williams). Models are not true they are

approximations. The authors mention have no concept of

approximation.

Speed’s criticism was directed at some very specific sorts of principles, but the suggestion that effectiveness is separate from principles is harmful and is what motivated @HarryDCrane and I to write our paper, https://t.co/WenAvl3Eec

— statsmartin (@statsmartin) May 14, 2018

]]>