*unless you’re already on our LSE Phil500 list

JSM 2020 Panel Flyer (PDF)

JSM online program w/panel abstract & information):

**Stephen Senn
**Consultant Statistician

Edinburgh

**Losing Control**

**Match points**

The idea of local control is fundamental to the design and analysis of experiments and contributes greatly to a design’s efficiency. In clinical trials such control is often accompanied by randomisation and the way that the randomisation is carried out has a close relationship to how the analysis should proceed. For example, if a parallel group trial is carried out in different centres, but randomisation is ‘blocked’ by centre then, logically, centre should be in the model (Senn, S. J. & Lewis, R. J., 2019). On the other hand if all the patients in a given centre are allocated the same treatment at random, as in a so-called *cluster randomised trial*, then the fundamental unit of inference becomes the centre and patients are regarded as repeated measures on it. In other words, the way in which the allocation has been carried out effects the degree of matching that has been achieved and this, in turn, is related to the analysis that should be employed. A previous blog of mine, To Infinity and Beyond, discusses the point.

**Balancing acts**

In all of this, balance, or rather the degree of it, plays a fundamental part, if not the one that many commentators assume. Balance of prognostic factors is often taken as being necessary to avoid bias. In fact, it is * not *necessary. For example, supposed we wished to eliminate the effect of differences between centres in a clinical trial but had not, in fact, blocked by centre. We would then just by chance, have some centres in which numbers of patients on treatment and control differed. The simple difference of the two means for the trial as a whole would then have some influence from the centres, which might be regarded as biasing. However, these effects can be eliminated by the simple stratagem of analysing the data in two stages. In the first stage we compare the means under treatment and control within each centre. In the second stage we combine these differences across the centre weighting them according to the amount of information provided. In fact, including centre as a factor in a linear model to analyse the effect of treatment achieves the same result as this two-stage approach.

This raises the issue, ‘what is the value of balance?’. The answer is that other things being equal, balanced allocations are more efficient in that they lead to lower variances. This follows from the fact that the variance of a contrast based on two means is

where σ^{2}_{1}, σ^{2}_{2} are the variances in the two groups being compared and *n*_{1}, *n*_{2} the two sample sizes. In an experimental context, it is often reasonable to proceed as if σ^{2}_{1} = σ^{2}_{2} so that writing σ^{2} for each variance, we have an expression for the variance of the contrast of.

Now consider the successive ratios 1, 1/2, 1/3,…1/*n*. Each term is smaller than the preceding term. However, the amount by which a term is smaller is less than the amount by which the preceding term was smaller than the term that preceded it. For example, 1/3-1/4 = 1/12 but 1/2-1/3 = 1/6. In general we have 1/*n* – 1/*n*+1 = 1/*n*(*n*+1), which clearly reduces with increasing *n*. It thus follows that if an extra observation can be added to construct such a contrast, it will have the greater effect on reducing that contrast if it can be added to the group that has the fewest observations. This in turn implies, other things being equal, that balanced contrasts are more efficient.

**Exploiting the ex-external**

However, it is often the case in a randomised clinical trial of a new treatment that a potential control treatment has been much studied in the past. Thus, many more observations, albeit of a historical nature, are available for the control treatment than the experimental one. This in turn suggests that if the argument that balanced datasets are better is used, we should now allocate more patients, and perhaps even all that are available, to the experimental arm. In fact, things are not so simple.

First, it should be noted, that if blinding of patients and treating physicians to the treatment being given is considered important, this cannot be convincingly implemented unless randomisation is employed (Senn, S. J., 1994). I have discussed the way that this may have to proceed in a previous blog, Placebos: it’s not only the patients that are fooled but in fact, in what follows, I am going to assume that blinding is unimportant and consider other problems with using historical controls.

When historical controls are used there are two common strategies. The first is to regard the historical controls as providing an external standard which may be regarded as having negligible error and to use it, therefore, as an unquestionably valid reference. If significance tests are used, a one-sample test is applied to compare the experimental mean to the historical standard. The second is to treat historical controls as if they were concurrent controls and to carry out the statistical analysis that would be relevant were this the case. Both of these are inadequate. Once I have considered them, I shall turn to a third approach that might be acceptable.

**A standard error**

If an experimental group is compared to a historical standard, as if that standard were currently appropriate and established without error, an implicit analogy is being made to a parallel group trial with a control group arm of infinite size. This can be seen by looking at formula (2). Suppose that we let the first group be the control group and the second one the experimental group. As *n*_{1} → ∞, then formula (2) will approach σ^{2}/*n*_{2} , which is, in fact the formula we intend to use.

Figure 1 shows the variance that this approach uses as a horizontal red line and the variance that would apply to a parallel group trial. The experimental group size has been set at 100 and the control group sample size to vary from 100 to 2000. The within group variance has been set to σ^{2} = 1. It can be seen that this approach of the historical standard underestimates considerably the variance that will apply. In fact even the formula given by blue line will underestimate the variance as we shall explain below.

It thus follows that assessing the effect from a single arm given an experimental treatment by comparison to a value from historical controls but using a formula for the standard error of σ/√*n*_{2}, where σ is the within-treated group standard deviation and *n _{2 }*is the number of patients, will underestimate the uncertainty in this comparison.

**Parallel lies**

A common alternative is to treat the historical data as if they came concurrently from a parallel group trial. This overlooks many matters, not least of which is that in many cases the data will have come from completely different centres and, whether or not they came from different centres, they came from different studies. That being so, the nearest analogue of a randomised trial is not a *parallel group trial* but a *cluster randomised trial* with study as a unit of clustering. The general set up is illustrated in Figure 2. This shows a comparison of data taken from seven historical studies of a control treatment (C) and one new study of an experimental treatment (E).

This means that there is a between-study variance that has to be added to the within-study variances.

**Cluster muster**

The consequence is that the control variance is not just a function of the number of patients but also of the number of studies. Suppose there are *k* such studies, then even if each of these studies has a huge number of patients, the variance of the control mean cannot be less than *ϒ*^{2}/*k*, where *ϒ*^{2 }is the between-study variance. However, there is worse to come. The study of the new experimental treatment also has a between-study contribution but since there is only one such study its variance is *ϒ*^{2}/1 = *ϒ*^{2}. The result is that a lower bound for the variance of the contrast using historical data is

It turns out that the variance of the treatment contrast decreases disappointingly according to the number of clusters you can muster. Of course, in practice, things are worse, since all of this is making the optimistic assumption that historical studies are exchangeable with the current one (Collignon, O. et al., 2019; Schmidli, H. et al., 2014).

Optimists may ask, however, whether this is not all a fuss about nothing. The theory indicates that this might be a problem but is there anything in practice to indicate it is. Unfortunately, yes. The TARGET study provides a good example of the sort of difficulties encountered in practice (Senn, S., 2008). This was a study comparing Lumiracoxib, Ibuprofen and Naproxen in osteoarthritis. For practical reasons, centres were either enrolled in a sub-study comparing Lumiracoxib to Ibuprofen or one comparing Lumiracoxib to Naproxen. There were considerable differences between sub-studies in terms of baseline characteristics but not within sub-studies and there were even differences at outcome for lumiracoxib depending on which sub-study patients were enrolled in. This was not a problem for the way the trial was analysed, it was foreseen from the outset, but it provides a warning that differences between studies may be important.

Another example is provided by Collignon, O. et al. (2019). Looking at historical data on acute myeloid leukaemia (AML), they identified 19 studies of a proposed control treatment Azacitidine. However, the variation from study to study was such that the 1279 subjects treated in these studies would only provide, in the best of cases, as much information as 50 patients studied concurrently.

**COVID Control**

How have we done in the age of COVID? Not always very well. To give an example, a trial that received much coverage was one of hydroxychloroquine in the treatment of patients suffering from corona virus infection (Gautret, P. et al., 2020). The trial was in 20 patients and “Untreated patients from another center and cases refusing the protocol were included as negative controls.” The senior author Didier Raoult later complained of the ‘invasion of methodologists’ and blamed them and the pharmaceutical industry for a ‘moral dictatorship’ that physicians should resist and compared modellers to astrologers (Nau, J.-Y., 2020).

However, the statistical analysis section of the paper has the following to say

Statistical differences were evaluated by Pearson’s chi-square or Fisher’s exact tests as categorical variables, as appropriate. Means of quantitative data were compared using Student’s t-test.

Now, Karl Pearson, RA Fisher and Student were all methodologists. So, Gautret, P. et al. (2020) do not appear to be eschewing the work of methodologists, far from it. They are merely choosing to use this work inappropriately. But nature is a hard task-mistress and if outcome varies considerably amongst those infected with COVID-19, and we know it does, and if patients vary from centre to centre, and we know they do, then variation from centre to centre cannot be ignored and trials in which patients have not been randomised concurrently cannot be analysed as if they were. Fisher’s exact test, Pearson’s chi-square and Student’s t will underestimate the variation.

**The moral dictatorship of methodology**

Methodologists are, indeed, moral dictators. If you do not design your investigations carefully you are on the horns of a dilemma. Either, you carry out simplistic analyses that are simply wrong or you are condemned to using complex and often unconvincing modelling. Far from banishing the methodologists, you are holding the door wide open to let them in.

**Acknowledgement**

This is based on work that was funded by grant 602552 for the IDEAL project under the European Union FP7 programme and support from the programme is gratefully acknowledged.

**References**

Collignon, O., Schritz, A., Senn, S. J., & Spezia, R. (2019). Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations. *Statistical Methods in Medical Research*, 962280219880213

Gautret, P., Lagier, J. C., Parola, P., Hoang, V. T., Meddeb, L., Mailhe, M., . . . Raoult, D. (2020). Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial. *Int J Antimicrob Agents*, 105949

Nau, J.-Y. (2020). Hydroxychloroquine : le Pr Didier Raoult dénonce la «dictature morale» des méthodologistes. Retrieved from https://jeanyvesnau.com/2020/03/28/hydroxychloroquine-le-pr-didier-raoult-denonce-la-dictature-morale-des-methodologistes/

Schmidli, H., Gsteiger, S., Roychoudhury, S., O’Hagan, A., Spiegelhalter, D., & Neuenschwander, B. (2014). Robust meta‐analytic‐predictive priors in clinical trials with historical control information. *Biometrics, ***70**(4), 1023-1032

Senn, S. J. (2008). Lessons from TGN1412 and TARGET: implications for observational studies and meta-analysis. *Pharmaceutical Statistics, ***7**, 294-301

Senn, S. J. (1994). Fisher’s game with the devil. *Statistics in Medicine, ***13**(3), 217-230

Senn, S. J., & Lewis, R. J. (2019). Treatment Effects in Multicenter Randomized Clinical Trials. *JAMA*

**Link: **https://ww2.amstat.org/meetings/jsm/2020/onlineprogram/ActivityDetails.cfm?SessionID=219596

**To register for JSM: **https://ww2.amstat.org/meetings/jsm/2020/registration.cfm

**I.** “Colleges Face Rising Revolt by Professors,” proclaims an article in today’s *New York Times*, in relation to returning to in-person teaching:

Thousands of instructors at American colleges and universities have told administrators in recent days that they are unwilling to resume in-person classes because of the pandemic. More than three-quarters of colleges and universities have decided students can return to campus this fall. But they face a growing faculty revolt.

…This comes as major outbreaks have hit college towns this summer, spread by partying students and practicing athletes.

In an indication of how fluid the situation is, the University of Southern California said late Wednesday that “an alarming spike in coronavirus cases” had prompted it to reverse an earlier decision to encourage attending classes in person.

…. Faculty members at institutions including Penn State, the University of Illinois, Notre Dame and the State University of New York have signed petitions complaining that they are not being consulted and are being pushed back into classrooms too fast.

… “I shudder at the prospect of teaching in a room filled with asymptomatic superspreaders,” wrote Paul M. Kellermann, 62, an English professor at Penn State, in an essay for Esquire magazine, proclaiming that “1,000 of my colleagues agree.” Those colleagues have demanded that the university give them a choice of doing their jobs online or in person.

**II. **There is currently a circulating petition of Virginia faculty making similar requests, and if you’re a Virginia faculty member and wish to sign, you still have *one* day (7/4/20).

A preference to teach remotely isn’t only to mitigate the risk of infection by asymptotic students, it may also reflect the need to take care of children who might not be in school full-time this fall. Yet a return to in-person teaching has been made the default option in many universities such as Virginia Tech (which has decided 1/3 of classes will be in person).

Other universities have been more open to letting professors decide for themselves what to do. “Due to these extraordinary circumstances, the university is temporarily suspending the normal requirement that teaching be done in person,” the University of Chicago said in a message to instructors on June 26.

Yale said on Wednesday that it would bring only a portion of its students back to campus for each semester: freshmen, juniors and seniors in the fall, and sophomores, juniors and seniors in the spring. “Nearly all” college courses will be taught remotely, the university said, so that all students can enroll in them.

New York Times

It would be one thing if all students were regularly tested for covid-19, but in the long-awaited plan released yesterday by Virginia Tech, students are at most being “asked” to obtain a negative result within 5 days of returning to campus–with the exception of students living in a campus residence, who will be offered tests when they arrive. Getting tested is also being “strongly advised”.

If they test positive, they are asked to self-isolate (with the number of days not indicated). A student would need to begin the process of seeking a test several weeks prior to the start of class to ensure at least a 14-day isolation (even though asymptomatics are known to be infectious for longer). But my main concern is that even vigilant students would face obstacles to qualifying for testing, given the current criteria. A student who does not currently have symptoms would not meet the criteria for testing in Virginia, or in the vast majority of other states, unless they had been in close contact with infected persons. (There are exceptions, such as NYC.) This could be rectified if Virginia Tech could get the Virginia Department of Health to include “returning to campus” under their provision to test those “entering congregate settings”–currently limited to long-term care facilities, prisons, and the like.

It is now known that a large percentage of people with Covid-19 are asymptomatic. “Among more than 3,000 prison inmates in four states who tested positive for the coronavirus, the figure was astronomical: 96 percent asymptomatic.”(Link).

An extensive review in the *Annals of Internal Medicine*, suggests that asymptomatic infections may account for 45 percent of all COVID-19 cases:

“The likelihood that approximately 40% to 45% of those infected with SARS-CoV-2 will remain asymptomatic suggests that the virus might have greater potential than previously estimated to spread silently and deeply through human populations. Asymptomatic persons can transmit SARS-CoV-2 to others for an extended period, perhaps longer than 14 days.

…

The focus of testing programs for SARS-CoV-2 should be substantially broadened to include persons who do not have symptoms of COVID-19.”

**III. **An easy solution would seem to be to turn to “pooled testing”. It’s an old statistical idea, but it’s only now gaining traction [1] In the July 1 NYT:

The method, called pooled testing, signals a paradigm shift. Instead of carefully rationing tests to only those with symptoms, pooled testing would enable frequent surveillance of asymptomatic people. Mass identification of coronavirus infections could hasten the reopening of schools, offices and factories.

“We’re in intensive discussions about how we’re going to do it,” Dr. Anthony S. Fauci, the country’s leading infectious disease expert, said in an interview. “We hope to get this off the ground as soon as possible.”

…Here’s how the technique works: A university, for example, takes samples from every one of its thousands of students by nasal swab, or perhaps saliva. Setting aside part of each individual’s sample, the lab combines the rest into a batch holding five to 10 samples each. The pooled sample is tested for coronavirus infection. Barring an unexpected outbreak, just 1 percent or 2 percent of the students are likely to be infected, so the overwhelming majority of pools are likely to test negative.

But if a pool yields a positive result, the lab would retest the reserved parts of each individual sample that went into the pool, pinpointing the infected student. The strategy could be employed for as little as $3 per person per day, according an estimate from economists at the University of California, Berkeley.

The FDA has set out guidelines for adopting pooled testing, which employs the same PCR technology as individual diagnostic tests (link).

Universities should consider what they will do once a certain number of positive covid cases emerge. The Virginia Tech plan proposes to house infected students in a single dorm, but what about the majority of students who live off campus? At what point would they switch to remote teaching? As much as everyone wants to return to normalcy, a class of masked students, 6 feet apart, doesn’t obviously create a better learning environment than zoom. By regularly conducting pooled tests, the university would become aware of increased spread as soon as a higher proportion of the pools return positive results– before we see an increase in serious cases and hospitalizations.

Chris Bilder, a statisticians at University of Nebraska–Lincoln has been advising the Nebraska Public Health Laboratory on its use of group testing since April. He and his colleagues have developed a newly released app to determine precisely how best to conduct the pooling for a chosen reduction in testing, and given estimate of prevalence. (Link)

I will add to this over the next few days, as new reports become available. Please share your thoughts and related articles, in the comments.

[1]I first heard it discussed weeks ago by someone on Andrew Gelman’s blog, but I don’t know if it was the same idea.

]]>Tr**ustworthiness of Statistical Analysis**

David Hand

** Abstract:** Trust in statistical conclusions derives from the trustworthiness of the data and analysis methods. Trustworthiness of the analysis methods can be compromised by misunderstanding and incorrect application. However, that should stimulate a call for education and regulation, to ensure that methods are used correctly. The alternative of banning potentially useful methods, on the grounds that they are often misunderstood and misused is short-sighted, unscientific, and Procrustean. It damages the capability of science to advance, and feeds into public mistrust of the discipline.

Below are Prof.Hand’s slides w/o audio, followed by a video w/audio. You can also view them on the Meeting #6 post on the PhilStatWars blog (https://phil-stat-wars.com/2020/06/21/meeting-6-june-25/).

SLIDES:

VIDEO: (Viewing in full screen mode helps with buffering issues.)

]]>We’re holding a bonus, 6th, meeting of the graduate research seminar PH500 for the Philosophy, Logic & Scientific Method Department at the LSE:

(Remote 10am-12 EST, 15:00 – 17:00 London time; Thursday, June 25)

**VI. (June 25) BONUS: Power, shpower, severity, positive predictive value (diagnostic model) & a Continuation of The Statistics Wars and Their Casualties**

**There will also be a guest speaker: Professor David Hand (Imperial College, London). ****Here is Professor Hand’s presentation (click on “present” to hear sound)**

The main readings are on the blog page for the seminar.

]]>

Nearly 3 months ago I tweeted “Stat people: shouldn’t they be testing a largish random sample of people [w/o symptoms] to assess rates, alert those infected, rather than only high risk, symptomatic people, in the U.S.?” I was surprised that nearly all the stat and medical people I know expressed the view that it wouldn’t be feasible or even very informative. Really? Granted, testing was and is limited, but had it been made a priority, it could have been done. In the new issue of *Significance* (June 2020) that I just received, James J. Cochran writes “on the importance of testing a random sample.” [1]

In the United States (as of 9 April 2020), President Donald Trump has said that testing for novel coronavirus infection will be limited to people who believe they may be infected. But if we only test people who believe they may be infected, we cannot understand how deep the virus has reached into the population. The only way this could work is if those who believe they may be infected are representative of the population with respect to novel coronavirus infection. Does anyone believe this is so? The common characteristic of those who believe they may be infected is that they all show some outward symptoms of infection by the virus. In other words, people who are being tested for the novel coronavirus are disproportionately showing severe symptoms. This would not be a problem if someone who is infected by the novel coronavirus immediately shows symptoms, but this is not the case. We have strong evidence that some people develop mild cases, show no symptoms, and carry the virus without knowing it because they are asymptomatic. Thus, efforts to understand the virus’s penetration into the population must include observation of the asymptomatic.

Indeed, a recent assessment (the Annals of Internal Medicine) is that at least 40% of people with covid 19 are (and remain) asymptomatic. (An overview is in Time). Oddly, while remaining asymptomatic, some still show damage to the lungs or other organs.

The estimate of the proportion of the population who are infected can be calculated as:

So, we need data from a random sample of the entire population in order to gather data from infected people who are showing symptoms, infected people who are asymptomatic, and people who are not infected. All have some probability of being included in a true random sample of the population.

As of 23 April, leaders in Germany and New York State (see bit.ly/2Kp2iXd and dailym.ai/3bxZ5Au) had moved to implement random testing to assess how widespread the virus is, but there has been resistance from leaders elsewhere. This could be due to ignorance, disregard, or lack of appreciation of statistical principles – a consequence of the lack of statistical literacy that pervades the general population. (If the general population insisted on the use of random sampling to assess how widespread the virus is, leaders would not likely resist.) Or it could reflect concern over the limited availability of tests and a desire to devote all of these limited tests to those who show symptoms of novel coronavirus infection.

Unfortunately, this might be inadvertently helping the novel coronavirus spread. If a society does not understand the extent of infection in the general population or the virus’s infectivity, how can it prepare and optimally devote its resources to slow the spread of the virus? How does it decide what preventive measures are appropriate or necessary? How does it minimise the likelihood that the virus spreads to the point that the capacity of the hospital system is overwhelmed? Most crucially, how does it know if it is making progress or if conditions are deteriorating?

Without the evidence that a random sample of the general population would provide, we are operating in the dark. While we operate in the dark, preventable deaths will accumulate, and we will continue to take measures that are not only ineffective, but also unnecessarily costly.

Most of the world still lacks the ability to test a large number of people, and this understandably makes even those leaders who appreciate sampling hesitant to test a random sample of the general population. But the bottom line is, we need more coronavirus tests than we think we need.

*Today is Allan Birnbaum’s birthday. In honor of his birthday, I’m posting the articles in the Synthese volume that was dedicated to his memory in 1977. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I had posted the volume before, but there are several articles that are very worth rereading. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up. (Even if you are, you may be unaware of some of these key papers.)*

**HAPPY BIRTHDAY ALLAN!**

*Synthese* Volume 36, No. 1 Sept 1977: *Foundations of Probability and Statistics*, Part I

**Editorial Introduction:**

This special issue of

Syntheseon the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors ofSynthesein October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.THE EDITORS

**Table of Contents**

]]>

- Editorial Introduction. (1977).
Synthese,36(1), 3-3.- Giere, R. (1977). Allan Birnbaum’s Conception of Statistical Evidence.
Synthese,36(1), 5-13.## SUFFICIENCY, CONDITIONALLY AND LIKELIHOOD In December of 1961 Birnbaum presented the paper ‘On the Foundations, of Statistical Inference’ (Birnbaum [19]) at a special discussion meeting of the American Statistical Association. Among the discussants was L. J. Savage who pronounced it “a landmark in statistics”. Explicitly denying any “intent to speak with exaggeration or rhetorically”, Savage described the occasion as “momentous in the history of statistics”. “It would be hard”, he said, “to point to even a handful of comparable events” (Birnbaum [19], pp. 307-8). The reasons for Savage’s enthusiasm are obvious. Birnbaum claimed to have shown that two principles widely held by non-Bayesian statisticians (sufficiency and conditionality) jointly imply an important consequence of Bayesian statistics (likelihood).”[1]

- Giere, R. (1977). Publications by Allan Birnbaum.
Synthese,36(1), 15-17.

- Birnbaum, A. (1977). The Neyman-Pearson Theory as Decision Theory, and as Inference Theory; With a Criticism of the Lindley-Savage Argument for Bayesian Theory.
Synthese,36(1), 19-49.## INTRODUCTION AND SUMMARY ….Two contrasting interpretations of the decision concept are formulated: behavioral, applicable to ‘decisions’ in a concrete literal sense as in acceptance sampling; and evidential, applicable to ‘decisions’ such as ‘reject H in a research context, where the pattern and strength of statistical evidence concerning statistical hypotheses is of central interest. Typical standard practice is characterized as based on the confidence concept of statistical evidence, which is defined in terms of evidential interpretations of the ‘decisions’ of decision theory. These concepts are illustrated by simple formal examples with interpretations in genetic research, and are traced in the writings of Neyman, Pearson, and other writers. The Lindley-Savage argument for Bayesian theory is shown to have no direct cogency as a criticism of typical standard practice, since it is based on a behavioral, not an evidential, interpretation of decisions.

- Lindley, D. (1977). The Distinction between Inference and Decision.
Synthese,36(1), 51-58.

- Pratt, J. (1977). ‘Decisions’ as Statistical Evidence and Birnbaum’s ‘Confidence Concept’
Synthese,36(1), 59-69.

- Smith, C. (1977). The Analogy between Decision and Inference.
Synthese,36(1), 71-85.

- Kyburg, H. (1977). Decisions, Conclusions, and Utilities.
Synthese,36(1), 87-96.

- Neyman, J. (1977). Frequentist Probability and Frequentist Statistics.
Synthese,36(1), 97-131.

- Lecam, L. (1977).A Note on Metastatistics or ‘An Essay toward Stating a Problem in the Doctrine of Chances’.
Synthese,36(1), 133-160.

- Kiefer, J. (1977). The Foundations of Statistics Are There Any?
Synthese,36(1), 161-176.[1]By “likelihood” here, Giere means the (strong) Likelihood Principle (SLP). Dotted through the first 3 years of this blog are a number of (formal and informal) posts on his SLP result, and my argument as to why it is unsound. I wrote a paper on this that appeared in Statistical Science 2014. You can find it along with a number of comments and my rejoinder in this post: Statistical Science: The Likelihood Principle Issue is Out.The consequences of having found his proof unsound gives a new lease on life to statistical foundations, or so I argue in my rejoinder.

Ship StatInfasST will embark on a new journey from 21 May – 18 June, a graduate research seminar for the Philosophy, Logic & Scientific Method Department at the LSE, but given the pandemic has shut down cruise ships, it will remain at dock in the U.S. and use zoom. If you care to follow any of the 5 sessions, nearly all of the materials will be linked here collected from excerpts already on this blog. If you are interested in observing on zoom beginning 28 May, please follow the directions here.

**For the updated schedule, see the seminar web page.**

**Topic: Current Controversies in Phil Stat **(LSE, Remote 10am-12 EST, 15:00 – 17:00 London time; Thursdays 21 May-18 June)

**Main Text SIST****:** *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* CUP, 2018):

**I. (May 21)** Introduction: Controversies in Phil Stat:

**SIST: Preface, Excursion 1 **Preface

Excursion 1 Tour II

Notes/Outline of Excursion 1

Postcard: Souvenir A

**II.** **(May 28)** N-P and Fisherian Tests, Severe Testing:

**SIST: Excursion 3 Tour I **(focus on pages up to p. 152)** **

** Recommended:** Excursion 2 Tour II pp. 92-100

* Optional:* I will (try to) answer questions on demarcation of science, induction, falsification, Popper from Excursion 2 Tour II

**Handout**: *Areas Under the Standard Normal Curve*

**III.** **(June 4)** Deeper Concepts: Confidence Intervals and Tests: Higgs’ Discovery:

**SIST: Excursion 3 Tour III **

* Optional:* I will answer questions on Excursion 3 Tour II: Howlers and Chestnuts of Tests

**IV.** **(June 11)** Rejection Fallacies: Do P-values exaggerate evidence?

Jeffreys-Lindley paradox or Bayes/Fisher disagreement:

**SIST: Excursion 4 Tour II **

** SIST: Excursion 4 Tour II**

* Recommended **(if time)**: *Excursion 4 Tour I: The Myth of “The Myth of Objectivity”

**V. (June 18) The Statistics Wars and Their Casualties:**

**SIST: Excursion 4 Tour III**: pp. 267-286; **Farewell Keepsake: **pp. 436-444**-Amrhein, V., Greenland, S., & McShane, B., (2019).** Comment: Retire Statistical Significance, *Nature*, 567: 305-308. **-Ioannidis J. (2019).** “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not Abandon Significance.” JAMA. 321(21): 2067–2068. doi:10.1001/jama.2019.4582 **-Ioannidis, J. (2019).** Correspondence: Retiring statistical significance would give bias a free pass. *Nature,* *567,* 461. https://doi.org/10.1038/d41586-019-00969-2 **-Mayo, DG. (2019)**, P‐value thresholds: Forfeit at your peril. Eur J Clin Invest, 49: e13170. doi: 10.1111/eci.13170

**-References: Captain’s Bibliography **

**DELAYED: JUNE 19-20 Workshop: The Statistics Wars and Their Casualties**

Here’s the final part of Brian Haig’s recent paper ‘What can psychology’s statistics reformers learn from the error-statistical perspective?’ in *Methods in Psychology *2 (Nov. 2020). The full article, which is open access, is here. I will make some remarks in the comments.

**5. The error-statistical perspective and the nature of science**

As noted at the outset, the error-statistical perspective has made significant contributions to our philosophical understanding of the nature of science. These are achieved, in good part, by employing insights about the nature and place of statistical inference in experimental science. The achievements include deliberations on important philosophical topics, such as the demarcation of science from non-science, the underdetermination of theories by evidence, the nature of scientific progress, and the perplexities of inductive inference. In this article, I restrict my attention to two such topics: The process of falsification and the structure of modeling.

*5.1. Falsificationism*

The best known account of scientific method is the so-called hypothetico-deductive method. According to its most popular description, the scientist takes an existing hypothesis or theory and tests indirectly by deriving one or more observational predictions that are subjected to direct empirical test. Successful predictions are taken to provide inductive confirmation of the theory; failed predictions are said to provide disconfirming evidence for the theory. In psychology, NHST is often embedded within such a hypothetico-deductive structure and contributes to weak tests of theories.

Also well known is Karl Popper’s falsificationist construal of the hypothetico-deductive method, which is understood as a general strategy of conjecture and refutation. Although it has been roundly criticised by philosophers of science, it is frequently cited with approval by scientists, including psychologists, even though they do not, indeed could not, employ it in testing their theories. The major reason for this is that Popper does not provide them with sufficient methodological resources to do so.

One of the most important features of the error-statistical philosophy is its presentation of a falsificationist view of scientific inquiry, with error statistics serving an indispensable role in testing. From a sympathetic, but critical, reading of Popper, Mayo endorses his strategy of developing scientific knowledge by identifying and correcting errors through strong tests of scientific claims. Making good on Popper’s lack of knowledge of statistics, Mayo shows how one can properly employ a range of, often familiar, error-statistical methods to implement her all-important severity requirement. Stated minimally, and informally, this requirement says, “A claim is severely tested to the extent that it has been subjected to and passes a test that probably would have found flaws, were they present.” (Mayo, 2018, p. xii) Further, in marked contrast with Popper, who deemed deductive inference to be the only legitimate form of inference, Mayo’s conception of falsification stresses the importance of inductive, or content-increasing, inference in science. We have here, then, a viable account of falsification, which goes well beyond Popper’s account with its lack of operational detail about how to construct strong tests. It is worth noting that the error-statistical stance offers a constructive interpretation of Fisher’s oft-cited remark that the null hypothesis is never proved, only possibly disproved.

*5.2. A hierarchy of models*

In the past, philosophers of science tended to characterize scientific inquiry by focusing on the general relationship between evidence and theory. Similarly, scientists, even today, commonly speak in general terms of the relationship between data and theory. However, due in good part to the labors of experimentally-oriented philosophers of science, we now know that this coarse-grained depiction is a poor portrayal of science. The error-statistical perspective is one such philosophy that offers a more fine-grained parsing of the scientific process.

Building on Patrick Suppes’ (1962) important insight that science employs a hierarchy of models that ranges from experimental experience to theory, Mayo’s (1996) error-statistical philosophy initially adopted a framework in which three different types of models are interconnected and serve to structure error-statistical inquiry: Primary models, experimental models, and data models. Primary models, which are at the top of the hierarchy, break down a research problem, or question, into a set of local hypotheses that can be investigated using reliable methods. Experimental models take the mid-positon on the hierarchy and structure the particular models at hand. They serve to link primary models to data models. And, data models, which are at the bottom of the hierarchy, generate and model raw data, put them in canonical form, and check whether the data satisfy the assumptions of the experimental models. It should be mentioned that the error-statistical approach has been extended to primary models and theories of a more global nature (Mayo and Spanos, 2010) and, now, also includes a consideration of experimental design and the analysis and generation of data (Mayo, 2018).

This hierarchy of models facilitates the achievement of a number of goals that are important to the error-statistician. These include piecemeal strong testing of local hypotheses rather than broad theories, and employing the model hierarchy as a structuring device to knowingly move back and forth between statistical and scientific hypotheses. The error-statistical perspective insists on maintaining a clear distinction between statistical and scientific hypotheses, pointing out that psychologists often mistakenly take tests of significance to have direct implications for substantive hypotheses and theories.

**6. The philosophy of statistics**

A heartening attitude that comes through in the error-statistical corpus is the firm belief that the philosophy of statistics is an important part of statistical thinking. This emphasis on the conceptual foundations of the subject contrasts markedly with much of statistical theory, and most of statistical practice. It is encouraging, therefore, that Mayo’s philosophical work has influenced a number of prominent statisticians, who have contributed to the foundations of their discipline. Gelman’s error-statistical philosophy canvassed earlier is a prominent case in point. Through both precept and practice, Mayo’s work makes clear that philosophy can have a direct impact on statistical practice. Given that statisticians operate with an implicit philosophy, whether they know it or not, it is better that they avail themselves of an explicitly thought-out philosophy that serves their thinking and practice in useful ways. More particularly, statistical reformers recommend methods and strategies that have underlying philosophical commitments. It is important that they are identified, described, and evaluated.

The tools used by the philosopher of statistics in order to improve our understanding and use of statistical methods are considerable (Mayo, 2011). They include clarifying disputed concepts, evaluating arguments employed in statistical debates, including the core commitments of rival schools of thought, and probing the deep structure of statistical methods themselves. In doing this work, the philosopher of statistics, as philosopher, ascends to a meta-level to get purchase on their objects of study. This second-order inquiry is a proper part of scientific methodology.

It is important to appreciate that the error-statistical outlook is a scientific methodology in the proper sense of the term. Briefly stated, methodology is the interdisciplinary field that draws from disciplines that include statistics, philosophy of science, history of science, as well as indigenous contributions from the various substantive disciplines. As such, it is the key to a proper understanding of statistical and scientific methods. Mayo’s focus on the role of error statistics in science is deeply informed about the philosophy, history, and theory of statistics, as well as statistical practice. It is for this reason that the error-statistical perspective is strategically positioned to help the reader to go beyond the statistics wars.

**7. Conclusion**

The error-statistical outlook provides researchers, methodologists, and statisticians with a distinctive and illuminating perspective on statistical inference. Its Popper-inspired emphasis on strong tests is a welcome antidote to the widespread practice of weak statistical hypothesis testing that still pervades psychological research. More generally, the error-statistical standpoint affords psychologists an informative perspective on the nature of good statistical practice in science that will help them understand and transcend the statistics wars into which they have been drawn. Importantly, psychologists should know about the error-statistical perspective as a genuine alternative to the new statistics and Bayesian statistics. The new statisticians, Bayesians statisticians, and those with other preferences should address the challenges to their outlooks on statistics that the error-statistical viewpoint provides. Taking these challenges seriously would enrich psychology’s methodological landscape.

*****This article is based on an invited commentary on Deborah Mayo’s book, *Statistical inference as severe testing: How to get beyond the statistics wars* (Cambridge University Press, 2018), which appeared at https://statmodeling.stat.colombia.edu/2019/04/12 It is adapted with permission. I thank Mayo for helpful feedback on an earlier draft.

Refer to the paper for the references. I invite your comments and questions.

]]>

Here’s a picture of ripping open the first box of (rush) copies of *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*, *and here’s a continuation of Brian Haig’s recent paper ‘What can psychology’s statistics reformers learn from the error-statistical perspective?’ in *Methods in Psychology *2 (Nov. 2020). Haig contrasts error statistics, the “new statistics”, and Bayesian statistics from the perspective of the statistics wars in psychology. The full article, which is open access, is here. I will make several points in the comments.

**4. Bayesian statistics**

Despite its early presence, and prominence, in the history of statistics, the Bayesian outlook has taken an age to assert itself in psychology. However, a cadre of methodologists has recently advocated the use of Bayesian statistical methods as a superior alternative to the messy frequentist practice that dominates psychology’s research landscape (e.g., Dienes, 2011; Kruschke and Liddell, 2018; Wagenmakers, 2007). These Bayesians criticize NHST, often advocate the use of Bayes factors for hypothesis testing, and rehearse a number of other well-known Bayesian objections to frequentist statistical practice.

Of course, there are challenges for Bayesians from the error-statistical perspective, just as there are for the new statisticians. For example, the frequently made claim that *p* values exaggerate the evidence against the null hypothesis, but Bayes factors do not, is shown by Mayo not to be the case. She also makes the important point that Bayes factors, as they are currently used, do not have the ability to probe errors and, thus, violate the requirement for severe tests. Bayesians, therefore need to rethink whether Bayes factors can be deployed in some way to provide strong tests of hypotheses through error control. As with the new statisticians, Bayesians also need to reckon with the coherent hybrid NHST afforded by the error-statistical perspective, and argue against it, rather than the common inchoate hybrids, if they want to justify abandoning NHST. Finally, I note in passing that Bayesians should consider, among other challenges, Mayo’s critique of the controversial Likelihood Principle, a principle which ignores the post-data consideration of sampling plans.

*4.1. Contrasts between the Bayesian and error-statistical perspectives*

One of the major achievements of the philosophy of error-statistics is that it provides a comprehensive critical evaluation of the major variants of Bayesian statistical thinking, including the classical subjectivist, “default”, pragmatist, and eclectic options within the Bayesian corpus. Whether the adoption of Bayesian methods in psychology will overcome the disorders of current frequentist practice remains to be seen. What is clear from reading the error-statistical literature, however, is that the foundational options for Bayesians are numerous, convoluted, and potentially bewildering. It would be a worthwhile exercise to chart how these foundational options are distributed across the prominent Bayesian statisticians in psychology. For example, the increasing use of Bayes factors for hypothesis testing purposes is accompanied by disorderliness at the foundational level, just as it is in the Bayesian literature more generally. Alongside the fact that some Bayesians are sceptical of the worth of Bayes factors, we find disagreement about the comparative merits of the subjectivist and default Bayesianism outlooks on Bayes factors in psychology (Wagenmakers et al., 2018).

The philosophy of error-statistics contains many challenges for Bayesians to consider. Here, I want to draw attention to three basic features of Bayesian thinking, which are rejected by the error-statistical approach. First, the error-statistical approach rejects the Bayesian insistence on characterizing the evidential relation between hypothesis and evidence in a universal and logical manner in terms of Bayes’ theorem. Instead, it formulates the relation in terms of the substantive and specific nature of the hypothesis and the evidence with regards to their origin, modeling, and analysis. This is a consequence of a strong commitment to a piecemeal, contextual approach to testing, using the most appropriate frequentist methods available for the task at hand. This contextual attitude to testing is taken up in Section 5.2, where one finds a discussion of the role different models play in structuring and decomposing inquiry.

Second, the error-statistical philosophy also rejects the classical Bayesian commitment to the subjective nature of prior probabilities, which the agent is free to choose, in favour of the more objective process of establishing error probabilities understood in frequentist terms. It also finds unsatisfactory the turn to the more popular objective, or “default”, Bayesian option, in which the agent’s appropriate degrees of belief are constrained by relevant empirical evidence. The error-statistician rejects this default option because it fails in its attempts to unify Bayesian and frequentist ways of determining probabilities.

And, third, the error-statistical outlook employs probabilities to measure how effectively *methods* facilitate the detection of error, and how those methods enable us to choose between alternative hypotheses. By contrast, orthodox Bayesians use probabilities to measure *belief* in hypotheses or degrees of confirmation. As noted earlier, most Bayesians are not concerned with error probabilities at all. It is for this reason that error-statisticians will say about Bayesian methods that, without supplementation with error probabilities, they are not capable of providing stringent tests of hypotheses.

*4.2. The Bayesian remove from scientific practice*

Two additional features of the Bayesian focus on beliefs, which have been noted by philosophers of science and statistics, draw attention to their outlook on science. First, Kevin Kelly and Clark Glymour worry that “Bayesian methods assign numbers to answers instead of producing answers outright.” (2004, p. 112) Their concern is that the focus on the scientist’s beliefs “screens off” the scientist’s direct engagement with the empirical and theoretical activities that are involved in the phenomenology of science. Mayo agrees that we should focus on the scientific phenomena of interest, not the associated epiphenomena of degrees of belief. This preference stems directly from the error-statistician’s conviction that probabilities properly quantify the performance of methods, not the scientist’s degrees of belief.

Second, Henry Kyburg is puzzled by the Bayesian’s desire to “replace the fabric of science… with a vastly more complicated representation in which each statement of science is accompanied by its probability, for each of us.” (1992, p.149) Kyburg’s puzzlement prompts the question, ‘Why should we be interested in each other’s probabilities?’ This is a question raised by David Cox about prior probabilities, and noted by Mayo (2018).

This Bayesian remove from science contrasts with the willingness of the error-statistical perspective to engage more directly with science. Mayo is a philosopher of science as well as statistics, and has a keen eye for scientific practice. Given that contemporary philosophers of science tend to take scientific practice seriously, it comes as no surprise that she brings it to the fore when dealing with statistical concepts and issues. Indeed, her error-statistical philosophy should be seen as a significant contribution to the so-called *new experimentalism*, with its strong focus, not just on experimental practice in science, but also on the role of statistics in such practice. Her discussion of the place of frequentist statistics in the discovery of the Higgs boson in particle physics is an instructive case in point.

Taken together, these just-mentioned points of difference between the Bayesian and error-statistical philosophies constitute a major challenge to Bayesian thinking that methodologists, statisticians, and researchers in psychology need to confront.

*4.3. Bayesian statistics with error-statistical foundations*

One important modern variant of Bayesian thinking, which now receives attention within the error-statistical framework, is the f*alsificationist Bayesianism* of Andrew Gelman, which received its major formulation in Gelman and Shalizi (2013). Interestingly, Gelman regards his Bayesian philosophy as essentially error-statistical in nature – an intriguing claim, given the anti-Bayesian preferences of both Mayo and Gelman’s co-author, Cosma Shalizi. Gelman’s philosophy of Bayesian statistics is also significantly influenced by Popper’s view that scientific propositions are to be submitted to repeated criticism in the form of strong empirical tests. For Gelman, best Bayesian statistical practice involves formulating models using Bayesian statistical methods, and then checking them through hypothetico-deductive attempts to falsify and modify those models.

Both the error-statistical and neo-Popperian Bayesian philosophies of statistics extend and modify Popper’s conception of the hypotheticodeductive method, while at the same time offering alternatives to received views of statistical inference. The error-statistical philosophy injects into the hypothetico-deductive method an account of statistical induction that employs a panoply of frequentist statistical methods to detect and control for errors. For its part, Gelman’s Bayesian alternative involves formulating models using Bayesian statistical methods, and then checking them through attempts to falsify and modify those models. This clearly differs from the received philosophy of Bayesian statistical modeling, which is regarded as a formal inductive process.

From the wide-ranging error-statistical evaluation of the major varieties of Bayesian statistical thought on offer, Mayo concludes that Bayesian statistics needs new foundations: In short, those provided by her error-statistical perspective. Gelman acknowledges that his falsificationist Bayesian philosophy is underdeveloped, so it will be interesting to learn how its further development relates to Mayo’s error-statistical perspective. It will also be interesting to see if Bayesian thinkers in psychology engage with Gelman’s brand of Bayesian thinking. Despite the appearance of his work in a prominent psychology journal, they have yet to do so. However, Borsboom and Haig (2013) and Haig (2018) provide sympathetic critical evaluations of Gelman’s philosophy of statistics.

It is notable that in her treatment of Gelman’s philosophy, Mayo emphasizes that she is willing to allow a decoupling of statistical outlooks and their traditional philosophical foundations in favour of different foundations, which are judged more appropriate. It is an important achievement of Mayo’s work that she has been able to consider the current statistics wars without taking a particular side in the debates. She achieves this by examining methods, both Bayesian and frequentist, in terms of whether they violate her minimal severity requirement of “bad evidence, no test”.

I invite your comments and questions.

**This picture was taken by Diana Gillooly, Senior Editor for Mathematical Sciences, Cambridge University Press, at the book display for the Sept. 2018 meeting of the Royal Statistical Society in Cardiff. She also had the honor of doing the ripping. A blogpost on the session I was in is here.*

This is the title of Brian Haig’s recent paper in *Methods in Psychology *2 (Nov. 2020). Haig is a professor emeritus of psychology at the University of Canterbury. Here he provides both a thorough and insightful review of my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2018) as well as an excellent overview of the high points of today’s statistics wars and the replication crisis, especially from the perspective of psychology. I’ll excerpt from his article in a couple of posts. The full article, which is open access, is here.

** Abstract: **In this article, I critically evaluate two major contemporary proposals for reforming statistical thinking in psychology: The recommendation that psychology should employ the “new statistics” in its research practice, and the alternative proposal that it should embrace Bayesian statistics. I do this from the vantage point of the modern error-statistical perspective, which emphasizes the importance of the severe testing of knowledge claims. I also show how this error-statistical perspective improves our understanding of the nature of science by adopting a workable process of falsification and by structuring inquiry in terms of a hierarchy of models. Before concluding, I briefly discuss the importance of the philosophy of statistics for improving our understanding of statistical thinking.

* Keywords: *The error-statistical perspective, The new statistics, Bayesian statistics, Falsificationism, Hierarchy of models, Philosophy of statistics

**1. Introduction**

Psychology has been prominent among a number of disciplines that have proposed statistical reforms for improving our understanding and use of statistics in research. However, despite being at the forefront of these reforms, psychology has ignored the philosophy of statistics to its detriment. In this article, I consider, in a broad-brush way, two major proposals that feature prominently in psychology’s current methodological reform literature: The recommendation that psychology should employ the so-called “new statistics” in its research practice, and the alternative proposal that psychology should embrace Bayesian statistics. I evaluate each from the vantage point of the error-statistical philosophy, which, I believe, is the most coherent perspective on statistics available to us. Before concluding, I discuss two interesting features of the conception of science adopted by the error-statistical perspective, along with brief remarks about the value of the philosophy of statistics for deepening our understanding of statistics.

**2. The error-statistical perspective**

The error-statistical perspective employed in this article is that of Deborah Mayo, sometimes in collaboration with Aris Spanos (Mayo, 1996, 2018; Mayo & Spanos, 2010, 2011). This perspective is landmarked by two major works. The first is Mayo’s ground-breaking book, *Error and the growth of experimental **knowledge** *(1996), which presented the first extensive formulation of her error-statistical perspective on statistical inference. This philosophy provides a systematic understanding of experimental reasoning in science that uses frequentist statistics in order to manage error. Hence, its name. The novelty of the book lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with the nature of evidence and inference.

The second book is Mayo’s recently published *Statistical inference as severe testing* (2018). In contrast with the first book, this work focuses on problems arising from statistical practice, but endeavors to solve them by probing their foundations from the related vantage points of the philosophy of science and the philosophy of statistics. By dealing with the vexed problems of current statistical practice, this book is a valuable repository of ideas, insights, and solutions designed to help a broad readership deal with the current crisis in statistics. Because my focus is on statistical reforms in psychology, I draw mainly from the resources contained in the second book.

Fundamental disputes about the nature and foundations of statistical inference are long-standing and ongoing. Most prominent have been the numerous debates between, and within, frequentist and Bayesian camps. Cutting across these debates have been more recent attempts to unify and reconcile rival outlooks, which have complexified the statistical landscape. Today, these endeavors fuel the ongoing concern that psychology and many sciences have with replication failures, questionable research practices, and the strong demand for an improvement of research integrity. Mayo refers to debates about these concerns as the “statistics wars”. With the addition of *Statistical inference as severe testing* to the error-statistical corpus, it is fair to say that the error-statistical outlook now has the resources to enable statisticians and scientists to understand and advance beyond the bounds of these statistics wars.

The strengths of the error-statistical approach are considerable (Haig, 2017; Spanos, 2019a, 2019b), and I believe that they combine to give us the most coherent philosophy of statistics currently available. For the purpose of this article, it suffices to say that the error-statistical approach contains the methodological and conceptual resources that enable one to diagnose and overcome the common misunderstandings of widely used frequentist statistical methods such as tests of significance. It also provides a trenchant critique of Bayesian ways of thinking in statistics. I will draw from these two strands of the error-statistical perspective to inform my critical evaluation of the new statistics and the Bayesian alternative.

Because the error-statistical and Bayesian outlooks are so different, some might consider it unfair to use the former to critique the latter. My response to this worry is three-fold: First, perspective-taking is an unavoidable feature of the human condition; we cannot rise above our human conceptual frameworks and adopt a position from nowhere. Second, in thinking things through, we often find it useful to proceed by contrast, rather than direct analysis. Indeed, the error-statistical outlook on statistics was originally developed in part by using the Bayesian outlook as a foil. And third, strong debates between Bayesians and frequentists have a long history, and they have helped shape the character of these two alternative outlooks on statistics. By participating in these debates, the error-statistical perspective is itself unavoidably controversial.

**3. The new statistics**

For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, which urges the abandonment of null hypothesis significance testing (NHST), and the adoption of effect sizes, confidence intervals, and meta-analysis as a replacement package, is one such reform movement (Calin-Jageman and Cumming, 2019; Cumming, 2012, 2014). It has been heavily promoted in psychological circles and touted as a much-needed successor to NHST, which is deemed to be broken-backed. *Psychological Science*, which is the flagship journal of the Association for Psychological Science, endorsed the use of the new statistics, wherever appropriate (Eich, 2014). In fact, the new statistics might be considered the Association’s current quasi-official position on statistical inference. Although the error-statistical outlook does not directly address the new statistics movement, its suggestions for overcoming the statistics wars contain insights about statistics that can be employed to mount a powerful challenge to the integrity of that movement.

*3.1. Null hypothesis significance testing*

The new statisticians contend that NHST has major flaws and recommend replacing it with their favored statistical methods. Prominent among the flaws are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. The claim that we should abandon NHST because it leads to dichotomous thinking is unconvincing because it is leveled at the misuse of a statistical test that arises from its mechanical application and a poor understanding of its foundations. By contrast, the error-statistical perspective advocates the flexible use of levels of significance tailored to the case at hand as well as reporting of exact* p* values – a position that Fisher himself came to hold.

Further, the error-statistical perspective makes clear that the common understanding of the amalgam that is NHST is not an amalgam of Fisher’s and Neyman and Pearson’s thinking on the matter, especially their mature thought. Further, the error-statistical outlook can accommodate both evidential and behavioural interpretations of NHST, respectively serving *probative* and *performance* goals, to use Mayo’s suggestive terms. The error-statistical perspective urges us to move beyond the claim that NHST is an inchoate hybrid. Based on a close reading of the historical record, Mayo argues that Fisher and Neyman and Pearson should be interpreted as compatibilists, and that focusing on the vitriolic exchanges between Fisher and Neyman prevents one from seeing how their views dovetail. Importantly, Mayo formulates the error-statistical perspective on NHST by assembling insights from these founding fathers, and additional sources, into a coherent hybrid. There is much to be said for replacing psychology’s fixation on the muddle that is NHST with the error-statistical perspective on significance testing.

Thus, the recommendation of the new statisticians to abandon NHST, understood as the inchoate hybrid commonly employed in psychology, commits the fallacy of the false dichotomy because there exist alternative defensible accounts of NHST (Haig, 2017). The error-statistical perspective is one such attractive alternative.

*3.2. Confidence intervals*

For the new statisticians, confidence intervals replace *p*-valued null hypothesis significance testing. Confidence intervals are said to be more informative, and more easily understood, than *p* values, as well as serving the important scientific goal of estimation, which is preferred to hypothesis testing. Both of these claims are open to challenge. Whether confidence intervals are more informative than statistical hypothesis tests in a way that matters will depend on the research goals being pursued. For example, *p* values might properly be used to get a useful initial gauge of whether an experimental effect occurs in a particular study, before one runs further studies and reports *p* values, supplementary confidence intervals, and effect sizes. The claim that confidence intervals are more easily understood than *p* values is surprising, and is not borne out by the empirical evidence (e.g., Hoekstra et al., 2014). I will speak to the claim about the greater importance of estimation in the next section.

There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: Make a decision on whether a parameter estimate is either inside, or outside, its confidence interval.

Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). The single interval estimate corresponding to this level provides the basis for the inference that is drawn about the parameter values, depending on whether they fall inside or outside the interval. A limitation of this way of thinking is that each of the values of a parameter in the interval are taken to have the same evidential, or probative, force – an unsatisfactory state of affairs that results from weak testing. For example, there is no way of answering the relevant questions, ‘Are the values in the middle of the interval closer to the true value?’, or ‘Are they more probable than others in the interval?’

The error-statistician, by contrast, draws inferences about each of the obtained values, according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Mayo (2018) captures the counterfactual logic of severity thinking involved with the following general example: “Were *μ* less than the 0.995 lower limit, then it is very probable (>0.995) that our procedure would yield a smaller sample mean than 0.6. This probability gives the severity.” (p. 195) Clearly, this is a more nuanced and informative assessment of parameter estimates than that offered by the standard view. Details on the error-statistical conception of confidence intervals can be found in Mayo (2018, pp. 189–201), as well as Mayo and Spanos (2011) and Spanos (2014, 2019a, b).

Methodologists and researchers in psychology are now taking confidence intervals seriously. However, in the interests of adopting a sound frequentist conception of such intervals, they would be well advised to replace the new statistics conception of them with their superior error-statistical understanding.

*3.3. Estimation and hypothesis tests*

The new statisticians claim, controversially, that parameter estimation, rather than statistical hypothesis testing, leads to better science – presumably in part because of the deleterious effects of NHST. However, a strong preference for estimation leads Cumming (2012) to aver that the typical questions addressed in science are what questions (e.g., “What is the age of the earth?”, “What is the most likely sea-level rise by 2012?”). I think that this is a restricted, rather “flattened”, view of science where, by implication, explanatory* why* questions and *how* questions (which often ask for information about causal mechanisms) are considered atypical.

Why and how questions are just as important for science as what questions. They are often the sort of questions that science seeks to answer when constructing and evaluating explanatory hypotheses and theories. Interestingly, and at variance with this view, Cumming (Fidler and Cumming, 2014) acknowledges that estimation can be usefully combined with hypothesis testing in science, and that estimation can play a valuable role in theory construction. This is as it should be because science frequently incorporates parameter estimates in precise predictions that are used to assess the hypotheses and theories from which they are derived.

Although it predominantly uses the language of testing, the error-statistical perspective maintains that statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science and, in fact, advocates piecemeal testing of local hypotheses nested within large-scale explanatory theories.

Despite the generally favorable reception of the new statistics in psychology, it has been subject to criticism by both frequentists (e.g., Sakaluk, 2016), and Bayesians (e.g., Kruschke and Liddell, 2018). However, these criticisms have not occasioned a public response from the principal advocates of the new statistics movement. The error-statistical outlook presents a golden opportunity for those who advocate, or endorse, the new statistics to defend their position in the face of challenging criticism. A sound justification for the promotion and adoption of new statistics practices in psychology requires as much.

To be continued…. Please share comments and questions.

Excerpts and Mementos from SIST on this blog are compiled here.

]]>**Stephen Senn**

Consultant Statistician

Edinburgh

The intellectual illness of clinical drug evaluation that I have discussed here can be cured, and it will be cured when we restore intellectual primacy to the questions we ask, not the methods by which we answer them. Lewis Sheiner^{1}

In their recent essay *Causal Evidence and Dispositions in Medicine and Public Health*^{2}*,* Elena Rocca and Rani Lill Anjum challenge, ‘the epistemic primacy of randomised controlled trials (RCTs) for establishing causality in medicine and public health’. That an otherwise stimulating essay by two philosophers, experts on causality, which makes many excellent points on the nature of evidence, repeats a common misunderstanding about randomised clinical trials, is grounds enough for me to address this topic again. Before, however, explaining why I disagree with Rocca and Anjum on RCTs, I want to make clear that I agree with much of what they say. I loathe these pyramids of evidence, beloved by some members of the evidence-based movement, which have RCTs at the apex or possibly occupying a second place just underneath meta-analyses of RCTs. In fact, although I am a great fan of RCTs and (usually) *of intention to treat* analysis, I am convinced that RCTs alone are not enough. My thinking on this was profoundly affected by Lewis Sheiner’s essay of nearly thirty years ago (from which the quote at the beginning of this blog is taken). Lewis was interested in many aspects of investigating the effects of drugs and would, I am sure, have approved of Rocca and Anjum’s insistence that there are many layers of understanding how and why things work, and that means of investigating them may have to range from basic laboratory experiments to patient narratives via RCTs. Rocca and Anjum’s essay provides a good discussion of the various ‘causal tasks’ that need to be addressed and backs this up with some excellent examples.

In discussing RCTs Rocca and Anjum write

‘…any difference in outcome between the test group and the control group should be caused by the tested interventions, since all other differences should be homogenously distributed between the two groups,’

and later,

‘The experimental design is intended to minimise complexity—for instance, through strict inclusion and exclusion criteria’.

However, it is not the case that randomisation will guarantee that any difference between the groups should be caused by the intervention. On the contrary, many things apart from the treatment will affect the observed difference. And it is not the case that the analysis of RCTs requires the minimisation of complexity. Randomisation and its associated analysis deals with complexity in the experimental material and although the treatment structure in RCTs is often simple this is not always so (I give an example below) and it was not so in the field (literally) of agriculture for which Fisher developed his theory of randomisation. This is what Fisher, himself had to say about complexity

No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or ideally one question, at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical and carefully thought out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed.

^{3 }(p. 511)

This 1926 paper of Fisher’s is an important and early statement of his views on randomisation and was cited recently by Simon Raper in his article in *Significance*^{4}. Raper points out, that Fisher was abandoning as unworkable an earlier view of causality due to John Stuart Mill whereby controlling for everything imaginable was the way you made valid causal judgements. I consider Raper is right in thinking of Fisher’s approach as an alternative to Mill’s programme, rather than some realisation of it, so I disagree for example, with Mumford and Anjum in their book^{5 }when they state

‘Fisher’s idea is the basis of the randomized controlled trial (RCT), which builds on J.S. Mill’s earlier

method of difference’ (pp. 111-112).

I shall now explain exactly what it is that Fisher’s approach does with the help of an example.

Before going into the example, which is a complex design, it is necessary to clear up one further potential point of confusion in Rocca and Anjum’s essay. N-of-1 studies, are not alternatives to RCTs but a subset of them. RCTs include not just conventional parallel group trials but also cluster randomised trial and cross-over trials, including n-of-1 studies. The difference between these studies is at the level one randomises and this is reflected in my example, which has features of both a parallel group and a cross-over study. Thus, reading Rocca and Anjum’s paper, which I can recommend, will make more sense if by their use of *RCT* is understood ‘*randomised parallel group trials*’.

For the moment, all that it is necessary to know is that within the same design, I can compare the effect on forced expiratory volume in one second (FEV_{1}), measured 12 hours after treatment, of two bronchodilators in asthma, which here I shall just label ISF24 and MTA6, in two different ways. First, I can use 71 patients who were given MTA6 and ISF24 on different occasions. Here I can compare the two treatments patient by patient. These data have the structure of a within-patient study. Second, within the same study there were 37 further patients who were given MTA6 but not 1SF24 and 37 further patients who were given ISF24 but not MTA6. Here I can compare the two groups of patients with each other. These data have the structure of a between-patient or parallel group study.

I now proceed to analyse the data from the 71 pairs of values from the patients who were given both using a matched pairs t-test. This will be referred to as the *within-patient study*. Note that this is an analysis of 2×71=142 values in total. I then proceed to compare the 37 patients given MTA6 *only* to the 37 given ISF24 *only* using a two-sample t-test. I shall refer to this as the *between-patient study*. Note that this is an analysis of 37+37=74 values in total. Finally, I combine the two using a meta-analysis.

The results are presented in the figure below which gives the point estimates for the difference between the two treatments and the 95% confidence intervals for both analyses and for a meta-analysis of both, which is labelled ‘combined’. (The horizontal dashed line is the point estimate for a full analysis of all the data and is described in the appendix.) Note how much wider the confidence intervals are for the between-patient study than the within-patient study. This is because the within-patient study is much more precise.

Why is the within-patient study so much more precise? Part of the story is that it is based on more data, in fact nearly twice as many data: 142 rather than 74. However, this is only part of the story. The ratio of variances is more than 30 to 1 and not just approximately 2 to 1, as the number of data might suggest. The main reason is that the within-patient study has balanced for a huge number of factors and the between-patient study has not. Thus, differences in 20,000 plus genes and all life-history until the beginning of the trial are balanced in the within-patient study, since each patient is his or her own control. For the between-patient study none of this is balanced by design. In fact, there are two crucial points regarding balance.

1. Randomisation does not produce balance

2. This does not affect the validity of the analysis

Why do I claim this does not matter? Suppose we accept the within-patient estimate as being nearly perfect because it balances for those huge numbers of factors. It seems that we can then claim that the between-patient estimate did a pretty bad job. The point estimate is 0.2L more than that from the within-patient design, a non-negligible difference. However, this is to misunderstand what the between-patient analysis claims. Its ‘claim’ is not the point estimate; its claim is the distribution associated with it, of which the 95% confidence interval is a sort of minimalist conventional summary and of which the point estimate is only one point. As I have explained elsewhere, such claims of uncertainty are a central feature of statistics. Thus, the true claim made by the between-patient study is not misleading. It is vague and, indeed, when we come to combine the results, the meta-analysis will give 30 times the weight to the within-patient estimate as to the between-patient estimate simply because of the vagueness of the associated claim. This is why the result from the meta-analysis is so similar to that of the within-patient estimate. Furthermore, although this can never be guaranteed, since probabilities are involved, the 95% CI for the between-patient study includes the estimate given by the within-patient study. (Note, that in general, confidence intervals are not a statement about a value in a future study, but about the ‘true’ average value^{6} but here, the within-patient study being very precise, they can be taken to be similar.)

This works because what Fisher’s analysis does is use variation at an appropriate level to estimate variation in the treatment estimate. So, for the between-study it starts from the following observations

1) There are numerous factors apart from treatment that could affect the outcome in one arm of the between-patient study compared to the other.

2) However, it is the joint effect of these that matters.

3) This joint effect of such factors will also vary within each of the two treatment groups.

4) Provided I use a method of allocation that is random, there will be no tendency for this variation within the groups to be larger or smaller than that between the groups.

5) Under this condition I have a way of estimating how reliable the treatment estimate is.

Thus, his programme is not about eliminating all sources of variation. He knows that this is impossible and accepts that estimates will be imperfect. Instead, he answers the question: ‘given that estimates are (inevitably) less than perfect, can we estimate how reliable they are?’. The answer he provides is ‘yes’ if we randomise.

If we now turn to the within-patient estimate, the same argument is repeated but in a first step differences are calculated by patient. These differences do not reflect differences in genes etc. since each patient acts as his or her own control. (They could reflect a treatment-by-patient interaction but this is another story I choose not to go into here^{7,} ^{8}. See my blog on n-of-1 trials for a discussion.) The argument then uses the variance in the single group of differences to estimate how reliable their average will be.

Note that a different design requires a different analysis and in particular because the estimate of the variability of the estimate will be inappropriate even if the estimate is not affected. This is illustrated in Figure 2 which shows what happens if you analyse the paired data from the 71 patients as if they were two independent sets of 71 each. Although the point estimate is unchanged, the confidence interval is now much wider than it was before. The value of having the patients as their own control is lost. The downstream effect of this is that the meta-analysis now weights the two estimates inappropriately.

Note also, that it is not a feature of Fisher’s approach that claims made by larger or otherwise more precise trials are generally more reliable than smaller or otherwise less precise ones. The increase in precision is *consumed* by the calculation of the confidence interval^{9, 10}. More precise designs produce narrower intervals. Nothing is left to make the claim that is made more valid. It is simply more precise. The allowance for chance effects will be less, and appropriately so. Balance is a matter of precision not validity.

As I often put it, the shocking truth about RCTs is the opposite of what many believe. Far from requiring us to know that all possible causal factors affecting the outcome are balanced in order for the conventional analysis of RCTs to be valid, if we knew all such factors were balanced, the conventional analysis would be *invalid*. RCTs neither guarantee nor require balance. Imbalance is inevitable and Fisher’s analysis allows for this. The allowance that is made for imbalance is appropriate provided that we have randomised. Thus, randomisation is a device for enabling us to make precise estimates of an inevitable imprecision.

I thank George Davey Smith, Elena Rocca and Rani Lill Anjum for helpful comments on an earlier version.

- Sheiner LB. The intellectual health of clinical drug evaluation [see comments].
*Clin Pharmacol Ther*1991;**50**(1): 4-9. - Rocca E, Anjum RL. Causal Evidence and Dispositions in Medicine and Public Health.
*International Journal of Environmental Research and Public Health*2020;**17**. - Fisher RA. The arrangement of field experiments.
*Journal of the Ministry of Agriculture of Great Britain*1926;**33**: 503-13. - Raper S. Turning points: Fisher’s random idea.
*Significance*2019;**16**(1): 20-23. - Mumford S, Anjum RL.
*Causation: a very short introduction*: OUP Oxford, 2013. - Senn SJ. A comment on replication, p-values and evidence S.N.Goodman,
*Statistics in Medicine*1992;**11**: 875-879.*Statistics in Medicine*2002;**21**(16): 2437-44. - Senn SJ. Mastering variation: variance components and personalised medicine. Statistics in Medicine 2016;
**35**(7): 966-77. - Araujo A, Julious S, Senn S. Understanding Variation in Sets of N-of-1 Trials.
*PloS one*2016;**11**(12): e0167167. - Senn SJ. Seven myths of randomisation in clinical trials.
*Statistics in Medicine*2013;**32**(9): 1439-50. - Cumberland WG, Royall RM. Does Simple Random Sampling Provide Adequate Balance.
*J R Stat Soc*Ser B-Methodol 1988;**50**(1): 118-24. - Senn SJ, Lillienthal J, Patalano F, et al. An incomplete blocks cross-over in asthma: a case study in collaboration. In: Vollmar J, Hothorn LA, eds.
*Cross-over Clinical Trials*. Stuttgart: Fischer, 1997: 3-26.

This was a so-called *balanced incomplete blocks design* necessitated because it was desired to study seven treatments (three doses of each of two formulations and a placebo)^{11 }but it was not considered practical to treat patients in more than five treatments. Thus, patients were allocated a different one of the seven treatments in each of the five periods. That is to say, each patient received a subset of five of the seven treatments. Twenty-one sequences of five treatments were used. Each sequence permits (5×4)/2 = 10 pairwise comparisons but there are (7×6)/2= 21 pairwise comparisons overall and the sequences were chosen in such a way that any given one of the 21 pairwise comparisons within a sequence would appear equally often over the design. Looking at the members of such a given chosen pair one would find that in five further sequences the first would appear and not the second and *vice versa*. This leaves one sequence out of the 21 in which neither treatment would appear. The sort of scheme involved is illustrated in Table 1 below.

The active treatments were MTA6, MTA12, MTA24, ISF6, ISF12, ISF24, where the number refers to a dose in μg and the letters to two different formulations (MTA and ISF) of a dry powder of formoterol delivered by inhaler. The seventh treatment was a placebo.

In fact, the plan was to recruit six times as many patients as there were sequences, randomising a given patient to a sequence in a way that would guarantee approximately equal numbers per sequence. This would have given 126 patients in total. In the end, this target was exceeded and 161 patients were randomised to one of the sequences.

Obviously, this is a rather complex design but I have used it because it enabled me to compare two treatments two different ways. First by using only the ten sequences in which they both appear. For this purpose, I could use each patient as his or her control. Second, by using the ten further sequences in which only one appears.

This thus permitted me to analyse data from the same trial using a *within-patient analysis* and a *between-patient analysis*. The analyses used above should not be taken too seriously. The analysis would not generally proceed and did not in fact proceed in this way. For example, I ignored the complication of period effects and ignored the fact that by including all the seven treatments in an analysis at once, I could recover more information. I simply chose two treatments to compare and ignored all other information in order to illustrate a point. The two treatments I compared, ‘ISF24’ and ‘MTA6’, were respectively, the highest (24μg) dose of the then (1997) existing standard dry powder formulation, ISF of the beta-agonist formoterol, and the lowest (6μg) of a newer formulation, MTA, it was hoped to introduce. The experiment is discussed in full in Senn, Lilienthal, Patalano and Till^{11}.

The *full model* analysis that I showed as a dotted line in Figure 1 & Figure 2 fitted Patient as a random effect and Treatment and Period as fixed factors with 7 and 5 levels respectively.

Ed. A link to a selection of Senn’s posts and papers is here. Please share comments and thoughts.

]]>As much as doctors and hospitals are raising alarms about a shortage of ventilators for Covid-19 patients, some doctors have begun to call for entirely reassessing the standard paradigm for their use–according to a cluster of articles to appear in the last week. “What’s driving this reassessment is a baffling observation about Covid-19: Many patients have blood oxygen levels so low they should be dead. But they’re not gasping for air, their hearts aren’t racing, and their brains show no signs of blinking off from lack of oxygen.”[1] Within that group of patients, some doctors wonder if the standard use of mechanical ventilators does more harm than good.[2] The issue is controversial; I’ll just report what I find in the articles over the past week. Please share ongoing updates in the comments.

**I. Gattinoni: “COVID-19 pneumonia: different respiratory treatment for different phenotypes?”**

Luciano Gattinoni, one of the world’s experts in mechanical ventilation, “says more than half the patients he and his colleagues have treated in Northern Italy have had this unusual symptom. They seem to be able to breathe just fine, but their oxygen is very low”. …

He says these patients with more normal-looking lungs, but low blood oxygen, may also be especially vulnerable to ventilator-associated lung injury, where pressure from the air that’s being forced into the lungs damages the thin air sacs that exchange oxygen with the blood. [3]

Gattinoni labels these patients (more normal-looking lungs, but low blood oxygen) as Type L, and urges they be treated differently than the type of acute respiratory [ARDS] patients seen prior to Covid-19. This second type he calls Type H. (His editorial is in [4]). I found a picture of Type L and Type H lungs at this link on p. 12.

Patients with respiratory failure who can still breathe OK, but have still have very low oxygen, may improve on oxygen alone, or on oxygen delivered through a lower pressure setting on a ventilator.[3]

Gattinoni thinks the trouble for these patients may not be swelling and stiffening of their lung tissue, which is what happens when an infection causes pneumonia. Instead, he thinks the problem may lie in the intricate web of blood vessels in the lungs.[a]

Gattinoni says putting a patient like this on a ventilator under too high a pressure may cause lung damage that ultimately looks like ARDS.[3]

In other words, the high pressure of the ventilator may turn a Type L patient into a more serious Type H patient. “If you start with the wrong protocol, at the end they become similar,” Gattinoni said.[2] Oy! He recommends the two types (which can be determined in a number of ways) be treated differently: Type L patients receive greater benefit from less invasive oxygen support, via breathing masks, such as those used for patients with sleep apnea, nasal cannulas, or via a non-invasive high flow device.

Gattinoni said one center in central Europe that had begun using different treatments for different types of COVID-19 patients had not seen any deaths among those patients in its intensive care unit. He said a nearby hospital that was treating all COVID-19 patients based on the same set of instructions had a 60% death rate in its ICU. [He did not give the names of the hospitals.]

“This is a kind of disease in which you don’t have to follow the protocol — you have to follow the physiology,” Gattinoni said. “Unfortunately, many, many doctors around the world cannot think outside the protocol.” [3]

**II. Kyle-Sidell: Covid vent protocols need a second look**

But there are some doctors who may want to think outside the protocol, yet face pressure against doing so–according to Cameron Kyle-Sidell, an emergency room and critical care doctor at Maimonides Medical Center in Brooklyn.

The article that captured my attention on April 6 was the surprising transcript of Kyle-Sidell being video interviewed by WebMD chief medical officer John Whyte [5]:

: You’ve been talking on social media; you say you’ve seen things that you’ve never seen before. What are some of those things that you’re seeing?Whyte

: When I initially started treating patients, I was under the impression, as most people were, that I was going to be treating acute respiratory distress syndrome (ARDS)… And as I start to treat these patients, I witnessed things that are just unusual. …In the past, we haven’t seen patients who are talking in full sentences and not complaining of overt shortness of breath, with saturations [blood oxygen levels] in the high 70s [normal is said to be between 95 and 100].[b].Kyle-SidellThis originally came to me when we had a patient who had hit what we call our trigger to put in a breathing tube, … Most of the time, when patients hit that level of hypoxia, they’re in distress and they can barely talk; they can’t say complete sentences. She could do all of those and she did not want a breathing tube. So she asked that we put it in at the last minute possible. It was this perplexing clinical condition: When was I supposed to put the breathing tube in?…

We ran into an impasse where I could not morally, in a patient-doctor relationship, continue the current protocols which, again, are the protocols of the top hospitals in the country. … So I had to step down from my position in the ICU, and now I’m back in the ER where we are setting up slightly different ventilation strategies. Fortunately, we’ve been boosted by recent work by Gattinoni.

: Do you feel that somewhere the world made a wrong turn in treating COVID-19?Whyte

I don’t know that they made a wrong turn. I mean, it came so fast. … It’s hard to switch tracks when the train is going a million miles an hour. …But I do think that it starts out with knowing, or at least accepting the idea, that this may be an entirely new disease. Because once you do that, then you can accept the idea that perhaps all the studies on ARDS in the 2000s and 2010s, which were large, randomized, well-performed, well-funded studies, perhaps none of those patients in those studies had COVID-19 or something resembling it. It allows you to move away from a paradigm in which this disease may fit and, unfortunately, walk somewhat into the unknown.Kyle-Sidell:…One of the reasons I speak up, and I hope people at the bedside speak up, is that I think there may be a disconnect between those who are seeing these patients directly, who are sensing that something is not quite right, and those brilliant people and researchers and administrators who are writing the protocols and working on finding answers. The first thing to do is see if we can admit that this is something new. I think it all starts from there.

Gattinoni’s paper and Kyle-Sidell’s on-line discussions are having an impact in the popular press. Yesterday, the *Telegraph* reported that “British and American intensive care doctors at the front line of the coronavirus crisis are starting to question the aggressive use of ventilators for the treatment of patients”.[6]

In many cases, they say the machines– which are highly invasive and require the patient to be rendered unconscious– are being used too early and may cause more harm than good. Instead they are finding that less invasive forms of oxygen treatment through face masks or nasal cannulas work better for patients, even those with very low blood oxygen readings….This is the sort of treatment Boris Johnson, the Prime Minister, is said to have received in an intensive care unit at St Thomas’ Hospital in London.

…Increasingly, doctors in the UK, America and Europe are using these less invasive measures and holding back on the use of mechanical ventilation for as long as possible…Invasive ventilation is never a good option for any patient if it can be avoided. It can result in muscle wastage around the lungs and makes secondary infections more likely. It also requires a cocktail of drugs which themselves can prove toxic and lead to organ failure.[6]

“Instead of asking how do we ration a scarce resource, we should be asking how do we best treat this disease?” says physician Muriel Gillick of Harvard Medical School.[1]

**III. Need Non-invasive Ventilation Risk Health Care Workers?**

Yet there’s an important reason the standard protocol is to bypass non-invasive ventilation in Covid-19 patients (in the U.S.), and I don’t know if Gattinoni or Kyle-Sidell address it: they are thought to pose risks for heath care providers, at least without adequate protective devices.[c]:

One problem, though, is that CPAP [continuous positive airway pressure] and other positive-pressure machines pose a risk to health care workers…The devices push aerosolized virus particles into the air, where anyone entering the patient’s room can inhale them [spillage]. The intubation required for mechanical ventilators can also aerosolize virus particles, but the machine is a contained system after that.[1]

“If we had unlimited supply of protective equipment and if we had a better understanding of what this virus actually does in terms of aerosolizing, and if we had more negative pressure rooms, then we would be able to use more” of the noninvasive breathing support devices, said [Lakshman] Swamy [an ICU physician and pulmonologist of Boston Medical Center].[1]

But surely it would be easier to procure adequate protective equipment than obtain more ventilators, especially if it’s a way to beat the grim statistics for a significant group of Covid-19 sufferers. Italy has special plastic helmets that cordon off the patient’s head from the shoulder up, redolent of Victorian diving helmets. A virus filter prevents the aerosolization risk that is behind the common protocol. The Italian helmet, however, hasn’t been approved by the FDA, and anyway, Italy has banned its export given its own COVID-19 crisis. Fortunately, at least one group in the U. S is building its own coronavirus helmets.

Please share your thoughts, updates, and errors.

**NOTES:**

[a] The following are quotes from reference [3]

Normally, when lungs become damaged, the vessels that carry blood through the lungs so it can be re-oxygenated constrict, or close down, so blood can be shunted away from the area that’s damaged to an area that’s still working properly. This protects the body from a drop in oxygen. Gattinoni thinks some COVID-19 patients can’t do this anymore. So blood is still flowing to damaged parts of the lungs. People still feel like they’re taking good breaths, but their blood oxygen is dropping all the same.[3]

One doctor treating COVID-19 patients in New York [Cameron Kyle-Sidell] says it was like altitude sickness. It was “as if tens of thousands of my fellow New Yorkers are stuck on a plane at 30,000 feet and the cabin pressure is slowly being let out. These patients are slowly being starved of oxygen”. [3]

Lung scans show the same “ground glass” appearance in both covid-19 and high altitude pulmonary edema (HAPE).

[b] An oximeter I recently bought, of not very good quality, has me at 97.

[c] Except perhaps when mechanical ventilators are in too short supply.( I am not up on the current regulations). Of course, another reason is the danger in delaying intubation that might be necessary.

**REFERENCES:**

[0] March 17, 2020 around 30 cases down.

[1] “With ventilators running out, doctors say the machines are overused for Covid-19” STATREPORTS, April 8, 2020

[2] “Is Protocol-Driven COVID-19 Ventilation Doing More Harm Than Good?” Medscape, April 6, 2020.

[3] “Doctors puzzle over covid-19 lung problems”, WebMD Health News, April 07, 2020

[4] Gattinoni’s editorial: “COVID-19 pneumonia: different respiratory treatment for different phenotypes?” L. Gattinoni et al., (2020)

[5] “Do COVID-19 Vent Protocols Need a Second Look?”, WebMD Interview, John Whyte, MD, MPH; Cameron Kyle-Sidell, MD, April 06, 2020

[6] “Intensive care doctors question ‘overly aggressive’ use of ventilators in coronavirus crisis”, Telegraph, April 9, 2020

]]>**Aris Spanos
**

Beyond the plenitude of misery and suffering that pandemics bring down on humanity, occasionally they contribute to the betterment of humankind by (inadvertently) boosting creative activity that leads to knowledge, and not just in epidemiology. A case in point is that of Isaac Newton and the pandemic of 1665-6.

Born in 1642 (on Christmas day – old Julian calendar) in the small village of Woolsthorpe Manor, southeast of Nottingham, England, Isaac Newton had a very difficult childhood. He lost his father, also named Isaac, a farmer, three months before he was born; his mother, Hannah, married again when he was 3 years old and moved away with her second husband to start a new family; he was brought up by his maternal grandmother until the age of 10, when his mother returned, after her second husband died, with three young kids in tow.

At age 12, Isaac was enrolled in the King’s School in Grantham [where Margaret Thatcher was born], 8 miles away from home, where he boarded at the home of the local pharmacist. During the first two years at King’s School, he was an average student, but after a skirmish with a schoolyard bully, he took his revenge by distinguishing himself, or so the story goes! After that episode, Isaac began to exhibit an exceptional aptitude for constructing mechanical contraptions, such as windmills, dials, water-clocks, and kites. His mother, however, had other ideas and took young Isaac out of school at age 16 to attend the farm she inherited from the second husband. Isaac was terrible at farming, and after a year the headmaster of King’s School, Mr. Stokes, lectured Hannah to allow a promising pupil to return to school, and took Isaac to board in his own home. It was clear to both that young Isaac was not cut out to herd sheep and shovel dung. After completing the coursework in Latin, Greek and some mathematics, Newton was accepted at Trinity College, University of Cambridge, in 1661, at an age close to 19, somewhat older than the other students due to his skirmish with farming. For the first three years, he did not pay tuition by having to work in the College’s kitchen, diner and housekeeping, but by 1664 he showed adequate promise to be awarded a scholarship guaranteeing him four more years to complete his MA degree. As an undergraduate Isaac spent most of his time in solitary intellectual pursuits, which, beyond the prescribed Aristotelian texts, included reading in diverse subjects in a conscious attempt to supplement his education with reading extra-curricular books that attracted his curiosity, in history, philosophy – Rene Descartes in particular – and astronomy, such as the works of Galileo and Thomas Street through whom he learned about Kepler’s work; many scholars attribute Newton’s passion for mathematics to Descartes’s *Geometry*. He completed his BA degree in 1665 without displaying any scholarly promise that he would become the most celebrated scientist of all time. That was to be changed by a pandemic!

The bubonic plague of 1665-6 ravaged London, killing more than 100,000 residents (25% of its population), and rapidly spread throughout the country. Like most universities, Cambridge closed its doors and the majority of its students return to their family residence in the countryside to isolate themselves and avoid the plague. Isaac, an undistinguished BA student from Cambridge University, returned to Woolsthorpe, where he began a most creative period of assimilating what he has learned during his studies and devoting ample time to reflect on subjects of great interest to him, including mathematics, philosophy, and physics, that he could not devote sufficient time to during his coursework at Cambridge. These two years of isolation turned out to be the most creative years of his life. Newton’s major contributions to science and mathematics, including his work in Optics, the laws of motion and universal gravitation, as well as the creation of infinitesimal calculus, can be traced back to these two years of incredible ingenuity and originality, and their importance for science can only be compared with Einstein’s 1904-1905 *Annus Mirabilis*.

Newton returned to Cambridge in the Autumn of 1667 with notebooks filled with ideas as well as solved and unsolved problems. Soon after, he was elected a Minor Fellow of Trinity College. Newton completed his MA in 1668 during which he began interacting with Isaac Barrow, the Lucasian Professor of Mathematics, an accomplished mathematician in his own right with important contributions in geometry and optics, whom he failed to impress as an undergraduate. He handed Barrow a set of notes on the generalized binomial theorem and various applications of his newly minted fluxions (modern differential calculus) developed during the two years in Woolsthorpe. After a short period of discoursing with Newton, Barrow realized the importance of his young student’s work. Soon after that Barrow retired from the Lucasian chair in 1669, recommending Newton, age 26, to succeed him. Newton’s ideas during the next 30 years as Lucasian Professor of Mathematics changed the way we understand the physical world we live in.

One wonders how the history of science would have unfolded if it were not for the bubonic plague of 1665-6 forcing Newton into two years of isolation to study, contemplate and create!

**Aris Spanos **(March 2020)

Ed (Mayo) Note: Aris shared with me the case of Newton working during the bubonic plague 2 weeks ago, hearing how unproductive I was. I asked him to write a blogpost on it, and I’m very grateful that he did!

]]>