Monthly Archives: July 2020

JSM 2020 Panel on P-values & “Statistical Significance”

All: On July 30 (10am EST) I will give a virtual version of my JSM presentation, remotely like the one I will actually give on Aug 6 at the JSM. Co-panelist Stan Young may as well. One of our surprise guests tomorrow (not at the JSM) will be Yoav Benjamini!  If you’re interested in attending our July 30 practice session* please follow the directions here. Background items for this session are in the “readings” and “memos” of session 5.

*unless you’re already on our LSE Phil500 list

JSM 2020 Panel Flyer (PDF)
JSM online program w/panel abstract & information):

Categories: Announcement, JSM 2020, significance tests, stat wars and their casualties | Leave a comment

Stephen Senn: Losing Control (guest post)


Stephen Senn
Consultant Statistician

Losing Control

Match points

The idea of local control is fundamental to the design and analysis of experiments and contributes greatly to a design’s efficiency. In clinical trials such control is often accompanied by randomisation and the way that the randomisation is carried out has a close relationship to how the analysis should proceed. For example, if a parallel group trial is carried out in different centres, but randomisation is ‘blocked’ by centre then, logically, centre should be in the model (Senn, S. J. & Lewis, R. J., 2019). On the other hand if all the patients in a given centre are allocated the same treatment at random, as in a so-called cluster randomised trial, then the fundamental unit of inference becomes the centre and patients are regarded as repeated measures on it. In other words, the way in which the allocation has been carried out effects the degree of matching that has been achieved and this, in turn, is related to the analysis that should be employed. A previous blog of mine, To Infinity and Beyond,  discusses the point.

Balancing acts

In all of this, balance, or rather the degree of it, plays a fundamental part, if not the one that many commentators assume. Balance of prognostic factors is often taken as being necessary to avoid bias. In fact, it is not necessary. For example, supposed we wished to eliminate the effect of differences between centres in a clinical trial but had not, in fact, blocked by centre. We would then just by chance, have some centres in which numbers of patients on treatment and control differed. The simple difference of the two means for the trial as a whole would then have some influence from the centres, which might be regarded as biasing. However, these effects can be eliminated by the simple stratagem of analysing the data in two stages. In the first stage we compare the means under treatment and control within each centre. In the second stage we combine these differences across the centre weighting them according to the amount of information provided. In fact, including centre as a factor in a linear model to analyse the effect of treatment achieves the same result as this two-stage approach.

This raises the issue, ‘what is the value of balance?’. The answer is that other things being equal, balanced allocations are more efficient in that they lead to lower variances. This follows from the fact that the variance of a contrast based on two means is

where σ21, σ22 are the variances in the two groups being compared and n1n2 the two sample sizes. In an experimental context, it is often reasonable to proceed as if σ21 = σ22 so that writing σ2 for each variance, we have an expression for the variance of the contrast of.

Now consider the successive ratios 1, 1/2, 1/3,…1/n. Each term is smaller than the preceding term. However, the amount by which a term is smaller is less than the amount by which the preceding term was smaller than the term that preceded it. For example, 1/3-1/4 = 1/12 but 1/2-1/3 = 1/6. In general we have 1/n – 1/n+1 = 1/n(n+1), which clearly reduces with increasing n. It thus follows that if an extra observation can be added to construct such a contrast, it will have the greater effect on reducing that contrast if it can be added to the group that has the fewest observations. This in turn implies, other things being equal, that balanced contrasts are more efficient.

Exploiting the ex-external

However, it is often the case in a randomised clinical trial of a new treatment that a potential control treatment has been much studied in the past. Thus, many more observations, albeit of a historical nature, are available for the control treatment than the experimental one. This in turn suggests that if the argument that balanced datasets are better is used, we should now allocate more patients, and perhaps even all that are available, to the experimental arm. In fact, things are not so simple.

First, it should be noted, that if blinding of patients and treating physicians to the treatment being given is considered important, this cannot be convincingly implemented unless randomisation is employed (Senn, S. J., 1994). I have discussed the way that this may have to proceed in a previous blog, Placebos: it’s not only the patients that are fooled but in fact, in what follows, I am going to assume that blinding is unimportant and consider other problems with using historical controls.

When historical controls are used there are two common strategies. The first is to regard the historical controls as providing an external standard which may be regarded as having negligible error and to use it, therefore, as an unquestionably valid reference. If significance tests are used, a one-sample test is applied to compare the experimental mean to the historical standard. The second is to treat historical controls as if they were concurrent controls and to carry out the statistical analysis that would be relevant were this the case. Both of these are inadequate. Once I have considered them, I shall turn to a third approach that might be acceptable.

A standard error

If an experimental group is compared to a historical standard, as if that standard were currently appropriate and established without error, an implicit analogy is being made to a parallel group trial with a control group arm of infinite size. This can be seen by looking at formula (2). Suppose that we let the first group be the control group and the second one the experimental group. As n1 → ∞, then formula (2) will approach σ2/n2 , which is, in fact the formula we intend to use.

Figure 1 shows the variance that this approach uses as a horizontal red line and the variance that would apply to a parallel group trial. The experimental group size has been set at 100 and the control group sample size to vary from 100 to 2000. The within group variance has been set to σ2 = 1. It can be seen that this approach of the historical standard underestimates considerably the variance that will apply. In fact even the formula given by blue line will underestimate the variance as we shall explain below.

Figure 1. The variance of the contrast for a two-group parallel clinical trial for which the number of patients on the experimental arm is 100 as a function of the number on the control group arm.

It thus follows that assessing the effect from a single arm given an experimental treatment by comparison to a value from historical controls but using a formula for the standard error of σ/√n2, where σ is the within-treated group standard deviation and nis the number of patients, will underestimate the uncertainty in this comparison.

Parallel lies

A common alternative is to treat the historical data as if they came concurrently from a parallel group trial. This overlooks many matters, not least of which is that in many cases the data will have come from completely different centres and, whether or not they came from different centres, they came from different studies. That being so, the nearest analogue of a randomised trial is not a parallel group trial but a cluster randomised trial with study as a unit of clustering. The general set up is illustrated in Figure 2. This shows a comparison of data taken from seven historical studies of a control treatment (C) and one new study of an experimental treatment (E).

Figure 2. A data set consisting of information on historical controls (C) in seven studies and information on an experimental treatment in a new study.

This means that there is a between-study variance that has to be added to the within-study variances.

Cluster muster

The consequence is that the control variance is not just a function of the number of patients but also of the number of studies. Suppose there are k such studies, then even if each of these studies has a huge number of patients, the variance of the control mean cannot be less than ϒ2/k, where ϒis the between-study variance.  However, there is worse to come. The study of the new experimental treatment also has a between-study contribution but since there is only one such study its variance is ϒ2/1 = ϒ2. The result is that a lower bound for the variance of the contrast using historical data is

It turns out that the variance of the treatment contrast decreases disappointingly according to the number of clusters you can muster. Of course, in practice, things are worse, since all of this is making the optimistic assumption that historical studies are exchangeable with the current one (Collignon, O. et al., 2019; Schmidli, H. et al., 2014).

Optimists may ask, however, whether this is not all a fuss about nothing. The theory indicates that this might be a problem but is there anything in practice to indicate it is. Unfortunately, yes. The TARGET study provides a good example of the sort of difficulties encountered in practice (Senn, S., 2008). This was a study comparing Lumiracoxib, Ibuprofen and Naproxen in osteoarthritis. For practical reasons, centres were either enrolled in a sub-study comparing Lumiracoxib to Ibuprofen or one comparing Lumiracoxib to Naproxen. There were considerable differences between sub-studies in terms of baseline characteristics but not within sub-studies and there were even differences at outcome for lumiracoxib depending on which sub-study patients were enrolled in. This was not a problem for the way the trial was analysed, it was foreseen from the outset, but it provides a warning that differences between studies may be important.

Another example is provided by Collignon, O. et al. (2019). Looking at historical data on acute myeloid leukaemia (AML), they identified 19 studies of a proposed control treatment Azacitidine. However, the variation from study to study was such that the 1279 subjects treated in these studies would only provide, in the best of cases, as much information as 50 patients studied concurrently.

COVID Control

How have we done in the age of COVID? Not always very well. To give an example, a trial that received much coverage was one of hydroxychloroquine in the treatment of patients suffering from corona virus infection (Gautret, P. et al., 2020). The trial was in 20 patients and “Untreated patients from another center and cases refusing the protocol were included as negative controls.” The senior author Didier Raoult later complained of the ‘invasion of methodologists’ and blamed them and the pharmaceutical industry for a ‘moral dictatorship’ that physicians should resist and compared modellers to astrologers (Nau, J.-Y., 2020).

However, the statistical analysis section of the paper has the following to say

Statistical differences were evaluated by Pearson’s chi-square or Fisher’s exact tests as categorical variables, as appropriate. Means of quantitative data were compared using Student’s t-test.

Now, Karl Pearson, RA Fisher and Student were all methodologists. So, Gautret, P. et al. (2020) do not appear to be eschewing the work of methodologists, far from it. They are merely choosing to use this work inappropriately. But nature is a hard task-mistress and if outcome varies considerably amongst those infected with COVID-19, and we know it does, and if patients vary from centre to centre, and we know they do, then variation from centre to centre cannot be ignored and trials in which patients have not been randomised concurrently cannot be analysed as if they were. Fisher’s exact test, Pearson’s chi-square and Student’s t will underestimate the variation.

The moral dictatorship of methodology

Methodologists are, indeed, moral dictators. If you do not design your investigations carefully you are on the horns of a dilemma. Either, you carry out simplistic analyses that are simply wrong or you are condemned to using complex and often unconvincing modelling. Far from banishing the methodologists, you are holding the door wide open to let them in.


This is based on work that was funded by grant 602552 for the IDEAL project under the European Union FP7 programme and support from the programme is gratefully acknowledged.


Collignon, O., Schritz, A., Senn, S. J., & Spezia, R. (2019). Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations. Statistical Methods in Medical Research, 962280219880213

Gautret, P., Lagier, J. C., Parola, P., Hoang, V. T., Meddeb, L., Mailhe, M., . . . Raoult, D. (2020). Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial. Int J Antimicrob Agents, 105949

Nau, J.-Y. (2020). Hydroxychloroquine : le Pr Didier Raoult dénonce la «dictature morale» des méthodologistes.  Retrieved from

Schmidli, H., Gsteiger, S., Roychoudhury, S., O’Hagan, A., Spiegelhalter, D., & Neuenschwander, B. (2014). Robust meta‐analytic‐predictive priors in clinical trials with historical control information. Biometrics, 70(4), 1023-1032

Senn, S. J. (2008). Lessons from TGN1412 and TARGET: implications for observational studies and meta-analysis. Pharmaceutical Statistics, 7, 294-301

Senn, S. J. (1994). Fisher’s game with the devil. Statistics in Medicine, 13(3), 217-230

Senn, S. J., & Lewis, R. J. (2019). Treatment Effects in Multicenter Randomized Clinical Trials. JAMA

Categories: covid-19, randomization, RCTs, S. Senn | 14 Comments

JSM 2020: P-values & “Statistical Significance”, August 6


To register for JSM:

Categories: JSM 2020, P-values | Leave a comment

Colleges & Covid-19: Time to Start Pool Testing


I. “Colleges Face Rising Revolt by Professors,” proclaims an article in today’s New York Times, in relation to returning to in-person teaching:

Thousands of instructors at American colleges and universities have told administrators in recent days that they are unwilling to resume in-person classes because of the pandemic. More than three-quarters of colleges and universities have decided students can return to campus this fall. But they face a growing faculty revolt.

…This comes as major outbreaks have hit college towns this summer, spread by partying students and practicing athletes.

In an indication of how fluid the situation is, the University of Southern California said late Wednesday that “an alarming spike in coronavirus cases” had prompted it to reverse an earlier decision to encourage attending classes in person.

…. Faculty members at institutions including Penn State, the University of Illinois, Notre Dame and the State University of New York have signed petitions complaining that they are not being consulted and are being pushed back into classrooms too fast.

… “I shudder at the prospect of teaching in a room filled with asymptomatic superspreaders,” wrote Paul M. Kellermann, 62, an English professor at Penn State, in an essay for Esquire magazine, proclaiming that “1,000 of my colleagues agree.” Those colleagues have demanded that the university give them a choice of doing their jobs online or in person.

II. There is currently a circulating petition of Virginia faculty making similar requests, and if you’re a Virginia faculty member and wish to sign, you still have one day (7/4/20).

A preference to teach remotely isn’t only to mitigate the risk of infection by asymptotic students, it may also reflect the need to take care of children who might not be in school full-time this fall. Yet a return to in-person teaching has been made the default option in many universities such as Virginia Tech (which has decided 1/3 of classes will be in person).

Other universities have been more open to letting professors decide for themselves what to do. “Due to these extraordinary circumstances, the university is temporarily suspending the normal requirement that teaching be done in person,” the University of Chicago said in a message to instructors on June 26.

Yale said on Wednesday that it would bring only a portion of its students back to campus for each semester: freshmen, juniors and seniors in the fall, and sophomores, juniors and seniors in the spring. “Nearly all” college courses will be taught remotely, the university said, so that all students can enroll in them. New York Times

It would be one thing if all students were regularly tested for covid-19, but in the long-awaited plan released yesterday by Virginia Tech, students are at most being “asked” to obtain a negative result within 5 days of returning to campus–with the exception of students living in a campus residence, who will be offered tests when they arrive. Getting tested is also being “strongly advised”.

If they test positive, they are asked to self-isolate (with the number of days not indicated). A student would need to begin the process of seeking a test several weeks prior to the start of class to ensure at least a 14-day isolation (even though asymptomatics are known to be infectious for longer). But my main concern is that even vigilant students would face obstacles to qualifying for testing, given the current criteria. A student who does not currently have symptoms would not meet the criteria for testing in Virginia, or in the vast majority of other states, unless they had been in close contact with infected persons. (There are exceptions, such as NYC.) This could be rectified if Virginia Tech could get the Virginia Department of Health to include “returning to campus” under their provision to test those “entering congregate settings”–currently limited to long-term care facilities, prisons, and the like.

It is now known that a large percentage of people with Covid-19 are asymptomatic. “Among more than 3,000 prison inmates in four states who tested positive for the coronavirus, the figure was astronomical: 96 percent asymptomatic.”(Link).

An extensive review in the Annals of Internal Medicine, suggests that asymptomatic infections may account for 45 percent of all COVID-19 cases:

“The likelihood that approximately 40% to 45% of those infected with SARS-CoV-2 will remain asymptomatic suggests that the virus might have greater potential than previously estimated to spread silently and deeply through human populations. Asymptomatic persons can transmit SARS-CoV-2 to others for an extended period, perhaps longer than 14 days.

The focus of testing programs for SARS-CoV-2 should be substantially broadened to include persons who do not have symptoms of COVID-19.”  

III.  An easy solution would seem to be to turn to “pooled testing”. It’s an old statistical idea, but it’s only now gaining traction [1] In the July 1 NYT:

The method, called pooled testing, signals a paradigm shift. Instead of carefully rationing tests to only those with symptoms, pooled testing would enable frequent surveillance of asymptomatic people. Mass identification of coronavirus infections could hasten the reopening of schools, offices and factories.

“We’re in intensive discussions about how we’re going to do it,” Dr. Anthony S. Fauci, the country’s leading infectious disease expert, said in an interview. “We hope to get this off the ground as soon as possible.”

…Here’s how the technique works: A university, for example, takes samples from every one of its thousands of students by nasal swab, or perhaps saliva. Setting aside part of each individual’s sample, the lab combines the rest into a batch holding five to 10 samples each. The pooled sample is tested for coronavirus infection. Barring an unexpected outbreak, just 1 percent or 2 percent of the students are likely to be infected, so the overwhelming majority of pools are likely to test negative.

But if a pool yields a positive result, the lab would retest the reserved parts of each individual sample that went into the pool, pinpointing the infected student. The strategy could be employed for as little as $3 per person per day, according an estimate from economists at the University of California, Berkeley.

The FDA has set out guidelines for adopting pooled testing, which employs the same PCR technology as individual diagnostic tests (link).

Universities should consider what they will do once a certain number of positive covid cases emerge. The Virginia Tech plan proposes to house infected students in a single dorm, but what about the majority of students who live off campus?  At what point would they switch to remote teaching? As much as everyone wants to return to normalcy, a class of masked students, 6 feet apart, doesn’t obviously create a better learning environment than zoom. By regularly conducting pooled tests, the university would become aware of increased spread as soon as a higher proportion of the pools return positive results– before we see an increase in serious cases and hospitalizations.

Chris Bilder, a statisticians at University of Nebraska–Lincoln has been advising the Nebraska Public Health Laboratory on its use of group testing since April. He and his colleagues have developed a newly released app to determine precisely how best to conduct the pooling for a chosen reduction in testing, and given estimate of prevalence. (Link)

I will add to this over the next few days, as new reports become available. Please share your thoughts and related articles, in the comments.

[1]I first heard it discussed weeks ago by someone on Andrew Gelman’s blog, but I don’t know if it was the same idea.

Categories: covid-19 | Tags: | 8 Comments

Blog at