Monthly Archives: November 2020

Is it impossible to commit Type I errors in statistical significance tests? (i)


While immersed in our fast-paced, remote, NISS debate (October 15) with J. Berger and D. Trafimow, I didn’t immediately catch all that was said by my co-debaters (I will shortly post a transcript). We had all opted for no practice. But  looking over the transcript, I was surprised that David Trafimow was indeed saying the answer to the question in my title is yes. Here are some excerpts from his remarks:

Trafimow 8:44
See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so what you’re in essence doing then, is you’re using the P-value to index evidence against a model that is already known to be wrong. …But the point is the model was wrong. And so there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. …

Trafimow 18:27
I’ll make a more general comment, which is that since since the model is wrong, in the sense of not being exactly correct, whenever you reject it, you haven’t learned anything. And in the case where you fail to reject it, you’ve made a mistake. So the worst, so the best possible cases you haven’t learned anything, the worst possible cases is you’re wrong…

Trafimow 37:54
Now, Deborah, again made the point that you need procedures for testing discrepancies from the null hypothesis, but I will repeat that …P-values don’t give you that. P-values are about discrepancies from the model…

But P-values are not about discrepancies from the model (in which a null or test hypothesis is embedded). If they were, you might say, as he does, that you should properly always find small P-values, so long as the model isn’t exactly correct. If you don’t, he says, you’re making a mistake. But this is wrong, and is in need of clarification. In fact, if violations of the model assumptions prevent computing a legitimate P-value, then its value is not really “about” anything.

Three main points:

[1] It’s very important to see that the statistical significance test is not testing whether the overall model is wrong, and it is not indexing evidence against the model. It is only testing the null hypothesis (or test hypothesis) H0. It is an essential part of the definition of a test statistic T that its distribution be known, at least approximately, under H0. Cox has discussed this for over 40 years; I’ll refer first to a recent, and then an early paper.

Cox (2020, p. 1):

Suppose that we study a system with haphazard variation and are interested in a hypothesis, H, about the system.We find a test quantity, a function t(y) of data y, such that if H holds, t(y) can be regarded as the observed value of a random variable t(Y) having a distribution under H that is known numerically to an adequate approximation, either by mathematical theory or by computer simulation. Often the distribution of t(Y) is known also under plausible alternatives to H, but this is not necessary. It is enough that the larger the value of t(y), the stronger the pointer against H.

Cox (1977, pp. 1-2):

The basis of a significance test is an ordering of the points in [a sample space] in order of increasing inconsistency with H0, in the respect under study. Equivalently there is a function t = t(y) of the observations, called a test statistic, and such that the larger is t(y), the stronger is the inconsistency of y with H0, in the respect under study. The corresponding random variable is denoted by T. To complete the formulation of a significance test, we need to be able to compute, at least approximately,

p(yobs) = pobs = pr(T > tobs ; H0),                                  (1)

called the observed level of significance.

…To formulate a test, we therefore need to define a suitable function t(.), or rather the associated ordering of the sample points. Essential requirements are that (a) the ordering is scientifically meaningful, (b) it is possible to evaluate, at least approximately, the probability (1).

To suppose, as Trafimow plainly does, that we can never commit a Type 1 error in statistical significance testing because the underlying model “is not exactly correct” is a serious misinterpretation. The statistical significance test only tests one null hypothesis at a time. It is piecemeal. If it’s testing, say, the mean of a Normal distribution, it’s not also testing the underlying assumptions of the Normal model (Normal, IID). Those assumptions are tested separately, and the error statistical methodology offers systematic ways for doing so, with yet more statistical significance tests [see point 3].

[2] Moreover, although the model assumptions must be met adequately in order for the P-value to serve as a test of H0, it isn’t required that we have an exactly correct model, merely that the reported error probabilities are close to the actual ones. As I say in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018) (several excerpts of which can be found on this blog):

Statistical models are at best approximations of aspects of the data-generating process. Reasserting this fact is not informative about the case at hand. These models work because they need only capture rather coarse properties of the phenomena: the error probabilities of the test method are approximately and conservatively related to actual ones. …Far from wanting true (or even “truer”) models, we need models whose deliberate falsity enables finding things out. (p. 300)

Nor do P-values “track” violated assumptions; such violations can lead to computing an incorrectly high, or an incorrectly low, P-value.

And what about cases where we know ahead of time that a hypothesis H0 is strictly false?—I’m talking about the hypothesis here, not the underlying model. (Examples would be with a point null, or one asserting “there’s no Higgs boson”.) Knowing a hypothesis H0 is false is not yet to falsify it. That is, we are not warranted in inferring we have evidence of a genuine effect or discrepancy from H0, and we still don’t know in which way it is flawed.

[3] What is of interest in testing H0 with a statistical significance test is whether there is a systematic discrepancy or inconsistency with H0—one that is not readily accounted for by background variability, chance, or “noise” (as modelled). We don’t need, or even want, a model that fully represented the phenomenon—whatever that would mean. In “design-based” tests, we look to experimental procedures, within our control, as with randomisation.


the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged. (Fisher 1935, 21)

We look to RCTs quite often these days to test the benefits (and harms) of vaccines for Covid-19. Researchers observe differences in the number of Covid-19 cases in two randomly assigned groups, vaccinated and unvaccinated. We know there is ordinary variability in contracting Covid-19; it might be that, just by chance, more people who would have remained Covid-free, even without the vaccine, happen to be assigned to the vaccination group. The random assignment allows determining the probability that an even larger difference in Covid-19 rates would be observed even if H0: the two groups have the same chance of avoiding Covid-19. (I’m describing things extremely roughly; a much more realistic account of randomisation is given by several guest posts by Senn (e.g., blogpost).) Unless this probability is small, it would not be correct to reject H0 and infer that there is evidence the vaccine is effective. Yet Trafimow, if we take him seriously, is saying it would always be correct to reject H0, and that to fail to reject it is to make a mistake. I hope that no one’s seriously suggesting that we should always infer there’s evidence a vaccine or other treatment works. But I don’t know how else to understand the position that it’s always correct to reject H0, and that to fail to reject it is to make a mistake. This is a dangerous and wrong view, which fortunately researchers are not guilty of.

When we don’t have design-based assumptions, we may check the model-based assumptions by means of tests that are secondary in relation to the primary test. The trick is to get them to be independent of the unknowns in the primary test, and there are systematic ways to achieve this.

Cox 2006:

We now turn to a complementary use of these ideas, namely to test the adequacy of a given model, what is also sometimes called model criticism…..It is necessary if we are to parallel the previous argument to find a statistic whose distribution is exactly of very nearly independent of the unknown parameter μ. An important way of doing this is by appeal t the second property of sufficient statistics, namely that after conditioning on their observed value the remaining data have a fixed distribution. (2006, p. 33)

“In principle, the information in the data is split into two parts, one to assess the unknown parameters of interest and the other for model criticism” (Cox 2006, p. 198). If the model is appropriate then the conditional distribution of Y given the value of the sufficient statistic s is known, so it serves to assess if the model is violated. The key is often to look at residuals: the difference between each observed outcome and what is expected under the model. The full data are remodelled to ask a different question. [i]

In testing assumptions, the null hypothesis is generally that the assumption(s) hold approximately. Again, even when we know this secondary null is strictly false, we want to learn in what way, and use the test to pinpoint improved models to try. (These new models must be separately tested.) [ii]

The essence of the reasoning can be made out entirely informally. Think of how the 2019 Eddington eclipse tests probed departures from the Newtonian predicted light deflection. It tested the Newtonian “half deflection” H0:  μ ≤ 0.87, vs H1: μ > 0.87, which includes the Einstein value of 1.75. These primary tests relied upon sufficient accuracy in the telescopes to get a usable standard error for the star positions during the eclipse, and 6 months before (SIST, Excursion 3 Tour I). In one set of plates, that some thought supported Newton, this necessary assumption was falsified using a secondary test. Relying only on known star positions and the detailed data, it was clear that the sun’s heat had systematically distorted the telescope mirror. No assumption about general relativity was required.

If I update this, I will indicate with (i), (ii), etc.

I invite your comments and/or guest posts on this topic.

NOTE: Links to the full papers/book are given in this post, so you might want to check them out.

[i] See  Spanos 2010 (pp. 322-323) from Error & Inference. (This is his commentary on Cox and Mayo in the same volume.) Also relevant Mayo and Spanos 2011 (pp. 193-194).

[ii] It’s important to see that other methods, error statistical or Bayesian, rely on models. A central asset of the simple significance test, on which Bayesians will concur, is their apt role in testing assumptions.

Categories: D. Trafimow, J. Berger, National Institute of Statistical Sciences (NISS), Testing Assumptions | 15 Comments

S. Senn: “A Vaccine Trial from A to Z” with a Postscript (guest post)


Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Alpha and Omega (or maybe just Beta)

Well actually, not from A to Z but from AZ. That is to say, the trial I shall consider is the placebo- controlled trial of the Oxford University vaccine for COVID-19 currently being run by AstraZeneca (AZ) under protocol AZD1222 – D8110C00001 and which I considered in a previous blog, Heard Immunity. A summary of the design  features is given in Table 1. The purpose of this blog is to look a little deeper at features of the trial and the way I am going to do so is with the help of geometric representations of the sample space, that is to say the possible results the trial could produce. However, the reader is warned that I am only an amateur in all this. The true professionals are the statisticians at AZ who, together with their life science colleagues in AZ and Oxford, designed the trial.

Whereas in an October 20 post (on PHASTAR) I considered the sequential nature of the trial, here I am going to ignore that feature and only look at the trial as if it had a single look. Note that the trial employs a two to one randomisation, twice as many subjects being given vaccine as placebo

However, first I shall draw attention to one interesting feature. Like the two other trials that I also previously considered (one by BioNTech and Pfizer and the other by Moderna) the null hypothesis that is being tested is not that the vaccine has no efficacy but that its efficacy does not exceed 30%. Vaccine Efficacy (VE) is defined as

Where Rplacebo & Rvaccine are the ‘true’  rates of infection under placebo and vaccine respectively

Obviously, if the vaccine were completely ineffective, the value of VE would be 0. Presumably the judgement is that a vaccine will be of no practical use unless it has an efficacy of 30%. Perhaps a lower value than this could not really help to control the epidemic. The trial is designed to show that this is the case. In what follows, you can take it as read that the probability of the trial failing because the efficacy is equal to some value that is less than 30% (such as 27%, say) is even greater than if the value is exactly 30%. Therefore, it becomes of interest to consider the way the trial will behave if the value is exactly 30%.

Figuring it out

Figure 1 gives a representation of what might happen in terms of cases of infected subjects in both arms of the trial based on its design. It’s a complicated diagram and I shall take some time to explain it. For the moment I invite the reader to ignore the concentric circles and the shading. I shall get to those in due course.

Figure 1 Possible and expected outcomes for the trial plotted in the two dimensional space of vaccine and placebo cases of infection. The contour plot applies when the null hypothesis is true.

The X axis gives the number of cases  in the vaccine group and the Y axis the number of cases under Placebo. It is important to bear in mind that twice as many subjects are being treated with vaccine as with placebo. The line of equality of infection rates is given by the dashed white diagonal line  towards the bottom right hand side of the pot and labelled ‘0% efficacy’. This joins (for example) the points (80,40) and (140, 70) corresponding to twice as many cases under vaccine as placebo and reflecting the 2:1 allocation ratio. Other diagonal lines correspond to 30%, 50% and 60% VE respectively.

The trial is deigned to stop once 150 cases of infection have occurred. This boundary is represented by the diagonal solid red line descending from the upper left (30 cases in the vaccine group and 120 cases in the placebo group) towards the bottom right (120 cases in the vaccine group and 30 cases in the placebo group). Thus, we know in advance, that the combination of results we shall see must lie on this line.

Note that the diagram is slightly misleading, since where the space concerned refers to number of cases, it is neither continuous in X nor continuous in Y. The only possible values are those given by the whole numbers, W, that is to say the integers plus zero. However, the same is not true for expected numbers and this is a common difference between parameters and random variables in statistics. For example, if we have a Poisson random variable with a given mean, the only possible values of the random variable are the whole numbers 0,1,2… but the mean can be any positive real number.

Ripples in the pond

Figure 2 is the same diagram as Figure 1 as regards every feature except that which I invited the reader to ignore. The concentric circles are contour plots that represent features of the trial that are suitable for planning. In order to decide how many subjects to recruit, the scientists at AZ and Oxford had to decide what infection rate was likely. They chose an infection rate of 0.8% per 6 months under placebo. This in turn implies that of 10,000 subjects treated with placebo, we might expect 80 to get COVID. On the other hand, a vaccine efficacy of 30% would imply an infection rate of 0.56% since 

For 20,000 subjects treated with vaccine we would expect (0.56/100)20,000 = 112 of them to be infected with COVID and if the vaccine efficacy were 60%, the value assumed for the power calculation, then the expected infection rate would be 0.32% and we would expect 64 of the subjects to be infected.

Since the infection rates are small, a Poisson distribution is a possible simple model for the probability of seeing certain combinations of infections. This is what the contour plots illustrate. For both cases, the expected number of cases under placebo is assumed to be 80 and this is illustrated by a dashed horizontal white line. However, the lower infection rate under H1 has the effect of shifting the contour plots to the left. Thus, in Figure 1 the dashed vertical line indicating the expected numbers in the vaccine arm is at 112 and in Figure 2 it is at 64. Nothing else changes between the figures.

Figure 2 Possible and expected outcomes for the trial plotted in the two dimensional space of vaccine and placebo cases of infection. The contour plot applies when the value under the alternative hypothesis assumed for power calculations  is true.

Test bed

How should we carry out a significance test? One way of doing so is to condition on the total number of infected cases. The issue of whether to condition or not is a notorious controversy ins statistics, Here the total of 150 is fixed but I think that there is a good argument for doing so whether or not it is fixed. Such conditioning in this case leads to a binomial distribution describing the number of cases of infection observed out of the 150 that are in the vaccine group. Ignoring any covariates, therefore, a simple analysis, is to compare the proportion of cases we see to the proportion we would expect to see under the null hypothesis. This proportion is given by 112/(112+80)=0.583. (Note a subtle but important point here. The total number of cases expected is 192 but we know the trial will stop at 150. That is irrelevant. It is the expected proportion that matters here.)

By trial and error or by some other means we can now discover that the probability of 75 or fewer cases given vaccine out of 150 in total when the probability is 0.583 is 0.024.The AZ protocol requires a two-sided P-value less than or equal to 4.9%, which is to say 0.0245 one sided, assuming the usual doubling rule, so this is just low enough.  On the other hand, the probability of 76 or fewer cases under vaccine is 0.035 and thus too high. This establishes the point X=75, Y=75 as a critical value of the test. This is shown by the small red circle labelled ‘critical value’ on both figures. It just so happens that this lies along the 50% efficacy line. Thus observed 50% efficacy will be (just) enough to reject the hypothesis that the true efficacy is 30% or lower.

Reading the tea-leaves 

There are many other interesting features of this trial I could discuss, in particular what alternative analyses might be tried (the protocol refers to a ‘modified Poisson regression approach’ due to Zou, 2004) but I shall just consider one other issue here. That is that in theory when the trial stops might give some indication as to vaccine efficacy, a point that might be of interest to avid third party trial-watchers. If you look at Figure 3, which combines Figure 1 and Figure 2, you will note that the expected number of cases under H­0, if the values used for planning are correct,  is at least (when vaccine efficacy is 30%) 80+112=192. For zero efficacy the figure is 80+160=240. However, the trial will stop once 150 cases of infection have been observed. Thus, under H0, the trial is expected to stop before all 30,000 subjects have had six months of follow-up.

On the other hand, for an efficacy of 60% given in Figure 3 the value is 80+64 =144 and so slightly less then the figure required. Thus, under H1, the trial might not be big enough. Taken together, these figures imply that other things being equal, the earlier the trial stops the more likely the result is to be negative and the longer it continues, the more likely it is to be positive.

Of course, this raises the issue as to whether one can judge what is early and what is late. To make some guesses as to background rates of infection is inevitable when planning a trial. One would be foolish to rely on them when interpreting it.

Figure 3 Combination of Figures 1 and 2 showing contour plots for the joint density for the number of cases when the vaccine efficacy is 30% (H0) and the value under H1 of 60% used for planning.


Zou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol. 2004;159(7):702-6.

POSTSCRIPT: Needlepoint

Pressing news

Extract of a press-release from Pfizer, 9 November 2020:

“I am happy to share with you that Pfizer and our collaborator, BioNTech, announced positive efficacy results from our Phase 3, late-stage study of our potential COVID-19 vaccine. The vaccine candidate was found to be more than 90% effective in preventing COVID-19 in participants without evidence of prior SARS-CoV-2 infection in the first interim efficacy analysis.” Albert Bourla (Chairman and CEO, Pfizer.)

Naturally, this had Twitter agog and calculations were soon produced to try and reconstruct the basis on which the claim was being made: how many cases of COVID-19 infection under vaccine had there been seen in order to be able to make this claim? In the end these amateur calculations don’t matter. It’s what Pfizer calculates and what the regulators decide about the calculation that matters. I note by the by that a fair proportion of Twitter seemed to think that journal publication and peer review is essential. I don’t share this point of view, which I tend to think of as “quaint”. It’s the regulator’s view I am interested in but we shall have to wait for that.

Nevertheless, calculation can be fun and if I don’t think so, I am in the wrong profession. So here goes. However, first I should acknowledge that Jen Rogers’s interesting blog on the subject has been very useful in preparing this note.

The back of the envelope

To do the calculation properly, this is what one would have to know

Need to know


Disposition of Subjects

Randomisation was one to one but strictly speaking we want to know the exact achieved proportions. BusinessWire describe a total of “43,538 participants to date, 38,955 of whom have received a second dose of the vaccine candidate as of November 8, 2020”.

Number of cases of infection

According to BusinessWire 94 were seen.

Method of analysis

Pfizer claims in the protocol a Bayesian analysis will be used. I shall not attempt this but use a very simple frequentist one conditioning on totals infected.

Aim of claim

Is the point estimate the basis of the claim or is the lower bound of some confidence interval the basis?

Level of confidence to be used

Pfizer planned to look five times but it seems that the first look was abandoned. The reported look is the 2nd but at a number of cases that is slightly greater (94) than the number originally planned for the 3rd (92). I shall assume that the confidence level for look three of an O’Brien-Fleming boundary is appropriate.


A simple analysis would assume no missing data or at least that any missing data are missing completely at random.

Other matters

Two doses are required. Were there any cases arising between the two doses and if so, what was done with them?


If I condition on the total number of infected cases, and assume equal numbers of subjects on each arm, then by varying the number of cases in the vaccine group and subtracting them from the total of 94 to get those on the control group arm, I can calculate the vaccine efficacy. This has been done in the figure below.

The solid blue circles are the estimate of the vaccine efficacy. The ‘whiskers’ below indicate a confidence limit of 99.16% which (I think) is the level appropriate for the third look in an O’Brien-Fleming scheme for an overall type I error rate of 5%. Horizontal lines have been drawn at 30% efficacy (the value used in the protocol for the null hypothesis) and 90% efficacy (the claimed effect in the press release). Three cases on the vaccine arm would give a vaccine efficacy at about 91.3% for the lower confidence interval whereas four gives a value of 89.2%. Eight cases would give a point estimate of 90.7%. So depending on what exactly the claim of “more than 90% effective” might mean (and a whole host of other assumptions) we could argue that between three and eight cases of infection were seen.

Safety second

Of course safety is often described as being first in terms of priorities but it usually takes longer to see the results that are necessary to judge it than to see those for efficacy. According to BusinessWire “Pfizer and BioNTech are continuing to accumulate safety data and currently estimate that a median of two months of safety data following the second (and final) dose of the vaccine candidate – the amount of safety data specified by the FDA in its guidance for potential Emergency Use Authorization – will be available by the third week of November.”

The world awaits the results with interest.

A Dec. 2, 2020 update by Senn


  1. C. O’Brien and T. R. Fleming (1979) A multiple testing procedure for clinical trials. Biometrics, 549-556.


Categories: covid-19, RCTs, Stephen Senn | 9 Comments

Phil Stat Forum: November 19: Stephen Senn, “Randomisation and Control in the Age of Coronavirus?”

For information about the Phil Stat Wars forum and how to join, see this post and this pdf. 

Continue reading

Categories: Error Statistics, randomization | Leave a comment

Blog at