S. Senn: “Beta testing”: The Pfizer/BioNTech statistical analysis of their Covid-19 vaccine trial (guest post)


Stephen Senn

Consultant Statistician
Edinburgh, Scotland

The usual warning

Although I have researched on clinical trial design for many years, prior to the COVID-19 epidemic I had had nothing to do with vaccines. The only object of these amateur musings is to amuse amateurs by raising some issues I have pondered and found interesting.

Coverage matters

In this blog I am going to cover the statistical analysis used by Pfizer/BioNTech (hereafter referred to as P&B) in their big phase III trial of a vaccine for COVID-19. I considered this in some previous posts, in particular Heard Immunity, and Infectious Enthusiasm. The first of these posts compared the P&N trial to two others, a trial run by Moderna and another by Astra Zeneca and Oxford University (hereafter referred to as AZ&Ox) and the second discussed the results that P&N reported.

Figure 1 Stopping boundaries for three trials. Labels are anticipated numbers of cases at the looks.

All three trials were sequential in nature and, as is only proper, all three protocols gave details of the proposed stopping boundaries. These are given in Figure 1. AZ&Ox proposed to have two looks, Moderna to have three and P&B to have five. As things turned out, there was only one interim look at the P&B trial and so two, and not five, in total.

Moderna and AZ&Ox specified a frequentist approach in their protocols and P&B a Bayesian one. It is aspects of this Bayesian approach that I propose to consider.

Some symbol stuff

It is common to measure vaccine efficacy in terms of a relative risk reduction expressed as a percentage. The percentage is a nuisance and instead I shall express it as a simple ratio. If ψ, πc, πv are the true vaccine efficacy and the probabilities of being infected in the control and vaccine groups respectively, then



Note that if we have Yc, Ycases in the control and vaccine arms respectively and nc, nsubjects then an intuitively reasonable estimate of ψ is




where VE is the observed vaccine efficacy.

If the total number of subjects is N and we have ncrN, nv = (1 − r)N, with r being the proportion of subjects on the control arm, then we have

Note that if r = 1 − r = 1/2, that is to say that there are equal numbers of subjects on both arms, then (3) simply reduces to one minus the ratio of observed cases. VE thus has the curious property that its maximum value is 1 (when there are no cases in the vaccine group) but its minimum value is −∞ (when there are no cases in the control group and at least one in the vaccine group).

A contour plot of vaccine efficacy as a function of the control and vaccine group probabilities of infection is given in Figure 2.

Figure 2 Vaccine efficacy as a function of the probability of infection in the control and vaccine groups.

Scaly beta

P&B specified a prior distribution for their analysis but very sensibly shied away from attempting to do one for vaccine efficacy directly. Instead, they considered a transformation, or re-scaling, of the parameter defining




This looks rather strange but in fact it can be re-expressed as




Figure 3 Contour plot of transformed vaccine efficacy, θ. 

Its contour plot is given in Figure 3. The transformation is the ratio of the probability of infection in the vaccine group to the sum of the probabilities in the two groups. It thus takes on a value between 0 and 1 and this in turn implies that it behaves like a probability. In fact if we have equal numbers of subjects on both arms and condition on the total numbers of cases we can regard it as being the probability that a randomly chosen case will be in the vaccine group and therefore as an estimate of the efficacy of the vaccine. The lower this probability, the more effective the vaccine.

In fact, this simple analysis, captures much of what the data have to tell and in estimating vaccine efficacy in previous posts. I simply used this ratio as a probability and estimated ‘exact’ confidence intervals using the binomial distribution. Having calculated the intervals on this scale, I back-transformed them to the vaccine efficacy scale

A prior distribution that is commonly used for modelling data using the binomial distribution  is the beta-distribution (see pp. 55-61 of Forbes et al[1], 2011), hence the title of this post. This is a two parameter distribution with parameters (say) ν, ω and mean


and variance


Thus, the relative value of the two parameters governs the mean and, the larger the parameter values, the smaller the variance. A special case of the beta distribution is the uniform distribution, which can be obtained by setting both parameter values to 1. The resulting mean is 1/2 and the variance is 1/12. Parameter values of ½ and ½ give a distribution with the same mean but a larger variance of 1/8. For comparison, the mean and variance of a binomial proportion are p, p(1 − p)/and if you set = 1/2, n = 2 you get the same mean and variance. This gives a feel for how much information is contained in a prior distribution.

Of Ps and Qs

P&B chose a beta distribution for θ with ν = 0.700102, ω = 1. The prior distribution is plotted in Figure 4. This has a mean of 0.4118 and a variance of approximately 1/11. I now propose to discuss and speculate how these values were arrived at. I come to a rather cynical conclusion and before I give you my reasoning, I want to make two points quite clear:

a) My cynicism does not detract from my admiration for what P&B have done. I think the achievement is magnificent, not only in terms of the basic science but also in terms of trial design, management and delivery.

b) I am not criticising the choice of a Bayesian analysis. In fact, I found that rather interesting.

However, I think it is appropriate to establish what exactly is incorporated in any prior distribution and that is what I propose to do.

First note the extraordinary number of significant figures (6) for the first parameter, of the beta distribution, which has a value of 0.700102 . The distribution itself (as established by its variance) is not very informative but at first sight there would seem to be a great deal of information about the prior distribution itself. This is a feature of some analyses that I have drawn attention to before. See Dawid’s Selection Paradox.

Figure 4 Prior distribution for θ

So this is what I think happened. P&B reached for a common default uniform distribution, a beta with parameters 1,1. However, this would yield an expected value of θ = 0 . On the other hand, they wished to show that the vaccine efficacy ψ was greater than 0.3. They thus asked the question, what value of θ corresponds to a value of ψ = 0.3? Substituting in (4) the answer is 0.4117647 or 0.4118 to four decimal places. They explained this in the protocol as follows: ‘The prior is centered at θ = 0.4118 (VE = 30%) which may be considered pessimistic’.

Figure 5 Combination of parameters for the prior distribution yielding the required mean. The diagonal light blue line gives the combination of values that will produce the desired mean. The red diamond gives the parameter combination chosen by P&B. The blue circle gives the parameter combination that would also have produced the mean chosen but also the same variance as a beta(1,1). The contour lines show the variance of the distribution as a function of the two parameters.

Note that the choice of word centered is odd. The mean of the distribution is 0.4118 but the distribution is not really centered there.  Be that as it may, they now had an infinite possible combination of values for ν, ω that would yield an expected value of 0.4118. Note that solving (6) for ν, ω yields



and plugging in μ = 0.4118, ω = 1 gives ν = 0.700102. Possible choices of parameter combinations yielding the same mean are given in Figure 5. An alternative to the beta(0.700102,1) they chose might have been beta(0.78516,1.1215). This would have yielded the same mean but given the equivalent variance to a conventional beta(1,1).

It is also somewhat debatable as to whether pessimistic is the right word. The distribution is certainly very uninformative. Note also that just because if the mean value on the scale is transformed to the vaccine efficacy scale it gives a value of 0.30. It does not follow that this is the mean value of the vaccine efficacy. Only medians can be guaranteed to be invariant under transformation. The median of the distribution of θ is 0.3716 and this corresponds to a median vaccine efficacy of 0.4088.

Can you beta Bayesian?

Perhaps unfairly, I could ask, ‘what has the Bayesian element added to the analysis?’ A Bayesian might reply, ‘what advantage does subtracting the Bayesian element bring to the analysis?’ Nevertheless, the choice of prior distribution here points a problem. It clearly does not reflect what anybody believed about the vaccine efficacy before the trial began. Of course, establishing reasonable prior parameters for any statistical analysis is extremely difficult[2].

On the other hand, if a purely conventional prior is required why not choose beta(1,1) or beta(1/2,1/2), say? I think the 0.3 hypothesised value for vaccine efficacy is a red herring here. What should be of interest to a Bayesian is the posterior probability that the vaccine efficacy is greater than 30%. This does not require that the prior distribution is ‘centred’ on this value.

Of course the point is that provided that the variance of the prior distribution is large enough, the posterior inference is scarcely affected. In any case a Bayesian might reply, ‘if you don’t like the prior distribution choose your own’. To which a diehard frequentist might reply, ‘it is a bit late for choosing prior distributions’.

I take two lessons from this, however. First, where Bayesian analyses are being used we should all try to understand what the prior distribution implies: in what we now ‘believe’ and how data would update such belief[3]. Second, disappointing as this may be to inferential enthusiasts, this sort of thing is not where the action is. The trial was well conceived, designed and conducted and the product was effective. My congratulations to all the scientists involved, including, but not limited to, the statisticians.


  1. Forbes, C., et al., Statistical distributions. 2011: John Wiley & Sons.
  2. Senn, S.J., Trying to be precise about vagueness. Statistics in Medicine, 2007. 26: p. 1417-1430.
  3. Senn, S.J., You may believe you are a Bayesian but you are probably wrong. Rationality, Markets and Morals, 2011. 2: p. 48-66.

Categories: covid-19, PhilStat/Med, S. Senn | 4 Comments

Why hasn’t the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?

something’s not revealed

A little over a year ago, the board of the American Statistical Association (ASA) appointed a new Task Force on Statistical Significance and Replicability (under then president, Karen Kafadar), to provide it with recommendations. [Its members are here (i).] You might remember my blogpost at the time, “Les Stats C’est Moi”. The Task Force worked quickly, despite the pandemic, giving its recommendations to the ASA Board early, in time for the Joint Statistical Meetings at the end of July 2020. But the ASA hasn’t revealed the Task Force’s recommendations, and I just learned yesterday that it has no plans to do so*. A panel session I was in at the JSM, (P-values and ‘Statistical Significance’: Deconstructing the Arguments), grew out of this episode, and papers from the proceedings are now out. The introduction to my contribution gives you the background to my question, while revealing one of the recommendations (I only know of 2). 

[i] Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

You can access the full paper here.


Rejecting Statistical Significance Tests: Defanging the Arguments^

Abstract: I critically analyze three groups of arguments for rejecting statistical significance tests (don’t say ‘significance’, don’t use P-value thresholds), as espoused in the 2019 Editorial of The American Statistician (Wasserstein, Schirm and Lazar 2019). The strongest argument supposes that banning P-value thresholds would diminish P-hacking and data dredging. I argue that it is the opposite. In a world without thresholds, it would be harder to hold accountable those who fail to meet a predesignated threshold by dint of data dredging. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no a test of that claim. Giving up on tests means forgoing statistical falsification. The second group of arguments constitutes a series of strawperson fallacies in which statistical significance tests are too readily identified with classic abuses of tests. The logical principle of charity is violated. The third group rests on implicit arguments. The first in this group presupposes, without argument, a different philosophy of statistics from the one underlying statistical significance tests; the second group—appeals to popularity and fear—only exacerbate the ‘perverse’ incentives underlying today’s replication crisis. 

1. Introduction and Background 

Today’s crisis of replication gives a new urgency to critically appraising proposed statistical reforms intended to ameliorate the situation. Many are welcome, such as preregistration, testing by replication, and encouraging a move away from cookbook uses of statistical methods. Others are radical and might inadvertently obstruct practices known to improve on replication. The problem is one of evidence policy, that is, it concerns policies regarding evidence and inference. Problems of evidence policy call for a mix of statistical and philosophical considerations, and while I am not a statistician but a philosopher of science, logic, and statistics, I hope to add some useful reflections on the problem that confronts us today. 

In 2016 the American Statistical Association (ASA) issued a statement on P-values, intended to highlight classic misinterpretations and abuses. 

The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. (Wasserstein and Lazar 2016, p. 129) 

The statement itself grew out of meetings and discussions with over two dozen others, and was specifically approved by the ASA board. The six principles it offers are largely rehearsals of fallacious interpretations to avoid. In a nutshell: P-values are not direct measures of posterior probabilities, population effect sizes, or substantive importance, and can be invalidated by biasing selection effects (e.g., cherry picking, P-hacking, multiple testing). The one positive principle is the first: “P-values can indicate how incompatible the data are with a specified statistical model” (ibid., p. 131). 

The authors of the editorial that introduces the 2016 ASA Statement, Wasserstein and Lazar, assure us that “Nothing in the ASA statement is new” (p. 130). It is merely a “statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value” ( p. 131). Thus, it came as a surprise, at least to this outsider’s ears, to hear the authors of the 2016 Statement, along with a third co-author (Schirm), declare in March 2019 that: “The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” (Wasserstein, Schirm and Lazar 2019, p. 2, hereafter, WSL 2019). 

The 2019 Editorial announces: “We take that step here….[I]t is time to stop using the term ‘statistically significant’ entirely. …[S]tatistically significant –don’t say it and don’t use it” (WSL 2019, p. 2). Not just outsiders to statistics were surprised. To insiders as well, the 2019 Editorial was sufficiently perplexing for the then ASA President, Karen Kafadar, to call for a New ASA Task Force on Significance Tests and Replicability. 

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA. 

… To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece … without leaving the impression that p-values and hypothesis tests…have no role in ‘good statistical practice’. (K. Kafadar, President’s Corner, 2019, p. 4) 

This was a key impetus for the JSM panel discussion from which the current paper derives (“P-values and ‘Statistical Significance’: Deconstructing the Arguments”). Kafadar deserves enormous credit for creating the new task force.1 Although the new task force’s report, submitted shortly before the JSM 2020 meeting, has not been disclosed, Kadar’s presentation noted that one of its recommendations is that there be a “disclaimer on all publications, articles, editorials, … authored by ASA Staff”.2 In this case, a disclaimer would have noted that the 2019 Editorial is not ASA policy. Still, given that its authors include ASA officials, it has a great deal of impact. 

We should indeed move away from unthinking and rigid uses of thresholds—not just with significance levels, but also with confidence levels and other quantities. No single statistical quantity from any school, by itself, is an adequate measure of evidence, for any of the many disparate meanings of “evidence” one might adduce. Thus, it is no special indictment of P-values that they fail to supply such a measure. We agree as well that the actual P-value should be reported, as all the founders of tests recommended (see Mayo 2018, Excursion 3 Tour II). But the 2019 Editorial goes much further. In its view: Prespecified P-value thresholds should not be used at all in interpreting results. In other words, the position advanced by the 2019 Editorial, “reject statistical significance”, is not just a word ban but a gatekeeper ban. For example, in order to comply with its recommendations, the FDA would have to end its “long established drug review procedures that involve comparing p-values to significance thresholds for Phase III drug trials” as the authors admit (p. 10). 

Kafadar is right to see the 2019 Editorial as challenging the overall use of hypothesis tests, even though it is not banning P-values. Although P-values can be used as descriptive measures, rather than as tests, when we wish to employ them as tests, we require thresholds. Ideally there are several P-value benchmarks, but even that is foreclosed if we take seriously their view: “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (WSL 2019, p. 2). 

The March 2019 Editorial (WSL 2019) also includes a detailed introduction to a special issue of The American Statistician (“Moving to a World beyond p < 0.05”). The position that I will discuss, reject statistical significance, (“don’t say ‘significance’, don’t use P-value thresholds”), is outlined largely in the first two sections of the 2019 Editorial. What are the arguments given for the leap from the reasonable principles of the 2016 ASA Statement to the dramatic “reject statistical significance” position? Do they stand up to principles for good argumentation? 

Continue reading the paper here. Please share your comments.


1 Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020) 

2 Kafadar, K., “P-values: Assumptions, Replicability, ‘Significance’,” slides given in the Contributed Panel: P-Values and “Statistical Significance”: Deconstructing the Arguments at the (virtual) JSM 2020. (August 6, 2020). 

^CITATION: Mayo, D. (2020). Rejecting Statistical Significance Tests: Defanging the Arguments. In JSM Proceedings, Statistical Consulting Section. Alexandria, VA: American Statistical Association. (2020). 236-256.

*Jan 11 update. The ASA executive director, Ron Wasserstein, wants to emphasize that it is leaving to the members of the Task Force when and how to release the report on their own. I do not know if it will do so or if all of the authors will agree to this shift. Personally, I don’t know why the ASA Board would not wish to reveal the recommendations of the Task Force that it created–even without any presumption that it thereby is understood to be a policy document. There can be a clear disclaimer that it is not. The Task Force carried out the work that was asked of them in a timely manner. You can find a statement of the charge given to the Task Force in my comments.

Categories: 2016 ASA Statement on P-values, JSM 2020, replication crisis, statistical significance tests, straw person fallacy | 7 Comments

Next Phil Stat Forum: January 7: D. Mayo: Putting the Brakes on the Breakthrough (or “How I used simple logic to uncover a flaw in …..statistical foundations”)

The fourth meeting of our New Phil Stat Forum*:

The Statistics Wars
and Their Casualties

January 7, 16:00 – 17:30  (London time)
11 am-12:30 pm (New York, ET)**
**note time modification and date change

Putting the Brakes on the Breakthrough,

or “How I used simple logic to uncover a flaw in a controversial 60-year old ‘theorem’ in statistical foundations” 

Deborah G. Mayo



ABSTRACT: An essential component of inference based on familiar frequentist (error statistical) notions p-values, statistical significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory). This results in violations of a principle known as the strong likelihood principle (SLP), or just the likelihood principle (LP), which says, in effect, that outcomes other than those observed are irrelevant for inferences within a statistical model. Now Allan Birnbaum was a frequentist (error statistician), but he found himself in a predicament: He seemed to have shown that the LP follows from uncontroversial frequentist principles! Bayesians, such as Savage, heralded his result as a “breakthrough in statistics”! But there’s a flaw in the “proof”, and that’s what I aim to show in my presentation by means of 3 simple examples:

  • Example 1: Trying and Trying Again
  • Example 2: Two instruments with different precisions
    (you shouldn’t get credit/blame for something you didn’t do)
  • The Breakthrough: Don’t Birnbaumize that data my friend

As in the last 9 years, I posted an imaginary dialogue (here) with Allan Birnbaum at the stroke of midnight, New Year’s Eve, and this will be relevant for the talk.

The Phil Stat Forum schedule is at the Phil-Stat-Wars.com blog 


One of the following 3 papers:

My earliest treatment via counterexample:

A deeper argument can be found in:

For an intermediate Goldilocks version (based on a presentation given at the JSM 2013):

This post from the Error Statistics Philosophy blog will get you oriented. (It has links to other posts on the LP & Birnbaum, as well as background readings/discussions for those who want to dive deeper into the topic.)

Slides and Video Links:

D. Mayo’s slides: “Putting the Brakes on the Breakthrough, or ‘How I used simple logic to uncover a flaw in a controversial 60-year old ‘theorem’ in statistical foundations’”

D. Mayo’s  presentation:

Discussion on Mayo’s presentation:

Mayo’s Memos: Any info or events that arise that seem relevant to share with y’all before the meeting.

You may wish to look at my rejoinder to a number of statisticians: Rejoinder “On the Birnbaum Argument for the Strong Likelihood Principle”. (It is also above in the link to the complete discussion in the 3rd reading option.)

I often find it useful to look at other treatments. So I put together this short supplement to glance through to clarify a few select points.

Please post comments on the Phil Stat Wars blog here.


Categories: Birnbaum, Birnbaum Brakes, Likelihood Principle | 5 Comments

Midnight With Birnbaum (Remote, Virtual Happy New Year 2020)!

 Unlike in the past 9 years since I’ve been blogging, I can’t revisit that spot in the road  outside the Elbar Room, looking to get into a strange-looking taxi, to head to “Midnight With Birnbaum”.  Because of the pandemic, I refuse to go out this New Year’s Eve, so the best I can hope for is a zoom link that will take me to a hypothetical party with him. (The pic on the left is the only blurry image I have of the club I’m taken to.) I just keep watching my email, to see if a zoom link arrives. My book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT 2018)  doesn’t rehearse the argument from my Birnbaum article, but there’s much in it that I’d like to discuss with him. The (Strong) Likelihood Principle–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and statistical significance testing in general. Let’s hope that in 2021 the American Statistical Association 9ASA) will finally reveal the recommendations from the ASA Task Force on Statistical Significance and Replicability that the ASA Board itself created one year ago. They completed their recommendations early–back at the end of July 2020–but no response from the ASA has been forthcoming (to my knowledge). As Birnbaum insisted, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder.) Birnbaum bemoaned the lack of an explicit evidential interpretation of N-P methods.  I purport to give one in SIST 2018. Maybe it will come to fruition in 2021? Anyway, I was just sent an internet link–but it’s not zoom, not Skype, not Webinex, or anything I’ve ever seen before….no time to describe it now, but I’m recording and the rest of the transcript is live; this year there are some new, relevant additions.  Happy New Year! Continue reading

Categories: Birnbaum Brakes, strong likelihood principle | Tags: , , , | Leave a comment

A Perfect Time to Binge Read the (Strong) Likelihood Principle


An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data. Continue reading

Categories: Birnbaum, Birnbaum Brakes, law of likelihood | 3 Comments

Cox’s (1958) Chestnut: You should not get credit (or blame) for something you didn’t do


Just as you keep up your physical exercise during the pandemic (sure), you want to keep up with mental gymnastics too. With that goal in mind, and given we’re just a few days from the New Year (and given especially my promised presentation for January 7), here’s one of the two simple examples that will limber you up for the puzzle to ensue. It’s the famous weighing machine example from Sir David Cox (1958)[1]. It is one of the “chestnuts” in the museum exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018). So block everything else out for a few minutes and consider 3 pages from SIST …  Continue reading

Categories: Birnbaum, Statistical Inference as Severe Testing, strong likelihood principle | 4 Comments

Next Phil Stat Forum: January 7: D. Mayo: Putting the Brakes on the Breakthrough (or “How I used simple logic to uncover a flaw in …..statistical foundations”)

The fourth meeting of our New Phil Stat Forum*:

The Statistics Wars
and Their Casualties

January 7, 16:00 – 17:30  (London time)
11 am-12:30 pm (New York, ET)**
**note time modification and date change

Putting the Brakes on the Breakthrough,

or “How I used simple logic to uncover a flaw in a controversial 60-year old ‘theorem’ in statistical foundations” 

Deborah G. Mayo



ABSTRACT: An essential component of inference based on familiar frequentist (error statistical) notions p-values, statistical significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory). This results in violations of a principle known as the strong likelihood principle (SLP), or just the likelihood principle (LP), which says, in effect, that outcomes other than those observed are irrelevant for inferences within a statistical model. Now Allan Birnbaum was a frequentist (error statistician), but he found himself in a predicament: He seemed to have shown that the LP follows from uncontroversial frequentist principles! Bayesians, such as Savage, heralded his result as a “breakthrough in statistics”! But there’s a flaw in the “proof”, and that’s what I aim to show in my presentation by means of 3 simple examples:

  • Example 1: Trying and Trying Again
  • Example 2: Two instruments with different precisions
    (you shouldn’t get credit/blame for something you didn’t do)
  • The Breakthrough: Don’t Birnbaumize that data my friend

As in the last 9 years, I will post an imaginary dialogue with Allan Birnbaum at the stroke of midnight, New Year’s Eve, and this will be relevant for the talk.

The Phil Stat Forum schedule is at the Phil-Stat-Wars.com blog 

Categories: Birnbaum, Birnbaum Brakes, Likelihood Principle | 1 Comment

The Statistics Debate (NISS) in Transcript Form

I constructed, together with Jean Miller, a transcript from the October 15 Statistics Debate (with me, J. Berger and D. Trafimow and moderator D. Jeske), sponsored by NISS. It’s so much easier to access the material this way rather than listening to it on the video. Using this link, you can see the words and hear the video at the same time, as well as pause and jump around. Below, I’ve pasted our responses to Question #1. Have fun and please share your comments.

Deborah Mayo  03:46

Thank you so much. And thank you for inviting me, I’m very pleased to be here. Yes, I say we should continue to use p values and statistical significance tests. Uses of p values are really just a piece in a rich set of tools intended to assess and control the probabilities of misleading interpretations of data, i.e., error probabilities. They’re the first line of defense against being fooled by randomness as Yoav Benjamini puts it. If even larger, or more extreme effects than you observed are frequently brought about by chance variability alone, i.e., p value not small, clearly you don’t have evidence of incompatibility with the mere chance hypothesis. It’s very straightforward reasoning. Even those who criticize p values you’ll notice will employ them, at least if they care to check their assumptions of their models. And this includes well known Bayesian such as George Box, Andrew Gelman, and Jim Berger. Critics of p values often allege it’s too easy to obtain small p values. But notice the whole replication crisis is about how difficult it is to get small p values with preregistered hypotheses. This shows the problem isn’t p values, but those selection effects and data dredging. However, the same data drenched hypothesis can occur in other methods, likelihood ratios, Bayes factors, Bayesian updating, except that now we lose the direct grounds to criticize inferences for flouting error statistical control. The introduction of prior probabilities, which may also be data dependent, offers further researcher flexibility. Those who reject p values are saying we should reject the method because it can be used badly. And that’s a bad argument. We should reject misuses of p values. But there’s a danger of blindly substituting alternative tools that throw out the error control baby with the bad statistics bathwater.

Dan Jeske  05:58

Thank you, Deborah, Jim, would you like to comment on Deborah’s remarks and offer your own?

Jim Berger  06:06

Okay, yes. Well, I certainly agree with much of what Deborah said, after all, a p value is simply a statistic. And it’s an interesting statistic that does have many legitimate uses, when properly calibrated. And Deborah mentioned one such case is model checking where Bayesians freely use some version of p values for model checking. You know, on the other hand, that one interprets this question, should they continue to be used in the same way that they’re used today? Then my, my answer would be somewhat different. I think p values are commonly misinterpreted today, especially when when they’re used to test a sharp null hypothesis. For instance, of a p value of .05, is commonly interpreted as by many is indicating the evidence is 20 to one in favor of the alternative hypothesis. And that just that just isn’t true. You can show for instance, that if I’m testing with a normal mean of zero versus nonzero, the odds of the alternative hypothesis to the null hypothesis can at most be seven to one. And that’s just a probabilistic fact, doesn’t involve priors or anything. It just is, is a is an answer covering all probability. And so that 20 to one cannot be if it’s, if it’s, if a p value of .05 is interpreted as 20 to one, it’s just, it’s just being interpreted wrongly, and the wrong conclusions are being reached. I’m reminded of an interesting paper that was published some time ago now, which was reporting on a survey that was designed to determine whether clinical practitioners understood what a p value was. The results of the survey were published and were not surprising. Most clinical practitioners interpreted the p value as something like a p value of .05 as something like 20 to one odds against the null hypothesis, which again, is incorrect. The fascinating aspect of the paper is that the authors also got it wrong. Deborah pointed out that the p value is the probability under the null hypothesis of the data or something more extreme. The author’s stated that the correct answer was the p value is the probability of the data under the null hypothesis, they forgot the more extreme. So, I love this article, because the scientists who set out to show that their colleagues did not understand the meaning of p values themselves did not understand the meaning of p values. 

Dan Jeske  08:42


David Trafimow  08:44

Okay. Yeah, Um, like Deborah and Jim, I’m delighted to be here. Thanks for the invitation. Um and I partly agree with what both Deborah and Jim said, um, it’s certainly true that people misuse p values. So, I agree with that. However, I think p values are more problematic than the other speakers have mentioned. And here’s here’s the problem for me. We keep talking about p values relative to hypotheses, but that’s not really true. P values are relative to hypotheses plus additional assumptions. So, if we call, if we use the term model to describe the null hypothesis, plus additional assumptions, then p values are based on models, not on hypotheses, or only partly on hypotheses. Now, here’s the thing. What are these other assumptions? An example would be random selection from the population, an assumption that is not true in any one of the thousands of papers I’ve read in psychology. And there are other assumptions, a lack of systematic error, linearity, and then we can go on and on, people have even published taxonomies of the assumptions because there are so many of them. See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so, what you’re in essence doing then, is you’re using the p value to index evidence against a model that is already known to be wrong. And even the part about indexing evidence is questionable, but I’ll go with it for the moment. But the point is the model was wrong. And so, there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. There’s, p values don’t tell you how close the model is to being right. P values don’t tell you how valuable the model is. P values pretty much don’t tell you anything that researchers might want to know, unless you misuse them. Anytime you draw a conclusion from a p value, you are guilty of misuse. So, I think the misuse problem is much more subtle than is perhaps obvious at firsthand. So, that’s really all I have to say at the moment.

Dan Jeske  11:28

Thank you. Jim, would you like to follow up?

Jim Berger  11:32

Yes,  so, so,  I certainly agree that that assumptions are often made that are wrong. I won’t say that that’s always the case. I mean, I know many scientific disciplines where I think they do a pretty good job, and work with high energy physicists, and they do a pretty good job of checking their assumptions. Excellent job. And they use p values. It’s something to watch out for. But any statistical analysis, you know, can can run into this problem. If the assumptions are wrong, it’s, it’s going to be wrong.

Dan Jeske  12:09


Deborah Mayo  12:11

Okay. Well, Jim thinks that we should evaluate the p value by looking at the Bayes factor when he does, and he finds that they’re exaggerating, but we really shouldn’t expect agreement on numbers from methods that are evaluating different things. This is like supposing that if we switch from a height to a weight standard, that if we use six feet with the height, we should now require six stone, to use an example from Stephen Senn. On David, I think he’s wrong about the worrying assumptions with using the p value since they have the least assumptions of any other method, which is why people and why even Bayesians will say we need to apply them when we need to test our assumptions. And it’s something that we can do, especially with randomized controlled trials, to get the assumptions to work. The idea that we have to misinterpret p values to have them be relevant, only rests on supposing that we need something other than what the p value provides.

Dan Jeske  13:19

David, would you like to give some final thoughts on this question?

David Trafimow  13:23

Sure. As it is, as far as Jim’s point, and Deborah’s point that we can do things to make the assumptions less wrong. The problem is the model is wrong or it isn’t wrong. Now if the model is close, that doesn’t justify the p value because the p value doesn’t give the closeness of the model. And that’s the, that’s the problem. We’re not we’re not using, for example, a sample mean, to estimate a population mean, in which case, yeah, you wouldn’t expect the sample mean to be exactly right. If it’s close, it’s still useful. The problem is that p values don’t tell you p values aren’t being used to estimate anything. So, if you’re not estimating anything, then you’re stuck with either correct or incorrect, and the answer is always incorrect that, you know, this is especially true in psychology, but I suspect it might even be true in physics. I’m not the physicist that Jim is. So, I can’t say that for sure.

Dan Jeske  14:35

Jim, would you like to offer Final Thoughts?

Jim Berger  14:37

Let me comment on Deborah’s comment about Bayes factors are just a different scale of measurement. My my point was that it seems like people invariably think of p values as something like odds or probability of the null hypothesis, if that’s the way they’re thinking, because that’s the way their minds reason. I believe we should provide them with odds. And so, I try to convert p values into odds or Bayes factors, because I think that’s much more readily understandable by people.

Dan Jeske  15:11

Deborah, you have the final word on this question.

Deborah Mayo  15:13

I do think that we need a proper philosophy of statistics to interpret p values. But I think also that what’s missing in the reject p values movement is a major reason for calling in statistics in science is to give us tools to inquire whether an observed phenomena can be a real effect, or just noise in the data and the P values have intrinsic properties for this task, if used properly, other methods don’t, and to reject them is to jeopardize this important role. As Fisher emphasizes, we need randomized control trials precisely to ensure the validity of statistical significance tests, to reject them because they don’t give us posterior probabilities is illicit. In fact, I think that those claims that we want such posteriors need to show for any way we can actually get them, why. 

You can watch the debate at the NISS website or in this blog post.

You can find the complete audio transcript at this LINK: https://otter.ai/u/hFILxCOjz4QnaGLdzYFdIGxzdsg
[There is a play button at the bottom of the page that allows you to start and stop the recording. You can move about in the transcript/recording by using the pause button and moving the cursor to another place in the dialog and then clicking the play button to hear the recording from that point. (The recording is synced to the cursor.)]

Categories: D. Jeske, D. Trafimow, J. Berger, NISS, statistics debate | 1 Comment

Is it impossible to commit Type I errors in statistical significance tests? (i)


While immersed in our fast-paced, remote, NISS debate (October 15) with J. Berger and D. Trafimow, I didn’t immediately catch all that was said by my co-debaters (I will shortly post a transcript). We had all opted for no practice. But  looking over the transcript, I was surprised that David Trafimow was indeed saying the answer to the question in my title is yes. Here are some excerpts from his remarks: Continue reading

Categories: D. Trafimow, J. Berger, National Institute of Statistical Sciences (NISS), Testing Assumptions | 29 Comments

S. Senn: “A Vaccine Trial from A to Z” with a Postscript (guest post)


Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Alpha and Omega (or maybe just Beta)

Well actually, not from A to Z but from AZ. That is to say, the trial I shall consider is the placebo- controlled trial of the Oxford University vaccine for COVID-19 currently being run by AstraZeneca (AZ) under protocol AZD1222 – D8110C00001 and which I considered in a previous blog, Heard Immunity. A summary of the design  features is given in Table 1. The purpose of this blog is to look a little deeper at features of the trial and the way I am going to do so is with the help of geometric representations of the sample space, that is to say the possible results the trial could produce. However, the reader is warned that I am only an amateur in all this. The true professionals are the statisticians at AZ who, together with their life science colleagues in AZ and Oxford, designed the trial. Continue reading

Categories: covid-19, RCTs, Stephen Senn | 9 Comments

Phil Stat Forum: November 19: Stephen Senn, “Randomisation and Control in the Age of Coronavirus?”

For information about the Phil Stat Wars forum and how to join, see this post and this pdf. 

Continue reading

Categories: Error Statistics, randomization | Leave a comment

S. Senn: Testing Times (Guest post)



Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Testing Times

Screening for attention

There has been much comment on Twitter and other social media about testing for coronavirus and the relationship between a test being positive and the person tested having been infected. Some primitive form of Bayesian reasoning is often used  to justify concern that an apparent positive may actually be falsely so, with specificity and sensitivity taking the roles of likelihoods and prevalence that of a prior distribution. This way of looking at testing dates back at least to a paper of 1959 by Ledley and Lusted[1]. However, as others[2, 3] have pointed out, there is a trap for the unwary in this, in that it is implicitly assumed that specificity and sensitivity are constant values unaffected by prevalence and it is far from obvious that this should be the case. Continue reading

Categories: S. Senn, significance tests, Testing Assumptions | 14 Comments

Souvenir From the NISS Stat Debate for Users of Bayes Factors (& P-Values)


What would I say is the most important takeaway from last week’s NISS “statistics debate” if you’re using (or contemplating using) Bayes factors (BFs)–of the sort Jim Berger recommends–as replacements for P-values? It is that J. Berger only regards the BFs as appropriate when there’s grounds for a high concentration (or spike) of probability on a sharp null hypothesis,            e.g.,H0: θ = θ0.

Thus, it is crucial to distinguish between precise hypotheses that are just stated for convenience and have no special prior believability, and precise hypotheses which do correspond to a concentration of prior belief. (J. Berger and Delampady 1987, p. 330).

Continue reading

Categories: bayes factors, Berger, P-values, S. Senn | 4 Comments

My Responses (at the P-value debate)


How did I respond to those 7 burning questions at last week’s (“P-Value”) Statistics Debate? Here’s a fairly close transcript of my (a) general answer, and (b) final remark, for each question–without the in-between responses to Jim and David. The exception is question 5 on Bayes factors, which naturally included Jim in my general answer. 

The questions with the most important consequences, I think, are questions 3 and 5. I’ll explain why I say this in the comments. Please share your thoughts. Continue reading

Categories: bayes factors, P-values, Statistics, statistics debate NISS | 1 Comment

The P-Values Debate



National Institute of Statistical Sciences (NISS): The Statistics Debate (Video)

Categories: J. Berger, P-values, statistics debate | 14 Comments

The Statistics Debate! (NISS DEBATE, October 15, Noon – 2 pm ET)

October 15, Noon – 2 pm ET (Website)

Where do YOU stand?

Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading

Categories: Announcement, J. Berger, P-values, Philosophy of Statistics, reproducibility, statistical significance tests, Statistics | Tags: | 9 Comments

CALL FOR PAPERS (Synthese) Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications


Call for Papers: Topical Collection in Synthese

Title: Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications

The deadline for submissions is 1 November, 2020 1 December 2020

Description: Continue reading

Categories: Announcement, CFP, Synthese | Leave a comment

G.A. Barnard’s 105th Birthday: The Bayesian “catch-all” factor: probability vs likelihood


G. A. Barnard: 23 Sept 1915-30 July, 2002

Yesterday was statistician George Barnard’s 105th birthday. To acknowledge it, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] A portion appears on p. 420 of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Six other posts on Barnard are linked below, including 2 guest posts, (Senn, Spanos); a play (pertaining to our first meeting), and a letter Barnard wrote to me in 1999.  Continue reading

Categories: Barnard, phil/history of stat, Statistics | 10 Comments

Live Exhibit: Bayes Factors & Those 6 ASA P-value Principles


Live Exhibit: So what happens if you replace “p-values” with “Bayes Factors” in the 6 principles from the 2016 American Statistical Association (ASA) Statement on P-values? (Remove “or statistical significance” in question 5.)

Does the one positive assertion hold? Are the 5 “don’ts” true? Continue reading

Categories: ASA Guide to P-values, bayes factors | 2 Comments

September 24: Bayes factors from all sides: who’s worried, who’s not, and why (R. Morey)

Information and directions for joining our forum are here.

Continue reading

Categories: Announcement, bayes factors, Error Statistics, Phil Stat Forum, Richard Morey | 1 Comment

Blog at WordPress.com.