Stephen Senn
Consultant Statistician
Edinburgh, Scotland
Although I have researched on clinical trial design for many years, prior to the COVID-19 epidemic I had had nothing to do with vaccines. The only object of these amateur musings is to amuse amateurs by raising some issues I have pondered and found interesting.
In this blog I am going to cover the statistical analysis used by Pfizer/BioNTech (hereafter referred to as P&B) in their big phase III trial of a vaccine for COVID-19. I considered this in some previous posts, in particular Heard Immunity, and Infectious Enthusiasm. The first of these posts compared the P&N trial to two others, a trial run by Moderna and another by Astra Zeneca and Oxford University (hereafter referred to as AZ&Ox) and the second discussed the results that P&N reported.
Figure 1 Stopping boundaries for three trials. Labels are anticipated numbers of cases at the looks.
All three trials were sequential in nature and, as is only proper, all three protocols gave details of the proposed stopping boundaries. These are given in Figure 1. AZ&Ox proposed to have two looks, Moderna to have three and P&B to have five. As things turned out, there was only one interim look at the P&B trial and so two, and not five, in total.
Moderna and AZ&Ox specified a frequentist approach in their protocols and P&B a Bayesian one. It is aspects of this Bayesian approach that I propose to consider.
It is common to measure vaccine efficacy in terms of a relative risk reduction expressed as a percentage. The percentage is a nuisance and instead I shall express it as a simple ratio. If ψ, π_{c}, π_{v} are the true vaccine efficacy and the probabilities of being infected in the control and vaccine groups respectively, then
Note that if we have Y_{c}, Y_{v }cases in the control and vaccine arms respectively and n_{c}, n_{v }subjects then an intuitively reasonable estimate of ψ is
where VE is the observed vaccine efficacy.
If the total number of subjects is N and we have n_{c} = rN, n_{v}_{ }= (1 − r)N, with r being the proportion of subjects on the control arm, then we have
Note that if r = 1 − r = 1/2, that is to say that there are equal numbers of subjects on both arms, then (3) simply reduces to one minus the ratio of observed cases. VE thus has the curious property that its maximum value is 1 (when there are no cases in the vaccine group) but its minimum value is −∞ (when there are no cases in the control group and at least one in the vaccine group).
A contour plot of vaccine efficacy as a function of the control and vaccine group probabilities of infection is given in Figure 2.
Figure 2 Vaccine efficacy as a function of the probability of infection in the control and vaccine groups.
P&B specified a prior distribution for their analysis but very sensibly shied away from attempting to do one for vaccine efficacy directly. Instead, they considered a transformation, or re-scaling, of the parameter defining
This looks rather strange but in fact it can be re-expressed as
Figure 3 Contour plot of transformed vaccine efficacy, θ.
Its contour plot is given in Figure 3. The transformation is the ratio of the probability of infection in the vaccine group to the sum of the probabilities in the two groups. It thus takes on a value between 0 and 1 and this in turn implies that it behaves like a probability. In fact if we have equal numbers of subjects on both arms and condition on the total numbers of cases we can regard it as being the probability that a randomly chosen case will be in the vaccine group and therefore as an estimate of the efficacy of the vaccine. The lower this probability, the more effective the vaccine.
In fact, this simple analysis, captures much of what the data have to tell and in estimating vaccine efficacy in previous posts. I simply used this ratio as a probability and estimated ‘exact’ confidence intervals using the binomial distribution. Having calculated the intervals on this scale, I back-transformed them to the vaccine efficacy scale
A prior distribution that is commonly used for modelling data using the binomial distribution is the beta-distribution (see pp. 55-61 of Forbes et al[1], 2011), hence the title of this post. This is a two parameter distribution with parameters (say) ν, ω and mean
and variance
Thus, the relative value of the two parameters governs the mean and, the larger the parameter values, the smaller the variance. A special case of the beta distribution is the uniform distribution, which can be obtained by setting both parameter values to 1. The resulting mean is 1/2 and the variance is 1/12. Parameter values of ½ and ½ give a distribution with the same mean but a larger variance of 1/8. For comparison, the mean and variance of a binomial proportion are p, p(1 − p)/n and if you set p = 1/2, n = 2 you get the same mean and variance. This gives a feel for how much information is contained in a prior distribution.
P&B chose a beta distribution for θ with ν = 0.700102, ω = 1. The prior distribution is plotted in Figure 4. This has a mean of 0.4118 and a variance of approximately 1/11. I now propose to discuss and speculate how these values were arrived at. I come to a rather cynical conclusion and before I give you my reasoning, I want to make two points quite clear:
a) My cynicism does not detract from my admiration for what P&B have done. I think the achievement is magnificent, not only in terms of the basic science but also in terms of trial design, management and delivery.
b) I am not criticising the choice of a Bayesian analysis. In fact, I found that rather interesting.
However, I think it is appropriate to establish what exactly is incorporated in any prior distribution and that is what I propose to do.
First note the extraordinary number of significant figures (6) for the first parameter, of the beta distribution, which has a value of 0.700102 . The distribution itself (as established by its variance) is not very informative but at first sight there would seem to be a great deal of information about the prior distribution itself. This is a feature of some analyses that I have drawn attention to before. See Dawid’s Selection Paradox.
Figure 4 Prior distribution for θ
So this is what I think happened. P&B reached for a common default uniform distribution, a beta with parameters 1,1. However, this would yield an expected value of θ = 0 . On the other hand, they wished to show that the vaccine efficacy ψ was greater than 0.3. They thus asked the question, what value of θ corresponds to a value of ψ = 0.3? Substituting in (4) the answer is 0.4117647 or 0.4118 to four decimal places. They explained this in the protocol as follows: ‘The prior is centered at θ = 0.4118 (VE = 30%) which may be considered pessimistic’.
Figure 5 Combination of parameters for the prior distribution yielding the required mean. The diagonal light blue line gives the combination of values that will produce the desired mean. The red diamond gives the parameter combination chosen by P&B. The blue circle gives the parameter combination that would also have produced the mean chosen but also the same variance as a beta(1,1). The contour lines show the variance of the distribution as a function of the two parameters.
Note that the choice of word centered is odd. The mean of the distribution is 0.4118 but the distribution is not really centered there. Be that as it may, they now had an infinite possible combination of values for ν, ω that would yield an expected value of 0.4118. Note that solving (6) for ν, ω yields
and plugging in μ = 0.4118, ω = 1 gives ν = 0.700102. Possible choices of parameter combinations yielding the same mean are given in Figure 5. An alternative to the beta(0.700102,1) they chose might have been beta(0.78516,1.1215). This would have yielded the same mean but given the equivalent variance to a conventional beta(1,1).
It is also somewhat debatable as to whether pessimistic is the right word. The distribution is certainly very uninformative. Note also that just because if the mean value on the scale is transformed to the vaccine efficacy scale it gives a value of 0.30. It does not follow that this is the mean value of the vaccine efficacy. Only medians can be guaranteed to be invariant under transformation. The median of the distribution of θ is 0.3716 and this corresponds to a median vaccine efficacy of 0.4088.
Perhaps unfairly, I could ask, ‘what has the Bayesian element added to the analysis?’ A Bayesian might reply, ‘what advantage does subtracting the Bayesian element bring to the analysis?’ Nevertheless, the choice of prior distribution here points a problem. It clearly does not reflect what anybody believed about the vaccine efficacy before the trial began. Of course, establishing reasonable prior parameters for any statistical analysis is extremely difficult[2].
On the other hand, if a purely conventional prior is required why not choose beta(1,1) or beta(1/2,1/2), say? I think the 0.3 hypothesised value for vaccine efficacy is a red herring here. What should be of interest to a Bayesian is the posterior probability that the vaccine efficacy is greater than 30%. This does not require that the prior distribution is ‘centred’ on this value.
Of course the point is that provided that the variance of the prior distribution is large enough, the posterior inference is scarcely affected. In any case a Bayesian might reply, ‘if you don’t like the prior distribution choose your own’. To which a diehard frequentist might reply, ‘it is a bit late for choosing prior distributions’.
I take two lessons from this, however. First, where Bayesian analyses are being used we should all try to understand what the prior distribution implies: in what we now ‘believe’ and how data would update such belief[3]. Second, disappointing as this may be to inferential enthusiasts, this sort of thing is not where the action is. The trial was well conceived, designed and conducted and the product was effective. My congratulations to all the scientists involved, including, but not limited to, the statisticians.
A little over a year ago, the board of the American Statistical Association (ASA) appointed a new Task Force on Statistical Significance and Replicability (under then president, Karen Kafadar), to provide it with recommendations. [Its members are here (i).] You might remember my blogpost at the time, “Les Stats C’est Moi”. The Task Force worked quickly, despite the pandemic, giving its recommendations to the ASA Board early, in time for the Joint Statistical Meetings at the end of July 2020. But the ASA hasn’t revealed the Task Force’s recommendations, and I just learned yesterday that it has no plans to do so*. A panel session I was in at the JSM, (P-values and ‘Statistical Significance’: Deconstructing the Arguments), grew out of this episode, and papers from the proceedings are now out. The introduction to my contribution gives you the background to my question, while revealing one of the recommendations (I only know of 2).
[i] Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)
You can access the full paper here.
Rejecting Statistical Significance Tests: Defanging the Arguments^
Abstract: I critically analyze three groups of arguments for rejecting statistical significance tests (don’t say ‘significance’, don’t use P-value thresholds), as espoused in the 2019 Editorial of The American Statistician (Wasserstein, Schirm and Lazar 2019). The strongest argument supposes that banning P-value thresholds would diminish P-hacking and data dredging. I argue that it is the opposite. In a world without thresholds, it would be harder to hold accountable those who fail to meet a predesignated threshold by dint of data dredging. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no a test of that claim. Giving up on tests means forgoing statistical falsification. The second group of arguments constitutes a series of strawperson fallacies in which statistical significance tests are too readily identified with classic abuses of tests. The logical principle of charity is violated. The third group rests on implicit arguments. The first in this group presupposes, without argument, a different philosophy of statistics from the one underlying statistical significance tests; the second group—appeals to popularity and fear—only exacerbate the ‘perverse’ incentives underlying today’s replication crisis.
1. Introduction and Background
Today’s crisis of replication gives a new urgency to critically appraising proposed statistical reforms intended to ameliorate the situation. Many are welcome, such as preregistration, testing by replication, and encouraging a move away from cookbook uses of statistical methods. Others are radical and might inadvertently obstruct practices known to improve on replication. The problem is one of evidence policy, that is, it concerns policies regarding evidence and inference. Problems of evidence policy call for a mix of statistical and philosophical considerations, and while I am not a statistician but a philosopher of science, logic, and statistics, I hope to add some useful reflections on the problem that confronts us today.
In 2016 the American Statistical Association (ASA) issued a statement on P-values, intended to highlight classic misinterpretations and abuses.
The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. (Wasserstein and Lazar 2016, p. 129)
The statement itself grew out of meetings and discussions with over two dozen others, and was specifically approved by the ASA board. The six principles it offers are largely rehearsals of fallacious interpretations to avoid. In a nutshell: P-values are not direct measures of posterior probabilities, population effect sizes, or substantive importance, and can be invalidated by biasing selection effects (e.g., cherry picking, P-hacking, multiple testing). The one positive principle is the first: “P-values can indicate how incompatible the data are with a specified statistical model” (ibid., p. 131).
The authors of the editorial that introduces the 2016 ASA Statement, Wasserstein and Lazar, assure us that “Nothing in the ASA statement is new” (p. 130). It is merely a “statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value” ( p. 131). Thus, it came as a surprise, at least to this outsider’s ears, to hear the authors of the 2016 Statement, along with a third co-author (Schirm), declare in March 2019 that: “The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” (Wasserstein, Schirm and Lazar 2019, p. 2, hereafter, WSL 2019).
The 2019 Editorial announces: “We take that step here….[I]t is time to stop using the term ‘statistically significant’ entirely. …[S]tatistically significant –don’t say it and don’t use it” (WSL 2019, p. 2). Not just outsiders to statistics were surprised. To insiders as well, the 2019 Editorial was sufficiently perplexing for the then ASA President, Karen Kafadar, to call for a New ASA Task Force on Significance Tests and Replicability.
Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.
… To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece … without leaving the impression that p-values and hypothesis tests…have no role in ‘good statistical practice’. (K. Kafadar, President’s Corner, 2019, p. 4)
This was a key impetus for the JSM panel discussion from which the current paper derives (“P-values and ‘Statistical Significance’: Deconstructing the Arguments”). Kafadar deserves enormous credit for creating the new task force.^{1} Although the new task force’s report, submitted shortly before the JSM 2020 meeting, has not been disclosed, Kadar’s presentation noted that one of its recommendations is that there be a “disclaimer on all publications, articles, editorials, … authored by ASA Staff”.^{2} In this case, a disclaimer would have noted that the 2019 Editorial is not ASA policy. Still, given that its authors include ASA officials, it has a great deal of impact.
We should indeed move away from unthinking and rigid uses of thresholds—not just with significance levels, but also with confidence levels and other quantities. No single statistical quantity from any school, by itself, is an adequate measure of evidence, for any of the many disparate meanings of “evidence” one might adduce. Thus, it is no special indictment of P-values that they fail to supply such a measure. We agree as well that the actual P-value should be reported, as all the founders of tests recommended (see Mayo 2018, Excursion 3 Tour II). But the 2019 Editorial goes much further. In its view: Prespecified P-value thresholds should not be used at all in interpreting results. In other words, the position advanced by the 2019 Editorial, “reject statistical significance”, is not just a word ban but a gatekeeper ban. For example, in order to comply with its recommendations, the FDA would have to end its “long established drug review procedures that involve comparing p-values to significance thresholds for Phase III drug trials” as the authors admit (p. 10).
Kafadar is right to see the 2019 Editorial as challenging the overall use of hypothesis tests, even though it is not banning P-values. Although P-values can be used as descriptive measures, rather than as tests, when we wish to employ them as tests, we require thresholds. Ideally there are several P-value benchmarks, but even that is foreclosed if we take seriously their view: “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (WSL 2019, p. 2).
The March 2019 Editorial (WSL 2019) also includes a detailed introduction to a special issue of The American Statistician (“Moving to a World beyond p < 0.05”). The position that I will discuss, reject statistical significance, (“don’t say ‘significance’, don’t use P-value thresholds”), is outlined largely in the first two sections of the 2019 Editorial. What are the arguments given for the leap from the reasonable principles of the 2016 ASA Statement to the dramatic “reject statistical significance” position? Do they stand up to principles for good argumentation?
Continue reading the paper here. Please share your comments.
NOTES:
^{1} Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)
^{2} Kafadar, K., “P-values: Assumptions, Replicability, ‘Significance’,” slides given in the Contributed Panel: P-Values and “Statistical Significance”: Deconstructing the Arguments at the (virtual) JSM 2020. (August 6, 2020).
^CITATION: Mayo, D. (2020). Rejecting Statistical Significance Tests: Defanging the Arguments. In JSM Proceedings, Statistical Consulting Section. Alexandria, VA: American Statistical Association. (2020). 236-256.
*Jan 11 update. The ASA executive director, Ron Wasserstein, wants to emphasize that it is leaving to the members of the Task Force when and how to release the report on their own. I do not know if it will do so or if all of the authors will agree to this shift. Personally, I don’t know why the ASA Board would not wish to reveal the recommendations of the Task Force that it created–even without any presumption that it thereby is understood to be a policy document. There can be a clear disclaimer that it is not. The Task Force carried out the work that was asked of them in a timely manner. You can find a statement of the charge given to the Task Force in my comments.
]]>The fourth meeting of our New Phil Stat Forum*:
The Statistics Wars
and Their Casualties
January 7, 16:00 – 17:30 (London time)
11 am-12:30 pm (New York, ET)**
**note time modification and date change
Putting the Brakes on the Breakthrough,
or “How I used simple logic to uncover a flaw in a controversial 60-year old ‘theorem’ in statistical foundations”
Deborah G. Mayo
HOW TO JOIN US: SEE THIS LINK
ABSTRACT: An essential component of inference based on familiar frequentist (error statistical) notions p-values, statistical significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory). This results in violations of a principle known as the strong likelihood principle (SLP), or just the likelihood principle (LP), which says, in effect, that outcomes other than those observed are irrelevant for inferences within a statistical model. Now Allan Birnbaum was a frequentist (error statistician), but he found himself in a predicament: He seemed to have shown that the LP follows from uncontroversial frequentist principles! Bayesians, such as Savage, heralded his result as a “breakthrough in statistics”! But there’s a flaw in the “proof”, and that’s what I aim to show in my presentation by means of 3 simple examples:
As in the last 9 years, I posted an imaginary dialogue (here) with Allan Birnbaum at the stroke of midnight, New Year’s Eve, and this will be relevant for the talk.
The Phil Stat Forum schedule is at the Phil-Stat-Wars.com blog
One of the following 3 papers:
My earliest treatment via counterexample:
A deeper argument can be found in:
For an intermediate Goldilocks version (based on a presentation given at the JSM 2013):
This post from the Error Statistics Philosophy blog will get you oriented. (It has links to other posts on the LP & Birnbaum, as well as background readings/discussions for those who want to dive deeper into the topic.)
D. Mayo’s slides: “Putting the Brakes on the Breakthrough, or ‘How I used simple logic to uncover a flaw in a controversial 60-year old ‘theorem’ in statistical foundations’”
D. Mayo’s presentation:
Discussion on Mayo’s presentation:
Mayo’s Memos: Any info or events that arise that seem relevant to share with y’all before the meeting.
You may wish to look at my rejoinder to a number of statisticians: Rejoinder “On the Birnbaum Argument for the Strong Likelihood Principle”. (It is also above in the link to the complete discussion in the 3^{rd} reading option.)
I often find it useful to look at other treatments. So I put together this short supplement to glance through to clarify a few select points.
Please post comments on the Phil Stat Wars blog here.
You know how in that Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is either picked up in a mysterious taxi at midnight (New Year’s Eve 2011 2012, 2013, 2014, 2015, 2016, 2017, 2018) or (in 2020) is given a mysterious link for a remote, virtual reality meeting place from sixty years ago, and lo and behold, finds herself in the company of Allan Birnbaum.[i]
ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to be giving an informal presentation on January 7 on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)
BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT, 2018, CUP).
ERROR STATISTICIAN: You’ve read my book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found the problem in 2006, when I was writing something on “conditioning” with David Cox. [ii] Sorry,…I know it’s famous…
BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).
ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.
BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!
ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.
BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.
ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.
BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:
(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)
ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.
BIRNBAUM: Suppose you’ve observed x”, a 2-standard deviation difference from an optional stopping experiment E”, that finally stops at n=100. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2-standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100.
ERROR STATISTICAL PHILOSOPHER: Well, that is not how x” was obtained, but ok, it could have occurred that way.
BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment. In a BB-experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’. In that case, a BB-experiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.
(They fill their glasses again)
ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair coin (which decides between E’ and E”)?
BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have an “LP pair” in the experiment you did not perform, E”. Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.
ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.
BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100^{th} trial). That’s how I sometimes formulate a BB-experiment.
ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.
BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a single experiment, so really you only need to apply the weak LP which frequentists accept. Yes? (The weak LP is the same as the sufficiency principle).
ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference from optional stopping experiment E”. How do I calculate the p-value within a Birnbaumized experiment?
BIRNBAUM: I don’t think anyone has ever called it that.
ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the p-value within a BB-experiment?
BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2
Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).
ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?
BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment.
My this drink is sour!
ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.
BIRNBAUM: Perhaps you’re in want of a gene; never mind.
I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).
ERROR STATISTICAL PHILOSOPHER: But the result would be that the p-value associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.
BIRNBAUM: Yes, the BB-experiment computes the P-value in an unconditional manner: it takes the convex combination over the 2 ways the result could have come about.
ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.
BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to this, it is just a matter of mathematical equivalence.
By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.
ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)
BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”
ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!
BIRNBAUM: So far all of this was step (1).
ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?
BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.
This gives us premise (2a):
(2a) outcome x”, once it is known that it came from E”, should NOT be analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the p-value, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?
BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.
(2b) Likewise, if you knew the 2-standard deviation difference came from E’, then
x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.
BIRNBAUM: Yes. There was no need to repeat the whole spiel.
ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course, all of this assumes the model is correct or adequate to begin with.
BIRNBAUM: Yes, the LP (or SLP, to indicate it’s the strong LP) is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?
ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.
BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?
ERROR STATISTICAL PHILOSOPHER: Well the WCP originally refers to actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.
BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BB-experiment, and that’s all I need. Notice
(1), (2a) and (2b) yield the strong LP!
Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).
ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).
BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?
(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)
ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. Here’s one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:
Step 1 requires us to analyze results in accordance with a BB- experiment. If we do so, true enough we get:
premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB experiment):
That is because in either case, the p-value would be (p’ + p”)/2
Step 2 now insists that we should NOT calculate evidential import as if we were in a BB- experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:
premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its p-value should be p”.
premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its p-value should be p’.
If (1) is true, then (2a) and (2b) must be false!
If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:
The average p-value (p’ + p”)/2 = p’ which is false.
Likewise if (1) is true, then (2b) is asserting:
the average p-value (p’ + p”)/2 = p” which is false
Alternatively, we can see what goes wrong by realizing:
If (2a) and (2b) are true, then premise (1) must be false.
In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.
I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).
BIRNBAUM: Yet some people still think it is a breakthrough. I never agreed to go as far as Jimmy Savage wanted me too….
ERROR STATISTICAL PHILOSOPHER: I have a much clearer exposition of what goes wrong in your argument than I did in the discussion from 2010. There were still several gaps, and lack of a clear articulation of the WCP. In fact, I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in Statistical Science? The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost.
Birnbaum: Yes, the “monster of the LP” arises from viewing WCP as an equivalence, instead of going in one direction (from mixtures to the known result).
ERROR STATISTICAL PHILOSOPHER: In my 2014 paper (unlike my earlier treatments) I too construe WCP as giving an “equivalence” but there is an equivocation that invalidates the purported move to the LP.
On the one hand, it’s true that if z is known (and known for example to have come from optional stopping), it’s irrelevant that it could have resulted from either fixed sample testing or optional stopping.
But it does not follow that if z is known, it’s irrelevant whether it resulted from fixed sample testing or optional stopping. It’s the slippery slide into this second statement, I claim, that makes your argument such a brain buster.
BIRNBAUM: Yes I have seen your 2014 paper, very clever! Your Rejoinder to some of the critics is gutsy, to say the least. Congratulations! I’ve also seen the slides on your blog.
ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! But look I must get your answer to a question before you leave this year.
Sudden interruption by the waiter
WAITER: Who gets the tab?
BIRNBAUM: I do. To Elbar Grease! And to your new book SIST! I’ve read it 4 times. I have a list of comments and questions right here.
ERROR STATISTICAL PHILOSOPHER: Let me see, I’d love to read your questions and comments. (She takes a long legal-sized yellow sheet from Birnbaum, notices it is filled with tiny hand-written comments, covering both sides.)
BIRNBAUM: To Elbar Grease! To Severe Testing! Happy New Year!
ERROR STATISTICAL PHILOSOPHER: I have one quick question, Professor Birnbaum, and I swear that whatever you say will be just between us, I won’t tell a soul. In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962)paper, you seemed to agree with Pratt that WCP can’t do the job you intend.
BIRNBAUM: Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)
ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question, you disappeared before answering last year…I just want to know…you did see the flaw, yes?
WAITER: We’re closing now; shall I call Remote Taxi?
BIRNBAUM: Yes.
ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?
MANAGER: We’re closing now; I’m sorry you must leave.
ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….
Large group of people bustle past.
Prof. Birnbaum…? Allan? Where did he go? (oy, not again!)
But wait! I’ve got his list of comments and questions in my hand! It’s real!!!
Link to complete discussion:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).Statistical Science 29 (2014), no. 2, 227-266.
[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as classic background papers may be found in my last blogpost. I hope to see you at my zoom Phil Stat Forum on January 7.
[ii] By the way, Ronald Giere gave me numerous original papers of yours. They’re in files in my attic library. Some are in mimeo, others typed…I mean, obviously for that time that’s what they’d be…now of course, oh never mind, sorry.
An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.
SLP (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)
For any two experiments E_{1} and E_{2} with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, if outcomes x* and y* (from E_{1} and E_{2} respectively) determine the same (i.e., proportional) likelihood function (f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ), then x* and y* are inferentially equivalent (for an inference about θ).
(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)
Violation of SLP:
Whenever outcomes x* and y* from experiments E_{1} and E_{2} with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, and f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ, and yet outcomes x* and y* have different implications for an inference about θ.
For an example of a SLP violation, E_{1} might be sampling from a Normal distribution with a fixed sample size n, and E_{2} the corresponding experiment that uses an optional stopping rule: keep sampling until you obtain a result 2 standard deviations away from a null hypothesis that θ = 0 (and for simplicity, a known standard deviation). When you do, stop and reject the point null (in 2-sided testing).
The SLP tells us (in relation to the optional stopping rule) that once you have observed a 2-standard deviation result, there should be no evidential difference between its having arisen from experiment E_{1}, where n was fixed, say, at 100, and experiment E_{2} where the stopping rule happens to stop at n = 100. For the error statistician, by contrast, there is a difference, and this constitutes a violation of the SLP.
———————-
Now for the surprising part: Remember the chestnut from my last post where a coin is flipped to decide which of two experiments to perform? David Cox (1958) proposes something called the Weak Conditionality Principle (WCP) to restrict the space of relevant repetitions for frequentist inference. The WCP says that once it is known which E_{i }produced the measurement, the assessment should be in terms of the properties of the particular E_{i}. Nothing could be more obvious.
The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixture experiments, and so uncontroversial a principle as sufficiency (SP)–although even that has been shown to be optional, strictly speaking. But this would preclude the use of sampling distributions. J. Savage calls Birnbaum’s argument “a landmark in statistics” (see [i]).
Although his argument purports that [(WCP and SP) entails SLP], we will see that data may violate the SLP while holding both the WCP and SP. Such cases also directly refute [WCP entails SLP].
Binge reading the Likelihood Principle.
If you’re keen to binge read the SLP–a way to break holiday/winter break/pandemic doldrums– I’ve pasted most of the early historical sources below. The argument is simple; showing what’s wrong with it took a long time.
I will be talking about it, informally, at our Phil Stat Forum meeting on 7 January 11AM-12:30 PM Eastern (NY) time. For the announcement, and how to join us see this post.
I recommend reading one of the following: My earliest treatment, via counterexample, in Mayo (2010). A deeper argument is in Mayo (2014) in Statistical Science.[ii] An intermediate paper Mayo (2013) corresponds to a talk I presented at the JSM in 2013.
Interested readers may search this blog for quite a lot of discussion of the SLP including “U-Phils” (discussions by readers) (e.g., here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum).
The topic is different in style from our usual “Phil Stat Wars and Their Casualties” Forum. However, the casualties from this result are deep. They are relevant to the very notion of “evidence”, blithely taken for granted in such “best practice guides” as the 2016 ASA statement on P-values and significance. It is the interpretation of evidence (left intuitive by Birnbaum) underlying the SLP that is being presupposed. [iii]
To have a list for binging, I’ve grouped some key readings below [iv].
Classic Birnbaum Papers:
Note to Reader: If you look at the (1962) “discussion”, you can already see Birnbaum backtracking a bit, in response to Pratt’s comments.
Some additional early discussion papers:
Durbin:
There’s also a good discussion in Cox and Hinkley 1974.
Evans, Fraser, and Monette:
Kalbfleisch:
My discussions (also noted above):
[i] Savage on Birnbaum: “This paper is a landmark in statistics. . . . I, myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great. . . . [T]his paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. …once the likelihood principle is widely recognized, people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability for statistics” (Savage 1962, 307-308).
[ii] The link includes comments on my paper by Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu, and my rejoinder.
[iii] In Birnbaum’s argument, he introduces an informal, and rather vague, notion of the “evidence (or evidential meaning) of an outcome z from experiment E”. He writes it: Ev(E,z).
In my formulation of the argument, I introduce a new symbol ⇒ to represent a function from a given experiment-outcome pair, (E,z) to a generic inference implication. It (hopefully) lets us be clearer than does Ev.
(E,z) ⇒ Infr_{E}(z) is to be read “the inference implication from outcome z in experiment E” (according to whatever inference type/school is being discussed).
If E is within error statistics, for example, it is necessary to know the relevant sampling distribution associated with a statistic. If it is within a Bayesian account, a relevant prior would be needed.
[iv] I’ve blogged these links in the past; please let me know if any links are broken.
]]>
Just as you keep up your physical exercise during the pandemic (sure), you want to keep up with mental gymnastics too. With that goal in mind, and given we’re just a few days from the New Year (and given especially my promised presentation for January 7), here’s one of the two simple examples that will limber you up for the puzzle to ensue. It’s the famous weighing machine example from Sir David Cox (1958)[1]. It is one of the “chestnuts” in the museum exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018). So block everything else out for a few minutes and consider 3 pages from SIST …
Exhibit (vi): Two Measuring Instruments of Different Precisions. SIST (pp. 170-173). Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?
She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)
Basis for the joke: An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’s the statistical formulation.
We flip a fair coin to decide which of two instruments, E_{1 }or E_{2}, to use in observing a Normally distributed random sample Z to make inferences about mean θ. E_{1 }has variance of 1, while that of E_{2 }is 10^{6}. Any randomizing device used to choose which instrument to use will do, so long as it is irrelevant to θ. This is called a mixture experiment. The full data would report both the result of the coin flip and the measurement made with that instrument. We can write the report as having two parts: First, which experiment was run and second the measurement: (E_{i}, z), i = 1 or 2.
In testing a null hypothesis such as θ = 0, the same z measurement would correspond to a much smaller P-value were it to have come from E_{1} rather than from E_{2}: denote them as p_{1}(z) and p_{2}(z), respectively. The overall significance level of the mixture: [p_{1}(z) + p_{2}(z)]/2, would give a misleading report of the precision of the actual experimental measurement. The claim is that N-P statistics would report the average P-value rather than the one corresponding to the scale you actually used! These are often called the unconditional and the conditional test, respectively. The claim is that the frequentist statistician must use the unconditional test.
Suppose that we know we have observed a measurement from E_{2 }with its much larger variance:
The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance]. (Cox 1958, p. 361)
Once it is known which E_{i } has produced z, the P-value or other inferential assessment should be made with reference to the experiment actually run. As we say in Cox and Mayo (2010):
The point essentially is that the marginal distribution of a P-value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense. (p. 296)
To scotch his famous example, Cox (1958) introduces a principle: weak conditionality.
Weak Conditionality Principle (WCP): If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences about θ are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed (Cox and Mayo 2010, p. 296).
It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.
While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “which is best on the average over both experiments” (Lehmann and Romano 2005, p. 394), and it’s even possible that the average or unconditional power is better than the conditional. In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b).He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b, p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’m sure they’d concur with N-P tribe leader Lehmann. Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974, pp. 95–6; Birnbaum 1962, p. 491.)
Is There a Catch?
The “two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii)). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “there is a catch” and that WCP leads to the Likelihood Principle (LP)!
It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘dilemma’ argument is therefore an illusion. (Cox and Mayo 2010, p. 298)
There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’s “proof” were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “breakthrough in statistics,” went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.
In fact, I show there is a flaw in his venerable argument (Mayo 2010b, 2013a, 2014b). That’s a relief. Now some of you will howl, “Mayo, not everyone agrees with your disproof! Some say the issue is not settled.” Fine, please explain where my refutation breaks down. It’s an ideal brainbuster to work on along the promenade after a long day’s tour. Don’t be dismayed by the fact that it has been accepted for so long. But I won’t revisit it here.
From Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).
Excursion 3 Tour II, pp. 170-173.
I just noticed (12/29) that the classic Berger and Wolpert The Likelihood Principle is on-line. Here’s their description of the Cox (1958) example:
Note to the Reader:
The LP was a main topic for the first few years of this blog (2011-2014). That’s because I was still refining an earlier disproof from Mayo (2010), based on giving a counterexample. I later saw the need for a deeper argument which I give in Mayo (2014) in Statistical Science.[3] (There, among other subtleties, the WCP is put as a logical equivalence as intended.)
“It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1962 paper, to the monster of the likelihood axiom,” (Birnbaum 1975, 263).
An intermediate paper is Mayo (2013).
Some authors are claiming to have new and improved proofs of it. The only problem is that the new attempts reiterate the same premises that render the initial argument circular, only with greater gusto–or so I will argue. Once an argument is circular, it remains so. Textbooks should not call a claim a theorem if it’s not a theorem, i.e., if there isn’t a proof of it (within the relevant formal system).
If statistical inference follows Bayesian posterior probabilism, the LP follows easily. It’s shown in just a couple of pages of SIST Excursion 1 Tour II (45-6). All the excitement is whether the frequentist (error statistician) is bound to hold it. If she is, then error probabilities become irrelevant to the evidential import of data (once the data are given), at least when making parametric inferences within a statistical model.
Stay tuned for more later in the week.
[1] Cox 1958 has a different variant of the chestnut.
[2] Note sufficiency is not really needed in the “proof”.
[3] The discussion includes commentaries by Dawid, Evans, Martin and Liu, Hannig, and Bjørnstad–some of whom are very unhappy with me. But I’m given the final word (at least in that journal) in the rejoinder.
References (outside of the excerpt; for refs within SIST, please see SIST):
Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57(298), 269-306.
Birnbaum, A. (1975). Comments on Paper by J. D. Kalbfleisch. Biometrika, 62 (2), 262–264.
Cox, D. R. (1958), “Some problems connected with statistical inference“, The Annals of Mathematical Statistics, 29, 357-372.
Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.
Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in JSM Proceedings, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453.
Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: Statistical Science 29(2) pp. 227-239, 261-266.
]]>The fourth meeting of our New Phil Stat Forum*:
The Statistics Wars
and Their Casualties
January 7, 16:00 – 17:30 (London time)
11 am-12:30 pm (New York, ET)**
**note time modification and date change
Putting the Brakes on the Breakthrough,
or “How I used simple logic to uncover a flaw in a controversial 60-year old ‘theorem’ in statistical foundations”
Deborah G. Mayo
HOW TO JOIN US: SEE THIS LINK
ABSTRACT: An essential component of inference based on familiar frequentist (error statistical) notions p-values, statistical significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory). This results in violations of a principle known as the strong likelihood principle (SLP), or just the likelihood principle (LP), which says, in effect, that outcomes other than those observed are irrelevant for inferences within a statistical model. Now Allan Birnbaum was a frequentist (error statistician), but he found himself in a predicament: He seemed to have shown that the LP follows from uncontroversial frequentist principles! Bayesians, such as Savage, heralded his result as a “breakthrough in statistics”! But there’s a flaw in the “proof”, and that’s what I aim to show in my presentation by means of 3 simple examples:
As in the last 9 years, I will post an imaginary dialogue with Allan Birnbaum at the stroke of midnight, New Year’s Eve, and this will be relevant for the talk.
The Phil Stat Forum schedule is at the Phil-Stat-Wars.com blog
Deborah Mayo 03:46
Thank you so much. And thank you for inviting me, I’m very pleased to be here. Yes, I say we should continue to use p values and statistical significance tests. Uses of p values are really just a piece in a rich set of tools intended to assess and control the probabilities of misleading interpretations of data, i.e., error probabilities. They’re the first line of defense against being fooled by randomness as Yoav Benjamini puts it. If even larger, or more extreme effects than you observed are frequently brought about by chance variability alone, i.e., p value not small, clearly you don’t have evidence of incompatibility with the mere chance hypothesis. It’s very straightforward reasoning. Even those who criticize p values you’ll notice will employ them, at least if they care to check their assumptions of their models. And this includes well known Bayesian such as George Box, Andrew Gelman, and Jim Berger. Critics of p values often allege it’s too easy to obtain small p values. But notice the whole replication crisis is about how difficult it is to get small p values with preregistered hypotheses. This shows the problem isn’t p values, but those selection effects and data dredging. However, the same data drenched hypothesis can occur in other methods, likelihood ratios, Bayes factors, Bayesian updating, except that now we lose the direct grounds to criticize inferences for flouting error statistical control. The introduction of prior probabilities, which may also be data dependent, offers further researcher flexibility. Those who reject p values are saying we should reject the method because it can be used badly. And that’s a bad argument. We should reject misuses of p values. But there’s a danger of blindly substituting alternative tools that throw out the error control baby with the bad statistics bathwater.
Dan Jeske 05:58
Thank you, Deborah, Jim, would you like to comment on Deborah’s remarks and offer your own?
Jim Berger 06:06
Okay, yes. Well, I certainly agree with much of what Deborah said, after all, a p value is simply a statistic. And it’s an interesting statistic that does have many legitimate uses, when properly calibrated. And Deborah mentioned one such case is model checking where Bayesians freely use some version of p values for model checking. You know, on the other hand, that one interprets this question, should they continue to be used in the same way that they’re used today? Then my, my answer would be somewhat different. I think p values are commonly misinterpreted today, especially when when they’re used to test a sharp null hypothesis. For instance, of a p value of .05, is commonly interpreted as by many is indicating the evidence is 20 to one in favor of the alternative hypothesis. And that just that just isn’t true. You can show for instance, that if I’m testing with a normal mean of zero versus nonzero, the odds of the alternative hypothesis to the null hypothesis can at most be seven to one. And that’s just a probabilistic fact, doesn’t involve priors or anything. It just is, is a is an answer covering all probability. And so that 20 to one cannot be if it’s, if it’s, if a p value of .05 is interpreted as 20 to one, it’s just, it’s just being interpreted wrongly, and the wrong conclusions are being reached. I’m reminded of an interesting paper that was published some time ago now, which was reporting on a survey that was designed to determine whether clinical practitioners understood what a p value was. The results of the survey were published and were not surprising. Most clinical practitioners interpreted the p value as something like a p value of .05 as something like 20 to one odds against the null hypothesis, which again, is incorrect. The fascinating aspect of the paper is that the authors also got it wrong. Deborah pointed out that the p value is the probability under the null hypothesis of the data or something more extreme. The author’s stated that the correct answer was the p value is the probability of the data under the null hypothesis, they forgot the more extreme. So, I love this article, because the scientists who set out to show that their colleagues did not understand the meaning of p values themselves did not understand the meaning of p values.
Dan Jeske 08:42
David?
David Trafimow 08:44
Okay. Yeah, Um, like Deborah and Jim, I’m delighted to be here. Thanks for the invitation. Um and I partly agree with what both Deborah and Jim said, um, it’s certainly true that people misuse p values. So, I agree with that. However, I think p values are more problematic than the other speakers have mentioned. And here’s here’s the problem for me. We keep talking about p values relative to hypotheses, but that’s not really true. P values are relative to hypotheses plus additional assumptions. So, if we call, if we use the term model to describe the null hypothesis, plus additional assumptions, then p values are based on models, not on hypotheses, or only partly on hypotheses. Now, here’s the thing. What are these other assumptions? An example would be random selection from the population, an assumption that is not true in any one of the thousands of papers I’ve read in psychology. And there are other assumptions, a lack of systematic error, linearity, and then we can go on and on, people have even published taxonomies of the assumptions because there are so many of them. See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so, what you’re in essence doing then, is you’re using the p value to index evidence against a model that is already known to be wrong. And even the part about indexing evidence is questionable, but I’ll go with it for the moment. But the point is the model was wrong. And so, there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. There’s, p values don’t tell you how close the model is to being right. P values don’t tell you how valuable the model is. P values pretty much don’t tell you anything that researchers might want to know, unless you misuse them. Anytime you draw a conclusion from a p value, you are guilty of misuse. So, I think the misuse problem is much more subtle than is perhaps obvious at firsthand. So, that’s really all I have to say at the moment.
Dan Jeske 11:28
Thank you. Jim, would you like to follow up?
Jim Berger 11:32
Yes, so, so, I certainly agree that that assumptions are often made that are wrong. I won’t say that that’s always the case. I mean, I know many scientific disciplines where I think they do a pretty good job, and work with high energy physicists, and they do a pretty good job of checking their assumptions. Excellent job. And they use p values. It’s something to watch out for. But any statistical analysis, you know, can can run into this problem. If the assumptions are wrong, it’s, it’s going to be wrong.
Dan Jeske 12:09
Deborah…
Deborah Mayo 12:11
Okay. Well, Jim thinks that we should evaluate the p value by looking at the Bayes factor when he does, and he finds that they’re exaggerating, but we really shouldn’t expect agreement on numbers from methods that are evaluating different things. This is like supposing that if we switch from a height to a weight standard, that if we use six feet with the height, we should now require six stone, to use an example from Stephen Senn. On David, I think he’s wrong about the worrying assumptions with using the p value since they have the least assumptions of any other method, which is why people and why even Bayesians will say we need to apply them when we need to test our assumptions. And it’s something that we can do, especially with randomized controlled trials, to get the assumptions to work. The idea that we have to misinterpret p values to have them be relevant, only rests on supposing that we need something other than what the p value provides.
Dan Jeske 13:19
David, would you like to give some final thoughts on this question?
David Trafimow 13:23
Sure. As it is, as far as Jim’s point, and Deborah’s point that we can do things to make the assumptions less wrong. The problem is the model is wrong or it isn’t wrong. Now if the model is close, that doesn’t justify the p value because the p value doesn’t give the closeness of the model. And that’s the, that’s the problem. We’re not we’re not using, for example, a sample mean, to estimate a population mean, in which case, yeah, you wouldn’t expect the sample mean to be exactly right. If it’s close, it’s still useful. The problem is that p values don’t tell you p values aren’t being used to estimate anything. So, if you’re not estimating anything, then you’re stuck with either correct or incorrect, and the answer is always incorrect that, you know, this is especially true in psychology, but I suspect it might even be true in physics. I’m not the physicist that Jim is. So, I can’t say that for sure.
Dan Jeske 14:35
Jim, would you like to offer Final Thoughts?
Jim Berger 14:37
Let me comment on Deborah’s comment about Bayes factors are just a different scale of measurement. My my point was that it seems like people invariably think of p values as something like odds or probability of the null hypothesis, if that’s the way they’re thinking, because that’s the way their minds reason. I believe we should provide them with odds. And so, I try to convert p values into odds or Bayes factors, because I think that’s much more readily understandable by people.
Dan Jeske 15:11
Deborah, you have the final word on this question.
Deborah Mayo 15:13
I do think that we need a proper philosophy of statistics to interpret p values. But I think also that what’s missing in the reject p values movement is a major reason for calling in statistics in science is to give us tools to inquire whether an observed phenomena can be a real effect, or just noise in the data and the P values have intrinsic properties for this task, if used properly, other methods don’t, and to reject them is to jeopardize this important role. As Fisher emphasizes, we need randomized control trials precisely to ensure the validity of statistical significance tests, to reject them because they don’t give us posterior probabilities is illicit. In fact, I think that those claims that we want such posteriors need to show for any way we can actually get them, why.
You can watch the debate at the NISS website or in this blog post.
You can find the complete audio transcript at this LINK: https://otter.ai/u/hFILxCOjz4QnaGLdzYFdIGxzdsg
[There is a play button at the bottom of the page that allows you to start and stop the recording. You can move about in the transcript/recording by using the pause button and moving the cursor to another place in the dialog and then clicking the play button to hear the recording from that point. (The recording is synced to the cursor.)]
While immersed in our fast-paced, remote, NISS debate (October 15) with J. Berger and D. Trafimow, I didn’t immediately catch all that was said by my co-debaters (I will shortly post a transcript). We had all opted for no practice. But looking over the transcript, I was surprised that David Trafimow was indeed saying the answer to the question in my title is yes. Here are some excerpts from his remarks:
Trafimow 8:44
See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so what you’re in essence doing then, is you’re using the P-value to index evidence against a model that is already known to be wrong. …But the point is the model was wrong. And so there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. …Trafimow 18:27
I’ll make a more general comment, which is that since since the model is wrong, in the sense of not being exactly correct, whenever you reject it, you haven’t learned anything. And in the case where you fail to reject it, you’ve made a mistake. So the worst, so the best possible cases you haven’t learned anything, the worst possible cases is you’re wrong…Trafimow 37:54
Now, Deborah, again made the point that you need procedures for testing discrepancies from the null hypothesis, but I will repeat that …P-values don’t give you that. P-values are about discrepancies from the model…
But P-values are not about discrepancies from the model (in which a null or test hypothesis is embedded). If they were, you might say, as he does, that you should properly always find small P-values, so long as the model isn’t exactly correct. If you don’t, he says, you’re making a mistake. But this is wrong, and is in need of clarification. In fact, if violations of the model assumptions prevent computing a legitimate P-value, then its value is not really “about” anything.
Three main points:
[1] It’s very important to see that the statistical significance test is not testing whether the overall model is wrong, and it is not indexing evidence against the model. It is only testing the null hypothesis (or test hypothesis) H_{0}. It is an essential part of the definition of a test statistic T that its distribution be known, at least approximately, under H_{0}. Cox has discussed this for over 40 years; I’ll refer first to a recent, and then an early paper.
Suppose that we study a system with haphazard variation and are interested in a hypothesis, H, about the system.We find a test quantity, a function t(y) of data y, such that if H holds, t(y) can be regarded as the observed value of a random variable t(Y) having a distribution under H that is known numerically to an adequate approximation, either by mathematical theory or by computer simulation. Often the distribution of t(Y) is known also under plausible alternatives to H, but this is not necessary. It is enough that the larger the value of t(y), the stronger the pointer against H.
The basis of a significance test is an ordering of the points in [a sample space] in order of increasing inconsistency with H_{0}, in the respect under study. Equivalently there is a function t = t(y) of the observations, called a test statistic, and such that the larger is t(y), the stronger is the inconsistency of y with H_{0}, in the respect under study. The corresponding random variable is denoted by T. To complete the formulation of a significance test, we need to be able to compute, at least approximately,
p(y_{obs}) = p_{obs} = pr(T > t_{obs} ; H_{0}), (1)
called the observed level of significance.
…To formulate a test, we therefore need to define a suitable function t(.), or rather the associated ordering of the sample points. Essential requirements are that (a) the ordering is scientifically meaningful, (b) it is possible to evaluate, at least approximately, the probability (1).
To suppose, as Trafimow plainly does, that we can never commit a Type 1 error in statistical significance testing because the underlying model “is not exactly correct” is a serious misinterpretation. The statistical significance test only tests one null hypothesis at a time. It is piecemeal. If it’s testing, say, the mean of a Normal distribution, it’s not also testing the underlying assumptions of the Normal model (Normal, IID). Those assumptions are tested separately, and the error statistical methodology offers systematic ways for doing so, with yet more statistical significance tests [see point 3].
[2] Moreover, although the model assumptions must be met adequately in order for the P-value to serve as a test of H_{0}, it isn’t required that we have an exactly correct model, merely that the reported error probabilities are close to the actual ones. As I say in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018) (several excerpts of which can be found on this blog):
Statistical models are at best approximations of aspects of the data-generating process. Reasserting this fact is not informative about the case at hand. These models work because they need only capture rather coarse properties of the phenomena: the error probabilities of the test method are approximately and conservatively related to actual ones. …Far from wanting true (or even “truer”) models, we need models whose deliberate falsity enables finding things out. (p. 300)
Nor do P-values “track” violated assumptions; such violations can lead to computing an incorrectly high, or an incorrectly low, P-value.
And what about cases where we know ahead of time that a hypothesis H_{0 }is strictly false?—I’m talking about the hypothesis here, not the underlying model. (Examples would be with a point null, or one asserting “there’s no Higgs boson”.) Knowing a hypothesis H_{0 }is false is not yet to falsify it. That is, we are not warranted in inferring we have evidence of a genuine effect or discrepancy from H_{0}, and we still don’t know in which way it is flawed.
[3] What is of interest in testing H_{0} with a statistical significance test is whether there is a systematic discrepancy or inconsistency with H_{0}—one that is not readily accounted for by background variability, chance, or “noise” (as modelled). We don’t need, or even want, a model that fully represented the phenomenon—whatever that would mean. In “design-based” tests, we look to experimental procedures, within our control, as with randomisation.
Fisher:
the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged. (Fisher 1935, 21)
We look to RCTs quite often these days to test the benefits (and harms) of vaccines for Covid-19. Researchers observe differences in the number of Covid-19 cases in two randomly assigned groups, vaccinated and unvaccinated. We know there is ordinary variability in contracting Covid-19; it might be that, just by chance, more people who would have remained Covid-free, even without the vaccine, happen to be assigned to the vaccination group. The random assignment allows determining the probability that an even larger difference in Covid-19 rates would be observed even if H_{0}: the two groups have the same chance of avoiding Covid-19. (I’m describing things extremely roughly; a much more realistic account of randomisation is given by several guest posts by Senn (e.g., blogpost).) Unless this probability is small, it would not be correct to reject H_{0} and infer that there is evidence the vaccine is effective. Yet Trafimow, if we take him seriously, is saying it would always be correct to reject H_{0}, and that to fail to reject it is to make a mistake. I hope that no one’s seriously suggesting that we should always infer there’s evidence a vaccine or other treatment works. But I don’t know how else to understand the position that it’s always correct to reject H_{0}, and that to fail to reject it is to make a mistake. This is a dangerous and wrong view, which fortunately vaccine researchers are not guilty of.
When we don’t have design-based assumptions, we may check the model-based assumptions by means of tests that are secondary in relation to the primary test. The trick is to get them to be independent of the unknowns in the primary test, and there are systematic ways to achieve this.
Cox 2006:
We now turn to a complementary use of these ideas, namely to test the adequacy of a given model, what is also sometimes called model criticism…..It is necessary if we are to parallel the previous argument to find a statistic whose distribution is exactly of very nearly independent of the unknown parameter μ. An important way of doing this is by appeal t the second property of sufficient statistics, namely that after conditioning on their observed value the remaining data have a fixed distribution. (2006, p. 33)
“In principle, the information in the data is split into two parts, one to assess the unknown parameters of interest and the other for model criticism” (Cox 2006, p. 198). If the model is appropriate then the conditional distribution of Y given the value of the sufficient statistic s is known, so it serves to assess if the model is violated. The key is often to look at residuals: the difference between each observed outcome and what is expected under the model. The full data are remodelled to ask a different question. [i]
In testing assumptions, the null hypothesis is generally that the assumption(s) hold approximately. Again, even when we know this secondary null is strictly false, we want to learn in what way, and use the test to pinpoint improved models to try. (These new models must be separately tested.) [ii]
The essence of the reasoning can be made out entirely informally. Think of how the 2019 Eddington eclipse tests probed departures from the Newtonian predicted light deflection. It tested the Newtonian “half deflection” H_{0}: μ ≤ 0.87, vs H_{1}: μ > 0.87, which includes the Einstein value of 1.75. These primary tests relied upon sufficient accuracy in the telescopes to get a usable standard error for the star positions during the eclipse, and 6 months before (SIST, Excursion 3 Tour I). In one set of plates, that some thought supported Newton, this necessary assumption was falsified using a secondary test. Relying only on known star positions and the detailed data, it was clear that the sun’s heat had systematically distorted the telescope mirror. No assumption about general relativity was required.
If I update this, I will indicate with (i), (ii), etc.
I invite your comments and/or guest posts on this topic.
NOTE: Links to the full papers/book are given in this post, so you might want to check them out.
[i] See Spanos 2010 (pp. 322-323) from Error & Inference. (This is his commentary on Cox and Mayo in the same volume.) Also relevant Mayo and Spanos 2011 (pp. 193-194).
[ii] It’s important to see that other methods, error statistical or Bayesian, rely on models. A central asset of the simple significance test, on which Bayesians will concur, is their apt role in testing assumptions.
]]>Stephen Senn
Consultant Statistician
Edinburgh, Scotland
Well actually, not from A to Z but from AZ. That is to say, the trial I shall consider is the placebo- controlled trial of the Oxford University vaccine for COVID-19 currently being run by AstraZeneca (AZ) under protocol AZD1222 – D8110C00001 and which I considered in a previous blog, Heard Immunity. A summary of the design features is given in Table 1. The purpose of this blog is to look a little deeper at features of the trial and the way I am going to do so is with the help of geometric representations of the sample space, that is to say the possible results the trial could produce. However, the reader is warned that I am only an amateur in all this. The true professionals are the statisticians at AZ who, together with their life science colleagues in AZ and Oxford, designed the trial.
Whereas in an October 20 post (on PHASTAR) I considered the sequential nature of the trial, here I am going to ignore that feature and only look at the trial as if it had a single look. Note that the trial employs a two to one randomisation, twice as many subjects being given vaccine as placebo
However, first I shall draw attention to one interesting feature. Like the two other trials that I also previously considered (one by BioNTech and Pfizer and the other by Moderna) the null hypothesis that is being tested is not that the vaccine has no efficacy but that its efficacy does not exceed 30%. Vaccine Efficacy (VE) is defined as
Where R_{placebo} & R_{vaccine} are the ‘true’ rates of infection under placebo and vaccine respectively
Obviously, if the vaccine were completely ineffective, the value of VE would be 0. Presumably the judgement is that a vaccine will be of no practical use unless it has an efficacy of 30%. Perhaps a lower value than this could not really help to control the epidemic. The trial is designed to show that this is the case. In what follows, you can take it as read that the probability of the trial failing because the efficacy is equal to some value that is less than 30% (such as 27%, say) is even greater than if the value is exactly 30%. Therefore, it becomes of interest to consider the way the trial will behave if the value is exactly 30%.
Figure 1 gives a representation of what might happen in terms of cases of infected subjects in both arms of the trial based on its design. It’s a complicated diagram and I shall take some time to explain it. For the moment I invite the reader to ignore the concentric circles and the shading. I shall get to those in due course.
The X axis gives the number of cases in the vaccine group and the Y axis the number of cases under Placebo. It is important to bear in mind that twice as many subjects are being treated with vaccine as with placebo. The line of equality of infection rates is given by the dashed white diagonal line towards the bottom right hand side of the pot and labelled ‘0% efficacy’. This joins (for example) the points (80,40) and (140, 70) corresponding to twice as many cases under vaccine as placebo and reflecting the 2:1 allocation ratio. Other diagonal lines correspond to 30%, 50% and 60% VE respectively.
The trial is deigned to stop once 150 cases of infection have occurred. This boundary is represented by the diagonal solid red line descending from the upper left (30 cases in the vaccine group and 120 cases in the placebo group) towards the bottom right (120 cases in the vaccine group and 30 cases in the placebo group). Thus, we know in advance, that the combination of results we shall see must lie on this line.
Note that the diagram is slightly misleading, since where the space concerned refers to number of cases, it is neither continuous in X nor continuous in Y. The only possible values are those given by the whole numbers, W, that is to say the integers plus zero. However, the same is not true for expected numbers and this is a common difference between parameters and random variables in statistics. For example, if we have a Poisson random variable with a given mean, the only possible values of the random variable are the whole numbers 0,1,2… but the mean can be any positive real number.
Figure 2 is the same diagram as Figure 1 as regards every feature except that which I invited the reader to ignore. The concentric circles are contour plots that represent features of the trial that are suitable for planning. In order to decide how many subjects to recruit, the scientists at AZ and Oxford had to decide what infection rate was likely. They chose an infection rate of 0.8% per 6 months under placebo. This in turn implies that of 10,000 subjects treated with placebo, we might expect 80 to get COVID. On the other hand, a vaccine efficacy of 30% would imply an infection rate of 0.56% since
For 20,000 subjects treated with vaccine we would expect (0.56/100)20,000 = 112 of them to be infected with COVID and if the vaccine efficacy were 60%, the value assumed for the power calculation, then the expected infection rate would be 0.32% and we would expect 64 of the subjects to be infected.
Since the infection rates are small, a Poisson distribution is a possible simple model for the probability of seeing certain combinations of infections. This is what the contour plots illustrate. For both cases, the expected number of cases under placebo is assumed to be 80 and this is illustrated by a dashed horizontal white line. However, the lower infection rate under H_{1} has the effect of shifting the contour plots to the left. Thus, in Figure 1 the dashed vertical line indicating the expected numbers in the vaccine arm is at 112 and in Figure 2 it is at 64. Nothing else changes between the figures.
Figure 2 Possible and expected outcomes for the trial plotted in the two dimensional space of vaccine and placebo cases of infection. The contour plot applies when the value under the alternative hypothesis assumed for power calculations is true.
How should we carry out a significance test? One way of doing so is to condition on the total number of infected cases. The issue of whether to condition or not is a notorious controversy ins statistics, Here the total of 150 is fixed but I think that there is a good argument for doing so whether or not it is fixed. Such conditioning in this case leads to a binomial distribution describing the number of cases of infection observed out of the 150 that are in the vaccine group. Ignoring any covariates, therefore, a simple analysis, is to compare the proportion of cases we see to the proportion we would expect to see under the null hypothesis. This proportion is given by 112/(112+80)=0.583. (Note a subtle but important point here. The total number of cases expected is 192 but we know the trial will stop at 150. That is irrelevant. It is the expected proportion that matters here.)
By trial and error or by some other means we can now discover that the probability of 75 or fewer cases given vaccine out of 150 in total when the probability is 0.583 is 0.024.The AZ protocol requires a two-sided P-value less than or equal to 4.9%, which is to say 0.0245 one sided, assuming the usual doubling rule, so this is just low enough. On the other hand, the probability of 76 or fewer cases under vaccine is 0.035 and thus too high. This establishes the point X=75, Y=75 as a critical value of the test. This is shown by the small red circle labelled ‘critical value’ on both figures. It just so happens that this lies along the 50% efficacy line. Thus observed 50% efficacy will be (just) enough to reject the hypothesis that the true efficacy is 30% or lower.
There are many other interesting features of this trial I could discuss, in particular what alternative analyses might be tried (the protocol refers to a ‘modified Poisson regression approach’ due to Zou, 2004) but I shall just consider one other issue here. That is that in theory when the trial stops might give some indication as to vaccine efficacy, a point that might be of interest to avid third party trial-watchers. If you look at Figure 3, which combines Figure 1 and Figure 2, you will note that the expected number of cases under H_{0}, if the values used for planning are correct, is at least (when vaccine efficacy is 30%) 80+112=192. For zero efficacy the figure is 80+160=240. However, the trial will stop once 150 cases of infection have been observed. Thus, under H_{0}, the trial is expected to stop before all 30,000 subjects have had six months of follow-up.
On the other hand, for an efficacy of 60% given in Figure 3 the value is 80+64 =144 and so slightly less then the figure required. Thus, under H_{1}, the trial might not be big enough. Taken together, these figures imply that other things being equal, the earlier the trial stops the more likely the result is to be negative and the longer it continues, the more likely it is to be positive.
Of course, this raises the issue as to whether one can judge what is early and what is late. To make some guesses as to background rates of infection is inevitable when planning a trial. One would be foolish to rely on them when interpreting it.
Figure 3 Combination of Figures 1 and 2 showing contour plots for the joint density for the number of cases when the vaccine efficacy is 30% (H_{0}) and the value under H_{1} of 60% used for planning.
Zou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol. 2004;159(7):702-6.
POSTSCRIPT: Needlepoint
Extract of a press-release from Pfizer, 9 November 2020:
“I am happy to share with you that Pfizer and our collaborator, BioNTech, announced positive efficacy results from our Phase 3, late-stage study of our potential COVID-19 vaccine. The vaccine candidate was found to be more than 90% effective in preventing COVID-19 in participants without evidence of prior SARS-CoV-2 infection in the first interim efficacy analysis.” Albert Bourla (Chairman and CEO, Pfizer.)
Naturally, this had Twitter agog and calculations were soon produced to try and reconstruct the basis on which the claim was being made: how many cases of COVID-19 infection under vaccine had there been seen in order to be able to make this claim? In the end these amateur calculations don’t matter. It’s what Pfizer calculates and what the regulators decide about the calculation that matters. I note by the by that a fair proportion of Twitter seemed to think that journal publication and peer review is essential. I don’t share this point of view, which I tend to think of as “quaint”. It’s the regulator’s view I am interested in but we shall have to wait for that.
Nevertheless, calculation can be fun and if I don’t think so, I am in the wrong profession. So here goes. However, first I should acknowledge that Jen Rogers’s interesting blog on the subject has been very useful in preparing this note.
To do the calculation properly, this is what one would have to know
Need to know |
Discussion |
Disposition of Subjects |
Randomisation was one to one but strictly speaking we want to know the exact achieved proportions. BusinessWire describe a total of “43,538 participants to date, 38,955 of whom have received a second dose of the vaccine candidate as of November 8, 2020”. |
Number of cases of infection |
According to BusinessWire 94 were seen. |
Method of analysis |
Pfizer claims in the protocol a Bayesian analysis will be used. I shall not attempt this but use a very simple frequentist one conditioning on totals infected. |
Aim of claim |
Is the point estimate the basis of the claim or is the lower bound of some confidence interval the basis? |
Level of confidence to be used |
Pfizer planned to look five times but it seems that the first look was abandoned. The reported look is the 2^{nd} but at a number of cases that is slightly greater (94) than the number originally planned for the 3^{rd} (92). I shall assume that the confidence level for look three of an O’Brien-Fleming boundary is appropriate. |
Missingness |
A simple analysis would assume no missing data or at least that any missing data are missing completely at random. |
Other matters |
Two doses are required. Were there any cases arising between the two doses and if so, what was done with them? |
If I condition on the total number of infected cases, and assume equal numbers of subjects on each arm, then by varying the number of cases in the vaccine group and subtracting them from the total of 94 to get those on the control group arm, I can calculate the vaccine efficacy. This has been done in the figure below.
The solid blue circles are the estimate of the vaccine efficacy. The ‘whiskers’ below indicate a confidence limit of 99.16% which (I think) is the level appropriate for the third look in an O’Brien-Fleming scheme for an overall type I error rate of 5%. Horizontal lines have been drawn at 30% efficacy (the value used in the protocol for the null hypothesis) and 90% efficacy (the claimed effect in the press release). Three cases on the vaccine arm would give a vaccine efficacy at about 91.3% for the lower confidence interval whereas four gives a value of 89.2%. Eight cases would give a point estimate of 90.7%. So depending on what exactly the claim of “more than 90% effective” might mean (and a whole host of other assumptions) we could argue that between three and eight cases of infection were seen.
Of course safety is often described as being first in terms of priorities but it usually takes longer to see the results that are necessary to judge it than to see those for efficacy. According to BusinessWire “Pfizer and BioNTech are continuing to accumulate safety data and currently estimate that a median of two months of safety data following the second (and final) dose of the vaccine candidate – the amount of safety data specified by the FDA in its guidance for potential Emergency Use Authorization – will be available by the third week of November.”
The world awaits the results with interest.
https://www.linkedin.com/pulse/needlework-guesswork-stephen-senn/
]]>
For information about the Phil Stat Wars forum and how to join, see this post and this pdf.
For related posts on randomization by Stephen Senn, see these guest posts from this blog:
Slides and Video Links:
Stephen Senn’s slides: Randomisation and Control in the Age of Coronavirus
Stephen Senn’s presentation:
Discussion on Senn’s presentation:
*Meeting 11 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21
]]>
Stephen Senn
Consultant Statistician
Edinburgh, Scotland
Testing Times
There has been much comment on Twitter and other social media about testing for coronavirus and the relationship between a test being positive and the person tested having been infected. Some primitive form of Bayesian reasoning is often used to justify concern that an apparent positive may actually be falsely so, with specificity and sensitivity taking the roles of likelihoods and prevalence that of a prior distribution. This way of looking at testing dates back at least to a paper of 1959 by Ledley and Lusted[1]. However, as others[2, 3] have pointed out, there is a trap for the unwary in this, in that it is implicitly assumed that specificity and sensitivity are constant values unaffected by prevalence and it is far from obvious that this should be the case.
In the age of COVID-19 this is a highly suitable subject for a blog. However, I am a highly unsuitable person to blog about it, since what I know about screening programmes could be written on the back of a postage stamp and the matter is very delicate. So, I shall duck the challenge but instead write about something that bears more than a superficial similarity to it, namely, testing for model adequacy prior to carrying out a test of a hypothesis of primary interest. It is an issue that arises in particular in cross-over trials, where there may be concerns that carry-over has taken place. Here, I may or may not be an expert, that is for others to judge, but I can’t claim that I think I am unqualified to write about it, since I once wrote a book on the subject[4, 5]. So this blog will be about testing model assumptions taking the particular example of cross-over trials.
The simplest of all cross-over trials, the so-called AB/BA cross-over is one in which two treatments A and B are compare by randomising patients to one of two sequences: either A followed by B (labelled AB) or B followed by A (labelled BA). Each patient is thus studied in two periods, receiving one of the two treatments in each. There may be a so-called wash-out period between them but whether or not a wash-out is employed, the assumption will be made that by the time the effect of a treatment comes to be measured in period two, the effect of any treatment given in period one has disappeared. If such a residual effect, referred to as a carry-over, existed it would bias the treatment effect since, for example, the result in period two of the AB sequence would not only reflect the effect of giving B but the previous effect of having given A.
Everyone is agreed that if the effect of carry-over can be assumed negligible, an efficient estimate of the difference between the effect of B and A can be made by allowing each patient to act as his or her own control. One way of doing this is to calculate a difference for each patient of the period two values minus the period one values. I shall refer to these as the period differences. If the effect of treatment B is the same as that of treatment A, then these period differences will not be expected to differ systematically from one sequence to another. However, if (say) the effect of B was greater that that of A (and higher values were better), then in the AB sequence a positive difference would be added to the period differences and in the BA sequence that difference would be subtracted from the period differences. We should thus expect the means of the period differences for the two sequences to differ. So one way of testing the null hypothesis of no treatment effect is to carry out a two-sample t-test comparing the period differences between one sequence an another. Equivalently from the point of view of testing, but more convenient from the point of view of estimation, is to work with the semi period differences, that is to say the period difference divided by two. I shall refer to the associated estimate as CROS and the t-statistic for this test as CROS_{t}, since they are what the crossover trial was designed to produce.
Unfortunately, however, these period differences could also reflect carry-over and if this occurs it would bias the estimate of the treatment effect. The usual effect will be to bias it towards the null and so there may be a loss of power. Is there a remedy? One possibility is to discard the second period values. After all, the first period values cannot be contaminated by carry-over. On the other hand single period values is what we have in any parallel group trial. So all we need to do is regard the first period values as coming from a parallel group trial. Patients in the AB sequence yield values under A and patients in the BA sequence yield values under B so a comparison of the first period values, again using a two-sample t-test, is a test of the null hypothesis of no difference between treatments. I shall refer to this estimate as the PAR statistic and the corresponding t-statistic as PAR_{t}.
Note that the PAR statistic is expected to be a second best to the CROS statistic. Not only are half the values discarded but since we can no longer use each patient as his or her own control, the relevant variance is a sum of between and within-patient variances unlike for CROS, which only reflects within-patient variation. Nevertheless, PAR may be expected to be unbiased in circumstances where CROS will be and since all applied statistics is a bias-variance trade-off it seems conceivable that there are circumstances under which PAR would be preferable.
However, there is a problem. How shall we know that carry-over has occurred? It turns out that there is a t-test for this too. First, what we construct are means over the two periods for each patient. In each sequence such means must reflect the effect of both treatments, since each patient receives each treatment and they must also reflect the effect of both periods, since each patient will be treated in each period. However, in the AB sequence the total (and hence the mean) will also reflect the effect of any carryover from A in the second period whereas in the BA sequence the total (and hence the mean) will reflect the carry-over of B. Thus, if the two carry-overs differ, which is what matters, these sequence means will differ and therefore the t-test of the totals comparing the two sequences is a valid test of zero differential carry-over. I shall refer to the estimate as SEQ and the corresponding t-statistic as SEQ_{t}.
These three tests were then formally incorporated in a testing strategy known as the two-stage procedure[6]as follows. First a test for carry-over was performed using SEQ_{t}. Since the test was a between-patient test and therefore of low power, a nominal type I error rate of 10% was generally used. If SEQ_{t} was not significant, the statistician proceeded to use CROS_{t} to test the principle hypothesis of interest, namely that of the equality of the two treatments. If however, SEQ_{t} was significant, which might be taken as an indication of carry-over, the fallback test PAR_{t} was used instead to test equality of the treatments.
The procedure is illustrated in Figure 1.
Of course, to the extent that the three tests are used as prescribed, they can be combined in a single algorithm. In fact in the pharmaceutical company that I joined in 1987, the programming group had written a SAS® macro to do exactly that. You just needed to point the macro at your data and it would calculate SEQ, come to a conclusion, choose either CROS or PAR as appropriate and give you your P-value.
I hated the procedure as soon as I saw it and never used it. I argued that it was an abuse of testing to assume that just because SEQ was not significant that therefore no carry-over has occurred. One had to rely on other arguments to justify ignoring carry-over. It was only on hearing a colleague lecture on an example where the test for carry-over had proved significant and, much to his surprise, given its low power, the first period test had also proved significant, that I suddenly realised that SEQ and PAR were highly correlated and therefore this was only to be expected. In consequence, the procedure would not maintain the Type I error rate. Only a few days later a manuscript from Statistics in Medicine arrived on my desk for review. The paper by Peter Freeman[7] overturned everything everyone believed on testing for carry-over. In bolting these tests together, statisticians had created a chimeric monster. Far from helping to solve the problem of carry-over the two-stage procedure had made it worse.
‘How can screening for something be a problem?’, an applied statistician might ask but in asking that they would be completely forgetting the advice they would give a physician who wanted to know the same thing. The process as a whole of screening plus remedial action needed to be studied and statisticians had failed to do so. Peter Freeman completely changed that. He did what statisticians should have done and looked at how the procedure as whole behaved. In the years since I have simply asked statisticians who wish to give an opinion on cross-over trials what they think of Freeman’s paper[7]. It has become a litmus paper for me. Their answer tells me everything I need to know.
So what is the problem? The problem is illustrated by Figure 2. This shows a simulation from a null case. There is no difference between the treatments and no carry-over. The correlation between periods one and two has been set to 0.7. One thousand trials in 24 patients (12 for each sequence) have been simulated. The figure plots CROS_{t}(blue circles) and PAR_{t} (red diamonds) on the Y axis against SEQ_{t} on the X axis. The vertical lines show the critical boundaries for SEQ_{t} at the 10% level and the horizontal lines show the critical boundaries for CROS_{t} and PAR_{t} at the 5% level. Filled circles or diamonds indicate significant results of CROS_{t} and PAR_{t} and open circles or diamonds indicate non-significant values.
It is immediately noticeable that CROS_{t} and SEQ_{t} are uncorrelated. This is hardly surprising, since given equal variances CROS and SEQ are orthogonal by construction. On the other hand PAR_{t} and SEQ_{t} are very strongly correlated. This ought not to be surprising. PAR uses the first period means. SEQ also uses the first period with the same sign. Even if the second period means were uncorrelated with the first the two statistics would be correlated, since the same information is used. However, in practice the second period means will be correlated and thus a strong correlation can result. In this example the empirical correlation is 0.92.
The consequence is that if SEQ_{t} is significant PAR_{t} is likely to be so. This can be seen from the scatter plot where there are far more filled diamonds in the regions to the left of the lower critical value or to the right of the higher critical values for SEQ_{t} than in the region in between. In this simulation of 1000 trials, 99 values of SEQ_{t} are significant at the 10% level 50 values of CROS_{t }and 53 values of PAR_{t} are significant at the 5% level. These figures are close to the expected values. However, 91 values are significant using the two-stage procedure. Of course, this is just a simulation. However, for the theory[8] see http://www.senns.demon.co.uk/ROEL.pdf .
In fact, this inflation really underestimates the problem with the two-stage procedure. Either the extra complication is irrelevant (we end up using CROS_{t}) or the conditional type-I error rate is massively inflated. In this example, of the 99 cases where SEQ_{t} is significant, 48 of the values of PAR_{t} are significant. A nominal 5% significance rate has become nearly a 50% conditional one!
The first lesson is despite what your medical statistics textbook might tell you, you should never use the two-stage procedure. It is completely unacceptable.
Should you test for carry-over at all? That’s a bit more tricky. In principle more evidence is always better than less. The practical problem is that there is no advice that I can offer you as to what to do next on ‘finding’ carry-over except to drop the nominal target significance level. (See The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t[8], but note the warning in the title.)
Should you avoid using cross-over trials? No. they can be very useful on occasion. Their use needs to be grounded in biology and pharmacology. Statistical manipulation is not the cure for carry-over.
Are there more general lessons? Probably. The two-stage analysis is the worst case I know of but there may be others where testing assumptions is dangerous. Remember, a decision to behave as if something is true, is not the same as knowing it is true. Also, beware of recogiseable subsets. There are deep waters here.
]]>
What would I say is the most important takeaway from last week’s NISS “statistics debate” if you’re using (or contemplating using) Bayes factors (BFs)–of the sort Jim Berger recommends–as replacements for P-values? It is that J. Berger only regards the BFs as appropriate when there’s grounds for a high concentration (or spike) of probability on a sharp null hypothesis, e.g.,H_{0}: θ = θ_{0}.
Thus, it is crucial to distinguish between precise hypotheses that are just stated for convenience and have no special prior believability, and precise hypotheses which do correspond to a concentration of prior belief. (J. Berger and Delampady 1987, p. 330).
Now, to be clear, I do not think that P-values need to be misinterpreted (Bayesianly) to use them evidentially, and think it’s a mistake to try to convert them into comparative measures of belief or support. However, it’s important to realize that even if you do think such a conversion is required, and are contemplating replacing them with the kind of BF Jim Berger advances, then it would be wrong to do so if there were no grounds for a high prior belief on a point null. Jim said in the debate that people want a Bayes factor, so we give it to them. But when you’re asking for it, especially if it’s described as a “default” method, you might assume it is capturing a reasonably common standpoint—not one that only arises in an idiosyncratic case. To allege that there’s really much less evidence against the sharp null than is suggested by a P-value, as does the BF advocate, is to hide the fact that most of this “evidence” is due to the spiked concentration of prior belief being given to the sharp null hypothesis. This is an a priori bias in favor of the sharp null, not evidence in the data. (There is also the matter of how the remainder of the prior is smeared over the parameter values in the alternative.) Jim Berger, somewhat to my surprise (at the debate) reaffirms that that is the context for the intended use of his recommended Bayes factor with the spiked prior. Yet these BFs are being touted as a tool to replace P-values for everyday use.
Jim’s Sharp Null BFs were developed For a Very Special Case. Harold Jeffreys developed the spiked priors for a Bayesian special problem: how to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon, as R.A. Fisher emphasized.)
Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test (J. Berger and Sellke 1987, p. 136).
Casella and Roger Berger (1987b) respond to Jim Berger and Sellke and to Jim Berger and Delampady –all in 1987. “We would be surprised if most researchers would place even a 10% prior probability of H_{0}. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H_{0}|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H_{0}] that was used.” They make the astute point that the most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability. They conclude: “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H_{0}” (ibid., p. 345). Thus, they conclude, “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).
As I said in response to debate question 3, “the move to redefine statistical significance, advanced by a megateam in 2017, including Jim, all rest upon the lump high prior probability on the null as well as the appropriateness of evaluating P-values using Bayes factors. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies [from the point null]”.
Conduct an Error Statistical Critique. Thus a question you should ask in contemplating the application of the default BF is this: What’s the probability the default BF would find no evidence against the null or even evidence for it for an alternative or discrepancy of interest to you? If the probability is fairly high then you’d not want to apply it.
Notice what we’re doing in asking this question: we’re applying the frequentist error statistical analysis to the Bayes factor. What’s sauce for the goose is sauce for the gander.[ii] This is what the error statistician needs to do whenever she’s told an alternative measure ought to be adopted as a substitute for an error statistical one: check its error statistical properties.
Is the Spiked Prior Appropriate to the Problem, Even With a Well-corroborated Value? Even in those highly special cases where a well-corroborated substantive theory gives a high warrant for a particular value of a parameter, it’s far from clear that a spiked prior reflects how scientists examine the question: is the observed anomaly (with the theory) merely background noise or some systematic effect? Remember when Neutrinos appeared to travel faster than light—an anomaly for special relativity—in an OPERA experiment in 2011?
This would be a case where Berger would place a high concentration of prior probability on the point null, the speed of light c given by special relativity. The anomalous results, at most, would lower the posterior belief. But I don’t think scientists were interested in reporting that the posterior probability for the special relativity value had gone down a bit, due to their anomalous result, but was still quite high. Rather, they wanted to know whether the anomaly was mere noise or genuine, and finding it was genuine, they wanted to pinpoint blame for the anomaly. It turns out a fiber optic cable wasn’t fully screwed in and one of the clocks was ticking too fast. Merely discounting the anomaly (or worse, interpreting it as evidence strengthening their belief in the precise null) because of strong belief in special relativity would sidestep the most interesting work: gleaning important information about how well or poorly run the experiment was.[iii]
It is interesting to compare the position of the spiked prior with an equally common Bayesian position that all null hypotheses are false. The disagreement may stem from viewing H_{0 }as asserting the correctness of a scientific theory (the spiked prior view) as opposed to asserting a parameter in a model, representing a portion of that theory, is correct (the all nulls are false view).
Search Under “Overstates” On This Blog for More (and “P-values” for much more). The reason to focus attention on the disagreement between the P-value and the Bayes factor with a sharp null is that it explains an important battle in the statistics wars, and thus points the way to (hopefully) getting beyond it. The very understanding of the use and interpretation of error probabilities differs in the rival approaches.
As I was writing my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), I was often distracted by high pitched discussions in 2015-17 about P-values “overstating” evidence on account of being smaller than a posterior probability on a sharp null. Thus, I wound up writing several posts, the ingredients of which made their way into the book, notably, Section 4.4. Here’s one. I eventually coined it as a fallacy, “P-values overstate the evidence fallacy”. For many excerpts from the book, including the rest of the “Tour” where this issue arises, see this blogpost.
Stephen Senn wrote excellent guest posts on P-values for this blog that are especially clarifying, such as this one. He observes that Jeffreys, having already placed the spiked prior on the point null, required only that the posterior on the alternative exceeded .5 in order to find evidence against the null, not that it be a large number such as .95.
A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives. (S. Senn)
***
[i] I mentioned two of the simplest inferential arguments using P-values during the debate: one for blocking an inference, a second for inferring incompatibility with (or discrepancy from) a null hypothesis, set as a reference: “If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.”
“…A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H0. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it.“ For a more detailed discussion see SIST, e.g., Souvenir C (SIST, p. 52) https://errorstatistics.files.wordpress.com/2019/04/sist_ex1-tourii.pdf.
[ii] From SIST* (p. 247): “The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P -value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’ s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you.”
[iii] Conversely, the sharp null in discovering the Higgs Boson was disbelieved even before they built the expensive particle colliders (physicists knew there had to be a Higgs particle of some sort). You can find a number of posts on the Higgs on this blog (also in Mayo 2018, Excursion 3 Tour III).
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). The book follows an “itinerary” of a stat cruise with lots of museum stops and souvenirs.
]]>
How did I respond to those 7 burning questions at last week’s (“P-Value”) Statistics Debate? Here’s a fairly close transcript of my (a) general answer, and (b) final remark, for each question–without the in-between responses to Jim and David. The exception is question 5 on Bayes factors, which naturally included Jim in my general answer.
The questions with the most important consequences, I think, are questions 3 and 5. I’ll explain why I say this in the comments. Please share your thoughts.
Question 1. Given the issues surrounding the misuses and abuse of p-values, do you think they should continue to be used or not? Why or why not?
Yes we should continue to use P-values and statistical significance tests. Uses of P-values are a piece in a rich set of tools for assessing and controlling the probabilities of misleading interpretations of data (error probabilities). They’re “the first line of defense against being fooled by randomness” (Yoav Benjamini). If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.
Even those who criticize P-values will employ them at least if they care to check the assumptions of their statistical models—this includes Bayesians George Box, Andrew Gelman, and Jim Berger.
Critics of P-values often allege it’s too easy to obtain small P-values, but notice the replication crisis is all about how difficult it is to get small P-values with preregistered hypotheses. This shows the problem isn’t P-values but the selection effects and data-dredging. However, the same data dredged hypothesis can occur in likelihood ratios, Bayes factors, and Bayesian updating, except that we now lose the direct grounds to criticize inferences flouting error statistical control. The introduction of prior probabilities –which may also be data dependent–offers further researcher flexibility.
Those who reject P values are saying we should reject a method because it can be used badly. That’s a very bad argument committing straw person fallacies.
We should reject misuses and abuses of P-values, but there’s a danger of blithely substituting “alternative tools” that throw out the error control baby with the bad statistics bathwater.
Final remark on P-values
What’s missed in the reject P-values movement is the major reason for calling in statistics in science is that it gives tools to inquire whether an observed phenomenon could be a real effect or just noise in the data. P-values have the intrinsic properties for this task, if used properly. To reject them is to jeopardize this important role of statistics. As Fisher emphasizes, we seek randomized controlled trials in order to ensure the validity of statistical significance tests. To reject P-values because they don’t give posterior probabilities in hypotheses is illicit. The onus is on those claiming we want such posteriors to show, for any way of getting them, why.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 2 Should practitioners avoid the use of thresholds (e.g., P-value thresholds) in interpreting data? If so, does this preclude testing?
There’s a lot of confusion about thresholds. What people oppose are dichotomous accept/reject routines. We should move away from them as well as unthinking uses of thresholds like 95% confidence levels or other quantities. Attained P-values should be reported (as all the founders of tests recommended). We should not confuse fixing a threshold to habitually use with prespecifying a threshold beyond which there is evidence of inconsistency with a test hypothesis. I’ll often call it the null for short.
Some think that banishing thresholds would diminish P-hacking and data dredging. It is the opposite. In a world without thresholds, it would be harder to criticize those who fail to meet a small P-value because they engaged in data dredging & multiple testing, and at most have given us a nominally small P-value. Yet that is the upshot of declaring predesignated P-value thresholds should not be used at all in interpreting data. If an account cannot say about any outcomes in advance that they will not count as evidence for a claim, then there is no a test of that claim.
Giving up on tests means forgoing statistical falsification. What’s the point of insisting on replications if at no point can you say, the effect has failed to replicate?
You may favor a philosophy of statistics that rejects statistical falsification, but it will not do to declare by fiat that science should reject the falsification or testing view. (The “no thresholds” view also torpedoes common testing uses of confidence intervals and Bayes Factor standards.)
So my answer is NO and YES: don’t abandon thresholds, to do so is to ban tests.
Final remark on thresholds Q-2
A common fallacy is to suppose that because we have a continuum, that we cannot distinguish points at the extremes (fallacy of the beard). We can distinguish results readily produced by random variability from cases where there is evidence of incompatibility with the chance variability hypothesis. We use thresholds throughout science to measure if you’re pre-diabetic, diabetic, etc.
When P-values are banned altogether … the eager researcher does not claim, I’m simply describing, but they invariably go on to claim evidence for a substantive psych theory—but on results that would be blocked if they’d required a reasonably small P-value threshold.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 3 Is there a role for sharp null hypotheses or should we be thinking about interval nulls?
… I’d agree with those who regard testing of a point null hypothesis as problematic and often misused. Notice that arguments purporting to show P-values exaggerate evidence are based on this point null and a spiked or lump of prior to it. By giving a spike prior to the nil, it’s easy to find the nil more likely than the alternative—Jeffreys-Lindley paradox: the P-value can differ from the posterior probability on the null. But the posterior can also equal the P-value, it can range from p to 1-p. In other words, the Bayesians differ amongst themselves, because with diffuse priors the P-value can equal the posterior on the null hypothesis.
My own work reformulates results of statistical significance tests in terms of discrepancies from the null that are well or poorly tested. A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H_{0}. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it.
Final remark on sharp nulls Q-3
The move to redefine significance, advanced by a megateam including Jim, all rest upon the lump high prior probability on the null as well as evaluating P-values using Bayes factors. It’s not equipoise, it’s biased in favor of the null. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies.
Whether to use a lower threshold is one thing, to argue we should based on Bayes factor standards lacks legitimate grounds.[1][2]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 4 Should we be teaching hypothesis testing anymore, or should we be focusing on point estimation and interval estimation?
Absolutely [we should be teaching hypothesis testing]. The way to understand confidence interval estimation, and to fix its shortcomings, is to understand their duality with tests. The same person who developed confidence intervals developed tests in the 1930s—Jerzy Neyman. The intervals are inversions of tests.
A 95% CI contains the parameter values that are not statistically significant from the data at the 5% level.
While I agree that P-values should be accompanied by CIs, my own preferred reconstruction of tests blends intervals and tests. It reports the discrepancies from a reference value that are well or poorly indicated at different levels—not just 1 level like .95. This improves on current confidence interval use. For example, the justification standardly given for inferring a particular confidence interval estimate is that it came from a method which, with high probability, would cover the true parameter value. This is a performance justification. The testing perspective on CIs gives an inferential justification. I would justify inferring evidence that the parameter exceeds the CI lower bound this way: if the parameter were smaller than the lower bound, then with high probability we would have observed a smaller value of the test statistic than we did.
Amazingly, the last president of the ASA, Karen Kafadar, had to appoint a new task force on statistical significance tests to affirm that statistical hypothesis testing is indeed part of good statistical practice. Though much credit goes to her for bringing this about.
Final remark on question 4
Understanding the duality between tests and CIs is the key to improving both. …So it makes no sense for advocates of the “new statistics” to shun tests. The testing interpretation of confidence intervals also scotches criticisms of examples where, it can happen that a 95% confidence estimate contains all possible parameter values. Although such an inference is ‘trivially true,’ it is scarcely vacuous in the testing construal. As David Cox remarks, that all parameter values are consistent with the data is an informative statement about the limitations of the data (to detect discrepancies at the particular level).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 5 What are your reasons for or against the use of Bayes Factors?
Jim is a leading advocate of Bayes factors and also of the non-subjective interpretation of Bayesian prior probabilities (2006) to be used. ‘Eliciting’ subjective priors, Jim has convincingly argued, is too difficult, expert’s prior beliefs almost never even overlap he says, and scientists are reluctant for subjective beliefs to overshadow data. Default priors (reference or non-subjective priors) are supposed to prevent prior beliefs from influencing the posteriors–they are data dominant in some sense. But there’s a variety of incompatible ways to go about this job.
(A few are maximum entropy, invariance, maximizing the missing information, coverage matching.) As David Cox points out, it’s unclear how we should interpret these default probabilities. Default priors, we are told, are simply formal devices to obtain default posteriors. “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299), being improper.
Prior probabilities are supposed to let us bring in background information, but this pulls in the opposite direction from the goal of the default prior which is to reflect just the data. The goal of representing your beliefs is very different from the goal of finding a prior that allows the data to be dominant. Yet, current uses of Bayesian methods combine both in the same computation—how do you interpret them? I think this needs to be assessed now that they’re being so widely advocated.
Final remark on Q-5
BFs give a comparative appraisal not a test. It depends on how you assign the priors to the test and alternative hypotheses.
Bayesian testing, Bayesians admit, is a work in progress. My feeling is, we shouldn’t kill a well worked out theory of testing for one that is admitted to being a work in progress.
It might be noted that even default Bayesian Jose Bernardo holds that the difference between the P-value and the BF (the Jeffreys Lindley paradox or Fisher-Jeffreys disagreement) is actually an indictment of the BF because it finds evidence in favor of a null hypothesis even when an alternative is much more likely.
Other Bayesians dislike the default priors because they can lead to improper posteriors and thus to violations of probability theory. This leads some like Dennis Lindley back to subjective Bayesianism.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 6 With so much examination of if/why the usual nominal type I error .05 is appropriate, should there be similar questions about the usual nominal type II error?
No, there should not be a similar examination of type 2 error bounds. Rigid bounds for either error should be avoided. N-P themselves urged the specifications be used with discretion and understanding.
It occurs to me, if an examination is wanted it should be done by the new ASA Task Force on Significance Tests and Replicability. Its members aren’t out to argue for rejecting significance tests but to show they are part of proper statistical practice.
Power, the complement of the type II error probability, I often say is a most abused notion (note it’s only defined in terms of a threshold). Critics of statistical significance tests, I’m afraid to say, often fallaciously take a just statistically significant difference at level α as a better indication of a discrepancy from a null if the test’s power to detect that discrepancy is high rather than low. This is like saying it’s a better indication for a discrepancy of at least 10 than of at least 1 (whatever the parameter is). I call it the Mountains out of Molehill fallacy. It results from trying to use power and alpha as ingredients for a Bayes factor and from viewing non-Bayesian methods through a Bayesian lens.
We set a high power to detect population effects of interest, but finding statistical significance doesn’t warrant saying we’ve evidence for those effects.
(The significance tester doesn’t infer points but inequalities, discrepancies at least such and such).
Final remark on Q-6, power
A legitimate criticism of P-values is they don’t give population effect sizes. Neyman developed power analysis for this purpose, in addition to comparing tests pre-data. Yet critics of tests typically keep to Fisherian tests that don’t have explicit alternatives or power. Neyman was keen to avoid misinterpreting non-significant results as evidence for a null hypothesis. He used power analysis post data (like Jacob Cohen much later) to set an upper bound for a discrepancy from the null value.
If a test has high power to detect a population discrepancy, but does not do so, it’s evidence the discrepancy is absent (qualified by the level).
My preference is to use the attained power but it’s the same reasoning.
I see people objecting to post-hoc power as “sinister” but they’re referring to computing power by using the observed effect as the parameter value in its computation. This is not power analysis.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
QUESTION 7 What are the problems that lead to the reproducibility crisis and what are the most important things we should do to address it?
Irreplication is due to many factors from data generation and modeling, to problems of measurement, and linking statistics to substantive science. Here I just focus on P-values. The key problem is that in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. The fact it becomes difficult to replicate effects when features of the tests are tied down shows the problem isn’t P-values but exploiting researcher flexibility and multiple testing. The same flexibility can occur when the p-hacked hypotheses enter methods being promoted as alternatives to significance tests: likelihood ratios, Bayes Factors, or Bayesian updating. But direct grounds to criticize inferences as flouting error statistical control is lost (at least not without adding non-standard stipulations). Since they condition on the actual outcome they don’t consider outcomes other than the one observed. This is embodied in something called the likelihood principle—.
Admittedly error control, some think, is only of concern to ensure low error rates in some long run. I argue instead that what bothers us about the P-hacker and data dredger is that they have done a poor job in the case at hand. Their method very probably would have found some such effect even if it is merely noise.
Probability here is to assess how well tested claims are, which is very different from how comparatively believable they are—claims can even be known true while poorly tested. Though there’s room for both types of assessments in different contexts, how plausible and how well tested are very different and this needs to be recognized.
To address replication problems, statistical reforms should be developed together with a philosophy of statistics that properly underwrites them.[3]
Final remark on Q-7
Please see the video here or in this news article.
[1] The following are footnotes 4 and 5 from page 252 of Statistical Inference as Severe testing: How to Get Beyond the Statistics Wars. The relevant section is 4.4. (pp. 246-259)
Casella and Roger (not Jim) Berger (1987b) argue, “We would be surprised if most researchers would place even a 10% prior probability of H_{0}. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H_{0}|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H_{0}] that was used. ” The most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability. “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H_{0}” (ibid., p. 345). Thus, “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).
Harold Jeffreys developed the spiked priors for a very special case: to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon.)
In defending spiked priors, J. Berger and Sellke move away from the importance of effect size. “Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test” (1987, p. 136).
[2] As Cox and Hinkley explain, most tests of interest are best considered as running two one-sided tests, insofar as we are interested in the direction of departure. (Cox and Hinkley 1974; Cox 2020).
[3] In the error statistical view, the interest is not in measuring how strong your degree of belief in H is but how well you can show why it ought to be believed or not. How well can you put to rest skeptical challenges? What have you done to put to rest my skepticism of your lump prior on “no effect”?
]]>
National Institute of Statistical Sciences (NISS): The Statistics Debate (Video)
]]>