Q. Was it a mistake to quarantine the passengers aboard the Diamond Princess in Japan?

A.The original statement, which is not unreasonable, was that the best thing to do with these people was to keep them safely quarantined in an infection-control manner on the ship. As it turned out, that was very ineffective in preventing spread on the ship. So the quarantine process failed. I mean, I’d like to sugarcoat it and try to be diplomatic about it, but it failed. I mean, there were people getting infected on that ship. So something went awry in the process of quarantining on that ship. I don’t know what it was, but a lot of people got infected on that ship. (Dr. A Fauci, Feb 17, 2020)

This is part of an interview of Dr. Anthony Fauci, the coronavirus point person we’ve been seeing so much of lately. Fauci has been the director of the National Institute of Allergy and Infectious Diseases since all the way back to 1984! You might find his surprise surprising. Even before getting our recent cram course on coronavirus transmission, tales of cruises being hit with viral outbreaks are familiar enough. The horror stories from passengers on the floating petri dish were well known by this Feb 17 interview. Even if everything had gone as planned, the quarantine was really only for the (approximately 3700) passengers because the 1000 or so crew members still had to run the ship, as well as cook and deliver food to the passenger’s cabins. Moreover, the ventilation systems on cruise ships can’t filter out particles smaller than 5000 or 1000 nanometers.[1]

“If the coronavirus is about the same size as SARS [severe acute respiratory syndrome], which is 120 nanometers in diameter, then the air conditioning system would be carrying the virus to every cabin,” according to Purdue researcher, Qingyan Chen, who specializes in how air particles spread in different passenger crafts. (His estimate was correct: the coronavirus is 120 nanometers.) Halfway through the quarantine, after passenger complaints, they began circulating only fresh air–which would have been preferable from the start. By then, however, it is too late: the ventilation system is already likely filled with the virus, says Chen.[2] Arthur Caplan, the bioethicist who is famous for issuing rulings on such matters, declares that

“Boats are notorious places for being incubators for viruses. It’s only morally justified to keep people on the boat if there are no other options.”

Admittedly, it is hard to see an alternative option to accommodate so many passengers for a 2 week quarantine on land, and there was the possible danger of any infections spreading to the local population in Japan. So, by his assessment, it may be considered morally justified.

*The upshot*: As of 19 March 2020, at least 712 out of the 3,711 passengers and crew had tested positive for covid-19; 9 of those who were on board have died from the disease (all over the age of 70). As I was writing this, I noted a new CDC report on the Diamond Princess as well as other cruise ships; they state 9 deaths.[3] A table on the distribution of ages of passengers on the Diamond Princess is in Note [4].

*So how did the Diamond Princess cruise ship become a floating petri dish for the coronavirus from Feb 4-Feb 20?*

**The Quarantine**

It was their last night of a 2-week luxury cruise aboard the Diamond Princess in Japan (Feb 3) when the captain came on the intercom. He announced: a passenger on this ship who disembarked in Hong Kong 9 days ago (Jan 25) has tested positive for the coronovirus. (He was on board for 5 days.) Everyone will have to stay on board an extra day to be examined by the Japanese health authorities. A new slate of activities was arranged to occupy passengers during the day of health screening–later mostly dropped. But on the evening of February 3, things continued on the ship more or less as before the intercom message.

“The response aboard the Diamond Princess reflected concern, but not a major one. The buffets remained open as usual. Onboard celebrations, opera performances and goodbye parties continued”. (NYT, March 8)

The next day, as health officials went door to door to screen passengers, guests still circulated on board, lined up for buffets, and used communal spaces. But then, the following morning (Feb 5), as guests were heading to breakfast, the captain came over the intercom again. He announced that 10 people had tested positive for the coronavirus and would be taken off the ship. Everyone else would now have to be quarantined in their cabins for 14 days. The second day of the quarantine (Feb 6) it was announced that 20 people more had tested positive, then on day three, 41 more, then 64 more, and on and on. By the end of the quarantine on February 19 at least 621 on the ship had tested positive for the virus.

Adding to the stress, “we quickly learned that our tests were part of an initial batch of 273 samples and that the first 10 cases reported on day one were only from the first 31 samples that had been processed” from the passengers with highest risk. (U.S. passenger, Spencer Fehrenbacher, interviewed on the ship)

As the number of infected ballooned, passengers were not always informed right away; some took to counting ambulances lined up outside to find out how many new cases would be announced at some point. I wonder if the passengers were told that the very first person to test positive was a crew member responsible for preparing food. In fact, by February 9, around 20 of the crew members tested positive, *15 of which were workers preparing food*. Crew members lived in close quarters, shared rooms and continued to eat their meals together buffet-style. They had no choice but to keep running the ship as best as they could.

“Feverish passengers were left in their rooms for days without being tested for the virus. Health officials and even some medical professionals worked on board without full protective gear. [Several got infected.] Sick crew members slept in cabins with roommates who continued their duties across the ship, undercutting the quarantine”. (NYT Feb 22)

Passengers in cabins without windows (and later, others) were allowed to walk on deck, six feet apart, for a short time daily. Unfortunately, presumed infection-free “green zones” were not rigidly separated from potentially contaminated “red zones”, and people walked back and forth between them. Gay Courter, a writer from the U.S. who, as it happens, situated one of her murder mysteries on a cruise ship, told *Time* “It feels like I’m in a bad movie. I tell myself, ‘Wake up, wake up, this isn’t really happening.’” (Time, Feb 11). This is the same bad movie we are all in now, except our horror tale has gotten much worse than on Feb 10.

At some point, I think Feb 10, the ship became the largest concentration of Covid-19 cases outside China, which is why you’ll notice the Diamond Princess has own category in the data compiled by the World Health Organization (Worldometer).

In a Science Today article, a Japanese infectious disease specialist regretted the patchwork way in which passenger testing was done:

Japan has missed a chance to answer important epidemiological questions about the new virus and the illness it causes. For instance, a rigorous investigation that tested all passengers at the start of the quarantine and followed them through to the end could have provided information on when infections occurred and answered questions about transmission, the course of the illness, and the behavior of the virus.

(They were only able to test people in stages.) A similar paucity of testing in the U.S. robs us from crucial information for understanding and controlling the coronavirus. However, there is a fair amount being gleaned from the Diamond Princess, as you can see in the references below. (Please share additional references in the comments.) More is bound to follow.

**Estimates from the Diamond Princess**

“Data from the *Diamond Princess* cruise ship outbreak provides a unique snapshot of the true mortality and symptomatology of the disease, given that everyone on board was tested, regardless of symptoms”–or at least virtually all. [link] The estimates (from the Diamond Princess) I’ve seen are based on those from the London School of Hygiene and Tropical Medicine, in a paper still in preprint form,”Estimating the infection and case fatality ratio for COVID-19 using age-adjusted data from the outbreak on the Diamond Princess cruise ship”.

Adjusting for delay from confirmation-to-death, we estimated case and infection fatality ratios (CFR, IFR) for COVID-19 on the Diamond Princess ship as 2.3% (0.75%-5.3%) [among symptomatic] and 1.2% (0.38-2.7%) [all cases]. Comparing deaths onboard with expected deaths based on naive CFR estimates using China data, we estimate IFR and CFR in China to be 0.5% (95% CI: 0.2-1.2%) and 1.1% (95% CI: 0.3-2.4%) respectively. (PDF)

(For definitions and computations, see the article.) These are lower than the numbers we are often hearing. They used their lower fatality estimates to adjust (down) the estimates from China data. The paper lists a number of caveats.[5] I hope readers will have a look at it (it’s just a few pages) and share their thoughts in the comments. (Their estimates are in sync with an article by Fauci et al., to come out this week in *NEJM*; but whatever the numbers turn out to be, we know our healthcare system, in many places, is being overloaded. [6])

Another study takes the daily reports of infections on the Diamond Princes to attempt to evaluate the impact of the quarantine, as imperfect as it was, in comparison to a counterfactual situation where nothing was done, including not removing infected people from the ship. They estimate nearly 80%, rather than 17% would have been infected. [link]

We found that the reproductive number [R

_{0}] of COVID-19 in the cruise ship situation of 3,700 persons confined to a limited space was around 4 times higher than in the epicenter in Wuhan, where was estimated to have a mean of 3.7.[7]The interventions that included the removal of all persons with confirmed COVID-19 disease combined with the quarantine of all passengers substantially reduced the anticipated number of new COVID-19 cases compared to a scenario without any interventions (17% attack rate with intervention versus 79% without intervention) … However, the main conclusion from our modelling is that evacuating all passengers and crew early on in the outbreak would have prevented many more passengers and crew members from getting infected.” [link]

Only 76, rather than 621 would have been infected, they estimate. [8]

Conclusions: The cruise ship conditions clearly amplified an already highly transmissible disease. The public health measures prevented more than 2000 additional cases compared to no interventions. However, evacuating all passengers and crew early on in the outbreak would have prevented many more passengers and crew from infection.

These studies and models are of interest, although I’m in no position to evaluate them. Please share your thoughts and information, and point out any errors you find. I will indicate updates in the title of this post.

**Optimism**

I leave off with the remark of one of the U.S. passengers interviewed while still on the Diamond Princess:

“Being knee deep in the middle of a crisis leaves a person with two options — optimism or pessimism. The former gives a person strength, and the latter gives rise to fear.” (link)

He, like the others who were evacuated, faced an additional 2 weeks of quarantine.[9] He has since returned home and remains infection free.

*****

[1] As a noteworthy aside, Fauci was able to assure the interviewer that the “danger of getting coronavirus now is just minusculely low” (in the U.S. on Feb. 17). What a difference 2 weeks can make.

[2] In a 2015 paper, Chen and colleagues found a cruise ship’s ventilation spread particles from cabin to cabin. They found that 1 infected person typically led to more than 40 cases a week later on a 2000 passenger cruise. By contrast, the coronavirus, with a reproductive rate of 2 cases per infected person, would only lead to 3 new cases during that time. Planes rely on high-strength air filters and are designed to circulate air within cabin sections.

[3] In a March 23 CDC report: Among 3,711 Diamond Princess passengers and crew, 712 (19.2%) had positive test results for SARS-CoV-2. Of these, 331 (46.5%) were asymptomatic at the time of testing. Among 381 symptomatic patients, 37 (9.7%) required intensive care, and nine (1.3%) died (*8*).

They found coronavirus in Diamond Princess cabins 17 days after passengers disembarked (prior to cleaning).

[4] A table from the Japanese National Institute of Infectious Diseases (NIID) (Source LINK):

[5]

“There were some limitations to our analysis. Cruise ship passengers may have a different health status to the general population of their home countries, due to health requirements to embark on a multi-week holiday, or differences related to socio-economic status or comorbidities. Deaths only occurred in individuals 70 years or older, so we were not able to generate age-specific cCFRs; the fatality risk may also be influenced by differences in healthcare between countries”.

[6] In a March 26 article by Fauci and others, Covid-19 — Navigating the Uncharted, we read:

“If one assumes that the number of asymptomatic or minimally symptomatic cases is several times as high as the number of reported cases, the case fatality rate may be considerably less than 1%.”

[7] R_{0} may be viewed as the expected number of cases generated directly by 1 case in a susceptible population.

[8] The number in the most recent report is 712, but that would be after the quarantine ended on Feb 19.

[9] I read today that one of the U. S. evacuated passengers just entered a clinical trial on remdesivir. This would be over a month since the end of the first quarantine.

———–

**REFERENCES:**

- Fauci interview: ‘Danger of getting coronavirus now is just minusculely low‘

- Giwa, A., LLB, MD, MBA, FACEP, FAAEM; Desai, A., MD; Duca, A., MD; Translation by: Sabrina Paula Rodera Zorita, MD (2020). “Novel 2019 Coronavirus SARS-CoV-2 (COVID-19): An Updated Overview for Emergency Clinicians – 03-23-20”
*EBMedicine.net*; Pub Med ID: 32207910; (LINK)

- Japanese National Institute of Infectious Diseases (NIID). “Field Briefing: Diamond Princess COVID-19 Cases, 20 Feb Update” (LINK)

- Russell, T., Hellewell, J.,Jarvis, C., van-Zandvoort, K.Abbott, S.,Ratnayake, R., Flasche, S., Eggo, R. & Kucharski, A. (2020). “Estimating the infection and case fatality ratio for COVID-19 using age-adjusted data from the outbreak on the Diamond Princess cruise ship.”
*MedRXIV: The preprint server for the Health Sciences*. (March 9, 2020). (PDF)

- Zheng, L., Chen, Q., Xu, J., & Wu, F. (2016). Evaluation of intervention measures for respiratory disease transmission on cruise ships.
*Indoor and Built Environment*, 25(8), 1267–1278. (First Published online August 28, 2015 ). (PDF)

**Stephen Senn**

*Consultant Statistician*

*Edinburgh *

**Correcting errors about corrected estimates**

Randomised clinical trials are a powerful tool for investigating the effects of treatments. Given appropriate design, conduct and analysis they can deliver good estimates of effects. The key feature is concurrent control. Without concurrent control, randomisation is impossible. Randomisation is necessary, although not sufficient, for effective blinding. It also is an appropriate way to deal with unmeasured predictors, that is to say suspected but unobserved factors that might also affect outcome. It does this by ensuring that, in the absence of any treatment effect, the *expected* value of variation between and within groups is the same. Furthermore, probabilities regarding the relative variation can be delivered and this is what is necessary for valid inference.

There are two extreme positions regarding randomisation that are unreasonable. The first is that because randomisation only ensures performance in terms of probabilities, it is irrelevant to any actual study conducted. The second is that because randomisation delivers valid estimates on average, using observed covariate information, say in a so-called *analysis of covariance* (ANCOVA), is unnecessary and possibly even harmful. The first is easily answered, even if many find the answer difficult to understand: probabilities are what we have to use when we don’t have certainties. For further discussion of this see my previous blogs on this site Randomisation Ratios and Rationality and Indefinite Irrelevance and also a further blog Stop Obsessing about Balance. The second criticism is also, in principle, easy to answer: covariates provide the means of recognising that averages are not relevant to the case in hand (Senn, S. J., 2019). Nevertheless, many trialists stubbornly refuse to use covariate information. This is wrong and in this blog I shall explain why.

A variant of the refusal to use covariates occurs when the covariate is a baseline. The argument is then sometimes used that an obviously appropriate ‘response’ variable is the so-called *change-score* (or *gain- score*), that is to say, the difference between the variable of interest at outcome and its value at baseline. For example, in a trial of asthma, we might be interested in the variable forced expiratory volume in one second (FEV_{1}), a measure of lung function. We measure this at outcome and also at baseline and then use the difference (outcome – baseline) as the ‘response’ measure in subsequent analysis.

The argument then continues that since one has adjusted for the baseline, by subtracting it from the outcome, no further adjustment is necessary and furthermore, that analysis of these change-scores being simpler than ANCOVA, it is more robust and reliable.

I shall now explain why this is wrong.

The important point to grasp from the beginning is that what are affected by the treatment are the outcomes. The baselines are not affected by treatment and so they do not carry the causal message. They may be predictive of the outcome, and that being so they may usefully be incorporated in an estimate of the effect of treatment, but treatment only has the capacity to affect the outcomes.

I stress this because much unnecessary complication is introduced by regarding the effect that treatments have on change over time as being fundamental. Such effects on change are a *consequence* of the effect on outcome: outcome is primary, change is secondary.

It will be easier to discuss all this with the help of some symbols. Table 1 shows symbols that can be used for referring to *statistics* for a parallel group trial in asthma with two arms, *control* and *treatment. *To simplify things, I shall take the case where the number of patients in each arm is identical, although this is not necessary to anything important that follows.

A further table, Table 2 gives symbols for various *parameters*. Some simplifying assumptions will be made that various parameter values do not change from control group to treatment group. For the assumptions in lines 2 and 4, this must be true if randomisation is carried out, since we are talking about expectations over all randomisations and the parameters refer to quantities measured *before* treatment starts. For lines 3 and 5, the assumptions are true under the null hypothesis that there is no difference between treatment and control.

Since the treatments can only affect the outcomes, the logical place to start is at the end. Thus, to use a term much in vogue, our *estimand* (that which we wish to estimate) is δ = μ_{Yt}-μ_{Yc }, the difference in ‘true’ means at outcome. Note that we could define the estimand in terms of the double difference

that is to say the differences between groups of the differences from baseline, but this is pointless because in a randomised trial we have μ_{Xc}=μ_{Xt }= μ_{X} and so this reduces to what we had previously. In fact, if we think of the logic of the randomised clinical trial, by which the patients given control are there purely to estimate what would have happened to the patients given the treatment had they been given the control, this is quite unnecessary.

As a first stab at estimating the estimand, the simplest thing to use is the corresponding difference in statistics, that is to say

This estimator is unbiased for δ: on average it will be equal to the parameter it is supposed to estimate. Its variance, given our assumptions, will be equal to

However, it is not independent of the observed difference at baseline. In fact, given our simplification, it has a covariance with this difference of

where ρ is the correlation between baseline and outcome.

This dependence implies two things. First, it means that although is unbiased, it is not *conditionally* unbiased. Given an observed difference at baseline,

we can do better than just assuming that the difference we would see at outcome, in the absence of any treatment effect, would be zero. Zero is the value we would see over all randomisations but it is not the value we would see for all randomised clinical trials with the observed baseline. This can be easily illustrated using a simulation described below.

I simulated 1000 clinical trials of a bronchodilator in asthma using forced expiratory volume in one second (FEV_{1}) as an outcome with parameters set as in Table 3:

The seed is of no relevance to anybody except me but is included here should I ever need to check back in the future. The other parameters are supposed to be what might be plausible in a such a clinical trial. The values are drawn from a bivariate Normal assumed to be a suitable approximate theoretical model for a randomised clinical trial.

One thousand confidence intervals are displayed in Figure 1. They have been plotted against the baseline difference. Those that are plotted in black cover the ‘true’ value of 200 mL and those that are plotted in red do not. The point estimate is shown by a black diamond in the former case and by a red circle in the latter. There are 949, or 94.9% that cover the true value and 51 or 5.1% that do not. The differences from the theoretical expectations of 95% and 5% are immaterial.

However, since we have baseline information available, we can recognise something about the intervals: where the baseline difference is negative, they are more likely to underestimate the true value and where the difference is positive, they are likely to overestimate it. In fact, generally, the bigger the baseline difference, the worse the coverage. In the view of many statisticians, including me, this means that a confidence interval calculated in this way is satisfactory if the baselines have __not __been observed but not if they have.

Can we fix this using the change-score? The answer is no. Figure 2 shows the corresponding plot for the change-score. Now we have the reverse problem. Where the baseline difference is negative, the treatment effect is overestimated and where the difference is positive, underestimated. There are exceptions but a general pattern is visible: the estimates are negatively correlated with the baseline difference.

The reason that this happens is that the values are not perfectly predictive of the values at outcome. The way to deal with this is to calculate exactly *how* predictive the values are using the data *within* the treatment groups. Because I am in the privileged position of having set the values of the simulation, I know that the relevant *slope *parameter, β, is 0.6, considerably less than the implicit change-score value of 1. This is because the correlation, ρ was set to be 0.6 and the variances at baseline and outcome were set to be equal. However, I did not ‘cheat’ by using this knowledge to do the adjusted calculation, since in practice I would never know the true value. In fact, for each simulation, the value was estimated from the covariances and variances within the treatment groups.

Figure 3 gives a dot histogram for the estimates of β. The mean over the 1000 simulations was 0.601 and the lowest value of the 1000 was 0.279, with the maximum being 0.869. Different values will have been used for different simulated clinical trials to adjust the difference at outcome by the difference at baseline. The adjusted estimates and confidence limits are given in Figure 4 and now it can be seen that these are independent of the baseline differences, which cannot now be used to select which confidence intervals are less likely to include the true value.

The three estimators we have considered can be regarded as a special case of

If we set *b* = 0 we have the simple unadjusted estimate of Figure 1. If we set *b *= 1 we have the change-score estimate of Figure 2. Finally if we set *b = *, the within group regression coefficient of Y on X shown in Figure 3, we get the ANCOVA estimate of Figure 4.

Given our assumptions, not unreasonable given the design, we have that the regression of the mean difference at outcome on the mean difference at baseline should be the same as the within groups regression of the individual outcome values. Note that this assumption is not so reasonable for different designs, for example, cluster randomised designs, a point that has been misunderstood in some treatments of Lord’s paradox. See Rothamsted Statistics meets Lord’s Paradox for an explanation and also Red Herrings and the Art of Cause Fishing. Thus, we can now study exactly what is going on in terms of the adjusted within group differences *Y* – *bX*. If these adjusted differences are averaged for each of the two groups and we then form the difference of these averages, treatment minus control, we have the estimate .

Now, the covariance of (*Y* – *bX*) & *X* is σ_{XY} – bσ_{X}σ_{X }from which it follows immediately that this is 0 if and only if

Thus, the value of *b* is then the within-slope regression β, so this is nothing other than the ANCOVA solution. (In practice we have to estimate β, a point to which I shall return later.) Hence, the lack of dependence between estimate and baseline difference exhibited by Figure 4. Note also that dimensional analysis shows that the units of β are units of *Y* over units of *X*. In our particular example of using a baseline, these two units are the same and so cancel out but the argument is also valid for any predictor, including one whose units are quite different. Once the slope has been multiplied by an *X* it yields a prediction in units of *Y* which is, of course, exactly what is needed for ‘correcting’ an observed *Y*. This, in my opinion, is yet another argument against the change-score ‘solution’. It only works in a special case for which there is no need to abandon the solution that works generally.

In fact, the change-score works badly. To see this, consider a re-writing of the adjusted outcome as

On the RHS for the first term in square brackets we have the ANCOVA estimate. On the RHS for the second term in square brackets, we have the amount by which any general estimate differs from the ANCOVA estimate. If we calculate the variance of the RHS, we find it consists of three terms. The first of these is the variance of the ANCOVA estimator. The second is (β – *b*)^{2 }times . The third is 2(β – *b*) times the covariance of [*Y* – β*X*] and *X*. However, we have shown that this covariance is zero. Thus, the third term is zero. Furthermore, the second term, being a product of squared quantities is greater than zero unless β = *b* but when this is the case we have the ANCOVA solution. Thus, ANCOVA is the minimum variance solution. The change-score solution will have a higher variance.

Some summary statistics (over the 1000 simulated trials) are given for the three approaches used to analyse the results. There are no differences worth noticing between coverage. This is because each method of analysis is designed to provide correct coverage on average. The simple analysis of outcomes is positively and highly significantly correlated with the baseline difference. The change-score is highly significantly negatively correlated. The correlation is less in absolute terms because the correlation coefficient used for the simulation (0.6) is greater than 0.5. Had it been less than 0.5, the absolute correlation would have been greater for the change-score. Similarly, although the mean width of the change-score is less than that for the simple analysis, this does not have to be so and is a result of the correlation being greater than 0.5. The ANCOVA estimates are more precise than either and this has to be so in expectation barring a minor issue, discussed in the next section.

It is not quite the case that ANCOVA is the best one can do, since I have assumed in the algebraic development (although not in the simulation) that the slope parameter is known. In practice it has to be estimated and this leads to a small loss in precision compared to the (unrealistic) situation where the parameter is known. Basically, there are two losses: first the degrees of freedom for estimating the error variance are reduced by one and second there is, in practice, a small penalty for the loss of orthogonality. Further discussing of this is beyond the cope of this note but is covered in (Lesaffre, E. & Senn, S., 2003) and (Senn, S. J., 2011). In practice, the effect is small once one has even a modest number of patients.

For randomised clinical trials, there is no excuse for using a change-score approach, rather than analysis of covariance. To do so betrays not only a conceptual confusion about causality but is inefficient. Given the stakes, this is unacceptable (Senn, S. J., 2005).

Lesaffre, E., & Senn, S. (2003). A note on non-parametric ANCOVA for covariate adjustment in randomized clinical trials. *Statistics in Medicine, ***22**(23), 3583-3596.

Senn, S. J. (2005). An unreasonable prejudice against modelling? *Pharmaceutical statistics, ***4**, 87-89.

Senn, S. J. (2011). Modelling in drug development. In M. Christie, A. Cliffe, A. P. Dawid, & S. J. Senn (Eds.), *Simplicity Complexity and Modelling* (pp. 35-49). Chichester: Wiley.

Senn, S. J. (2019). The well-adjusted statistician. *Applied Clinical Trials*, 2.

.

I will run a graduate Research Seminar at the LSE on Thursdays from May 21-June 18:

(See my new blog for specifics (phil-stat-wars.com).

I am co-running a workshop

from 19-20 June, 2020 at LSE (Center for the Philosophy of Natural and Social Sciences CPNSS), with Roman Frigg. Participants include:

If you have a particular Phil Stat event you’d like me to advertise, please send it to me.

]]>*Notre Dame Philosophical Reviews* is a leading forum for publishing reviews of books in philosophy. The philosopher of statistics, Prasanta Bandyopadhyay, published a review of my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)(SIST) in this journal, and I very much appreciate his doing so. Here I excerpt from his review, and respond to a cluster of related criticisms in order to avoid some fundamental misunderstandings of my project. Here’s how he begins:

In this book, Deborah G. Mayo (who has the rare distinction of making an impact on some of the most influential statisticians of our time) delves into issues in philosophy of statistics, philosophy of science, and scientific methodology more thoroughly than in her previous writings. Her reconstruction of the history of statistics, seamless weaving of the issues in the foundations of statistics with the development of twentieth-century philosophy of science, and clear presentation that makes the content accessible to a non-specialist audience constitute a remarkable achievement. Mayo has a unique philosophical perspective which she uses in her study of philosophy of science and current statistical practice.

^{[1]}I regard this as one of the most important philosophy of science books written in the last 25 years. However, as Mayo herself says, nobody should be immune to critical assessment. This review is written in that spirit; in it I will analyze some of the shortcomings of the book.

* * * * * * * * *

I will begin with three issues on which Mayo focuses:

- Conflict about the foundation of statistical inference: Probabilism or Long-run Performance?
- Crisis in science: Which method is adequately general/flexible to be applicable to most problems?
- Replication crisis: Is scientific research reproducible?
Mayo holds that these issues are connected. Failure to recognize that connection leads to problems in statistical inference.

Probabilism, as Mayo describes it, is about accepting reasoned belief when certainty is not available. Error-statistics is concerned with understanding and controlling the probability of errors. This is a long-run performance criterion. Mayo is concerned with “probativeness” for the analysis of “particular statistical inference” (p. 14). She draws her inspiration concerning probativeness from severe testing and calls those who follow this kind of philosophy the “

severe testers” (p. 9). This concept is the central idea of the book.…. What should be done, according to the severe tester, is to take refuge in a meta-standard and evaluate each theory from that meta-theoretical standpoint. Philosophy will provide that higher ground to evaluate two contending statistical theories. In contrast to the statistical foundations offered by both probabilism and long-run performance accounts, severe testers advocate probativism, which does not recommend any statement to be warranted unless a fair amount of investigation has been carried out to probe ways in which the statement could be wrong.

Severe testers think their method is adequately general to capture this intuitively appealing requirement on any plausible account of evidence. That is, if a test were not able to find flaws with H even if H were incorrect, then a mere agreement of H with data X

_{0}would provide poor evidence for H. This, according to the severe tester’s account, should be a minimal requirement on any account of evidence. This is how they address (ii).Next consider (iii). According to the severe tester’s diagnosis, the replication crisis arises when there is selective reporting: the statistics are cherry-picked for x, i.e., looked at for significance where it is absent, multiple testing, and the like. Severe testers think their account alone can handle the replication crisis satisfactorily. That leaves the burden on them to show that other accounts, such as probabilism and long-run performance, are incapable of handling the crisis, or are inadequate compared to the severe tester’s account. One way probabilists (such as subjective Bayesians) seem to block problematic inferences resulting from the replication crisis is by assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it. Severe testers grant that this procedure can block problematic inferences leading to the replication crisis. However, they insist that this procedure won’t be able to show

whatresearchers have initiallydonewrong in producing the crisis in the first place. The nub of their criticism is that Bayesians don’t provide a convincing resolution of the replication crisis since they don’t explain where the researchers make their mistake.

I don’t think we can look to this procedure (“assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it”) to block problematic inferences. In some cases, your disbelief in H might be right on the money, but this is precisely what is *unknown* when undertaking research. An account must be able to directly register how biasing selection effects alter error probing capacities if it is to call out the resulting bad inferences–or so I argue. Data-dredged hypotheses are often very believable, that’s what makes them so seductive. Moreover, it’s crucial for an account to be able to say that H is plausible but terribly tested by this particular study or test. I don’t say that inquirers are always in the context of severe testing, by the way. We’re not always truly trying to find things out; often, we’re just trying to make our case. That said, I never claim the severe testing account is the only way to avoid irreplication in statistics, nor do I suggest that the problem of replication is the sole problem for an account of statistical inference. Explaining and avoiding irreplication is a *minimal* problem an account should be capable of solving. This relates to Bandyopadhyay’s central objection below.

In some places, he attributes to me a position that is nearly the opposite of what I argue. After explaining, I try to consider why he might be led to his topsy turvy allegation.

The problem with the long-run performance-based frequency approach, according to Mayo, is that it is easy to support a false hypothesis with these methods by selective reporting. The severe tester thinks both Fisher’s and Neyman and Pearson’s methods leave the door open for cherry-picking, significance seeking, and multiple-testing, thus generating the possibility of a replication crisis. Fisher’s and Neyman-Pearson’s methods make room for enabling the support of a preferred claim even though it is not warranted by evidence. This causes severe testers like Mayo to abandon the idea of adopting long-run performance as a sufficient condition for statistical inferences; it is merely a necessary condition for them.

No, it is the opposite. The error statistical assessments are highly valuable because they pick up on the effects of data dredging, multiple testing, optional stopping and a host of biasing selection effects. Biasing selection effects are blocked in error statistical accounts because they preclude control of error probabilities! It is precisely because they render the error probability assessments invalid that error statistical accounts are able to require–with justification– predesignation and preregistration. That is the key message of SIST from the very start.

- SIST, p. 20: A key point too rarely appreciated: Statistical facts about P -values themselves demonstrate how data finagling can yield spurious significance. This is true for all error probabilities. That’s what a self-correcting inference account should do. … Scouring different subgroups and otherwise “trying and trying again” are classic ways to blow up the actual probability of obtaining an impressive, but spurious, finding – and that remains so even if you ditch P-values and never compute them.

Consider the dramatic opposition between Savage, and Fisher and N-P regarding the Likelihood Principle and optional stopping:

- SIST, p. 46: The lesson about who is allowed to cheat depends on your statistical philosophy. Error statisticians require that the overall and not the “computed” significance level be reported. To them, cheating would be to report the significance level you got after trying and trying again in just
*the same way*as if the test had a fixed sample size.

Bandyopadhyay seems to think that if I have criticisms of the long-run performance (or behavioristic) construal of error probabilities, it must be because I claim it leads to replication failure. That’s the only way I can explain his criticism above.

He is startled that I’m rejecting the long-run performance view I previously held.

This leads me to discuss the severe tester’s rejection of both probabilism and frequency-based long-run performance, especially the latter. It is understandable why Mayo finds fault with probabilists, since they are no friends of Bayesians who take probability theory to be the

onlylogic of uncertainty. So, the position is consistent with the severe tester’s account proposed in Mayo’s last two influential books (1996 and 2010.) What is surprising is that her account rejects the long-run performance view and only takes the frequency-based probability as necessary for statistical inference.

But I’ve always rejected the long run performance or “behavioristic” construal of error statistical methods–when it comes to using them for scientific inference. I’ve always rejected the supposition that the justification and rationale for error statistical methods is their ability to control the probabilities of erroneous inferences in a long run series of applications. Others have rejected it as well, notably, Birnbaum, Cox, Giere. Their sense is that these tools are satisfying inferential goals but in a way that no one has been able to quite explain. What hasn’t been done, and what I only hinted at in earlier work, is to supply an alternative, inferential rationale for error statistics. The trick is to show when and why long run error control supplies a measure of a method’s *capability* to identify mistakes. This capability assessment, in turn, supplies a measure of how well or poorly tested claims are. So, the inferential assessment, post data, is in terms of how well or poorly tested claims are.

My earlier work, *Error and the Growth of Experimental Knowledge* (EGEK) was directed at the uses of statistics for solving philosophical problems of evidence and inference.[1] SIST, by contrast, is focussed almost entirely on the philosophical problems of statistical practice. Moreover, I stick my neck out, and try to tackle essentially all of the examples around which there have been philosophical controversy from the severe tester’s paradigm. While I freely admit this represents a gutsy, if not radical, gambit, I actually find it perplexing that it hasn’t been done before. It seems to me that we convert information about (long-run) performance into information about well-testedness in ordinary, day to day reasoning. Take the informal example early on in the book.

- SIST, p. 14: Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’ s office. …Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4– 5 pound gain. …But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. …No one would say: ‘I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.’ To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings:
*H*: I’ve gained weight…. This is the key – granted with a homely example – that can fill a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.

Let me now clarify the reason that satisfying a long-run performance requirement only necessary and not sufficient for severity. Long-run behavior could be satisfied while the error probabilities do not reflect well-testedness in the case at hand. Go to the howlers and chestnuts of Excursion 3 tour II:

- Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time? She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)
*Basis for the joke:*An N-P test bases error probabilities on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do.

In short: I’m taking the tools that are typically justified only because they control the probability of erroneous inferences in the long-run, and providing them with an inferential justification relevant for the case at hand. It’s only when long-run relative frequencies represent the method’s capability to discern mistaken interpretations of data that the performance and severe testing goals line up. Where the two sets of goals do not line up, severe testing takes precedence–at least when we’re trying to find things out. The book is an experiment in trying to do all of philosophy of statistics within the severe testing paradigm.

There’s more to reply to in his review, but I want to just focus on this clarification which should rectify his main criticism. For a discussion of the general points of severely testing theories, I direct the reader to extensive excerpts from SIST. His full review is here.

__________________________________

Bandyopadhyay attended my NEH Summer Seminar in 1999 on Inductive-Experimental Inference. I’m glad that he has pursued philosophy of statistics through the years. I do wish he had sent me his review earlier so that I could clarify the small set of confusions that led him to some unintended places. Nous might have given the author an opportunity to reply lest readers come away with a distorted view of the book. I will shortly be resuming a discussion of SIST on this blog, picking up with excursion 2.

Update March 4: Note that I wound up commenting further on the Review in the following comments:

[1] If you find an example that has been the subject of philosophical debate that is omitted from SIST, let me know. You will notice that all these examples are elementary, which is why I was able to cover them with minimal technical complexity. Some more exotic examples are in “chestnuts and howlers”.

]]>

This is a belated birthday post for R.A. Fisher (17 February, 1890-29 July, 1962)–it’s a guest post from earlier on this blog by Aris Spanos.

**Happy belated birthday to R.A. Fisher!**

**‘R. A. Fisher: How an Outsider Revolutionized Statistics’**

by **Aris Spanos**

Few statisticians will dispute that R. A. Fisher **(February 17, 1890 – July 29, 1962)** is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of *optimal estimation* based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of *optimal testing* in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the *ultimate outsider* when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper.

After graduating from Cambridge he drifted into a series of jobs, including subsistence farming and teaching high school mathematics and physics, until his temporary appointment as a statistician at Rothamsted Experimental Station in 1919. During the period 1912-1919 his interest in statistics was driven by his passion for eugenics and a realization that his mathematical knowledge of n-dimensional geometry can be put to good use in deriving finite sample distributions for estimators and tests in the spirit of Gosset’s (1908) paper. Encouraged by his early correspondence with Gosset, he derived the finite sampling distribution of the sample correlation coefficient which he published in 1915 in Biometrika; the only statistics journal at the time, edited by Karl Pearson. To put this result in a proper context, Pearson was working on this problem for two decades and published more than a dozen papers with several assistants on approximating the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments.

Due to its importance, the 1915 paper provided Fisher’s first skirmish with the ‘statistical establishment’. Karl Pearson would not accept being overrun by a ‘newcomer’ lightly. So, he prepared a critical paper with four of his assistants that became known as “the cooperative study”, questioning Fisher’s result as stemming from a misuse of Bayes theorem. He proceeded to publish it in Biometrika in 1917 without bothering to let Fisher know before publication. Fisher was furious at K.Pearson’s move and prepared his answer in a highly polemical style which Pearson promptly refused to publish in his journal. Eventually Fisher was able to publish his answer after tempering the style in *Metron*, a brand new statistics journal. As a result of this skirmish, Fisher pledged never to send another paper to *Biometrika*, and declared a war against K.Pearson’s perspective on statistics. Fisher, not only questioned his method of moments as giving rise to inefficient estimators, but also his derivation of the degrees of freedom of his chi-square test. Several, highly critical published papers ensued.[i]

Between 1922 and 1930 Fisher did most of his influential work in recasting statistics, including publishing a highly successful textbook in 1925, but the ‘statistical establishment’ kept him ‘in his place’; a statistician at an experimental station. All his attempts to find an academic position, including a position in Social Biology at the London School of Economics (LSE), were unsuccessful (see Box, 1978, p. 202). Being turned down for the LSE position was not unrelated to the fact that the professor of statistics at the LSE was Arthur Bowley (1869-1957); second only to Pearson in statistical high priesthood.[ii]

Coming of age as a statistician during the 1920s in England, was being awarded the Guy medal in gold, silver or bronze, or at least receiving an invitation to present your work to the Royal Statistical Society (RSS). Despite his fundamental contributions to the field, Fisher’s invitation to RSS would not come until 1934. To put that in perspective, Jerzy Neyman, his junior by some distance, was invited six months earlier! Indeed, one can make a strong case that the statistical establishment kept Fisher away for as long as they could get away with it. However, by 1933 they must have felt that they had to invite Fisher after he accepted a professorship at University College, London. The position was created after Karl Pearson retired and the College decided to split his chair into a statistics position that went to Egon Pearson (Pearson’s son) and a Galton professorship in Eugenics that was offered to Fisher. To make it worse, Fisher’s offer came with a humiliating clause that he was forbidden to teach statistics at University College (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself!

Fisher’s presentation to the Royal Statistical Society, on December 18th, 1934, entitled “The Logic of Inductive Inference”, was an attempt to summarize and explain his published work on recasting the problem of statistical induction since his classic 1922 paper. Bowley was (self?) appointed to move the traditional vote of thanks and open the discussion. After some begrudging thanks for Fisher’s ‘contributions to statistics in general’, he went on to disparage his new approach to statistical inference based on the likelihood function by describing it as abstruse, arbitrary and misleading. His comments were predominantly sarcastic and discourteous, and went as far as to accuse Fisher of plagiarism, by not acknowledging Edgeworth’s priority on the likelihood function idea (see Fisher, 1935, pp. 55-7). The litany of churlish comments continued with the rest of the old guard: Isserlis, Irwin and the philosopher Wolf (1935, pp. 57-64), who was brought in by Bowley to undermine Fisher’s philosophical discussion on induction. Jeffreys complained about Fisher’s criticisms of the Bayesian approach (1935, pp. 70-2).

To Fisher’s support came … Egon Pearson, Neyman and Bartlett. E. Pearson argued that:

“When these ideas [on statistical induction] were fully understood … it would be realized that statistical science owed a very great deal to the stimulus Professor Fisher had provided in many directions.” (Fisher, 1935, pp. 64-5)

Neyman too came to Fisher’s support, praising Fisher’s path-breaking contributions, and explaining Bowley’s reaction to Fisher’s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73).

Fisher, in his reply to Bowley and the old guard, was equally contemptuous:

“The acerbity, to use no stronger term, with which the customary vote of thanks has been moved and seconded … does not, I confess, surprise me. From the fact that thirteen years have elapsed between the publication, by the Royal Society, of my first rough outline of the developments, which are the subject of to-day’s discussion, and the occurrence of that discussion itself, it is a fair inference that some at least of the Society’s authorities on matters theoretical viewed these developments with disfavour, and admitted with reluctance. … However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company. … For the rest, I find that Professor Bowley is offended with me for “introducing misleading ideas”. He does not, however, find it necessary to demonstrate that any such idea is, in fact, misleading. It must be inferred that my real crime, in the eyes of his academic eminence, must be that of “introducing ideas”. (Fisher, 1935, pp. 76-82)[iii]

In summary, the pioneering work of Fisher and later supplemented by Egon Pearson and Neyman, was largely ignored by the Royal Statistical Society (RSS) establishment until the early 1930s. By 1933 it was difficult to ignore their contributions, published primarily in other journals, and the ‘establishment’ of the RSS decided to display its tolerance to their work by creating ‘the Industrial and Agricultural Research Section’, under the auspices of which both papers by Neyman and Fisher were presented in 1934 and 1935, respectively. [iv]

In 1943, Fisher was offered the Balfour Chair of Genetics at the University of Cambridge. Recognition from the RSS came in 1946 with the Guy medal in gold, and he became its president in 1952-1954, just after he was knighted! Sir Ronald Fisher retired from Cambridge in 1957. The father of modern statistics never held an academic position in statistics!

You can read more in Spanos 2008 (below)

**References**

Bowley, A. L. (1902, 1920, 1926, 1937) *Elements of Statistics*, 2nd, 4th, 5th and 6th editions, Staples Press, London.

Box, J. F. (1978) *The Life of a Scientist: R. A. Fisher*, Wiley, NY.

Fisher, R. A. (1912), “On an Absolute Criterion for Fitting Frequency Curves,” *Messenger of Mathematics*, 41, 155-160.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” *Biometrika,* 10, 507-21.

Fisher, R. A. (1921) “On the ‘probable error’ of a coefficient deduced from a small sample,” *Metron* 1, 2-32.

Fisher, R. A. (1922) “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society*, A 222, 309-68.

Fisher, R. A. (1922a) “On the interpretation of *c*^{2} from contingency tables, and the calculation of p, “*Journal of the Royal Statistical Society* 85, 87–94.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae and the distribution of regression coefficients,” *Journal of the Royal Statistical Society,* 85, 597–612.

Fisher, R. A. (1924) “The conditions under which the x2 measures the discrepancy between observation and hypothesis,” *Journal of the Royal Statistical Society*, 87, 442-450.

Fisher, R. A. (1925) *Statistical Methods for Research Workers*, Oliver & Boyd, Edinburgh.

Fisher, R. A. (1935) “The logic of inductive inference,” *Journal of the Royal Statistical Society* 98, 39-54, discussion 55-82.

Fisher, R. A. (1937), “Professor Karl Pearson and the Method of Moments,” *Annals of Eugenics*, 7, 303-318.

Gossett, W. S. (1908) “The probable error of the mean,” *Biometrika*, 6, 1-25.

Hald, A. (1998) *A History of Mathematical Statistics from 1750 to 1930*, Wiley, NY.

Hotelling, H. (1930) “British statistics and statisticians today,” *Journal of the American Statistical Association*, 25, 186-90.

Neyman, J. (1934) “On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection,” *Journal of the Royal Statistical Society,* 97, 558-625.

Rao, C. R. (1992), “ R. A. Fisher: The Founder of Modern, *Statistical Science*, 7, 34-48.

RSS (Royal Statistical Society) (1934) *Annals of the Royal Statistical Society* 1834-1934, The Royal Statistical Society, London.

Savage, L . J. (1976) “On re-reading R. A. Fisher,” *Annals of Statistics*, 4, 441-500.

Spanos, A. (2008), “Statistics and Economics,” pp. 1057-1097 in *The New Palgrave Dictionary of Economics*, Second Edition. Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan.

Tippet, L. H. C. (1931) *The Methods of Statistics*, Williams & Norgate, London.

[i] Fisher (1937), published a year after Pearson’s death, is particularly acerbic. In Fisher’s mind, Karl Pearson went after a young Indian statistician – totally unfairly – just the way he went after him in 1917.

[ii] Bowley received the Guy Medal in silver from the Royal Statistical Society (RSS) as early as 1895, and became a member of the Council of the RSS in 1898. He was awarded the society’s highest honor, the Guy Medal in gold, in 1935.

[iii] It is important to note that Bowley revised his textbook in statistics for the last time in 1937, and predictably, he missed the whole change of paradigms brought about by Fisher, Neyman and Pearson.

Spanos-2008[iv] In their centennial volume published in 1934, the RSS acknowledged the development of ‘mathematical statistics’, referring to Galton, Edgeworth, Karl Pearson, Yule and Bowley as the main pioneers, and listed the most important contributions in this sub-field which appeared in its Journal during the period 1909-33, but the three important papers by Fisher (1922a-b; 1924) are conspicuously absent from that list. The list itself is dominated by contributions in vital, commercial, financial and labour statistics (see RSS, 1934, pp. 208-23). There is a single reference to Egon Pearson.

This was first posted on 17, Feb. 2013 here.

**HAPPY BIRTHDAY R.A. FISHER!**

** I. Doubt is Their Product **is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle:

**II. Fixing Science.** So, one day in January, I was invited to speak in a panel “Falsifiability and the Irreproducibility Crisis” at a conference “Fixing Science: Practical Solutions for the Irreproducibility Crisis.” The inviter, whom I did not know, David Randall, explained that a speaker withdrew from the session because of some kind of controversy surrounding the conference, but did not give details. He pointed me to an op-ed in the *Wall Street Journal.* I had already heard about the conference months before (from Nathan Schachtman) and before checking out the op-ed, my first thought was: I wonder if the controversy has to do with the fact that a keynote speaker is Ron Wasserstein, ASA Executive Director, a leading advocate of retiring “statistical significance”, and barring P-value thresholds in interpreting data. Another speaker eschews all current statistical inference methods (e.g., P-values, confidence intervals) as just too uncertain (D. Trafimow). More specifically, I imagined it might have to do with the controversy over whether the March 2019 editorial in TAS (Wasserstein, Schirm, and Lazar 2019) was a continuation of the ASA 2016 Statement on P-values, and thus an official ASA policy document, or not. Karen Kafadar, recent President of the American Statistical Association (ASA), made it clear in December 2019 that it is not.[2] The “no significance/no thresholds” view is the position of the guest editors of the March 2019 issue. (See “P-Value Statements and Their Unintended(?) Consequences” and “Les stats, c’est moi“.) Kafadar created a new 2020 ASA Task Force on Statistical Significance and Replicability to:

prepare a thoughtful and concise piece …without leaving the impression that p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice”. (Kafadar 2019, p. 4)

Maybe those inviting me didn’t know I’m “anti” the Anti-Statistical Significance campaign (“On some self-defeating aspects of the 2019 recommendations“), that I agree with John Ioannidis (2019) that “retiring statistical significance would give bias a free pass“, and published an editorial “P-value Thresholds: Forfeit at Your Peril“. While I regard many of today’s statistical reforms as welcome (preregistration, testing for replication, transparency about data-dredging, P-hacking and multiple testing), I argue that those in Wasserstein et al., (2019) are “Doing more harm than good“. In “Don’t Say What You don’t Mean“, I express doubts that Wasserstein et al. (2019) could really mean to endorse certain statements in their editorial that are so extreme as to conflict with the ASA 2016 guide on P-values. To be clear, I reject oversimple dichotomies, and cookbook uses of tests, long lampooned, and have developed a reformulation of tests that avoids the fallacies of significance and non-significance.[1] It’s just that many of the criticisms are confused, and, consequently so are many reforms.

**III. Bad Statistics is Their Product.** It turns out that the brouhaha around the conference had nothing to do with all that. I thank Dorothy Bishop for pointing me to her blog which gives a much fuller background. Aside from the lack of women (I learned a new word–a manference), her real objection is on the order of “Bad Statistics is Their Product”: The groups sponsoring the *Fixing Science *conference, The National Association of Scholars and the Independent Institute, Bishop argues, are using the replication crisis to cast doubt on well-established risks, notably those of climate change. She refers to a book whose title echoes David Michael’s: *Merchants of Doubt* (2010) __(__by historians of science: Conway and Oreskes). Bishop writes:

Uncertainty about science that threatens big businesses has been promoted by think tanks … which receive substantial funding from those vested interests. The Fixing Science meeting has a clear overlap with those players. (Bishop)

The speakers on bad statistics, as she sees it, are “foils” for these interests, and thus “responsible scientists should avoid” the meeting.

*But what if things are the reverse?* What if “bad statistics is our product” leaders also have an agenda. By influencing groups who have a voice in evidence policy in government agencies, they might effectively discredit methods they don’t like, and advance those they like. Suppose you have strong arguments that the consequences of this will undermine important safeguards (despite the key players being convinced they’re promoting better science). Then you should speak, if you can, and not stay away. *You should try to fight fire with fire. *

**IV. So what Happened?** So I accepted the invitation and gave what struck me as a fairly radical title: “P-Value ‘Reforms’: Fixing Science or Threats to Replication and Falsification?” (The abstract and slides are below.) Bishop is right that evidence of bad science can be exploited to selectively weaken entire areas of science; but evidence of bad statistics can also be exploited to selectively weaken entire methods one doesn’t like, and successfully gain acceptance of alternative methods, without the hard work of showing those alternative methods do a better, or even a good, job at the task at hand. Of course both of these things might be happening simultaneously.

Do the conference organizers overlap with science policy as Bishop alleges? I’d never encountered either outfits before, but Bishop quotes from their annual report.

In April we published

The Irreproducibility Crisis, a report on the modern scientific crisis of reproducibility—the failure of a shocking amount of scientific research to discover true results because of slipshod use of statistics, groupthink, and flawed research techniques. We launched the report at the Rayburn House Office Building in Washington, DC; it was introduced by Representative Lamar Smith, the Chairman of the House Committee on Science, Space, and Technology.

So there is a mix with science policy makers in Washington, and their publication, *The Irreproducibility Crisis,* is clearly prepared to find its scapegoat in the bad statistics supposedly encouraged in statistical significance tests. To its credit, it discusses how data-dredging and multiple testing can make it easy to arrive at impressive-looking findings that are spurious, but nothing is said about ways to adjust or account for multiple testing and multiple modeling. (P-values *are* defined correctly, but their interpretation of confidence levels is incorrect.) Published before the Wasserstein et al. (2019) call to end P-value thresholds, which would require the FDA and other agencies to end what many consider vital safeguards of error control, it doesn’t go that far. *Not yet at least! *Trying to prevent that from happening is a key reason I decided to attend. (updated 2/16)

My first step was to send David Randall my book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)–which he actually read and wrote a report on–and I met up with him in NYC to talk. He seemed surprised to learn about the controversies over statistical foundations and the disagreement about reforms. So did I hold people’s feet to the fire at the conference (when it came to scapegoating statistical significance tests and banning P-value thresholds for error probability control?) I did! I continue to do so in communications with David Randall. (I’ll write more in the comments to this post, once our slides are up.)

As for climate change, I wound up entirely missing that part of the conference: Due to the grounding of all flights to and from CLT the day I was to travel, thanks to rain, hail and tornadoes, I could only fly the following day, so our sessions were swapped. I hear the presentations will be posted. Doubtless, some people will use bad statistics and the “replication crisis” to claim there’s reason to reject our best climate change models, without having adequate knowledge of the science. But the real and present danger today that I worry about is that they will use bad statistics to claim there’s reason to reject our best (error) statistical practices, without adequate knowledge of the statistics or the philosophical and statistical controversies behind the “reforms”.

Let me know what you think in the comments.

**V.** Here’s my abstract and slides

P-Value “Reforms”: Fixing Science or Threats to Replication and Falsification?

Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome, others are quite radical. The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. Paradoxically, some of the reforms intended to fix science enable rather than reveal illicit inferences due to P-hacking, multiple testing, and data-dredging. Some even preclude testing and falsifying claims altogether. Too often the statistics wars become proxy battles between competing tribal leaders, each keen to advance a method or philosophy, rather than improve scientific accountability.

[1]* Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)*, 2018; *SIST* excerpts; Mayo and Cox 2006; Mayo and Spanos 2006.

[2] All uses of ASA II^{(note)} on this blog must now be qualified to reflect this.

[3] You can find a lot on the conference and the groups involved on-line. The letter by Lenny Teytelman warning people off the conference is here. Nathan Schachtman has a post up today on his law blog here.

]]>

My new paper, “*P* Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in *Harvard Data Science Review (HDSR). HDSR *describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue.

This is a case where reality proves the parody (or maybe, the proof of the parody is in the reality) or something like that. More specifically, Excursion 4 Tour III of my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP) opens with a parody of a legal case, that of Scott Harkonen (in the parody, his name is Paul Hack). You can read it here. A few months after the book came out, the actual case took a turn that went even a bit beyond what I imagined could transpire in my parody. I got cold feet when it came to naming names in the book, but in this article I do.

Below I paste Meng’s blurb, followed by the start of my article.

**Meng’s blurb **(his full editorial is here):

*P*** values on Trial (and the Beauty and Beast in a Single Number)**

Perhaps there are no statistical concepts or methods that have been used and abused more frequently than statistical significance and the *p* value. So much so that some journals are starting to recommend authors move away from rigid *p* value thresholds by which results are classified as significant or insignificant. The American Statistical Association (ASA) also issued a statement on statistical significance and *p *values in 2016, a unique practice in its nearly 180 years of history. However, the 2016 ASA statement did not settle the matter, but only ignited further debate, as evidenced by the 2019 special issue of *The American Statistician*. The fascinating account by the eminent philosopher of science Deborah Mayo of how the ASA’s 2016 statement was used in a legal trial should remind all data scientists that what we do or say can have completely unintended consequences, despite our best intentions.

The ASA is a leading professional society of the studies of uncertainty and variabilities. Therefore, the tone and overall approach of its 2016 statement is understandably nuanced and replete with cautionary notes. However, in the case of Scott Harkonen (CEO of InterMune), who was found guilty of misleading the public by reporting a cherry-picked ‘significant *p* value’ to market the drug Actimmune for unapproved uses, the appeal lawyers cited the ASA Statement’s cautionary note that “a *p value* without context or other evidence provides limited information,” as compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false. I doubt the authors of the ASA statement ever anticipated that their warning against the inappropriate use of *p* value could be turned into arguments for protecting exactly such uses.

To further clarify the ASA’s position, especially in view of some confusions generated by the aforementioned special issue, the ASA recently established a task force on statistical significance (and research replicability) to “develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors” within 2020. As a member of the task force, I’m particularly mindful of the message from Mayo’s article, and of the essentially impossible task of summarizing scientific evidence by a single number. As consumers of information, we are all seduced by simplicity, and nothing is simpler than conveying everything through a single number, which renders simplicity on multiple fronts, from communication to decision making. But, again, there is no free lunch. Most problems are just too complex to be summarized by a single number, and concision in this context can exact a considerable cost. The cost could be a great loss of information or validity of the conclusion, which are the central concerns regarding the *p* value. The cost can also be registered in terms of the tremendous amount of hard work it may take to produce a usable single summary.

**Abstract**

In an attempt to stem the practice of reporting impressive-looking findings based on data dredging and multiple testing, the American Statistical Association’s (ASA) 2016 guide to interpreting *p* values (Wasserstein & Lazar) warns that engaging in such practices “renders the reported *p*-values essentially uninterpretable” (pp. 131-132). Yet some argue that the ASA statement actually frees researchers from culpability for failing to report or adjust for data dredging and multiple testing. We illustrate the puzzle by means of a case appealed to the Supreme Court of the United States: that of Scott Harkonen. In 2009, Harkonen was found guilty of issuing a misleading press report on results of a drug advanced by the company of which he was CEO. Downplaying the high *p* value on the primary endpoint (and 10 secondary points), he reported statistically significant drug benefits had been shown, without mentioning this referred only to a subgroup he identified from ransacking the unblinded data. Nevertheless, Harkonen and his defenders argued that “the conclusions from the ASA Principles are the opposite of the government’s” conclusion that his construal of the data was misleading (*Harkonen v. United States*, 2018, p. 16). On the face of it, his defenders are selectively reporting on the ASA guide, leaving out its objections to data dredging. However, the ASA guide also points to alternative accounts to which some researchers turn to avoid problems of data dredging and multiple testing. Since some of these accounts give a green light to Harkonen’s construal, a case might be made that the guide, inadvertently or not, frees him from culpability.

**Keywords:** statistical significance, *p* values, data dredging, multiple testing, ASA guide to *p* values, selective reporting

**Introduction**

The biggest source of handwringing about statistical inference boils down to the fact it has become very easy to infer claims that have not been subjected to stringent tests. Sifting through reams of data makes it easy to find impressive-looking associations, even if they are spurious. Concern with spurious findings is considered sufficiently serious to have motivated the American Statistical Association (ASA) to issue a guide to stem misinterpretations of *p* values (Wasserstein & Lazar, 2016; hereafter, ASA guide). Principle 4 of the ASA guide asserts that*:*

Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certainp-values (typically those passing a significance threshold) renders the reportedp-values essentially uninterpretable. (pp. 131–132)

An intriguing example is offered by a legal case that was back in the news in 2018, having made it to the U.S. Supreme Court (*Harkonen v. United States*, 2018). In 2009, Scott Harkonen (CEO of drug company InterMune) was found guilty of wire fraud for issuing a misleading press report on Phase III results of a drug Actimmune in 2002, successfully pumping up its sales. While Actimmune had already been approved for two rare diseases, it was hoped that the FDA would approve it for a far more prevalent, yet fatal, lung disease (whose treatment would cost patients $50,000 a year). Confronted with a disappointing lack of statistical significance (*p* = .52)[1] on the primary endpoint—that the drug improves lung function as reflected by progression free survival—and on any of ten prespecified endpoints, Harkonen engaged in postdata dredging on the unblinded data until he unearthed a non-prespecified subgroup with a *nominally* statistically significant survival benefit. The day after the Food and Drug Administration (FDA) informed him it would not approve the use of the drug on the basis of his post hoc finding, Harkonen issued a press release to doctors and shareholders optimistically reporting Actimmune’s statistically significant survival benefits in the subgroup he identified from ransacking the unblinded data.

What makes the case intriguing is not its offering yet another case of *p*-hacking, nor that it has found its way more than once to the Supreme Court. Rather, it is because in 2018, Harkonen and his defenders argued that the ASA guide provides “compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false” (Goodman, 2018, p. 3). His appeal alleges that “the conclusions from the ASA Principles are the opposite of the government’s” charge that his construal of the data was misleading (*Harkonen v. United States*, 2018, p. 16 ).

Are his defenders merely selectively reporting on the ASA guide, making no mention of Principle 4, with its loud objections to the behavior Harkonen displayed? It is hard to see how one can hold Principle 4 while averring the guide’s principles run counter to the government’s charges against Harkonen. However, if we view the ASA guide in the context of today’s disputes about statistical evidence, things may look topsy turvy. None of the attempts to overturn his conviction succeeded (his sentence had been to a period of house arrest and a fine), but his defenders are given a leg to stand on—wobbly as it is. While the ASA guide does not show that the theory of statistical significance testing ‘is demonstrably false,’ it might be seen to communicate a message that is in tension with itself on one of the most important issues of statistical inference.

Before beginning, some caveats are in order. The legal case was not about which statistical tools to use, but merely whether Harkonen, in his role as CEO, was guilty of intentionally issuing a misleading report to shareholders and doctors. However, clearly, there could be no hint of wrongdoing if it were acceptable to treat post hoc subgroups the same as prespecified endpoints. In order to focus solely on that issue, I put to one side the question whether his press report rises to the level of wire fraud. Lawyer Nathan Schachtman argues that “the judgment in *United States v. Harkonen* is at odds with the latitude afforded companies in securities fraud cases” even where multiple testing occurs (Schachtman, 2020, p. 48). Not only are the intricacies of legal precedent outside my expertise, the arguments in his defense, at least the ones of interest here, regard only the data interpretation. Moreover, our concern is strictly with whether the ASA guide provides grounds to exonerate Harkonen-like interpretations of data.

I will begin by describing the case in relation to the ASA guide. I then make the case that Harkonen’s defenders mislead by omission of the relevant principle in the guide. I will then reopen my case by revealing statements in the guide that have thus far been omitted from my own analysis. Whether they exonerate Harkonen’s defenders is for you, the jury, to decide.

You can read the full article at HDSR here. The Harkonen case is also discussed on this blog: search Harkonen (and Matrixx).

]]>

**Stephen Senn**

*Consultant Statistician*

*Edinburgh*

‘The term “point estimation” made Fisher nervous, because he associated it with estimation without regard to accuracy, which he regarded as ridiculous.’ Jimmy Savage [1, p. 453]

The classic text by David Cox and David Hinkley, *Theoretical Statistics *(1974), has two extremely interesting features as regards estimation. The first is in the form of an indirect, implicit, message and the second explicit and both teach that point estimation is far from being an obvious goal of statistical inference. The indirect message is that the chapter on point estimation (chapter 8) comes *after* that on interval estimation (chapter 7). This may puzzle the reader, who may anticipate that the complications of interval estimation would be handled after the apparently simpler point estimation rather than before. However, with the start of chapter 8, the reasoning is made clear. Cox and Hinkley state:

Superficially, point estimation may seem a simpler problem to discuss than that of interval estimation; in fact, however, any replacement of an uncertain quantity is bound to involve either some arbitrary choice or a precise specification of the purpose for which the single quantity is required. Note that in interval-estimation we explicitly recognize that the conclusion is uncertain, whereas in point estimation…no explicit recognition is involved in the final answer.[2, p. 250]

In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten. For example, much of the criticism of randomisation overlooks the fact that the statistical analysis will deliver a probability statement and, other things being equal, the more unobserved prognostic factors there are, the more uncertain the result will be claimed to be. However, statistical statements are not wrong *because* they are uncertain, they are wrong if claimed to be more certain (or less certain) than they are.

Amongst justifications that Cox and Hinkley give for calculating point estimates is that when supplemented with an appropriately calculated standard error they will, in many cases, provide the means of calculating a confidence interval, or if you prefer being Bayesian, a credible interval. Thus, to provide a point estimate without also providing a standard error is, indeed, an all too standard error. Of course, there is no value in providing a standard error unless it has been calculated appropriately and addressing the matter of appropriate calculation is not necessarily easy. This is a point I shall pick up below but for the moment let us proceed to consider why it is useful to have standard errors.

First, suppose you have a point estimate. At some time in the past you or someone else decided to collect the data that made it possible. Time and money were invested in doing this. It would not have been worth doing this unless there was a state of uncertainty that the collection of data went some way to resolving. Has it been resolved? Are you certain enough? If not, should more data be collected or would that not be worth it? This cannot be addressed without assessing the uncertainty in your estimate and this is what the standard error is for.

Second, you may wish to combine the estimate with other estimates. This has a long history in statistics. It has been more recently (in the last half century) developed under the heading of *meta-analysis*, which is now a huge field of theoretical study and practical application. However, the subject is much older than that. For example, I have on the shelves of my library at home, a copy of the second (1875) edition of *On the Algebraical And Numerical Theory of Observations: And The Combination of* *Observations*, by George Biddell Airy (1801-1892). [3] Chapter III is entitled ‘Principles of Forming the Most Advantageous Combination of Fallible Measures’ and treats the matter in some detail. For example, Airy defines what he calls the *theoretical weight* (*t.w.*) for combining errors asand then draws attention to ‘two remarkable results’

First. The combination-weight for each measure ought to be proportional to its theoretical weight.

Second. When the combination-weight for each measure is proportional to its theoretical weight, the theoretical weight of the final result is equal to the sum of the theoretical weights of the several collateral measures. (pp. 55-56).

We are now more used to using the standard error (*SE*) rather than the probable error (*PE*) to which Airy refers. However, the *PE*, which can be defined as the *SE* multiplied by the upper quartile of the standard Normal distribution, is just a multiple of the *SE*. Thus we have *PE ≈ 0.645 × SE* and therefore 50% of values ought to be in the range mean −*PE *to mean *+PE*, hence the name. Since the *PE* is just a multiple of the *SE*, Airy’s second *remarkable result *applies in terms of SEs also. Nowadays we might speak of the *precision*, defined thus

and say that estimates should be combined in proportion to their precision, in which case the precision of the final result will be the sum of the individual precisions.

This second edition of Airy’s book dates from 1875 but, although, I have not got a copy of the first edition, which dates from 1861, I am confident that the history can be pushed at least as far as that. In fact, as has often been noticed, fixed effects meta-analysis is really just a form of least squares, a subject developed at the end of the 18^{th}and beginning of the 19^{th} century by Legendre, Gauss and Laplace, amongst others. [4]

A third reason to be interested in standard errors is that you may wish to carry out a Bayesian analysis. In that case, you should consider what the mean and the ‘standard error’ of your prior distribution are. You can then apply Airy’s two remarkable results as follows.

and

Suppose that you regard all this concern with uncertainty as an unnecessary refinement and argue, “Never mind Airy’s precision weighting; when I have more than one estimate, I’ll just use an unweighted average”. This might seem like a reasonable ‘belt and braces’ approach but the figure below illustrates a problem. It supposes the following. You have one estimate and you then obtain a second. You now form an unweighted average of the two. What is the precision of this mean compared to a) the first result alone and b) the second result alone? In the figure, the X axis gives the relative precision of the second result alone to that of the first result alone. The Y axis gives the relative precision of the mean to the first result alone (red curve) or to the second result alone (blue curve).

Where a curve is below 1, the precision of the mean is below the relevant single result. If the precision of the second result is less than 1/3 of the first, you would be better off using the first result alone. On the other hand, if the second result is more than three times as precise as the first, you would be better off using the second alone. The consequence is, that if you do not know the precision of your results you *not only don’t know which one to trust, you don’t even know if an average of them should be preferred*.

So, to sum up, if you don’t know how uncertain your evidence is, you can’t use it. Thus, assessing uncertainty is important. However, as I said in the introduction, all too easily, attention focuses on estimating the parameter of interest and not the probability statement. This (perhaps unconscious) obsession with point estimation as the be all and end all causes problems. As a common example of the problem, consider the following statement: ‘all covariates are balanced, therefore they do not need to be in the model’. The point of view expresses the belief that nothing of relevance will change if the covariates are not in the model, so why bother.

It is true that if a linear model applies, the point estimate for a ‘treatment effect’ will not change by including balanced covariates in the model. However, the expression of uncertainty will be quite different. The balanced case is one that does not apply in general. It thus follows that valid expressions of uncertainty have to allow for prognostic factors being imbalanced and this is, indeed, what they do. Misunderstanding of this is an error often made by critics of randomisation. I explain the misunderstanding like this: *If we knew that important but unobserved prognostic factors were balanced, the standard analysis of clinical trials would be wrong*. Thus, to claim that the analysis of randomised clinical trial relies on prognostic factors being balanced is exactly the opposite of what is true. [5]

As I explain in my blog Indefinite Irrelevance, if the prognostic factors are balanced, not adjusting for them, treats them as if they might be imbalanced: the confidence interval will be too wide given that we know that they are __not__ imbalanced. (See also The Well Adjusted Statistician. [6])

Another way of understanding this is through the following example.

Consider a two-armed placebo-controlled clinical trial of a drug with a binary covariate (let us take the specific example of *sex*) and suppose that the patients split 50:50 according to the covariate. Now consider these two questions. What allocation of patients by sex within treatment arms will be such that differences in sex do not impact on 1) the estimate of the effect and 2) the estimate of the standard error of the estimate of the effect?

Everybody knows what the answer is to 1): the males and females must be equally distributed with respect to treatment. (Allocation one in the table below.) However, the answer to 2) is less obvious: it is that the two groups within which variances are estimated must be homogenous by treatment and sex. (Allocation two in the table below shows one of the two possibilities.) That means that if we do not put sex in the model, in order to remove sex from affecting the estimate of the variance, we would have to have all the males in one treatment group and all the females in another.

Allocation one | Allocation two | |||||

Sex | Sex | |||||

Male | Female | Male | Female | Total | ||

Treatment |
Placebo | 25 | 25 | 50 | 0 | 50 |

Drug | 25 | 25 | 0 | 50 | 50 | |

Total | 50 | 50 | 50 | 50 | 100 |

Table: Percentage allocation by sex and treatment for two possible clinical trials

Of course, nobody uses allocation two but if allocation one is used, then the logical approach is to analyse the data so that the influence of sex is eliminated from the estimate of the variance, and hence the standard error. Savage, referring to Fisher, puts it thus:

He taught what should be obvious but always demands a second thought from me: if an experiment is laid out to diminish the variance of comparisons, as by using matched pairs…the variance eliminated from the comparison shows up in the estimate of this variance (unless care is taken to eliminate it)… [1, p. 450]

The consequence is that one needs to allow for this in the estimation procedure. One needs to ensure not only that the effect is estimated appropriately but that __its uncertainty is also assessed appropriately__. In our example this means that *sex*, in addition to *treatment*, must be in the model.

it doesn’t approve of your philosophyRay Bradbury,Here There be Tygers

So, estimating uncertainty is a key task of any statistician. Most commonly, it is addressed by calculating a standard error. However, this is not necessarily a simple matter. The school of statistics associated with design and analysis of agricultural experiments founded by RA Fisher, and to which I have referred as the Rothamsted School, addressed this in great detail. Such agricultural experiments could have a complicated block structure, for example, rows and columns of a field, with whole plots defined by their crossing and subplots within the whole plots. Many treatments could be studied simultaneously, with some (for example crop variety) being capable of being varied at the level of the plots but some (for example fertiliser) at the level of the subplots. This meant that variation at different levels affected different treatment factors. John Nelder developed a formal calculus to address such complex problems [7, 8].

In the world of clinical trials in which I have worked, we distinguish between trials in which patients can receive different treatments on different occasions and those in which each patient can independently receive only one treatment and those in which all the patients in the same centre must receive the same treatment. Each such design (cross-over, parallel, cluster) requires a different approach to assessing uncertainty. (See To Infinity and Beyond.) Naively treating all observations as independent can underestimate the standard error, a problem that Hurlbert has referred to as *pseudoreplication. *[9]

A key point, however, is this: the formal nature of experimentation forces this issue of variation to our attention. In observational studies we may be careless. We tend to assume that once we have chosen and made various adjustments to correct bias in the point estimate, that the ‘errors’ can then be treated as independent. However, only for the simplest of experimental studies would such an assumption be true, so what justifies making it as matter of habit for observational ones?

Recent work on historical controls has underlined the problem [10-12]. Trials that use such controls have features of both experimental and observational studies and so provide an illustrative bridge between the two. It turns out that treating the data as if they came from one observational study would underestimate the variance and hence overestimate the precision of the result. The implication is that analyses of observational studies more generally may be producing inappropriately narrow confidence intervals. [10]

If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties. Francis Bacon,The Advancement of Learning, Book I, v,8.

In short, I am making an argument for Fisher’s general attitude to inference. Harry Marks has described it thus:

Fisher was a sceptic…But he was an unusually constructive sceptic. Uncertainty and error were, for Fisher, inevitable. But ‘rigorously specified uncertainty’ provided a firm ground for making provisional sense of the world. H Marks [13, p.94]

Point estimates are not enough. It is rarely the case that you have to act immediately based on your best guess. Where you don’t, you have to know how good your guesses are. This requires a principled approach to assessing uncertainty.

- Savage, J.,
*On rereading R.A. Fisher.*Annals of Statistics, 1976.**4**(3): p. 441-500. - Cox, D.R. and D.V. Hinkley,
*Theoretical Statistics*. 1974, London: Chapman and Hall. - Airy, G.B.,
*On the Algebraical and Numerical Theory of Errors of Observations and the Combination of Observations*. 1875, london: Macmillan. - Stigler, S.M.,
*The History of Statistics: The Measurement of Uncertainty before 1900*. 1986, Cambridge, Massachusets: Belknap Press. - Senn, S.J.,
*Seven myths of randomisation in clinical trials.*Statistics in Medicine, 2013.**32**(9): p. 1439-50. - Senn, S.J.,
*The well-adjusted statistician.*Applied Clinical Trials, 2019: p. 2. - Nelder, J.A.,
*The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance.*Proceedings of the Royal Society of London. Series A, 1965.**283**: p. 147-162. - Nelder, J.A.,
*The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance.*Proceedings of the Royal Society of London. Series A, 1965.**283**: p. 163-178. - Hurlbert, S.H.,
*Pseudoreplication and the design of ecological field experiments.*Ecological monographs, 1984.**54**(2): p. 187-211. - Collignon, O., et al.,
*Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations.*Stat Methods Med Res, 2019: p. 962280219880213. - Galwey, N.W.,
*Supplementation of a clinical trial by historical control data: is the prospect of dynamic borrowing an illusion?*Statistics in Medicine 2017.**36**(6): p. 899-916. - Schmidli, H., et al.,
*Robust meta‐analytic‐predictive priors in clinical trials with historical control information.*Biometrics, 2014.**70**(4): p. 1023-1032. - Marks, H.M.,
*Rigorous uncertainty: why RA Fisher is important.*Int J Epidemiol, 2003.**32**(6): p. 932-7; discussion 945-8.

]]>

Aris Spanos was asked to review my *Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars* (CUP, 2018), but he was to combine it with a review of the re-issue of Ian Hacking’s classic *Logic of Statistical Inference. The journal is **OEconomia: History, Methodology, Philosophy*. Below are excerpts from his discussion of my book (pp. 843-860). I will jump past the Hacking review, and occasionally excerpt for length.To read his full article go to external journal pdf or stable internal blog pdf.

**….**

The sub-title of Mayo’s (2018) book provides an apt description of the primary aim of the book in the sense that its focus is on the current discussions pertaining to replicability and trustworthy empirical evidence that revolve around the main fault line in statistical inference: the nature, interpretation and uses of probability in statistical modeling and inference. This underlies not only the form and structure of inductive inference, but also the nature of the underlying statistical reasonings as well as the nature of the evidence it gives rise to.

A crucial theme in Mayo’s book pertains to the current confusing and confused discussions on reproducibility and replicability of empirical evidence. The book cuts through the enormous level of confusion we see today about basic statistical terms, and in so doing explains why the experts so often disagree about reforms intended to improve statistical science.

Mayo makes a concerted effort to delineate the issues and clear up these confusions by defining the basic concepts accurately and placing many widely held methodological views in the best possible light before scrutinizing them. In particular, the book discusses at length the merits and demerits of the proposed reforms which include: (a) replacing p-values with Confidence Intervals (CIs), (b) using estimation-based effect sizes and (c) redefining statistical significance.

The key philosophical concept employed by Mayo to distinguish between a *sound* empirical evidential claim for a hypothesis *H* and an unsound one is the notion of a *severe test*: if little has been done to rule out *flaws (errors and omissions)* in pronouncing that data **x**_{0} provide evidence for a hypothesis *H*, then that inferential claim has not passed a severe test, rendering the claim untrustworthy. One has trustworthy evidence for a claim C only to the extent that C passes a severe test; see Mayo (1983; 1996). A distinct advantage of the concept of severe testing is that it is sufficiently general to apply to both frequentist and Bayesian inferential methods.

Mayo makes a case that there is a two-way link between philosophy and statistics. On one hand, philosophy helps in resolving conceptual, logical, and methodological problems of statistical inference. On the other hand, viewing statistical inference as severe testing gives rise to novel solutions to crucial philosophical problems including induction, falsification and the demarcation of science from pseudoscience. In addition, it serves as the foundation for understanding and getting beyond the statistics wars that currently revolves around the replication crises; hence the title of the book, *Statistical Inference as Severe Testing*.

Chapter (excursion) 1 of Mayo’s (2018) book sets the scene by scrutinizing the different role of probability in *statistical inference*, distinguishing between:

**(i) Probabilism.** Probability is used to assign a *degree of confirmation, support or belief* in a hypothesis *H*, given data **x**_{0} (Bayesian, likelihoodist, Fisher (fiducial)). An inferential claim *H* is warranted when it is assigned a *high* probability, support, or degree of belief (absolute or comparative).

**(ii) Performance.** Probability is used to ensure the *long-run reliability* of inference procedures; type I, II, coverage probabilities (frequentist, behavioristic Neyman-Pearson). An inferential claim *H* is warranted when it stems from a procedure with a low long-run error.

**(iii) Probativism.** Probability is used to assess the *probing capacity* of inference procedures, *pre-data* (type I, II, coverage probabilities), as well as *post-data* (p-value, severity evaluation). An inferential claim *H* is warranted when the different ways it can be false have been adequately probed and averted.

Mayo argues that probativism based on the severe testing account uses error probabilities to output an evidential interpretation based on assessing how severely an inferential claim *H* has passed a test with data **x**_{0}. Error control and long-run reliability is necessary but not sufficient for probativism. This perspective is contrasted to probabilism (Law of Likelihood (LL) and Bayesian posterior) that focuses on the relationships between data **x**_{0} and hypothesis *H*, and ignores outcomes **x**∈* R^{n }*other than

Chapter (excursion) 2 entitled ‘Taboos of Induction and Falsification’ relates the various uses of probability to draw certain parallels between probabilism, Bayesian statistics and Carnapian logics of confirmation on one side, and performance, frequentist statistics and Popperian falsification on the other. The discussion in this chapter covers a variety of issues in philosophy of science, including, the problem of induction, the asymmetry of induction and falsification, sound vs. valid arguments, enumerative induction (straight rule), confirmation theory (and formal epistemology), statistical affirming the consequent, the old evidence problem, corroboration, demarcation of science and pseudoscience, Duhem’s problem and novelty of evidence. These philosophical issues are also related to statistical conundrums as they relate to significance testing, fallacies of rejection, the cannibalization of frequentist testing known as Null Hypothesis Significance Testing (NHST) in psychology, and the issues raised by the reproducibility and replicability of evidence.

Chapter (excursion) 3 on ‘Statistical Tests and Scientific Inference’ provides a basic introduction to frequentist testing paying particular attention to crucial details, such as specifying explicitly the assumed statistical model **M**_{θ}(**x**) and the proper framing of hypotheses in terms of its parameter space Θ, with a view to provide a coherent account by avoiding undue formalism. The Neyman-Pearson (N-P) formulation of hypothesis testing is explained using a simple example, and then related to Fisher’s significance testing. What is different from previous treatments is that the claimed ‘inconsistent hybrid’ associated with the NHST caricature of frequentist testing is circumvented. The crucial difference often drawn is based on the N-P emphasis on pre-data long-run error probabilities, and the behavioristic interpretation of tests as accept/reject rules. By contrast, the post-data p-value associated with Fisher’s significance tests is thought to provide a more evidential interpretation. In this chapter, the two approaches are reconciled in the context of the error statistical framework. The N-P formulation provides the formal framework in the context of which an optimal theory of frequentist testing can be articulated, but in its current expositions lack a proper evidential interpretation. **[For the detailed example see his review pdf.] **…

If a hypothesis *H _{0}* passes a test

The *post-data severity evaluation* outputs the discrepancy γ stemming from the testing results and takes the probabilistic form:

SEV (θ ≶ θ* _{1}*;

where the inequalities are determined by the testing result and the sign of d(**x*** _{0}*).

Mayo uses the post-data severity perspective to scorch several misinterpretations of the p-value, including the claim that the p-value is not a legitimate error probability. She also calls into question any comparisons of the tail areas of d(**X**) under *H _{0}* that vary with

The real life examples of the 1919 eclipse data for testing the General Theory of Relativity, as well as the 2012 discovery of the Higgs particle are used to illustrate some of the concepts in this chapter.

The discussion in this chapter sheds light on several important problems in statistical inference, including several howlers of statistical testing, Jeffreys’ tail area criticism, weak conditionality principle and the likelihood principle.

…**[To read about excursion 4, see his full review pdf.]**

Chapter (excursion) 5, entitled ‘Power and Severity’, provides an in-depth discussion of power and its abuses or misinterpretations, as well as scotch several confusions permeating the current discussions on the replicability of empirical evidence.

**Confusion 1:** The power of a N-P test *Τ*_{α}:= {d(**X**), C_{1}(α)} is a *pre-data* error probability that calibrates the generic (for any sample realization x∈* R^{n}* ) capacity of the test in detecting different discrepancies from

**Confusion 2:** The power function is properly defined for all θ_{1}∈Θ_{1} only when (Θ_{0}, Θ_{1}) constitute a partition of Θ. This is to ensure that θ^{∗} is not in a subset of Θ ignored by the comparisons since the *main objective* is to *narrow down* the unknown parameter space Θ using *hypothetical* values of θ. …Hypothesis testing poses questions as to whether a hypothetical value θ_{0} is close enough to θ^{∗} in the sense that the difference (θ^{∗} – θ_{0}) is ‘statistically negligible’; a notion defined using error probabilities.

**Confusion 3:** Hacking (1965) raised the problem of using predata error probabilities, such as the significance level α and power, to evaluate the testing results post-data. As mentioned above, the post-data severity aims to address that very problem, and is extensively discussed in Mayo (2018), excursion 5.

**Confusion 4:** Mayo and Spanos (2006) define “attained power” by replacing c_{α} with the observed d(**x**_{0}). But this should not be confused with replacing θ_{1} with its observed estimate [e.g., *x*_{n}], as in what is often called “observed” or “retrospective” power. To compare the two in example 2, contrast:

Attained power: POW(µ_{1})=Pr(d(**X**) > d(**x**_{0}); µ=µ_{1}), for all µ_{1}>µ_{0},

with what Mayo calls Shpower which is defined at µ=*x*_{n}:

Shpower: POW(*x*_{n})=Pr(d(**X**) > d(**x**_{0}); µ=*x*_{n}).

Shpower makes very little statistical sense, unless point estimation justifies the inferential claim *x*_{n }≅ µ^{∗}, which it does not, as argued above. Unfortunately, the statistical literature in psychology is permeated with (implicitly) invoking such a claim when touting the merits of estimation-based effect sizes. The estimate *x*_{n }represents just a single value of X_{n} ∼N(µ, σ^{2}/n ), and any inference pertaining to µ needs to take into account the uncertainty described by this sampling distribution; hence, the call for using interval estimation and hypothesis testing to account for that sampling uncertainty. The post-data severity evaluation addresses this problem using hypothetical reasoning and taking into account the relevant statistical context (11). It outputs the discrepancy from *H _{0}* warranted by test

**Confusion 5:** Frequentist error probabilities (type I, II, coverage, p-value) are *not conditional* on *H* (*H*_{0} or *H _{1}*) since θ=θ

This confusion undermines the credibility of Positive Predictive Value (PPV):

where (i) *F* = *H _{0}* is false, (ii) R=test rejects

A stronger case can be made that abuses and misinterpretations of frequentist testing are only symptomatic of a more extensive problem: the *recipe-like/uninformed implementation of statistical methods*. This contributes in many different ways to untrustworthy evidence, including: (i) statistical misspecification (imposing invalid assumptions on one’s data), (ii) poor implementation of inference methods (insufficient understanding of their assumptions and limitations), and (iii) unwarranted evidential interpretations of their inferential results (misinterpreting p-values and CIs, etc.).

Mayo uses the concept of a post-data severity evaluation to illuminate the above mentioned issues and explain how it can also provide the missing evidential interpretation of testing results. The same concept is also used to clarify numerous misinterpretations of the p-value throughout this book, as well as the fallacies:

**(a) Fallacy of acceptance (non-rejection).** No evidence against *H _{0}* is misinterpreted as evidence for it. This fallacy can easily arise when the power of a test is low (e.g. small

In chapter 5 Mayo returns to a recurring theme throughout the book, the mathematical duality between Confidence Intervals (CIs) and hypothesis testing, with a view to call into question certain claims about the superiority of CIs over p-values. This mathematical duality derails any claims that observed CIs are less vulnerable to the large n problem and more informative than p-values. Where they differ is in terms of their inferential claims stemming from their different forms of reasoning, factual vs. hypothetical. That is, the mathematical duality does not imply inferential duality. This is demonstrated by contrasting CIs with the post-data severity evaluation.

Indeed, a case can be made that the post-data severity evaluation addresses several long-standing problems associated with frequentist testing, including the large *n* problem, the apparent arbitrariness of the N-P framing that allows for simple vs. simple hypotheses, say *H _{0}*: µ= 1 vs.

Chapter 5 also includes a retrospective view of the disputes between Neyman and Fisher in the context of the error statistical perspective on frequentist inference, bringing out their common framing and their differences in emphasis and interpretation. The discussion also includes an interesting summary of their personal conflicts, not always motivated by statistical issues; who said the history of statistics is boring?

Chapter (excursion) 6 of Mayo (2018) raises several important foundational issues and problems pertaining to Bayesian inference, including its primary aim, subjective vs. default Bayesian priors and their interpretations, default Bayesian inference vs. the Likelihood Principle, the role of the catchall factor, the role of Bayes factors in Bayesian testing, and the relationship between Bayesian inference and error probabilities. There is also discussion about attempts by ‘default prior’ Bayesians to unify or reconcile frequentist and Bayesian accounts.

A point emphasized in this chapter pertains to model validation. Despite the fact that Bayesian statistics shares the same concept of a statistical model **M**_{θ}(**x**) with frequentist statistics, there is hardly any discussion on validating **M**_{θ}(**x**) to secure the reliability of the posterior distribution:…upon which all Bayesian inferences are based. The exception is the indirect approach to model validation in Gelman et al (2013) based on the posterior predictive distribution:Since *m*(**x**) is parameter free, one can use it as a basis for simulating a number of replications **x*** _{1}*,

On the question posed by the title of this review, Mayo’s answer is that the error statistical framework, a refinement or extension of the original Fisher-Neyman-Pearson framing in the spirit of Peirce, provides a pertinent foundation for frequentist modeling and inference.

A retrospective view of Hacking (1965) reveals that its main weakness is that its perspective on statistical induction adheres too closely to the philosophy of science framing of that period, and largely ignores the formalism based on the theory of stochastic processes {X* _{t}*, t∈

Probability as a dispositional property’ of a chance set-up alludes to the *propensity interpretation* of probability associated with Peirce and Popper, which is in complete agreement with the model-based frequentist interpretation; see Spanos (2019).

The main contribution of Mayo’s (2018) book is to put forward a framework and a strategy to evaluate the trustworthiness of evidence resulting from different statistical accounts. Viewing statistical inference as a form of severe testing elucidates the most widely employed arguments surrounding commonly used (and abused) statistical methods. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to evaluate and control how severely tested different inferential claims are. Without assuming that other statistical accounts aim for severe tests, Mayo proposes the following strategy for evaluating the trustworthiness of evidence: begin with a minimal requirement that if a test has little or no chance to detect flaws in a claim *H*, then *H*’s passing result constitutes untrustworthy evidence. Then, apply this minimal severity requirement to the various statistical accounts as well as to the proposed reforms, including estimation-based effect sizes, observed CIs and redefining statistical significance. Finding that they fail even the minimal severity requirement provides grounds to question the trustworthiness of their evidential claims. One need not reject some of these methods just because they have different aims, but because they give rise to evidence [claims] that fail the minimal severity requirement. Mayo challenges practitioners to be much clearer about their aims in particular contexts and different stages of inquiry. It is in this way that the book ingeniously links philosophical questions about the roles of probability in inference to the concerns of practitioners about coming up with trustworthy evidence across the landscape of the natural and the social sciences.

**References**

- Barnard, George. 1972. Review article: Logic of Statistical Inference.
*The British Journal of the Philosophy of Science*, 23: 123- 190. - Cramer, Harald. 1946.
*Mathematical Methods of Statistics*, Princeton: Princeton University Press. - Fisher, Ronald A. 1922. On the Mathematical Foundations of Theoretical Statistics.
*Philosophical Transactions of the Royal Society*A, 222(602): 309-368. - Fisher, Ronald A. 1925.
*Statistical Methods for Research Workers*. Edinburgh: Oliver & Boyd. - Gelman, Andrew. John B. Carlin, Hal S. Stern, Donald B. Rubin. 2013.
*Bayesian Data Analysis*, 3rd ed. London: Chapman & Hall/CRC. - Hacking, Ian. 1972. Review: Likelihood.
*The British Journal for the Philosophy of Science*, 23(2): 132-137. - Hacking, Ian. 1980. The Theory of Probable Inference: Neyman, Peirce and Braithwaite. In D. Mellor (ed.),
*Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite*. Cambridge: Cambridge University Press, 141-160. - Ioannidis, John P. A. 2005. Why Most Published Research Findings Are False.
*PLoS*medicine, 2(8): 696-701. - Koopman, Bernard O. 1940. The Axioms and Algebra of Intuitive Probability.
*Annals of Mathematics*, 41(2): 269-292. - Mayo, Deborah G. 1983. An Objective Theory of Statistical Testing.
*Synthese*, 57(3): 297-340. - Mayo, Deborah G. 1996.
*Error and the Growth of Experimental Knowledge*. Chicago: The University of Chicago Press. - Mayo, Deborah G. 2018.
*Statistical Inference as Severe Testing: How to Get Beyond the Statistical Wars*. Cambridge: Cambridge University Press. - Mayo, Deborah G. and Aris Spanos. 2004. Methodology in Practice: Statistical Misspecification Testing.
*Philosophy of Science*, 71(5): 1007-1025. - Mayo, Deborah G. and Aris Spanos. 2006. Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.
*British Journal for the Philosophy of Science*, 57(2): 323- 357. - Mayo, Deborah G. and Aris Spanos. 2011. Error Statistics. In D. Gabbay, P. Thagard, and J. Woods (eds),
*Philosophy of Statistics, Handbook of Philosophy of Science*. New York: Elsevier, 151-196. - Neyman, Jerzy. 1952.
*Lectures and Conferences on Mathematical Statistics and Probability*, 2nd ed. Washington: U.S. Department of Agriculture. - Royall, Richard. 1997.
*Statistical Evidence: A Likelihood Paradigm*. London: Chapman & Hall. - Salmon, Wesley C. 1967.
*The Foundations of Scientific Inference*. Pittsburgh: University of Pittsburgh Press. - Spanos, Aris. 2013. A Frequentist Interpretation of Probability for Model-Based Inductive Inference.
*Synthese*, 190(9):1555- 1585. - Spanos, Aris. 2017. Why the Decision-Theoretic Perspective Misrepresents Frequentist Inference. In
*Advances in Statistical Methodologies and Their Applications to Real Problems*. http://dx.doi.org/10.5772/65720, 3-28. - Spanos, Aris. 2018. Mis-Specification Testing in Retrospect.
*Journal of Economic Surveys*, 32(2): 541-577. - Spanos, Aris. 2019.
*Probability Theory and Statistical Inference: Empirical Modeling with Observational Data*, 2nd ed. Cambridge: Cambridge University Press. - Von Mises, Richard. 1928.
*Probability, Statistics and Truth*, 2nd ed. New York: Dover. - Williams, David. 2001.
*Weighing the Odds: A Course in Probability and Statistics*. Cambridge: Cambridge University Press.

Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:

Dear Dr. Mayo,

I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.

Two footnotes, on pages~~31~~35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOCThank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.

With kind regards and wishes of a happy 2020,

Jenny Heimberg

Jennifer Heimberg, Ph.D.

Senior Program OfficerThe National Academies of Sciences, Engineering, and Medicine

I’m really glad to see the effort! The footnote on p. 35 reads:

The original document read:

And the revised paragraph is:

Although my letter had also made the point about the difference between ordinary English, and technical, uses of “likelihood”, I did not expect them to tinker with those because the document is filled with jumbled uses of the two. Notice, just for one example, how the replacement on p. 221, along with the footnote

is immediately followed by:

Do you see any mixture of “likelihood” and “probability”?[1]

Still, I *greatly appreciate* their making the correction which will alert readers to be careful in combing through the document. As encouragement to others to write-in corrections, they might have acknowledged the error corrector, but I’m not complaining. It underscores my position that it’s really not so onerous or impossible to fix mistakes in committee-generated “guides for best practices”. See, for instance, my friendly amendments to the March 2019 editorial in *The American Statistician*.[2]

[1]At a time when people are cavalierly combining Type I error probabilities and power in a quasi-Bayesian computation to yield a “posterior predictive value” (the *diagnostic screening model* of tests)–which is also found in the NAS document– it’s especially important to be consistent in the use of “likelihood”. For a criticism of the diagnostic screening model see pp 361-370 of my *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST) (2018, CUP), or search this blog.

[2] Before you quit a committee on scientific methodology because you think they’re not upholding standards, please alert me error@vt.edu.

]]>

You know how in that Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight (New Year’s Eve ~~2011~~ ~~2012~~, ~~2013~~, ~~2014~~, ~~2015~~, ~~2016~~, ~~2017~~, ~~2018~~, 2019) and is taken back sixty years and, lo and behold, finds herself in the company of Allan Birnbaum.[i] There are a few 2019 updates–one is of great significance.

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to be writing on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)

BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your new book: *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (STINT, 2018, CUP).

ERROR STATISTICIAN: You’ve read my new book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found it in 2006.[ii] Sorry,…I know it’s famous…

BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).

ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.

BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!

ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.

BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.

ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.

BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:

(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)

ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.

BIRNBAUM: Suppose you’ve observed x”, a 2-standard deviation difference from an optional stopping experiment E”, that finally stops at n=100. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2-standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100.

ERROR STATISTICAL PHILOSOPHER: Well, that is not how x” was obtained, but ok, it could have occurred that way.

BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment. In a BB- experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’. In that case, a BB-experiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.

(They fill their glasses again)

ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair coin (which decides between E’ and E”)?

BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have an “LP pair” in the experiment you did not perform, E”. Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.

ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.

BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100^{th} trial). That’s how I sometimes formulate a BB-experiment.

ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.

BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a *single* experiment, so really you only need to apply the *weak LP* which frequentists accept. Yes? (The *weak LP is* the same as the *sufficiency principle*).

ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference from optional stopping experiment E”. How do I calculate the p-value within a Birnbaumized experiment?

BIRNBAUM: I don’t think anyone has ever called it that.

ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the p-value within a BB-experiment?

BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2

Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).

ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?

BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB- experiment.

*My this drink is sour! *

ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.

BIRNBAUM: Perhaps you’re in want of a gene; never mind.

I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).

ERROR STATISTICAL PHILOSOPHER: But the result would be that the p-value associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.

BIRNBAUM: Yes, the BB-experiment computes the P-value in an *unconditional* manner: it takes the convex combination over the 2 ways the result could have come about.

ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.

BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to this, it is just a matter of mathematical equivalence.

By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.

ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)

BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”

ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!

BIRNBAUM: So far all of this was step (1).

ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?

BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.

This gives us premise (2a):

(2a) outcome x”, once it is known that it came from E”, should NOT be analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the p-value, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?

BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.

(2b) Likewise, if you knew the 2-standard deviation difference came from E’, then

x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.

BIRNBAUM: Yes. There was no need to repeat the whole spiel.

ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course,all of this assumes the model is correct or adequate to begin with.

BIRNBAUM: Yes, the SLP is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?

ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.

BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?

ERROR STATISTICAL PHILOSOPHER: Well the WCP is defined for actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.

BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BB-experiment, and that’s all I need. Notice

(1), (2a) and (2b) yield the strong LP!

Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).

ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).

BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?

(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)

ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. Here’s one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:

Step 1 requires us to analyze results in accordance with a BB- experiment. If we do so, true enough we get:

premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB experiment):

That is because in either case, the p-value would be (p’ + p”)/2

Step 2 now insists that we should NOT calculate evidential import as if we were in a BB- experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:

premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its p-value should be p”.

premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its p-value should be p’.

If (1) is true, then (2a) and (2b) must be false!

If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:

The average p-value (p’ + p”)/2 = p’ which is false.

Likewise if (1) is true, then (2b) is asserting:

the average p-value (p’ + p”)/2 = p” which is false

Alternatively, we can see what goes wrong by realizing:

If (2a) and (2b) are true, then premise (1) must be false.

In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.

I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).

BIRNBAUM: Yet some people still think it is a breakthrough (in favor of Bayesianism).

ERROR STATISTICAL PHILOSOPHER: I have a much clearer exposition of what goes wrong in your argument than I did in the discussion from 2010. There were still several gaps, and lack of a clear articulation of the WCP. In fact, I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in *Statistical Science?* The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost.

BIRNBAUM: Yes I have seen your 2014 paper, very clever! Your Rejoinder to some of the critics is gutsy, to say the least. Congratulations! I’ve also seen the slides on your blog.

ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! But look I *must* get your answer to a question before you leave this year.

S*udden interruption by the waiter*

WAITER: Who gets the tab?

BIRNBAUM: I do. To Elbar Grease! And to your new book SIST! I’ve read it 3 times. I have a list of comments and questions right here.

ERROR STATISTICAL PHILOSOPHER: Let me see, I’d love to read your questions and comments. (She takes a long legal-sized yellow sheet from Birnbaum, notices it is filled with tiny hand-written comments, covering both sides.)

BIRNBAUM:** To Elbar Grease! To Severe Testing! Happy New Year!**

ERROR STATISTICAL PHILOSOPHER: I have one quick question, Professor Birnbaum, and I swear that whatever you say will be just between us, I won’t tell a soul. In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962)paper, you seemed to agree with Pratt that WCP can’t do the job you intend.

BIRNBAUM: Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)

ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question, you disappeared before answering last year…I just want to know…you did see the flaw, yes?

WAITER: We’re closing now; shall I call you a taxicab?

BIRNBAUM: Yes.

ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?

MANAGER: We’re closing now; I’m sorry you must leave.

ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….

*Large group of people bustle past.*

Prof. Birnbaum…? Allan? **Where did he go? **(oy, not again!)

**But wait! I’ve got his list of comments and questions in my hand! It’s real!!!**

**Link to complete discussion: **

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).*Statistical Science* 29 (2014), no. 2, 227-266.

[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as historical background papers may be found in my last blogpost. Please see that post for how you can very easily win a free signed copy of SIST.

[ii] By the way, Ronald Giere gave me numerous original papers of yours. They’re in files in my attic library. Some are in mimeo, others typed…I mean, obviously for that time that’s what they’d be…now of course, oh never mind, sorry.

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term *sampling theory, *or my preferred *error statistics, *as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the *strong likelihood principle* (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

**SLP** (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E

_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, if outcomesx* andy* (from E_{1}and E_{2}respectively) determine the same (i.e., proportional) likelihood function (f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ), thenx* andy* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

**Violation of SLP:**

Whenever outcomes

x* andy* from experiments E_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, and f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ, and yet outcomesx* andy* have different implications for an inference about θ.

For an example of a SLP violation, E_{1} might be sampling from a Normal distribution with a fixed sample size n, and E_{2} the corresponding experiment that uses an optional stopping rule: keep sampling until you obtain a result 2 standard deviations away from a null hypothesis that θ = 0 (and for simplicity, a known standard deviation). When you do, stop and reject the point null (in 2-sided testing).

The SLP tells us (in relation to the optional stopping rule) that once you have observed a 2-standard deviation result, there should be no evidential difference between its having arisen from experiment E_{1} , where n was fixed, say, at 100, and experiment E_{2} where the stopping rule happens to stop at n = 100. For the error statistician, by contrast, there is a difference, and this constitutes a violation of the SLP.

———————-

*Now for the surprising part:* Remember the 61-year old chestnut from my last post where a coin is flipped to decide which of two experiments to perform? David Cox (1958) proposes something called the Weak Conditionality Principle (WCP) to restrict the space of relevant repetitions for frequentist inference. The WCP says that once it is known which E_{i }produced the measurement, the assessment should be in terms of the properties of the particular E_{i}. Nothing could be more obvious.

The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP)–although even that has been shown to be optional, strictly speaking. But this would preclude the use of sampling distributions. J. Savage calls Birnbaum’s argument “a landmark in statistics” (see [i]).

Although his argument purports that [(WCP and SP) entails SLP], I show how data may violate the SLP while holding both the WCP and SP. Such cases directly refute [WCP entails SLP].

In Birnbaum’s argument, he introduces an informal, and rather vague, notion of the “evidence (or evidential meaning) of an outcome * z* from experiment E”. He writes it: Ev(E,

In my formulation of the argument, I introduce a new symbol to represent a function from a given experiment-outcome pair, (E,**z**) to a generic inference implication. It (hopefully) lets us be clearer than does Ev.

(E,**z**) Infr_{E}(**z**) is to be read “the inference implication from outcome **z** in experiment E” (according to whatever inference type/school is being discussed).

*An outline of my argument is in the slides for a talk below: *

**Binge reading the Likelihood Principle.**

If you’re keen to binge read the SLP–a way to break holiday/winter break doldrums– I’ve pasted most of the early historical sources before the slides. The argument is simple; showing what’s wrong with it took a long time. My earliest treatment, via counterexample, is in Mayo (2010). A deeper argument is in Mayo (2014) in *Statistical Science*.[ii] An intermediate paper Mayo (2013) corresponds to the slides below–they were presented at the JSM. Interested readers may search this blog for quite a lot of discussion of the SLP including “U-Phils” (discussions by readers) (e.g., here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum).

**Why this issue is bound to resurface in 2020.**

I had blogged this “binge read” post a year ago, but the issue has scarcely been put to rest. I expect it to resurface in 2020 for a few reasons. Firstly, I’d promised myself that once my book (SIST) was out that I’d try to collect central textbooks that are still calling it a theorem, and write to the authors. Hence, my offer in my last post to send you a free signed copy of SIST in exchange for texts you find (1 book per textbook, and the page/pages themselves need to be sent or attached). The argument barely takes a page.

Secondly, I’ve already been asked to review some new attempts to declare an improvement on the original attempt. There’s no mention of my disproof, nor that of Mike Evans’.

Third, it ought to come up as a crucial battle about the very notion of “evidence”, blithely taken for granted in such “best practice guides” as the 2016 ASA statement on P-values and significance (ASA I). It is the interpretation of evidence (left intuitive by Birnbaum) underlying the SLP that is being presupposed.

You may not wish to engage in what looks to be (and is) a rather convoluted logical argument. That’s fine, but just remember that when someone says “it’s been proved mathematically” that error probabilities are irrelevant to evidence post data, you can say, “I read somewhere that this has been disproved”.

—–

[i] Savage on Birnbaum: “This paper is a landmark in statistics. . . . I, myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great. . . . [T]his paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. …once the likelihood principle is widely recognized, people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability for statistics” (Savage 1962, 307-308).

The argument purports to follow from principles frequentist error statisticians accept.

[ii] The link includes comments on my paper by Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu, and my rejoinder.

**Classic Birnbaum Papers:**

- Birnbaum, A. (1962), “On the Foundations of Statistical Inference“,
*Journal of the American Statistical Association*57(298), 269-306. - Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). “Discussion on Birnbaum’s On the Foundations of Statistical Inference”,
*Journal of the American Statistical Association*57(298), 307-326. - Birnbaum, A (1970). Statistical Methods in Scientific Inference (letter to the editor). Nature 225, 1033.
- Birnbaum, A (1972), “More on Concepts of Statistical Evidence“,
*Journal of the American Statistical Association*, 67(340), 858-861.

**Note to Reader:** If you look at the “discussion”, you can already see Birnbaum backtracking a bit, in response to Pratt’s comments.

Some additional early discussion papers:

**Durbin:**

- Durbin, J. (1970), “On Birnbaum’s Theorem on the Relation Between Sufficiency, Conditionality and Likelihood”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 395-398. - Savage, L. J., (1970), “Comments on a Weakened Principle of Conditionality”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 399-401. - Birnbaum, A. (1970), “On Durbin’s Modified Principle of Conditionality”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 402-403.

There’s also a good discussion in Cox and Hinkley 1974.

**Evans, Fraser, and Monette:**

- Evans, M., Fraser, D.A., and Monette, G., (1986), “On Principles and Arguments to Likelihood.”
*The Canadian Journal of Statistics*14: 181-199.

**Kalbfleisch:**

- Kalbfleisch, J. D. (1975), “Sufficiency and Conditionality”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 251-259. - Barnard, G. A., (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 260-261. - Barndorff-Nielsen, O. (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 261-262. - Birnbaum, A. (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 262-264. - Kalbfleisch, J. D. (1975), “Reply to Comments”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), p. 268.

**My discussions:**

- Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in
*Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science*(D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14. - Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in
*JSM Proceedings*, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453. - Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder:
*Statistical Science**.*

]]>

2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. it is now 61. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my (still) new book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST, 2018). It’s especially relevant to take this up now, just before we leave 2019, for reasons that will be revealed over the next day or two. For a sneak preview of those reasons, see the “note to the reader” at the end of this post. So, let’s go back to it, with an excerpt from SIST (pp. 170-173).

**Exhibit (vi): Two Measuring Instruments of Different Precisions. ***Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?*

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

*Basis for the joke: *An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’s the statistical formulation.

We flip a fair coin to decide which of two instruments, E_{1 }or E_{2}, to use in observing a Normally distributed random sample * Z* to make inferences about mean

In testing a null hypothesis such as *θ* = 0, the same * z *measurement would correspond to a much smaller

Suppose that we know we have observed a measurement from E_{2 }with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance]. (Cox 1958, p. 361)

Once it is known which E_{i } has produced * z*, the

The point essentially is that the marginal distribution of a

P-value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense. (p. 296)

To scotch his famous example, Cox (1958) introduces a principle: weak conditionality.

If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences aboutWeak Conditionality Principle (WCP):θ are appropriately drawn in terms of the sampling behaviorin the experiment known to have been performed (Cox and Mayo 2010, p. 296).

It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.

While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “which is best on the average over both experiments” (Lehmann and Romano 2005, p. 394), and it’s even possible that the average or unconditional power is better than the conditional. In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b).He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b, p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’m sure they’d concur with N-P tribe leader Lehmann. Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974, pp. 95–6; Birnbaum 1962, p. 491.)

**Is There a Catch?**

The “two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii)). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “there is a catch” and that WCP leads to the Likelihood Principle (LP)!

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘dilemma’ argument is therefore an illusion. (Cox and Mayo 2010, p. 298)

There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’s “proof” were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “breakthrough in statistics,” went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.

In fact, I show there is a flaw in his venerable argument (Mayo 2010b, 2013a, 2014b). That’s a relief. Now some of you will howl, “Mayo, not everyone agrees with your disproof! Some say the issue is not settled.” Fine, please explain where my refutation breaks down. It’s an ideal brainbuster to work on along the promenade after a long day’s tour. Don’t be dismayed by the fact that it has been accepted for so long. But I won’t revisit it here.

From *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Mayo 2018, CUP).

Excursion 3 Tour II, pp. 170-173.

Note to the Reader:

The LP was a main topic for the first few years of this blog. That’s because I was still refining an earlier disproof from Mayo (2010), based on giving a counterexample. I later saw the need for a deeper argument which I give in Mayo (2014) in *Statistical Science*.[3] (There, among other subtleties, the WCP is put as a logical equivalence as intended.)

“It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1962 paper, to the monster of the likelihood axiom,” (Birnbaum 1975, 263).

An intermediate paper is Mayo (2013).

What I discovered in 2019 is that rather than admit that Allan Birnbaum’s (1962) alleged proof is circular, some authors are claiming to have new proofs of it. These consist of reiterating the same premises that render the argument circular, but with greater exuberance and more certainty. In this way, it is said to avoid objections to the earlier attempted proof! The only problem is: once an argument is circular, it remains so.

Textbooks should not call a claim a theorem if it’s not a theorem, i.e., if there isn’t a proof of it (within the relevant formal system). **So, ****in 2020, when you find a textbook that claims the LP is a theorem, provable from the (WEP) and (SP), or (WCP) alone, please send me an attachment or link of the relevant pages and reference. A free signed copy of SIST goes to the first person (1 copy for each such textbook) who does so. **Since there are many such textbooks out there, I expect to part with several copies of SIST in 2020.

If statistical inference follows Bayesian posterior probabilism, the LP follows easily. It’s shown in just a couple of pages of SIST Excursion 1 Tour II (45-6). All the excitement is whether the frequentist (error statistician) is bound to hold it. If she is, then error probabilities become irrelevant to the evidential import of data (once the data are given), at least when making parametric inferences within a statistical model.

If you’re keen to try your hand at the arguments (Birnbaum’s or mine), you might start with a summary post (based on slides) here. It is *not* included in SIST. There’s no real mathematics or statistics involved, it’s pure logic. But it’s very circuitous, which is one reason why the supposed “proof” has stuck around as long as it has. The other reason is that many people *want* it to be so.

[1] Cox 1958 has a different variant of the chestnut.

[2] Note sufficiency is not really needed in the “proof”.

[3] The discussion includes commentaries by Dawid, Evans, Martin and Liu, Hannig, and Bjørnstad–some of whom are very unhappy with me. But I’m given the final word in the rejoinder.

**References **(outside of the excerpt; for refs within SIST, please see SIST):

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, *Journal of the American Statistical Association* 57(298), 269-306.

Birnbaum, A. (1975). *Comments on Paper by J. D. Kalbfleisch*. Biometrika, 62 (2), 262–264.

Cox, D. R. (1958), “Some problems connected with statistical inference“, The Annals of Mathematical Statistics, 29, 357-372.

Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in *JSM Proceedings*, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453.

Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: *Statistical Science** *29(2) pp. 227-239, 261-266*.*

I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) ~~howler~~ well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past:

2013 is right around the corner, and here are 13 well-known criticisms of statistical significance tests, and how they are addressed within the error statistical philosophy, as discussed in Mayo, D. G. and Spanos, A. (2011) “Error Statistics“.

- (#1) Error statistical tools forbid using any background knowledge [1].
- (#2) All statistically signiﬁcant results are treated the same.
- (#3) The p-value does not tell us how large a discrepancy is found.
- (#4) With large enough sample size even a trivially small discrepancy from the null can be detected.
- (#5) Whether there is a statistically signiﬁcant diﬀerence from the null depends on which is the null and which is the alternative.
- (#6) Statistically insigniﬁcant results are taken as evidence that the null hypothesis is true.
- (#7) Error probabilities are misinterpreted as posterior probabilities.
- (#8) Error statistical tests are justiﬁed only in cases where there is a very long (if not inﬁnite) series of repetitions of the same experiment.
- (#9) Specifying statistical tests is too arbitrary.
- (#10) We should be doing conﬁdence interval estimation rather than signiﬁcance tests.
- (#11) Error statistical methods take into account the intentions of the scientists analyzing the data.
- (#12) All models are false anyway.
- (#13) Testing assumptions involves illicit data-mining.

You can read how we avoid them in the full paper here.

My book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST 2018, CUP), excavates the most recent variations on all of these howlers. To allege that statistical significance tests don’t use background information is a willful distortion of the tests which Fisher developed, hand-in-hand, with a large methodology of experimental design: randomization, predesignation and testing model assumptions. All these depend on incorporating background information into the specification and interpretation of tests. “The purpose of randomisation” Fisher made clear, “is to guarantee the validity of the test of significance” (1935). Observational (and other) studies that lack proper controls may well need to concede that any reported P-values are illicit–but then why report P-values at all? (Confidence levels are then equally illicit, except as descriptive measures without error control.) I say they should not report P-values lacking in error-statistical interpretations, at least not without reporting this. But don’t punish studies that work hard to attain error control.

Before you jump on the popular (but misguided) bandwagons of “abandoning statistical significance” or derogating P-values as so-called “purely (blank slate) statistical measures”, ask for evidence supporting the criticisms.[2] You will find they are based on rather blatant misuses and abuses. Only by blocking the credulity with which such apparitions are met these days (in some circles) can we attain improved statistical inferences in Christmases yet to come.

[1] “Error statistical methods” is an umbrella term for methods that employ probability in inference to assess and control the capabilities of methods to avoid mistakes in interpreting data. It includes statistical significance tests, confidence intervals confidence distributions, randomization, resampling and bootstrapping. A proper subset of error statistical methods are those that use error probabilities to assess and control the *severity* with which claims may be said to have passed a test (with given data). A claim C passes a test with severity to the extent that it has been subjected to and survives a test that probably would have found specified flaws in C, if present. Please see excerpts from SIST 2018.

[2] See

- November 4, 2019:On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- November 14, 2019: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)
- November 30, 2019: P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

The paper referred to in the post from Christmas past (1) is:

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in* Philosophy of Statistics , Handbook of Philosophy of Science* Volume 7 Philosophy of Statistics.

When it comes to the statistics wars, leaders of rival tribes sometimes sound as if they believed “les stats, c’est moi”. [1]. So, rather than say they would like to supplement some well-known tenets (e.g., “a statistically significant effect may not be substantively important”) with a new rule that advances their particular preferred language or statistical philosophy, they may simply blurt out: “**we take that step here!**” followed by whatever rule of language or statistical philosophy they happen to prefer (as if they have just added the new rule to the existing, uncontested tenets). Karan Kefadar, in her last official (December) report as President of the American Statistical Association (ASA), expresses her determination to call out this problem at the ASA itself. (She raised it first in her June article, discussed in my last post.)

One final challenge, which I hope to address in my final month as ASA president, concerns issues of significance, multiplicity, and reproducibility. In 2016, the ASA published a statement that simply reiterated what

p-values are and are not. It did not recommend specific approaches, other than “good statistical practice … principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean.”The guest editors of the March 2019 supplement to

The American Statisticianwent further, writing: “The ASA Statement on P-Values and Statistical Significancestopped just short of recommending that declarations of ‘statistical significance’ be abandoned.We take that step here. … [I]t is time to stop using the term ‘statistically significant’ entirely.”Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA. In fact, the ASA does not endorse any article, by any author, in any journal—even an article written by a member of its own staff in a journal the ASA publishes. (Kafadar, December President’s Corner)

Yet Wasserstein et al. 2019 describes itself as a *continuation* of the ASA 2016 Statement on P-values, which I abbreviate as ASA I. (Wasserstein is the Executive Director of the ASA.) It describes itself as merely recording the decision to “take that step here”, and add one more “don’t” to ASA I. As part of this new “don’t,” it also stipulates that we should not consider “at all” whether pre-designated P-value thresholds are met. (It also restates four of the six principles in ASA I so as to be considerably stronger than those in ASA I. I argue, in fact, the resulting principles are inconsistent with principles 1 and 4 of ASA I. See my post from June 17, 2019.) Since it describes itself as a continuation of the ASA policy in ASA I, and that description survived peer review at the journal TAS, readers presume that’s what it is; absent any disclaimer to the contrary, that conception (or misconception) remains operative.

There really is no other way to read the claim in the Wasserstein et al. March 2019 editorial: “*The ASA Statement on P-Values and Statistical Significance *stopped just short of recommending that declarations of ‘statistical significance’ be abandoned.[2] We take that step here.” Had the authors viewed their follow-up as anything but a continuation of ASA I, they would have said something like: “Our own recommendation is to go *much further* than ASA I. We suggest that all branches of science stop using the term ‘statistically significant’ entirely.” They do not say that. What they say is written from the perspective of “Les stats, c’est moi”.

**The 2019 P-value Project II**

Kafadar deserves a great deal of credit for providing some needed qualification in her December note. However, there needs to be a disclaimer by ASA as regards what it calls its **P-value Project**. The P-value project, started in 2014, refers to the overall ASA campaign to provide guides for the correct use and interpretation of P-values and statistical significance, and journal editors and societies are to consider revising their instructions to authors taking into account its guidelines. ASA I was distilled from many meetings and discussions from representatives in statistics. The only difference in today’s P-value Project is that both ASA I *and* the 2019 editorial by Wasserstein et al. are to form the new ASA guidelines–even if the latter is not to be regarded as a continuation of ASA I (in accord with Kafadar’s qualification). I will refer to it as the **2019 ASA P-value Project II**.^{(note)} Wasserstein et al. 2019 is a piece of the P-value project, and the authors thank the ASA for its support of this Project at the end of the article. [4] [5]

**Of Policies and Working Groups**

Kafadar continues:

Even our own ASA members are asking each other, “What do we tell our collaborators when they ask us what they should do about statistical hypothesis tests and

p-values?” Should the ASA have a policy on hypothesis testing or on using “statistical significance”?

Allow me to weigh in here: No, no it should not. At one time I would have said yes, but no more. I can hear the policy now (sounding much like Wasserstein et al. 2019, only written in stone): “Don’t say, never say, or if you really feel you must say significance, and are prepared to thoroughly justify such a “thoughtless” term, then you may only say “significance level p” where p is continuous, and never rounded up or cut off, ever. But never, ever use the “ant” ending: signifi* cant. *Y

Why can’t the ASA merely provide a bipartisan forum for discussion of the multitude of models, methods, aims, goals, and philosophies of its members? Wasserstein et al. 2019 admits there is no agreement, and that there might never be. Spare us another document whose implication is: we need not test, and cannot falsify claims, even statistically (since that is the consequence of no thresholds). I realize that Kafadar is calling for a serious statement–one that counters the impression of the Wasserstein et al. opinion.

To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece reflecting “good statistical practice,” without leaving the impression that

p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice.” …The ASA should develop—and publicize—a properly endorsed statement on these issues that will guide good practice.

Be careful what you wish for. I give major plaudits to Kafadar for pressing hard to see that alternative views are respected, and to counter the popular but terrible arguments of the form: since these methods are misused, they should be banished, and replaced with methods advocated by group Z (even if the credentials of Z’s methods haven’t been scrutinized!) We have already seen in 2019 the extensive politicization and sensationalizing of bandwagons in statistics. (See my editorial P-value Thresholds: Forfeit at your Peril.) The average ASA member, who doesn’t happen to be a thought leader or member of a politically correct statistical-philosophical tribe, is in great danger of being muffled entirely. There’s already a loss of trust. We already know, under the motto that “a crisis should never be wasted”, that many leaders of statistical tribes view the crisis of replication as an opportunity to sell alternative methods they have long been promoting. Rather than the properly endorsed, truly representative, statement that Kafadar seeks, we may get dictates from those who are quite convinced that they know best: “les stats, c’est moi”.

**APPENDIX. How a Working Group on P-values and Significance Testing Could Work**

I see one way that a working group could actually work. The 2016 ASA statement, ASA I, had a principle, it was #4. You don’t hear about it in the 2019 follow-up. It is that “P-values and related statistics” cannot be correctly interpreted without knowing how many hypotheses were tested, how data were specified and results selected for inference. Notice the qualification “and related statistics”. The presumption is that some methods don’t require that information! That information is necessary only if one is out to control the error probabilities associated with an inference.

Here’s my idea: Have the group consist of those who work in areas where statistical inferences depend on controlling error probabilities (I call such methods *error statistical*). They would be involved in current uses and developments of statistical significance testing and the much larger (frequentist) error statistical methodology within which it forms just a part. They would be familiar with, and some would be involved in developing, the latest error statistical tools, including tests and confidence distributions, P-values with high dimensional data, current problems of adjusting for multiple testing, and of testing statistical model assumptions, and they would be capable of different aspects of comparative statistical methods (Bayesian and error statistical). They would present their findings and recommendations, and responses sought.

The need for the kind of forum I’m envisioning is so pressing, that it should not be contingent on being created by any outside association. It should emerge spontaneously in 2020. *We take that step here.*

*Please share your comments in the comments.*

[1] This is a pun on “l’état, c’est moi” (“I am the state”, Louis XIV*.) I thank Glenn Shafer for the appropriate French spelling for my pun. (*Thanks to S. Senn for noticing I was missing the X in Louis XIV.)

[2] They are referring to the last section of ASA I on “other measures of evidence”. Indeed, that section suggests an endorsement of an assortment of alternative measures of evidence including Bayes factors, likelihood ratios and others. There is no attention to whether any of these methods accomplish the key task of the statistical significance test–to distinguish genuine from spurious effects. For a fuller explanation of this last section, please see my post from June 17, 2019 and November 14, 2019. And, obviously, check the last section of ASA I.

Shortly after the 2019 editorial appeared, I queried Wasserstein as to the relationship between it and ASA I. It was never clarified. I hope now that it will be. At the same time I informed him of what appeared to me to be slips in expressing principles of ASA I, and I offered friendly amendments (see my post from June 17, 2019).

[3] If you’re giving the history of statistics, you can speak of those bad, bad men–dichotomaniacs, Neyman and Pearson– who, following Fisher, divided results into significant and non-significant discrepancies (introduced the alternative hypotheses, type I and II errors, power and optimal tests) and thereby tried to reduce all of statistics to acceptance sampling, engineering, and 5-year plans in Russia–as Fisher (1955) himself said (after the professional break with Neyman in 1935). Never mind that Neyman developed confidence intervals at the same time, 1930. For a full discussion of the history of the Fisher-Neyman (and related) wars, please see my *Statistical Inference as severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2018).

[4] I was just sent this podcast and interview of Ron Wasserstein, so I’m adding it as a footnote. There, Wasserstein et al. 2019 is clearly described as the ASA’s “further guidance”, and Wasserstein takes no exception to it. The interviewer says:

**“**But it would seem as though Ron’s work has only just begun. The ASA has just published further guidance in the most recent edition of The American Statistician, which is open access and written for non-statisticians. The guidance is intended to go further and argues for an end to the concept of statistical significance and towards a model which the ASA have coined their ATOM Principle: Accept uncertainty, Thoughtful, Open and Modest.”

[5]Nathan Schachtman, in a new post just added to his law blog on this very topic, displays a letter from the ASA acknowledging that a journal has revised its guidelines taking into account *both* ASA I and the 2019 Wasserstein et al. editorial. I had seen this letter, in relation to the NEJM, but it’s hard to know what to make of it. I haven’t seen others acknowledging other journals, and there have been around 7 at this point. I may just be out of the loop.

**Selected blog posts on ASA I and the Wasserstein et al. 2019 editorial:**

- March 25, 2019: “Diary for Statistical War Correspondents on the Latest Ban on Speech.”
- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
- July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- November 4, 2019:On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- November 14, 2019: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)
- November 30, 2019: P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

]]>