The latest salvo in the statistics wars comes in the form of the publication of The ASA Task Force on Statistical Significance and Replicability, appointed by past ASA president Karen Kafadar in November/December 2019. (In the ‘before times’!) Its members are:

Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

The full report of this Task Force is in the The Annals of Applied Statistics, and on my blogpost. It begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

On Monday, August 2, the National Institute of Statistical Science (NISS) will hold a public discussion whose focus is this report, and several of its members will be there. (See the announcement at the end of this post).

Kafadar, and the members of this task force, deserve a lot of credit for defying the popular movement to “abandon statistical significance” by unanimously declaring: “that the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”…

• P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.

• They are important tools that have advanced science through their proper application. …

• P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.

(Benjamini et al. 2021)

If you follow this blog, you know that I have often discussed the 2019 editorial in The American Statistician to which this Task Force report refers: Wasserstein, Schirm and Lazar (2019), hereafter WSL 2019 (see blog links below). But now I’m inviting you to share your views on any aspects of the overall episode (ASAgate?) for posting on this blog. (Send them by August 31, 2021, info in Note [1] .) I’d like to put together a blogpost with multiple authors, and multiple perspectives soon after. For background see this post. (For even more background, see the links at the end of this post.)

I first assumed WSL 2019 was a continuation of the 2016 ASA Statement on P-values, especially given how it is written. According to WSL 2019, the 2016 ASA Statement had “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned”, and they announce: “We take that step here….‘statistically significant’—don’t say it and don’t use it”. The use of p-value thresholds to distinguish data that do and do not indicate various discrepancies from a test hypotheses are also verboten.) [1] In fact, it rejects any number of classifications of data: “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (WSL 2019, p. 2).

To many, including then ASA president Karen Kafadar (2019), the position in WSL 2019 challenges the overall use of hypothesis tests even though it does not ban P-values:

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.

Even our own ASA members are asking each other, “What do we tell our collaborators when they ask us what they should do about statistical hypothesis tests and

p-values?”

So the ASA Board created the new Task Force in November 2019 “with a charge to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors.” (AMSTATNEWS 1 February 2020).

Several of its members will be at Monday’s NISS meeting. A panel session I organized at the 2020 JSM, (P-values and ‘Statistical Significance’: Deconstructing the Arguments), grew out of this episode (my contribution in the proceedings is here).

The Task Force worked quickly, despite the pandemic, giving its recommendations to the ASA Board early–in time for the Joint Statistical Meetings (JSM) at the end of July 2020. But the ASA didn’t “endorse and share” the Task Force’s recommendations, and for months the document has been in limbo, turned down for publication in numerous venues, until recently finding a home in the *Annals of Applied Statistics*. So, it is finally out. What does it say aside from what I have quoted above? I’m guessing that because the statements were unanimous, they couldn’t go much beyond some rather unobjectionable claims. It’s quite short, and there’s also an editorial by Kafadar (editor-in-chief of the journal) in the issue.

I imagine a statistical significance tester raising these objections to the task force report:

- It does not tell us why, properly used, p-values increase rigor—namely by enabling error probability control.
- It implicitly seems to accept the view that using thresholds means viewing test outcomes as leading directly to decisions or actions (as in the behavioristic interpretation of Neyman-Pearson tests), rather than as part of an appraisal of evidence.

In fact any time you test a claim (or compute power) you are implicitly using a threshold. It needs to be specified, in advance, that not all outcomes will be allowed to be taken as evidence for a given claim.

3. It doesn’t tell us what’s meant by “properly used”.

While this might be assumed to be uncontroversial, nowadays, you will sometimes hear critics aver ‘of course the tests are fine if properly used’, but then in the next breath suggest that this requires p-values to agree with quantities measuring very different things (since ‘that’s what people want’). Even the meaning of “abandon” statistical significance has become highly equivocal (e.g., Mcshane et al. 2019).

4. Others? (Use the comments, or put them in your guest blog contributions).

On the meta-level, of course, she would be concerned about the communication break down suggested by the very fact that the ASA board felt the need to appoint a Task Force to dispel the supposition that a position advanced by its Executive Director reflects the views of the ASA itself.[3] Still, in today’s climate of anti-statistical significance test fervor, the Task Force’s pressing on to find a home for their report when the ASA declined to make it public is impressive, if not heroic. We need more of that type of independence if scientific integrity is to be restored. The Task Force deserves further accolades for sparing us, for once, a rehearsal of the well-known howlers of abusives of tests that have long been lampooned.

Here’s the announcement of the NISS Program for August 2, 2021 (5pm ET)

NISS Affiliates liaisons and representatives representing academia, industry and government institutions traditionally meet over lunch at JSM to catch up with one another and hear from speakers on a topic of current interest. This year, (even though this event takes place in the evening) the ‘luncheon’ speakers featured will be Karen Kafadar, the 2019 ASA President from the University of Virginia. Karen initiated the ASA Task Force on Statistical Significance and Replicability during her presidential year, that was convened to address issues surrounding the use of p-values and statistical significance, as well as their connection to replicability. Xuming He from the University of Michigan and Linda Young from NASS who both served as co-chairs of this Task Force will summarize the discussion leading to the final report, and invite other task force members to join the discussion.

Luncheon Speakers: Karen Kafadar- 2019 ASA President, (University of Virginia), Xuming He – Task Force Co-chair, (University of Michigan), Linda Young – Task Force Co-chair, (NASS), Steven Stigler, (University of Chicago), Nancy Reid (University of Toronto) and Yoav Benjamini, (Tel Aviv University)..

This is a Special Event that is open to the public. You need not be a member of a NISS Affiliate Institution to attend the event this year. Invite your colleagues!

**NOTES**

[1]They can be as long as you think apt. We can always post part of your commentary and link to the remainder (write me with questions). All who havve their guest post included will receive a free copy of my book: *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)

[2] Since the clarification was not made public until December 2019, an editorial I wrote “P-value thresholds, forfeit at your peril” mistakenly referred to WSL 2019 as ASA II https://errorstatistics.files.wordpress.com/2019/11/mayo-2019-forfeit-own-peril-european_journal_of_clinical_investigation-2.pdf

[3] It would seem that a public disclaimer by the authors sent around to members and journals would have avoided this. Kafadar had indicated early on that the Task Force recommendations would include a call for a “Disclaimer on all publications, articles, editorials, … authored by ASA Staff (e.g., as required for U.S. Govt employees)”.

At any rate, this was in Kafadar’s slide presentation at our JSM forum. Perhaps at Monday’s forum someone will ask: Why was that sensible recommendation deleted from the final report?

**REFERENCES:**

Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on statistical significance and replicability. *The Annals of Applied Statistics*. (Online June 20, 2021.)

Kafadar, K. (2019). “The Year in Review … And More to Come”. *AmStat News*3 (Dec. 2019)

Kafadar, K. (2020). “Task Force on Statistical Significance and Replicability”. *ASA Amstat Blog* (Feb. 1, 2020).

Kafadar, K. (2021) “Editorial: Statistical Significance, *P*-Values, and Replicability“. *The Annals of Applied Statistics*

*Mayo, D. G. (2020). **Rejecting Statistical Significance Tests: Defanging the Arguments**. In JSM Proceedings, Statistical Consulting Section. Alexandria, VA: American Statistical Association. 236-256.*

Mayo, D. G. (2019)*. *“P-value Thresholds: Forfeit at Your Peril,”*European Journal of Clinical Investigation *49(10). EJCI-2019-0447

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. *American Statistician*, *73*, 235–245.

Wasserstein R. & Lazar, N. “The ASA’s Statement on p-Values: Context, Process, and Purpose,” *The American Statistician* 70(129 )(2016); see “The American Statistical Association’s Statement on and of Significance” (March 17, 2016).

Wasserstein, R., Schirm, A,. & Lazar, N. (2019). Moving to a world beyond “*p* < 0.05” (Editorial). *The** American Statistician* 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913

**(SELECTED) BLOGPOSTS ON WSL 2019 FROM ERRORSTATISTICS.COM:**

March 25, 2019: “Diary for Statistical War Correspondents on the Latest Ban on Speech.”

June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

July 19, 2019: “The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)”

September 19, 2019: “(Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access).” The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.

November 4, 2019: “On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests”

November 14, 2019: “The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)”

November 30, 2019: “P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)”

December 13, 2019: “’Les stats, c’est moi’: We take that step here! (Adopt our fav word or phil stat!)(iii)”

August 4, 2020: “August 6: JSM 2020 Panel on P-values & ‘Statistical significance’”

October 16, 2020: “The P-Values Debate”

December 13, 2020: “The Statistics Debate (NISS) in Transcript Form”

January 9, 2021: “Why hasn’t the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?”

June 20, 2021: “At Long Last! The ASA President’s Task Force Statement on Statistical Significance and Replicability”

June 28, 2021: “Statisticians Rise Up To Defend (error statistical) Hypothesis Testing”

Kafadar, K. (2020) JSM slides.

]]>

*I’m reblogging two of my Higgs posts at the 9th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2” (from March, 2013**).[1]*

Some people say to me: “severe testing is fine for ‘sexy science’ like in high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning, at least, when we’re trying to find things out [2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

The Higgs discussion finds its way into Tour III in Excursion 3 of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). You can read it (in proof form) here, pp. 202-217. in a section with the provocative title:

3.8 The Probability Our Results Are Statistical Fluctuations: Higgs’ Discovery

**“Higgs Analysis and Statistical Flukes: part 2”**

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of *excess events* of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgsparticle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess;

H_{0}: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone). The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

*Error probabilities*

In a Neyman-Pearson setting, a cut-off c_{α}_{ }is chosen pre-data so that the probability of a type I error is low. In general,

Pr(

d(X) > c_{α};H_{0}) ≤ α

and in particular, alluding to an overall test T:

(1) Pr(Test T yields

d(X) > 5 standard deviations;H_{0}) ≤ .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, *p*_{0}. In general,

Pr(

P<p_{0};H_{0}) <p_{0}

and in particular,

(2) Pr(Test T yields

P<.0000003;H_{0}) < .0000003.

For test T to yield a “worse fit” with *H*_{0 }(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0. With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic *d*(**X)**, or the p-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

*An implicit principle of inference or evidence*

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Datax_{0 }from a test T provide evidence for rejecting H_{0}(just) to the extent that H_{0}would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010) and a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006).[3]

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between *H*_{0}* *and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977). “*H** _{0 }*is true” is a shorthand for a very long statement that

*Severity and the detachment of inferences*

The sampling distributions serve to give counterfactuals. In this case, they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to *H*_{0}.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out. *Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference*. (This is why the bootstrap, and other types of, re-sampling works when one has a random sample from the process or population of interest.)

The *severity principle*, put more generally:

Data from a test T[ii]provide good evidence for inferring H (just) to the extent that H passes severely withx_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.)* *In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim *H* requires considering *H*‘s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually **detached** from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

*Qualifying claims by how well they have been probed*

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned. Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

*Telling what’s true about significance levels*

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation). It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to *H** _{0}*. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to *H** _{0}*, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

*Those prohibited phrases*

One may wish to return to some of the condemned phrases of particular physics reports. Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution: The statistical null asserts that *H*_{0}: background alone adequately describes the process.

*H*_{0} does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under *H*_{0}”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {** x**: p < p

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain *H*_{0} as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

*Triggering, indicating, inferring*

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

If interested: *See statistical flukes (part 3)*

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1: https://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Part 2 https://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

2018/2015/2014 Notes

[0]Physicists manage to learn quite a lot from negative results. They’d love to find something more exotic, but the negative results will not go away. A recent article from CERN, “We need to talk about the Higgs” says: While there are valid reasons to feel less than delighted by the null results of searches for physics beyond the Standard Model (SM), this does not justify a mood of despondency.

“Physicists aren’t just praying for hints of new physics, Strassler stresses. He says there is very good reason to believe that the LHC should find new particles. For one, the mass of the Higgs boson, about125.09 billion electron volts, seems precariously low if the census of particles is truly complete. Various calculations based on theory dictate that the Higgs mass should be comparable to a figure called the Planck mass, which is about 17 orders of magnitude higher than the boson’s measured heft.”The article is here.

[1]My presentation at a Symposium on the Higgs discovery at the Philosophy of Science Association (Nov. 2014) is here.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

[3]Aspects of the statistical controversy in the Higgs episode occurs in *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Mayo 2018)

___________

Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(*H*) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv] In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter prompted by Lindley, O’Hagan wrote up a synthesis https://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

REFERENCES (from March, 2013 post):

ATLAS Collaboration (November 14, 2012), Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” *Annals of Mathematical Statistics*, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” *Scandinavian Journal of Statistics*, 4: 49–70.

Mayo, D.G. (1996), *Error and the Growth of Experimental Knowledge*, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference“ in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy *of *Science*, 57: 323–357.

What is the message conveyed when the board of a professional association X appoints a Task Force intended to dispel the supposition that a position advanced by the Executive Director of association X does not reflect the views of association X on a topic that members of X disagree on? What it says to me is that there is a serious break-down of communication amongst the leadership and membership of that association. So while I’m extremely glad that the ASA appointed the Task Force on Statistical Significance and Replicability in 2019, I’m very sorry that the main reason it was needed was to address concerns that an editorial put forward by the ASA Executive Director (and 2 others) “might be mistakenly interpreted as official ASA policy”. The 2021 Statement of the Task Force (Benjamini et al. 2021) explains:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force…

It’s also too bad that the statement was blocked for nearly a year, and wasn’t shared by the ASA. In contrast to the 2019 editorial, the Task Force on Statistical Significance and Replicability writes that “P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”. The full statement is in the *The Annals of Applied Statistics**, and *on my blogpost. It is very welcome that leading statisticians rose up to block the attitude that I describe in this post as Les Stats C’est Moi, diminishing the inclusivity of a variety of methodologies and philosophies among ASA members. Where was *Nature*, *Science* and other venues when they had their shot at an article: “Scientists Rise Up in Favor of (error statistical) hypothesis Testing”? Nowhere to be found.

An excellent overview is given by Nathan Schachtman on his law blog here:

A Proclamation from the Task Force on Statistical SignificanceJune 21st, 2021

The American Statistical Association (ASA) has finally spoken up about statistical significance testing.[1] Sort of.

Back in February of this year, I wrote about the simmering controversy over statistical significance at the ASA.[2] Back in 2016, the ASA issued its guidance paper on p-values and statistical significance, which sought to correct misinterpretations and misrepresentations of “statistical significance.”[3] Lawsuit industry lawyers seized upon the ASA statement to proclaim a new freedom from having to exclude random error.[4] To obtain their ends, however, the plaintiffs’ bar had to distort the ASA guidance in statistically significant ways.

To add to the confusion, in 2019, the ASA Executive Director published an editorial that called for an end to statistical significance testing.[5] Because the editorial lacked disclaimers about whether or not it represented official ASA positions, scientists, statisticians, and lawyers on all sides were fooled into thinking the ASA had gone whole hog.[6] Then ASA President Karen Kafadar stepped into the breach to explain that the Executive Director was not speaking for the ASA.[7]

In November 2019, members of the ASA board of directors (BOD) approved a motion to create a “Task Force on Statistical Significance and Replicability.”[8] Its charge was

“to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors. The task force will be appointed by the ASA President with advice and participation from the ASA BOD. The task force will report to the ASA BOD by November 2020.

The members of the Task Force identified in the motion were:

Linda Young (Nat’l Agricultural Statistics Service & Univ. of Florida; Co-Chair)

Xuming He (Univ. Michigan; Co-Chair)

Yoav Benjamini (Tel Aviv Univ.)

Dick De Veaux (Williams College; ASA Vice President)

Bradley Efron (Stanford Univ.)

Scott Evans (George Washington Univ.; ASA Publications Representative)

Mark Glickman (Harvard Univ.; ASA Section Representative)

Barry Graubard (Nat’l Cancer Instit.)

Xiao-Li Meng (Harvard Univ.)

Vijay Nair (Wells Fargo & Univ. Michigan)

Nancy Reid (Univ. Toronto)

Stephen Stigler (Univ. Chicago)

Stephen Vardeman (Iowa State Univ.)

Chris Wikle (Univ. Missouri)

Tommy Wright (U.S. Census Bureau)

[T]he Taskforce arrived at its recommendations, but curiously, its report did not find a home in an ASA publication. Instead, the “The ASA President’s Task Force Statement on Statistical Significance and Replicability” has now appeared as an “in press” publication at

The Annals of Applied Statistics,where Karen Kafadar is the editor in chief. The report is accompanied by an editorial by Kafadar.…

You can read the rest of his post here.

Some links from Schachtman’s blog:

]]>[1] Deborah Mayo, “At Long Last! The ASA President’s Task Force Statement on Statistical Significance and Replicability,”

Error Statistics(June 20, 2021).[2] “Falsehood Flies – The ASA 2016 Statement on Statistical Significance” (Feb. 26, 2021).

[3] Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70

The Am. Statistician129 (2016);see“The American Statistical Association’s Statement on and of Significance” (March 17, 2016).[4] “The American Statistical Association Statement on Significance Testing Goes to Court – Part I” (Nov. 13, 2018); “The American Statistical Association Statement on Significance Testing Goes to Court – Part 2” (Mar. 7, 2019).

[5] “Has the American Statistical Association Gone Post-Modern?” (Mar. 24, 2019); “American Statistical Association – Consensus versus Personal Opinion” (Dec. 13, 2019).

See alsoDeborah G. Mayo, “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean,”Error Statistics Philosophy(June 17, 2019); B. Haig, “The ASA’s 2019 update on P-values and significance,”Error Statistics Philosophy(July 12, 2019); Brian Tarran, “THE S WORD … and what to do about it,”Significance(Aug. 2019); Donald Macnaughton, “Who Said What,”Significance47 (Oct. 2019).[6] Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar, “Editorial: Moving to a World Beyond ‘p < 0.05’,” 73

Am. StatisticianS1, S2 (2019).[7] Karen Kafadar, “The Year in Review … And More to Come,”

AmStat News3 (Dec. 2019) (emphasis added);seeKafadar, “Statistics & Unintended Consequences,”AmStat News3,4 (June 2019).[8] Karen Kafadar, “Task Force on Statistical Significance and Replicability,” ASA

Amstat Blog(Feb. 1, 2020).

*The **tenth** meeting of our Phil Stat Forum*:*

**The Statistics Wars
and Their Casualties**

**24 June 2021**

**TIME: 15:00-16:45 (London); 10:00-11:45 (New York, EST)**

**For information about the Phil Stat Wars forum and how to join, click on this link.**

**“Have Covid-19 lockdowns led to an increase in domestic violence? Drawing inferences from police administrative data” **

**Katrin Hohl**

**Abstract:** This applied paper reflects on the challenges in measuring the impact of Covid-19 lockdowns on the volume and profile of domestic violence. The presentation has two parts. First, I present preliminary findings from analyses of large-scale police data from seven English police forces that disentangle longer-term trends from the effect of the imposing and lifting of lockdown restrictions. Second, I reflect on the methodological challenges involved in accessing, analysing and drawing inferences from police administrative data.

**Katrin Hohl **(Department of Sociology, City University London). Dr Katrin Hohl joined City University London in 2012 after completing her PhD at the LSE. Her research has two strands. The first revolves around various aspects of criminal justice responses to violence against women, in particular: the processes through which complaints of rape fail to result in a full police investigation, charge, prosecution and conviction; the challenges rape victims with mental health conditions pose to criminal justice, and the use of victim memory as evidence in rape complaints. The second strand focusses on public trust in the police, police legitimacy, compliance with the law and cooperation with the police and courts. Katrin has collaborated with the London Metropolitan Police on several research projects on the topics of public confidence in policing, police communication and neighbourhood policing. She is a member of the Centre for Law Justice and Journalism and the Centre for Crime and Justice Research.

Piquero et al. (2021) Domestic violence during the Covid-19 pandemic – Evidence from a systematic review and meta-analysis, Journal of Criminal Justice, 74 (May-June). (PDF)

Hohl, K. and Johnson K. (2020) A crisis exposed – how Covid-19 is impacting domestic abuse reported to the police.

Katrin Hohl presentation (Video Link)

Link to paste into browser: https://philstatwars.files.wordpress.com/2021/07/hohl-presentation-edited.mp4

*****Meeting 18 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21

**THE ASA PRESIDENT’S TASK FORCE STATEMENT ON STATISTICAL SIGNIFICANCE AND REPLICABILITY**

BY YOAV BENJAMINI, RICHARD D. DE VEAUX, BRADLEY EFRON, SCOTT EVANS, MARK GLICKMAN,*, BARRY I. GRAUBARD, XUMING HE, XIAO-LI MENG,†, NANCY REID8, STEPHEN M. STIGLER, STEPHEN B. VARDEMAN, CHRISTOPHER K. WIKLE, TOMMY WRIGHT, LINDA J. YOUNG AND KAREN KAFADAR (for affiliations see the article)

Over the past decade, the sciences have experienced elevated concerns about replicability of study results. An important aspect of replicability is the use of statistical methods for framing conclusions. In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force, and the ASA invited us to publicize it. Its purpose is two-fold: to clarify that the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, and to briefly set out some principles of sound statistical inference that may be useful to the scientific community.

P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application.

Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability. The following general principles underlie the appropriate use of P -values and the reporting of statistical significance and apply more broadly to good statistical practice.

**Capturing the uncertainty associated with statistical summaries is critical.** Different measures of uncertainty can complement one another; no single measure serves all purposes. The sources of variation that the summaries address should be described in scientific articles and reports. Where possible, those sources of variation that have not been addressed should also be identified.

**Dealing with replicability and uncertainty lies at the heart of statistical science. Study results are replicable if they can be verified in further studies with new data. **Setting aside the possibility of fraud, important sources of replicability problems include poor study design and conduct, insufficient data, lack of attention to model choice without a full appreciation of the implications of that choice, inadequate description of the analytical and computational procedures, and selection of results to report. Selective reporting, even the highlighting of a few persuasive results among those reported, may lead to a distorted view of the evidence. In some settings this problem may be mitigated by adjusting for multiplicity. Controlling and accounting for uncertainty begins with the design of the study and measurement process and continues through each phase of the analysis to the reporting of results. Even in well-designed, carefully executed studies, inherent uncertainty remains, and the statistical analysis should account properly for this uncertainty.

**The theoretical basis of statistical science offers several general strategies for dealing with uncertainty. **P -values, confidence intervals and prediction intervals are typically associated with the frequentist approach. Bayes factors, posterior probability distributions and credible intervals are commonly used in the Bayesian approach. These are some among many statistical methods useful for reflecting uncertainty.

**Thresholds are helpful when actions are required.** Comparing P -values to a significance level can be useful, though P -values themselves provide valuable information. P – values and statistical significance should be understood as assessments of observations or effects relative to sampling variation, and not necessarily as measures of practical significance. If thresholds are deemed necessary as a part of decision-making, they should be explicitly defined based on study goals, considering the consequences of incorrect decisions. Conventions vary by discipline and purpose of analyses.

**In summary, P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.** Analyzing data and summarizing results are often more complex than is sometimes popularly conveyed. Although all scientific methods have limitations, the proper application of statistical methods is essential for interpreting the results of data analyses and enhancing the replicability of scientific results.

]]>“The most reckless and treacherous of all theorists is he who professes to let facts and figures speak for themselves, who keeps in the background the part he has played, perhaps unconsciously, in selecting and grouping them.” (Alfred Marshall, 1885)

*The **tenth** meeting of our Phil Stat Forum*:*

**The Statistics Wars
and Their Casualties**

**24 June 2021**

**TIME: 15:00-16:45 (London); 10:00-11:45 (New York, EST)**

**For information about the Phil Stat Wars forum and how to join, click on this link.**

**“Have Covid-19 lockdowns led to an increase in domestic violence? Drawing inferences from police administrative data” **

**Katrin Hohl**

**Abstract:** This applied paper reflects on the challenges in measuring the impact of Covid-19 lockdowns on the volume and profile of domestic violence. The presentation has two parts. First, I present preliminary findings from analyses of large-scale police data from seven English police forces that disentangle longer-term trends from the effect of the imposing and lifting of lockdown restrictions. Second, I reflect on the methodological challenges involved in accessing, analysing and drawing inferences from police administrative data.

**Katrin Hohl **(Department of Sociology, City University London). Dr Katrin Hohl joined City University London in 2012 after completing her PhD at the LSE. Her research has two strands. The first revolves around various aspects of criminal justice responses to violence against women, in particular: the processes through which complaints of rape fail to result in a full police investigation, charge, prosecution and conviction; the challenges rape victims with mental health conditions pose to criminal justice, and the use of victim memory as evidence in rape complaints. The second strand focusses on public trust in the police, police legitimacy, compliance with the law and cooperation with the police and courts. Katrin has collaborated with the London Metropolitan Police on several research projects on the topics of public confidence in policing, police communication and neighbourhood policing. She is a member of the Centre for Law Justice and Journalism and the Centre for Crime and Justice Research.

Piquero et al. (2021) Domestic violence during the Covid-19 pandemic – Evidence from a systematic review and meta-analysis, Journal of Criminal Justice, 74 (May-June). (PDF)

Hohl, K. and Johnson K. (2020) A crisis exposed – how Covid-19 is impacting domestic abuse reported to the police.

*****Meeting 18 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21

Hi Deborah,

I guess you may have heard about the approval of this Alzheimer’s drug under controversial circumstances:

- https://www.buzzfeednews.com/article/stephaniemlee/alzheimers-aducanumab-biogen-fda-approval
- FDA Report
I was asked for my statistical opinion about this prior to the decision. I am giving a brief talk later this week. My answer was although non-effectiveness has not been proven, there was not sufficient evidence to show effectiveness, and I thought the FDA would ask for more research. They kind of did this in the form of requiring a Phase IV trial.[ii]

There were problems I think with the evidence, and with the FDA’s reasoning.

The problems were (i) stopping the trials on the basis of futility analyses, (ii) then re-analysing using subgroup analyses and secondary endpoints, and (iii) lack of evidence for a relationship between amyloid levels on imaging and cognitive scores.

I think this decision may set an unfortunate precedent that undermines the principle of pre-registering trials. Secondary analysis is fine for understanding why a trial failed and generating hypotheses for future studies, but it is not usually considered sufficient evidence for drug approval. Interestingly, the FDA’s own expert panel and statistician didn’t think that the evidence justified approval at this point.

I think there are critical flaws in the FDA’s reasoning. They argued for a lower standard of evidence given the current lack of effective treatments for Alzheimer’s disease. They argued that the trials showed that the drug prevented the accumulation of amyloid protein, and that should improve cognition. It is the improvement in cognition that is the primary endpoint, and the trials were stopped because of the decline of cognition in the treatment group [in one of two trials]. Even in the positive-trending trial, cognition in the treatment group was not sufficiently different from decline in the control group. The initial hope in the field that amyloid was the actual cause of dementia and clearing it would help has been dashed many times. Previous drugs with similar actions have all failed. These trials were a bit different in targeting the early phase, so they are more about prevention than treatment. That was a reasonable approach, but we need strong evidence about preventing cognitive decline.

This is a fairly unique example of not sticking to pre-registered analyses. I don’t know of any similar example. Other groups who have conducted ostensibly failed trials may conduct secondary analyses and then seek to have treatments approved based on those analyses. That opens up the possibility of data-dredging to achieve the desired outcome, and I think that you will agree that is something that has contributed to the “replicability crisis” in science. Not the use of significance testing!

Given the small and questionable treatment effect and the huge cost of the drug, the FDA may have some explaining to do in a few years. The Phase IV trial may not be successful. I think the insurance companies and other countries will be watching with interest before they commit.

What do you think about this development?

Regards,

Geoff

I agree with Stuart’s worry about the post hoc searching, especially as it appears the FDA worked with Biogen to carve out the subgroup after the company itself stopped the trial for futility.[iii] I agree with those who criticize the FDA approving the drug despite poor evidence. I don’t know enough about the status of the amyloid plaque theory today to weight in much further (see note iv.) But I’m very concerned about the precedent this sets. In the Buzzfeed link:

“The FDA’s decision to approve aducanumab shows a stunning disregard for science and eviscerates the agency’s standards for approving new drugs,” Michael Carome, who tracks the pharmaceutical industry for the watchdog group Public Citizen, told BuzzFeed News. “Because of this reckless action, the agency’s credibility has been irreparably damaged.”

Granted, it’s hard to see how the FDA can insist on its own strong position against drug approvals based on such ransacking in the future. (See my ‘P-values on Trial’.) Stuart says he doesn’t know of another similar example, but perhaps others do (please inform us in the comments). I would definitely worry about future FDA panels being as prepared to say “no” in the future as this one was–and that’s a real problem. (One panelist just quit, see comment 6/9/21.)

One good thing is that the FDA admitted that “the drug had provided incomplete evidence to demonstrate effectiveness” and so it was requiring Biogen to conduct a new clinical trial. This I believe is a distinct, possibly a newish, category for especially devastating diseases. Some say people won’t volunteer for a clinical trial where they might be assigned to placebo, now that it has been approved. Maybe so, but maybe those who can’t possibly afford the $56,000 cost per year would be willing for a chance of being assigned to it. [iv]

The main thing, it seems to me, is that the FDA clearly made the distinction between having provided sufficient evidence the drug works in improving cognitive function (Biogen did not), and the reasons it was nevertheless approving it–the fact that Alzheimer’s is a devastating disease with no effective treatment, and that this would give hope to Alzheimers’ sufferers after nearly 20 years with no new treatments. FDA Report. The agony this group went through in coping with the Covid pandemic might also have played a role. (There are also allegations of financial ties of various sorts, both with drug companies and advocacy groups.)The most serious problem with the latest calls to “abandon” statistical significance is their recommendation that such extra-evidential considerations go toward the evaluation of the evidence for effectiveness, and that these policy decisions be mixed in with the evidential evaluation. That’s one of the main reasons I find such calls so irresponsible, especially among those in positions of power and influence.

Please share your thoughts in the comments.

As usual, I indicate updated versions with (i), (ii),… in the title.

[i] Stuart is involved in statistical research in psychology at Melbourne, and has attended sessions of my Phil Stat Wars Forum.

[ii] This grants the drug only conditional approval, which can later be rescinded, though it’s very doubtful this would happen.

[iii] Stuart elaborated further on the subgroup analysis:

My understanding of the post hoc analysis is that it was only of participants who had completed the full course of treatment and were in the high dose group or the placebo group (not the low dose group – but surely they looked). The problem here is that at the point of the futility analysis, progress would have been compared for everyone at the same time from enrolment and it was looking very much like the treatment group were declining in their cognition at a similar rate to the placebo group, so they stopped the trials. Then they found post-hoc that the high dose “completers” showed a bit of an effect (only a 20% reduction in cognitive decline). Now it could be that the eventual completers were getting better at the mid-point. So, removing all those who had only partly completed the trial could have helped them get a significant result with the subset who completed before the trials were aborted. There’s no statistical control for that, so the chance that the finding was a Type I error goes up by an unknown amount. Sticking to the pre-registered protocol and completing the trials would have controlled for that.

It could be true that for effectiveness a high dose for the pre-specified treatment period is required, so the hypothesis is still alive. But a lot of us, probably including the FDAs own expert panel and statistician, think there should be a new Phase III trial to test that hypothesis, and that the evidence from the aborted trials is compromised by its post-hoc nature.

[iv] The fact that reducing amyloid in the brain, achieved by other experimental drugs, hasn’t shown improved cognitive function had been taken by some as falsifying the amyloid theory altogether. (They’ve also all tended to have the side effect of brain-swelling.) One valuable consequence of this approval could be to finally falsify the amyloid plaque theory, or, alternatively, give it new life.

**Added 6/11/21 Remarks from Bio Pharma collected by Endpoints (though without names that I can see):**

- — It’s a horrifying abdication of the agency’s responsibility as a vanguard of scientific truth. It makes a mockery of statistical rigor and their own advisors, and it will do irreparable damage to health care economics, research into Alzheimer’s, and their own reputation.
- — Totally discredits the FDA and potentially dangerous to patients without any real indication of benefit.
- — Unproven efficacy, unproven surrogate that can’t be reasonably expected to predict benefit given large # of other failed attempts.
- — “Borderline fraud,” one reader noted. “I have an ethical obligation to my patients. I will never prescribe.”
- — This is the first time in 30 years I’ve seen a decision that’s based almost entirely on lack of scientific reason, lack of science biology and medical evidence (they claim that “it’s reasonable to assume” that losing a-beta plaques will help the patient). There was powerful manipulation by in this case a patient group organization (Alz Assn, all-bad) which is ignorant and subservient to some influence group and accompany this and like minded companies. There was also manipulation of trial data; which made everyone believe this was nonsensical.
- — The sheer ignorance/disregard of proper scientific method that was displayed in not requiring a confirmatory trial prior to approval is disgraceful.
- — Clinical studies failed to provide convincing results as to its efficacy. I was interviewing at Biogen the day they first announced they are ceasing clinical trials due to failure of the drug. The team I was interviewing with was part of the group that “discovered” the drug and they clearly told me that they were not surprised at all that the clinical trials failed. 2 years down the line and FDA managed to take science back 20 years …
- — It is at odds with all prior guidance. They treated it like it was an AIDS drug in the late 1980s but the problem is nothing like AIDS. There is no established link between plaque regression and cognitive improvement. They changed their mind on accelerated approval. The AdComm should have had a chance to weigh in on this. It is not like DMD — this disease affects millions of people.
- — Cherry picked, post-hoc data. There have been 17 other A-beta antibody studies that showed no benefit or worse outcomes. There have been 16 trials (not all anti-A-beta) in which the treatment arm did worse than placebo despite reduction of amyloid.
- — This sets a terrible incentive for other companies to make me-too drugs rather than focus on alternative pathways that would be more effective, such as taking a gene-based approach. In addition, there is not enough proof of efficacy. They should have made them do a third trial at the higher dosing. Lastly, Biogen will make a boat-load of profit selling an unproven drug at a high price based on the (false?) hopes of patients.
- — Data presented by Biogen is at the best weak, and at the worst totally irrelevant. This is coupled with high costs to consumers for, at best, modest improvements in therapeutic outcomes. The FDA was founded to stop the sale of “patent” medicines to unwitting citizens, yet here it is a century later endorsing what is their functional equivalent.
- — The advisory panel overwhelmingly rejected aducanumab and later the FDA changed the criteria to the surrogate endpoint. It exposes patients to an ineffective and costly treatment, gives misguided encouragement to other similar would-be products, throws the reimbursement system into disarray and completely disregards science.
- — The indirect implications include validating a hypothesis that has never panned out (and I don’t believe it did here) and from which the world has moved on.
- — Overall, a very sad decision from the FDA, sad for serious and responsible drug development, not the one which is guided solely by the population’s unmet need. Very sad with major implications for the FDA, pharma, drug development and society. It will cost us years to recover from this one bad decision! I do not envy the patients … sad also for them who will take the risk of having unwanted side effects, for very few beneficial effects (if at all). Praying is less risky and maybe more efficacious.
- — It undermines the credibility of randomized placebo controlled trials. While post-hoc analyses can be useful in designing subsequent trials, they should not be used as the basis for approval. As Alexander et al. put it in their scathing JAMA commentary, “Any treatment will appear to be more effective if individuals in whom it works least are removed from the analysis.”
- — Did not meet the FDA’s standard of 2 demonstrative clinical trials. Also across disease states, the FDA has set an appropriate standard of requiring positive data on both patient-reported outcomes/symptoms as well as underlying disease process. In this case the drug only hinted at one of these and not the other at all. What gives? Yes we are desperate for ALZ therapies. But even its critics have previously appreciated the FDA’s role as the toughest reviewer of clinical data. Now like so many other areas of American leadership, that is crumbling.
- — Ignoring a nearly unanimous Ad Board sets a dangerous and confusing precedent.
- — The approval decision should be guided by science and not outside pressure (patient advocacy and/or business lobby). The adcom unilaterally rejecting the aducanumab on efficacy screams “no” to me.
- — The FDA has given Biogen free license to charge $56K (an insane price given how poor their data are) for 6+ million people. Even if Biogen is only able to capture 10% of that market, you’re talking about an economic burden on the US for payers of ~$34B a year. How are payers going to focus on any other disease therapies (with far better efficacy) when they’re distracted with this giant line item? Sovaldi’s CURE for hep C was already an enormous burden and now we have a chronic, ineffective therapy. The FDA has no ability to enforce the requirements of the post-marketing study and Biogen will likely be unable to recruit placebo patients anyway. They’ve handed Biogen a gigantic blockbuster without them earning it. Sad day.
- — This result will embolden the nearly fully denounced amyloid hypothesis supporters in the field and will ensure that other, more promising approaches fail to get funding meaning likely meaning we will not see progress in treating az for another 20+ years.
- — Can’t say it better than John’s editorial. Not following the science, terrible precedent and counterproductive for finding real therapies. Biogen will do a Sarepta and use the money to buy a real drug. Puts all the Covid goodwill in peril.
- — Terrible FDA precedent. Makes drug development and regulatory approval less predictable for all sponsors. FDA has stated many times that accelerated approval is NOT meant to be a consolation prize for a failed study – that’s what happened here.
- — While the 2nd trial may barely meet the requirements for the FDA’s Accelerated Approval Program, the strong disapproval by such a large number of neuroscientists should NOT have been dismissed … I am the daughter of parents who both died from the disease and it is prevalent in both sides of my family so my disagreement did not come without a lot of thought. I am always concerned when one individual in particular (in this case Dr. Dunn) appears to want it approved over all opposition. What is his relationship to Biogen?
- — This is simply the FDA unable to make a tough call, and passing the buck, hoping that payers will do the right thing. This is bad for patients, bad for the payers, and very, very bad for those manufacturers that are doing the right thing in developing good drugs supported by good data.
- — To me, this is worse than Shkreli-type price gouging, because it interferes with our ability to actually know if this drug can actually help anyone (besides Biogen).
- — What is the point of a p value if you play with the data until you get what you want (stats 101!)? — What is the point of aligning with the FDA prior to the studies on endpoints etc if it doesn’t matter? — What is the point of the FDA if their endorsement no longer means that the drug works? — Why have any adcomms in the future if all the experts will be ignored? But most of all: This provides false hope to a lot of desperate people who cannot (and should not!) understand the nuances above, and undermines efforts to develop something that does work.

There were some supportive remarks scattered in the mix here, but even some of those were qualified with a marked lack of enthusiasm for this drug. From readers:

- — It is a toe hold into neurodegeneration and a starting point upon which to improve. Much like early approvals in MS and ALS starting points are imperfect.
- — The data, while not up to the FDA’s usual standard of clear efficacy in 2 independent Phase 3 trials, are the most convincing to date of a disease-modifying effect in this desperate disease. Biogen should have to conduct another confirmatory trial, but patients should not have to wait for the outcome in order to be treated.
- — With each failure there is increased trepidation to invest in Alzheimer’s treatments, diagnostics and infrastructure. A win, although controversial, could reverse this trend and encourage treatments that are much better.

]]>

**1 Are the wars mostly in statistics, not philosophy?** According to Williamson, Bayesian philosophers, while mostly subjective, just “see Bayesianism as being concerned with belief, and so not a rival” to frequentist statistics, while Bayesian statisticians do “see Bayesianism as a rival to frequentist statistical inference—‘statistics wars’.” [SLIDE 6]. So, in his view, philosophers of statistics don’t see a war between frequentists and Bayesians whereas statisticians do. I think, if anything, it is the reverse. Bayesian statisticians are more eclectic, and much less inclined to see a war between frequentists and Bayesians. When they do so, they are, largely, wearing philosophical hats. Granted they often do not recognize that they’re making philosophical presuppositions when waging a war with frequentists, (error statisticians in my terminology), but by and large they are happy to get on with the job. By contrast, philosophers who accept a frequentist, error statistical view are generally seen as exiles, under the presumption that only Bayesianism gives a sound (coherent) underlying philosophy.[i] (That is why the alt name for my blog is “frequentists in exile”.)

I’m not saying, by the way, that the main stat wars are between frequentists and Bayesians–there are many other battles as well. I’m just addressing some of the points Williamson makes.

I also find it surprising that according to Williamson, a Bayesian statistician appears to be much more of a true blue subjective (in the manner of de Finetti) than is the Bayesian philosopher. Again, I would have thought it was the opposite. [SLIDE 6] Statisticians seem much more eclectic than philosophers in their interpretations of probability (see Gelman and Shalizi 2013). Williamson avers that Bayesian statisticians “often doubt the existence of non-epistemic probabilities (following Bruno de Finetti)”. Non-epistemic probabilities” he says are “generic frequencies or single-case chances”. Doubting the existence of frequencies and chances, the Bayesian statistician, he claims, does not seek to “directly calibrate one’s credences [degrees of belief] to nonepistemic probabilities (generic frequencies or single-case chances).”

But certainly the large class of non-subjective, default, reference, and empirical Bayesians go beyond probability as subjective degrees of belief, and increasingly separate themselves from traditional subjective and personalist philosophies. Being calibrated to frequencies in some sense, I thought, was one of the main advantages of non-subjective, objective, or default Bayesianism in statistical practice. This is so, regardless of their metaphysics on chances or propensities: it suffices to allude to relative frequencies (actual or hypothetical) stemming from modeled phenomena or deliberately designed experiments.

Having a battle about where the wars are–statistics or philosophy of statistics–may not be productive, but I think it’s very important to understand the nature of the debates. In fact, one of the most serious casualties of the statistics wars from the philosophical perspective is in obscuring the roles of statistical methods (and other formal and quasi-formal methodologies) in addressing the epistemological problem of how to generate, learn and generalize from data. In other words, the wars have confused the value of statistics for philosophy.

**2.** **Philosophy of confirmation vs philosophy of statistics.** Now there is a radical difference to which Williamson’s discussion points, and it is between a given project–it might be called Bayesian confirmation theory, Bayesian epistemology, inductive logic or the like–and what I would call philosophy of statistics, or the philosophy of inductive-statistical inference. Bayesian confirmation theorists, since Carnap, have a tradition of building an account based on a restricted language: statements, propositions, and first order logics. By contrast, statisticians and statistical philosophers refer to probability models, continuous random variables, parameters and the like. The philosophical project is essentially to justify a mode of inductive inference, basically, Carnap’s straight rule or a version of enumerative induction: if k% of A’s have been B’s then believe the next A will be B to degree k. Perhaps a stipulation that the observations satisfy a condition of randomness or exchangeability is added.

An example from Williamson (SLIDE 7) is this: Suppose your evidence E is: 17 of a random sample of a hundred 21-year-olds develop a cough. That the sample frequency is 0.17 is evidence that the chance/frequency is ≈ 0.17. A case of what he calls a *direct inference* would be to take .17 as how confident one should be, or how strongly one should believe, the statement A: that Cheesewright, who is 21, gets a cough. In the confirmation philosopher’s project, there is a restriction to a finite language, set of predicates, and assignments of probabilities to chosen elements. A statistician, instead, might appeal to Bernouilli trials and a Binomial model of experiment to reach such a (direct) inference to the probability of an event occurring on the next trial.

Philosophers of confirmation sought an a priori (non empirical) way to justify such an inference–to solve the traditional problem of induction–and any reference to a probability model, or even slipping in that it’s a random sample, makes an empirical assumption. So they shied away from tackling the inductive problem using models. By the way, I don’t view inductive inference as inferring probabilities of claims, but making inferences that go beyond the data–they are ampliative. They are qualified using probabilities, but these are not posteriors in hypotheses. Even falsification requires inductive inferences in this sense (see SIST, excursion 2 Tour II, p. 83).

I don’t think philosophers still consider that an a priori justification of inductive inference is possible or even desirable. Thus, the impetus to restricting the philosophical account of inference to specially crafted first order languages goes by the board, and we can freely talk about design-based or model based probabilities. Unlike what the confirmation project typically supposes, appeals to such models may be warranted or, alternatively, falsified. Showing how is part of what is involved in solving the problem of induction now, or so I argue (2018, pp 107-115).

**3. Do statisticians not move from general probabilities to specific assignments?** It’s interesting that Williamson claims that “statisticians tend not to appeal to direct inference principles” that move from population probabilities and frequencies to degrees of probability, belief, or support in a particular case. The reason, if I understand him, is that they tend not to believe in these non-epistemic, frequentist probability notions–taking us back to point 1 above.

I find Bayesian statisticians/practitioners highly interested in assigning degrees of belief, credence, or plausibility to events– whether we consider that tantamount to assigning beliefs to statements about events, or to events defined in a model. In fact Bayesian statisticians see a key selling point of their methodology that it offers a way to assign probabilities to particular events and hypotheses, whereas the frequentist error statistician generally only speaks of the performance properties of methods in repetitions. The error statistician is also largely interested in inverse inference from data to claims about aspects of the data generating method (rather than direct inference).

The latest move (by both Bayesian and frequentist practitioners) to embrace what I call a “screening model” of statistical significance tests would seen to be an example of practitioners performing direct inferences. (See SIST, excursion 5 Tour II.) Here the probability of a particular hypothesis is given by considering it to have been (randomly?) selected from a universe or urn of hypotheses (where it’s assumed some % are known to be true).

Williamson himself appeals to confidence intervals in illustrating a direct inference from a confidence level to a degree of belief in a particular interval estimate. A popular reconciliation, which I think he endorses, makes use of frequentist matching priors. Fisher’s fiducialist essentially tried to do this without appealing to priors. So again it’s not clear why he takes statisticians as uninterested in direct inference. True, contradictions result from probabilistic instantiation, as noted in my “casualties“, and in Williamson’s presentation. (His solution is to drop Bayesian conditioning and start over with new maximum entropy priors.) The error statistician will also take error probabilities of methods as qualifying a particular inference. But it is qualifying, not its degree of believability, but rather, how well tested or corroborated it is.

The upshot: I consider statistical methods far more relevant to the philosophers’ epistemological projects than I find in Williamson’s portrayal, at least based on his May 20 presentation. Philosophers of confirmation and formal epistemology shortchange themselves by keeping their projects separate from the ones that empirical statistical (and other formal and quasi-formal) methods supply. In the reverse direction, the foundational and methodological problems of these methods cannot be so readily swept aside as simply directed at a problem that is outside of those of the philosophers. In my view, this thinking has stalled progress in both arenas for the past 25 years.

Please share your comments and questions.

[i] Admittedly, there is a program of Bayesian epistemology that might be seen as doing ordinary epistemology (discussions about knowledge and beliefs) employing formal probabilities. But this is not Williamson’s project.

]]>

After Jon Williamson’s talk, *Objective Bayesianism from a Philosophical Perspective*, at the PhilStat forum on May 22, I raised some general “casualties” encountered by objective, non-subjective or default Bayesian accounts, not necessarily Williamson’s. I am pasting those remarks below, followed by some additional remarks and the video of his responses to my main kvetches.

To consider some casualties, some object that the Bayesian picture of a coherent way to represent and update beliefs goes by the board by the non-subjective or default Bayesian.

The non-subjective priors are not supposed to be considered expressions of uncertainty, ignorance, or degree of belief. They may not even be probabilities, being improper, that is not integrating [to] one. As David Cox asks: as an adequate summary of information?” (2006a, p. 774)

A

second casualty is that prior probabilities are often touted a way that Bayesians to let us bring background information into the analysis, but this pulls in the opposite direction from the goal of the default prior which aims to maximize the contribution of just the data, not background. Trying to describe your beliefs is different from trying to make the data dominant. Users applying a computer package with default priors, might think they’re putting in something uninformative or safe. Not necessarily.The goal of truly “ uninformative” priors has been abandoned around 20 years ago–what is uninformative under one parameterization can be informative for another.

Third casualty, especially from the subjective Bayesian perspective, is that non-subjective Bayesianism is incoherent in terms of Bayes rule or conditioning—as Jon knows. One reason for incoherence is that the default prior on the same hypothesis can vary according to the type of experiment performed. This violates the LP followed by strict Bayes rule—to mention a term we’ve spoken of before in this forum.

When there’s a conflict with Bayes’ Rule, or in the face of improper posteriors (that don’t sum to 1) the default Bayesian reassign priors, starting over again, as it were. The very idea that inference is a matter of continually updating by Bayesian conditioning goes by the board, at least in those cases.

There are also casualties that emerge from the perspective of the frequentist error statistician.

Yes, the non-subjective Bayesian may allow the error probability measure to match a posterior. But do we want that?

One way allows assigning a confidence level to the particular inferred estimate. But save for special cases, this leads to contradictions, as with Fisher’s attempt at fiducial intervals. To a frequentist it’s a fallacy to instantiate the probability this way. (We might for example have two non-overlapping CIs both at level .95).

Nor would the error statistician want a p-value of .2 in testing a null hypothesis of no benefit of a treatment say to be taken as .8 degree of plausibility in the treatment’s benefits, or to say they’d be as ready to bet on the alternative, as they would the occurrence of an event with probability 0.8. A severe tester would regard the claim of a genuine positive benefit as poorly tested, from a significance level of .2.

In her view, the error probability can be seen to be applied to the particular inference by telling us how well or poorly tested the claim.

The error statistician operates without an exhaustive set of models, theories, languages and predicate. She will split off a single piece, question or model, with its error properties assessed and controlled.

Having said all that, Williamson has his own ingenious system, and I’m interested to consider how it rules on some of today’s controversies.

Jon is free to react to any of this, or perhaps they will arise later in conversation.

* New remarks by Mayo:* Williamson, in his response (below) agrees with my caveats relating to non-subjective Bayesianism, but avers that they only give evidence in support of his (non-standard) approach. Start with casualty #3. Williamson bites the bullet here, because he has long rejected the what many Bayesians consider a great asset: the view that learning from data is all about Bayesian updating (applications of Bayes rule or conditionalization). However, the example he used in this presentation is different from the ones usually used in showing temporal incoherence, and I’ll raise some questions about it in a follow-up post on this blog. As for casualty #1, I can’t say whether his Bayesian posteriors provide appropriate degrees of rational belief, as I’m not sure how to attain them.

One example he gave was moving from k% of observed A’s have been B’s to a degree of belief equal to k that a randomly selected A will be a B. This is essentially Carnaps “straight rule of induction. But the randomness assumption (if I’m right that he uses it) makes the assessment a model-based one whereas the kind of Carnapian-style inductive logic that Williamson wants to use, I think, does not use models, since otherwise there is a problem with taking it to solve problems of induction. (He will correct me if I’m wrong.) The other example he gives takes confidence levels as rational degrees of belief on particular estimates. But this takes us to my casualty regarding contradictions resulting from such probabilistic instantiations. Perhaps he wants to argue that assigning the confidence level to the particular instance is OK for the Bernouilli parameter p–the probability of success on a given trial. (This may be akin to the approach of Roger Rosenkrantz who, like Williamson, is a Jaynesian.) But this requires an appeal to a model and not merely a first order language. And I don’t see how to avoid the reference class problem, except maybe to say, whatever reference class you’re interested in is fine for applying the straight rule. But this would be in tension with the main rationale of the project being to solve induction.[i]

J. Williamson response to Casualties video

J. Williamson’s “Objective Bayesianism from a Philosophical Perspective” Slides.

His full talk is in this Presentation Video.

Please share your comments and questions in the comments to this blog.

[i] For how the error statistical philosopher recommends we solve the problem of induction now, see pp 107-115 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018): https://errorstatistics.files.wordpress.com/2019/09/ex2-tii.pdf

]]>*The **ninth** meeting of our Phil Stat Forum*:*

**The Statistics Wars
and Their Casualties**

**20 May 2021**

**TIME: 15:00-16:45 (London); 10:00-11:45 (New York, EST)**

**For information about the Phil Stat Wars forum and how to join, click on this link.**

**“Objective Bayesianism from a philosophical perspective” **

**Jon Williamson**

**Abstract:** This talk addresses the ‘statistics war’ between frequentists and Bayesians, and argues for a reconciliation of sorts. We start with an overview of Bayesianism and a divergence that has taken place between Bayesianism as adopted by philosophers and Bayesianism as adopted by statisticians. This divergence centres around the use of direct inference principles, which are now widely advocated by philosophers. I consider two direct inference principles, Reichenbach’s Principle of the Narrowest Reference Class and Lewis’ Principal Principle, and I argue that neither can be adequately accommodated within a standard Bayesian framework. A non-standard version of objective Bayesianism, however, can accommodate such principles. I introduce this version of objective Bayesianism and explain how it integrates both frequentist and Bayesian inference. Finally, I illustrate the application of the approach to medicine and suggest that this sort of approach offers a very natural solution to the statistical matching problem, which is becoming increasingly important.

**Jon Williamson **(Centre for Reasoning, University of Kent) works in the area of philosophy of science and medicine. He works on the philosophy of causality, the foundations of probability, formal epistemology, inductive logic, and the use of causality, probability and inference methods in science and medicine. Williamson’s books Bayesian Nets and Causality and In Defence of Objective Bayesianism develop the view that causality and probability are features of the way we reason about the world, not a part of the world itself. His books Probabilistic Logics and Probabilistic Networks and Lectures on Inductive Logic apply recent developments in Bayesianism to motivate a new approach to inductive logic. His latest book, Evaluating Evidence of Mechanisms in Medicine, seeks to broaden the range of evidence considered by evidence-based medicine. Jon Williamson’s webpage.

**(1) **Christian Wallmann and Jon Williamson: **The Principal Principle and subjective Bayesianism**, *European Journal for the Philosophy of Science* 10(1):3, 2020. doi:10.1007/s13194-019-0266-4 (Link to PDF)

**(2) **Jon Williamson: **Why Frequentists and Bayesians Need Each Other**, *Erkenntnis* 78:293-318, 2013. (Link to PDF)

J. Williamson Slides

J. Williamson Presentation Video

D. Mayo Casualties Slide

J. Williamson response to Casualties video

**Mayo’s Memos: **Any info or events that arise that seem relevant to share with y’all before the meeting. Please check back closer to the meeting day.

*****Meeting 17 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21

Tom Sterkenburg, PhD

Postdoctoral Fellow

Munich Center for Mathematical Philosophy

LMU Munich

Munich, German

The foundations of statistics is not a land of peace and quiet. “Tribal warfare” is perhaps putting it too strong, but it is the case that for decades now various camps and subcamps have been exchanging heated arguments about the right statistical methodology. That these skirmishes are not just an academic exercise is clear from the widespread use of statistical methods, and contemporary challenges that cry for more secure foundations: the rise of big data, the replication crisis.

One often hears that to blame are classical, frequentist methods, that lack a proper justification and are easily misused at that; so that it is all a matter of stepping up our efforts to spread the Bayesian philosophy. This not only ignores the various conflicting views *within** *the Bayesian camp, but also gives too little credit to opposing philosophical perspectives. In particular, this does not do justice to the work of philosopher of statistics Deborah Mayo. Perhaps most famously in her Lakatos Award winning *Error and the Growth of Experimental Knowledge** *(1996), Mayo has been developing an account of statistical and scientific inference that builds on Popper’s falsificationist philosophy and frequentist statistics. She has now written a new book, with the stated goal of helping us get beyond the statistics wars.

This work is a genuine tour de force. Mayo weaves together an extraordinary amount of philosophical themes, technical discussions, and historical anecdotes into a lively and engaging exposition of what she calls the *error-statistical** *philosophy. Like few other works in the area Mayo instills in the reader an appreciation for both the interest and the significance of the topic of statistical methodology, and indeed for the importance of *philosophers** *engaging with it.

That does not yet make the book an easy read. In fact, the downside of Mayo’s conversational style of presentation is that it can take a serious effort on the reader’s part to distill the argumentative structure and how various observations and explanations hang together. This, unfortunately, also limits its use somewhat for those intended readers that are new to the discussed topics.

In the following I will summarize the book, and conclude with some general remarks. (Mayo organizes her book into “excursions” divided into “tours”—we are invited to imagine we are on a cruise—but below I will stick to chapters divided into parts.)

Chapter 1 serves as a warming-up. In the course of laying out the motivation for the book’s project, Mayo introduces *severity** *as a requirement for evidence. On the *weak** *version of the severity criterion, one does *not** *have evidence for a claim *C** *if the method used to arrive at evidence * x*, even if

Thus if a statistical philosophy is to tell us what we seek to quantify using probability, then Mayo’s error-statistical philosophy says that this is “well-testedness” or *probativeness.** *This she sets apart from *probabilism,** *which sees probability as a way of quantifying plausibility of hypotheses (tenet of the Bayesian approach), but also from *performance*, where probability is a method’s long-run frequency of faulty inferences (the classical, frequentist approach). Mayo is careful, too, to set her philosophy apart from recent efforts to unify or bridge Bayesian and frequentist statistics, approaches that she chastises as “marriages of convenience” that simply look away from the underlying philosophical incongruities. There is here an ambiguity in the nature of Mayo’s project, that remains unresolved throughout the book: is she indeed proposing a new perspective “to tell what is true about the different methods of statistics” (p. 28), the view-from-a-balloon that might finally get us beyond the statistics wars, or should we actually see her as joining the fray with a yet different competing account? What is certainly clear is that Mayo’s philosophy is much closer to the frequentist than the Bayesian school, so that an important application of the new perspective is to exhibit the flaws of the latter. In the second part of the chapter Mayo immediately gets down to business, revisiting a classic point of contention in the form of the likelihood principle.

In Chapter 2 the discussion shifts to Bayesian confirmation theory, in the context of traditional philosophy of science and the problem of induction. Mayo’s diagnosis is that the aim of confirmation theory is *merely** *to try to spell out inductive method, having given up on actually providing justification for it; and in general, that philosophers of science now feel it is taboo to even try to make progress on this account. The latter assessment is not entirely fair, even if it is true that recent proposals addressing the problem of induction (notably those by John Norton and by Gerhard Schurz, who both abandon the idea of a single context-independent inductive method) are still far removed from actual scientific or statistical practice. More interesting than the familiar issues with confirmation theory Mayo lists in the first part of the chapter is therefore the positive account she defends in the second.

Here she discusses falsificationism and how the error-statistical account builds and improves on Popper’s ideas. We read about demarcation, Duhem’s problem, and novel predictions; but also about the replicability crisis in psychology and fallacies of significance tests. In the last section Mayo returns to the question that has been in the background all this time: what is the error-statistical answer to the problem of inductive inference? By then we have already been handed a number of clues: inferences to hypotheses are arguments from strong coincidence, that (unlike “inductive” but really still deductive probabilistic logics) provide genuine “lift-off”, and that (against Popperians) we are free to call warranted or justified. Mayo emphasises that the output of a statistical inference is not a belief; and it is undeniable that for the plausibility of an hypothesis severe testing is neither necessary (the problem of after-the-fact cooked-up hypotheses, Mayo points out, is exactly that they can be so plausible) nor sufficient (as illustrated by the base-rate fallacy). Nevertheless, the envisioned epistemic yield of a (warranted) inference remains agonizingly imprecise. For instance, we read that (sensibly enough) isolated significant results do not count; but when do results start counting, and how? Much is delegated to the dynamics of the overall inquiry, as further illustrated below.

Chapter 3 goes deeper into severe testing: as employed in actual cases of scientific inference, and as instantiated in methods from classical statistics. Thus the first part starts with the 1919 Eddington experiment to test Einstein’s relativity theory, and continues with a discussion of Neyman–Pearson (N–P) tests. The latter are then accommodated into the error-statistical story, with the admonition that the severity rationale goes beyond the usual behavioural warrant of N–P testing as the guarantee of being rarely wrong in repeated application. Moreover, it is stressed, the statistical methods given by N–P as well as Fisherian tests represent “canonical pieces of statistical reasoning, in their naked form as it were” (p. 150). In a real scientific inquiry these are only part of the investigator’s reservoir of error-probabilistic tools “both formal and quasi-formal”, providing the parts that “are integrated in building up arguments from coincidence, informing background theory, self- correcting […], in an iterative movement” (p. 162).

In the next part of Chapter 3, Mayo defends the classical methods against an array of attacks launched from different directions. Apart from some old charges (or “howlers and chestnuts of statistical tests”), these include the excusations arising from the “family feud” between adherents of Fisher and Neyman–Pearson. Mayo argues that the purported different interpretational stances of the founders (Fisher’s more evidential outlook versus Neyman’s more behaviourist position) are a bad reason to preclude a unified view on both methodologies. In the third part, Mayo extends this discussion to incorporate confidence intervals, and the chapter concludes with another illustration of statistical testing in actual scientific inference, the 2012 discovery of the Higgs boson.

The different parts of Chapter 4 revolve around the theme of objectivity. First up is the “dirty hands argument”, the idea that since we can never be free of the influence of subjective choices, all statistical methods must be (equally) subjective. The mistake, Mayo says, is to assume that we are incapable of registering and managing these inevitable threats to objectivity. The subsequent dismissal of the Bayesian way of taking into account—or indeed embracing—subjectivity is followed, in the second part of the chapter, by a response to a series of Bayesian critiques of frequentist methods, and particularly the charge that, as compared to Bayesian posterior probabilities, *P** *values overstate the evidence. The crux of Mayo’s reply is that “it’s erroneous to fault one statistical philosophy from the perspective of a philosophy with a different and incompatible conception of evidence or inference” (p. 265). This is certainly a fair point, but could just as well be turned against her own presentation of the error-statistical perspective as a meta-methodology. Of course, the lesson we are actually encouraged to draw is that an account of evidence in terms of severe testing is preferable to one in terms of plausibility. For this Mayo makes a strong case, in the next part, in connection to the need for tools to intercept various illegitimate research practices. The remainder of the chapter is devoted to some other important themes around frequentist methods: randomization, the trope that “all models are false”, and model validation.

Chapter 5 is a relatively technical chapter about the notion of a test’s *power*. Mayo addresses some purported misunderstandings around the use of power, and discusses the notion of *attained** *or post-data power, combining elements of N–P and of Fisher, as part of her severity account. Later in the chapter we revisit the replication crisis, and in the last part we are given an entertaining “deconstruction” of the debates between N–P and Fisher. Finally, in Chapter 6, Mayo takes one last look at the probabilistic “foundations lost”, to clear the way for her parting proclamation of the new probative foundations. She discusses the retreat by theoreticians from full-blown subjective Bayesianism, the shaky grounds under objective or default Bayesianism, and attempts at unification (“schizophrenia”) or flat-out pragmatism. Saved till the end, fittingly, is the recent “falsificationist Bayesianism” that emerges from the writings of Andrew Gelman, who indeed adopts important elements of the error-statistical philosophy.

It seems only a plausible if not warranted inductive inference that the statistics wars will rage on for a while; but what, towards an assessment of Mayo’s programme, should we be looking for in a foundational account of statistics? The philosophical attraction of the dominant Bayesian approach lies in its promise of a principled and unified account of rational inference. It appears to be too rigid, however, in suggesting a fully mechanical method of inference: after you fix your prior it is, on the standard conception, just a matter of conditionalizing. At the same time it appears to leave too much open, in allowing you to reconstruct any desired reasoning episode by suitable choice of model and prior. Mayo is very clear that her account resists the first: we are not looking for a purely formal account, a single method that can be mindlessly pursued. Still, the severity rationale is emphatically meant to be restrictive: to expose certain inferences as unwarranted. But the threat of too much flexibility is still lurking in how much is delegated to the messy context of the overall inquiry. If too much is left to context-dependent expert judgment, for instance, the account risks to forfeit its advertized capacity to help us hold the experts accountable for their inferences. This motivates the desire for a more precise philosophical conception, if possible, of what inferences count as warranted and how. What Mayo’s book should certainly convince us of is the value of seeking to develop her programme further, and for that reason alone the book is recommended reading for all philosophers—not least those of the Bayesian denomination—concerned with the foundations of statistics.

***

Sterkenburg, T. (2020). “Deborah G. Mayo: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars”, *Journal for General Philosophy of Science* 51: 507–510. (https://doi.org/10.1007/s10838-019-09486-2). Link to review.

Excerpts, mementos, and sketches of 16 tours (including links to proofs) are here.

]]>*The eighth meeting of our Phil Stat Forum*:*

**The Statistics Wars
and Their Casualties**

**22 April 2021**

**TIME: 15:00-16:45 (London); 10:00-11:45 (New York, EST)**

**For information about the Phil Stat Wars forum and how to join, click on this link.**

**“How an information metric could bring truce to the statistics wars****“**

**Daniele Fanelli**

**Abstract: **Both sides of debates on P-values, reproducibility, and other meta-scientific issues are entrenched in traditional methodological assumptions. For example, they often implicitly endorse rigid dichotomies (e.g. published findings are either “true” or “false”, replications either “succeed” or “fail”, research practices are either “good” or “bad”), or make simplifying and monistic assumptions about the nature of research (e.g. publication bias is generally a problem, all results should replicate, data should always be shared).

Thinking about knowledge in terms of information may clear a common ground on which all sides can meet, leaving behind partisan methodological assumptions. In particular, I will argue that a metric of knowledge that I call “K” helps examine research problems in a more genuinely “meta-“ scientific way, giving rise to a methodology that is distinct, more general, and yet compatible with multiple statistical philosophies and methodological traditions.

This talk will present statistical, philosophical and scientific arguments in favour of K, and will give a few examples of its practical applications.

**Daniele Fanelli **is a London School of Economics Fellow in Quantitative Methodology, Department of Methodology, London School of Economics and Political Science. He graduated in Natural Sciences, earned a PhD in Behavioural Ecology and trained as a science communicator, before devoting his postdoctoral career to studying the nature of science itself – a field increasingly known as meta-science or meta-research. He has been primarily interested in assessing and explaining the prevalence, causes and remedies to problems that may affect research and publication practices, across the natural and social sciences. Fanelli helps answer these and other questions by analysing patterns in the scientific literature using meta- analysis, regression and any other suitable methodology. He is a member of the Research Ethics and Bioethics Advisory Committee of Italy’s National Research Council, for which he developed the first research integrity guidelines, and of the Research Integrity Committee of the Luxembourg Agency for Research Integrity (LARI).

**Fanelli D** (2019) *A theory and methodology to quantify knowledge. *Royal Society Open Science – doi.org/10.1098/rsos.181055. (PDF)

4 page Background: **Fanelli D** (2018) *Is science really facing a reproducibility crisis, and do we need it to? *PNAS –doi.org/10.1073/pnas.1708272114. (PDF)

**See Phil-Stat-Wars.com**

*****Meeting 16 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21

*A Statistical Model as a Chance Mechanism
*

**Jerzy Neyman** **(April 16, 1894 – August 5, 1981)**, was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for non-random samples. Fisher’s original parametric statistical model M_{θ}(**x**) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data **x**_{0}:=(x_{1},x_{2},…,x_{n}) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data **x**_{0} come from sample surveys or it can be viewed as a typical realization of a random sample **X**:=(X_{1},X_{2},…,X_{n}), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!

Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):

Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)

From my perspective, this was a major step forward for several reasons, including the following.

*First*, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.

*Second*, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:

M_{θ}(**x**)={f(**x**;θ), θ∈Θ**}**, **x**∈R^{n }, Θ⊂R^{m}; m << n,

where the distribution of the sample f(**x**;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from f(**x**;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:

X_{t} = α_{0} + α_{1}X_{t-1} + σε_{t}, *t=1,2,…,n*

This indicates how one can use *pseudo-random* numbers for the error term ε_{t} ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size *n* in nanoseconds on a PC.

*Third*, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its *repeatability in principle*, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).

**I have restored all available links to the following references.**

For further discussion on the above issues see:

Spanos, A. (2013), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in *Synthese.*

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society* A, 222: 309-368.

Mayo, D. G. (1996), *Error and the Growth of Experimental Knowledge*, The University of Chicago Press, Chicago.

Neyman, J. (1950), *First Course in Probability and Statistics*, Henry Holt, NY.

Neyman, J. (1952), *Lectures and Conferences on Mathematical Statistics and Probability*, 2nd ed. U.S. Department of Agriculture, Washington.

Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” *Synthese*, 36, 97-131.

[i]He was born in an area that was part of Russia.

]]>**Today is Jerzy Neyman’s birthday (April 16, 1894 – August 5, 1981). **I’m posting a link to a quirky paper of his that explains one of the most misunderstood of his positions–what he was opposed to in opposing the “inferential theory”. The paper is Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments. “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute *a priori* distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. It arises on p. 391 of Excursion 5 Tour III of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s a link to the proofs of that entire tour. If you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. He is not rejecting statistical inference in favor of behavioral performance as typically thought. Neyman always distinguished his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?). You can find quite a lot on this blog searching Birnbaum.

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.

**HAPPY BIRTHDAY NEYMAN!**

What doesn’t Neyman like about Birnbaum’s advocacy of a Principle of Sufficiency S (p. 25)? He doesn’t like that it is advanced as a normative principle (e.g., about when evidence is or ought to be deemed equivalent) rather than a criterion that does something for you, such as control errors. (Presumably it is relevant to a type of context, say parametric inference within a model.) S is put forward as a kind of principle of rationality, rather than one with a rationale in solving some statistical problem

“The principle of sufficiency (S): If E is specified experiment, with outcomes x; if t = t (x) is any sufficient statistic; and if E’ is the experiment, derived from E, in which any outcome x of E is represented only by the corresponding value t = t (x) of the sufficient statistic; then for each x, Ev (E, x) = Ev (E’, t) where t = t (x)… (S) may be described informally as asserting the ‘irrelevance of observations independent of a sufficient statistic’.”

Ev(E, x) is a metalogical symbol referring to the evidence from experiment E with result x. The very idea that there is such a thing as an evidence function is never explained, but to Birnbaum “inferential theory” required such things. (At least that’s how he started out.) The view is very philosophical and it inherits much from logical positivism and logics of induction.The principle S, and also other principles of Birnbaum, have a normative character: Birnbaum considers them “compellingly appropriate”.

“The principles of Birnbaum appear as a kind of substitutes for known theorems” Neyman says. For example, various authors proved theorems to the general effect that the use of sufficient statistics will minimize the frequency of errors. But if you just start with the rationale (minimizing the frequency of errors, say) you wouldn’t need these”principles” from on high as it were. That’s what Neyman seems to be saying in his criticism of them in this paper. Do you agree? He has the same gripe concerning Cornfield’s conception of a default-type Bayesian account akin to Jeffreys. Why?

[i] I am grateful to @omaclaran for reminding me of this paper on twitter in 2018.

[ii] Or so I argue in my *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*, 2018, CUP.

[iii] Do you think Neyman is using “breakthrough” here in reference to Savage’s description of Birnbaum’s “proof” of the (strong) Likelihood Principle? Or is it the other way round? Or neither? Please weigh in.

REFERENCES

**Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘, Revue De l’Institut International De Statistique / Review of the International Statistical Institute, 30(1), 11-27.**