
.
There was a very valuable panel discussion after my October 9 Neyman Seminar in the Statistics Department at UC Berkeley. I want to respond to many of the questions put forward by the participants (Ben Recht, Philip Stark, Bin Yu, Snow Zhang) that we did not address during that panel. Slides from my presentation, “Severity as a basic concept of philosophy of statistics” are at the end of this post (but with none of the animations). I begin in this post by responding to Ben Recht, a professor of Artificial Intelligence and Computer Science at Berkeley, and his recent blogpost, What is Statistics’ Purpose? On severe testing, regulation, and butter passing, on my talk. I will consider: (1) A complex or leading question; (2) Why I chose to focus about Neyman’s philosophy of statistics and (3) What the “100 years of fighting and browbeating” were/are all about.
(1) A complex or leading question.
A question Recht submitted to the panel is this:
Even if the epistemological value of statistical tests is highly questionable, is it reasonable to use statistical tests as benchmarks for regulatory approval (say for drugs or policy)?
Where has it been shown that the epistemological value of statistical tests is highly questionable? I follow Recht in using “statistical tests” to refer to statistical significance tests. One can affirm the value of statistical tests for regulatory approval, as he does, while denying the allegations some have voiced that tests don’t also have epistemological value. The criticisms of statistical tests run to type. Either they are based on:
(A) misinterpretations or misuses of tests, or
(B) assuming notions of evidence and inference that are at odds with those underlying statistical tests.
If Recht knows of others that don’t fit under these umbrellas, I’d be interested to hear. The best known examples of (A) are: interpreting p-values as posterior probabilities, unwarranted moves from statistical to substantive claims and magnitude errors, taking no evidence against H0 as evidence for it, and illicit error probabilities due to biasing selection effects (e.g., multiple testing, optional stopping, cherry picking, outcome switching, etc.) For the main examples of (B), see slides 58-60 of my talk.
Recht is right to stress that “statistics asks many different kinds of questions”, and that statistical (significance) tests are only a small part of a rich methodology I dub error statistics (itself a proper subset of statistics), but this does not show they lack epistemological value for the problems they are intended to address. Severity makes this explicit.
The fallacy of statistical affirming the consequent
I realize that there are misuses of statistical tests that, in some circles, are so baked-in that they are thought to actually be licensed by tests, notably the view that rejecting a statistical test or null hypothesis H0 warrants a substantive research claim H*. In places, Recht seems to associate my notion of severe testing with this illicit notion, often associated with something called NHST, but this is precisely what severe testing denies. He refers to Paul Meehl:
In Meehlian language, the verisimilitude of claims can be tested by experiment, and we want to demonstrate a “damned strange coincidence” to convince ourselves that our claim is causally associated with our experiment. For falsificationists, the more surprising the experimental outcome, the more we are assured our claim resembles the truth.
It is important to emphasize that the only claim severely tested by dint of statistically significant results is the denial of the test or null hypothesis H0. A claim C is severely tested only by passing a test that C would probably have failed if it is false. That statistically significant results would be very surprising (i.e., very improbable) under the assumption that H0, say that there’s no positive effect, does not warrant with severity the truthlikeness of some alternative claim C (however one likes to define versimilitude) even if C “explains” or entails the effect. A falsificationist would not be “assured our claim resembles the truth” simply because there’s evidence of a genuine effect, not due to chance. Finding a genuine discrepancy from H0 only gives evidence of an incompatibility with or discrepancy from H0—of the particular sort the test is probing.
Recht has a series of interesting blogposts on Meehl, so I might note that when Meehl and the “damned strange coincidence” comes up in my Statistical Inference as Severe Testing: How to get beyond the statistics wars [SIST] (CUP, 2018), I emphasize the difference between ruling out coincidence and finding evidence for research hypothesis H*.
For the corroboration to be strong, we have to have ‘Popperian risk’, … ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence [“damn strange coincidence”]. (Meehl and Waller 2002, p. 284)
Yet we mustn’t blur an argument from coincidence merely to a real effect, and one that underwrites arguing from coincidence to research hypothesis H*.
…Meehl’s critiques [of NHST] rarely mention the methodological falsificationism of Neyman and Pearson. Why is the field that cares about power—which is defined in terms of N-P tests—so hung up on simple significance tests?…With N-P tests, the statistical alternative to the null hypothesis is made explicit: the null and alternative exhaust the possibilities. There can be no illicit jumping of levels from statistical to causal (from H1 to H*) Fisher didn’t allow the illicit move either, but he was less explicit. (SIST 95-6)
That’s one big reason Neyman-Pearson (N-P) tests are so relevant for today’s hand-wringing about illicit statistical affirming the consequent.
Is Recht suggesting that the logic of statistical significance tests purports to warrant evidence for the truth of a substantive alternative H*? Something often called NHST is portrayed as warranting the illicit inference from statistical to substantive claims, and that is why, in my discussion with Gelman, I say we should move away from “NHST”–a term which was never an official designation in statistical testing. Also, NHST is often described as using the point nil null. The severe tester (in sync with both Neyman and with Cox) would consider one-sided tests, or two one-sided tests with an adjustment for selection.
We are not merely interested in inferring the existence of a discrepancy; we generally wish to infer those discrepancies (or population effective sizes) that are well or poorly tested. The severe tester considers a number of alternatives to the reference hypothesis H0, and reports severity curves (as in slide #62. ) Note that severity decreases as the test’s power at the corresponding alternative increases.[1]
(2) On why I chose to speak about Neyman.
In answer to Recht’s question as to why I focus on Neyman (also Fisher, Cox, Lehmann, Pearson) rather than some other great statisticians let me give 4 main reasons:
(1) I am giving a Neyman seminar,
(2) Neyman’s (and Pearson’s) development of Fisherian tests avoid the central fallacy that Recht is on about (moving from rejecting H0 to inferring a substantive claim H*), and
(3) Neyman’s construal of tests in terms of error statistical performance provides a crucial strategy that takes us beyond the traditional problems of induction, and also beyond Popper (although Popper could have used N-P tests to flesh out his notion of “methodological falsification”). Recht is in favor of the performance, acceptance-sampling philosophy. I don’t think it goes far enough for using tests to make inferences with stringency. A fourth reason that I didn’t take up in my talk:
(4) Neyman’s applied papers provide important insights about the value of statistical models for finding out true things, even though the models themselves are at best approximations. They still enable adequate error probability control. I especially like his discussion of a conjecture and refutation exercise of a model for pest control (SIST, exhibit xii, p. 299 in Excursion 4 Tour iv).
Formal statistical tests give us error probabilities defined in terms of the sampling distribution of a test statistic. The more Fisherian construal focuses on the attained p-value: “pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as just decisive against H0.” (Cox and Hinkley 1974, 66.) (See slide #32.) Neyman-Pearson view a statistical test as a rule that maps observed values of an appropriate test statistic d(x) into either “reject H0” or “do not reject H0” in such a way that there is a low probability of erroneously rejecting H0 and a much higher probability of correctly rejecting H0. “Reject H0” and “fail to reject H0” are generally interpreted as x is evidence against H0, or x fails to provide evidence against H0 (which is not the same as evidence for H0.)
Contrary to what is often supposed, N-P testers also advocate reporting attained p-values post-data. Erich Lehmann, Neyman’s first Ph.D student at Berkeley makes this clear. See my slides (#30-31) from Lehmann’s (1993) “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One theory or two?”.
So error statistical tests supply error probabilities associated with test outputs. Granted, there is a missing premise that moves from them to any inference, claim, decision or any other output of a test. That is what the severity concept and severity requirements supply for contexts where the goal is finding something out, solving a statistical problem or critically scrutinizing a conjectured solution to a problem. That is why severity is a basic concept for philosophy of statistics—as in my title.
(3) What are the 100 years of fighting and browbeating all about?
According to Recht:
If, after 100 years of fighting and browbeating, we see that statistical testing consistently fails to be severe testing, then it’s pretty silly to keep teaching our students that statistical tests are severe tests. (Recht post)
I am not sure how Recht is viewing the “100 years of fighting and browbeating”? Does he think they are over the well-known fallacy of moving of statistical to substantive significance–affirming the consequent discussed in (1) above? The only claim inferred with severity from a rejection of H0 is its denial (in relation to the test statistic). The fact that Meehl was wrestling with fallacious uses of tests in psychology does not alter what tests actually do, nor provide grounds to suppose that statisticians have been teaching their students that a statistically significant effect automatically warrants a substantive claim H*. The error probabilities of tests do not apply to H*, unless it is tantamount to the denial of H0. So the “100 years of fighting and browbeating” is not over whether statistical tests supply ways to move directly from statistical to substantive claims, from correlational to causal claims or the like. Those have been well-known fallacies for donkey’s years.
Perhaps Recht means that there has been 100 years of fighting and browbeating over whether statistical tests can supply tests with good error probabilities (which is all they strictly claim to do), but that is not so. We know they can supply them. (Neyman also developed confidence interval estimation as inverses to tests, with corresponding coverage probabilities.) The 100 years of fighting is whether error probabilities matter for statistical inference, given they do not supply degrees of support, belief, or probability of statistical hypotheses. The 100 years of fighting, in other words, is over (frequentist) error statistical performance versus Bayesian (or other) probabilisms. It is conceptual and philosophical. Recht does not say anything about this here, and I’d still like to know what he thinks. Notice that Bayesian confirmation does permit moving from rejecting H0 to a substantive H* insofar as H* receives a Bayes boost (is made more probable). The evidence from the data for updating, or for Bayes factors, is in the likelihood ratio (see slide #38)[ the likelihood principle]. This is at odds with error probabilities.
I might note that no formal statistical methods make use of the term I dub “severity”, although I thank Meehl for mentioning me (in the same breath as Popper and Salmon!) Severity leads to reformulating tests so as to avoid classic fallacies of rejection and non-rejection. I observe in SIST:
none of the [existing] formal notions directly give severity assessments. There isn’t even a statistical school or tribe that has explicitly endorsed this goal. I find this perplexing. That will not preclude our immersion into the mindset of a futuristic tribe whose members use error probabilities for assessing severity; it’s just the ticket for our task: understanding and getting beyond the statistics wars. We may call this tribe the severe testers. (SIST, 9)
So what of Recht’s question of the purpose of statistical tests?
The purposes of statistics, of course, are enormously broad, and I will leave that to statisticians. But the purposes of statistical significance tests are, first and foremost, as Benjamini puts it, to supply our “first line of defense against being fooled by randomness” (2016, p. 1). If an observed effect is explainable as due to chance variability, then we’d very probably fail to reliably generate statistically significant results. It is by dint of non-significant results that failed replication is identified, and, notice, critics of tests presuppose the use of tests for this critical role. (The Replication Paradox.)
By dint of statistically significant results, on the other hand, tests uncover discordancies and inconsistencies between data and a reference or test hypothesis H0. This is the essence of model criticism. In the severe tester’s formulation, attained p-values may be used to indicate the extent of discrepancies that are well or poorly warranted. This is conjecture and refutation, followed by new conjecture which is then put to the test (combined with background theory), and the error probing occurs all along the path from data collection, modeling, and inference. (I see this as akin to Bin Yu’s veridical data science, which I’m only recently learning about.)
The introspection Recht calls for at the end of his post, as I see it, is an invitation to the philosophical and conceptual considerations that I discuss in my talk (probabilism, performance, and probativism).
For statisticians, this means there’s a need for introspection about what the field is for. …Statistics asks many different kinds of questions, but we confuse our students because the methods often look the same. Are we trying to quantify the verisimilitude of a theory or assertion? Or are we trying to quantify the error in a measurement to aid decision making? We need to speak with clarity about this. …until we disambiguate use, we will have more dumb arguments about what p-value thresholds mean for the replicability of science.
Whether one wants to call it measuring plausibility, probability, support, or verisimilitude, these would all fall under what I call probabilisms. Highly probable (however one measures it) differs from highly well probed, and distinct tools are needed for these different goals. Moreover, good performance is necessary but not sufficient for severe testing. (This was the distinction between Fisher and Neyman that Lehmann draws at the end of his paper.) In addition to clarifying use, probabilists need to clarify their chosen measures. At present, there are subjective, objective, empirical, pragmatic and many systems within each, with little agreement on which to use or how to interpret them. How the goals of AI/ML and data science more generally fit in, is yet a distinct question.
Recht’s final point leads me to add one more remark. He says:
So I’ll close a word to the folks who like to philosophize about statistics, whether they be philosophers, statisticians, or bloggers. We need less focus on Popper’s modus tollens and more on his piecemeal social engineering. (Recht post)
The severe tester is happy to promote Popper’s piecemeal engineering for problems of policy reform: it is precisely akin to what is right-headed in his view of solving problems piecemeal. The need for democratic checks, and considering how any one side on controversial policies may be wrong, leading to unintended consequences, is crucial. That was the gist of my (2021) editorial in Conservation Biology “The statistics wars and intellectual conflicts of interest”, for the policy of abandoning significance and p-value thresholds. The editorial is here.
I’m grateful to statistician Philip Stark, also a panelist at my Berkeley talk, for his published comment (in Conservation Biology) on my editorial. In his view:
I also agree with Prof. Mayo’s thesis that abandoning P-values exacerbates moral hazard for journal editors, although there has always been moral hazard in the gatekeeping function. Absent any objective assessment of the agreement between the data and competing theories, publication decisions may be even more subject to cronyism, “taste,” confirmation bias, etc.
Throwing away P-values because many practitioners don’t know how to use them is like banning scalpels because most people don’t know how to perform surgery. Those who would perform surgery should be trained in the proper use of scalpels, and those who would use statistics should be trained in the proper use of P-values. (Stark 2022, 1)
I ended my editorial as follows:
The key function of statistical tests is to constrain the human tendency to selectively favor views they believe in. There are ample forums for debating statistical methodologies. There is no call for executive directors or journal editors to place a thumb on the scale. Whether in dealing with environmental policy advocates, drug lobbyists, or avid calls to expel statistical significance tests, a strong belief in the efficacy of an intervention is distinct from its having been well tested. Applied science will be well served by editorial policies that uphold that distinction.
Notes
[1] For example, in a one-sided test (of the mean): H0: μ < μ0 vs H1: μ > μ0 , if the power of the test to detect μ’ is high, (i.e., POW(μ’) is high) then a just statistically significant result is poor evidence that μ > μ’: the severity associated with inferring μ > μ’ is low.
My slides from the Neyman Seminar are below (pdf):
Severity as a basic concept in philosophy of statistics



Seems like this has been a seminal event encompassing, statistical analysis, modeling in general, severe testing and modern challenges in AI and ML. Was is recorded?
What has not been mentioned there (apparently), is the challenge of generalizability of findings. Severe testing is one aspect of it. There are of course also others. For example, the design of blocks in experiments, together with application of random effects, can be used to ensure generalizability. Non significant block by treatment effects enhance a claim for generalizability. Another aspect of generalizability that needs to be addressed is how to present claims in a generalizable form. A proposal based on a boundary of meaning (BOM) that delineate alternatives that can be mapped as relevant to the existing data collection mechanism and data analysis is presented in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070.
In summary there are three aspects of generalizability worth considering: 1. How to design a study, 2. How to analyze a study, 3. How to present findings. Quoting the preface of Cuthbert Daniel’s 1976 book: “The major contribution of a statistician to an experimenter is to help him obtain more valid, that is to say, more general, more broadly based, results”. Mayo’s seminar seems addressing this in some way. An attempt to operationalize the work of an error statistician, with a case study, is provided in https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/ansa.202000159 – not an easy task….
Ron:
I’m sure it would have interested you. Unfortunately, it was not recorded, because they hadn’t planned for it–only realized at the last minute. I could have tried on my device, which would have made the start of the seminar late. I’m trying to write up what I have in my notes and ask the others what they have.
But what do you think of Recht’s post? His position surprised me actually, even though I had seen some of his blog. Do you think it’s that in AI inferential statistics is only used for quality control.
Mayo
Quality control is one aspect but definitely not the only one. In fact I had a discussion on this in the Q&A following Neil Lawrence’s talk at ENBIS in 2016. Neil is now the DeepMind Professor of Machine Learning at Cambridge. He made a similar argument than Ben Hecht. My counterargument is that Statistics should contribute to the discovery process and not be considered a sanitation discipline used to ensure quality and clean ups. You can view this at the end of the following
My comment is at 59:29….
Sorry, it is 1:02:00
PS I uploaded today a manuscript on generalizability in measurement system uncertainty studied https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4996846
Ben Recht wrote a blogpost in response to this post: https://www.argmin.net/p/a-use-theory-of-testing
I wrote a reply to Ben Recht’s new blogpost in which he continues the discussion with me.
https://www.argmin.net/p/a-use-theory-of-testing
Ben:
Thank you for continuing the discussion. Here are some replies to what you say in the first portion (I’ll come back to the rest later):
Recht: “Now, it is 100% clear by a scan of any existing literature that statistical tests in science do not [provide an objective tool to distinguish genuine from spurious effects] Statistical tests do not and have not revealed truths about objective reality.”
Objective tools for distinguishing genuine from spurious effects is not the same as tools for revealing “truths about objective reality”—whose meaning is unclear.
Recht: “Statistical tests most certainly do not constrain two scientists from completely disagreeing about how to interpret the same data.”
Who says that distinguishing genuine from spurious effects precludes “two scientists from completely disagreeing about how to interpret the same data”? I don’t understand why Recht thinks that tools that control the probability of erroneous interpretations of data would preclude disagreement. Scientists must give reasons for their disagreement that respect the evidence. Failed replications sometimes result in the initial researchers blaming the replication leading, in turn, to examining the allegation. A new replication may be carried out to avoid the criticism. That too is progress in a conjecture and refutation exercise.
Recht: “If Stark and Benyamini are right, this in turn means that consumers of statistical tests, despite hundreds of years of statistician browbeating (my phrasing), have all been using them incorrectly.”
Stark and Benjamini are right, and it in no way follows “that consumers of statistical tests, despite hundreds of years of statistician browbeating …have all been using them incorrectly”. The tests are tools for falsification. Inability to falsify (statistically) a null hypothesis is a way to block erroneously inferring evidence of a genuine effect. Such negative results are at the heart of the so-called “replication crisis”. When nominally significant results are due to multiple testing, data-dredging, outcome switching and the like, it is unsurprising that the effects disappear when independent groups seek to replicate them with more stringent protocols. The replication crisis, where there is one, is evidence of how tests are used to avoid being fooled by randomness.
I think it is very important to understand what the 100 years of controversy is all about–I’ll come back to this.
Recht: “I made a lot of physicists angry yesterday arguing” against the use of p-values in the Higgs discovery.
The Higgs discovery is an excellent case study for examining the important role of statistical tests in science, as well as illuminating controversies (ever since Lindley accused physicists of “bad science”). In my 10 year review of the Higgs episode, I discuss the value of negative statistical results. https://errorstatistics.com/2022/07/04/10-years-after-the-july-4-statistical-discovery-of-the-the-higgs-the-value-of-negative-results/
From the 10 year review: It turned out that the promising bump or “resonance” (a great HEP term) disappeared as more data became available, drowning out the significant indications seen in April. Its reality was falsified. …While disappointing to physicists, this negative role of significance tests is crucial for denying BSM [Beyond Standard Model] anomalies are real, and setting upper bounds for these discrepancies with the SM Higgs.
I’ll return to the second portion of Recht’s post in another comment.
I continue my response to the remainder of Recht’s latest post: https://www.argmin.net/p/a-use-theory-of-testing
Recht: “However, communities can come together and define proper use of p-values for their applications. As Benjamini puts it, p-values require minimal formal set-up to define. In almost all applications, they are pretty easy to compute. If you need to set some easily computable quality control standard based on statistical sampling, why not use Fisherian NHSTs or Neymanian confidence intervals? Null hypothesis tests are just rules. They are rules set by particular communities to advance particular agendas.”
Yes, they are humanly defined rules, but they give the methods real capabilities. Neyman and Pearson couched them in terms of error probability control—but that can be extended.
There’s a serious language issue that still needs clearing up, or it will be hard to proceed.
Recht: “Look at AB Testing in the tech industry. Does anyone believe that AB tests severely test the veracity of claims about software widgets? Not anyone I’ve met. But AB tests are undeniably useful. … They are an imperfect convention, but a reasonable tool for quality control. This isn’t too far off from FDA drug testing. Are drug trials perfect at discovering which drugs are useful? No, but they provide a convenient framework to set regulatory standards against pushing toxic pharmaceuticals onto the market.”
Let’s clear up some terminology. Recht says no one believes an assertion like “AB tests severely test the veracity of claims about software widgets”, but why not? A claim C about a software widget e.g., a search function, and I admit I had to look up “software widget”, might be that search functions are found useful in word processing. If a test method finds that data strongly accord with C, when it would not have found such strong accordance were C false, then C passes a severe test—whether a formal statistical test is used or not. So I suspect Recht has an idiosyncratic meaning in mind. Again, he asks “Are drug trials perfect at discovering which drugs are useful? No.” Severe tests are not discovery tools, let alone perfect ones. They always have error probabilities, or error probing properties attached.
Recht: “Neither of these example applications of statistical testing is about preventing someone from being fooled by randomness.”
Why not? Denying that a data-dredged subgroup that happens to show treatment benefit is warranted is preventing doctors, investors, and patients from being fooled by randomness.
Recht: “Statistical tests are used in these applications because they give transparent, reasonably defined standards. Statistical power calculations standardize sample sizes to achieve agreed upon thresholds of acceptance. Parties agree in advance that 100, 1000 or 10,000 suffice for a community to agree to proceed. They provide a reasonably efficient means of settling disputes between stakeholders. I think this interpretation of tests and their purpose is much closer to what Mayo herself says at the end of her post.”
Deborah Mayo: “The key function of statistical tests is to constrain the human tendency to selectively favor views they believe in.”
Recht: “The purpose of statistical tests is regulation”.
We may accept that the purpose of statistical tests is regulation”, where this is not limited to policy regulation but extends to regulating conclusions, interpretations of data, inferences, and purported solutions to statistical problems–essentially in sync with Neyman’s quote in slide #34. But the reason tools can serve a purpose is because they actually have certain properties that make them capable for the job. Regulating performance, on average, can conflict with reliably regulating which claims are warranted in the case at hand. Test specification, in any particular case, will depend on the goals at hand.
Interesting discussion that brings an opportunity for new perspectives. In clinical trials that plan having an intermediate analysis (analyze the data half way through the trial), following this intermediate analysis, one either stops the trial or continues. If you stop it your claim can be that either the treatment doe not work or that it does. This is similar to a two stage acceptance sampling procedure.
The argument for having an intermediate analysis is grounded on ethical considerations. Severe testing could also be justified on ethical grounds. Has it been done so by anyone?