At the end of this post is “A recap of recaps”, the short video we showed at the beginning of Session 3 last week that summarizes the presentations from Sessions 1 & 2 back in September 22-23.

]]>**The Statistics Wars**

**and Their Casualties**

**1 December and 8 December 2022
Sessions #3 and #4 **

**15:00-18:15 pm London Time/10:00am-1:15pm EST
**

**For slides and videos of Sessions #1 and #2: see the workshop page**

**Session 3** (Moderator: Daniël Lakens, Eindhoven University of Technology)

**OPENING **

**“What Happened So Far”:**A medley (20 min) of recaps from Sessions 1 & 2: Deborah Mayo (Virginia Tech), Richard Morey (Cardiff), Stephen Senn (Edinburgh), Daniël Lakens (Eindhoven), Christian Hennig (Bologna) & Yoav Benjamini (Tel Aviv).

**SPEAKERS**

**Daniele Fanelli**(London School of Economics and Political Science)*The neglected importance of complexity in statistics and Metascience*(Abstract)**Stephan Guttinger**(University of Exeter)*What are questionable research practices?*(Abstract)**David J. Hand**(Imperial College, London)*What’s the question?*(Abstract)

**DISCUSSIONS**:

- Closing Panel:
**“Where Should Stat Activists Go From Here (Part i)?”**: Yoav Benjamini, Daniele Fanelli, Stephan Guttinger, David Hand, Christian Hennig, Daniël Lakens, Deborah Mayo, Richard Morey, Stephen Senn

**Session 4** (Moderator: Deborah Mayo, Virginia Tech)

**SPEAKERS**

**Jon Williamson**(University of Kent)*Causal inference is not statistical inference*(Abstract)**Margherita Harris**(London School of Economics and Political Science)*On Severity, the Weight of Evidence, and the Relationship Between the Two*(Abstract)**Aris Spanos**(Virginia Tech)*Revisiting the Two Cultures in Statistical Modeling and Inference as they relate to the Statistics Wars and Their Potential Casualties*(Abstract)**Uri Simonsohn**(Esade Ramon Llull University)*Mathematically Elegant Answers to Research Questions No One is Asking (meta-analysis, random effects models, and Bayes factors)*(Abstract)

**DISCUSSIONS**;

- Closing Panel:
**“Where Should Stat Activists Go From Here (Part ii)?”**: Workshop Participants: Yoav Benjamini, Alexander Bird, Mark Burgman, Daniele Fanelli, Stephan Guttinger, David Hand, Margherita Harris, Christian Hennig, Daniël Lakens, Deborah Mayo, Richard Morey, Stephen Senn, Uri Simonsohn, Aris Spanos, Jon Williamson

**********************************************************************

**DESCRIPTION:**While the field of statistics has a long history of passionate foundational controversy, the last decade has, in many ways, been the most dramatic. Misuses of statistics, biasing selection effects, and high-powered methods of big-data analysis, have helped to make it easy to find impressive-looking but spurious results that fail to replicate. As the crisis of replication has spread beyond psychology and social sciences to biomedicine, genomics, machine learning and other fields, the need for critical appraisal of proposed reforms is growing. Many are welcome (transparency about data, eschewing mechanical uses of statistics); some are quite radical. The experts do not agree on the best ways to promote trustworthy results, and these disagreements often reflect philosophical battles–old and new– about the nature of inductive-statistical inference and the roles of probability in statistical inference and modeling. Intermingled in the controversies about evidence are competing social, political, and economic values. If statistical consumers are unaware of assumptions behind rival evidence-policy reforms, they cannot scrutinize the consequences that affect them. What is at stake is a critical standpoint that we may increasingly be in danger of losing. Critically reflecting on proposed reforms and changing standards requires insights from statisticians, philosophers of science, psychologists, journal editors, economists and practitioners from across the natural and social sciences. This workshop will bring together these interdisciplinary insights–from speakers as well as attendees.

**Speakers/Panellists:**

**Yoav Benjamini**(Tel Aviv University),**Alexander Bird**(University of Cambridge),**Mark Burgman**(Imperial College London),**Daniele Fanelli**(London School of Economics and Political Science),**Roman Frigg**(London School of Economics and Political Science),**Stephan Guttinger**(University of Exeter),**David Hand**(Imperial College London),**Margherita Harris**(London School of Economics and Political Science),**Christian Hennig**(University of Bologna),**Daniël Lakens**(Eindhoven University of Technology),**Deborah M****a****yo**(Virginia Tech),**Richard Morey**(Cardiff University),**Stephen Senn**(Edinburgh, Scotland),**Uri Simonsohn**(Esade Ramon Llull University),**Aris Spanos**(Virginia Tech),**Jon Williamson**(University of Kent)

**Sponsors/Affiliations:**

- The Foundation for the Study of Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (E.R.R.O.R.S.); Centre for Philosophy of Natural and Social Science (CPNSS), London School of Economics; Virginia Tech Department of Philosophy
**Organizers**: D. Mayo, R. Frigg and M. Harris(chief logistics and contact person): Jean Miller

Logistician

**Executive Planning Committee:**Y. Benjamini, D. Hand, D. Lakens, S. Senn

**Stephen Senn**Consultant Statistician

Edinburgh, Scotland

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls and any sex difference in these effects. Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and their weight the following June are recorded.(P304)

This is how Frederic Lord (1912-2000) introduced the paradox (1) that now bears his name. It is justly famous (or notorious). However, the addition of sex as a factor adds nothing to the essence of the paradox and (in my opinion) merely confuses the issue. Furthermore, studying the *effect* of diet needs some sort of control. Therefore, I shall consider the paradox in the purer form proposed by Wainer and Brown (2), which was subtly modified by Pearl and Mackenzie in *The Book of Why *(3) (See pp212-217).

In the Wainer and Brown form, two dining rooms are mentioned, Dining Room A and Dining Room B. Pearl and McKenzie, however, although they too refer to Dining Room A and Dining Room B in the diagram they present, also refer to two *diets*. In my discussion below I shall maintain a distinction between *Hall* (using Lord’s original term) and *Diet*. This distinction is of causal interest since I shall assume that if Diet A was given in Hall 1 and Diet B in Hall 2 (say) the alternative arrangement of Diet B in Hall 1 and Diet A in Hall 2 might have been possible and also that a difference between diets may be of wider general interest than a difference between halls.

A most thorough and penetrating analysis of assumptions made in discussing Lord’s paradox is given by Holland and Rubin (4), and the reader who is interested in learning more is referred to their paper.

I shall now consider four variants for the way that the data to be analysed might have arisen and I shall illustrate the analysis using John Nelder’s approach to designed experiments (5, 6) as incorporated in Genstat®(7). This requires separate identification of structure in the experimental material that exists prior to experimentation (the *block structure*) and the nature of the treatments that are subsequently applied (the *treatment structure*). This, together with a third piece of information, the *design matrix*, which maps treatments onto units, determines the analysis.

Students have already decided in which of the two halls they will dine. The university authorities then decide to allocate (at random) diet A to one hall and diet B to the other and measure initial and final weights of 100 students in each hall.

The disposition of students looks like this.

Count

Diet A B

Hall

1 100 0

2 0 100

The Genstat® code for the dummy ANOVA (dummy because ANOVA has not been given an outcome variate) for the experiment looks like this.

BLOCKSTRUCTURE Hall/Student

COVARIATE Initial

TREATMENTSTRUCTURE Diet

ANOVA

Note that the fact that students are ‘nested’ within halls is shown using the / operator. The dummy analysis of variance includes this output:

**Analysis of variance (adjusted for covariate)**

Covariate: Initial Weight

Source of variation d.f.

Hall stratum

Diet 1

Hall.Student stratum

Covariate 1

Residual 197

Total 199

From this we see that Diet appears in the Hall stratum (that is to say at the higher level) but there are only two halls and so the effect of Diet cannot be separated from the effect of Hall.

It has been decided to trial Diet A in Hall 1 (say) and Diet B in Hall 2 (say). Students are then randomly allocated in equal numbers to dine in one or the other hall and in each hall 100 students are chosen to be measured. The disposition of students is as before. It is now accepted that the effects of Diet and Hall cannot be separated but it is agreed that the joint effect of both will be studied. Hall can now be transferred from the block structure to the treatment structure. The code is now

BLOCKSTRUCTURE Student

COVARIATE Initial

TREATMENTSTRUCTURE Diet+Hall

ANOVA

The output includes

**Analysis of variance (adjusted for covariate)**

Covariate: Initial Weight

Source of variation d.f.

Student stratum

Diet 1

Covariate 1

Residual 197

Total 199

**Information summary**

Aliased model terms

Hall

It appears that the effect of Diet can now be studied. In fact, having fitted the initial weight and the covariate, 197 degrees of freedom are left for estimating residual variation. Note, however, that we are warned that Hall is an *aliased model term*. Since, as Hall is changed Diet is changed, the effect of one cannot be separated from the other. Thus although nothing can be said about the effect of Diet independently of Hall, their joint effect *can *be studied. Not only will it be possible to calculate a standard error but an appropriate covariate adjustment can be made.

It must be conceded, however, that the analysis proposed requires an important assumption. Although the method of assignment (students allocated independently to the hall/diet combination) does make initial weights independent of each other, the same is not necessarily true of final weights. It is possible that living together over the period of the experiment will introduce some sort of correlation. Thus a conventional analysis requires an assumption of independence that the experimental procedure cannot guarantee.

It is decided to vary diets *within *Halls. In each hall an equal number of students will be randomly assigned to follow Diet A and randomly assigned to follow diet B. In each hall 100 students (50 on Diet A and 50 on Diet B) will have their initial and final weights measured. The disposition of students looks like this.

Count

Diet2 A B

Hall

1 50 50

2 50 50

The code to analyse this experiment will look like this.

BLOCKSTRUCTURE Hall/Student

COVARIATE Initial

TREATMENTSTRUCTURE Diet2

ANOVA

Here the code is apparently the same as in Variant 1a apart from the fact that Diet is replaced by Diet2. The former has a pattern whereby Diet is varied between halls and the latter where it is varied within halls.

The output includes the following.

**Analysis of variance (adjusted for covariate)**

Covariate: Initial Weight

Source of variation d.f. .

Hall stratum

Covariate 1

Hall.Student stratum

Diet2 1

Covariate 1

Residual 196

Total 199

It now becomes possible to estimate the effect of diet on weight.

A possible illustration of the data is given in Figure 1.

It is now decided to assign students independently to a diet. There is no attempt to block this by hall. (This would be a reasonable strategy if one believed the effect of Hall was negligible.) To what degree numbers are balanced by diet within halls is a matter of chance. As it turns out, the disposition of students is like this.

Count

Diet3 A B

Hall

1 45 55

2 55 45

The code for analysis will look like this.

BLOCKSTRUCTURE Student

COVARIATE Initial

TREATMENTSTRUCTURE Diet3

ANOVA

Here Diet3 is a factor representing how the diet has been allocated.

The output includes the following:

**Analysis of variance (adjusted for covariate)**

Covariate: Initial Weight

Source of variation d.f.

Student stratum

Diet3 1

Covariate 1

Residual 197

Total 199

“However, for statisticians who are trained in “conventional” (i.e. model-blind) methodology and avoid using causal lenses, it is deeply paradoxical that the correct conclusion in one case would be incorrect in another, even though the data look exactly the same.” *The Book of Why *(3), P217

Well, I am not that straw man and I suspect very few statistician are. I was trained in the Rothamsted approach to statistics and that takes design very seriously and elucidating causes is the central objective of experimental design.

Trialists and medical statisticians will recognise variant 1a as being a *cluster randomised design,* albeit a rather degenerate example of the class, since there are only two clusters. Variant 2 is a *blocked parallel trial,* the blocking factor being hall. The clinical trial analogy would be blocking by centre. Variant 3 is a *completely randomised parallel group trial*. Variant 1b is more unusual. It is theoretical possible and I would not be surprised to find that some clinical trial analogue has been run but I know of no examples.

Each of these four cases leads to a different analysis. It seems intuitively right that they do and John Nelder’s approach delivers a different answer for each.

However, I am not sure that Directed Acyclic Graphs (DAGs) are up to the job. I shall be happy to be proved wrong but must conclude for the moment that it is the causal analysts who will find these four cases *deeply paradoxical*. They may even refuse to recognise that they are different: if DAGs can’t be drawn to illustrate them, the differences don’t exist.

In fact, it is difficult to decide which of these variants the authors of *The Book of Why* think they are discussing. Versions 2 and 3 ought to be dismissed from their discussion, yet the proposed analysis that they offer, adjusts the difference in final weights using the *within* halls regression on initial weights. This is the analysis that is appropriate to variant 3 illustrated in Figure 2. It is very similar to the analysis for variant 1b but the interpretation is different. Variant 1b would not permit separation of Hall and Diet effects.

Lord’s Paradox illustrates the well-known statistical phenomenon that how data arose is essential to a correct understanding of their analysis. I consider that there are these lessons for causal inference.

- Identifiability is sterile unless it also delivers estimability (8).
- Point estimates are not enough. They are only a small part of the story. Inference must incorporate uncertainty.
- Hierarchical data sets are common, important and require handling appropriately.
- The design of experiments is a powerful field of statistics and is now at least a hundred years old. Although experiments are far from being the only way we make causal inferences, they are an important way that we do. Any causal theory should also be able to handle experiments and learning from statistics will be useful.

- Lord FM. A paradox in the interpretation of group comparisons. Psychological Bulletin. 1967;66:304-5.
- Wainer H, Brown LM. Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. American Statistician. 2004;58(2):117-23.
- Pearl J, Mackenzie D. The Book of Why: Basic Books; 2018.
- Holland PW, Rubin DB. On Lord’s Paradox. In: Wainer H, Messick S, editors. Principals of Modern Psychological Measurement. Hillsdale, NJ: Lawrence Erlbaum Associates; 1983. p. 3-25.
- Nelder JA. The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A. 1965;283:147-62.
- Nelder JA. The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A. 1965;283:163-78.
- Payne R, Tobias R. General balance, combination of information and the analysis of covariance. Scandinavian journal of statistics. 1992:3-23.
- Maclaren OJ, Nicholson R. What can be estimated? Identifiability, estimability, causal inference and ill-posed inverse problems. arXiv preprint arXiv:1904.02826. 2019.

**Other related guest posts by Senn include:**

Please share your comments.

]]>Some claim that no one attends Sunday morning (9am) sessions at the Philosophy of Science Association. But if you’re attending the PSA (in Pittsburgh), we hope you’ll falsify this supposition and come to hear us (Mayo, Thornton, Glymour, Mayo-Wilson, Berger) wrestle with some rival views on the trenchant problems of multiplicity, data-dredging, and error control. *Coffee and donuts to all who show up.*

*Multiplicity, Data-Dredging, and Error Control*

**November 13, 9:00 – 11:45 AM
(link to symposium on PSA website)**

**Speakers:**

**Deborah Mayo (Virginia Tech) abstract **Error control and Severity

**Suzanne Thornton** (Swarthmore College) **abstract** The Duality of Parameters and the Duality of Probability

**Clark Glymour** (Carnegie Mellon University) **abstract** Good Data Dredging

**Conor Mayo-Wilson** (University of Washington, Seattle **abstract** Bamboozled By Bonferroni

**James O. Berger** ( Duke University) **abstract** Controlling for Multiplicity in Science

**Summary**

High powered methods, the big data revolution, and the crisis of replication in medicine and social sciences have prompted new reflections and debates in both statistics and philosophy about the role of traditional statistical methodology in current science. Experts do not agree on how to improve reliability, and these disagreements reflect philosophical battles–old and new– about the nature of inductive-statistical evidence and the roles of probability in statistical inference. We consider three central questions:

- How should we cope with the fact that data-driven processes, multiplicity and selection effects can invalidate a method’s control of error probabilities?
- Can we use the same data to search non-experimental data for causal relationships and also to reliably test them?
- Can a method’s error probabilities both control a method’s performance as well as give a relevant epistemological assessment of what can be learned from data?

As reforms to methodology are being debated, constructed or (in some cases) abandoned, the time is ripe to bring the perspectives of philosophers of science (Glymour, Mayo, Mayo-Wilson) and statisticians (Berger, Thornton) to reflect on these questions.

**Topic Description**

** Multiple testing, replication and error control**. The probabilities that a method leads to misinterpreting data in repeated use may be called its

Accordingly, the statistical significance tester and the Bayesian propose different ways to solve the problem. Jim Berger will argue that older frequentist solutions, such as Bonferroni and the False Discovery Rate (FDR), are inappropriate for many of today’s complex, high-throughput inquiries. He argues for a unified method that can address any such problems of multiplicity by means of the choice of objective prior probabilities of hypotheses.

Philosophical scrutiny of both older and newer solutions to the multiple test problem reveals challenges to the very assumptions for the necessity of taking account of, and adjusting for, multiplicity. Conor Mayo-Wilson shows that a prevalent argument for the Bonferroni correction, which recommends replacing a p-value threshold with p/n when testing n independent hypotheses, can violate important axioms of evidence. Correcting error probabilities or p-values for multiple testing, he argues, should be viewed as value judgments in deciding which hypotheses or models are worth pursuing.

** Using the same data to construct and stringently test causal relationships. **Under the guise of fixing the problem of selective reporting, it is increasingly recommended that scientists predesignate all details of experimental procedure, number of tests run, and rules for collecting and analyzing data in advance of the experiment. Clark Glymour asks if predesignation comes at the cost of high Type II error probability—erroneously failing to find effects—and lost opportunities for discovery. In contemporary science, Glymour argues, in which the number of variables is large in comparison to the sample size, principled search algorithms can be invaluable. Some of the leading research areas of machine learning and AI develop “post-selection inferences” that violate the rule against finding one’s hypothesis in the data. These adaptive methods attempt to arrive at reliable results by compensating for the fact that the model was picked in a data-dependent way using methods such as cross validation, simulation, and bootstrapping. Glymour argues that some of these methods are a form of “severe testing” of their output, whereas commonly used regression methods are actually “bad” data dredging methods that do not severely test their results. For both frequentist and Bayesian statistics, search procedures press epistemic issues about how using observational data to try to reach beyond experimental possibilities should be evaluated for accuracy and reliability. We suggest, in each of our contributions, some principled ways to distinguish “bad” from “good” data dredging.

** Error probabilities and epistemic assessments. **Controversies between Bayesian and frequentist methods reflect different answers to the question of the role of probability in inference—to supply a measure of belief or support in hypotheses? or to control a method’s error probabilities? While a criticism often leveled at Type I and II error probabilities is they do not give direct assessments of epistemic probability, Bayesians are also often keen to show their methods have good performance in repeated sampling. Can the performance of a method under hypothetical uses also supply epistemically relevant measures of belief, confidence or corroboration? Suzanne Thornton presents new developments toward an affirmative answer by means of confidence distributions (CD) which provide confidence intervals for parameters at any level of confidence, not just the typical .95. Even regarding a parameter as fixed, say the mean deflection of light, we can calibrate how reliably a method enables finding out about its values. In this sense, she argues, parameters play a dual role—a possible key to reconciling approaches.

Deborah Mayo’s idea is to view a method’s ability to control erroneous interpretations of data as measuring its capability to probe errors. In her view, we have evidence for a claim just to the extent that it has been subjected to and passes a test that would probably have found it false, just if it is. This probability is the stringency or severity with which it has passed the test. On the severity view, the question of whether, and when, to adjust a statistical method’s error probabilities in the face of multiple testing and data-dredging (debated by Berger, Glymour, and Mayo-Wilson) is directly connected to the relevance of error control for qualifying a particular statistical inference (discussed by Thornton). Thus a platform for connecting the five contributions emerges.

Our goal is to channel some of the sparks that grow out of our contrasting views to vividly illuminate the issues, and point to the directions for new interdisciplinary work.

]]>

From what standpoint should we approach the statistics wars? That’s the question from which I launched my presentation at the Statistics Wars and Their Casualties workshop (phil-stat-wars.com). In my view, it should be, not from the standpoint of technical disputes, but from the non-technical standpoint of the skeptical consumer of statistics (see my slides here). What should we do now as regards the controversies and conundrums growing out of the statistics wars? We should not leave off the discussions of our workshop without at least sketching a future program for answering this question. We still have 2 more sessions, December 1 and 8, but I want to prepare us for the final discussions which should look beyond a single workshop. (The slides and videos from the presenters in Sessions 1 and 2 can be found here.)

I will consider three, interrelated, responsibilities and tasks that we can undertake as statistical activist citizens. In so doing I will refer to presentations from the workshop, limiting myself to session #1. (I will add more examples in part (ii) of this post.)

** 1. Keep alert to ongoing evidence policy “reforms”. **Scrutinize attempts to replace designs and methods that ensure error control with alternatives that actually make it harder to achieve error control. Be on the lookout for methods that presuppose a principle of evidence where error probabilities drop out–the Likelihood Principle. While they’re unlikely to be described that way, ask journals/ authors etc. directly if the LP is being presupposed. Write letters to editors asking how the proposed change in method benefits (rather than hurts) the skeptical statistical consumer.

In slide #64 of my presentation, I proposed that in the context of the skeptical consumer of statistics, methods should be:

*-directly altered *by biasing selection effects

-able to *falsify* claims statistically,

-able to *test statistical model *assumptions.

-able to *block inferences* that violate minimal severity

If someone is trying to sell you a reform where any of these are lacking, you might wish to hold off buying.

In reacting to proposed Bayesian replacements for error statistical methods, ask how they are arrived at, what they mean, and how to check them. Here’s a slide (#22) from Stephen Senn on the various types of Bayesian approaches (Slides from Senn presentation)

** 2. Reject the howlers and caricatures of error statistical methods that are the basis of the vast majority of criticisms against them**. Typical examples are claims that either P-values must be misinterpreted as posterior probabilities or else they are irrelevant for science. Resist popular mantras that error statistical control is only relevant to ensure ‘quality control’, apt for such contexts as needing to avoid the acceptance of a batch of bolts with too high a proportion of defectives, but not for science. The supposition that the choice is either “belief or performance” is to commit a false dilemma fallacy. Admittedly what is still needed is a clear articulation of uses of error statistical methods that reflect what Cox, E. Pearson, Birnbaum, Giere (and others, including Mayo) dub the “evidential” vs the “behavioristic” uses of tests. Scientists use error statistical tests to appraise, develop, and answer questions about theories and models (e.g., could the observed effect readily be due to sampling variability? could the data have been generated by a process approximately represented by model M? Is the data-model fit ‘too good to be true?”).

By the way, founders of error statistical methods never claimed low Type I and 2 error probabilities suffice for warranted inference. Observe that a statistically significant result *in*severely passes an alternative H’ against which a test has high power. (Mayo, slide #50):

** 3. No preferential treatment for one methodology or philosophy. Developing author and journal guidelines for avoiding problematic uses of Bayesian (and other) methods is long overdue. **In many journals, authors are warned to avoid classic fallacies of statistical significance: statistical significance is not substantive importance; P-values aren’t effect size measures; a nonstatistically significant difference isn’t evidence of no difference; a P-value is not a posterior probability of H

Let’s focus here on one of the rivals that arose in several presentations: Bayes factors (BFs).[1] We can begin with the uses to which they are routinely put, especially in the service of critiques of statistical significance tests. BFs do not satisfy the 4 requirements I list at the outset. Two main problems arise.

**Problem #1:*** High probability of erroneous claims of evidence against a hypothesis H _{0}.* Because error probabilities drop out of Bayes factors, the ability to control them goes by the wayside. Stat activists should uncover how biasing selection effects (e.g., multiple testing, data-dredging, optional stopping) might adversely affect a method’s ability to have uncovered mistaken interpretations of data. A part of what I have in mind is an active research area in genomics, machine learning (ML) and big data science under terms such as ‘post data selective inference’. The skeptical statistical consumer should be aware of how data dependent methods can succeed or badly fail in some of the ML algorithms that affect them in medical, legal, and a host of social policies.

One of many well-known examples involves optional stopping in the context of a type of example the BF advocate often recommends—two-sided testing of the mean of a Normal distribution.[2] This example “is enough to refute the strong likelihood principle” (Cox 1978, p. 54), since, with high probability, it will stop with a “nominally” significant result even though the point null hypothesis is true. It contradicts what Cox and Hinkley call “the weak repeated sampling principle” (See SIST 2018, p. 45 or Mayo slides).

Inference by Bayes theorem entails the LP, so either one accepts error probability control or accepts Bayesian “incoherence”.

Interestingly, Richard Morey, a leading developer of BFs, also focused his presentation on how BFs preclude satisfying error statistical severity. But he does not look at false rejections due to biasing selection effects. Rather, Morey shows that BFs allow erroneously accepting hypothesis H_{0} with high probability (see Morey slides here). Call this problem #2.

**Problem #2: **Even a statistically significant difference from H_{0}—a low P-value—can, according to a BF computation, become evidence in favor of H_{0 }by assigning it a high enough prior degree of belief, especially coupled with a suitable choice of alternative.[3] See Morey’s conclusions in his slides.

** What needs to be done:** If a BF purports to supply evidence for a point null, compute, as does Morey, the probability the assignments used would find as much or even more evidence for H

Morey begins by remarking that he himself has developed popular methods for computing BFs, so it is especially meaningful that he concedes their inability to sustain severity. He does an excellent job. One thing the skeptical statistical consumer will want to know is whether Morey alerts the user to these consequences. The user needs to see precisely what he so clearly shows *us* in his presentation: applying the defaults can have seriously problematic error statistical consequences. If he hasn’t already, I propose Morey include examples of this in his next BF computer package.[4]

The bottom line is: we need to erect guidelines to ward off the bad statistics that can easily result from rivals to error statistical methods, and encourage journals to include such guidelines, especially now that these alternatives are becoming so prevalent. That a method is Bayesian should not make it above reproach, as if it’s a protected class.[5]

——–

[0] In the open discussion in Session 1, Mark Burgman, editor of *Conservation Biology*, seemed surprised at my suggestion that he add to the guidelines in his journal caveats directed at methods other than statistical significance tests and P-values, including confidence intervals and Bayes factors. I am surprised at his surprise (but perhaps I misunderstood his reaction.)

[1] As I say in my presentation, there are other goals in statistics. We are not always trying to critically probe what is the case. In some contexts, we might merely be developing hypotheses and models for subsequent severe probing. However, it’s important to see that even weaker claims such as “this model is worth probing further” need to be probed with reasonable severity. Severity provides a minimal requirement of evidence for any type of claim, in other words. Moreover, I should note, there are Bayesians who reject BFs and criticize the spiked priors to point null hypotheses. Their objections should be part of the remedy for problem #2. “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (Gelman 2011, p. 74) or a weighted average of them as in Madigan and Raftery (1994).

[2] The error statistical tester, but also some Bayesians, eschew these two-sided tests as artificial—particularly when paired with the lump of prior placed on the point null hypothesis. However, they are the mainstay of the examples currently relied on in launching criticisms of statistical significance tests.

[3] Strictly speaking, BFs do not supply evidence for or against hypotheses—they are only comparative claims, e.g., the data support or fit one hypothesis or model better than another. Morey speaks of “accepting H_{0}”, and that is entirely in sync with the way BF advocates purport to use BFs—namely as tests. Ironically, while BF enthusiasts (like Likelihoodists) are one in criticizing P-values because they are not comparative (and thus, according to them, cannot supply evidence), BF advocates strive to turn their own comparative accounts into tests, by allowing values of the BF to count as accepting or rejecting statistical hypotheses. The trouble is that it is often forgotten that they were really only entitled to a comparative claim. Construed as tests, error probabilities can be high. Moreover, as Morey points out, both claims under comparison can be terribly warranted. Moreover, the value of the BF—essentially a likelihood ratio—doesn’t mean the same thing in different contexts, especially if hypotheses are data dependent. As Morey points out, for the BF advocate, the value of the BF is the evidence measure, whereas for error statisticians, such “fit” measures only function as statistics to which we would need to attach a sampling distribution. By contrast sampling distributions are rejected as irrelevant post data by the BF advocate.

[4] Such consequences are often hidden by Bayesians behind the cover of: we are warranted in a high spiked prior degree of belief in H_{0} because nature is “simple” or we are “conservative”. The former is a presumed metaphysics, not evidence. As for the latter, consider where the null hypothesis asserts “no serious toxicity exists”. Assigning H_{0} a high prior is quite the opposite of taking precautions.[4] Even if one would be correct to doubt the existence of the effect, that is very different from having evidential reasons for this. One may reject the effect for the wrong reasons.

[5] All slides and videos from Sessions 1 and 2 can be found on this post.

]]>

I will be writing some reflections on our two workshop sessions on this blog soon, but for now, here are just the slides I used on Thursday, 22 September. If you wish to ask a question of any of the speakers, use the blogpost at phil-stat-wars.com. The slides from the other speakers will also be up there on Monday.

Deborah G. Mayo’s. Slides from the workshop: *The Statistics Wars and Their Casualties*, Session 1, on September 22, 2022.

**Final Schedule for September 22 & 23 (Workshop Sessions 1 & 2)**

__Session 1: September 22 __

**Moderator**: David Hand (Imperial College London)

**3:00-3:10** (10:00-10:10): Deborah Mayo, Opening Remarks and Thanks

**3:10-3:15** (10:10-10:15) Chair introduction to the session

**3:15-3:50**(10:15-10:50):**Deborah Mayo**(Virginia Tech)*The Statistics Wars and Their Causalities*(Abstract)**3:50-4:25**(10:50-11:25):**Richard Morey**(Cardiff University)*Bayes factors, p values, and the replication crisis*(Abstract)**4:25-5:00**(11:25-12:00):**Stephen Senn**(Edinburgh)*The replication crisis: are P-values the problem and are Bayes factors the solution?*(Abstract)

**5:00-5:10** (12-12:10):* Break*

**5:10-5:20** (12:10-12:20): PANEL DISCUSSION between speakers & chair

**5:20-5:50** (12:20-12:50): OPEN DISCUSSION with audience

**5:50-6:00** (12:50-1pm): Reflections on session

__Session 2: September 23____ __

**Co-Moderators: **S. Senn (Edinburgh) & M. Harris (LSE)

**3:00-3:35**(10:00-10:35):**Daniël Lakens**(Eindhoven University of Technology)*The role of background assumptions in severity appraisal*(Abstract)**3:35-4:10**(10:35-11:10):**Christian Hennig**(University of Bologna)*On the interpretation of the mathematical characteristics of statistical tests***4:10-4:45**(11:10-11:45):**Yoav Benjamini**(Tel Aviv University)*The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability*(Abstract)

**4:45-4:55** (11:45-11:55): *Break*

**4:55 -5:05** (11:55-12:05): PANEL DISCUSSION between speakers & chair

**5:05-5:35** (12:05-12:35): OPEN DISCUSSION with audience

**5:35-5:45** (12:35-12:45): Panel discussion of both sessions

**5:45-6:00** (12:45-1pm): Discussion of both sessions & future topics by workshop participants

]]>

**The Statistics Wars
**

**22-23 September 2022
15:00-18:00 pm London Time*
**

**To register for the workshop,
please fill out the registration form here.**

**For schedules and updated details, please see the workshop webpage: phil-stat-wars.com.**

***These will be sessions 1 & 2, there will be two more
online sessions (3 & 4) on December 1 & 8.
**

While the field of statistics has a long history of passionate foundational controversy, the last decade has, in many ways, been the most dramatic. Misuses of statistics, biasing selection effects, and high-powered methods of big-data analysis, have helped to make it easy to find impressive-looking but spurious results that fail to replicate. As the crisis of replication has spread beyond psychology and social sciences to biomedicine, genomics, machine learning and other fields, the need for critical appraisal of proposed reforms is growing. Many are welcome (transparency about data, eschewing mechanical uses of statistics); some are quite radical. The experts do not agree on the best ways to promote trustworthy results, and these disagreements often reflect philosophical battles–old and new– about the nature of inductive-statistical inference and the roles of probability in statistical inference and modeling. Intermingled in the controversies about evidence are competing social, political, and economic values. If statistical consumers are unaware of assumptions behind rival evidence-policy reforms, they cannot scrutinize the consequences that affect them. What is at stake is a critical standpoint that we may increasingly be in danger of losing. Critically reflecting on proposed reforms and changing standards requires insights from statisticians, philosophers of science, psychologists, journal editors, economists and practitioners from across the natural and social sciences. This workshop will bring together these interdisciplinary insights–from speakers as well as attendees.

**Speakers/Panellists:**

**Yoav Benjamini **(Tel Aviv University), **Alexander Bird** (University of Cambridge), **Mark Burgman** (Imperial College London), **Daniele Fanelli** (London School of Economics and Political Science), **Roman Frigg **(London School of Economics and Political Science), **Stephan Guttinger** (University of Exeter), **David Hand** (Imperial College London), **Margherita Harris** (London School of Economics and Political Science), **Christian Hennig** (University of Bologna), **Daniël Lakens** (Eindhoven University of Technology), **Deborah M****a****yo** (Virginia Tech), **Richard Morey** (Cardiff University), **Stephen Senn** (Edinburgh, Scotland), **Jon Williamson** (University of Kent)

**Sponsors/Affiliations:**

The Foundation for the Study of Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (E.R.R.O.R.S.); Centre for Philosophy of Natural and Social Science (CPNSS), London School of Economics; Virginia Tech Department of Philosophy

**Organizers**: D. Mayo, R. Frigg and M. Harris**
Logistician** (chief logistics and contact person): Jean Miller

**To register** for the workshop,

**please fill out the registration form here. **

Thanks to CUP, the electronic version of my book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018)*, is available for free for one more week (through August 31) at this link: https://www.cambridge.org/core/books/statistical-inference-as-severe-testing/D9DF409EF568090F3F60407FF2B973B2 * *Blurbs of the 16 tours in the book may be found here: blurbs of the 16 tours.

]]>

This is my third and final post marking Egon Pearson’s birthday (Aug. 11). The focus is his little-known paper: “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve linked to it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a repeated applications or long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, it might be said that some people concentrate to an absurd extent on “science-wise error rates” in their view of statistical tests as dichotomous screening devices.)One of the best sources of E.S. Pearson’s statistical philosophy is his (1955) “Statistical Concepts in Their Relation to Reality”. It’s his response to Fisher (1955), the first part of what I call the “triad”). It begins like this:

Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data. We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done. If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this

Journal(Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect. There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans. It was really much simpler–or worse.

The original heresy, as we shall see, was a Pearson one!…

You can read “Statistical Concepts in Their Relation to Reality” HERE.

* What was the heresy, really?* Pearson doesn’t mean it was he who endorsed the behavioristic model that Fisher is here attacking.[i] The “original heresy” refers to the break from Fisher in the explicit introduction of alternative hypotheses (even if only directional). Without considering alternatives, Pearson and Neyman argued, statistical tests of significance are insufficiently constrained–for evidential purposes! Note: this does

But it’s a mistake to suppose that’s all that an inferential or evidential formulation of statistical tests requires. What more is required comes out in my deconstruction of those famous (“miserable”) passages found in the key Neyman and Pearson 1933 paper. We acted out the play I wrote for SIST (2018) in our recent Summer Seminar in Phil Stat. The participants were surprisingly good actors!

Granted, these “evidential” attitudes and practices have never been explicitly codified to guide the interpretation of N-P tests. Doing so is my goal in viewing “Statistical Inference as Severe Testing”.

*Notice, by the way, Pearson’s discussion and extension of Fisher’s construal of differences that are not statistically significant on p. 207:*

These points might have been helpful to those especially concerned with mistaking non-statistically significant differences as supposed “proofs of the null”.

Share your comments.

**“The triad”:**

- Fisher, R. A. (1955), “Statistical Methods and Scientific Induction“.
*Journal of The Royal Statistical Society*(B) 17: 69-78. - Neyman, J. (1956), “Note on an Article by Sir Ronald Fisher,”
*Journal of the Royal Statistical Society*. Series B (Methodological), 18: 288-294. - Pearson, E. S. (1955), “Statistical Concepts in Their Relation to Reality,”
*Journal of the Royal Statistical Society*, B, 17: 204-207.

I’ll post some other Pearson items over the week.

[i] Fisher’s tirades against behavioral interpretations of “his” tests are almost entirely a reflection of his break with Neyman (after 1935) rather than any radical disagreement either in philosophy or method. Fisher could be even more behavioristic in practice (if not in theory) than Neyman, and Neyman could be even more evidential in practice (if not in theory) than Fisher. Moreover, it was really when others discovered Fisher’s fiducial methods could fail to correspond to intervals with valid error probabilities that Fisher began claiming he never really was too wild about them! (Check fiducial on this blog and in Excursion 5 of SIST.) Contemporary writers tend to harp on the so-called “inconsistent hybrid” combining Fisherian and N-P tests. I argue in SIST that it’s time to dismiss these popular distractions: they are serious obstacles to progress in statistical understanding. Most notably, Fisherians are kept from adopting features of N-P statistics, and visa versa (or they adopt them improperly). *What matters is what the methods are capable of doing! For more on this, see the post “it’s the methods, stupid!” *and excerpts from Excursion 3 of SIST. Thanks to CUP, my full book, corrected, can still be downloaded for free until August 31, 2022 at

References

Lehmann, E. (1997). Review of Error and the Growth of Experimental Knowledge by Deborah G. Mayo, Journal of the American Statistical Association, Vol. 92.

Also of relevance:

Erich Lehmann’s (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?“. *Journal of the American Statistical Association*, Vol. 88, No. 424: 1242-1249.

Mayo, D. (1996), “Why Pearson Rejected the Neyman-Pearson (Behavioristic) Philosophy and a Note on Objectivity in Statistics” (Chapter 11) in *Error and the Growth of Experimental Knowledge.* Chicago: University of Chicago Press. [This is a somewhat older view of mine; a newer view is in SIST below.]

Mayo, D. (2018). *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. (SIST) *CUP.

**Egon Pearson’s Neglected Contributions to Statistics**

by** Aris Spanos**

**Egon Pearson** (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the * Neyman-Pearson (1933) theory of hypothesis testing*. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

**(i) specification**: the need to state explicitly the inductive premises of one’s inferences,

**(ii) robustness**: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

**(iii) Mis-Specification (M-S) testing**: probing for potential departures from the Normality assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the **Student’s t** fame] and then **Fisher** (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as *the simple Normal model*:

X_{k} ∽ NIID(μ,σ²), k=1,2,…,n,… (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(**X**) =[√n(Xbar- μ)/s] ∽ St(n-1), (2)

(b) *v*(**X**) =[(n-1)s²/σ²] ∽ χ²(n-1), (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom.

The question of ‘how these inferential results might be affected when the Normality assumption is false’ was originally raised by Gosset in a letter to Fisher in 1923:

“What I should like you to do is to find a solution for some other population than a normal one.” (Lehmann, 1999)

He went on to say that he tried the rectangular (uniform) distribution but made no progress, and he was seeking Fisher’s help in tackling this ‘robustness/sensitivity’ problem. In his reply that was unfortunately lost, Fisher must have derived the sampling distribution of τ(**X**), assuming some skewed distribution (possibly log-Normal). We know this from Gosset’s reply:

“I like the result for z [τ(**X**)] in the case of that horrible curve you are so fond of. I take it that in skew curves the distribution of z is skew in the opposite direction.” (Lehmann, 1999)

After this exchange Fisher was not particularly receptive to Gosset’s requests to address the problem of working out the implications of non-Normality for the Normal-based inference procedures; t, chi-square and F tests.

In contrast, **Egon Pearson** shared Gosset’s concerns about the robustness of Normal-based inference results (a)-(b) to non-Normality, and made an attempt to address the problem in a series of papers in the late 1920s and early 1930s.

This line of research for Pearson began with a review of Fisher’s 2nd edition of the 1925 book, published in *Nature*, and dated June 8th, 1929. Pearson, after praising the book for its path breaking contributions, dared raise a mild criticism relating to (i)-(ii) above:

“There is one criticism, however, which must be made from the statistical point of view. A large number of tests are developed upon the assumption that the population sampled is of ‘normal’ form. That this is the case may be gathered from a very careful reading of the text, but the point is not sufficiently emphasised. It does not appear reasonable to lay stress on the ‘exactness’ of tests, when no means whatever are given of appreciating how rapidly they become inexact as the population samples diverge from normality.” (Pearson, 1929a)

Fisher reacted badly to this criticism and was preparing an acerbic reply to the ‘young pretender’ when Gosset jumped into the fray with his own letter in *Nature*, dated July 20th, in an obvious attempt to moderate the ensuing fight. Gosset succeeded in tempering Fisher’s reply, dated **August 17th**, forcing him to provide a less acerbic reply, but instead of addressing the ‘robustness/sensitivity’ issue, he focused primarily on Gosset’s call to address ‘the problem of what sort of modification of my tables for the analysis of variance would be required to adapt that process to non-normal distributions’. He described that as a hopeless task. This is an example of Fisher’s genious when cornered by an insightful argument. He sidestepped the issue of ‘robustness’ to departures from Normality, by broadening it – alluding to other possible departures from the ID assumption – and rendering it a hopeless task, by focusing on the call to ‘modify’ the statistical tables for all possible non-Normal distributions; there is an infinity of potential modifications!

**Egon Pearson** recognized the importance of stating explicitly the inductive premises upon which the inference results are based, and pressed ahead with exploring the robustness issue using several non-Normal distributions within the Pearson family. His probing was based primarily on *simulation*, relying on tables of pseudo-random numbers; see Pearson and Adyanthaya (1928, 1929), Pearson (1929b, 1931). His broad conclusions were that the t-test:

τ_{0}(**X**)=** |**[√n(X-bar- μ

for testing the hypotheses:

H_{0}: μ = μ_{0} vs. H_{1}: μ ≠ μ_{0}, (5)

is relatively robust to certain departures from Normality, especially when the underlying distribution is symmetric, but the ANOVA test is rather sensitive to such departures! He continued this line of research into his 80s; see Pearson and Please (1975).

Perhaps more importantly, Pearson (1930) proposed a *test for the Normality* assumption based on the skewness and kurtosis coefficients: a Mis-Specification (M-S) test. Ironically, Fisher (1929) provided the sampling distributions of the sample skewness and kurtosis statistics upon which Pearson’s test was based. Pearson continued sharpening his original M-S test for Normality, and his efforts culminated with the D’Agostino and Pearson (1973) test that is widely used today; see also Pearson et al. (1977). The crucial importance of testing Normality stems from the fact that it renders the ‘robustness/sensitivity’ problem manageable. The test results can be used to narrow down the possible departures one needs to worry about. They can also be used to suggest ways to respecify the original model.

After Pearson’s early publications on the ‘robustness/sensitivity’ problem Gosset realized that *simulation* alone was not effective enough to address the question of robustness, and called upon Fisher, who initially rejected Gosset’s call by saying ‘it was none of his business’, to derive analytically the implications of non-Normality using different distributions:

“How much does it [non-Normality] matter? And in fact that is your business: none of the rest of us have the slightest chance of solving the problem: we can play about with samples [i.e. perform simulation studies], I am not belittling E. S. Pearson’s work, but it is up to you to get us a proper solution.” (Lehmann, 1999).

In this passage one can discern the high esteem with which Gosset held Fisher for his technical ability. Fisher’s reply was rather blunt:

“I do not think what you are doing with nonnormal distributions is at all my business, and I doubt if it is the right approach. … Where I differ from you, I suppose, is in regarding normality as only a part of the difficulty of getting data; viewed in this collection of difficulties I think you will see that it is one of the least important.”

It’s clear from this that Fisher understood the problem of how to handle departures from Normality more broadly than his contemporaries. His answer alludes to two issues that were not well understood at the time:

(a) departures from the other two probabilistic assumptions (IID) have much more serious consequences for Normal-based inference than Normality, and

(b) deriving the consequences of particular forms of non-Normality on the reliability of Normal-based inference, and proclaiming a procedure enjoys a certain level of ‘generic’ robustness, does *not* provide a complete answer to the problem of dealing with departures from the inductive premises.

In relation to (a) it is important to note that the role of ‘randomness’, as it relates to the IID assumptions, was not well understood until the 1940s, when the notion of non-IID was framed in terms of explicit forms of heterogeneity and dependence pertaining to stochastic processes. Hence, the problem of assessing departures from IID was largely ignored at the time, focusing almost exclusively on departures from Normality. Indeed, the early literature on nonparametric inference retained the IID assumptions and focused on inference procedures that replace the Normality assumption with indirect distributional assumptions pertaining to the ‘true’ but unknown *f*(x), like the existence of certain moments, its symmetry, smoothness, continuity and/or differentiability, unimodality, etc. ; see Lehmann (1975). It is interesting to note that Egon Pearson did not consider the question of testing the IID assumptions until his 1963 paper.

In relation to (b), when one poses the question ‘how robust to non-Normality is the reliability of inference based on a t-test?’ one ignores the fact that the t-test might no longer be the ‘optimal’ test under a non-Normal distribution. This is because the sampling distribution of the test statistic and the associated type I and II error probabilities depend crucially on the validity of the statistical model assumptions. When any of these assumptions are invalid, the relevant error probabilities are no longer the ones derived under the original model assumptions, and the optimality of the original test is called into question. For instance, assuming that the ‘true’ distribution is uniform (Gosset’s rectangular):

X_{k }∽ U(a-μ,a+μ), k=1,2,…,n,… (6)

where *f*(x;a,μ)=(1/(2μ)), (a-μ) ≤ x ≤ (a+μ), μ > 0,

how does one assess the robustness of the t-test? One might invoke its generic robustness to symmetric non-Normal distributions and proceed as if the t-test is ‘fine’ for testing the hypotheses (5). A more well-grounded answer will be to assess the discrepancy between the nominal (assumed) error probabilities of the t-test based on (1) and the actual ones based on (6). If the latter approximate the former ‘closely enough’, one can justify the generic robustness. These answers, however, raise the broader question of what are the relevant error probabilities? After all, the optimal test for the hypotheses (5) in the context of (6), is no longer the t-test, but the test defined by:

w(**X**)=|{(n-1)([X_{[1] }+X_{[n]}]-μ_{0})}/{[X_{[1]}-X_{[n]}]}|∽F(2,2(n-1)), (7)

with a rejection region C_{1}:={**x**: w(**x**) > c_{α}}, where (X_{[1]}, X_{[n]}) denote the smallest and the largest element in the ordered sample (X_{[1]}, X_{[2]},…, X_{[n]}), and F(2,2(n-1)) the F distribution with 2 and 2(n-1) degrees of freedom; see Neyman and Pearson (1928). One can argue that the relevant comparison error probabilities are no longer the ones associated with the t-test ‘corrected’ to account for the assumed departure, but those associated with the test in (7). For instance, let the t-test have nominal and actual significance level, .05 and .045, and power at μ_{1}=μ_{0}+1, of .4 and .37, respectively. The conventional wisdom will call the t-test robust, but is it reliable (effective) when compared with the test in (7) whose significance level and power (at μ_{1}) are say, .03 and .9, respectively?

A strong case can be made that a more complete approach to the statistical misspecification problem is:

(i) to probe thoroughly for any departures from all the model assumptions using trenchant M-S tests, and if any departures are detected,

(ii) proceed to respecify the statistical model by choosing a more appropriate model with a view to account for the statistical information that the original model did not.

Admittedly, this is a more demanding way to deal with departures from the underlying assumptions, but it addresses the concerns of Gosset, Egon Pearson, Neyman and Fisher much more effectively than the invocation of vague robustness claims; see Spanos (2010).

**References**

Bartlett, M. S. (1981) “Egon Sharpe Pearson, 11 August 1895-12 June 1980,” *Biographical Memoirs of Fellows of the Royal Society*, 27: 425-443.

D’Agostino, R. and E. S. Pearson (1973) “Tests for Departure from Normality. Empirical Results for the Distributions of b₂ and √(b₁),” *Biometrika*, 60: 613-622.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” *Biometrika*, 10: 507-521.

Fisher, R. A. (1921) “On the “probable error” of a coefficient of correlation deduced from a small sample,” *Metron*, 1: 3-32.

Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society* A, 222, 309-368.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae, and the distribution of regression coefficients,” *Journal of the Royal Statistical Society*, 85: 597-612.

Fisher, R. A. (1925) *Statistical Methods for Research Workers*, Oliver and Boyd, Edinburgh.

Fisher, R. A. (1929), “Moments and Product Moments of Sampling Distributions,” *Proceedings of the London Mathematical Society*, Series 2, 30: 199-238.

Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” *Biometrika*, 20A: 175-240.

Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses”, *Philosophical Transanctions of the Royal Society*, A, 231: 289-337.

Lehmann, E. L. (1975) *Nonparametrics: statistical methods based on ranks*, Holden-Day, San Francisco.

Lehmann, E. L. (1999) “‘Student’ and Small-Sample Theory,” *Statistical Science*, 14: 418-426.

Pearson, E. S. (1929a) “Review of ‘Statistical Methods for Research Workers,’ 1928, by Dr. R. A. Fisher”, *Nature*, June 8th, pp. 866-7.

Pearson, E. S. (1929b) “Some notes on sampling tests with two variables,” *Biometrika*, 21: 337-60.

Pearson, E. S. (1930) “A further development of tests for normality,” *Biometrika*, 22: 239-49.

Pearson, E. S. (1931) “The analysis of variance in cases of non-normal variation,” *Biometrika*, 23: 114-33.

Pearson, E. S. (1963) “Comparison of tests for randomness of points on a line,” *Biometrika*, 50: 315-25.

Pearson, E. S. and N. K. Adyanthaya (1928) “The distribution of frequency constants in small samples from symmetrical populations,” *Biometrika*, 20: 356-60.

Pearson, E. S. and N. K. Adyanthaya (1929) “The distribution of frequency constants in small samples from non-normal symmetrical and skew populations,” *Biometrika*, 21: 259-86.

Pearson, E. S. and N. W. Please (1975) “Relations between the shape of the population distribution and the robustness of four simple test statistics,” *Biometrika*, 62: 223-241.

Pearson, E. S., R. B. D’Agostino and K. O. Bowman (1977) “Tests for departure from normality: comparisons of powers,” *Biometrika*, 64: 231-246.

Spanos, A. (2010) “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” *Journal of Econometrics*, 158: 204-220.

Student (1908), “The Probable Error of the Mean,” *Biometrika*, 6: 1-25.

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980)–one of my statistical heroes. It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. Yes, I know I’ve been neglecting this blog as of late, because I’m busy planning our workshop: The Statistics Wars and Their Casualties (22-23 September, online). See phil-stat-wars.com. I will reblog some favorite Pearson posts in the next few days.

**HAPPY BELATED BIRTHDAY EGON!**

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (*performance*). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (*probativeness*). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson*Cases of Type A and Type B*

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B.Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171)“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

*We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing.* As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

*Three Steps in the Original Construction of Tests*

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts”.

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2. However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

*Neyman Was the More Behavioristic of the Two*

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

These points on Pearson are discussed in more depth in my book* Statistical Inference as Severe Testing (SIST): How to Get Beyond the Statistics Wars* (CUP 2018). You can read and download the entire book for free during the month of August 2022 at the following link:

** **

**References:**

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,” *Biometrika* 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to Reality” *Journal of the Royal Statistical Society, Series B, (Methodological)*, 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” *Biometrika* 20(A): 175-240.

CUP will make the electronic version of my book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018)*, available to access for free from August 1-31 at this link: https://www.cambridge.org/core/books/statistical-inference-as-severe-testing/D9DF409EF568090F3F60407FF2B973B2 However, they will confirm the link closer to August, so check this blog on Aug 1 for any update, if you’re interested. (July 31, the link works!) * (August 5, the link is working. Let me know if you have problems getting in.) *Blurbs of the 16 tours in the book may be found here: blurbs of the 16 tours.

Here’s a CUP interview from when the book first came out.

]]>