I came across a paper, “Tests of Statistical Significance Made Sound,” by Brian Haig, a psychology professor at the University of Canterbury, New Zealand. It hits most of the high notes regarding statistical significance tests, their history & philosophy and, refreshingly, is in the error statistical spirit! I’m pasting excerpts from his discussion of “The Error-Statistical Perspective”starting on p.7.
The Error-Statistical Perspective
An important part of scientific research involves processes of detecting, correcting, and controlling for error, and mathematical statistics is one branch of methodology that helps scientists do this. In recognition of this fact, the philosopher of statistics and science, Deborah Mayo (e.g., Mayo, 1996), in collaboration with the econometrician, Aris Spanos (e.g., Mayo & Spanos, 2010, 2011), has systematically developed, and argued in favor of, an error-statistical philosophy for understanding experimental reasoning in science. Importantly, this philosophy permits, indeed encourages, the local use of ToSS, among other methods, to manage error.
In the error-statistical philosophy, the idea of an experiment is understood broadly to include controlled experiments, observational studies, and even thought experiments. What matters in all these types of inquiry is that a planned study permits one to mount reliable arguments from error. By using statistics, the researcher is able to model ‘‘what it would be like to control, manipulate, and change in situations where we cannot literally’’ do so (Mayo, 1996, p. 459). Furthermore, although the error- statistical approach has broad application within science, it is concerned neither with all of science nor with error generally. Instead, it focuses on scientific experimentation and error probabilities, which ground knowledge obtained from the use of statistical methods.
Development of the Error-Statistical Philosophy
In her initial formulation of the error-statistical philosophy, Mayo (1996) modified, and built upon, the classical Neyman–Pearsonian approach to ToSS. However, in later publications with Spanos (e.g., Mayo & Spanos, 2011), and in writings with David Cox (Cox & Mayo, 2010; Mayo & Cox, 2010), her error-statistical approach has come to represent a coherent blend of many elements, including both Neyman– Pearsonian and Fisherian thinking. For Fisher, reasoning about p values is based on postdata, or after-trial, consideration of probabilities, whereas Neyman and Pearson’s Type I and Type II errors are based on predata, or before-trial, error probabilities. The error-statistical approach assigns each a proper role that serves as an important complement to the other (Mayo & Spanos, 2011; Spanos, 2010). Thus, the error- statistical approach partially resurrects and combines, in a coherent way, elements of two perspectives that have been widely considered to be incompatible. In the post- data element of this union, reasoning takes the form of severe testing, a notion to which I now turn.
The Severity Principle
Central to the error-statistical approach is the notion of a severe test, which is a means of gaining knowledge of experimental effects. An adequate test of an experimental claim must be a severe test in the sense that relevant data must be good evidence for a hypothesis. Thus, according to the error-statistical perspective, a sufficiently severe test should conform to the severity principle, which has two variants: A weak severity principle and a fullseverityprinciple. The weak severity principle acknowledges situations where we should deny that data are evidence for a hypothesis. Adhering to this principle discharges the investigator’s responsibility to identify and eliminate situations where an agreement between data and hypothesis occurs when the hypothesis is false. Mayo and Spanos (2011) state the principle as follows:
Data x0 (produced by process G) do not provide good evidence for hypothesis H if x0 results from a test procedure with a very low probability or capacity of having uncovered the falsity of H, even if H is incorrect. (p. 162)
However, this negative conception of evidence, although important, is not sufficient; it needs to be conjoined with the positive conception of evidence to be found in the full severity principle. Mayo and Spanos (2011) formulate the principle thus,
Data x0 (produced by process G) provide good evidence for hypothesis H (just) to the extent that test T has severely passed H with x0. (p. 162)
With a severely tested hypothesis, the probability is low that test procedure would pass muster if the hypothesis was false. Furthermore, the probability that the data agree with the alternative hypothesis must be very low. The full severity principle is the key to the error-statistical account of evidence and provides the core of the rationale for the use of error-statistical methods. The error probabilities afforded by these methods provide a measure of how frequently the methods can discriminate between alternative hypotheses, and how reliably they can detect errors.
The error-statistical approach constitutes an inductive approach to scientific inquiry. However, unlike favored inductive methods that emphasize the broad logical nature of inductive reasoning (notably, the standard hypothetico-deductive method and the Bayesian approach to scientific inference), the error-statistical approach furnishes context-dependent, local accounts of statistical reasoning. It seeks to rectify the troubled foundations of Fisher’s account of inductive inference, makes selective use of Neyman and Pearson’s behaviorist conception of inductive behavior, and endorses Charles Peirce’s (1931-1958) view that inductive inference is justified pragmatically in terms of self-correcting inductive methods.
The error-statistical approach employs a wide variety of error-statistical methods to link experimental data to theoretical hypotheses. These include the panoply of standard frequentist statistics that use error probabilities assigned on the basis of the relative frequencies of errors in repeated sampling, such as ToSS and confidence interval estimation, which are used to collect, model, and interpret data. They also include computer-intensive resampling methods, such as the bootstrap, Monte Carlo simulations, nonparametric methods, and ‘‘noninferential’’ methods for exploratory data analysis. In all this, ToSS have a minor, though useful, role.
A Hierarchy of Models
In the early 1960s, Patrick Suppes (1962) suggested that science employs a hierarchy of models that ranges from experimental experience to theory. He claimed that theoretical models, which are high on the hierarchy, are not compared directly with empirical data, which are low on the hierarchy. Rather, they are compared with models of the data, which are higher than data on the hierarchy. The error-statistical approach similarly adopts a framework in which three different types of models are interconnected and serve to structure error-statistical inquiry: primary models, experimental models, and data models. Primary models break down a research question into a set of local hypotheses that can be investigated using reliable methods. Experimental models structure the particular models at hand and serve to link primary models to data models. And, data models generate and model raw data, as well as checking whether the data satisfy the assumptions of the experimental models. The error-statistical approach (Mayo & Spanos, 2010) has also been extended to primary models and theories of a more global nature. The hierarchy of models employed in the error-statistical perspective exhibits a structure similar to the important threefold distinction between data, phenomena, and theory (Woodward, 1989; see also Haig, 2014). These similar threefold distinctions accord better with scientific practice than the ubiquitous coarse-grained data-theory/model distinction.
Error-Statistical Philosophy and Falsificationism
The error-statistical approach shares a number of features with Karl Popper’s (1959) falsificationist theory of science. Both stress the importance of identifying and correcting errors for the growth of scientific knowledge, both focus on the importance of hypothesis testing in science, and both emphasize the importance of strong tests of hypotheses. However, the error-statistical approach differs from Popper’s theory in a number of respects: It focuses on statistical error and its role in experimentation, neither of which were considered by Popper. It employs a range of statistical methods to test for error. And, in contrast with Popper, who deemed deductive inference to be the only legitimate form of inference, it stresses the importance of inductive reasoning in its conception of science. This error-statistical stance regarding Popper can be construed as a constructive interpretation of Fisher’s oft-cited remark that the null hypothesis is never proved, only possibly disproved.
Error-Statistical Philosophy and Bayesianism
You can read this section on p. 10 of his paper. I’ll jump down to….
Virtues of the Error-Statistical Approach
The error-statistical approach has a number of strengths, which I enumerate at this point without justification (1) it boasts a philosophy of statistical inference, which provides guidance for thinking about, and constructively using, common statistical methods, including ToSS, for the conduct of scientific experimentation. Statistical methods are often employed with a shallow understanding that comes from ignoring their accompanying theory and philosophy; (2) it has the conceptual and methodological resources to enable one to avoid the common misunderstandings of ToSS, which afflict so much empirical research in the behavioral sciences; (3) it provides a challenging critique of, and alternative to, the Bayesian way of thinking in both statistics and current philosophy of science; moreover, it is arguably the major modern alternative to the Bayesian philosophy of statistics; (4) finally, the error-statistical approach is not just a philosophy of statistics concerned with the growth of experimental knowledge. It is also regarded by Mayo and Spanos as a general philosophy of science. As such, its authors employ error-statistical thinking to cast light on vexed philosophical problems to do with scientific inference, modeling, theory testing, explanation, and the like. A critical evaluation by prominent philosophers of science of the early extension of the error-statistical philosophy to the philosophy of science more generally can be found in Mayo and Spanos (2010).
He goes on to discuss how we avoid fallacies of rejection and non-rejection (“acceptance”). You can find it on pp. 11-12 here.
Share your comments; Haig has agreed to reply to queries, as will I.
 He had shared parts of an earlier draft, but I hadn’t read the final version completely. I’m not saying we agree on everything; I’ll post some comments on this.
Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.
Haig, B.D. (2016). “Tests of Statistical Significance Made Sound“, Educational and Psychological Measurement, pp. 1-18.
Haig, B. D. (2014). Investigating the Psychological World: Scientific Method in the Behavioral
Sciences. Cambridge: MIT Press.
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver &
Gelman, A., & Shalizi, C.R (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8-38.
Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.
Mayo, D. G. (2011) “Statistical Science and Philosophy of Science: Where Do/Should They Meet in 2011 (and beyond).” Rationality, Markets and Morals (RMM) 2, Special Topic: Statistical Science and Philosophy of Science, 79–102.
Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.
Mayo, D. G. and Spanos, A. (2010). “Introduction and Background: Part I: Central Goals, Themes, and Questions; Part II The Error-Statistical Philosophy” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-14, 15-27.
Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
Peirce, C. S. (1931-1958). The collected papers of Charles Sanders Peirce (Vols. 1-8; C. Hartshorne & P. Weiss [Eds., Vols. 1-6], & A. W. Burks [Ed., Vols. 7-8]). Cambridge, MA: Harvard University Press.
Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books.
Spanos, A. (2010). On a New Philosophy of Frequentist Inference: Exchanges with David Cox and Deborah G. Mayo. In D. G. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (pp. 315-330). New York, NY: Cambridge University Press.
Suppes, P. (1962). Models of Data. In E. Nagel, P. Suppes, & A. Tarski (Eds.), Logic, methodology, and philosophy of science: Proceedings of the 1960 International Congress (pp. 252-261). Stanford, CA: Stanford University Press.
Woodward, J. (1989). Data and Phenomena. Synthese, 79, 393-472.
Dr. Emrah Aktunc
On Mon, Dec 12, 2016 at 5:53 AM, Error Statistics Philosophy wrote:
> Mayo posted: ” I came across a paper, “Tests of Statistical Significance > Made Sound,” by Brian Haig, a psychology professor at the University of > Canterbury, New Zealand0. It hits most of the high notes > regarding statistical significance tests, their history & ph” >
This is a really useful paper. Thank you. It is helpful to hear (read) different accounts of an issue, which gives a more nuanced understanding of the themes involved. Error statistical testing is being argued as important for psychology and other social sciences, but behavioural null hypothesis statistical testing (NHST) is also being widely misused in climate and other environment/earth system sciences, largely because of the misapplication of the collapsed theory/model distinction articulated in Haig’s paper.