“Tests of Statistical Significance Made Sound”: excerpts from B. Haig

Posted on December 11, 2016 by Mayo

I came across a paper, “Tests of Statistical Significance Made Sound,” by Brian Haig, a psychology professor at the University of Canterbury, New Zealand. It hits most of the high notes regarding statistical significance tests, their history & philosophy and, refreshingly, is in the error statistical spirit! I’m pasting excerpts from his discussion of “The Error-Statistical Perspective”starting on p.7.[1]

The Error-Statistical Perspective

An important part of scientific research involves processes of detecting, correcting, and controlling for error, and mathematical statistics is one branch of methodology that helps scientists do this. In recognition of this fact, the philosopher of statistics and science, Deborah Mayo (e.g., Mayo, 1996), in collaboration with the econometrician, Aris Spanos (e.g., Mayo & Spanos, 2010, 2011), has systematically developed, and argued in favor of, an error-statistical philosophy for understanding experimental reasoning in science. Importantly, this philosophy permits, indeed encourages, the local use of ToSS, among other methods, to manage error.

In the error-statistical philosophy, the idea of an experiment is understood broadly to include controlled experiments, observational studies, and even thought experiments. What matters in all these types of inquiry is that a planned study permits one to mount reliable arguments from error. By using statistics, the researcher is able to model ‘‘what it would be like to control, manipulate, and change in situations where we cannot literally’’ do so (Mayo, 1996, p. 459). Furthermore, although the error- statistical approach has broad application within science, it is concerned neither with all of science nor with error generally. Instead, it focuses on scientific experimentation and error probabilities, which ground knowledge obtained from the use of statistical methods.

Development of the Error-Statistical Philosophy

In her initial formulation of the error-statistical philosophy, Mayo (1996) modified, and built upon, the classical Neyman–Pearsonian approach to ToSS. However, in later publications with Spanos (e.g., Mayo & Spanos, 2011), and in writings with David Cox (Cox & Mayo, 2010; Mayo & Cox, 2010), her error-statistical approach has come to represent a coherent blend of many elements, including both Neyman– Pearsonian and Fisherian thinking. For Fisher, reasoning about p values is based on postdata, or after-trial, consideration of probabilities, whereas Neyman and Pearson’s Type I and Type II errors are based on predata, or before-trial, error probabilities. The error-statistical approach assigns each a proper role that serves as an important complement to the other (Mayo & Spanos, 2011; Spanos, 2010). Thus, the error- statistical approach partially resurrects and combines, in a coherent way, elements of two perspectives that have been widely considered to be incompatible. In the post- data element of this union, reasoning takes the form of severe testing, a notion to which I now turn.

The Severity Principle

Central to the error-statistical approach is the notion of a severe test, which is a means of gaining knowledge of experimental effects. An adequate test of an experimental claim must be a severe test in the sense that relevant data must be good evidence for a hypothesis. Thus, according to the error-statistical perspective, a sufficiently severe test should conform to the severity principle, which has two variants: A weak severity principle and a fullseverityprinciple. The weak severity principle acknowledges situations where we should deny that data are evidence for a hypothesis. Adhering to this principle discharges the investigator’s responsibility to identify and eliminate situations where an agreement between data and hypothesis occurs when the hypothesis is false. Mayo and Spanos (2011) state the principle as follows:

Data x₀ (produced by process G) do not provide good evidence for hypothesis H if x₀ results from a test procedure with a very low probability or capacity of having uncovered the falsity of H, even if H is incorrect. (p. 162)

However, this negative conception of evidence, although important, is not sufficient; it needs to be conjoined with the positive conception of evidence to be found in the full severity principle. Mayo and Spanos (2011) formulate the principle thus,

Data x₀ (produced by process G) provide good evidence for hypothesis H (just) to the extent that test T has severely passed H with x₀. (p. 162)

With a severely tested hypothesis, the probability is low that test procedure would pass muster if the hypothesis was false. Furthermore, the probability that the data agree with the alternative hypothesis must be very low. The full severity principle is the key to the error-statistical account of evidence and provides the core of the rationale for the use of error-statistical methods. The error probabilities afforded by these methods provide a measure of how frequently the methods can discriminate between alternative hypotheses, and how reliably they can detect errors.

Error-Statistical Methods

The error-statistical approach constitutes an inductive approach to scientific inquiry. However, unlike favored inductive methods that emphasize the broad logical nature of inductive reasoning (notably, the standard hypothetico-deductive method and the Bayesian approach to scientific inference), the error-statistical approach furnishes context-dependent, local accounts of statistical reasoning. It seeks to rectify the troubled foundations of Fisher’s account of inductive inference, makes selective use of Neyman and Pearson’s behaviorist conception of inductive behavior, and endorses Charles Peirce’s (1931-1958) view that inductive inference is justified pragmatically in terms of self-correcting inductive methods.

The error-statistical approach employs a wide variety of error-statistical methods to link experimental data to theoretical hypotheses. These include the panoply of standard frequentist statistics that use error probabilities assigned on the basis of the relative frequencies of errors in repeated sampling, such as ToSS and confidence interval estimation, which are used to collect, model, and interpret data. They also include computer-intensive resampling methods, such as the bootstrap, Monte Carlo simulations, nonparametric methods, and ‘‘noninferential’’ methods for exploratory data analysis. In all this, ToSS have a minor, though useful, role.

A Hierarchy of Models

In the early 1960s, Patrick Suppes (1962) suggested that science employs a hierarchy of models that ranges from experimental experience to theory. He claimed that theoretical models, which are high on the hierarchy, are not compared directly with empirical data, which are low on the hierarchy. Rather, they are compared with models of the data, which are higher than data on the hierarchy. The error-statistical approach similarly adopts a framework in which three different types of models are interconnected and serve to structure error-statistical inquiry: primary models, experimental models, and data models. Primary models break down a research question into a set of local hypotheses that can be investigated using reliable methods. Experimental models structure the particular models at hand and serve to link primary models to data models. And, data models generate and model raw data, as well as checking whether the data satisfy the assumptions of the experimental models. The error-statistical approach (Mayo & Spanos, 2010) has also been extended to primary models and theories of a more global nature. The hierarchy of models employed in the error-statistical perspective exhibits a structure similar to the important threefold distinction between data, phenomena, and theory (Woodward, 1989; see also Haig, 2014). These similar threefold distinctions accord better with scientific practice than the ubiquitous coarse-grained data-theory/model distinction.

Error-Statistical Philosophy and Falsificationism

The error-statistical approach shares a number of features with Karl Popper’s (1959) falsificationist theory of science. Both stress the importance of identifying and correcting errors for the growth of scientific knowledge, both focus on the importance of hypothesis testing in science, and both emphasize the importance of strong tests of hypotheses. However, the error-statistical approach differs from Popper’s theory in a number of respects: It focuses on statistical error and its role in experimentation, neither of which were considered by Popper. It employs a range of statistical methods to test for error. And, in contrast with Popper, who deemed deductive inference to be the only legitimate form of inference, it stresses the importance of inductive reasoning in its conception of science. This error-statistical stance regarding Popper can be construed as a constructive interpretation of Fisher’s oft-cited remark that the null hypothesis is never proved, only possibly disproved.

Error-Statistical Philosophy and Bayesianism

You can read this section on p. 10 of his paper. I’ll jump down to….

Virtues of the Error-Statistical Approach

The error-statistical approach has a number of strengths, which I enumerate at this point without justification (1) it boasts a philosophy of statistical inference, which provides guidance for thinking about, and constructively using, common statistical methods, including ToSS, for the conduct of scientific experimentation. Statistical methods are often employed with a shallow understanding that comes from ignoring their accompanying theory and philosophy; (2) it has the conceptual and methodological resources to enable one to avoid the common misunderstandings of ToSS, which afflict so much empirical research in the behavioral sciences; (3) it provides a challenging critique of, and alternative to, the Bayesian way of thinking in both statistics and current philosophy of science; moreover, it is arguably the major modern alternative to the Bayesian philosophy of statistics; (4) finally, the error-statistical approach is not just a philosophy of statistics concerned with the growth of experimental knowledge. It is also regarded by Mayo and Spanos as a general philosophy of science. As such, its authors employ error-statistical thinking to cast light on vexed philosophical problems to do with scientific inference, modeling, theory testing, explanation, and the like. A critical evaluation by prominent philosophers of science of the early extension of the error-statistical philosophy to the philosophy of science more generally can be found in Mayo and Spanos (2010).

He goes on to discuss how we avoid fallacies of rejection and non-rejection (“acceptance”). You can find it on pp. 11-12 here.

Share your comments; Haig has agreed to reply to queries, as will I.

[1] He had shared parts of an earlier draft, but I hadn’t read the final version completely. I’m not saying we agree on everything; I’ll post some comments on this.

REFERENCES:

Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.

Haig, B.D. (2016). “Tests of Statistical Significance Made Sound“, Educational and Psychological Measurement, pp. 1-18.

Haig, B. D. (2014). Investigating the Psychological World: Scientific Method in the Behavioral
Sciences. Cambridge: MIT Press.

Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver &
Boyd.

Gelman, A., & Shalizi, C.R (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8-38.

Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

Mayo, D. G. (2011) “Statistical Science and Philosophy of Science: Where Do/Should They Meet in 2011 (and beyond).” Rationality, Markets and Morals (RMM) 2, Special Topic: Statistical Science and Philosophy of Science, 79–102.

Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. G. and Spanos, A. (2010). “Introduction and Background: Part I: Central Goals, Themes, and Questions; Part II The Error-Statistical Philosophy” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-14, 15-27.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Peirce, C. S. (1931-1958). The collected papers of Charles Sanders Peirce (Vols. 1-8; C. Hartshorne & P. Weiss [Eds., Vols. 1-6], & A. W. Burks [Ed., Vols. 7-8]). Cambridge, MA: Harvard University Press.

Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books.

Spanos, A. (2010). On a New Philosophy of Frequentist Inference: Exchanges with David Cox and Deborah G. Mayo. In D. G. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (pp. 315-330). New York, NY: Cambridge University Press.

Suppes, P. (1962). Models of Data. In E. Nagel, P. Suppes, & A. Tarski (Eds.), Logic, methodology, and philosophy of science: Proceedings of the 1960 International Congress (pp. 252-261). Stanford, CA: Stanford University Press.

Woodward, J. (1989). Data and Phenomena. Synthese, 79, 393-472.

Categories: Bayesian/frequentist, Error Statistics, fallacy of rejection, P-values, Statistics | 12 Comments

12 thoughts on ““Tests of Statistical Significance Made Sound”: excerpts from B. Haig”

December 12, 2016

emrahaktunc

nice 🙂

Dr. Emrah Aktunc

On Mon, Dec 12, 2016 at 5:53 AM, Error Statistics Philosophy wrote:

> Mayo posted: ” I came across a paper, “Tests of Statistical Significance > Made Sound,” by Brian Haig, a psychology professor at the University of > Canterbury, New Zealand0. It hits most of the high notes > regarding statistical significance tests, their history & ph” >

Reply
December 12, 2016

Roger Jones

This is a really useful paper. Thank you. It is helpful to hear (read) different accounts of an issue, which gives a more nuanced understanding of the themes involved. Error statistical testing is being argued as important for psychology and other social sciences, but behavioural null hypothesis statistical testing (NHST) is also being widely misused in climate and other environment/earth system sciences, largely because of the misapplication of the collapsed theory/model distinction articulated in Haig’s paper.

Reply

December 14, 2016

Brian Haig

Roger:

Thanks for your positive feedback about my paper. I’m pleased you found it useful. Your comments prompt the following thoughts.

1. NHST, understood as a muddled hybrid of what textbook writers think Fisher and Neyman and Pearson believed about significance testing, is used widely in the social and behavioral sciences, and in climate and related sciences. The recent book by Ray Hubbard (Corrupt Research, Sage, 2016) is an excellent scholarly treatment of the widespread use and misuse of NHST in management and social science. In virtually all of these applications, the misunderstandings generated by the hybrid are compounded by researchers’ own multiple confusions about the hybrid. The physical sciences are an interesting case. There, tests of significance are likely to be used to better purpose than in other disciplines. For example, as I noted in my paper, they were used responsibly in the discovery of a Higgs boson. Given that much was at stake in that discovery venture, teams of physicists consulted seriously with professional statisticians about how best to use statistical methods in their data analysis. More of this type of consultation is needed in our sciences.

2. Paul Meehl laboured long and hard, with limited success, to convince psychologists that it was inappropriate to use NHST to weakly test substantive hypotheses and theories. Influenced by Popper and, later, Lakatos, and taking error detection and elimination quite seriously, Meehl urged psychologists to submit their theories to much stronger tests. Given its emphasis on strong tests for error detection and elimination, the ES outlook has an important role to play here. I was interested to learn that you are using insights from the ES perspective to strongly test competing theories in your own field on climate science. Prospective users of ES method should be encouraged by what you are doing. I have heard it said by a prominent methodologist in psychology that the ES perspective is a philosophy (which it is) and doesn’t contain useful guidance for researchers. I don’t see it this way. The panoply of frequentist methods used by error statisticians is familiar to most of us. The challenge is to learn about the ES perspective and employ it to understand and use our frequentist methods in more appropriate ways. I don’t underestimate the challenge, but who said science was easy? Realistically, this is the sort of hard challenge that awaits those who are prepared to move away from the mindless, mechanical use of tests of significance and related methods.

3. Finally, and briefly, I agree that the model/theory conflation you mention wrecks our attempts to produce good science. Suppes’ hierarchy of models, and the ES differentiation between primary, data, and experimental models provide a more realistic and helpful structure for science than the simpler frameworks we tend to work within. The three-fold distinction between data, empirical phenomena, and explanatory theory alluded to in my paper is a similarly valuable structuring device. With this distinction, data serve as evidence for phenomena, and phenomena in turn serve as evidence for theory. The simplistic data/theory talk that abounds in science gives a misleading picture of how good science often proceeds. The all-important distinction between statistical and substantive claims, and researchers’ conflation of them, applies to these three-fold structures. It is a merit of the ES approach that it endeavors to avoid these conflations.

Reply

December 12, 2016

Mayo

Brian:

This is an excellent paper and I hope it has a real impact in psychology. I’d like to separately ask Haig about issues in measurement psychology—I’ll come back to this.
A couple of things:

1. CIs, as used by the “new statisticians” are very much in need of reform. He uses them dichotomously, treats on par all the parameter values within an interval with a single fixed confidence level, permits fallacies of tests to persist (only in CI guise), and retains a behavioristic, performance rationale of CIs. For two posts on this, see:
Do CIs avoid fallacies of tests?
https://errorstatistics.com/2013/06/05/do-cis-avoid-fallacies-of-tests-reforming-the-reformers-reblog-51712/
Anything tests can do CIs do better
https://errorstatistics.com/2013/06/06/anything-tests-can-do-cis-do-better-cis-do-anything-better-than-tests-reforming-the-reformers-cont/

2. On the “inconsistent hybrid” of Fisher and N-P and the ousting of NHST.

Given the level of confusion that often reigns, it’s likely that talk of ousting NHST will mislead—even though Haig means to oust the abusive animal to which NHST often is taken to refer: reject with a single small p-value and move to inferring a research hypothesis (being sloppy about cherry picking, p-hacking and other QRPs).

A recent post from Gigerenzer’s contribution to our PSA symposium is relevant:
https://errorstatistics.com/2016/11/08/gigerenzer-at-the-psa-how-fisher-neyman-pearson-bayes-were-transformed-into-the-null-ritual-comments-and-queries-i/

Most importantly, the supposition that there is an inconsistent hybrid is false and has done very serious damage—something I’ve become increasingly aware of in the last few years. On the falsity, the accounts are mathematically in sync and are best seen as different tools for different kinds of questions. On the damage, the typical argument goes like this:

N-P is irrelevant for inference because its error probabilities only concern controlling errors in a long run series of applications. So it should be dumped for statistical inference. Fisher intends p-values to be relevant for inference, and strength of evidence. But this requires some form of probabilism (e.g., posteriors, Bayes factors, likelihoodist measures) and P-values aren’t any of these things. Therefore, P-values are invariably misinterpreted and also should be dumped for statistical inference. Therefore all of error statistics is wrongheaded and should be repealed and replaced by some form of probabilism.

This is a neat trick but it’s wrong. Error probabilities may be used to assess and control severity of tests—that goes for Fisherian and N-P tests. Moreover, N-P were only making good on Fisherian performance goals; he started back tracking like crazy only after the break-up after 1935, and Fisher’s disgruntlement because he saw N-P methods overshadowing his own.
I have several posts on this. Here’s just one:
Are P-values error probabilities: it’s the methods stupid.

https://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

For newcomers or anyone who wants to see the 5 years of this blog (up to Sept 2016), please see:
All she wrote so far: error statistics philosophy 5 years on.
https://errorstatistics.com/2016/09/03/all-she-wrote-so-far-error-statistics-philosophy-5-years-on/

Reply
December 15, 2016

Brian Haig

Reply to Deborah Mayo

Deborah:

Naturally, I’m pleased that you like my paper. Thanks for the positive remarks. You asked me to respond to two matters (1) confidence intervals and the new statistics; and (2) the wisdom of talking about ‘NHST’ as an incoherent account of significance testing. Since you address the first matter to my satisfaction, I won’t say much about it. However, I will comment on the new statistics more generally, since I think that their ready, and uncritical, adoption in psychology gives cause for real concern.

1. In the face of criticisms of NHST, as NHST is understood in psychology, some methods reformers have strongly urged that it be replaced by the new statistics of confidence intervals, effect sizes, and meta-analysis as the mainstay of statistical data analysis. The new statistics as a package deal is quickly becoming institutionalized as, for example, by the APS journal, Psychological Science.

2. As I pointed out in my paper, I think that the main argument underwriting the case for adopting the new statistics is fallacious. I agree that NHST, as it is understood in psychology, should be rejected because it is an indefensible amalgam of what are taken to be elements of Fisher and Neyman and Pearson. But that hardly justifies rejecting tests of significance altogether. To mount a convincing argument, the new statisticians wold have to show that credible accounts of significance testing (mainly, the neo-Fisherian and the error-statistical) are, in fact, suspect. This they have not bothered to do.

3. You quickly point out the ES criticisms of CIs, as the “new” statisticians understand them, and refer the reader to two of your earlier posts on the topic for detail. I find the ES critique right-headed on this matter, and I think it leaves the new statistics account of CIs broken-backed. One additional point about CIs, though: The new statisticians claim that CIs are more “natural”, and more easily understood, than tests of significance. I doubt that this is so. Much of physical science is unnatural, lies outside the bounds of common sense, and is learnt with difficulty (Wolpert, 1992). I think this holds for statistical science, and parts of psychology, too. In a sense, both tests of significance and CIs are unnatural and require a good deal of effort to understand them properly. There is some empirical evidence that CIs are regularly misinterpreted (Hoekstra at al., 2014; but see Garcia-Perez & Alcala-Quintana, 2016).

4. With the new statistics, parameter estimation replaces hypothesis testing as the dominant approach to research. However, this would be to adopt a narrow view of research that would greatly hinder progress in psychological science if it became the modal research practice. For example, it leads Cumming (2012) to aver that typical questions asked by science are what questions (e.g., “What is the age of the earth?’’; “What is the most likely sea-level rise by 2010?”). Explanatory why questions and how questions (the latter asking for information about causal mechanisms) are not explicitly considered by Cumming. However, why and how questions are just the sort of questions that science characteristically answers when constructing and evaluating hypotheses and theories. I agree with Morey et al. (2014) that hypothesis testing has an important place in scientific research, and that it seeks answers to questions that cannot always be answered by using estimation techniques

5. Given the many and varied tasks of science, a vigorous methodological pluralism is essential for its progress. As a prominent advocate of the new statistics, Cumming (2014) seems to endorse methodological pluralism when he briefly claims that Bayesian statistics, exploratory data analysis, robust statistics, and resampling methods are all very important methodological resources and deserve a prominent place in the researcher’s methodological armoury. However, if this is the case, then the new statistics cannot be the primary form of data analysis, as Cumming wants to maintain. Psychology greatly needs to expand its tool box of research methods and resist the totalizing impulses that afflict many traditional, and “new”, statisticians (Gigerenzer & Marewski, 2015).

6. Finally, and briefly, I agree with you that it is very important to understand that Fisher and Neyman and Pearson can be combined in coherent fashion, as with the ES. Psychology is yet to appreciate this fact. You note that I used the label ‘NHST’ to refer to the inchoate hybrid that afflicts out textbook understanding of significance testing. This is how most psychologists understand the label (and I was writing for them), but you may well be right that this usage will make it more difficult to see that a coherent Fisher-Neyman-Pearson amalgam is possible. What we call something shouldn’t matter, but if it does in this case, that would be unfortunate.

Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7-29.
Garcia-Perez, M. A., & Alcala-Quintana (2016). The interpretation of scholars’ interpretations of confidence intervals: Criticism, replication, and extension of Hoekstra et al. (2014). Frontiers in Psychology 7:1042.doi: 10.3389/fpsyg.2016.01042
Gigerenzer, G., & Marewski, J. N. (2015). Surrogate science: The idol of a universal method for scientific inference. Journal of Management, 41, 421-440.
Hoekstra, R., et al. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21, 1157-1164.
Morey, R. D., et al. (2014). Why hypothesis tests are essential for psychological science: A comment on Cumming (2014). Psychological Science, 25, 1289-1290.
Wolpert, L. (1992). The unnatural nature of science. Cambridge, MA: Harvard University Press.

Reply

December 15, 2016

Mayo

Brian:
I think I touched on all these points in my comment; I’ll check again later to see if I missed anything. ~~There’s nothing “new” about Cumming’s CIs~~–I take it back, the synthesis and the computerized accompaniments in his book are innovative. I was referring to the CI approach itself, and the newer work on confidence distributions. I find it very odd that he rejects, or appears to reject, N-P tests which are dual to CIs and developed by the same man. What would be really new is if Cumming* remedied some of the inadequacies of his use of CIS: dichotomous, with a single confidence level, all values in the interval deemed “plausible,”* unable to avoid fallacies of rejection/non-rejection in the tests which are dual to his CIs, and with a behavioristic, long-run coverage probability interpretation. As for meta-analysis, I don’t know Schmidt said they would finally turn psych into a replicable science around 20 years ago. Have they? Ioannidis has a strong criticism of their use/misuse in medicine. What do you think of them?
By the way, without tests, there’s no checking of assumptions of statistical models. Thanks for the references, I will check out the only one I haven’t read.

*Dec.16 qualification, as I was going too fast. First, I’m not knocking meta-analysis at all, I just don’t see that it’s transformed psych as hoped; I didn’t see them used in the studies taken up in the psych replication project–were they? Second, Cumming gives an informal, visual assessment of the likelihoods of different values within the interval (those closest to the data being higher), but he doesn’t advance this as part of the stat inference–not that he should. These likelihood measures don’t block fallacies of tests. For specifics see my posts on CIs linked above. It’s the difference between testing and likelihood reasoning: points within a CI are values not rejectable at the given confidence level. That doesn’t mean there’s evidence FOR them. also think several different confidence levels should be used in interpreting intervals. Constructive corrections are welcome.

Reply

December 15, 2016

Mayo

Here’s a comment on the Morey et.al., comment on Cumming. I agree that we need tests as well as estimation, but I disagree with their criticism of significance tests.
In the Morey et.al. paper, it is said “NHST should be avoided. NHST, like estimation, fails to consider predictions given that the null hypothesis is false and thus also cannot provide support for theory.”
Does this make sense? NHST can be referring to an illicit animal that Fisher debunked (where a single small p-value is taken as evidence for any substantive theory that would entail or render probable the small p-value), or it can be a simple significance test with only directly alternatives. I assume it is the latter, because the former is too silly for these authors to be discussing.
I don’t know what theory is being referred to in the phrase: “and thus also cannot provide support for theory”? But consider the first part: Is it true that significance testing doesn’t consider predictions given that the null is false? No it is not.

Let the null Ho be a typical null hypothesis, say that mean mu = 0 (against a one sided alternative, say that mu exceeds 0). For Ho to be false is for there to be a discrepancy g from 0). We most certainly can compute Pr(T > To; mu = g), for test statistic T. Such a computation would = the power of the test against alternative mu = g, if the observed To were taken as grounds to reject the null. Cohen, a psychologist, urged power on researchers long ago, but psych researchers seem to forget. But never mind power, Fisherians like Cox do the computation with the p-value viewed as a statistic, which is the same as using the test statistic.

If Pr(T > To; mu = g) is high, then there’s poor evidence for mu > g. If it’s low, then there’s evidence of mu > g.

The odd thing is that the authors are concerned about being able to falsify a claim H before being entitled to have inference for H. That’s fine, but the method they prefer to significance testing–Bayes factors–has no way to falsify at all! At most we get a comparison of posteriors of 2 hypotheses, and by suitable choice of the hypotheses—and the prior—results vary widely (without error probability guarantees). Having “default’ priors do not make them uninformative. By choosing the alternative far away from the null, for ex.,or with a suitable adjustment of the prior to the alternative, the null is supported. Other assignments lead to support for the alternative. These aren’t tests. Certainly not good ones.

Reply

December 15, 2016

Mayo

Morey responded by twitter:

@learnfromerror Thanks. I would say some things differently if I were writing that today.

— Richard D. Morey (@richarddmorey) December 15, 2016

In Haig’s first comment he mentions that he’s heard someone claim that if error statistics is a “philosophy” then it can’t be relevant to practice. That’s rather hysterically funny. It’s tantamount to the practitioner saying, if I’m being required to understand the concepts and methods that I’m using and criticizing, then that must be irrelevant to practice!

Reply

December 20, 2016

Brian Haig

Deborah:

You ask whether or not Frank Schmidt’s belief, that the use of meta-analysis would turn psychology into a better (i.e., more replicable) science, has been justified. That’s a tough question.

1. Schmidt places considerable store in the value of meta-analyses in science. For him, they assume greater importance than individual primary studies because they provide the empirical generalizations that motivate the construction of explanatory theories. Moreover, he thinks that meta-analyses are necessary for cumulative science. I’ll come back to Schmidt and meta-analysis in a moment.

2. You ask further, if meta-analysis has transformed psychology as hoped. Well, in one sense it has: Its rapid arrival has been likened to the big bang in cosmology (Shadish & Lecy, 2014). Empirical psychology looks different now, with thousands of meta-analyses populating the research landscape and purportedly giving us trustworthy knowledge about an array of empirical relationships. However, in a real sense, MA has not been a success story. There is now mounting evidence that the databases of primary studies are often untrustworthy as a result of numerous questionable research practices. There is also evidence that some meta-analyses themselves are not replicable (Hubbard, 2016). Whether meta-analyses in psychology are as bad as the picture Ioannides (2016) paints of medical science, is hard to say. As far as I’m aware, psychology has not undertaken a study as large in scale as Ioannides’. So, it’s hard to make a big picture assessment. It might well be the case that primary studies in medicine are, on the whole, better than those in psychology, which doesn’t auger well for the quality of psychology’s meta-analyses. However, psychology (and education) have played a major role in the impressive development of meta-analytic methodology (think Glass, Schmidt, Rosenthal, and Hedges, etc.), so the potential for doing better quality meta-analytic research is there.

3. Schmidt (2016) has an interesting take on the so-called replication crisis. He believes that the obsession with replication is a red herring that draws our attention away from the justification of knowledge in the social and behavioral sciences. He maintains that meta-analyses provide conceptual, or constructive, replications, so there isn’t a paucity of replications in psychology. For him, the real problem is questionable research practices, such as selectively reporting p values, HARKing, increasing N until significance is reached, etc.; publication biases are also a major problem. I agree with Schmidt that meta-analyses can provide conceptual replications, but I fault him for not giving direct replications their proper due. Whereas Schmidt underplays the importance of direct replications, the “replication crisis” folk in psychology have focused almost entirely on direct replications at the expense of constructive replications. And yes, oddly enough, the Reproducibility Project: Psychology did not employ meta-analytic methods.

4. I think that the replication debates haven’t been as sure-footed as they might have been in placing replication in its proper scientific context. Contrary to what one reads in the literature: (a) replication is not directly involved in theory evaluation (recall the data/ phenomena/ theory contrasts); it is mostly concerned with phenomena detection; (b) there are other strategies besides replication for the validation of empirical generalizations (e.g., controlling for confounds, triangulation, calibration, etc.); and these two methodological facts weaken the claim that we are in the midst of a replication crisis. Instead, I believe that we have a credibility crisis, more generally, with a number of challenging problems to be solved, not a replication crisis to be averted.

Ioannides, J. P. A. (2016). The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. The Milbank Quarterly, 94, 485-514.
Schmidt, F. L., & Oh, I-S. (2016). The crisis of confidence in research findings in psychology: Is lack of replication the real problem? Or is it something else? Archives of Scientific Psychology, 4, 32-37.
Shadish, W. R., & Lecy, J. D. (2015). The meta-analytic big bang. Research Synthesis Methods, 6, 246-264.

Reply

December 20, 2016

Mayo

Brian: Thanks so much for your detailed comment and useful references.
You say:
“Contrary to what one reads in the literature: (a) replication is not directly involved in theory evaluation (recall the data/ phenomena/ theory contrasts); it is mostly concerned with phenomena detection; (b) there are other strategies besides replication for the validation of empirical generalizations (e.g., controlling for confounds, triangulation, calibration, etc.); and these two methodological facts weaken the claim that we are in the midst of a replication crisis.”

But isn’t a main way to ascertain one has a genuine phenomenon a matter of trying to subject the claim in question to further tests—tests which are likely to reveal the falsity of the claimed phenomenon?

Schmidt’s right that what matters most are QRPs, but I take a function of replication research in psych to be that of pointing to such problems. Admittedly, they seem reluctant to assign blame to failed replications in the Open Science replication initiatives. They tread carefully to avoid accusations of methodological terrorism. Interested readers can search this blog under replication, p-values, etc.

I don’t understand Schmidt saying on p. 33 that they find practically all nulls false by citing:
“For example, Lipsey and Wilson (1993) examined metaanalyses of over 300 psychological interventions. In only one of these interventions was the null hypothesis true (less than 1%).”
How do they find the null is true? Do they just mean that when they put together studies in a meta-analysis they very often find significance and thus few nulls are true? (They could be due to QRPs or crud factors.)

Schmidt says “it may be that research findings from laboratory experiments are
the most questionable of any research areas”, but is this because he’s looking at psych experiments? And the reason these are typically experimental is that it seems the only way to render their questions open to statistical analysis at all. In other words, lacking random selection, they introduce probability by design, or try to.

Schmidt: “The fact that we have so many meta-analyses in our literatures proves that replication studies are carried out in most areas of research.”
The question is whether the vast literatures on some of the questionable effects translate into grounds to infer the reality of the effect. It’s interesting that when there are questions about things like the association between morality and cleanliness,ovulation & voting, etc. the authors can point to a huge literature. Is this what Schmidt is referring to?) But if these are infected with cherry-picking, p-hacking and various other biasing selection effects, then these “replications” aren’t probative. Do people try to combine these in a meta-analysis?

I’m guessing that if Nosek and Co. thought trying to combine lots and lots of psych studies sufficed to show they were onto genuine effects,then that’s what they would be doing and Kahnemann would have called for a daisy chain of meta-analysis.

Anyway, I’d like to know what your particular experience is in psych regarding any of these issues. I’m especially interested in what you’ve found regarding the validity of standard measurements, e.g., of self esteem.

Reply

January 10, 2017

bdh41Brian Haig

Deborah:
I agree with you that strong tests play a major role in establishing the existence of empirical phenomena. I suspect that your query about this arose from the fact that my claim that replication is not directly involved in theory evaluation was stated too cryptically. The three-fold distinction between data, phenomena, and explanatory theory motivated my claim. I think this distinction is of major methodological importance, and is often underappreciated by methodologists and scientists. So let me say a bit more about it for the reader (I know that you know of this distinction, and use the related three-fold distinction between data, experimental, and primary models in your error-statistical philosophy). I think it is often essential to understand that data serve as evidence for phenomena (often taking the form of empirical generalizations) and, that in turn, claims about phenomena serve as evidence for theory. I said replication is not DIRECTLY involved in theory evaluation because it’s a major means for establishing the existence of phenomena, not evaluating the worth of theories. I think that, by and large, the methods we use for phenomena detection are different from those we use for theory construction. I speak about phenomena detection (and theory construction) at length in my book, ‘Investigating the Psychological World’ (2014). Jim Woodward’s paper, ‘Data and phenomena’ (Synthese, 1989, 79, 393-472) is an excellent treatment of the topic.

Reply
January 11, 2017

Brian Haig

Deborah:
I agree with you that strong tests play a major role in establishing the existence of empirical phenomena. I suspect that your query about this arose from the fact that my claim that replication is not directly involved in theory evaluation was stated too cryptically. The three-fold distinction between data, phenomena, and explanatory theory motivated my claim. I think this distinction is of major methodological importance, and is often underappreciated by methodologists and scientists. So let me say a bit more about it for the reader (I know that you know of this distinction, and use the related three-fold distinction between data, experimental, and primary models in your error-statistical philosophy). I think it is often essential to understand that data serve as evidence for phenomena (often taking the form of empirical generalizations) and, that in turn, claims about phenomena serve as evidence for theory. I said replication is not DIRECTLY involved in theory evaluation because it is a major means for establishing the existence of phenomena, not evaluating the worth of theories. I think that, by and large, the methods we use for phenomena detection are different from those we use for theory construction. I speak about phenomena detection (and theory construction) at length in my book, ‘Investigating the Psychological World’ (2014). Jim Woodward’s paper, ‘Data and phenomena’ (Synthese, 1989, 79, 393-472) is an excellent treatment of the topic.

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

“Tests of Statistical Significance Made Sound”: excerpts from B. Haig

The Error-Statistical Perspective

Development of the Error-Statistical Philosophy

The Severity Principle

Error-Statistical Methods

A Hierarchy of Models

Error-Statistical Philosophy and Falsificationism

Error-Statistical Philosophy and Bayesianism

Virtues of the Error-Statistical Approach

Post navigation

12 thoughts on ““Tests of Statistical Significance Made Sound”: excerpts from B. Haig”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

“Tests of Statistical Significance Made Sound”: excerpts from B. Haig

The Error-Statistical Perspective

Development of the Error-Statistical Philosophy

The Severity Principle

Error-Statistical Methods

A Hierarchy of Models

Error-Statistical Philosophy and Falsificationism

Error-Statistical Philosophy and Bayesianism

Virtues of the Error-Statistical Approach

Related

Post navigation

12 thoughts on ““Tests of Statistical Significance Made Sound”: excerpts from B. Haig”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.