Here’s a picture of ripping open the first box of (rush) copies of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*, and here’s a continuation of Brian Haig’s recent paper ‘What can psychology’s statistics reformers learn from the error-statistical perspective?’ in Methods in Psychology 2 (Nov. 2020). Haig contrasts error statistics, the “new statistics”, and Bayesian statistics from the perspective of the statistics wars in psychology. The full article, which is open access, is here. I will make several points in the comments.
4. Bayesian statistics
Despite its early presence, and prominence, in the history of statistics, the Bayesian outlook has taken an age to assert itself in psychology. However, a cadre of methodologists has recently advocated the use of Bayesian statistical methods as a superior alternative to the messy frequentist practice that dominates psychology’s research landscape (e.g., Dienes, 2011; Kruschke and Liddell, 2018; Wagenmakers, 2007). These Bayesians criticize NHST, often advocate the use of Bayes factors for hypothesis testing, and rehearse a number of other well-known Bayesian objections to frequentist statistical practice.
Of course, there are challenges for Bayesians from the error-statistical perspective, just as there are for the new statisticians. For example, the frequently made claim that p values exaggerate the evidence against the null hypothesis, but Bayes factors do not, is shown by Mayo not to be the case. She also makes the important point that Bayes factors, as they are currently used, do not have the ability to probe errors and, thus, violate the requirement for severe tests. Bayesians, therefore need to rethink whether Bayes factors can be deployed in some way to provide strong tests of hypotheses through error control. As with the new statisticians, Bayesians also need to reckon with the coherent hybrid NHST afforded by the error-statistical perspective, and argue against it, rather than the common inchoate hybrids, if they want to justify abandoning NHST. Finally, I note in passing that Bayesians should consider, among other challenges, Mayo’s critique of the controversial Likelihood Principle, a principle which ignores the post-data consideration of sampling plans.
4.1. Contrasts between the Bayesian and error-statistical perspectives
One of the major achievements of the philosophy of error-statistics is that it provides a comprehensive critical evaluation of the major variants of Bayesian statistical thinking, including the classical subjectivist, “default”, pragmatist, and eclectic options within the Bayesian corpus. Whether the adoption of Bayesian methods in psychology will overcome the disorders of current frequentist practice remains to be seen. What is clear from reading the error-statistical literature, however, is that the foundational options for Bayesians are numerous, convoluted, and potentially bewildering. It would be a worthwhile exercise to chart how these foundational options are distributed across the prominent Bayesian statisticians in psychology. For example, the increasing use of Bayes factors for hypothesis testing purposes is accompanied by disorderliness at the foundational level, just as it is in the Bayesian literature more generally. Alongside the fact that some Bayesians are sceptical of the worth of Bayes factors, we find disagreement about the comparative merits of the subjectivist and default Bayesianism outlooks on Bayes factors in psychology (Wagenmakers et al., 2018).
The philosophy of error-statistics contains many challenges for Bayesians to consider. Here, I want to draw attention to three basic features of Bayesian thinking, which are rejected by the error-statistical approach. First, the error-statistical approach rejects the Bayesian insistence on characterizing the evidential relation between hypothesis and evidence in a universal and logical manner in terms of Bayes’ theorem. Instead, it formulates the relation in terms of the substantive and specific nature of the hypothesis and the evidence with regards to their origin, modeling, and analysis. This is a consequence of a strong commitment to a piecemeal, contextual approach to testing, using the most appropriate frequentist methods available for the task at hand. This contextual attitude to testing is taken up in Section 5.2, where one finds a discussion of the role different models play in structuring and decomposing inquiry.
Second, the error-statistical philosophy also rejects the classical Bayesian commitment to the subjective nature of prior probabilities, which the agent is free to choose, in favour of the more objective process of establishing error probabilities understood in frequentist terms. It also finds unsatisfactory the turn to the more popular objective, or “default”, Bayesian option, in which the agent’s appropriate degrees of belief are constrained by relevant empirical evidence. The error-statistician rejects this default option because it fails in its attempts to unify Bayesian and frequentist ways of determining probabilities.
And, third, the error-statistical outlook employs probabilities to measure how effectively methods facilitate the detection of error, and how those methods enable us to choose between alternative hypotheses. By contrast, orthodox Bayesians use probabilities to measure belief in hypotheses or degrees of confirmation. As noted earlier, most Bayesians are not concerned with error probabilities at all. It is for this reason that error-statisticians will say about Bayesian methods that, without supplementation with error probabilities, they are not capable of providing stringent tests of hypotheses.
4.2. The Bayesian remove from scientific practice
Two additional features of the Bayesian focus on beliefs, which have been noted by philosophers of science and statistics, draw attention to their outlook on science. First, Kevin Kelly and Clark Glymour worry that “Bayesian methods assign numbers to answers instead of producing answers outright.” (2004, p. 112) Their concern is that the focus on the scientist’s beliefs “screens off” the scientist’s direct engagement with the empirical and theoretical activities that are involved in the phenomenology of science. Mayo agrees that we should focus on the scientific phenomena of interest, not the associated epiphenomena of degrees of belief. This preference stems directly from the error-statistician’s conviction that probabilities properly quantify the performance of methods, not the scientist’s degrees of belief.
Second, Henry Kyburg is puzzled by the Bayesian’s desire to “replace the fabric of science… with a vastly more complicated representation in which each statement of science is accompanied by its probability, for each of us.” (1992, p.149) Kyburg’s puzzlement prompts the question, ‘Why should we be interested in each other’s probabilities?’ This is a question raised by David Cox about prior probabilities, and noted by Mayo (2018).
This Bayesian remove from science contrasts with the willingness of the error-statistical perspective to engage more directly with science. Mayo is a philosopher of science as well as statistics, and has a keen eye for scientific practice. Given that contemporary philosophers of science tend to take scientific practice seriously, it comes as no surprise that she brings it to the fore when dealing with statistical concepts and issues. Indeed, her error-statistical philosophy should be seen as a significant contribution to the so-called new experimentalism, with its strong focus, not just on experimental practice in science, but also on the role of statistics in such practice. Her discussion of the place of frequentist statistics in the discovery of the Higgs boson in particle physics is an instructive case in point.
Taken together, these just-mentioned points of difference between the Bayesian and error-statistical philosophies constitute a major challenge to Bayesian thinking that methodologists, statisticians, and researchers in psychology need to confront.
4.3. Bayesian statistics with error-statistical foundations
One important modern variant of Bayesian thinking, which now receives attention within the error-statistical framework, is the falsificationist Bayesianism of Andrew Gelman, which received its major formulation in Gelman and Shalizi (2013). Interestingly, Gelman regards his Bayesian philosophy as essentially error-statistical in nature – an intriguing claim, given the anti-Bayesian preferences of both Mayo and Gelman’s co-author, Cosma Shalizi. Gelman’s philosophy of Bayesian statistics is also significantly influenced by Popper’s view that scientific propositions are to be submitted to repeated criticism in the form of strong empirical tests. For Gelman, best Bayesian statistical practice involves formulating models using Bayesian statistical methods, and then checking them through hypothetico-deductive attempts to falsify and modify those models.
Both the error-statistical and neo-Popperian Bayesian philosophies of statistics extend and modify Popper’s conception of the hypotheticodeductive method, while at the same time offering alternatives to received views of statistical inference. The error-statistical philosophy injects into the hypothetico-deductive method an account of statistical induction that employs a panoply of frequentist statistical methods to detect and control for errors. For its part, Gelman’s Bayesian alternative involves formulating models using Bayesian statistical methods, and then checking them through attempts to falsify and modify those models. This clearly differs from the received philosophy of Bayesian statistical modeling, which is regarded as a formal inductive process.
From the wide-ranging error-statistical evaluation of the major varieties of Bayesian statistical thought on offer, Mayo concludes that Bayesian statistics needs new foundations: In short, those provided by her error-statistical perspective. Gelman acknowledges that his falsificationist Bayesian philosophy is underdeveloped, so it will be interesting to learn how its further development relates to Mayo’s error-statistical perspective. It will also be interesting to see if Bayesian thinkers in psychology engage with Gelman’s brand of Bayesian thinking. Despite the appearance of his work in a prominent psychology journal, they have yet to do so. However, Borsboom and Haig (2013) and Haig (2018) provide sympathetic critical evaluations of Gelman’s philosophy of statistics.
It is notable that in her treatment of Gelman’s philosophy, Mayo emphasizes that she is willing to allow a decoupling of statistical outlooks and their traditional philosophical foundations in favour of different foundations, which are judged more appropriate. It is an important achievement of Mayo’s work that she has been able to consider the current statistics wars without taking a particular side in the debates. She achieves this by examining methods, both Bayesian and frequentist, in terms of whether they violate her minimal severity requirement of “bad evidence, no test”.
I invite your comments and questions.
*This picture was taken by Diana Gillooly, Senior Editor for Mathematical Sciences, Cambridge University Press, at the book display for the Sept. 2018 meeting of the Royal Statistical Society in Cardiff. She also had the honor of doing the ripping. A blogpost on the session I was in is here.
I find it very valuable to reflect on these issues from the constructive perspective that Brian Haig offers, so I’m very thankful to him for setting it out.
I try to emphasize the large “gallimaufry” of Bayesian construals (SIST, p. 402) as well as the important differences between using Bayesian conceptions in philosophy (both in philosophy of science and epistemology) as opposed to applied statistics. Haig notes “the error-statistician’s conviction that probabilities properly quantify the performance of methods, not the scientist’s degrees of belief”. Two points (for the reader): (1) I try to develop a “third way” wherein error probabilities of methods can be used to assess how well or severely tested claims are. This gives them a proper “evidential” or “epistemic” function they are often thought to lack. (2) The severe tester, who I consider to be a proper subset of error statisticians, holds that we’re not always in a context of wanting to find something out, in fact she considers that most of the time we are more interested in making our case.
“The goal of highly well-tested claims differs sufficiently from highly probable ones that you can have your cake and it it too: retaining both for different contexts.” (SIST xii)
To be continued….
You write: “As noted earlier, most Bayesians are not concerned with error probabilities at all. It is for this reason that error-statisticians will say about Bayesian methods that, without supplementation with error probabilities, they are not capable of providing stringent tests of hypotheses.”
That does seem to be true for the Bayesians you mention at the start of this section–Dienes, Kruschke, Wagenmakers. The irony is that this prevents them from directly upbraiding researchers who exploit researcher flexibilities, outcome-switching, finding your hypothesis in the data, and other biasing selection effects. That’s because the problem with those gambits is that they preclude error probability control. I give examples in SIST. The very justification for preregistration is controlling and assessing error probabilities.
However, some Bayesians, e.g., Jim Berger, redefine error probabilities to refer to one or another variation of a posterior probabilityof hypotheses. This is a big reason that disputants in the stat wars often talk past each other. For this reason, SIST distinguishes error probability subscript 1 and 2:
Thanks for your comment. Starting from your latter points:
When you say a sensible Bayesian analysis is a good way of communicating your findings to non-statisticians, do you mean that you convert error probabilities to a posterior probability that feels about right? Or that you do a full blown Bayesian analysis, choosing a default prior that seems not to give the overly rosy construal you find with P-values?
You say that “the conventional frequentist p-value gives a much too rosy impression”. Do you mean that people will assume the P-value is the posterior probability for the null hypothesis and this will be too small? P-values may be thought to “exaggerate” the effect, assuming a measure that is at odds with the frequentist measures (e.g., a Bayes factor with a point prior). This is the main focus of Excursion 4 Tour II of SIST. I think Fauci and Gilead might have thought the attained P-value in the recent remdesivir trial wasn’t rosy enough. Fauci announced that remdesivir didn’t “yet” show a statistically significant improvement in mortality. But that trial was stopped after his assessment. What were your thoughts on that? He might have just reported the confidence interval corresponding to the P=.058.
On comparing two binomial probabilities, I see now that you’re talking about the Raoult trial which I know nothing about. I heard a little about it months ago, but as you say it wasn’t an RCT. I will check your links. Clearly you’ve written on this and I’ve always known you to have wise assessments, so what do you say? I invite readers to weigh in on this.
Here’s a thorough article just out on that Raoult character: https://www.nytimes.com/2020/05/12/magazine/didier-raoult-hydroxychloroquine.html
Didier Raoult is a colourful character, a charismatic person, and he’s strongly against “political correctness”. He’s against *models* and the power of medical statisticians. The work in question has lead author Prof. Philippe Gautret who himself is a medical doctor and professor of virology and leader of a research team in Raoult’s institute. The team does not include any statistician. They are aided in statistical issues by a young Vietnamese medical doctor, who is a PhD student in the group, and who calls himself an epidemiologist. The Gautier et al. publication is one of the few which compares HCQ treatment to the standard alternative, and their data consists of the very first ca. 40 Corona patients in Marseilles and Nice. Just recently they have published about their now 3000 patients, who nearly all have been treated with HCQ.
On my website (home page at Leiden University) I report (with links to RPubs web pages) some simple frequentist and Bayesian analyses of the Gautret et al. data, published back in March, of ca. 40 patients, comparing HCQ versus “standard” treatment for early (possible) Corona infection. I plan to soon add logistic regressions, both frequentist and Bayesian, and model search results (lasso etc.). This, and another amazingly similar Dutch data set, are n << p problems. And n is just too small for the modern machinery of applied statistics to work (lasso and logistic regression), though I have some ideas of how to circumvent this, now we know more about Covid-19 than back in March. In particular, I used Wagenmaker's Bayesian alternative to SPSS, JASP, to do Bayesian equivalents of Fisher's exact test in a 2×2 table, and discovered what I think is is a serious conceptual inconsistency in JASP. They do testing and then estimation conditional on test results! If you are a true Bayesian this makes little sense. I did find that a standard Bayesian analysis of a 2×2 table gave nice and very reasonable posterior probabilities, starting from pretty reasonable "slab and spike" not-very-informative prior, which said clearly that the results supported "further investigation into this treatment", while the Fisherian p-value says, "this is extraordinarily significant".
On another topic I recently had to look at the Wikipedia page on p-values. Total disaster. I started editing it. Maybe other sensible people – readers of this blog – will weigh in and help improve it further. You don't have to be for or against p-values. It's a question of just telling the honest truth. Which is a difficult story to tell, but alas, that's how it is.