Tour guides in your travels jot down Mementos and Keepsakes from each Tour[i] of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018). Their scribblings, which may at times include details, at other times just a word or two, may be modified through the Tour, and in response to questions from travelers (so please check back). Since these are just mementos, they should not be seen as replacements for the more careful notions given in the journey (i.e., book) itself. Still, you’re apt to flesh out your notes in greater detail, so please share yours (along with errors you’re bound to spot), and we’ll create Meta-Mementos.
Excursion 1. Tour I: Beyond Probabilism and Performance
Notes from Section1.1 Severity Requirement: Bad Evidence, No Test (BENT)
1.1 Terms (quick looks, to be crystalized as we journey on)
- epistemology: The general area of philosophy that deals with knowledge, evidence, inference, and rationality.
- severity requirement. In its weakest form it supplies a minimal requirement for evidence:
severity requirement (weak): One does not have evidence for a claim if little if anything has been done to rule out ways the claim may be false. If data x agree with a claim C but the method used is practically guaranteed to ﬁnd such agreement, and had little or no capability of ﬁnding ﬂaws with C even if they exist, then we have bad evidence, no test (BENT).
- error probabilities of a method: probabilities it leads or would lead to erroneous interpretations of data. (We will formalize this as we proceed.)
error statistical account: one that revolves around the control and assessment of a method’s error probabilities. An inference is qualified by the error probability of the method that led to it.
(This replaces common uses of “frequentist” which actually has many other connotations.)
error statistician: one who uses error statistical methods.
severe testers: a proper subset of error statisticians: those who use error probabilities to assess and control severity. (They may use them for other purposes as well.)
The severe tester also requires reporting what has been poorly probed and inseverely tested,
Error probabilities can, but don’t necessarily, provide assessments of the capability of methods to reveal or avoid mistaken interpretations of data. When they do, they may be used to assess how severely a claim passes a test.
- methodology and meta-methodology: Methods we use to study statistical methods may be called our meta-methodology – it’s one level removed.
We can keep to testing language as part of the meta-language we use to talk about formal statistical methods, where the latter include estimation, exploration, prediction, and data analysis.
There’s a diﬀerence between ﬁnding H poorly tested by data x, and ﬁnding x renders H improbable – in any of the many senses the latter takes on.
H: Isaac knows calculus.
x: results of a coin flipping experiment
Even taking H to be true, data x has done nothing to probe the ways in which H might be false.
5. R.A. Fisher, against isolated statistically significant results (p.4).
[W]e need, not an isolated record, but a reliable method of procedure. In relation to the
test of significance, we may say that a phenomenon is experimentally demonstrable
when we know how to conduct an experiment which will rarely fail to give us
a statistically significant result. (Fisher 1935b/1947, p. 14)
Notes from section 1.2 of SIST: How to get beyond the stat wars
6. statistical philosophy (associated with a statistical methodology): core ideas that direct its principles, methods, and interpretations.
two main philosophies about the roles of probability in statistical inference : performance (in the long run) and probabilism.
(i) performance: probability functions to control and assess the relative frequency of erroneous inferences in some long run of applications of the method
(ii) probabilism: probability functions to assign degrees of belief, support, or plausibility to hypotheses. They may be non-comparative (a posterior probability) or comparative (a likelihood ratio or Bayes Factor)
Severe testing introduces a third:
(iii) probativism: probability functions to assess and control a methods’ capability of detecting mistaken inferences, i.e., the severity associated with inferences.
• Performance is a necessary but not a sufficient condition for probativeness.
• Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.
7. Severity strong (argument from coincidence):
We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet no or few are found, then the passing result, x, is evidence for C.
lift-off vs drag down
(i) lift-off : an overall inference can be more reliable and precise than its premises individually.
(ii) drag-down: An overall inference is only as reliable/precise as is its weakest premise.
• Lift-off is associated with convergent arguments, drag-down with linked arguments.
• statistics is the science par excellence for demonstrating lift-off!
8. arguing from error: there is evidence an error is absent to the extent that a procedure with a high capability of signaling the error, if and only if it is present, nevertheless detects no error.
Bernouilli (coin tossing) model: we record success or failure, assume a fixed probability of success θ on each trial, and that trials are independent. (P-value in the case of the Lady Tasting tea, pp. 16-17).
Error probabilities can be readily invalidated due to how the data (and hypotheses!) are generated or selected for testing.
9. computed (or nominal) vs actual error probabilities: You may claim it’s very difficult to get such an impressive result due to chance, when in fact it’s very easy to do so, with selective reporting (e.g., your computed P-value can be small, but the actual P-value is high.)
Examples: Peirce and Dr. Playfair (a law is inferred even though half of the cases required Playfair to modify the formula after the fact. ) Texas marksman (shooting prowess inferred from shooting bullets into the side of a barn, and painting a bull’s eye around clusters of bullet holes); Pickrite stock portfolio (Pickrite’s effectiveness at stock picking is inferred based on selecting those on which the “method” did best)
• We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.
• A key role for statistical inference is to identify ways to spot egregious deceptions and create strong arguments from coincidence.
10. Auditing a P-value (one part) checking if the results due to selective reporting, cherry picking, trying and trying again, or any number of other similar ruses.
• Replicability isn’t enough: Example. observational studies on Hormone Replacement therapy (HRT) reproducibly showed benefits, but had little capacity to unearth biases due to “the healthy women’s syndrome.”
Souvenir A.[ii] Postcard to Send: the 4 fallacies from the opening of 1.1.
• We should oust mechanical, recipe-like uses of statistical methods long lampooned,
• But simple significance tests have their uses, and shouldn’t be ousted simply because some people are liable to violate Fisher’s warnings.
• They have the means by which to register formally the fallacies in the postcard list. (Failed statistical assumptions, selection effects alter a test’s error probing capacities).
• Don’t throw out the error control baby with the bad statistics bathwater.
10. severity requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.
severity (strong): If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet no or few are found, then the passing result, x, is an indication of, or evidence for, C.
Notes from Section 1.3: The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon
The Bayesian versus frequentist dispute parallels disputes between probabilism and performance.
-Using Bayes’ Theorem doesn’t make you a Bayesian.
-subjective Bayesianism and non-subjective (default) Bayesians
11. Advocates of uniﬁcations are keen to show that (i) default Bayesian methods have good performance in a long series of repetitions – so probabilism may yield performance; or alternatively, (ii) frequentist quantities are similar to Bayesian ones (at least in certain cases) – so performance may yield probabilist numbers. Why is this not bliss? Why are so many from all sides dissatisﬁed?
It had long been assumed that only subjective or personalistic Bayesianism had a shot at providing genuine philosophical foundations, but some Bayesians have come to question whether the widespread use of methods under the Bayesian umbrella, however useful, indicates support for subjective Bayesianism as a foundation.
Marriages of Convenience? The current frequentist–Bayesian uniﬁcations are often marriages of convenience;
-some are concerned that methodological conﬂicts are bad for the profession.
-frequentist tribes have not disappeared; scientists still call for error control.
-Frequentists’ incentive to marry: Lacking a suitable epistemic interpretation of error probabilities – significance levels, power, and conﬁdence levels – frequentists are constantly put on the defensive.
Eclecticism and Ecumenism. Current-day eclecticisms have a long history – the dabbling in tools from competing statistical tribes has not been thought to pose serious challenges.
Decoupling. On the horizon is the idea that statistical methods may be decoupled from the philosophies in which they are traditionally couched (e.g., Gelman and Cosma Shalizi 2013). The concept of severe testing is suﬃciently general to apply to any of the methods now in use.
Why Our Journey? To disentangle the jumgle. Being hesitant to reopen wounds from old battles does not heal them. They show up in the current problems of scientific integrity, irreproducibility, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports.
How it occurs: the new stat scrutiny (arising from failures of replication) collects from:
-the earlier social science “significance test controversy”
-the traditional frequentist and Bayesian accounts, and corresponding frequentist-Bayesian wars
-the newer Bayesian–frequentist uniﬁcations (non-subjective, default Bayesianism)
This jungle has never been disentangled.
Your Tour Guide invites your questions in the comments.
[i] As these are scribbled down in notebooks through ocean winds, wetlands and insects, do not expect neatness. Please share improvements nd corrections.
[ii] For a free copy of “Statistical Inference as Severe Testing”, send me your conception of Souvenir A, your real souvenir A, or a picture of your real Souvenir A (through Nov 16, 2018).
I have shared 50 pages of my book, (Excursion 1 Tour 1 (1.1-1.3) pp. 3-29; Excursion 2, 2.1 pp. 59-66; 2.3 pp. 75-89.
more from Lakens on Excur 1 Tour I
Notes from richard Morey
Severity and statistical evidence
Notes on Tour 1 of “Statistical Inference as Severe Testing”
Anyone who’s had any contact with statistical methods recently knows that there’s a battle being fought over the future of statistical methods. Actually, more than one; the big ones are significance testing vs confidence intervals and Bayes vs frequentism. The so-called “replication crisis” in the various sciences has provided an opportunity for people to advocate various solutions to the issues that plague statistical practice. These issues are real, and stakes high: bad choices could mean another 40 years wandering in the desert of bad methodology, as opposed to cleaning up some of the mess in various fields.
I was happy to snag the very last copy of Mayo’s “Statistical Inference as Severe Testing: How to get beyond the statistics wars” at the recent Royal Statistical Society conference in Cardiff. Mayo’s goal, telegraphed in the subtitle, is ambitious: Can we really get beyond the wars that have been raging for decades? Particularlyat a time when the opportunities for the various actors to shape the future are so great?
But now is the most critical time to get beyond these “wars”. What we need is a discussion at a level above the nuts and bolts of statistical theory: we need philosophy. Only with an expansive view of the landscape of statistical inference in science can we be sure that we don’t harm science while trying to save it. This is the departure point for Mayo’s book.
I am currently reading the text, and I will try to blog with some notes as I finish sections of it. This post is about the first “Tour”, Beyond probabilism and performance.
Beyond Probabilism and Performance
The title of the section is immediately refreshing. Those familiar with Mayo’s work will know she advocates a frequentist perspective on statistical inference; those familiar with my work will know I generally advocate a Bayesian perspective. “Probabilism” refers to one way of understanding Bayesian inference, and “Performance” refers to one way of understanding frequentist inference. Bayesians are traditionally skeptical of a focus on performance, and Mayo offers us a peace offering if we are willing to take it: our suspicions of the performance viewpoint are indulged. The cost of this peace offering is a willingness to be skeptical of our own probabilism in turn.
The key to making this work is a reliance on meta-statistical principles: philosophy. We need to explore our intuitions about what scientific evidence. What makes for a strong scientific/statistical inference? What do we want from statistics? Answering this question outside of any particular statistical theory is important, because many battles in the “statistics wars” are fought over evaluating one statistical theory by the standards of another, when they should be considered on higher ground.
Mayo gives us a principle we can work with: severity. Her weak severity principle is something that few scientists, I think, would object to.
One does not have evidence for a claim if nothing has been done to rule out ways in which the claim may be false.
You might be right about a claim, but your responsibility, if you say you have evidence for a claim, is to show that you’ve tried to rule out ways in which it can be false. A good test of a claim — a stringent test — is one which has a strong possibility of ruling something out, if it is false. Mayo’s strong severity principle makes the connection between stringency and severity:
We have evidence for a claim…just to the extent that it survives a stringent scrutiny.
Mayo’s position is that neither probabilism nor performance goals adequately capture the severity perspective. The long-run performance perspective says nothing about well-tested any particular claim is, and frequentists might even deny that such a goal is uninteresting: they just want to control overall error rates. Likewise, the Bayesian perspective focuses on coherence and the move from prior to posterior; there is nothing formal in Bayesian statistics that requires severity.
This does not mean that scientists applying statistical methods — Bayesian or frequentist — don’t fill the gap. Mayo introduces the idea of “decoupling”: that methods can become disconnected from the philosophy which originally spawned them. We are invited, then, to ask how our methods meet the requirements for severe testing, regardless of whether those methods are Bayesian, frequentist, or other. This appears to be Mayo’s roadmap away from the statistics wars, which she will outline in Tour 2.