Where you are in the Journey* We’ll move from the philosophical ground ﬂoor to connecting themes from other levels, from Popperian falsiﬁcation to signiﬁcance tests, and from Popper’s demarcation to current-day problems of pseudoscience and irreplication. An excerpt from our Museum Guide gives a broad-brush sketch of the ﬁrst few sections of Tour II:
Karl Popper had a brilliant way to “solve” the problem of induction: Hume was right that enumerative induction is unjustiﬁed, but science is a matter of deductive falsiﬁcation. Science was to be demarcated from pseudoscience according to whether its theories were testable and falsiﬁable. A hypothesis is deemed severely tested if it survives a stringent attempt to falsify it. Popper’s critics denied he could sustain this and still be a deductivist …
Popperian falsiﬁcation is often seen as akin to Fisher’s view that “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (1935a, p. 16). Though scientists often appeal to Popper, some critics of signiﬁcance tests argue that they are used in decidedly non-Popperian ways. Tour II explores this controversy.
While Popper didn’t make good on his most winning slogans, he gives us many seminal launching-oﬀ points for improved accounts of falsiﬁcation, corroboration, science versus pseudoscience, and the role of novel evidence and predesignation. These will let you revisit some thorny issues in today’s statistical crisis in science.
2.3 Popper, Severity, and Methodological Probability
Here’s Popper’s summary (drawing from Popper, Conjectures and Refutations, 1962, p. 53):
- [Enumerative] induction … is a It is neither a psychological fact …nor one of scientiﬁc procedure.
- The actual procedure of science is to operate with conjectures…
- Repeated observation and experiments function in science as tests of our conjectures or hypotheses, i.e., as attempted refutations.
- [It is wrongly believed that using the inductive method can] serve as a criterion of demarcation between science and pseudoscience. … None of this is altered in the least if we say that induction makes theories only probable.
There are four key, interrelated themes:
(1) Science and Pseudoscience. Redeﬁning scientiﬁc method gave Popper a new basis for demarcating genuine science from questionable science or pseudoscience. Flexible theories that are easy to conﬁrm – theories of Marx, Freud, and Adler were his exemplars – where you open your eyes and ﬁnd conﬁrmations everywhere, are low on the scientiﬁc totem pole (ibid., p. 35). For a theory to be scientiﬁc it must be testable and falsiﬁable.
(2) Conjecture and Refutation. The problem of induction is a problem only if it depends on an unjustiﬁable procedure such as enumerative induction. Popper shocked everyone by denying scientists were in the habit of inductively enumerating. It doesn’t even hold up on logical grounds. To talk of “another instance of an A that is a B” assumes a conceptual classiﬁcation scheme. How else do we recognize it as another item under the umbrellas A and B? (ibid., p. 44). You can’t just observe, you need an interest, a point of view, a problem.
The actual procedure for learning in science is to operate with conjectures in which we then try to ﬁnd weak spots and ﬂaws. Deductive logic is needed to draw out the remote logical consequences that we actually have a shot at testing (ibid., p. 51). From the scientist down to the amoeba, says Popper, we learn by trial and error: conjecture and refutation (ibid., p. 52). The crucial diﬀerence is the extent to which we constructively learn how to reorient ourselves after clashes.
Without waiting, passively, for repetitions to impress or impose regularities upon us, we actively try to impose regularities upon the world… These may have to be discarded later, should observation show that they are wrong. (ibid., p. 46)
(3) Observations Are Not Given. Popper rejected the time-honored empiricist assumption that observations are known relatively unproblematically. If they are at the “foundation,” it is only because there are apt methods for testing their validity. We dub claims observable because or to the extent that they are open to stringent checks. (Popper: “anyone who has learned the relevant technique can test it” (1959, p. 99).) Accounts of hypothesis appraisal that start with “evidence x,” as in conﬁrmation logics, vastly oversimplify the role of data in learning.
(4) Corroboration Not Conﬁrmation, Severity Not Probabilism. Last, there is his radical view on the role of probability in scientiﬁc inference. Rejecting probabilism, Popper not only rejects Carnap-style logics of conﬁrmation, he denies scientists are interested in highly probable hypotheses (in any sense). They seek bold, informative, interesting conjectures and ingenious and severe attempts to refute them. If one uses a logical notion of probability, as philosophers (including Popper) did at the time, the high content theories are highly improbable; in fact, Popper said universal theories have 0 probability. (Popper also talked of statistical probabilities as propensities.)
These themes are in the spirit of the error statistician. Considerable spade-work is required to see what to keep and what to revise, so bring along your archeological shovels.
Demarcation and Investigating Bad Science
There is a reason that statisticians and scientists often refer back to Popper; his basic philosophy – at least his most winning slogans – are in sync with ordinary intuitions about good scientiﬁc practice. Even people divorced from Popper’s full philosophy wind up going back to him when they need to demarcate science from pseudoscience. Popper’s right that if using enumerative induction makes you scientiﬁc then anyone from an astrologer to one who blithely moves from observed associations to full blown theories is scientiﬁc. Yet the criterion of testability and falsiﬁability – as typically understood – is nearly as bad. It is both too strong and too weak. Any crazy theory found false would be scientiﬁc, and our most impressive theories aren’t deductively falsiﬁable. Larry Laudan’s famous (1983) “The Demise of the Demarcation Problem” declared the problem taboo. This is a highly unsatisfactory situation for philosophers of science. Now Laudan and I generally see eye to eye, perhaps our disagreement here is just semantics. I share his view that what really matters is determining if a hypothesis is warranted or not, rather than whether the theory is “scientiﬁc,” but surely Popper didn’t mean logical falsiﬁability suﬃced. Popper is clear that many unscientiﬁc theories (e.g., Marxism, astrology) are falsiﬁable. It’s clinging to falsiﬁed theories that leads to unscientiﬁc practices. (Note: The use of a strictly falsiﬁed theory for prediction, or because nothing better is available, isn’t unscientiﬁc.) I say that, with a bit of ﬁne-tuning, we can retain the essence of Popper to capture what makes an inquiry, if not an entire domain, scientiﬁc.
Following Laudan, philosophers tend to shy away from saying anything general about science versus pseudoscience – the predominant view is that there is no such thing. Some say that there’s at most a kind of “family resemblance” amongst domains people tend to consider scientiﬁc (Dupré 1993, Pigliucci 2010, 2013). One gets the impression that the demarcation task is being left to committees investigating allegations of poor science or fraud. They are forced to articulate what to count as fraud, as bad statistics, or as mere questionable research practices (QRPs). People’s careers depend on their ruling: they have “skin in the game,” as Nassim Nicholas Taleb might say (2018). The best one I know – the committee investigating fraudster Diederik Stapel – advises making philosophy of science a requirement for researchers (Levelt Committee, Noort Committee, and Drenth Committee 2012). So let’s not tell them philosophers haven given up on it.
Diederik Stapel. A prominent social psychologist “was found to have committed a serious infringement of scientiﬁc integrity by using ﬁctitious data in his publications” (Levelt Committee 2012, p. 7). He was required to retract 58 papers, relinquish his university degree and much else. The authors of the report describe walking into a culture of conﬁrmation and veriﬁcation bias. They could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientiﬁc method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (ibid., p. 48). Free of the qualms that give philosophers of science cold feet, they advance some obvious yet crucially important rules with Popperian echoes:
One of the most fundamental rules of scientiﬁc research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that conﬁrm the research hypotheses. Violations of this fundamental rule, such as continuing to repeat an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tend to conﬁrm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts. (ibid., p. 48)
Exactly! This is our minimal requirement for evidence: If it’s so easy to ﬁnd agreement with a pet theory or claim, such agreement is bad evidence, no test, BENT. To scrutinize the scientiﬁc credentials of an inquiry is to determine if there was a serious attempt to detect and report errors and biasing selection eﬀects. We’ll meet Stapel again when we reach the temporary installation on the upper level: The Replication Revolution in Psychology.
The issue of demarcation (point (1)) is closely related to Popper’s conjecture and refutation (point (2)). While he regards a degree of dogmatism to be necessary before giving theories up too readily, the trial and error methodology “gives us a chance to survive the elimination of an inadequate hypothesis – when a more dogmatic attitude would eliminate it by eliminating us” (Popper 1962, p. 52). Despite giving lip service to testing and falsiﬁcation, many popular accounts of statistical inference do not embody falsiﬁcation – even of a statistical sort.
Nearly everyone, however, now accepts point (3), that observations are not just “given”– knocking out a crucial pillar on which naïve empiricism stood. To the question: What came ﬁrst, hypothesis or observation? Popper answers, another hypothesis, only lower down or more local. Do we get an inﬁnite regress? No, because we may go back to increasingly primitive theories and even, Popper thinks, to an inborn propensity to search for and ﬁnd regularities (ibid., p. 47). I’ve read about studies appearing to show that babies are aware of what is statistically unusual. In one, babies were shown a box with a large majority of red versus white balls (Xu and Garcia 2008, Gopnik 2009). When a succession of white balls are drawn, one after another, with the contents of the box covered with a screen, the babies looked longer than when the more common red balls were drawn. I don’t ﬁnd this far-fetched. Anyone familiar with preschool computer games knows how far toddlers can get in solving problems without a single word, just by trial and error.
Greater Content, Greater Severity. The position people are most likely to take a pass on is (4), his view of the role of probability. Yet Popper’s central intuition is correct: if we wanted highly probable claims, scientists would stick to low-level observables and not seek generalizations, much less theories with high explanatory content. In this day of fascination with Big Data’s ability to predict what book I’ll buy next, a healthy Popperian reminder is due: humans also want to understand and to explain. We want bold “improbable” theories. I’m a little puzzled when I hear leading machine learners praise Popper, a realist, while proclaiming themselves fervid instrumentalists. That is, they hold the view that theories, rather than aiming at truth, are just instruments for organizing and predicting observable facts. It follows from the success of machine learning, Vladimir Cherkassky avers, that “realism is not possible.” This is very quick philosophy! “.. . [I]n machine learning we are given a set of [random] data samples, and the goal is to select the best model (function, hypothesis) from a given set of possible models” (Cherkassky 2012). Fine, but is the background knowledge required for this setup itself reducible to a prediction–classiﬁcation problem? I say no, as would Popper. Even if Cherkassky’s problem is relatively theory free, it wouldn’t follow this is true for all of science. Some of the most impressive “deep learning” results in AI have been criticized for lacking the ability to generalize beyond observed “training” samples, or to solve open-ended problems (Gary Marcus 2018).
A valuable idea to take from Popper is that probability in learning attaches to a method of conjecture and refutation, that is to testing: it is methodological probability. An error probability is a special case of a methodological probability. We want methods with a high probability of teaching us (and machines) how to distinguish approximately correct and incorrect interpretations of data, even leaving murky cases in the middle, and how to advance knowledge of detectable, while strictly unobservable, eﬀects.
The choices for probability that we are commonly oﬀered are stark: “in here” (beliefs ascertained by introspection) or “out there” (frequencies in long runs, or chance mechanisms). This is the “epistemology” versus “variability” shoe- horn we reject (Souvenir D). To qualify the method by which H was tested, frequentist performance is necessary, but it’s not suﬃcient. The assessment must be relevant to ensuring that claims have been put to severe tests. You can talk of a test having a type of propensity or capability to have discerned ﬂaws, as Popper did at times. A highly explanatory, high-content theory, with inter- connected tentacles, has a higher probability of having ﬂaws discerned than low-content theories that do not rule out as much. Thus, when the bolder, higher content, theory stands up to testing, it may earn higher overall severity than the one with measly content. That a theory is plausible is of little interest, in and of itself; what matters is that it is implausible for it to have passed these tests were it false or incapable of adequately solving its set of problems. It is the fuller, unifying, theory developed in the course of solving interconnected problems that enables severe tests.
Methodological probability is not to quantify my beliefs, but neither is it about a world I came across without considerable eﬀort to beat nature into line. Let alone is it about a world-in-itself which, by deﬁnition, can’t be accessed by us. Deliberate eﬀort and ingenuity are what allow me to ensure I shall come up against a brick wall, and be forced to reorient myself, at least with reasonable probability, when I test a ﬂawed conjecture. The capabilities of my tools to uncover mistaken claims (its error probabilities) are real properties of the tools. Still, they are my tools, specially and imaginatively designed. If people say they’ve made so many judgment calls in building the inferential apparatus that what’s learned cannot be objective, I suggest they go back and work some more at their experimental design, or develop better theories.
Falsiﬁcation Is Rarely Deductive. It is rare for any interesting scientiﬁc hypotheses to be logically falsiﬁable. This might seem surprising given all the applause heaped on falsiﬁability. For a scientiﬁc hypothesis H to be deductively falsiﬁed, it would be required that some observable result taken together with H yields a logical contradiction (A & ~A). But the only theories that deductively prohibit observations are of the sort one mainly ﬁnds in philosophy books: All swans are white is falsiﬁed by a single non-white swan. There are some statistical claims and contexts, I will argue, where it’s possible to achieve or come close to deductive falsiﬁcation: claims such as, these data are independent and identically distributed (IID). Going beyond a mere denial to replacing them requires more work.
However, interesting claims about mechanisms and causal generalizations require numerous assumptions (substantive and statistical) and are rarely open to deductive falsiﬁcation. How then can good science be all about falsiﬁability? The answer is that we can erect reliable rules for falsifying claims with severity. We corroborate their denials. If your statistical account denies we can reliably falsify interesting theories, it is irrelevant to real-world knowledge. Let me draw your attention to an exhibit on a strange disease, kuru, and how it falsiﬁed a fundamental dogma of biology.
Exhibit (v): Kuru. Kuru (which means “shaking”) was widespread among the Fore people of New Guinea in the 1960s. In around 3–6 months, Kuru victims go from having diﬃculty walking, to outbursts of laughter, to inability to swallow and death. Kuru, and (what we now know to be) related diseases, e.g., mad cow, Creutzfeldt–Jakob, and scrapie, are “spongiform” diseases, causing brains to appear spongy. Kuru clustered in families, in particular among Fore women and their children, or elderly parents. They began to suspect transmission was through mortuary cannibalism. Consuming the brains of loved ones, a way of honoring the dead, was also a main source of meat permitted to women. Some say men got ﬁrst dibs on the muscle; others deny men partook in these funerary practices. What we know is that ending these cannibalistic practices all but eradicated the disease. No one expected at the time that understanding kuru’s cause would falsify an established theory that only viruses and bacteria could be infectious. This “central dogma of biology” says:
H: All infectious agents have nucleic acid.
Any infectious agent free of nucleic acid would be anomalous for H – meaning it goes against what H claims. A separate step is required to decide when H’s anomalies should count as falsifying H. There needn’t be a cut-oﬀ so much as a standpoint as to when continuing to defend H becomes bad science. Prion researchers weren’t looking to test the central dogma of biology, but to understand kuru and related diseases. The anomaly erupted only because kuru appeared to be transmitted by a protein alone, by changing a normal protein shape into an abnormal fold. Stanley Prusiner called the infectious protein a prion – for which he received much grief. He thought, at ﬁrst, he’d made a mistake “and was puzzled when the data kept telling me that our preparations contained protein but not nucleic acid” (Prusiner 1997). The anomalous results would not go away and, eventually, were demonstrated via experimental transmission to animals. The discovery of prions led to a “revolution” in molecular biology, and Prusiner received a Nobel prize in 1997. It is logically possible that nucleic acid is somehow involved. But continuing to block the falsiﬁcation of H (i.e., block the “protein only” hypothesis) precludes learning more about prion diseases, which now include Alzheimer’s. (See Mayo 2014a.)
Insofar as we falsify general scientiﬁc claims, we are all methodological falsiﬁcationists. Some people say, “I know my models are false, so I’m done with the job of falsifying before I even begin.” Really? That’s not falsifying. Let’s look at your method: always infer that H is false, fails to solve its intended problem. Then you’re bound to infer this even when this is erroneous. Your method fails the minimal severity requirement.
Do Probabilists Falsify? It isn’t obvious a probabilist desires to falsify, rather than supply a probability measure indicating disconﬁrmation, the opposite of a B-boost (a B-bust?), or a low posterior. Members of some probabilist tribes propose that Popper is subsumed under a Bayesian account by taking a low value of Pr(x|H) to falsify H. That could not work. Individual outcomes described in detail will easily have very small probabilities under H without being genuine anomalies for H. To the severe tester, this as an attempt to distract from the inability of probabilists to falsify, insofar as they remain probabilists. What about comparative accounts (Likelihoodists or Bayes factor accounts), which I also place under probabilism? Reporting that one hypothesis is more likely than the other is not to falsify anything. Royall is clear that it’s wrong to even take the comparative report as evidence against one of the two hypotheses: they are not exhaustive. (Nothing turns on whether you prefer to put Likelihoodism under its own category.) Must all such accounts abandon the ability to falsify? No, they can indirectly falsify hypotheses by adding a methodological falsiﬁcation rule. A natural candidate is to falsify H if its posterior probability is suﬃciently low (or, perhaps, suﬃciently disconﬁrmed). Of course, they’d need to justify the rule, ensuring it wasn’t often mistaken.
The Popperian (Methodological) Falsiﬁcationist Is an Error Statistician
When is a statistical hypothesis to count as falsiﬁed? Although extremely rare events may occur, Popper notes:
such occurrences would not be physical eﬀects, because, on account of their immense improbability, they are not reproducible at will … If, however, we ﬁnd reproducible deviations from a macro eﬀect .. . deduced from a probability estimate … then we must assume that the probability estimate is falsiﬁed. (Popper 1959, p. 203)
In the same vein, we heard Fisher deny that an “isolated record” of statistically signiﬁcant results suﬃces to warrant a reproducible or genuine eﬀect (Fisher 1935a, p. 14). Early on, Popper (1959) bases his statistical falsifying rules on Fisher, though citations are rare. Even where a scientiﬁc hypothesis is thought to be deterministic, inaccuracies and knowledge gaps involve error-laden predictions; so our methodological rules typically involve inferring a statistical hypothesis. Popper calls it a falsifying hypothesis. It’s a hypothesis inferred in order to falsify some other claim. A ﬁrst step is often to infer an anomaly is real, by falsifying a “due to chance” hypothesis.
The recognition that we need methodological rules to warrant falsiﬁcation led Popperian Imre Lakatos to dub Popper’s philosophy “methodological falsiﬁcationism” (Lakatos 1970, p. 106). If you look at this footnote, where Lakatos often buried gems, you read about “the philosophical basis of some of the most interesting developments in modern statistics. The Neyman–Pearson approach rests completely on methodological falsiﬁcationism” (ibid., p. 109, note 6). Still, neither he nor Popper made explicit use of N-P tests. Statistical hypotheses are the perfect tool for “falsifying hypotheses.” However, this means you can’t be a falsiﬁcationist and remain a strict deductivist. When statisticians (e.g., Gelman 2011) claim they are deductivists like Popper, I take it they mean they favor a testing account like Popper, rather than inductively building up probabilities. The falsifying hypotheses that are integral for Popper also necessitate an evidence-transcending (inductive) statistical inference.
This is hugely problematic for Popper because being a strict Popperian means never having to justify a claim as true or a method as reliable. After all, this was part of Popper’s escape from induction. The problem is this: Popper’s account rests on severe tests, tests that would probably falsify claims if false, but he cannot warrant saying a method is probative or severe, because that would mean it was reliable, which makes Popperians squeamish. It would appear to concede to his critics that Popper has a “whiﬀ of induction” after all. But it’s not inductive enumeration. Error statistical methods (whether from statistics or informal) can supply the severe tests Popper sought. This leads us to Pierre Duhem, physicist and philosopher of science.
To read ‘Duhemian Problems of Falsiﬁcation’, and souvenirs E and F, see all of section 2.3.
Live Exhibit (vi): Revisiting Popper’s Demarcation of Science. Here’s an experiment: try shifting what Popper says about theories to a related claim about inquiries to ﬁnd something out. To see what I have in mind, let’s listen to an exchange between two fellow travelers over coﬀee at Statbucks.
TRAVELER 1: If mere logical falsiﬁability suﬃces for a theory to be scientiﬁc, then, we can’t properly oust astrology from the scientiﬁc pantheon. Plenty of nutty theories have been falsiﬁed, so by deﬁnition they’re scientiﬁc. Moreover, scientists aren’t always looking to subject well-corroborated theories to “grave risk” of falsiﬁcation.
TRAVELER 2: I’ve been thinking about this. On your ﬁrst point, Popper confuses things by making it sound as if he’s asking: When is a theory unscientiﬁc? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H, unscientiﬁc? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face- saving devices, then the inquiry – the handling and use of data – is unscientiﬁc. Despite being logically falsiﬁable, theories can be rendered immune from falsiﬁcation by means of cavalier methods for their testing. Adhering to a falsiﬁed theory no matter what is poor science. Some areas have so much noise and/or ﬂexibility that they can’t or won’t distinguish warranted from unwarranted explanations of failed predictions. Rivals may ﬁnd ﬂaws in one another’s inquiry or model, but the criticism is not constrained by what’s actually responsible. This is another way inquiries can become unscientiﬁc.1
On your second point, it’s true that Popper talked of wanting to subject theories to grave risk of falsiﬁcation. I suggest that it’s really our inquiries into, or tests of, the theories that we want to subject to grave risk. The onus is on interpreters of data to show how they are countering the charge of a poorly run test. I admit this is a modiﬁcation of Popper. One could reframe the entire demarcation problem as one of the characters of an inquiry or test.
She makes a good point. In addition to blocking inferences that fail the minimal requirement for severity:
A scientiﬁc inquiry or test: must be able to embark on a reliable probe to pinpoint blame for anomalies (and use the results to replace falsiﬁed claims and build a repertoire of errors).
The parenthetical remark isn’t absolutely required, but is a feature that greatly strengthens scientiﬁc credentials. Without solving, not merely embarking on, some Duhemian problems there are no interesting falsiﬁcations. The ability or inability to pin down the source of failed replications – a familiar occupation these days – speaks to the scientiﬁc credentials of an inquiry. At any given time, even in good sciences there are anomalies whose sources haven’t been traced – unsolved Duhemian problems – generally at “higher” levels of the theory-data array. Embarking on solving these is the impetus for new conjectures. Checking test assumptions is part of working through the Duhemian maze. The reliability requirement is: infer claims just to the extent that they pass severe tests. There’s no sharp line for demarcation, but when these requirements are absent, an inquiry veers into the realm of questionable science or pseudoscience. Some physicists worry that highly theoretical realms can’t be expected to be constrained by empirical data. Theoretical constraints are also important. We’ll ﬂesh out these ideas in future tours.
1 For example, astronomy, but not astrology, can reliably solve its Duhemian puzzles. Chapter 2, Mayo (1996), following my reading of Kuhn (1970) on “normal science.”
*Where you are in the Journey: I posted all of Excursion 1 Tour I, here, here, and here, and omitted Tour II (but blogposts on the Law of Likelihood, Royall, optional stopping, and Barnard may be found by searching this blog). You are now in Excursion 2, the first stop of Tour I (2.1) is here. The main material from 2.2 can be found in this blogpost. You can read the rest of Excursion 2 Tour II section 2.3, in proof form, here. For the full Itinerary of Statistical Inference as Severe Testing: How to Get Beyond the Stat Wars (2018, CUP) SIST Itinerary.