Severity: Strong vs Weak (Excursion 1 continues)


Marking one year since the appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), let’s continue to the second stop (1.2) of Excursion 1 Tour 1. It begins on p. 13 with a quote from statistician George Barnard. Assorted reflections will be given in the comments. Ask me any questions pertaining to the Tour.


  • I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism.

The performance philosophy sees the key function of statistical method as controlling the relative frequency of erroneous inferences in the long run of applications. For example, a frequentist statistical test, in its naked form, can be seen as a rule: whenever your outcome exceeds some value (say, X > x*), reject a hypothesis H0 and infer H1. The value of the rule, according to its performance-oriented defenders, is that it can ensure that, regardless of which hypothesis is true, there is both a low probability of erroneously rejecting H0 (rejecting  H0  when  it is true)   as   well   as   erroneously   accepting H0 (failing to reject H0 when it is false).

The second philosophy, probabilism, views probability as a way to assign degrees of  belief,  support,  or  plausibility to  hypotheses.  Many  keep  to a comparative report, for example that H0 is more believable than is H1 given data x; others strive to say H0 is less believable given data than before, and offer a quantitative report of the difference.

What happened to the goal of scrutinizing BENT science by the severity criterion? [See 1.1] Neither “probabilism” nor “performance” directly captures that demand. To take these goals at face value, it’s easy to see why they come up short. Potti and Nevins’ strong belief in the reliability of their prediction model for cancer therapy scarcely made up for the shoddy testing. Neither is good long-run performance a sufficient condition. Most obviously, there may be no long-run repetitions, and our interest in science is often just the particular statistical inference before us. Crude long-run requirements may be met by silly methods. Most importantly, good performance alone fails to get at why methods work when they do; namely – I claim – to let us assess and control the stringency of tests. This is the key to answering a burning question that has caused major headaches in statistical foundations: why should a low relative frequency of error matter to the appraisal of the inference at hand? It is not probabilism or performance we seek to quantify, but probativeness.

I do not mean to disparage the long-run performance goal – there are plenty of tasks in inquiry where performance is absolutely key. Examples are screening in high-throughput data analysis, and methods for deciding which of tens of millions of collisions in high-energy physics to capture and analyze. New applications of machine learning may lead some to say that only low rates of prediction or classification errors matter. Even with prediction, “black-box” modeling, and non-probabilistic  inquiries,  there  is  concern  with  solving a problem. We want to know if a good job has been done in the case at hand.

Severity (Strong): Argument from Coincidence

The weakest version of the severity requirement (Section 1.1), in the sense of easiest to justify, is negative, warning us when BENT data are at hand, and a surprising amount of mileage may be had from that negative principle alone. It is when we recognize how poorly certain claims are warranted that we get ideas for improved inquiries. In fact, if you wish to stop at the negative requirement, you can still go pretty far along with me. I also advocate the positive counterpart:

Severity (strong): We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet none or few are found, then the passing result, x, is evidence for C.

One way this can be achieved is by an argument from coincidence. The most vivid cases occur outside formal statistics.

Some of my strongest examples tend to revolve around my weight. Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’s office. Suppose they are well calibrated and nearly identical in their readings, and they also all pick up on the extra 3 pounds when I’m weighed carrying three copies of my 1-pound book, Error and the Growth of Experimental Knowledge (EGEK). Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4–5 pound gain. There’s no difference when I place the three books on the scales, so I must conclude, unfortunately, that I’ve gained around 4 pounds. Even for me, that’s a lot. I’ve surely falsified the supposition that I lost weight! From this informal example, we may make two rather obvious points that will serve for less obvious cases. First, there’s the idea I call lift-off.

Lift-o: An overall inference can be more reliable and precise than its premises individually.

Each scale, by itself, has some possibility of error, and limited precision. But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. Were one scale off balance, it would be discovered by another, and would show up in the weighing of books. They cannot all be systematically misleading just when  it  comes  to  objects  of  unknown  weight,  can  they?  Rejecting a conspiracy of the scales, I conclude I’ve gained weight, at least 4 pounds. We may call this an argument from coincidence, and by its means we can attain lift-off. Lift-off runs directly counter to a seemingly obvious claim of drag-down.

Drag-down: An overall inference is only as reliable/precise as is its weakest premise.

The drag-down assumption is  common  among  empiricist  philosophers: As they like to say, “It’s turtles all the way down.” Sometimes our inferences do stand as a kind of tower built on linked stones – if even one stone fails they all come tumbling down. Call that a linked argument.

Our most prized scientific inferences would be in a very bad way if piling on assumptions invariably leads to weakened conclusions. Fortunately we also can build what may be called convergent arguments, where lift-off is attained. This seemingly banal point suffices to combat some of the most well entrenched skepticisms in philosophy of science. And statistics happens to be the science par excellence for demonstrating lift-off!

Now consider what justifies my weight conclusion, based, as we are supposing it is, on a strong argument from coincidence. No one would say: “I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.” To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings: H: I’ve gained weight. Simple as that. It would be a preposterous coincidence if none of the scales registered even slight weight shifts when weighing objects of known weight, and yet were systematically misleading when applied to my weight. You see where I’m going with this. This is the key – granted with a homely example – that can fill a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand. Nor is it merely the improbability of all the results were H false; it is rather like denying an evil demon has read my mind just in the cases where I do not know the weight of an object, and deliberately deceived me. The argument to “weight gain” is an example of an argument from coincidence to the absence of an error, what I call:

Arguing from Error: There is evidence an error is absent to the extent that a procedure with a very high capability of signaling the error, if and only if it is present, nevertheless detects no error.

I am using “signaling” and “detecting” synonymously: It is important to keep in mind that we don’t know if the test output is correct, only that it gives a signal or alert, like sounding a bell. Methods that enable strong arguments to the absence (or presence) of an error I call strong error probes. Our ability to develop strong arguments from coincidence, I will argue, is the basis for solving the “problem of induction.”

Glaring Demonstrations of Deception

Intelligence is indicated by a capacity for deliberate deviousness. Such deviousness becomes self-conscious in inquiry: An example is the use of a placebo to find out what it would be like if the drug has no effect. What impressed me the most in my first statistics class was the demonstration of how apparently impressive results are readily produced when nothing’s going on, i.e., “by chance alone.” Once you see how it is done, and done easily, there is no going back. The toy hypotheses used in statistical testing are nearly always overly simple as scientific hypotheses. But when it comes to framing rather blatant deceptions, they are just the ticket!

When Fisher offered Muriel Bristol-Roach a cup of tea back in the 1920s, she refused it because he had put the milk in first. What difference could it make? Her husband and Fisher thought it would be fun to put her to the test (1935a). Say she doesn’t claim to get it right all the time but does claim that she has some genuine discerning ability. Suppose Fisher subjects her to 16 trials and she gets 9 of them right. Should I be impressed or not? By a simple experiment of randomly assigning milk first/tea first Fisher sought to answer this stringently. But don’t be fooled: a great deal of work goes into controlling biases and confounders before the experimental design can work. The main point just now is this: so long as lacking ability is sufficiently like the canonical “coin tossing” (Bernoulli) model (with the probability of success at each trial of 0.5), we can learn from the test procedure. In the Bernoulli model, we record success or failure, assume a fixed probability of success θ on each trial, and that trials are independent. If the probability of getting even more successes than she got, merely by guessing, is fairly high, there’s little indication of special tasting ability. The probability of at least 9 of 16 successes, even if θ = 0.5, is 0.4. To abbreviate, Pr(at least 9 of 16 successes; H0: θ = 0.5) = 0.4. This is the P-value of the observed difference; an unimpressive 0.4. You’d expect as many or even more “successes” 40% of the time merely by guessing. It’s also the significance level attained by the result. (I often use P-value as it’s shorter.) Muriel Bristol-Roach pledges that if her performance may be regarded as scarcely better than guessing, then she hasn’t shown her ability. Typically, a small value such as 0.05, 0.025, or 0.01 is required.

Such artificial and simplistic statistical hypotheses play valuable roles at stages of inquiry where what is needed are blatant standards of “nothing’s going on.” There is no presumption of a metaphysical chance agency, just that there is expected variability – otherwise one test would suffice – and that probability models from games of chance can be used to distinguish genuine from spurious effects. Although the goal of inquiry is to find things out, the hypotheses erected to this end are generally approximations and may be deliberately false. To present statistical hypotheses as identical to substantive scientific claims is to mischaracterize them. We want to tell what’s true about statistical inference. Among the most notable of these truths is:

P-values can be readily invalidated due to how the data (or hypotheses!) are generated or selected for testing.

If you fool around with the results afterwards, reporting only successful guesses, your report will be invalid. You may claim it’s very difficult to get such an impressive result due to chance, when in fact it’s very easy to do so, with selective reporting. Another way to put this: your computed P-value is small, but the actual P-value is high! Concern with spurious findings, while an ancient problem, is considered sufficiently serious to have motivated the American Statistical Association to issue a guide on how not to interpret P-values (Wasserstein and Lazar 2016); hereafter, ASA 2016 Guide. It may seem that if a statistical account is free to ignore such fooling around then the problem disappears! It doesn’t.

Incidentally, Bristol-Roach got all the cases correct, and thereby taught her husband a lesson about putting her claims to the test.

skips p. 18 on Peirce

Texas Marksman

Take an even simpler and more blatant argument of deception. It is my favorite: the Texas Marksman. A Texan wants to demonstrate his shooting prowess. He shoots all his bullets any old way into the side of a barn and then paints a bull’s-eye in spots where the bullet holes are clustered. This fails utterly to severely test his marksmanship ability. When some visitors come to town and notice the incredible number of bull’s-eyes, they ask to meet this marksman and are introduced to a little kid. How’d you do so well, they ask? Easy, I just drew the bull’s-eye around the most tightly clustered shots. There is impressive “agreement” with shooting ability, he might even compute how improbably so many bull’s-eyes would occur by chance. Yet his ability to shoot was not tested in the least by this little exercise. There’s a real effect all right, but it’s not caused by his marksmanship! It serves as a potent analogy for a cluster of formal statistical fallacies from data-dependent findings of “exceptional” patterns.

The term “apophenia” refers to a tendency to zero in on an apparent regularity or cluster within a vast sea of data and claim a genuine regularity. One of our fundamental problems (and skills) is that we’re apopheniacs. Some investment funds, none that we actually know, are alleged to produce several portfolios by random selection of stocks and send out only the one that did best. Call it the Pickrite method. They want you to infer that it would be a preposterous coincidence to get so great a portfolio if the Pickrite method were like guessing. So their methods are genuinely wonderful, or so you are to infer. If this had been their only portfolio, the probability of doing so well by luck is low. But the probability of at least one of many portfolios doing so well (even if each is generated by chance) is high, if not guaranteed.

Let’s review the rogues’ gallery of glaring arguments from deception. The lady tasting tea showed how a statistical model of “no effect” could be used to amplify our ordinary capacities to discern if something really unusual is going on. The P-value is the probability of at least as high a success rate as observed, assuming the test or null hypothesis, the probability of success is 0.5. Since even more successes than she got is fairly frequent through guessing alone (the P-value is moderate), there’s poor evidence of a genuine ability. The Playfair and Texas sharpshooter examples, while quasi-formal or informal, demonstrate how to invalidate reports of significant effects. They show how gambits of post-data adjustments or selection can render a method highly capable of spewing out impressive looking fits even when it’s just random noise.

We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.

So am I proposing that a key role for statistical inference is to identify ways to spot egregious deceptions (BENT cases) and create strong arguments from coincidence? Yes, I am.

Skips “Spurious P-values and Auditing” (p. 20) up to Souvenir A (p. 21)

Souvenir A: Postcard to Send

The gift shop has a postcard listing the four slogans from the start of this Tour. Much of today’s handwringing about statistical inference is unified by a call to block these fallacies. In some realms, trafficking in too-easy claims for evidence, if not criminal offenses, are “bad statistics”; in others, notably some social sciences, they are accepted cavalierly – much to the despair of panels on research integrity. We are more sophisticated than ever about the ways researchers can repress unwanted, and magnify wanted, results. Fraud-busting is everywhere, and the most important grain of truth is this: all the fraud-busting is based on error statistical reasoning (if only on the meta-level). The minimal requirement to avoid BENT isn’t met. It’s hard to see how one can grant the criticisms while denying the critical logic.

We should oust mechanical, recipe-like uses of statistical methods that have long been lampooned, and are doubtless made easier by Big Data mining. They should be supplemented with tools to report magnitudes of effects that have and have not been warranted with severity. But simple significance tests have their uses, and shouldn’t be ousted simply because some people are liable to violate Fisher’s warning and report isolated results. They should be seen as a part of a conglomeration of error statistical tools for distinguishing genuine and spurious effects. They offer assets that are essential to our task: they have the means by which to register formally the fallacies in the postcard list. The failed statistical assumptions, the selection effects from trying and trying again, all alter a test’s error-probing capacities. This sets off important alarm bells, and we want to hear them. Don’t throw out the error-control baby with the bad statistics bathwater.

The slogans about lying with statistics? View them, not as a litany of embarrassments, but as announcing what any responsible method must register, if not control or avoid. Criticisms of statistical tests, where valid, boil down to problems with the critical alert function. Far from the high capacity to warn, “Curb your enthusiasm!” as correct uses of tests do, there are practices that make sending out spurious enthusiasm as easy as pie. This is a failure for sure, but don’t trade them in for methods that cannot detect failure at all. If you’re shopping for a statistical account, or appraising a statistical reform, your number one question should be: does it embody trigger warnings of spurious effects? Of bias? Of cherry picking and multiple tries? If the response is: “No problem; if you use our method, those practices require no change in statistical assessment!” all I can say is, if it sounds too good to be true, you might wish to hold off buying it.

Skips remainder of section 1.2 (bott p. 22- middle p. 23).


2 This is the traditional use of “bias” as a systematic error. Ioannidis (2005) alludes to biasing as behaviors that result in a reported significance level differing from the value it actually has or ought to have (e.g., post-data endpoints, selective reporting). I will call those biasing selection effects.

FOR ALL OF TOUR I: SIST Excursion 1 Tour I

THE FULL ITINERARY: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars: SIST Itinerary



Categories: Statistical Inference as Severe Testing

Post navigation

6 thoughts on “Severity: Strong vs Weak (Excursion 1 continues)

  1. The most important take away (souvenir) from the rogues gallery of BENT cases is that this:

    “We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.”

    In however many presentations of SIST over the last year I have found my audiences most surprised (and interested!) to learn that many statistical “reforms” involve methods that violate the minimal severity requirement. I, in turn, am most surprised to learn that they weren’t aware of this. (This all comes out more clearly as we proceed to Tour II of this first Excursion.) One year later, let me be so bold as to coin Mayo’s Principle:

    Statistical “reforms” intended to avoid egregious deceptions and improve the credibility of statistics in science will do just the opposite–if they violate the weak severity requirement.

  2. QUESTION. What is proving to be the largest obstacle in the way of improving statistics?

    I will give 2 answers, one concerning content/argument and a second concerning process.
    (1) content/argument
    I knew this would be the most difficult, and thus SIST spend a fair amount of time on it. It is the thinking that if a concern, such as biasing selection effects, does not register in a given account, then an account is free of the concern.

    “In fact, Bayes factors can be used in the complete absence of a sampling plan, or in situations where the analyst does not know the sampling plan that was used.”
    Bayarri, Benjamin, Berger (2016, p. 100), full reference in SIST.

    If you don’t have to know the sampling plan, then in general, you don’t need to know if there was data dredging, number of hypotheses sought, etc.

    Here they discuss optional stopping–how you can blow up a significance level by going on long enough (in certain examples such as Normal testing of a point null that the mean = 0). This is discussed in detail in this excursion (1), Tour II. They go on to marvel at how this seems like cheating but it isn’t–for them. If your method is guaranteed to reject a true null, then even if WITHIN your account everything is hunky dory (because you don’t compute the error probability of the method), it doesn’t change the fact that your method is guaranteed to reject a true null. This is not hunky dory for anyone who cares about not being wrong with maximal probability.

    (2) The largest policy obstacle?
    Individuals from the ASA setting out “recommendations” that are admitted to not being the slightest bit consensual among statisticians, and which are in fact at odds with 100s of best practice manuals on clinical trials in medicine, and then trying to get journals, professional societies and authors to adopt them. I have no doubt that they are convinced that not saying “significant/significance” and ending the use of predesignated P-value thresholds in interpreting results is all to the good. What is much harder to believe is that it is regarded as beneficial to set out “alternatives measures of evidence” without any assessment of their qualities or abilities to perform the task of the methods they are to replace. What kind of exemplar are they setting for the field of statistics–a field that is committed, or so I thought, to critically evaluating methods of statistical inference and modeling?

    Following their dictates, people would still perform statistical significance testing, but in secret–don’t ask, don’t tell–and then overlaying their replacement methods. To outsiders, it appears that leaders of rival statistical tribes are using the “crisis of replication” as an opportunity to plant their pet method without having to convince scientists of their good qualities.

    I’m guessing this is all being done with the firm belief that they are making things better.

  3. Rick Krantz

    I have a question: why can’t “probabilism” prevent what you call results that are BENT? Does this come up later in the book?

  4. Rick: Good question. Because I use “probabilism” in a very general way, I’m prepared to allow that it’s possible, but that no existing probabilist accounts do. With probabilism, one has evidence for a claim to the extent that it’s probable in some sense, or gets a probability boost, or is comparatively more likely than some other claim. The first two include Bayesian accounts of confirmation or statistics, generally interpreted in terms of degrees of belief, increased belief, ‘rational” belief or betting odds. To say how strongly believed a claim is, or ought to be, regarded is not to say how well tested it is. We (severe testers) are keen to have it make sense to say a claim is well supported, believable, plausible or even true while at the same time terribly tested by a particular method and data. We think this is common in day-to-day reasoning. We also want to distinguish how well tested one and the same claim is by different methods. The claim is just as believable in both cases but in one case its flaws might have been well ruled out, and in another, poorly probed. In formal contexts, the terribly tested and poorly probed ideas are cashed out in terms of error probabilities of methods.

    The third kind of probabilism is represented by comparative likelihood accounts. Like Bayesian accounts, they view the import of data as coming through likelihood ratios. To block BENT cases requires considering error probabilities of methods and these go beyond likelihoods. Excursion 1 Tour II discusses this in some detail, so maybe we should come back to this if your question remains at that point.

    You might make a note of the fact that “error probabilities” have been defined in a different, Bayesian, way (by J. Berger), so that when he, and others, speak of Bayesian error control they may not (and probably are not) speaking of the same thing. Instead they may be speaking of posterior probabilities. This comes up in Excursion 3 Tour II.

  5. All excerpts and mementos from the book are posted here:
    I will be adding to this over the course of rereading SIST.

  6. Yusuke Ono

    I have three suggestions for your description about Fisher’s tea experiment.

    (1) I think the experiment in Fisher(1935) is fictional, not “based on a true story”. See Kendall(1963), Biometrika, 50(1/2), p.5.
    (2) In Fisher(1935), hypergeometric distribution, not Bernoulli distribution, is used.
    (3) Even in Fisher-Box(1978), the result by Bristol-Roach isn’t shown. The result is reported in Salsburg(2002), The Lady Tasting Tea, but I think it’s doubtful.

Blog at