Marking one year since the appearance of my book:

Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars(2018, CUP), let’s continue to the second stop (1.2) of Excursion 1 Tour 1. It begins on p. 13 with a quote from statistician George Barnard. Assorted reflections will be given in the comments. Ask me any questions pertaining to the Tour.

- I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in ﬁelds of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its *statistical philosophy. *To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them *performance *(in the long run) and *probabilism.*

The performance philosophy sees the key function of statistical method as controlling the relative frequency of erroneous inferences in the long run of applications. For example, a frequentist statistical test, in its naked form, can be seen as a rule: whenever your outcome exceeds some value (say, * X* >

**), reject a hypothesis*

**x***H*

_{0}and infer

*H*

_{1}. The value of the rule, according to its performance-oriented defenders, is that it can ensure that, regardless of which hypothesis is true, there is both a low probability of erroneously rejecting

*H*

_{0}(rejecting

*H*

_{0}when it is true) as well as erroneously accepting

*H*

_{0}(failing to reject

*H*

_{0}when it is false).

The second philosophy, probabilism, views probability as a way to assign degrees of belief, support, or plausibility to hypotheses. Many keep to a comparative report, for example that *H*_{0} is more believable than is *H*_{1} given data* x*; others strive to say

*H*

_{0}is less believable given data

*than before, and oﬀer a quantitative report of the difference*

**x***.*

What happened to the goal of scrutinizing BENT science by the severity criterion? [See 1.1] Neither “probabilism” nor “performance” directly captures that demand. To take these goals at face value, it’s easy to see why they come up short. Potti and Nevins’ strong belief in the reliability of their prediction model for cancer therapy scarcely made up for the shoddy testing. Neither is good long-run performance a sufficient condition. Most obviously, there may be no long-run repetitions, and our interest in science is often just the particular statistical inference before us. Crude long-run requirements may be met by silly methods. Most importantly, good performance alone fails to get at *why *methods work when they do; namely – I claim – to let us assess and control the stringency of tests. This is the key to answering a burning question that has caused major headaches in statistical foundations: why should a low relative frequency of error matter to the appraisal of the inference at hand? It is not probabilism or performance we seek to quantify, but *probativeness*.

I do not mean to disparage the long-run performance goal – there are plenty of tasks in inquiry where performance is absolutely key. Examples are screening in high-throughput data analysis, and methods for deciding which of tens of millions of collisions in high-energy physics to capture and analyze. New applications of machine learning may lead some to say that only low rates of prediction or classification errors matter. Even with prediction, “black-box” modeling, and non-probabilistic inquiries, there is concern with solving a problem. We want to know if a good job has been done in the case at hand.

**Severity (Strong): Argument from Coincidence**

The weakest version of the severity requirement (Section 1.1), in the sense of easiest to justify, is negative, warning us when BENT data are at hand, and a surprising amount of mileage may be had from that negative principle alone. It is when we recognize how poorly certain claims are warranted that we get ideas for improved inquiries. In fact, if you wish to stop at the negative requirement, you can still go pretty far along with me. I also advocate the positive counterpart:

*Severity (strong): We have evidence for a claim C just to the extent it* *survives a stringent scrutiny. *If *C *passes a test that was highly capable of ﬁnding ﬂaws or discrepancies from *C*, and yet none or few are found, then the passing result, * x*, is evidence for

*C*.

One way this can be achieved is by an *argument from coincidence*. The most vivid cases occur outside formal statistics.

Some of my strongest examples tend to revolve around my weight. Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’s oﬃce. Suppose they are well calibrated and nearly identical in their readings, and they also all pick up on the extra 3 pounds when I’m weighed carrying three copies of my 1-pound book, *Error and the Growth of Experimental Knowledge *(EGEK). Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4–5 pound gain. There’s no difference when I place the three books on the scales, so I must conclude, unfortunately, that I’ve gained around 4 pounds. Even for me, that’s a lot. I’ve surely falsified the supposition that I lost weight! From this informal example, we may make two rather obvious points that will serve for less obvious cases. First, there’s the idea I call lift-oﬀ.

*Lift-o*ﬀ*:* *An overall inference can be more reliable and precise than its* *premises individually.*

Each scale, by itself, has some possibility of error, and limited precision. But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. Were one scale oﬀ balance, it would be discovered by another, and would show up in the weighing of books. They cannot all be systematically misleading just when it comes to objects of unknown weight, can they? Rejecting a conspiracy of the scales, I conclude I’ve gained weight, at least 4 pounds. We may call this an *argument* *from* *coincidence*, and by its means we can attain lift-oﬀ. Lift-oﬀ runs directly counter to a seemingly obvious claim of drag-down.

*Drag-down: An overall inference is only as reliable/precise as is its* *weakest premise.*

The drag-down assumption is common among empiricist philosophers: As they like to say, “It’s turtles all the way down.” Sometimes our inferences do stand as a kind of tower built on linked stones – if even one stone fails they all come tumbling down. Call that a *linked *argument.

Our most prized scientific inferences would be in a very bad way if piling on assumptions invariably leads to weakened conclusions. Fortunately we also can build what may be called *convergent *arguments, where lift-oﬀ is attained. This seemingly banal point suffices to combat some of the most well entrenched skepticisms in philosophy of science. And statistics happens to be the science par excellence for demonstrating lift-oﬀ!

Now consider what justifies my weight conclusion, based, as we are supposing it is, on a strong argument from coincidence. No one would say: “I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.” To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings: *H*: I’ve gained weight. Simple as that. It would be a preposterous coincidence if none of the scales registered even slight weight shifts when weighing objects of known weight, and yet were systematically misleading when applied to my weight. You see where I’m going with this. This is the key – granted with a homely example – that can ﬁll a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand. Nor is it merely the improbability of all the results were *H *false; it is rather like denying an evil demon has read my mind just in the cases where I do not know the weight of an object, and deliberately deceived me. The argument to “weight gain” is an example of an argument from coincidence to the absence of an error, what I call:

*Arguing from Error*: There is evidence an error is absent to the extent that a procedure with a very high capability of signaling the error, if and only if it is present, nevertheless detects no error.

I am using “signaling” and “detecting” synonymously: It is important to keep in mind that we don’t know if the test output is correct, only that it gives a signal or alert, like sounding a bell. Methods that enable strong arguments to the absence (or presence) of an error I call *strong error probes*. Our ability to develop strong arguments from coincidence, I will argue, is the basis for solving the “problem of induction.”

**Glaring Demonstrations of Deception**

Intelligence is indicated by a capacity for deliberate deviousness. Such deviousness becomes self-conscious in inquiry: An example is the use of a placebo to ﬁnd out what it would be like if the drug has no eﬀect. What impressed me the most in my ﬁrst statistics class was the demonstration of how apparently impressive results are readily produced when nothing’s going on, i.e., “by chance alone.” Once you see how it is done, and done easily, there is no going back. The toy hypotheses used in statistical testing are nearly always overly simple as scientific hypotheses. But when it comes to framing rather blatant deceptions, they are just the ticket!

When Fisher oﬀered Muriel Bristol-Roach a cup of tea back in the 1920s, she refused it because he had put the milk in ﬁrst. What difference could it make? Her husband and Fisher thought it would be fun to put her to the test (1935a). Say she doesn’t claim to get it right all the time but does claim that she has some genuine discerning ability. Suppose Fisher subjects her to 16 trials and she gets 9 of them right. Should I be impressed or not? By a simple experiment of randomly assigning milk ﬁrst/tea ﬁrst Fisher sought to answer this stringently. But don’t be fooled: a great deal of work goes into controlling biases and confounders before the experimental design can work. The main point just now is this: so long as lacking ability is sufficiently like the canonical “coin tossing” (Bernoulli) model (with the probability of success at each trial of 0.5), we can learn from the test procedure. In the Bernoulli model, we record success or failure, assume a ﬁxed probability of success θ on each trial, and that trials are independent. If the probability of getting even more successes than she got, merely by guessing, is fairly high, there’s little indication of special tasting ability. The probability of at least 9 of 16 successes, even if θ = 0.5, is 0.4. To abbreviate, Pr(at least 9 of 16 successes; *H*_{0}: θ = 0.5) = 0.4. This is the *P*-value of the observed difference; an unimpressive 0.4. You’d expect as many or even more “successes” 40% of the time merely by guessing. It’s also the *significance level attained *by the result. (I often use *P*-value as it’s shorter.) Muriel Bristol-Roach pledges that if her performance may be regarded as scarcely better than guessing, then she hasn’t shown her ability. Typically, a small value such as 0.05, 0.025, or 0.01 is required.

Such artiﬁcial and simplistic statistical hypotheses play valuable roles at stages of inquiry where what is needed are blatant standards of “nothing’s going on.” There is no presumption of a metaphysical chance agency, just that there is expected variability – otherwise one test would suﬃce – and that probability models from games of chance can be used to distinguish genuine from spurious eﬀects. Although the goal of inquiry is to ﬁnd things out, the hypotheses erected to this end are generally approximations and may be deliberately false. To present statistical hypotheses as identical to substantive scientiﬁc claims is to mischaracterize them. We want to tell what’s true about statistical inference. Among the most notable of these truths is:

*P*-values can be readily invalidated due to how the data (or hypotheses!) are generated or selected for testing.

If you fool around with the results afterwards, reporting only successful guesses, your report will be invalid. You may claim it’s very difficult to get such an impressive result due to chance, when in fact it’s very easy to do so, with selective reporting. Another way to put this: your *computed P*-value is small, but the *actual P*-value is high! Concern with spurious ﬁndings, while an ancient problem, is considered sufficiently serious to have motivated the American Statistical Association to issue a guide on how not to interpret *P*-values (Wasserstein and Lazar 2016); hereafter, ASA 2016 Guide. It may seem that if a statistical account is free to ignore such fooling around then the problem disappears! It doesn’t.

Incidentally, Bristol-Roach got all the cases correct, and thereby taught her husband a lesson about putting her claims to the test.

**skips p. 18 on Peirce**

**Texas Marksman**

Take an even simpler and more blatant argument of deception. It is my favorite: the Texas Marksman. A Texan wants to demonstrate his shooting prowess. He shoots all his bullets any old way into the side of a barn and then paints a bull’s-eye in spots where the bullet holes are clustered. This fails utterly to severely test his marksmanship ability. When some visitors come to town and notice the incredible number of bull’s-eyes, they ask to meet this marksman and are introduced to a little kid. How’d you do so well, they ask? Easy, I just drew the bull’s-eye around the most tightly clustered shots. There is impressive “agreement” with shooting ability, he might even compute how improbably so many bull’s-eyes would occur by chance. Yet his ability to shoot was not tested in the least by this little exercise. There’s a real eﬀect all right, but it’s not caused by his marksmanship! It serves as a potent analogy for a cluster of formal statistical fallacies from data-dependent ﬁndings of “exceptional” patterns.

The term “apophenia” refers to a tendency to zero in on an apparent regularity or cluster within a vast sea of data and claim a genuine regularity. One of our fundamental problems (and skills) is that we’re apopheniacs. Some investment funds, none that we actually know, are alleged to produce several portfolios by random selection of stocks and send out only the one that did best. Call it the Pickrite method. They want you to infer that it would be a preposterous coincidence to get so great a portfolio if the Pickrite method were like guessing. So their methods are genuinely wonderful, or so you are to infer. If this had been their only portfolio, the probability of doing so well by luck is low. But the probability of at least one of many portfolios doing so well (even if each is generated by chance) is high, if not guaranteed.

Let’s review the rogues’ gallery of glaring arguments from deception. The lady tasting tea showed how a statistical model of “no eﬀect” could be used to amplify our ordinary capacities to discern if something really unusual is going on. The *P*-value is the probability of at least as high a success rate as observed, assuming the test or null hypothesis, the probability of success is 0.5. Since even more successes than she got is fairly frequent through guessing alone (the *P*-value is moderate), there’s poor evidence of a genuine ability. The Playfair and Texas sharpshooter examples, while quasi-formal or informal, demonstrate how to invalidate reports of significant eﬀects. They show how gambits of post-data adjustments or selection can render a method highly capable of spewing out impressive looking ﬁts even when it’s just random noise.

We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.

So am I proposing that a key role for statistical inference is to identify ways to spot egregious deceptions (BENT cases) and create strong arguments from coincidence? Yes, I am.

**Skips “Spurious P-values and Auditing” (p. 20) up to Souvenir A (p. 21)**

**Souvenir A: Postcard to Send**

The gift shop has a postcard listing the four slogans from the start of this Tour. Much of today’s handwringing about statistical inference is uniﬁed by a call to block these fallacies. In some realms, trafficking in too-easy claims for evidence, if not criminal oﬀenses, are “bad statistics”; in others, notably some social sciences, they are accepted cavalierly – much to the despair of panels on research integrity. We are more sophisticated than ever about the ways researchers can repress unwanted, and magnify wanted, results. Fraud-busting is everywhere, and the most important grain of truth is this: all the fraud-busting is based on error statistical reasoning (if only on the meta-level). The minimal requirement to avoid BENT isn’t met. It’s hard to see how one can grant the criticisms while denying the critical logic.

We should oust mechanical, recipe-like uses of statistical methods that have long been lampooned, and are doubtless made easier by Big Data mining. They should be supplemented with tools to report magnitudes of eﬀects that have and have not been warranted with severity. But simple significance tests have their uses, and shouldn’t be ousted simply because some people are liable to violate Fisher’s warning and report isolated results. They should be seen as a part of a conglomeration of error statistical tools for distinguishing genuine and spurious eﬀects. They oﬀer assets that are essential to our task: they have the means by which to register formally the fallacies in the postcard list. The failed statistical assumptions, the selection eﬀects from trying and trying again, all alter a test’s error-probing capacities. This sets oﬀ important alarm bells, and we want to hear them. Don’t throw out the error-control baby with the bad statistics bathwater.

The slogans about lying with statistics? View them, not as a litany of embarrassments, but as announcing what any responsible method must register, if not control or avoid. Criticisms of statistical tests, where valid, boil down to problems with the critical alert function. Far from the high capacity to warn, “Curb your enthusiasm!” as correct uses of tests do, there are practices that make sending out spurious enthusiasm as easy as pie. This is a failure for sure, but don’t trade them in for methods that cannot detect failure at all. If you’re shopping for a statistical account, or appraising a statistical reform, your number one question should be: does it embody trigger warnings of spurious eﬀects? Of bias? Of cherry picking and multiple tries? If the response is: “No problem; if you use our method, those practices require no change in statistical assessment!” all I can say is, if it sounds too good to be true, you might wish to hold oﬀ buying it.

**Skips remainder of section 1.2 (bott p. 22- middle p. 23). **

**NOTES:**

^{2} This is the traditional use of “bias” as a systematic error. Ioannidis (2005) alludes to biasing as behaviors that result in a reported significance level differing from the value it actually has or ought to have (e.g., post-data endpoints, selective reporting). I will call those biasing selection eﬀects.

**FOR ALL OF TOUR I: SIST Excursion 1 Tour I**

**THE FULL ITINERARY:*** Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*: **SIST Itinerary**