# Mayo: (section 6) “StatSci and PhilSci: part 2″ Here is section 6 of my paper: “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”  Section 5 is in my last post.

6. Some Knock-Down Criticisms of Frequentist Error Statistics

With the error-statistical philosophy of inference under our belts, it is easy to run through the classic and allegedly damning criticisms of frequentist errorstatistical methods. Open up Bayesian textbooks and you will find, endlessly reprised, the handful of ‘counterexamples’ and ‘paradoxes’ that make up the charges leveled against frequentist statistics, after which the Bayesian account is proffered as coming to the rescue. There is nothing about how frequentists have responded to these charges; nor evidence that frequentist theory endorses the applications or interpretations around which these ‘chestnuts’ revolve.

If frequentist and Bayesian philosophies are to find common ground, this should stop. The value of a generous interpretation of rival views should cut both ways. A key purpose of the forum out of which this paper arises is to encourage reciprocity.

6.1 Fallacies of Rejection

A frequentist error statistical account, based on the notion of severity, accords well with the idea of scientific inquiry as a series of small-scale inquiries into local experimental questions. Many fallacious uses of statistical methods result from supposing that the statistical inference licenses a jump to a substantive claim that is ‘on a different level’ from the one well probed. Given the familiar refrain that statistical significance is not substantive significance, it may seem surprising how often criticisms of significance tests depend on running the two together!

6.1.1 Statistical Significance is Not Substantive Significance: Different Levels

Consider one of the strongest types of examples that Bayesians adduce. In a coin-tossing experiment, for example, the result of n trials may occur in testing a null hypothesis that the results are merely due to chance. A statistically significant proportion of heads (greater than .5) may be taken as grounds for inferring a real effect. But could not the same set of outcomes also have resulted from testing a null hypothesis that denies ESP? And so, would not the same data set warrant inferring the existence of ESP? If in both cases the data are statistically significant to the same degree, the criticism goes, the error statistical tester is forced to infer that there is as good a warrant for inferring the existence of ESP as there is to merely inferring a non-chance effect.2. But this is false. Any subsequent question about the explanation of a non-chance effect, plausible or not, is at a different level from the space of hypotheses about the probability of heads in Bernouilli trials, and thus would demand a distinct analysis. The nature and threats of error in the hypothesis about Harry’s ESP differs from those in merely inferring a real effect. The first significance test did not discriminate between different explanations of the effect, even if the effect is real. The severity analysis makes this explicit.[Related ESP posts begin here.]

6.1.2 Error-‘fixing’ Gambits in Model Validation

That a severity analysis always directs us to the relevant alternative (the denial of whatever is to be inferred) also points up fallacies that may occur in testing statistical assumptions. In a widely used test for independence in a linear regression model, a statistically significant difference from a null hypothesis that asserts the trials are independent may be taken as warranting one of many alternatives that could explain non-independence. For instance, the alternative H1 might assert that the errors are correlated with their past, expressed as a lag between trials. H1 now ‘fits’ the data all right, but since this is just one of many ways to account for the lack of independence, alternative H1 passes with low severity. This method has no chance of discerning other hypotheses that could also ‘explain’ the violation of independence. It is one thing to arrive at such an alternative based on the observed discrepancy with the requirement that it be subjected to further tests; it is another to say that this alternative is itself well tested, merely by dint of ‘correcting’ the misfit. It is noteworthy that Gelman’s Bayesian account advocates model checking. I am not familiar enough with its workings to say if it sufficiently highlights this distinction (Gelman 2011, this special topic of RMM; see also Mayo 2013).[i]

6.1.3 Significant Results with Overly Sensitive Tests: Large n Problem

A second familiar fallacy of rejection takes evidence of a statistically significant effect as evidence of a greater effect size than is warranted. It is known that with a large enough sample size any discrepancy from a null hypothesis will probably be detected. Some critics take this to show a rejection is no more informative than information on sample size (e.g., Kadane 2011, 438). Fortunately, it is easy to use the observed difference plus the sample size to distinguish discrepancies that are and are not warranted with severity. It is easy to illustrate by reference to our test T+. With statistically significant results, we evaluate inferences of the form:

μ > μ1 where μ1 = (μ0 + γ).

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences. Suppose test T+ has hypotheses

H0: μ ≤ 0 vs. H1: μ > 0.

Let σ = 1, n = 25, so σx = (σ/√n) = .2.

In general:

SEV(μ > X  − δε(σ/√n)) = 1 – ε.

Let X = .4, so it is statistically significant at the .03 level. But look what happens to severity assessments attached to various hypotheses about discrepancies from 0:

SEV(μ > 0) = .97
SEV(μ > .2) = .84
SEV(μ > .3) = .7
SEV(μ > .4) = .5
SEV(μ > .5) = .3
SEV(μ > .6) = .16

I have underlined the inference to μ >.4 since it is an especially useful benchmark.

So, clearly a statistically significant result cannot be taken as evidence for just any discrepancy in the alternative region. The severity becomes as low as .5 for an alternative equal to the observed sample mean, and any greater discrepancies are even more poorly tested! Thus, the severity assessment immediately scotches this well-worn fallacy. Keep in mind that the hypotheses entertained here are in the form, not of point values, but of discrepancies as large or larger than μ (for μ, greater than 0).

Oddly, some Bayesian critics (e.g., Howson and Urbach 1993) declare that significance tests instruct us to regard a statistically significant result at a given level as more evidence against the null, the larger the sample size; they then turn around and blame the tests for yielding counterintuitive results! Others have followed suit, without acknowledging this correction from long ago (e.g., Sprenger 2012, this special topic of RMM). In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size. The same point can equivalently be made for a fixed discrepancy from a null value μ0, still alluding to our one-sided test T+. Suppose μ1 = μ0 + γ. An α-significant difference with sample size n1 passes μ >μ1 less severely than with n2 where n2 > n1 (see Mayo 1981; 1996).

6.2 P-values Conflict with Posterior Probabilities: The Criticism in Statistics

Now we get to criticisms based on presupposing probabilism (in the form of Bayesian posterior probabilities). Assuming that significance tests really secretly aim to supply posterior probabilities to null hypotheses, the well-known fact that a frequentist p-value can differ from a Bayesian posterior in H0 is presumed to pose a problem for significance testers, if not prove their out and out “unsoundness” (Howson 1997a, b). This becomes the launching-off point for ‘conciliatory’ methods that escape the problem while inheriting an improved (Bayesian) foundation. What’s not to like?

Plenty, it turns out. Consider Jim Berger’s valiant attempt to get Fisher, Jeffreys, and Neyman to all agree on testing (Berger 2003). Taking a conflict between p-values and Bayesian posteriors as demonstrating the flaw with p-values, he offers a revision of tests thought to do a better job from both Bayesian and frequentist perspectives. He has us consider the two-sided version of our Normal distribution test H0: μ = μ0 vs. H1: μ μ0. (The difference between p-values and posteriors is far less marked with one-sided tests.) Referring to our example where the parameter measures mean pressure in the drill rig on that fateful day in April 2010, the alternative hypothesis asserts that there is some genuine discrepancy either positive or negative from some value μ0.

Berger warns that “at least 22%—and typically over 50%—of the corresponding null hypotheses will be true” if we assume that “half of the null hypotheses are initially true”, conditional on a 0.05 statistically significant d(x). Berger takes this to show that it is dangerous to “interpret the p-values as error probabilities”, but the meaning of ‘error probability’ has shifted. The danger follows only by assuming that the correct error probability is given by the proportion of true null hypotheses (in a chosen population of nulls), conditional on reaching an outcome significant at or near 0.05 (e.g., .22%, or over 50%). The discrepancy between p-values and posteriors increases with sample size. If n = 1000, a result statistically significant at the .05 level yields a posterior of .82 to the null hypothesis! (A statistically significant result has therefore increased the probability in the null!) But why should a frequentist use such a prior? Why should they prefer to report Berger’s ‘conditional error probabilities’ (of 22%, 50%, or 82%)?

6.2.1 Fallaciously Derived Frequentist Priors

Berger’s first reply attempts to give the prior a frequentist flavor: It is assumed that there is random sampling from a population of hypotheses, 50% of which are assumed to be true. This serves as the prior probability for H0. We are then to imagine repeating the current significance test over all of the hypotheses in the pool we have chosen. Using a computer program, Berger describes simulating a long series of tests and records how often H0 is true given a small p-value.

What can it mean to ask how often H0 is true? It is generally agreed that it is either true or not true about this one universe. But, to quote C. S. Peirce, we are to imagine that “universes are as plentiful as blackberries”, and that we can randomly select one from a bag or urn. Then the posterior probability of H0 (conditional on the observed result) will tell us whether the original assessment is misleading. But which pool of hypotheses should we use? The ‘initially true’ percentages will vary considerably. Moreover, it is hard to see that we would ever know the proportion of true nulls rather than merely the proportion that thus far has not been rejected by other statistical tests! But the most serious flaw is this: even if we agreed that there was a 50% chance of randomly selecting a true null hypothesis from a given pool of nulls, .5 would still not give the error statistician a frequentist prior probability of the truth of this hypothesis. It would at most give the probability of the event of selecting a hypothesis with property ‘true’. (We are back to Carnap’s frequentist.) An event is not a statistical hypothesis that assigns probabilities to outcomes.

Nevertheless, this gambit is ubiquitous across the philosophy of statistics literature. It commits the same fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.
This particular null hypothesis H0 was randomly selected from this pool.
Therefore P(H0 is true) = .5.

I have called this the fallacy of probabilistic instantiation.

6.2.2 The Assumption of ‘Objective’ Bayesian Priors

When pressed, surprisingly, Berger readily disowns the idea of obtaining frequentist priors by sampling from urns of nulls (though he continues to repeat it). He mounts a second reply: error statisticians should use the ‘objective’ Bayesian prior of 0.5 to the null, the remaining 0.5 probability being spread out over the alternative parameter space. Many take this to be an ‘impartial’ or ‘uninformative’ Bayesian prior probability, as recommended by Jeffreys (1939). Far from impartial, the ‘spiked concentration of belief in the null’ gives high weight to the null and is starkly at odds with the role of null hypotheses in testing. Some claim that ‘all nulls are false’, the job being to unearth discrepancies from it.

It also leads to a conflict with Bayesian ‘credibility interval’ reasoning, since 0 is outside the corresponding interval (I come back to this). Far from considering the Bayesian posterior as satisfying its principles, the error-statistical tester would balk at the fact that use of the recommended priors can result in highly significant results often being construed as no evidence against the null—or even evidence for it!

The reason the Bayesian significance tester wishes to start with a fairly high prior to the null is that otherwise its rejection would be merely to claim that a fairly improbable hypothesis has become more improbable (Berger and Sellke 1987, 115). By contrast, it is informative for an error-statistical tester to reject a null hypothesis, even assuming it is not precisely true, because we can learn how false it is.

Other reference Bayesians seem to reject the ‘spiked’ prior that is at the heart of Berger’s recommended frequentist-Bayesian reconciliation, at least of Berger (2003). This includes Jose Bernardo, who began his contribution to our forum with a disavowal of just those reference priors that his fellow default Bayesians have advanced (2010). I continue to seek a clear epistemic warrant for the priors he does recommend. It will not do to bury the entire issue under a decision-theoretic framework that calls for its own epistemic justification. The default Bayesian position on tests seems to be in flux.

6.3 Severity Values Conflict with Posteriors: The Criticism in Philosophy

Philosophers of science have precisely analogous versions of this criticism: error probabilities (associated with inferences to hypotheses) are not posterior probabilities in hypotheses, so they cannot serve in an adequate account of inference. They are exported to launch the analogous indictment of the severity account (e.g., Howson 1997a, b; Achinstein 2001; 2010; 2011). However severely I might wish to say that a hypothesis H has passed a test, the Bayesian critic assigns a sufficiently low prior probability to H so as to yield a low posterior probability in H. But this is still no argument about why this counts in favor of, rather than against, their Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis H. In every example, I argue, the case is rather the reverse. Here I want to identify the general flaw in their gambit.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve ‘hypotheses’ that consist of asserting that a sample possesses a characteristic, such as ‘having a disease’ or ‘being college ready’ or, for that matter, ‘being true’. This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, Isaac is ready, or this null hypothesis is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt in 6.2.1.

6.3.1 Achinstein’s Epistemic Probabilist

Achinstein (2010, 187) has most recently granted the fallacy . . . for frequentists:

My response to the probabilistic fallacy charge is to say that it would be true if the probabilities in question were construed as relative frequencies. However, [. . . ] I am concerned with epistemic probability.

He is prepared to grant the following instantiations:

P% of the hypotheses in a given pool of hypotheses are true (or a character holds for p%).
The particular hypothesis Hi was randomly selected from this pool.
Therefore, the objective epistemic probability P(Hi is true) = p.

Of course, epistemic probabilists are free to endorse this road to posteriors—this just being a matter of analytic definition. But the consequences speak loudly against the desirability of doing so.

An example Achinstein and I have debated (precisely analogous to several that are advanced by Howson, e.g., Howson 1997a, b) concerns a student, Isaac, who has taken a battery of tests and achieved very high scores, s, something given to be highly improbable for those who are not college ready. We can write the hypothesis:

And let the denial be H’:

H’(I): Isaac is not college ready (i.e., he is deficient).

The probability for such good results, given a student is college ready, is extremely high:

P(sH(I)) is practically 1,

while very low assuming he is not college ready. In one computation, the probability that Isaac would get such high test results, given that he is not college ready, is .05:

P(sH’(I)) = .05.

But imagine, continues our critic, that Isaac was randomly selected from the population of students in, let us say, Fewready Town—where college readiness is extremely rare, say one out of one thousand. The critic infers that the prior probability of Isaac’s college-readiness is therefore .001:

(*) P(H(I)) = .001.

If so, then the posterior probability that Isaac is college ready, given his high test results, would be very low:

P(H(I)│s) is very low,

even though the posterior probability has increased from the prior in (*).

The fallacy here is that although the probability of a randomly selected student taken from high schoolers in Fewready Town is .001, it does not follow that Isaac, the one we happened to select, has a probability of .001 of being college ready (Mayo 1997; 2005, 117). That Achinstein’s epistemic probabilist denies this fallacy scarcely speaks in favor of that account.

The example considers only two outcomes: reaching the high scores s, or reaching lower scores, ~s. Clearly a lower grade ~s gives even less evidence of readiness; that is, P(H’(I)│~s) > P(H’(I)│s). Therefore, whether Isaac scored as high as s or lower, ~s, Achinstein’s epistemic probabilist is justified in having high belief that Isaac is not ready. Even if he claims he is merely blocking evidence for Isaac’s readiness, the analysis is open to problems: the probability of Achinstein finding evidence of Isaac’s readiness even if in fact he is ready (H(I) is true) is low if not zero. Other Bayesians might interpret things differently, noting that since the posterior for readiness has increased, the test scores provide at least some evidence for H(I)—but then the invocation of the example to demonstrate a conflict between a frequentist and Bayesian assessment would seem to largely evaporate.

To push the problem further, suppose that the epistemic probabilist receives a report that Isaac was in fact selected randomly, not from Fewready Town, but from a population where college readiness is common, Fewdeficient Town. The same scores s now warrant the assignment of a strong objective epistemic belief in Isaac’s readiness (i.e., H(I)). A high-school student from Fewready Town would need to have scored quite a bit higher on these same tests than a student selected from Fewdeficient Town for his scores to be considered evidence of his readiness. When we move from hypotheses like ‘Isaac is college ready’ to scientific generalizations, the difficulty for obtaining epistemic probabilities via his frequentist rule becomes overwhelming.

We need not preclude that H(I) has a legitimate frequentist prior; the frequentist probability that Isaac is college ready might refer to genetic and environmental factors that determine the chance of his deficiency—although I do not have a clue how one might compute it. The main thing is that this probability is not given by the probabilistic instantiation above.

These examples, repeatedly used in criticisms, invariably shift the meaning from one kind of experimental outcome—a randomly selected student has the property ‘college ready’—to another—a genetic and environmental ‘experiment’ concerning Isaac in which the outcomes are ready or not ready.

This also points out the flaw in trying to glean reasons for epistemic belief with just any conception of ‘low frequency of error’. If we declared each student from Fewready to be ‘unready’, we would rarely be wrong, but in each case the ‘test’ has failed to discriminate the particular student’s readiness from his unreadiness. Moreover, were we really interested in the probability that a student randomly selected from a town is college ready, and had the requisite probability model (e.g., Bernouilli), then there would be nothing to stop the frequentist error statistician from inferring the conditional probability. However, there seems to be nothing ‘Bayesian’ in this relative frequency calculation. Bayesians scarcely have a monopoly on the use of conditional probability!

6.4 Trivial Intervals and Allegations of Unsoundness

Perhaps the most famous, or infamous, criticism of all—based again on the insistence that frequentist error probabilities be interpreted as degrees of belief—concerns interval estimation methods. The allegation does not merely assert that probability should enter to provide posterior probabilities—the assumption I called probabilism. It assumes that the frequentist error statistician also shares this goal. Thus, whenever error probabilities, be they p-values or confidence levels, disagree with a favored Bayesian posterior, this is alleged to show that frequentist methods are unsound!

The ‘trivial interval’ example is developed by supplementing a special case of confidence interval estimation with additional, generally artificial, constraints so that it can happen that a particular 95% confidence interval is known to be correct—a trivial interval. If we know it is true, or so the criticism goes, then to report a .95 rather than a 100% confidence-level is inconsistent! Non-Bayesians, Bernardo warns, “should be subject to some re-education using well known, standard counter-examples such as the fact that conventional 0.95-confidence regions may actually consist of the whole real line” (2008, 453).

I discussed this years ago, using an example from Teddy Seidenfeld (Mayo 1981); Cox addressed it long before: “Viewed as a single statement [the trivial interval] is trivially true, but, on the other hand, viewed as a statement that all parameter values are consistent with the data at a particular level is a strong statement about the limitations of the data.” (Cox and Hinkley 1974, 226) With this reading, the criticism evaporates.

Nevertheless, it is still repeated as a knock-down criticism of frequentist confidence intervals. But the criticism assumes, invalidly, that an error probability is to be assigned as a degree of belief in the particular interval that results. In our construal, the trivial interval amounts to saying that no parameter values are ruled out with severity, scarcely a sign that confidence intervals are inconsistent. Even then, specific hypotheses within the interval would be associated with different severity values. Note: by the hypothesis within the confidence interval, I mean that for any parameter value in the interval μ1, there is an associated claim of the form μ ≤ μ1 or μ > μ1, and one can entertain the severity for each. Alternatively, in some contexts, it can happen that all parameter values are ruled out at a chosen level of severity.

Even though examples adduced to condemn confidence intervals are artificial, moving outside statistics, the situation in which none of the possible values for a parameter can be discriminated is fairly common in science. Then the ‘trivial interval’ is precisely what we would want to infer, at least viewing the goal as reporting what has passed at a given severity level. The famous red shift experiments on the General Theory of Relativity (GTR) for instance, were determined to be incapable of discriminating between different relativistic theories of gravity—an exceedingly informative result determined only decades after the 1919 experiments.

6.5 Getting Credit (or Blamed) for Something You Didn’t Do

Another famous criticism invariably taken as evidence of the frequentist’s need for re-education—and readily pulled from the bag of Bayesian jokes carried to Valencia—accuses the frequentist (error-statistical) account of licensing the following:

Oil Exec: Our inference to H: the pressure is at normal levels is highly reliable!
Senator: But you conceded that whenever you were faced with ambiguous readings, you continually lowered the pressure, and that the stringent ‘cement bond log’ test was entirely skipped.
Oil Exec: We omitted reliable checks on April 20, 2010, but usually we do a better job—I am giving the average!

He might give further details:

Oil Exec: We use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, April 20 just happened to be one of those times we did the non-stringent test; but on average we do ok.

Overall, this ‘test’ rarely errs, but that is irrelevant to appraising the inference from the actual data on April 20, 2010. To report the average over tests whose outcomes, had they been performed, are unknown, violates the severity criterion. The data easily could have been generated when the pressure level was unacceptably high, therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages? Lest anyone think I am inventing a criticism, here is the most famous statistical instantiation (Cox 1958).

6.6 Two Measuring Instruments with Different Precisions

A single observation X is to be made on a normally distributed random variable with unknown mean μ, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicate whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively.

In applying our test T+ to a null hypothesis, say, μ = 0, the ‘same’ value of  X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”. Denote the two p-values as p’ and p”, respectively. However, or so the criticism proceeds, the error statistician would report the average p-value:  .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! In fact, any time an experiment E is performed, the critic could insist we consider whether we could have done some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with. The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests. This makes no sense. Yet it is a staple of Bayesian textbooks, and a main reason given for why we must renounce frequentist methods.

But what could lead the critic to suppose the error statistician must average over experiments not even performed? Here is the most generous construal I can think of. Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

• If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments that were not (but could have been?) run in reasoning from the data observed, and report some kind of frequentist average.

So if you are not prepared to average over any of the imaginary tests the critic wishes to make up, then you cannot consider any data set other than the one observed. This, however, would entail no use of error probabilities. This alone should be a sign to the critic that he has misinterpreted the frequentist, but that is not what has happened.

Instead Bayesians argue that if one tries to block the critics’ insistence that I average the properties of imaginary experiments, then “unfortunately there is a catch” (Ghosh, Delampady and Semanta 2006, 38): I am forced to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference, once the data are obtained. This is a false dilemma: evaluating error probabilities must always be associated with the model of the experiment I have performed. Thus we conclude that “the ‘dilemma’ argument is therefore an illusion” (Cox and Mayo 2010). Nevertheless, the critics are right about one thing: if we were led to embrace the LP, all error-statistical principles would have to be renounced. If so, the very idea of reconciling Bayesian and error-statistical inference would appear misguided.

Section 7 is here.

*To read previous sections, please see the RMM page, and scroll down to Mayo’s Sept 25 paper.
(All references can also be found in the link.)

[i] Goldstein (2006) alludes to such an example, but his students, who were supposed to give credence to support his construal, did not. He decided his students were at fault.