An argument that assumes the very thing that was to have been argued for is guilty of *begging the question*; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue *unsoundly*, and in bad faith. When a whirlpool of “reforms” subliminally alter the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

**I. Redefine Power?**

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining *H*_{1} as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This *H*_{1} represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1).

The Bayes factor discussed is of *H*_{1} over *H*_{0}, in two-sided Normal testing of *H*_{0}: μ = 0 versus *H*_{1}: μ ≠ 0.

“The variance of the observations is known. Without loss of generality, we assume that the variance is 1, and the sample size is also 1.” (p. 2 supplementary)

“This is achieved by assuming that μ under the alternative hypothesis is equal to ± (z

_{0.025}+ z_{0.75}) = ± 2.63 [1.96. + .63]. That is, the alternative hypothesis places ½ its prior mass on 2.63 and ½ its mass on -2.63”. (p. 2 supplementary)

Putting to one side whether this is “without loss of generality”, the use of “power” is quite different from the correct definition. The power of a test T (with type I error probability α) to detect a discrepancy μ’ is the probability T generates an observed difference that is statistically significant at level α, assuming μ = μ’. The value z = 2.63 comes from the fact that the alternative against which this test has power .75 is the value .63 SE in excess of the cut-off for rejection. (Since an SE is 1, they add .63 to 1.96.) I don’t really see why it’s advantageous to ride roughshod on the definition of power, and it’s not the main point of this blogpost, but it’s worth noting if you’re to avoid sinking into the quicksand.

Let’s distinguish the appropriateness of the test for a Bayesian, from its appropriateness as a criticism of significance tests. The latter is my sole focus. The criticism is that, at least if we accept these Bayesian assignments of priors, the posterior probability on *H*_{0} will be larger than the p-value. So if you were to interpret a p-value as a posterior on* H*_{0 }(a fallacy) or if you felt intuitively that a .05 (2-sided) statistically significant result should correspond to something closer to a .05 posterior on* H*_{0}, you should instead use a p-value of .005–or so it is argued. I’m not sure of the posterior on *H*_{0, }but the BF is between around 14 and 26.[1] That is the argument. If you lower the required p-value, it won’t be so easy to get statistical significance, and irreplicable results won’t be as common. [2]

The alternative corresponding to the preferred p =.005 requirement

“corresponds to a classical, two-sided test of size α = 0.005. The alternative hypothesis for this Bayesian test places ½ mass at 2.81 and ½ mass at -2.81. The null hypothesis for this test is rejected if the Bayes factor exceeds 25.7. Note that this curve is nearly identical to the “power” curve if that curve had been defined using 80% power, rather than 75% power. The Power curve for 80% power would place ½ its mass at ±2.80”. (Supplementary, p. 2)

z = 2.8 comes from adding .84 SE to the cut-off: 1.96 SE +.84 SE = 2.8. This gets to the alternative vs which the α = 0.05 test has 80% power. (See my previous post on power.)

Is this a good form of inference from the Bayesian perspective? (Why are we comparing μ = 0 and μ = 2.8?). As is always the case with “p-values exaggerate” arguments, there’s the supposition that testing should be on a point null hypothesis, with a lump of prior probability given to *H*_{0} (or to a region around 0 so small that it’s indistinguishable from 0). I leave those concerns for Bayesians, and I’m curious to hear from you. More importantly, does it constitute a relevant and sound criticism of significance testing? Let’s be clear: a tester might well have her own reasons for preferring z = 2.8 rather than z = 1.96, but that’s not the question. The question is whether they’ve provided a good argument for the significance tester to do so?

**II. What might the significance tester say?**

For starters, when she sets .8 power to detect a discrepancy, she doesn’t “implicitly assume” it’s a plausible population discrepancy, but simply one she wants the test to detect by producing a statistically significant difference (with probability .8). And if the test does produce a difference that differs statistically significantly from *H*_{0}, she does not infer the alternative against which the test had high power, call it μ’. (The supposition that she does grows out of fallaciously transposing “the conditional” involved in power.) Such a rule of interpreting data would have a high error probability of erroneously inferring a discrepancy μ’ (here 2.8).

The significance tester merely seeks evidence of *some* (genuine) discrepancy from 0, and eschews a comparative inference such as the ratio of the probability of the data under the points 0 and 2.63 (or 2.8). I don’t say there’s no role for a comparative inference, nor preclude someone arguing it is comparing how well μ = 2.8 “explains” the data compared to μ = 0 (given the assumptions), but the form of inference is so different from significance testing, it’s hard to compare them. She definitely wouldn’t ignore all the points in between 0 and 2.8. A one-sided test is preferable (unless the direction of discrepancy is of no interest). While one or two-sided doesn’t make that much difference for a significance tester, it makes a big difference for the type of Bayesian analyses that is appealed to in the “p-values exaggerate” literature. That’s because a lump prior, often .5 (but here .9!), is placed on the point 0 null. Without the lump, the p-value tends to be close to the posterior probability for *H*_{0}, as Casella and Berger (1987a,b) show–even though p-values and posteriors are actually measuring very different things.

“In fact it is not the case that P-values are too small, but rather that Bayes point null posterior probabilities are much too big!….Our concern should not be to analyze these misspecified problems, but to educate the user so that the hypotheses are properly formulated,” (Casella and Berger 1987 b, p. 334, p. 335).

There is a long and old literature on all this (at least since Edwards, Lindman and Savage 1963–let me know if you’re aware of older sources).

Those who lodge the “p-values exaggerate” critique often say, we’re just showing what would happen even if we made the strongest case for the alternative. No they’re not. They wouldn’t be putting the lump prior on 0 were they concerned not to bias things in favor of the null, and they wouldn’t be looking to compare 0 with so far away an alternative as 2.8 either.

The only way a significance tester can appraise or calibrate a measure such as a BF (and these will differ depending on the alternative picked) is to view it as a statistic and consider the probability of an even larger BF under varying assumptions about the value of μ. This is an error probability associated with the method. Accounts that appraise inferences according to the error probability of the method used I call *error statistical* (which is less equivocal than frequentist or other terms.)

For example, rejecting *H*_{0} when z ≥ 1.96 (which is the .05 test, since they make it 2-sided), we said, had .8 power to detect μ = 2.8, but with the .005 test it has only 50% power to do so. If one insists on a fixed .005 cut-off, this is construed as no evidence against the null (or even evidence *for* it–for a Bayesian). The new test has only 30% probability of finding significance were the data generated by μ = 2.3. So the significance tester is rightly troubled by the raised type II error [3], although the members of an imaginary Toxic Co. (having the risks of their technology probed) might be happy as clams.[4]

Suppose we do attain statistical significance at the recommended .005 level, say z = 2.8. The BF advocate assures us we can infer μ = 2.8, which is now 25 times as likely as μ = 0, (if all the various Bayesian assignments hold). The trouble is, the significance tester doesn’t want to claim good evidence for μ = 2.8. The significance tester merely infers an indication of a discrepancy (an isolated low p-value doesn’t suffice, and the assumptions also must be checked). She’d never ignore all the points other than 0 and ± 2.8. Suppose we were testing μ ≤ 2.7 vs. μ > 2.7, and observed z = 2.8. What is the p-value associated with this observed difference? The answer is ~.46. (Her inferences are not in terms of points but of discrepancies from the null, but I’m trying to relate the criticism to significance tests. ) To obtain μ ≥ 2.7 using one-sided confidence intervals would require a confidence level of .46! An absurdly high error probability.

The one-sided lower .975 bound with z = 2.8 would only entitle inferring μ > .84 (2.8 – 1.96)–quite a bit smaller than inferring μ = 2.8. If confidence levels are altered as well (and I don’t see why they wouldn’t be), the one-sided lower .995 bound would only be μ > 0. Thus, while the lump prior on *H*_{0 }results in a bias in favor of a null–increasing the type II error probability– it’s of interest to note that achieving the recommended p-value licenses an inference much *larger* than what the significance tester would allow.

Note, their inferences remain comparative in the sense of “*H*_{1} over *H*_{0}” on a given measure, it doesn’t actually say there’s evidence against (or for) either (unless it goes on to compute a posterior, not just odds ratios or BFs), nor does it falsify either hypothesis. This just underscores the fact that the BF comparative inference is importantly different from significance tests which seek to falsify a null hypothesis, with a view toward learning if there are genuine discrepancies, and if so, their magnitude.

Significance tests do not assign probabilities to these parametric hypotheses, but even if one wanted to, the spiked priors needed for the criticism are questioned by Bayesians and frequentists alike. Casella and Berger (1987a) say that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0} as much as possible” (p. 111) whether in one or two-sided tests. According to them “The testing of a point null hypothesis is one of the most misused statistical procedures.” (ibid., p. 106)

**III. Why significance testers should reject the “redefine statistical significance” argument:**

**(i)** If you endorse this particular Bayesian way of attaining the BF, fine, but then your argument *begs the central question* against the significance tester (or of the confidence interval estimator, for that matter). The significance tester is free to turn the situation around, as Fisher does, as refuting the assumptions:

Even if one were to imagine that *H*_{0} had an extremely high prior probability, says Fisher—never minding “what such a statement of probability a priori could possibly mean”(Fisher, 1973, p.42)—the resulting high posteriori probability to *H*_{0} , he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (ibid., p. 44) … “…is not capable of finding expression in any calculation of probability a posteriori” (ibid., p. 43). Indeed, if one were to consider the claim about the priori probability to be itself a hypothesis, Fisher says, “it would be rejected at once by the observations at a level of significance almost as great [as reached by *H*_{0} ]. …Were such a conflict of evidence, as has here been imagined under discussion… in a scientific laboratory, it would, I suggest, be some prior assumption…that would certainly be impugned.” (p. 44)

**(ii)** Suppose, on the other hand, you don’t endorse these priors or the Bayesian computation on which the “redefine significance” argument turns. Since lowering the p-value cut-off doesn’t seem too harmful, you might tend to look the other way as to the argument on which it is based. Isn’t that OK? Not unless you’re prepared to have your students compute these BFs and/or posteriors in just the manner upon which the critique of significance tests rests. Will you say, “oh that was just for criticism, not for actual use”? Unless you’re prepared to defend the statistical analysis, you shouldn’t support it. Lowering the p-value that you require for evidence of a discrepancy, or getting more data (should you wish to do so) doesn’t require it.

Moreover, your student might point out that you still haven’t matched p-values and BFs (or posteriors on *H*_{0} ): They still differ, with the p-value being smaller. If you wanted to match the p-value and the posterior, you could do so very easily: use the frequency matching priors (which doesn’t use the spike). You could still lower the p-value to .005, and obtain a rejection region precisely identical to the Bayesian. *Why isn’t that a better solution than one based on a conflicting account of statistical inference?*

Of course, even that is to grant the problem as put before us by the Bayesian argument. If you’re following good error statistical practice you might instead shirk all cut-offs. You’d report attained p-values, and wouldn’t infer a genuine effect until you’ve satisfied Fisher’s requirements: (a) Replicate yourself, show you can bring about results that “rarely fail to give us a statistically significant result” (1947, p. 14) and that you’re getting better at understanding the causal phenomenon involved. (b) Check your assumptions: both the statistical model, the measurements, and the links between statistical measurements and research claims. (c) Make sure you adjust your error probabilities to take account of, or at least report, biasing selection effects (from cherry-picking, trying and trying again, multiple testing, flexible determinations, post-data subgroups)–according to the context. That’s what prespecified reports are to inform you of. The suggestion that these are somehow taken care of by adjusting the pool of hypotheses on which you base a prior will not do. (It’s their plausibility that often makes them so seductive, and anyway, the injury is to how well-tested claims are, not to their prior believability.) The appeal to diagnostic testing computations of “false positive rates” in this paper opens up a whole new urn of worms. Don’t get me started. (see related posts.)

A final word is from a guest post by Senn. Harold Jeffreys, he says, held that if you use the spike (which he introduced), you are to infer the hypothesis that achieves greater than .5 posterior probability.

Within the Bayesian framework, in abandoning smooth priors for lump priors, it is also necessary to change the probability standard. (In fact I speculate that the 1 in 20 standard seemed reasonable partly because of the smooth prior.) … A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

Please share your views, and alert me to errors. I will likely update this. Stay tuned for asterisks.

12/17 * I’ve already corrected a few typos.

[1] I do not mean the “false positive rate” defined in terms of α and (1 – β)–a problematic animal I put to one side here (Mayo 2003). Richard Morey notes that using their prior odds of 1:10, even the recommended BF of 26 gives us an unimpressive posterior odds ratio of 2.6 (email correspondence).

[2] Note what I call the “fallacy of replication”. It’s said to be too easy to get low p-values, but at the same time it’s too hard to get low p-values in replication. Is it too easy or too hard? That just shows it’s not the p-value at fault but cherry-picking and other biasing selection effects. Replicating a p-value is hard–when you’ve cheated or been sloppy the first time.

[3] They suggest increasing the sample size to get the power where it was with rejection at z = 1.96, and, while this is possible in some cases, increasing the sample size changes what counts as one sample. As n increases the discrepancy indicated by any level of significance decreases.

[4] The severe tester would report attained levels and,in this case, would indicate the the discrepancies indicated and ruled out with reasonable severity. (Mayo and Spanos 2011). Keep in mind that statistical testing inferences are in the form of µ > µ’ =µ_{0 }+ δ, or µ ≤ µ’ =µ_{0 }+ δ or the like. They are *not* to point values. As for the imaginary Toxic Co., I’d put the existence of a risk of interest in the null hypothesis of a one-sided test.

**Related Posts**

10/26/17: Going round and round again: a roundtable on reproducibility & lowering p-values

10/18/17: Deconstructing “A World Beyond P-values”

1/19/17: The “P-values overstate the evidence against the null” fallacy

8/28/16 Tragicomedy hour: p-values vs posterior probabilities vs diagnostic error rates

12/20/15 Senn: Double Jeopardy: Judge Jeffreys Upholds the Law, sequel to the pathetic p-value.

2/1/14 Comedy hour at the Bayesian epistemology retreat: highly probable vs highly probed vs B-boosts

11/25/14: How likelihoodists exaggerate evidence from statistical tests

Elements of this post are from Mayo 2018.

**References**

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., 3 … Johnson, V. (2017, July 22), “Redefine statistical significance“, *Nature Human Behavior*.

Berger, J. O. and Delampady, M. (1987). “Testing Precise Hypotheses” and “Rejoinder“, *Statistical Science* **2**(3), 317-335.

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R. (1987a). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

Cassella, G. and Berger, R. (1987b). “Comment on Testing Precise Hypotheses by J. O. Berger and M. Delampady”, *Statistical Science* **2**(3), 344–347.

Edwards, W., Lindman, H. and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological Review 70(3): 193-242.

Fisher, R. A. (1947). *The Design of Experiments *(4^{th} ed.). Edinburgh: Oliver and Boyd. (First published 1935).

Fisher, R. A. (1973). *Statistical Methods and Scientific Inference,* 3rd ed, New York: Hafner Press.

Ghosh, J. Delampady, M., and Samanta, T. (2006). *An Introduction to Bayesian Analysis: Theory and Methods*. New York: Springer.

Mayo, D. G. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing? Commentary on J. Berger’s Fisher Address,” *Statistical Science* 18: 19-24.

Mayo (2018), *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*. Cambridge (June 2018)

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in *Philosophy of Statistics , Handbook of Philosophy of Science* Volume 7 *Philosophy of Statistics*, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Filed under: Bayesian/frequentist, fallacy of rejection, P-values, reforming the reformers, spurious p values ]]>

*In preparation for a new post that takes up some of the recent battles on reforming or replacing p-values, I reblog an older post on power, one of the most misunderstood and abused notions in statistics. (I add a few “notes on howlers”.) The power of a test T in relation to a discrepancy from a test hypothesis H _{0} is the probability T will lead to rejecting H_{0} when that discrepancy is present. Power is sometimes misappropriated to mean something only distantly related to the probability a test leads to rejection; but I’m getting ahead of myself. This post is on a classic fallacy of rejection.*

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significantxis agoodindication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result *is just statistically significant *at a level α. (As soon as the observed difference exceeds the cut-off the rule has to be modified).

Rule (1) was stated in relation to a statistically significant result ** x** (at level α) from a one-sided test T+ of the mean of a Normal distribution with

(2) If POW(T+,µ’) is high, then an α statistically significantxis agoodindication that µ < µ’.

(The higher the POW(T+,µ’) is, the better the indication that µ < µ’.)That is, if the test’s power to detect alternative µ’ is

high, then the just statistically significantis axgoodindication (or good evidence) that the discrepancy from null isnotas large as µ’ (i.e., there’s good evidence that µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

**POWER**: POW(T+,µ’)** = **POW(Test T+ rejects *H*_{0};µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

**EXAMPLE**. Let σ = 10, *n* = 100, so (σ/√*n*) = 1. Test T+ rejects H_{0 }at the .025 level if M_{ } > 1.96(1).

Find the power against µ = 2.3. To find Pr(M > 1.96; 2.3), get the standard Normal z = (1.96 – 2.3)/1 = -.84. Find the area to the right of -.84 on the standard Normal curve. It is .8. So POW(T+,2.8) = .8.

For simplicity in what follows, let the cut-off, M*, be 2. Let the observed mean M_{0} just reach the cut-off 2.

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*, for these yield negative z values. Power fact, POW(M* + 1(σ/√*n*)) = .84.

That is, adding one (σ/ √*n*) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ** _{ }**= 3) = .84. See this post.

By (2), the (just) significant result * x* is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84. The upper .84 confidence limit is 3. The significant result is much better evidence that µ< 4, the upper .975 confidence limit is 4 (approx.), etc.

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (However, in my view, (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical *in*significance, (2) is essentially ordinary *power analysis.* (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, what counts as a risk of concern is a context-dependent consideration, often stipulated in regulatory statutes.

NOTES ON HOWLERS: When researchers set a high power to detect µ’, it is not an indication they regard µ’ as plausible, likely, expected, probable or the like. Yet we often hear people say “if statistical testers set .8 power to detect µ = 2.3 (in test T+), they must regard µ = 2.3 as probable in some sense”. No, in no sense. Another thing you might hear is, “when *H*_{0}: µ ≤ _{ }0 is rejected (at the .025 level), it’s reasonable to infer µ > 2.3″, or “testers are comfortable inferring µ ≥ 2.3”. No, they are not comfortable, nor should you be. Such an inference would be wrong with probability ~.8. Given M = 2 (or 1.96), you need to subtract to get a lower confidence bound, if the confidence level is not to exceed .5 . For example, µ > .5 is a lower confidence bound at confidence level .93.

Rule (2) also provides a way to distinguish values *within* a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often called “retrospective power”, which is a fine term, but it’s often defined as what I call shpower). Again, confidence bounds could be, but they are not now, used to this end [iii].

**Severity replaces M* in (2) with the actual result, be it significant or insignificant. **

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to *custom tailor* results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

*One more thing:*

**Applying (1) and (2) requires the error probabilities to be actual** (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. *If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones.* There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that statistical testing inferences are going to be in the form of µ > µ’ =µ_{0 }+ δ, or µ ≤ µ’ =µ_{0 }+ δ or the like. They are *not* to point values! (Not even to the point µ =M_{0}.) Take a look at the alternative *H*_{1}: µ > _{ }0. It is not a point value. Although we are going beyond inferring the existence of some discrepancy, we still retain inferences in the form of inequalities.

[iii] That is, upper confidence bounds are too readily viewed as “plausible” bounds, and as values for which the data provide positive evidence. In fact, as soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
- 12/29/14 To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
- 01/03/15

Filed under: CIs and tests, Error Statistics, power ]]>

“The subjective Bayesian theory as developed, for example, by Savage … cannot solve the deceptively simple but actually intractable old evidence problem, whence as a foundation for a logic of confirmation at any rate, it must be accounted a failure.” (Howson, (2017), p. 674)

What? Did the “old evidence” problem cause Colin Howson to recently abdicate his decades long position as a leading subjective Bayesian? It seems to have. I was so surprised to come across this in a recent perusal of* Philosophy of Science* that I wrote to him to check if it is really true. (It is.) I thought perhaps it was a different Colin Howson, or the son of the one who co-wrote 3 editions of Howson and Urbach: *Scientific Reasoning: The Bayesian Approach* espousing hard-line subjectivism since 1989.[1] I am not sure which of the several paradigms of non-subjective or default Bayesianism Howson endorses (he’d argued for years, convincingly, against any one of them), nor how he handles various criticisms (Kass and Wasserman 1996), I put that aside. Nor have I worked through his, rather complex, paper to the extent necessary, yet. What about the “old evidence” problem, made famous by Clark Glymour 1980? What is it?

Consider Jay Kadane, a well-known subjective Bayesian statistician. According to Kadane, the probability statement: Pr(d(** X**) ≥ 1.96) = .025

“is a statement about d(

) before it is observed. After it is observed, the event {d(X) ≥ 1.96} either happened or did not happen and hence has probability either one or zero” (2011, p. 439).X

Knowing d_{0}= 1.96, (the specific value of the test statistic d(** X**)), Kadane is saying, there’s no more

*What’s the accepted subjective Bayesian solution to this? * (I’m really asking.) One attempt is to subtract out, or try to, the fact that ** x** is known, and envision being in a context prior to knowing

Any case where the data are known prior to constructing or selecting a hypothesis to accord with them, strictly speaking, would count as cases where data are known, or so it seems.** The most well known cases in philosophy allude to a known phenomenon, such as Mercury’s perihelion, as evidence for Einstein’s GTR. (The perihelion was long known as anomalous for Newton, yet GTR’s predicting it, without adjustments, is widely regarded as evidence for GTR.)[2] You can read some attempted treatments by philosophers in Howson’s paper; I discuss Garber’s attempt in Chapter 10, Mayo 1996 [EGEK], 10.2.[3] *I’d like to hear from readers, regardless of statistical persuasion, how it’s handled in practice *(or why it’s deemed unproblematic).

But wait, are we sure it isn’t also a problem for non-subjective or default Bayesians? In this paradigm (and there are several varieties), the prior probabilities in hypotheses are not taken to express degrees of belief but are given by various formal assignments, so as to have minimal impact on the posterior probability. Although the holy grail of finding “uninformative” default priors has been given up, default priors are at least supposed to ensure the data dominate in some sense.[4] A true blue subjective Bayesian like Kadane is unhappy with non-subjective priors. Rather than quantify prior beliefs, non-subjective priors are viewed as primitives or conventions or references for obtaining posterior probabilities. How are they to be interpreted? It’s not clear, but let’s put this aside to focus on the “old evidence” problem.

*OK, so how do subjective Bayesians get around the old evidence problem?*

*I thank Jay Kadane for noticing I used the inequality in my original post 11/27/17. I haven’t digested his general reaction yet, stay tuned.

**There’s a place where Glymour (or Glymour, Scheines, Spirtes, and Kelly 1987) slyly argues that, strictly speaking, the data are always known by the time you appraise some some model–or so I seem to recall. But I’d have to research that or ask him.

[1] I’ll have to add a footnote to my new book (*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*, CUP, 2018), as I allude to him as a subjectivist Bayesian philosopher throughout.

[2] I argue that the reason it was good evidence for GTR is precisely because it was long known, and yet all attempts to explain it were ad hoc, so that they failed to pass severe tests. The deflection effect, by contrast, was new and no one had discerned it before, let alone tried to explain it. Note that this is at odds with the idea that *novel* results count more for a theory H when H (temporally) *pre*dicts them, than when H accounts for known results. (Here the known perihelion of Mercury is thought to be better evidence for GTR than the novel deflection effect.) But the issue isn’t the novelty of the results, it’s how well-tested H is, or so I argue. (EGEK Chapter 8, p. 288).

[3] I don’t see that the newer attempts avoid the key problem in Garber’s. I’m not sure if Howson is rescinding the remark I quote from him in EGEK, p. 333. Here he was trying to solve it by subtracting the data out from what’s known.

[4] Some may want to use “informative” priors as well, but their meaning/rationale is unclear. Howson mentions Wes Salmon’s style of Bayesianism in this paper, but Salmon was a frequentist.

REFERENCES

-Glymour, C. (1980), *Theory and Evidence,* Princeton University Press. I’ve added a link to the relevant chapter, “Why I am Not a Bayesian” (from Fitelson resources). The relevant pages are 85-93.

–Howson, C (2017), “Putting on the Garber Style? Better Not”, *Philosophy of Science*, 84 (October 2017) pp. 659–676.

-Kass, R. and Wasserman, L. (1996), “The Selection of Prior Distributions By Formal Rules”, JASA 91: 1343-70.

*Further References to Solutions (to this or Related problems): *

-Garber, Daniel. 1983. “Old Evidence and Logical Omniscience in Bayesian Confirmation Theory.” In Minnesota Studies in the Philosophy of Science, ed. J. Earman, 99–131. Minneapolis: University of Minnesota Press.

-Hartmann, Stephan, and Branden Fitelson. 2015. “A New Garber-Style Solution to the Problem of Old Evidence.” Philosophy of Science 82 (4): 712–17. H

–Seidenfeld, T., Schervish, M., and Kadane, T. 2012. “What kind of uncertainty is that ?” *Journal of Philosophy*, (2012), pp 516-533.

Filed under: Bayesian priors, objective Bayesians, Statistics Tagged: old evidence problem ]]>

**Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009).** Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that). He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!” So he walked up to it…. It turned out to be my *Error and the Growth of Experimental Knowledge* (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

One of Lehmann’s more philosophical papers is Lehmann (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” Here are some excerpts (blue), and remarks (black)

…A distinction frequently made between the approaches of Fisher and Neyman-Pearson is that in the latter the test is carried out at a fixed level, whereas the principal outcome of the former is the statement of a p value that may or may not be followed by a pronouncement concerning significance of the result [p.1243].

The history of this distinction is curious. Throughout the 19th century, testing was carried out rather informally. It was roughly equivalent to calculating an (approximate) p value and rejecting the hypothesis if this value appeared to be sufficiently small. … Fisher, in his 1925 book and later, greatly reduced the needed tabulations by providing tables not of the distributions themselves but of selected quantiles. … These tables allow the calculation only of ranges for the p values; however, they are exactly suited for determining the critical values at which the statistic under consideration becomes significant at a given level. As Fisher wrote in explaining the use of his [chi square] table (1946, p. 80):

In preparing this table we have borne in mind that in practice we do not want to know the exact value of P for any observed [chi square], but, in the first place, whether or not the observed value is open to suspicion. If P is between .1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [chi square] indicate a real discrepancy.

Similarly, he also wrote (1935, p. 13) that “it is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard .. .” ….

Neyman and Pearson followed Fisher’s adoption of a fixed level. In fact, Pearson (1962, p. 395) acknowledged that they were influenced by “[Fisher’s] tables of 5 and 1% significance levels which lent themselves to the idea of choice, in advance of experiment, of the risk of the ‘first kind of error’ which the experimenter was prepared to take.” … [p. 1244]

It is interesting to note that unlike Fisher, Neyman and Pearson (1933a, p. 296) did not recommend a standard level but suggested that “how the balance [between the two kinds of error] should be struck must be left to the investigator.” … It is thus surprising that in SMSI Fisher (1973, p. 44-45) criticized the NP use of a fixed conventional level.” …

Responding to earlier versions of these and related objections by Fisher to the Neyman-Pearson formulation, Pearson (1955, p. 206) admitted that the terms “acceptance” and “rejection” were perhaps unfortunately chosen, but of his joint work with Neyman he said that “from the start we shared Professor Fisher’s view that in scientific inquiry, a statistical test is ‘a means of learning’ ” and “I would agree that some of our wording may have been chosen inadequately, but I do not think that our position in some respects was or is so very different from that which Professor Fisher himself has now reached.” [This is from Pearson’s portion of the”triad”: “Statistical Concepts in Their Relation to Reality’.]

[A] central consideration of the Neyman-Pearson theory is that one must specify not only the hypothesis H but also the alternatives against which it is to be tested. In terms of the alternatives, one can then define the type II error (false acceptance) and the power of the test (the rejection probability as a function of the alternative). This idea is now fairly generally accepted for its importance in assessing the chance of detecting an effect (i.e., a departure from H) when it exists, determining the sample size required to raise this chance to an acceptable level, and providing a criterion on which to base the choice of an appropriate test…[p.1244].

The big difference Lehmann sees between Fisher and NP is not one we’re most likely to hear about these days: conditioning vs maximizing power. [The paper puts to one side the difference between Fisher’s fiducial and Neyman’s confidence intervals] [i]. To illustrate the conditioning issue, Lehmann alludes to the famous example of flipping a coin to determine if an experiment will use a highly precise or imprecise weighing machine (Cox 1958). Discussions may be found on this blog. [Search under Birnbaum or the SLP.] In this paper, Lehmann makes it clear that NP theory is free to condition (to attain the relevant error probabilities) even if an unconditional test could well be chosen if the context called for maximizing power. I find it interesting that Lehmann’s statistical philosophy (not just in his writing, but in many conversations we’ve had) emphasizes the difference between scientific contexts– where the concern is finding things out in the case at hand–and contexts where long-run optimization is of prime importance, while his own work contributed so much to developing the latter. Back to his paper:

Fisher was of course aware of the importance of power, as is clear from the following remarks (1947, p. 24): “With respect to the refinements of technique, we have seen above that these contribute nothing to the validity of the experiment and of the test of significance by which we determine its result. They may, however, be important, and even essential, in permitting the phenomenon under test to manifest itself.”

…Under the heading “Multiplicity of Tests of the Same Hypothesis,” he devoted a section (sec. 61) to this topic. Here again, without using the term, he referred to alternatives when he wrote (Fisher 1947, p. 182) that “we may now observe that the same data may contradict the hypothesis in any of a number of different ways.” After illustrating how different tests would be appropriate for different alternatives, he continued (p. 185):

The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation but has been the occasion of much theoretical discussion among statisticians….[The experimenter] is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs. He is, therefore, not usually concerned with the question: To what observational feature should a test of significance be applied?

The idea is “that there is no need for a theory of test choice,because an experienced experimenter knows what is the appropriate test”. [ibid, p. 1245]. This reminds me of Senn’s post on Fisher’s alternative to the alternative. Maybe “an experienced experimenter” knows the appropriate test, but this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for different choices, made informally. Frankly, I can’t take entirely seriously any complaints Fisher levels against NP after 1935, and the break-up. Lehmann returns to the question “cut-offs or P-values?” at the end. [1247]

In “the reporting of the conclusions of the analysis. Should this consist merely of a statement of significance or nonsignificance at a given level, or should a p value be reported? …fortunately, this is a case where you can have your cake and eat it too. One should routinely report the p value and, where desired, combine this with a statement on significance at any stated level. …

Both Neyman-Pearson and Fisher would give at most lukewarm support to standard significance levels such as 5% or 1%. Fisher, although originally recommending the use of such levels, later strongly attacked any standard choice.[p. 1248] Neyman-Pearson, in their original formulation of 1933, recommended a balance between the two kinds of error (i.e., between level and power). …

…To summarize, p values, fixed-level significance statements,conditioning, and power considerations can be combined into a unified approach. When long-term power and conditioning are in conflict, specification of the appropriate frame of reference takes priority, because it determines the meaning of the probability statements. A fundamental gap in the theory is the lack of clear principles for selecting the appropriate framework. Additional work in this area will have to come to terms with the fact that the decision in any particular situation must be based not only on abstract principles but also on contextual aspects.

The full paper is here. You can find the references within. It’s too bad Lehmann didn’t try to work out this unification; I try my own hand at this in my forthcoming book, *Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars* (CUP 2018).

**Happy Birthday Erich!**

*I blogged most of this material in 2015.

[0]That same year I remember having a last-minute phone call with Erich to ask how best to respond to a “funny Bayesian example” raised by Colin Howson. It is essentially the case of Mary’s positive result for a disease, where Mary is selected randomly from a population where the disease is very rare. See for example here. His recommendations were extremely illuminating, and with them he sent me a poem he’d written (which you can read in my published response here). Aside from being a leading statistician, Erich had a (serious) literary bent.

In 1998 Lehmann wrote in support of the book’s receiving the Lakatos Prize which it did. The picture on the right (of Erich and his wife Julie Schaffer) was taken by Aris Spanos in 2003.

[i] Lehmann notes [p. 1242] that “in certain important situations tests can be obtained by an approach also due to Fisher for which he used the term *fiducial*. Most comparisons of Fisher’s work on hypothesis testing with that of Neyman and Pearson … do not include a discussion of the fiducial argument, which most statisticians have found difficult to follow. Although Fisher himself viewed fiducial considerations to be a very important part of his statistical thinking, this topic can be split off from other aspects of his work, and here I shall consider neither the fiducial nor the Bayesian approach…”

Following Lehmann, I too, largely sidestepped Fisher’s fiducial episode in discussing N-P and Fisher in the past, but I’ve recently come to the view that this seriously shortchanges one’s understanding of NP methods, and the disagreements between N-P and Fisher. But I won’t get into that now.

**(Selected) Books by Lehmann)**

*Testing Statistical Hypotheses*, 1959*Basic Concepts of Probability and Statistics*, 1964, co-author J. L. Hodges*Elements of Finite Probability*, 1965, co-author J. L. Hodges- Lehmann, Erich L.; With the special assistance of H. J. M. D’Abrera (2006).
*Nonparametrics: Statistical methods based on ranks*(Reprinting of 1988 revision of 1975 Holden-Day ed.). New York: Springer. pp. xvi+463. ISBN 978-0-387-35212-1. MR 2279708. *Theory of Point Estimation*, 1983*Elements of Large-Sample Theory (1988)*. New York: Springer Verlag.

*Reminiscences of a Statistician*, 2007, ISBN 978-0-387-71596-4*Fisher, Neyman, and the Creation of Classical Statistics*, 2011, ISBN 978-1-4419-9499-8 [published posthumously]

Articles (3 of very many)

- Lehmann, E.L.; Scheffé, H. (1950). “Completeness, similar regions, and unbiased estimation. I.”.
*Sankhyā: the Indian Journal of Statistics***10**(4): 305–340. JSTOR 25048038. MR 39201. - Lehmann, E.L.; Scheffé, H. (1955). “Completeness, similar regions, and unbiased estimation. II”.
*Sankhyā: the Indian Journal of Statistics***15**(3): 219–236. JSTOR 25048243. MR 72410. - Lehmann, E. L. 1993. “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?”
*Journal of the American Statistical Association*88 (424): 1242–1249.

Well-Known Results

Filed under: Fisher, P-values, phil/history of stat ]]>

**MONTHLY MEMORY LANE: 3 years ago: November 2014. **I mark in **red** **3-4** posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently**[1], and in ****green**** 3- 4 others** of general relevance to philosophy of statistics (in months where I’ve blogged a lot)**[2]**.** **Posts that are part of a “unit” or a group count as one (11/1/14 & 11/09/14 and 11/15/14 & 11/25/14 are grouped). The comments are worth checking out.

**November 2014**

**11/01 Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”****11/09 “Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA****11/11**The Amazing Randi’s Million Dollar Challenge**11/12 A biased report of the probability of a statistical fluke: Is it cheating?****11/15 Why the Law of Likelihood is bankrupt–as an account of evidence****11/18**Lucien Le Cam: “The Bayesians Hold the Magic”**11/20**Erich Lehmann: Statistician and Poet**11/22**Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”**11/25****How likelihoodists exaggerate evidence from statistical tests****11/30**3 YEARS AGO: MONTHLY (Nov.) MEMORY LANE

**[1]** Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

**[2]** New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).

Filed under: 3-year memory lane ]]>

…. Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental.

My gripe has long been that the focus of replication research is with the “pure” statistics– do we get a low p-value or not?–largely sidestepping the often tenuous statistical-substantive links and measurement questions, as in this case [1]. When the study failed to replicate there was a lot of heated debate about the “fidelity” of the replication.[2] Yoav Benjamini cuts to the chase by showing the devastating selection effects involved (on slide #18):

For the severe tester, the onus is on researchers to show explicitly that they could put to rest expected challenges about selection effects. Failure to do so suffices to render the finding poorly or weakly tested (i.e., it passes with low severity) overriding the initial claim to have a non-spurious effect.

“Every research proposal and paper should have a replicabiity-check component” Benjamini recommends.[3] Here are the full slides for your Saturday night perusal.

[1] I’ve yet to hear anyone explain why unscrambling soap-related words should be a good proxy for “situated cognition” of cleanliness. But even without answering that thorny issue, identifying the biasing selection effects that were not taken account of vitiates the nominal low p-value. It is *easy* rather than difficult to find at least one such computed low p-value by selection alone. I advocate going further, where possible, and falsifying the claim that the statistical correlation is a good test of the hypothesis.

[2] For some related posts, with links to other blogs, check out:

Some ironies in the replication crisis in social psychology.

[3] The problem underscores the need for a statistical account to be *directly* altered by biasing selection effects, as is the p-value.

Filed under: Error Statistics, P-values, replication research, selection effects ]]>

There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then, if you reach the stipulated BF level against a null hypothesis, you will find the situation is reversed, and the recommended BF exaggerates the evidence! (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) Failing to reach the BF cut-off, by contrast, can find no evidence against, and even finds evidence for, a null hypothesis with high probability, even when non-trivial discrepancies exist. They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

- Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of “Redefine Statistical Significance”
- Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and a primary co-author of a response to ‘Redefine statistical significance’ (under review).
- Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
- Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
- E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper “Redefine Statistical Significance”

To tune in to the presentation and participate in the discussion after the talk, visit this site on the day of the talk. To register for the talk in advance, click here.

The paradox for those wishing to abandon significance tests on grounds that there’s “a replication crisis”–and I’m not alleging everyone under the “lower your p-value” umbrella are advancing this–is that lack of replication is effectively uncovered *thanks to statistical significance tests*. They are also the basis for fraud-busting, and adjustments for multiple testing and selection effects. Unlike Bayes Factors, they:

- are directly affected by cherry-picking, data dredging and other biasing selection effects
- are able to test statistical model assumptions, and may have their own assumptions vouchsafed by appropriate experimental design
- block inferring a genuine effect when a method has low capability of having found it spurious.

In my view, the result of a significance test should be interpreted in terms of the discrepancies that are well or poorly indicated by the result. So we’d avoid the concern that leads some to recommend a .005 cut-off to begin with. But if this does become the standard for testing the existence of risks, I’d make “there’s an increased risk of at least r” the test hypothesis in a one-sided test, as Neyman recommends. Don’t give a gift to the risk producers. In the most problematic areas of social science, the real problems are (a) the questionable relevance of the “treatment” and “outcome” to what is purported to be measured, (b) cherry-picking, data-dependent endpoints, and a host of biasing selection effects, and (c) violated model assumptions. Lowering a p-value will do nothing to help with these problems; forgoing statistical tests of significance will do a lot to make them worse.

*Added Oct 27. This is worth noting because in other Bayesian assessment, indeed, in assessments deemed more sensible and less biased in favor of the null hypothesis–the p-value scarcely differs from the posterior on Ho. This is discussed, for example, in Casella and R. Berger 1987. See links in [iii]. The two are reconciled with 1-sided tests, and insofar as the typical study states a predicted direction, that’s what they should be doing.

[i] Both “frequentist” and “sampling theory” are unhelpful names. Since the key feature is basing inference on error probabilities of methods, I abbreviate by *error statistics*. The error probabilities are based on the sampling distribution of the appropriate test statistic. A proper subset of error statistical contexts are those that utilize error probabilities to assess and control the *severity* by which a particular claim is tested.

[ii] See#4 of my recent talk on statistical skepticism “7 challenges and how to respond to them”

[iii] Two related posts: p-values overstate the evidence against the null fallacy

How likelihoodists exaggerate evidence from statistical tests (search the blog for others)

Filed under: Announcement, P-values, reforming the reformers, selection effects ]]>

I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:

The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. [1]

The second debate is a contemporary version of long-standing, but unresolved, disagreements on statistical philosophy: degrees of belief vs. error control, frequentist vs. Bayesian, testing vs. confidence intervals, etc. That’s what most of the conference revolved around. The opening talk by Steve Goodman was “Why is Eliminating P-values so Hard?”. Admittedly there were concurrent sessions, so my view is selective. True, bad statistics–perverse incentives, abuses of significance tests, publication biases and the resulting irreplicability–have given a new raison d’etre for (re)fighting the old, and tackling newer, statistics wars. And, just to be clear, let me say that I think these battles *should* be reexamined, but taking into account the more sophisticated variants of the methods, on all sides. Yet the conference, by and large, presumed the main war was already over, and the losers were tests of the statistical significance of differences–not merely abuses of the tests, the entire statistical method! [2]

Under the revolutionary rubric of “The Radical Prescription for Change”, we heard, in the final session, eminently sensible recommendations for doing good science–the first interpretation in my deconstruction. Marcia McNutt provided a terrific overview of what *Science*, *Nature*, and key agencies are doing to uplift scientific rigor and sound research. She listed statistical issues: file drawer problems, p-hacking, poor experimental design, model misspecification; and empirical ones: unidentified variables, outliers and data gaps, problems with data smoothing, and so on. In an attempt at “raising the bar”, she tells us, 80 editors agreed on the importance of preregistration, randomization and blindness. Excellent! Gelman recommended that p-values be just one piece of information rather than a rigid rod by which, once jumped over, publication ensues. Decisions should be holistic and take into account background information and questions of measurement. The ways statisticians can help scientists, Gelman proposed, is (1) by changing incentives so that it’s harder to cheat and (2) helping them determine the frequency properties of their tools (e.g., their abilities to reveal or avoid magnitude and sign errors). Meng, in his witty and sagacious manner, suggested punishing researchers by docking their salary if they’re wrong–using some multiple of their p-values. The one I like best is his recommendation that researchers ask themselves whether they’d ever dream of using the results of their work on themselves or a loved one. I totally agree!

Thus, in the interpretation represented by the closing session, “A World Beyond P-values” refers to a world beyond cookbook, and other modes of, bad statistics. A second reading, however, has it refer to statistical inference where significance tests, if permitted at all, are to be compelled to wear badges of shame–use them at your peril. Never mind that these are the very tools relied upon to reveal lack of replication, to show adulteration by cherry-picking and other biasing selection effects, and to test assumptions. From that vantage point, it made sense that participants began by offering up alternative or modified statistical tools–and there were many. Why fight the battle–*engage* the arguments–if the enemy is already down? Using the suffix “cide”, (killer), we might call it statistical testicide.

I’m certainly not defending the crude uses of tests long lampooned. Even when used correctly, they’re just a part of what I call error statistics: tools that employ sampling distributions to assess and control the capabilities of methods to avoid erroneous interpretations of data (error probabilities).[3] My own work in philosophy of statistics has been to reformulate statistical tests to avoid fallacies and arrive at an evidential interpretation of error probabilities in scientific contexts (to assess and control well-testedness).

Given my sense of the state of play, I decided that the best way to tackle the question of “What are the Best Uses For P-Values?”–the charge for our session–was to supply the key existing responses to criticisms of significance tests. Typically hidden from view (at least in these circles), these should now serve as handy retorts for the significance test user. *The starting place for future significance test challengers should no longer be to just rehearse the criticisms, but to grapple with these responses and the arguments behind them.*[4]

So to the question on my first slide: What contexts ever warrant the use of statistical tests of significance? The answer is: Precisely those you’d find yourself in if you’re struggling to get to a “World Beyond P-values” in the first sense–namely, battling bad statistical science.

___

[1] Andrew Gelman, Columbia University; Marcia McNutt, National Academy of Sciences; Xiao-Li Meng, Harvard University.

[2] Please correct me with info from other sessions. I’m guessing one of the policy-oriented session might have differed. Naturally, I’m excluding ours.

[3] A proper subset of error statistics uses these capabilities to assess how severely claims have passed.

[4] Please search this blog for details behind each, e.g., likelihood principle, p-values exaggerate, error probabilities, power, law of likelihood, p-value madness, etc.

**Some related blogposts:**

The ASA Document on P-values: One Year On

Statistical Reforms Without Philosophy Are Blind

Saturday Night Brainstorming and Task Forces (spoof)

On the Current State of Play in the Crisis of Replication in Psychology: Some Heresies

Filed under: P-values, Philosophy of Statistics, reforming the reformers ]]>

**7 QUESTIONS**

*Why use a tool that infers***from a single (arbitrary) P-value that pertains to a***statistical*hypothesis*H*_{0}*to a research claim**H***?***Why use an incompatible hybrid (of Fisher and N-P)?**- Why apply a method
**that uses error probabilities, the sampling distribution,**researcher “**intentions**” and**violates the likelihood principle**(LP)? You should**condition**on the data. **Why use methods that***overstate evidence*against a null hypothesis?**Why do you use a method that****presupposes the underlying statistical model?****Why use a measure that doesn’t report effect sizes?****Why do you use a method that doesn’t provide posterior probabilities (in hypotheses)?**

** **

Filed under: P-values, spurious p values, statistical tests, Statistics ]]>

Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”.

Our session, “What are the best uses for P-values?“, I take it, will discuss the value of frequentist (error statistical) testing more generally, as is appropriate. (Aside from me, it includes Yoav Benjamini and David Robinson.)

One of the more baffling things about today’s statistics wars is their tendency to readily undermine themselves. One example is how they lead to relinquishing the strongest criticisms against findings that have been based on fishing expeditions. In the interest of promoting a statistical account that downplays error probabilities (Bayes Factors), it becomes difficult to lambast a researcher for engaging in practices on grounds that they violate error probabilities (cherry-picking, multiple testing, trying and trying again, post-data selection effects). You can’t condemn a researcher for engaging in practices on grounds that they violate error probabilities, if you reject error probabilities. The direct criticism has been lost. The criticism is redirected to finding the cherry picked hypothesis improbable. But now the cherry picker is free to discount the criticism as something a Bayesian can *always* do to counter a statistically significant finding–and they do this very effectively![1] Moreover, cherry picked hypotheses are often believable, that’s what makes things like post-data subgroups so seductive. Finally, in an adequate account, the improbability of a claim must be distinguished from its having been poorly tested. (You need to be able to say things like, “it’s plausible, but that’s a lousy test of it.”)

Now I realize error probabilities are often criticized as relevant only for long-run error control, but that’s a mistake. Ask yourself: what bothers you when cherry pickers selectively report favorable findings, and then claim to have good evidence of an effect? You’re not concerned that making a habit out of this would yield poor long-run performance–even though it would. What bothers you, and rightly so, is they haven’t done a good job in ruling out spurious findings in the case at hand. You need a principle to explain this epistemological standpoint–something frequentists have only hinted at. To state it informally:

I haven’t been given evidence for a claim C by dint of a method that had little if any capability to reveal specific flaws in C, even if they are present.

It’s a minimal requirement for evidence; I call it the severity requirement, but it doesn’t matter what its called. The stronger form says:

Data provide evidence for C only to the extent C has been subjected to and passes a reasonably severe test–one that probably would have found flaws in C, if present.

To many, “a world beyond p <.05” suggests a world without error statistical testing. So if you’re to use such a test in the best way, you’ll need to say why. *Why use a statistical test of significance for your context and problem rather than some other tool?* What type of context would *ever* make statistical (significance) testing just the ticket? Since the right use of these methods cannot be divorced from responding to expected critical challenges, I’ve decided to focus my remarks on giving those responses. However, I’m still working on it. Feel free to share thoughts.

Anyone who follows this blog knows I’ve been speaking about these things for donkey’s years–search terms of interest on this blog. I’ve also just completed a book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (CUP 2018).

[1] The inference to denying the weight of the finding lacks severity; it can readily be launched by simply giving high enough prior weight to the null hypothesis.

Mayo, D. 2018, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Cambridge, 2018)

Filed under: Announcement, Bayesian/frequentist, P-values ]]>

**With continued acknowledgement of Barnard’s birthday on Friday, Sept.23, I reblog an exchange on catchall probabilities from the “The Savage Forum” (pp 79-84 Savage, 1962) with some new remarks.[i] **

**BARNARD**:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.

**SAVAGE**: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.

Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …

On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.

Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.

**BARNARD**: Professor Savage says in effect, ‘add at the bottom of list H_{1}, H_{2},…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.

**LINDLEY**: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.

**BARTLETT**: But you would be inconsistent because your prior probability would be zero one day and non-zero another.

**LINDLEY**: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.

**BARNARD**: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.

**LINDLEY**: I do not care what it is as long as it is not one.

**BARNARD**: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.

**LINDLEY**: All probabilities are conditional.

**BARNARD**: I agree.

**LINDLEY**: If there are only conditional ones, what is the point at issue?

**PROFESSOR E.S. PEARSON**: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.

**BARNARD**: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.

**LINDLEY**: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.

**BARNARD**: Only if you knew that the condition was true, but you do not.

**GOOD**: Make a conditional bet.

**BARNARD**: You can make a conditional bet, but that is not what we are aiming at.

**WINSTEN**: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.

**BARNARD**: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H_{1} against H_{2}, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H_{1} as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.

**[i]Anyone who thinks we really want a Bayesian probability assignment to a hypothesis must come to grips with the fact that it depends on having a catchall factor-of all possible hypotheses that could explain the data-and the probability of data given “something else”. Barnard is telling Savage this is unrealistic and when something new enters, the original probability assessments are wrong. In their attempts to get the “catchall factor” to disappear, most probabilists appeal to comparative assessments–likelihood ratios or Bayes’ factors. So, Barnard was saying, back then, there’s no advantage to likelihoods. Several key problems remain: (i) the appraisal is always relative to the choice of alternative, and this allows “favoring” one or the other hypothesis, without being able to say there is evidence for either; (ii) although the hypotheses are not exhaustive, many give priors to the null and alternative that sum to 1 (iii) the ratios do not have the same evidential meaning in different cases (what’s high? 10, 50, 800?), and (iv) there’s a lack of control of the probability of misleading interpretations, except with predesignated point against point hypotheses or special cases (this is why Barnard later rejected the Likelihoodism). You can read the rest of pages 78-103 of the Savage Forum here. This exchange was first blogged here. Share your comments.**

**References**

Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Links to a scan of the entire Savage forum may be found here.

Filed under: Barnard, highly probable vs highly probed, phil/history of stat ]]>

Today is George Barnard’s birthday. I met him in the 1980s and we corresponded off and on until 1999. Here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to issues often taken up on this blog: stopping rules and the likelihood principle. (It’s a slightly revised reblog of an earlier post.) I’ll post some other items related to Barnard this week, in honor of his birthday.

*Happy Birthday George!*

Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevantin certain circumstances. Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant.In particular, suppose somebody sets out to demonstrate the existence of extrasensory perception and says ‘I am going to go on until I get a one in ten thousand significance level’. Knowing that this is what he is setting out to do would lead you to adopt a different test criterion. What you would look at would not be the ratio of successes obtained, but how long it took him to obtain it. And you would have a very simple test of significance which said if it took you so long to achieve this increase in the score above the chance fraction, this is not at all strong evidence for E.S.P., it is very weak evidence. And the reversing of the choice of test criteria would I think overcome the difficulty.

This is the answer to the point Professor Savage makes; he says why use one method when you have vague knowledge, when you would use a quite different method when you have precise knowledge. It seem to me the answer is that you would use one method when you have precisely determined alternatives, with which you want to compare a given hypothesis, and you use another method when you do not have these alternatives.

Savage: May I digress to say publicly that I learned the stopping-rule principle from professor Barnard, in conversation in the summer of 1952. FranklyI then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right.I am particularly surprised to hear Professor Barnard say today that the stopping rule is irrelevant in certain circumstances only, for the argument he first gave in favour of the principle seems quite unaffected by the distinctions just discussed. The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head and cannot be known to those who have to judge the experiment. Never having been comfortable with that argument, I am not advancing it myself. But if Professor Barnard still accepts it, how can he conclude that the stopping-rule principle is only sometimes valid? (emphasis added)

Barnard: If I may reply briefly to Professor Savage’s question as to whether I still accept the argument I put to Professor Savage in 1952 (Barnard 1947a), I would say that I do so in relation to the question then discussed, where it is a matter of choosing from among a number of simple statistical hypotheses. When it is a question of deciding whether an observed result is reasonably consistent or not with a single hypothesis, no simple statistical alternatives being specified, then the argument cannot be applied. I would not claim it as foresight so much as good fortune that on page 664 of the reference given I did imply that the likelihood-ratio argument would apply ‘to all questions where the choice lies between a finite number of exclusive alternatives’; it is implicit that the alternatives here must be statistically specified. (75-77)

By the time I met Barnard, he was in what might be called the *Fisher-E.S. Pearson-Birnbaum-Cox error statistical camp, *rather than being a Likelihoodist. The example under discussion is a 2-sided Normal test of a 0 mean and was brought up by Peter Armitage (a specialist in sequential trials in medicine). I don’t see that it makes sense to favor preregistration and preregistered reports while denying the relevance *to evidence* of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history or plan supplied by the preregistered report is equivalent to worrying about the fact that the researcher would have stopped had the results been significant prior to the point of stopping–when this is not written down. It’s not a matter of “intentions locked up inside his head” that renders an inference unwarranted when it results from trying and trying again (to reach significance)–at least in a case like this one.

Share your Barnard reflections and links in the comments.

Related:

Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Barnard, G.A. (1947a), “A Review of Sequential Analysis by Abraham Wald,” J. amer. Statist. Assoc., 42, 658-669.

Barnard, G.A. (1947b), “The Meaning of a Significance Level”, *Biometrika*, 34, 179-182.

Filed under: Likelihood Principle, Philosophy of Statistics Tagged: Barnard ]]>

Karl Popper died on September 17 1994. One thing that gets revived in my new book (*Statistical Inference as Severe Testing*, 2018, CUP) is a Popperian demarcation of science vs pseudoscience Here’s a snippet from what I call a “live exhibit” (where the reader experiments with a subject) toward the end of a chapter on Popper:

*Live**Exhibit. **Revisiting Popper’s Demarcation of Science: *Here’s an experiment: Try shifting what Popper says about theories to a related claim about inquiries to find something out. To see what I have in mind, join me in watching a skit over the lunch break:

*Physicist:* “If mere logical falsifiability suffices for a theory to be scientific, then, we can’t properly oust astrology from the scientific pantheon. Plenty of nutty theories have been falsified, so by definition they’re scientific. Moreover, scientists aren’t always looking to subject well corroborated theories to “grave risk” of falsification.”

*Fellow traveler:* “I’ve been thinking about this. On your first point, Popper confuses things by making it sound as if he’s asking: *When is a theory unscientific?* What he is actually asking or should be asking is: *When is an inquiry into a theory, or an appraisal of claim H unscientific*? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable *ad hoc** *maneuvering, sneaky face-saving devices, then the inquiry–the handling and use of data–is unscientific. Despite being logically falsifiable, theories can be *rendered immune from falsification* by means of cavalier methods for their testing. Adhering to a falsified theory no matter what is poor science. On the other hand, some areas have so much noise that you can’t pinpoint what’s to blame for failed predictions. This is another way that inquiries become bad science.”

She continues:

“On your second point, it’s true that Popper talked of wanting to subject theories to grave risk of falsification. I suggest that it’s really our *inquiries* into, or tests of, the theories that we want to subject to grave risk. The onus is on interpreters of data to show how they are countering the charge of a poorly run test. I admit this is a modification of Popper. One could reframe the entire problem as one of the character of the inquiry or test.

In the context of trying to find something out, in addition to blocking inferences that fail the minimal requirement for severity[1]:

A scientific inquiry or test: must be able to embark on a reliable inquiry to pinpoint blame for anomalies (and use the results to replace falsified claims and build a repertoire of errors).

The parenthetical remark isn’t absolutely required, but is a feature that greatly strengthens scientific credentials. Without solving, not merely embarking on, some Duhemian problems there are no interesting falsifications. The ability or inability to pin down the source of failed replications–a familiar occupation these days–speaks to the scientific credentials of an inquiry. At any given time, there are anomalies whose sources haven’t been traced–unsolved Duhemian problems–generally at “higher” levels of the theory-data array. Embarking on solving these is the impetus for new conjectures. Checking test assumptions is part of working through the Duhemian maze. The reliability requirement is given by inferring claims just to the extent that they pass severe tests. There’s no sharp line for demarcation, but when these requirements are absent, an inquiry veers into the realm of questionable science or pseudo science. Some physicists worry that highly theoretical realms can’t be expected to be constrained by empirical data. Theoretical constraints are also important.

[1] Before claiming to have evidence for claim C, something must have been done to have found flaws in C, were C false. If a method is incapable of finding flaws with C, then finding none is poor grounds for inferring they are absent.

Mayo, D. 2018. *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Cambridge)

Filed under: Error Statistics, Popper, pseudoscience, science vs pseudoscience Tagged: science vs pseudoscience ]]>

Sunday, September 10, was C.S. Peirce’s birthday. He’s one of my heroes. He’s a treasure chest on essentially any topic, and anticipated quite a lot in statistics and logic. (As Stephen Stigler (2016) notes, he’s to be credited with articulating and appling randomization [1].) I always find something that feels astoundingly new, even rereading him. He’s been a great resource as I complete my book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2018) [2]. I’m reblogging the main sections of a (2005) paper of mine. It’s written for a very general philosophical audience; the statistical parts are very informal. I first posted it in 2013. *Happy **(belated)** Birthday Peirce*.

**Peircean Induction and the Error-Correcting Thesis**

Deborah G. Mayo

*Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy*, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

**Self-Correcting Thesis SCT:** methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.

Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).

In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.

**2. Probabilities are assigned to procedures not hypotheses**

Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, *H*, of a given type be rejected or not, calculate a specified character, ** x_{0}**, of the observed facts; if

The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:

The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)

The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:

If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)

For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).

Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).

Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis *H* is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what *H* asserts, and yet it did not.

**3. So why is justifying Peirce’s SCT thought to be so problematic?**

You can read Section 3 here. (it’s not necessary for understanding the rest).

**4. Peircean induction as severe testing**

… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).

The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)

When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.

This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)

While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly *corroborated (by his lights)*, he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.

In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis *H* when not only does *H* “accord with” the data ** x**; but also, so good an accordance would very probably not have resulted, were

*Hypothesis H passes a severe test with* ** x** iff (firstly)

The test would “have signaled an error” by having produced results less accordant with *H* than what the test yielded. Thus, we may inductively infer *H* when (and only when) *H* has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely *H* has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for *H* but the probative capacity of the test of experiment ET (with regard to those errors that an inference to *H* is declaring to be absent)……….

You can read the rest of Section 4 here here

**5. The path from qualitative to quantitative induction**

In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.

*(I) First-Order, Rudimentary or Crude Induction*

Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim *H*, provisionally adopt *H*. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of *H*‘s falsity would probably have been detected, were H false, finding no evidence against *H* is poor inductive evidence *for* *H*. *H* has passed only a highly unreliable error probe.

*(II) Second Order (Qualitative) Induction*

It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.

The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)

The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)

The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.

The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)

It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.

*(III) Third Order, Statistical (Quantitative) Induction*

We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.

**6. Quantitative and qualitative induction: significance test reasoning**

*Quantitative Severity*

A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a *null hypothesis* *H_{0}*, which is an assertion about the distribution of the sample

*H_{0}*: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.

*Let d(x)* measure the increased risk of cancer in

*p*-value = Prob(** d**(

If this probability is very small, the data are taken as evidence that

*H**: cancer risks are higher in women treated with HRT

The reasoning is a statistical version of *modes tollens*.

If the hypothesis *H _{0}* is correct then, with high probability, 1-

** x** is statistically significant at level

Therefore, ** x** is evidence of a discrepancy from

*(i.e., H* severely passes, where the severity is 1 minus the p-value) [iii]*

…

If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject *H _{0}* (and

It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)

The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.

[Here are Part 2 and part 3; you can find the rest of section 6 here.]

[1] Stigler discusses some of the experiments Peirce performed. In one, with Joseph Jastrow, the goal was to test whether there’s a threshold below which you can’t discern the difference in weights between two objects. Psychologists had hypothesized that there was a minimal threshold “ such that if the difference was below the threshold, termed the just noticeable difference (jnd), the two stimuli were indistinguishable….[Peirce and Jastrow] showed this speculation was false’ Stigler (2016, 160). No matter how close in weight the objects were the probability of a correct discernment of difference differed from ½. A good example of evidence for a “no-effect” null by falsifying the alternative statistically.

[2] I’m now, truly, within days of completing a very short, but deep, conclusion.9/13/17

**REFERENCES:**

Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Laudan, L. 1981 *Science and Hypothesis: Historical Essays on Scientific Methodology*. Dordrecht: D. Reidel.

Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Mayo, D. 1991 “Novel Evidence and Severe Tests”, *Philosophy of Science*, 58: 523-552.

———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), *Charles S. Peirce and the Philosophy of Science*. Tuscaloosa: University of Alabama Press.

——— 1996 *Error and the Growth of Experimental Knowledge*, The University of Chicago Press, Chicago.

———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), *Probability Is the Very Guide in Life*. Chicago: Open Court Press, pp. 89-117.

———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), *Scientific Evidence*, Johns Hopkins University Press.

Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in *Foundations of Bayesianism*, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.

Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” *Philosophy of Science*, Vol. II, PSA 2002, pp. 1007-1025.

———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, *The British Journal of Philosophy of Science* 57: 323-357.

Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, *Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium*, *2005*.

Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in *Philosophical Transactions of the Royal Society*, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.

———- 1967 *Joint Statistical Papers*, Berkeley: University of California Press.

Niiniluoto, I. 1984 *Is Science Progressive*? Dordrecht: D. Reidel.

Peirce, C. S. *Collected Papers: Vols. I-VI*, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.

Popper, K. 1962 *Conjectures and Refutations: the Growth of Scientific Knowledge*, Basic Books, New York.

Rescher, N. 1978 *Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method*, Notre Dame: University of Notre Dame Press.

Stigler, S. 2016 *The Seven Pillars of Statistical Wisdom, Harvard.*

[i] Others who relate Peircean induction and Neyman-Pearson tests are Isaac Levi (1980) and Ian Hacking (1980). See also Mayo 1993 and 1996.

[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative *T’* is closer to the truth than a refuted *T*.

[iii] If the *p*-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard *H _{0}* as consistent with data

If there were a discrepancy from hypothesis *H _{0}* of

** x** is not statistically significant at level

Therefore, ** x** is evidence than any discrepancy from

For a general treatment of severity, see Mayo and Spanos (2006).

[Ed. Note: A not bad biographical sketch can be found on wikipedia.]

Filed under: Bayesian/frequentist, C.S. Peirce ]]>

ABSTRACT. In a well-cited article, ecologist Jared Diamond characterizes three main types of experiment that are performed in community ecology: the Laboratory Experiment (LE), the Field Experiment (FE), and the Natural Experiment (NE). Diamond argues that each form of experiment has strengths and weaknesses, with respect to, for example, realism or the ability to follow a causal trajectory. But does Diamond’s typology exhaust the available kinds of cause-finding practices? Some social scientists have characterized something they call causal process tracing. Is this a fourth type of experiment or something else? In this talk, I examine Diamond’s typology and causal process tracing in the context of a case study concerning the dynamics of wolf and deer populations on the Kaibab Plateau in the 1920s, a case that has been used as a canonical example of a trophic cascade by ecologists but which has also been subject to a fair bit of controversy. I argue that ecologists have profitably deployed causal process tracing together with other types of experiment to help settle questions of causality in this case. It remains to be seen how widespread the use of causal process tracing outside of the social sciences is (or could be), but there are some potentially promising applications, particularly with respect to questions about specific causal sequences.

**Sponsored by Mayo-Chatfield Fund for Experimental Reasoning, Reliability, Objectivity and Rationality ( E.R.R.O.R.) and the Philosophy Department**

**Related**:

- Millstein, R. (2006). “Natural Selection as a Population-Level Causal Process”,
*The British Journal for the Philosophy of Science*. Vol. 57 Issue 4, p. 627. - Millstein, R. (2013) Chapter 8: “Natural Selection and Causal Productivity” in H.-K. Chao et al. (eds.),
*Mechanism and Causality in Biology and Economics*.

**Possibly Related**:

- Mayo, D. (1986). Understanding Frequency-Dependent Causation”
*Philosophical Studies: An International Journal for Philosophy in the Analytic Tradition***49**: 109-124. - Mayo, D. and N. Gilinsky (1987). ” Models of Group Selection,”
*Philosophy of Science*, 54(4): 515-538.

***Roberta L. Millstein** is Professor of Philosophy at the University of California, Davis, with affiliations to the Science and Technology Studies Program and the John Muir Institute for the Environment. She specializes in the history and philosophy of biology as well as environmental ethics, with a particular focus on fundamental concepts in evolution and ecology. Her work has appeared in journals such as *Philosophy of Science; Journal of the History of Biology; Studies in History and Philosophy of Biological and Biomedical Sciences; Journal of the History of Biology*; and *Ethics, Policy & Environment*. She is Co-Editor of the online, open-access journal, *Philosophy, Theory, and Practice in Biology* and has served on governing boards for several academic societies. Her current project develops and defends a reinterpretation of Aldo Leopold’s land ethic in light of contemporary ecology.

Filed under: Announcement ]]>