I was just reading a paper by Martin and Liu (2014) in which they allude to the “questionable logic of proving H0 false by using a calculation that assumes it is true”(p. 1704). They say they seek to define a notion of “plausibility” that
“fits the way practitioners use and interpret p-values: a small p-value means H0 is implausible, given the observed data,” but they seek “a probability calculation that does not require one to assume that H0 is true, so one avoids the questionable logic of proving H0 false by using a calculation that assumes it is true“(Martin and Liu 2014, p. 1704).
Questionable? A very standard form of argument is a reductio (ad absurdum) wherein a claim C is inferred (i.e., detached) by falsifying ~C, that is, by showing that assuming ~C entails something in conflict with (if not logically contradicting) known results or known truths [i]. Actual falsification in science is generally a statistical variant of this argument. Supposing H0 in p-value reasoning plays the role of ~C. Yet some aver it thereby “saws off its own limb”!
I first blogged on this in an “Overheard at the Comedy Hour” post.* It came up again in a comical criticism of significance tests in a paper by Szucs and Ioannidis, Here’s their version:
“[P]aradoxically, when we achieve our goal and successfully reject H0 we will actually be left in complete existential vacuum because during the rejection of H0 NHST ‘saws off its own limb’ (Jaynes, 2003; p. 524): If we manage to reject H0 then it follows that pr(data or more extreme data|H0) is useless because H0 is not true” (p.15).
Here’s Jaynes (p. 524):
“Suppose we decide that the effect exists; that is, we reject [null hypothesis] H0. Surely, we must also reject probabilities conditional on H0, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ “
But this reasoning would saw off the legs of all hypothetical testing or falsification. The entailment from a provisional hypothesis or model H to x, whether it is statistical or deductive, does not go away after the hypothesis or model H is rejected on grounds that the prediction is not born out.[i] It is called an argumentative or implicationary assumption in logic. It is not questionable, but the strongest form of scientific reasoning. When particle physicists deduce the events that would be expected with immensely high probability under H0: background alone (e.g., bumps would disappear with more data), the derivation does not get sawed off when H0 is refuted! The conditional claim remains. And if the statistical test passes an audit (of its assumptions), H0 is statistically falsified. (Search Higgs on this blog.)[ii]
Now I don’t know if the limb-sawing charge is behind Martin and Liu’s claim that finding H0 “false by using a calculation that assumes it is true” is “questionable”(we’re not told), but I know this: If you’re seeing limb-sawing in p-value logic, you’re sawing off the limbs of reductio arguments; since it’s a mistake to saw off reductios, it follows that seeing limb-sawing in P-value logic is a mistake [ii].
Send me your limb-sawers: I’m now collecting other examples of the limb-sawing fallacy; for years I skipped by them with a silent “hah!”, trying to avert my eyes–not wanting to acknowledge how logic can “go on a holiday” even in some discussions by brilliant statisticians. But now I think the problem bears taking seriously, so please send me examples you come across [iii].
[i]Reductio ad absurdum, a form of argument where one provisionally assumes one or more claims, derives a contradiction from them, and then concludes that at least one of those claims must be false. A reductio argument …is specifically aimed at bringing someone to reject some belief (an arbitrary encyclopedia entry).
[ii] Actually the most important function of the p-value, as I see it, is to block rejections of H0 when the p-value is not small. We reason: if even larger differences than the observed d0 would be produced fairly often even if we supposed H0 adequately describes the data-generating mechanism, then d0 does not warrant rejecting H0.
[iii] The limb-sawing fallacy makes an appearance, but without attribution, in my new book [i] (“Statistical Inference as Severe Testing”, which is currently undergoing a final round of edits). Fans of Jaynes exhorted me not to attach his name to this howler, and I obliged. To their credit, they acknowledged his flaw.
REFERENCES
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge: Cambridge University Press.
Szucs, D. and Ioannidis, J. 2016. “When null hypothesis significance testing is unsuitable for research: a reassessment”
Martin, R. and Liu, C. (2014), “A Note on P-Values Interpreted as Plausibilities” Statistica Sinica, Vol. 24, No. 4 (October 2014), pp. 1703-1716.
*Some other comedy hour posts:
(09/03/11) Overheard at the comedy hour at the Bayesian retreat
(4/4/12) Jackie Mason: Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
(04/28/12) Comedy Hour at the Bayesian Retreat: P-values versus Posteriors
(05/05/12) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed
(09/03/12) After dinner Bayesian comedy hour…. (1 year anniversary)
(09/08/12) Return to the comedy hour…(on significance tests)
(04/06/13) Who is allowed to cheat? I.J. Good and that after dinner comedy hour….
(04/27/13) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)
A related logical tragicomedy I have often observed —
☞ Modus Dolens
I agree that the limb-sawing argument doesn’t hold any water. It’s a strange one.
I’m not so sure that p value reasoning is best described as n reductio ad absurdum though – surely it’s more of a probabilistic modus tollens? I.e.,
If H0 true, then we will probably not observe p < .05
Observe p < .05
Therefore probably not H0
Which is not a valid argument… although it may produce quite accurate inferences depending on P(H0) and P(p < .05 | ~H0).
Matt: Last line “probably not H0″doesn’t hold, nor do we want it to hold (since it would involve a prior). We infer evidence of a discrepancy from Ho based on the severity principle (assuming an audit is passed), moving from essentially your 2 premises, qualifying by dint of the relevant error probability. However, we must also report discrepancies poorly warranted.
It seems to me that the sentence “First and foremost, the fundamental mistaken notion is to believe that if an observation is rare under a given hypothesis, then it can be regarded as evidence against that hypothesis” (1) is precisely a rejection of the reductio ad absurdum mode of reasoning, and then it qualify as an example of the limb-sawing fallacy. Correct? [I think the passage is a quotation from a I.J. Good article (2), though this is not obvious from (1)].
1) Taroni F et al. Statistical hypothesis testing and common misinterpretations: Should we abandon p-value in forensic science applications? Forensic Sci Int 2016 Feb;259:e32-6.
2) I.J. Good. The Interface Between Statistics and the Philosophy of Science. In: Studies in Logic and the Foundations of Mathematics, Volume 126, 1989, Pages 393-411
Silvano: Whoever held the “First and foremost” sentence? Not I. Significance testing does not involve rejecting a null due to improbable events. It rejects Ho when Pr(d < do;Ho) is high, where d is the test statistic.
Well, if Pr(d < d0;H0) is high, then the p-value is small, and if the p-value is exceedingly small, “either an exceptionally rare chance has occurred or the [H0] theory is not true” (Fisher, 1959). Therefore, I agree that "if an observation is rare under a given hypothesis, then it can be regarded as evidence against that hypothesis", and I am brought to the conclusion that those who consider this statement a fundamental mistaken notion are in fact rejecting the reductio ad absurdum argument.
Silvano: No, that is incorrect, at least without a fair bit of qualification and filling in. Any observation can be very implausible in some manner, particularly if the characteristic is chosen based on the data. The disjunction would not go through (because even if Ho were true, it would be easy to find such an improbable result). To understand Fisher’s claim, which isn’t his best, can live, one has to remember (as he clearly emphasized) that it’s not the improbability of the results, but the improbability of the results under Ho. That is, it’s being able to produce results at odds with a particular Ho, while adhering to the supposition that Ho adequately describes the data generating procedure. It was further required that the small p-value not be an “isolated result” if one is to infer an experimental phenomenon as opposed to a mere indication of a result at odds with what’s expected under Ho. But because he operated just with the null, it was also important to have a sensible test statistic/distance measure– even for the weak indication to go through. Further, the result has to take into account “selection effects” (Fisher;s “political principle”)–which readily make the p-value spurious–as well as assumptions of the model (which can be quite weak in the case of simple significance tests, precisely what makes them so useful in checking assumptions).
I must confess I may have reached a similar conclusion to Szucs and Ioannidis when I wrote “As H0 is always true (i.e., it shows the theoretical random distribution of frequencies under certain parameters), it cannot, at the same time, be false nor falsifiable a posteriori. Basically, if at any point you say that H0 is false, then you are also invalidating the whole test and its results” (Perezgonzalez, 2015, https://doi.org/10.3389/fpsyg.2015.00223).
However, I don’t see my comment as trying to ‘saw NHST limb off” but I appreciate I didn’t have a clear resolution to the conflict back then. I believe I do now, after appreciating that the discussion of tests typically conflate substantive and statistical null hypotheses. In https://doi.org/10.3389/fpsyg.2015.00341 I advanced an understanding of p-values as percentile locations within a statistical hypothesis, an understanding that could help resolve typical confusions of p-values [p(D;statistical H0)] with the probability of hypotheses [p(substantitve H0;D)]. In https://doi.org/10.3389/fpsyg.2015.01293 I came to appreciate that such confusion between substantive and statistical hypotheses was not clearly resolved neither by Fisher nor by Meehl.
I see the genius of Fisher in his use of a statistical null hypothesis in order to shone some light on its corresponding substantive null hypothesis, “Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes with opposed interpretations [:] those which show a significant discrepancy from a certain [SUBSTANTIVE] hypothesis… and [those] which show no significant discrepancy from this [SUBSTANTIVE] hypothesis” (Fisher, 1960, p.15-16). That is, while rejecting a statistical null sees oneself ‘sawing one’s own limb off’, rejecting a substantive null does not carry such consequences. Namely, a p-value is a descriptive of the statistical hypothesis (a.k.a., a percentile location if you wish), but the test proper is concentrated on the level of significance, as Fisher also wrote, “It is open to the experimenter to be more or less exacting in respect of the smallness of the [STATISTICAL] probability he would require before he would be willing to admit that his observations have demonstrated a [SUBSTANTIVE] positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him” (p.13).
However, I must also confess that Fisher’s genius can only be partly appreciated if we take into account Gosset’s / Neyman-Pearson’s argument that the smallness of the p-value cannot possibly lead to a rejection of a [SUBSTANTIVE] hypothesis unless an alternative hypothesis exist (I believe Fisher was not against such alternative hypothesis, simply that a test of significance could be carried out without it as the rejection of the substantive H0 “acts in support of” the substantive alternative, noH0).
Fisher’s genius can be fully appreciated when also considering Mayo’s Severity principle for the [SUBSTANTIVE] H0: i.e., when the substantive H0 has been put to a severe test (e.g., a priori significance = 5%) and it has not passed the test (SEV noH0 = 1 – p < 5%) via p(D; statistical H0), then the [SUBSTANTIVE] H0 can be rejected without it amounting to 'sawing one's own limb" (i.e., it is the substantive H0 which is rejected with severity, not the statistical one).
(Note: Mayo probably disagrees with me in regards to noH0 being Fisher's rather than Neyman-Pearson's, so the reader may need to take this discrepancy into account whenever appropriate)
Jose: No limb-sawing in either substantive or stat hyp. Yes you were/are guilty of the fallacy.
Jose: I really am puzzled that so many people (including you, as it seems) find it so difficult to follow a simple and everyday logic. 1) You have an idea of how the world works. 2) Therefore, you expect that your experience conforms to that idea. 3) You notice that your observations sometimes do not conform, and you devise a way to measure the discrepancy (the p-value). 4) Once you decide that the discrepancy has reached a critical value, you simply discard your idea. No limb-sawing here…
Mayo, Silvano: I agree, the logic is simple and understandable. The nitty-gritty of the issue is that any idea of how the world works –the world is big, a hypothesis– and the device used for measuring (testing) any discrepancy observed –a ruler, a piece of string– are two different things. Technically, if after measuring I deny the precision of the ruler (or, using Mayo’s example of body weighting, if I deny the scales are as precise and in working order as I knew them to be before carrying out the test, then how could I possibly know anything about the world?
Using another example. We know that winning the jackpot in a fair lottery is a very rare event, which is the information the p-value gives us: the probability of ‘winning’ the research lottery. It is just descriptive of a given model, and whatever its low probability it does not warrant rejecting the model if this is the only hypothesis. When do I doubt the fairness of the enterprise? Only when I conceive an alternative hypothesis (this lottery is not fair) and put it to test. The measuring device (the p-value) is still the same, the only change is that the probability space needs to accommodate two substantive hypotheses now, the level of significance being the limit that signals their boundary.
Szucs and Ioannidis are talking about denying the statistical null (a.k.a., the ruler, the scale), which amounts to limb-sawing –although this is not really what we are testing. That is, what we are really denying is the absence of (a given) difference (a.k.a., that there is no difference in distance or weight, conditioned on our ruler and scales).
Jose: There is no limbsawing–it’s no different than reductio or falsification–they are confused.
Contrapositive or modus tollens arguments are very common in mathematics. Since it’s Comedy Hour, I can’t help thinking of Chrysippus, who is said to have died laughing, and his dog — not the one tied to a cart, the one chasing a rabbit or stag or whatever.