When logical fallacies of statistics go uncorrected, they are repeated again and again…and again. And so it is with the limb-sawing fallacy I first posted in one of my “Overheard at the Comedy Hour” posts.* It now resides as a comic criticism of significance tests in a paper by Szucs and Ioannidis (posted this week), Here’s their version:

“[P]aradoxically, when we achieve our goal and successfully reject

Hwe will actually be left in complete existential vacuum because during the rejection of_{0 }HNHST ‘_{0 }saws off its own limb’ (Jaynes, 2003; p. 524): If we manage to rejectHthen it follows that pr(data or more extreme data|_{0}H) is useless because_{0}His not true” (p.15)._{0}

Here’s Jaynes (p. 524):

“Suppose we decide that the effect exists; that is, we reject [null hypothesis]

H. Surely, we must also reject probabilities conditional on_{0}H, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ “_{0}

*Ha! Ha!* By this reasoning, no hypothetical testing or falsification could ever occur. As soon as *H* is falsified, the grounds for falsifying disappear! If *H*: all swans are white, then if I see a black swan, *H* is falsified. But according to this criticism, we can no longer assume the deduced prediction from *H*! What?

The entailment from a hypothesis or model *H* to ** x**, whether it is statistical or deductive, does not go away after the hypothesis or model

*H*is rejected on grounds that the prediction is not born out.[i]

*When particle physicists deduce the events that would be expected with immensely high probability under*

**It is called an argumentative assumption or implicationary assumption in logic.***H*: background alone, the derivation does not get sawed off when

_{0}*H*is refuted! The conditional claim remains. And if the statistical test passes an audit (of its assumptions),

_{0 }*H*is statistically falsified.

_{0}It is scarcely useless to falsify claims! We’re not in an “existential vacuum”!

The limb-sawing fallacy makes an appearance, but without attribution, in my new book [i] (“Statistical Inference as Severe Testing,” which I’m currently subjecting to a final round of edits).[ii] The rest of the paper by Szucs and Ioannidis rehearses many of the canonical howlers that pass as criticisms of significance tests, all of which have made their appearance so often on this blog, from p-values exaggerate evidence (no they don’t) to what we really want are Bayesian posterior probabilities in statistical hypotheses (really?). Hopefully, if their paper isn’t out yet, they can be persuaded to reassess their “reassessment”, and not buy into all of these chestnuts hook, line, and sinker.[iii]

I ended my 2013 post saying:

“To be generous, we may assume that in the heat of criticism, his [Jaynes’] logic takes a wild holiday. Unfortunately, I’ve heard several of his acolytes repeat this. There’s a serious misunderstanding of how hypothetical reasoning works: 6 lashes, and a pledge not to uncritically accept what critics say, however much you revere them”.

______

[i] Fans of Jaynes exhorted me not to attach his name to this howler, and I obliged. But what if I need to cite Szucs and Ioannidis?

[ii] Szucs and Ioannidis’ version might be seen as ever so slightly weaker than Jaynes’, since it’s less clear what they think goes wrong; but as they refer to him, we may assume they endorse his version.)

[iii] For some papers on statistical tests:

- Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in
*Optimality: The Second Erich L. Lehmann Symposium*, ed. J. Rojo, (IMS), Vol. 49: 77-97. - Mayo, D. G. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction,”
*British Journal for the Philosophy of Science*57(2): 323–57. - A very recent post by Brian Haig on “Tests of Statistical Significance Made Sound” is relevant.

**REFERENCES**

Jaynes, E. T. 2003. *Probability Theory: The Logic of Science. *Cambridge: Cambridge University Press.

Szucs, D. and Ioannidis, J. 2016. “When null hypothesis significance testing is unsuitable for research: a reassessment”

***Some previous comedy hour posts:**

(09/03/11) Overheard at the comedy hour at the Bayesian retreat

(4/4/12) Jackie Mason: Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

(04/28/12) Comedy Hour at the Bayesian Retreat: P-values versus Posteriors

(05/05/12) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed

(09/03/12) After dinner Bayesian comedy hour…. (1 year anniversary)

(09/08/12) Return to the comedy hour…(on significance tests)

(04/06/13) Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

(04/27/13) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)

I’m rather surprised with this paper as it is so much at odds with what I thought Ioannidis at least was keen to promote. I know he advanced the “diagnostic screening model” of tests, in (2005), which is problematic, but to trot out the same baked-on positions–positions which are at odds with what Ioannidis at least seemed to care about–is troubling. Consider:

“Hypotheses could be tested by either likelihood ratio testing, and/or Bayesian methods

which usually view probability as characterizing the state of our beliefs about the world

(Jaynes, 2003; Pearl 1998; MacKay, 2003; Gelman et al. 2014; Sivia and Skilling, 2006). The

above alternative approaches require model specifications about alternative hypotheses, they can give probability statements about H0 and alternative hypotheses, they allow for clear model comparison, are insensitive to data collection procedures and do not suffer from problems with large samples.”

First of all, a comparative likelihood ratio is not a test in my book, because (a) it doesn’t falsify anything and (b) fails to control for erroneously preferring one hypothesis to another, even when that hypothesis is false. Worse, so much of the problems with lack of reproducibility–as Ioannidis knows–is due to cherry-picking, p-hacking, multiple testing, trying and trying again. So how can they champion methods that are “insensitive to data collection”? I could go on with every howler, all of which are discussed many times on this blog.

“Fans of Jaynes exhorted me not to attach his name to this howler, and I obliged.” You find me surprised. For a moment at least you thought he didn’t deserve to be named there?

Christian: I’m not sure I got your question. I’m saying that some of his fans, realizing the embarassment of his gaff, privately asked me not to name him. You would have been around during my first post of this, and some of the Jaynesians who would comment. So in my book it’s mentioned as a howler without naming names. But if this paper includes it, it will be known.

Well, I was just expressing my surprise that you actually “obliged”.

Christian: I’ve wondered in general how important it is to name names in relation to fallacies. If the person is alive, I ask them. Jaynes wrote in the most gratuitously nasty, condescending, and demeaning voice possible. He was a leader in straw men put downs and appeals to ridicule. That encouraged others to do the same. That is the basis for the “comedy hours” on this blog, that is, I was taking up their “hilarious” put-downs. But the fact of the matter is, there’s a double standard. Bayesians can get away with it, even in professional articles–but if it comes in reverse, some react more like the way members of a certain sect react to caricatures of their God. I want to be constructive.

Now when I’m making out arguments thought to show things like “p-values exaggerate”, I must give names and arguments, so I keep to the original leaders of the flawed argument, repeated by dozens, every day.

Dr. Mayo, I think it would be important for the discussion that Prof. Jaynes be quoted in context. He starts with “To see how far this procedure takes us frome elementary logic, suppose we decide […]”. This means there is background to the quoted text, some discussion on periodicity and how the null model doesn’t take into account all information on the alternative. For what its worth (I’m not a statistician or philosopher), what I understand is that a hypothesis cannot be rejected if you do not include the information on the alternative, which the null model doesn’t have.

Martin: I’m aware of the context, it doesn’t help. Of course a null can be rejected without an explicit alternative. The alternative is it’s denial. But I’m not arguing against alternatives–they vary for for different problems e.g., to deny “iid holds for this data” it’s not necessary to have a specific non-iid alternative, but one can also learn about such alternatives. This is the difference between Cox’s omnibus and specific tests of model assumptions.

None of this bears on the Jaynes’ fallacy.

I agree that the limb sawing argument is a fallacy. But it’s also the case that P values, *as they are usually used*, will produce a high false positive rate. That conclusion doesn’t need any maths. All you have to do is to think of an RCT of homeopathy. The false positive rate will always be 100%, regardless of the P value, because the two groups have identical treatments.

David: We know your view on a different issue. Nobody ever said you couldn’t define a domain where all rejections are wrong, as Erich Lehmann used to say.So?

Just so readers know, as always, I’ve written to the authors, inviting them to comment.

Pingback: Friday links: lessons from a recent NSF panelist, have a Secchi day, and more | Dynamic Ecology

Tragic to see Ioannidis step so far out of his bailiwick and pen such howlers. Perhaps if he does chime in here he can clarify where he received statistical and philosphical training. His CV lists a medical degree.

Just because P values, *as they are usually used*, (as Colquhoun states above) yield improper impressions of error rates is no reason to stop using them. This is as fallacious as declaring a ban on automobiles because people keep crashing them. We haven’t done that – instead we have implemented Driver’s Education and we require people to have a driver’s license when they operate an automobile.

This is precisely why so many grants in the USA require a line item cost for a statistician, so that medical doctors, engineers, biologists and so on will not mangle their data and continue to produce this degree of wrecked findings.

Proper use of NHST is the solution here, and that requires better training of researchers in many fields, noteably those in the social sciences and psychology fields as discussed in replication crisis reviews. That training should also include teaching researchers to recognize when they are out of their bailiwick for statistical assessment of their findings, and how to obtain statistical guidance from a licensed statistician when they do not have the proper skills themselves to conduct appropriate statistical analyses.

I don’t go into the lab and start squirting liquids with pipettes, or pick up a scalpel and start removing limbs. Why professionals of other fields feel that statistics is something they can take on is a big part of the problem here.

Bashing NHST has become a fad of late – what do the bashers suggest to replace it? Goat entrails, anyone?

I don’t think that many people are suggesting that we should abolish P values. They are the only thing that can be calculated with any certainty. and they do what it says on the can.

I’m with Goodman and Valen Johnson: what’s needed is to change the way P values are described in works. My suggestions ( http://rsos.royalsocietypublishing.org/content/1/3/140216#comment-1889100957 ) are

P > 0.05 very weak evidence

P = 0.05 weak evidence: worth another look

P = 0.01 moderate evidence for a real effect

P = 0.001 strong evidence for real effect

Of course these are pretty rough -they could be overoptimistic if the prior was small enough.

What we need to get rid of is not P values, but the terms “significant” and “non-significant”. especially when the former is defined as P < 0.05.

“Significant” is a descriptive term in a most useful statistical testing paradigm. What we need to get rid of is misuse and misinterpretation thereof.

The level of evidence required to begin establishing the validity of a scientific phenomenon is context dependent. The gravitational wave crew would never have accepted your proposal – they insisted on a far smaller p-value for their first publication of an apparent gravitational wave event.

I’ve discussed your Royal Society paper elsewhere so I won’t repeat all that here

http://retractionwatch.com/2016/10/15/weekend-reads-arguments-for-abandoning-statistically-significant-boorish-behavior-and-useless-clinical-trials/#comment-1144007