I resume my comments on the contributions to our symposium on Philosophy of Statistics at the Philosophy of Science Association. My earlier comment was on Gerd Gigerenzer’s talk. I move on to Clark Glymour’s “Exploratory Research Is More Reliable Than Confirmatory Research.” His complete slides are after my comments.

**GLYMOUR’S ARGUMENT (in a nutshell):**

“The anti-exploration argument has everything backwards,” says Glymour (slide #11). While John Ioannidis maintains that “Research findings are more likely true in confirmatory designs,” the opposite is so, according to Glymour. (Ioannidis 2005, Glymour’s slide #6). Why? To answer this he describes an exploratory research account for causal search that he has been developing:

**(slide #5)**

What’s confirmatory research for Glymour? It’s moving directly from rejecting a null hypothesis with a low P-value to inferring a causal claim.

**(slide #2)**

**MAYO ON GLYMOUR:**

I have my problems with Ioannidis, but Glymour’s description of exploratory inquiry is not what Ioannidis is on about. What Ioannidis is or ought to be criticizing are findings obtained through cherry picking, trying and trying again, p-hacking, multiple testing with selective reporting, hunting and snooping, exploiting researcher flexibility—*where those gambits make it easy to output a “finding” even though it’s false.* In those cases, the purported finding fails to pass a *severe test*. One reports the observed effect is difficult to achieve unless it’s genuine, when in fact it’s easy (frequent) to attain just by expected chance variability. The central sources of nonreplicability are precisely these data-dependent selection effects, and that’s why they’re criticized.

If you’re testing purported claims with stringency and multiple times, as Glymour describes, subjecting claims arrived at one stage to *checks at another*, then you’re not really in “exploratory inquiry” as Ioannidis and others describe it. There can be no qualms with testing a conjecture arrived at through searching, using new data (so long as the probability of affirming the finding isn’t assured, even if the causal claim is false.) I have often said that the terms exploratory and confirmatory should be dropped, and we should talk just about poorly tested and well tested claims, and reliable versus unreliable inquiries.

(Added Nov. 20, 2016 in burgundy): Admittedly, and this may be Glymour’s main point, Ioannidis’ categories of exploratory and confirmatory inquires are too coarse. Here’s Ioannidis’ chart:

But nowadays, to come away from a discussion thinking that the warranted criticism of unreliable explorations can be ignored, is dangerous. Hence my comment.

Thus we can agree that compared to Glymour’s “exploratory inquiry,” what he calls “confirmatory inquiry” is inferior. Doubtless some people conduct statistical tests this way (*shame on them*!), but to do so commits two glaring fallacies: (1) moving from a single statistically significant result to a genuine effect; and (2) moving from a statistically significant effect to a causal claim. Admittedly, Ioannidis’ (2005) critique is aimed at such abusive uses of significance tests.

R.A. Fisher denounced these fallacies donkey’s years ago:

“[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1935, p.14)

“[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter…requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.” (Gigerenzer et al 1989, pp. 95-6)

Glymour has been a leader in developing impressive techniques for causal exploration and modeling. I take his thesis to be that stringent modeling and self-critical causal searching are likely to do better than extremely lousy “experiments.” (He will correct me if I’m wrong.)

I have my own severe gripes with Ioannidis’ portrayal and criticism of significance tests in terms of what I call “the diagnostic screening model of tests.” I can’t tell from Glymour’s slides whether he agrees that looking at the posterior predictive value (PPV), as Ioannidis does, is legitimate for scientific inference, and as a basis for criticizing a proper use of significance tests. I think it’s a big mistake and has caused serious misunderstandings. Two slides from my PSA presentation allude to this[i].

Moreover, using “power” as a conditional probability for a Bayesian-type computation here is problematic (the null and alternative don’t exhaust the space of possibilities). Issues with the diagnostic screening model of tests have come up a lot on this blog. Some relevant posts are at the end. **Please share your thoughts.**

**Here are Clark Glymour’s slides: **

**Clark Glymour** (Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania) *“Exploratory Research is More Reliable Than Confirmatory Research” *(Abstract)

[i] I haven’t posted my PSA slides yet; I wanted the focus of this post to be on Glymour.

**Blogposts relating to the “diagnostic model of tests”**

- (11/9) Beware of questionable front page articles warning you to beware of questionable front page articles (iii)
- 03/16 Stephen Senn: The pathetic P-value (Guest Post)
- 05/09 Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
- 01/17 “P-values overstate the evidence against the null”: legit or fallacious?
- 01/19 High error rates in discussions of error rates (1/21/16 update)
- 01/24 Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand
- 04/11 When the rejection ratio (1 – β)/α turns evidenc0e on its head, for those practicing in an error-statistical tribe (ii)
- (08/28) TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

**References**:

- Fisher, R. A. (1947).
*The Design of Experiments*, 4^{th }ed. Edinburgh: Oliver and Boyd.
- Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Krüger, L. (1989).
*Empire of Chance: How Probability Changed Science and Everyday Life*. Cambridge UK: Cambridge University Press.
- Ioannidis, J. (2005). “Why most published research ﬁndings are false“,
*PLoS Med* 2(8):0696-0701.