Many fallacious uses of statistical methods result from supposing that the statistical inference licenses a jump to a substantive claim that is ‘on a different level’ from a statistical one being probed. Given the familiar refrain that statistical significance is not substantive significance, it may seem surprising how often criticisms of significance tests depend on running the two together! But it is not just two, but a great many levels that need distinguishing linking collecting, modeling and analyzing data to a variety of substantive claims of inquiry (though for simplicity I often focus on the three depicted, described in various ways).
A question that continues to arise revolves around a blurring of levels, and is behind my recent ESP post. It goes roughly like this:
If we are prepared to take a statistically significant proportion of successes (greater than .5) in n Binomial trials as grounds for inferring a real (better than chance) effect (perhaps of two teaching methods) but not as grounds for inferring Uri’s ESP (at guessing outcomes, say), then aren’t we implicitly invoking a difference in prior probabilities? The answer is no, but there are two very different points to be made:
First, merely finding evidence of a non-chance effect is at a different “level” from a subsequent question about the explanation or cause of a non-chance effect. To infer from the former to the latter is an example of a fallacy of rejection. The nature and threats of error in the hypothesis about a specific cause of an effect are very different from those in merely inferring a real effect. There are distinct levels of inquiry and distinct errors at each given level. The severity analysis for the respective claims makes this explicit.[ii] Even a test that did a good job distinguishing and ruling out threats to a hypothesis of “mere chance” would not thereby have probed errors about specific causes or potential explanations. Nor does an “isolated record” of statistically significant results suffice. Recall Fisher: “In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result”(1935, 14). PSI researchers never managed to demonstrate this.
[Michael Goldstein (2006) uses such an example to illustrate the need for priors (comparing ESP and the lady tasting tea), but his students did not seem to support his construal. He decided his students were far too credulous (in ESP). I was presented with an analogous example at the recent CMU conference on simplicity.]
But someone who grants that we may distinguish on grounds of severity the two inferences (e.g., to an improved teaching method vs ESP) may still want to know how we’d use background information to appraise the ESP hypothesis. Don’t we need Bayesian prior probabilities in appraising data x from an ESP trial? This gets to the second point: Our answer is that the background information that is relevant for inquiry into the phenomena is given in terms of the series of problems, flaws and fallacies described in my Sept. 22 post.
The background knowledge, insofar as it is relevant for inquiry, consists of very specific problems as well as specific recommendations/requirements for experimental designs. Communicating and using the background knowledge in inquiry also involves describing specific protocols, checks, and stipulations for any future experimental demonstrations (of PSI).
If given the choice to have the background summed up in terms of this series of problems and protocols (say in 1980), or only in a prior degree of belief in the reality of ESP, which would you take?
It is irrelevant that you may have no interest in pursuing research in ESP, the question is which would communicate the relevant information for conducting and analyzing research? (And of course I will make the analogous argument for inquiry more generally.)
So not only do we not appeal to prior probabilities to distinguish the inferences that our critic asks about, I really do not see how it could be deemed preferable to have a sum-up in terms of a formal prior probability distribution.
Are we to have a prior for each of the problems and protocols? Whose would we use? And, why such an indirect approach when we have severely tested experimental knowledge with which to scrutinize the data directly?
[i] There are two main kinds of fallacy of rejection: the first infers a discrepancy larger than is warranted, the second takes a statistical inference as erroneously warranting a subsequent inference.
You may search “fallacy of rejection” on this blog, and/or see Mayo and Spanos (2006, 2011) below.
[ii] We would also make distinctions between the well-testedness of hypothesis H, in relation to given data x, as opposed to in relation to all the available scientific evidence at a given time.
Fisher, R.A. (1935). The Design of Experiments.
Goldstein,M. Subjective Bayesian Analysis: Principles and Practice, May, 30 2006, pp 403 — 420, DOI:10.1214/06-BA116
Mayo, D. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster. General editors: Dov M.Gabbay, Paul Thagard and John Woods) Elsevier: 1-46.
Mayo, D. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction” British Journal of Philosophy of Science, 57: 323-357.