Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers. Oh wait, ….one of the leading texts repeats the fallacy in their third edition:

“The classical thesis that a null hypothesis may be rejected with greater confidence, the greater the power of the test, is not borne out; indeed the reverse trend is signaled” (Howson and Urbach 2006, 154).

But this is mistaken. The frequentist appraisal of tests is, and has always been, the reverse, whether of Fisherian significance tests or those of the Neyman-Pearson variety. This is pointed this out directly in relation to their Bayesian text in EGEK 1996, pp. 402-3.

But alas, they repeat it verbatim, with no reference to these corrections. Given the popularity of their text, the consequences are not surprising (at least in some quarters): another generation committing the same fallacy and/or repeating the same howlers against significance tests. It is essentially the fallacy behind the imaginary case of the “prionvac” reformer who is (inadvertently we suppose) more impressed the smaller the discrepancy indicated—analogous to Mason’s haute cuisiner (Oct. 4, 2011). (See also Note [ii])

I am currently researching and writing a new book on contemporary philosophy of statistics. The review of the literature is itself a window on the movement of positions through philosophical scrutiny. With philosophy of frequentist statistics, however, I (often) find myself at the window of an older era, one I had hoped we’d left behind.

*And next week’s reading in Phil6334.

**References**

Howson, C. and P. Urbach (2006). *Scientific Reasoning: The Bayesian Approach*. La Salle, Il: Open Court

Mayo, D. G (1983) “An Objective Theory of Statistical Testing.” *Synthese* **57**(2): 297-340.

Mayo, D. G (1996) *Error and the Growth of Experimental Knowledge*, [EGEK] Chicago: Chicago University Press.

Mayo, D. G. and A. Spanos (2006) “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction” *British Journal of Philosophy of Science*, 57: 323-357.

Mayo, D. G. and A. Spanos (2011) “Error Statistics” in *Philosophy of Statistics , Handbook of Philosophy of Science* Volume 7 *Philosophy of Statistics*, (Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster. General editors: Dov M. Gabbay, Paul Thagard and John Woods) Elsevier: 1-46.

Morrison, D. and R. Henkel (eds.) (1970). *Significance Test Controversy*. Chicago: Aldine

Rosenthal, R. and J. Gaito (1963). “The Interpretation of Levels of Significance by Psychological Researchers. *Journal of Psychology* **55**:33-38.

[ii] “Now some early literature, e.g., Morrison and Henkel’s *Significance Test Controversy*(1962), performed an important service over fifty years ago. They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis—especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing.“The volume describes research studies conducted for the sole purpose of revealing these flaws. Rosenthal and Gaito (1963) document how it is not rare for scientists to mistakenly regard a statistically significant difference, say at level .05, as indicating a greater discrepancy from the null when arising from a large sample size rather than a smaller sample size—even though a correct interpretation of tests indicates the *reverse*.”

You should mention this to all the genomics researchers who get excited about all the “new significant findings” to be discovered with increasingly larger sample sizes (keeping in mind tests of association in genomics are applied to non-randomized samples).

vl: There are different things going on in the case of a large sample probing a given effect, and searching through a great many associations, right? With the latter screening enterprise, as in genomics, I take it they apply all kinds of adjustments. But maybe they also find themselves in the former situation?

Your point about non-randomized samples in genomics reminds me of a point that arose earlier (mentioned by Stan Young): http://errorstatistics.com/2013/06/19/stanley-young-better-p-values-through-randomization-in-microarrays/

vl: Genomics researchers are well aware of the issues of confounding; ask them about population stratification. And to help assess confounding by ancestry, there are literally millions of variants with negligible effects; finding even a little too much signal in them is a good indicator that one has a confounding problem. For everything else, note that genotypes are fixed at conception, so they are not affected by factors acquired over one’s lifespan.

Lack of randomization is not the issue you seem to suggest. If you think researchers are being statistically naive you could perhaps explain why?

Randomisation and local control are important design issues when spotting samples on micro-arrays. See Christophe Lambert’s post Stop Ignoring Experimental Design (or my head will explode) http://blog.goldenhelix.com/?p=322 and also his interesting article with Laura Black in Biostatistics

http://biostatistics.oxfordjournals.org/content/13/2/195.short

If Lambert is right, then genomics researchers may or may not be aware of randomisation but they are regularly ignoring it in their experimental technique to their cost.

The fact that genotypes are fixed at conception does not constitute a natural experiment, as I’m sure you’re aware. I’m familiar with the standard approaches to population stratification.

But to stay on the topic at hand and to respond to Professor Mayo’s comment about adjustments, multiple comparison adjustments get applied. Without randomization I’m dubious that anyone has any idea regarding the relationship between real and nominal error rates before or after the correction, but more relevant to the post, these adjustments do not account for the fact that your millions of tests each correspond to a distinct statistical test with a distinct power (due to different minor allele frequencies for each snp). There’s no formal accounting for the fact that p-value across all snps in the genome are appraised differently due to variation in power, in fact, the most common data visualization is to display a plot of millions of -log10(p-values).

I also don’t expect the field to change their standard significance thresholds as sample sizes are getting larger. I haven’t heard concerns about increasing power which prof. mayo raises here, mostly I hear excitement about obtaining more statistically significant hits and being able to detect “subtle” effects.

vl: I, and I’m sure many readers, would find it illuminating to have some links to the literature you’re talking about. Then I’d understand if this is something I am familiar with or something new (to me). Thanks.

vl: genotypes getting assigned at conception *does* enable natural experiments. Google “Mendelian randomization” and “nature’s randomized trial” to see lots of references on just this topic.

Given that lots of similar things are being estimated, shrinkage estimators ought to be possible. These use the idea that the regression of statistic on parameter is not the same as the regression of parameter on statistic.

If we call the parameter P and the statistic S, then, if we treat parameters as as random effects (natural in a hierarchical set up) the regression of statistic on parameter is cov(P,S)/var(P) whereas the regression of parameter on statistic is cov(P,S)/var(S). In a standard model you can show that cov(P,S)=var(P) so that the regression statistic on parameter is 1. This is a property I have called ‘direct unbiasedness’.

However, since sample sizes are not infinite, var(S) > var(P) and for the small sample sizes used in micro-array work var(S)>> var(P). Thus the regression of parameter on statistic is (often much) less than 1. Hence the shrinkage and the use of shrunk estimates (sometimes called best linear unbiased predictors =BLUPs in another context). Such shrunk parameter estimates have a property that I call ‘inverse unbiasedness’.

To assume that direct unbiasedness implies inverse unbiasedness is a logical error equivalent to assuming P(A¦B)=P(B¦A). This is is not true if P(A) not equal to P(B). Note the strong similarity of the two conditional probability equations P(A¦B)=P(A&B)/P(B) and P(B¦A)=P(A&B)/P(A) with the two regression equations. In both cases the numerator is the same; it is the scaling denominator that differs.

Confusing the two forms of unbiasedness is a particular example of an error I call ‘invalid inversion’. See http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2013.00652.x/abstract

@george – perhaps the wikipedia article on mendelian randomization is poorly written, but the dispute is precisely regarding whether genotypes are a valid instrument. So methods which assume that they are a valid instrument isn’t going to cut it as a justification.

The issue is that the assignment of the parent’s genotypes are not random, nor is mating. Thus a causal interpretation relying on the fact that the child’s genotype is a random assortment of the parent’s genotypes would be incredibly naive.

@mayo – there’s two separate issues, the issue of causal inference that george brought up and the erroneous interpretation of p-values across the genome when they correspond to different tests and different power.

Causal inference is a bit off-topic here, so I’ll try to stay on the power issue. One paper that discusses the power issue is – you’re going to absolutely hate this – Stephens and Balding’s 2009 review “Bayesian statistical methods for genetic association studies”.

I would be fine if you proposed a severity-based alternative to the bayesian computation they suggest. What’s troubling to me is that these studies always assume that p-values across millions of tests having different power can be interpreted in the same way. In these papers, you’ll see endless tables of “rankings” by p-values and plots of millions p-values, even though the tests from which they’re derived correspond to different null distributions.