fallacy of non-significance

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower)


returned from London…

The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6.  I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I.  Here’s how it starts:

5.5 Power Taboos, Retrospective Power, and Shpower

Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results.

Exhibit (vi): Non-significance + High Power Does Not Imply Support for the Null over the Alternative. Sander Greenland (2012) has a paper with this title. The first step is to understand the assertion, giving the most generous interpretation. It deals with non-significance, so our ears are perked for a fallacy of non-rejection. Second, we know that “high power” is an incomplete concept, so he clearly means high power against “the alternative.” We have a handy example: alternative μ.84 in T+ (POW(T+, μ.84) = 0.84).

Note to blog reader: μ.84 abbreviates “the alternative against which the test has 0.84 power.” This general abbreviation was introduced in Tour I. 

Use the water plant case, T+: H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, n = 100. With α = 0.025, z0.025 = 1.96, and the corresponding cut-off in terms of x0.025 is [150 + 1.96(10)/√100] = 151.96], μ.84 = 152.96.

Now a title like this is supposed to signal a problem, a reason for those “keep out” signs. His point, in relation to this example, boils down to noting that an observed difference may not be statistically significant – x may fail to make it to the cut-off  x0:025 – and yet be closer to μ.84 than to 0. This happens because the Type II error probability β (here, 0.16)1 is greater than the Type I error probability (0.025).

For a quick computation let x0:025  = 152 and μ.84 = 153. Halfway between alternative 153 and the 150 null is 151.5. Any observed mean greater than 151.5 but less than the x0.025 cut-off, 152, will be an example of Greenland’s phenomenon. An example would be those values that are closer to 153, the alternative against which the test has 0.84 power, than to 150 and thus, by a likelihood measure, support 153 more than 150 –  even though POW(μ  = 153) is high (0.84). Having established the phenomenon, your next question is: so what?

It would  be problematic if power analysis took the insignificant result as evidence for μ  = 150 –  maintaining compliance with the ecological stipulation – and I don’t doubt some try to construe it as such, nor that Greenland has been put in the position of needing to correct them. Power analysis merely licenses μμ.84  where 0.84 was chosen for “high power.” Glance back at Souvenir X. So at least one of the “keep out” signs can be removed.

All of Excursion 5 Tour II (in proofs) is here.


1 That is, β(μ.84) = Pr(d < 0.4; μ = 0.6) = Pr(Z < −1) = 0.16.



*This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

It is still valuable to look at the discussions and comments under “power” and “shpower” on this blog.

Earlier excerpts and mementos from SIST (May 2018-May 2019) are here.


Where YOU are in the journey.






Categories: fallacy of non-significance, power, Statistical Inference as Severe Testing | Leave a comment

Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!


bending of starlight.

[T]he impressive thing about the 1919 tests of Einstein ‘s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted. The theory is incompatible with certain possible results of observation—in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where]..it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories.” (Popper, CR, [p. 36))


Popper lauds Einstein’s General Theory of Relativity (GTR) as sticking its neck out, bravely being ready to admit its falsity were the deflection effect not found. The truth is that even if no deflection effect had been found in the 1919 experiments it would have been blamed on the sheer difficulty in discerning so small an effect (the results that were found were quite imprecise.) This would have been entirely correct! Yet many Popperians, perhaps Popper himself, get this wrong.[i] Listen to Popperian Paul Meehl (with whom I generally agree).

The stipulation beforehand that one will be pleased about substantive theory T when the numerical results come out as forecast, but will not necessarily abandon it when they do not, seems on the face of it to be about as blatant a violation of the Popperian commandment as you could commit. For the investigator, in a way, is doing…what astrologers and Marxists and psychoanalysts allegedly do, playing heads I win, tails you lose.” (Meehl 1978, 821)

No, there is a confusion of logic. A successful result may rightly be taken as evidence for a real effect H, even though failing to find the effect need not be taken to refute the effect, or even as evidence as against H. This makes perfect sense if one keeps in mind that a test might have had little chance to detect the effect, even if it existed. The point really reflects the asymmetry of falsification and corroboration. Popperian Alan Chalmers wrote an appendix to a chapter of his book, What is this Thing Called Science? (1999)(which at first had criticized severity for this) once I made my case. [i] Continue reading

Categories: fallacy of non-significance, philosophy of science, Popper, Severity, Statistics | Tags: | 2 Comments

Higgs Discovery two years on (1: “Is particle physics bad science?”)


July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:

“Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out [1]. (See O’Hagan’s “Digest and Discussion”) Continue reading

Categories: Bayesian/frequentist, fallacy of non-significance, Higgs, Lindley, Statistics | Tags: , , , , , | 4 Comments

“Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)

Sell me that antiseptic!

We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here).[1] Continue reading

Categories: fallacy of non-significance, junk science, reformers, Statistics | 15 Comments

P-values as posterior odds?

METABLOG QUERYI don’t know how to explain to this economist blogger that he is erroneously using p-values when he claims that “the odds are” (1 – p)/p that a null hypothesis is false. Maybe others want to jump in here?

On significance and model validation (Lars Syll)

Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20. Continue reading

Categories: fallacy of non-significance, Severity, Statistics | 36 Comments

Blog at WordPress.com.