returned from London…
The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6. I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I. Here’s how it starts:
5.5 Power Taboos, Retrospective Power, and Shpower
Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results. Continue reading →
bending of starlight.
[T]he impressive thing about the 1919 tests of Einstein ‘s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted. The theory is incompatible with certain possible results of observation—in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where]..it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories.” (Popper, CR, [p. 36))
Popper lauds Einstein’s General Theory of Relativity (GTR) as sticking its neck out, bravely being ready to admit its falsity were the deflection effect not found. The truth is that even if no deflection effect had been found in the 1919 experiments it would have been blamed on the sheer difficulty in discerning so small an effect (the results that were found were quite imprecise.) This would have been entirely correct! Yet many Popperians, perhaps Popper himself, get this wrong.[i] Listen to Popperian Paul Meehl (with whom I generally agree).
The stipulation beforehand that one will be pleased about substantive theory T when the numerical results come out as forecast, but will not necessarily abandon it when they do not, seems on the face of it to be about as blatant a violation of the Popperian commandment as you could commit. For the investigator, in a way, is doing…what astrologers and Marxists and psychoanalysts allegedly do, playing heads I win, tails you lose.” (Meehl 1978, 821)
No, there is a confusion of logic. A successful result may rightly be taken as evidence for a real effect H, even though failing to find the effect need not be taken to refute the effect, or even as evidence as against H. This makes perfect sense if one keeps in mind that a test might have had little chance to detect the effect, even if it existed. The point really reflects the asymmetry of falsification and corroboration. Popperian Alan Chalmers wrote an appendix to a chapter of his book, What is this Thing Called Science? (1999)(which at first had criticized severity for this) once I made my case. [i] Continue reading →
July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:
“Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”
Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out . (See O’Hagan’s “Digest and Discussion”) Continue reading →
Sell me that antiseptic!
We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here). Continue reading →
I don’t know how to explain to this economist blogger that he is erroneously using p-values when he claims that “the odds are” (1 – p)/p that a null hypothesis is false. Maybe others want to jump in here?
On significance and model validation (Lars Syll)
Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20. Continue reading →