Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.” I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?
(I) Listen to Jacob Cohen (1988) introduce Power Analysis
“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists.
What is really intended by the invalid affirmation of a null hypothesis is not that the population ES is literally zero, but rather that it is negligible, or trivial. This proposition may be validly asserted under certain circumstances. Consider the following: for a given hypothesis test, one defines a numerical value i (or iota) for the ES, where i is so small that it is appropriate in the context to consider it negligible (trivial, inconsequential). Power (1 – b) is then set at a high value, so that b is relatively small. When, additionally, a is specified, n can be found. Now, if the research is performed with this n and it results in nonsignificance, it is proper to conclude that the population ES is no more than i, i.e., that it is negligible; this conclusion can be offered as significant at the b level specified. In much research, “no” effect (difference, correlation) functionally means one that is negligible; “proof” by statistical induction is probabilistic. Thus, in using the same logic as that with which we reject the null hypothesis with risk equal to a, the null hypothesis can be accepted in preference to that which holds that ES = i with risk equal to b. Since i is negligible, the conclusion that the population ES is not as large as i is equivalent to concluding that there is “no” (nontrivial) effect. This comes fairly close and is functionally equivalent to affirming the null hypothesis with a controlled error rate (b), which, as noted above, is what is actually intended when null hypotheses are incorrectly affirmed (J. Cohen 1988, p. 16).
Here Cohen imagines the researcher sets the size of a negligible discrepancy ahead of time–something not always available. Even where a negligible i may be specified, the power to detect that i may be low and not high. Two important points can still be made:
Now to tell what’s true about Greenland’s concern that “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”
(II) The first step is to understand the assertion, giving the most generous interpretation. It deals with nonsignificance, so our ears are perked for a fallacy of nonrejection or nonsignificance. We know that “high power” is an incomplete concept, so he clearly means high power against “the alternative”.
For a simple example of Greenland’s phenomenon, consider an example of the Normal test we’ve discussed a lot on this blog. Let T+: H0: µ ≤ 12 versus H_{1}: µ > 12, σ = 2, n = 100. Test statistic Z is √100(M – 12)/2 where M is the sample mean. With α = .025, the cut-off for declaring .025 significance from M*_{.025 }= 12+ 2(2)/√100 = 12.4 (rounding to 2 rather than 1.96 to use a simple Figure below).
[Note: The thick black vertical line in the Figure, which I haven’t gotten to yet, is going to be the observed mean, M_{0 }= 12.35. It’s a bit lower than the cut-off at 12.4.]
Now a title like Greenland’s is supposed to signal some problem. What is it? The statistical part just boils down to noting that the observed mean M_{0 }(e.g., 12.35) may fail to make it to the cut-off M* (here 12.4), and yet be closer to an alternative against which the test has high power (e.g., 12.6) than it is to the null value, here 12. This happens because the Type 2 error probability is allowed to be greater than the Type 1 error probability (here .025).
Abbreviate the alternative against which the test T+ has .84 power as, µ^{.84} , as I’ve often done. (See, for example, this post.) That is, the probability Test T+ rejects the null when µ = µ^{.84} = .84. i.e.,POW(T+, µ^{.84}) = .84. One of our power short-cut rules tells us:
µ^{.84 }= M* + 1σ_{M }= 12.4 + .2 = 12.6,
where σ_{M}: =σ/√100 = .2.
Note: the Type 2 error probability in relation to alternative µ = 12.6 is.16. This is the area to the left of 12.4 under the red curve above. Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β(12.6).
µ^{.84 }exceeds the null value by 3σ_{M}: so any observed mean that exceeds 12 by more than 1.5σ_{M }but less than 2σ_{M }gives an example of Greenland’s phenomenon. [Note: the previous sentence corrects an earlier wording.] In T+ , values 12.3 < M_{0 }<12 .4 do the job. Pick M_{0 }= 12.35. That value is indicated by the black vertical line in the figure above.
Having established the phenomenon, your next question is: so what?
It would be problematic if power analysis took the insignificant result as evidence for μ = 12 (i.e., 0 discrepancy from the null). I’ve no doubt some try to construe it as such, and that Greenland has been put in the position of needing to correct them. This is the reverse of the “mountains out of molehills” fallacy. It’s making molehills out of mountains. It’s not uncommon when a nonsignificant observed risk increase is taken as evidence that risks are “negligible or nonexistent” or the like. The data are looked at through overly rosy glasses (or bottle). Power analysis enters to avoid taking no evidence of increased risk as evidence of no risk. Its reasoning only licenses μ < µ^{.84} where .84 was chosen for “high power”. From what we see in Cohen, he does not give a green light to the fallacious use of power analysis.
(III) Now for how the inference from power analysis is akin to significance testing (as Cohen observes). Let μ^{1−β} be the alternative against which test T+ has high power, (1 – β). Power analysis sanctions the inference that would accrue if we switched the null and alternative, yielding the one-sided test in the opposite direction, T-, we might call it. That is, T- tests H_{0}: μ ≥ μ^{1−β} versus H_{1}: μ < μ^{1−β} at the β level. The test rejects H_{0} (at level β) when M < μ_{0} – z_{β}σ_{M}. Such a significant result would warrant inferring μ < μ^{1−β }at significance level β. Using power analysis doesn’t require making this switcheroo, which might seem complicated. The point is that there’s really no new reasoning involved in power analysis, which is why the members of the Fisherian tribe manage it without even mentioning power.
EXAMPLE. Use μ^{.84} in test T+ (α = .025, n = 100, σ_{M }= .2) to create test T-. Test T+ has .84 power against μ^{.84} = 12 + 3σ_{M} = 12.6 (with our usual rounding). So, test T- is
H_{0}: μ ≥ 12.6 versus H_{1}: μ <12 .6
and a result is statistically significantly smaller than 12.6 at level .16 whenever sample mean M < 12.6 – 1σ_{M} = 12.4. To check, note (as when computing the Type 2 error probability of test T+) that
Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β. In test T-, this serves as the Type 1 error probability.
So ordinary power analysis follows the identical logic as significance testing. [i] Here’s a qualitative version of the logic of ordinary power analysis.
Ordinary Power Analysis: If data x are not statistically significantly different from H_{0}, and the power to detect discrepancy γ is high, then x indicates that the actual discrepancy is no greater than γ.[ii]
Or, another way to put this:
If data x are not statistically significantly different from H_{0}, then x indicates that the underlying discrepancy (from H_{0}) is no greater than γ, just to the extent that that the power to detect discrepancy γ is high,
************************************************************************************************
[i] Neyman, we’ve seen, was an early power analyst. See, for example, this post.
[ii] Compare power analytic reasoning with severity reasoning from a negative or insignificant result.
POWER ANALYSIS: If Pr(d > c_{α}; µ’) = high and the result is not significant, then it’s evidence µ < µ’
SEVERITY ANALYSIS: (for an insignificant result): If Pr(d > d_{0}; µ’) = high and the result is not significant, then it’s evidence µ < µ.’
Severity replaces the pre-designated cut-off c_{α} with the observed d_{0}. Thus we obtain the same result remaining in the Fisherian tribe. We still abide by power analysis though, since if Pr(d > d_{0}; µ’) = high then Pr(d > c_{α}; µ’) = high, at least in a sensible test like T+. In other words, power analysis is conservative. It gives a sufficient but not a necessary condition for warranting bound: µ < µ’. But why view a miss as good as a mile? Power is too coarse.
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum. [Link to quote above: p. 16]
Greenland, S. 2012. ‘Nonsignificance Plus High Power Does Not Imiply Support for the Null Over the Alternative’, Annals of Epidemiology 22, pp. 364-8. Link to paper: Greenland (2012)
Date: July 17, 2016
Location: London School of Economics, London
Website: http://www.lse.ac.uk/philosophy/events/carlo-rovelli-public-lecture/
Start Date: July 21, 2016
End Date: July, 22, 2016
Location: University of Cambridge
Website: http://www.crassh.cam.ac.uk/events/26814
Start Date: September 6, 2016
End Date: September 9, 2016
Location: University of Exeter, UK
Website: http://www.philsci.eu/epsa17
Submission Deadline: December 16, 2016
Flyer: Structure.pdf
Submission Deadline: September 5, 2016
Flyer: CFPLinconscio_ENG.pdf
Submission Deadline: July 17, 2016
Start Date: October 3, 2016
End Date: October 7, 2016
Location: San Sebastian, Spain
Flyer: Flier-XIIInternationalOntologyCongress.pdf
Gravity: Its History and Philosophy
University of Colorado at Boulder
Invited speakers
Peter Saulson, Syracuse University, LIGO
Michel Janssen, University of Minnesota
Peter Bender, JILA, University of Colorado
The conference topic is gravity from antiquity to the present. Historical and philosophical papers on both theory and experiment are welcome.
Start Date: October 12, 2016
End Date: October 13, 2016
Location: Leuven, Belgium
Flyer: TheScienceOfEvolutionAndTheEvolutionOftheSciences.pdf
Start Date: September 5, 2016
End Date: September 9, 2016
Location: Urbino, Italy
Website: https://sites.google.com/site/centroricerchecirfis/sep-2016
Start Date: September 23, 2016
End Date: September 24, 2016
Location: University of Pittsburgh, PA
Website: http://www.pitt.edu/~pittcntr/Events
MONTHLY MEMORY LANE: 3 years ago: June 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1]. Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.
June 2013
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 Statistical Science issue discussing Birnbaum’s result is here. Reference [5] links to the Synthese 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading!
NATURE VOL. 225 MARCH 14, 1970 (1033)
LETTERS TO THE EDITOR
Statistical Methods in Scientific Inference (posted earlier here)
It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood). I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.
If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence. This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data. (The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].
While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.
Allan Birnbaum
New York University
Courant Institute of Mathematical Sciences,
251 Mercer Street,
New York, NY 10012
Birnbaum’s confidence concept, sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:
(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.
Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence” simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below).
Still, since Birnbaum’s (Conf) appears to allude to pre-trial error probabilities, I regard (Conf) as still too “behavioristic”. But I discovered that Pratt, in the link in [5] below, entertains the possibility of viewing Conf in terms of what might be called post-data or “attained” error probabilities. Some of his papers hint at the possibility that he would have wanted to use Conf for a post-data assessment of how well (or poorly) various claims were tested. I developed the concept of severity and severe testing to provide an “evidential” or “inferential” notion, along with a statistical philosophy and a philosophy of science in which it is to be embedded.
I think that Fisher (1955) is essentially correct in maintaining that “When, therefore, Neyman denies the existence of inductive reasoning he is merely expressing a verbal preference”. It is a verbal preference one can also find in Popper’s view of corroboration. (He, and current day critical rationalists, also hold that probability arises to evaluate degrees of severity, well-testedness or corroboration, not inductive confirmation.) The inference to the severely corroborated claim is still inductive. It goes beyond the premises. It is qualified by the relevant severity assessments.
I have many of Birnbaum’s original drafts of papers and articles here (with carbon copies (!) and hand-written notes in the margins), thanks to the philosopher of science, Ronald Giere, who gave them to me years ago[iii].
***
[i] His untimely death was a suicide.
[ii] A considerable number of posts on the strong likelihood principle (SLP) may be found searching this blog (e.g., here and here). Links or references to the associated literature, perhaps all of it, may also be found here. A post linking to the 2014 Statistical Science issue on my criticism of Birnbaum’s “breakthrough” (to the SLP) is here.
[iii]See posts under “Neyman’s Nursery” (1, 2, 3, 4, 5)
References
[3] Birnbaum, A., in Philosophy, Science and Method: Essays in Honor of Ernest Nagel (edited by Morgenbesser, S., Suppes, P., and White, M.) (St. Martin’s Press. NY, 1969).
[4] Likelihood in International Encyclopedia of the Social Sciences (Crowell-Collier, NY, 1968).
[5] Full contents of Synthese 1977, dedicated to his memory in 1977, can be found in this post.
[6] Birnbaum, A. (1977). “The Neyman-Pearson theory as decision theory, and as inference theory; with a criticism of the Lindley-Savage argument for Bayesian theory”. Synthese 36 (1) : 19-49. See links in [5]
Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University
It was statistician Richard Gill who first told me about Diederik Stapel (see an earlier post on Diederik). We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have Gill be assigned as my commentator/presenter—he was excellent! As I was explaining some data problems to him, he suddenly said, “Some people don’t bother to collect data at all!” That’s when I learned about Stapel.
Committees often turn to Gill when someone’s work is up for scrutiny of bad statistics or fraud, or anything in between. Do you think he’s being too easy on researchers when he says, about a given case:
“data has been obtained by some combination of the usual ‘questionable research practices’ [QRPs] which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published. …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them.”
Isn’t that the danger in relying on deeply felt background beliefs? Have our attitudes changed (toward QRPs) over the past 3 years (harsher or less harsh)? Here’s a talk of his I blogged 3 years ago (followed by a letter he allowed me to post). I reflect on the pseudoscientific nature of the ‘recovered memories’ program in one of the Geraerts et al. papers in a later post.
I certainly have been thinking about these issues a lot in recent months. I got entangled in intensive scientific and media discussions – mainly confined to the Netherlands – concerning the cases of social psychologist Dirk Smeesters and of psychologist Elke Geraerts. See: http://www.math.leidenuniv.nl/~gill/Integrity.pdf
And I recently got asked to look at the statistics in some papers of another … [researcher] ..but this one is still confidential ….
The verdict on Smeesters was that he like Stapel actually faked data (though he still denies this).The Geraerts case is very much open, very much unclear. The senior co-authors Merckelbach, McNally of the attached paper, published in the journal “Memory”, have asked the journal editors for it to be withdrawn because they suspect the lead author, Elke Geraerts, of improper conduct. She denies any impropriety. It turns out that none of the co-authors have the data. Legally speaking it belongs to the University of Maastricht where the research was carried out and where Geraerts was a promising postdoc in Merckelbach’s group. She later got a chair at Erasmus University Rotterdam and presumably has the data herself but refuses to share it with her old co-authors or any other interested scientists. Just looking at the summary statistics in the paper one sees evidence of “too good to be true”. Average scores in groups supposed in theory to be similar are much closer to one another than one would expect on the basis of the within group variation (the paper reports averages and standard deviations for each group, so it is easy to compute the F statistic for equality of the three similar groups and use its left tail probability as test statistic.
The same phenomenon turns up in another unpublished paper by the same authors and moreover in one of the papers contained in Geraerts (Maastricht) thesis. I attach the two papers published in Geraert’s thesis which present results in very much the same pattern as the disputed “Memory” paper. Four groups of subjects, three supposed in theory to be rather similar, one expected to be strikingly different. In one of the two, just as in the Memory paper, the average scores of the three similar groups are much closer to one another than one would expect on the basis of the within-groups variation.
I got involved in the quarrel between Merckelbach and Geraerts which was being fought out in the media so various science journalists also consulted me about the statistical issues. I asked Geraerts if I could have the data of the Memory paper so that I could carry out distribution-free versions of the statistical tests of “too good to be true” which are easy to perform if you just have the summary statistics. She claimed that I had to get permission from the University of Maastricht. At some point both the presidents of Maastricht and Erasmus university were involved and presumably their legal departments too. Finally I got permission and arranged a meeting with Geraerts where she was going to tell me “her side of the story” and give me the data and we would look at my analyses together. Merckelbach and his other co-authors all enthusiastically supported this too, by the way. However at the last moment the chair of her department at Erasmus university got worried and stepped in and now an internal Rotterdam (=Erasmus) committee is investigating the allegations and Geraerts is not allowed to give anyone the data or talk to anyone about the problem.
I think this is totally crazy. First of all, the data set should have been made public years ago. Secondly, the fact that the co-authors of the paper never even saw the data themselves is a sign of poor research practices. Thirdly, getting university lawyers and having high level university ethics committees involved does not further science. Science is furthered by open discussion. Publish the data, publish the criticism, and let the scientific community come to its own conclusion. Hold a workshop where different points of view of presented about what is going on in these papers, where statisticians and psychologists communicate to one another.
Probably, Geraerts’s data has been obtained by some combination of the usual “questionable research practices” which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published: sample sizes are too small, effects are too small, noise is too large. People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them and are just doing the best to make this as clear as possible to everyone.
Richard
PS summary of my investigation of the papers contained in Geraert’s PhD thesis:ch 8 Geraerts et al 2006b BRAT Long term consequences of suppression of intrusive anxious thoughts and repressive coping.
ch 9 Geraerts et al 2006 AJP Suppression of intrusive thoughts and working memory capacity in repressive coping.These two chapters show the pattern of four groups of subjects, three of which are very similar, while the fourth is strikingly different with respect to certain (but not all) responses.In the case of chapter 8, the groups which are expected to be similar are (just as in the already disputed Memory and JAb papers) actually much too similar! The average scores are closer to one another than one can expect on the basis of the observed within-group variation (1 over square root of N law).In the case of chapter 9, nothing odd seems to be going on. The variation between the average scores of similar groups of subjects is just as big as it ought to be, relative to the variation within the groups.
Geraerts et al (2008 Memory pdf). “Recovered memories of childhood sexual abuse: Current ﬁndings and their legal implications” Legal and Criminological Psychology 13, 165–176
In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I had said I would label as pseudoscience or questionable science any enterprise that regularly permits the kind of ‘verification biases’ in the statistical dirty laundry list. How regularly? (I’ve been asked)
Well, surely if it’s as regular as, say, much of social psychology, it goes over the line. But it’s not mere regularity, it’s the nature of the data, the type of inferences being drawn, and the extent of self-scrutiny and recognition of errors shown (or not shown). The regularity is just a consequence of the methodological holes. My standards may be considerably more stringent than most, but quite aside from statistical issues, I simply do not find hypotheses well-tested if they are based on “experiments” that consist of giving questionnaires. At least not without a lot more self-scrutiny and discussion of flaws than I ever see. (There may be counterexamples.)
Attempts to recreate phenomena of interest in typical social science “labs” leave me with the same doubts. Huge gaps often exist between elicited and inferred results. One might locate the problem under “external validity” but to me it is just the general problem of relating statistical data to substantive claims.
Experimental economists (expereconomists) take lab results plus statistics to warrant sometimes ingenious inferences about substantive hypotheses. Vernon Smith (of the Nobel Prize in Econ) is rare in subjecting his own results to “stress tests”. I’m not withdrawing the optimistic assertions he cites from EGEK (Mayo 1996) on Duhem-Quine (e.g., from “Method in Experiment: Rhetoric and Reality” 2002, p. 104). I’d still maintain, “Literal control is not needed to attribute experimental results correctly (whether to affirm or deny a hypothesis). Enough experimental knowledge will do”. But that requires piece-meal strategies that accumulate, and at least a little bit of “theory” and/or a decent amount of causal understanding.[1]
I think the generalizations extracted from questionnaires allow for an enormous amount of “reading into” the data. Suddenly one finds the “best” explanation. Questionnaires should be deconstructed for how they may be misinterpreted, not to mention how responders tend to guess what the experimenter is looking for. (I’m reminded of the current hoopla over questionnaires on breadwinners, housework and divorce rates!) I respond with the same eye-rolling to just-so story telling along the lines of evolutionary psychology.
I apply the “Stapel test”: Even if Stapel had bothered to actually carry out the data-collection plans that he so carefully crafted, I would not find the inferences especially telling in the least. Take for example the planned-but-not-implemented study discussed in the recent New York Times article on Stapel:
Stapel designed one such study to test whether individuals are inclined to consume more when primed with the idea of capitalism. He and his research partner developed a questionnaire that subjects would have to fill out under two subtly different conditions. In one, an M&M-filled mug with the word “kapitalisme” printed on it would sit on the table in front of the subject; in the other, the mug’s word would be different, a jumble of the letters in “kapitalisme.” Although the questionnaire included questions relating to capitalism and consumption, like whether big cars are preferable to small ones, the study’s key measure was the amount of M&Ms eaten by the subject while answering these questions….Stapel and his colleague hypothesized that subjects facing a mug printed with “kapitalisme” would end up eating more M&Ms.
Stapel had a student arrange to get the mugs and M&Ms and later load them into his car along with a box of questionnaires. He then drove off, saying he was going to run the study at a high school in Rotterdam where a friend worked as a teacher.
Stapel dumped most of the questionnaires into a trash bin outside campus. At home, using his own scale, he weighed a mug filled with M&Ms and sat down to simulate the experiment. While filling out the questionnaire, he ate the M&Ms at what he believed was a reasonable rate and then weighed the mug again to estimate the amount a subject could be expected to eat. He built the rest of the data set around that number. He told me he gave away some of the M&M stash and ate a lot of it himself. “I was the only subject in these studies,” he said.
He didn’t even know what a plausible number of M&Ms consumed would be! But never mind that, observing a genuine “effect” in this silly study would not have probed the hypothesis. Would it?
II. Dancing the pseudoscience limbo: How low should we go?
Should those of us serious about improving the understanding of statistics be expending ammunition on studies sufficiently crackpot to lead CNN to withdraw reporting on a resulting (published) paper?
“Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as “silly,” “stupid,” “sexist,” and “offensive.” Others were less nice.”
That’s too low down for me.…(though it’s good for it to be in Retraction Watch). Even stooping down to the level of “The Journal of Psychological Pseudoscience” strikes me as largely a waste of time–for meta-methodological efforts at least.
I was hastily making these same points in an e-mail to Andrew Gelman just yesterday:
E-mail to Gelman: Yes, the idea that X should be published iff a p<.05 in an interesting topic is obviously crazy.
I keep emphasizing that the problems of design and of linking stat to substantive are the places to launch a critique, and the onus is on the researcher to show how violations are avoided. … I haven’t looked at the ovulation study (but this kind of thing has been done a zillion times) and there are a zillion confounding factors and other sources of distortion that I know were not ruled out. I’m prepared to abide such studies as akin to Zoltar at the fair [Zoltar the fortune teller]. Or, view it as a human interest story—let’s see what amusing data they collected, […oh, so they didn’t even know if women they questioned were ovulating]. You talk of top psych journals, but I see utter travesties in the ones you call top. I admit I have little tolerance for this stuff, but I fail to see how adopting a better statistical methodology could help them. …
Look, there aren’t real regularities in many, many areas–better statistics could only reveal this to an honest researcher. If Stapel actually collected data on M&M’s and having a mug with “Kapitalism” in front of subjects, it would still be B.S.! There are a lot of things in the world I consider crackpot. They may use some measuring devices, and I don’t blame those measuring devices simply because they occupy a place in a pseudoscience or “pre-science” or “a science-wannabe”. Do I think we should get rid of pseudoscience? Yes! [At least if they have pretensions to science, and are not described as “for entertainment purposes only”[2].] But I’m afraid this would shut down [or radically redescribe] a lot more fields than you and most others would agree to. So it’s live and let live, and does anyone really think it’s hurting honest science very much?
There are fields like (at least parts of) experimental psychology that have been trying to get scientific by relying on formal statistical methods, rather than doing science. We get pretensions to science, and then when things don’t work out, they blame the tools. First, significance tests, then confidence intervals, then meta-analysis,…do you think these same people are going to get the cumulative understanding they seek when they move to Bayesian methods? Recall [Frank] Schmidt in one of my Saturday night comedies, rhapsodizing about meta-analysis:
“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”(Schmidt 1996)
III. Dale Carnegie salesman fallacy:
It’s not just that bending over backwards to criticize the most blatant abuses of statistics is a waste of time. I also think dancing the pseudoscientific limbo too low has a tendency to promote its very own fallacy! I don’t know if it has a name, so I made one up. Carnegie didn’t mean this to be used fallaciously, but merely as a means to a positive sales pitch for an idea, call it H. You want to convince a person ofH? Get them to say yes to a series of claims first, then throw in H and let them make the leap to accept H too. “You agree that the p-values in the ovulation study show nothing?” “Yes” “You agree that study on bicep diameter is bunk?” “Yes, yes”, and “That study on ESP—pseudoscientific, yes?” “Yes, yes, yes!” Then announce, “I happen to favor operational probalogist statistics (H)”. Nothing has been said to advance H, no reasons have been given that it avoids the problems raised. But all those yeses may well lead the person to say yes to H, and to even imagine an argument has been given. Dale Carnegie was a shrewd man.
(June 25, 2016 cartoon)
Note: You might be interested in the (brief) exchange between Gelman and me in the comments from the original post.
Of relevance is a later post on the replication crisis in psych. Search the blog for more on replication, if interested.
[1] Vernon Smith ends his paper:
My personal experience as an experimental economist since 1956 resonates, well with Mayo’s critique of Lakatos: “Lakatos, recall, gives up on justifying control; at best we decide—by appeal to convention—that the experiment is controlled. … I reject Lakatos and others’ apprehension about experimental control. Happily, the image of experimental testing that gives these philosophers cold feet bears little resemblance to actual experimental learning. Literal control is not needed to correctly attribute experimental results (whether to affirm or deny a hypothesis). Enough experimental knowledge will do. Nor need it be assured that the various factors in the experimental context have no influence on the result in question—far from it. A more typical strategy is to learn enough about the type and extent of their influences and then estimate their likely effects in the given experiment”. [Mayo EGEK 1996, 240]. V. Smith, “Method in Experiment: Rhetoric and Reality” 2002, p. 106.
My example in this chapter was linking statistical models in experiments on Brownian motion (by Brown).
[2] I actually like Zoltar (or Zoltan) fortune telling machines, and just the other day was delighted to find one in a costume store on 21st St.
Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong[2].) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013:
[i] I assume this is no longer true.
[2] June 24: Earp’s correction was that QRPs had “become standard practice”. But if they were taught as things a scientist with integrity must avoid, or adjust for (or at least inform the reader about), then how did they become standard practice? In the interviews conducted by the Stapel committee, the interviewees showed a cavalier attitude toward these moves.
Some statistical dirty laundry
I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…
1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job:
One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).
I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.
2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:
In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.
A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments). That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?
3. Hanging out some statistical dirty laundry.
Items in their laundry list include:
- An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
- A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
- The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
- The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
- Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)
For many further examples, and also caveats [3],see Report.
4. Significance tests don’t abuse science, people do.
Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. Statistical methods don’t kill scientific validity, people do.
I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and associated confidence intervals. At least the methods admit of tools for mounting a critique.
In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one? Or was there just a bit of data massaging and cherry picking to support the desired conclusion? As a matter of routine, researchers should tell us. Yes, as Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!” No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)
I recommend reading the Tilberg report!
*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”
[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).
[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)
[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).
[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!
[5] From Simmons, Nelson and Simonsohn:
The Fall 2012 Newsletter for the Society for Personality and Social Psychology Popper, K. 1994, The Myth of the Framework.Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.
Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.
If you determined sample size in advance, say it.
If you did not drop any variables, say it.
If you did not drop any conditions, say it.
Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology
Deborah Mayo Virginia Tech, Department of Philosophy United States
Caitlin Parker Virginia Tech, Department of Philosophy United States
The discussion surrounding the replication crisis in psychology has raised philosophical issues that remain to be seriously addressed. These touch on foundational questions in the philosophy of statistics about the role of probability in scientific inference and the proper interpretation of statistical tests. Such matters are key to understanding a paradox related to replicability criticisms in social science. This is that, although critics argue that it is too easy to obtain statistically significant results, the comparably low rate of positive results in replication studies shows that it is quite difficult to obtain low p-values. The resolution of the paradox is that small p-values aren’t easy to come by when experimental protocols are preregistered and researcher flexibility is minimized. They are easy to generate thanks to biasing selection effects: cherry-picking, multiple testing, and the type of questionable research practices that are widely lampooned. The consequence of these influences is that the reported, ‘nominal’ p-value for the original study differs greatly from the ‘actual’ p-value. As Gelman and Loken (2014) have argued, the same problem occurs due to the flexibility of choices in the “forking paths” leading from data to inferences, even if the critique remains informal. It follows that to avoid problematic inferences, researchers need statistical tools with the capacity to pick up on the effects of biasing selections. Significance tests have a limited but important goal, especially in testing model assumptions. To trade them in for methods that do not pick up on alterations to error probabilities (Bayes ratios, posterior probabilities, likelihood ratios) is not progress, but would enable their effects to remain hidden . The sensitivity of p-values to selection effects is actually the key to understanding their relevance to appraising particular inferences, not just to long-run error control. The problems of hunting and cherry picking are not a matter of getting it wrong in the long run, but failing to provide good grounds for the intended inference in the immediate inquiry. There’s a second way in which reforms are in danger of enabling fallacies. It is fallacious to take the falsification of a null hypothesis as evidence for a substantive theory (confusing statistical and substantive hypotheses). Neither Fisherian nor NP tests permit moving directly from statistical significance to research hypotheses, let alone from a single, just significant result. Yet in order to block an inference to a research hypothesis, a popular reform is to assign a lump of prior probability on the “no effect” null hypothesis. But this countenances rather than prohibiting blurring statistical and substantive hypotheses! This is not only a statistical fallacy, it draws attention away from what is most needed in psychology experiments with poor replication: a scrutiny of the relevance of the measurements and experiments to the research hypotheses of interest [3]. Slides are here.
[1] Parker had been my Masters’ student at Virginia Tech, and is beginning Ph.D work at Carnegie Mellon University in the fall.
[2] We’re on at 11:30, Enterprise Center, room 409. (Third paper in a session that starts at 10:30). The conference goes from June 17-19.
[3] Links to some relevant posts are at the end.
SPSP Mission Statement
Philosophy of science has traditionally focused on the relation between scientific theories and the world, at the risk of disregarding scientific practice. In social studies of science and technology, the predominant tendency has been to pay attention to scientific practice and its relation to theories, sometimes willfully disregarding the world except as a product of social construction. Both approaches have their merits, but they each offer only a limited view, neglecting some essential aspects of science. We advocate a philosophy of scientific practice, based on an analytic framework that takes into consideration theory, practice and the world simultaneously.
The direction of philosophy of science we advocate is not entirely new: naturalistic philosophy of science, in concert with philosophical history of science, has often emphasized the need to study scientific practices; doctrines such as Hacking’s “experimental realism” have viewed active intervention as the surest path to the knowledge of the world; pragmatists, operationalists and late-Wittgensteinians have attempted to ground truth and meaning in practices. Nonetheless, the concern with practice has always been somewhat outside the mainstream of English-language philosophy of science. We aim to change this situation, through a conscious and organized programme of detailed and systematic study of scientific practice that does not dispense with concerns about truth and rationality.
Practice consists of organized or regulated activities aimed at the achievement of certain goals. Therefore, the epistemology of practice must elucidate what kinds of activities are required in generating knowledge. Traditional debates in epistemology (concerning truth, fact, belief, certainty, observation, explanation, justification, evidence, etc.) may be re-framed with benefit in terms of activities. In a similar vein, practice-based treatments will also shed further light on questions about models, measurement, experimentation, etc., which have arisen with prominence in recent decades from considerations of actual scientific work.
There are some salient aspects of our general approach that are worth highlighting here.
I. Some relevant recent posts on p-values (search this blog for many others):
“Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater”
“P-Value Madness: A Puzzle About the Latest Test Ban, or ‘Don’t Ask, Don’t Tell’”
II. Posts on replication research in psychology:
Repligate Returns (or, the Non Significance of Nonsignificant Results Are the New Significant Results)
This includes links to:
“Some Ironies in the Replication Crisis in Social Psychology”
“The Paradox of Replication and the Vindication of the P-value, but She Can Go Deeper”
“Out Damned Pseudoscience: Nonsignificant Results Are the New Significant Results”
I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data [0]. The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in black.
The journal Basic and Applied Social Psychology banned p-values a year ago. I read some of their articles published in the last year. I didn’t like many of them. Here’s why.
First of all, it seems BASP didn’t just ban p-values. They also banned confidence intervals, because God
forbid you use that lower bound to check whether or not it includes 0. They also banned reporting sample sizes for between subject conditions, because God forbid you divide that SD by the square root of N and multiply it by 1.96 and subtract it from the mean and guesstimate whether that value is smaller than 0.
It reminds me of alcoholics who go into detox and have to hand in their perfume, before they are tempted to drink it. Thou shall not know whether a result is significant – it’s for your own good! Apparently, thou shall also not know whether effect sizes were estimated with any decent level of accuracy. Nor shall thou include the effect in future meta-analyses to commit the sin of cumulative science.(my emphasis)
There are some nice papers where the p-value ban has no negative consequences. For example, Swab & Greitemeyer (2015) examine whether indirect (virtual) intergroup contact (seeing you have 1 friend in common with an outgroup member, vs not) would influence intergroup attitudes. It did not, in 8 studies. P-values can’t be used to accept the null-hypothesis, and these authors explicitly note they aimed to control Type 2 errors based on an a-priori power analysis. So, after observing many null-results, they drew the correct conclusion that if there was an effect, it was very unlikely to be larger than what the theory on evaluative conditioning predicted. After this conclusion, they logically switch to parameter estimation, perform a meta-analysis and based on a Cohen’s d of 0.05, suggest that this effect is basically 0. It’s a nice article, and the p-value ban did not make it better or worse.
If the journal is banning reports of inferential notions, then how do power and Type 2 errors slip by the editors’ bloodhounds?
But in many other papers, especially those where sample sizes were small, and experimental designs were used to examine hypothesized differences between conditions, things don’t look good.
In many of the articles published in BASP, researchers make statements about differences between groups. Whether or not these provide support for their hypotheses becomes a moving target, without the need to report p-values. For example, some authors interpret a d of 0.36 as support for an effect, while in the same study, a Cohen’s d < 0.29 (we are not told the exact value) is not interpreted as an effect. You can see how banning p-values solved the problem of dichotomous interpretations (I’m being ironic). Also, with 82 people divided over three conditions, the p-value associated with the d = 0.36 interpreted as an effect is around p= 0.2. If BASP had required authors to report p-values, they might have interpreted this effect a bit more cautiously. And in case you are wondering: No, this is not the only non-significant finding interpreted as an effect. Surprisingly enough, it seems to happen a lot more often than in journals where authors report p-values! Who would have predicted this?! (my emphasis)
Nice work Trafimow and Marks! Just what psychology needs.
Saying one thing is bigger than something else, and reporting an effect size, works pretty well in simple effects. But how would say there is a statistically significant interaction, if you can’t report inferential statistics and p-values? Here are some of my favorite statements.
“The ANOVA also revealed an interaction between [X] and [Y], η² = 0.03 (small to medium effect).”
How much trust do you have in that interaction from an exploratory ANOVA with a small to medium effect size of .03, partial eta squared? That’s what I thought.
“The main effects were qualified by an [X] by [Y] interaction. See Figure 2 for means and standard errors”
The main effects were qualified, but the interaction was not quantified. What does this author expect I do with the means and standard errors? Look at it while humming ‘ohm’ and wait to become enlightened? Everybody knows these authors calculated p-values, and based their statements on these values.
My predictions on the consequences of this journal’s puzzling policy appear to be true, all too true: They allow error statistical methods for purposes of a paper’s acceptance, but then require their extirpation in the published paper. I call it the “Don’t ask, don’t tell” policy (see this post). See also my commentary on the ASA P-value report.
In normal scientific journals, authors sometimes report a Bonferroni correction. But there’s no way you are going to Bonferroni those means and standard deviations, now is there? With their ban on p-values and confidence intervals, BASP has banned error control. For example, read the following statement:
Willpower theories were also related to participants’ BMI. The more people endorsed a limited theory, the higher their BMI. This finding corroborates the idea that a limited theory is related to lower self-control in terms of dieting and might therefore also correlate with patients BMI.
This is based on a two-sided p-value of 0.026, and it was one of 10 calculated correlation coefficient. Would a Bonferroni adjusted p-value have led to a slightly more cautious conclusion?
Oh, and if you hoped banning p-values would lead anyone to use Bayesian statistics: No. It leads to a surprisingly large number of citations to Trafimow’s articles where he tries to use p-values as measures of evidence, and is disappointed they don’t do what he expects. Which is like going to The Hangover part 4 and complaining it’s really not that funny. Except everyone who publishes in BASP mysteriously agrees that Trafimow’s articles show NHST has been discredited and is illogical. (my emphasis)
This last sentence gets to the most unfortunate consequence of all. In a field increasingly recognized to be driven by “perverse incentives,” and desperately in need of publishing reform, even the appearance of “pay to play” is disturbing, when editors hold so idiosyncratic a view about standard statistical methods.
In their latest editorial[1], Trafimow and Marks hit down some arguments you could, after a decent bottle of liquor, interpret as straw men against their ban of p-values. They don’t, and have never, discussed the only thing p-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise.
I’m guessing that Daniel means they might (after liquor at least) be interpreted as converting the many telling criticisms of their ban into such weak versions as to render them “straw men”. I make some comments on this editorial [1].
The absence of p-values has not prevented dichotomous conclusions, nor claims that data support theories (which is only possible using Bayesian statistics), nor anything else p-values were blamed for in science. After reading a year’s worth of BASP articles, you’d almost start to suspect p-values are not the real problem. Instead, it looks like researchers find making statistical inferences pretty difficult, and forcing them to ignore p-values didn’t magically make things better.[2]
As far as I can see, all that banning p-values has done, is increase the Type 1 error rate in BASP articles. Restoring a correct use of p-values would substantially improve how well conclusions authors draw actually follow from the data they have collected. The only expense, I predict, is a much lower number of citations to articles written by Trafimow about how useless p-values are.(my emphasis)
Lakens, by dint of this post, certainly deserves an Honorable Mention, and can choose a book prize from the palindrome prize list. He has agreed to answer questions posted in the comments. So share your thoughts.
[0] As I say (slide 26) in my recent Popper talk at the LSE: “To use an eclectic toolbox in statistics, it’s important not to expect an agreement on numbers from methods evaluating different things. A p-value isn’t ‘invalid’ because it does not supply ‘the probability of the null hypothesis, given the finding’ (the posterior probability of H_{0}) (Trafimow and Marks, 2015).”
[1] I checked this editorial. Among at least a half dozen fallacies*, the editors say that the definition of a p-value is “true by definition and hence trivial”. But the definition of the posterior probability of H given x is also “true by definition and hence trivial”. Yet they’re quite sure that P(H|x) is informative. Why? It’s just true by definition.
Another puzzling claim is that “One cannot compute the probability of the finding due to chance unless one knows the population effect size. And if one knows the population effect size, there is no need to do the research.”
Given how they understand “probability of the finding due to chance” what this says is you can’t compute P(H|x) unless you know the population effect size. So this nihilistic claim of the editors is that to make a statistical inference about H requires knowing H, but then there’s no need to do the research. So there’s never any reason to do any research!
*I won’t call them statistical howlers—a term I’ve used on this blog–because they really have little to do with statistics and involve rudimentary logical gaffes.
[2] My one question is what Lakens means in saying that to claim that data support theories “is only possible using Bayesian statistics”. I don’t see how Bayesian statistics infers that data support theories, unless he means they may be used to provide a comparative measure of support, such as a Bayes’ Factor or likelihood ratio. On the other hand, if “supporting a theory” means something like “accepting or inferring a theory is well tested” then it’s outside of Bayesian probabilism (understood as a report of a posterior probability, however defined). An “acceptance” or “rejection” rule could be added to Bayesian updating (e.g., infer H if its posterior is high enough), but I’m not sure Bayesians find this welcome. It’s also possible that Lakens finds authors of this journal claiming their theories are “probable,” and he’s pointing out their error.
Send me your thoughts.
Winner of the May 2016 Palindrome contest
Curtis Williams: Inventor, entrepreneur, and professional actor
The winning palindrome (a dialog):
“Disable preplan?… I, Mon Ami?”
“Ask!”
“Calm…Sit, fella.”
“No! I tag. I vandalized Dezi, lad.”
“Navigational leftism lacks aim…a nominal perp: Elba’s id.”
The requirement: A palindrome using “navigate” or “navigation” (and Elba, of course).
Book choice: Error and Inference (D. Mayo & A. Spanos, Cambridge University Press, 2010)
Bio: Curtis Mark Williams is the co-founder of WavHello and the inventor of Bellybuds, who also counts himself as an occasional professional actor who has performed on Broadway [1] and in several television shows and films.
He currently resides in Los Angeles with his lovely wife, two daughters, his dog, Newton, and his framed New Yorker Caption Contest winning cartoon. [He has been a finalist twice and the one he won is contest #329, by Joe Dator (inspired by his theatrical background. :)]
Statement: I’ve always loved wordplay and puns especially. I recently happened across some of Demetri Martin’s poetry and was fascinated by his long palindromes. I was curious how he could possibly write something like that. So I decided to sit down and give it a try and I was hooked instantly. I love the problem-solving process of constrained writing. And I love this blog for giving me a reason to do it!
Remark from Prof. Mayo: What a terrific palindrome!* “Navigational leftism lacks aim”–so true! It’s head and shoulders above my efforts using “navigation”.[2] I’m incredibly impressed, as well, by anyone who wins the New Yorker cartoon contest (I asked him to include it here.) I’ve tried many times and never even made it to finalist. I happen to remember voting in favor of this one (given my tap dancing interest). Congratulations Curtis! I hope you will send us a submission again.
[1]He was in the Broadway show, Reckless (starring Mary Louise Parker) back in 2004 at the Manhattan Theatre Club in NYC. The role he is most proud of (which wasn’t Broadway) was originating the role of Denis McCleary in Richard Greenberg’s The Violet Hour at South Coast Repertory in SoCal. (His professional name is Curtis Mark Williams)
[2] Here was one:
Able no geek, a fan, Amy, entraps. Navigate man’s name-tag. Ivan’s part Neyman, a fake Egon? Elba
*The editors inadvertently put up an earlier version lacking Elba at first.
Aris Spanos, my colleague (in economics) and co-author, came across this anonymous review of our Error and Inference (2010) [E & I]. Interestingly, the reviewer remarks that “The book gives a sense of security regarding the future of statistical science and its importance in many walks of life.” We’re not sure what the reviewer means–but it’s appreciated regardless. This post was from yesterday’s 3-year memory lane and was first posted here.
2010 American Statistical Association and the American Society for Quality
TECHNOMETRICS, AUGUST 2010, VOL. 52, NO. 3, Book Reviews, 52:3, pp. 362-370.
Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. MAYO and Aris SPANOS, New York: Cambridge University Press, 2010, ISBN 978-0-521-88008-4, xvii+419 pp., $60.00.
This edited volume contemplates the interests of both scientists and philosophers regarding gathering reliable information about the problem/question at hand in the presence of error, uncertainty, and with limited data information.
The volume makes a signiﬁcant contribution in bridging the gap between scientiﬁc practice and the philosophy of science. The main contribution of this volume pertains to issues of error and inference, and showcases intriguing discussions on statistical testing and providing alternative strategy to Bayesian inference. In words, it provides cumulative information towards the philosophical and methodological issues of scientiﬁc inquiry at large.
The target audience of this volume is quite general and open to a broad readership. With some reasonable knowledge of probability theory and statistical science, one can get the maximum beneﬁt from most of the chapters of the volume. The volume contains original and fascinating articles by eminent scholars (nine, including the editors) who range from names in statistical science to philosophy, including D. R. Cox, a name well known to statisticians.
The editors have done a superb job in presenting, organizing, and structuring the material in a logical order. The “Introduction and Background” is nicely presented and summarized, allowing for a smooth reading of the rest of the volume. There is a broad range of carefully selected topics from various related ﬁelds reﬂecting recent developments in these areas. The rest of the volume is divided in nine chapters/sections as follows:
1. Learning from Error, Severe Testing, and the Growth of Theoretical
Knowledge
2. The Life of Theory in the New Experimentalism
3. Revisiting Critical Rationalism
4. Theory Conﬁrmation and Novel Evidence
5. Induction and Severe Testing
6. Theory Testing in Economics and the Error-Statistical Perspective
7. New Perspectives on (Some Old) Problems of Frequentist Statistics
8. Casual Modeling, Explanation and Severe Testing
9. Error and Legal Epistemology
In summary, this volume contains a wealth of knowledge and fascinating debates on a host of important and controversial topics equally important to the philosophy of science and scientiﬁc practice. This is a must-read—I enjoyed reading it and I am sure you will too! The book gives a sense of security regarding the future of statistical science and its importance in many walks of life. I also want to take the opportunity to suggest another seemingly related book by Harman and Kulkarni (2007). The review of this book was appeared in Technometricsin May 2008 (Ahmed 2008).
The following are chapters in E & I (2010) written by Mayo and/or Spanos, if you’re interested. If you produce a palindrome meeting the extremely simple requirements for May (by June 4, midnight), you can win a free copy!
MONTHLY MEMORY LANE: 3 years ago: May 2013. I mark in red three posts that seem most apt for general background on key issues in this blog [1]. Some of the May 2013 posts blog the conference we held earlier that month: “Ontology and Methodology”. I highlight in burgundy a post on Birnbaum that follows up on my last post in honor of his birthday. New questions or comments can be placed on this post.
May 2013
[1]Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
Today is Allan Birnbaum’s birthday. In honor of his birthday this year, I’m posting the articles in the Synthese volume that was dedicated to his memory in 1977. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up.(Even if you are,you may be unaware of some of these key papers.)
HAPPY BIRTHDAY ALLAN!
Synthese Volume 36, No. 1 Sept 1977: Foundations of Probability and Statistics, Part I
Editorial Introduction:
This special issue of Synthese on the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors of Synthese in October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.
THE EDITORS
Table of Contents
- Editorial Introduction. (1977). Synthese, 36(1), 3-3.
- Giere, R. (1977). Allan Birnbaum’s Conception of Statistical Evidence. Synthese, 36(1), 5-13.
SUFFICIENCY, CONDITIONALLY AND LIKELIHOOD In December of 1961 Birnbaum presented the paper ‘On the Foundations, of Statistical Inference’ (Birnbaum [19]) at a special discussion meeting of the American Statistical Association. Among the discussants was L. J. Savage who pronounced it “a landmark in statistics”. Explicitly denying any “intent to speak with exaggeration or rhetorically”, Savage described the occasion as “momentous in the history of statistics”. “It would be hard”, he said, “to point to even a handful of comparable events” (Birnbaum [19], pp. 307-8). The reasons for Savage’s enthusiasm are obvious. Birnbaum claimed to have shown that two principles widely held by non-Bayesian statisticians (sufficiency and conditionality) jointly imply an important consequence of Bayesian statistics (likelihood).”[1]
- Giere, R. (1977). Publications by Allan Birnbaum. Synthese, 36(1), 15-17.
- Birnbaum, A. (1977). The Neyman-Pearson Theory as Decision Theory, and as Inference Theory; With a Criticism of the Lindley-Savage Argument for Bayesian Theory. Synthese, 36(1), 19-49.
INTRODUCTION AND SUMMARY ….Two contrasting interpretations of the decision concept are formulated: behavioral, applicable to ‘decisions’ in a concrete literal sense as in acceptance sampling; and evidential, applicable to ‘decisions’ such as ‘reject H in a research context, where the pattern and strength of statistical evidence concerning statistical hypotheses is of central interest. Typical standard practice is characterized as based on the confidence concept of statistical evidence, which is defined in terms of evidential interpretations of the ‘decisions’ of decision theory. These concepts are illustrated by simple formal examples with interpretations in genetic research, and are traced in the writings of Neyman, Pearson, and other writers. The Lindley-Savage argument for Bayesian theory is shown to have no direct cogency as a criticism of typical standard practice, since it is based on a behavioral, not an evidential, interpretation of decisions.
- Lindley, D. (1977). The Distinction between Inference and Decision. Synthese, 36(1), 51-58.
- Pratt, J. (1977). ‘Decisions’ as Statistical Evidence and Birnbaum’s ‘Confidence Concept’Synthese, 36(1), 59-69.
- Smith, C. (1977). The Analogy between Decision and Inference. Synthese, 36(1), 71-85.
- Kyburg, H. (1977). Decisions, Conclusions, and Utilities. Synthese, 36(1), 87-96.
- Neyman, J. (1977). Frequentist Probability and Frequentist Statistics. Synthese, 36(1), 97-131.
- Lecam, L. (1977).A Note on Metastatistics or ‘An Essay toward Stating a Problem in the Doctrine of Chances’. Synthese, 36(1), 133-160.
- Kiefer, J. (1977). The Foundations of Statistics Are There Any? Synthese, 36(1), 161-176.
[1]By “likelihood” here, Giere means the (strong) Likelihood Principle (SLP). Dotted through the first 3 years of this blog are a number of (formal and informal) posts on his SLP result, and my argument as to why it is unsound. I wrote a paper on this that appeared in Statistical Science 2014. You can find it along with a number of comments and my rejoinder in this post: Statistical Science: The Likelihood Principle Issue is Out.The consequences of having found his proof unsound gives a new lease on life to statistical foundations, or so I argue in my rejoinder.
In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:
It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)
The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:
…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….
The (pre-experimental) ‘rejection ratio’ R_{pre} , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H_{1} and H_{0} respectively), is shown to capture the strength of evidence in the experiment for H_{1 }over H_{0}. (ibid., p. 2)
But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1]
This brings me to where I left off in my last post: How could people think it plausible to compute comparative strength of evidence this way? The rejection ratio is one of the “new monsters”, but it also appears, without this name, in popular diagnostic screening models of tests. See, for example, this post (“Beware of questionable front page articles telling you to beware…”)
The Law of Comparative Support
It comes from a comparativist support position which has intrinsic plausibility, although I do not hold to it. It is akin to what some likelihoodists call “the law of support”: if H_{1} make the observed results probable, while H_{0} make them improbable, then the results are strong (or at least better) evidence for H_{1} compared to H_{0} . It appears to be saying (sensibly) that you have better evidence for a hypothesis that best “explains” the data, only this is not a good measure of explanation. It is not generally required H_{0} and H_{1} be exhaustive. Even if you hold a comparative support position, the “ratio of statistical power to significance threshold” isn’t a plausible measure for this. Now BBBS also object to the Rejection Ratio, but only largely because it’s not sensitive to the actual outcome; so they recommend the Bayes Factor post data. My criticism is much, much deeper. To get around the data-dependent part, let’s assume throughout that we’re dealing with a result just statistically significant at the α level.
I had a post last year called “What’s Wrong with Taking (1 – β)/α as a Likelihood Ratio Comparing H_{0} and H_{1}?” While it garnered over 80 interesting comments (and a continuation), only one or two concerned the point I really had in mind. So in what follows I’ll take some excerpts from it, interspersed with new remarks.
Take a one-sided Normal test T+: with n iid samples:
H_{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0
σ = 10, n = 100, σ/√n =σ_{x}= 1, α = .025.
So the test would reject H_{0} iff Z > c_{.025} =1.96. (1.96. is the “cut-off”.)
People often talk of a test “having a power” but the test actually specifies a power function that varies with different point values in the alternative H_{1} . The power of test T+ in relation to point alternative µ’ is
Pr(Z > 1.96; µ = µ’).
We can abbreviate this as POW(T+,µ’).
~~~~~~~~~~~~~~
Jacob Cohen’s slips
By the way, Jacob Cohen, a founder of power analysis, makes a few slips in introducing power, even though he correctly computes power through the book (so far as I know). [2] Someone recently reminded me of this, and given the confusion about power, maybe it’s had more of an ill effect than I assumed.
In the first sentence on p. 1 of Statistical Power Analysis for the Behavioral Sciences, Cohen says “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty, and for two reasons, is what he says on p. 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.”
In case you don’t see the two mistakes, I will write them in my first comment.
~~~~~~~~~~~~~~
Examples of alternatives against which T+ has high power:
Let the observed outcome just reach the cut-off to reject the null, z_{0 }= 1.96.
If we were to form a “rejection ratio” or a “likelihood ratio” of μ = 4.96 compared to μ_{0} = 0 using
[POW(T+, 4.96)]/α,
it would be 40. (.999/.025).
It is absurd to say the alternative 4.96 is supported 40 times as much as the null, even understanding support as comparative likelihood or something akin. The data 1.96 are even closer to 0 than to 4.96. The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding
Pr(H_{0}|z_{0}) = 1/(1 + 40) = .024, so Pr(H_{1}|z_{0}) = .976.
Such an inference is highly unwarranted and would almost always be wrong. Back to our question:
How could people think it plausible to compute comparative evidence this way?
I presume it stems comes from the comparativist support position noted above. I’m guessing they’re reasoning as follows:
The probability is very high that z > 1.96 under the assumption that μ = 4.96.
The probability is low that z > 1.96 under the assumption that μ = μ_{0} = 0.
We’ve observed z_{0 }= 1.96 (so you’ve observed z > 1.96).
Therefore, μ = 4.96 makes the observation more probable than does μ = 0.
Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.
But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this.
I can pick any far away alternative I like for purposes of getting high power, and we wouldn’t want to say that just reaching the cut-off (1.96) is good evidence for it! Power works in the reverse. That is,
If POW(T+,µ’) is high, then z_{0 }= 1.96 is poor evidence that μ > μ’.
That’s because were μ as great as μ’, with high probability we would have observed a larger z value (smaller p-value) than we did. Power may, if one wishes, be seen as a kind of distance measure, but (just like α) it is inverted.
(Note that our inferences take the form μ > μ’, μ < μ’, etc. rather than to a point value.)
In fact:
if Pr(Z > z_{0};μ =μ’) = high , then Z = z_{0} is strong evidence that μ < μ’!
Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.
~~~~~~~~~~~~~~
My favorite post by Stephen Senn
In my very favorite post by Stephen Senn here, Senn strengthens a point from his 2008 book (p. 201), namely, that the following is “nonsense”:
[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect. (Senn 2008, p. 201)
Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big.
Supposing that it is, is essentially to treat the test as if it were:
H_{0}: μ < 0 vs H_{1}: μ > 4.96
This, he says, is “ludicrous”as it:
would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference. (Senn, 2008, p. 201)
The same holds with H_{0}: μ = 0 as null.
If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one-sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:
μ > 0 or μ > .3.
~~~~~~~~~~~~~~
What does the severe tester say?
In sync with the confidence interval, she would say SEV(μ > 0) = .975 (if one-sided), and would also note some other benchmarks, e.g., SEV(μ > .96) = .84.
Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate
μ > 4.96
would be wrong over 99% of the time!
Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.
~~~~~~~~~~~~~~
The (Type 1, 2 error probability) trade-off vanishes
Notice what happens if we consider the “real Type 1 error” as Pr(H_{0}|z_{0})
Since Pr(H_{0}|z_{0}) decreases with increasing power, it decreases with decreasing Type 2 error. So we know that to identify “Type 1 error” and Pr(H_{0}|z_{0}) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between Type 1 and 2 error probabilities.
Upshot (modified 8p.m. 5/23/16)
Using size/ power as a likelihood ratio or as an indication of pre-data strength of evidence with which to accord a rejection, a bad idea for anyone who wants to assess the comparative evidence by likelihoods. The error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses (much less do we wish to be required to assign priors to the point hypotheses). Criticisms often start out forming these ratios and then blaming the “tail areas” for exaggerating the evidence against. We don’t form those ratios. My point here, though, is that this gambit also serves very badly for a Bayes ratio or likelihood assessment.(Likelihoodlums* and Bayesians, please weigh in on this.)
This is related to several posts having to do with allegations that p-values overstate the evidence against the null hypothesis, such as this one.
Please alert me to errors.
*Michael Lew’s term.
REFERENCES
Bayarri, M., Benjamin, D., Berger, J., & Sellke, T. (2016, in press). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses“, Journal of Mathematical Psychology
Benjamin, D. & Berger J. 2016. “Comment: A Simple Alternative to P-values,” The American Statistician (online March 7, 2016).
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.
Mayo, D. 2016. “Don’t throw out the Error Control Baby with the Error Statistical Bathwater“. (My comment on the ASA document)
Mayo, D. 2003. Comments on J. Berger’s, “Could Jeffreys, Fisher and Neyman have Agreed on Testing? (pp. 19-24)
Senn, S. 2008. Statistical Issues in Drug Development, 2^{nd} ed. Chichster, New Sussex: Wiley Interscience, John Wiley & Sons.
Wasserstein, R. & Lazar, N. 2016. “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician (online March 7, 2016).
[1] I don’t say the Rejection Ratio can have no frequentist role. It may arise in a diagnostic screening or empirical Bayesian context.
[2] It may also be found in Neyman! (Search this blog under Neyman’s Nursery.) However, Cohen uniquely provides massive power computations, before it was all computerized.