The New England Journal of Medicine NEJM announced new guidelines for authors for statistical reporting yesterday*. The ASA describes the change as “in response to the ASA Statement on P-values and Statistical Significance and subsequent The American Statistician special issue on statistical inference” (ASA I and II, in my abbreviation). If so, it seems to have backfired. I don’t know all the differences in the new guidelines, but those explicitly noted appear to me to move in the reverse direction from where the ASA I and II guidelines were heading.
The most notable point is that the NEJM highlights the need for error control, especially for constraining the Type I error probability, and pays a lot of attention to adjusting P-values for multiple testing and post hoc subgroups. ASA I included an important principle (#4) that P-values are altered and may be invalidated by multiple testing, but they do not call for adjustments for multiplicity, nor do I find a discussion of Type I or II error probabilities in the ASA documents. NEJM gives strict requirements for controlling family-wise error rate or false discovery rates (understood as the Benjamini and Hochberg frequentist adjustments).
They do not go along with the ASA II call for ousting thresholds, ending the use of the words “significance/significant”, or banning “p ≤ 0.05”. In the associated article, we read:
“Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions”.
When it comes to confidence intervals, the recommendations of ASA II, to the extent they were influential on the NEJM, seem to have had the opposite effect to what was intended–or is this really what they wanted?
Significance levels and P-values, in other words, are terms to be reserved for contexts in which their error statistical meaning is legitimate. This is a key strong point of the NEJM guidelines. Confidence levels, for the NEJM, lose their error statistical or “coverage probability” meaning, unless they follow the adjustments that legitimate P-values call for. But they must be accompanied by a sign that warns the reader the intervals were not adjusted for multiple testing and thus “the inferences drawn may not be reproducible.” The P-value, but not the confidence interval, remains an inferential tool with control of error probabilities. Now CIs are inversions of tests, and strictly speaking should also have error control. Authors may be allowed to forfeit this, but then CIs can’t replace significance tests and their use may even (inadvertently, perhaps) signal lack of error control. (In my view, that is not a good thing.) Here are some excerpts:
For all studies:
Significance tests should be accompanied by confidence intervals for estimated effect sizes, measures of association, or other parameters of interest. The confidence intervals should be adjusted to match any adjustment made to significance levels in the corresponding test.
For clinical trials:
Original and final protocols and statistical analysis plans (SAPs) should be submitted along with the manuscript, as well as a table of amendments made to the protocol and SAP indicating the date of the change and its content.
The analyses of the primary outcome in manuscripts reporting results of clinical trials should match the analyses prespecified in the original protocol, except in unusual circumstances. Analyses that do not conform to the protocol should be justified in the Methods section of the manuscript. …
When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control overall type I error — for example, Bonferroni adjustments or prespecified hierarchical procedures. P values adjusted for multiplicity should be reported when appropriate and labeled as such in the manuscript. In hierarchical testing procedures, P values should be reported only until the last comparison for which the P value was statistically significant. P values for the first nonsignificant comparison and for all comparisons thereafter should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling false discovery rate described in the SAP — for example, Benjamini–Hochberg procedures.
When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.
As noted earlier, since P-values would be invalidated in such cases, it’s entirely right not to give them. CIs are permitted, yes, but are required to sport an alert warning that, even though multiple testing was done, the intervals were not adjusted for this and therefore “the inferences drawn may not be reproducible.” In short their coverage probability justification goes by the board.
I wonder if practitioners can opt out of this weakening of CIs, and declare in advance that they are members of a subset of CI users who will only report confidence levels with a valid error statistical meaning, dual to statistical hypothesis tests.
The NEJM guidelines continue:
…When the SAP prespecifies an analysis of certain subgroups, that analysis should conform to the method described in the SAP. If the study team believes a post hoc analysis of subgroups is important, the rationale for conducting that analysis should be stated. Post hoc analyses should be clearly labeled as post hoc in the manuscript.
Forest plots are often used to present results from an analysis of the consistency of a treatment effect across subgroups of factors of interest. …A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots.
If significance tests of safety outcomes (when not primary outcomes) are reported along with the treatment-specific estimates, no adjustment for multiplicity is necessary. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editors may request that P values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP.
When possible, the editors prefer that absolute event counts or rates be reported before relative risks or hazard ratios. The goal is to provide the reader with both the actual event frequency and the relative frequency. Odds ratios should be avoided, as they may overestimate the relative risks in many settings and be misinterpreted.
Authors should provide a flow diagram in CONSORT format. The editors also encourage authors to submit all the relevant information included in the CONSORT checklist. …The CONSORT statement, checklist, and flow diagram are available on the CONSORT
Detailed instructions to ensure that observational studies retain control of error rates are given.
In the associated article:
P values indicate how incompatible the observed data may be with a null hypothesis; “P<0.05” implies that a treatment effect or exposure association larger than that observed would occur less than 5% of the time under a null hypothesis of no effect or association and assuming no confounding. Concluding that the null hypothesis is false when in fact it is true (a type I error in statistical terms) has a likelihood of less than 5%. [i]…
The use of P values to summarize evidence in a study requires, on the one hand, thresholds that have a strong theoretical and empirical justification and, on the other hand, proper attention to the error that can result from uncritical interpretation of multiple inferences.^{5} This inflation due to multiple comparisons can also occur when comparisons have been conducted by investigators but are not reported in a manuscript. A large array of methods to adjust for multiple comparisons is available and can be used to control the type I error probability in an analysis when specified in the design of a study.^{6,7} Finally, the notion that a treatment is effective for a particular outcome if P<0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality. [ii]
… A well-designed randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response. Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions.
Finally, the current guidelines are limited to studies with a traditional frequentist design and analysis, since that matches the large majority of manuscripts submitted to the Journal. We do not mean to imply that these are the only acceptable designs and analyses. The Journal has published many studies with Bayesian designs and analyses^{8-10} and expects to see more such trials in the future. When appropriate, our guidelines will be expanded to include best practices for reporting trials with Bayesian and other designs.
What do you think?
I will update this with corrections and thoughts using (i), (ii), etc.
The author guidelines:
https://www.nejm.org/author-center/new-manuscripts
The associated article:
https://www.nejm.org/doi/full/10.1056/NEJMe1906559
*I meant to thank Nathan Schachtman for notifying me and sending links; also Stuart Hurlbert.
[i] It would be better, it seems to me, if the term “likelihood” was used only for its technical meaning in a document like this.
[ii] I don’t see it as a matter of “reductionism” but simply a matter of the properties of the test and the discrepancies of interest in the context at hand.
]]>
The American Statistical Association’s (ASA) recent effort to advise the statistical and scientific communities on how they should think about statistics in research is ambitious in scope. It is concerned with an initial attempt to depict what empirical research might look like in “a world beyond p<0.05” (The American Statistician, 2019, 73, S1,1-401). Quite surprisingly, the main recommendation of the lead editorial article in the Special Issue of The American Statistician devoted to this topic (Wasserstein, Schirm, & Lazar, 2019; hereafter, ASA II) is that “it is time to stop using the term ‘statistically significant’ entirely”. (p.2) ASA II acknowledges the controversial nature of this directive and anticipates that it will be subject to critical examination. Indeed, in a recent post, Deborah Mayo began her evaluation of ASA II by making constructive amendments to three recommendations that appear early in the document (‘Error Statistics Philosophy’, June 17, 2019). These amendments have received numerous endorsements, and I record mine here. In this short commentary, I briefly state a number of general reservations that I have about ASA II.
1. The proposal that we should stop using the expression “statistical significance” is given a weak justification
ASA II proposes a superficial linguistic reform that is unlikely to overcome the widespread misconceptions and misuse of the concept of significance testing. A more reasonable, and common-sense, strategy would be to diagnose the reasons for the misconceptions and misuse and take ameliorative action through the provision of better statistics education, much as ASA I did with p values. Interestingly, ASA II references Mayo’s recent book, Statistical Inference as Severe Testing (2018), when mentioning the “statistics wars”. However, it refrains from considering the fact that her error-statistical perspective provides an informed justification for continuing to use tests of significance, along with the expression, “statistically significant”. Further, ASA II reports cases where some of the Special Issue authors thought that use of a p-value threshold might be acceptable. However, it makes no effort to consider how these cases might challenge their main recommendation.
2. The claimed benefits of abandoning talk of statistical significance are hopeful conjectures.
ASA II makes a number of claims about the benefits that it thinks will follow from abandoning talk of statistical significance. It says,“researchers will see their results more easily replicated – and, even when not, they will better understand why”. “[We] will begin to see fewer false alarms [and] fewer overlooked discoveries …”. And, “As ‘statistical significance’ is used less, statistical thinking will be used more.” (p.1) I do not believe that any of these claims are likely to follow from retirement of the expression, “statistical significance”. Unfortunately, no justification is provided for the plausibility of any of the alleged benefits. To take two of these claims: First, removal of the common expression, “significance testing” will make little difference to the success rate of replications. It is well known that successful replications depend on a number of important factors, including research design, data quality, effect size, and study power, along with the multiple criteria often invoked in ascertaining replication success. Second, it is just implausible to suggest that refraining from talk about statistical significance will appreciably help overcome mechanical decision-making in statistical practice, and lead to a greater engagement with statistical thinking. Such an outcome will require, among other things, the implementation of science education reforms that centre on the conceptual foundations of statistical inference.
3. ASA II’s main recommendation is not a majority view.
ASA II bases its main recommendation to stop using the language of “statistical significance” in good part on its review of the articles in the Special Issue. However, an inspection of the Special Issue reveals that this recommendation is at variance with the views of many of the 40-odd articles it contains. Those articles range widely over topics covered, and attitudes to, the usefulness of tests of significance. By my reckoning, only two of the articles advocate banning talk of significance testing. To be fair, ASA II acknowledges the diversity of views held about the nature of tests of significance. However, I think that this diversity should have prompted it to take proper account of the fact that its recommendation is only one of a number of alternative views about significance testing. At the very least, ASA II should have tempered its strong recommendation not to speak of statistical significance any more.
4.The claim for continuity between ASA I and ASA II is misleading. There is no evidence in ASA I (Wasserstein & Lazar, 2016) for the assertion made in ASA II that the earlier document stopped just short of recommending that claims of “statistical significance” should be eliminated. In fact, ASA II marks a clear departure from ASA I, which was essentially concerned with how to better understand and use p-values. There is nothing in the earlier document to suggest that abandoning talk of statistical significance might be the next appropriate step forward in the ASA’s efforts to guide our statistical thinking.
5. Nothing is said about scientific method, and little is said about science.
The announcement of the ASA’s 2017 Symposium on Statistical Inference stated that the Symposium would “focus on specific approaches for advancing scientific methods in the 21^{st}century”. However, the Symposium, and the resulting Special Issue of The American Statistician, showed little interest in matters to do with scientific method. This is regrettable because the myriad insights about scientific inquiry contained in contemporary scientific methodology have the potential to greatly enrich statistical science. The post-p< 0.05 world depicted by ASA II is not an informed scientific world. It is an important truism that statistical inference plays a major role in scientific reasoning. However, for this role to be properly conveyed, ASA II would have to employ an informative conception of the nature of scientific inquiry.
6. Scientists who speak of statistical significance do embrace uncertainty. I think that it is uncharitable, indeed incorrect, of ASA II to depict many researchers who use the language of significance testing as being engaged in a quest for certainty. John Dewey, Charles Peirce, and Karl Popper taught us quite some time ago that we are fallible, error-prone creatures, and that we must embrace uncertainty. Further, despite their limitations, our science education efforts frequently instruct learners to think of uncertainty as an appropriate epistemic attitude to hold in science. This fact, combined with the oft-made claim that statistics employs ideas about probability in order to quantify uncertainty, requires from ASA II a factually-based justification for its claim that many scientists who employ tests of significance do so in a quest for certainty.
Under the heading, “Has the American Statistical Association Gone Post-Modern?”, the legal scholar, Nathan Schachtman, recently stated:
The ASA may claim to be agnostic in the face of the contradictory recommendations, but there is one thing we know for sure: over-reaching litigants and their expert witnesses will exploit the real or apparent chaos in the ASA’s approach. The lack of coherent, consistent guidance will launch a thousand litigation ships, with no epistemic compass.(‘Schachtman’s Law’, March 24, 2019)
I suggest that, with appropriate adjustment, the same can fairly be said about researchers and statisticians, who might look to ASA II as an informative guide to a better understanding of tests of significance, and the many misconceptions about them that need to be corrected.
References
Haig, B. D. (2019). Stats: Don’t retire significance testing. Nature, 569, 487.
Mayo, D. G. (2019). The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean (Some Recommendations)(ii),blog post on Error Statistics Philosophy Blog, June 17, 2019.
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. New York, NY: Cambridge University Press.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129-133.
Wasserstein, R. L., Schirm. A. L., & Lazar, N. A. (2019). Editorial: Moving to a world beyond “p<0.05”. The American Statistician, 73, S1, 1-19.
]]>Had I been scheduled to speak later at the 12th MuST Conference & 3rd Workshop “Perspectives on Scientific Error” in Munich, rather than on day 1, I could have (constructively) illustrated some of the errors and casualties by reference to a few of the conference papers that discussed significance tests. (Most gave illuminating discussions of such topics as replication research, the biases that discredit meta-analysis, statistics in the law, formal epistemology [i]). My slides follow my abstract.
The Statistics Wars: Errors and Casualties. Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
[I missed the afternoon of day #4]
D. Mayo’s The Statistics Wars: Errors and Casualties slides::
]]>
So I should say something. But the task is delicate. And painful. Very. I should start by asking: What is it (i.e., what is it actually saying)? Then I can offer some constructive suggestions.
The Invitation to Broader Consideration and Debate
The papers in this issue propose many new ideas, ideas that in our determination as editors merited publication to enable broader consideration and debate. The ideas in this editorial are likewise open to debate. (ASAII p. 1)
The questions around reform need consideration and debate. (p. 9)
Excellent! A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on ASA II (~1000 words) if you send them to me.
My focus here is just on the intended positions of the ASA, not the summaries of articles. This comprises around the first 10 pages. Even from just the first few pages the reader is met with some noteworthy declarations:
Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof). (p. 1)
No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)
A declaration of statistical significance is the antithesis of thoughtfulness. (p. 4)
Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)
It is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive. (p.2)
“Statistically significant”– don’t say it and don’t use it. (p. 2)
(Wow!)
I am very sympathetic with the concerns about rigid cut-offs, and fallacies of moving from statistical significance to substantive scientific claims. I feel as if I’ve just written a whole book on it! I say, on p. 10 of SIST:
In formal statistical testing, the crude dichotomy of “pass/fail” or “significant or not” will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.
Since ASA II will still use P-values, you’re bound to wonder why a user wouldn’t just report “the difference is statistically significant at the P-value attained”. (The probability of observing even larger differences, under the assumption of chance variability alone is p.) Confidence intervals (CIs) are already routinely given alongside P-values. So there is clearly more to the current movement than meets the eye. But for now I’m just trying to decipher what the ASA position is.
What’s the Relationship Between ASA I and ASA II?
I assume, for this post, that ASA II is intended to be an extension of ASA I. In that case, it would subsume the 6 principles of ASA I. There is evidence for this. For one thing, it begins by sketching a “sampling” of “don’ts” from ASA I, for those who are new to the debate. Secondly, it recommends that ASA I be widely disseminated. But some Principles (1, 4) are apparently missing[3], and others are rephrased in ways that alter the initial meanings. Do they really mean these declarations as written? Let us try to take them at their word.
But right away we are struck with a conflict with Principle 1 of ASA I–which happens to be the only positive principle given. (See Note 5 for the six Principles of ASA I.)
Principle 1. P-values can indicate how incompatible the data are with a specified statistical model.
A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.” (ASA I p. 131)
However, an indication of how incompatible data are with a claim of the absence of a relationship between a factor and an outcome would be an indication of the presence of the relationship; and providing evidence against a claim of no difference between two groups would often be of scientific or practical importance.
So, Principle 1 (from ASA I) doesn’t appear to square with the first bulleted item I listed (from ASA II):
(1) “Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, ASA II).
Either modify (1) or erase Principle 1. But if you erase all thresholds for finding incompatibility (whether using P-values or other measures), there are no tests, and no falsifications, even of the statistical kind.
My understanding (from Ron Wasserstein) is that this bullet is intended to correspond to Principle 5 in ASA I – that P-values do not give population effect sizes. But it is now saying something stronger (at least to my ears and to everyone else I’ve asked). Do the authors mean to be saying that nothing (of scientific or practical importance) can be learned from statistical significance tests? I think not.
So, my first recommendation is:
Replace (1) with:
“Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on statistical significance (or lack thereof).”
Either that, or simply stick to Principle 5 from ASA I : “A p-value, or statistical significance[4], does not measure the size of an effect or the importance of a result.” (p. 132) This statement is, strictly speaking, a tautology, true by the definitions of terms: probability isn’t itself a measure of the size of a (population) effect. However, you can use statistically significant differences to infer what the data indicate about the size of the (population) effect.[4]
My second friendly amendment concerns the second bulleted item:
(2) No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p. 2)
Focus just on “presence”. From this assertion it would seem to follow that no P-values[5], however small, even from well-controlled trials, can reveal the presence of an association or effect–and that is too strong. Again, we get a conflict with Principle 1 from ASA I. But I’m guessing, for now, the authors do not intend to say this. If you don’t mean it, don’t say it.
So, my second recommendation is to replace (2) with:
“No p-value by itself can reveal the plausibility, presence, truth, or importance of an association or effect.
Without this friendly amendment, ASA II is at loggerheads with ASA I, and they should not be advocating those 6 principles without changing either or both. Without this or a similar modification, moreover, the ability of any other statistical quantity or evidential measure is likewise unable to reveal these things. Or so many would argue. These modest revisions might prevent some readers stopping after the first few pages, and that would be a shame, as they would miss the many right-headed insights about linking statistical and scientific inference.
This leads to my third bulleted item from ASA II:
(3) A declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. (p. 4)
Surely the authors do not mean to say that anyone who asserts the observed difference is statistically significant at level p has her hands tied and invariably ignores all previous studies, background information and theories in planning and reaching conclusions, decisions, proposed solutions to problems. I’m totally on board with the importance of backgrounds, and multiple steps relating data to scientific claims and problems. Here’s what I say in SIST:
The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal.The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,” but rather, “ how well-probed” claims are, and what has been poorly probed. (SIST, p. 162)
But good inquiry is piecemeal: There is no reason to suppose one does everything at once in inquiry, and it seems clear from the ASA II guide that the authors agree. Since I don’t think they literally mean (3), why say it?
Practitioners who use these methods in medicine and elsewhere have detailed protocols for how background knowledge is employed in designing, running, and interpreting tests. When medical researchers specify primary outcomes, for just one example, it’s very explicitly with due regard for the mechanism of drug action. It’s intended as the most direct way to pick up on the drug’s mechanism. Finding incompatibility using P-values, inherits the meaning already attached to a sensible test hypothesis. That valid P-values require context is presupposed by the very important Principle 4 of ASA I (see note (3).
As lawyer Nathan Schachtman observes, in a recent conversation on ASA II:
By the time a phase III clinical trial is being reviewed for approval, there is a mountain of data on pharmacology, pharmacokinetics, mechanism, target organ, etc. If Wasserstein wants to suggest that there are some people who misuse or misinterpret p-values, fine. The principle of charity requires that we give a more sympathetic reading to the broad field of users of statistical significance testing. (Schachtman 2019)
Now it is possible the authors are saying a reported P-value can never be thoughtful because thoughtfulness requires that a statistical measure, at any stage of probing, incorporate everything we know (SIST dubs this “big picture” inference.) Do we want that? Or maybe (3) is their way of saying a statistical measure must incorporate background beliefs in the manner of Bayesian degree-of-belief (?) priors. Many would beg to differ, including some leading Bayesians. Andrew Gelman (2012) has suggested that ‘Bayesians Want Everybody Else to be Non-Bayesian’:
Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior (or, more generally, a hierarchical model). The likelihood is key. . . No funny stuff, no posterior distributions, just the likelihood. . . I don’t want everybody coming to me with their posterior distribution – I’d just have to divide away their prior distributions before getting to my own analysis. (ibid., p. 54)
So, my third recommendation is to replace (3) with (something like):
“failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.”
There’s much else that bears critical analysis and debate in ASA II; I’ll come back to it. I hope to hear from the authors of ASA II about my very slight, constructive amendments (to avoid a conflict with Principle 1).
Meanwhile, I fear we will see court cases piling up denying that anyone can be found culpable for abusing p-values and significance tests, since the ASA declared that all p-values are arbitrary, and whether predesignated thresholds are honored or breached should not be considered at all. (This was already happening based on ASA I.)[6]
Please share your thoughts and any errors in the comments, I will indicate later drafts of this post with (i), (ii),…Do send me other articles you find discussing this. Version (ii) of this post begins a list:
Nathan Schachtman (2019): Has the ASA Gone Post-Modern?
Cook et al.,(2019) There is Still Place for Significance Testing in Clinical Trials
NEJM Manuscript & Statistical Guidelines 2019Harrington, New Guidelines for Statistical Reporting in the Journal NEJM 2019^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
References:
Gelman, A. (2012) “Ethics and the Statistical Use of Prior Information”. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics5.pdf
Mayo, D. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2).
Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge: Cambridge University Press.
Schachtman, N. (2019). (private communication)
Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, (and supplemental materials), The American Statistician 70(2), 129–33. (ASA I)
Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19. (ASA II)
A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)
See SIST, p. 323. For links to all of excursion 5 on power, see this post.
The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6. I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I. Here’s how it starts:
5.5 Power Taboos, Retrospective Power, and Shpower
Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results.
Exhibit (vi): Non-significance + High Power Does Not Imply Support for the Null over the Alternative. Sander Greenland (2012) has a paper with this title. The first step is to understand the assertion, giving the most generous interpretation. It deals with non-significance, so our ears are perked for a fallacy of non-rejection. Second, we know that “high power” is an incomplete concept, so he clearly means high power against “the alternative.” We have a handy example: alternative μ^{.84} in T+ (POW(T+, μ^{.84}) = 0.84).
Note to blog reader: μ^{.84 }abbreviates “the alternative against which the test has 0.84 power.” This general abbreviation was introduced in Tour I.
Use the water plant case, T+: H_{0}: μ ≤ 150 vs. H_{1}: μ > 150, σ = 10, n = 100. With α = 0.025, z_{0.025} = 1.96, and the corresponding cut-off in terms of x_{0.025} is [150 + 1.96(10)/√100] = 151.96], μ^{.84} = 152.96.
Now a title like this is supposed to signal a problem, a reason for those “keep out” signs. His point, in relation to this example, boils down to noting that an observed difference may not be statistically significant – x may fail to make it to the cut-off x_{0:025} – and yet be closer to μ^{.84 }than to 0. This happens because the Type II error probability β (here, 0.16)^{1} is greater than the Type I error probability (0.025).
For a quick computation let x_{0:025} = 152 and μ^{.84} = 153. Halfway between alternative 153 and the 150 null is 151.5. Any observed mean greater than 151.5 but less than the x_{0.025} cut-off, 152, will be an example of Greenland’s phenomenon. An example would be those values that are closer to 153, the alternative against which the test has 0.84 power, than to 150 and thus, by a likelihood measure, support 153 more than 150 – even though POW(μ = 153) is high (0.84). Having established the phenomenon, your next question is: so what?
It would be problematic if power analysis took the insignificant result as evidence for μ = 150 – maintaining compliance with the ecological stipulation – and I don’t doubt some try to construe it as such, nor that Greenland has been put in the position of needing to correct them. Power analysis merely licenses μ ≤ μ^{.84} where 0.84 was chosen for “high power.” Glance back at Souvenir X. So at least one of the “keep out” signs can be removed.
All of Excursion 5 Tour II (in proofs) is here.
Notes:
^{1} That is, β(μ^{.84}) = Pr(d < 0.4; μ = 0.6) = Pr(Z < −1) = 0.16.
_____________
*This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
It is still valuable to look at the discussions and comments under “power” and “shpower” on this blog.
Earlier excerpts and mementos from SIST (May 2018-May 2019) are here.
Where YOU are in the journey.
]]>
An article [i],“There is Still a Place for Significance Testing in Clinical Trials,” appearing recently in Clinical Trials, while very short, effectively responds to recent efforts to stop error statistical testing [ii]. We need more of this. Much more. The emphasis in this excerpt is mine:
Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials….
However, we should not be shy about asserting the unique role that clinical trials play in scientific research. A clinical trial is a much safer context within which to carry out a statistical test than most other settings. Properly designed and executed clinical trials have opportunities and safeguards that other types of research do not typically possess, such as protocolisation of study design; scientific review prior to commencement; prospective data collection; trial registration; specification of outcomes of interest including, importantly, a primary outcome; and others. For randomised trials, there is even more protection of scientific validity provided by the randomisation of the interventions being compared. It would be a mistake to allow the tail to wag the dog by being overly influenced by flawed statistical inferences that commonly occur in less carefully planned settings….
The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.
The issues that have led to the criticisms of conventional statistical testing are of much greater concern where statistical inferences are derived from observational data. … Even when the study is appropriately designed, there is also a common converse misinterpretation of statistical tests whereby the investigator incorrectly infers and reports that a non-significant finding conclusively demonstrates no effect. However, it is important to recognise that an appropriately designed and powered clinical trial enables the investigators to potentially conclude there is ‘no meaningful effect’ for the principal analysis.[iii] More generally, these problems are largely due to the fact that many individuals who perform statistical analyses are not sufficiently trained in statistics. It is naive to suggest that banning statistical testing and replacing it with greater use of confidence intervals, or Bayesian methods, or whatever, will resolve any of these widespread interpretive problems. Even the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could make things worse. By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.
You can read the full article here.
We may reject, with reasonable severity, that promoting correctly interpreted statistical tests is the real goal of the most powerful leaders of the movement to Stop Statistical Tests. The goal is just to stop (error) statistical tests altogether.[iv] That today’s CI leaders advance this goal is especially unwarranted and self-defeating, in that confidence intervals are just inversions of N-P tests, and were developed at the same time by the same man (Neyman) who developed (with E. Pearson) the theory of error statistical tests. See this recent post.
Reader: I’ve placed on draft a number of posts while traveling in England over the past week, but haven’t had the chance to read them over, or find pictures for them. This will soon change, so stay tuned!
*****************************************************************
[i] Jonathan A Cook, Dean A Fergusson, Ian Ford , Mithat Gonen, Jonathan Kimmelman, Edward L Korn and Colin B Begg (2019). “There is still a place for significance testing in clinical trials”, Clinical Trials 2019, Vol. 16(3) 223–224.
[ii] Perhaps we should call those driven to Stop Error Statistical Tests “Obsessed”. I thank Nathan Schachtman for sending me the article.
[iii] It’s disappointing how many critics of tests seem unaware of this simple power analysis point, and how it avoids egregious fallacies of non-rejection, or moderate P-value. It precisely follows simple significance test reasoning. The severity account that I favor gives a more custom-tailored approach that is sensitive to the actual outcome. (See, for example, Excursion 5 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP).
[iv] Bayes factors, like other comparative measures, are not “tests”, and do not falsify (even statistically). They can only say one hypothesis or model is better than a selected other hypothesis or model, based on some^ selected criteria. They can both (all) be improbable, unlikely, or terribly tested. One can always add a “falsification rule”, but it must be shown that the resulting test avoids frequently passing/failing claims erroneously.
^The Anti-Testers would have to say “arbitrary criterion”, to be consistent with their considering any P-value “arbitrary”, and denying that a statistically significant difference, reaching any P-value, indicates a genuine difference from a reference hypothesis.
]]>
Introduction & Overview
The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18
Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19
Excursion 1
EXCERPTS
Tour I
Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18
Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18
Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18
Tour II
Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19
Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18
MEMENTOS
Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18
Excursion 2
EXCERPTS
Tour I
Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18
“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18
Tour II
Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3) 10/10/18
MEMENTOS
Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18
Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18
Excursion 3
EXCERPTS
Tour I
Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18
Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18
First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18
Tour II
It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18
60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18
Tour III
Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18
MEMENTOS
Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18
Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18
Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18
Excursion 4
EXCERPTS
Tour I
Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18
Tour II
Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19
Tour IV
Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19
MEMENTOS
Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19
Excursion 5
Tour I
(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19
Tour III
Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19
Excursion 6
Tour II
Excerpts: Souvenir Z: Understanding Tribal Warfare + 6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
]]>We’ve reached our last Tour (of SIST)*: Pragmatic and Error Statistical Bayesians (Excursion 6), marking the end of our reading with Souvenir Z, the final Souvenir, as well as the Farewell Keepsake in 6.7. Our cruise ship Statinfasst, currently here at Thebes, will be back at dock for maintenance for our next launch at the Summer Seminar in Phil Stat (July 28-Aug 11). Although it’s not my preference that new readers begin with the Farewell Keepsake (it contains a few spoilers), I’m excerpting it together with Souvenir Z (and a list of all souvenirs A – Z) here, and invite all interested readers to peer in. There’s a check list on p. 437: If you’re in the market for a new statistical account, you’ll want to test if it satisfies the items on the list. Have fun!
Souvenir Z: Understanding Tribal Warfare
We began this tour asking: Is there an overarching philosophy that “matches contemporary attitudes”? More important is changing attitudes. Not to encourage a switch of tribes, or even a tribal truce, but something more modest and actually achievable: to understand and get beyond the tribal warfare. To understand them, at minimum, requires grasping how the goals of probabilism differ from those of probativeness. This leads to a way of changing contemporary attitudes that is bolder and more challenging. Snapshots from the error statistical lens let you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. All of the links, from data generation to modeling, to statistical inference and from there to substantive research claims, fall into place within this statistical philosophy. If this is close to being a useful way to interpret a cluster of methods, then the change in contemporary attitudes is radical: it has never been explicitly unveiled. Our journey was restricted to simple examples because those are the ones fought over in decades of statistical battles. Much more work is needed. Those grappling with applied problems are best suited to develop these ideas, and see where they may lead. I never promised,when you bought your ticket for this passage, to go beyond showing that viewing statistics as severe testing will let you get beyond the statistics wars.
6.7 Farewell Keepsake
Despite the eclecticism of statistical practice, conflicting views about the roles of probability and the nature of statistical inference – holdovers from long-standing frequentist–Bayesian battles – still simmer below the surface of today’s debates. Reluctance to reopen wounds from old battles has allowed them to fester. To assume all we need is an agreement on numbers – even if they’re measuring different things – leads to statistical schizophrenia. Rival conceptions of the nature of statistical inference show up unannounced in the problems of scientific integrity, irreproducibility, and questionable research practices, and in proposed methodological reforms. If you don’t understand the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden from you.
Rival standards reflect a tension between using probability (a) to constrain the probability that a method avoids erroneously interpreting data in a series of applications (performance), and (b) to assign degrees of support, confirmation, or plausibility to hypotheses (probabilism). We set sail on our journey with an informal tool for telling what’s true about statistical inference: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test . From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. The goals of the severe tester (probativism) arise in contexts sufficiently different from those of probabilism that you are free to hold both, for distinct aims (Section 1.2). For statistical inference in science, it is severity we seek. A claim passes with severity only to the extent that it is subjected to, and passes, a test that it probably would have failed, if false. Viewing statistical inference as severe testing alters long-held conceptions of what’s required for an adequate account of statistical inference in science. In this view, a normative statistical epistemology – an account of what’ s warranted to infer – must be:
• directly altered by biasing selection effects
• able to falsify claims statistically
• able to test statistical model assumptions
• able to block inferences that violate minimal severity
These overlapping and interrelated requirements are disinterred over the course of our travels. This final keepsake collects a cluster of familiar criticisms of error statistical methods. They are not intended to replace the detailed arguments, pro and con, within; here we cut to the chase, generally keeping to the language of critics. Given our conception of evidence, we retain testing language even when the statistical inference is an estimation, prediction, or proposed answer to a question. The concept of severe testing is sufficiently general to apply to any of the methods now in use. It follows that a variety of statistical methods can serve to advance the severity goal, and that they can, in principle, find their foundations in an error statistical philosophy. However, each requires supplements and reformulations to be relevant to real-world learning. Good science does not turn on adopting any formal tool, and yet the statistics wars often focus on whether to use one type of test (or estimation, or model selection) or another. Meta-researchers charged with instigating reforms do not agree, but the foundational basis for the disagreement is left unattended. It is no wonder some see the statistics wars as proxy wars between competing tribe leaders, each keen to advance one or another tool, rather than about how to do better science. Leading minds are drawn into inconsequential battles, e.g., whether to use a prespecified cut-off of 0.025 or 0.0025 – when in fact good inference is not about cut-offs altogether but about a series of small-scale steps in collecting, modeling and analyzing data that work together to find things out. Still, we need to get beyond the statistics wars in their present form. By viewing a contentious battle in terms of a difference in goals – finding highly probable versus highly well probed hypotheses – readers can see why leaders of rival tribes often talk past each other. To be clear, the standpoints underlying the following criticisms are open to debate; we’re far from claiming to do away with them. What should be done away with is rehearsing the same criticisms ad nauseum. Only then can we hear the voices of those calling for an honest standpoint about responsible science.
1. NHST Licenses Abuses. First, there’s the cluster of criticisms directed at an abusive NHST animal: NHSTs infer from a single P-value below an arbitrary cut-off to evidence for a research claim, and they encourage P-hacking, fishing, and other selection effects. The reply: this ignores crucial requirements set by Fisher and other founders: isolated significant results are poor evidence of a genuine effect and statistical significance doesn’t warrant substantive, (e.g., causal) inferences. Moreover, selective reporting invalidates error probabilities. Some argue significance tests are un-Popperian because the higher the sample size, the easier to infer one’s research hypothesis. It’s true that with a sufficiently high sample size any discrepancy from a null hypothesis has a high probability of being detected, but statistical significance does not license inferring a research claim H. Unless H’s errors have been well probed by merely finding a small P-value, H passes an extremely insevere test. No mountains out of molehills (Sections 4.3 and 5.1). Enlightened users of statistical tests have rejected the cookbook, dichotomous NHST, long lampooned: such criticisms are behind the times. When well-intentioned aims of replication research are linked to these retreads, it only hurts the cause. One doesn’t need a sharp dichotomy to identify rather lousy tests – a main goal for a severe tester. Granted, policy-making contexts may require cut-offs, as do behavioristic setups. But in those contexts, a test’s error probabilities measure overall error control, and are not generally used to assess well-testedness. Even there, users need not fall into the NHST traps (Section 2.5). While attention to banning terms is the least productive aspect of the statistics wars, since NHST is not used by Fisher or N-P, let’s give the caricature its due and drop the NHST acronym; “statistical tests” or “error statistical tests” will do. Simple significance tests are a small part of a conglomeration of error statistical methods.
To continue reading: Excerpt Souvenir Z, Farewell Keepsake & List of Souvenirs can be found here.
*We are reading Statistical Inference as Severe Testing: How to Get beyond the Statistics Wars (2018, CUP)
***
Where YOU are in the journey.
]]>
It’s a balmy day today on Ship StatInfasST: An invigorating wind has a salutary effect on our journey. So, for the first time I’m excerpting all of Excursion 5 Tour I (proofs) of Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (2018, CUP)
A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)
So how would you use power to consider the magnitude of effects were you drawn forcibly to do so? In with your breakfast is an exercise to get us started on today’ s shore excursion.
Suppose you are reading about a statistically signifi cant result x (just at level α ) from a one-sided test T+ of the mean of a Normal distribution with n IID samples, and known σ: H_{0} : μ ≤ 0 against H_{1} : μ > 0. Underline the correct word, from the perspective of the (error statistical) philosophy, within which power is defined.
- If the test’ s power to detect μ′ is very low (i.e., POW(μ′ ) is low), then the statistically significant x is poor/good evidence that μ > μ′ .
- Were POW(μ′ ) reasonably high, the inference to μ > μ′ is reasonably/poorly warranted.
We’ve covered this reasoning in earlier travels (e.g., Section 4.3), but I want to launch our new tour from the power perspective. Assume the statistical test has passed an audit (for selection effects and underlying statistical assumptions) – you can’t begin to analyze the logic if the premises are violated.
During our three tours on Power Peninsula, a partially uncharted territory, we’ll be residing at local inns, not returning to the ship, so pack for overnights. We’ll visit its museum, but mostly meet with different tribal members who talk about power – often critically. Power is one of the most abused notions in all of statistics, yet it’ s a favorite for those of us who care about magnitudes of discrepancies. Power is always defined in terms of a fixed cut-off, c_{α}, computed under a value of the parameter under test; since these vary, there is really a power function . If someone speaks of the power of a test tout court , you cannot make sense of it, without qualification. First defined in Section 3.1, the power of a test against μ′ is the probability it would lead to rejecting H_{0} when μ = μ′:
POW(T, μ′) = Pr(d(X) ≥ c_{α}; μ = μ′), or Pr(test T rejects H_{0}; μ = μ′).
If it’s clear what the test is, we just write POW(μ′). Power measures the capability of a test to detect μ′ – where the detection is in the form of producing a d ≥ c_{α}. While power is computed at a point μ = μ′, we employ it to appraise claims of form μ > μ′ or μ < μ′.
Power is an ingredient in N-P tests, but even practitioners who declare they never set foot into N-P territory, but live only in the land of Fisherian significance tests, invoke power. This is all to the good, and they shouldn’t fear that they are dabbling in an inconsistent hybrid.
Jacob Cohen’s (1988) Statistical Power Analysis for the Behavioral Sciences is displayed at the Power Museum’ s permanent exhibition. Oddly, he makes some slips in the book’ s opening. On page 1 Cohen says: “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty is what he says on page 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.” Cohen means to add “computed under an alternative hypothesis,” else the definitions are wrong. These snafus do not take away from Cohen’s important tome on power analysis, yet I can’ t help wondering if these initial definitions play a bit of a role in the tendency to define power as ‘the probability of a correct rejection,’ which slips into erroneously viewing it as a posterior probability (unless qualified).
Although keeping to the fixed cut-off c_{α} is too coarse for the severe tester’s tastes, it is important to keep to the given definition for understanding the statistical battles. We’ve already had sneak previews of achieved sensitivity” or “attained power” [Π (γ ) = Pr(d(X ) ≥ d(x_{0} ); μ_{0} + γ )] by which members of Fisherian tribes are able to reason about discrepancies (Section 3.3). N-P accorded three roles to power: the first two are pre-data, for planning and comparing tests; the third is for interpretation post-data. It’s the third that they don’t announce very loudly, whereas that will be our main emphasis. Have a look at this museum label referring to a semi-famous passage by E. Pearson. Barnard (1950, p. 207) has just suggested that error probabilities of tests, like power, while fine for pre-data planning, should be replaced by other measures (likelihoods perhaps?) after the trial. What did Egon say in reply to George?
[I]f the planning is based on the consequences that will result from following a rule of statistical procedure, e.g., is based on a study of the power function of a test and then, having obtained our results, we do not follow the first rule but another, based on likelihoods, what is the meaning of the planning? (Pearson 1950, p. 228)
This is an interesting and, dare I say, powerful reply, but it doesn’t quite answer George. By all means apply the rule you planned to, but there’s still a legitimate question as to the relationship between the pre-data capability or performance measure, and post-data inference. The severe tester offers a view of this intimate relationship. In Tour II we’ll be looking at interactive exhibits far outside the museum, including N-P post-data power analysis, retrospective power, and a notion I call shpower. Employing our understanding of power, scrutinizing a popular reinterpretation of tests as diagnostic tools will be straightforward. In Tour III we go a few levels deeper in disinterring the N-P vs. Fisher feuds. I suspect there is a correlation between those who took Fisher’s side in the early disputes with Neyman and those leery of power. Oscar Kempthorne being interviewed by J. Leroy Folks (1995) said:
Well, a common thing said about [Fisher] was that he did not accept the idea of the power. But, of course, he must have. However, because Neyman had made such a point about power, Fisher couldn’t bring himself to acknowledge it (p. 331).
However, since Fisherian tribe members have no problem with corresponding uses of sensitivity, P-value distributions, or CIs, they can come along on a severity analysis. There’s more than one way to skin a cat, if one understands the relevant statistical principles. The issues surrounding power are subtle, and unraveling them will require great care, so bear with me. I will give you a money-back guarantee that by the end of the excursion you’ll have a whole new view of power. Did I mention you’ll have a chance to power the ship into port on this tour? Only kidding, however, you will get to show your stuff in a Cruise Severity Drill (Section 5.2).
To continue reading Excursion 5 Tour I, go here.
__________
This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.
Jan 10, 2019 Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” is here,
Jan 27, Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking here,
Feb 23, Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST
here,
April 4, Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt here.
Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.
March 5, 2019 Blurbs of all 16 Tours can be found here.
Where YOU are in the journey.
]]>Neyman, confronted with unfortunate news would always say “too bad!” At the end of Jerzy Neyman’s birthday week, I cannot help imagining him saying “too bad!” as regards some twists and turns in the statistics wars. First, too bad Neyman-Pearson (N-P) tests aren’t in the ASA Statement (2016) on P-values: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power”. An especially aggrieved “too bad!” would be earned by the fact that those in love with confidence interval estimators don’t appreciate that Neyman developed them (in 1930) as a method with a precise interrelationship with N-P tests. So if you love CI estimators, then you love N-P tests!
Consider a typical N-P test of the mean of a Normal distribution T+: H_{0}: µ ≤ µ_{0 } vs H1: µ > µ_{0. }
Imagine σ is known, since nothing of interest to the logic changes if it is estimated as is more typical. Notice the null hypothesis is composite, it is not a point, and the alternative is explicit (you can’t jump from a small P-value to some theory that would “explain ” it).[i]
The (1 – α) confidence interval (CI) corresponding to test T+ is that µ > the (1 – α) lower bound:
µ > M – c_{a}(σ/ √n ).
M is the sample mean, and this is the generic lower confidence bound. Replacing M with the observed sample mean M_{0} yields the particular CI lower bound.
Why does µ > M – c_{a}(σ/ √n ) correspond the above test T+? Why is it an inversion or dual to the test?
Consider, said Neyman, that the values of µ that exceed M_{0} – c_{a}(σ/ √n ) are values of µ that could not be rejected at level α with sample mean M_{0}. Equivalently, these are values of the parameter µ that M_{0} is not statistically significantly greater than at a P-value of α. Yes CIs correspond to Neyman-Pearson tests and were developed by Neyman in 1930, a bit after Fisher’s Fiducial intervals. Yes, those doing CIs (the so-called “new” statistics) are doing Neyman-Pearson tests, only inverted. Neyman didn’t care if you called them hypothesis tests or significance tests (as we saw in my last post). [ii]
Thanks to the duality between tests and confidence intervals, you could give the information provided by a confidence interval at any level in terms of the corresponding test. For a two-sided, 95% confidence interval [µ_{L },µ_{U}].
µ_{L }is the (parameter) value that the sample mean is just statistically significantly greater than at the P= .025 level.
µ_{U} is the (parameter) value that the sample mean is just statistically significantly lower than at the P= .025 level.
That means it is wrong to say you cannot ascertain anything about the population effect size using P-value computations. You can. It’s not the only way. You can also use P-value functions (Fraser, Cox), power, and severity, but they are all interrelated.
You ask: Please tell me the value of µ that the sample mean M_{0} is just statistically significantly greater than, at the P= .025 level? The answer is the lower confidence bound µ_{L}
If the tester is able to determine the P-value corresponding to a specific value of µ you wanted to test, then she is also able to use the observed M_{0} to compute the value µ_{L}
Likewise for finding µ_{U} . All the information is there.
But choosing a single confidence level is quite inadequate. Yet that is still what members of today’s “new” CI tribe do–generally .95. They get very upset at your dichotomizing P ≤ 0.05 and P > 0.05, but happily dichotomize µ is in or out of the CI formed.
The severe tester always infers a discrepancy that is well indicated (if any) but also at least one that is poorly indicated. In relation to test T+, the inference µ > M_{0 }where M_{0} is the observed mean is a good benchmark for a terrible inference! It corresponds to a lower confidence bound at level 0.5! And yet, critics of significance tests (at least,from outside the error statistical family) often advocate inferring
µ ≥ M_{0 }
as either comparatively more likely or probable than the null or test hypothesis. For detailed examples, see SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What?
So why are members of the Confidence Interval tribe going around misrepresenting hypothesis tests as if they must take the form of Fisherian “simple” significance tests with a point null (nil) hypothesis, usually of 0? (N-P tests were purposely designed to improve upon Fisher’s tests, and it’s that improvement that gives you CIs.) And why do they say what’s inferred with a CI cannot be ascertained with N-P tests? Are they unaware they’re using N-P tests? Or is the simple Fisherian test (no explicit alternative, no consideration of power) just much easier to criticize? If they’re cousins or brothers, why the family feud? Sibling rivalry? Why be a Unitarian? Most testers would supply a P-value as well as a CI. The severe tester combines the two, so that discrepancies are directly reported from test results. For another reason, see [iii].
Critics of tests from outside the family, will also take the simple “nil” point null vs a two-sided alternative as their foil, and demonstrate that the p-value ≠ either their Bayes Factor or posterior probability. It serves as a convenient straw test to knock down. If they kept the comparison to one-sided tests, they would not disagree (at least not with any sensible prior). See SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What? This is shown by Casella and R. Berger (1987) and the reconciliation is agreed to by Berger and Sellke (1987).
I’m not saying the simple significance test doesn’t have uses; it’s vital for testing assumptions of statistical models. That’s why Bayesians who want to check their models can be found sneaking P-value goodies from the tests that many of them profess to dislike. If a small P-value indicates a discrepancy from the null there, it does so in other uses too. [iv]
Note too the connection between confidence intervals and severity: Taking a sample mean M that is just statistically significant at level α (Mα) as warranting µ > µ_{0 }with severity 1 – α is the same as inferring µ > M_{0}– c_{a}(σ/ √n ) at confidence level 1 – α. However, severity improves on CIs by breaking out of the single confidence level, providing an inferential justification (rather than merely a long-run coverage rationale), and avoids a number of fallacies and paradoxes of ordinary CIs. For a post on CIs and severity see here. Also see: Do CIs Avoid Fallacies of Tests? Reforming the Reformers. For a full discussion, see SIST.
[i] The null and alternative would be treated symmetrically. You are to choose the null, or more properly, what Neyman called the test hypothesis, according to which error was more serious. A lot of the agony that has people up in arms regarding the fallacy of taking non-significant results as evidence for a (point) null is immediately scotched by letting the test hypothesis be “an effect exists” (or an effect of a given magnitude is present). For example, T-: H_{0}: µ ≥ µ_{0 } vs H1: µ < µ_{0. }
A two-sided test, if wanted, may be seen as doing two one-sided tests (Cox and Hinkley 1974).
[ii] Note the equivalences:
µ < M – c_{a}(σ/ √n ) iff M > µ + c_{a}(σ/ √n )
So µ < CI lower at confidence level 1 – α iff M reaches statistical significance at P = α in test T+. Since it’s continuous we could use ≤ or <.
Iff = if and only if.
[iii] Some prefer CIs to corresponding tests because it’s easier to slide the confidence level onto the interval estimate, viewing it as affording a probability assignment to the interval itself. This of course is, strictly, a fallacy, unless one just stipulates: I assign “probability” .95, say, to the result of applying a method if that method has .95 “coverage probability”. This is/was the Fiducial dream. But one cannot do probability computations with these assignments. For the severe tester’s evidential interpretation of CIs, please see SIST, Excursion 3 Tour III.
[iv] Moving from a discrepancy (from a model assumption) to a particular rival model invites the same risks as when explaining other small P-values by invoking a rival insofar as the null and the rival model do not exhaust the possibilities.
SIST= Mayo, D (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, CUP.
]]>
I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?
“Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena”
by Jerzy Neyman
ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.
I recommend, especially, the example on home ownership. Here are two snippets:
1. INTRODUCTION
The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction. Particularly with reference to applied statistical work in a variety of domains of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors….
(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion, these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. In my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)
To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.
Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my book.[1] Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudolf Carnap:
I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H_{0} is true of the particular data set? (Neyman, pp 40-41).
Neyman continues:
The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H_{0}, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H_{0} cannot be reasonably considered as anything like a confirmation of H_{0}. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)
The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.
I’m adding another paper of Neyman’s that echoes these same sentiments on the use of power, post data to evaluate what is “confirmed” ‘The Use of the Concept of Power in Agricultural Experimentation’.
Neyman, like Peirce, Popper and many others, hold that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis H only to the extent that it passed a severe test–one with a high probability of having found flaws in H, if they existed. Of course, Neyman puts this in terms of having high power to reject H, if H is false, and high probability of finding no evidence against H if true, but it’s the same idea. But the use of power post-data is to interpret the discrepancies warranted in the given test. (This third use of power is also in Neyman 1956, responding to Fisher, the Triad).Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.
Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach,[2] from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.
De Finetti gets it right when he says that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).
Related papers on tests:
[1] That really is a decision, though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There are plenty of times, by the way, where Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!” This is now the title of Excursion 3 Tour II of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP).
[2] And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.
Fisher is the arch anti-Bayesian; whereas, Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors. Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):
[3] Who drew the picture of Neyman above? Anyone know?
References
de Finetti, B. 1972. Probability, Induction and Statistics: The Art of Guessing. Wiley.
Neyman, J. 1957. “The Use of the Concept of Power in Agricultural Experimentation“, Journal of the Indian Society of Agricultural Statistics, 9(1): 9–17.
Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” Commun. Statist. Theor. Meth. A5(8), 737-751.
Reader: This and other Neyman blogposts have been incorporated into my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Several excerpts can be found on this blog. Look up excerpts and mementos.
]]>
We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake. My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”. “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).
Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.
HAPPY BIRTHDAY WEEK FOR NEYMAN!
What doesn’t Neyman like about Birnbaum’s advocacy of a Principle of Sufficiency S (p. 25)? He doesn’t like that it is advanced as a normative principle (e.g., about when evidence is or ought to be deemed equivalent) rather than a criterion that does something for you, such as control errors. (Presumably it is relevant to a type of context, say parametric inference within a model.) S is put forward as a kind of principle of rationality, rather than one with a rationale in solving some statistical problem
“The principle of sufficiency (S): If E is specified experiment, with outcomes x; if t = t (x) is any sufficient statistic; and if E’ is the experiment, derived from E, in which any outcome x of E is represented only by the corresponding value t = t (x) of the sufficient statistic; then for each x, Ev (E, x) = Ev (E’, t) where t = t (x)… (S) may be described informally as asserting the ‘irrelevance of observations independent of a sufficient statistic’.”
Ev(E, x) is a metalogical symbol referring to the evidence from experiment E with result x. The very idea that there is such a thing as an evidence function is never explained, but to Birnbaum “inferential theory” required such things. (At least that’s how he started out.) The view is very philosophical and it inherits much from logical positivism and logics of induction.The principle S, and also other principles of Birnbaum, have a normative character: Birnbaum considers them “compellingly appropriate”.
“The principles of Birnbaum appear as a kind of substitutes for known theorems” Neyman says. For example, various authors proved theorems to the general effect that the use of sufficient statistics will minimize the frequency of errors. But if you just start with the rationale (minimizing the frequency of errors, say) you wouldn’t need these”principles” from on high as it were. That’s what Neyman seems to be saying in his criticism of them in this paper. Do you agree? He has the same gripe concerning Cornfield’s conception of a default-type Bayesian account akin to Jeffreys. Why?
[i] I thank @omaclaran for reminding me of this paper on twitter in 2018.
[ii] Or so I argue in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, 2018, CUP.
[iii] Do you think Neyman is using “breakthrough” here in reference to Savage’s description of Birnbaum’s “proof” of the (strong) Likelihood Principle? Or is it the other way round? Or neither? Please weigh in.
REFERENCES
Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘, Revue De l’Institut International De Statistique / Review of the International Statistical Institute, 30(1), 11-27.
]]>My second Jerzy Neyman item, in honor of his birthday, is a little play that I wrote for Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018):
A local acting group is putting on a short theater production based on a screenplay I wrote: “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:
We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.
But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).
In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”.
Since it’s Saturday night, let’s listen in on this one act play, just about to begin at the Elba Dinner Theater. Don’t worry, food and drink are allowed to be taken in. (I’ve also included, in the References, several links to papers for your weekend reading enjoyment!) There go les trois coups–the curtain’s about to open!
The curtain opens with a young Neyman and Pearson (from 1933) standing mid-stage, lit by a spotlight. (Neyman does the talking, since its his birthday).
Neyman: “Bertrand put into statistical form a variety of hypotheses, as for example the hypothesis that a given group of stars…form a ‘system.’ His method of attack, which is that in common use, consisted essentially in calculating the probability, P, that a certain character, x, of the observed facts would arise if the hypothesis tested were true. If P were very small, this would generally be considered as an indication that…H was probably false, and vice versa. Bertrand expressed the pessimistic view that no test of this kind could give reliable results.
The stage fades to black, then a spotlight shines on Bertrand, stage right.
Bertrand: “How can we decide on the unusual results that chance is incapable of producing?…The Pleiades appear closer to each other than one would naturally expect…In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances?…Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility. …
[He turns to the audience, shaking his head.]
The stage fades to black, then a spotlight appears on Borel, stage left.
Borel: “The particular form that problems of causes often take…is the following: Is such and such a result due to chance or does it have a cause? It has often been observed how much this statement lacks in precision. Bertrand has strongly emphasized this point. But …to refuse to answer under the pretext that the answer cannot be absolutely precise, is to… misunderstand the essential nature of the application of mathematics.” (ibid. p. 964) Bertrand considers the Pleiades. ‘If one has observed a [precise angle between the stars]…in tenths of seconds…one would not think of asking to know the probability [of observing exactly this observed angle under chance] because one would never have asked that precise question before having measured the angle’… (ibid.)
Here is what one can say on this subject: One should carefully guard against the tendency to consider as striking an event that one has not specified beforehand, because the number of such events that may appear striking, from different points of view, is very substantial” (ibid. p. 964).
The stage fades to black, then a spotlight beams on Neyman and Pearson mid-stage. (Neyman does the talking)
Neyman: “We appear to find disagreement here, but are inclined to think that…the two writers [Bertrand and Borel] are not really considering precisely the same problem. In general terms the problem is this: Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature. …What is the precise meaning of the words ‘an efficient test of a hypothesis’?” (1933, p. 140/290)
Fade to black, spot on narrator mid-stage:
Narrator: We all know our famous (miserable) lines are about to come. But let’s linger on the “as far as a particular hypothesis is concerned” portion. For any particular case, one may identify a data dependent feature x that would be highly improbable “under the particular hypothesis of chance”. We must “carefully guard,” Borel warns, “against the tendency to consider as striking an event that one has not specified beforehand”. But if you are required to set the test’s capabilities ahead of time then you need to specify the type of falsity of Ho, the distance measure or test statistic beforehand. An efficient test should capture Fisher’s concern with tests sensitive to departures of interest. Listen to Neyman over 40 years later, reflecting on the relevance of Borel’s position in 1977.
Fade to black. Spotlight on an older Neyman, stage right.
Neyman: “The question (what is an efficient test of a statistical hypothesis) is about an intelligible methodology for deciding whether the observed difference…contradicts the stochastic model….
Fade to back. Spotlight on an older Egon Pearson writing a letter to Neyman about the preprint Neyman sent of his 1977 paper. (The letter is unpublished, but I cite Lehmann 1993).
Pearson: “I remember that you produced this quotation [from Borel] when we began to get our [1933] paper into shape… The above stages [wherein he had been asking ‘Why use that particular test statistic?’] led up to Borel’s requirement of finding…a criterion which was ‘a function of the observations ‘en quelque sorte remarquable’. Now my point is that you and I (perhaps my first leading) had ourselves reached the Borel requirement independently of Borel, because we were serious humane thinkers; Borel’s expression neatly capped our own.”
Fade to black. End Play
Egon has the habit of leaving the most tantalizing claims unpacked, and this is no exception: What exactly is the Borel requirement already reached due to their being “serious humane thinkers”? I can well imagine growing this one act play into something like the expressionist play of Michael Fraylin, Copenhagen, wherein a variety of alternative interpretations are entertained based on subsequent work and remarks. I don’t say that it would enjoy a long life on Broadway, but a small handful of us would relish it.
As with my previous attempts at “statistical theatre of the absurd, (e.g., “Stat on a hot-tin roof”) there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included.
Deconstructions on the Meaning of the Play by Theater Critics
It’s not hard to see that “as far as a particular” star grouping is concerned, we cannot expect a reliable inference to just any non-chance effect discovered in the data. The more specific the feature is to these particular observations, the more improbable. What’s the probability of 3 hurricanes followed by 2 plane crashes (as occurred last month, say)? Harold Jeffreys put it this way: any sample is improbable in some respect;to cope with this fact statistical method does one of two things: appeals to prior probabilities of a hypothesis or to error probabilities of a procedure. The former can check our tendency to find a more likely explanation H’ than chance by an appropriately low prior weight to H’. What does the latter approach do? It says, we need to consider the problem as of a general type. It’s a general rule, from a test statistic to some assertion about alternative hypotheses, expressing the non-chance effect. Such assertions may be in error but we can control such erroneous interpretations. We deliberately move away from the particularity of the case at hand, to the general type of mistake that could be made.
Isn’t this taken care of by Fisher’s requirement that Pr(P < p_{0}; Ho) = p—that the test rarely rejects the null if true? It may be, in practice, Neyman and Pearson thought, but only with certain conditions that were not explicitly codified by Fisher’s simple significance tests. With just the null hypothesis, it is unwarranted to take low P-values as evidence for a specific “cause” or non-chance explanation. Many could be erected post data, but the ways these could be in error would not have been probed. Fisher (1947, p. 182) is well aware that “the same data may contradict the hypothesis in any of a number of different ways,” and that different corresponding tests would be used.
The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation. [T]he experimenter is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs (ibid., p. 185).
Even if “an experienced experimenter” knows the appropriate test, this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for the choices made on informal grounds. In today’s world, if not in Fisher’s day, there’s legitimate concern about selecting the alternative that gives the more impressive P-value.
Here’s Egon Pearson writing with Chandra Sekar: In testing if a sample has been drawn from a single normal population, “it is not possible to devise an efficient test if we only bring into the picture this single normal probability distribution with its two unknown parameters. We must also ask how sensitive the test is in detecting failure of the data to comply with the hypotheses tested, and to deal with this question effectively we must be able to specify the directions in which the hypothesis may fail” ( p. 121). “It is sometimes held that the criterion for a test can be selected after the data, but it will be hard to be unprejudiced at this point” (Pearson & Chandra Sekar, 1936, p. 129).
To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptions if the hypothesis is true….By choosing the feature most unfavourable to Ho out of a very large number of features examined it will usually be possible to find some reason for rejecting the hypothesis. It must be remembered, however, that the point now at issue will not be whether it is exceptional to find a given criterion with so unfavourable a value. We shall need to find an answer to the more difficult question. Is it exceptional that the most unfavourable criterion of the n, say, examined should have as unfavourable a value as this? (ibid., p. 127).
Notice, the goal is not behavioristic; it’s a matter of avoiding the glaring fallacies in the test at hand, fallacies we know all too well.
“The statistician who does not know in advance with which type of alternative to H_{0} he may be faced, is in the position of a carpenter who is summoned to a house to undertake a job of an unknown kind and is only able to take one tool with him! Which shall it be? Even if there is an omnibus tool, it is likely to be far less sensitive at any particular job than a specialized one; but the specialized tool will be quite useless under the wrong conditions” (ibid., p. 126).
In a famous result, Neyman (1952) demonstrates that by dint of a post-data choice of hypothesis, a result that leads to rejection in one test yields the opposite conclusion in another, both adhering to a fixed significance level. [Fisher concedes this as well.] If you are keen to ensure the test is capable of teaching about discrepancies of interest, you should prespecify an alternative hypothesis, where the null and alternative hypothesis exhaust the space, relative to a given question. We can infer discrepancies from the null, as well as corroborate their absence by considering those the test had high power to detect.
Playbill Souvenir
Let’s flesh out Neyman’s conclusion to the Borel-Bertrand debate: if we accept the words, “an efficient test of the hypothesis H” to mean a statistical (methodological) falsification rule that controls the probabilities of erroneous interpretations of data, and ensures the rejection was because of the underlying cause (as modeled), then we agree with Borel that efficient tests are possible. This requires (a) a prespecified test criterion to avoid verification biases while ensuring power (efficiency), and (b) consideration of alternative hypotheses to avoid fallacies of acceptance and rejection. We must steer clear of isolated or particular curiosities to find indications that we are tracking genuine effects.
“Fisher’s the one to be credited,” Pearson remarks, “for his emphasis on planning an experiment, which led naturally to the examination of the power function, both in choosing the size of sample so as to enable worthwhile results to be achieved, and in determining the most appropriate tests” (Pearson 1962, p. 277). If you’re planning, you’re prespecifying, perhaps, nowadays, by means of explicit preregistration.
Nevertheless prespecifying the question (or test statistic) is distinct from predesignating a cut-off P-value for significance. Discussions of tests often suppose one is somehow cheating if the attained P-value is reported, as if it loses its error probability status. It doesn’t.[2] I claim they are confusing prespecifying the question or hypothesis, with fixing the P-value in advance–a confusion whose origin stems from failing to identify the rationale behind conventions of tests, or so I argue. Nor is it even that the predesignation is essential, rather than an excellent way to promote valid error probabilities.
But not just any characteristic of the data affords the relevant error probability assessment. It has got to be pretty remarkable!
Enter those pivotal statistics called upon in Fisher’s Fiducial inference. In fact, the story could well be seen to continue in the following two posts: “You can’t take the Fiducial out of Fisher if you want to understand the N-P performance philosophy“, and ” Deconstructing the Fisher-Neyman conflict wearing fiducial glasses”.
[1] Or, it might have been titled, “A Polish Statistician in Paris”, given the remake of “An American in Paris” is still going strong on Broadway, last time I checked.
[2] We know that Lehmann insisted people report the attained p-value so that others could apply their own preferred error probabilities. N-P felt the same way. (I may add some links to relevant posts later on.)
REFERENCES
Bertrand, J. (1888/1907). Calcul des Probabilités. Paris: Gauthier-Villars.
Borel, E. 1914. Le Hasard. Paris: Alcan.
Fisher, R. A. 1947. The Design of Experiments (4^{th} ed.). Edinburgh: Oliver and Boyd.
Lehmann, E.L. 2012. “The Bertrand-Borel Debate and the Origins of the Neyman-Pearson Theory” in J. Rojo (ed.), Selected Works of E. L. Lehmann, 2012, Springer US, Boston, MA, pp. 965-974.
Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP)
Neyman, J. 1952. Lectures and Conferences on Mathematical Statistics and Probability. 2^{nd} ed. Washington, DC: Graduate School of U.S. Dept. of Agriculture.
Neyman, J. 1977. “Frequentist Probability and Frequentist Statistics“, Synthese 36(1): 97–131.
Neyman, J. & Pearson, E. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses“, Philosophical Transactions of the Royal Society of London 231. Series A, Containing Papers of a Mathematical or Physical Character: 289–337.
Pearson, E. S. 1962. “Some Thoughts on Statistical Inference”, The Annals of Mathematical Statistics, 33(2): 394-403.
Pearson, E. S. & Sekar, C. C. 1936. “The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations“, Biometrika 28(3/4): 308-320. Reprinted (1966) in The Selected Papers of E. S. Pearson, (pp. 118-130). Berkeley: University of California Press.
Reid, Constance (1982). Neyman–from life
]]>
A Statistical Model as a Chance Mechanism
Aris Spanos
Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)
One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for non-random samples. Fisher’s original parametric statistical model M_{θ}(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x_{0}:=(x_{1},x_{2},…,x_{n}) can be viewed as a ‘truly representative sample’ from that ‘population’:
“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)
In cases where data x_{0} come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X_{1},X_{2},…,X_{n}), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.
This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!
Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):
Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)
From my perspective, this was a major step forward for several reasons, including the following.
First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.
Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:
M_{θ}(x)={f(x;θ), θ∈Θ}, x∈R^{n }, Θ⊂R^{m}; m << n,
where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:
X_{t} = α_{0} + α_{1}X_{t-1} + σε_{t}, t=1,2,…,n
This indicates how one can use pseudo-random numbers for the error term ε_{t} ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.
Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).
HAPPY BIRTHDAY NEYMAN!
For further discussion on the above issues see:
Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese:
http://www.econ.vt.edu/faculty/2008vitas_research/Spanos/1Spanos-2011-Synthese.pdf
Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.
Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.
Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.
Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.
Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.
[i]He was born in an area that was part of Russia.
]]>