The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

The New England Journal of Medicine NEJM announced new guidelines for authors for statistical reporting  yesterday*. The ASA describes the change as “in response to the ASA Statement on P-values and Statistical Significance and subsequent The American Statistician special issue on statistical inference” (ASA I and II, in my abbreviation). If so, it seems to have backfired. I don’t know all the differences in the new guidelines, but those explicitly noted appear to me to move in the reverse direction from where the ASA I and II guidelines were heading.

The most notable point is that the NEJM highlights the need for error control, especially for constraining the Type I error probability, and pays a lot of attention to adjusting P-values for multiple testing and post hoc subgroups. ASA I included an important principle (#4) that P-values are altered and may be invalidated by multiple testing, but they do not call for adjustments for multiplicity, nor do I find a discussion of Type I or II error probabilities in the ASA documents. NEJM gives strict requirements for controlling family-wise error rate or false discovery rates (understood as the Benjamini and Hochberg frequentist adjustments).

They do not go along with the ASA II call for ousting thresholds, ending the use of the words “significance/significant”, or banning “p ≤ 0.05”.  In the associated article, we read:

“Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions”.

When it comes to confidence intervals, the recommendations of ASA II, to the extent they were influential on the NEJM, seem to have had the opposite effect to what was intended–or is this really what they wanted?

  • When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.

Significance levels and P-values, in other words, are terms to be reserved for contexts in which their error statistical meaning is legitimate. This is a key strong point of the NEJM guidelines. Confidence levels, for the NEJM, lose their error statistical or “coverage probability” meaning, unless they follow the adjustments that legitimate P-values call for. But they must be accompanied by a sign that warns the reader the intervals were not adjusted for multiple testing and thus “the inferences drawn may not be reproducible.” The P-value, but not the confidence interval, remains an inferential tool with control of error probabilities. Now CIs are inversions of tests, and strictly speaking should also have error control. Authors may be allowed to forfeit this, but then CIs can’t replace significance tests and their use may even (inadvertently, perhaps) signal lack of error control. (In my view, that is not a good thing.) Here are some excerpts:

For all studies:

  • Significance tests should be accompanied by confidence intervals for estimated effect sizes, measures of association, or other parameters of interest. The confidence intervals should be adjusted to match any adjustment made to significance levels in the corresponding test.

For clinical trials:

  • Original and final protocols and statistical analysis plans (SAPs) should be submitted along with the manuscript, as well as a table of amendments made to the protocol and SAP indicating the date of the change and its content.

  • The analyses of the primary outcome in manuscripts reporting results of clinical trials should match the analyses prespecified in the original protocol, except in unusual circumstances. Analyses that do not conform to the protocol should be justified in the Methods section of the manuscript. …

  • When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control overall type I error — for example, Bonferroni adjustments or prespecified hierarchical procedures. P values adjusted for multiplicity should be reported when appropriate and labeled as such in the manuscript. In hierarchical testing procedures, P values should be reported only until the last comparison for which the P value was statistically significant. P values for the first nonsignificant comparison and for all comparisons thereafter should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling false discovery rate described in the SAP — for example, Benjamini–Hochberg procedures.

  • When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.

As noted earlier, since P-values would be invalidated in such cases, it’s entirely right not to give them. CIs are permitted, yes, but are required to sport an alert warning that, even though multiple testing was done, the intervals were not adjusted for this and therefore “the inferences drawn may not be reproducible.” In short their coverage probability justification goes by the board.

I wonder if practitioners can opt out of this weakening of CIs, and declare in advance that they are members of a subset of CI users who will only report confidence levels with a valid error statistical meaning, dual to statistical hypothesis tests.

The NEJM guidelines continue:

  • …When the SAP prespecifies an analysis of certain subgroups, that analysis should conform to the method described in the SAP. If the study team believes a post hoc analysis of subgroups is important, the rationale for conducting that analysis should be stated. Post hoc analyses should be clearly labeled as post hoc in the manuscript.

  • Forest plots are often used to present results from an analysis of the consistency of a treatment effect across subgroups of factors of interest. …A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots.

  • If significance tests of safety outcomes (when not primary outcomes) are reported along with the treatment-specific estimates, no adjustment for multiplicity is necessary. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editors may request that P values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP.

  • When possible, the editors prefer that absolute event counts or rates be reported before relative risks or hazard ratios. The goal is to provide the reader with both the actual event frequency and the relative frequency. Odds ratios should be avoided, as they may overestimate the relative risks in many settings and be misinterpreted.

  • Authors should provide a flow diagram in CONSORT format. The editors also encourage authors to submit all the relevant information included in the CONSORT checklist. …The CONSORT statement, checklist, and flow diagram are available on the CONSORT

Detailed instructions to ensure that observational studies retain control of error rates are given.

In the associated article:

P values indicate how incompatible the observed data may be with a null hypothesis; “P<0.05” implies that a treatment effect or exposure association larger than that observed would occur less than 5% of the time under a null hypothesis of no effect or association and assuming no confounding. Concluding that the null hypothesis is false when in fact it is true (a type I error in statistical terms) has a likelihood of less than 5%. [i]…

The use of P values to summarize evidence in a study requires, on the one hand, thresholds that have a strong theoretical and empirical justification and, on the other hand, proper attention to the error that can result from uncritical interpretation of multiple inferences.5 This inflation due to multiple comparisons can also occur when comparisons have been conducted by investigators but are not reported in a manuscript. A large array of methods to adjust for multiple comparisons is available and can be used to control the type I error probability in an analysis when specified in the design of a study.6,7 Finally, the notion that a treatment is effective for a particular outcome if P<0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality. [ii]

A well-designed randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response. Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions.

Finally, the current guidelines are limited to studies with a traditional frequentist design and analysis, since that matches the large majority of manuscripts submitted to the Journal. We do not mean to imply that these are the only acceptable designs and analyses. The Journal has published many studies with Bayesian designs and analyses8-10 and expects to see more such trials in the future. When appropriate, our guidelines will be expanded to include best practices for reporting trials with Bayesian and other designs.

What do you think?

I will update this with corrections and thoughts using (i), (ii), etc.

The author guidelines:

https://www.nejm.org/author-center/new-manuscripts

The associated article:

https://www.nejm.org/doi/full/10.1056/NEJMe1906559

*I meant to thank Nathan Schachtman for notifying me and sending links; also Stuart Hurlbert.

[i] It would be better, it seems to me, if the term “likelihood” was used only for its technical meaning in a document like this.

[ii] I don’t see it as a matter of “reductionism” but simply a matter of the properties of the test and the discrepancies of interest in the context at hand.

 

 

 

Categories: ASA Guide to P-values | 15 Comments

B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post)

Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand

The American Statistical Association’s (ASA) recent effort to advise the statistical and scientific communities on how they should think about statistics in research is ambitious in scope. It is concerned with an initial attempt to depict what empirical research might look like in “a world beyond p<0.05” (The American Statistician, 2019, 73, S1,1-401). Quite surprisingly, the main recommendation of the lead editorial article in the Special Issue of The American Statistician devoted to this topic (Wasserstein, Schirm, & Lazar, 2019; hereafter, ASA II) is that “it is time to stop using the term ‘statistically significant’ entirely”. (p.2) ASA II acknowledges the controversial nature of this directive and anticipates that it will be subject to critical examination. Indeed, in a recent post, Deborah Mayo began her evaluation of ASA II by making constructive amendments to three recommendations that appear early in the document (‘Error Statistics Philosophy’, June 17, 2019). These amendments have received numerous endorsements, and I record mine here. In this short commentary, I briefly state a number of general reservations that I have about ASA II.

1. The proposal that we should stop using the expression “statistical significance” is given a weak justification

ASA II proposes a superficial linguistic reform that is unlikely to overcome the widespread misconceptions and misuse of the concept of significance testing. A more reasonable, and common-sense, strategy would be to diagnose the reasons for the misconceptions and misuse and take ameliorative action through the provision of better statistics education, much as ASA I did with p values. Interestingly, ASA II references Mayo’s recent book, Statistical Inference as Severe Testing (2018), when mentioning the “statistics wars”. However, it refrains from considering the fact that her error-statistical perspective provides an informed justification for continuing to use tests of significance, along with the expression, “statistically significant”. Further, ASA II reports cases where some of the Special Issue authors thought that use of a p-value threshold might be acceptable. However, it makes no effort to consider how these cases might challenge their main recommendation.

2. The claimed benefits of abandoning talk of statistical significance are hopeful conjectures.

ASA II makes a number of claims about the benefits that it thinks will follow from abandoning talk of statistical significance. It says,“researchers will see their results more easily replicated – and, even when not, they will better understand why”. “[We] will begin to see fewer false alarms [and] fewer overlooked discoveries …”. And, “As ‘statistical significance’ is used less, statistical thinking will be used more.” (p.1) I do not believe that any of these claims are likely to follow from retirement of the expression, “statistical significance”. Unfortunately, no justification is provided for the plausibility of any of the alleged benefits.  To take two of these claims: First, removal of the common expression, “significance testing” will make little difference to the success rate of replications. It is well known that successful replications depend on a number of important factors, including research design, data quality, effect size, and study power, along with the multiple criteria often invoked in ascertaining replication success. Second, it is just implausible to suggest that refraining from talk about statistical significance will appreciably help overcome mechanical decision-making in statistical practice, and lead to a greater engagement with statistical thinking. Such an outcome will require, among other things, the implementation of science education reforms that centre on the conceptual foundations of statistical inference.

3. ASA II’s main recommendation is not a majority view.

ASA II bases its main recommendation to stop using the language of “statistical significance” in good part on its review of the articles in the Special Issue. However, an inspection of the Special Issue reveals that this recommendation is at variance with the views of many of the 40-odd articles it contains. Those articles range widely over topics covered, and attitudes to, the usefulness of tests of significance. By my reckoning, only two of the articles advocate banning talk of significance testing. To be fair, ASA II acknowledges the diversity of views held about the nature of tests of significance. However, I think that this diversity should have prompted it to take proper account of the fact that its recommendation is only one of a number of alternative views about significance testing. At the very least, ASA II should have tempered its strong recommendation not to speak of statistical significance any more.

4.The claim for continuity between ASA I and ASA II is misleading.  There is no evidence in ASA I (Wasserstein & Lazar, 2016) for the assertion made in ASA II that the earlier document stopped just short of recommending that claims of “statistical significance” should be eliminated. In fact, ASA II marks a clear departure from ASA I, which was essentially concerned with how to better understand and use p-values. There is nothing in the earlier document to suggest that abandoning talk of statistical significance might be the next appropriate step forward in the ASA’s efforts to guide our statistical thinking.

5. Nothing is said about scientific method, and little is said about science.

The announcement of the ASA’s 2017 Symposium on Statistical Inference stated that the Symposium would “focus on specific approaches for advancing scientific methods in the 21stcentury”. However, the Symposium, and the resulting Special Issue of The American Statistician, showed little interest in matters to do with scientific method. This is regrettable because the myriad insights about scientific inquiry contained in contemporary scientific methodology have the potential to greatly enrich statistical science. The post-p< 0.05 world depicted by ASA II is not an informed scientific world. It is an important truism that statistical inference plays a major role in scientific reasoning. However, for this role to be properly conveyed, ASA II would have to employ an informative conception of the nature of scientific inquiry.

6. Scientists who speak of statistical significance do embrace uncertainty. I think that it is uncharitable, indeed incorrect, of ASA II to depict many researchers who use the language of significance testing as being engaged in a quest for certainty. John Dewey, Charles Peirce, and Karl Popper taught us quite some time ago that we are fallible, error-prone creatures, and that we must embrace uncertainty. Further, despite their limitations, our science education efforts frequently instruct learners to think of uncertainty as an appropriate epistemic attitude to hold in science. This fact, combined with the oft-made claim that statistics employs ideas about probability in order to quantify uncertainty, requires from ASA II a factually-based justification for its claim that many scientists who employ tests of significance do so in a quest for certainty.

Under the heading, “Has the American Statistical Association Gone Post-Modern?”, the legal scholar, Nathan Schachtman, recently stated:

The ASA may claim to be agnostic in the face of the contradictory recommendations, but there is one thing we know for sure: over-reaching litigants and their expert witnesses will exploit the real or apparent chaos in the ASA’s approach. The lack of coherent, consistent guidance will launch a thousand litigation ships, with no epistemic compass.(‘Schachtman’s Law’, March 24, 2019)

I suggest that, with appropriate adjustment, the same can fairly be said about researchers and statisticians, who might look to ASA II as an informative guide to a better understanding of tests of significance, and the many misconceptions about them that need to be corrected.

References

Haig, B. D. (2019). Stats: Don’t retire significance testing. Nature, 569, 487.

Mayo, D. G. (2019). The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean (Some Recommendations)(ii),blog post on Error Statistics Philosophy Blog, June 17, 2019.

Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. New York, NY: Cambridge University Press.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129-133.

Wasserstein, R. L., Schirm. A. L., & Lazar, N. A. (2019). Editorial: Moving to a world beyond “p<0.05”. The American Statistician, 73, S1, 1-19.

Categories: ASA Guide to P-values, Brian Haig | Tags: | 31 Comments

The Statistics Wars: Errors and Casualties

.

Had I been scheduled to speak later at the 12th MuST Conference & 3rd Workshop “Perspectives on Scientific Error” in Munich, rather than on day 1, I could have (constructively) illustrated some of the errors and casualties by reference to a few of the conference papers that discussed significance tests. (Most gave illuminating discussions of such topics as replication research, the biases that discredit meta-analysis, statistics in the law, formal epistemology [i]). My slides follow my abstract. Continue reading

Categories: slides, stat wars and their casualties | Tags: | Leave a comment

“The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)–hereafter, ASA II–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein.[2] As the ASA II authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.) Continue reading

Categories: ASA Guide to P-values, Statistics | 92 Comments

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower)

.

returned from London…

The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6.  I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I.  Here’s how it starts:

5.5 Power Taboos, Retrospective Power, and Shpower

Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results. Continue reading

Categories: fallacy of non-significance, power, Statistical Inference as Severe Testing | Leave a comment

Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences

.

An article [i],“There is Still a Place for Significance Testing in Clinical Trials,” appearing recently in Clinical Trials, while very short, effectively responds to recent efforts to stop error statistical testing [ii]. We need more of this. Much more. The emphasis in this excerpt is mine: 

Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials…. Continue reading

Categories: statistical tests | Leave a comment

SIST: All Excerpts and Mementos: May 2018-May 2019

view from a hot-air balloon

Introduction & Overview

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19

 

Excursion 1

EXCERPTS

Tour I

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18

Tour II

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

MEMENTOS

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

 

Excursion 2

EXCERPTS

Tour I

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

Tour II

Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3) 10/10/18

MEMENTOS

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

 

Excursion 3

EXCERPTS

Tour I

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18

Tour II

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

Tour III

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

MEMENTOS

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

 

Excursion 4

EXCERPTS

Tour I

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

Tour II

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19

Tour IV

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

MEMENTOS

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

 

Excursion 5

Tour I

(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

Tour III

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

 

Excursion 6

Tour II

Excerpts: Souvenir Z: Understanding Tribal Warfare +  6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

Excerpts: Final Souvenir Z, Farewell Keepsake & List of Souvenirs

.

We’ve reached our last Tour (of SIST)*: Pragmatic and Error Statistical Bayesians (Excursion 6), marking the end of our reading with Souvenir Z, the final Souvenir, as well as the Farewell Keepsake in 6.7. Our cruise ship Statinfasst, currently here at Thebes, will be back at dock for maintenance for our next launch at the Summer Seminar in Phil Stat (July 28-Aug 11). Although it’s not my preference that new readers begin with the Farewell Keepsake (it contains a few spoilers), I’m excerpting it together with Souvenir Z (and a list of all souvenirs A – Z) here, and invite all interested readers to peer in. There’s a check list on p. 437: If you’re in the market for a new statistical account, you’ll want to test if it satisfies the items on the list. Have fun!

Souvenir Z: Understanding Tribal Warfare

We began this tour asking: Is there an overarching philosophy that “matches contemporary attitudes”? More important is changing attitudes. Not to encourage a switch of tribes, or even a tribal truce, but something more modest and actually achievable: to understand and get beyond the tribal warfare. To understand them, at minimum, requires grasping how the goals of probabilism differ from those of probativeness. This leads to a way of changing contemporary attitudes that is bolder and more challenging. Snapshots from the error statistical lens let you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. All of the links, from data generation to modeling, to statistical inference and from there to substantive research claims, fall into place within this statistical philosophy. If this is close to being a useful way to interpret a cluster of methods, then the change in contemporary attitudes is radical: it has never been explicitly unveiled. Our journey was restricted to simple examples because those are the ones fought over in decades of statistical battles. Much more work is needed. Those grappling with applied problems are best suited to develop these ideas, and see where they may lead. I never promised,when you bought your ticket for this passage, to go beyond showing that viewing statistics as severe testing will let you get beyond the statistics wars.

6.7 Farewell Keepsake

Despite the eclecticism of statistical practice, conflicting views about the roles of probability and the nature of statistical inference – holdovers from long-standing frequentist–Bayesian battles – still simmer below the surface of today’s debates. Reluctance to reopen wounds from old battles has allowed them to fester. To assume all we need is an agreement on numbers – even if they’re measuring different things – leads to statistical schizophrenia. Rival conceptions of the nature of statistical inference show up unannounced in the problems of scientific integrity, irreproducibility, and questionable research practices, and in proposed methodological reforms. If you don’t understand the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden from you.

Rival standards reflect a tension between using probability (a) to constrain the probability that a method avoids erroneously interpreting data in a series of applications (performance), and (b) to assign degrees of support, confirmation, or plausibility to hypotheses (probabilism). We set sail on our journey with an informal tool for telling what’s true about statistical inference: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test . From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. The goals of the severe tester (probativism) arise in contexts sufficiently different from those of probabilism that you are free to hold both, for distinct aims (Section 1.2). For statistical inference in science, it is severity we seek. A claim passes with severity only to the extent that it is subjected to, and passes, a test that it probably would have failed, if false. Viewing statistical inference as severe testing alters long-held conceptions of what’s required for an adequate account of statistical inference in science. In this view, a normative statistical epistemology –  an account of what’ s warranted to infer –  must be:

  directly altered by biasing selection effects
  able to falsify claims statistically
  able to test statistical model assumptions
  able to block inferences that violate minimal severity

These overlapping and interrelated requirements are disinterred over the course of our travels. This final keepsake collects a cluster of familiar criticisms of error statistical methods. They are not intended to replace the detailed arguments, pro and con, within; here we cut to the chase, generally keeping to the language of critics. Given our conception of evidence, we retain testing language even when the statistical inference is an estimation, prediction, or proposed answer to a question. The concept of severe testing is sufficiently general to apply to any of the methods now in use. It follows that a variety of statistical methods can serve to advance the severity goal, and that they can, in principle, find their foundations in an error statistical philosophy. However, each requires supplements and reformulations to be relevant to real-world learning. Good science does not turn on adopting any formal tool, and yet the statistics wars often focus on whether to use one type of test (or estimation, or model selection) or another. Meta-researchers charged with instigating reforms do not agree, but the foundational basis for the disagreement is left unattended. It is no wonder some see the statistics wars as proxy wars between competing tribe leaders, each keen to advance one or another tool, rather than about how to do better science. Leading minds are drawn into inconsequential battles, e.g., whether to use a prespecified cut-off  of 0.025 or 0.0025 –  when in fact good inference is not about cut-offs altogether but about a series of small-scale steps in collecting, modeling and analyzing data that work together to find things out. Still, we need to get beyond the statistics wars in their present form. By viewing a contentious battle in terms of a difference in goals –  finding highly probable versus highly well probed hypotheses – readers can see why leaders of rival tribes often talk past each other. To be clear, the standpoints underlying the following criticisms are open to debate; we’re far from claiming to do away with them. What should be done away with is rehearsing the same criticisms ad nauseum. Only then can we hear the voices of those calling for an honest standpoint about responsible science.

1. NHST Licenses Abuses. First, there’s the cluster of criticisms directed at an abusive NHST animal: NHSTs infer from a single P-value below an arbitrary cut-off to evidence for a research claim, and they encourage P-hacking, fishing, and other selection effects. The reply: this ignores crucial requirements set by Fisher and other founders: isolated significant results are poor evidence of a genuine effect and statistical significance doesn’t warrant substantive, (e.g., causal) inferences. Moreover, selective reporting invalidates error probabilities. Some argue significance tests are un-Popperian because the higher the sample size, the easier to infer one’s research hypothesis. It’s true that with a sufficiently high sample size any discrepancy from a null hypothesis has a high probability of being detected, but statistical significance does not license inferring a research claim H. Unless H’s errors have been well probed by merely finding a small P-value, H passes an extremely insevere test. No mountains out of molehills (Sections 4.3 and 5.1). Enlightened users of statistical tests have rejected the cookbook, dichotomous NHST, long lampooned: such criticisms are behind the times. When well-intentioned aims of replication research are linked to these retreads, it only hurts the cause. One doesn’t need a sharp dichotomy to identify rather lousy tests – a main goal for a severe tester. Granted, policy-making contexts may require cut-offs, as do behavioristic setups. But in those contexts, a test’s error probabilities measure overall error control, and are not generally used to assess well-testedness. Even there, users need not fall into the NHST traps (Section 2.5). While attention to banning terms is the least productive aspect of the statistics wars, since NHST is not used by Fisher or N-P, let’s give the caricature its due and drop the NHST acronym; “statistical tests” or “error statistical tests” will do. Simple significance tests are a small part of a conglomeration of error statistical methods.

To continue reading: Excerpt Souvenir Z, Farewell Keepsake & List of Souvenirs can be found here.

*We are reading Statistical Inference as Severe Testing: How to Get beyond the Statistics Wars (2018, CUP)

***

 

Where YOU are in the journey.

 


Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”)

S.S. StatInfasST

It’s a balmy day today on Ship StatInfasST: An invigorating wind has a salutary effect on our journey. So, for the first time I’m excerpting all of Excursion 5 Tour I (proofs) of Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (2018, CUP)

A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)

 So how would you use power to consider the magnitude of effects were you drawn forcibly to do so? In with your breakfast is an exercise to get us started on today’ s shore excursion.

Suppose you are reading about a statistically signifi cant result x (just at level α ) from a one-sided test T+ of the mean of a Normal distribution with IID samples, and known σ: H0 : μ ≤ 0 against H1 : μ > 0. Underline the correct word, from the perspective of the (error statistical) philosophy, within which power is defined.

  • If the test’ s power to detect μ′ is very low (i.e., POW(μ′ ) is low), then the statistically significant x is poor/good evidence that μ > μ′ .
  • Were POW(μ′ ) reasonably high, the inference to μ > μ′ is reasonably/poorly warranted.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistical power | Leave a comment

If you like Neyman’s confidence intervals then you like N-P tests

Neyman

Neyman, confronted with unfortunate news would always say “too bad!” At the end of Jerzy Neyman’s birthday week, I cannot help imagining him saying “too bad!” as regards some twists and turns in the statistics wars. First, too bad Neyman-Pearson (N-P) tests aren’t in the ASA Statement (2016) on P-values: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power”. An especially aggrieved “too bad!” would be earned by the fact that those in love with confidence interval estimators don’t appreciate that Neyman developed them (in 1930) as a method with a precise interrelationship with N-P tests. So if you love CI estimators, then you love N-P tests! Continue reading

Categories: ASA Guide to P-values, CIs and tests, Neyman | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

Neyman April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | Leave a comment

Neyman vs the ‘Inferential’ Probabilists

.

We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake.  My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).

drawn by his wife,Olga

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.
HAPPY BIRTHDAY WEEK FOR NEYMAN! Continue reading

Categories: Bayesian/frequentist, Error Statistics, Neyman | Leave a comment

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday yesterday)

images-14

Neyman April 16, 1894 – August 5, 1981

My second Jerzy Neyman item, in honor of his birthday, is a little play that I wrote for Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018):

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

Continue reading

Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

A. Spanos: Jerzy Neyman and his Enduring Legacy

Today is Jerzy Neyman’s birthday. I’ll post various Neyman items this week in recognition of it, starting with a guest post by Aris Spanos. Happy Birthday Neyman!

A. Spanos

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’: Continue reading

Categories: Neyman, Spanos | Leave a comment

Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Source: Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Categories: Error Statistics | Leave a comment

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt

.

For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).

1.4 The Law of Likelihood and Error Statistics

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one. Continue reading

Categories: Error Statistics, law of likelihood, SIST | 2 Comments

there’s a man at the wheel in your brain & he’s telling you what you’re allowed to say (not probability, not likelihood)

It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power: Continue reading

Categories: Bayesian/frequentist | 5 Comments

Diary For Statistical War Correspondents on the Latest Ban on Speech

When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection growing out of that conference “A world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks. Continue reading

Categories: ASA Guide to P-values, P-values | 40 Comments

1 Days to Apply for the Summer Seminar in Phil Stat

Go to the website for instructions: SummerSeminarPhilStat.com.

Categories: Summer Seminar in PhilStat | 1 Comment

S. Senn: To infinity and beyond: how big are your data, really? (guest post)

.

 

Stephen Senn
Consultant Statistician
Edinburgh

What is this you boast about?

Failure to understand components of variation is the source of much mischief. It can lead researchers to overlook that they can be rich in data-points but poor in information. The important thing is always to understand what varies in the data you have, and to what extent your design, and the purpose you have in mind, master it. The result of failing to understand this can be that you mistakenly calculate standard errors of your estimates that are too small because you divide the variance by an n that is too big. In fact, the problems can go further than this, since you may even pick up the wrong covariance and hence use inappropriate regression coefficients to adjust your estimates.

I shall illustrate this point using clinical trials in asthma. Continue reading

Categories: Lord's paradox, S. Senn | 8 Comments

Blog at WordPress.com.