“The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(i)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019) –hereafter, ASA II–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein[2]. As the ASA II authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.)

So I should say something. But the task is delicate. And painful. Very. I should start by asking: What is it (i.e., what is it actually saying)? Then I can offer some constructive suggestions

The Invitation to Broader Consideration and Debate

The papers in this issue propose many new ideas, ideas that in our determination as editors merited publication to enable broader consideration and debate. The ideas in this editorial are likewise open to debate. (ASAII p. 1)

The questions around reform need consideration and debate. (p. 9)

Excellent! A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on ASA II (~1000 words) if you send them to me.

My focus here is just on the intended positions of the ASA, not the summaries of articles. This comprises around the first 10 pages. Even from just the first few pages the reader is met with some noteworthy declarations:

♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof) (p. 1)

♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)

♦ A declaration of statistical significance is the antithesis of thoughtfulness. (p. 4)

♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)

♦ It is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive. (p.2)

♦ “Statistically significant”– don’t say it and don’t use it (p. 2)

(Wow!)

I am very sympathetic with the concerns about rigid cut-offs, and fallacies of moving from statistical significance to substantive scientific claims. I feel as if I’ve just written a whole book on it! I say, on p. 10 of SIST:

In formal statistical testing, the crude dichotomy of “pass/fail” or “significant or not” will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.

Since ASA II will still use P-values, you’re bound to wonder why a user wouldn’t just report “the difference is statistically significant at the P-value attained”. (The probability of observing even larger differences, under the assumption of chance variability alone is p.) Confidence intervals CIs are already routinely given alongside P-values. So there is clearly more to the current movement than meets the eye. But for now I’m just trying to decipher what the ASA position is.

What’s the Relationship Between ASA I and ASA II?

I assume, for this post, that ASA II is intended to be an extension of ASA I. In that case, it would subsume the 6 principles of ASA I. There is evidence for this. For one thing, it begins by sketching a “sampling” of “don’ts” from ASA I, for those who are new to the debate. Secondly, it recommends that ASA I be widely disseminated. But some Principles (1, 4) are apparently missing[3], and others are rephrased in ways that alter the initial meanings. Do they really mean these declarations as written? Let us try to take them at their word.

But right away we are struck with a conflict with Principle 1 of ASA I–which happens to be the only positive principle given. (See Note 5 for the six Principles of ASA I.)

Principle 1. P-values can indicate how incompatible the data are with a specified statistical model.

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.” (ASA I p. 131)

However, an indication of how incompatible data are with a claim of the absence of a relationship between a factor and an outcome would be an indication of the presence of the relationship; and providing evidence against a claim of no difference between two groups would often be of scientific or practical importance.

So, Principle 1 (from ASA I) doesn’t appear to square with the first bulleted item I listed (from ASA II):

(1) “Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, ASA II).

Either modify (1) or erase Principle 1. But if you erase all thresholds for finding incompatibility (whether using P-values or other measures), there are no tests, and no falsifications, even of the statistical kind.

My understanding (from Ron Wasserstein) is that this bullet is intended to correspond to Principle 5 in ASA I – that P-values do not give population effect sizes. But it is now saying something stronger (at least to my ears and to everyone else I’ve asked). Do the authors mean to be saying that nothing (of scientific or practical importance) can be learned from statistical significance tests? I think not.

So, my first recommendation is:

Replace (1) with:

“Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on statistical significance (or lack thereof).”

Either that, or simply stick to Principle 5 from ASA I : “A p-value, or statistical significance[4], does not measure the size of an effect or the importance of a result.” (p. 132) This statement is, strictly speaking, a tautology, true by the definitions of terms: probability isn’t itself a measure of the size of a (population) effect. However, you can use statistically significant differences to infer what the data indicate about the size of the (population) effect.[4]

My second friendly amendment concerns the second bulleted item:

(2) No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p. 2)

Focus just on “presence”. From this assertion it would seem to follow that no P-values[5], however small, even from well-controlled trials, can reveal the presence of an association or effect–and that is too strong. Again, we get a conflict with Principle 1 from ASA I. But I’m guessing, for now, the authors do not intend to say this. If you don’t mean it, don’t say it.

So, my second recommendation is to replace (2) with:

 “No p-value by itself can reveal the plausibility, presence, truth, or importance of an association or effect.

Without this friendly amendment, ASA II is at loggerheads with ASA I, and they should not be advocating those 6 principles without changing either or both. Without this or a similar modification, moreover, the ability of any other statistical quantity or evidential measure is likewise unable to reveal these things. Or so many would argue. These modest revisions might prevent some readers stopping after the first few pages, and that would be a shame, as they would miss the many right-headed insights about linking statistical and scientific inference.

This leads to my third bulleted item from ASA II:

(3) A declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. (p. 4)

Surely the authors do not mean to say that anyone who asserts the observed difference is statistically significant at level p has her hands tied and invariably ignores all previous studies, background information and theories in planning and reaching conclusions, decisions, proposed solutions to problems. I’m totally on board with the importance of backgrounds, and multiple steps relating data to scientific claims and problems:

The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal.The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,”  but rather, “ how well-probed”  claims are, and what has been poorly probed. (SIST, p. 162)

But good inquiry is piecemeal: There is no reason to suppose one does everything at once in inquiry, and it seems clear from the ASA II guide that the authors agree. Since I don’t think they literally mean (3), why say it?

Practitioners who use these methods in medicine and elsewhere have detailed protocols for how background knowledge is employed in designing, running, and interpreting tests. When medical researchers specify primary outcomes, for just one example, it’s very explicitly with due regard for the mechanism of drug action. It’s intended as the most direct way to pick up on the drug’s mechanism. Finding incompatibility using P-values, inherits the meaning already attached to a sensible test hypothesis. That valid P-values require context is even in the very important Principle 4 of ASA I (see note (3).

As lawyer Nathan Schachtman observes, in a recent conversation on ASA II:

By the time a phase III clinical trial is being reviewed for approval, there is a mountain of data on pharmacology, pharmacokinetics, mechanism, target organ, etc. If Wasserstein wants to suggest that there are some people who misuse or misinterpret p-values, fine. The principle of charity requires that we give a more sympathetic reading to the broad field of users of statistical significance testing. (Schachtman 2019)

Now it is possible the authors are saying a reported P-value can never be thoughtful because thoughtfulness requires that a statistical measure, at any stage of probing, incorporate everything we know (SIST dubs this “big picture” inference.) Do we want that? Or maybe (3) is their way of saying a statistical measure must incorporate background beliefs in the manner of Bayesian degree-of-belief (?) priors. Many would beg to differ, including some leading Bayesians. Andrew Gelman (2012) has suggested that ‘Bayesians Want Everybody Else to be Non-Bayesian’:

Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior (or, more generally, a hierarchical model). The likelihood is key. . .  No funny stuff, no posterior distributions, just the likelihood. . . I don’t want everybody coming to me with their posterior distribution – I’d just have to divide away their prior distributions before getting to my own analysis. (ibid., p. 54)

So, my third recommendation is to replace (3) with (something like):

failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.”

There’s much else that bears critical analysis and debate in ASA II; I’ll come back to it. I hope to hear from the authors of ASA II about my very slight, constructive amendments (to avoid a conflict with Principle 1).

Meanwhile, I fear we will see court cases piling up denying that anyone can be found culpable for abusing p-values and significance tests, since the ASA declared that all p-values are arbitrary, and whether predesignated thresholds are honored or breached should not be considered at all. (This was already happening based on ASA I.)[6] 

Please share your thoughts and any errors in the comments, I will indicate later drafts of this post with (i), (ii),…Do send me other articles you find discussing this.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 

References:

Gelman, A. 2012” Ethics and the Statistical Use of Prior Information”. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics5.pdf

Mayo, D. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2).

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge: Cambridge University Press.

Schachtman, N.  (2019).  (private communication)

Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, (and supplemental materials), The American Statistician 70(2), 129–33. (ASA I)

Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19. (ASA II)

NOTES
[1] I gave an invited paper at the conference (“A world Beyond…”) out of which the idea for this volume grew. I was in a session with a few other exiles to describe the contexts where statistical significance tests are of value. I was too much involved in completing my book to write up my paper for this volume, nor did others in our small group. Links to my slides and Yoav Benjamini’s are below. I did post notes to journalists on the Amrhein article here.

 

[2]Excerpts and mementos from SIST are here.

 

[3] Principle 4 ASA I asserts that “proper inference requires full reporting and transparency”: P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. ….Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed. (pp. 131-2)

 

[4] Consider, for example, a two-sided (symmetric) 95% confidence interval estimate of Normal mean: [a, b]. This information can also be given in terms of observed significance levels.
  • CI-lower is the (parameter) value that the data x are just statistically significantly greater than, at the 0.025 level.
  • CI-upper is the (parameter) value that the data x are just statistically significantly smaller than, at the 0.025 level.
There’s a clear duality between statistical significance tests and confidence intervals. (The CI contains those parameter values that would not be rejected at the corresponding significance level, were they the hypotheses under test.) CIs were developed by the same man who co-developed Neyman-Pearson (N-P) tests in the same years (~1930): Jerzy Neyman. There are other ways to get indicated effect sizes such as with (attained) power analysis and the P-value distribution over different values of the parameter. The goal of assessing how severely tested a claim is serves to direct this analysis (Mayo 2018). However, the mathematical computations are well-known (see Fraser’s article in the collection), and continue to be extended in work on Confidence Distributions. See this blog or SIST for references.
      However, confidence intervals as currently used in reform movements inherit many of the weaknesses of N-P tests: they are dichotomous (inside/outside), adhere to a single confidence level, and are justified merely with a long-run performance (or coverage) rationale. By considering the P-values associated with different hypotheses (corresponding to parameter values in the interval), one can scotch all of these weaknesses. 
      It is often claimed that anything tests can do CIs do better (sung to the tune of “Annie Get Your Gun”). Not so. (See SIST p. 356). It is odd and ironic that psychologists urging us to use CIs depict statistical tests as exclusively of the artificial “simple” Fisherian variety, with a “nil” null and no explicit alternative, given how Paul Meehl chastised this tendency donkey’s years ago, and given that Jacob Cohen advanced power analysis.
A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)
See SIST, p. 323. For links to all of excursion 5 on power, see this post.
      Of course, the beauty of the simple Fisherian test shows itself when there is no explicit alternative, as when testing assumptions of models–models that all the alternative statistical methods on offer also employ. ASA I also limits itself to the simple Fisherian test: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power…” (p. 130)

 

[5] I assume they intend to make claims about valid P-values, not those that are discredited by failing “audits” due either to violated assumptions, or to multiple testing and other selection effects given in Principle 4, ASA I. The, largely unexceptional, six principles of ASA I (2016) are:
  • P-values can indicate how incompatible the data are with a specified statistical model.
  • P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  • Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  • Proper inference requires full reporting and transparency.
  • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  • By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

 

[6] Just because P-values form a continuum, it doesn’t follow that we can’t use very high and very low P-values to distinguish rather lousy from fairly well indicted discrepancies. Beware the “Fallacy of the Continuum”. Would anyone use a confidence level of 0.5 or 0.6?
Categories: ASA Guide to P-values, Statistics | 5 Comments

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower)

.

returned from London…

The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6.  I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I.  Here’s how it starts:

5.5 Power Taboos, Retrospective Power, and Shpower

Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results.

Exhibit (vi): Non-significance + High Power Does Not Imply Support for the Null over the Alternative. Sander Greenland (2012) has a paper with this title. The first step is to understand the assertion, giving the most generous interpretation. It deals with non-significance, so our ears are perked for a fallacy of non-rejection. Second, we know that “high power” is an incomplete concept, so he clearly means high power against “the alternative.” We have a handy example: alternative μ.84 in T+ (POW(T+, μ.84) = 0.84).

Note to blog reader: μ.84 abbreviates “the alternative against which the test has 0.84 power.” This general abbreviation was introduced in Tour I. 

Use the water plant case, T+: H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, n = 100. With α = 0.025, z0.025 = 1.96, and the corresponding cut-off in terms of x0.025 is [150 + 1.96(10)/√100] = 151.96], μ.84 = 152.96.

Now a title like this is supposed to signal a problem, a reason for those “keep out” signs. His point, in relation to this example, boils down to noting that an observed difference may not be statistically significant – x may fail to make it to the cut-off  x0:025 – and yet be closer to μ.84 than to 0. This happens because the Type II error probability β (here, 0.16)1 is greater than the Type I error probability (0.025).

For a quick computation let x0:025  = 152 and μ.84 = 153. Halfway between alternative 153 and the 150 null is 151.5. Any observed mean greater than 151.5 but less than the x0.025 cut-off, 152, will be an example of Greenland’s phenomenon. An example would be those values that are closer to 153, the alternative against which the test has 0.84 power, than to 150 and thus, by a likelihood measure, support 153 more than 150 –  even though POW(μ  = 153) is high (0.84). Having established the phenomenon, your next question is: so what?

It would  be problematic if power analysis took the insignificant result as evidence for μ  = 150 –  maintaining compliance with the ecological stipulation – and I don’t doubt some try to construe it as such, nor that Greenland has been put in the position of needing to correct them. Power analysis merely licenses μμ.84  where 0.84 was chosen for “high power.” Glance back at Souvenir X. So at least one of the “keep out” signs can be removed.

All of Excursion 5 Tour II (in proofs) is here.

Notes:

1 That is, β(μ.84) = Pr(d < 0.4; μ = 0.6) = Pr(Z < −1) = 0.16.

 

_____________

*This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

It is still valuable to look at the discussions and comments under “power” and “shpower” on this blog.

Earlier excerpts and mementos from SIST (May 2018-May 2019) are here.

 

Where YOU are in the journey.

 

 

 

 

 

Categories: fallacy of non-significance, power, Statistical Inference as Severe Testing | Leave a comment

Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences

.

An article [i],“There is Still a Place for Significance Testing in Clinical Trials,” appearing recently in Clinical Trials, while very short, effectively responds to recent efforts to stop error statistical testing [ii]. We need more of this. Much more. The emphasis in this excerpt is mine: 

Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials….

However, we should not be shy about asserting the unique role that clinical trials play in scientific research. A clinical trial is a much safer context within which to carry out a statistical test than most other settings. Properly designed and executed clinical trials have opportunities and safeguards that other types of research do not typically possess, such as protocolisation of study design; scientific review prior to commencement; prospective data collection; trial registration; specification of outcomes of interest including, importantly, a primary outcome; and others. For randomised trials, there is even more protection of scientific validity provided by the randomisation of the interventions being compared. It would be a mistake to allow the tail to wag the dog by being overly influenced by flawed statistical inferences that commonly occur in less carefully planned settings….

The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.

The issues that have led to the criticisms of conventional statistical testing are of much greater concern where statistical inferences are derived from observational data. … Even when the study is appropriately designed, there is also a common converse misinterpretation of statistical tests whereby the investigator incorrectly infers and reports that a non-significant finding conclusively demonstrates no effect. However, it is important to recognise that an appropriately designed and powered clinical trial enables the investigators to potentially conclude there is ‘no meaningful effect’ for the principal analysis.[iii]  More generally, these problems are largely due to the fact that many individuals who perform statistical analyses are not sufficiently trained in statistics. It is naive to suggest that banning statistical testing and replacing it with greater use of confidence intervals, or Bayesian methods, or whatever, will resolve any of these widespread interpretive problems. Even the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could make things worse. By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.

You can read the full article here.

We may reject, with reasonable severity, that promoting correctly interpreted statistical tests is the real goal of the most powerful leaders of the movement to Stop Statistical Tests. The goal is just to stop (error) statistical tests altogether.[iv] That today’s CI leaders advance this goal is especially unwarranted and self-defeating, in that confidence intervals are just inversions of N-P tests, and were developed at the same time by the same man (Neyman) who developed (with E. Pearson) the theory of error statistical tests. See this recent post.

Reader: I’ve placed on draft a number of posts while traveling in England over the past week, but haven’t had the chance to read them over, or find pictures for them. This will soon change, so stay tuned!

*****************************************************************

[i] Jonathan A Cook, Dean A Fergusson, Ian Ford , Mithat Gonen, Jonathan Kimmelman, Edward L Korn and Colin B Begg (2019). “There is still a place for significance testing in clinical trials”, Clinical Trials 2019, Vol. 16(3) 223–224.

[ii] Perhaps we should call those driven to Stop Error Statistical Tests “Obsessed”. I thank Nathan Schachtman for sending me the article.

[iii] It’s disappointing how many critics of tests seem unaware of this simple power analysis point, and how it avoids egregious fallacies of non-rejection, or moderate P-value. It precisely follows simple significance test reasoning. The severity account that I favor gives a more custom-tailored approach that is sensitive to the actual outcome. (See, for example, Excursion 5 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). 

[iv] Bayes factors, like other comparative measures, are not “tests”, and do not falsify (even statistically). They can only say one hypothesis or model is better than a selected other hypothesis or model, based on some^ selected criteria. They can both (all) be improbable, unlikely, or terribly tested. One can always add a “falsification rule”, but it must be shown that the resulting test avoids frequently passing/failing claims erroneously. 

^The Anti-Testers would have to say “arbitrary criterion”, to be consistent with their considering any P-value “arbitrary”, and denying that a statistically significant difference, reaching any P-value, indicates a genuine difference from a reference hypothesis.

 

 

 

Categories: statistical tests | Leave a comment

SIST: All Excerpts and Mementos: May 2018-May 2019

view from a hot-air balloon

Introduction & Overview

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19

 

Excursion 1

EXCERPTS

Tour I

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18

Tour II

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

MEMENTOS

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

 

Excursion 2

EXCERPTS

Tour I

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

Tour II

Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3) 10/10/18

MEMENTOS

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

 

Excursion 3

EXCERPTS

Tour I

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18

Tour II

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

Tour III

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

MEMENTOS

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

 

Excursion 4

EXCERPTS

Tour I

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

Tour II

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19

Tour IV

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

MEMENTOS

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

 

Excursion 5

Tour I

(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

Tour III

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

 

Excursion 6

Tour II

Excerpts: Souvenir Z: Understanding Tribal Warfare +  6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

Excerpts: Final Souvenir Z, Farewell Keepsake & List of Souvenirs

.

We’ve reached our last Tour (of SIST)*: Pragmatic and Error Statistical Bayesians (Excursion 6), marking the end of our reading with Souvenir Z, the final Souvenir, as well as the Farewell Keepsake in 6.7. Our cruise ship Statinfasst, currently here at Thebes, will be back at dock for maintenance for our next launch at the Summer Seminar in Phil Stat (July 28-Aug 11). Although it’s not my preference that new readers being with the Farewell Keepsake (it contains a few spoilers), I’m excerpting it together with Souvenir Z (and a list of all souvenirs A – Z) here, and invite all interested readers to peer in. There’s a check list on p. 437: If you’re in the market for a new statistical account, you’ll want to test if it satisfies the items on the list. Have fun!

Souvenir Z: Understanding Tribal Warfare

We began this tour asking: Is there an overarching philosophy that “matches contemporary attitudes”? More important is changing attitudes. Not to encourage a switch of tribes, or even a tribal truce, but something more modest and actually achievable: to understand and get beyond the tribal warfare. To understand them, at minimum, requires grasping how the goals of probabilism differ from those of probativeness. This leads to a way of changing contemporary attitudes that is bolder and more challenging. Snapshots from the error statistical lens let you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. All of the links, from data generation to modeling, to statistical inference and from there to substantive research claims, fall into place within this statistical philosophy. If this is close to being a useful way to interpret a cluster of methods, then the change in contemporary attitudes is radical: it has never been explicitly unveiled. Our journey was restricted to simple examples because those are the ones fought over in decades of statistical battles. Much more work is needed. Those grappling with applied problems are best suited to develop these ideas, and see where they may lead. I never promised,when you bought your ticket for this passage, to go beyond showing that viewing statistics as severe testing will let you get beyond the statistics wars.

6.7 Farewell Keepsake

Despite the eclecticism of statistical practice, conflicting views about the roles of probability and the nature of statistical inference – holdovers from long-standing frequentist–Bayesian battles – still simmer below the surface of today’s debates. Reluctance to reopen wounds from old battles has allowed them to fester. To assume all we need is an agreement on numbers – even if they’re measuring different things – leads to statistical schizophrenia. Rival conceptions of the nature of statistical inference show up unannounced in the problems of scientific integrity, irreproducibility, and questionable research practices, and in proposed methodological reforms. If you don’t understand the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden from you.

Rival standards reflect a tension between using probability (a) to constrain the probability that a method avoids erroneously interpreting data in a series of applications (performance), and (b) to assign degrees of support, confirmation, or plausibility to hypotheses (probabilism). We set sail on our journey with an informal tool for telling what’s true about statistical inference: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test . From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. The goals of the severe tester (probativism) arise in contexts sufficiently different from those of probabilism that you are free to hold both, for distinct aims (Section 1.2). For statistical inference in science, it is severity we seek. A claim passes with severity only to the extent that it is subjected to, and passes, a test that it probably would have failed, if false. Viewing statistical inference as severe testing alters long-held conceptions of what’s required for an adequate account of statistical inference in science. In this view, a normative statistical epistemology –  an account of what’ s warranted to infer –  must be:

  directly altered by biasing selection effects
  able to falsify claims statistically
  able to test statistical model assumptions
  able to block inferences that violate minimal severity

These overlapping and interrelated requirements are disinterred over the course of our travels. This final keepsake collects a cluster of familiar criticisms of error statistical methods. They are not intended to replace the detailed arguments, pro and con, within; here we cut to the chase, generally keeping to the language of critics. Given our conception of evidence, we retain testing language even when the statistical inference is an estimation, prediction, or proposed answer to a question. The concept of severe testing is sufficiently general to apply to any of the methods now in use. It follows that a variety of statistical methods can serve to advance the severity goal, and that they can, in principle, find their foundations in an error statistical philosophy. However, each requires supplements and reformulations to be relevant to real-world learning. Good science does not turn on adopting any formal tool, and yet the statistics wars often focus on whether to use one type of test (or estimation, or model selection) or another. Meta-researchers charged with instigating reforms do not agree, but the foundational basis for the disagreement is left unattended. It is no wonder some see the statistics wars as proxy wars between competing tribe leaders, each keen to advance one or another tool, rather than about how to do better science. Leading minds are drawn into inconsequential battles, e.g., whether to use a prespecified cut-off  of 0.025 or 0.0025 –  when in fact good inference is not about cut-offs altogether but about a series of small-scale steps in collecting, modeling and analyzing data that work together to find things out. Still, we need to get beyond the statistics wars in their present form. By viewing a contentious battle in terms of a difference in goals –  finding highly probable versus highly well probed hypotheses – readers can see why leaders of rival tribes often talk past each other. To be clear, the standpoints underlying the following criticisms are open to debate; we’re far from claiming to do away with them. What should be done away with is rehearsing the same criticisms ad nauseum. Only then can we hear the voices of those calling for an honest standpoint about responsible science.

1. NHST Licenses Abuses. First, there’s the cluster of criticisms directed at an abusive NHST animal: NHSTs infer from a single P-value below an arbitrary cut-off to evidence for a research claim, and they encourage P-hacking, fishing, and other selection effects. The reply: this ignores crucial requirements set by Fisher and other founders: isolated significant results are poor evidence of a genuine effect and statistical significance doesn’t warrant substantive, (e.g., causal) inferences. Moreover, selective reporting invalidates error probabilities. Some argue significance tests are un-Popperian because the higher the sample size, the easier to infer one’s research hypothesis. It’s true that with a sufficiently high sample size any discrepancy from a null hypothesis has a high probability of being detected, but statistical significance does not license inferring a research claim H. Unless H’s errors have been well probed by merely finding a small P-value, H passes an extremely insevere test. No mountains out of molehills (Sections 4.3 and 5.1). Enlightened users of statistical tests have rejected the cookbook, dichotomous NHST, long lampooned: such criticisms are behind the times. When well-intentioned aims of replication research are linked to these retreads, it only hurts the cause. One doesn’t need a sharp dichotomy to identify rather lousy tests – a main goal for a severe tester. Granted, policy-making contexts may require cut-offs, as do behavioristic setups. But in those contexts, a test’s error probabilities measure overall error control, and are not generally used to assess well-testedness. Even there, users need not fall into the NHST traps (Section 2.5). While attention to banning terms is the least productive aspect of the statistics wars, since NHST is not used by Fisher or N-P, let’s give the caricature its due and drop the NHST acronym; “statistical tests” or “error statistical tests” will do. Simple significance tests are a small part of a conglomeration of error statistical methods.

To continue reading: Excerpt Souvenir Z, Farewell Keepsake & List of Souvenirs can be found here.

*We are reading Statistical Inference as Severe Testing: How to Get beyond the Statistics Wars (2018, CUP)

***

 

Where YOU are in the journey.

 


Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”)

S.S. StatInfasST

It’s a balmy day today on Ship StatInfasST: An invigorating wind has a salutary effect on our journey. So, for the first time I’m excerpting all of Excursion 5 Tour I (proofs) of Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (2018, CUP)

A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)

 So how would you use power to consider the magnitude of effects were you drawn forcibly to do so? In with your breakfast is an exercise to get us started on today’ s shore excursion.

Suppose you are reading about a statistically signifi cant result x (just at level α ) from a one-sided test T+ of the mean of a Normal distribution with IID samples, and known σ: H0 : μ ≤ 0 against H1 : μ > 0. Underline the correct word, from the perspective of the (error statistical) philosophy, within which power is defined.

  • If the test’ s power to detect μ′ is very low (i.e., POW(μ′ ) is low), then the statistically significant x is poor/good evidence that μ > μ′ .
  • Were POW(μ′ ) reasonably high, the inference to μ > μ′ is reasonably/poorly warranted.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistical power | Leave a comment

If you like Neyman’s confidence intervals then you like N-P tests

Neyman

Neyman, confronted with unfortunate news would always say “too bad!” At the end of Jerzy Neyman’s birthday week, I cannot help imagining him saying “too bad!” as regards some twists and turns in the statistics wars. First, too bad Neyman-Pearson (N-P) tests aren’t in the ASA Statement (2016) on P-values: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power”. An especially aggrieved “too bad!” would be earned by the fact that those in love with confidence interval estimators don’t appreciate that Neyman developed them (in 1930) as a method with a precise interrelationship with N-P tests. So if you love CI estimators, then you love N-P tests! Continue reading

Categories: ASA Guide to P-values, CIs and tests, Neyman | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

Neyman April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | Leave a comment

Neyman vs the ‘Inferential’ Probabilists

.

We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake.  My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).

drawn by his wife,Olga

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.
HAPPY BIRTHDAY WEEK FOR NEYMAN! Continue reading

Categories: Bayesian/frequentist, Error Statistics, Neyman | Leave a comment

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday yesterday)

images-14

Neyman April 16, 1894 – August 5, 1981

My second Jerzy Neyman item, in honor of his birthday, is a little play that I wrote for Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018):

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

Continue reading

Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

A. Spanos: Jerzy Neyman and his Enduring Legacy

Today is Jerzy Neyman’s birthday. I’ll post various Neyman items this week in recognition of it, starting with a guest post by Aris Spanos. Happy Birthday Neyman!

A. Spanos

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’: Continue reading

Categories: Neyman, Spanos | Leave a comment

Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Source: Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Categories: Error Statistics | Leave a comment

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt

.

For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).

1.4 The Law of Likelihood and Error Statistics

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one. Continue reading

Categories: Error Statistics, law of likelihood, SIST | 2 Comments

there’s a man at the wheel in your brain & he’s telling you what you’re allowed to say (not probability, not likelihood)

It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power: Continue reading

Categories: Bayesian/frequentist | 5 Comments

Diary For Statistical War Correspondents on the Latest Ban on Speech

When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection growing out of that conference “A world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks. Continue reading

Categories: ASA Guide to P-values, P-values | 40 Comments

1 Days to Apply for the Summer Seminar in Phil Stat

Go to the website for instructions: SummerSeminarPhilStat.com.

Categories: Summer Seminar in PhilStat | 1 Comment

S. Senn: To infinity and beyond: how big are your data, really? (guest post)

.

 

Stephen Senn
Consultant Statistician
Edinburgh

What is this you boast about?

Failure to understand components of variation is the source of much mischief. It can lead researchers to overlook that they can be rich in data-points but poor in information. The important thing is always to understand what varies in the data you have, and to what extent your design, and the purpose you have in mind, master it. The result of failing to understand this can be that you mistakenly calculate standard errors of your estimates that are too small because you divide the variance by an n that is too big. In fact, the problems can go further than this, since you may even pick up the wrong covariance and hence use inappropriate regression coefficients to adjust your estimates.

I shall illustrate this point using clinical trials in asthma. Continue reading

Categories: Lord's paradox, S. Senn | 5 Comments

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)

Statistical Inference as Severe Testing:
How to Get Beyond the Statistics Wars (2018, CUP)

Deborah G. Mayo

Abstract for Book

By disinterring the underlying statistical philosophies this book sets the stage for understanding and finally getting beyond today’s most pressing controversies revolving around statistical methods and irreproducible findings. Statistical Inference as Severe Testing takes the reader on a journey that provides a non-technical “how to” guide for zeroing in on the most influential arguments surrounding commonly used–and abused– statistical methods. The book sets sail with a tool for telling what’s true about statistical controversies: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to assess and control how severely tested claims are. Viewing statistical inference as severe testing supplies novel solutions to problems of induction, falsification and demarcating science from pseudoscience, and serves as the linchpin for understanding and getting beyond the statistics wars. The book links philosophical questions about the roles of probability in inference to the concerns of practitioners in psychology, medicine, biology, economics, physics and across the landscape of the natural and social sciences.

Keywords for book:

Severe testing, Bayesian and frequentist debates, Philosophy of statistics, Significance testing controversy, statistics wars, replication crisis, statistical inference, error statistics, Philosophy and history of Neyman, Pearson and Fisherian statistics, Popperian falsification

Continue reading

Categories: Statistical Inference as Severe Testing | 2 Comments

Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST

imgres-4

Fisher/ Neyman

This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. These 2 posts reflect my working out of these ideas in writing Section 5.8 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, CUP 2018). Here’s all of Section 5.8 (“Neyman’s Performance and Fisher’s Fiducial Probability”) for your Saturday night reading.* 

Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74). Continue reading

Categories: fiducial probability, Fisher, Neyman, Statistics | 2 Comments

Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]

imgres

R.A. Fisher: February 17, 1890 – July 29, 1962

Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a few years ago on a topic that had previously not been discussed on this blog: Fisher’s fiducial probability

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehmann (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics. Continue reading

Categories: fiducial probability, Fisher, Phil6334/ Econ 6614, Statistics | Leave a comment

Blog at WordPress.com.