Introduction & Overview
The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18
Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19
Excursion 1
EXCERPTS
Tour I
Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18
Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18
Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18
Tour II
Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19
Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18
MEMENTOS
Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18
Excursion 2
EXCERPTS
Tour I
Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18
“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18
Tour II
Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3) 10/10/18
MEMENTOS
Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18
Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18
Excursion 3
EXCERPTS
Tour I
Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18
Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18
First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18
Tour II
It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18
60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18
Tour III
Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18
MEMENTOS
Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18
Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18
Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18
Excursion 4
EXCERPTS
Tour I
Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18
Tour II
Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19
Tour IV
Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19
MEMENTOS
Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19
Excursion 5
Tour I
(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19
Tour III
Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19
Excursion 6
Tour II
Excerpts: Souvenir Z: Understanding Tribal Warfare + 6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
]]>We’ve reached our last Tour (of SIST)*: Pragmatic and Error Statistical Bayesians (Excursion 6), marking the end of our reading with Souvenir Z, the final Souvenir, as well as the Farewell Keepsake in 6.7. Our cruise ship Statinfasst, currently here at Thebes, will be back at dock for maintenance for our next launch at the Summer Seminar in Phil Stat (July 28-Aug 11). Although it’s not my preference that new readers being with the Farewell Keepsake (it contains a few spoilers), I’m excerpting it together with Souvenir Z (and a list of all souvenirs A – Z) here, and invite all interested readers to peer in. There’s a check list on p. 437: If you’re in the market for a new statistical account, you’ll want to test if it satisfies the items on the list. Have fun!
Souvenir Z: Understanding Tribal Warfare
We began this tour asking: Is there an overarching philosophy that “matches contemporary attitudes”? More important is changing attitudes. Not to encourage a switch of tribes, or even a tribal truce, but something more modest and actually achievable: to understand and get beyond the tribal warfare. To understand them, at minimum, requires grasping how the goals of probabilism differ from those of probativeness. This leads to a way of changing contemporary attitudes that is bolder and more challenging. Snapshots from the error statistical lens let you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. All of the links, from data generation to modeling, to statistical inference and from there to substantive research claims, fall into place within this statistical philosophy. If this is close to being a useful way to interpret a cluster of methods, then the change in contemporary attitudes is radical: it has never been explicitly unveiled. Our journey was restricted to simple examples because those are the ones fought over in decades of statistical battles. Much more work is needed. Those grappling with applied problems are best suited to develop these ideas, and see where they may lead. I never promised,when you bought your ticket for this passage, to go beyond showing that viewing statistics as severe testing will let you get beyond the statistics wars.
6.7 Farewell Keepsake
Despite the eclecticism of statistical practice, conflicting views about the roles of probability and the nature of statistical inference – holdovers from long-standing frequentist–Bayesian battles – still simmer below the surface of today’s debates. Reluctance to reopen wounds from old battles has allowed them to fester. To assume all we need is an agreement on numbers – even if they’re measuring different things – leads to statistical schizophrenia. Rival conceptions of the nature of statistical inference show up unannounced in the problems of scientific integrity, irreproducibility, and questionable research practices, and in proposed methodological reforms. If you don’t understand the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden from you.
Rival standards reflect a tension between using probability (a) to constrain the probability that a method avoids erroneously interpreting data in a series of applications (performance), and (b) to assign degrees of support, confirmation, or plausibility to hypotheses (probabilism). We set sail on our journey with an informal tool for telling what’s true about statistical inference: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test . From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. The goals of the severe tester (probativism) arise in contexts sufficiently different from those of probabilism that you are free to hold both, for distinct aims (Section 1.2). For statistical inference in science, it is severity we seek. A claim passes with severity only to the extent that it is subjected to, and passes, a test that it probably would have failed, if false. Viewing statistical inference as severe testing alters long-held conceptions of what’s required for an adequate account of statistical inference in science. In this view, a normative statistical epistemology – an account of what’ s warranted to infer – must be:
• directly altered by biasing selection effects
• able to falsify claims statistically
• able to test statistical model assumptions
• able to block inferences that violate minimal severity
These overlapping and interrelated requirements are disinterred over the course of our travels. This final keepsake collects a cluster of familiar criticisms of error statistical methods. They are not intended to replace the detailed arguments, pro and con, within; here we cut to the chase, generally keeping to the language of critics. Given our conception of evidence, we retain testing language even when the statistical inference is an estimation, prediction, or proposed answer to a question. The concept of severe testing is sufficiently general to apply to any of the methods now in use. It follows that a variety of statistical methods can serve to advance the severity goal, and that they can, in principle, find their foundations in an error statistical philosophy. However, each requires supplements and reformulations to be relevant to real-world learning. Good science does not turn on adopting any formal tool, and yet the statistics wars often focus on whether to use one type of test (or estimation, or model selection) or another. Meta-researchers charged with instigating reforms do not agree, but the foundational basis for the disagreement is left unattended. It is no wonder some see the statistics wars as proxy wars between competing tribe leaders, each keen to advance one or another tool, rather than about how to do better science. Leading minds are drawn into inconsequential battles, e.g., whether to use a prespecified cut-off of 0.025 or 0.0025 – when in fact good inference is not about cut-offs altogether but about a series of small-scale steps in collecting, modeling and analyzing data that work together to find things out. Still, we need to get beyond the statistics wars in their present form. By viewing a contentious battle in terms of a difference in goals – finding highly probable versus highly well probed hypotheses – readers can see why leaders of rival tribes often talk past each other. To be clear, the standpoints underlying the following criticisms are open to debate; we’re far from claiming to do away with them. What should be done away with is rehearsing the same criticisms ad nauseum. Only then can we hear the voices of those calling for an honest standpoint about responsible science.
1. NHST Licenses Abuses. First, there’s the cluster of criticisms directed at an abusive NHST animal: NHSTs infer from a single P-value below an arbitrary cut-off to evidence for a research claim, and they encourage P-hacking, fishing, and other selection effects. The reply: this ignores crucial requirements set by Fisher and other founders: isolated significant results are poor evidence of a genuine effect and statistical significance doesn’t warrant substantive, (e.g., causal) inferences. Moreover, selective reporting invalidates error probabilities. Some argue significance tests are un-Popperian because the higher the sample size, the easier to infer one’s research hypothesis. It’s true that with a sufficiently high sample size any discrepancy from a null hypothesis has a high probability of being detected, but statistical significance does not license inferring a research claim H. Unless H’s errors have been well probed by merely finding a small P-value, H passes an extremely insevere test. No mountains out of molehills (Sections 4.3 and 5.1). Enlightened users of statistical tests have rejected the cookbook, dichotomous NHST, long lampooned: such criticisms are behind the times. When well-intentioned aims of replication research are linked to these retreads, it only hurts the cause. One doesn’t need a sharp dichotomy to identify rather lousy tests – a main goal for a severe tester. Granted, policy-making contexts may require cut-offs, as do behavioristic setups. But in those contexts, a test’s error probabilities measure overall error control, and are not generally used to assess well-testedness. Even there, users need not fall into the NHST traps (Section 2.5). While attention to banning terms is the least productive aspect of the statistics wars, since NHST is not used by Fisher or N-P, let’s give the caricature its due and drop the NHST acronym; “statistical tests” or “error statistical tests” will do. Simple significance tests are a small part of a conglomeration of error statistical methods.
To continue reading: Excerpt Souvenir Z, Farewell Keepsake & List of Souvenirs can be found here.
*We are reading Statistical Inference as Severe Testing: How to Get beyond the Statistics Wars (2018, CUP)
***
Where YOU are in the journey.
]]>
It’s a balmy day today on Ship StatInfasST: An invigorating wind has a salutary effect on our journey. So, for the first time I’m excerpting all of Excursion 5 Tour I (proofs) of Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (2018, CUP)
A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)
So how would you use power to consider the magnitude of effects were you drawn forcibly to do so? In with your breakfast is an exercise to get us started on today’ s shore excursion.
Suppose you are reading about a statistically signifi cant result x (just at level α ) from a one-sided test T+ of the mean of a Normal distribution with n IID samples, and known σ: H_{0} : μ ≤ 0 against H_{1} : μ > 0. Underline the correct word, from the perspective of the (error statistical) philosophy, within which power is defined.
- If the test’ s power to detect μ′ is very low (i.e., POW(μ′ ) is low), then the statistically significant x is poor/good evidence that μ > μ′ .
- Were POW(μ′ ) reasonably high, the inference to μ > μ′ is reasonably/poorly warranted.
We’ve covered this reasoning in earlier travels (e.g., Section 4.3), but I want to launch our new tour from the power perspective. Assume the statistical test has passed an audit (for selection effects and underlying statistical assumptions) – you can’t begin to analyze the logic if the premises are violated.
During our three tours on Power Peninsula, a partially uncharted territory, we’ll be residing at local inns, not returning to the ship, so pack for overnights. We’ll visit its museum, but mostly meet with different tribal members who talk about power – often critically. Power is one of the most abused notions in all of statistics, yet it’ s a favorite for those of us who care about magnitudes of discrepancies. Power is always defined in terms of a fixed cut-off, c_{α}, computed under a value of the parameter under test; since these vary, there is really a power function . If someone speaks of the power of a test tout court , you cannot make sense of it, without qualification. First defined in Section 3.1, the power of a test against μ′ is the probability it would lead to rejecting H_{0} when μ = μ′:
POW(T, μ′) = Pr(d(X) ≥ c_{α}; μ = μ′), or Pr(test T rejects H_{0}; μ = μ′).
If it’s clear what the test is, we just write POW(μ′). Power measures the capability of a test to detect μ′ – where the detection is in the form of producing a d ≥ c_{α}. While power is computed at a point μ = μ′, we employ it to appraise claims of form μ > μ′ or μ < μ′.
Power is an ingredient in N-P tests, but even practitioners who declare they never set foot into N-P territory, but live only in the land of Fisherian significance tests, invoke power. This is all to the good, and they shouldn’t fear that they are dabbling in an inconsistent hybrid.
Jacob Cohen’s (1988) Statistical Power Analysis for the Behavioral Sciences is displayed at the Power Museum’ s permanent exhibition. Oddly, he makes some slips in the book’ s opening. On page 1 Cohen says: “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty is what he says on page 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.” Cohen means to add “computed under an alternative hypothesis,” else the definitions are wrong. These snafus do not take away from Cohen’s important tome on power analysis, yet I can’ t help wondering if these initial definitions play a bit of a role in the tendency to define power as ‘the probability of a correct rejection,’ which slips into erroneously viewing it as a posterior probability (unless qualified).
Although keeping to the fixed cut-off c_{α} is too coarse for the severe tester’s tastes, it is important to keep to the given definition for understanding the statistical battles. We’ve already had sneak previews of achieved sensitivity” or “attained power” [Π (γ ) = Pr(d(X ) ≥ d(x_{0} ); μ_{0} + γ )] by which members of Fisherian tribes are able to reason about discrepancies (Section 3.3). N-P accorded three roles to power: the first two are pre-data, for planning and comparing tests; the third is for interpretation post-data. It’s the third that they don’t announce very loudly, whereas that will be our main emphasis. Have a look at this museum label referring to a semi-famous passage by E. Pearson. Barnard (1950, p. 207) has just suggested that error probabilities of tests, like power, while fine for pre-data planning, should be replaced by other measures (likelihoods perhaps?) after the trial. What did Egon say in reply to George?
[I]f the planning is based on the consequences that will result from following a rule of statistical procedure, e.g., is based on a study of the power function of a test and then, having obtained our results, we do not follow the first rule but another, based on likelihoods, what is the meaning of the planning? (Pearson 1950, p. 228)
This is an interesting and, dare I say, powerful reply, but it doesn’t quite answer George. By all means apply the rule you planned to, but there’s still a legitimate question as to the relationship between the pre-data capability or performance measure, and post-data inference. The severe tester offers a view of this intimate relationship. In Tour II we’ll be looking at interactive exhibits far outside the museum, including N-P post-data power analysis, retrospective power, and a notion I call shpower. Employing our understanding of power, scrutinizing a popular reinterpretation of tests as diagnostic tools will be straightforward. In Tour III we go a few levels deeper in disinterring the N-P vs. Fisher feuds. I suspect there is a correlation between those who took Fisher’s side in the early disputes with Neyman and those leery of power. Oscar Kempthorne being interviewed by J. Leroy Folks (1995) said:
Well, a common thing said about [Fisher] was that he did not accept the idea of the power. But, of course, he must have. However, because Neyman had made such a point about power, Fisher couldn’t bring himself to acknowledge it (p. 331).
However, since Fisherian tribe members have no problem with corresponding uses of sensitivity, P-value distributions, or CIs, they can come along on a severity analysis. There’s more than one way to skin a cat, if one understands the relevant statistical principles. The issues surrounding power are subtle, and unraveling them will require great care, so bear with me. I will give you a money-back guarantee that by the end of the excursion you’ll have a whole new view of power. Did I mention you’ll have a chance to power the ship into port on this tour? Only kidding, however, you will get to show your stuff in a Cruise Severity Drill (Section 5.2).
To continue reading Excursion 5 Tour I, go here.
__________
This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.
Jan 10, 2019 Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” is here,
Jan 27, Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking here,
Feb 23, Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST
here,
April 4, Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt here.
Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.
March 5, 2019 Blurbs of all 16 Tours can be found here.
Where YOU are in the journey.
]]>Neyman, confronted with unfortunate news would always say “too bad!” At the end of Jerzy Neyman’s birthday week, I cannot help imagining him saying “too bad!” as regards some twists and turns in the statistics wars. First, too bad Neyman-Pearson (N-P) tests aren’t in the ASA Statement (2016) on P-values: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power”. An especially aggrieved “too bad!” would be earned by the fact that those in love with confidence interval estimators don’t appreciate that Neyman developed them (in 1930) as a method with a precise interrelationship with N-P tests. So if you love CI estimators, then you love N-P tests!
Consider a typical N-P test of the mean of a Normal distribution T+: H_{0}: µ ≤ µ_{0 } vs H1: µ > µ_{0. }
Imagine σ is known, since nothing of interest to the logic changes if it is estimated as is more typical. Notice the null hypothesis is composite, it is not a point, and the alternative is explicit (you can’t jump from a small P-value to some theory that would “explain ” it).[i]
The (1 – α) confidence interval (CI) corresponding to test T+ is that µ > the (1 – α) lower bound:
µ > M – c_{a}(σ/ √n ).
M is the sample mean, and this is the generic lower confidence bound. Replacing M with the observed sample mean M_{0} yields the particular CI lower bound.
Why does µ > M – c_{a}(σ/ √n ) correspond the above test T+? Why is it an inversion or dual to the test?
Consider, said Neyman, that the values of µ that exceed M_{0} – c_{a}(σ/ √n ) are values of µ that could not be rejected at level α with sample mean M_{0}. Equivalently, these are values of the parameter µ that M_{0} is not statistically significantly greater than at a P-value of α. Yes CIs correspond to Neyman-Pearson tests and were developed by Neyman in 1930, a bit after Fisher’s Fiducial intervals. Yes, those doing CIs (the so-called “new” statistics) are doing Neyman-Pearson tests, only inverted. Neyman didn’t care if you called them hypothesis tests or significance tests (as we saw in my last post). [ii]
Thanks to the duality between tests and confidence intervals, you could give the information provided by a confidence interval at any level in terms of the corresponding test. For a two-sided, 95% confidence interval [µ_{L },µ_{U}].
µ_{L }is the (parameter) value that the sample mean is just statistically significantly greater than at the P= .025 level.
µ_{U} is the (parameter) value that the sample mean is just statistically significantly lower than at the P= .025 level.
That means it is wrong to say you cannot ascertain anything about the population effect size using P-value computations. You can. It’s not the only way. You can also use P-value functions (Fraser, Cox), power, and severity, but they are all interrelated.
You ask: Please tell me the value of µ that the sample mean M_{0} is just statistically significantly greater than, at the P= .025 level? The answer is the lower confidence bound µ_{L}
If the tester is able to determine the P-value corresponding to a specific value of µ you wanted to test, then she is also able to use the observed M_{0} to compute the value µ_{L}
Likewise for finding µ_{U} . All the information is there.
But choosing a single confidence level is quite inadequate. Yet that is still what members of today’s “new” CI tribe do–generally .95. They get very upset at your dichotomizing P ≤ 0.05 and P > 0.05, but happily dichotomize µ is in or out of the CI formed.
The severe tester always infers a discrepancy that is well indicated (if any) but also at least one that is poorly indicated. In relation to test T+, the inference µ > M_{0 }where M_{0} is the observed mean is a good benchmark for a terrible inference! It corresponds to a lower confidence bound at level 0.5! And yet, critics of significance tests (at least,from outside the error statistical family) often advocate inferring
µ ≥ M_{0 }
as either comparatively more likely or probable than the null or test hypothesis. For detailed examples, see SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What?
So why are members of the Confidence Interval tribe going around misrepresenting hypothesis tests as if they must take the form of Fisherian “simple” significance tests with a point null (nil) hypothesis, usually of 0? (N-P tests were purposely designed to improve upon Fisher’s tests, and it’s that improvement that gives you CIs.) And why do they say what’s inferred with a CI cannot be ascertained with N-P tests? Are they unaware they’re using N-P tests? Or is the simple Fisherian test (no explicit alternative, no consideration of power) just much easier to criticize? If they’re cousins or brothers, why the family feud? Sibling rivalry? Why be a Unitarian? Most testers would supply a P-value as well as a CI. The severe tester combines the two, so that discrepancies are directly reported from test results. For another reason, see [iii].
Critics of tests from outside the family, will also take the simple “nil” point null vs a two-sided alternative as their foil, and demonstrate that the p-value ≠ either their Bayes Factor or posterior probability. It serves as a convenient straw test to knock down. If they kept the comparison to one-sided tests, they would not disagree (at least not with any sensible prior). See SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What? This is shown by Casella and R. Berger (1987) and the reconciliation is agreed to by Berger and Sellke (1987).
I’m not saying the simple significance test doesn’t have uses; it’s vital for testing assumptions of statistical models. That’s why Bayesians who want to check their models can be found sneaking P-value goodies from the tests that many of them profess to dislike. If a small P-value indicates a discrepancy from the null there, it does so in other uses too. [iv]
Note too the connection between confidence intervals and severity: Taking a sample mean M that is just statistically significant at level α (Mα) as warranting µ > µ_{0 }with severity 1 – α is the same as inferring µ > M_{0}– c_{a}(σ/ √n ) at confidence level 1 – α. However, severity improves on CIs by breaking out of the single confidence level, providing an inferential justification (rather than merely a long-run coverage rationale), and avoids a number of fallacies and paradoxes of ordinary CIs. For a post on CIs and severity see here. Also see: Do CIs Avoid Fallacies of Tests? Reforming the Reformers. For a full discussion, see SIST.
[i] The null and alternative would be treated symmetrically. You are to choose the null, or more properly, what Neyman called the test hypothesis, according to which error was more serious. A lot of the agony that has people up in arms regarding the fallacy of taking non-significant results as evidence for a (point) null is immediately scotched by letting the test hypothesis be “an effect exists” (or an effect of a given magnitude is present). For example, T-: H_{0}: µ ≥ µ_{0 } vs H1: µ < µ_{0. }
A two-sided test, if wanted, may be seen as doing two one-sided tests (Cox and Hinkley 1974).
[ii] Note the equivalences:
µ < M – c_{a}(σ/ √n ) iff M > µ + c_{a}(σ/ √n )
So µ < CI lower at confidence level 1 – α iff M reaches statistical significance at P = α in test T+. Since it’s continuous we could use ≤ or <.
Iff = if and only if.
[iii] Some prefer CIs to corresponding tests because it’s easier to slide the confidence level onto the interval estimate, viewing it as affording a probability assignment to the interval itself. This of course is, strictly, a fallacy, unless one just stipulates: I assign “probability” .95, say, to the result of applying a method if that method has .95 “coverage probability”. This is/was the Fiducial dream. But one cannot do probability computations with these assignments. For the severe tester’s evidential interpretation of CIs, please see SIST, Excursion 3 Tour III.
[iv] Moving from a discrepancy (from a model assumption) to a particular rival model invites the same risks as when explaining other small P-values by invoking a rival insofar as the null and the rival model do not exhaust the possibilities.
SIST= Mayo, D (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, CUP.
]]>
I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?
“Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena”
by Jerzy Neyman
ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.
I recommend, especially, the example on home ownership. Here are two snippets:
1. INTRODUCTION
The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction. Particularly with reference to applied statistical work in a variety of domains of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors….
(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion, these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. In my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)
To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.
Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my book.[1] Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudolf Carnap:
I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H_{0} is true of the particular data set? (Neyman, pp 40-41).
Neyman continues:
The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H_{0}, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H_{0} cannot be reasonably considered as anything like a confirmation of H_{0}. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)
The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.
I’m adding another paper of Neyman’s that echoes these same sentiments on the use of power, post data to evaluate what is “confirmed” ‘The Use of the Concept of Power in Agricultural Experimentation’.
Neyman, like Peirce, Popper and many others, hold that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis H only to the extent that it passed a severe test–one with a high probability of having found flaws in H, if they existed. Of course, Neyman puts this in terms of having high power to reject H, if H is false, and high probability of finding no evidence against H if true, but it’s the same idea. But the use of power post-data is to interpret the discrepancies warranted in the given test. (This third use of power is also in Neyman 1956, responding to Fisher, the Triad).Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.
Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach,[2] from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.
De Finetti gets it right when he says that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).
Related papers on tests:
[1] That really is a decision, though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There are plenty of times, by the way, where Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!” This is now the title of Excursion 3 Tour II of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP).
[2] And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.
Fisher is the arch anti-Bayesian; whereas, Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors. Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):
[3] Who drew the picture of Neyman above? Anyone know?
References
de Finetti, B. 1972. Probability, Induction and Statistics: The Art of Guessing. Wiley.
Neyman, J. 1957. “The Use of the Concept of Power in Agricultural Experimentation“, Journal of the Indian Society of Agricultural Statistics, 9(1): 9–17.
Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” Commun. Statist. Theor. Meth. A5(8), 737-751.
Reader: This and other Neyman blogposts have been incorporated into my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Several excerpts can be found on this blog. Look up excerpts and mementos.
]]>
We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake. My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”. “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).
Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.
HAPPY BIRTHDAY WEEK FOR NEYMAN!
What doesn’t Neyman like about Birnbaum’s advocacy of a Principle of Sufficiency S (p. 25)? He doesn’t like that it is advanced as a normative principle (e.g., about when evidence is or ought to be deemed equivalent) rather than a criterion that does something for you, such as control errors. (Presumably it is relevant to a type of context, say parametric inference within a model.) S is put forward as a kind of principle of rationality, rather than one with a rationale in solving some statistical problem
“The principle of sufficiency (S): If E is specified experiment, with outcomes x; if t = t (x) is any sufficient statistic; and if E’ is the experiment, derived from E, in which any outcome x of E is represented only by the corresponding value t = t (x) of the sufficient statistic; then for each x, Ev (E, x) = Ev (E’, t) where t = t (x)… (S) may be described informally as asserting the ‘irrelevance of observations independent of a sufficient statistic’.”
Ev(E, x) is a metalogical symbol referring to the evidence from experiment E with result x. The very idea that there is such a thing as an evidence function is never explained, but to Birnbaum “inferential theory” required such things. (At least that’s how he started out.) The view is very philosophical and it inherits much from logical positivism and logics of induction.The principle S, and also other principles of Birnbaum, have a normative character: Birnbaum considers them “compellingly appropriate”.
“The principles of Birnbaum appear as a kind of substitutes for known theorems” Neyman says. For example, various authors proved theorems to the general effect that the use of sufficient statistics will minimize the frequency of errors. But if you just start with the rationale (minimizing the frequency of errors, say) you wouldn’t need these”principles” from on high as it were. That’s what Neyman seems to be saying in his criticism of them in this paper. Do you agree? He has the same gripe concerning Cornfield’s conception of a default-type Bayesian account akin to Jeffreys. Why?
[i] I thank @omaclaran for reminding me of this paper on twitter in 2018.
[ii] Or so I argue in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, 2018, CUP.
[iii] Do you think Neyman is using “breakthrough” here in reference to Savage’s description of Birnbaum’s “proof” of the (strong) Likelihood Principle? Or is it the other way round? Or neither? Please weigh in.
REFERENCES
Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘, Revue De l’Institut International De Statistique / Review of the International Statistical Institute, 30(1), 11-27.
]]>My second Jerzy Neyman item, in honor of his birthday, is a little play that I wrote for Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018):
A local acting group is putting on a short theater production based on a screenplay I wrote: “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:
We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.
But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).
In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”.
Since it’s Saturday night, let’s listen in on this one act play, just about to begin at the Elba Dinner Theater. Don’t worry, food and drink are allowed to be taken in. (I’ve also included, in the References, several links to papers for your weekend reading enjoyment!) There go les trois coups–the curtain’s about to open!
The curtain opens with a young Neyman and Pearson (from 1933) standing mid-stage, lit by a spotlight. (Neyman does the talking, since its his birthday).
Neyman: “Bertrand put into statistical form a variety of hypotheses, as for example the hypothesis that a given group of stars…form a ‘system.’ His method of attack, which is that in common use, consisted essentially in calculating the probability, P, that a certain character, x, of the observed facts would arise if the hypothesis tested were true. If P were very small, this would generally be considered as an indication that…H was probably false, and vice versa. Bertrand expressed the pessimistic view that no test of this kind could give reliable results.
The stage fades to black, then a spotlight shines on Bertrand, stage right.
Bertrand: “How can we decide on the unusual results that chance is incapable of producing?…The Pleiades appear closer to each other than one would naturally expect…In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances?…Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility. …
[He turns to the audience, shaking his head.]
The stage fades to black, then a spotlight appears on Borel, stage left.
Borel: “The particular form that problems of causes often take…is the following: Is such and such a result due to chance or does it have a cause? It has often been observed how much this statement lacks in precision. Bertrand has strongly emphasized this point. But …to refuse to answer under the pretext that the answer cannot be absolutely precise, is to… misunderstand the essential nature of the application of mathematics.” (ibid. p. 964) Bertrand considers the Pleiades. ‘If one has observed a [precise angle between the stars]…in tenths of seconds…one would not think of asking to know the probability [of observing exactly this observed angle under chance] because one would never have asked that precise question before having measured the angle’… (ibid.)
Here is what one can say on this subject: One should carefully guard against the tendency to consider as striking an event that one has not specified beforehand, because the number of such events that may appear striking, from different points of view, is very substantial” (ibid. p. 964).
The stage fades to black, then a spotlight beams on Neyman and Pearson mid-stage. (Neyman does the talking)
Neyman: “We appear to find disagreement here, but are inclined to think that…the two writers [Bertrand and Borel] are not really considering precisely the same problem. In general terms the problem is this: Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature. …What is the precise meaning of the words ‘an efficient test of a hypothesis’?” (1933, p. 140/290)
Fade to black, spot on narrator mid-stage:
Narrator: We all know our famous (miserable) lines are about to come. But let’s linger on the “as far as a particular hypothesis is concerned” portion. For any particular case, one may identify a data dependent feature x that would be highly improbable “under the particular hypothesis of chance”. We must “carefully guard,” Borel warns, “against the tendency to consider as striking an event that one has not specified beforehand”. But if you are required to set the test’s capabilities ahead of time then you need to specify the type of falsity of Ho, the distance measure or test statistic beforehand. An efficient test should capture Fisher’s concern with tests sensitive to departures of interest. Listen to Neyman over 40 years later, reflecting on the relevance of Borel’s position in 1977.
Fade to black. Spotlight on an older Neyman, stage right.
Neyman: “The question (what is an efficient test of a statistical hypothesis) is about an intelligible methodology for deciding whether the observed difference…contradicts the stochastic model….
Fade to back. Spotlight on an older Egon Pearson writing a letter to Neyman about the preprint Neyman sent of his 1977 paper. (The letter is unpublished, but I cite Lehmann 1993).
Pearson: “I remember that you produced this quotation [from Borel] when we began to get our [1933] paper into shape… The above stages [wherein he had been asking ‘Why use that particular test statistic?’] led up to Borel’s requirement of finding…a criterion which was ‘a function of the observations ‘en quelque sorte remarquable’. Now my point is that you and I (perhaps my first leading) had ourselves reached the Borel requirement independently of Borel, because we were serious humane thinkers; Borel’s expression neatly capped our own.”
Fade to black. End Play
Egon has the habit of leaving the most tantalizing claims unpacked, and this is no exception: What exactly is the Borel requirement already reached due to their being “serious humane thinkers”? I can well imagine growing this one act play into something like the expressionist play of Michael Fraylin, Copenhagen, wherein a variety of alternative interpretations are entertained based on subsequent work and remarks. I don’t say that it would enjoy a long life on Broadway, but a small handful of us would relish it.
As with my previous attempts at “statistical theatre of the absurd, (e.g., “Stat on a hot-tin roof”) there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included.
Deconstructions on the Meaning of the Play by Theater Critics
It’s not hard to see that “as far as a particular” star grouping is concerned, we cannot expect a reliable inference to just any non-chance effect discovered in the data. The more specific the feature is to these particular observations, the more improbable. What’s the probability of 3 hurricanes followed by 2 plane crashes (as occurred last month, say)? Harold Jeffreys put it this way: any sample is improbable in some respect;to cope with this fact statistical method does one of two things: appeals to prior probabilities of a hypothesis or to error probabilities of a procedure. The former can check our tendency to find a more likely explanation H’ than chance by an appropriately low prior weight to H’. What does the latter approach do? It says, we need to consider the problem as of a general type. It’s a general rule, from a test statistic to some assertion about alternative hypotheses, expressing the non-chance effect. Such assertions may be in error but we can control such erroneous interpretations. We deliberately move away from the particularity of the case at hand, to the general type of mistake that could be made.
Isn’t this taken care of by Fisher’s requirement that Pr(P < p_{0}; Ho) = p—that the test rarely rejects the null if true? It may be, in practice, Neyman and Pearson thought, but only with certain conditions that were not explicitly codified by Fisher’s simple significance tests. With just the null hypothesis, it is unwarranted to take low P-values as evidence for a specific “cause” or non-chance explanation. Many could be erected post data, but the ways these could be in error would not have been probed. Fisher (1947, p. 182) is well aware that “the same data may contradict the hypothesis in any of a number of different ways,” and that different corresponding tests would be used.
The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation. [T]he experimenter is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs (ibid., p. 185).
Even if “an experienced experimenter” knows the appropriate test, this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for the choices made on informal grounds. In today’s world, if not in Fisher’s day, there’s legitimate concern about selecting the alternative that gives the more impressive P-value.
Here’s Egon Pearson writing with Chandra Sekar: In testing if a sample has been drawn from a single normal population, “it is not possible to devise an efficient test if we only bring into the picture this single normal probability distribution with its two unknown parameters. We must also ask how sensitive the test is in detecting failure of the data to comply with the hypotheses tested, and to deal with this question effectively we must be able to specify the directions in which the hypothesis may fail” ( p. 121). “It is sometimes held that the criterion for a test can be selected after the data, but it will be hard to be unprejudiced at this point” (Pearson & Chandra Sekar, 1936, p. 129).
To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptions if the hypothesis is true….By choosing the feature most unfavourable to Ho out of a very large number of features examined it will usually be possible to find some reason for rejecting the hypothesis. It must be remembered, however, that the point now at issue will not be whether it is exceptional to find a given criterion with so unfavourable a value. We shall need to find an answer to the more difficult question. Is it exceptional that the most unfavourable criterion of the n, say, examined should have as unfavourable a value as this? (ibid., p. 127).
Notice, the goal is not behavioristic; it’s a matter of avoiding the glaring fallacies in the test at hand, fallacies we know all too well.
“The statistician who does not know in advance with which type of alternative to H_{0} he may be faced, is in the position of a carpenter who is summoned to a house to undertake a job of an unknown kind and is only able to take one tool with him! Which shall it be? Even if there is an omnibus tool, it is likely to be far less sensitive at any particular job than a specialized one; but the specialized tool will be quite useless under the wrong conditions” (ibid., p. 126).
In a famous result, Neyman (1952) demonstrates that by dint of a post-data choice of hypothesis, a result that leads to rejection in one test yields the opposite conclusion in another, both adhering to a fixed significance level. [Fisher concedes this as well.] If you are keen to ensure the test is capable of teaching about discrepancies of interest, you should prespecify an alternative hypothesis, where the null and alternative hypothesis exhaust the space, relative to a given question. We can infer discrepancies from the null, as well as corroborate their absence by considering those the test had high power to detect.
Playbill Souvenir
Let’s flesh out Neyman’s conclusion to the Borel-Bertrand debate: if we accept the words, “an efficient test of the hypothesis H” to mean a statistical (methodological) falsification rule that controls the probabilities of erroneous interpretations of data, and ensures the rejection was because of the underlying cause (as modeled), then we agree with Borel that efficient tests are possible. This requires (a) a prespecified test criterion to avoid verification biases while ensuring power (efficiency), and (b) consideration of alternative hypotheses to avoid fallacies of acceptance and rejection. We must steer clear of isolated or particular curiosities to find indications that we are tracking genuine effects.
“Fisher’s the one to be credited,” Pearson remarks, “for his emphasis on planning an experiment, which led naturally to the examination of the power function, both in choosing the size of sample so as to enable worthwhile results to be achieved, and in determining the most appropriate tests” (Pearson 1962, p. 277). If you’re planning, you’re prespecifying, perhaps, nowadays, by means of explicit preregistration.
Nevertheless prespecifying the question (or test statistic) is distinct from predesignating a cut-off P-value for significance. Discussions of tests often suppose one is somehow cheating if the attained P-value is reported, as if it loses its error probability status. It doesn’t.[2] I claim they are confusing prespecifying the question or hypothesis, with fixing the P-value in advance–a confusion whose origin stems from failing to identify the rationale behind conventions of tests, or so I argue. Nor is it even that the predesignation is essential, rather than an excellent way to promote valid error probabilities.
But not just any characteristic of the data affords the relevant error probability assessment. It has got to be pretty remarkable!
Enter those pivotal statistics called upon in Fisher’s Fiducial inference. In fact, the story could well be seen to continue in the following two posts: “You can’t take the Fiducial out of Fisher if you want to understand the N-P performance philosophy“, and ” Deconstructing the Fisher-Neyman conflict wearing fiducial glasses”.
[1] Or, it might have been titled, “A Polish Statistician in Paris”, given the remake of “An American in Paris” is still going strong on Broadway, last time I checked.
[2] We know that Lehmann insisted people report the attained p-value so that others could apply their own preferred error probabilities. N-P felt the same way. (I may add some links to relevant posts later on.)
REFERENCES
Bertrand, J. (1888/1907). Calcul des Probabilités. Paris: Gauthier-Villars.
Borel, E. 1914. Le Hasard. Paris: Alcan.
Fisher, R. A. 1947. The Design of Experiments (4^{th} ed.). Edinburgh: Oliver and Boyd.
Lehmann, E.L. 2012. “The Bertrand-Borel Debate and the Origins of the Neyman-Pearson Theory” in J. Rojo (ed.), Selected Works of E. L. Lehmann, 2012, Springer US, Boston, MA, pp. 965-974.
Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP)
Neyman, J. 1952. Lectures and Conferences on Mathematical Statistics and Probability. 2^{nd} ed. Washington, DC: Graduate School of U.S. Dept. of Agriculture.
Neyman, J. 1977. “Frequentist Probability and Frequentist Statistics“, Synthese 36(1): 97–131.
Neyman, J. & Pearson, E. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses“, Philosophical Transactions of the Royal Society of London 231. Series A, Containing Papers of a Mathematical or Physical Character: 289–337.
Pearson, E. S. 1962. “Some Thoughts on Statistical Inference”, The Annals of Mathematical Statistics, 33(2): 394-403.
Pearson, E. S. & Sekar, C. C. 1936. “The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations“, Biometrika 28(3/4): 308-320. Reprinted (1966) in The Selected Papers of E. S. Pearson, (pp. 118-130). Berkeley: University of California Press.
Reid, Constance (1982). Neyman–from life
]]>
A Statistical Model as a Chance Mechanism
Aris Spanos
Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)
One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for non-random samples. Fisher’s original parametric statistical model M_{θ}(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x_{0}:=(x_{1},x_{2},…,x_{n}) can be viewed as a ‘truly representative sample’ from that ‘population’:
“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)
In cases where data x_{0} come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X_{1},X_{2},…,X_{n}), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.
This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!
Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):
Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)
From my perspective, this was a major step forward for several reasons, including the following.
First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.
Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:
M_{θ}(x)={f(x;θ), θ∈Θ}, x∈R^{n }, Θ⊂R^{m}; m << n,
where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:
X_{t} = α_{0} + α_{1}X_{t-1} + σε_{t}, t=1,2,…,n
This indicates how one can use pseudo-random numbers for the error term ε_{t} ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.
Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).
HAPPY BIRTHDAY NEYMAN!
For further discussion on the above issues see:
Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese:
http://www.econ.vt.edu/faculty/2008vitas_research/Spanos/1Spanos-2011-Synthese.pdf
Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.
Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.
Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.
Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.
Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.
[i]He was born in an area that was part of Russia.
]]>For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).
1.4 The Law of Likelihood and Error Statistics
If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.
Law of Likelihood (LL):Data x are better evidence for hypothesis H_{1 }than for H_{0 }if x is more probable under H_{1 }than under H_{0}: Pr(x; H_{1}) > Pr(x; H_{0}) that is, the likelihood ratio LR of H_{1 }over H_{0 }exceeds 1.
H_{0 }and H_{1 }are statistical hypotheses that assign probabilities to the values of the random variable X. A fixed value of X is written x_{0}, but we often want to generalize about this value, in which case, following others, I use x. The likelihood of the hypothesis H, given data x, is the probability of observing x, under the assumption that H is true or adequate in some sense. Typically, the ratio of the likelihood of H_{1 }over H_{0 }also supplies the quantitative measure of comparative support. Note when X is continuous, the probability is assigned over a small interval around X to avoid probability 0.
Does the Law of Likelihood Obey the Minimal Requirement for Severity?
Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’s promotional letter went up in value (x)–all winners. A hypothesis H to explain this is that their method always succeeds in picking winners. H entails x, so the likelihood of H given x is 1. Yet we wouldn’t say H is therefore highly probable, especially without reason to put to rest that they culled the winners post hoc. For a second way, at any time, the same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.
Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as x_{0 }=<1,1,0>. Let the hypotheses be different values for θ, the probability of success on each independent trial. The likelihood of the hypothesis H_{0 }: θ = 0.5, given x_{0}, which we may write as Lik(0.5), equals (½)(½)(½) = 1/8. Strictly speaking, we should write Lik(θ;x_{0}), because it’s always computed given data x_{0}; I will do so later on. The likelihood of the hypothesis θ = 0.2 is Lik(0.2)= (0.2)(0.2)(0.8) = 0.032. In general, the likelihood in the case of Bernoulli independent and identically distributed trials takes the form: Lik(θ)= θ^{s}(1- θ)^{f}, 0< θ<1, where s is the number of successes and f the number of failures. Infinitely many values for θ between 0 and 1 yield positive likelihoods; clearly, then likelihoods do not sum to 1, or any number in particular. Likelihoods do not obey the probability calculus.
The Law of Likelihood (LL) will immediately be seen to fail on our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis H_{0 }
Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of x maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “kinkera”). Hunting for an impressive fit, or trying and trying again, one is sure to find a rival hypothesis H_{1 }much better “supported” than H_{0 }even when H_{0 }is true. As George Barnard puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972, p. 129).
Note that for any outcome of n Bernoulli trials, the likelihood of H_{0 }: θ = 0.5 is (0.5)^{n}, so is quite small. The likelihood ratio (LR) of a best-supported alternative compared to H_{0 }would be quite high. Since one could always erect such an alternative,
(*) Pr(LR in favor of H_{1 }over H_{0}; H_{0}) = maximal.
Thus the LL permits BENT evidence. The severity for H_{1 }is minimal, though the particular H_{1 }is not formulated until the data are in hand.I call such maximally fitting, but minimally severely tested, hypotheses Gellerized, since Uri Geller was apt to erect a way to explain his results in ESP trials. Our Texas sharpshooter is analogous because he can always draw a circle around a cluster of bullet holes, or around each single hole. One needn’t go to such an extreme rival, but it suffices to show that the LL does not control the probability of erroneous interpretations.
What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes H_{0 }maximally likely, we can find an H_{1 }that is more likely. This lets us compute the relevant properties of the method: its inability to block erroneous interpretations of data. As always, a severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution. It’s not a very apt term, but nothing has arisen to replace it. For those who embrace the LL, once the data are given, it’s irrelevant what other outcomes could have been observed but were not. Likelihoodists say that such considerations make sense only if the concern is the performance of a rule over repetitions, but not for inference from the data. Likelihoodists hold to “the irrelevance of the sample space” (once the data are given). This is the key contrast between accounts based on error probabilities (error statistical) and logics of statistical inference.
To continue reading Excursion 1 Tour II, go here.
__________
This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.
Jan 10, 2019 Excerpt from SIST is here, Jan 27 is here, and Feb 23 here.
Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.
March 5, 2019 Blurbs of all 16 Tours can be found here.
]]>
It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say
“the p-value is p”.
(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power:
Do not say a test has high power. Don’t believe that if a test has high power to produce a low p-value when an alternative H’ is true, that finding a low p-value is good evidence for H’. This is wrong. Any effect, no matter how tiny, can produce a small p-value if the power of the test is high enough.
Recommendation: Report the complement of the power in relation to H’: the probability of a type II error β(H’). For instance, instead of saying the power of the test against H’ is .8, say “β(H’) = 0.2.”
“So what do you think?” he began the conversation. Giggling just a little, I told him I basically felt the same way about this as the ban on significance/significant. I didn’t see why people couldn’t just stop abusing power, and especially stop using it in the backwards fashion that is now common (and is actually encouraged by using power as a kind of likelihood). I spend the entire Excursion 5 on power in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars 2018, CUP. Readers of this blog can merely search for power to find quite a lot! That said, I told him the recommendation seemed OK, but noted that you need a minimal threshold (for declaring evidence against a test hypothesis) in order to compute power.
After talking about power, we moved on to some other statistical notions under review. He’d told me last time he lacked a statistical background, and asked me to point out any flagrant mistakes (in his sum-ups, not the Board’s items, which were still under embargo, whatever that means.) I was glad to see that, apparently, the joint committees were subjecting some other notions to scrutiny (for once). According to his draft, the Board doesn’t want people saying a hypothesis or model receives a high probability, say .95, because it is invariably equivocal.
Do not say a hypothesis or model has a high posterior probability (e.g., 0.95) given the data. The statistical and ordinary language meanings of “probability” are now so hopelessly confused; the term should be avoided for that reason alone.
Don’t base your scientific conclusion or practical decision solely on whether a claim gets a high posterior probability. That a hypothesis is given a .95 posterior does not by itself mean it has a “truth probability” of 0.95, nor that H is practically certain (while it’s very improbable that H is false), nor that the posterior was arrived at by a method that is correct 95% of the time, nor that it is rational to bet that the probability of H is 0.95, nor that H will replicate 95% of the time, nor that the falsity of H would have produced a lower posterior on H with probability .95. The posterior can reflect empirical priors, default or data dominant priors, priors from an elicitation of beliefs, conjugate priors, regularisation, prevalence of true effects in a field, or many, many others.
A Bayesian posterior report doesn’t tell you how uncertain that report is.
A posterior of .95 depends on just one way of exhausting the space of possible hypotheses or models (invariably excluding those not thought of). This can considerably distort the scientific process which is always open ended.
Recommendation: If you’re doing a Bayesian posterior assessment, just report “a posterior on H is .95” (or other value), or a posterior distribution over parameters. Don’t say probable.
At this point I began to wonder if he was for real. Was he the Richard Harris who wrote that article last week? I was approached by 3 different journals, and never questioned them. Was this some kind of a backlash to the p-value pronouncements from Stat Report Watch? Or maybe he was that spoofer Justin Smith (whom I don’t know in the least) who recently started that blog on the P-value police. My caller assured me he was on the level, and he did have the official NPR logo. So we talked for around 2 hours!
Comparative measures don’t get off scott-free, according to this new report of the joint Boards:
Don’t say one hypothesis H is more likely than another H’ because this is likely to be interpreted as H is more probable than H’.
Don’t believe that because H is more likely than H’, given data x, that H is probable, well supported or plausible, while H’ is not. This is wrong. It just means H makes the data x more probable than does H’. A high likelihood ratio LR can occur when both H and H’ are highly unlikely, and when some other hypothesis H” is even more likely. Two incompatible hypotheses can both be maximally likely. Being incompatible, they cannot both be highly probable. Don’t believe a high LR in favor of H over H’ means the effect size is large or practically important.
Recommendation: Report “the value of the LR (of H over H’) = k” rather than “H is k times as likely as H'”. As likelihoods enter the computation as a ratio, the word “likelihood” is not necessary and should be dropped wherever possible. The LR level can be reported. The statistical and ordinary language meanings of “likely” are sufficiently confused to avoid the term.
Odds ratios and Bayes Factors (BFs), surprisingly, are treated almost the same way as the LR. (Don’t say H is more probable than H’. Just report BFs and prior odds. A BF doesn’t tell you if an effect size is scientifically or practically important. There’s no BF value between H and H’ that tells you there’s good evidence for H.
But maybe I shouldn’t be so surprised. As in the initial ASA statement, the newest Report avers
Nothing in this statement is new. Statisticians and others have been sounding the alarm about these matters for decades,
In support of their standpoint on posteriors as well as on Bayes Factors they cite Andrew Gelman:
“I do not trust Bayesian induction over the space of models because the posterior probability of a continuous-parameter model depends crucially on untestable aspects of its prior distribution” (Gelman 2011, p. 70). He’s also cited as regards their new rule on Bayes Factors. “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (Gelman 2011, p. 74) or a weighted average of them as in Madigan and Raftery (1994).
Also familiar is the mistake in taking a high BF of in favor of a null or test hypothesis H over an alternative H’ as if it supplies evidence in favor of H. It’s always just a comparison; there’s never a falsification, unlike statistical tests (unless supplemented with a falsification rule[i]).
The Board warns: “Don’t believe a Bayes Factor in favor of H over H’, using a “default” Bayesian prior, means the results are neutral, uninformative, or bias free. Here the report quotes Uri Simonsohn:
“Saying a Bayesian test ‘supports the null’ in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.”
“What they actually ought to write is ‘the data support the null more than they support one mathematically elegant alternative hypothesis I compared it to’”
“The default Bayes factor test “means the Bayesian test ends up asking: ‘is the effect zero, or is it biggish?’ When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero” with high probability.
Scanning the rest of Herris’ article, which was merely in rough draft yesterday, I could see that next in line to face the axe are: confidence, credible, coherent, and probably other honorifics. Maybe now people can spend more time thinking, or so they tell you [ii]!
Check date!
[i] For example, the rule might be: falsify H in favor of H’ whenever H’ is k times as likely or probable as H, or whenever the posterior of H’ exceeds .95, or whenever the p-value against H is less than .05.)
]]>
First set.
1. We agree with the age-old fallacy of non-rejection of a null hypothesis: a non-statistically significant result at level P is not evidence for the null because a test may have low probability of rejecting a null even if it’s false (i.e., it might have low power to detect a particular alternative).
The solution in the severity interpretation of tests is to take a result that is not statistically significant at a small level, i.e., a large P-value, as ruling out given discrepancies from the null or other reference value:
The data indicate that discrepancies from the null are less than those parametric values the test had a high probability of detecting, if present. See p. 351 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics wars (2018, CUP). [i]
This is akin to the use of power analysis, except that it is sensitive to the actual outcome. It is very odd that this paper makes no mention of power analysis, since that is the standard way to interpret non-significant results.
Using non-significant results (“moderate” P-values) to set upper bounds is done throughout the sciences and is highly informative. This paper instead urges us to read into any observed difference found to be in the welcome direction, to potentially argue for an effect.
2. I agree that one shouldn’t mechanically use P< .05. Ironically, they endorse a .95 confidence interval CI. They should actually use several levels, as is done with a severity assessment.
I have objections to their interpretation of CIs, but I will mainly focus my objections to the ban of the words “significance” or “significant”. It’s not too hard to report that results are significant at level .001 or whatever. Assuming researchers invariably use an unthinking cut-off, rather than reporting the significance level attained by the data, they want to ban words. They (Greenland at least) claim this is a political fight, and so arguing by an appeal to numbers (who sign on to their paper) is appropriate for science. I think many will take this as yet one more round of significance test bashing–even though, amazingly, it is opposite to the most popular of today’s statistical wars. I explain in #3. (The actual logic of significance testing is lost in both types of criticisms.)
3. The most noteworthy feature of this criticism of statistical significance tests is that it is opposite to the most well-known and widely circulated current criticisms of significance tests.
In other words, the big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis. The most well known Bayesian reforms being bandied about do this by giving a point prior–a lump of prior probability–to a point null hypothesis. (There’s no mention of this in the paper.)
These Bayesians argue that small P-values are consistent with strong evidence for the null hypothesis. They conclude that P-values exaggerate the evidence against the null hypothesis. Never mind for now that they are insisting P-values be measured against a standard that is radically different from what the P-value means. All of the criticisms invoke reasoning at odds with statistical significance tests. I want to point out the inconsistency between those reforms and the current one. I will call them Group A and Group B:
Group A: “Make it harder to find evidence against the null”: a P-value of .05 (i.e. a statistically significant result) should not be taken as evidence against the null, it may often be evidence for the null.
Group B (“Retire Stat Sig”): “Make it easier to find evidence against the null”: a P-value > .05 (i.e., a non-statistically significant result) should not be taken as evidence for the null, it may often be evidence against the null.
A proper use and interpretation of statistical tests (as set out in my SIST) interprets P-values correctly in both cases and avoids fallacies of rejection (inferring a magnitude of discrepancy larger than warranted) and fallacies of non-rejection (inferring the absence of an effect smaller than warranted).
The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence! When data provide lousy evidence, when little if anything has been done to rule out known flaws in a claim, it’s not a little bit of evidence (on my account). The most serious concern with the “Retire” argument to ban thresholds for significance is that it is likely to encourage the practice whereby researchers spin their non-significant results by P-hacking or data dredging. It’s bad enough that they do this. Read Goldacre [ii]
Note their saying the researcher should discuss the observed difference. This opens the door to spinning it convincingly to the uninitiated reader.
4. What about selection effects? The really important question that is not mentioned in this paper is whether the researcher is allowed to search for endpoints post-data.
My own account replaces P-values with reports of how severely tested various claims are, whether formal or informal. If we are in a context reporting P-values, the phrase “statistically significant” at the observed P-value is important because the significance level is invalidated by multiple testing, optional stopping, data-dependent subgroups, and data dredging. Everyone knows that. (A P-value, by contrast, if detached from corresponding & testable claims about significance levels, is sometimes seen as a mere relationship between data and a hypothesis.) Getting rid of the term is just what is wanted by those who think the researcher should be free to scour the data in search of impressive-looking effects, or interpret data according to what they believe. Some aver that their very good judgment allows them to determine post-data what the pre-registered endpoints really are or were or should have been. (Goldacre calls this “trust the trialist”). The paper mentions pre-registration fleetingly, but these days we see nods to it that actually go hand in hand with flouting it.
The ASA P-value Guide very pointedly emphasizes that selection effects invalidate P-values. But it does not say that selection effects need to be taken into account by any of the “alternative measures of evidence”, including Bayesian and Likelihoodist. Are they free from Principle 4 on transparency, or not? Whether or when to take account of multiple testing and data dredging are known to be key points on which those accounts differ from significance tests (at least all those who hold to the Likelihood Principle, as with Bayes Factors and Likelihood Ratios).
5. A few asides:
They should really be doing one-sided tests and do away with the point null altogether (except for special cases. I agree with D.R. Cox who suggests doing two 1-sided tests.) . (With 1-sided tests, the test hypothesis and alternative hypothesis are symmetrical as with N-P tests.)
The authors seem to view a test as a report on parameter values that merely fit or are compatible with data. This misses testing reasoning! Granted the points within a CI aren’t far enough away to reject the null at level .05–but that doesn’t mean there’s evidence for them. In other words, they commit the same fallacy they are on about, but regarding members of the CI. In fact there is fairly good evidence the parameter value is less than those values close to the upper confidence limit. Yet this paper calls them compatible, even where there’s rather strong evidence against them, as with an upper .9 level bound, say.
[Using one-sided tests and letting the null assert: a positive effect exists, the recommended account is tantamount to taking the non-significant result as evidence for this null.]
Second Set (to briefly give the minimal non-technical points):
I do think we should avoid the fallacy of going from a large P-value to evidence for a point null hypothesis: inferring evidence of no effect.
CIs at the .95 level are more dichotomous than reporting attained P-values for various hypotheses.
The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence!
The most serious concern with the argument to ban thresholds for significance is that it encourages researchers to spin their non-significant results by P-hacking, data dredging, multiple testing, and outcome-switching.
I would like to see some attention paid to how easy it is to misinterpret results with Bayesian and Likelihoodist methods. Obeying the LP, there is no onus to take account of selection effects, and priors are very often data-dependent, giving even more flexibility.
Third Set (for different journals)
Banning the word “significance” may well free researchers from being held accountable when they downplay negative results and search the data for impressive-looking subgroups.
It’s time for some attention to be paid to how easy it is to misinterpret results on various (subjective,default) Bayesian methods–if there is even agreement on one to examine. The brouhaha is all about a method that plays a small role in an overarching methodology that is able to bound the probabilities of seriously misleading interpretations of data. These are called error probabilities. Their role is just a first indication of whether results could readily be produced by chance variability alone.
Rival schools of statistics (the ASA Guide’s “alternative accounts of evidence”) have never shown their worth in controlling error probabilities of methods. (Without this, we cannot assess their capability for having probed mistaken interpretations of data).
Until those alternative methods are subject to scrutiny for the same or worse abuses–biasing selection effects–we should be wary of ousting these methods and the proper speech that goes with them.
One needs to consider a statistical methodology as a whole–not one very small piece. That full methodology may be called error statistics. (Focusing on the simple significance test, with a point null & no alternative or power consideration, as in the ASA Guide, hardly does justice to the overall error statistical methodology. Error statistics is known to be a piecemeal account–it’s highly distorting to focus on an artificial piece of it.)
Those who use these methods with integrity never recommend using a single test to move from statistical significance to a substantive scientific claim. Once a significant effect is found, they move on to estimating its effect size & exploring properties of the phenomenon. I don’t favor existing testing methodologies but rather reinterpret tests as a way to infer discrepancies that are well or poorly indicated. I described this account over 25 years ago.
On the other hand, simple significance tests are important for testing assumptions of statistical models. Bayesians, if they test their assumptions, use them as well, so they could hardly ban them entirely. But what are P-values measuring? OOPS! you’re not allowed to utter the term s____ance level that was coined for this purpose. Big Brother has dictated! (Look at how strange it is to rewrite Goldacre’s claim below without it. [ii])
I’m very worried that the lead editorial in the new “world after P ≤ 0.05” collection warns us that even if scientists repeatedly show statistically significant increases (p< 0.01 or 0.001) in lead poisoning among children in City F, we mustn’t “conclude anything about scientific or practical importance” such as the water is causing lead poisoning.
“Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, editorial for the Special Issue).
Following this rule, and note the qualification that had been in the ASA Guide is missing, would mean never inferring risks of concern when there was uncertainty (among much else that would go by the wayside). Risks have to be so large and pervasive that no statistics is needed! Statistics is just window dressing, with no actual upshot about the world. Menopausal women would still routinely be taking and dying from hormone replacement therapy because “real world” observational results are compatible with HRT staving off age-related diseases.
Welcome to the brave new world after abandoning error control.
See also my post “Deconstructing ‘A World Beyond P-values’”on the 2017 conference.
[i] Mayo, D. (2018). Statistical Inference as Severe Testing: How To Get Beyond the Statistics Wars, Cambridge: Cambridge University Press.
[ii] Should we replace the offending terms with “moderate or non-small P-values”? The required level for “significance” is separately reported.
Misleading reporting by presenting a study in a more positive way than the actual results reflect constitutes ‘spin’. Authors of an analysis of 72 trials with non-significant results reported it was a common phenomenon, with 40% of the trials containing some form of spin. Strategies included reporting on statistically significant results for within-group comparisons, secondary outcomes, or subgroup analyses and not the primary outcome, or focussing the reader on another study objective away from the statistically non-significant result. (Goldacre)
[added March 25: To be clear, I have no objection to recommending people not use “statistical significance” routinely in that it may be confused with “important”. But the same warnings about equivocation would have to be given to the use of claims: H is more likely than H’. H is more probable than H’. H has probability p. What I object to is mandating a word ban, along with derogating statistical tests in general, while raising no qualms or questions about alternative methods. It doesn’t suffice to say “all methods have problems” either. Let’s look at them.
In the time people have spent repeating old criticisms of significance tests, different ways to deal with data-dependent selection effects could have been developed and experimented with. I know there is considerable work in this area, but I haven’t seen it in the pop discussions of significance tests and p-values.
Related links:
Gelman’s blog (post on April 12, 2019): Reviews and Discussions of Mayo’s New Book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
Papers/Articles
Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar, “Editorial: Moving to a World Beyond ‘p < 0.05’,” 73 Am. Statistician S1, S2 (2019).
Valentin Amrhein, Sander Greenland, Blake McShane, Retiring Statistical significance.
John P. A. Ioannidis, “Retiring statistical significance would give bias a free pass,” 567 Nature 461 (2019).
John P. A. Ioannidis, “Do Not Abandon Statistical Significance” (Nature, April 4, 2019)
]]>
Stephen Senn
Consultant Statistician
Edinburgh
Failure to understand components of variation is the source of much mischief. It can lead researchers to overlook that they can be rich in data-points but poor in information. The important thing is always to understand what varies in the data you have, and to what extent your design, and the purpose you have in mind, master it. The result of failing to understand this can be that you mistakenly calculate standard errors of your estimates that are too small because you divide the variance by an n that is too big. In fact, the problems can go further than this, since you may even pick up the wrong covariance and hence use inappropriate regression coefficients to adjust your estimates.
I shall illustrate this point using clinical trials in asthma.
Suppose that I design a clinical trial in asthma as follows. I have six centres, each centre has four patients, each patient will be studied in two episodes of seven days and during these seven days the patients will be measured daily, that is to say, seven times per episode. I assume that between the two episodes of treatment there is a period of some days in which no measurements are taken. In the context of a cross-over trial, which I may or may not decide to run, such a period is referred to as a washout period.
The block structure is like this:
Centres/Patients/Episodes/Measurements
The / sign is a nesting operator and it shows, for example, that I have Patients ‘nested’ within centres. For example, I could label the patients 1 to 4 in each centre, but I don’t regard patient 3 (say) in centre 1 as being somehow similar to patient 3 in centre 2 and patient 3 in centre 3 and so forth. Patient is a term that is given meaning by referring it to centre.
The block structure is shown in Figure 1, which does not, however, show the seven measurements per episode.
Figure 1. Schematic representation of the block structure for some possible clinical trials. The six centres are shown by black lines. For each centre there are four patients shown by blue lines and each patient is studied in two episodes, shown by red lines.
I now wish to compare two treatments, two so-called beta-agonists. The first of these, I shall call Zephyr and the second Mistral. I shall do this using a measure of lung function called forced expiratory volume in one second, (FEV_{1}). If there are no dropouts and no missing measurements, I shall have 6 x 4 x 2 x 7 =336 FEV_{1 }readings. Is this my ‘n’?
I am going to use Genstat®, a package that fully incorporates John Nelder’s ideas of general balance[1, 2]and the analysis of designed experiments and uses, in fact, what I have called the Rothamsted approach to experiments.
I start by declaring the block structure thus
BLOCKSTRUCTURE Centre/Patient/Episode/Measurement
This is the ‘null’ situation: it describes the variation in the experimental material before any treatment is applied. If I ask Genstat®to do a ‘null’ skeleton analysis of variance for me, by typing the statement
ANOVA
and the output is as given in Table 1
Source of variation | d.f. |
Centre stratum | 5 |
Centre.Patient stratum | 18 |
Centre.Patient.Episode stratum | 24 |
Centre.Patient.Episode.Measurement stratum | 288 |
Total | 335 |
Table 1. Degrees of freedom for a null analysis of variance for a nested block structure.
This only gives me possible sources of variation and degrees of freedom associated with them but not the actual variances: that would require data. There are six centres, so five degrees of freedom between centres. There are four patients per centre, so three degrees of freedom per centre between patients but there are six centres and therefore 6 x 3 = 18 in total. There are two episodes per patient and so one degree of freedom between episodes per patient but there are 24 patients and so 24 degrees of freedom in total. Finally, there are seven measurements per episode and hence six degrees of freedom but 48 episodes in total so 48 x 6 = 288 degrees of freedom for measurements.
Having some actual data would put flesh on the bones of this skeleton by giving me some mean square errors, but to understand the general structure this is not necessary. It tells me that at the highest level I will have variation between centres, next patients within centres, after that episodes within patients and finally measurements within episodes. Which of these are relevant to judging the effect of any treatments I wish to study depends how I allocate treatments.
I now consider, three possible approaches to allocating treatments to patients. In each of the three designs, the same number of measurements will be available for each treatment. There will be 168 measurements under Zephyr and 168 measurements under Mistral and thus 336 in total. However, as I shall show, the designs will be very different, and this will lead to different analyses being appropriate and lead us to understand better what our n is.
I shall also suppose that we are interested in causal analysis rather than prediction. That is to say, we are interested in estimating the effect that the treatments did have (actually, the difference in their effects) in the trial that was actually run. The matter of predicting what would happen in future to other patients is much more delicate and raises other issues and I shall not address it here, although I may do so in future. For further discussion see my paper Added Values[3].
In the first experiment, I carry out a so-called cluster-randomised trial. I choose three centres at random and all patents, in both episodes on all occasions in the three centres chosen receive Zephyr. For the other three centres, all patients on all occasions receive Mistral. I create a factor Treatment (cluster trial), (Cluster for short) which encodes this allocation so that the pattern of allocation to Zephyr or Mistral reflects this randomised scheme.
In the second experiment, I carry out a parallel group trial blocking by centre. In each centre, I choose two patients to receive Zephyr and two to receive Mistral. Thus, overall, there 6 x 2 = 12 patients on each treatment. I create a factor Treatment (parallel trial) (Parallel for short) to reflect this.
The third experiment consists of a cross-over trial. Each patient is randomised to one of two sequences, either receiving Zephyr in episode one and Mistral in episode two, or vice versa. Each patient receives both treatments so that there will be 6 x 4 = 24 patients given each treatment. I create a factor Treatment (cross-over trial) (Cross-over for short) to encode this.
Note that the total number of measurements obtained is the same for each of the three schemes. For the cluster randomised trial, a given treatment will be studied in three centres each of which has four patients, each of whom will be studied in two episodes on seven occasions. Thus, we have 3 x 4 x 2 x 7 = 168 measurement per treatment. For the parallel group trial, 12 patients are studied for a given treatment in two episodes, each providing 7 measurements. Thus, we have 12 x 2 x 7 = 168 measurement per treatment. For the cross-over trial we have 24 patients each of whom will receive a given treatment in one episode (either episode one or two) so we have 24 x 1 x 7 + 168 measurements per treatment.
Thus, from one point of view the n in the data is the same for each of these three designs. However, each of the three designs provides very different amounts of information and this alone should be enough to warn anybody against assuming that all problems of precision can be solved by increasing the number of data.
Before collecting any data, I can analyse this scheme and use Nelder’s approach to tell me where the information is in each scheme.
Using the three factors to encode the corresponding allocation, I now ask Genstat® to prepare a dummy analysis of variance (in advance of having collected any data) as follows. All I need to do is type a statement of the form
TREATMENTSTRUCTURE Design
ANOVA
Where Design is set equal to the Cluster, Parallel, Crossover, as the case may be. The result is shown in Table 2
Source of variation | d.f. |
Centre stratum | |
Treatment (cluster trial) | 1 |
Residual | 4 |
Centre.Patient stratum | |
Treatment (parallel trial) | 1 |
Residual | 17 |
Centre.Patient.Episode stratum | |
Treatment (cross-over trial) | 1 |
Residual | 23 |
Centre.Patient.Episode.Measurement stratum | 288 |
Total | 335 |
Table 2. Analysis of variance skeleton for three possible designs using the block structure given in Table 1
This shows us that the three possible designs will have quite different degrees of precision associated with them. Since, for the cluster trial, any given centre only receives one of the treatments, the variation between centres affects the estimate of the treatment effect and its standard error must reflect this. Since, however, the parallel trial balances treatments by centres it is unaffected by variation between centres. It is, however, affected by variation between patients. This variation is, in turn, eliminated by the cross-over trial which, in consequence is only affected by variation between episodes (although this variation will, itself, inherit variation from measurements). Each higher level of variation inherits variation from the lower levels but adds its own.
Note, however, that for all three designs the unbiased estimate of the treatment effect is the same. All that is necessary is to average the 168 measurements under Zephyr and the 168 under Mistral and calculate the difference. It is the estimate of the appropriate variation in the estimate that varies.
Suppose that, more generally, we have m centres, with n patients per centre and p episodes per patient, with the number of measurements per episode fixed, then for the cross-over trial the variance of our estimate will be proportional to /(mnp) where is variance between episodes. For the parallel group trial, there will be a further term involving /(mn) where is the variance between patients. Finally, for the cluster randomised trial there will be a further term involving /m, where is the variance between centres.
The consequences of this are, you cannot decrease the variance of a cluster randomised trial indefinitely simply by increasing the number of patients; it is centres you need to increase. You cannot decrease the variance of a parallel group trial indefinitely by increasing the number of episodes; it is patients you need to increase.
Why should this matter? Why should it matter how certain we are about anything? There are several reasons. Bayesian statisticians need to know what relative weight to give their prior belief and the evidence from the data. If they do not, they do not know how to produce a posterior distribution. If they do not know what the variances of both data and prior are, they don’t know the posterior variance. Frequentists and Bayesians are often required to combine evidence from various sources as, say, in a so-called meta-analysis. They need to know what weight to give to each and again to assess the total information available at the end. Any rational approach to decision-making requires an appreciation of the value of information. If one had to make a decision with no further prospect of obtaining information based on a current estimate it might make little difference how precise it was but if the option of obtaining further information at some cost applies, this is no longer true. In short, estimation of uncertainty is important. Indeed, it is a central task of statistics.
Finally, there is one further point that is important. What applies to variances also applies to covariances. If you are adjusting for a covariate using a regression approach, then the standard estimate of the coefficient of adjustment will involve a covariance divided by a variance. Just as there can be variances at various levels there can be covariances at various levels. It is important to establish which is relevant[4] otherwise you will calculate the adjustment incorrectly.
Just because you have many data does not mean that you will come to precise conclusions: the variance of the effect estimate may not, as one might naively suppose, be inversely proportional to the number of data, but to some other much rarer feature in the data-set. Failure to appreciate this has led to excessive enthusiasm for the use of synthetic patients and historical controls as alternatives to concurrent controls. However, the relevant dominating component of variation is that between studies not between patients. This does not shrink to zero as the number of subjects goes to infinity. it does not even shrink to zero as the number of studies goes to infinity, since if the current study is the only one that the new treatment is on, the relevant variance for that arm is at least /1, where is the variance between studies, even if, for the ‘control’ data-set it may be negligible , thanks to data collected from many subjects in many studies.
There is a lesson also for epidemiology here. All too often, the argument in the epidemiological, and more recently, the causal literature has been about which effects one should control for or condition on without appreciating that merely stating what should be controlled for does not solve how. I am not talking here about the largely sterile debate, to which I have contributed myself[5] as to how at a given level, adjustment should be made for possible confounders (for example, propensity score or linear model), but to the level at which such adjustment can be made. The usual implicit assumption is that an observational study is somehow a deficient parallel group trial, with maybe complex and perverse allocation mechanisms that must somehow be adjusted for, but that once such adjustments have been made, precision increases as the subjects increase. But suppose the true analogy is a cluster randomised trial. Then, whatever you adjust for, your standard errors will be too small.
Finally, it is my opinion, that much of the discussion about Lord’s paradox would have benefitted from an appreciation of the issue of components of variance. I am used to informing medical clients that saying we will analyse the data using analysis of variance is about as useful as saying we will treat the patients with a pill. The varieties of analysis of variance are legion and the same is true of analysis of covariance. So, you conditioned on the baseline values. Bravo! But how did you condition on them? If you used a slope obtained at the wrong level of the data then, except fortuitously, your adjustment will be wrong, as will the precision you claim for it.
Finally, if I may be permitted an auto-quote, the price one pays for not using concurrent control is complex and unconvincing mathematics. That complexity may be being underestimated by those touting ‘big data’.
Lord’s Paradox:
Personalized Medicine:
Randomisation:
Deborah G. Mayo
Abstract for Book
By disinterring the underlying statistical philosophies this book sets the stage for understanding and finally getting beyond today’s most pressing controversies revolving around statistical methods and irreproducible findings. Statistical Inference as Severe Testing takes the reader on a journey that provides a non-technical “how to” guide for zeroing in on the most influential arguments surrounding commonly used–and abused– statistical methods. The book sets sail with a tool for telling what’s true about statistical controversies: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to assess and control how severely tested claims are. Viewing statistical inference as severe testing supplies novel solutions to problems of induction, falsification and demarcating science from pseudoscience, and serves as the linchpin for understanding and getting beyond the statistics wars. The book links philosophical questions about the roles of probability in inference to the concerns of practitioners in psychology, medicine, biology, economics, physics and across the landscape of the natural and social sciences.
Keywords for book:
Severe testing, Bayesian and frequentist debates, Philosophy of statistics, Significance testing controversy, statistics wars, replication crisis, statistical inference, error statistics, Philosophy and history of Neyman, Pearson and Fisherian statistics, Popperian falsification
Tour I: Beyond Probabilism and Performance
(1.1) If we’re to get beyond the statistics wars, we need to understand the arguments behind them. Disagreements about the roles of probability in statistical inference–holdovers from long-standing frequentist-Bayesian battles–still simmer below the surface of current debates on scientific integrity, irreproducibility, and questionable research practices. Striving to restore scientific credibility, researchers, professional societies, and journals are getting serious about methodological reforms. Some–disapproving of cherry picking and advancing preregistration–are welcome. Others might create obstacles to the critical standpoint we seek. Without understanding the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden. (1.2) Rival standards reflect a tension between using probability (i) to constrain a method’s ability to avoid erroneously interpreting data (performance), and (ii) to assign degrees of support, confirmation, or plausibility to hypotheses (probabilism). We set sail with a tool for telling what’s true about statistical inference: If little has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test. From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. (1.3) We survey the current state of play in statistical foundations.
Excursion 1 Tour I: Keywords
Error statistics, severity requirement: weak/strong, probabilism, performance, probativism, statistical inference, argument from coincidence, Life-off (vs drag down), sampling distribution, cherry-picking
Excursion 1 Tour II: Error Probing Tools vs. Logics of Evidence
Core battles revolve around the relevance of a method’s error probabilities. What’s distinctive about the severe testing account is that it uses error probabilities evidentially: to assess how severely a claim has passed a test. Error control is necessary but not sufficient for severity. Logics of induction focus on the relationships between given data and hypotheses–so outcomes other than the one observed drop out. This is captured in the Likelihood Principle (LP). Tour II takes us to the crux of central wars in relation to the Law of Likelihood (LL) and Bayesian probabilism. (1.4) Hypotheses deliberately designed to accord with the data can result in minimal severity. The likelihoodist tries to oust them via degrees of belief captured in prior probabilities. To the severe tester, such gambits directly alter the evidence by leading to inseverity. (1.5) If a tester tries and tries again until significance is reached–optional stopping–significance will be attained erroneously with high probability. According to the LP, the stopping rule doesn’t alter evidence. The irrelevance of optional stopping is an asset for holders of the LP, it’s the opposite for a severe tester. The warring sides talk past each other.
Excursion 1 Tour II: Keywords
Statistical significance: nominal vs actual, Law of likelihood, Likelihood principle, Inductive inference, Frequentist/Bayesian, confidence concept, Bayes theorem, default/non-subjective Bayesian, stopping rules/optional stopping, argument from intentions
Tour I: Induction and Confirmation
The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction, e.g., Carnap’s confirmation theory. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory is directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine problems of irrelevant conjunctions: if xconfirms H, it confirms (H& J) for any J.
Tour I: keywords
asymmetry of induction and falsification, argument, sound and valid, enumerative induction (straight rule), confirmation theory (and formal epistemology), statistical affirming the consequent, guide to life, problem of induction, irrelevant conjunction, likelihood ratio, old evidence problem
Excursion 2 Tour II: Falsification, Pseudoscience, Induction
Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions. Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set: isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.
Excursion 2 Tour II: keywords
Corroboration, Demarcation of science and pseudoscience, Falsification, Duhem’s problem, Novelty, Biasing selection effects, Simple significance tests, Fallacies of rejection, NHST, Reproducibility and replication
Tour I: Ingenious and Severe Tests
We move from Popper to the development of statistical tests (3.2) by way of a gallery on (3.1): Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR). The tour opens by honing in on where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps we find in E. Pearson’s opening description of tests in (3.2). The typical (behavioristic) formulation of N-P tests is as mechanical rules to accept or reject claims with good long run error probabilities. The severe tester breaks out of the behavioristic prison. The classical testing notions–Type I and II errors, power, consistent tests–are shown to grow out of requiring of probative tests. Viewing statistical inference as severe testing, we explore how members of the Fisherian tribe can do all N-P tests do (3.3). We consider the frequentist principle of evidence FEV (Mayo and Cox) and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR.
Tour I: keywords
eclipse test, statistical test ingredients, Type I & II errors, power, P-value, uniformly most powerful (UMP); severity interpretation of tests, severity function, frequentist principle of evidence FEV; Cox’s taxonomy of nulls
Excursion 3 Tour II: It’s The Methods, Stupid
Tour II disentangles a jungle of conceptual issues at the heart of today’s statistical wars. (3.4) unearths the basis for counterintuitive inferences thought to be licensed by Fisherian or N-P tests. These howlers and chestnuts show: the need for an adequate test statistic, the difference between implicationary and actual assumptions, and the fact that tail areas serve to raise, and not lower, the bar for rejecting a null hypothesis. Stop (3.5) pulls back the curtain on an equivocal use of “error probability”. When critics allege that Fisherian P-values are not error probabilities, they mean Fisher wanted an evidential not a performance interpretation–this is a philosophical not a mathematical claim. In fact, N-P and Fisher used P-values in both ways. Critics argue that P-values are for evidence, unlike error probabilities, but in the next breath they aver P-values aren’t good measures of evidence either, since they disagree with probabilist measures: likelihood ratios, Bayes Factors or posteriors (3.6). But the probabilist measures are inconsistent with the error probability ones. By claiming the latter are what’s wanted, the probabilist begs key questions, and misinterpretations are entrenched.
Excursion 3 Tour II keywords
howlers and chestnuts of statistical tests, Jeffreys tail area criticism, two machines with different positions, weak conditionality principle, likelihood principle, long run performance vs probabilism, Neyman vs Fisher, hypothetical long-runs, error probability_{1}and error probability _{2}, incompatibilism (Fisher & Neyman-Pearson must be separated)
Excursion 3 Tour III: Capability and Severity: Deeper Concepts
A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs). In fact there’s a clear duality between the two: the parameter values within the (1 – α) CI are those that are not rejectable by the corresponding test at level α. (3.7) illuminates both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. In (3.8) we reopen a highly controversial matter of interpretation in relation to statistics and the 2012 discovery of the Higgs particle based on a “5 sigma observed effect”. Because the 5-sigma standard refers to frequentist significance testing, the discovery was immediately imbued with controversies that, at bottom, concern statistical philosophy. Some Bayesians even hinted it was “bad science”. One of the knottiest criticisms concerns the very meaning of the phrase: “the probability our data are merely a statistical fluctuation”. Failing to clarify it may impinge on the nature of future big science inquiry. The problem is a bit delicate, and my solution is likely to be provocative. Even rejecting my construal will allow readers to see what it’s like to switch from wearing probabilist, to severe testing, glasses.
Excursion 3 Tour III: keywords
confidence intervals, duality of confidence intervals and tests, rubbing off interpretation, confidence level, Higg’s particle, look elsewhere effect, random fluctuations, capability curves, 5 sigma, beyond standard model physics (BSM)
Tour I: The Myth of “The Myth of Objectivity”
Blanket slogans such as “all methods are equally objective and subjective” trivialize into oblivion the problem of objectivity. Such cavalier attitudes are at odds with the moves to take back science. The goal of this tour is to identify what there is in objectivity that we won’t give up, and shouldn’t. While knowledge gaps leave room for biases and wishful thinking, we regularly come up against data that thwart our expectations and disagree with predictions we try to foist upon the world. This pushback supplies objective constraints on which our critical capacity is built. Supposing an objective method is to supply formal, mechanical, rules to process data is a holdover of a discredited logical positivist philosophy.
Discretion in data generation and modeling does not warrant concluding: statistical inference is a matter of subjective belief. It is one thing to talk of our models as objects of belief and quite another to maintain that our task is to model beliefs. For a severe tester, a statistical method’s objectivity requires the ability to audit an inference: check assumptions, pinpoint blame for anomalies, falsify, and directly register how biasing selection effects–hunting, multiple testing and cherry-picking–alter its error probing capacities.
Tour I: keywords
objective vs. subjective, objectivity requirements, auditing, dirty hands argument, logical positivism; default Bayesians, equipoise assignments, (Bayesian) wash-out theorems, degenerating program, epistemology: internal/external distinction
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
We begin with the Mountains out of Molehills Fallacy (large nproblem): The fallacy of taking a (P-level) rejection of H_{0}with larger sample size as indicating greater discrepancy from H_{0}than with a smaller sample size. (4.3). The Jeffreys-Lindley paradox shows with large enough n, a .05 significant result can correspond to assigning H_{0}a high probability .95. There are family feuds as to whether this is a problem for Bayesians or frequentists! The severe tester takes account of sample size in interpreting the discrepancy indicated. A modification of confidence intervals (CIs) is required.
It is commonly charged that significance levels overstate the evidence against the null hypothesis (4.4, 4.5). What’s meant? One answer considered here, is that the P-value can be smaller than a posterior probability to the null hypothesis, based on a lump prior (often .5) to a point null hypothesis. There are battles between and within tribes of Bayesians and frequentists. Some argue for lowering the P-value to bring it into line with a particular posterior. Others argue the supposed exaggeration results from an unwarranted lump prior to a wrongly formulated null.We consider how to evaluate reforms based on bayes factor standards (4.5). Rather than dismiss criticisms of error statistical methods that assume a standard from a rival account, we give them a generous reading. Only once the minimal principle for severity is violated do we reject them. Souvenir R summarizes the severe tester’s interpretation of a rejection in a statistical significance test. At least 2 benchmarks are needed: reports of discrepancies (from a test hypothesis) that are, and those that are not, well indicated by the observed difference.
Keywords:
significance test controversy, mountains out of molehills fallacy, large n problem, confidence intervals, P-values exaggerate evidence, Jeffreys-Lindley paradox, Bayes/Fisher disagreement, uninformative (diffuse) priors, Bayes factors, spiked priors, spike and slab, equivocating terms, severity interpretation of rejection (SIR)
Excursion 4 Tour III: Auditing: Biasing Selection Effects & Randomization
Tour III takes up Peirce’s “two rules of inductive inference”: predesignation (4.6) and randomization (4.7). The Tour opens on a court case transpiring: the CEO of a drug company is being charged with giving shareholders an overly rosy report based on post-data dredging for nominally significant benefits. Auditing a result includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to moving from statistical to substantive claims. We hear it’s too easy to obtain small P-values, yet replication attempts find it difficult to get small P-values with preregistered results. I call this the paradox of replication. The problem isn’t P-values but failing to adjust them for cherry picking and other biasing selection effects. Adjustments by Bonferroni and false discovery rates are considered. There is a tension between popular calls for preregistering data analysis, and accounts that downplay error probabilities. Worse, in the interest of promoting a methodology that rejects error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. However, data dependent searching need not be pejorative. In some cases, it can improve severity. (4.6)
Big Data cannot ignore experimental design principles. Unless we take account of the sampling distribution, it becomes difficult to justify resampling and randomization. We consider RCTs in development economics (RCT4D) and genomics. Failing to randomize microarrays is thought to have resulted in a decade lost in genomics. Granted the rejection of error probabilities is often tied to presupposing their relevance is limited to long-run behavioristic goals, which we reject. They are essential for an epistemic goal: controlling and assessing how well or poorly tested claims are. (4.7)
Keywords
error probabilities and severity, predesignation, biasing selection effects, paradox of replication, capitalizing on chance, bayes factors, batch effects, preregistration, randomization: Bayes-frequentist rationale, bonferroni adjustment, false discovery rates, RCT4D, genome-wide association studies (GWAS)
Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking
While all models are false, it’s also the case that no useful models are true. Were a model so complex as to represent data realistically, it wouldn’t be useful for finding things out. A statistical model is useful by being adequate for a problem, meaningit enables controlling and assessing if purported solutions are well or poorly probed and to what degree. We give a way to define severity in terms of solving a problem.(4.8) When it comes to testing model assumptions, many Bayesians agree with George Box (1983) that “it requires frequentist theory of significance tests” (p. 57). Tests of model assumptions, also called misspecification (M-S) tests, are thus a promising area for Bayes-frequentist collaboration. (4.9) When the model is in doubt, the likelihood principle is inapplicable or violated. We illustrate a non-parametric bootstrap resampling. It works without relying on a theoretical probability distribution, but it still has assumptions. (4.10). We turn to the M-S testing approach of econometrician Aris Spanos.(4.11) I present the high points for unearthing spurious correlations, and assumptions of linear regression, employing 7 figures. M-S tests differ importantly from model selection–the latter uses a criterion for choosing among models, but does not test their statistical assumptions. They test fit rather than whether a model has captured the systematic information in the data.
Keywords
adequacy for a problem, severity (in terms of problem solving), model testing/misspecification (M-S) tests, likelihood principle conflicts, bootstrap, resampling, Bayesian p-value, central limit theorem, nonsense regression, significance tests in model checking, probabilistic reduction, respecification
Tour I: Power: Pre-data and Post-data
The power of a test to detect a discrepancy from a null hypothesis H_{0}is its probability of leading to a significant result if that discrepancy exists. Critics of significance tests often compare H_{0}and a point alternative H_{1 }against which the test has high power. But these don’t exhaust the space. Blurring the power against H_{1 }with a Bayesian posterior in H_{1}results in exaggerating the evidence. (5.1) A drill is given for practice (5.2). As we learn from Neyman and Popper: if data failed to reject a hypothesis H, it does not corroborate Hunless the test probably would have rejected it if false. A classic fallacy is to construe no evidence against H_{0}as evidence of the correctness of H_{0}. It was in the list of slogans opening Excursion 1. His corroborated severely only if, and only to the extent that, it passes a test it probably would have failed, if false. By reflecting this reasoning, power analysis avoids such fallacies, but it’s too coarse. Severity analysis follows the pattern but is sensitive to the actual outcome (it uses what I call attained power). (5.3) Using severity curves we read off assessments for interpreting non-significant results in a standard test. (5.4)
Tour I: keywords
power of a test, attained power (and severity), fallacies of non-rejection, severity curves, severity interpretation of negative results (SIN), power analysis, Cohen and Neyman on power analysis, retrospective power
Excursion 5 Tour II: How not to Corrupt Power
We begin with objections to power analysis, and scrutinize accounts that appear to be at odds with power and severity analysis.(5.5) Understanding power analysis also promotes an improved construal of CIs: instead of a fixed confidence level, several levels are needed, as with confidence distributions. Severity offers an evidential assessment rather than mere coverage probability. We examine an influential new front in the statistics wars based on what I call the diagnostic model of tests. (5.6) The model is a cross between a Bayesian and frequentist analysis. To get the priors, the hypothesis you’re about to test is viewed as a random sample from an urn of null hypotheses, a high proportion of which are true. The analysis purports to explain the replication crisis because the proportion of true nulls amongst hypotheses rejected may be higher than the probability of rejecting a null hypothesis given it’s true. We question the assumptions and the altered meaning of error probability (error probability_{2}in 3.6). The Tour links several arguments that use probabilist measures to critique error statistics.
Excursion 5 Tour II: keywords
confidence distributions, coverage probability, criticisms of power, diagnostic model of tests, shpower vs power, fallacy of probabilistic instantiation, crud factors
Excursion 5 Tour III: Deconstructing the N-P vs. Fisher Debates
We begin with a famous passage from Neyman and Pearson (1933), taken to show N-P philosophy is limited to long-run performance. The play, “Les Miserables Citations”leads to a deconstruction that illuminates the evidential over the performance construal.(5.7) To cope with the fact that any sample is improbable in some respect, statistical methods either: appeal to prior probabilities of hypotheses or to error probabilities of a method. Pursuing the latter N-P are led to (i) a prespecified test criterion and (ii) consider alternative hypotheses and power. Fisher at first endorsed their idea of a most powerful test. Fisher hoped fiducial probability would both control error rates of a method – performance – as well as supply an evidential assessment. When confronted with the fact that fiducial solutions disagreed with performance goals he himself had held, Fisher abandoned them. (5.8) He railed against Neyman who was led to a performance construal largely to avoid inconsistencies in Fisher’s fiducial probability. The problem we face today is precisely to find a measure that controls error while capturing evidence.This is what severity purports to supply. We end with a connection with recent work on Confidence Distributions.
Excursion 5 Tour III: keywords
Bertrand and Borel debate, Neyman-Pearson test development, behavioristic (performance model) of tests, deconstructing N-P (1933), Fisher’s fiducial probabilities, Neyman/Fisher feuds, Neyman and Fisher dovetail, confidence distributions
Excursion 6 Tour I: What Ever Happened to Bayesian Foundations
Statistical battles often grow out of assuming the goal is a posterior probabilism of some sort. Yet when we examine each of the ways this could be attained, the desirability for science evanesces. We survey classical subjective Bayes via an interactive museum display on Lindley and commentators. (6.1) We durvey a plethora of meanings given to Bayesian priors (6.2) and current family feuds between subjective and non-subjective Bayesians. (6.3) The most prevalent Bayesian accounts are default/non-subjective, but there is no agreement on suitable priors. Sophisticated methods give as many priors as there are parameters and different orderings. They are deemed mere formal devices for obtaining a posterior. How then should we interpret the posterior as an adequate summary of information? While touted as the best way to bring in background, they are simultaneously supposed to minimize the influence of background. The main assets of the Bayesian picture–a coherent way to represent and update beliefs–go by the board.(6.4) The very idea of conveying “the” information in the data is unsatisfactory. It turns on what one wants to know. An answer to: how much a prior would be updated, differs from how well and poorly tested claims are. The latter question, of interest to a severe tester, is not answered by accounts that require assigning probabilities to a catchall factor: science must be open ended.
Tour I: keywords
Classic subjective Bayes, subjective vs default Bayesians, Bayes conditioning, default priors (and their multiple meanings), default Bayesian and the Likelihood Principle, catchall factor
Excursion 6 Tour II: Pragmatic and Error Statistical Bayesians
Tour II asks: Is there an overarching philosophy that “matches contemporary attitudes”? Kass’s pragmatic Bayesianism seeks unification by a restriction to cases where the default posteriors match frequentist error probabilities.(6.5) Even with this severe limit, the necessity for a split personality remains: probability is to capture variability as well as degrees of belief. We next consider the falsificationist Bayesianism of Andrew Gelman, and his work with others.(6.6) This purports to be an error statistical view, and we consider how its foundations might be developed. The question of where it differs from our misspecification testing is technical and is left open. Even more important than shared contemporary attitudes is changing them: not to encourage a switch of tribes, but to understand and get beyond the tribal warfare. If your goal is really and truly probabilism, you are better off recognizing the differences than trying to unify or reconcile. Snapshots from the error statistical lens lets you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. If you’ve come that far in making the gestalt switch to the error statistical paradigm, a new candidate for an overarching philosophy is at hand. Our Fairwell Keepsake delineates the requirements for a normative epistemology and surveys nine key statistics wars anda cluster of familiar criticisms of error statistical methods. They can no longer be blithely put forward as having weight without wrestling with the underlying presuppositions and challenges collected on our journey. This provides the starting point for any future attempts to refight these battles. The reader will then be beyond the statistics wars. (6.7)
Excursion 6 Tour II: keywords
pragmatic Bayesians, falsificationist Bayesian, confidence distributions, epistemic meaning for coverage probability, optional stopping and Bayesian intervals, error statistical foundations
Mayo, D. (2018). Statistical Inference as Severe Testing: How To Get Beyond the Statistics Wars, Cambridge: Cambridge University Press.
_______
*Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.
Jan 10, 2019 Excerpt from SIST is here.
Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.
Feb 23, 2019 Excerpt from SIST 5.8 is here.
]]>