It’s a balmy day today on Ship StatInfasST: An invigorating wind has a salutary effect on our journey. So, for the first time I’m excerpting all of Excursion 5 Tour I (proofs) of Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (2018, CUP)
A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)
So how would you use power to consider the magnitude of effects were you drawn forcibly to do so? In with your breakfast is an exercise to get us started on today’ s shore excursion.
Suppose you are reading about a statistically signifi cant result x (just at level α ) from a one-sided test T+ of the mean of a Normal distribution with n IID samples, and known σ: H0 : μ ≤ 0 against H1 : μ > 0. Underline the correct word, from the perspective of the (error statistical) philosophy, within which power is defined.
- If the test’ s power to detect μ′ is very low (i.e., POW(μ′ ) is low), then the statistically significant x is poor/good evidence that μ > μ′ .
- Were POW(μ′ ) reasonably high, the inference to μ > μ′ is reasonably/poorly warranted.
We’ve covered this reasoning in earlier travels (e.g., Section 4.3), but I want to launch our new tour from the power perspective. Assume the statistical test has passed an audit (for selection effects and underlying statistical assumptions) – you can’t begin to analyze the logic if the premises are violated.
During our three tours on Power Peninsula, a partially uncharted territory, we’ll be residing at local inns, not returning to the ship, so pack for overnights. We’ll visit its museum, but mostly meet with different tribal members who talk about power – often critically. Power is one of the most abused notions in all of statistics, yet it’ s a favorite for those of us who care about magnitudes of discrepancies. Power is always defined in terms of a fixed cut-off, cα, computed under a value of the parameter under test; since these vary, there is really a power function . If someone speaks of the power of a test tout court , you cannot make sense of it, without qualification. First defined in Section 3.1, the power of a test against μ′ is the probability it would lead to rejecting H0 when μ = μ′:
POW(T, μ′) = Pr(d(X) ≥ cα; μ = μ′), or Pr(test T rejects H0; μ = μ′).
If it’s clear what the test is, we just write POW(μ′). Power measures the capability of a test to detect μ′ – where the detection is in the form of producing a d ≥ cα. While power is computed at a point μ = μ′, we employ it to appraise claims of form μ > μ′ or μ < μ′.
Power is an ingredient in N-P tests, but even practitioners who declare they never set foot into N-P territory, but live only in the land of Fisherian significance tests, invoke power. This is all to the good, and they shouldn’t fear that they are dabbling in an inconsistent hybrid.
Jacob Cohen’s (1988) Statistical Power Analysis for the Behavioral Sciences is displayed at the Power Museum’ s permanent exhibition. Oddly, he makes some slips in the book’ s opening. On page 1 Cohen says: “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty is what he says on page 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.” Cohen means to add “computed under an alternative hypothesis,” else the definitions are wrong. These snafus do not take away from Cohen’s important tome on power analysis, yet I can’ t help wondering if these initial definitions play a bit of a role in the tendency to define power as ‘the probability of a correct rejection,’ which slips into erroneously viewing it as a posterior probability (unless qualified).
Although keeping to the fixed cut-off cα is too coarse for the severe tester’s tastes, it is important to keep to the given definition for understanding the statistical battles. We’ve already had sneak previews of achieved sensitivity” or “attained power” [Π (γ ) = Pr(d(X ) ≥ d(x0 ); μ0 + γ )] by which members of Fisherian tribes are able to reason about discrepancies (Section 3.3). N-P accorded three roles to power: the first two are pre-data, for planning and comparing tests; the third is for interpretation post-data. It’s the third that they don’t announce very loudly, whereas that will be our main emphasis. Have a look at this museum label referring to a semi-famous passage by E. Pearson. Barnard (1950, p. 207) has just suggested that error probabilities of tests, like power, while fine for pre-data planning, should be replaced by other measures (likelihoods perhaps?) after the trial. What did Egon say in reply to George?
[I]f the planning is based on the consequences that will result from following a rule of statistical procedure, e.g., is based on a study of the power function of a test and then, having obtained our results, we do not follow the first rule but another, based on likelihoods, what is the meaning of the planning? (Pearson 1950, p. 228)
This is an interesting and, dare I say, powerful reply, but it doesn’t quite answer George. By all means apply the rule you planned to, but there’s still a legitimate question as to the relationship between the pre-data capability or performance measure, and post-data inference. The severe tester offers a view of this intimate relationship. In Tour II we’ll be looking at interactive exhibits far outside the museum, including N-P post-data power analysis, retrospective power, and a notion I call shpower. Employing our understanding of power, scrutinizing a popular reinterpretation of tests as diagnostic tools will be straightforward. In Tour III we go a few levels deeper in disinterring the N-P vs. Fisher feuds. I suspect there is a correlation between those who took Fisher’s side in the early disputes with Neyman and those leery of power. Oscar Kempthorne being interviewed by J. Leroy Folks (1995) said:
Well, a common thing said about [Fisher] was that he did not accept the idea of the power. But, of course, he must have. However, because Neyman had made such a point about power, Fisher couldn’t bring himself to acknowledge it (p. 331).
However, since Fisherian tribe members have no problem with corresponding uses of sensitivity, P-value distributions, or CIs, they can come along on a severity analysis. There’s more than one way to skin a cat, if one understands the relevant statistical principles. The issues surrounding power are subtle, and unraveling them will require great care, so bear with me. I will give you a money-back guarantee that by the end of the excursion you’ll have a whole new view of power. Did I mention you’ll have a chance to power the ship into port on this tour? Only kidding, however, you will get to show your stuff in a Cruise Severity Drill (Section 5.2).
To continue reading Excursion 5 Tour I, go here.
This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.
Jan 10, 2019 Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” is here,
Jan 27, Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking here,
Feb 23, Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST
April 4, Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt here.
Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.
March 5, 2019 Blurbs of all 16 Tours can be found here.
Where YOU are in the journey.
A short comprehension question for Page 327. The case H0: mu ≤ delta vs. H1: mu > delta. Ist this the case when both distributions overlap and so it is the case that the probability the observed difference exceeds the critical value under H1 ist alpha?