
.
1. New monsters. One of the bizarre facts of life in the statistics wars is that a method from one school may be criticized on grounds that it conflicts with a conception that is the reverse of what that school intends. How is that even to be deciphered? That was the difficult task I set for myself in writing Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2008) [SIST 2018]. I thought I was done, but new monsters keep appearing. In some cases, rather than see how the notion of severity gets us beyond fallacies, misconstruals are taken to criticize severity! So, for example, in the last couple of posts, here and here, I deciphered some of the better known power howlers (discussed in SIST Ex 5 Tour II) I’m linking to all of this tour (in proofs).
We may examine claim (I) in a typical one-sided test (of the mean):
H0: μ < μ0 vs H1: μ > μ0
(I):Our Claim: If the power of the test to detect μ’ is high, (i.e., POW(μ’) is high) (e.g., over .5) then a just significant result is poor evidence that μ > μ’; while if POW(μ’) is low (e.g., less than .1), it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).
Now this claim (I) follows directly from the definitions of terms, but some argue that this just goes to show what’s wrong with the terms, rather than with their construal of them.
Specific Example of test T+. Let’s use an example from our last post, taken from SIST (2018), just to illustrate. We’re testing the normal mean µ:
H0: µ ≤ 150 against H1: µ > 150
with σ = 10, SE = σ/√n = 1. The critical value for α =.025 is z = 1.96. That is, we reject the claim that the population mean is less than or equal to 150, (we reject µ ≤ 150) and infer there is evidence µ > 150 whenever the sample mean M is at least 1.96 SE in excess of 150, i.e., when M > 150 + 1.96(1). For simplicity, let’s use the 2 SE cut-off as the critical value for rejecting:
Test T+: with n = 100: Reject µ ≤ 150 when M > 150 + 2SE = 152.
QUESTION: Now, suppose your outcome just makes it over the hurdle, M = 152. Does it make sense to say that this outcome is better evidence that µ is at least 153 than it is evidence that µ is at least 152? Well, POW(153) > POW(152), so if we thought the higher the power against μ’ the stronger the evidence that µ is at least µ’, then the answer would be yes. But logic alone would tell us that since:
claim A (e.g., µ ≥ 153) entails claim B (e.g., µ ≥ 152), claim B should be better warranted than claim A.
Nevertheless, we showed how one can make sense of the allegation that if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then if you will use the observed M as an estimate of µ (rather than a lower CI bound), then you’ll be “exaggerating” µ. Some take away from this that: low power for µ’ indicates poor evidence for µ > µ’. Or they put it as a comparative, the higher the power to detect µ’ the better the evidence for µ > µ’. This conflicts with our claim (I). We show that (I) is correct–but some may argue upholding I is a problem for severity!
2. A severity critic. One such critic, Rochefort-Maranda (2020), hereafter, RM writes: “My aim…is to show how the problem of inflated effect sizes…corrupts the severity measure of evidence” where severity is from Mayo and Spanos 2006, 20011. But his example actually only has sample size of 10! You would be right to suspect violated assumptions, and Aris Spanos (2022), in his article in Philosophy of Science, shows in great detail how far from satisfying the assumptions his example is.
“[RM’s] claim that ‘the more powerful a test that rejects H0, the more the evidence against H0,’ constitutes a misconception. This claim is based on misunderstanding the difference between aiming for “a large n” predata to increase the power of the test (a commendable strategy) and what the particular power implies, postdata” (p. 16)
“Rochefort-Maranda’s (2020) case against the postdata severity evaluation, built on a numerical example using a ‘bad draw’ of simulated data with n = 10, illustrates how one can generate untrustworthy evidence (inconsistent estimators and an underpowered test) and declare severity as the culprit for the ensuing dubious results. His discussion is based on several misconceptions about the proper implementation and interpretation of frequentist testing.” (p. 18)
RM’s data, Spanos shows,
“is a form of “simulated data dredging” that describes the practice of simulating hundreds of replications of size n by changing the “seed” of the pseudorandom number algorithm in search of a desired result.” (ibid.)
Here the desired result is one that appears to lead to endorsing an inference with high severity even where it is stipulated we happen to known the inference is false. He doesn’t show such a misleading result probable–merely logically possible–and in fact, he himself says it’s practically impossible to replicate such misleading data!
3. The whole criticism is wrong-headed, even if assumptions hold. I want to raise another very general problem that would hold for such a criticism even if we imagine all the assumptions are met. (This is different from the “M inflates µ” problem.) In a nutsehll: The fact that one can imagine a parameter value excluded from a confidence interval CI at a reasonable CI level is scarcely an indictment of CIs! RM’s argument is just of that form, and it seems to me he could have spared himself the elaborate simulations and programming to show this. He overlooks the fact that the error probability must be included to qualify the inference, be it a p-value, confidence interval, or severity assessment.
Go back to our example. We observe M = 152 and our critic says: But suppose we knew the true µ was 150.01. RM is alarmed: We have observed a difference from 150 of 2 when the true difference is only .01. He blames it on the low power against .01.
“That is because we have a significant result with an underpowered test such that the effect size is incredibly bigger than reality [200 times bigger]. The significance tester ‘would thus be wrong to believe that there is such a substantial difference from H0. But S would feel warranted to reach a similarly bad conclusion with the severity score.” (Rochefort-Maranda 2020)
Bad conclusion? Let’s separate his two allegations [i]: Yes, it would be wrong to take the observed M as the population µ–but do people not already know this? (One would be wrong ~50% of the time.) I come back to this at the end with some quotes from Stephen Senn.
But there’s scarcely anything bad about inferring µ > 150.04—provided the assumptions hold approximately. It’s a well warranted statistical inference.
The .975 lower bound with M = 152 is µ > 150.04.
RM comes along and says but suppose I know µ = 150.01. Of course, we don’t know the true value µ. But let’s suppose we do. Then the alleged problem is that we infer H0:µ > 150.04 with severity .975 (the lower .975 confidence bound), and we’re “wrong” because the true value is outside the confidence interval!! Note that the inference to µ > 150.01 is even stronger, severity .98.
Insofar as a statistical inference account is fallible, a CI, even at high confidence level, can exclude the imagined known µ value. This is no indictment of the method. The same is true for a severity assessment, and of course, there is a duality between tests and CIs. We control such errors at small values as we choose the error probability associated with the method.
Remember, the form of inference (with CIs and tests) is not to a point value but to an inequality such as µ > 150.01.
Of course the inference is actually: if the assumptions hold approximately, then there is a good indication that µ > 150.01. The p-value ~.02. The confidence level with 150.01 as lower bound is ~.98. The fact that the power against µ = 150.01 is low, is actually a way to explain why the just statistically significant M is a good indication that µ > 150.01. (ii)
Once again, as in our last post, if one knows the observed difference is radically out of line, one suspects the computations of the p-value, the lower confidence bound, and severity are illicit, typically by biasing selections, ignoring data-dredging, optional stopping and or using a sample size too small to satisfy assumptions. This is what goes wrong in the RM illustration, as Spanos shows.
4. To conclude…. It does not make sense to say that because the test T+ has low power against values “close” to µ0 (here 150) that a statistically significant M isn’t good evidence that µ exceeds that value. At least not within the error statistical paradigm. It’s the opposite, and one only needs to remember that the power of the test against µ0 is α, say .025. It is even more absurd to say this in the case where M exceeds the 2SE cut-off (we’ve been supposing it just makes it to the statistically significant M, 151.96, or to simplify, 152). Suppose for example M = 153. This is even stronger grounds that µ > 150.01 (p value ~001).
On the other hand, it makes sense to say–since it’s true– that as sample size increases, the value of M needed to just reach .025 statistical significance gets closer to 150. So if you’re bound to use the observed M to estimate the population mean, then just reaching 025 significance is less of an exaggeration with higher n.
Question for the Reader: However, what if we compare two .025 statistically significant results in test T+ but with two sample sizes, say one with n = 100 (as before) and a second with n = 10,000.? The 025 statistically significant result with n = 10,000 indicates less of a discrepancy from 150 than the one with n= 100. Do you see why? (Give your reply in the comments). (See (iii))
Finally, to take the observed effect M as a good estimate of the true µ in the population is a bad idea—it is to ignore the fact that statistical methods have uncertainty. Rather, we would use the lower bound of a confidence interval with reasonably high confidence level (or corresponding high severity). If you think .975 or .95 give lower bounds that are too conservative, as someone who emailed me recently avers, then at least report the 1SE lower bound (for a confidence level of .84). Error statistical methods don’t hide the uncertainties associated with the method. If you do, it’s no wonder you end up with unwarranted claims. (iv)
Stephen Senn on ignoring uncertainty. There’s a guest post on this blog:by Stephen Senn, which alludes to R.A. Fisher and Cox and Hinkley (1974),on this issue
“In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten.
…to provide a point estimate without also providing a standard error is, indeed, an all too standard error. …if you don’t know how uncertain your evidence is, you can’t use it. Thus, assessing uncertainty is important. … This (perhaps unconscious) obsession with point estimation as the be all and end all causes problems. …
Point estimates are not enough. It is rarely the case that you have to act immediately based on your best guess. Where you don’t, you have to know how good your guesses are. This requires a principled approach to assessing uncertainty.” (Senn)
Use the comments for your queries and response to my “question for the reader”.
______
[i] I put aside the fact that we would never call the degree of corroboration a “score”.
(ii) I hear people say, well if the power against 150.01 is practically the same as α, then the test isn’t discriminating 150.01 from the null of 150. Fine. M warrants both µ > 150 as well as µ > 150.01, and the warrant for the former is scarcely stronger than the latter. So the warranted inferences are similar.
(iii) On the other hand, if a test just fails to make it to the statistically significant cut-off, and POW(µ’) is low, then there’s poor evidence that µ < µ’. It’s for an inference of this form that the low power creates problems.
(iv) I note in one of my comments that Ioannidis’ (2008) way of stating the “inflation” claim is less open to misconstrual He says it’s the observed effect (i.e., the observed M) that “inflates” the “true” population effect, when the test has low power to detect that effect (but he allows it can also underestimate the true effect)–especially in the context of a variety of selective effects.
REFERENCES:
- Rochefort-Maranda, G. (2020). Inflated effect sizes and underpowered tests: how the severity measure of evidence is affected by the winners’ curse. Phil Studies.
- Mayo, D. G. (2018).Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, CUP.
- Mayo, D. G., & Spanos, A. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of induction. The British Journal for the Philosophy of Science, 57(2), 323–357.
- Mayo, D. G., & Spanos, A. (2011). Error statistics. Philosophy of Statistics, 7, 152–198.
- Spanos, A. (2022) Severity and Trustworthy Evidence: Foundational Problems versus Misuses of Frequentist Testing. Philosophy of Science.