1. **New monsters.** One of the bizarre facts of life in the statistics wars is that a method from one school may be criticized on grounds that it conflicts with a conception that is the reverse of what that school intends. How is that even to be deciphered? That was the difficult task I set for myself in writing Statistical* Inference as Severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2008) [SIST 2018]. I thought I was done, but new monsters keep appearing. In some cases, rather than see how the notion of severity gets us beyond fallacies, misconstruals are taken to criticize severity! So, for example, in the last couple of posts, here and here, I deciphered some of the better known power howlers (discussed in SIST Ex 5 Tour II) I’m linking to all of this tour (in proofs).

We may examine claim (I) in a typical one-sided test (of the mean):

H_{0}: μ<μ_{0}vsH_{1}: μ > μ_{0}(I):

OurClaim: If the power of the test to detect μ’ is high, (i.e., POW(μ’) is high) (e.g., over .5) then a just significant result ispoorevidence that μ > μ’; while if POW(μ’) is low (e.g., less than .1), it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).

Now this claim (I) follows directly from the definitions of terms, but some argue that this just goes to show what’s wrong with the terms, rather than with their construal of them.

**Specific Example of test T+.** Let’s use an example from our last post, taken from SIST (2018), just to illustrate. We’re testing the normal mean µ:

H_{0}: µ ≤150againstH_{1}: µ > 150

with σ = 10, SE = σ/√n^{ } = 1. The critical value for α =.025 is z = 1.96. That is, we reject the claim that the population mean is less than or equal to 150, (we reject µ ≤ 150) and infer there is evidence µ > 150 whenever the sample mean M is at least 1.96 SE in excess of 150, i.e., when M __>__ 150 + 1.96(1). For simplicity, let’s use the 2 SE cut-off as the critical value for rejecting:

Test T+: with n = 100: Reject µ ≤ 150 when M>150 + 2SE = 152.

**QUESTION**: Now, suppose your outcome just makes it over the hurdle, M = 152. Does it make sense to say that this outcome is better evidence that µ is at least 153 than it is evidence that µ is at least 152? Well, POW(153) > POW(152), so if we thought the higher the power against μ’ the stronger the evidence that µ is at least µ’, then the answer would be yes. But logic alone would tell us that since:

claim A (e.g., µ ≥ 153) entails claim B (e.g., µ ≥ 152), claim B should be better warranted than claim A.

Nevertheless, we showed how one can make sense of the allegation that if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then if you will use the observed M as an estimate of µ (rather than a lower CI bound), then you’ll be “exaggerating” µ. Some take away from this that: low power for µ’ indicates poor evidence for µ > µ’. Or they put it as a comparative, the higher the power to detect µ’ the better the evidence for µ > µ’. This conflicts with our claim (I). We show that (I) is correct–but some may argue upholding I is a problem for severity!

2. **A severity critic.** One such critic, Rochefort-Maranda (2020), hereafter, RM writes: “My aim…is to show how the problem of inflated effect sizes…corrupts the severity measure of evidence” where severity is from Mayo and Spanos 2006, 20011. But his example actually only has sample size of 10! You would be right to suspect violated assumptions, and **Aris Spanos **(2022), in his article in *Philosophy of Science*, shows in great detail how far from satisfying the assumptions his example is.

“[RM’s] claim that ‘the more powerful a test that rejects H

_{0}, the more the evidence against H_{0},’ constitutes a misconception. This claim is based on misunderstanding the difference between aiming for “a large n” predata to increase the power of the test (a commendable strategy) and what the particular power implies, postdata” (p. 16)“Rochefort-Maranda’s (2020) case against the postdata severity evaluation, built on a numerical example using a ‘bad draw’ of simulated data with n = 10, illustrates how one can generate untrustworthy evidence (inconsistent estimators and an underpowered test) and declare severity as the culprit for the ensuing dubious results. His discussion is based on several misconceptions about the proper implementation and interpretation of frequentist testing.” (p. 18)

RM’s data, Spanos shows,

“is a form of “simulated data dredging” that describes the practice of simulating hundreds of replications of size n by changing the “seed” of the pseudorandom number algorithm in search of a desired result.” (ibid.)

Here the desired result is one that appears to lead to endorsing an inference with high severity even where it is stipulated we happen to known the inference is false. He doesn’t show such a misleading result probable–merely logically possible–and in fact, he himself says it’s practically impossible to replicate such misleading data!

3. **The whole criticism is wrong-headed, even if assumptions hold. **I want to raise another very general problem that would hold for such a criticism even if we imagine all the assumptions are met. (This is different from the “M inflates µ” problem.) In a nutsehll: The fact that one can imagine a parameter value excluded from a confidence interval CI at a reasonable CI level is scarcely an indictment of CIs! RM’s argument is just of that form, and it seems to me he could have spared himself the elaborate simulations and programming to show this. He overlooks the fact that the error probability must be included to qualify the inference, be it a p-value, confidence interval, or severity assessment.

Go back to our example. We observe M = 152 and our critic says: But suppose we knew the true µ was 150.01. RM is alarmed: We have observed a difference from 150 of 2 when the true difference is only .01. He blames it on the low power against .01.

“That is because we have a significant result with an underpowered test such that the effect size is incredibly bigger than reality [200 times bigger]. The significance tester ‘would thus be wrong to believe that there is such a substantial difference from H

_{0}. But S would feel warranted to reach a similarly bad conclusion with the severity score.” (Rochefort-Maranda 2020)

Bad conclusion? Let’s separate his two allegations [i]: Yes, it would be wrong to take the observed M as the population µ–but do people not already know this? (One would be wrong ~50% of the time.) I come back to this at the end with some quotes from Stephen Senn.

But there’s scarcely anything bad about inferring µ > 150.04—provided the assumptions hold approximately. It’s a well warranted statistical inference.

The .975 lower bound with M = 152 is µ > 150.04.

RM comes along and says but suppose I know µ = 150.01. Of course, we don’t know the true value µ. But let’s suppose we do. Then the alleged problem is that we infer *H*_{0}:µ > 150.04 with severity .975 (the lower .975 confidence bound), and we’re “wrong” because the true value is outside the confidence interval!! Note that the inference to µ > 150.01 is even stronger, severity .98.

Insofar as a statistical inference account is fallible, a CI, even at high confidence level, can exclude the imagined known µ value. This is no indictment of the method. The same is true for a severity assessment, and of course, there is a duality between tests and CIs. We control such errors at small values as we choose the error probability associated with the method.

Remember, the form of inference (with CIs and tests) is not to a point value but to an inequality such as µ > 150.01.

Of course the inference is actually: if the assumptions hold approximately, then there is a good indication that µ > 150.01. The p-value ~.02. The confidence level with 150.01 as lower bound is ~.98. The fact that the power against µ = 150.01 is low, is actually a way to explain why the just statistically significant M is a *good indication* that µ > 150.01. (ii)

Once again, as in our last post, if one knows the observed difference is radically out of line, one suspects the computations of the p-value, the lower confidence bound, and severity are illicit, typically by biasing selections, ignoring data-dredging, optional stopping and or using a sample size too small to satisfy assumptions. This is what goes wrong in the RM illustration, as Spanos shows.

4. **To conclude…**. It does not make sense to say that because the test T+ has low power against values “close” to µ_{0} (here 150) that a statistically significant M isn’t good evidence that µ exceeds that value. At least not within the error statistical paradigm. It’s the *opposite*, and one only needs to remember that the power of the test against µ_{0} is α, say .025. It is even more absurd to say this in the case where M exceeds the 2SE cut-off (we’ve been supposing it just makes it to the statistically significant M, 151.96, or to simplify, 152). Suppose for example M = 153. This is even stronger grounds that µ > 150.01 (p value ~001).

On the other hand, it makes sense to say–since it’s true– that as sample size increases, the value of M needed to just reach .025 statistical significance gets closer to 150. So if you’re bound to use the observed M to estimate the population mean, then just reaching 025 significance is less of an exaggeration with higher n.

* Question for the Reader:* However, what if we compare two .025 statistically significant results in test T+ but with two sample sizes, say one with n = 100 (as before) and a second with n = 10,000.? The 025 statistically significant result with n = 10,000 indicates less of a discrepancy from 150 than the one with n= 100. Do you see why? (Give your reply in the comments). (See (iii))

Finally, to take the observed effect M as a good estimate of the true µ in the population is a bad idea—it is to ignore the fact that statistical methods have uncertainty. Rather, we would use the lower bound of a confidence interval with reasonably high confidence level (or corresponding high severity). If you think .975 or .95 give lower bounds that are too conservative, as someone who emailed me recently avers, then at least report the 1SE lower bound (for a confidence level of .84). Error statistical methods don’t hide the uncertainties associated with the method. If you do, it’s no wonder you end up with unwarranted claims. (iv)

* Stephen Senn on ignoring uncertainty.* There’s a guest post on this blog:by Stephen Senn, which alludes to R.A. Fisher and Cox and Hinkley (1974),on this issue

“In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten.

…to provide a point estimate without also providing a standard error is, indeed, an all too standard error. …if you don’t know how uncertain your evidence is, you can’t use it. Thus, assessing uncertainty is important. … This (perhaps unconscious) obsession with point estimation as the be all and end all causes problems. …

Point estimates are not enough. It is rarely the case that you have to act immediately based on your best guess. Where you don’t, you have to know how good your guesses are. This requires a principled approach to assessing uncertainty.” (Senn)

**Use the comments for your queries and response to my “question for the reader”.**

______

[i] I put aside the fact that we would never call the degree of corroboration a “score”.

(ii) I hear people say, well if the power against 150.01 is practically the same as α, then the test isn’t discriminating 150.01 from the null of 150. Fine. M warrants both µ > 150 as well as µ > 150.01, and the warrant for the former is scarcely stronger than the latter. So the warranted inferences are similar.

(iii) On the other hand, if a test just *fails to make it to the statistically significant cut-off,* and POW(µ’) is low, then there’s poor evidence that µ < µ’. It’s for an inference of this form that the low power creates problems.

(iv) I note in one of my comments that Ioannidis’ (2008) way of stating the “inflation” claim is less open to misconstrual He says it’s the observed effect (i.e., the observed M) that “inflates” the “true” population effect, when the test has low power to detect that effect (but he allows it can also underestimate the true effect)–especially in the context of a variety of selective effects.

**REFERENCES:**

- Rochefort-Maranda, G. (2020). Inflated effect sizes and underpowered tests: how the severity measure of evidence is affected by the winners’ curse.
*Phil Studies*. - Mayo, D. G. (2018).
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*, CUP. - Mayo, D. G., & Spanos, A. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of induction.
*The British Journal for the Philosophy of Science*, 57(2), 323–357. - Mayo, D. G., & Spanos, A. (2011). Error statistics.
*Philosophy of Statistics*, 7, 152–198. - Spanos, A. (2022) Severity and Trustworthy Evidence: Foundational Problems versus Misuses of Frequentist Testing.
*Philosophy of Science*.

My short answer to your question is that with n=10000 we observed 150.2 instead of 152. The severity curve would be closer to 150, but much steeper. I think my answer would be different if significance was a filter to even think about the problem (maybe you would only notice if it hits the threshold).

RM’s example is not specific enough about how severity is calculated. The calculation doesn’t account for model selection inherent in the problem. It feels unfair of me to tell him “what he really wants to say”, but I have some supporting quotes from the text.

RM quotes Button et al 2013: “when an underpowered study discovers a true effect, it is likely that the estimate …. will be exaggerated”.

He again quotes Ioannidis 2008: “Inflation is expected when, to claim success (discovery), an association has to pass a certain threshold of statistical significance…”

I think this is the scenario RM wants to model: “I have a lot of variability, and I can only publish if the claim gamma > 0 is significant. But I want to say more than that – can I reliably say that gamma > c, given a publishable result?”.

My answer to that is no – not if he is using the same data for both claims.

I tried some simulations with his model, following RM’s and Spano’s 2022’s calculations. If you set gamma to any value > 0, and then filter on the statistically significant outcomes, an effect size of 0.856 is not impressive. The severity score for gamma > c, for any c > 0, and taking the model selection into account, is never more than 0.3.

My conclusion is that we are justified in saying gamma > 0, but we cannot claim that gamma > c given our analysis. Given the power, we were lucky to make the discovery. I believe this is similar to post-selection adjustments from using stepwise or lasso.

A better option for RM might be to use the following strategy. Split the data into 5/5, and check for a significant gamma > 0 on the first half. Then use the 2nd half to test for effect size. The power could hardly be worse than it already is. Or, he could increase n and try his original strategy.

sorry I misspelled Spanos – posting the corrected version below

My short answer to your question is that with n=10000 we observed 150.2 instead of 152. The severity curve would be closer to 150, but much steeper. I think my answer would be different if significance was a filter to even think about the problem (maybe you would only notice if it hits the threshold).

RM’s example is not specific enough about how severity is calculated. The calculation doesn’t account for model selection inherent in the problem. It feels unfair of me to tell him “what he really wants to say”, but I have some supporting quotes from the text.

RM quotes Button et al 2013: “when an underpowered study discovers a true effect, it is likely that the estimate …. will be exaggerated”.

He again quotes Ioannidis 2008: “Inflation is expected when, to claim success (discovery), an association has to pass a certain threshold of statistical significance…”

I think this is the scenario RM wants to model: “I have a lot of variability, and I can only publish if the claim gamma > 0 is significant. But I want to say more than that – can I reliably say that gamma > c, given a publishable result?”.

My answer to that is no – not if he is using the same data for both claims.

I tried some simulations with his model, following RM’s and Spanos 2022’s calculations. If you set gamma to any value > 0, and then filter on the statistically significant outcomes, an effect size of 0.856 is not impressive. The severity score for gamma > c, for any c > 0, and taking the model selection into account, is never more than 0.3.

My conclusion is that we are justified in saying gamma > 0, but we cannot claim that gamma > c given our analysis. Given the power, we were lucky to make the discovery. I believe this is similar to post-selection adjustments from using stepwise or lasso.

A better option for RM might be to use the following strategy. Split the data into 5/5, and check for a significant gamma > 0 on the first half. Then use the 2nd half to test for effect size. The power could hardly be worse than it already is. Or, he could increase n and try his original strategy.

It’s easy to see that the just significant result, say at level .025, with n = 100, indicates a larger (positive) discrepancy from 150 than does a just significant result at the .025 level, with n = 10,000. Form the two .84 CIs. With n = 10,000, we get mu > 150.1, with n = 100, we get mu > 151. So we have evidence of a larger discrepancy from 150 when the .025 significant result comes from n = 100 than when it comes from n = 10,000.

hwyneken:

Please don’t use the term “severity score” although I know what you mean. We wouldn’t say the p-value “score” or the confidence level “score”. Severity is an assessment of how warranted (or corroborated) the inference.

When people say, given the low power you were lucky to get statistical significance, it seems to me they’re missing the testing reasoning. It would have been very improbable to get so large an observed difference (from the null) WERE WE in a world where the null hypothesis holds. That’s why we take the statistically significant result to indicate we are NOT in such a world, but rather one where mu exceeds mu0.

thanks – would it be appropriate to say “SEV score” or is just SEV better?

I agree with your argument that low power, for expected alternatives, is not an inherent reason to dismiss a significant result. I could see how saying “we were lucky” takes away from the meaningfulness of that result in the eyes of the audience.

I still think that Eric and RM are on to something, but that the answer has more to do with accounting for test selection. It’s kind of slippery, but it seems like the use case is: “if I get significance, I’ll estimate an effect size. If not, I’ll try an equivalence test or just forget about it.” This seems different in kind from the water heater example, where we still might try to establish a lower bound, even if we did not see a significant increase from 150.

In other words, the p values are not adjusted for this selection, and that is making what you called “honest hunting” impossible. Do you think this is on the right track?

hwyneken: One needs to carefully distinguish each of several different issues, which is exactly why I wrote three posts delineating the different problems.

I do not argue “that low power, for expected alternatives, is not an inherent reason to dismiss a significant result.” If the assumptions are approximately met, the low power against mu’ is a good reason to take a statistically significant result as indicating mu > mu’. Some people get it backwards because they misinterpret an entirely distinct point. Remember, the power against the null mu0, is very low–alpha. People are misinterpreting statistical significance tests if they deny my claim, correctly parsed. You might read Erik’s reply to me.

A distinct issue arises if one takes the observed M as the population mu. That is not the way estimation is to be done. If one insists on doing it, and further imagines we know the true population mu, thenI imagine there might be all manner of “adjustments” to patch things up so that such an estimate is closer to what is assumed known.

Two distinct issues are:

(a) violated assumptions and selection effects,

(b) misinterpreting power as a posterior probability (previous post)

Thanks – what you wrote here helped this to click: “Remember, the power against the null mu0, is very low–alpha.” It looks like there’s convergence between what you and Erik are saying.

What I want to be able to calculate is

SEV(mu’ < mu) = P(t* <= t ; mu = mu', data, significance filter)

I can always fall back on simulation, but I might want to work this out for a simple example. I think Eric's paper has some tools to help with that.

Hwyneken:

You might think there’s convergence between Erik and I, given his comment in my last post wherein he concurs with my assertion (I):

(This is not letting me link to just the comment, so far as I can tell (like it used to), so I’ll paste down

Erik’s comments, where he expressed agreement, at the bottom of this comment)*.

However, it is clear from the emails he sends me that he’s keen to revert to supposing we disagree. That there are at least two different claims being bandied about should be quite clear. I think Ioannidis (2008) on why observed associations tend to ‘inflate’ true associations, especially in cases with selection effects, early stopping, researcher flexibility, etc avoids a very important equivocation. That’s because he makes it clear that it’s the observed effect that may tend to be larger than the “true” population effect, when that true population effect is one against which the test has low power (and thus is smaller than the cut-off for a statistically significant result). Then there’s no tendency to suppose the assertion is the false one that I’ve been on about: to wit: that a stat sig result is better evidence that mu exceeds mu’ when POW(mu’) is high than when it is low. That would be at odds with statistical significance test reasoning (and this discussion presumes that’s what we’re talking about.)The claims are quite different!

*Erik’s comment on my last post:

You ask: First, do you disagree with my claim?You claim: If POW(μ’) is high then a just significant result is poor evidence that μ > μ’; while if POW(μ’) is low it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).

This seems fine, as I understand it. POW(mu’) being high would mean that mu’ is something like 3 standard errors larger than mu0. If the test is *just* significant, then our estimate of mu would be about 2 standard errors larger than mu0 (assuming alpha=0.025, one-sided). So in this case, mu’ is 1 standard error larger than our estimate of mu. If anything, there is reason to think mu’ is larger than mu.

Conversely, POW(mu’) being low would mean that mu’ is something like 1 standard error larger than mu0. In that case, mu’ is 1 standard error smaller than our estimate of mu. So, then there is some reason to think mu’ is smaller than mu.Deborah: I’m not “keen to revert to supposing we disagree”. I agree with your claim which I quoted. However, I also agree with Gelman, Carlin and Ioannidis that “underpowered tests exaggerate population effects”, with which you seem to disagree.

Erik: I have showed in detail how both assertions can be true It seems to me that you are troubled with having granted my claim (I) because it seems to be in tension with your claim that the observed effect tends to exaggerate the true effect, when the power against the true effect is low. So maybe the question should be how do you reconcile them intuitively or logically?

As I understand your claim (I), it seems quite obvious (see my earlier comment). I am not at all troubled by it because I do not even see how it relates to the claim of Gelman, Carlin and Ioannidis.

You say that I claim: observed effect tends to exaggerate the true effect, when the power against the true effect is low.

Perhaps there’s the problem; I only make this claim *conditionally* on significance. I have been very clear and precise about that. I have even stated (and proved) the claim in mathematical terms, but you have not engaged with that at all.

Erik:

Sure, the observed (just) statistically significant effect exaggerates the true effect when the true effect is one against which the test has low power. I didn’t repeat everything in your full statement because I thought we’d been through it in email and on this blog. Here I was just trying to stress the Ioannidis’ wording which occurs to me might help avoid the equivocation I also agree that Gelman and Carlin’s claim is very different from mine, while being consistent with it (in my view), and it receives its own analysis in my book. My discussion concerned a type of post-data use of power, and theirs is not really about “power analysis” as typically understood. But it is described in Gelman and Carlin as an example of retro power analysis. I was trying to disentangle our claims.

However, in their examples there are blatant selection effects and publication biases. So the ostensive error probability computations are actually not warranted.

Finally, it’s important to see that rather than infer the observed effect is ~ the population effect, the confidence interval (lower bound) subtracts 1-2 SE from the observed effect.

The more common problem is trying to use power analysis to interpret negative results but replacing mu with the observed M. They are trying to avoid this. Problem is, since this computation yields a low value, standard power analysis doesn’t work. Standard power analysis reasons from a nonsignificant result to arguing that there’s evidence mu < mu' if POW(mu') is high. I'm considering here a one sided test as before. But having replaced mu with the observed nonsignificant result, the highest power you'll get is .5. I call this "shpower analysis" and there's a lot in my book and on this blog on it.

The focus on exaggeration (M) is interesting. What about the consideration of sign (S). In my experience this is even more interesting. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070

Ron: I just wrote a paper with Steve Goodman about the type S error in the context of the Cochrane database, see https://onlinelibrary.wiley.com/doi/full/10.1002/sim.9406

There was another point I raised in the comments to my last post that reflects a qualification I recognized from a reader comment (Michael Lew) long ago. I’ll repeat it here:

**Now the reason I wrote “just statistically significant at the given level”, e.g., .025, concerns another reading that I don’t think is meant by those claiming M exaggerates mu, but it might. It came up in a much earlier post. Suppose one compares the SAME outcome that reached the .025 level with n = 100 with that SAME OBSERVED EFFECT SIZE but with larger n, say in =10,000. Call this test T+ #2. Now an SE is (σ/√n)= 10/100 = .1. The 2-standard deviation cut-off for rejection becomes 150.2. Then the p-value against a specified value of µ will be smaller with n = 10,000 than with n = 100. So that’s a way to make the assertion true that has nothing to do with estimating the population parameter with the observed difference. But I doubt that’s what they mean. I’ll explain why.

Spoze we know µ =150.4 or around that small.

In Test + #1, POW(150.4) = Pr(M > 152; µ =150.4) = Pr(Z > 1.6) = .05—so it’s very low.

Someone says but suppose n were large enough to make the power against µ =150.4 high? Test +#2 will do the trick. With n = 10,000, the POW(150.4)= Pr(M > 150.2;150.4) = Pr(Z > -2) = .98.

In both cases, under this reading I am entertaining, the observed sample mean is the same, say 152. In test T+2, the observed mean M is 20 SE in excess of 150!

(Wouldn’t one suspect a gross exaggeration then?) Anyway, I grant that the p-value against µ < 150.4 with the more powerful test is much smaller (0) in the test with n = 10,000 (p-value .05).

Compare the p-values in testing µ <150.4 with M = 152 from the two tests. The p-value from Test T+ #1 is ~.05, whereas the p-value from test T+ #2 is 0, because 152 is 6 SE in excess of 150.4. Please inform me of errors, I'm doing this quickly, but the same result holds even if approximate. So, there is stronger evidence against µ <150.4 from test T+ with the higher power (by imagining the same observed effect size, 152, came from a test with high power against 150.4.) So that is why I stipulated that in comparing two tests, one should look at outcomes with the same statistical significance level. I’m rather sure that proponents of the claim are not saying, “Imagine the same M came from a test where M is 20 SE rather than 2SE" but Michael Lew once raised a similar case only in terms of likelihoods. And it may well be that this is what some mean.