Posts Tagged With: severe testing

Power and Severity with nonsignificant results: more power puzzles? (ii)

Posted on March 14, 2026 by Mayo

The concept of a test’s power, originating in Neyman-Pearson’s early work, by and large, is a pre-data concept for purposes of specifying a test (notably, determining worthwhile sample size), and choosing between tests. In some papers, however, Neyman lists a third goal for power: to interpret test results post data much in the spirit of what is often called “power analysis”. This is to determine the discrepancy from a null hypothesis that may be ruled out, given nonsignificant results. One example is in a paper “The Problem of Inductive Inference” (Neyman 1955)–already a surprising title for behaviorist Neyman. The reason I’m bringing this up is that it has direct bearing on some of today’s most puzzling (and problematic) post-data uses of power. Interestingly, in that 1955 paper, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H₀ is true of the particular data set? (Neyman, pp 40-41).

Neyman continues: Continue reading →

Categories: Neyman's Nursery, power analysis | Tags: negative result, Neyman, power, severe testing | Leave a comment

Neyman, Power, and Severity

Posted on August 5, 2014 by Mayo

NEYMAN: April 16, 1894 – August 5, 1981

Jerzy Neyman: April 16, 1894-August 5, 1981. This reblogs posts under “The Will to Understand Power” & “Neyman’s Nursery” here & here.

Way back when, although I’d never met him, I sent my doctoral dissertation, Philosophy of Statistics, to one person only: Professor Ronald Giere. (And he would read it, too!) I knew from his publications that he was a leading defender of frequentist statistical methods in philosophy of science, and that he’d worked for at time with Birnbaum in NYC.

Some ~~ten~~ 15 years ago, Giere decided to quit philosophy of statistics (while remaining in philosophy of science): I think it had to do with a certain form of statistical exile (in philosophy). He asked me if I wanted his papers—a mass of work on statistics and statistical foundations gathered over many years. Could I make a home for them? I said yes. Then came his caveat: there would be a lot of them.

As it happened, we were building a new house at the time, Thebes, and I designed a special room on the top floor that could house a dozen or so file cabinets. (I painted it pale rose, with white lacquered book shelves up to the ceiling.) Then, for more than 9 months (same as my son!), I waited . . . Several boxes finally arrived, containing hundreds of files—each meticulously labeled with titles and dates. More than that, the labels were hand-typed! I thought, If Ron knew what a slob I was, he likely would not have entrusted me with these treasures. (Perhaps he knew of no one else who would actually want them!) Continue reading →

Categories: Neyman, phil/history of stat, power, Statistics | Tags: negative result, Neyman, power, severe testing | 5 Comments

Further Reflections on Simplicity: Mechanisms

Posted on June 29, 2012 by Mayo

To continue with some philosophical reflections on the papers from the “Ockham’s razor” conference, let me respond to something in Shalizi’s recent comments (http://cscs.umich.edu/~crshalizi/weblog/). His emphasis on the interest in understanding processes and mechanisms, as opposed to mere prediction, seems exactly right. But he raises a question that seems to me simply answered (on grounds of evidence): If “a model didn’t seem to need” a mechanism, it is left out, why?

“It’s this, the leave-out-processes-you-don’t-need, which seems to me the core of the Razor for scientific model-building. This is definitely not the same as parameter-counting, and I think it’s also different from capacity control and even from description-length-measuring (cf.), though I am open to Peter persuading me otherwise. I am not, however, altogether sure how to formalize it, or what would justify it, beyond an aesthetic preference for tidy models. (And who died and left the tidy-minded in charge?) The best hope for such justification, I think, is something like Kevin’s idea that the Razor helps us get to the truth faster, or at least with fewer needless detours. Positing processes and mechanisms which aren’t strictly called for to account for the phenomena is asking for trouble needlessly.”

But it is easy to see that if a model M is adequate for data x regarding an aspect of a phenomenon (i.e., M had passed reasonably severe tests with x) , then a model M’ that added an “unnecessary” mechanism would have passed with very low severity, or, if one prefers, M’ would be very poorly corroborated. To justify “leaving-out-processes-you-don’t-need” then, the appeal is not to aesthetics or heuristics but to the severity or well-testedness of M and M’.

Continue reading →

Categories: philosophy of science, Statistics | Tags: Cosma Shalizi, Magic Angle Spinning, mechanisms, prions, severe testing | 4 Comments

Do CIs Avoid Fallacies of Tests? Reforming the Reformers

Posted on May 17, 2012 by Mayo

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it? Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+. Continue reading →

Categories: Statistics | Tags: confidence intervals, confidenceinterval, Geoff Cumming, P-value, reformers, severe testing, statistical significance | 14 Comments

Reposting from Jan 29: No-Pain Philosophy: Skepticism, Rationality, Popper, and All That: The First of 3 Parts

Posted on February 1, 2012 by Mayo

I want to shift to the arena of testing the adequacy of statistical models and misspecification testing (leading up to articles by Aris Spanos, Andrew Gelman, and David Hendry). But first, a couple of informal, philosophical mini-posts, if only to clarify terms we will need (each has a mini test at the end).

1. How do we obtain Knowledge, and how can we get more of it?

Few people doubt that science is successful and that it makes progress. This remains true for the philosopher of science, despite her tendency to skepticism. By contrast, most of us think we know a lot of things, and that science is one of our best ways of acquiring knowledge. But how do we justify our lack of skepticism? Continue reading →

Categories: philosophy of science | Tags: Popper, probabilism, severe testing, wedge between rationality and skepticism | 3 Comments

No-Pain Philosophy: Skepticism, Rationality, Popper, and All That: First of 3 Parts

Posted on January 29, 2012 by Mayo

Categories: No-Pain Philosophy, philosophy of science | Tags: Popper, probabilism, severe testing, wedge between rationality and skepticism | 2 Comments

Neyman’s Nursery (NN2): Power and Severity [Continuation of Oct. 22 Post]:

Posted on November 9, 2011 by Mayo

Let me pick up where I left off in “Neyman’s Nursery,” [built to house Giere’s statistical papers-in-exile]. The main goal of the discussion is to get us to exercise correctly our “will to understand power”, if only little by little. One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955). It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman. Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H₀ is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H₀, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H₀ cannot be reasonably considered as anything like a confirmation of H₀. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+. (Whether Greek symbols will appear where they should, I cannot say; it’s being worked on back at Elba).

H₀: µ ≤ µ₀ against H₁: µ > µ₀.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ₀ iff {d(x₀) > cα) where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x₀) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].

We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1) P(d(X) > cα; µ = µ₀ + δ)

It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed. He sounds like a Cohen-style power analyst! Still, power is calculated relative to an outcome just missing the cutoff cα. This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results. It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2) P(d(X) > d(x0); µ = µ₀ + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ₀ + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before. We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange). Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean! (Why anyone would want to do this and then apply power analytic reasoning is unclear. I’ll come back to this in my next post.) Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25). This reasoning yields a core frequentist principle of evidence (FEV) in Mayo and Cox 2010, 256):

FEV:¹ A moderate p-value is evidence of the absence of a discrepancy d from H₀ only if there is a high probability the test would have given a worse fit with H₀ (i.e., smaller p value) were a discrepancy d to exist.

It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power. In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account. These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.²
________________________________________

The full version of our frequentist principle of evidence FEV corresponds to the interpretation of a small p-value:

x is evidence of a discrepancy d from H₀ iff, if H₀ is a correct description of the mechanism generating x, then, with high probability a less discordant result would have occurred.

Severity (SEV) may be seen as a meta-statistical principle that follows the same logic as FEV reasoning within the formal statistical analysis.

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.
It didn’t have to be done this way, but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

NOTE: There are 5 Neyman’s Nursery posts (NN1-NN5). NN3 is here. Search this blog for the others.

REFERENCES:

Cohen, J. (1992) A Power Primer.
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2^nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Categories: Neyman's Nursery, Statistics | Tags: negative result, Neyman, power, severe testing | Leave a comment

Posts Tagged With: severe testing

Power and Severity with nonsignificant results: more power puzzles? (ii)

Neyman, Power, and Severity

Further Reflections on Simplicity: Mechanisms

Do CIs Avoid Fallacies of Tests? Reforming the Reformers

Reposting from Jan 29: No-Pain Philosophy: Skepticism, Rationality, Popper, and All That: The First of 3 Parts

No-Pain Philosophy: Skepticism, Rationality, Popper, and All That: First of 3 Parts

Neyman’s Nursery (NN2): Power and Severity [Continuation of Oct. 22 Post]:

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.