** Stephen Senn
**Head, Methodology and Statistics Group,

Competence Center for Methodology and Statistics (CCMS),

Luxembourg

**Delta Force
**

*To what extent is clinical relevance relevant?*

**Inspiration
**This note has been inspired by a Twitter exchange with respected scientist and famous blogger David Colquhoun. He queried whether a treatment that had 2/3 of an effect that would be described as

*clinically relevant*could be useful. I was surprised at the question, since I would regard it as being pretty obvious that it could but, on reflection, I realise that things that may seem obvious to some who have worked in drug development may not be obvious to others, and if they are

*not*obvious to others are either in need of a defence or wrong. I don’t think I am wrong and this note is to explain my thinking on the subject.

**Conventional power or sample size calculations
**As is happens, I don’t particularly like conventional power calculations but I think they are, nonetheless, a good place to start. To carry out such a calculation a statistician needs the following ingredients

- A definition of a rational design (the smallest design that is feasible but would retain the essential characteristics of the design chosen).
- An agreed outcome measure.
- A proposed analysis.
- A measure of variability for the rational design. (This might, for example, be the between-patient variance σ
^{2}for a parallel group design.) - An agreed type I error rate, α.
- An agreed power, 1-β.
- A
*clinically relevant difference,**δ*. (To be discussed.) - The size of the experiment,
*n*, (in terms of multiples of the rational design).

In treatments of this subject points 1-3 are frequently glossed over as already being known and given, although in my experience, any serious work on trial design involves the statistician in a lot of work investigating and discussing these issues. In consequence, in conventional discussions, attention is placed on points 4-8. Typically, it is assumed that 4-7 are given and 8, the size of the experiment, is calculated as a consequence. More rarely, 4, 5, 7 and 8 are given and 6, the power, is calculated from the other 4. An obvious weakness of this system is that there is no formal mention of cost whether in money, lost opportunities or patient time and suffering.

**An example
**A parallel group trial is planned in asthma with 3 months follow up. The agreed outcome measure is forced expiratory volume in one second (FEV

_{1}) at the end of the trial. The between-patient standard deviation is 450ml and the clinically relevant difference is 200ml. A type I error rate of 5% is chosen and the test will be two-sided. A power of 80% is targeted.

An *approximate* formula that may be used is

Here the second term on the right hand side reflects what I call *decision precision,* with *z _{α/2}, z_{β}* as the relevant percentage points of the standard Normal. If you lower the type I error rate or increase the power, decision precision will increase. The first term on the right hand side is the variance for a rational design (consisting of one patient on each arm) expressed as a ratio to the square of a clinically relevant difference. It is a noise-to-signal ratio.

Substituting we have

Thus we need an 80-fold replication of the rational design, which is to say, 80 patients on each arm.

**What is delta?
**I now list different points of view regarding this.

** 1. It is the difference we would like to observe
**This point of view is occasionally propounded but it is incompatible with the formula used. To see this consider a re-arrangement of equation (1) as

The numerator on the left hand side is the clinically relevant difference and the denominator is the standard error. Now if the observed difference, *d*, is the same as the clinically relevant difference the we can replace δ by *d* in (2) but that would imply that the ratio of observed value to statistic would be (in our example) 2.8. This does not correspond to a P-value of 0.05, which our calculation was supposed to deliver us with 80% probability *if the clinically relevant difference obtained *but to a P-value of 0.006, or just over 1/10 of what our power calculation would accept as constituting proof of efficacy.

To put it another way if δ is the value we would like to observe and if the treatment does, indeed, have a value of δ then we have only half a chance, not an 80% chance, that the trial will deliver to us a value as big as this.

** 2. It is the difference we would like to ‘prove’ obtains
**This view is hopeless. It requires that the lower confidence interval should be greater than δ. If this is what is needed, the power calculation is completely irrelevant.

** 3. It is the difference we believe obtains
**This is another wrong-headed notion. Since, the smaller the value of δ the larger the sample size, it would have the curious side effect that given a number of drug-development candidates we would spend most money on those we considered least promising. There are some semi-Bayesian versions of this in which a probability distribution for δ would be substituted for a single value. Most medical statisticians would reject this as being a pointless elaboration of a point of view that is wrong in the first place. If you reject the notion that δ is your best guess as to what the treatment effect is there is no need to elaborate this rejected position by giving δ a probability distribution.

Note, I am not rejecting the idea of Bayesian sample size calculations. A fully decision-analytic approach might be interesting. I am rejecting what is a Bayesian-frequentist chimera.

** 4. It is the difference you would not like to miss
**This is the interpretation I favour. The idea is that we control two (conditional) errors in the process. The first is α, the probability of claiming that a treatment is effective when it is, in fact, no better than placebo. The second is the error of failing to develop a (very) interesting treatment further. If a trial in drug development is not ‘successful’, there is a chance that the whole development programme will be cancelled. It is the conditional probability of cancelling an interesting project that we seek to control.

Note that the FDA will usually requires that two phase III trials are ‘significant’ and significance requires that the observed effect is at least equal to In our example this would give us 1.96/2.8=0.7δ, or a little over two thirds of δ for at least two trials for any drug that obtained registration. In practice, the observed average of the two would have an effect somewhat in excess of 0.7δ. Of course, we would be naïve to believe that all drugs that get accepted have this effect (regression to the mean is ever- present) but nevertheless it provides *some* reassurance.

**Lessons
**In other words, if you are going to do a power calculation and you are going to target some sort of value like 80% power, you need to set δ at a value that is higher than that you would be happy to find. Statisticians like me think of δ as

*the difference we would not like to miss*and we call this

*the clinically relevant difference*.

Does this mean that an effect that is 2/3 of the clinically relevant difference is worth having? Not necessarily. That depends on what *your* understanding of the phrase is. It should be noted, however, that when it is crucial to establish that *no important difference between treatments exists*, as in a non-inferiority study, then another sort of difference is commonly used. This is referred to as the *clinically irrelevant difference*. Such differences are quite commonly no more than 1/3 of the sort of difference a drug will have shown historically to placebo and hence *much* smaller than the difference you would not like to miss.

Another lesson, however, is this. In this area, as in others in the analysis of clinical data, dichotomisation is a bad habit. There are no hard and fast boundaries. Relevance is a matter of degree not kind.

## Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance.

While post-data power is scarcely taboo for a severe tester, severity always uses the actual outcome, with its level of statistical significance, whereas power is in terms of the fixed cut-off. Still power provides (worst-case) pre-data guarantees. Now before you get any wrong ideas, I am not endorsing what some people call retrospective power, and I call “shpower”–which goes against severity logic, and is misconceived.

We are reading the Fisher-Pearson-Neyman “triad” tomorrow in Phil6334. Even here (i.e., Neyman 1956), Neyman alludes to a post-data use of power. But, strangely enough,I only noticed this after discovering more blatant discussions in what Spanos and I call “Neyman’s hidden papers”. Here’s an excerpt of from Neyman’s Nursery (part 2) [NN-2]

_____________

One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955). It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman. Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

Neyman continues:

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H

_{0}: µ ≤ µ_{0}against H_{1}: µ > µ_{0}.The test statisticd(X) is the standardized sample mean.The test rule: Infer a (positive) discrepancy from µ

_{0}iff {d(x_{0}) > cα) where cα corresponds to a difference statistically significant at the α level.In Carnap’s example the test could not reject the null hypothesis, i.e., d(x

_{0}) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1) P(d(X) > cα; µ = µ

_{0}+ δ)It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed. He sounds like a Cohen-style power analyst! Still, power is calculated relative to an outcome just missing the cutoff cα. This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results. It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2) P(d(X) > d(x0); µ = µ

_{0}+ δ)In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ

_{0}+ δ.Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before. We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange). Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean! (I call this “shpower”. )

Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25). This reasoning yields a core frequentist principle of evidence (FEV) in Mayo and Cox 2010, 256):

FEV:

^{1}A moderate p-value is evidence of the absence of a discrepancy d from H_{0}only if there is a high probability the test would have given a worse fit with H_{0}(i.e., smaller p value) were a discrepancy d to exist.It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power. In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account. These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.…..

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant. It didn’t have to be done this way (at first I didn’t), but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

[i] To repeat it again: some may be thinking of an animal I call “shpower”.

[ii] I realize comments are informal and unpolished, but isn’t that the beauty of blogging?

NOTE:To read the full post go to [NN-2].There are 5 Neyman’s Nursery posts (NN1-NN5). Search this blog for the others.

REFERENCES:

Cohen, J. (1992)

A Power Primer.Cohen, J. (1988),Statistical Power Analysis for the Behavioral Sciences, 2^{nd}ed. Hillsdale, Erlbaum, NJ.Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,”

British Journal of Philosophy of Science, 57: 323-357.Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,”

Optimality: The Second Erich L. Lehmann Symposium(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.Mayo, D. and Spanos, A. (eds.) (2010),

Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.Mayo, D. G. and Spanos, A. (2011) “Error Statistics“

Neyman, J. (1955), “The Problem of Inductive Inference,”

Communications on Pure and Applied Mathematics, VIII, 13-46.