Stephen Senn

Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Unknown-3Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

Senn comment: So let me give you another analogy to your (very interesting) fire alarm analogy (My analogy is imperfect but so is the fire alarm.) If you want to cross the Atlantic from Glasgow you should do some serious calculations to decide what boat you need. However, if several days later you arrive at the Statue of Liberty the fact that you see it is more important than the size of the boat for deciding that you did, indeed, cross the Atlantic.

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance.

Mayo comment (in reply): A crucial disanalogy arises: You see the statue and you see the observed difference in a test, but even when the stat sig alarm goes off, you are not able to see the discrepancy that generated the observed difference or the alarm you hear. You don’t know that you’ve arrived (at the cause). The statistical inference problem is precisely to make that leap from the perceived alarm to some aspect of the underlying process that resulted in the alarm being triggered. Then it is of considerable relevance to exploit info on the capability of your test procedure to result in alarms going off (perhaps of different loudness), due to varying values of an aspect of the underlying process µ’, µ”,µ”‘  …etc..

Using the loudness of the alarm you actually heard, rather than the minimal stat sig bell, would be analogous to using the p-value rather than the pre-data cut-off for rejection. But the logic is just the same.

While post-data power is scarcely taboo for a severe tester, severity always uses the actual outcome, with its level of statistical significance, whereas power is in terms of the fixed cut-off. Still power provides (worst-case) pre-data guarantees. Now before you get any wrong ideas, I am not endorsing what some people call retrospective power, and I call “shpower”–which goes against severity logic, and is misconceived.

We are reading the Fisher-Pearson-Neyman “triad” tomorrow in Phil6334. Even here (i.e., Neyman 1956), Neyman alludes to a post-data use of power. But, strangely enough,I only noticed this after discovering more blatant discussions in what Spanos and I call “Neyman’s hidden papers”.  Here’s an excerpt of from Neyman’s Nursery (part 2) [NN-2] 

_____________

One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955).  It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman.  Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H0: µ ≤ µ0 against H1: µ > µ0.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ0 iff {d(x0) > cα) where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x0) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].

We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1)  P(d(X) > cα; µ =  µ0 + δ)

It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed.  He sounds like a Cohen-style power analyst!  Still, power is calculated relative to an outcome just missing the cutoff  cα.  This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results.  It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2)  P(d(X) > d(x0); µ = µ0 + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ0 + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before.  We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange).  Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean!  (I call this “shpower”. )

Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25).  This reasoning yields a core frequentist principle of evidence  (FEV) in Mayo and Cox 2010, 256):

FEV:1 A moderate p-value is evidence of the absence of a discrepancy d from H0 only if there is a high probability the test would have given a worse fit with H0 (i.e., smaller p value) were a discrepancy d to exist.

It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power.  In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account.  These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.…..

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.
 It didn’t have to be done this way (at first I didn’t), but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

[i] To repeat it again: some may be thinking of an animal I call “shpower”.

[ii] I realize comments are informal and unpolished, but isn’t that the beauty of blogging?

NOTE:To read the full post go to [NN-2].There are 5 Neyman’s Nursery posts (NN1-NN5). Search this blog for the others.

REFERENCES:

Cohen, J. (1992) A Power Primer.
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Categories: exchange with commentators, Neyman's Nursery, P-values, Phil6334, power, Stephen Senn | 5 Comments

Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

Stephen Senn

Senn

Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

Delta Force
To what extent is clinical relevance relevant?

Inspiration
This note has been inspired by a Twitter exchange with respected scientist and famous blogger  David Colquhoun. He queried whether a treatment that had 2/3 of an effect that would be described as clinically relevant could be useful. I was surprised at the question, since I would regard it as being pretty obvious that it could but, on reflection, I realise that things that may seem obvious to some who have worked in drug development may not be obvious to others, and if they are not obvious to others are either in need of a defence or wrong. I don’t think I am wrong and this note is to explain my thinking on the subject.

Conventional power or sample size calculations
As is happens, I don’t particularly like conventional power calculations but I think they are, nonetheless, a good place to start. To carry out such a calculation a statistician needs the following ingredients

  1. A definition of a rational design (the smallest design that is feasible but would retain the essential characteristics of the design chosen).
  2. An agreed outcome measure.
  3. A proposed analysis.
  4. A measure of variability for the rational design. (This might, for example, be the between-patient variance σ2 for a parallel group design.)
  5. An agreed type I error rate, α.
  6. An agreed power, 1-β.
  7. A clinically relevant difference, δ. (To be discussed.)
  8. The size of the experiment, n, (in terms of multiples of the rational design).

In treatments of this subject points 1-3 are frequently glossed over as already being known and given, although in my experience, any serious work on trial design involves the statistician in a lot of work investigating and discussing these issues. In consequence, in conventional discussions, attention is placed on points 4-8. Typically, it is assumed that 4-7 are given and 8, the size of the experiment, is calculated as a consequence. More rarely, 4, 5, 7 and 8 are given and 6, the power, is calculated from the other 4. An obvious weakness of this system is that there is no formal mention of cost whether in money, lost opportunities or patient time and suffering.

An example
A parallel group trial is planned in asthma with 3 months follow up. The agreed outcome measure is forced expiratory volume in one second (FEV1) at the end of the trial. The between-patient standard deviation is 450ml and the clinically relevant difference is 200ml. A type I error rate of 5% is chosen and the test will be two-sided. A power of 80% is targeted.

An approximate formula that may be used is

Senn 1

Here the second term on the right hand side reflects what I call decision precision, with zα/2, zβ  as the relevant percentage points of the standard Normal. If you lower the type I error rate or increase the power, decision precision will increase. The first term on the right hand side is the variance for a rational design (consisting of one patient on each arm) expressed as a ratio to the square of a clinically relevant difference. It is a noise-to-signal ratio.

Substituting we have

Senn 2

Thus we need an 80-fold replication of the rational design, which is to say, 80 patients on each arm.

What is delta?
I now list different points of view regarding this.

     1.     It is the difference we would like to observe
This point of view is occasionally propounded but it is incompatible with the formula used. To see this consider a re-arrangement of equation (1) as

Senn real 2

The numerator on the left hand side is the clinically relevant difference and the denominator is the standard error. Now if the observed difference, d,  is the same as the clinically relevant difference the we can replace δ by d in (2) but that would imply that the ratio of observed value to statistic would be (in our example) 2.8. This does not correspond to a P-value of 0.05, which our calculation was supposed to deliver us with 80% probability if the clinically relevant difference obtained but to a P-value of 0.006, or just over 1/10 of what our power calculation would accept as constituting proof of efficacy.

To put it another way if δ is the value we would like to observe and if the treatment does, indeed, have a value of δ then we have only half a chance, not an 80% chance, that the trial will deliver to us a value as big as this.

     2.     It is the difference we would like to ‘prove’ obtains
This view is hopeless. It requires that the lower confidence interval should be greater than δ. If this is what is needed, the power calculation is completely irrelevant.

     3.     It is the difference we believe obtains
This is another wrong-headed notion. Since, the smaller the value of δ the larger the sample size, it would have the curious side effect that given a number of drug-development candidates we would spend most money on those we considered least promising. There are some semi-Bayesian versions of this in which a probability distribution for δ would be substituted for a single value. Most medical statisticians would reject this as being a pointless elaboration of a point of view that is wrong in the first place. If you reject the notion that δ is your best guess as to what the treatment effect is there is no need to elaborate this rejected position by giving δ a probability distribution.

Note, I am not rejecting the idea of Bayesian sample size calculations. A fully decision-analytic approach might be interesting.  I am rejecting what is a Bayesian-frequentist chimera.

     4.     It is the difference you would not like to miss
This is the interpretation I favour. The idea is that we control two (conditional) errors in the process. The first is α, the probability of claiming that a treatment is effective when it is, in fact, no better than placebo. The second is the error of failing to develop a (very) interesting treatment further. If a trial in drug development is not ‘successful’, there is a chance that the whole development programme will be cancelled. It is the conditional probability of cancelling an interesting project that we seek to control.

Note that the FDA will usually requires that two phase III trials are ‘significant’ and significance requires that the observed effect is at least equal to Senn 3aIn our example this would give us 1.96/2.8=0.7δ, or a little over two thirds of δ for at least two trials for any drug that obtained registration. In practice, the observed average of the two would have an effect somewhat in excess of 0.7δ. Of course, we would be naïve to believe that all drugs that get accepted have this effect (regression to the mean is ever- present) but nevertheless it provides some reassurance.

Lessons
In other words, if you are going to do a power calculation and you are going to target some sort of value like 80% power, you need to set δ at a value that is higher than that you would be happy to find. Statisticians like me think of δ as the difference we would not like to miss and we call this the clinically relevant difference.

Does this mean that an effect that is 2/3 of the clinically relevant difference is worth having? Not necessarily. That depends on what your understanding of the phrase is. It should be noted, however, that when it is crucial to establish that no important difference between treatments exists, as in a non-inferiority study, then another sort of difference is commonly used. This is referred to as the clinically irrelevant difference. Such differences are quite commonly no more than 1/3 of the sort of difference a drug will have shown historically to placebo and hence much smaller than the difference you would not like to miss.

Another lesson, however, is this. In this area, as in others in the analysis of clinical data, dichotomisation is a bad habit. There are no hard and fast boundaries. Relevance is a matter of degree not kind.

Categories: power, Statistics, Stephen Senn | 38 Comments

Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

Comedy hour icon

This headliner appeared last month, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance… 

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

IMG_1547It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

Categories: Comedy, Fisher, Jeffreys, P-values, Stephen Senn | Leave a comment

STEPHEN SENN: Fisher’s alternative to the alternative

Reblogging 2 years ago:

By: Stephen Senn

This year [2012] marks the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , , , | 31 Comments

Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

Comedy hour iconYou might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

IMG_1547It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

Well, what’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation.

We can view p-values in terms of rejecting H0, as in the joke: There’s a test statistic D such that H0 is rejected if its observed value d0 reaches or exceeds a cut-off d* where Pr(D > d*; H0) is small, say .025.
           Reject H0 if Pr(D > d0H0) < .025.
The report might be “reject Hat level .025″.
Example:  H0: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject H0 .

Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that H0 “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; H0), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]

So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting H0. Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d0 ; H0 ) be small, but that Pr(D > d0 ; H0 ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this …. Continue reading

Categories: Comedy, Fisher, Jeffreys, P-values, Statistics, Stephen Senn | 12 Comments

Stephen Senn: Dawid’s Selection Paradox (guest post)

Stephen SennStephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

“Dawid’s Selection Paradox”

You can protest, of course, that Dawid’s Selection Paradox is no such thing but then those who believe in the inexorable triumph of logic will deny that anything is a paradox. In a challenging paper published nearly 20 years ago (Dawid 1994), Philip Dawid drew attention to a ‘paradox’ of Bayesian inference. To describe it, I can do no better than to cite the abstract of the paper, which is available from Project Euclid, here: http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?

 When the inference to be made is selected after looking at the data, the classical statistical approach demands — as seems intuitively sensible — that allowance be made for the bias thus introduced. From a Bayesian viewpoint, however, no such adjustment is required, even when the Bayesian inference closely mimics the unadjusted classical one. In this paper we examine more closely this seeming inadequacy of the Bayesian approach. In particular, it is argued that conjugate priors for multivariate problems typically embody an unreasonable determinism property, at variance with the above intuition.

I consider this to be an important paper not only for Bayesians but also for frequentists, yet it has only been cited 14 times as of 15 November 2013 according to Google Scholar. In fact I wrote a paper about it in the American Statistician a few years back (Senn 2008) and have also referred to it in a previous blogpost (12 May 2012). That I think it is important and neglected is excuse enough to write about it again.

Philip Dawid is not responsible for my interpretation of his paradox but the way that I understand it can be explained by considering what it means to have a prior distribution. First, as a reminder, if you are going to be 100% Bayesian, which is to say that all of what you will do by way of inference will be to turn a prior into a posterior distribution using the likelihood and the operation of Bayes theorem, then your prior distribution has to satisfy two conditions. First, it must be what you would use to bet now (that is to say at the moment it is established) and second no amount of subsequent data will change your prior qua prior. It will, of course, be updated by Bayes theorem to form a posterior distribution once further data are obtained but that is another matter. The relevant time here is your observation time not the time when the data were collected, so that data that were available in principle but only came to your attention after you established your prior distribution count as further data.

Now suppose that you are going to make an inference about a population mean, θ, using a random sample from the population and choose the standard conjugate prior distribution. Then in that case you will use a Normal distribution with known (to you) parameters μ and σ2. If σ2 is large compared to the random variation you might expect for the means in your sample, then the prior distribution is fairly uninformative and if it is small then fairly informative but being uninformative is not in itself a virtue. Being not informative enough runs the risk that your prior distribution is not one you might wish to use to bet now and being too informative that your prior distribution is one you might be tempted to change given further information. In either of these two cases your prior distribution will be wrong. Thus the task is to be neither too informative nor not informative enough. Continue reading

Categories: Bayesian/frequentist, selection effects, Statistics, Stephen Senn | 67 Comments

Highly probable vs highly probed: Bayesian/ error statistical differences

3077175-lgA reader asks: “Can you tell me about disagreements on numbers between a severity assessment within error statistics, and a Bayesian assessment of posterior probabilities?” Sure.

There are differences between Bayesian posterior probabilities and formal error statistical measures, as well as between the latter and a severity (SEV) assessment, which differs from the standard type 1 and 2 error probabilities, p-values, and confidence levels—despite the numerical relationships. Here are some random thoughts that will hopefully be relevant for both types of differences. (Please search this blog for specifics.)

1. The most noteworthy difference is that error statistical inference makes use of outcomes other than the one observed, even after the data are available: there’s no other way to ask things like, how often would you find 1 nominally statistically significant difference in a hunting expedition over k or more factors?  Or to distinguish optional stopping with sequential trials from fixed sample size experiments.  Here’s a quote I came across just yesterday:

“[S]topping ‘when the data looks good’ can be a serious error when combined with frequentist measures of evidence. For instance, if one used the stopping rule [above]…but analyzed the data as if a fixed sample had been taken, one could guarantee arbitrarily strong frequentist ‘significance’ against H0.” (Berger and Wolpert, 1988, 77).

The worry about being guaranteed to erroneously exclude the true parameter value here is an error statistical affliction that the Bayesian is spared (even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.) See this post for an amusing note; Mayo and Kruse (2001) below; and, if interested, search the (strong)  likelihood principle, and Birnbaum.

2. Highly probable vs. highly probed. SEV doesn’t obey the probability calculus: for any test T and outcome x, the severity for both H and ~H might be horribly low. Moreover, an error statistical analysis is not in the business of probabilifying hypotheses but evaluating and controlling the capabilities of methods to discern inferential flaws (problems with linking statistical and scientific claims, problems of interpreting statistical tests and estimates, and problems of underlying model assumptions). This is the basis for applying what may be called the Severity principle. Continue reading

Categories: Bayesian/frequentist, Error Statistics, P-values, Philosophy of Statistics, Statistics, Stephen Senn, strong likelihood principle | 40 Comments

Stephen Senn: Open Season (guest post)

Stephen SennStephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

“Open Season”

The recent joint statement(1) by the Pharmaceutical Research and Manufacturers of America (PhRMA) and the European Federation of Pharmaceutical Industries and Associations(EFPIA) represents a further step in what has been a slow journey towards (one assumes) will be the achieved  goal of sharing clinical trial data. In my inaugural lecture of 1997 at University College London I called for all pharmaceutical companies to develop a policy for sharing trial results and I have repeated this in many places since(2-5). Thus I can hardly complain if what I have been calling for for over 15 years is now close to being achieved.

However, I have now recently been thinking about it again and it seems to me that there are some problems that need to be addressed. One is the issue of patient confidentiality. Ideally, covariate information should be exploitable as such often increases the precision of inferences and also the utility of decisions based upon them since they (potentially) increase the possibility of personalising medical interventions. However, providing patient-level data increases the risk of breaching confidentiality. This is a complicated and difficult issue about which, however, I have nothing useful to say. Instead I want to consider another matter. What will be the influence on the quality of the inferences we make of enabling many subsequent researchers to analyse the same data?

One of the reasons that many researchers have called for all trials to be published is that trials that are missing tend to be different from those that are present. Thus there is a bias in summarising evidence from published trial only and it can be a difficult task with no guarantee of success to identify those that have not been published. This is a wider reflection of the problem of missing data within trials. Such data have long worried trialists and the Food and Drug Administration (FDA) itself has commissioned a report on the subject from leading experts(6). On the European side the Committee for Medicinal Products for Human Use (CHMP) has a guideline dealing with it(7).

However, the problem is really a particular example of data filtering and it also applies to statistical analysis. If the analyses that are present have been selected from a wider set, then there is a danger that they do not provide an honest reflection of the message that is in the data. This problem is known as that of multiplicity and there is a huge literature dealing with it, including regulatory guidance documents(8, 9).

Within drug regulation this is dealt with by having pre-specified analyses. The broad outlines of these are usually established in the trial protocol and the approach is then specified in some detail in the statistical analysis plan which is required to be finalised before un-blinding of the data. The strategies used to control for multiplicity will involve some combination of defining a significance testing route (an order in which test must be performed and associated decision rules) and reduction of the required level of significance to detect an event.

I am not a great fan of these manoeuvres, which can be extremely complex. One of my objections is that it is effectively assumed that the researchers who chose them are mandated to circumscribe the inferences that scientific posterity can make(10). I take the rather more liberal view that provided that everything that is tested is reported one can test as much as one likes. The problem comes if there is selective use of results and in particular selective reporting. Nevertheless, I would be the first to concede the value of pre-specification in clarifying the thinking of those about to embark on conducting a clinical trial and also in providing a ‘template of trust’ for the regulator when provided with analyses by the sponsor.

However, what should be our attitude to secondary analyses? From one point of view these should be welcome. There is always value in looking at data from different perspectives and indeed this can be one way of strengthening inferences in the way suggested nearly 50 years ago by Platt(11). There are two problems, however. First, not all perspectives are equally valuable. Some analyses in the future, no doubt, will be carried out by those with little expertise and in some cases, perhaps, by those with a particular viewpoint to justify. There is also the danger that some will carry out multiple analyses (of which, when one consider the possibility of changing endpoints, performing transformations, choosing covariates and modelling framework there are usually a great number) but then only present those that are ‘interesting’. It is precisely to avoid this danger that the ritual of pre-specified analysis is insisted upon by regulators. Must we also insist upon it for those seeking to reanalyse?

To do so would require such persons to do two things. First, they would have to register the analysis plan before being granted access to the data. Second, they would have to promise to make the analysis results available, otherwise we will have a problem of missing analyses to go with the problem of missing trials. I think that it is true to say that we are just beginning to feel our way with this. It may be that the chance has been lost and that the whole of clinical research will be ‘world wide webbed’: there will be a mass of information out there but we just don’t know what to believe. Whatever happens the era of privileged statistical analyses by the original data collectors is disappearing fast.

[Ed. note: Links to some earlier related posts by Prof. Senn are:  "Casting Stones" 3/7/13, "Also Smith & Jones" 2/23/13, and "Fooling the Patient: An Unethical Use of Placebo?" 8/2/12 .]

References

1. PhRMA, EFPIA. Principles for Responsible Clinical Trial Data Sharing. PhRMA; 2013 [cited 2013 31 August]; Available from: http://phrma.org/sites/default/files/pdf/PhRMAPrinciplesForResponsibleClinicalTrialDataSharing.pdf.

2. Senn SJ. Statistical quality in analysing clinical trials. Good Clinical Practice Journal. [Research Paper]. 2000;7(6):22-6.

3. Senn SJ. Authorship of drug industry trials. Pharm Stat. [Editorial]. 2002;1:5-7.

4. Senn SJ. Sharp tongues and bitter pills. Significance. [Review]. 2006 September 2006;3(3):123-5.

5. Senn SJ. Pharmaphobia: fear and loathing of pharmaceutical research. [pdf] 1997 [updated 31 August 2013; cited 2013 31 August ]; Updated version of paper originally published on PharmInfoNet].

6. Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012 Oct 4;367(14):1355-60.

7. Committee for Medicinal Products for Human Use (CHMP). Guideline on Missing Data in Confirmatory Clinical Trials London: European Medicine Agency; 2010. p. 1-12.

8. Committee for Proprietary Medicinal Products. Points to consider on multiplicity issues in clinical trials. London: European Medicines Evaluation Agency2002.

9. International Conference on Harmonisation. Statistical principles for clinical trials (ICH E9). Statistics in Medicine. 1999;18:1905-42.

10. Senn S, Bretz F. Power and sample size when multiple endpoints are considered. Pharm Stat. 2007 Jul-Sep;6(3):161-70.

11. Platt JR. Strong Inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science. 1964 Oct 16;146(3642):347-53.

Categories: evidence-based policy, science communication, Statistics, Stephen Senn | 6 Comments

Stephen Senn: Indefinite irrelevance

Stephen SennStephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

At a workshop on randomisation I attended recently I was depressed to hear what I regard as hackneyed untruths treated as if they were important objections. One of these is that of indefinitely many confounders. The argument goes that although randomisation may make it probable that some confounders are reasonably balanced between the arms, since there are indefinitely many of these, the chance that at least some are badly confounded is so great as to make the procedure useless.

This argument is wrong for several related reasons. The first is to do with the fact that the total effect of these indefinitely many confounders is bounded. This means that the argument put forward is analogously false to one in which it were claimed that the infinite series ½, ¼,⅛ …. did not sum to a limit because there were infinitely many terms. The fact is that the outcome value one wishes to analyse poses a limit on the possible influence of the covariates. Suppose that we were able to measure a number of covariates on a set of patients prior to randomisation (in fact this is usually not possible but that does not matter here). Now construct principle components, C1, C2… .. based on these covariates. We suppose that each of these predict to a greater or lesser extent the outcome, Y  (say).  In a linear model we could put coefficients on these components, k1, k2… (say). However one is not free to postulate anything at all by way of values for these coefficients, since it has to be the case for any set of m such coefficients that inequality (2)where  V(  ) indicates variance of. Thus variation in outcome bounds variation in prediction. This total variation in outcome has to be shared between the predictors and the more predictors you postulate there are, the less on average the influence per predictor.

The second error is to ignore the fact that statistical inference does not proceed on the basis of signal alone but also on noise. It is the ratio of these that is important. If there are indefinitely many predictors then there is no reason to suppose that their influence on the variation between treatment groups will be bigger than their variation within groups and both of these are used to make the inference. Continue reading

Categories: RCTs, Statistics, Stephen Senn | 15 Comments

Gelman sides w/ Neyman over Fisher in relation to a famous blow-up

3-d red yellow puzzle people (E&I)

blog-o-log

Andrew Gelman had said he would go back to explain why he sided with Neyman over Fisher in relation to a big, famous argument discussed on my Feb. 16, 2013 post: “Fisher and Neyman after anger management?”, and I just received an e-mail from Andrew saying that he has done so: “In which I side with Neyman over Fisher”. (I’m not sure what Senn’s reply might be.) Here it is:

“In which I side with Neyman over Fisher” Posted by  on 24 May 2013, 9:28 am

As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally.gelman5

Here’s an example that recently came up.

Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero.

Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible that a bunch of nonzero effects would exactly cancel). And I remember a similar discussion as a student, many years ago, when Rubin talked about that silly Neyman null hypothesis. Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , | 10 Comments

Guest post: Bad Pharma? (S. Senn)

SENN FEBProfessor Stephen Senn*
Full Paper: Bad JAMA?
Short version–Opinion Article: Misunderstanding publication bias
Video below

Data filters

The student undertaking a course in statistical inference may be left with the impression that what is important is the fundamental business of the statistical framework employed: should one be Bayesian or frequentist, for example? Where does one stand as regards the likelihood principle and so forth? Or it may be that these philosophical issues are not covered but that a great deal of time is spent on the technical details, for example, depending on framework, various properties of estimators, how to apply the method of maximum likelihood, or, how to implement Markov chain Monte Carlo methods and check for chain convergence. However much of this work will take place in a (mainly) theoretical kingdom one might name simple-random-sample-dom. Continue reading

Categories: Statistics, Stephen Senn | Tags: , | 12 Comments

Blog at WordPress.com. Customized Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 315 other followers