S. Senn

S. Senn: “Automatic for the people? Not quite” (Guest post)

Stephen Senn

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Automatic for the people? Not quite

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173184 … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself.

What van Ravenzwaaij and Ioannidis (R&I) have done is investigate the FDA’s famous two trials rule as a requirement for drug registration. To do this R&I simulated two-armed parallel group clinical trials according to the following combinations of scenarios (p4).

Thus, to sum up, our simulations varied along the following dimensions:
1. Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD)
2. Number of total trials: 2, 3, 4, 5, and 20
3. Number of participants: 20, 50, 100, 500, and 1,000

The first setting defines the treatment effect in terms of common within-group standard deviations, the second defines the total number of trials submitted to the FDA with exactly two of them significant and the third the total number of patients per group.

They thus had 3 x 5 x 5 = 75  simulation settings in total. In each case the simulations were run until 500 cases arose for which two trials were significant. For each of these cases they calculated a one-sided Bayes factor and then proceeded to judge the FDA’s rule based on P-values according to the value the Bayes factor indicated.

In my opinion this is a hopeless mishmash of two systems: the first, (frequentist) conditional on the hypotheses and the second (Bayesian) conditional on the data. They cannot be mixed to any useful purpose in the way attempted and the result is not only irrelevant frequentist statistics but irrelevant Bayesian.

Before proceeding to discuss the inferential problems, however, I am going to level a further charge of irrelevance as regards the simulations. It is true that the ‘two trials rule’ is rather vague in that it is not clear how many trials one is allowed to run to get two significant ones. In my opinion it is reasonable to consider that the FDA might accept two out of three but it is frankly incredible that they would accept two out of twenty unless there were further supporting evidence. For example, if two large trials were significant but 18 smaller ones were not, but significant as a set in a meta-analysis, one could imagine the programme passing. Even this scenario, however, is most unlikely and I would be interested to know of any case of any sort in which the FDA has accepted a ‘two out of twenty’ registration.

Now let us turn to the mishmash. Let us look, first of all, at the set up in frequentist terms. The simplest common case to take is the ‘two out of two’ significant scenario. Sponsors going into phase III will typically perform calculations to target at least 80% power for the programme as a whole. Thus 90% power for individual trials is a common standard since the product of the powers is just over 80%. For the two effect sizes of 0.3 and 0.5 that R&I consider this would, according to nQuery®, yield 527 and 86 patients per arm respectively. The overall power of the programme would be 81% and the joint two-sided type I error rate would be 2 x (1/40) = 1/800, reflecting the fact that each of two two-sided tests would have to be significant at the 5% level but in the right direction.

Now, of course, these are planned characteristics in advance of running a trial. In practice you will get a result and then, in the spirit of what R&I are attempting, it would be of interest to consider the least impressive result that would just give you registration. This of, course, is P=0.05 for each of the two trials. At this point, by the way, I note that a standard frequentist objection can be entered to the two trials rule. If the designs of two trials are identical, then given that they are of the same size, the sufficient statistic is simply the average of the two results. If conducted simultaneously there would be no reason not to use this. This leads to a critical region for a more powerful test based on the average result from the two providing a 1/1600 type I error rate (one-sided) illustrated in the figure below that is to the right and above the blue diagonal line. The corresponding region for the two-trials rule is to the right of the vertical red line and above the horizontal one. The just ‘significant’ value for the two-trials rule has a standardised z-score of 1.96 x √2 = 277, whereas the rule based on the average from the two trials would have a z-score of 3.02. In other words, evidentially, the value according to the two-trials rule is less impressive[2].

Now, the Bayesian will argue, that the frequentist is controlling the behaviour of the procedure if one of two possible realities out of a whole range applies but has given no prior thought to their likely occurrence or, for that matter, to the occurrence of other values. If, for example, moderate effects sizes are very unlikely, but it is quite plausible that the treatment has no effect at all and the trials are very large, then even though their satisfying the two trials rule would be a priori unlikely, if it was only minimally satisfied, it might actually imply that the null hypothesis was likely true.

A possible way for the Bayesian to assess the evidential value is to assume, just for argument’s sake, that the null hypothesis and the set of possible alternative hypotheses are equally likely a priori (the prior odds are one) and then calculate the posterior probability and hence odds given the observed data. The ratio of the posterior odds to the prior odds is known as the Bayes factor[3]. Citing a paper[4] by Rouder et al describing this approach R&I then use the BayesFactor package created by Morey and Rouder to calculate the Bayes factor corresponding to every case of two significant trials they generate.

Actually it is not the Bayes factor but a Bayes factor. As Morey and Rouder make admirably clear in a subsequent paper[5], what the Bayes factor turns out to be depends very much on how the probability is smeared over the range of the alternative hypothesis. This can perhaps be understood by looking at the ratios of likelihoods (relative to the value under the null) when P=0.05 for each of the two trials as a function of the true (unknown) effect size for the sample sizes of 527 and 86 that would give 90% power for the values of the effect sizes (0.2 and 0.5) that R&I consider. The logs of these (chosen to make plotting easier) are given in the figure below. The blue curve corresponds to the smaller effect size used in planning (0.2) and hence the larger sample size (527) and the red curve corresponds to the larger effect size (0.5) and hence the smaller sample size (86). Given the large number of degrees of freedom available, the Normal distribution likelihoods have been used. The values of the statistic that would be just significant at the 5% level (0.1207 and 0.2989) for the two cases are given by the vertical dashed lines and, since these are the values that we assume observed in the two cases, each curve reaches its respective maximum at the relevant value.

Wherever a value on the curve is positive, the ratio of likelihoods is greater than one and the posited value of the effect size is supported against the null. Wherever it is negative, the ratio is less than one and the null is supported. Thus, whether the posited values of the treatment effect that make up the alternative values are supported as a set or not depends on how you smear the prior probability. The Bayes factor is the ratio of the prior-weighted integral of the likelihoods. In this case the likelihood under the null is a constant so the conditional prior under the alternative is crucial.  There is no automatic solution and careful choice is necessary. So what are you supposed to do? Well, as a Bayesian you are supposed to choose a prior distribution that reflects what you believe. At this point, I want to make it quite clear that if you think you can do it you should do so and I don’t want to argue against that. However, this is really hard and it has serious consequences[6]. Suppose that sample size of 527 has been used corresponding to the blue curve. Then any value of the effect size greater than 0 and less than about 2 x 0.1207 = 0.2414 has more support than the null hypothesis itself but any value more than 0.2414 is less supported than the null. How this pans out in your Bayes factor now depends on your prior distribution. If your prior maintains that all possible values of the effect size when the alternative hypothesis is true must be modest (say never greater than 0.2414), then they are all supported and so is the set. On the other hand, if you think that unless the null hypothesis is true, only values greater than 0.2414 are possible, then all such values are unsupported and so is the set. In general, the way the conditional prior smears the probability is crucial.

Be that as it may, I doubt that choosing, ‘a Cauchy distribution with a width of r = √2/2’ as R&I did would stand any serious scrutiny. Bear in mind that these are molecules that have passed a series of in vitro and in vivo pre-clinical screens as well as phase I, IIa and IIb before being put to the test in phase III. However, if R&I were serious about this, they would consider how well the distribution works as a prediction as to what actually happens in phase III and examine some data.

Instead, they assume, (as far as I can tell) that the Bayes factor they calculate in this way is some sort of automatic gold standard by which any other inferential statistic can and should be judged whether or not the distribution on which the Bayes factor is based on is reasonable. This is reflected in Richard Lehman’s tweet ‘Bayes factors consistently quantify strength of evidence’ which, in fact, however needs to be rephrased ‘Bayes factors coherently quantify strength of evidence for You if  You have chosen coherent prior distributions to construct them.’ It’s a big if.

R&I then make a second mistake of simultaneously conditioning on a result and a hypothesis. Suppose their claim is correct that in each of the cases of two significant trials that they generate the FDA would register the drug without further consideration. Then, for the first two of the three cases ‘Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD)’ the FDA has got it right and for the third it has got it wrong. By the same token wherever any decision based on the Bayes factor would disagree with the FDA it would be wrong in the first two cases and right in the third. However, this is completely useless information. It can’t help us decide between the two approaches. If we want to use true posited values of the effect size, we have to consider all possible outcomes for the two trial rule, not just the ones that indicate ‘register’. For the cases that indicate ‘register’, it is a foregone conclusion that we will have 100% success (in terms of decision-making) in the first two cases and 100% failure in the second. What we need to consider also is the situation where it is not the case that two trials are significant.

If, on the other hand R&I wish to look at this in Bayesian terms, then they have also picked this up the wrong way. If they are committed to their adopted prior distribution, then once they have calculated the Bayes factor there is no more to be said and if they simulate from the prior distribution they have adopted, then their decision making will, as judged by the simulation, turn out to be truly excellent. If they are not committed to the prior distribution, then they are faced with the sore puzzle that is Bayesian robustness. How far can the prior distribution from which one simulates be from the prior distribution one assumes for inference in order for the simulation to be a) a severe test but b) not totally irrelevant?

In short the R&I paper, in contradistinction to Richard Lehman’s claim, tells us nothing about the reasonableness of the FDA’s rule. That would require an analysis of data. Automatic for the people? Not quite. To be Bayesian, ‘to thine own self be true’. However, as I have put it previously, this is very hard and ‘You may believe you are a Bayesian but you are probably wrong’[7].

Acknowledgements

I am grateful to Don van Ravenzwaaij and John Ioannidis for helpful correspondence and to Andy Grieve for helpful comments. My research on inference for small populations is carried out in the framework of the IDEAL project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.

 

References

  1. van Ravenzwaaij, D. and J.P. Ioannidis, A simulation study of the strength of evidence in the recommendation of medications based on two trials with statistically significant results. PLoS One, 2017. 12(3): p. e0173184.
  2. Senn, S.J., Statistical Issues in Drug Development. Statistics in Practice. 2007, Hoboken: Wiley. 498.
  3. O’Hagan, A., Bayes factors. Significance, 2006(4): p. 184-186.
  4. Rouder, J.N., et al., Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic bulletin & review, 2009. 16(2): p. 225-237.
  5. Morey, R.D., J.-W. Romeijn, and J.N. Rouder, The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology, 2016. 72: p. 6-18.
  6. Grieve, A.P., Discussion of Piegorsch and Gladen (1986). Technometrics, 1987. 29(4): p. 504-505.
  7. Senn, S.J., You may believe you are a Bayesian but you are probably wrong. Rationality, Markets and Morals, 2011. 2: p. 48-66.

 

Categories: Bayesian/frequentist, Error Statistics, S. Senn | 9 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012.  (I will comment in the comments.)

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 13 Comments

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

images-3

.

Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics | 39 Comments

Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

Slide1

.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics | 36 Comments

Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”

old blogspot typewriter

.

I first blogged this letter here. Below the references are some more recent blog links of relevance to this issue. 

 Dear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing. You can read the full letter here. Sincerely, D. G. Mayo

 STATISTICS IN MEDICINE, LETTER TO THE EDITOR

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: 4 years ago!, reproducibility, S. Senn, Statistics | Tags: , , , | 3 Comments

Stephen Senn: The pathetic P-value (Guest Post) [3]

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value* [3]

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

Categories: P-values, S. Senn, statistical tests, Statistics | 27 Comments

Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics

.

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

This post first appeared here. An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence-based medicine? Philosophy of Science 2002; 69: S316-S330: see p. S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random. Continue reading

Categories: RCTs, S. Senn, Statistics | Tags: , | 6 Comments

Can You change Your Bayesian prior? (ii)

images-1

.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

images

.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while Continue reading

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

From our “Philosophy of Statistics” session: APS 2015 convention

aps_2015_logo_cropped-1

.

“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

 

D. Mayo: “Error Statistical Control: Forfeit at your Peril” 

 

S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”

 

A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)

 

For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Stephen Senn: The pathetic P-value (Guest Post)

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

Categories: P-values, S. Senn, statistical tests, Statistics | 148 Comments

Stephen Senn: Is Pooling Fooling? (Guest Post)

Stephen Senn

.

Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg

Is Pooling Fooling?

‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s A Journal of a Tour to the Hebrides

A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.

It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).

A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose[1]) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.

Suppose that we have k ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of k trials. We can label these Hn1, Hn2, … Hnk. We are perfectly entitled to test the null hypothesis Hjoint that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling. Continue reading

Categories: evidence-based policy, PhilPharma, S. Senn, Statistics | 19 Comments

Blog at WordPress.com.