S. Senn

S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)

Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine?

A common misinterpretation of Numbers Needed to Treat is causing confusion about the scope for personalised medicine.

Stephen Senn
Consultant Statistician,
Edinburgh

Introduction

Thirty years ago, Laupacis et al1 proposed an intuitively appealing way that physicians could decide how to prioritise health care interventions: they could consider how many patients would need to be switched from an inferior treatment to a superior one in order for one to have an improved outcome. They called this the number needed to be treated. It is now more usually referred to as the number needed to treat (NNT).

Within fifteen years, NNTs were so well established that the then editor of the British Medical Journal, Richard Smith could write:  ‘Anybody familiar with the notion of “number needed to treat” (NNT) knows that it’s usually necessary to treat many patients in order for one to benefit’2. Fifteen years further on, bringing us up to date,  Wikipedia makes a similar point ‘The NNT is the average number of patients who need to be treated to prevent one additional bad outcome (e.g. the number of patients that need to be treated for one of them to benefit compared with a control in a clinical trial).’3

This common interpretation is false, as I have pointed out previously in two blogs on this site: Responder Despondency and  Painful Dichotomies. Nevertheless, it seems to me the point is worth making again and the thirty-year anniversary of NNTs provides a good excuse.

NNTs based on dichotomies, as opposed to those based on true binary outcomes (which are very rare), do not measure the proportion of patients who benefit from the drug and even when not based on such dichotomies, they say less about differential response than many suppose. Common false interpretations of NNTs are creating confusion about the scope for personalised medicine.

Not necessarily true

To illustrate the problem, consider a 2015  Nature comment piece by Nicholas Schork4 calling for N-of-1 trials to be used more often in personalising medicine. These are trials in which, as a guide to treatment, patients are repeatedly randomised in different episodes to the therapies being compared. 5.

NNTs are commonly used in health economics. Other things being equal, a drug with a larger NNT ought to have a lower cost per patient day than one with a smaller NNT if it is to justify its place in the market. Here however, they were used to make the case for the scope for personalised medicine, and hence the need for N-of-1 trials, a potentially very useful approach to personalising treatment.  Schork claimed, ‘The top ten highest-grossing drugs in the United States help between 1 in 25 and 1 in 4 of the people who take them (p609). This claim may or may not be correct (it is almost certainly wrong) but the argument for it is false.

The figure: Imperfect medicine is based on Schork’s figure Imprecision medicine and shows the NNTs for the ten best selling drugs in the USA at the time of his comment. The NNTs range, for example, from 4 for Humira® in arthritis to 25 for Nexium in heartburn. This is then interpreted as meaning that since, for example, on average 4 patients would have to be treated with Humira rather than placebo in order to get one more response, only one in 4 patients responds to Humira.Imperfect medicine: Numbers Needed to Treat based on a figure in Schork (2015). The total number of dots represents how many patients you would have to switch to the treatment mentioned to get one additional response (blue dot). The red dots are supposed to represent the patients for whom it would make no difference.

Take the example of Nexium. The figure quoted by Schork is taken from a meta-analysis carried out by Gralnek et al6 based on several studies comparing Esomeprazole (Nexium) to other protein pump inhibitors. The calculation of the NNT may be illustrated by taking one of the studies that comprise the meta-analysis, the EXPO study reported by Labenz et al7 in which a clinical trial with more than 3000 patients compared Esomeprazole to Pantoprazole. Patients with erosive oesophagitis were treated with either one or the other treatment and then evaluated at 8 weeks.

Of those treated with Esomeprazole 92.1% were healed. Of those treated with Pantoprazole 87.3% were healed. The difference of 4.8% is the risk difference. Expressed as a proportion this is 0.048 and the reciprocal of this figure is 21, rounded up to the nearest whole number. This figure is the NNT and an interpretation is that on average you would need to treat 21 patients with Esomeprazole rather than with Pantoprazole to have one extra healed case at 8 weeks. For the meta-analysis as a whole, Gralnek et al6 found a risk difference of 4% and this yields an NNT of 25, the figure quoted by Schork. (See Box for further discussion.)

 Two different interpretations of the EXPO oesophageal ulcer data

 

It is impossible for us to observe the ulcers that were studied in the EXPO trial under both treatments. Each patient, was treated with either Esomeprazole or Pantoprazole. We can imagine what response would have been on either but we can only observe it on one. Table 1 and Table 2 have the same observable marginal probabilities of ulcer healing but different postulated joint ones.

    Esomeprazole  
  Not healed Healed Total
Pantoprazole Not healed        7.9        4.8       12.7
Healed        0.0      87.3       87.3
  Total        7.9      92.1     100.0

Table 1 Possible joint distribution of response (percentages) for the EXPO trial. Case where no patient would respond on Pantoprazole who did not on Esomeprazole

In the case of Table 1, no patient that would not have been healed by Esomeprazole could have been healed by Pantoprazole. In consequence the total number of patients who could have been healed are those who were healed with Esomeprazole, that is to say 92.1%. In the case of Table 2, all patients who were not healed with by Esomeprazole, that is to say 7.9%, could have been healed by Pantoprazole. In principle it becomes possible to heal all patients. Of course, intermediate situations are possible but all such tables have the same NNT of 21. The NNT cannot tell us which is true.

    Esomeprazole  
  Not healed Healed Total
Pantoprazole Not healed        0.0      12.7       12.7
Healed        7.9      79.4       87.3
  Total        7.9      92.1     100.0

Table 2 Possible joint distribution of response (percentages) for the EXPO trial. Case where all patients did not respond on Esomeprazole would respond on Pantoprazole

 

A number of points can be made taking this example. First, it is comparator-specific. Proton pump inhibitors as a class are highly effective and one would get quite a different figure if placebo rather than Pantoprazole had been used as the control for Esomeprazole. Second, the figure, of itself, does not tell us the scope for personalising medicine. It is quite compatible with the two extreme positions given in the Box. In the first case, every single patent who was helped by Pantoprazole would have been so by Esomeprazole. If there are no cost or tolerability advantages to the former the optimal policy would be to give all patient the latter. In the second case, every single patient who was not helped by Esomeprazole would have been helped by Pantoprazole. If a suitable means can be found of identifying such patients, all patients can be treated successfully. Third, healing is a process that takes time. The eight-week time-point is partly arbitrary. The careful analysis presented by Labenz et al7 shows healing rates rising with time with the Esomeprazole rate always above that for Pantoprazole. Perhaps with time, either would heal all ulcers, the difference between them being one of speed. Fourth, although it is not directly related to this discussion, it should be appreciated that a given drug can have many NNTs. The NNT will vary both according to the comparator, the outcome chosen, the cut point for any dichotomy or the follow-up8. (The original article proposing NNTs by Laupacis et al1 discusses a number of such caveats.) Indeed, for the EXPO study the risk difference at 4 weeks is 8.7 with an NNT of  rather than 21 for 8 weeks. This shows the importance of not mixing NNTs for different follow-ups in a meta-analysis.

An easy lie or a difficult truth?

There are no shortcuts to finding evidence for variation in response9. Dichotomising continuous measures not only has the capacity to exaggerate unimportant differences it is also inefficient and needlessly increases trial sizes10.

Rather than becoming simpler, ways that clinical trial are reported need to be more nuanced. In a previous blog I showed how a NNT of 10 for headache had been misinterpreted as meaning that only 1 in 10 benefitted from paracetamol. It is, or ought to be obvious, that in order to understand the extent to which patients respond to paracetamol you should study them more than once under treatment and under control. For example, a design could be employed in which each patient was treated for four headaches, twice with placebo and twice with paracetamol. This is an example of the n-of-1 trials than Schork calls for4. We hardly ever run these. Of course for some diseases they are not practical but where we can’t run them, we should not pretend to have identified what we can’t.

The role for n-of-1 trials is indeed there but not necessarily to personalise treatment. More careful analysis of response may simply reveal that this is less variable than supposed11. In some cases such trials may simply deliver the message that we need to do better for everybody12.

In his editorial of 2003 Smith referred to pharmacogenetics as providing ‘hopes that greater understanding of genetics will mean that we will be able to identify with a “simple genetic test” people who will respond to drugs and design drugs for individuals rather than populations.’ and added, ‘We have, however, been hearing this tune for a long time’2.

Smith’s complaint about an old tune is as true today as it was in 2003. However, the message for the pharmaceutical industry may simply be that we need better drugs not better diagnosis.

Acknowledgement

I am grateful to Andreas Laupacis and Jennifer Deevy for helpfully providing me with a copy of the 1988 paper.

References

  1. Laupacis A, Sackett DL, Roberts RS. An Assessment of Clinically Useful Measures of the Consequences of Treatment. New England Journal of Medicine 1988;318(26):1728-33.
  2. Smith R. The drugs don’t work. British Medical Journal 2003;327(7428).
  3. Wikipedia. Number needed to treat 2018 [Available from: https://en.wikipedia.org/wiki/Number_needed_to_treat.
  4. Schork NJ. Personalized medicine: Time for one-person trials. Nature 2015;520(7549):609-11.
  5. Araujo A, Julious S, Senn S. Understanding Variation in Sets of N-of-1 Trials. PloS one 2016;11(12):e0167167.
  6. Gralnek IM, Dulai GS, Fennerty MB, et al. Esomeprazole versus other proton pump inhibitors in erosive esophagitis: a meta-analysis of randomized clinical trials. Clin Gastroenterol Hepatol 2006;4(12):1452-8.
  7. Labenz J, Armstrong D, Lauritsen K, et al. A randomized comparative study of esomeprazole 40 mg versus pantoprazole 40 mg for healing erosive oesophagitis: the EXPO study. Alimentary pharmacology & therapeutics 2005;21(6):739-46.
  8. Suissa S. Number needed to treat: enigmatic results for exacerbations in COPD. The European respiratory journal : official journal of the European Society for Clinical Respiratory Physiology 2015;45(4):875-8.
  9. Senn SJ. Mastering variation: variance components and personalised medicine. Statistics in Medicine 2016;35(7):966-77.
  10. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25(1):127-41.
  11. Churchward-Venne TA, Tieland M, Verdijk LB, et al. There are no nonresponders to resistance-type exercise training in older men and women. Journal of the American Medical Directors Association 2015;16(5):400-11.
  12. Senn SJ. Individual response to treatment: is it a valid assumption? BMJ 2004;329(7472):966-68.
Categories: personalized medicine, PhilStat/Med, S. Senn | 7 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012/2017.  The comments from 2017 lead to a troubling issue that I will bring up in the comments today.

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 1 Comment

S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)

.

Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Evidence Based or Person-centred? A statistical debate

It was hearing Stephen Mumford and Rani Lill Anjum (RLA) in January 2017 speaking at the Epistemology of Causal Inference in Pharmacology conference in Munich organised by Jürgen Landes, Barbara Osmani and Roland Poellinger, that inspired me to buy their book, Causation A Very Short Introduction[1]. Although I do not agree with all that is said in it and also could not pretend to understand all it says, I can recommend it highly as an interesting introduction to issues in causality, some of which will be familiar to statisticians but some not at all.

Since I have a long-standing interest in researching into ways of delivering personalised medicine, I was interested to see a reference on Twitter to a piece by RLA, Evidence based or person centered? An ontological debate, in which she claims that the choice between evidence based or person-centred medicine is ultimately ontological[2]. I don’t dispute that thinking about health care delivery in ontological terms might be interesting. However, I do dispute that there is any meaningful choice between evidence based medicine (EBM) and person centred healthcare (PCH). To suggest so is to commit a category mistake by suggesting that means are alternatives to ends.

In fact, EBM will be essential to delivering effective PCH, as I shall now explain. Continue reading

Categories: personalized medicine, RCTs, S. Senn | 7 Comments

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

Slide1

.

ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn | 17 Comments

S. Senn: “Automatic for the people? Not quite” (Guest post)

Stephen Senn

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Automatic for the people? Not quite

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173184 … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself. Continue reading

Categories: Bayesian/frequentist, Error Statistics, S. Senn | 18 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012.  (I will comment in the comments.)

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 13 Comments

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

images-3

.

Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics | 39 Comments

Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

Slide1

.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics | 36 Comments

Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”

old blogspot typewriter

.

I first blogged this letter here. Below the references are some more recent blog links of relevance to this issue. 

 Dear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing. You can read the full letter here. Sincerely, D. G. Mayo

 STATISTICS IN MEDICINE, LETTER TO THE EDITOR

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: 4 years ago!, reproducibility, S. Senn, Statistics | Tags: , , , | 3 Comments

Stephen Senn: The pathetic P-value (Guest Post) [3]

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value* [3]

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

Categories: P-values, S. Senn, statistical tests, Statistics | 27 Comments

Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics

.

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

This post first appeared here. An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence-based medicine? Philosophy of Science 2002; 69: S316-S330: see p. S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random. Continue reading

Categories: RCTs, S. Senn, Statistics | Tags: , | 6 Comments

Can You change Your Bayesian prior? (ii)

images-1

.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

images

.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while Continue reading

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

From our “Philosophy of Statistics” session: APS 2015 convention

aps_2015_logo_cropped-1

.

“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

 

D. Mayo: “Error Statistical Control: Forfeit at your Peril” 

 

S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”

 

A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)

 

For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Stephen Senn: The pathetic P-value (Guest Post)

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

Categories: P-values, S. Senn, statistical tests, Statistics | 148 Comments

Stephen Senn: Is Pooling Fooling? (Guest Post)

Stephen Senn

.

Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg

Is Pooling Fooling?

‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s A Journal of a Tour to the Hebrides

A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.

It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).

A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose[1]) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.

Suppose that we have k ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of k trials. We can label these Hn1, Hn2, … Hnk. We are perfectly entitled to test the null hypothesis Hjoint that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling. Continue reading

Categories: evidence-based policy, PhilPharma, S. Senn, Statistics | 19 Comments

Blog at WordPress.com.