Professor Stephen Senn*
Full Paper: Bad JAMA?
Short version–Opinion Article: Misunderstanding publication bias
Video below
Data filters
The student undertaking a course in statistical inference may be left with the impression that what is important is the fundamental business of the statistical framework employed: should one be Bayesian or frequentist, for example? Where does one stand as regards the likelihood principle and so forth? Or it may be that these philosophical issues are not covered but that a great deal of time is spent on the technical details, for example, depending on framework, various properties of estimators, how to apply the method of maximum likelihood, or, how to implement Markov chain Monte Carlo methods and check for chain convergence. However much of this work will take place in a (mainly) theoretical kingdom one might name simple-random-sample-dom.
In fact, there are many real data-sets that have various features that do not accord well with this set up. For example, there is a huge variety of data generating and data filtering mechanisms any applied statistician has to deal with. The standard example chosen for statistical inferences courses, a simple random sample, is almost never encountered in practice.
I was struck by the way in which data-filtering can mislead by reading Ben Goldacre’s Bad Pharma recently. I must declare an interest, and since this would otherwise take up most of this blog, I will do so with an URL. Here it is http://www.senns.demon.co.uk/Declaration_Interest.htm . Amongst claims that Goldacre makes are first, that authors are less likely to submit negative papers for publication, although second, journals have no prejudice against negative papers and in favour of positive ones.
Since authors are frequently reviewers, this combination of bias as regards one activity and not as regards another is implausible and, in my opinion, Goldacre is making the classic error of assuming that the data that arise at the point of study have not suffered any filtering before being seen. However, if authors have a bias against negative and in favour of positive papers, then reviewers cannot be getting to see a comparable sample of each. If authors were submitting by probability of acceptance, then we might well expect to see similar probabilities of acceptance for negative and positive studies but higher quality of the former. If a study was negative, authors, anticipating editorial attitudes, would not submit unless the study was of excellent quality. Thus identical probabilities of acceptance would be evidence of a bias just as, for example, statistics that showed women candidates for promotion were as successful on average as men, would be evidence for bias if we also found that the women were on average much better qualified. See http://f1000research.com/2012/12/11/positively-negative-are-editors-to-blame-for-publication-bias/ for a discussion.
Such data filters can cause problems for naïve approaches to inferences. Here are some notorious cases.
- A study found that Oscar winners lived longer than actors who did not win an Oscar. Does esteem lead to long life?
- Obese infarct survivors have better prognosis than the non-obese. Does obesity protect against further heart damage?
- Women in the village of Wickham were asked if they smoked or not. Twenty years later a higher percentage of non-smokers had died than smokers. Does smoking help you live longer?
- The average age at death of right-handers was found to be much greater than left-handers. Is there a sinister effect on mortality?
- In a trial comparing three non-steroidal anti-inflammatory drugs, patients were stratified by concomitant aspirin use (yes or no). Aspiring takers had a much higher rate of cardiovascular events. Is aspirin bad for your heart?
However, my favourite story regarding this concerns Abraham Wald. Faced with a distribution of bullet holes in returning planes Wald stunned his military colleagues by claiming that it was important to put armour where there were no bullet-holes since those were precisely the planes that did not get back.
Sometimes what you don’t see is more important than what you do and theoretical statistics courses do not always prepare you for this.
Click on picture for a video presentation:
*Head of the Competence Centre for Methodology and Statistics
CRP Santé
Strassen, Luxembourg
Stephen,
Love the stories of Wald and the biased editors! They perfectly demonstrate what Taleb calls the narrative fallacy or “ignoring the graveyard”. Thanks much for this post.
Thanks, Mark. Taleb’s point about the problem of “ignoring the graveyard” is an important one.
When I heard Stephen Senn discuss this at University College London last month I recall asking how a “positive result” was being defined. Is it merely a matter of statistical significance, or something more like a result that is noteworthy, novel, anomalous or the like. Failing to reject in a number of high precision null hypothesis tests in physics and elsewhere can be much more telling than finding a statistically significant discrepancy.
The “negative” results of Michelson and Morley gave impressive evidence for the null of a 0 ether effect, against the (very strongly-believed) ether theory because there was so high a probability of finding even a slight departure. Likewise with numerous (increasingly precise) tests of the Einstein equivalence principles (e.g., finding no Nordvedt effect).
In any event, Goldacre is, presumably, showing a negative result, but not of high quality, Senn argues.
An aside: my sense is that when people make debunking their claim to fame–what with blogs like http://www.badscience.net– their need to keep up the show takes on a life of its own.
Mayo: I have a number of problems with Goldacre that won’t fit in a comment. I really enjoyed Senn’s full review linked above though.
In terms of positive results: one thing I think is a problem is the easy running together of medicine/epidemiology with the whole of science — the physics examples you mention are interesting but I suspect not what Goldacre is thinking about.
So to give the best possible interpretation of Goldacre that I can, I think it is fairly clear from the context that he regards a positive result as meaning one which would lead to an intervention being used or licensed in medical practice. More technically, I think that means a paper has a positive finding if:
(1) It addresses a causal question “Does X cause Y?”
(2) It answers the question “yes”.
Papers that don’t ask questions of the form (1) aren’t included in this definition of positive/negative.
The statistical formulation of this (in for example a randomised controlled trial) would generally be as a null hypothesis test where non-causation would entail a regression coefficient of zero, or an effect size of zero, or similar. Broadly such nulls correspond to statistical independence between X and Y.
Statistical significance / rejection of this null would then essentially mean a positive result, both by virtue of being significant and by virtue of leading to inference of a casual effect.
But in the observational study Senn is talking about here, the causal question (note: as posed by Goldacre, *not* the study’s authors) is:
“Does the positive / negative nature of a research finding cause journal editors to be more likely to publish it?”
As I read it, Senn is pointing out that due to the implicit filtering on the condition that only papers submitted to journals were analysed, then we would expect the existence of this causal influence to entail statistical independence (not statistical dependence as would usually be the case with the RCT).
So .. if positive means “does not reject a null hypothesis of statistical independence” then yes it is negative. But if positive means “accords well with the existence of this particular causal influence”, then it is positive — though Goldacre seems to think it is negative, because he hasn’t interpreted the result correctly.
James: Thanks for this. I need to study the last few paragraphs of your comment more carefully. Just to note: I was of course deliberately considering an area outside “pharma” to illustrate how unobvious it is as to what to count as positive (I was asking Senn if Goldacre defined it, I take it he did not), but also to highlight something to which statistical discussions often give sort shrift: the information that a stringent non-rejection of a null can supply. I suppose something similar might be possible in a non-rejection of a null asserting a new drug (or generic) is just as effective as a known one. I know Senn has criticisms of how this is done in practice.
I’d also like to hear your other criticisms of Goldacre–he’s got this incredible schtick I see (watched one of his video clips). He even sells stuff, which gives me the idea to design some T-shirts of our own!
Yes you definitely have a point that positive/negative is not obvious. Also, now that I think about it, a lot that Goldacre talks about in his book is unreported side-effects — that is, trials which are “positive” for a side effect are apparently left unpublished — suggesting quite the opposite bias!
Goldacre is a minor celebrity in the UK… in fact I would say I was a big fan myself pretty unreservedly a few years ago — I bought his first book (“Bad Science”) when it came out and enjoyed it a lot, went to one of his talks when he was sufficiently non-famous for the talk to be free. Speaking of debunking for example, he did a particularly good job exposing a dodgy nutrition programme on British TV called “You are what you eat” (or some similar cliche) fronted by a quack doctor of sorts called Gillian McKeith. She hectored the overweight on TV and made them defecate into a bucket because, she claimed, she could diagnose and fix you from the smell. Goldacre called her the “poo lady” and managed to get his dead cat a doctorate from the same place she got hers http://en.wikipedia.org/wiki/Gillian_McKeith#Controversy_over_qualifications. All good fun.
But I would say Stephen Senn sums up the problem with his style perfectly, from the full article:
“Statistics is a subject that many medics find easy but most statisticians find difficult.”
Take this article:
http://www.badscience.net/2006/10/the-prosecutors-phallusy/
– about a legal case with some questionable statistical evidence involved. I could go on at length here but let’s say I find it overly simplistic to call lawyers “innumerate”, though clearly there has been a problem with the statistical evidence in legal trials – and nor do I want to scapegoat the professor involved. But “innumerate” means can’t convert eight out of ten into percent, or calculate change for a coffee. Statistical inference isn’t that kind of problem!
For the details, I think Goldacre is adopting the argument of Philip Dawid: http://128.40.111.250/evidence/content/dawid-paper.pdf — I find Dawid’s characterisation of Neyman-Pearson statistics in that paper problematic to say the least. This in particular I could talk for a while on – suffice to say a p-value or and alpha level is not, and is not supposed to be, the posterior probability that the null is true, but you knew that! Anyway, did Goldacre notice this problem he presents as obvious before Dawid pointed it out?
Ranting over, rock band merchandise is a great idea. Personally I think coffee mugs are the best form — possibly with “ERROR PROBE” written across the side or a picture of E.S. Pearson?
James: Dawid falls into the confusion about data-dependent “selection effects” in DNA searches that Cox and I discuss:
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27.
Not sure an “Error Probe” mug, even with Egon’s mug, would be sexy enough. Perhaps that picture of Solome…
Thanks for the question, Deborah. Goldacre is reporting primary research and the various studies that he cites do not necessarily use the same definition of positive. In the Olson et al study in JAMA, 2002, they defined studies as positive ‘if they showed a statistically significant effect’ (p2806). A further complication was that there might be more than one endpoint and if there was they used the primary one (if declared) and otherwise made a decision using all endpoints. They also explain pp2806-2807 that others might have defined positive as significant and showing a benefit but they did not. On re-reading this paper I realise that in my accompanying piece that Deborah has posted I misunderstood a statement Olson et al made in discussion of their findings. They did not (I think) use two different definitions of positive (one sided or two-sided significance) they were pointing out instead that had they used a one-sided test themselves they would have had a significant result in favour of positive studies.
Goldacre uses the results to indicate absence of a bias in favour of positive studies but in this he goes further than the authors. They quote confidence intervals of 0.99 to 1.88 for the relative risk and are rather more circumspect in claiming that this is evidence of absence.
Goldacre’s got various RCT primers on his blog today:
http://www.badscience.net/2013/01/i-made-this-radio-4-documentary-on-randomised-trials-on-government-policy/
This was the material passed around by the RCT4D (development economics) group, I see. To his credit he stresses the problems with post-data endpoints and subclass searches.
To add another link into the mix: the RCT document was published by the government’s Behavioural Insights Team — quite an interesting context, the idea is to use RCTs to determine how to change people’s behaviour. I came across this blog post about it: http://www.bbc.co.uk/blogs/adamcurtis/2010/11/post_1.html — fascinating clips of B.F. Skinner, hopefully accessible outside the UK, not sure, also the fourth video below “ghastly plastic flowers” is unmissable.
(Not very statistics related though!)
James: Thanks for pointing this out, I mention it on my current blogpost.