Science is in flux. The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data: selective publication, inadequate descriptions of study methods that block efforts at replication, and data dredging through undisclosed use of multiple analytical strategies. Problems such as these undermine the integrity of published data and increase the risk of exaggerated or even false-positive findings, leading collectively to the ‘replication crisis’.
Alongside academic papers that document the prevalence of these problems, we have seen a growth in ‘technical activism’: groups creating data structures and services to help find solutions. These include the Reproducibility Project, which shares out the work of replicating hundreds of published papers in psychology, and Registered Reports, in which researchers can specify their methods and analytical strategy before they begin a study.
These initiatives can generate conflict, because they set out to hold individuals to account. Most researchers maintain a public pose that science is about healthy, reciprocal, critical appraisal. But when you replicate someone’s methods and find discrepant results, there is inevitably a risk of friction.
Our team in the Centre for Evidence-Based Medicine at the University of Oxford, UK, is now facing the same challenge. We are targeting the problem of selective outcome reporting in clinical trials.
At the outset, those conducting clinical trials are supposed to publicly declare what measurements they will take to assess the relative benefits of the treatments being compared. This is long-standing best practice, because an outcome such as ‘cardiovascular health’ could be measured in many ways. So researchers are expected to list the specific blood tests and symptom-rating scales that they will use, for example, alongside the dates on which measurements will be taken, and any cut-off values they will apply to turn continuous data into categorical variables.
This is all done to prevent researchers from ‘data-dredging’ their results. If researchers switch from these pre-specified outcomes, without explaining that they have done so, then they break the assumptions of their statistical tests. That carries a significant risk of exaggerating findings, or simply getting them wrong, and this in turn helps to explain why so many trial results eventually turn out to be incorrect.
You might think that this problem is so obvious that it would already be competently managed by researchers and journals. But that is not the case. Repeatedly, academic papers have been published showing that outcome-switching is highly prevalent, and that such switches often lead to more favourable statistically significant results being reported instead. ….
Our group has taken a new approach to trying to fix this problem. Since last October, we have been checking the outcomes reported in every trial published in five top medical journals against the pre-specified outcomes from the registry entries or protocols. Most had discrepancies, many of them major. Then, crucially, we have submitted a correction letter, on every trial that misreported its outcomes, to the journal in question. (All of our raw data, methods and correspondence with journals are available on our website at COMPare-trials.org.)
We expected that journals would take these discrepancies seriously, because trial results are used by physicians, researchers and patients to make informed decisions about treatments. Instead, we have seen a wide range of reactions. Some have demonstrated best practice: the BMJ, for instance, quickly published a correction on one misreported trial we found, within days of our letter being posted.
Other journals have not followed the BMJ’s lead. The editors at Annals of Internal Medicine, for example, have responded to our correction letters with an unsigned rebuttal that, in our view, raises serious questions about their commitment to managing outcome-switching. For example, they repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” from documents produced after a trial began; they make concerning comments that undermine the crucial resource of trial registers; and they say that their expertise allows them to permit — and even solicit — undeclared outcome-switching.
The practice of identifying “prespecified outcomes” from post hoc information can indeed “break the assumptions of their statistical tests”. For example, the reported significance level may have no relation to the actual significance level. You might report that an observed effect would not easily (frequently) be bought about by mere chance variability (small reported P-value), when in fact it would frequently be brought about by chance alone, thanks to dredging (large actual P-value). You’re breaking the test’s assumptions, but what if your account of evidence denies that’s any skin off it’s nose?
Take for example the epidemiologist Stephen Goodman, a co-director of a leading home for “technical activism” (Meta-Research Innovation Center at Stanford):
“Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value”(Goodman 1999, p. 1010).
“Nothing to do with the data”? On the frequentist (error statistical) philosophy, it has a lot to do with the data. To Goodman’s credit, he’s up front about his standpoint being based on accepting the “likelihood principle”. [See Supplement below.] However, what people come away with is the upshot of that evidential standpoint, not the philosophical nuances that lurk in the undergrowth. At their inaugural conference, the questionable relevance of significance levels was the punch line of the joke in a funny group video you may have seen (posted on 538). The take away message is scarcely: Do all you can to report any post data specifications that would violate the legitimacy of your significance level.
Goldacre is surely right to suspect that some of the resistance to calls against “outcome switching” is defensiveness; but he shouldn’t close his eyes to the role played by foundational principles of evidence.
What do you think?
(ii) Supplement: You may wonder, how the contrasting standpoint (between, say, a Goodman and a Goldacre) involves philosophical principles of evidence. In a nutshell, the former holds an “evidential-relation” (EGEK) or a logicist (Hacking) view of statistical evidence where, given statements of hypotheses and data, an evidential appraisal–generally comparative–falls out. Then considerations such as when the hypotheses were constructed drop out–at least for questions of evidence. Philosophers sought “logics of confirmation” for a long time, and some/many still do? The idea is to have a context-free logic for inductive inference akin to logics of deductive inference. This contrasts with the position that an evidential appraisal depends on features (of the selection and generation of data) that alter the error probabilities of the procedure, such as “selection effects”. Interested readers can search this blog for statistically oriented discussions (under likelihood principle, law of likelihood) or philosophically oriented ones (novel evidence, double counting, Popper, Carnap). Carnap’s logicism took the form of confirmation theories. Popper rejected this and required “novel” evidence for a severe test––even though he tended to change his definition of novelty, and never settled on an adequate notion of severity. On the statistical side, relating to the likelihood principle, is the “Law of Likelihood” (LL). Some relevant posts are here and here. The (LL) regards data x as evidence supporting H1 over H0 iff
Pr(x; H1) > Pr(x; H0).
H0 and H1 are statistical hypothesis that assign probabilities to the random variable X taking value x. On many accounts, the likelihood ratio also measures the strength of that comparative evidence. Here’s Richard Royall (whom Goodman follows):
“If hypothesis A implies that the probability that a random variable X takes the value x is pA(x), while hypothesis B implies that the probability is pB(x), then the observation X = x is evidence supporting A over B if and only if pA(x) > pB(x), and the likelihood ratio, pA(x)/ pB(x), measures the strength of that evidence.” (Royall, 2004, p. 122)
Moreover, “the likelihood ratio, is the exact factor by which the probability ratio [ratio of priors in A and B] is changed. (ibid. 123)
RELATED POSTS: Statistical “reforms” without philosophy are blind
 Violations of statistical test assumptions won’t always occur as a result of post data specifications. Post hoc determinations can, in certain cases, be treated “as if” they were prespecified, but the onus is on the researcher to show the error probabilities aren’t vitiated. Interestingly, the philosopher C.S. Peirce anticipates this contemporary point.
I first saw the video in a tweet by the American Statistical Association.
Goldacre, B. 2008. Bad Science. HarperCollins Publishers.
Goldacre, B. 2016. “Make Journals Report Clinical Trials Properly,” Nature 530,7 (04 February 2016)
Goodman, S. (1999). ‘Toward Evidence-Based Medical Statistics. 2: The Bayes Factor’, Annals of Internal Medicine, 130(12): 1005-13.
Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite.” In D. H. Mellor (ed.), Science, belief and behavior: Essays in honor of R.B. Braithwaite. 141-160. Cambridge: CUP.
Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.