T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)

Tom Kepler’s guest post arose in connection with my November 9 post & comments.

Kepler--Thomas-1x1.4

Professor Thomas B. Kepler
Department of Microbiology
Department of Mathematics & Statistics
Boston University School of Medicine

There is much to say about the article in the Economist, but the first is to note that it is far more balanced than its sensational headline promises. Promising to throw open the curtain on “Unreliable research” is mere click-bait for the science-averse readers who have recently found validation against their intellectual insecurities in the populist uprising against the shadowy world of the scientist. What with the East Anglia conspiracy, and so on, there’s no such thing as “too skeptical” when it comes to science.

There is some remarkably casual reporting in an article that purports to be concerned with mechanisms to assure that inaccuracies not be perpetuated.

For example, the authors cite the comment in Nature by Begley and Ellis and summarize it thus: …scientists at Amgen, an American drug company, tried to replicate 53 studies that they considered landmarks in the basic science of cancer, often co-operating closely with the original researchers to ensure that their experimental technique matched the one used first time round. Stan Young, in his comments to Mayo’s blog adds, “These claims can not be replicated – even by the original investigators! Stop and think of that.” But in fact the role of the original investigators is described as follows in Begley and Ellis: “…when findings could not be reproduced, an attempt was made to contact the original authors, discuss the discrepant findings, exchange reagents and repeat experiments under the authors’ direction, occasionally even in the laboratory of the original investigator.” (Emphasis added.) Now, please stop and think about what agenda is served by eliding the tempered language of the original.

Both the Begley and Ellis comment and the brief correspondence by Prinz et al. also cited in this discussion are about laboratories in commercial pharmaceutical companies failing to reproduce experimental results. While deciding how to interpret their findings, it would be prudent to bear in mind the insight from Harry Collins, the sociologist of science paraphrased in the Economist piece as indicating that “performing an experiment always entails what sociologists call “tacit knowledge”—craft skills and extemporisations that their possessors take for granted but can pass on only through example. Thus if a replication fails, it could be because the repeaters didn’t quite get these je-ne-sais-quoi bits of the protocol right.” Indeed, I would go further and conjecture that few experimental biologists would hold out hope that any one laboratory could claim the expertise necessary to reproduce the results of 53 ground-breaking papers in diverse specialties, even within cancer drug discovery. And to those who are unhappy that authors often do not comply with the journals’ clear policy of data-sharing, how do you suppose you would fare getting such data from the pharmaceutical companies that wrote these damning papers? Or the authors of the papers themselves? Nature had to clarify, writing two months after the publication of Begley and Ellis, “Nature, like most journals, requires authors of research papers to make their data available on request. In this less formal Comment, we chose not to enforce this requirement so that Begley and Ellis could abide by the legal agreements [they made with the original authors].” There seems to be a good reason that the data are not being provided, but it does make pursuing the usual self-corrective course of science ironically unavailable. Furthermore, one might be persuaded to grant the benefit of the doubt beyond this one case to authors who don’t respond to demands for all their data and metadata immediately. They, too, may have reasons (other than concealing ineptitude) for their failure to respond to requests fast enough to satisfy the requestor.

I agree that there are problems with the way science is done, and that serious attention must be paid to making its practice more efficient and fairer to its practitioners. There is much to be gained by reforming peer-review, for example, and a great deal of progress is being made. The hyper-competitive atmosphere of contemporary science and the attendant implicit directive to value speed over reliability is deeply problematic and unfair to many thoughtful young scientists. (I have often been frustrated at others’ scooping me while using flawed analyses. What was frustrating in these cases was not that they were wrong, but that they were right, in spite of the naiveté of their methods.) The industrialization of science and the growth of “team science” threaten to exacerbate the very real problems of elevating a small number of elite PIs to mythic status at the expense of many very good people. Indeed, the self-corrective nature of science is, unfortunate as it may be, strictly impersonal. The scientific method does not provide assurance that its practitioner will be treated justly and fairly. At the same time, I do not believe that fomenting rebellion (Stan Young’s comment: “Why should the taxpayer fund such an unreliable enterprise?”) is going to be a productive strategy.

The problem is that the non-practicing public–even the very well-educated–have an oversimplified conception of how science works. It is not the case that there is a finite number of propositions that make up the instantaneous canon, and that this set of common beliefs grows and shrinks through the publication of experimental results. As we all know too well, the condition is far messier than that. Mistaken results and bad theories are not typically dispatched with a single fatal blow, but instead die through simple neglect. Perhaps it is lamentable that “to outsiders they will appear part of the scientific canon”, as the Economist opines, but that is simply not relevant as a measure of the ability of the scientific enterprise to self-correct. The reputations of poor scientists may survive longer than they should but poor ideas are dealt with very effectively.

Where I agree strongly with the anonymous authors is in their contention that “Budding scientists must be taught technical skills, including statistics, and must be imbued with scepticism towards their own results and those of others.” I would further urge that scientists learn respect for their peers in other disciplines. I am fortunate to have had (and currently hold) appointments in Biomedical Basic Science departments and in Math and Statistics departments, and have been on the receiving (and alas, giving) end of interdisciplinary prejudices in both directions. Where statisticians see experimental biomedical researchers as corrupt strivers in need of policing, biologists see statisticians as uninterested in actual science and perfectly willing to hold up its progress indefinitely in the name of some imagined platonic ideal.

Maybe they’re both right. But maybe raising the next generation to be just a little more appreciative and less defensive will contribute to the continued growth of the scientific worldview we all share.

And in that vein, I’m fine with the statistical argument presented in the Economist article. It does not hew to any coherent philosophical conception of statistics, but it is clear, correct as far as it goes, and conveys a correct understanding to the reader.

Categories: junk science, reforming the reformers, science communication, Statistics | 20 Comments

Post navigation

20 thoughts on “T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)

  1. Torg

    “The problem is that the non-practicing public–even the very well-educated–have an oversimplified conception of how science works. It is not the case that there is a finite number of propositions that make up the instantaneous canon, and that this set of common beliefs grows and shrinks through the publication of experimental results. As we all know too well, the condition is far messier than that. Mistaken results and bad theories are not typically dispatched with a single fatal blow, but instead die through simple neglect. ”

    The non practicing public expects scientists to be using the scientific method. This method does not include supporting your theory by disproving that there is no correlation or two means are exactly equal. Theories dying due to neglect is a symptom of the NHST hybrid. Paul Meehl noted this decades ago for psychology, the problem has since spread into medicine and biology.

    Does this sound familiar to any preclinical researchers? Paul Meehl in 1978 on the state of psychology:
    “…Perhaps the easiest way to convince yourself is by scanning the literature of soft psychology over the last 30 years and noticing what happens to theories. Most of them suffer the fate that General MacArthur ascribed to old generals—They never die, they just slowly fade away. In the developed sciences, theories tend either to become widely accepted and built into the larger edifice of well-tested human knowledge or else they suffer destruction in the face of recalcitrant facts and are abandoned, perhaps regretfully as a “nice try.” But in fields like personology and social psychology, this seems not to happen. There is a period of enthusiasm about a new theory, a period of attempted application to several fact domains, a period of disillusionment as the negative data come in, a growing bafflement about inconsistent and unreplicable empirical results, multiple resort to ad hoc excuses, and then finally people just sort of lose interest in the thing and pursue other endeavors.”

    Theoretical Risks and Tabular Asterisks:
    Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology
    http://www.psych.umn.edu/people/meehlp/113TheoreticalRisks.pdf

    • “Theories dying due to neglect is a symptom of the NHST hybrid.”
      Oh please. We’re all very familiar with Meehl, but this was not his point. He rued the lack of real theories in soft psychology—they were scarcely out there being neglected. The interpretive fads that merely blew away, did so because they were not real theories.
      And by the way, the so-called NHST is not a hybrid of Fisher/Neyman-Pearson, it’s a made up animal invented in psychology that is neither N-P, nor Fisher, nor any combination or subset of the two.

      • Torg

        Mayo, the best evidence I’ve seen is that the hybrid was created on accident by E.F. Lindquist in 1938-1940. If you read the Meehl Paper I cited he directly blames significance testing in the case of testing a hypothesis that does not correspond to a theoretical prediction.

        “But, you may say, what has all this got to do with significance testing? Isn’t the social scientist’s use of the null hypothesis simply the application of Popperian (or Bayesian) thinking in contexts in which probability plays such a big role? No, it is not. One reason it is not is that the usual use of null hypothesis testing in soft psychology as a means of “corroborating” substantive theories does not subject the theory to grave risk of refutation modus tollens, but only to a rather feeble danger.”

        • Torg: Again, we know what Meehl said, some of us even knew Paul, and we know why rejecting a null fails to provide evidence for substantive claims and theories. If NHST directs or allows one to commit such “fallacies of rejection” then I aver it is not a hybrid or an “account of tests” but simply an abuse of significance tests, both of the N-P and Fisherian varieties. Moreover it is one that its founders explicitly rejected.

          Fallacies of rejection are of two main types: moving directly from statistical significance to (a) causal or even genuine experimental effects, and (b) an inferred effect that exceeds the magnitude of discrepancy (from null) that is warranted.
          Both fallacies are immediately scotched by a severity assessment of tests (or CIs). Meehl, by the way, endorsed this severity analysis (grouping it with Salmon and Popper). The blog may be searched for more on this.

          • Torg

            “we know why rejecting a null fails to provide evidence for substantive claims and theories.”

            Yet use of the hybrid is growing if anything. Sorry, if you know all this already, the majority of researchers and apparently those teaching stats to them do not. Reading the frequentist vs bayesian arguments is funny when you see what goes on from the inside, that disagreement is irrelevant in comparison. This is what makes me think you really do not understand intuitively the problem of the hybrid.

            I was reading comments on a post a few days back and saw someone arguing for teaching some combination of bayesian and frequentist approaches in order to reduce confusion. I am not sure if that strategy will be effective, but I do think that the confusion and hybrid problem is much bigger than those statisticians who are not interacting with medical/social researchers realize. If they knew they would be horrified and never trust anything published, want to take any drug recommended by their doctor, or trust any “evidence based” government policy to do anything except waste money.

            It very well could be mostly snake oil at this point, we won’t know unless a movement for vast amount of replication studies is forms and is successful. These stories about pharm companies only being able to replicate a minority of published reports supports that thought. I am glad it is making it into the news now.

            My hope is that the researchers scientific intuitions have managed to still guide us down a somewhat correct path despite the widespread incorrect use and interpretation of p values for the last few decades. The problem is not fraud, it is confusion.

            • Torg: You claim: “This is what makes me think you really do not understand intuitively the problem of the hybrid.”
              All I can say is that I wrote down the specifics of the fallacies of rejection associated with the (unapproved, distorted) animal some call the hybrid of testing. Why don’t you write down your idea of its features and then we can see who understands intuitively what the great big problem of the terrifying hybrid is. And is not.

              And further, although you might think the question of underlying goals and the role of probability in statistical inference has nothing to do with the growing “confusion” you keep mentioning, think again. These issues are at the heart of it, and those ARE the issues between frequentists and Bayesians, if one scratches just a bit below the surface.

              Beyond those fundamental issues of the nature and role of probability in inductive-statistical learning, the problems are straightforward–failing to check assumptions, satisfy data generation and modeling requirements, or engaging in pseudoscience.

              On dual teaching, please check out my Greenland post (and search others):
              http://errorstatistics.com/2013/06/26/why-i-am-not-a-dualist-in-the-sense-of-sander-greenland/

              On hybrid hysteria and the like, see:

              http://errorstatistics.com/2013/10/31/whipping-boys-and-witch-hunters/

              http://errorstatistics.com/2013/01/19/saturday-night-brainstorming-and-task-forces-2013-tfsi-on-nhst/

              • Torg

                “Why don’t you write down your idea of its features and then we can see who understands intuitively what the great big problem of the terrifying hybrid is.”

                1. It provides no additional information than plotting the data. The p value in this case is the equivalent of reporting the average color of a picture rather than showing the whole picture.

                2. It is in widespread use and all attempts to stop this have failed thus far.

                3. The majority of those reporting p values are not aware of #1, and if you try to explain it they will not change but instead attack or dismiss the messenger.

                I have read through and seen that you agree with the whole Meehl argument and that the hybrid (it is a hybrid and is not a fault of Fisher or Neyman, look at the historical development and ask researchers)

                “Beyond those fundamental issues of the nature and role of probability in inductive-statistical learning, the problems are straightforward–failing to check assumptions, satisfy data generation and modeling requirements, or engaging in pseudoscience. ”

                These are impractical for many research projects. I guess my question is, if you knew it was impractical to accomplish those things would you say to avoid statistical tests and instead only describe the data and methods?

                • Let’s see if there’s any content here about the method at the center of controversy:
                  (1) The p-value is something that doesn’t go beyond the data, but maybe reduces the data, such as by averaging, (2) no one’s been able to stop its widespread use, and (3) Most people who use it don’t know (1).
                  Wow.

                  • Torg

                    As I said, you do not appear to understand the magnitude or nature of the problem. You want a technical answer to argue against when the problem is social. Please volunteer or take a job teaching medical/grad students and being asked for advice by their superiors. I have come to think the best way to understand is via experience.

              • Torg

                to correct this line:
                “I have read through and seen that you agree with the whole Meehl argument and that the hybrid (it is a hybrid and is not a fault of Fisher or Neyman, look at the historical development and ask researchers)”

                .. is bad.

                • I fail to see any description of a “hybrid method” in what you wrote, or even of “a method”.
                  The only content regarding the “very, very bad method” is that it doesn’t go beyond the data, and might even reduce the data.

                  Well, it’s an interesting question as to whether a reduction of data actually provides more information to humans than all of the ‘blooming buzzing confusion’ (James) of a report of every aspect and feature of data–even though in a sense, it doesn’t go beyond the data.
                  The answer is yes.
                  The truth is that we can’t even perceive without being selective, without having an interest and a perspective, eliminating much. Fisher, it’s quite true, regarded one of the 3 main functions of statistics as taking a whole pile of detailed observations and rendering them graspable by a human mind so as to try to understand and learn from them.

                  Stare at a page of numbers that record 50 features of each of the 5000 samples of water and fish from Fukushima over 1 day. Is the radiation level beyond x?, the toxic concentrations higher than yesterday? Is it safe for pregnant women to eat more than z fish caught today? These questions, and stipulations of what to “record” already invoke vast condensations of observations without which all language, communication, and certainly all learning would be impossible.
                  Of course, the second key role for statistics is inductive/ampliative inference, and that does, literally, go beyond the data–but only thanks to having reduced the initial mess of observations.

                  Anyway, I see no description of the hybrid method in any of your comments.

  2. Tom: Thanks for your post. Maybe you can say some more about the problem at Duke, since you were there. So you’re not too concerned with the erroneous computations promulgated in these attempts to show, mixing power and bayesian computations, high rate of false positives?

    • Tom Kepler

      Re: Bayesian computations. I read the statistical bit in the article as a dressed-up version of the rare disease problem, which illustrates the counterintuitive notion that even a good diagnostic test reports more false positives than true positives if the condition being diagnosed is rare enough. I think that in that situation, both Bayesian computations and the term “power” are warranted.

      But I would object strongly if I were being asked to take seriously the idea that a population of hypotheses is out there to be randomly sampled, and having established prior probabilities of truth and falsehood.

      Having now gone back and re-reading the article, I see it contains the passage, “Dr Ioannidis argues that in his field, epidemiology, you might expect one in ten hypotheses to be true.” So yes, I do object to the presumption underlying this statement.

      As to Duke…to the extent that my experience there bears on the contentious issues involving experimental reproducibility and statistics, I’ll share my observations. Stay tuned.

      • Tom: I am staying tuned and eagerly await your observations, whether they bear directly on the scientific/statistical issues or the socio-political ones.

  3. Anon

    “Where statisticians see experimental biomedical researchers as corrupt strivers in need of policing, biologists see statisticians as uninterested in actual science and perfectly willing to hold up its progress indefinitely in the name of some imagined platonic ideal.”

    Can you please explain or illustrate the imagined platonic ideal in whose name statisticians are willing to hold up scientific progress?

  4. Tom Kepler

    My collaborators and I frequently adapt or develop novel experimental methods; my group analyzes the data produced using them, trying to account for everything we know about the instruments, the designs, and the biology.The methods we develop can be complicated and the software implementing them takes time to write. Meanwhile, another team member might do a rough analysis, get a reasonable answer, and wonder why we can’t just use that simpler and quicker result and publish already.

    Finding the appropriate balance is not easy.

    • Tom: Was this your answer to the Anon’s question about the platonic ideal? I assume not, unless that would be the first group? (your group?)

  5. Pingback: Friday links: the history of “Big Data” in ecology, inside an NSF panel, funny Fake Science, and more | Dynamic Ecology

  6. Pingback: S. Stanley Young: More Trouble with ‘Trouble in the Lab’: S. Stanley Young (Guest post) | Error Statistics Philosophy

I welcome constructive comments for 14-21 days

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com. Customized Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 321 other followers