A new front in the statistics wars? Peaceful negotiation in the face of so-called ‘methodological terrorism’

images-30I haven’t been blogging that much lately, as I’m tethered to the task of finishing revisions on a book (on the philosophy of statistical inference!) But I noticed two interesting blogposts, one by Jeff Leek, another by Andrew Gelman, and even a related petition on Twitter, reflecting a newish front in the statistics wars: When it comes to improving scientific integrity, do we need more carrots or more sticks? 

Leek’s post, from yesterday, called “Statistical Vitriol” (29 Sep 2016), calls for de-escalation of the consequences of statistical mistakes:

Over the last few months there has been a lot of vitriol around statistical ideas. First there were data parasites and then there were methodological terrorists. These epithets came from established scientists who have relatively little statistical training. There was the predictable backlash to these folks from their counterparties, typically statisticians or statistically trained folks who care about open source.

I’m a statistician who cares about open source but I also frequently collaborate with scientists from different fields. It makes me sad and frustrated that statistics – which I’m so excited about and have spent my entire professional career working on – is something that is causing so much frustration, anxiety, and anger.

I have been thinking a lot about the cause of this anger and division in the sciences. As a person who interacts with both groups pretty regularly I think that the reasons are some combination of the following.

1. Data is now everywhere, so every single publication involves some level of statistical modeling and analysis. It can’t be escaped.

2. The deluge of scientific papers means that only big claims get your work noticed, get you into fancy journals, and get you attention.

3. Most senior scientists, the ones leading and designing studies, have little or no training in statistics. There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics. So statistics and data science wasn’t (and still often isn’t) integrated into medical and scientific curricula.

Even for senior scientists in charge of designing statistical studies?

4. There is an imbalance of power in the scientific process between statisticians/computational scientists and scientific investigators or clinicians. The clinicians/scientific investigators are “in charge” and the statisticians are often relegated to a secondary role. … There are a large number of lonely bioinformaticians out there.

5. Statisticians and computational scientists are also frustrated because there is often no outlet for them to respond to these papers in the formal scientific literature – those outlets are controlled by scientists and rarely have statisticians in positions of influence within the journals.

Since statistics is everywhere (1) and only flashy claims get you into journals (2) and the people leading studies don’t understand statistics very well (3), you get many publications where the paper makes a big claim based on shaky statistics but it gets through. This then frustrates the statisticians because they have little control over the process (4) and can’t get their concerns into the published literature (5).

This used to just result in lots of statisticians and computational scientists complaining behind closed doors. The internet changed all that, everyone is an internet scientist now. 

…Sometimes to get attention, statisticians start to have the same problem as scientists; they need their complaints to get attention to have any effect. So they go over the top. They accuse people of fraud, or being statistically dumb, or nefarious, or intentionally doing things with data, or cast a wide net and try to implicate a large number of scientists in poor statistics. The ironic thing is that these things are the same thing that the scientists are doing to get attention that frustrated the statisticians in the first place.

Just to be 100% clear here I am also guilty of this. I have definitely fallen into the hype trap – talking about the “replicability crisis”. I also made the mistake earlier in my blogging career of trashing the statistics of a paper that frustrated me. …

I also understand the feeling of “being under attack”. I’ve had that happen to me too and it doesn’t feel good. So where do we go from here? How do we end statistical vitriol and make statistics a positive force? Here is my six part plan:

  1. We should create continuing education for senior scientists and physicians in statistical and open data thinking so people who never got that training can understand the unique requirements of a data rich scientific world.
  2. We should encourage journals and funders to incorporate statisticians and computational scientists at the highest levels of influence so that they can drive policy that makes sense in this new data driven time.
  3. We should recognize that scientists and data generators have a lot more on the line when they produce a result or a scientific data set. We should give them appropriate credit for doing that even if they don’t get the analysis exactly right.
  4. We should de-escalate the consequences of statistical mistakes. Right now the consequences are: retractions that hurt careers, blog posts that are aggressive and often too personal, and humiliation by the community. We should make it easy to acknowledge these errors without ruining careers. This will be hard – scientist’s careers often depend on the results they get (recall 2 above). So we need a way to pump up/give credit to/acknowledge scientists who are willing to sacrifice that to get the stats right.
  5. We need to stop treating retractions/statistical errors/mistakes like a sport where there are winners and losers. Statistical criticism should be easy, allowable, publishable and not angry or personal.
  6. Any paper where statistical analysis is part of the paper must have both a statistically trained author or a statistically trained reviewer or both. I wouldn’t believe a paper on genomics that was performed entirely by statisticians with no biology training any more than I believe a paper with statistics in it performed entirely by physicians with no statistical training.

I think scientists forget that statisticians feel un-empowered in the scientific process and statisticians forget that a lot is riding on any given study for a scientist. So being a little more sympathetic to the pressures we all face would go a long way to resolving statistical vitriol.

What do you think of his six part plan? More carrots or more sticks? (you can read his post here.)

There may be a fairly wide disparity between the handling of these issues in medicine and biology as opposed to the social sciences. In psychology at least, it appears my predictions (vague, but clear enough) of the likely untoward consequences of their way of handling their “replication crisis” are proving all too true. (See, for example, this post.)

Compare Leek to Gelman’s recent blog on the person raising accusations of “methodological terrorism”, Susan Fiske. (I don’t know if Fiske coined the term, but I consider the analogy reprehensible and think she should retract the term.) Here’s from Gelman:

Who is Susan Fiske and why does she think there are methodological terrorists running around? I can’t be sure about the latter point because she declines to say who these terrorists are or point to any specific acts of terror. Her article provides exactly zero evidence but instead gives some uncheckable half-anecdotes.

I first heard of Susan Fiske because her name was attached as editor to the aforementioned PPNAS articles on himmicanes, etc. So, at least in some cases, she’s a poor judge of social science research….

Fiske’s own published work has some issues too. I make no statement about her research in general, as I haven’t read most of her papers. What I do know is what Nick Brown sent me: [an article] by Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005). . . .

This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

….The short story is that Cuddy, Norton, and Fiske made a bunch of data errors—which is too bad, but such things happen—and then when the errors were pointed out to them, they refused to reconsider anything. Their substantive theory is so open-ended that it can explain just about any result, any interaction in any direction.

And that’s why the authors’ claim that fixing the errors “does not change the conclusion of the paper” is both ridiculous and all too true….

The other thing that’s sad here is how Fiske seems to have felt the need to compromise her own principles here. She deplores “unfiltered trash talk,” “unmoderated attacks” and “adversarial viciousness” and insists on the importance of “editorial oversight and peer review.” According to Fiske, criticisms should be “most often in private with a chance to improve (peer review), or at least in moderated exchanges (curated comments and rebuttals).” And she writes of “scientific standards, ethical norms, and mutual respect.”

But Fiske expresses these views in an unvetted attack in an unmoderated forum with no peer review or opportunity for comments or rebuttals, meanwhile referring to her unnamed adversaries as “methological terrorists.” Sounds like unfiltered trash talk to me. But, then again, I haven’t seen Fiske on the basketball court so I really have no idea what she sounds like when she’s really trash talkin’. (You can read Gelman’s post, which also includes a useful chronology of events, here.)

How can Leek’s 6 point list of “peaceful engagement” work in cases where authors deny the errors really matter? What if they view statistics as so much holy water to dribble over their data, mere window-dressing to attain a veneer of science? I have heard some (successful) social scientists say this aloud (privately)! Far from showing the claims they infer may be represented as unsuccessful attempts to falsify (as good Popperians would demand), the entire effort is a self-sealing affair, dressed up with statistical razzmatazz.

So, I concur with Gelman who has no sympathy for those who wish to protect their work from criticism, going merrily on their way using significance tests illicitly. I also have no sympathy for those who think the cure is merely lowering p-values or embracing methods where the assessment and control of error probabilities are absent. For me, error probability control is not for good long-run error rates, by the way, but to ensure a severe probing of error in the case at hand.

One group may unfairly call the critics “methodological terrorists.” Another may unfairly demonize the statistical methods as the villains to be blamed, banned and eradicated. It’s all the p-value’s fault there’s bad science (never mind that the lack of replication and fraudbusting are based on the use of significance tests). Worse, in some circles, methods that neatly hide the damage from biasing selection effects are championed (in high places)![1]

Gelman says the paradigm of erroneously moving from an already spurious p-value to a substantive claim—thereby doubling up on the blunders–is dead. Is it? That would be swell, but I have my doubts, especially in the most troubling areas. They didn’t nail Potti and Nevins whose erroneous cancer trials had life-threatening consequences; we can scarcely feel confident that such finagling isn’t continuing in clinical trials (see this post), though I think there’s some hope for improvements. But how can it be that “senior scientists, the ones leading and designing studies, have little or no training in statistics,” as Leek says? This is exactly why everyone could say “it’s not my job” in the horror story of the Potti and Nevins fraud. At least social psychologists aren’t using their results to base decisions on chemo treatments for breast cancer patients. 

In the social sciences, undergoing a replication revolution has raised awareness, no doubt, and it’s altogether a plus that they’re stressing preregistration. But it’s been such a windfall, one cannot help asking: why would a field whose own members frequently write about its “perverse incentives,” have an incentive to kill the cash cow?  Especially with all its interesting side-lines? It has a life of its own, and offers a career of its own with grants aplenty. So grist for its mills would need to continue. That’s rather cynical, but unless they’re prepared to call out bad sciences-including mounting serious critiques of widely held experimental routines and measurements (which could well lead to whole swaths of inquiry falling by the wayside), I don’t see how any other outcome is to be expected. 

Share your thoughts. I wrote much more, but it got too long. I may continue this…

Related:

Send me related links you find (on comments) and I’ll post them.

1)”There’s no tone problem in psychology” Talyarkoni

2)Menschplaining: Three Ideas for Civil Criticism: Uri Simonsohn on Data Colada

http://datacolada.org/52

[1] I do not attribute this stance to Gelman who has made it clear that he cares about what could have happened but didn’t in analyzing tests, and is sympathetic to the idea of statistical tests as error probes:

“But I do not make these decisions on altering, rejecting, and expanding models based on the posterior probability that a model is true. …In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). … At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, p. 70).

 

 

Categories: Anil Potti, fraud, Gelman, pseudoscience, Statistics | 15 Comments

Post navigation

15 thoughts on “A new front in the statistics wars? Peaceful negotiation in the face of so-called ‘methodological terrorism’

  1. With respect to “de-escalating” the consequences of error, maybe it should be “3 strikes and you’re out!”

  2. Peter Chapman

    I spent my entire career as an applied statistician working with applied biologists, ecologists, and chemist.

    At the beginning of this blog you list Leek’s four reasons for the present situation. I don’t agree with 1 and 2. I agree with only the first sentence in 3 but the whole of 4. To 3 I would add that, in my experience, a lot of scientists have no interest nor aptitude for statistics and some demonstrate actual contempt. For junior scientists many applied statisticians are introverts and poor communicators and are perceived to be very very odd.. Senior scientists rarely meet a statistician as a member of their peer group so statisticians are almost always junior and of low status. Even though they don’t understand it, scientists perceive statistics to be easy and routine. But when one of their own shows an aptitude for computing and knows how to enter data into a statistics package they are perceived to be very clever.

    Sometimes academic statisticians don’t help. I know from personal experience that some also perceive applied statisticians as being of low status and of carrying out routine work. The job of teaching statistics to biologists, ecologists, etc is often given to the most junior member of staff – i.e. the guy/girl who has just completed his/her PhD and is new to teaching. This results in statistics being taught as a bunch of equations. Teaching statistics well to biologists etc is challenging and requires real skill, backed up by experience.

    The leaders of the statistics profession in the academic societies do not help either. They tend to be senior academics or government statisticians who are very remote from the coalface. I tend to envy engineers. If their models were wrong their bridges (etc) would fall down which leads to a professional organisation that works for the ordinary engineer.

    Finally, it doesn’t help when statisticians argue about the benefits of different methods: NHST, Bayes, Bayes Factors, AIC, BIC, Likelihood ratios etc etc. For a low powered study they will all flounder and for a study with good power they will all give a reasonable answer. Because they all utilize the Likelihood as the means of introducing data to the model, they are all probably doing the same thing,.

    I am surprised but then not (because of the above) that design (and study conduct) never gets a mention when discussing the replication crisis. I am pretty sure that poor design and study conduct are a major cause or contributor.

    And finally, I don’t believe that Leek’s 6 point solution will work for the reasons outlined above.

    I guess this doesn’t pass the constructive test.

    • Peter: Thanks for your comment, I’ll study the pts you agree with and which not later. I think your points about teaching and leaders of academic societies are very good ones. On your list, at least statistical hypotheses tests (of the Fisherian and N-P varieties) allow you to control and assess power (and other error probabilities). The others in the list (except NHST which is usually associated with an illicit distortion of Fisherian tests, permitting a jump from statistical to substantive, among other crimes) don’t even compute power. AIC, even if the assumptions are satisfied, has been shown to have lousy error probabilities. Bayes Factors and likelihood ratios suffer from both comparativism (with lots of latitude) and lack of error probability control.

      • To clarify my reference to power; Suppose that you are trying to compare two groups and you simulate many data sets (representing two groups with different means) from a model that you know has a power of 50% under NHST. You compute p values for all these data sets and find that 50% are < 0.05 as expected. Now fit a Bayesian model (with noninformative prior) to all the data sets and you find the 95% range of posterior includes zero 50% of the time. Now try AIC on the same data sets and you find there is a non-linear one-to-one mapping between AIC values and p-values, so an AIC is a transformation of a p-value. If you work out what value of AIC corresponds to p=0.05 and use this AIC value as a cutoff you will find that AIC has the same power as NHST.

        At least this is what I am finding in my so far limited investigation into these matters. But I am happy to be proved wrong

        • Peter: I don’t speak ofNHST, but rather hypotheses test or significance tests, or even N-P tests. You’re mixing all kinds of things. What is for a model to have power? what is it for an AIC model selection procedure to have power?

          • Does it matter what we call them. I don’t think I’m mixing anything. I’m saying that for a given set of data different people will recommend different methods of analysis. I’m then saying that all the methods will give the same answer. If you carry out an N-P test and an AIC on the same data they will give the same answer. This is because there is a one-one mapping between a p-value and an AIC-value so one is simply a transformation of the other. AIC helps you decide whether to include a parameter in a model. This is the same as deciding whether mu1-mu2 =0 in an N-P test.

  3. Deborah:

    You write, “Gelman says the paradigm of erroneously moving from an already spurious p-value to a substantive claim—thereby doubling up on the blunders–is dead. Is it? That would be swell, but I have my doubts. . .”

    I’d say that the paradigm is dead among serious scientists. Yes, there will be legacy figures such as John Bargh or Roy Baumeister who won’t want to give up their previous successes, not to mention people such as Malcolm Gladwell who have made their living from such storytelling, but I think the mainstream of science is moving away from that paradigm. It’s “dead” in the sense that I think nothing can resuscitate it.

    • Andrew: Thanks for your comment. I’m not sure if you know that it was “dead” among serious scientists 30, 40+ yrs ago. Morrison & Henkel’s 1970 “the Significance Test Controversy,” which contains papers from decades earlier, was in my doctoral dissertation (“philosophy of statistics”) & in my Lakatos prize talk from 1999. Lakatos cites bad significance testing way back when. I knew and corresponded w/ Meehl over many years. (Meehl & I would laugh over what seemed to us work best seen as “for entertainment only”.) This seems to me to be an area that’s actually gone backwards in many ways (at least the people writing in Morrison & Henkel knew they were fallacies), and now that “replication research” is its own research field with abundant funding and positions of honor, it seems to me that grist for their mills will continue.

  4. Michael Lew

    What we need to do is to make it apparent to all that statistics should not, and cannot, stand as a substitute for scientific reasoning. The results of a statistical analysis serve to clarify a portion of the information that should feed into the rational and principled argument that leads to a scientific assessment of ideas and models. Statistics (and P-values in particular, I suppose) should never be used as a primary gatekeeper against false inferences.

    I rate Leeks six points on the basis that they help or hinder that understanding.

    1. Education: It’s hard to argue against the goodness of education but, judging on the observable outcomes, the statistical education provided over the last 20 years (I did not pay attention before then) has been quite deficient. The need for a change in the nature of educational material and process is well supported on both empirical and theoretical grounds. More of the same would be wasteful and damaging.

    2. Statisticians in journals: Sure, it probably wouldn’t hurt. However, if I were to choose between that an a reduction in the influence of the hype machinery of the ‘major’ journals, I would go for the latter.

    3. Give “data generators” some slack and recognition. Sure, but is that a change? Seems like we do it all the time.

    4. De-escalate the consequences of statistical mistakes: Really? How often are people dismissed and expelled from scientific societies because of honest statistical mistakes? Nearly never. (How often for obvious intentional fraud? Not often enough.)

    5. Stop statistical criticism of published papers from being a sport with losers: Yes! This is a good one. Scientific progress entails revisions of understandings at all levels. Revisions of the interpretations of published datasets should be possible without substantive loss of kudos. It should be technically possible now that we use non-paper-based publication. Comments from readers should be dealt with by authors far more often and far more openly than they are at the moment.

    6. Statistically trained author (or reviewer): I agree with that idea, but not for the reason that might be supposed. In a large number of papers the advice from a conventional statistician would not be particularly valuable of itself. However, having separate people responsible for analysis and scientific inferences might well serve to improve the quality of the rational and principled argument that is crucial to good scientific inference.

  5. I’ll return comments later tonight, I’m hosting a big party for all of the philo-sufferers tonight at Thebes.

  6. I made the following comment over on Leek’s blog, in response to another commentator:

    Deborah Mayo Benjamin Kirkup 11 minutes ago
    I think the point in C is quite right and rarely acknowledged.People talk as if the whole problem is significance tests or other statistical method, when in fact what’s going on is (often) largely political. The reason Stapel was able to get by for so long (without collecting any data) is that he cleverly chose topics that many people, liberals especially, would agree with ahead of time, and more than that, wanted to see given some evidence-based argument. I may bring this comment over to my current blog on this.

    It’s another reason this kind of work won’t end.

  7. relater recent blogs:
    1)”There’s no tone problem in psychology” Talyarkoni

    2)Menschplaining: Three Ideas for Civil Criticism: Uri Simonsohn on Data Colada

    http://datacolada.org/52

  8. I read on twitter Fiske is dropping the term “methodological terrorism”. Maybe my remark influenced her, unlikely she’d read my blog. good thing, in any event.

  9. cartoon on replication crisis
    https://thenib.com/repeat-after-me

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.