Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)

images-9If there’s somethin’ strange in your neighborhood. Who ya gonna call?(Fisherian Fraudbusters!)*

*[adapted from R. Parker’s “Ghostbusters”]

When you need to warrant serious accusations of bad statistics, if not fraud, where do scientists turn? Answer: To the frequentist error statistical reasoning and to p-value scrutiny, first articulated by R.A. Fisher[i].The latest accusations of big time fraud in social psychology concern the case of Jens Förster. As Richard Gill notes:

The methodology here is not new. It goes back to Fisher (founder of modern statistics) in the 30’s. Many statistics textbooks give as an illustration Fisher’s re-analysis (one could even say: meta-analysis) of Mendel’s data on peas. The tests of goodness of fit were, again and again, too good. There are two ingredients here: (1) the use of the left-tail probability as p-value instead of the right-tail probability. (2) combination of results from a number of independent experiments using a trick invented by Fisher for the purpose, and well known to all statisticians. (Richard D. Gill)


Those who deny the value of statistical significance test reasoning should wonder at how, correctly used and understood, it can be the basis for charges of bias, distortion and fraud (apparently depriving Förster from receiving an expected Humboldt Foundation award this week).[ii] For a related post: (https://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/)

The following is from a discussion by Neuroskeptic in Discover Magazine with lots of links, and useful graphs. 

On the “Suspicion of Scientific Misconduct by Jens Förster” 

By Neuroskeptic | May 6, 2014 4:42 pm


One week ago, the news broke that the University of Amsterdam is recommending the retraction of a 2012 paper by one of its professors, social psychologist Prof Jens Förster, due to suspected data manipulation. The next day, Förster denied any wrongdoing.

Shortly afterwards, the Retraction Watch blog posted a (leaked?) copy of an internal report that set out the accusations against Förster.

The report, titled Suspicion of scientific misconduct by Dr. Jens Försteris anonymous and dated September 2012. Reportedly it came from a statistician(s) at Förster’s own university. It relates to three of Förster’s papers, including the one that the University says should be retracted, plus two others.[Below is the abstract from Retraction Watch].

Here we analyze results from three recent papers (2009, 2011, 2012) by Dr. Jens Förster from the Psychology Department of the University of Amsterdam. These papers report 40 experiments involving a total of 2284 participants (2242 of which were undergraduates). We apply an F test based on descriptive statistics to test for linearity of means across three levels of the experimental design. Results show that in the vast majority of the 42 independent samples so analyzed, means are unusually close to a linear trend. Combined left-tailed probabilities are 0.000000008, 0.0000004, and 0.000000006, for the three papers, respectively. The combined left-tailed p-value of the entire set is p= 1.96 * 10-21, which corresponds to finding such consistent results (or more consistent results) in one out of 508 trillion (508,000,000,000,000,000,000). Such a level of linearity is extremely unlikely to have arisen from standard sampling. We also found overly consistent results across independent replications in two of the papers. As a control group, we analyze the linearity of results in 10 papers by other authors in the same area. These papers differ strongly from those by Dr. Förster in terms of linearity of effects and the effect sizes. We also note that none of the 2284 participants showed any missing data, dropped out during data collection, or expressed awareness of the deceit used in the experiment, which is atypical for psychological experiments. Combined these results cast serious doubt on the nature of the results reported by Dr. Förster and warrant an investigation of the source and nature of the data he presented in these and other papers.

Read the whole report here.

A vigorous discussion of the allegations has been taking place in this Retraction Watch comment thread. The identity and motives of the unknown accuser(s) are one main topic of debate; another is whether Förster’s inability to produce raw data and records relating the studies is suspicious or not.

The actual accusations have been less discussed, and there’s a perception that they are based on complex statistics that ordinary psychologists have no hope of understanding. But as far as I can see, they are really very simple – if poorly explained in the report – so here’s my attempt to clarify the accusations.

First a bit of background.

The Experiments

In the three papers in question, Forster reported a large number of separate experiments. In each experiment, participants (undergraduate students) were randomly assigned to three groups, and each group was given a different ‘intervention’. All participants were then tested on some outcome measure.

In each case, Förster’s theory predicted that one of the intervention groups would test low on the outcome measure, another would be medium, and another would be high (Low < Med < High).

Generally the interventions were various tasks designed to make the participants pay attention to either the ‘local’ or the ‘global’ (gestalt) properties of some visual, auditory, smell or taste stimulus. Local and global formed the low and high groups (though not always in that order). The Medium group either got no intervention, or a balanced intervention with neither a local nor global emphasis. The outcome measures were tests of creative thinking, and others.

The Accusation

The headline accusation is that the results of these experiments were too linear: that the mean outcome scores of the three groups, Low, Medium, and High, tended to be almost evenly spaced. That is to say, the difference between the Low and Medium group means tended to be almost exactly the same as the difference between the Medium and High means.

The report includes six montages, each showing graphs of from one batch of the experiments. Here’s my meta-montage of all of the graphs:

forster_linearThis montage is the main accusation in a nutshell: those lines just seem too good to be true. The trends are too linear, too ‘neat’, to be real data. Therefore, they are… well, the report doesn’t spell it out, but the accusation is pretty clear: they were made up.

The super-linearity is especially stark when you compare Förster’s data to the accuser’s ‘control’ sample of 21 recently published, comparable results from the same field of psychology:

control_papersIt doesn’t look good. But is that just a matter of opinion, or can we quantify how ‘too good’ they are?

The Evidence

Using a method they call delta-F, the accusers calculated the odds of seeing such linear trends, even assuming that the real psychological effects were perfectly linear. These odds came out as 1 in 179 million, 1 out of 128 million, and 1 out of 2.35 million in each of the three papers individually.

Combined across all three papers, the odds were one out of 508 quintillion: 508,000,000,000,000,000,000. (The report, using the long scale, says 508 ‘trillion’ but in modern English ‘trillion’ refers to a much smaller number.)

So the accusers say

Thus, the results reported in the three papers by Dr. Förster deviate strongly from what is to be expected from randomness in actual psychological data.

How so?

The Statistics

Unless the sample size is huge, a perfectly linear observed result is unlikely, even assuming that the true means of the three groups are linearly spaced. This is because there is randomness (‘noise’) in each observation. This noise is measurable as the variance in the scores within each of the three groups.

For a given level of within-group variance, and a given sample size, we can calculate the odds of seeing a given level of linearity in the following way.

delta-F is defined as the difference in the sum of squares accounted for by a linear model (linear regression) and a nonlinear model (one-way ANOVA), divided by the mean squared error (within-group variance.) The killer equation from the report:


If this difference is small, it means that a nonlinear model can’t fit the data any better than a linear one – which is pretty much the definition of ‘linear’.

Assuming that the underlying reality is perfectly linear (independent samples from three distributions with evenly spaced means), this delta-F metric should follow what’s known as an F distribution. We can work out how likely a given delta-F score is to occur, by chance, given this assumption, i.e. we can convert delta-F scores to p-values.

Remember, this is assuming that the underlying psychology is always linear. This is almost certainly implausible, but it’s thebest possible assumption for Förster. If the reality were nonlinear, the odds of getting low delta-F scores would be evenmore unlikely.

The delta-F metric is not new, but the application of it is (I think). Delta-F is a case of the well-known use of F-tests to compare the fit of two statistical models. People normally use this method to see whether some ‘complex’ model fits the data significantly better than a ‘simple’ model (the null hypothesis). In that case, they are looking to see if Delta-F is high enough to be unlikely given the null hypothesis.

But here the whole thing is turned on its head. Random noise means that a complex model will sometimes fit the data better than a simple one, even if the simple model describes reality. In a conventional use of F-tests, that would be regarded as a false positive. But in this case it’s the absence of those false positives that’s unusual.

The Questions

I’m not a statistician but I think I understand the method (and have bashed together some MATLAB simulations). I find the method convincing. My impression is that delta-F is a valid test of non-linearity and ‘super-linearity’ in three-group designs.

I have been trying to think up a ‘benign’ scenario that could generate abnormally low delta-F scores in a series of studies. I haven’t managed it yet.

But there is one thing that troubles me. All of the statistics above operate on the assumption that data are continuously distributed. However, most of the data in Förster’s studies were categorical i.e. outcome scores were fixed to be (say) 1 2 3 4 or 5, but never 4.5, or any other number.

Now if you simulate categorical data (by rounding all numbers to the nearest integer), the delta-F distribution starts behaving oddly. For example given the null hypothesis, the p-curve should be flat, like it is in the graph on the right. But with rounding, it looks like the graph on the left:


The p-values at the upper end of the range (i.e. at the end of the range corresponding to super-linearity) start to ‘clump’.

The authors of the accusation note this as well (when I replicated the effect, I knew my simulations were working!). They say that it’s irrelevant because the clumping doesn’t make the p-values either higher or lower on average. The high and low clumps average out. My simulations also bear this out: rounding to integers doesn’t introduce bias.

However, a p-value distribution just shouldn’t look like that, so it’s still a bit worrying. Perhaps, if some additional constraints and assumptions are added to the simulations, delta-F might become not just clumped, but also biased – in which case the accusations would fall apart.

Perhaps. Or perhaps the method is never biased. But in my view, if Förster and his defenders want to challenge the statistics of the accusations, this is the only weak spot I can see. Förster’s career might depend on finding a set of conditions that skew those curves.

UPDATE 8th May 2014: The findings of the Dutch scientific integrity commission, LOWI, on Förster, have been released. English translation here. As was already known, LOWI recommended the retraction of the 2012 paper, on grounds that the consistent linearity was so unlikely to have occured by chance that misconduct seems likely. What’s new in the report, however, is the finding that the superlinearity was not present when male and female participants were analysed seperately. This is probably the nail in the coffin for Förster because it shows that there is nothing inherent in the data that creates superlinearity (i.e. it is not a side effect of the categorical data, as I speculated it might be.) Rather, both male and female data show random variation but they always seem to ‘cancel out’ to produce a linear mean. This is very hard to explain in a benign way.

 [i]This doesn’t mean that fraud busting charges, even those that rise to a level of concern, should not also be critically evaluated. On the contrary, it is crucial that they be scrupulously criticized.

[ii]Warning to fans of Nate Silver,he shows the greatest disrespect for and misunderstanding of R.A. Fisher and significance tests: “Fisher and his contemporaries …sought to develop a set of statistical methods that they hoped would free us from any possible contamination from bias…[T]he frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researchers’s bias–keep him hermetically sealed off from the real-world.” (Silver 2012, 252-3). Fisher designed methods, relied on to this day, to detect and unearth bias based on understanding of how they arise and, with care, may be controlled and/or discerned objectively. Where does his “immaculate conception” come from? Silver does a great disservice to Fisher and fraud busting (e.g., in his 2012 “The Signal and the Noise”, pp.250-255). I hope he will correct his perception.

Categories: Error Statistics, Fisher, significance tests, Statistical fraudbusting, Statistics

Post navigation

42 thoughts on “Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)

  1. See also: Fake-data colada.

  2. Corey: Yes, I’d seen that when it came out. The discussion on Retraction Watch which is quite long, and I surely shouldn’t be reading it, comes up with a few other explanations. In particular, Richard Gill suggested how selecting out impressive results over many experiments could conceivably…. It’s unclear.

  3. I think there is something bothersome about this process of investigating fraud, as voiced by many discussing the case. I mean from a scientific point of view. From a legal point of view, there are yet other worries, like the right (?) to face your accuser? But we can just focus on the scientific issues. What do people think?

  4. Well, taking all things together, it’s pretty clear to me that this is a case of scientific fraud. Your IQ does not jump by 15 points because of whether your breakfast cereal this morning was one of your usual choices or a mixture of two your usual choices. This guy was in a competition with Stapel and Smeesters. They all had to cheat in order to get even better results than their competitors.

    • Hi Richard: Great to have you swing by. So you don’t think this is questionable research practices or selections. But you still think he has a valid lawsuit?

    • Richard: Is that what they were studying?–or something remotely like that? It doesn’t give the details, but likely I missed it.

    • Visitor

      Richard: The article does not define the research hypotheses in the problem papers. In discussion on other blogs, some say Forster’s research hypotheses may still be tenable and that he should be allowed to replicate his experiments. If “they all had to cheat” as in your comment, then the entire field is bunk, so replication wouldn’t settle anything.

      • It has something to do with local/global sensations priming creativity. I don’t know.

  5. Christian Hennig

    Three comments regarding the statistics. Firstly, a multiple testing issue is implicit in the reasoning. There are many aspects with respect to which such data could look dodgy. People computed a p-value probably only after they had a look at the data and this thing flew in their faces. The theory of p-values assumes that the test computed is not chosen dependent on the data. So in principle they could have found other things and the resulting p-value needs to be taken with a grain of salt because formally one should have tested for a pre-defined list of things that could have looked “too good to be true”, not just linearity after things looked too linear on graphs.
    Secondly, “too linear” is measured with a statistic that has a variance estimator in the denominator. So one could suspect that what goes on here is too linear, but also that the error variance estimator is too large. This could come from other violations of the model such as outliers or large variance under specific conditions (I don’t know whether we see all observations here) that are problematic but maybe less likely the result of cheating.
    Thirdly, regarding discreteness, important is not what this does to the p-value “on average”, but what it does to the probability of getting very small p-values, which is probably affected to a somewhat stronger extent than what one would believe looking at the average. (It seems for example that the probability of observing only *exact* lines is non-zero, and possibly larger than the p-value that the fraudbusters computed.)

    That said, intuitively I still think that the data look pretty fishy. But it’s not that easy to make a really sharp argument that cannot be challenged.

    • Christian: Regarding your second and third points, the analysis carried out by Uri Simonsohn (see my link to Data Colada above) does not suffer from the problems you note.

      • Christian Hennig

        Corey: Nice, thanks for pointing me to it.

    • Christian:
      First, there’s a fairly extensive discussion (including statisticians) at the Retraction watch link I gave; maybe you should add your comments, if you haven’t already. Definitely the multiple testing arises in their discussions and is discounted. I think it is clear that the absurdly small numbers don’t mean much. Gill recommends cutting them by half. The only thing that matters is whether one has a strong argument for ruling out other explanations, here, QRPs (questionable research practices). That argument still stands here, at least according to the many analysts.
      I’m not sure about the other two points, but I hope you raise them. yet, you think the data look fishy…because, I guess they’re too neat? Again, I hope you add your points to their discussion.

      • Christian Hennig

        I had a look at the retraction watch discussion. I won’t write there because I feel that here it’s about philosophy and statistics but not about the personalities of those involved. Over there it’s much more personal and sensitive and I’d feel that I should really have read all the related stuff and have thought things through absolutely thoroughly, whereas here I feel that nobody except myself will suffer if I get something slightly wrong.

        There are also a number of good comments and I commend Richard wholeheartedly for the very balanced and thought through postings there. I just don’t have the time for this to write something even close to that standard.

        I think that to me really it is a very interesting issue here (for the philosophy more than for whether Foerster is a fraudster or not) that one can do calculations and come up with precise numbers, but that actually (regarding my intuition but also regarding how thoughtful people such as Richard argue) what counts is not the precision but the somewhat informal way of how the different bits of evidence work together. And to some but very weak extent the numbers, but not their precise values but rather the fact that they are *very, very small”. Richard writes that 1.96*10^{-21} is probably wrong by several orders of magnitude, and this is pretty much my point. These calculations make some assumptions that are wrong (first of all that the hypothesis to be tested, namely here the “typical random variation under linearity” was fixed before seeing the data, but also that data are not discrete, homogeneous variances etc.), and then one has to somehow intuitively assign numbers to by how much this will change the strength of the evidence. One poster said something like they could hardly think of a factor more than 1000 coming in from multiple testing and even be it 10^6, the issue would still stand because 10^{-21} times 10^6 times “something more to account for other problems” is probably (!?) still very, very small. Although we’re getting into the order of magnitude that would be required for winning a lottery just once, I guess (see Foerster’s response). On the other hand, the list of other reasons for suspicion will get down again what people think of as “the really true p-value for the whole thing”, which every statistician would agree to not be exactly 1.96*10^{-21}, but many would think is still so very very small that reasonable doubt will have a tough time with it. But does such a “true but uncomputable p-value” exist? And what role should it play in such a discussion?

        One reason I shouldn’t post this over there is because I think that my thoughts really don’t have strong implications either way regarding the Foerster case. But as I said, I’m fascinated by the role of informal arguments amending precise but not really reliable numbers, and whether one can still rely on the imprecise outcome of this endeavor.

        • Christian: I can see why you wouldn’t necessary want to comment there, and I’m glad my little blog offers something to some people, I just wanted to indicate the place to go to get a hearing on possibly new analyses of the case. It’s easy to sign on when so many others are agreeing, yet Gill seemed to open the door someplace, somewhere in one of his postings, to a distinct possibility that hadn’t been traced out, and that might involve a simulation. I can’t recall the details, nor check right now.
          Foerster’s own letter is most informative. It’s as if he’s saying, we know the basic theory is true (whatever it is, something having to do with global vs local perceptions and creativity), and our job is to find a way to trigger it in the lab. Our task is to find a way to demonstrate the effect by arriving at an experimental protocol that triggers it reliably. By denying he was “testing” anything except his protocol for generating the assumed result, I spoze he could say “there’s no way I manipulated anything”.

  6. Dear Professor Mayo,

    The use of significance tests is just fine for this particular sort of fraud (Cyril Burt’s fraud in educational psychology is a leading instance, with grim results in supporting the 11-Plus exam in Britain for many decades). But you would not, I hope, want to defend null hypothesis testing without a loss function as it is in fact grossly overused in many fields (medicine, economics, psychology) to determine “whether an effect exists.”


    Deirdre McCloskey

    • Deirdre:
      Thank you so much for your comment (I apologize for not being notified of its presence until last night*).

      What matters is that these tests supply methods for statistically discerning and distinguishing sources of data. This they do by affording a set of statistical arguments whose power remains whether the problem happens to be fraud-busting—i.e., self-correcting–or something else. The important issue is understanding how the tools work. Saying, it’s fine over there, but not over here, instead of very clearly eliciting the nature of the licit statistical arguments, we make no progress in the debates about statistical methodology; thus they lack intellectual integrity.

      As for your point about adding a loss function to null hypothesis testing, I’m not sure what you mean. Statistical inferences in science should, and readily do, go beyond inferring an effect exists, to indicating magnitudes that are poorly detected, and any that are warranted. I can well imagine that the reviewers of the Foerster data felt the deepest responsibility, given the consequences of finding QRPs, let alone, evidence of manipulation. Did they employ a formal loss function? I doubt it.

      *I hope to have fixed the blog notifications now.

      • Dear Professor Mayo,

        Hmm. Your reply worries me. You are defending the conventions—this against dozens of the best statistical theorists. I wonder if you have read and considered their positions. 5% Null Hypothesis Testing Without a Loss Function (you say you don’t know what the last phrase means, which worries me even more: it has been conventional talk in statistical theory since around 1940) is in the opinion of many bankrupt. I do wonder why you are so eager to defend the orthodoxy.


        Deirdre McCloskey

        • Deirdre:
          Who are these “best statistical theorists” that my view is against? (Seriously, if you mention 2 or 3, I can react to arguments rather than “dozens of the best statisticians” allegedly disagree we me). I’m not defending any of the conventions of misinterpreting and abusing tests, quite the opposite. I assure you,as well, that I have more than read the “dozens of the best statistical theorists,” I have even written papers with some of them.
          I know from reading your work that you object to merely inferring the existence of effects without indicating magnitudes of discrepancy, and I agree. It is explicit in my formulation (or reformulation) of tests and confidence intervals.
          Instead of saying you are worried, please define your null hypothesis testing with an explicit loss function. (Perhaps in relation to this example of fraud busting.) Maybe we agree.

  7. The error variance estimator too large? The F statistic is a ratio. Ratio too small IFF the denominator is too small compared to the numerator IFF the numerator is too large compared to the denominator. (IFF = if and only if.) Editor’s correction.

    If you want to look at the error variance on its own,
    you can equally well say that it’s too small, since the difference between the three groups is suspiciously highly statistically significant.

    Under the null hypothesis, the numerator and denominator of an F statistic are *both* estimates of the *same* error variance.

  8. Sorry, that should have been: Ratio too small IFF the denominator is too small compared to the numerator IFF the numerator is too large compared to the denominator. (IFF = if and only if. I wrote it first with some keyboard symbols which apparently WordPress took as markup code)

    • Richard: thank you for the response. Did I get your IFF correct in editing your comment?

    • Christian Hennig

      Richard: Note that I commented on one specific test only. Actually I don’t believe that the error variance being too large is an issue here (my posting doesn’t take into account the issue that differences between groups are suspiciously large and it ignores a number of other things one could look at; which is one of the reasons why I still believe that something’s wrong with the data despite what I wrote), but in principle it could for this kind of test. You’re right that under the null hypothesis numerator and denominator estimate the same thing, but what I meant was that model assumptions could be violated in such a way that the denominator is affected for other reasons than problems with linearity, which is what people take the test to be about.

  9. Retraction watch posted Forster’s letter from May 11, but the comments are entirely new, and interesting. The first comment says Forster’s letter just goes to show why there should be a statistician on board in the psych experiments. Well maybe this guy was a bit rude, but he does have a point, and I wonder if that would be a way for them to improve. Assuming, of course, that stats folk wanted to be involved in social psych research of this sort. Here’s the link:

    • Mark

      Even if there was no explicit fraud here (although, given your meta-montage and the linked analyses (I like those by Simonsohn that Corey pointed out), it certainly seems like there was), the presentation of the results was most certainly misleading. These experiments were sold as “randomized”, however, the following quote from Forster (also pointed out in the first comment) implies that they most certainly were not randomized:
      “What does happen in my lab, as in any lab, is that some participants fail to complete the experiment (e.g., because of computer failure, personal problems, etc.). The partial data of these people is, of course, useless. Typically, I instruct RAs to fill up the conditions to compensate for such data loss. For example, if I aimed at 20 participants per condition, I will make sure that these will be 20 full-record participants.”

      Fraud by ignorance, at the very least, I’d say.

      • Mark: His idea is that if there are dropouts or in completes he will keep adding students to get the 20 complete scores. A lot of flexibility there. When reviewers and others expressed surprise that he had no dropouts, they obviously didn’t consider this procedure was being followed.

        • Mark

          Mayo, right, got that… but it ain’t randomization. This method does not carry the benefits of randomization (i.e., avoiding selection bias). And, as far as I can tell, we can have no idea of the extent of such substitution because he didn’t think it important to report.

          • Mark: Possibly he collects more than he needs initially, and then omits the incompletes. I have a feeling that’s what they do.

      • john byrd

        That phrase “fill up the conditions” sounds a lot like using an algorithm for missing data. If that was the case, it could explain a data set too good to be true. In any event, something is not right. I wonder how much management there was of the student helpers. It seems many professors these days spend little time mentoring students in the kraft of their discipline.

  10. I noticed something strange in Foerster’s letter. Which of the following comes from his letter, and which from my April Fool’s post from 2013? (near the very end)

    “If the experiment does not confirm the hypothesis, it is our fault, and we do it over til it works right. We change the subjects or the questionnaire, we find which responses are too small and must be fixed.

    “If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.”


    • E.Berk

      First line nearly identical. Maybe he got it from the April Fools Day post, as if it gave advice for research, rather than spoofing on the bad practices described in the Tilberg Report.

  11. West

    What would a successful fraudulent research study look like? That is from the perspective of all of these post-publication investigations.

    • West: I’m not sure what you’re asking.
      One way to see it, from the perspective of either the simulations or the corresponding analytic formulation, is that fraud-checking checks if the experiments are capable of producing the observed patterns of fit (and how). From the simulations of “fake data colada”: “We drew samples from populations with means and standard deviations equal to those reported in the suspicious paper. …We gave the reported data every benefit of the doubt.How often did we observe such a perfectly linear result in our 100,000 simulations?”
      Not once.

      The simulation emphasizes the role played by the error probabilities (in the analytic formulation) in grounding (the counterfactual reasoning for) the particular case at hand (not long-runs). Even if the hypotheses to be demonstrated are plausible or given as true, the experiment would not have produced these patterns. Which particular questions/simulations are appropriate to catch the problem in the case at hand will vary, but there are a small number of them, grouped by “canonical mistakes”.

  12. Steven McKinney

    Not certain what your footnote [ii] refers to, I do not see any “[ii]” annotation above the footmark.

    But on this point, indeed Nate Silver appears to be a committed Bayesian, which in statistics (I am a statistician) is problematic. To be dismissive of a large body of statistical science with the intellectual weight of Ronald Fisher, Bradley Efron and a host of other genius thinkers behind the science, shows biased and misguided thinking on Silver’s part.

    This issue is well discussed by Gary Marcus and Ernest Davis of New York University in their New Yorker “Page Turner” blog post (January 25, 2013 “What Nate Silver gets wrong”)


    • Steven: Thanks for your comment.
      I really liked that New Yorker review, and of course I concur: “Unfortunately, Silver’s discussion of alternatives to the Bayesian approach is dismissive, incomplete, and misleading. In some cases, Silver tends to attribute successful reasoning to the use of Bayesian methods without any evidence that those particular analyses were actually performed in Bayesian fashion.” Some of my own informal blog remarks are a bit stronger.
      I’m really glad they brought out those points because everyone generally treats him as too much of a data analytics rock star to criticize at all! (Larry Wasserman wrote a review pointing out that Silver was actually a frequentist.)

      Putting aside his bizarre remarks about Fisher*, his description at the JSM as to why he thinks journalists and others should be Bayesian makes no sense, especially in relation to his stated aims for a “data driven” blog, 538.

      He said journalists are so biased and inclined to see through their tunnel vision that they should reveal their biases up front. Even if we imagine people are sufficiently aware of their biases, and even if they could report them in terms of prior probability distributions, we certainly would want to keep them separate from the data analysis. We wouldn’t want to combine them to give an assessment of how warranted a claim is. I don’t see the writers on 538 offering their priors and combining them with data, but then again I’ve only read a few of the articles. Instead, we see—yes, statistical significance tests given!
      I have a post on draft on an early article in his 538 blog about “decoding” health news (I think it was by Jeff Leek). He was advising readers to multiply their prior beliefs in a health risk– one was something like hours on facebook causes cancer– with other made up numbers assigned to a study (not likelihoods and no model). That strikes me as very bad advice, and not what we’d want to do to “decode” health articles at all.

      I fixed the [ii].
      * I’m determined to get Nate to somehow withdraw the remarks he makes on those pages in The Signal and the Noise.

    • In concur with their ccriticisms of Nate, but this comment struck me

      “Silver seems to be using “Bayesian” not to mean the use of Bayes’s theorem but, rather, the general strategy of combining many different kinds of information. ”

      As that is essentially Brad Efron’s definition of being Bayesian as in this paper http://statweb.stanford.edu/~ckirby/brad/papers/2009Future.pdf

      Though I would not agree with Efron, as I find it hard to distinguish between one study and one observation or sug-group in that study – they all could be take as islands on their own.

      • Keith: Everyone combines different kinds of information whether they are walking down the street or buying a car. Efron doesn’t “define” it that way, nor do even the most minimalists of Bayesians, e.g., Andrew Gelman. As for Silver, he was quite clear at the JSM that he favors Bayesian inference in order to quantify prior bias and opinion which is intended then to be “updated” by evidence (presumably likelihoods). I don’t say the view makes sense, or what he really means deep down, I have no idea.Let’s try not to utterly trivialize different standpoints about statistical inference by seeking “definitions” no one can fail to satisfy. We’re all disjunctive syllogizers too—including my cat I’ve noticed.

  13. Pingback: Friday links: valuing scientists vs. science, real stats vs. fake data, Pigliucci vs. Tyson, and more | Dynamic Ecology

  14. From the abstract of Efron’s paper I linked to

    “Very roughly speaking, the difference between direct and indirect statistical evidence marks the boundary between
    frequentist and Bayesian thinking.”

    My taking “combining many different kinds of information” as related to Efron’s view stated above surely did not deserve a scolding. I made a blog comment not a submission of a graduate student essay.

    By the way, in a talk I gave on Meta-analysis to the Oxford Statistics Department in 1981, I claimed multi-cell organisms were the first to combine information and hence do informal meta-analysis.

    • Keith: Sorry, I thought you were submitting a graduate student essay, just kidding. But seriously what you wrote, and given that it’s written by you, is sufficient for many to go away thinking that Efron’s definition of being Bayesian (in the cited paper) is essentially to follow “the general strategy of combining many different kinds of information.”

      If there weren’t so much confusion surrounding this, it wouldn’t matter.

      On the business of using “direct” vs “indirect” statistical evidence, I think Efron should say just what he means here, so people can decide if the particular use of “indirect” evidence is warranted.(I know what he’s getting at, but it’s equivocal. In this connection, it’s interesting that people very often say the frequentist uses probability indirectly, the Bayesian directly, thereby reversing things.)
      I’ve no problem with multi-cell organisms being the first to do informal meta-analysis.

  15. Jeremy Fox at Dynamic Ecology links to this post: http://dynamicecology.wordpress.com/2014/05/16/friday-links-25/#comments
    He remarks: “The goal is to infer, beyond a reasonable doubt, whether or not a particular person has or hasn’t faked data. We don’t want to quantify our personal beliefs about whether he did or not, as a subjective Bayesian would. We don’t even want to quantify the fraction of relevantly-similar cases in which the data were faked, as an objective Bayesian like Nate Silver would. Rather, we have a hypothesis we want to subject to a severe test. People who make the blanket claim that frequentist hypothesis testing is never of scientific interest are just wrong”.

    But is Nate Silver an objective Bayesian? I didn’t think so.

    • vl

      Nate is an “objective” Bayesian in the sense that he has a non-subjective definition of a “correct” probability model – calibration.

      This is why Larry argues he’s a frequentist applying Bayes theorem. He’s using frequentist priors and a frequentist definition of probability.

      • vl: There’s some ambiguity. First, so-called “objective” (default, non-subjective, reference) Bayesians do not intend their priors to be frequentist, and second, Silver could not have been clearer (e.g., at the JSM) that he insisted on the Bayesian philosophy (at least for journalists) so that they could express their beliefs and biases. Recall Silver condemning frequentist methods: “as striving for immaculate statistical procedures” resulting in their being “hermetically sealed off from the real-world.” (Silver 2012, 252-3). So, he can’t have it all three ways!

Blog at WordPress.com.