**Stapel’s “fix” for science is to admit it’s all “fixed!”**

That recent case of the guy suspected of using faked data for a study on how to promote support for gay marriage in a (retracted) paper, Michael LaCour, is directing a bit of limelight on our star fraudster Diederik Stapel (50+ retractions).

**The Chronicle of Higher Education **just published an article by Tom Bartlett:** “**Can a Longtime Fraud Help Fix Science? You can read his full interview of Stapel here. A snippet:

You write that “every psychologist has a toolbox of statistical and methodological procedures for those days when the numbers don’t turn out quite right.” Do you think every psychologist uses that toolbox? In other words, is everyone at least a little bit dirty?

: In essence, yes. The universe doesn’t give answers. There are no data matrices out there. We have to select from reality, and we have to interpret. There’s always dirt, and there’s always selection, and there’s always interpretation. That doesn’t mean it’s all untruthful. We’re dirty because we can only live with models of reality rather than reality itself. It doesn’t mean it’s all a bag of tricks and lies. But that’s where the inconvenience starts.Stapel

It’s the illusion that these models are one-to-one descriptions of reality. That’s what we hope for, but that’s of course not true.I think the solution is in accepting this and saying these are the tips and tricks, and this is the story I want to tell, and this is how I did it, instead of trying to pose as if it’s real. We should be more open about saying, I’m using this trick, this statistical method, and people can figure out for themselves.

This is our “dirty hands” argument, so often used these days, coupled with claims of so-called “perverse incentives,” to excuse QRPs (questionable research practices), bias, and flat out **cheating**. The leap from “our models are invariably idealizations” to “we all have dirty hands” to “statistical tricks cannot be helped,” may inadvertantly be encouraged by some articles on how to “fix” science.

Earlier in the interview:

You mention lots of possible reasons for your fraud: laziness, ambition, a short attention span. One of the more intriguing reasons to me — and you mention it twice in the book — is nihilism. Do you mean that? Did you think of yourself as a nihilist? Then or now?

: I’m not sure I’m a nihilist. ….Stapel

Did you think of the work you were doing as meaningful?

I was raised in the 1980s, at the height of postmodernism, and that was something I related to. I studied many of the French postmodernists. That made me question meaningfulness. I had a hard time explaining the meaningfulness of my work to students.Stapel:

I’ll bet.

I agree with Bartlett that you don’t have to have any sympathy with a fraudster to possibly learn from him about preventing doctored statistics, or sharpening fraudbusting skills, except that it turns out Stapel *really and truly* believes science is a fraud![ii] In his pristine accomplishment of using *no data at all,* rather than merely subjecting them to extraordinary rendition (leaving others to wrangle over the fine points of statistics), you could say that Stapel is the ultimate, radical, postmodern scientific anarchist. Stapel is a personable guy, and I’ve had some interesting exchanges with him; but on that basis, from his “Fictionfactory,” and autobiography, “Derailment”, I say he’s the wrong person to ask*. He still doesn’t get it!*

[i]There are several posts on this blog that discuss Stapel:

Some Statistical Dirty Laundry

Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)

Should a “fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no

How to hire a fraudster chauffeur (includes video of Stapel’s TED talk)

50 shades of grey between error and fraud

Thinking of Eating Meat Causes Antisocial behavior

[ii] At least social science, social psychology. He may be right that the effects are small or uninteresting in social psych.

Filed under: junk science, Statistics ]]>

**MONTHLY MEMORY LANE: 3 years ago: June 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog**.[1]** It was *extremely* difficult to pick only 3 this month; please check out others that look interesting to you. This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014.

**June 2012**

- (6/2) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
**(6/6) Review of***Error and Inference*(Mayo and Spanos) by C. Hennig- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (6/12) CMU Workshop on Foundations for Ockham’s Razor
- (6/14) Answer to the Homework & a New Exercise
- (6/15) Scratch Work for a SEV Homework Problem
- (6/17) Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers
- (6/17) G. Cumming Response: The New Statistics
**(6/19) The Error Statistical Philosophy and The Practice of Bayesian Statistics: Comments on Gelman and Shalizi**- (6/23) Promissory Note
**(6/26) Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop***- (6/29) Further Reflections on Simplicity: Mechanisms

**[1]**excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Filed under: 3-year memory lane ]]>

This is one of the questions high on the “To Do” list I’ve been keeping for this blog. The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in* Rationality, Markets, and Morals.[i]*

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

**Is it legitimate to change one’s prior based on the data?**

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while

“arrival of new data will, of course, require you to update your prior distribution to being a posterior distribution, no conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it.” (Senn, 2011, 63)

“If you cannot go back to the drawing board, one seems stuck with priors one now regards as wrong; if one does change them, then what was the meaning of the prior as carrying prior information?” (Senn, 2011, p. 58)

I take it that Senn is referring to a Bayesian prior expressing belief. (He will correct me if I’m wrong.)[ii] Senn takes the upshot to be that priors cannot be changed based on data. **Is there a principled ground for blocking such moves?**

I.J. GOOD: The traditional idea was that one would have thought very hard about one’s prior before proceeding—that’s what Jack Good always said. Good advocated his device of “imaginary results” whereby one would envisage all possible results in advance (1971, p. 431) and choose a prior that you can live with whatever happens. *This could take a long time!* Given how difficult this would be, in practice, Good allowed

“that it is possible after all to change a prior in the light of actual experimental results” [but] rationality of type II has to be used.” (Good 1971, p. 431)

Maybe this is an example of what Senn calls requiring the informal to come to the rescue of the formal? Good was commenting on D. J. Bartholomew [iii] in the same wonderful volume (edited by Godambe and Sprott).

D. LINDLEY: According to subjective Bayesian Dennis Lindley:

“[I]f a prior leads to an unacceptable posterior then I modify it to cohere with properties that seem desirable in the inference.”(Lindley 1971, p. 436)

This would seem to open the door to all kinds of verification biases, wouldn’t it? This is the same Lindley who famously declared:

“I am often asked if the method gives the

rightanswer: or, more particularly, how do you know if you have got therightprior. My reply is that I don’t know what is meant by “right” in this context. The Bayesian theory is aboutcoherence, not about right or wrong.” (1976, p. 359)

H. KYBURG: Philosopher Henry Kyburg (who wrote a book on subjective probability, but was or became a frequentist) gives what I took to be the standard line (for subjective Bayesians at least):

There is no way I can be in error in my prior distribution for μ ––unless I make a logical error–… . It is that very fact that makes this prior distribution perniciously subjective. It represents an assumption that has consequences, but cannot be corrected by criticism or further evidence.” (Kyburg 1993, p. 147)

It can be updated of course via Bayes rule.

D.R. COX: While recognizing the serious problem of “temporal incoherence”, (a violation of diachronic Bayes updating), David Cox writes:

“On the other hand [temporal coherency] is not inevitable and there is nothing intrinsically inconsistent in changing prior assessments” in the light of data; however, the danger is that “even initially very surprising effects can

post hocbe made to seem plausible.” (Cox 2006, p. 78)

An analogous worry would arise, Cox notes, if frequentists permit data dependent selections of hypotheses (significance seeking, cherry picking, etc). However, frequentists (if they are not to be guilty of cheating) would need to take into account any adjustments to the overall error probabilities of the test. But the Bayesian is not in the business of computing error probabilities associated with a method for reaching posteriors. At least not traditionally. Would Bayesians even be required to report such shifts of priors? (A principle is needed.)

What if the proposed adjustment of prior is based on the data and resulting likelihoods, rather than an impetus to ensure one’s favorite hypothesis gets a desirable posterior? After all, Jim Berger says that prior elicitation typically takes place *after* “the expert has already seen the data” (2006, p. 392). Do they instruct them to try not to take the data into account? Anyway, if the prior is determined post-data, then one wonders how it can be seen to reflect information distinct from the data under analysis. All the work to obtain posteriors would have been accomplished by the likelihoods. There’s also the issue of using data twice.

**So what do you think is the answer? Does it differ for subjective vs conventional vs other stripes of Bayesian?**

[i]Both were contributions to the RMM (2011) volume: Special Topic: Statistical Science and Philosophy of Science: Where Do (Should) They Meet in 2011 and Beyond? (edited by D. Mayo, A. Spanos, and K. Staley). The volume was an outgrowth of a 2010 conference that Spanos and I (and others) ran in London, and conversations that emerged soon after. See full list of participants, talks and sponsors here.

[ii] Senn and I had a published exchange on his paper that was based on my “deconstruction” of him on this blog, followed by his response! The published comments are here (Mayo) and here (Senn).

[iii] At first I thought Good was commenting on Lindley. Bartholomew came up in this blog in discussing when Bayesians and frequentists can agree on numbers.

**WEEKEND READING**

Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis.”

Senn, S. 2011. “You May Believe You Are a Bayesian But You Are Probably Wrong.”

Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.”

Discussions and Responses on Senn and Gelman can be found searching this blog:

Commentary on Berger & Goldstein**: **Christen, Draper, Fienberg, Kadane, Kass, Wasserman,

Rejoinders**: **Berger, Goldstein,

REFERENCES

Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” *Bayesian Analysis* 1 (3): 385–402.

Cox, D. R. 2006. *Principles of Statistical Inference*. Cambridge, UK: Cambridge University Press.

Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis.” *Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics* 2 (Special Topic: Statistical Science and Philosophy of Science): 67–78.

Godambe, V. P., and D. A. Sprott, ed. 1971. *Foundations of Statistical Inference*. Toronto: Holt, Rinehart and Winston of Canada.

Good, I. J. 1971. Comment on Bartholomew. In *Foundations of Statistical Inference*, edited by V. P. Godambe and D. A. Sprott, 108–122. Toronto: Holt, Rinehart and Winston of Canada.

Kyburg, H. E. Jr. 1993. “The Scope of Bayesian Reasoning.” In *Philosophy of Science Association: PSA 1992*, vol 2, 139-152. East Lansing: Philosophy of Science Association.

Lindley, D. V. 1971. “The Estimation of Many Parameters.” In *Foundations of Statistical Inference*, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

Lindley, D.V. 1976. “Bayesian Statistics.” In Harper and Hooker’s (eds.)*Foundations of Probabilitiy Theory, Statistical Inference and Statistical Theories of Science*., 353-362. D Reidel.

Senn, S. 2011. “You May Believe You Are a Bayesian But You Are Probably Wrong.” *Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics* 2 (Special Topic: Statistical Science and Philosophy of Science): 48–66.

Filed under: Bayesian/frequentist, Gelman, S. Senn, Statistics ]]>

*I had a chance to reread the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. The discussion of the statistics is around pp. 17-21 (of course there was so little actual data in this case!) You might find it interesting. Here are some stray thoughts reblogged from 2 years ago…*

*1. Slipping into pseudoscience.*

The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts

…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

*2. A role for philosophy of science?*

I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant

basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of sciencemust be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments). That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

* 3. Hanging out some statistical dirty laundry.
*Items in their laundry list include:

- An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
- A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
- The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
- The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
- Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)

For many further examples, and also caveats [3],see Report.

**4. Significance tests don’t abuse science, people do**.

Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. *Statistical methods don’t kill scientific validity, people do*.

I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and associated confidence intervals. At least the methods admit of tools for mounting a critique.

In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, *researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias*? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one? Or was there just a bit of data massaging and cherry picking to support the desired conclusion?* As a matter of routine, researchers should tell us. Yes, as *Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!” No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)

*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”

[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).

[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)

[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).

[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!

[5] From Simmons, Nelson and Simonsohn:

The Fall 2012 Newsletter for the Society for Personality and Social Psychology Popper, K. 1994, The Myth of the Framework.Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.

Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If

youdid not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.If you determined sample size in advance,

say it.If you did not drop any variables,

say it.If you did not drop any conditions,

say it.

Filed under: junk science, spurious p values ]]>

I thought the criticisms of social psychologist Jens Förster were already quite damning (despite some attempts to explain them as mere QRPs), but there’s recently been some pushback from two of his co-authors Liberman and Denzler. Their objections are directed to the application of a distinct method, touted as “Bayesian forensics”, to their joint work with Förster. I discussed it very briefly in a recent “rejected post“. Perhaps the earlier method of criticism was inapplicable to these additional papers, and there’s an interest in seeing those papers retracted as well as the one that was. I don’t claim to know. A distinct “policy” issue is whether there should be uniform standards for retraction calls. At the very least, one would think new methods should be well-vetted before subjecting authors to their indictment (particularly methods which are incapable of issuing in exculpatory evidence, like this one). Here’s a portion of their response. I don’t claim to be up on this case, but I’d be very glad to have reader feedback.

**Nira Liberman, School of Psychological Sciences, Tel Aviv University, Israel**

**Markus Denzler, Federal University of Applied Administrative Sciences, Germany**

June 7, 2015

**Response to a Report Published by the University of Amsterdam**

The University of Amsterdam (UvA) has recently announced the completion of a report that summarizes an examination of all the empirical articles by Jens Förster (JF) during the years of his affiliation with UvA, including those co-authored by us. The report is available online. The report relies solely on statistical evaluation, using the method originally employed in the anonymous complaint against JF, as well as a new version of a method for detecting “low scientific veracity” of data, developed by Prof. Klaassen (2015). The report concludes that some of the examined publications show “strong statistical evidence for low scientific veracity”, some show “inconclusive evidence for low scientific veracity”, and some show “no evidence for low veracity”. UvA announced that on the basis of that report, it would send letters to the Journals, asking them to retract articles from the first category, and to consider retraction of articles in the second category.

After examining the report, **we have reached the conclusion that it is misleading, biased and is based on erroneous statistical procedures**. In view of that we surmise that it **does not present reliable evidence for “low scientific veracity”**.

**We ask you to consider our criticism of the methods used in UvA’s report and the procedures leading to their recommendations in your decision.**

Let us emphasize that we never fabricated or manipulated data, nor have we ever witnessed such behavior on the part of Jens Förster or other co-authors.

**Here are our major points of criticism. **Please note that, due to time considerations, our examination and criticism focus on papers co-authored by us. Below, we provide some background information and then elaborate on these points.

**The new method is falsely portrayed as “standard procedure in Bayesian forensic inference**.”**In fact, it is set up in such a way that evidence can only strengthen a prior belief in low data veracity.**This method is not widely accepted among other experts, and has never been published in a peer-reviewed journal.

Despite that, UvA’s recommendations for all but one of the papers in question are solely based on this method. No confirming (not to mention disconfirming) evidence from independent sources was sought or considered.

**The new method’s criteria for “low veracity” are too inclusive**(5-8% chance to wrongly accuse a publication as having “strong evidence of low veracity” and as high as 40% chance to wrongly accuse a publication as showing “inconclusive evidence for low veracity”). Illustrating the potential consequences, a failed replication paper by other authors that we examined was flagged by this method.

**The new method (and in fact also the “old method” used in former cases against JF) rests on a wrong assumption that dependence of errors between experimental conditions necessarily indicates “low veracity”**, whereas in real experimental settings many (benign) reasons may contribute to such dependence.

- The reports treats between-subjects designs of 3 x 2 as two independent instances of 3-level single-factor experiments. However, the same (benign) procedures may render this assumption questionable, thus inflating the indicators for “low veracity” used in the report.

**The new method (and also the old method) estimate fraud as extent of deviation from a linear contrast. This contrast cannot be applied to “control” variables**(or control conditions) for which experimental effects were neither predicted nor found, as was done in the report. The misguided application of the linear contrast to control variables also produces, in some cases, inflated estimates of “low veracity”.

**The new method appears to be critically sensitive to minute changes in values**that are within the boundaries of rounding.

- Finally,
**we examine every co-authored paper that was classified as showing “strong” or “inconclusive” evidence of low veracity**(excluding one paper that is already retracted),**and show that it does not feature any reliable evidence for low veracity.**

Background

On April 2nd each of us received an email from the head of the Psychology Department at the University of Amsterdam (UvA), Prof. De Groot, on behalf of University’s Executive Board. She informed us that all the empirical articles by Jens Förster (JF) during the years of his affiliation with UvA, including those co-authored by us, have been examined by three statisticians who submitted their report. According to this (earlier version of the) report, we were told, some of the examined publications had “strong statistical evidence for fabrication”, some had “questionable veracity,” and some showed “no statistical evidence for fabrication”. Prof. De Groot also wrote that on the basis of that report, letters would be sent to the relevant Journals, asking them to retract articles from the first two categories. It is important to note that this was the first time we were officially informed about the investigation. None of the co-authors had been ever contacted by UvA to assist with the investigation. The University could have taken interest in the data files, or in earlier drafts of the papers, or in information on when, where and by whom the studies were run. Apparently, however, UvA’s Executive Board did not find any of these relevant for judging the potential veracity of the publications and requesting retraction.

Only upon repeated requests, on April 7th, 2015 we received the 109-page report (dated March 31st, 2015) and were given 2.5 weeks to respond. This deadline was determined one-sidedly. Also, UvA did not provide the R-code used to investigate our papers for almost two weeks (until April 22nd), despite the fact that it was listed as an attachment to the initial report. We responded on April 27th, following which the authors of the report corrected it (henceforth Report-R) and wrote a response letter (henceforth, the PKW letter, after authors Peeters, Klaassen, and de Wiel ). Both documents are dated May 15, 2015, but were sent to us only on June 2, **the same day that UvA also published the news regarding the report **and its conclusions on its official site, and the final report was leaked. **Thus, we were not allowed any time to read Report-R or the PKW letter before the report and UvA’s conclusions were made public. These and other procedural decisions by the UvA were needlessly detrimental to us.**

The present response letter refers to Report-R. The R-Report is almost unchanged compared to the original report, except that the language of the report and the labels for the qualitative assessments of the papers is somewhat softened, to refer to “low veracity” rather than “fraud” or “manipulation”. This has been done to reflect the authors’ own acknowledgement that their methods “cannot demarcate fabrication from erroneous or questionable research practices.” UvA’s retraction decisions only slightly changed in response to this acknowledgement. They are still requesting retraction of papers with “strong evidence for low veracity”. They are also asking journals to “consider retraction” for papers with “inconclusive evidence for low veracity,” which seems not to match this lukewarm new label (also see Point 2 below about the likelihood for a paper to to receive this label erroneously).

Unlike our initial response letter, this letter is not addressed to UvA, but rather to editors who read Report-R or reports about it. To keep things simple, we will refer to the PKW letter by citing from it only when necessary. In this way, a reader can follow our argument by reading Report-R and the present letter, but is not required to also read the original version of the report, our previous response letter, and the PKW letter.

Because of time pressure, we decided to respond only to findings that concerned co-authored papers, excluding the by-now-retracted paper Förster and Denzler (2012, SPPS). We therefore looked at the general introduction of Report-R and at the sections that concern the following papers:

In the “strong evidence for low veracity” category

Förster and Denzler, 2012, JESP

Förster, Epstude, and Ozelsel, 2009, PSPB

Förster, Liberman, and Shapira, 2009, JEP:G

Liberman and Förster, 2009, JPSP

In the “inconclusive evidence for low veracity” category

Denzler, Förster, and Liberman, 2009, JESP

Förster, Liberman, and Kuschel, 2008, JPSP

Kuschel, Förster, and Denzler, 2010, SPPS

This is not meant to suggest that our criticism does not apply to the other parts of Report-R. We just did not have sufficient time to carefully examine them. **We would like to elaborate now on points 1-7 above and explain in detail why we think that UvA’s report is biased, misleading, and flawed.**

**The new method by Klaassen (2015) (the V method) is inherently biased**

Report-R heavily relies on a new method for detecting low veracity (Klaassen, 2015), whose author, Prof. Klaassen, is also one of the authors of Report-R (and its previous version).

In this method (which we’ll refer to as the V method), a V coefficient is computed and used as an indicator of data veracity. V is called “evidential value” and is treated as the belief-updating coefficient in Bayes formula, as in equation (2) in Klaassen (2015) For example, according to the V method, when we examine a new study with V = 2, our posterior odds for fabrication should be double the prior odds. If we now add another study with V = 3, our confidence in fabrication should triple still. Klaassen, 2015, writes “When a paper contains more than one study based on independent data, then the evidential values of these studies can and may be combined into an overall evidential value by multiplication in order to determine the validity of the whole paper” (p. 10).

The problem is that V is not allowed to be less than unity. This means that there is nothing that can ever reduce confidence in the hypothesis of “low data veracity”. The V method entails, for example, that the more studies there are in a paper, the more we should get convinced that the data has low veracity.

**Klaassen (2015) writes “we apply the by now standard approach in Forensic Statistics” (p. 1). We doubt very much, however, that an approach that can only increase confidence in a defendant’s guilt could be a standard approach in court.**

We consulted an expert in Bayesian statistics (s/he preferred not to disclose her name). S/he found the V method problematic, and noted that quite contrary to the V method, typical Bayesian methods would allow both upward and downward changes in one’s confidence in a prior hypothesis.

In their letter, PKW defend the V method by saying that it has been used in the Stapel and Smeesters cases. As far as we know, however, in these cases there was other, independent evidence of fraud (e.g., Stapel reported significant effects with t-test values smaller than 1, in a Smeesters’ data individual scores were distributed too evenly; see Simonsohn, 2013) and the V method was only supporting other evidence. In contrast, in our case, labeling the papers in question as having “low scientific veracity” is almost always based only on V values – the second method for testing “ultra-linearity” in a set of studies (Δ*F *combined with the Fisher’s method) either could not be applied due to a low number of independent studies in the paper or was applied and did not yield a reason for concern. We do not know what weight the V method received in the Staple and Smeesters cases (relative to other evidence), and whether all the experts who examined those cases found the method useful. As noted before, a statistician we consulted found the method very problematic.

The authors of Report-R do acknowledge that combining V values becomes problematic as the number of studies increases (e.g., p. 4) and explain in the PKW letter that “the conclusions reached in the report are never based on overall evidential values, but on the (number of) evidential values of individual samples/sub-experiments that are considered substantial”. They nevertheless proceed to compute overall V’s and report them repeatedly in Report-R (e.g., “The overall V has a lower bound of 9.93″, p. 31; “The overall V amounts to 8.77″, on p. 66). Why?

**The criteria for “low veracity” are too inclusive**

… applying the V method across the board would result in erroneously retracting 1/12-1/19 of all published papers with experimental designs similar to those examined in Report-R (before taking into account those flagged as exhibiting “inconclusive” evidence).

In their letter, PKW write “these probabilities are in line with (statistical) standards for accepting a chance-result as scientific evidence”. In fact, these p-values are higher than is commonly acceptable in science. One would think that in “forensic” contexts of “fraud detection” the threshold should be, if anything, even higher (meaning, with lower chance for error).

Report-R says “When there is no strong evidence for low scientific veracity (according to the judgment above), but there are multiple constituent (sub)experiments with a substantial evidential value, then the evidence for low scientific veracity of a publication is considered inconclusive (p.2).” As already mentioned, UvA plans to ask journals to consider retraction of such papers. For example, in Denzler, Förster, and Liberman (2009) there are two Vs that are greater than 6 (Table 14.2) out of 17 V values computed for that paper in Report-R. The probability of obtaining two or more values of 6 or more out of 17 computed values by chance is 0.40. **Let us reiterate this figure – 40% chance of type-I error.**

Do these thresholds provide good enough reasons to ask journals to retract a paper or consider retraction? Apparently, the Executive Board of the University of Amsterdam thinks so. We are sure that many would disagree.

An anecdotal demonstration of the potential consequences of applying such liberal standards comes from our examination of a recent publication by Blanken, de Ven, Zeelenberg, and Meijers (2014, Social Psychology) using the V method. We chose this paper because it had the appropriate design (three between-subjects conditions) and was conducted as part of an Open Science replication initiative. It presents three failures to replicate the moral licensing effect (e.g., Merritt, Effron, & Monin, 2010) . The whole research process is fully transparent and materials and data are available online. The three experiments in this paper yield 10 V values, two of which are higher than 6 (9.02 and 6.18; we thank PKW for correcting a slight error in our earlier computation). The probability of obtaining two or more V-values of 6 or more out of 10 by chance is 0.19. By the criteria of Report-R, this paper would be classified as showing “inconclusive evidence of low veracity”. **By the standards of UvA’s Executive Board, which did not seek any confirming evidence to statistical findings based on the V method, this would require sending a note to the journal asking it to consider retraction of this failed replication paper**. We doubt if many would find this reasonable.

It is interesting in this context to note that in a different investigation that applied a variation of the V method (investigation of the Smeesters case) a V = 9 was used as the threshold. Simply adopting that threshold from previous work in the current report would dramatically change the conclusions. Of the 20 V values deemed “substantial” in the papers we consider here, only four have Vs over 9, which would qualify them as “substantial” with this higher threshold. Accordingly, none of the papers would have made it to the “strong evidence” category. In addition, three of the four Vs that are above 9 pertain to control conditions – we elaborate later on why this might be problematic.

**Dependence of measurement errors does not necessarily indicate low veracity**

Klaassen (2015) writes: “If authors are fiddling around with data and are fabricating and falsifying data, they tend to underestimate the variation that the data should show due to the randomness within the model. Within the framework of the above ANOVA-regression case, we model this by introducing dependence between the normal random variables ε ij , which represent the measurement errors” (p. 3). Thus, the argument that underlies the V method is that if fraud tends to create dependence of measurement errors between independent samples, then any evidence of such dependence is indicative of fraud. This is a logically invalid deduction. There are many benign causes that might create dependency between measurement errors in independent conditions. ……

**See the entire response:** Response to a Report Published by the University of Amsterdam.

**Klaassen, C. A. J. (2015).** *Evidential value in ANOVA-regression results in scientific integrity studies*. arXiv:1405.4540v2 [stat.ME].

Discussion of the Klaassen method on pubpeer review https://pubpeer.com/publications/5439C6BFF5744F6F47A2E0E9456703

**Some previous posts on Jens Förster case:**

- May 10, 2014: Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
- January 18, 2015: Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?

Filed under: junk science, reproducibility Tagged: Jens Forster ]]>

Filed under: evidence-based policy, frequentist/Bayesian, junk science, Rejected Posts ]]>

“There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

That’s philosopher’s talk for “I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics”. Yesterday, I began my talk at the Society for Philosophy and Psychology workshop on “Replication in the Sciences”with examples of two main philosophical tasks: to clarify concepts, and reveal inconsistencies, tensions and ironies surrounding methodological “discomforts” in scientific practice.

Example of aconceptual clarificationEditors of a journal,

Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” (A puzzle about the latest “test ban”)It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of

H_{0}) (2015 Trafimow and Marks)

- Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid.
- Simple conceptual job that philosophers are good at
(I don’t know if the group of eminent statisticians assigned to react to the “test ban” will bring up this point. I don’t think it includes any philosophers.)

____________________________________________________________________________________

Example of revealing inconsistencies and tensions

: It’s too easy to satisfy standard significance thresholdsCritic

: Why do replicationists find it so hard to achieve significance thresholds?You

: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPsCritic

: So, the replication researchers want methods that pick up on and block these biasing selection effects.You

Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference.________________________________________________________________

Whether this can be resolved or not is separate.

- We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility
- As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!)

I also think it’s crucial for philosophers of science and statistics to show how to improve on and solve problems of methodology in scientific practice.

My slides are below; share comments.

Filed under: Error Statistics, reproducibility, Statistics ]]>

**MONTHLY MEMORY LANE: 3 years ago: May 2012. Lots of worthy reading and rereading for your Saturday night memory lane; it was hard to choose just 3. **

I mark in **red** **three** posts that seem most apt for general background on key issues in this blog* (Posts that are part of a “unit” or a group of “U-Phils” count as one.) This new feature, appearingthe end of each month, began at the blog’s 3-year anniversary in Sept, 2014.

*excluding any that have been recently reblogged.

**May 2012**

**(5/1) Stephen Senn: A Paradox of Prior Probabilities**- (5/5) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed
- (5/8) LSE Summer Seminar: Contemporary Problems in Philosophy of Statistics
- (5/10) Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”
- (5/12) Saturday Night Brainstorming & Task Forces: The TFSI on NHST (for update see 2015 Task Force)
**(5/17) Do CIs Avoid Fallacies of Tests? Reforming the Reformers**- (5/20) Betting, Bookies and Bayes: Does it Not Matter?
- (5/23) Does the Bayesian Diet Call For Error-Statistical Supplements?
**(5/24) An Error-Statistical Philosophy of Evidence (PH500, LSE Seminar)-short intro to error statistics**- (5/28) Painting-by-Number #1
- (5/31) Metablog: May 31, 2012

Filed under: 3-year memory lane ]]>

**Today is Allan Birnbaum’s Birthday. B**irnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in *Breakthroughs in Statistics (volume I 1993), *concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, *properties of the sampling distribution of the test statistic vanish *(as I put it in my slides from my last post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10).

* Intentions is a New Code Word: *Where, then,

Birnbaum struggled. Why? Because he regarded controlling the probability of misleading interpretations to be essential for scientific inference, and yet he seemed to have demonstrated that the LP/SLP followed from frequentist principles! That would mean error statistical principles entailed the denial of error probabilities! For many years this was assumed to be the case, and accounts that rejected error probabilities flourished. Frequentists often admitted their approach seemed to lack what Birnbaum called a “concept of evidence”–even those who suspected there was something pretty fishy about Birnbaum’s “proof”. I have shown the flaw in Birnbaum’s alleged demonstration of the LP/SLP (most fully in the Statistical Science issue). (It only uses logic, really, yet philosophers of science do not seem interested in it.) [3]

The Statistical Science Issue: This is the first Birnbaum birthday that I can point to the Statistical Science issue being out.I’ve a hunch that Birnbaum would have liked my rejoinder to discussants (*Statistical Science*): **Bjornstad, Dawid, Evans, Fraser, Hannig, **and** Martin and Liu**. For those unfamiliar with the argument, at the end of this entry are slides from an entirely informal talk as well as some links from this blog. **Happy Birthday Birnbaum!**

[1] The Weak LP concerns a single experiment; whereas, the strong LP concerns two (or more) experiments. The weak LP is essentially just the sufficiency principle.

[2] I will give $50 for each of the first 30 distinct (fully cited and linked) published examples (with distinct authors) that readers find of criticisms of frequentist methods based on arguing against the relevance of “intentions”. Include as much of the cited material as needed for a reader to grasp the general argument. Entries must be posted as a comment to this post.*****

[3] The argument still cries out for being translated into a symbolic logic of some sort.

Excerpts from my Rejoinder

I. Introduction……As long-standing as Birnbaum’s result has been, Birnbaum himself went through dramatic shifts in a short period of time following his famous (1962) result. More than of historical interest, these shifts provide a unique perspective on the current problem.

Already in the rejoinder to Birnbaum (1962), he is worried about criticisms (by Pratt 1962) pertaining to applying WCP to his constructed mathematical mixtures (what I call Birnbaumization), and hints at replacing WCP with another principle (Irrelevant Censoring). Then there is a gap until around 1968 at which point Birnbaum declares the SLP plausible “only in the simplest case, where the parameter space has but two” predesignated points (1968, 301). He tells us in Birnbaum (1970a, 1033) that he has pursued the matter thoroughly leading to “rejection of both the likelihood concept and various proposed formalizations of prior information”. The basis for this shift is that the SLP permits interpretations that “can be seriously misleading with high probability” (1968, 301). He puts forward the “confidence concept” (Conf) which takes from the Neyman-Pearson (N-P) approach “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” while supplying it an evidential interpretation (1970a, 1033). Given the many different associations with “confidence,” I use (Conf) in this Rejoinder to refer to Birnbaum’s idea. Many of the ingenious examples of the incompatibilities of SLP and (Conf) are traceable back to Birnbaum, optional stopping being just one (see Birnbaum 1969). A bibliography of Birnbaum’s work is Giere 1977. Before his untimely death (at 53), Birnbaum denies the SLP even counts as a principle of evidence (in Birnbaum 1977). He thought it anomalous that (Conf) lacked an explicit evidential interpretation even though, at an intuitive level, he saw it as the “one rock in a shifting scene” in statistical thinking and practice (Birnbaum 1970, 1033). I return to this in part IV of this rejoinder……

IV Post-SLP foundationsReturn to where we left off in the opening section of this rejoinder: Birnbaum (1969).

The problem-area of main concern here may be described as that of determining precise

concepts of statistical evidence(systematically linked with mathematical models of experiments), concepts which are to benon-Bayesian, non-decision-theoretic, and significantlyrelevant to statistical practice.(Birnbaum 1969, 113)Given Neyman’s behavioral decision construal, Birnbaum claims that “when a confidence region estimate is interpreted as statistical evidence about a parameter”(1969, p. 122), an investigator has necessarily adjoined a concept of evidence, (Conf) that goes beyond the formal theory. What is this evidential concept? The furthest Birnbaum gets in defining (Conf) is in his posthumous article (1977):

(Conf) A concept of statistical evidence is not plausible unless it finds ‘strong evidence for

H_{2}againstH_{1}’ with small probability (α) whenH_{1}is true, and with much larger probability (1 – β) whenH_{2}is true. (1977, 24)On the basis of (Conf), Birnbaum reinterprets statistical outputs from N-P theory as strong, weak, or worthless statistical evidence depending on the error probabilities of the test (1977, 24-26). While this sketchy idea requires extensions in many ways (e.g., beyond pre-data error probabilities, and beyond the two hypothesis setting), the spirit of (Conf), that error probabilities qualify properties of methods which in turn indicate the warrant to accord a given inference, is, I think, a valuable shift of perspective. This is not the place to elaborate, except to note that my own twist on Birnbaum’s general idea is to appraise evidential warrant by considering the capabilities of tests to have detected erroneous interpretations, a concept I call

severity. That Birnbaum preferred a propensity interpretation of error probabilities is not essential. What matters is their role in picking up how features of experimental design and modeling alter a methods’ capabilities to control “seriously misleading interpretations”. Even those who embrace a version of probabilism may find a distinct role for a severity concept. Recall that Fisher always criticized the presupposition that a single use of mathematical probability must be competent for qualifying inference in all logical situations (1956, 47).Birnbaum’s philosophy evolved from seeking concepts of evidence in degree of support, belief, or plausibility between statements of data and hypotheses to embracing (Conf) with the required control of misleading interpretations of data. The former view reflected the logical empiricist assumption that there exist context-free evidential relationships—a paradigm philosophers of statistics have been slow to throw off. The newer (post-positivist) movements in philosophy and history of science were just appearing in the 1970s. Birnbaum was ahead of his time in calling for a philosophy of science relevant to statistical practice; it is now long overdue!

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists” (Birnbaum 1972, 861).

**Link to complete discussion: **

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). *Statistical Science* 29 (2014), no. 2, 227-266.

**Links to individual papers:**

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle. *Statistical Science* 29 (2014), no. 2, 227-239.

Dawid, A. P. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. *Statistical Science* 29 (2014), no. 2, 240-241.

Evans, Michael. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. *Statistical Science* 29 (2014), no. 2, 242-246.

Martin, Ryan; Liu, Chuanhai. Discussion: Foundations of Statistical Inference, Revisited. *Statistical Science* 29 (2014), no. 2, 247-251.

Fraser, D. A. S. Discussion: On Arguments Concerning Statistical Principles. *Statistical Science* 29 (2014), no. 2, 252-253.

Hannig, Jan. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. *Statistical Science* 29 (2014), no. 2, 254-258.

Bjørnstad, Jan F. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. *Statistical Science* 29 (2014), no. 2, 259-260.

Mayo, Deborah G. Rejoinder: “On the Birnbaum Argument for the Strong Likelihood Principle”. *Statistical Science* 29 (2014), no. 2, 261-266.

**Abstract:** An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes *x*^{∗} and *y*^{∗} from experiments *E*_{1} and *E*_{2} (both with unknown parameter *θ*), have different probability models* f*_{1}( . ),* f*_{2}( . ), then even though *f*_{1}(*x*^{∗}; *θ*) = c*f*_{2}(*y*^{∗}; *θ*) for all* θ*, outcomes *x*^{∗} and *y*^{∗}may have different implications for an inference about *θ*. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which *E _{i}* produced the measurement, the assessment should be in terms of the properties of

**Key words:** Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality

Regular readers of this blog know that the topic of the “Strong Likelihood Principle (SLP)” has come up quite frequently. Numerous informal discussions of earlier attempts to clarify where Birnbaum’s argument for the SLP goes wrong may be found on this blog. [SEE PARTIAL LIST BELOW.[i]] These mostly stem from my initial paper Mayo (2010) [ii]. I’m grateful for the feedback.

**[i]** A quick take on the argument may be found in the appendix to: “A Statistical Scientist Meets a Philosopher of Science: A conversation between David Cox and Deborah Mayo (as recorded, June 2011)”

Some previous posts on this topic can be found at the following links (and by searching this blog with key words):

- Midnight with Birnbaum (Happy New Year).
- New Version: On the Birnbaum argument for the SLP: Slides for my JSM talk.
- Don’t Birnbaumize that experiment my friend*–updated reblog.
- Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976 .
- LSE seminar
- A. Birnbaum: Statistical Methods in Scientific Inference
- ReBlogging the Likelihood Principle #2: Solitary Fishing: SLP Violations
- Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle.

**UPhils and responses**

- U-PHIL: Gandenberger & Hennig : Blogging Birnbaum’s Proof
- U-Phil: Mayo’s response to Hennig and Gandenberger
- Mark Chang (now) gets it right about circularity
- U-Phil: Ton o’ Bricks
- Blogging (flogging?) the SLP: Response to Reply- Xi’an Robert
- U-Phil: J. A. Miller: Blogging the SLP

**[ii]**

- Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in
*Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science*(D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Below are my slides from my May 2, 2014 presentation in the Virginia Tech Department of Philosophy 2014 Colloquium series:

“Putting the Brakes on the Breakthrough, or

‘How I used simple logic to uncover a flaw in a controversial 50 year old ‘theorem’ in statistical foundations taken as a

‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”

Birnbaum, A. 1962. “On the Foundations of Statistical Inference.” In *Breakthroughs in Statistics*, edited by S. Kotz and N. Johnson, 1:478–518. Springer Series in Statistics 1993. New York: Springer-Verlag.

*****Judges reserve the right to decide if the example constitutes the relevant use of “intentions” (amid a foundations of statistics criticism) in a published article. Different subsets of authors can count for distinct entries. No more than 2 entries per person. This means we need your name.

Filed under: Birnbaum, Birnbaum Brakes, frequentist/Bayesian, Likelihood Principle, phil/history of stat, Statistics ]]>

*“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference*,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

**D. Mayo: “Error Statistical Control: Forfeit at your Peril” **

**S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”**

**A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)**

**For more details see this post.**

Filed under: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics ]]>

**Society for Philosophy and Psychology (SPP): 41st Annual meeting**

**SPP 2015 Program**

**Wednesday, June 3rd**

** 1:30-6:30: Preconference Workshop on Replication in the Sciences, organized by Edouard Machery**

**1:30-2:15: Edouard Machery (Pitt)**

**2:15-3:15: Andrew Gelman (Columbia, Statistics, via video link)**

**3:15-4:15: Deborah Mayo (Virginia Tech, Philosophy)**

*4:15-4:30: Break*

**4:30-5:30: Uri Simonshon (Penn, Psychology)**

**5:30-6:30: Tal Yarkoni (University of Texas, Neuroscience)**

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,** 2015 APS Annual Convention Saturday, May 23 2:00 PM- 3:50 PM in Wilder (Marriott Marquis 1535 B’way)

See earlier post for Frank Sinatra and more details

Filed under: Announcement, reproducibility ]]>

**A new joint paper….**

**“Error statistical modeling and inference: Where methodology meets ontology”**

**Aris Spanos · Deborah G. Mayo**

**Abstract:** In empirical modeling, an important desideratum for deeming theoretical entities and processes real is that they can be reproducible in a statistical sense. Current day crises regarding replicability in science intertwine with the question of how statistical methods link data to statistical and substantive theories and models. Different answers to this question have important methodological consequences for inference, which are intertwined with a contrast between the ontological commitments of the two types of models. The key to untangling them is the realization that behind every substantive model there is a statistical model that pertains exclusively to the probabilistic assumptions imposed on the data. It is not that the methodology determines whether to be a realist about entities and processes in a substantive field. It is rather that the substantive and statistical models refer to different entities and processes, and therefore call for different criteria of adequacy.

**Keywords:** Error statistics · Statistical vs. substantive models · Statistical ontology · Misspecification testing · Replicability of inference · Statistical adequacy

To read the full paper: “Error statistical modeling and inference: Where methodology meets ontology.”

The related conference.

**Reference: **Spanos, A. & Mayo, D. G. (2015). “Error statistical modeling and inference: Where methodology meets ontology.” *Synthese* (online May 13, 2015), pp. 1-23.

Filed under: Error Statistics, misspecification testing, O & M conference, reproducibility, Severity, Spanos ]]>

**Stephen Senn**

Head of Competence Center for Methodology and Statistics (CCMS)

Luxembourg Institute of Health

**Double Jeopardy?: Judge Jeffreys Upholds the Law**

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing *n* counters and supposes that all *m* drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes:

Note that in the case that only one counter remains we have *n = m + 1* and the two probabilities are the same. However, if *n > m+1* they are not the same and in particular if *m* is large but *n* is much larger, the first probability can approach 1 whilst the second remains small.

The practical implication of this just because Bayesian induction implies that a large sequence of successes (and no failures) supports belief that the next trial will be a success, it does not follow that one should believe that all future trials will be so. This distinction is often misunderstood. This is The Economist getting it wrong in September 2000

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

See *Dicing with Death*(3) (pp76-78).

The practical relevance of this is that scientific laws cannot be established by Laplacian induction. Jeffreys (1891-1989) puts it thus

Thus I may have seen 1 in 1000 of the ‘animals with feathers’ in England; on Laplace’s theory the probability of the proposition, ‘all animals with feathers have beaks’, would be about 1/1000. This does not correspond to my state of belief or anybody else’s. (P128)

Here Jeffreys is using Broad’s formula with the ratio of *m* to *n *of 1:1000.

To Harold Jeffreys the situation was unacceptable. Scientific laws *had to be* capable of being if not proved at least made more likely by a process of induction. The solution he found was to place a lump of probability on the simpler model that any particular scientific law would imply compared to some vaguer and more general alternative. In hypothesis testing terms we can say that Jeffreys moved from testing

H_{0A}: θ ≤ 0,v,H_{1A}: θ > 0 &H_{0B}: θ ≥ 0,v,H_{1B}: θ < 0

to testing

H_{0}: θ = 0, v,H_{1}: θ ≠ 0

As he put it

The essential feature is that we express ignorance of whether the new parameter is needed by taking half the prior probability for it as concentrated in the value indicated by the null hypothesis, and distributing the other half over the range possible.(1) (p249)

Now, the interesting thing is that in frequentist cases these two make very little difference. The P-value calculated in the second case is the same as that in the first, although its interpretation is slightly different. In the second case it is a sense exact since the null hypothesis is ‘simple’. In the first case it is a maximum since for a given statistic one calculates the probability of a result as extreme or more extreme as that observed for that value of the null hypothesis for which this probability is maximised.

In the Bayesian case the answers are radically different as is shown by the attached figure, which gives one-sided P-values and posterior probabilities (calculated from a simulation for fun rather than by necessity) for smooth and lump prior distributions. If we allow that θ may vary smoothly over some range which is, in a sense, case 1 and is the Laplacian formulation, we get a very different result to allowing it to have a lump of probability at 0, which is the innovation of Jeffreys. The origin of the difference is to do with the prior probability. It may seem that we are still in the world of uninformative prior probabilities but this is far from so. In the Laplacian formulation every value of θ is equally likely. However in the Jeffreys formulation the value under the null is infinitely more likely than any other value. This fact is partly hidden by the approach. First make *H*0 & *H*1 equally likely. Then make every value under equally likely. The net result is that all values of θ are far from being equally likely.

This simple case is a paradigm for a genuine issue in Bayesian inference that arises over and over again. It is crucially important as to how you pick up the problem when specifying prior distributions. (Note that this is not in itself a criticism of the Bayesian approach. It is a warning that care is necessary.)

Has Jeffreys’s innovation been influential? Yes and no. A lot of Bayesian work seems to get by without it. For example, an important paper on Bayesian Approaches to Randomized Trials by Spiegelhalter, Freedman and Parmar(4) and which has been cited more than 450 times according to Google Scholar (as of 7 May 2015), considers four types of smooth priors, reference, clinical, sceptical and enthusiastic, none of which involve a lump of probability.

However, there is one particular area where an analogous approach is very common and that is model selection. The issue here is that if one just uses likelihood as a guide between models one will always prefer a more complex one to a simpler one. A practical likelihood based solution is to use a penalised form whereby the likelihood is handicapped by a function of the number of parameters. The most famous of these is the AIC criterion. It is sometimes maintained that this deals with a very different sort of problem to that addressed by hypothesis/significance testing but its originator, Akaike (1927-2009) (5), clearly did not think so, writing

So, it is clear that Akaike regarded this as being a unifying approach, to estimation *and* hypothesis testing: that which was primarily an estimation tool, likelihood, was now a model selection tool also.

However, as Murtaugh(6) points out, there is a strong similarity between using AIC and P-values to judge the adequacy of a model. The AIC criterion involves log-likelihood and this is what is also in involved in analysis of deviance where the fact that asymptotically minus twice the difference in log likelihoods between two nested models has a chi-square distribution with mean equal to the difference in the number of parameters modelled. The net result is that if you use AIC to choose between a simpler model and a more complex model within which it is nested and the more complex model has one extra parameter, choosing or rejecting the more complex values is equivalent to using a significance threshold of 16%.

For those who are used to the common charge levelled by, for example Berger and Sellke(7) and more recently, David Colquhoun (8) in his 2014 paper that the P-value approach gives significance too easily, this is a baffling result: significance tests are too conservative rather than being too liberal. Of course, Bayesians will prefer the BIC to the AIC and this means that there is a further influence of sample size on inference that is not captured by any function that depends on likelihood and number of parameters only. Nevertheless, it is hard to argue that, whatever advantages the AIC may have in terms of flexibility, for the purpose of comparing nested models, it somehow represents a more rigorous approach than significance testing.

However, it is easily understood if one appreciates the following. Within the Bayesian framework, in abandoning smooth priors for lump priors, it is also necessary to change the probability standard. (In fact I speculate that the 1 in 20 standard seemed reasonable partly because of the smooth prior.) In formulating the hypothesis-testing problem the way that he does, Jeffreys has already used up any preference for parsimony in terms of prior probability. Jeffreys made it quite clear that this was his view, stating

I maintain that the only ground that we can possibly have for not rejecting the simple law is that we believe that it is quite likely to be true (p119)

He then proceeds to express this in terms of a prior probability. Thus there can be no double jeopardy. A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities.

ACKNOWLEDGEMENT

My research on inference for small populations is carried out in the framework of the IDEAL project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.

REFERENCES

- Jeffreys H. Theory of Probability. Third ed. Oxford: Clarendon Press; 1961.
- Howie D. Interpreting Probability: Controversies and Developments in the Early Twentieth Century. Skyrms B, editor. Cambridge: Cambridge University Press; 2002. 262 p.
- Senn SJ. Dicing with Death. Cambridge: Cambridge University Press; 2003.
- Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian Approaches to Randomized Trials. Journal of the Royal Statistical Society Series a-Statistics in Society. 1994;157:357-87.
- Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Czáki F, editors. Second International Symposium on Information Theory. Budapest: Akademiai Kiadó; 1973. p. 267-81.
- Murtaugh PA. In defense of P values. Ecology. 2014;95(3):611-7.
- Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of
*p*values and evidence,” (with discussion).*J. Amer. Statist. Assoc.***82:**112–139. - Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science. 2014;1(3):140216.

Also Relevant

- Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion).
*J. Amer. Statist. Assoc.***82**106–111, 123–139.

Related Posts

P-values overstate the evidence?

P-values versus posteriors (comedy hour)

Spanos: Recurring controversies about P-values and confidence intervals (paper in relation to Murtaugh)

Filed under: Jeffreys, P-values, reforming the reformers, Statistics, Stephen Senn ]]>

Filed under: frequentist/Bayesian, msc kvetch, rejected post ]]>

These days, there are so many dubious assertions about alleged correlations between two variables that an entire website: Spurious Correlation (Tyler Vigen) is devoted to exposing (and creating*) them! A classic problem is that the means of variables X and Y may both be trending in the order data are observed, invalidating the assumption that their means are constant. In my initial study with Aris Spanos on misspecification testing, the X and Y means were trending in much the same way I imagine a lot of the examples on this site are––like the one on the number of people who die by becoming tangled in their bedsheets and the per capita consumption of cheese in the U.S.

The annual data for 2000-2009 are: x_{t}: per capita consumption of cheese (U.S.) : **x** = (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8); y_{t}: Number of people who died by becoming tangled in their bedsheets: **y **= (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)

I asked Aris Spanos to have a look, and it took him no time to identify the main problem. He was good enough to write up a short note which I’ve pasted as slides.

**Aris Spanos**

* Wilson E. Schmidt Professor of Economics
*

*The site says that the server attempts to generate a new correlation every 60 seconds.

Filed under: misspecification testing, Spanos, Statistics, Testing Assumptions ]]>