It’s an apt time to reblog the “statistical dirty laundry” post from 2013 here. I hope we can take up the recommendations from Simmons, Nelson and Simonsohn at the end (Note [5]), which we didn’t last time around.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…
1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job:
One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).
I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.
2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:
In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.
A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments). That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?
3. Hanging out some statistical dirty laundry.
Items in their laundry list include:
- An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
- A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
- The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
- The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
- Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)
For many further examples, and also caveats [3],see Report.
4. Significance tests don’t abuse science, people do.
Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. Statistical methods don’t kill scientific validity, people do.
I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and associated confidence intervals. At least the methods admit of tools for mounting a critique.
In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one? Or was there just a bit of data massaging and cherry picking to support the desired conclusion? As a matter of routine, researchers should tell us. Yes, as Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!” No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)
I recommend reading the Tilberg report!
*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”
[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).
[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)
[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).
[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!
[5] From Simmons, Nelson and Simonsohn:
Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.
Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.
If you determined sample size in advance, say it.
If you did not drop any variables, say it.
If you did not drop any conditions, say it.
The Fall 2012 Newsletter for the Society for Personality and Social Psychology Popper, K. 1994, The Myth of the Framework.
Re: “That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?” Jan-Willem Romeijn (philosophy dept at Groningen) and I outlined such a plan that is going to be deployed at the University of Groningen’s University College. Our idea was to *begin* methods education with philosophy. I’ve since moved to Cardiff University, but I have similar plans for the curriculum here.
Richard: I was going to ask if there was a group anywhere that felt well qualified to do this.
I do feel purposeful philosophy of science is quite rare, (purposeful being Peirce’s later favoured synonym for pragmatic) as is a purposeful grasp of applying statistics in research ( as say related here Questions, answers and statistics, 1986 by Terry Speed http://iase-web.org/documents/papers/icots2/Speed.pdf )
So I would want to see folks in Mayo’s and Gelman’s league working collaboratively.
OK wishful thinking, but it would be nice to see outlines (i.e. I am suggesting you lean over backwards to discover what you might be wrong about here.)
Mayo: Nice post but you are dismissing an possible role for priors to help us bend over backwards in order for reality to surprise us in how we are wrong. Why not simple say they have to help in that regard?
Typos
– anY possible role for priors
Why not simplY say they have to help
Keith
Keith: Thanks for your comment. Just to address your last point:
“Why not simply say they [prior probabilities] have to help..?”
Here’s my answer:
1. To appeal to your disbelief in H as a way to criticize whether data x warrant H is a fallacy.Moreover, the dispute becomes a war over beliefs in the very hypothesis to be probed–begs the question and fails to achieve what’s needed.
2. What is required is to show what’s wrong with the methodology used such that the error probabilities are illicit. Reported error probabilities can be shown to be invalidated when they fail an audit which has two parts:
Checking if
(a) violated assumptions (of the stat model or of the links from stat to substantive claims)
(b) a slew of “selection effects” (cherry picking, barn hunting, ad hoc adjustments, monster barring, data-dependent hypotheses etc.)
result in violating error probability requirements for a severe test of claim H. (They don’t always!)
The background information and techniques for (a) and (b) are not captured by a prior probability in H. It involves background knowledge of errors, biases, flaws and foibles that obstruct learning of the type of problem and stage of inquiry at hand.
3. The critique in #2 turns on the fact that the flaws incapacitate the error probability controls; the reported capability of the test to inform and control misleading interpretations is not the actual capacity (i.e., reported error probabilities differ from actual ones, whether formal or qualitative). The entire appeal to error probabilities–of this type– is outside the Bayesian frameworks, at least officially.
4. Going back to #1, we may not have a clue if H is plausible, but if we do have legitimate grounds to regard it as implausible, this will invariably boil down to it’s having passed poor tests. This should at most be a grounds to suspect some methodological flaw (point #2). It does not demonstrate the methodological flaw (without begging the question). It can at most motivate an interested critic to check for them (as does knowledge of conflicts of interest).
Finally, the plausibility of H is completely distinct from whether someone has provided a real test of it. Scientists must be able to say things like “H is correct (plausible, solves our problems, etc), but this test of it is lousy and lacks probative force!”
It’s kind of impressive how no matter how many Bayesians say prior probs aren’t ‘beliefs’, no matter how many write in excruciating detail that prior probs aren’t ‘beliefs’, and no matter how many Bayesians successfully apply prior probs which aren’t ‘beliefs’ in hundreds of thousands of real problems, you always simply dismiss them as ‘beliefs’ every single time it comes up.
Truly, it’s impressive. It does make it impossible and pointless to discuss statistics with you though.
Are you referring to “non subjective” Bayesians then? But here we get priors which are given to be undefined mathematical entities purely to compute a posterior. Their goal is to have the LEAST impact on the data, so how do they help? I didn’t think people were referring to those to help with the problem we were discussing. I assumed the idea was to appeal to prior information about the correctness, plausibility of the H in question. My remarks still hold in cases where these beliefs are well founded. So I fail to see the relevance of your remark, as stated.
Respectfully, your knowledge of Objective Bayes is so shallow it’s impossible to engage you intelligibly. Please take the time to understand O-Bayes and stop repeating the same old howlers and falsehoods. You’re doing a disserve to all involved.
BayesPhilo: My knowledge of O-Bayes is directly through Jim Berger, Bernardo, and Sun. You ought to be able to sketch how you think your recommended methodology can be used to address the particular problem being discussed, as many of us have have tried to do, and we did so without disguises. Else, we will have to conclude that it is fruitless for us to engage with YOU intelligibly.
Mayo: I just want to increase and accelerate my getting less wrong, so I would not worry about going outside the (various and strictly uncountable) _official_ Bayesian frameworks.
> involves background knowledge of errors, biases, flaws and foibles that obstruct learning of the type of problem and stage of inquiry at hand.
I could try to quantify this into a prior (in the process learn a lot about the background) and then give up, or use it just as sensitivity analysis, or to discern poor error properties of a complex frequency based approach, or replace the frequency based approach with Bayesian machinery that has better frequency properties (actually did that in one publication) or become convinced that the prior is a good model and use the posterior (like Fisher did in some genetics examples).
(I am not expecting to get an answer in a blog comment.)
Going outside the official Bayesian frameworks would be great. And when you do, seek the most direct means to solve the problem.. If the problems are those I listed (obviously with only the brevity of a blog comment), then it makes little sense to strive to pack these concerns into priors when they directly concern error probabilities,capacities, probativeness and the like. Since any claim can have the words “I judge that” placed in front of it, one could always try to translate from the error statistical paradigm into the Bayesian paradigm, but why not “go native” and speak the language that gets directly to the problem?
Mayo:
I was going to add, in cases like a well designed randomised study with good compliance and ample sample size, there is little advantage to considering priors in any role. in evaluating the evidence in that study. In my work, A lot more challenging things arise in most applications.
Keith
Phanerono: Are you saying that a lot more challenging things arise in the well-designed cases? I take it you mean all the work that goes into modeling or even figuring out how to solve problems to develop a relevant inquiry.
But in the not so well designed nonrandomised study, etc. the priors can be an advantage, you say, and i take it their job would be to try to achieve what a better designed study could have accomplished?
(I’m asking because I’ve heard some people say that outside of a well-controlled study or the like, the goals change rather dramatically.)
Sorry about the default pseudo name of Phaneron0.
> outside of a well-controlled study or the like, the goals change rather dramatically.
I think they should.
This might be a helpful reference on that position.
Good practices for quantitative bias analysis. Timothy L Lash, Matthew P Fox, Richard F MacLehose, George Maldonado, Lawrence C McCandless and Sander Greenland.
http://ije.oxfordjournals.org/content/early/2014/07/30/ije.dyu149.long
Not everyone agrees apparently, as Andrew Gelman and I point out in http://www.stat.columbia.edu/~gelman/research/published/GelmanORourkeBiostatistics.pdf
“Jager and Leek’s key assumption is that p-values of null effects will follow a uniform distribution. We argue that this will be the case under only very limited settings and thus they are mistaken”
Keith: I agree that purposeful philosophy of science–in THIS-arena is rare–and it makes no sense since there were philosophers of statistics back in the 70s and 80s. I have taught courses in this arena, and Spanos, I and others were supposed to build something bigger in that area (when I was joint with Econ), which is one of the reasons I stayed. After IJ Good, there was little interest in stat in philosophical foundations.
Now my plan is to pursue this on my own through my E.R.R.O.R. Fund and a collection of interested people–mostly statisticians. That is, as soon as my book is done (soon). I’ll likely start with a big conference or a small workshop of planners. Anyone interested?
Keith:
By the way, I’m glad you said you “would want to see folks in Mayo’s and Gelman’s league working collaboratively” and I don’t think that’s the slightest bit far-fetched, so long as he’s interested.
You can read Stapel’s own take on this in his autobiography (in English) here: http://nick.brown.free.fr/stapel/
“..in the light of the Committees’ assignment, the definition of fraud should be limited to the fabrication, falsification or unjustified replenishment of data, as well as the whole or partial fabrication of analysis results. It also includes the misleading presentation of crucial points as far as the organization or nature of the experiment are concerned”.(p.17, Report)
Unjustified replenishment? I’d need to check what they mean by that; I read it some time ago. Since they were dealing with out and out fraud (Diederik Stapel), it’s noteworthy that they nevertheless extend the condemnation beyond that.
The Blog Editors have blocked an unconstructive comment by Bayesphilo alleging as he always has, under his various disguises, that the level of misunderstanding is too great to discuss anything. But there are a couple of things some might wish to discuss, so I’ve eked them out from between the grit.
Bayesphilo: Gelman, Jeffreys, and Jaynes are all highly prominent O-Bayesians, more so than the authors you cite. No where in their published works do I find “Their goal is to have the LEAST impact on the data,” or even anything remotely like that.
Mayo: I’d said that the goal is for the priors to have minimal impact on the posteriors. I didn’t think Gelman considered himself an O-Bayesian (I could be wrong), but O-Bayesians do appeal to a variety of “uninformative”, invariant, Maxent and other priors (check Kass and Wasserman). Gelman, if I’m not mistaken, has said he’d rather work with informative, or weakly informative, or some such priors. So that puts him squarely within the group I had in mind in my response.
Bayesphilo [with respect to Jaynes]: “His goal for assigning P(A|B) was to encode the uncertainty about A implied by B regardless whether P(A|B) was to be used as a likelihood or a prior. Whether this is “informative” and made a difference, depends entirely on A and B.”
Please explain how this view helps to show that selection effects, barn-hunting, researcher degrees of freedom, optional stopping, etc. can alter and invalidate the warrant for one’s statistical inference (to claims of the sort being discussed here.) I’ve no doubt he’d want to. We can demonstrate how things like P-values go out the window under various such moves.
” . . . everyone in my research environment does the same, and so does everyone we talk to at international conferences”
One can almost hear a petulant teenager: “Dad, why can’t I??? All my friends are doing it . . . ”
Thus they all jumped into the fire. Truly amazing to see such behaviour amongst a collection of academics. Who then will parent the grown-ups?
It will no doubt fall on future generations to cull through this mess of current research reports (and not just those from the field of social psychology) based on questionable research practices, and weed out the minority of truthful findings. Sound statistical philosophy will help them form a foundation upon which to stand while they cull.
Steven: You’re so right about how the interviewees sound. As for future generations culling…well, if one goes back to Morrison and Henkel from the 60s, you’ll find them leagues ahead of what we’re seeing now. They criticized unthinking uses of tests, and fallacies of rejection and non-rejection, but they were clear on how a variety of selection effects invalidate p-values and would scarcely have said “everyone does it, so you can’t blame me”. Maybe Meehl would have smacked them if they had. If someone wanted a thesis topic in stat education/sociology, or science studies, they might explore how we came to be sliding further down the ladder in integrity and understanding when it comes to statistical method.
Once again the discussion didn’t get around to talking about the recommendations from Simmons, Nelson and Simonsohn for a “A 21-word solution”.I’m no longer inclined to regard their recommendation as too unserious. At least it would prevent people like Forster from saying, they had no idea, nobody every told them this was problematic:
“If you determined sample size in advance, say it.
If you did not drop any variables, say it.
If you did not drop any conditions, say it.”
(This was in The Fall 2012 Newsletter for the Society for Personality and Social Psychology, linked above).
I guess I’ll have to do a separate post on it.
Along the lines of using that child’s “Clap Your Hands ” song as suggested by Joe Simmons, Leif Nelson and Uri Simonsohn in their “A 21-word solution”, for researchers (to “just say” they haven’t cheated) here’s a verse:
To be sung to the tune of Blowing in the Wind….
How many times must we publish that list
Of bad practices you should avoid?
And how many years, if you still won’t desist
May we finally call you a fraud?
The answer my friend….
Hi Deborah … I just read this excellent post (though I still need to read through the various comments to your post) and am now moving into Chapter 3 of your Experimental Knowledge book. For now, since I am a legal scholar, I just want to say how relevant I am finding your work to the issue of whether law is a science or whether law is just politics. As far I can tell, there are no “severe” or “critical” tests in law and legal analysis; there is no way of saying x rule or theory of interpretation is better or worse than y rule or theory. I am now left wondering whether such tests are possible in a domain like law, or whether law should more properly be classified with art, religion, etc. (i.e. an important and worthwhile activity, but not a testable or rigorous one).
Explain what kind of claims you think are immune to test in law. Perhaps you mean value judgments about burdens of proof.
Of course determining legal evidence (within a system) is (or should be) akin to determining reliable evidence in any field–though I take it you’re speaking of aspects of legal theory. Larry Laudan is a philosopher who writes on evidence in the law.
Thanks, Deborah, for your thoughtful reply. Yes, there are two aspects of law I have in mind. One is the rules of evidence and legal procedure — or more broadly speaking, the accurate application of existing rules to facts or “normal testing” within a given framework (or set of pre-existing rules), to borrow your helpful term from chapter of your book. My other area of interest is the merits of the rules themselves — or the framework in which we conduct our “normal testing”, whether those tests consists of legal trials or social science experiments. In any case, I appreciate the pointer, and I will look up Laudan’s work and continue reading your error statistics work as well and keep you posted on my progress.
You might find the section on Oliver Wendell Holmes in Chyrel Misak’s The American Pragmatists helpful in distinguishing science from law.
Thanks for the pointer. I see Misak’s book was published in 2013. That’s another trip to the library for me!