October 15, Noon – 2 pm ET (Website)
Where do YOU stand?
Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading
October 15, Noon – 2 pm ET (Website)
Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading
Gelman and Loken (2014) recognize that even without explicit cherry picking there is often enough leeway in the “forking paths” between data and inference so that by artful choices you may be led to one inference, even though it also could have gone another way. In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about. Stating a hypothesis in advance, far from protecting from the verification biases, can be the engine that enables data to be “constructed”to reach the desired end .
[E]ven in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases…..and many other steps in the analysis could well have occurred with different data (Gelman and Loken 2014, p. 464).
An idea growing out of this recognition is to imagine the results of applying the same statistical procedure, but with different choices at key discretionary junctures–giving rise to a multiverse analysis, rather than a single data set (Steegen, Tuerlinckx, Gelman, and Vanpaemel 2016). One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “which constellation of choices corresponds to which statistical results” (p. 797). The result of this exercise can, at times, mimic the delineation of possibilities in multiple testing and multiple modeling strategies. Continue reading
Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had
been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong.) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013: Continue reading
I’m giving a joint presentation with Caitlin Parker on Friday (June 17) at the meeting of the Society for Philosophy of Science in Practice (SPSP): “Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology” (Rowan University, Glassboro, N.J.) The Society grew out of a felt need to break out of the sterile straightjacket wherein philosophy of science occurs divorced from practice. The topic of the relevance of PhilSci and PhilStat to Sci has often come up on this blog, so people might be interested in the SPSP mission statement below our abstract.
Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology
Deborah Mayo Virginia Tech, Department of Philosophy United States
Caitlin Parker Virginia Tech, Department of Philosophy United States
I first blogged this letter here. Below the references are some more recent blog links of relevance to this issue.
Dear Reader: I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost. It is a letter to the editor of Statistics in Medicine in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing. You can read the full letter here. Sincerely, D. G. Mayo
STATISTICS IN MEDICINE, LETTER TO THE EDITOR
From: Stephen Senn*
Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading
Below are the slides from my Popper talk at the LSE today (up to slide 70): (post any questions in the comments)
Remember “Repligate”? [“Some Ironies in the Replication Crisis in Social Psychology“] and, more recently, the much publicized attempt to replicate 100 published psychology articles by the Open Science Collaboration (OSC) [“The Paradox of Replication“]? Well, some of the critics involved in Repligate have just come out with a criticism of the OSC results, claiming they’re way, way off in their low estimate of replications in psychology . (The original OSC report is here.) I’ve only scanned the critical article quickly, but some bizarre statistical claims leap out at once. (Where do they get this notion about confidence intervals?) It’s published in Science! There’s also a response from the OSC researchers. Neither group adequately scrutinizes the validity of many of the artificial experiments and proxy variables–an issue I’ve been on about for a while. Without firming up the statistics-research link, no statistical fixes can help. I’m linking to the articles here for your weekend reading. I invite your comments! For some reason a whole bunch of items of interest, under the banner of “statistics and the replication crisis,” are all coming out at around the same time, and who can keep up? March 7 brings yet more! (Stay tuned). Continue reading
Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results
I generally find National Academy of Science (NAS) manifestos highly informative. I only gave a quick reading to around 3/4 of this one. I thank Hilda Bastian for twittering the link. Before giving my impressions, I’m interested to hear what readers think, whenever you get around to having a look. Here’s from the intro*:
Questions about the reproducibility of scientific research have been raised in numerous settings and have gained visibility through several high-profile journal and popular press articles. Quantitative issues contributing to reproducibility challenges have been considered (including improper data management and analysis, inadequate statistical expertise, and incomplete data, among others), but there is no clear consensus on how best to approach or to minimize these problems…
Findings of Research Misconduct A Notice by the Health and Human Services Dept
on 11/09/2015 AGENCY: Office of the Secretary, HHS. ACTION: Notice. ----------------------------------------------------------------------- SUMMARY: Notice is hereby given that the Office of Research Integrity (ORI) has taken final action in the following case: Anil Potti, M.D., Duke University School of Medicine: Based on the reports of investigations conducted by Duke University School of Medicine (Duke) and additional analysis conducted by ORI in its oversight review, ORI found that Dr. Anil Potti, former Associate Professor of Medicine, Duke, engaged in research misconduct in research supported by National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health (NIH), grant R01 HL072208 and National Cancer Institute (NCI), NIH, grants R01 CA136530, R01 CA131049, K12 CA100639, R01 CA106520, and U54 CA112952. ORI found that Respondent engaged in research misconduct by including false research data in the following published papers, submitted manuscript, grant application, and the research record as specified in 1-3 below. Specifically, ORI found that: Continue reading
Critic 1: It’s much too easy to get small P-values.
Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).
Is it easy or is it hard?
You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other biasing selection effects. The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs. In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.
The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:
The findings also offered some support for the oft-criticized statistical tool known as the P value, which measures whether a result is significant or due to chance. …
The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)
I thought the criticisms of social psychologist Jens Förster were already quite damning (despite some attempts to explain them as mere QRPs), but there’s recently been some pushback from two of his co-authors Liberman and Denzler. Their objections are directed to the application of a distinct method, touted as “Bayesian forensics”, to their joint work with Förster. I discussed it very briefly in a recent “rejected post“. Perhaps the earlier method of criticism was inapplicable to these additional papers, and there’s an interest in seeing those papers retracted as well as the one that was. I don’t claim to know. A distinct “policy” issue is whether there should be uniform standards for retraction calls. At the very least, one would think new methods should be well-vetted before subjecting authors to their indictment (particularly methods which are incapable of issuing in exculpatory evidence, like this one). Here’s a portion of their response. I don’t claim to be up on this case, but I’d be very glad to have reader feedback.
Nira Liberman, School of Psychological Sciences, Tel Aviv University, Israel
Markus Denzler, Federal University of Applied Administrative Sciences, Germany
June 7, 2015
Response to a Report Published by the University of Amsterdam
The University of Amsterdam (UvA) has recently announced the completion of a report that summarizes an examination of all the empirical articles by Jens Förster (JF) during the years of his affiliation with UvA, including those co-authored by us. The report is available online. The report relies solely on statistical evaluation, using the method originally employed in the anonymous complaint against JF, as well as a new version of a method for detecting “low scientific veracity” of data, developed by Prof. Klaassen (2015). The report concludes that some of the examined publications show “strong statistical evidence for low scientific veracity”, some show “inconclusive evidence for low scientific veracity”, and some show “no evidence for low veracity”. UvA announced that on the basis of that report, it would send letters to the Journals, asking them to retract articles from the first category, and to consider retraction of articles in the second category.
After examining the report, we have reached the conclusion that it is misleading, biased and is based on erroneous statistical procedures. In view of that we surmise that it does not present reliable evidence for “low scientific veracity”.
We ask you to consider our criticism of the methods used in UvA’s report and the procedures leading to their recommendations in your decision.
Let us emphasize that we never fabricated or manipulated data, nor have we ever witnessed such behavior on the part of Jens Förster or other co-authors.
Here are our major points of criticism. Please note that, due to time considerations, our examination and criticism focus on papers co-authored by us. Below, we provide some background information and then elaborate on these points. Continue reading
Around a year ago on this blog I wrote:
“There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”
That’s philosopher’s talk for “I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics”. Yesterday, I began my talk at the Society for Philosophy and Psychology workshop on “Replication in the Sciences”with examples of two main philosophical tasks: to clarify concepts, and reveal inconsistencies, tensions and ironies surrounding methodological “discomforts” in scientific practice.
Example of a conceptual clarification
Editors of a journal, Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” (A puzzle about the latest “test ban”)
It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (2015 Trafimow and Marks)
- Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid.
- Simple conceptual job that philosophers are good at
(I don’t know if the group of eminent statisticians assigned to react to the “test ban” will bring up this point. I don’t think it includes any philosophers.)
Example of revealing inconsistencies and tensions
Critic: It’s too easy to satisfy standard significance thresholds
You: Why do replicationists find it so hard to achieve significance thresholds?
Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs
You: So, the replication researchers want methods that pick up on and block these biasing selection effects.
Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference.
Whether this can be resolved or not is separate.
- We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility
- As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling
The philosopher is the curmudgeon (takes chutzpah!)
I also think it’s crucial for philosophers of science and statistics to show how to improve on and solve problems of methodology in scientific practice.
My slides are below; share comments.
“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:
D. Mayo: “Error Statistical Control: Forfeit at your Peril”
S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”
A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)
For more details see this post.
Society for Philosophy and Psychology (SPP): 41st Annual meeting
SPP 2015 Program
Wednesday, June 3rd
1:30-6:30: Preconference Workshop on Replication in the Sciences, organized by Edouard Machery
1:30-2:15: Edouard Machery (Pitt)
2:15-3:15: Andrew Gelman (Columbia, Statistics, via video link)
3:15-4:15: Deborah Mayo (Virginia Tech, Philosophy)
4:30-5:30: Uri Simonshon (Penn, Psychology)
5:30-6:30: Tal Yarkoni (University of Texas, Neuroscience)
The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference, 2015 APS Annual Convention Saturday, May 23 2:00 PM- 3:50 PM in Wilder (Marriott Marquis 1535 B’way)
A new joint paper….
“Error statistical modeling and inference: Where methodology meets ontology”
Aris Spanos · Deborah G. Mayo
Abstract: In empirical modeling, an important desideratum for deeming theoretical entities and processes real is that they can be reproducible in a statistical sense. Current day crises regarding replicability in science intertwine with the question of how statistical methods link data to statistical and substantive theories and models. Different answers to this question have important methodological consequences for inference, which are intertwined with a contrast between the ontological commitments of the two types of models. The key to untangling them is the realization that behind every substantive model there is a statistical model that pertains exclusively to the probabilistic assumptions imposed on the data. It is not that the methodology determines whether to be a realist about entities and processes in a substantive field. It is rather that the substantive and statistical models refer to different entities and processes, and therefore call for different criteria of adequacy.
Keywords: Error statistics · Statistical vs. substantive models · Statistical ontology · Misspecification testing · Replicability of inference · Statistical adequacy
To read the full paper: “Error statistical modeling and inference: Where methodology meets ontology.”
The related conference.
Reference: Spanos, A. & Mayo, D. G. (2015). “Error statistical modeling and inference: Where methodology meets ontology.” Synthese (online May 13, 2015), pp. 1-23.
Given recent evidence of the irreproducibility of a surprising number of published scientific findings, the White House’s Office of Science and Technology Policy (OSTP) sought ideas for “leveraging its role as a significant funder of scientific research to most effectively address the problem”, and announced funding for projects to “reset the self-corrective process of scientific inquiry”. (first noted in this post.)
I was sent some information this morning with a rather long description of the project that received the top government award thus far (and it’s in the millions). I haven’t had time to read the proposal*, which I’ll link to shortly, but for a clear and quick description, you can read the excerpt of an interview of the OSTP representative by the editor of the Newsletter for Innovation in Science Journals (Working Group), Jim Stein, who took the lead in writing the author check list for Nature.
Stein’s queries are in burgundy, OSTP’s are in blue. Occasional comments from me are in black, which I’ll update once I study the fine print of the proposal itself. Continue reading
It’s an apt time to reblog the “statistical dirty laundry” post from 2013 here. I hope we can take up the recommendations from Simmons, Nelson and Simonsohn at the end (Note ), which we didn’t last time around.
I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…
1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job:
One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).
I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89).  It is unclear at what point a field slips into the pseudoscience realm.
2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this: Continue reading
If questionable research practices (QRPs) are prevalent in your field, then apparently you can’t be guilty of scientific misconduct or fraud (by mere QRP finagling), or so some suggest. Isn’t that an incentive for making QRPs the norm?
The following is a recent blog discussion (by Ulrich Schimmack) on the Jens Förster scandal: I thank Richard Gill for alerting me. I haven’t fully analyzed Schimmack’s arguments, so please share your reactions. I agree with him on the importance of power analysis, but I’m not sure that the way he’s using it (via his “R index”) shows what he claims. Nor do I see how any of this invalidates, or spares Förster from, the fraud allegations along the lines of Simonsohn[i]. Most importantly, I don’t see that cheating one way vs another changes the scientific status of Forster’s flawed inference. Forster already admitted that faced with unfavorable results, he’d always find ways to fix things until he got results in sync with his theory (on the social psychology of creativity priming). Fraud by any other name.
The official report, “Suspicion of scientific misconduct by Dr. Jens Förster,” is anonymous and dated September 2012. An earlier post on this blog, “Who ya gonna call for statistical fraud busting” featured a discussion by Neuroskeptic that I found illuminating, from Discover Magazine: On the “Suspicion of Scientific Misconduct by Jens Förster.” Also see Retraction Watch.
Does anyone know the official status of the Forster case?
“How Power Analysis Could Have Prevented the Sad Story of Dr. Förster”
From Ulrich Schimmack’s “Replicability Index” blog January 2, 2015. A January 14, 2015 update is here. (occasional emphasis in bright red is mine) Continue reading
You still have a few days to respond to the call of your country to solve problems of scientific reproducibility!
The following passages come from Retraction Watch, with my own recommendations at the end.
“White House takes notice of reproducibility in science, and wants your opinion”
The White House’s Office of Science and Technology Policy (OSTP) is taking a look at innovation and scientific research, and issues of reproducibility have made it onto its radar.
Here’s the description of the project from the Federal Register:
The Office of Science and Technology Policy and the National Economic Council request public comments to provide input into an upcoming update of the Strategy for American Innovation, which helps to guide the Administration’s efforts to promote lasting economic growth and competitiveness through policies that support transformative American innovation in products, processes, and services and spur new fundamental discoveries that in the long run lead to growing economic prosperity and rising living standards.
I wonder what Steven Pinker would say about some of the above verbiage?
And here’s what’s catching the eye of people interested in scientific reproducibility:
(11) Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?
The OSTP is the same office that, in 2013, took what Nature called “a long-awaited leap forward for open access” when it said “that publications from taxpayer-funded research should be made free to read after a year’s delay.That OSTP memo came after more than 65,000 people “signed a We the People petition asking for expanded public access to the results of taxpayer-funded research.”
Have ideas on improving reproducibility? Emails to email@example.com are preferred, according to the notice, which also explains how to fax or mail comments. The deadline is September 23.
Off the top of my head, how about:
Promote the use of methodologies that:
Institute penalties for QRPs and fraud?
Please offer your suggestions in the comments, or directly to Uncle Sam.
[i]It may require a certain courage on the part of researchers, journalists, referees.