“Using PhilStat to Make Progress in the Replication Crisis in Psych” at Society for PhilSci in Practice (SPSP)

Screen Shot 2016-06-15 at 1.19.23 PMI’m giving a joint presentation with Caitlin Parker[1] on Friday (June 17) at the meeting of the Society for Philosophy of Science in Practice (SPSP): “Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology” (Rowan University, Glassboro, N.J.)[2] The Society grew out of a felt need to break out of the sterile straightjacket wherein philosophy of science occurs divorced from practice. The topic of the relevance of PhilSci and PhilStat to Sci has often come up on this blog, so people might be interested in the SPSP mission statement below our abstract.

Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology

Deborah Mayo Virginia Tech, Department of Philosophy United States
Caitlin Parker Virginia Tech, Department of Philosophy United States

The discussion surrounding the replication crisis in psychology has raised philosophical issues that remain to be seriously addressed. These touch on foundational questions in the philosophy of statistics about the role of probability in scientific inference and the proper interpretation of statistical tests. Such matters are key to understanding a paradox related to replicability criticisms in social science. This is that, although critics argue that it is too easy to obtain statistically significant results, the comparably low rate of positive results in replication studies shows that it is quite difficult to obtain low p-values. The resolution of the paradox is that small p-values aren’t easy to come by when experimental protocols are preregistered and researcher flexibility is minimized. They are easy to generate thanks to biasing selection effects: cherry-picking, multiple testing, and the type of questionable research practices that are widely lampooned. The consequence of these influences is that the reported, ‘nominal’ p-value for the original study differs greatly from the ‘actual’ p-value. As Gelman and Loken (2014) have argued, the same problem occurs due to the flexibility of choices in the “forking paths” leading from data to inferences, even if the critique remains informal. It follows that to avoid problematic inferences, researchers need statistical tools with the capacity to pick up on the effects of biasing selections. Significance tests have a limited but important goal, especially in testing model assumptions. To trade them in for methods that do not pick up on alterations to error probabilities (Bayes ratios, posterior probabilities, likelihood ratios) is not progress, but would enable their effects to remain hidden . The sensitivity of p-values to selection effects is actually the key to understanding their relevance to appraising particular inferences, not just to long-run error control. The problems of hunting and cherry picking are not a matter of getting it wrong in the long run, but failing to provide good grounds for the intended inference in the immediate inquiry. There’s a second way in which reforms are in danger of enabling fallacies. It is fallacious to take the falsification of a null hypothesis as evidence for a substantive theory (confusing statistical and substantive hypotheses). Neither Fisherian nor NP tests permit moving directly from statistical significance to research hypotheses, let alone from a single, just significant result. Yet in order to block an inference to a research hypothesis, a popular reform is to assign a lump of prior probability on the “no effect” null hypothesis. But this countenances rather than prohibiting blurring statistical and substantive hypotheses! This is not only a statistical fallacy, it draws attention away from what is most needed in psychology experiments with poor replication: a scrutiny of the relevance of the measurements and experiments to the research hypotheses of interest [3]. Slides are here.

[1] Parker had been my Masters’ student at Virginia Tech, and is beginning Ph.D work at Carnegie Mellon University in the fall.
[2] We’re on at 11:30, Enterprise Center, room 409. (Third paper in a session that starts at 10:30). The conference goes from June 17-19.
[3] Links to some relevant posts are at the end.


SPSP Mission Statement

Philosophy of science has traditionally focused on the relation between scientific theories and the world, at the risk of disregarding scientific practice. In social studies of science and technology, the predominant tendency has been to pay attention to scientific practice and its relation to theories, sometimes willfully disregarding the world except as a product of social construction. Both approaches have their merits, but they each offer only a limited view, neglecting some essential aspects of science. We advocate a philosophy of scientific practice, based on an analytic framework that takes into consideration theory, practice and the world simultaneously.

The direction of philosophy of science we advocate is not entirely new: naturalistic philosophy of science, in concert with philosophical history of science, has often emphasized the need to study scientific practices; doctrines such as Hacking’s “experimental realism” have viewed active intervention as the surest path to the knowledge of the world; pragmatists, operationalists and late-Wittgensteinians have attempted to ground truth and meaning in practices. Nonetheless, the concern with practice has always been somewhat outside the mainstream of English-language philosophy of science. We aim to change this situation, through a conscious and organized programme of detailed and systematic study of scientific practice that does not dispense with concerns about truth and rationality.

Practice consists of organized or regulated activities aimed at the achievement of certain goals. Therefore, the epistemology of practice must elucidate what kinds of activities are required in generating knowledge. Traditional debates in epistemology (concerning truth, fact, belief, certainty, observation, explanation, justification, evidence, etc.) may be re-framed with benefit in terms of activities. In a similar vein, practice-based treatments will also shed further light on questions about models, measurement, experimentation, etc., which have arisen with prominence in recent decades from considerations of actual scientific work.

There are some salient aspects of our general approach that are worth highlighting here.

  1. We are concerned with not only the acquisition and validation of knowledge, but its use. Our concern is not only about how pre-existing knowledge gets applied to practical ends, but also about how knowledge itself is fundamentally shaped by its intended use. We aim to build meaningful bridges between the philosophy of science and the newer fields of philosophy of technology and philosophy of medicine; we also hope to provide fresh perspectives for the latter fields.
  2. We emphasize how human artifacts, such as conceptual models and laboratory instruments, mediate between theories and the world. We seek to elucidate the role that these artifacts play in the shaping of scientific practice.
  3. Our view of scientific practice must not be distorted by lopsided attention to certain areas of science. The traditional focus on fundamental physics, as well as the more recent focus on certain areas of biology, will be supplemented by attention to other fields such as economics and other social/human sciences, the engineering sciences, and the medical sciences, as well as relatively neglected areas within biology, physics, and other physical sciences.
  4. In our methodology, it is crucial to have a productive interaction between philosophical reasoning and a study of actual scientific practices, past and present. This provides a strong rationale for history-and-philosophy of science as an integrated discipline, and also for inviting the participation of practicing scientists, engineers and policymakers.


I. Some relevant recent posts on p-values (search this blog for many others):

Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater”

A Small P-value Indicates that the Results are Due to Chance Alone: Fallacious or not: More on the ASA P-value Doc”

“P-Value Madness: A Puzzle About the Latest Test Ban, or ‘Don’t Ask, Don’t Tell’”

II. Posts on replication research in psychology:

Repligate Returns (or, the Non Significance of Nonsignificant Results Are the New Significant Results)

This includes links to:

Some Ironies in the Replication Crisis in Social Psychology

The Paradox of Replication and the Vindication of the P-value, but She Can Go Deeper”

“Out Damned Pseudoscience: Nonsignificant Results Are the New Significant Results” 



Categories: Announcement, replication research, reproducibility

Post navigation

8 thoughts on ““Using PhilStat to Make Progress in the Replication Crisis in Psych” at Society for PhilSci in Practice (SPSP)

  1. Stan Young

    “They (small p-values) are easy to generate thanks to biasing selection effects: cherry-picking, multiple testing, and the type of questionable research practices that are widely lampooned.” Properly run just about any base statistical test is good enough. RCTs with FDA oversight work well, for example. The ELEPHANTS in the room are at least two. Researchers intentionally game the science system. They know about multiple testing and they exploit it to their advantage. Journal editors know about multiple testing, but love a good story (and a positive hit on their impact factor). Funding agencies need to fix the incentives. P-hacking should be considered science fraud. Many editors have either been AWOL or willing enablers.

    • Stan: Thanks for your comment. I totally agree. It isn’t just opportunists wanting to publish novel stories, there are potential conflicts of interest by scholars who also serve as lawyers, medical experts, expert witnesses, patent seekers (remember Potti), etc., etc. Their views can even shift according to setting. I wonder, at times, how much of the opportunistic gaming is disguised under merely holding a rival philosophy of statistics (e.g., one where error probability control is irrelevant to evidence/inference). See section 4 of “statistical reforms without philosophy are blind”. (Harkonen seemed to think that he couldn’t very well be blamed for searching through as many post-data endpoints as possible to find “nominal” significance since it was an issue around which there was philosophical controversy! The Supreme Court wasn’t buying it.*)
      Ironically, I have found some leaders of the new “technical activism” explicitly touting one philosophical standpoint about inductive inference—and even calling it “philosophical” (generally likelihoodist or Bayesian)– while refusing to consider a rival conception of inductive inference on the grounds that it will lead us into an endless philosophical discussion. It’s a clever trick.

      *If some of my lawyer friends who backed him on other subtle, legalistic grounds see this, they’ll surely fault me.

      • StanYoung

        Deborah: We are on the same page on this. I agree there are some hypocrites out there. At least one epidemiology thought leader honors multiple testing in his private consulting, yet speaks very eloquently against any adjustment for multiple testing publicly. If you are not part of the solution there is good money to be made prolonging the problem – I read somewhere.

        Two things. The attached just appeared in Science. I attended a workshop by National Library of Medicine on problems of science replication. It was a who’s who on poor replication. I attach program. NIH is quite interested, but even they think the problem is beyond them. A real mess.


        • Stan: Can you send a link? Else I’ll try to make one from your e-mail. Thanks.

        • Greenland has written a number of papers about how and when he’d take account of multiple testing, for example, with Robins:

          Click to access greenland-robins-1991-eb-multiple-comp-epidemiol.pdf

        • Sander Greenland

          I am rather shocked that you of all people would call your opponents “hypocrites.”
          After all, you join the chorus attacking the lack of transparency and replicability of results, yet you published an article (Young SS, Karr A. Significance 2011: 116-120) attacking observational epidemiology that gave no details about the sampling frame for included studies, how the studies and associations examined there were selected, and no criterion for declaring failure to replicate. Thus no one I know can reproduce your analysis based on what you have presented so far. Especially startling is that 10 of the of the 12 papers listed in Table 1 of Young & Karr are nutritional, yet a quick glance at general epidemiology and medical journals shows nutrition covers only a minority of topics in which both trial and epidemiologic data exist (far more common are studies of medical interventions such as drugs and devices, which is my primary application area).

          Worse, you target me with the false accusation that “At least one epidemiology thought leader [changed by Mayo from “Greenland” in your original comment] honors multiple testing in his private consulting, yet speaks very eloquently against any adjustment for multiple testing publicly.” This is false, and you ought to apologize (Mayo has already apologized in private for your comment). Here is why:

          Following the pioneering methods and examples of Efron & Morris, I have published a dozen articles recommending empirical-Bayes and partial-Bayes (“semi-Bayes”) adjustments for multiple comparisons, starting with Greenland & Robins, “Empirical-Bayes adjustments for multiple comparisons are sometimes useful” (Epidemiology 1991: 244-251) and including Greenland, “When should epidemiologic regressions use random coefficients?” (Biometrics 2000: 915-921) which I sent you years ago. The latter article specifically targeted inflated estimates in the nutritional literature – and Young and Karr should have cited it if they had been honest and scholarly reporters on the problem of data dredging in nutritional literature. In parallel I have actually used these adjustments for numerous substantive publications (e.g., Greenland & Finkle, “A retrospective cohort study of implanted medical devices and selected chronic diseases in Medicare claims data,” Annals of Epidemiology 2000: 205-213).

          That you prefer staying with adjustments to alpha levels or P-values rather than adopting hierarchical methodologies reflects only you preference for methods that regard any false positive as more costly than any false negative, a loss function which many do not share. That your writings fail to mention let alone use modern alternatives to your methods, coupled with your failure to disclose the sampling and analysis methodology behind your claims, corrupts the quality of science as badly as the data dredging we all decry.

          • EDITORIAL COMMENT: I was in transit when the Young comment came through, and I did not have comments on “manual approve”. This issue has come up many times before, and it’s no secret by now who takes what view, but still…. Fortunately, the upshot is constructive because Greenland explains his position. I also linked to one of his papers. Having said all that, I’ve never entirely understood the hierarchical solution to the issue of multiple testing. It was first discussed by Senn on this blog (alluding to Dawid).

  2. Pingback: Mayo & Parker “Using PhilStat to Make Progress in the Replication Crisis in Psych” SPSP Slides | A bunch of data

Blog at WordPress.com.