A little over a year ago, the board of the American Statistical Association (ASA) appointed a new Task Force on Statistical Significance and Replicability (under then president, Karen Kafadar), to provide it with recommendations. [Its members are here (i).] You might remember my blogpost at the time, “Les Stats C’est Moi”. The Task Force worked quickly, despite the pandemic, giving its recommendations to the ASA Board early, in time for the Joint Statistical Meetings at the end of July 2020. But the ASA hasn’t revealed the Task Force’s recommendations, and I just learned yesterday that it has no plans to do so*. A panel session I was in at the JSM, (P-values and ‘Statistical Significance’: Deconstructing the Arguments), grew out of this episode, and papers from the proceedings are now out. The introduction to my contribution gives you the background to my question, while revealing one of the recommendations (I only know of 2). Continue reading
Why hasn’t the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?
I constructed, together with Jean Miller, a transcript from the October 15 Statistics Debate (with me, J. Berger and D. Trafimow and moderator D. Jeske), sponsored by NISS. It’s so much easier to access the material this way rather than listening to it on the video. Using this link, you can see the words and hear the video at the same time, as well as pause and jump around. Below, I’ve pasted our responses to Question #1. Have fun and please share your comments.
Dan Jeske: [QUESTION 1] Given the issues surrounding the misuses and abuse of p values, do you think they should continue to be used or not? Why or why not?
Deborah Mayo 03:46
Thank you so much. And thank you for inviting me, I’m very pleased to be here. Yes, I say we should continue to use p values and statistical significance tests. Uses of p values are really just a piece in a rich set of tools intended to assess and control the probabilities of misleading interpretations of data, i.e., error probabilities. They’re the first line of defense against being fooled by randomness as Yoav Benjamini puts it. If even larger, or more extreme effects than you observed are frequently brought about by chance variability alone, i.e., p value not small, clearly you don’t have evidence of incompatibility with the mere chance hypothesis. It’s very straightforward reasoning. Even those who criticize p values you’ll notice will employ them, at least if they care to check their assumptions of their models. And this includes well known Bayesian such as George Box, Andrew Gelman, and Jim Berger. Critics of p values often allege it’s too easy to obtain small p values. But notice the whole replication crisis is about how difficult it is to get small p values with preregistered hypotheses. This shows the problem isn’t p values, but those selection effects and data dredging. However, the same data drenched hypothesis can occur in other methods, likelihood ratios, Bayes factors, Bayesian updating, except that now we lose the direct grounds to criticize inferences for flouting error statistical control. The introduction of prior probabilities, which may also be data dependent, offers further researcher flexibility. Those who reject p values are saying we should reject the method because it can be used badly. And that’s a bad argument. We should reject misuses of p values. But there’s a danger of blindly substituting alternative tools that throw out the error control baby with the bad statistics bathwater.
Dan Jeske 05:58
Thank you, Deborah, Jim, would you like to comment on Deborah’s remarks and offer your own?
Jim Berger 06:06
Okay, yes. Well, I certainly agree with much of what Deborah said, after all, a p value is simply a statistic. And it’s an interesting statistic that does have many legitimate uses, when properly calibrated. And Deborah mentioned one such case is model checking where Bayesians freely use some version of p values for model checking. You know, on the other hand, that one interprets this question, should they continue to be used in the same way that they’re used today? Then my, my answer would be somewhat different. I think p values are commonly misinterpreted today, especially when when they’re used to test a sharp null hypothesis. For instance, of a p value of .05, is commonly interpreted as by many is indicating the evidence is 20 to one in favor of the alternative hypothesis. And that just that just isn’t true. You can show for instance, that if I’m testing with a normal mean of zero versus nonzero, the odds of the alternative hypothesis to the null hypothesis can at most be seven to one. And that’s just a probabilistic fact, doesn’t involve priors or anything. It just is, is a is an answer covering all probability. And so that 20 to one cannot be if it’s, if it’s, if a p value of .05 is interpreted as 20 to one, it’s just, it’s just being interpreted wrongly, and the wrong conclusions are being reached. I’m reminded of an interesting paper that was published some time ago now, which was reporting on a survey that was designed to determine whether clinical practitioners understood what a p value was. The results of the survey were published and were not surprising. Most clinical practitioners interpreted the p value as something like a p value of .05 as something like 20 to one odds against the null hypothesis, which again, is incorrect. The fascinating aspect of the paper is that the authors also got it wrong. Deborah pointed out that the p value is the probability under the null hypothesis of the data or something more extreme. The author’s stated that the correct answer was the p value is the probability of the data under the null hypothesis, they forgot the more extreme. So, I love this article, because the scientists who set out to show that their colleagues did not understand the meaning of p values themselves did not understand the meaning of p values.
Dan Jeske 08:42
David Trafimow 08:44
Okay. Yeah, Um, like Deborah and Jim, I’m delighted to be here. Thanks for the invitation. Um and I partly agree with what both Deborah and Jim said, um, it’s certainly true that people misuse p values. So, I agree with that. However, I think p values are more problematic than the other speakers have mentioned. And here’s here’s the problem for me. We keep talking about p values relative to hypotheses, but that’s not really true. P values are relative to hypotheses plus additional assumptions. So, if we call, if we use the term model to describe the null hypothesis, plus additional assumptions, then p values are based on models, not on hypotheses, or only partly on hypotheses. Now, here’s the thing. What are these other assumptions? An example would be random selection from the population, an assumption that is not true in any one of the thousands of papers I’ve read in psychology. And there are other assumptions, a lack of systematic error, linearity, and then we can go on and on, people have even published taxonomies of the assumptions because there are so many of them. See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so, what you’re in essence doing then, is you’re using the p value to index evidence against a model that is already known to be wrong. And even the part about indexing evidence is questionable, but I’ll go with it for the moment. But the point is the model was wrong. And so, there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. There’s, p values don’t tell you how close the model is to being right. P values don’t tell you how valuable the model is. P values pretty much don’t tell you anything that researchers might want to know, unless you misuse them. Anytime you draw a conclusion from a p value, you are guilty of misuse. So, I think the misuse problem is much more subtle than is perhaps obvious at firsthand. So, that’s really all I have to say at the moment.
Dan Jeske 11:28
Thank you. Jim, would you like to follow up?
Jim Berger 11:32
Yes, so, so, I certainly agree that that assumptions are often made that are wrong. I won’t say that that’s always the case. I mean, I know many scientific disciplines where I think they do a pretty good job, and work with high energy physicists, and they do a pretty good job of checking their assumptions. Excellent job. And they use p values. It’s something to watch out for. But any statistical analysis, you know, can can run into this problem. If the assumptions are wrong, it’s, it’s going to be wrong.
Dan Jeske 12:09
Deborah Mayo 12:11
Okay. Well, Jim thinks that we should evaluate the p value by looking at the Bayes factor when he does, and he finds that they’re exaggerating, but we really shouldn’t expect agreement on numbers from methods that are evaluating different things. This is like supposing that if we switch from a height to a weight standard, that if we use six feet with the height, we should now require six stone, to use an example from Stephen Senn. On David, I think he’s wrong about the worrying assumptions with using the p value since they have the least assumptions of any other method, which is why people and why even Bayesians will say we need to apply them when we need to test our assumptions. And it’s something that we can do, especially with randomized controlled trials, to get the assumptions to work. The idea that we have to misinterpret p values to have them be relevant, only rests on supposing that we need something other than what the p value provides.
Dan Jeske 13:19
David, would you like to give some final thoughts on this question?
David Trafimow 13:23
Sure. As it is, as far as Jim’s point, and Deborah’s point that we can do things to make the assumptions less wrong. The problem is the model is wrong or it isn’t wrong. Now if the model is close, that doesn’t justify the p value because the p value doesn’t give the closeness of the model. And that’s the, that’s the problem. We’re not we’re not using, for example, a sample mean, to estimate a population mean, in which case, yeah, you wouldn’t expect the sample mean to be exactly right. If it’s close, it’s still useful. The problem is that p values don’t tell you p values aren’t being used to estimate anything. So, if you’re not estimating anything, then you’re stuck with either correct or incorrect, and the answer is always incorrect that, you know, this is especially true in psychology, but I suspect it might even be true in physics. I’m not the physicist that Jim is. So, I can’t say that for sure.
Dan Jeske 14:35
Jim, would you like to offer Final Thoughts?
Jim Berger 14:37
Let me comment on Deborah’s comment about Bayes factors are just a different scale of measurement. My my point was that it seems like people invariably think of p values as something like odds or probability of the null hypothesis, if that’s the way they’re thinking, because that’s the way their minds reason. I believe we should provide them with odds. And so, I try to convert p values into odds or Bayes factors, because I think that’s much more readily understandable by people.
Dan Jeske 15:11
Deborah, you have the final word on this question.
Deborah Mayo 15:13
I do think that we need a proper philosophy of statistics to interpret p values. But I think also that what’s missing in the reject p values movement is a major reason for calling in statistics in science is to give us tools to inquire whether an observed phenomena can be a real effect, or just noise in the data and the P values have intrinsic properties for this task, if used properly, other methods don’t, and to reject them is to jeopardize this important role. As Fisher emphasizes, we need randomized control trials precisely to ensure the validity of statistical significance tests, to reject them because they don’t give us posterior probabilities is illicit. In fact, I think that those claims that we want such posteriors need to show for any way we can actually get them, why.
You can find the complete audio transcript at this LINK: https://otter.ai/u/hFILxCOjz4QnaGLdzYFdIGxzdsg
[There is a play button at the bottom of the page that allows you to start and stop the recording. You can move about in the transcript/recording by using the pause button and moving the cursor to another place in the dialog and then clicking the play button to hear the recording from that point. (The recording is synced to the cursor.)]
National Institute of Statistical Sciences (NISS): The Statistics Debate (Video)
October 15, Noon – 2 pm ET (Website)
Where do YOU stand?
Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading
CALL FOR PAPERS (Synthese) Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications
Call for Papers: Topical Collection in Synthese
Title: Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications
The deadline for submissions is
1 November, 2020 1 December 2020
Description: Continue reading
I will now hold a monthly remote forum on Phil Stat: The Statistics Wars and Their Casualties–the title of the workshop I had scheduled to hold at the London School of Economics (Centre for Philosophy of Natural and Social Science: CPNSS) on 19-20 June 2020. (See the announcement at the bottom of this blog). I held the graduate seminar in Philosophy (PH500) that was to precede the workshop remotely (from May 21-June 25), and this new forum will be both an extension of that and a linkage to the planned workshop. The issues are too pressing to put off for a future in-person workshop, which I still hope to hold. It will begin with presentations by workshop participants, with lots of discussion. If you want to be part of this monthly forum and engage with us, please go to the information and directions page. The links are now fixed, sorry. (It also includes readings for Aug 20.) If you are already on our list, you’ll automatically be notified of new meetings. (If you have questions, email me.) Continue reading
All: On July 30 (10am EST) I will give a virtual version of my JSM presentation, remotely like the one I will actually give on Aug 6 at the JSM. Co-panelist Stan Young may as well. One of our surprise guests tomorrow (not at the JSM) will be Yoav Benjamini! If you’re interested in attending our July 30 practice session* please follow the directions here. Background items for this session are in the “readings” and “memos” of session 5.
*unless you’re already on our LSE Phil500 list
To register for JSM: https://ww2.amstat.org/meetings/jsm/2020/registration.cfm
Ship StatInfasST will embark on a new journey from 21 May – 18 June, a graduate research seminar for the Philosophy, Logic & Scientific Method Department at the LSE, but given the pandemic has shut down cruise ships, it will remain at dock in the U.S. and use zoom. If you care to follow any of the 5 sessions, nearly all of the materials will be linked here collected from excerpts already on this blog. If you are interested in observing on zoom beginning 28 May, please follow the directions here.
For the updated schedule, see the seminar web page.
Topic: Current Controversies in Phil Stat
(LSE, Remote 10am-12 EST, 15:00 – 17:00 London time; Thursdays 21 May-18 June) Continue reading
I will run a graduate Research Seminar at the LSE on Thursdays from May 21-June 18:
In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP), I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed). Continue reading
For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).
1.4 The Law of Likelihood and Error Statistics
If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one. Continue reading
Please See New Information for Summer Seminar in PhilStat
Bibliography (this includes a selection of articles with links; numbers 1-15 after the item refer to seminar meeting number.)
Achinstein (2010). Mill’s Sins or Mayo’s Errors? (E&I: 170-188). (11)
Bacchus, Kyburg, & Thalos (1990).Against Conditionalization, Synthese(85): 475-506. (15)
New Course Starts Tomorrow: Current Debates on Statistical Inference and Modelings: Joint Phil and Econ
The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars
Excerpts from the Preface:
The Statistics Wars:
Today’s “statistics wars” are fascinating: They are at once ancient and up to the minute. They reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error due to incomplete and variable data? At the same time, they are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences. How should the integrity of science be restored? Experts do not agree. This book pulls back the curtain on why. Continue reading