What would I say is the most important takeaway from last week’s NISS “statistics debate” if you’re using (or contemplating using) Bayes factors (BFs)–of the sort Jim Berger recommends–as replacements for P-values? It is that J. Berger only regards the BFs as appropriate when there’s grounds for a high concentration (or spike) of probability on a sharp null hypothesis, e.g.,H_{0}: θ = θ_{0}.
Thus, it is crucial to distinguish between precise hypotheses that are just stated for convenience and have no special prior believability, and precise hypotheses which do correspond to a concentration of prior belief. (J. Berger and Delampady 1987, p. 330).
Now, to be clear, I do not think that P-values need to be misinterpreted (Bayesianly) to use them evidentially, and think it’s a mistake to try to convert them into comparative measures of belief or support. However, it’s important to realize that even if you do think such a conversion is required, and are contemplating replacing them with the kind of BF Jim Berger advances, then it would be wrong to do so if there were no grounds for a high prior belief on a point null. Jim said in the debate that people want a Bayes factor, so we give it to them. But when you’re asking for it, especially if it’s described as a “default” method, you might assume it is capturing a reasonably common standpoint—not one that only arises in an idiosyncratic case. To allege that there’s really much less evidence against the sharp null than is suggested by a P-value, as does the BF advocate, is to hide the fact that most of this “evidence” is due to the spiked concentration of prior belief being given to the sharp null hypothesis. This is an a priori bias in favor of the sharp null, not evidence in the data. (There is also the matter of how the remainder of the prior is smeared over the parameter values in the alternative.) Jim Berger, somewhat to my surprise (at the debate) reaffirms that that is the context for the intended use of his recommended Bayes factor with the spiked prior. Yet these BFs are being touted as a tool to replace P-values for everyday use.
Jim’s Sharp Null BFs were developed For a Very Special Case. Harold Jeffreys developed the spiked priors for a Bayesian special problem: how to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon, as R.A. Fisher emphasized.)
Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test (J. Berger and Sellke 1987, p. 136).
Casella and Roger Berger (1987b) respond to Jim Berger and Sellke and to Jim Berger and Delampady –all in 1987. “We would be surprised if most researchers would place even a 10% prior probability of H_{0}. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H_{0}|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H_{0}] that was used.” They make the astute point that the most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability. They conclude: “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H_{0}” (ibid., p. 345). Thus, they conclude, “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).
As I said in response to debate question 3, “the move to redefine statistical significance, advanced by a megateam in 2017, including Jim, all rest upon the lump high prior probability on the null as well as the appropriateness of evaluating P-values using Bayes factors. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies [from the point null]”.
Conduct an Error Statistical Critique. Thus a question you should ask in contemplating the application of the default BF is this: What’s the probability the default BF would find no evidence against the null or even evidence for it for an alternative or discrepancy of interest to you? If the probability is fairly high then you’d not want to apply it.
Notice what we’re doing in asking this question: we’re applying the frequentist error statistical analysis to the Bayes factor. What’s sauce for the goose is sauce for the gander.[ii] This is what the error statistician needs to do whenever she’s told an alternative measure ought to be adopted as a substitute for an error statistical one: check its error statistical properties.
Is the Spiked Prior Appropriate to the Problem, Even With a Well-corroborated Value? Even in those highly special cases where a well-corroborated substantive theory gives a high warrant for a particular value of a parameter, it’s far from clear that a spiked prior reflects how scientists examine the question: is the observed anomaly (with the theory) merely background noise or some systematic effect? Remember when Neutrinos appeared to travel faster than light—an anomaly for special relativity—in an OPERA experiment in 2011?
This would be a case where Berger would place a high concentration of prior probability on the point null, the speed of light c given by special relativity. The anomalous results, at most, would lower the posterior belief. But I don’t think scientists were interested in reporting that the posterior probability for the special relativity value had gone down a bit, due to their anomalous result, but was still quite high. Rather, they wanted to know whether the anomaly was mere noise or genuine, and finding it was genuine, they wanted to pinpoint blame for the anomaly. It turns out a fiber optic cable wasn’t fully screwed in and one of the clocks was ticking too fast. Merely discounting the anomaly (or worse, interpreting it as evidence strengthening their belief in the precise null) because of strong belief in special relativity would sidestep the most interesting work: gleaning important information about how well or poorly run the experiment was.[iii]
It is interesting to compare the position of the spiked prior with an equally common Bayesian position that all null hypotheses are false. The disagreement may stem from viewing H_{0 }as asserting the correctness of a scientific theory (the spiked prior view) as opposed to asserting a parameter in a model, representing a portion of that theory, is correct (the all nulls are false view).
Search Under “Overstates” On This Blog for More (and “P-values” for much more). The reason to focus attention on the disagreement between the P-value and the Bayes factor with a sharp null is that it explains an important battle in the statistics wars, and thus points the way to (hopefully) getting beyond it. The very understanding of the use and interpretation of error probabilities differs in the rival approaches.
As I was writing my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), I was often distracted by high pitched discussions in 2015-17 about P-values “overstating” evidence on account of being smaller than a posterior probability on a sharp null. Thus, I wound up writing several posts, the ingredients of which made their way into the book, notably, Section 4.4. Here’s one. I eventually coined it as a fallacy, “P-values overstate the evidence fallacy”. For many excerpts from the book, including the rest of the “Tour” where this issue arises, see this blogpost.
Stephen Senn wrote excellent guest posts on P-values for this blog that are especially clarifying, such as this one. He observes that Jeffreys, having already placed the spiked prior on the point null, required only that the posterior on the alternative exceeded .5 in order to find evidence against the null, not that it be a large number such as .95.
A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives. (S. Senn)
***
[i] I mentioned two of the simplest inferential arguments using P-values during the debate: one for blocking an inference, a second for inferring incompatibility with (or discrepancy from) a null hypothesis, set as a reference: “If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.”
“…A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H0. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it.“ For a more detailed discussion see SIST, e.g., Souvenir C (SIST, p. 52) https://errorstatistics.files.wordpress.com/2019/04/sist_ex1-tourii.pdf.
[ii] From SIST* (p. 247): “The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P -value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’ s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you.”
[iii] Conversely, the sharp null in discovering the Higgs Boson was disbelieved even before they built the expensive particle colliders (physicists knew there had to be a Higgs particle of some sort). You can find a number of posts on the Higgs on this blog (also in Mayo 2018, Excursion 3 Tour III).
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). The book follows an “itinerary” of a stat cruise with lots of museum stops and souvenirs.
]]>
How did I respond to those 7 burning questions at last week’s (“P-Value”) Statistics Debate? Here’s a fairly close transcript of my (a) general answer, and (b) final remark, for each question–without the in-between responses to Jim and David. The exception is question 5 on Bayes factors, which naturally included Jim in my general answer.
The questions with the most important consequences, I think, are questions 3 and 5. I’ll explain why I say this in the comments. Please share your thoughts.
Question 1. Given the issues surrounding the misuses and abuse of p-values, do you think they should continue to be used or not? Why or why not?
Yes we should continue to use P-values and statistical significance tests. Uses of P-values are a piece in a rich set of tools for assessing and controlling the probabilities of misleading interpretations of data (error probabilities). They’re “the first line of defense against being fooled by randomness” (Yoav Benjamini). If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.
Even those who criticize P-values will employ them at least if they care to check the assumptions of their statistical models—this includes Bayesians George Box, Andrew Gelman, and Jim Berger.
Critics of P-values often allege it’s too easy to obtain small P-values, but notice the replication crisis is all about how difficult it is to get small P-values with preregistered hypotheses. This shows the problem isn’t P-values but the selection effects and data-dredging. However, the same data dredged hypothesis can occur in likelihood ratios, Bayes factors, and Bayesian updating, except that we now lose the direct grounds to criticize inferences flouting error statistical control. The introduction of prior probabilities –which may also be data dependent–offers further researcher flexibility.
Those who reject P values are saying we should reject a method because it can be used badly. That’s a very bad argument committing straw person fallacies.
We should reject misuses and abuses of P-values, but there’s a danger of blithely substituting “alternative tools” that throw out the error control baby with the bad statistics bathwater.
Final remark on P-values
What’s missed in the reject P-values movement is the major reason for calling in statistics in science is that it gives tools to inquire whether an observed phenomenon could be a real effect or just noise in the data. P-values have the intrinsic properties for this task, if used properly. To reject them is to jeopardize this important role of statistics. As Fisher emphasizes, we seek randomized controlled trials in order to ensure the validity of statistical significance tests. To reject P-values because they don’t give posterior probabilities in hypotheses is illicit. The onus is on those claiming we want such posteriors to show, for any way of getting them, why.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 2 Should practitioners avoid the use of thresholds (e.g., P-value thresholds) in interpreting data? If so, does this preclude testing?
There’s a lot of confusion about thresholds. What people oppose are dichotomous accept/reject routines. We should move away from them as well as unthinking uses of thresholds like 95% confidence levels or other quantities. Attained P-values should be reported (as all the founders of tests recommended). We should not confuse fixing a threshold to habitually use with prespecifying a threshold beyond which there is evidence of inconsistency with a test hypothesis. I’ll often call it the null for short.
Some think that banishing thresholds would diminish P-hacking and data dredging. It is the opposite. In a world without thresholds, it would be harder to criticize those who fail to meet a small P-value because they engaged in data dredging & multiple testing, and at most have given us a nominally small P-value. Yet that is the upshot of declaring predesignated P-value thresholds should not be used at all in interpreting data. If an account cannot say about any outcomes in advance that they will not count as evidence for a claim, then there is no a test of that claim.
Giving up on tests means forgoing statistical falsification. What’s the point of insisting on replications if at no point can you say, the effect has failed to replicate?
You may favor a philosophy of statistics that rejects statistical falsification, but it will not do to declare by fiat that science should reject the falsification or testing view. (The “no thresholds” view also torpedoes common testing uses of confidence intervals and Bayes Factor standards.)
So my answer is NO and YES: don’t abandon thresholds, to do so is to ban tests.
Final remark on thresholds Q-2
A common fallacy is to suppose that because we have a continuum, that we cannot distinguish points at the extremes (fallacy of the beard). We can distinguish results readily produced by random variability from cases where there is evidence of incompatibility with the chance variability hypothesis. We use thresholds throughout science to measure if you’re pre-diabetic, diabetic, etc.
When P-values are banned altogether … the eager researcher does not claim, I’m simply describing, but they invariably go on to claim evidence for a substantive psych theory—but on results that would be blocked if they’d required a reasonably small P-value threshold.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 3 Is there a role for sharp null hypotheses or should we be thinking about interval nulls?
… I’d agree with those who regard testing of a point null hypothesis as problematic and often misused. Notice that arguments purporting to show P-values exaggerate evidence are based on this point null and a spiked or lump of prior to it. By giving a spike prior to the nil, it’s easy to find the nil more likely than the alternative—Jeffreys-Lindley paradox: the P-value can differ from the posterior probability on the null. But the posterior can also equal the P-value, it can range from p to 1-p. In other words, the Bayesians differ amongst themselves, because with diffuse priors the P-value can equal the posterior on the null hypothesis.
My own work reformulates results of statistical significance tests in terms of discrepancies from the null that are well or poorly tested. A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H_{0}. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it.
Final remark on sharp nulls Q-3
The move to redefine significance, advanced by a megateam including Jim, all rest upon the lump high prior probability on the null as well as evaluating P-values using Bayes factors. It’s not equipoise, it’s biased in favor of the null. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies.
Whether to use a lower threshold is one thing, to argue we should based on Bayes factor standards lacks legitimate grounds.[1][2]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 4 Should we be teaching hypothesis testing anymore, or should we be focusing on point estimation and interval estimation?
Absolutely [we should be teaching hypothesis testing]. The way to understand confidence interval estimation, and to fix its shortcomings, is to understand their duality with tests. The same person who developed confidence intervals developed tests in the 1930s—Jerzy Neyman. The intervals are inversions of tests.
A 95% CI contains the parameter values that are not statistically significant from the data at the 5% level.
While I agree that P-values should be accompanied by CIs, my own preferred reconstruction of tests blends intervals and tests. It reports the discrepancies from a reference value that are well or poorly indicated at different levels—not just 1 level like .95. This improves on current confidence interval use. For example, the justification standardly given for inferring a particular confidence interval estimate is that it came from a method which, with high probability, would cover the true parameter value. This is a performance justification. The testing perspective on CIs gives an inferential justification. I would justify inferring evidence that the parameter exceeds the CI lower bound this way: if the parameter were smaller than the lower bound, then with high probability we would have observed a smaller value of the test statistic than we did.
Amazingly, the last president of the ASA, Karen Kafadar, had to appoint a new task force on statistical significance tests to affirm that statistical hypothesis testing is indeed part of good statistical practice. Though much credit goes to her for bringing this about.
Final remark on question 4
Understanding the duality between tests and CIs is the key to improving both. …So it makes no sense for advocates of the “new statistics” to shun tests. The testing interpretation of confidence intervals also scotches criticisms of examples where, it can happen that a 95% confidence estimate contains all possible parameter values. Although such an inference is ‘trivially true,’ it is scarcely vacuous in the testing construal. As David Cox remarks, that all parameter values are consistent with the data is an informative statement about the limitations of the data (to detect discrepancies at the particular level).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 5 What are your reasons for or against the use of Bayes Factors?
Jim is a leading advocate of Bayes factors and also of the non-subjective interpretation of Bayesian prior probabilities (2006) to be used. ‘Eliciting’ subjective priors, Jim has convincingly argued, is too difficult, expert’s prior beliefs almost never even overlap he says, and scientists are reluctant for subjective beliefs to overshadow data. Default priors (reference or non-subjective priors) are supposed to prevent prior beliefs from influencing the posteriors–they are data dominant in some sense. But there’s a variety of incompatible ways to go about this job.
(A few are maximum entropy, invariance, maximizing the missing information, coverage matching.) As David Cox points out, it’s unclear how we should interpret these default probabilities. Default priors, we are told, are simply formal devices to obtain default posteriors. “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299), being improper.
Prior probabilities are supposed to let us bring in background information, but this pulls in the opposite direction from the goal of the default prior which is to reflect just the data. The goal of representing your beliefs is very different from the goal of finding a prior that allows the data to be dominant. Yet, current uses of Bayesian methods combine both in the same computation—how do you interpret them? I think this needs to be assessed now that they’re being so widely advocated.
Final remark on Q-5
BFs give a comparative appraisal not a test. It depends on how you assign the priors to the test and alternative hypotheses.
Bayesian testing, Bayesians admit, is a work in progress. My feeling is, we shouldn’t kill a well worked out theory of testing for one that is admitted to being a work in progress.
It might be noted that even default Bayesian Jose Bernardo holds that the difference between the P-value and the BF (the Jeffreys Lindley paradox or Fisher-Jeffreys disagreement) is actually an indictment of the BF because it finds evidence in favor of a null hypothesis even when an alternative is much more likely.
Other Bayesians dislike the default priors because they can lead to improper posteriors and thus to violations of probability theory. This leads some like Dennis Lindley back to subjective Bayesianism.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Question 6 With so much examination of if/why the usual nominal type I error .05 is appropriate, should there be similar questions about the usual nominal type II error?
No, there should not be a similar examination of type 2 error bounds. Rigid bounds for either error should be avoided. N-P themselves urged the specifications be used with discretion and understanding.
It occurs to me, if an examination is wanted it should be done by the new ASA Task Force on Significance Tests and Replicability. Its members aren’t out to argue for rejecting significance tests but to show they are part of proper statistical practice.
Power, the complement of the type II error probability, I often say is a most abused notion (note it’s only defined in terms of a threshold). Critics of statistical significance tests, I’m afraid to say, often fallaciously take a just statistically significant difference at level α as a better indication of a discrepancy from a null if the test’s power to detect that discrepancy is high rather than low. This is like saying it’s a better indication for a discrepancy of at least 10 than of at least 1 (whatever the parameter is). I call it the Mountains out of Molehill fallacy. It results from trying to use power and alpha as ingredients for a Bayes factor and from viewing non-Bayesian methods through a Bayesian lens.
We set a high power to detect population effects of interest, but finding statistical significance doesn’t warrant saying we’ve evidence for those effects.
(The significance tester doesn’t infer points but inequalities, discrepancies at least such and such).
Final remark on Q-6, power
A legitimate criticism of P-values is they don’t give population effect sizes. Neyman developed power analysis for this purpose, in addition to comparing tests pre-data. Yet critics of tests typically keep to Fisherian tests that don’t have explicit alternatives or power. Neyman was keen to avoid misinterpreting non-significant results as evidence for a null hypothesis. He used power analysis post data (like Jacob Cohen much later) to set an upper bound for a discrepancy from the null value.
If a test has high power to detect a population discrepancy, but does not do so, it’s evidence the discrepancy is absent (qualified by the level).
My preference is to use the attained power but it’s the same reasoning.
I see people objecting to post-hoc power as “sinister” but they’re referring to computing power by using the observed effect as the parameter value in its computation. This is not power analysis.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
QUESTION 7 What are the problems that lead to the reproducibility crisis and what are the most important things we should do to address it?
Irreplication is due to many factors from data generation and modeling, to problems of measurement, and linking statistics to substantive science. Here I just focus on P-values. The key problem is that in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. The fact it becomes difficult to replicate effects when features of the tests are tied down shows the problem isn’t P-values but exploiting researcher flexibility and multiple testing. The same flexibility can occur when the p-hacked hypotheses enter methods being promoted as alternatives to significance tests: likelihood ratios, Bayes Factors, or Bayesian updating. But direct grounds to criticize inferences as flouting error statistical control is lost (at least not without adding non-standard stipulations). Since they condition on the actual outcome they don’t consider outcomes other than the one observed. This is embodied in something called the likelihood principle—.
Admittedly error control, some think, is only of concern to ensure low error rates in some long run. I argue instead that what bothers us about the P-hacker and data dredger is that they have done a poor job in the case at hand. Their method very probably would have found some such effect even if it is merely noise.
Probability here is to assess how well tested claims are, which is very different from how comparatively believable they are—claims can even be known true while poorly tested. Though there’s room for both types of assessments in different contexts, how plausible and how well tested are very different and this needs to be recognized.
To address replication problems, statistical reforms should be developed together with a philosophy of statistics that properly underwrites them.[3]
Final remark on Q-7
Please see the video here or in this news article.
[1] The following are footnotes 4 and 5 from page 252 of Statistical Inference as Severe testing: How to Get Beyond the Statistics Wars. The relevant section is 4.4. (pp. 246-259)
Casella and Roger (not Jim) Berger (1987b) argue, “We would be surprised if most researchers would place even a 10% prior probability of H_{0}. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H_{0}|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H_{0}] that was used. ” The most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability. “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H_{0}” (ibid., p. 345). Thus, “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).
Harold Jeffreys developed the spiked priors for a very special case: to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon.)
In defending spiked priors, J. Berger and Sellke move away from the importance of effect size. “Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test” (1987, p. 136).
[2] As Cox and Hinkley explain, most tests of interest are best considered as running two one-sided tests, insofar as we are interested in the direction of departure. (Cox and Hinkley 1974; Cox 2020).
[3] In the error statistical view, the interest is not in measuring how strong your degree of belief in H is but how well you can show why it ought to be believed or not. How well can you put to rest skeptical challenges? What have you done to put to rest my skepticism of your lump prior on “no effect”?
]]>
National Institute of Statistical Sciences (NISS): The Statistics Debate (Video)
]]>October 15, Noon – 2 pm ET (Website)
Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used?
Do you think the use of estimation and confidence intervals eliminates the need for hypothesis tests?
Bayes Factors – are you for or against?
How should we address the reproducibility crisis?
If you are intrigued by these questions and have an interest in how these questions might be answered – one way of the other – then this is the event for you!
Want to get a sense of the thinking behind the practicality (or not) of various statistical approaches? Interested in hearing both sides of the story – during the same session!?
This event will be held in a debate type of format. The participants will be given selected questions ahead of time so they have a chance to think about their responses, but this is intended to be much less of a presentation and more of a give and take between the debaters.
So – let’s have fun with this! The best way to find out what happens is to register and attend!
Dan Jeske (University of California, Riverside)
Jim Berger (Duke University)
Deborah Mayo (Virginia Tech)
David Trafimow (New Mexico State University)
Register to Attend this Event Here!
Dan Jeske (moderator) received MS and PhD degrees from the Department of Statistics at Iowa State University in 1982 and 1985, respectively. He was a distinguished member of technical staff, and a technical manager at AT&T Bell Laboratories between 1985-2003. Concurrent with those positions, he was a visiting part-time lecturer in the Department of Statistics at Rutgers University. Since 2003, he has been a faculty member in the Department of Statistics at the University of California, Riverside (UCR) serving as Chair of the department 2008-2015. He is currently the Vice Provost of Academic Personnel and the Vice Provost of Administrative Resolution at UCR. He is the Editor-in-Chief of The American Statistician, an elected Fellow of the American Statistical Association, an Elected Member of the International Statistical Institute, and is President-elect of the International Society for Statistics in Business and Industry.. He has published over 100 peer-reviewed journal articles and is a co-inventor on 10 U.S. Patents. He served a 3-year term on the Board of Directors of ASA in 2013-2015.
Jim Berger is the Arts and Sciences Professor of Statistics at Duke University. His current research interests include Bayesian model uncertainty and uncertainty quantification for complex computer models. Berger was president of the Institute of Mathematical Statistics from 1995-1996 and of the International Society for Bayesian Analysis during 2004. He was the founding director of the Statistical and Applied Mathematical Sciences Institute, serving from 2002-2010. He was co-editor of the Annals of Statistics from 1998-2000 and was a founding editor of the Journal on Uncertainty Quantification from 2012-2015. Berger received the COPSS `President’s Award’ in 1985, was the Fisher Lecturer in 2001, the Wald Lecturer of the IMS in 2007, and received the Wilks Award from the ASA in 2015. He was elected as a foreign member of the Spanish Real Academia de Ciencias in 2002, elected to the USA National Academy of Sciences in 2003, was awarded an honorary Doctor of Science degree from Purdue University in 2004, and became an Honorary Professor at East China Normal University in 2011.
Deborah G. Mayo is professor emerita in the Department of Philosophy at Virginia Tech. Her Error and the Growth of Experimental Knowledge won the 1998 Lakatos Prize in philosophy of science. She is a research associate at the London School of Economics: Centre for the Philosophy of Natural and Social Science (CPNSS). She co-edited (with A. Spanos) Error and Inference (2010, CUP). Her most recent book is Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). She founded the Fund for Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (E.R.R.O.R) which sponsored a 2 week summer seminar in Philosophy of Statistics in 2019 for 15 faculty in philosophy, psychology, statistics, law and computer science (co-directed with A. Spanos). She publishes widely in philosophy of science, statistics, and philosophy of experiment. She blogs at errorstatistics.com and phil-stat-wars.com.
David Trafimow is Professor in the Department of Psychology at the New Mexico State University. His research area is social psychology. In particular his research looks at social cognition especially in understanding how self-cognitions are organized, and the interrelations between self-cognitions and presumed determinants of behavior (e.g., attitudes, subjective norms, control beliefs, and behavioral intentions). His research interests include cognitive structures and processes underlying attributions and memory for events and persons. Additionally, he is also involved in methodological, statistical, and philosophical issues pertaining to science.
Call for Papers: Topical Collection in Synthese
Title: Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications
Description:
Statistics play an essential role in an extremely wide range of human reasoning. From theorizing in the physical and social sciences to determining evidential standards in legal contexts, statistical methods are ubiquitous, and questions about their proper application inevitably arise. As tools for making inferences that go beyond a given set of data, they are inherently a means of reasoning ampliatively, and so it is unsurprising that philosophers interested in the notions of evidence and inductive inference have been concerned to utilize statistical frameworks to further our understanding of these topics. The purpose of this volume is to present a cross-section of subjects related to statistical argumentation, written by scholars from a variety of fields in order to explore issues in philosophy of statistics from different perspectives. Here, we intend for “Philosophy of Statistics” to be broadly construed. This volume will thus include discussions of foundational issues in statistics, as well as questions having to do with evidence, induction, and confirmation as applied in various contexts.
Appropriate topics for submission include, among others:
For further information, please contact the guest editor(s): molly.kao@umontreal.ca; eshech@auburn.edu
See: https://philevents.org/event/show/83126
Journal: Synthese
Guest Editor(s):
Molly Kao, University of Montreal
Deborah Mayo, Virginia Tech
Elay Shech, Auburn University
]]>
Yesterday was statistician George Barnard’s 105th birthday. To acknowledge it, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] A portion appears on p. 420 of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Six other posts on Barnard are linked below, including 2 guest posts, (Senn, Spanos); a play (pertaining to our first meeting), and a letter Barnard wrote to me in 1999.
BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.
SAVAGE: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.
Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …
On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.
Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.
BARNARD: Professor Savage says in effect, ‘add at the bottom of list H_{1}, H_{2},…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.
LINDLEY: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.
BARTLETT: But you would be inconsistent because your prior probability would be zero one day and non-zero another.
LINDLEY: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.
BARNARD: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.
LINDLEY: I do not care what it is as long as it is not one.
BARNARD: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.
LINDLEY: All probabilities are conditional.
BARNARD: I agree.
LINDLEY: If there are only conditional ones, what is the point at issue?
PROFESSOR E.S. PEARSON: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.
BARNARD: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.
LINDLEY: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.
BARNARD: Only if you knew that the condition was true, but you do not.
GOOD: Make a conditional bet.
BARNARD: You can make a conditional bet, but that is not what we are aiming at.
WINSTEN: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.
BARNARD: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H_{1} against H_{2}, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H_{1} as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.
You can read the rest of pages 78-103 of the Savage Forum here.
HAPPY BIRTHDAY GEORGE!
References
*Six other Barnard links on this blog:
Guest Posts:
Aris Spanos: Comment on the Barnard and Copas (2002) Empirical Example
Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis
Posts by Mayo:
Barnard, Background Information, and Intentions
Statistical Theater of the Absurd: Stat on a Hot tin Roof
George Barnard’s 100^{th} Birthday: We Need More Complexity and Coherence in Statistical Education
Letter from George Barnard on the Occasion of my Lakatos Award
Links to a scan of the entire Savage forum may be found at: https://errorstatistics.com/2013/04/06/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour/
]]>
Live Exhibit: So what happens if you replace “p-values” with “Bayes Factors” in the 6 principles from the 2016 American Statistical Association (ASA) Statement on P-values? (Remove “or statistical significance” in question 5.)
Does the one positive assertion hold? Are the 5 “don’ts” true?
I will hold off saying what I think until our Phil Stat forum (Phil Stat Wars and Their Casualties) on Thursday [1], although anyone who has read Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [SIST] (CUP, 2019) will have a pretty good idea. You can read the relevant sections 4.5 and 4.6 in proof form. In SIST, I called examples “exhibits”, and examples the reader is invited to work through are called “live exhibits”. That’s because the whole book involves tours through statistical museums.
What do you think?
[1] For my general take on the meaning of the theme, see Statistical Crises and Their Casualties.
Selected blog posts on the 2016 ASA Statement on P-values & the Wasserstein et al. March 2019 supplement to The American Statistician 2019 editorial:
R. Morey’s slides “Bayes Factors from all sides: who’s worried, who’s not, and why” are at this link: https://richarddmorey.github.io/TalkPhilStat2020/#1
Upcoming talks will include Stephen Senn (Statistical consultant, Scotland, November 19, 2020); Deborah Mayo (Philosophy, Virginia Tech, December 19, 2020); and Alexander Bird (Philosophy, King’s College London, January 28, 2021). https://phil-stat-wars.com/schedule/.
In October, instead of our monthly meeting, I invite you to a P-value debate on October 15 sponsored by the National Institute of Statistical Science, with J. Berger, D. Mayo, and D. Trafimow. Register at https://www.niss.org/events/statistics-debate.
]]>
Many of the discussions in the book were importantly influenced (corrected and improved) by reader’s comments on the blog over the years. I posted several excerpts and mementos from SIST here. I thank readers for their input. Readers should look up the topics in SIST on this blog to check out the comments, and see how ideas were developed, corrected and turned into “excursions” in SIST.
In the summer of 2019, A. Spanos and I led a Summer Seminar in Phil Stat at Virginia Tech for 15 faculty members from around the world in philosophy, psychology, and statistics. A write up is here.
This past summer (May 21-June 18), I ran a virtual LSE PH500 seminar on Current Controversies in Phil Stat.
Please peruse the 9 years of offerings below, taking advantage of the discussions by guest posters and readers.
Thank you so much for your interest!
Sincerely,
D. Mayo
P.S. Yes, I continue to use this old typewriter! I wrote all of SIST on it, switching to a university computer only for the final proofs, as required. You’ll be interested to hear that my latest supplier of
ye olde typewriter ribbons is a woman in England distantly related to the woman E.S. Pearson’s cousin was to marry, and if you’ve read SIST, you know how this turned out. Fortunately, I have enough spare ribbons to get me through a pandemic for 1 year, but I expect more in January 2021.
September 2011
October 2011
November 2011
December 2011
January 2012
February 2012
March 2012
April 2012
May 2012
NHST
June 2012
July 2012
August 2012
September 2012
October 2012
November 2012
December 2012
January 2013
February 2013
March 2013
p-values
April 2013
Didn’t Do (BP oil spill, comedy hour)
May 2013
June 2013
July 2013
August 2013
September 2013
October 2013
November 2013
December 2013
January 2014
February 2014
March 2014
April 2014
May 2014
argument for the Likelihood Principle
June 2014
July 2014
August 2014
September 2014
October 2014
November 2014
or “How Medical Care Is Being Corrupted”
December 2014
testing (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”
January 2015
February 2015
March 2015
April 2015
May 2015
(Aris Spanos)
June 2015
July 2015
August 2015
September 2015
October 2015
November 201412/20
December 2015
January 2016
February 2016
March 2016
April 2016
May 2016
June 2016
July 2016
August 2016
September 2016
October 2016
November 2016
December 2016
January 2017
February 2017
March 2017
April 2017
May 2017
June 2017
July 2017
August 2017
September 2017
October 2017
November 2017
December 2017
January 2018
February 2018
March 2018
April 2018
May 2018
June 2018
July 2018
August 2018
September 2018
October 2018
November 2018
December 2018
2018, CUP)
January 2019
February 2019
March 2019
April 2019
May 2019
June 2019
July 2019
and Casualties
August 2019
September 2019
October 2019
November 2019
December 2019
January 2020
Beyond the
Statistics Wars
February 2020
March 2020
April 2020
May 2020
June 2020
July 2020
August 2020
September 2020
]]>Compiled by Jean Miller and D. Mayo.
Speakers:
Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech
Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.
Sept 5, 2020 update: Little did we know that the P-value wars were soon to take a severe turn for the worse, with continued casualties:
Cox’s paper (based on his RSS talk): cox378-1Download
Some relevant papers and blogs of mine after the Wasserstein et al (2019) editorial.
]]>
You can find several excerpts and mementos from the book, including whole “tours” (in proofs) updated June 2020 here.
]]>What do I mean by “The Statistics Wars and Their Casualties”? It is the title of the workshop I have been organizing with Roman Frigg at the London School of Economics (CPNSS) [1], which was to have happened in June. It is now the title of a forum I am zooming on Phil Stat that I hope you will want to follow. It’s time that I explain and explore some of the key facets I have in mind with this title.
The Statistics Wars, of course, refers to the wars of ideas between competing tribes of statisticians, data analysts and probabilistic modelers. Some may be surprised to learn that the field of statistics, arid and staid as it seems, has a fascinating and colorful history of philosophical debate, marked by unusual heights of passion, personality, and controversy for at least a century. Others know them all too well and regard them as “merely” philosophical, or perhaps cultural or political. That is why it seemed apt to use “wars” in the title of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)—although I wasn’t at all sure that Cambridge would allow it.[2] Although the wars go back for many years, my interest is in their current emergence within the “crisis of replication”, and in relation to challenges posed by the rise of Big Data, data analytics, machine learning, and bioinformatics. A variety of high-powered methods, despite impressive successes in some contexts, have often led to failed replication, irreproducibility and bias. A number of statistical “reforms” and reformulations of existing methods are welcome (preregistration, replication, calling out cookbook statistics). Others are radical and even obstruct practices known to improve on replication. With the “war” metaphor in place, it is only natural to dub these untoward consequences its “casualties”–whether intended or unintended.
Nowadays, it is not unusual for people to set out shingles, promising to give clarifying explanations of statistical significance tests, P-values, and Type 1 and 2 errors, but it seems to me that the terms are more confused than ever–including, to my dismay, by some of the “experts”.[3] These are among the casualties I have in mind. These issues are sufficiently urgent not to wait until the coronavirus pandemic is adequately controlled in the U.S. so that I can travel abroad. So I’m inviting the workshop participants (and perhaps others) to speak at a remote forum, even if their topic isn’t the one they choose in an eventual in-person workshop. I will encourage them to draw out some of the more contrarian positions I find in their work.
Why the urgency? For one thing, these issues are increasingly being brought to bear on some very public controversies—including some that are reflected in the pandemic itself. The British Medical Journal found all prediction models fail the tests for bias delineated in their guidelines for machine learning models. Playing on Box’s famous remark, the authors declare “all clinical prediction models for covid-19 to date are wrong and none are useful.” (True, this was back in April.)[4] Second, the “classical” statistics wars between Bayesians and frequentist error statisticians still simmer below the surface in assumptions about the very role probability plays in statistical inference. (Whenever people say ‘the issue is not about Bayesian vs frequentist’, I find, the issue turns out to be about Bayesian vs frequentist disagreements–or grows directly out of them.) Most important, what is at stake is a critical standpoint that we may be in danger of losing, if not permanently, then for a sufficiently long time to do real damage.
Of course, what counts as a welcome reform, and what counts as a casualty depends on who you ask. But the entire issue is rarely even considered. We should at least point up conflicts and inconsistencies in positions being bandied about. I’m most interested in those that are self-defeating or that, however indirectly, weaken stated goals and aims. For example, there are those who call for greater replication tests of purported findings while denying we should ever use a P-value threshold, or any other threshold, in interpreting data (Wasserstein et al, 2019). How then to pinpoint failed replications? We should move away from unthinking thresholds, not just with P-values but with any other statistical quantity. However, unless you can say ahead of time that some outcomes will not be allowed to count in favor of a claim, you don’t have a test of that claim–whatever it is.[5] If statistical consumers are unaware of assumptions behind proposed changes to standards of evidence, they can’t scrutinize the casualties that affect them (in drug treatments, personalized medicine, psychology, economics and so on). They might jump on a popular bandwagon, only to discover, too late, that important protections in interpreting data are gone. When the debate becomes politicized—as they now often are—warranted criticism is easily blurred with irrational distrust of science. Grappling with these issues requires a mix of philosophical, conceptual, and statistical considerations. I hope that by means of this forum we can have an impact on policies about evidence that are being debated, cancelled, adopted and put into practice across the sciences.
The P-value Wars and its Casualties
The best known of the statistics wars concern statistical significance tests. Many blame statistical significance tests for making it too easy to find impressive looking effects that do not replicate with predesignated hypotheses and tighter controls. However, the fact it becomes difficult to replicate effects when features of the tests are tied down gives new understanding and appreciation for the role of statistical significance tests. It vindicates them. Statistical significance tests are a part of a rich set of tools “for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data” (Birnbaum 1970, 1033). These are a method’s error probabilities. Accounts where probability is used to assess and control a method’s error probabilities I call error statistical. This is much more apt than “frequentist”, which fails to identify the core feature of the methodology I have in mind. This error control is nullified by biasing selection effects–cherry picking, multiple testing, selective reporting, data dredging, optional stopping and P-hacking.
However, the same flexibility can occur when cherry-picked or p-hacked hypotheses enter into methods that are being promoted as alternatives to significance tests: likelihood ratios, Bayes Factors, or Bayesian updating. There is one big difference: The direct grounds to criticize inferences as flouting error statistical control is lost, unless they are supplemented with principles that are not now standard. We hear that “Bayes factors can be used in the complete absence of a sampling plan, or in situations where the analyst does not know the sampling plan that was used” (Bayarri, Benjamin, Berger, & Sellke 2016, 100). But without the sampling plan, you cannot ascertain the altered capabilities of a method to distinguish genuine from spurious effects. Put simply, the Bayesian inference conditions on the actual outcome and so would not consider error probabilities which refer to outcomes other than the one observed. Even outside of cases with biasing selection effects, the very idea that we are uninterested in how a method would behave in general, with different data, seems very strange. Perhaps practitioners are now prepared to supplement these accounts with new stipulations that can pick up on the general behavior of a method. Excellent, then they are in sync with error statistical intuitions. But they should make this clear.
Admittedly, error statisticians haven’t been clear as to the justification for caring about error probabilities, implicitly accepting that it reflects a concern with good long-run performance. Performance matters, but the justification also concerns the inference at hand.
The problem with the data dredger’s inference is not that it uses a method with poor long-run error control. It should be clear from the replication crisis that what bothers us about P-hackers and data dredgers is that they have done a poor job in the case at hand. They have found data to agree with a hypothesized effect, but they did so by means of a method that very probably would have found some such effect (or other) even if spurious. As Popper would say, the inference has passed a weak, and not a severe test. Notice what the critical reader of a registered report is doing, whether pre-data or post-data. She looks, in effect, at the sampling distribution, the probability that one or another hypothesis, stopping point, choice of grouping variables, and so on, could have led to a false positive–even without a formal error probability computation. Psychologist Daniël Lakens—our August 20 presenter—suggests that the “severity argument currently provides one of the few philosophies of science that allows for a coherent conceptual analysis of the value of preregistration” (2019, 225)). Links to the video and slides of his excellent talk are below the Notes. Thus, the replication crisis has had the constructive upshot of supplying a rationale never made entirely clear by significance testers. Ironically, however, statistical significance tests are mired in more controversy than ever.
One final remark: It’s important to see that in many contexts the “same” data can be used to erect a model or claim as well provide a warranted test of the claim. (I put quotes around “same”, because the data are actually remodeled.) Examples include: using data to test statistical model assumptions, DNA matching, reliable estimation procedures. It may even be guaranteed that a method will output a claim or model in accordance with data. The problem is not guaranteeing agreement between data and a claim, the problem is doing so even though the claim is false or specifiably false. We shouldn’t confuse cases where we’re trying to determine if there even is a real effect that needs explaining—arguably, the key role for statistical significance tests–and cases where we have a known effect, and are seeking to explain it. In the latter case, we must use the known effect in arriving at and testing proposed explanations.
NOTES:
[1] Alexander Bird (King’s College London), Mark Burgman (Imperial College London), Daniele Fanelli (London School of Economics and Political Science), Roman Frigg (London School of Economics and Political Science), David Hand (Imperial College London), Christian Hennig (University of Bologna), Katrin Hohl (City University London), Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland), Jon Williamson (University of Kent)
[2] It would be interesting to collect those non-actual wars that seem aptly described as “wars”. What is it about them? The mommy wars, the culture wars. I occasionally find a new one that has the essence I have in mind, but I haven’t kept up a list. Please share examples in the comments. What is it they share? Or are they too different to be viewed as sharing an essence? I know one thing they share: they will not be won!
[3] I can hear some people saying, ‘you see, even the experts can’t understand them’, but I have a different theory. The fallacies have become more prevalent because of a growing tendency to interpret them through the lens of quantities that are measuring very different things, e.g., likelihood ratios or Bayes factors.
[4] A future statistics casualty to consider: Remember when 2013 was dubbed “the year of statistics”? This was partly to avoid being sidelined in the face of all the attention being given to the new kid on the block called “data science” or “data analytics”. The features that made data science so attractive—it gave answers quickly without all the qualifications and care of statistics—led statisticians to question if it was more of a pragmatic, business occupation, good at finding predictive patterns in data, but not a full-blown profession with principles as enjoyed in statistics. Yet, at the risk of losing resources, the field of statistics has rapidly merged in a variety of forms with data science. So I was surprised to see in the latest issue of Significance that “the problems with applied data science exist because data science currently does not constitute a profession but is instead an occupation” (Steuer 2020). The problems are claimed to be rooted in a failure to incorporate both an ethics and an epistemology of data. A philosophy of data science is needed.
[5] Andrew Gelman is a Bayesian who also wants to be a falsificationist. In a joint paper with the error statistician Cosma Shalizi in 2013, they say: “[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data” (Gelman and Shalizi 2013, 20). But, in recent years, Gelman has thrown in with those keen on “abandoning statistical significance”. I don’t see how his own falsificationist philosophy avoids being a casualty.
Video of D. Lakens’ presentation (3 parts):
(Viewing in full screen mode helps with buffering issues.)
Part 1: Mayo’s Introduction & Lakens’ presentation
Part 2: Lakens’ presentation continued
Part 3: Discussion
I’m delighted that the first presenter in this new series will be Daniël Lakens:
Our second meeting will be 24 September. Richard Morey will give a presentation combining critiques of P-values from a human factors’ perspective, as well as Bayes factors.
July 30 PRACTICE VIDEO for JSM talk (All materials for Practice JSM session here)
JSM 2020 Panel Flyer (PDF)
JSM online program w/panel abstract & information):
*unless you’re already on our LSE Phil500 list
JSM 2020 Panel Flyer (PDF)
JSM online program w/panel abstract & information):