I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:
The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. [1]
The second debate is a contemporary version of long-standing, but unresolved, disagreements on statistical philosophy: degrees of belief vs. error control, frequentist vs. Bayesian, testing vs. confidence intervals, etc. That’s what most of the conference revolved around. The opening talk by Steve Goodman was “Why is Eliminating P-values so Hard?”. Admittedly there were concurrent sessions, so my view is selective. True, bad statistics–perverse incentives, abuses of significance tests, publication biases and the resulting irreplicability–have given a new raison d’etre for (re)fighting the old, and tackling newer, statistics wars. And, just to be clear, let me say that I think these battles should be reexamined, but taking into account the more sophisticated variants of the methods, on all sides. Yet the conference, by and large, presumed the main war was already over, and the losers were tests of the statistical significance of differences–not merely abuses of the tests, the entire statistical method! [2]
Under the revolutionary rubric of “The Radical Prescription for Change”, we heard, in the final session, eminently sensible recommendations for doing good science–the first interpretation in my deconstruction. Marcia McNutt provided a terrific overview of what Science, Nature, and key agencies are doing to uplift scientific rigor and sound research. She listed statistical issues: file drawer problems, p-hacking, poor experimental design, model misspecification; and empirical ones: unidentified variables, outliers and data gaps, problems with data smoothing, and so on. In an attempt at “raising the bar”, she tells us, 80 editors agreed on the importance of preregistration, randomization and blindness. Excellent! Gelman recommended that p-values be just one piece of information rather than a rigid rod by which, once jumped over, publication ensues. Decisions should be holistic and take into account background information and questions of measurement. The ways statisticians can help scientists, Gelman proposed, is (1) by changing incentives so that it’s harder to cheat and (2) helping them determine the frequency properties of their tools (e.g., their abilities to reveal or avoid magnitude and sign errors). Meng, in his witty and sagacious manner, suggested punishing researchers by docking their salary if they’re wrong–using some multiple of their p-values. The one I like best is his recommendation that researchers ask themselves whether they’d ever dream of using the results of their work on themselves or a loved one. I totally agree!
Thus, in the interpretation represented by the closing session, “A World Beyond P-values” refers to a world beyond cookbook, and other modes of, bad statistics. A second reading, however, has it refer to statistical inference where significance tests, if permitted at all, are to be compelled to wear badges of shame–use them at your peril. Never mind that these are the very tools relied upon to reveal lack of replication, to show adulteration by cherry-picking and other biasing selection effects, and to test assumptions. From that vantage point, it made sense that participants began by offering up alternative or modified statistical tools–and there were many. Why fight the battle–engage the arguments–if the enemy is already down? Using the suffix “cide”, (killer), we might call it statistical testicide.
I’m certainly not defending the crude uses of tests long lampooned. Even when used correctly, they’re just a part of what I call error statistics: tools that employ sampling distributions to assess and control the capabilities of methods to avoid erroneous interpretations of data (error probabilities).[3] My own work in philosophy of statistics has been to reformulate statistical tests to avoid fallacies and arrive at an evidential interpretation of error probabilities in scientific contexts (to assess and control well-testedness).
Given my sense of the state of play, I decided that the best way to tackle the question of “What are the Best Uses For P-Values?”–the charge for our session–was to supply the key existing responses to criticisms of significance tests. Typically hidden from view (at least in these circles), these should now serve as handy retorts for the significance test user. The starting place for future significance test challengers should no longer be to just rehearse the criticisms, but to grapple with these responses and the arguments behind them.[4]
So to the question on my first slide: What contexts ever warrant the use of statistical tests of significance? The answer is: Precisely those you’d find yourself in if you’re struggling to get to a “World Beyond P-values” in the first sense–namely, battling bad statistical science.
___
[1] Andrew Gelman, Columbia University; Marcia McNutt, National Academy of Sciences; Xiao-Li Meng, Harvard University.
[2] Please correct me with info from other sessions. I’m guessing one of the policy-oriented session might have differed. Naturally, I’m excluding ours.
[3] A proper subset of error statistics uses these capabilities to assess how severely claims have passed.
[4] Please search this blog for details behind each, e.g., likelihood principle, p-values exaggerate, error probabilities, power, law of likelihood, p-value madness, etc.
Some related blogposts:
The ASA Document on P-values: One Year On
Statistical Reforms Without Philosophy Are Blind
Saturday Night Brainstorming and Task Forces (spoof)
On the Current State of Play in the Crisis of Replication in Psychology: Some Heresies
What a tour de force! Thanks for writing up this summary.
Dear Enrique: Thanks for your comment. So glad you find it of interest, although I might have thought you favored the second interpretation in my deconstruction, and perhaps you do.
The NYT article on Amy Cuddy and the power of power posing https://www.nytimes.com/2017/10/18/magazine/when-the-revolution-came-for-amy-cuddy.html?_r=1
is getting a ton of attention, understandably. It discusses how analyzing her p-values (using p-curves) by Simonsohn and others was at the heart of unearthing the problem with her proposed inferences. This is the central value of p-values, and the associated error statistical methodology of which it is just one part. It has yet to be shown that the “alternative” measures many favor over p-values,e.g., Bayes Factors, likelihood ratios, would similarly afford a stringent tool to detect lack of replication. It would be strange to advocate, in the name of improving scientific rigor, the banning or downplaying of the very methods that reveal bad statistics. Without tools to discriminate the statistical significance of differences (p-values), what would have shown the lack of replication?* And yet, on the second, and most popular, reading of “a World Beyond P-values” that’s precisely what’s being recommended. The movement for more responsible science cannot have it both ways. Of course, I’ve been arguing this for a long time, as blog readers know (and it’s one of the key themes of my forthcoming book).
*Turning the dispute into one over whether Cuddy’s hypothesis deserved a high or low prior probability would still demand a way to stringently test it (e.g., through inability to replicate small p-values).
Thank you as well from me. It seems hard to defend a reasonable use of tests and p-values and at the same time acknowledge and fight against their widespread misuse, but I still think it’s very worthwhile!
Christian: Thanks for your comment, but (unless I misunderstand you) I think my position is rather the opposite: I don’t see how one can champion good science, replication, predesignation and at the same time call for ousting significance tests. It’s by significance tests that the spurious results are found, and if they were abandoned, it’s scarcely clear that the alternative methods would reveal lack of replication. It would become a battle of priors, if, say, they were replaced by Bayes factors. It’s in order to vouchsafe error probabilities that strategies such as prespecified reports get their rationale. Was I not clear about this?
Probably I didn’t word my comment well. I don’t see the disagreement here.
Christian: Just to put my point in the most provocative way: even if statistical practice were no more inclined to questionable research practices and pockets of irreproducibility than it was 20 years ago, the same criticisms of frequentist error statistical methods would be with us, with likelihoodists and Bayesians (of various stripes) questioning the roles of error probabilities. The other day Stephen Senn said today’s replication crisis is really a “proxy war” between two types of Bayesians. I agree with him that there’s a proxy war going on–but with more than two rivals–it’s essentially the same statistics wars that have persisted over decades, with one important difference: few Bayesians want to be true blue subjectivists. Non-subjective Bayesians have given us a slew of variations, with no agreement on either meaning or use on the horizon. Even if you don’t want to go that far, I think you have to agree that replication problems were and are opportunistically used by those who have spent many years trying to advocate for their preferred alternatives to statistical significance tests, and perhaps to frequentist statistics altogether.
The danger is that the proxy wars may result in real winners who will have succeeded by testicide and not by demonstrating the worth of their method, and science will be the loser. Now you might say, as did the small circle of like-minded participants at last week’s conference, that stringent error control will win out in the end, and that scientists won’t abandon them on grounds of fashion, but I’m much less sanguine. Many practitioners are disinclined to think through for themselves the methods on offer. I’ve heard them say, just tell us what to use. So if the ASA says, “use those methods at your peril” they may follow.
*As I’m sure you know, not everyone agrees there’s a replication “crisis” outside of newbies from big data (untrained in stat w/ powerful methods suddenly thrust upon them) . Psychology was no different 20 or 30 or 40 or more years ago as we know from Meehl.
Surely this is not “most provocative” from my point of view… I’m not quite sure why you think I disagree here, probably my initial comment was not well put; I didn’t mean to say anything that opposes what you wrote in your last two replies. Anyway, I agree with much of what you write here and about some other things (“proxy war between two types of Bayesians”) I don’t have an opinion. So all fine on my side and I hope somebody else gets something from the additional things you wrote here.
Pingback: Diary For Statistical War Correspondents on the Latest Ban on Speech | Error Statistics Philosophy