I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:
The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. 
The second debate is a contemporary version of long-standing, but unresolved, disagreements on statistical philosophy: degrees of belief vs. error control, frequentist vs. Bayesian, testing vs. confidence intervals, etc. That’s what most of the conference revolved around. The opening talk by Steve Goodman was “Why is Eliminating P-values so Hard?”. Admittedly there were concurrent sessions, so my view is selective. True, bad statistics–perverse incentives, abuses of significance tests, publication biases and the resulting irreplicability–have given a new raison d’etre for (re)fighting the old, and tackling newer, statistics wars. And, just to be clear, let me say that I think these battles should be reexamined, but taking into account the more sophisticated variants of the methods, on all sides. Yet the conference, by and large, presumed the main war was already over, and the losers were tests of the statistical significance of differences–not merely abuses of the tests, the entire statistical method! 
Under the revolutionary rubric of “The Radical Prescription for Change”, we heard, in the final session, eminently sensible recommendations for doing good science–the first interpretation in my deconstruction. Marcia McNutt provided a terrific overview of what Science, Nature, and key agencies are doing to uplift scientific rigor and sound research. She listed statistical issues: file drawer problems, p-hacking, poor experimental design, model misspecification; and empirical ones: unidentified variables, outliers and data gaps, problems with data smoothing, and so on. In an attempt at “raising the bar”, she tells us, 80 editors agreed on the importance of preregistration, randomization and blindness. Excellent! Gelman recommended that p-values be just one piece of information rather than a rigid rod by which, once jumped over, publication ensues. Decisions should be holistic and take into account background information and questions of measurement. The ways statisticians can help scientists, Gelman proposed, is (1) by changing incentives so that it’s harder to cheat and (2) helping them determine the frequency properties of their tools (e.g., their abilities to reveal or avoid magnitude and sign errors). Meng, in his witty and sagacious manner, suggested punishing researchers by docking their salary if they’re wrong–using some multiple of their p-values. The one I like best is his recommendation that researchers ask themselves whether they’d ever dream of using the results of their work on themselves or a loved one. I totally agree!
Thus, in the interpretation represented by the closing session, “A World Beyond P-values” refers to a world beyond cookbook, and other modes of, bad statistics. A second reading, however, has it refer to statistical inference where significance tests, if permitted at all, are to be compelled to wear badges of shame–use them at your peril. Never mind that these are the very tools relied upon to reveal lack of replication, to show adulteration by cherry-picking and other biasing selection effects, and to test assumptions. From that vantage point, it made sense that participants began by offering up alternative or modified statistical tools–and there were many. Why fight the battle–engage the arguments–if the enemy is already down? Using the suffix “cide”, (killer), we might call it statistical testicide.
I’m certainly not defending the crude uses of tests long lampooned. Even when used correctly, they’re just a part of what I call error statistics: tools that employ sampling distributions to assess and control the capabilities of methods to avoid erroneous interpretations of data (error probabilities). My own work in philosophy of statistics has been to reformulate statistical tests to avoid fallacies and arrive at an evidential interpretation of error probabilities in scientific contexts (to assess and control well-testedness).
Given my sense of the state of play, I decided that the best way to tackle the question of “What are the Best Uses For P-Values?”–the charge for our session–was to supply the key existing responses to criticisms of significance tests. Typically hidden from view (at least in these circles), these should now serve as handy retorts for the significance test user. The starting place for future significance test challengers should no longer be to just rehearse the criticisms, but to grapple with these responses and the arguments behind them.
So to the question on my first slide: What contexts ever warrant the use of statistical tests of significance? The answer is: Precisely those you’d find yourself in if you’re struggling to get to a “World Beyond P-values” in the first sense–namely, battling bad statistical science.
 Andrew Gelman, Columbia University; Marcia McNutt, National Academy of Sciences; Xiao-Li Meng, Harvard University.
 Please correct me with info from other sessions. I’m guessing one of the policy-oriented session might have differed. Naturally, I’m excluding ours.
 A proper subset of error statistics uses these capabilities to assess how severely claims have passed.
 Please search this blog for details behind each, e.g., likelihood principle, p-values exaggerate, error probabilities, power, law of likelihood, p-value madness, etc.
Some related blogposts: