Guest Post (part 2 of 2): Daniël Lakens: “How were we supposed to move beyond  p < .05, and why didn’t we?”

.

Professor Daniël Lakens
Human Technology Interaction
Eindhoven University of Technology

[Some earlier posts by D. Lakens on this topic are at the end of this post]*

This continues Part 1:

4: Most do not offer any alternative at all

At this point, it might be worthwhile to point out that most of the contributions to the special issue do not discuss alternative approaches to p < .05 at all. They discuss general problems with low quality research (Kmetz, 2019), the importance of improving quality control (D. W. Hubbard & Carriquiry, 2019), results blind reviewing (Locascio, 2019), or the role of subjective judgment (Brownstein et al., 2019). There are historical perspectives on how we got to this point (Kennedy-Shaffer, 2019), ideas about how science should work instead, many stressing the importance of replication studies (R. Hubbard et al., 2019; Tong, 2019). Note that Trafimow both recommends replication as an alternative (Trafimow, 2019), but also co-authors a paper stating we should not expect findings to replicate (Amrhein et al., 2019), thereby directly contradicting himself within the same special issue. Others propose not simply giving up on p-values, but on generalizable knowledge (Amrhein et al., 2019). The suggestion is to only report descriptive statistics.

5: Why has Nothing Changed?

It is worth reflecting on why we have had these discussions about the use of p-values approximately every decade for the last century, while nothing changes. I will make 3 personal and largely unsubstantiated observations that you are free to disagree with.

As far as I noticed, not a single contribution to the special issue engaged with philosophy of science. This is almost unbelievable, as the choice for a statistical method is based on the aims of science (Laudan, 1986), and what the aims of science are is per definition a philosophical question. One would have expected every contribution to start with three paragraphs about philosophy of science, and only then, after having explained what their thoughts about the aims of science are, continue with their proposed recommendations. It is worthwhile to acknowledge that the practice of p < .05 is perfectly defensible from a methodological falsificationist approach to knowledge generation (Uygun Tunç et al., 2023). It is fine to criticize methodological falsificationism, but one can hardly blame methodological falsificationists for using the method that is coherent with their aims, and the claims they want to make. One is also free to align one’s recommendations with a different philosophy of science, but then it has to be coherent and well-developed.

Second, there were very few real-life examples of how one should analyze data in practice. In our own contribution to the special issue (Dongen et al., 2019) we analyzed the same dataset in four different ways, and all approaches lead to the same conclusion. Of course different approaches might yield different results in some applied contexts, and in some contexts it makes no sense to ask certain statistical questions (including those answered by p < .05). But there might be little use in recommendations that are not related to an applied context. Some solutions work in some contexts, but not in others. It requires a careful understanding of the research context to know which recommendations might work in some research areas. Without this understanding, statisticians might make proposals no one needs. It reminds me of a statistician who presented their work, and at the end of their talk asked: “so, does anyone have a dataset where they would need the method I have developed?”

Third, there was by and large a lack of engagement with other perspectives. Researchers propose their own views, but there is no honest discussion about relative strengths and weaknesses of what they propose. The statistics community has a reward structure that is similar to most scientific fields, where provocative and attention grabbing titles lead to a lot of citations. Actually solving problems is not strongly rewarded (with some exceptions, of course). Reaching consensus, or examining if proposals actually work in practice, is a lot of work and does not get any statistician tenure. The field is not an exception, but it helps to explain why discussions about p-values are never ending, I think.

References

(References below are for both parts )

Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication. The American Statistician, 73(sup1), 262–270. https://doi.org/10.1080/00031305.2018.1543137

Anderson, A. A. (2019). Assessing Statistical Results: Magnitude, Precision, and Model Uncertainty. The American Statistician, 73(sup1), 118–121. https://doi.org/10.1080/00031305.2018.1537889

Benjamin, D. J., & Berger, J. O. (2019). Three Recommendations for Improving the Use of p-Values. The American Statistician73(sup1), 186–191. https://doi.org/10.1080/00031305.2018.1543135

Betensky, R. A. (2019). The p-Value Requires Context, Not a Threshold. The American Statistician, 73(sup1), 115–117. https://doi.org/10.1080/00031305.2018.1529624

Blume, J. D., Greevy, R. A., Welty, V. F., Smith, J. R., & Dupont, W. D. (2019). An Introduction to Second-Generation p-Values. The American Statistician, 73(sup1), 157–167. https://doi.org/10.1080/00031305.2018.1537893

Brownstein, N. C., Louis, T. A., O’Hagan, A., & Pendergast, J. (2019). The Role of Expert Judgment in Statistical Inference and Evidence-Based Decision-Making. The American Statistician, 73(sup1), 56–68. https://doi.org/10.1080/00031305.2018.1529623

Calin-Jageman, R. J., & Cumming, G. (2019). The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known. The American Statistician, 73(sup1), 271–280. https://doi.org/10.1080/00031305.2018.1518266

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

Cohen, J. (1995). The earth is round ( p < .05): Rejoinder. American Psychologist, 50(12), 1103. http://dx.doi.org/10.1037/0003-066X.50.12.1103

Colquhoun, D. (2019). The False Positive Risk: A Proposal Concerning What to Do About p-Values. The American Statistician, 73(sup1), 192–201. https://doi.org/10.1080/00031305.2018.1529622

Dongen, N. N. N. van, Doorn, J. B. van, Gronau, Q. F., Ravenzwaaij, D. van, Hoekstra, R., Haucke, M. N., Lakens, D., Hennig, C., Morey, R. D., Homer, S., Gelman, A., Sprenger, J., & Wagenmakers, E.-J. (2019). Multiple Perspectives on Inference for Two Simple Statistical Scenarios. The American Statistician, 73(sup1), 328–339. https://doi.org/10.1080/00031305.2019.1565553

Fricker, R. D., Burke, K., Han, X., & Woodall, W. H. (2019). Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban. The American Statistician73(sup1), 374–384. https://doi.org/10.1080/00031305.2018.1537892

Gannon, M. A., de Bragança Pereira, C. A., & Polpo, A. (2019). Blending Bayesian and Classical Tools to Define Optimal Sample-Size-Dependent Significance Levels. The American Statistician73(sup1), 213–222. https://doi.org/10.1080/00031305.2018.1518268

Goodman, W. M., Spruill, S. E., & Komaroff, E. (2019). A Proposed Hybrid Effect Size Plus p-Value Criterion: Empirical Evidence Supporting its Use. The American Statistician, 73(sup1), 168–185. https://doi.org/10.1080/00031305.2018.1564697

Greenland, S. (2019). Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625

Hodges, J. L., & Lehmann, E. L. (1954). Testing the Approximate Validity of Statistical Hypotheses. Journal of the Royal Statistical Society. Series B (Methodological), 16(2), 261–268. https://doi.org/10.1111/j.2517-6161.1954.tb00169.x

Hubbard, D. W., & Carriquiry, A. L. (2019). Quality Control for Scientific Research: Addressing Reproducibility, Responsiveness, and Relevance. The American Statistician, 73(sup1), 46–55. https://doi.org/10.1080/00031305.2018.1543138

Hubbard, R., Haig, B. D., & Parsa, R. A. (2019). The Limited Role of Formal Statistical Inference in Scientific Inference. The American Statistician, 73(sup1), 91–98. https://doi.org/10.1080/00031305.2018.1464947

Kennedy-Shaffer, L. (2019). Before p < 0.05 to Beyond p < 0.05: Using History to Contextualize p-Values and Significance Testing. The American Statistician, 73(sup1), 82–90. https://doi.org/10.1080/00031305.2018.1537891

Kmetz, J. L. (2019). Correcting Corrupt Research: Recommendations for the Profession to Stop Misuse of p-Values. The American Statistician, 73(sup1), 36–45. https://doi.org/10.1080/00031305.2018.1518271

Krueger, J. I., & Heck, P. R. (2019). Putting the P-Value in its Place. The American Statistician73(sup1), 122–128. https://doi.org/10.1080/00031305.2018.1470033

Lakens, D., & Delacre, M. (2020). Equivalence Testing and the Second Generation P-Value. Meta-Psychology, 4, 1–11. https://doi.org/10.15626/MP.2018.933

Lakens, D., Adolfi, F.G., Albers, C.J. et al. Justify your alpha. Nat Hum Behav 2, 168–171 (2018). https://doi.org/10.1038/s41562-018-0311-x

Laudan, L. (1986). Science and Values: The Aims of Science and Their Role in Scientific Debate.

Locascio, J. J. (2019). The Impact of Results Blind Science Publishing on Statistical Consultation and Collaboration. The American Statistician, 73(sup1), 346–351. https://doi.org/10.1080/00031305.2018.1505658

Maier, M., & Lakens, D. (2022). Justify Your Alpha: A Primer on Two Practical Approaches. Advances in Methods and Practices in Psychological Science, 5(2), 25152459221080396. https://doi.org/10.1177/25152459221080396

Manski, C. F. (2019). Treatment Choice With Trial Data: Statistical Decision Theory Should Supplant Hypothesis Testing. The American Statistician, 73(sup1), 296–304. https://doi.org/10.1080/00031305.2018.1513377

Matthews, R. A. J. (2019). Moving Towards the Post p < 0.05 Era via the Analysis of Credibility. The American Statistician73(sup1), 202–212. https://doi.org/10.1080/00031305.2018.1543136

Mazzolari, R., Porcelli, S., Bishop, D. J., & Lakens, D. (2022). Myths and methodologies: The use of equivalence and non-inferiority tests for interventional studies in exercise physiology and sport science. Experimental Physiology, 107(3), 201–212. https://doi.org/10.1113/EP090171

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon Statistical Significance. The American Statistician, 73(sup1), 235–245. https://doi.org/10.1080/00031305.2018.1527253

Murphy, K. R., & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84(2), 234–248. https://doi.org/10.1037/0021-9010.84.2.234

Pogrow, S. (2019). How Effect Size (Practical Significance) Misleads Clinical Practice: The Case for Switching to Practical Benefit to Assess Applied Research Findings. The American Statistician, 73(sup1), 223–234. https://doi.org/10.1080/00031305.2018.1549101

Tong, C. (2019). Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science. The American Statistician, 73(sup1), 246–261. https://doi.org/10.1080/00031305.2018.1518264

Trafimow, D. (2019). Five Nonobvious Changes in Editorial Practice for Editors and Reviewers to Consider When Evaluating Submissions in a Post p < 0.05 Universe. The American Statistician, 73(sup1), 340–345. https://doi.org/10.1080/00031305.2018.1537888

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2023). The epistemic and pragmatic function of dichotomous claims based on statistical hypothesis tests. Theory & Psychology, 09593543231160112. https://doi.org/10.1177/09593543231160112

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913

Wong, T. K., Kiers, H., & Tendeiro, J. (2022). On the Potential Mismatch Between the Function of the Bayes Factor and Researchers’ Expectations. Collabra: Psychology, 8(1). https://doi.org/10.1525/collabra.36357 

*Other related guestposts by Lakens on this blog:

June 8, 2016 “So you banned p-values, how’s that working out for you?” D. Lakens exposes the consequences of a puzzling “ban” on statistical inference
Jan 5, 2022: Lakens (Guest Post): Averting journal editors from making fools of themselves.
June 30, 2022: D. Lakens responds to confidence interval crusading journal editors]

Categories: abandon statistical significance, D. Lakens, Wasserstein et al 2019 | 13 Comments

Post navigation

13 thoughts on “Guest Post (part 2 of 2): Daniël Lakens: “How were we supposed to move beyond  p < .05, and why didn’t we?”

  1. I am grateful to Daniel Lakens for his thoughtful and illuminating 2-part guest post. I have a lot to say about Part 2, which connects to a talk Lakens recently gave at a conference on Foundations of Applied Statistics, but that would call for a blogpost of its own.

    https://www.youtube.com/watch?v=pOaqDbm38yo

    Here are a few remarks: I totally agree that the authors proposing alternative methods should explain and defend their alternate aims for statistical science, and show how their methodology achieves those ends. It does not suffice to repeat again and again and again the same exact allegations against statistical significance tests and error statistical methods. I don’t know another field or problem where you can keep publishing repetitions of the identical points. (I gave a list in “Farewell Keepsake p. 436- 443) in my book SIST (2018).

    While I strongly endorse Lakens’ idea of putting together teams of statisticians, scientists and philosophers of science (if they are engaged in statistical practice) to weigh in on methodology, the idea of convening a group to reach consensus fills me with terror. Responding to criticisms, recognizing and including diverse perspectives are essential for scientific progress, but that’s not how the current controversies about statistical significance have played out as of late. One need only consider how even the 2016 ASA policy was created, the 2017 conference organized, and this special issue introduced. The worst thing that could happen would be to force agreement, although, unfortunately, to some degree, it is already happening (by dint of the reward structure). That’s why I wrote the editorial “The statistics wars and intellectual conflicts of interest” in Conservation Biology https://conbio.onlinelibrary.wiley.com/doi/10.1111/cobi.13861

    Dissent might well be treated as was the report of the 2019 ASA Task Force on statistical significance and replicability (which opposed abandoning statistical significance). The use of multiple methods, as in the paper Lakens co-authored, is preferable. I’m grateful to him for his post, which I hope will encourage disagreement with the new orthodoxy in (at least some of) statistics.

    • rkenett

      The slides of the conference mentioned by Mayo are available in https://neaman.org.il/en/On-the-foundations-of-applied-statistics

      Topics discussed there include the role of philosophical concepts in Statistics (Lakens), an historical perspective (Senn), selective inference (Benjamini), DOE for generalizability (Steinberg) and the need for a foundation of applied Statistics (Kenett).

      The argument by Lakens on why Popper was not a Bayesian, as it considered science not to depend on priors, was particularly interesting. It reminded me of Paul Erdos, the famous mathematician, who used to talk about The Book, in which God maintains the perfect proofs of mathematical theorems. See:

      Kenett, R. S., & Redman, T. C. (2019). The real work of data science: turning data into information, better decisions, and stronger organizations. John Wiley & Sons.

      Aigner, M. and Ziegler, G. (2000). Proofs from the Book. Springer.

      Is science about “the book”?

  2. vaccinelegit0q

    Many people like myself working in the field (e.g. quality control) have already moved beyond p-values. We use methods in the tradition of Shewhart, Deming, Fisher, Taguchi, and Moen. E.g. factorial experimentation, degree of belief, orthogonal arrays,… and it works quite well. In fact, it works tremendously well. Just look at how consistently we make high quality products.

    These methods do not depend upon p-values at all. Or if they do employ them, they are used as Fisher intended, as a guide of inference rather than a tool for selecting a hypothesis or making a decision.

    Why not include a discussion of these approaches?

    Perhaps you mentioned this and I missed it, but the solution seems clear: you get away from p-values by “doing science”, meaning through successful prediction in a variety of contexts, outside the lab, in the real world.

    Perhaps that brings us to the real problem: so much “science” that is done today is done by “proving” effects highly controlled environments (labs), not in the field. Sure, ergodic, closed systems can be tested in the lab (e.g. particle physics, chemistry), but nearly everything at human scale is open / non-ergodic.

    For instance, while we do test new polymers in the lab, that is only part of the experiment. We still have to test it in the field, in real world conditions. This is just as Fisher did in his agricultural experiments. Fisher (AFAIK) never used p-values achieved in the lab to justify a treatment. He relied upon tangible results in the real world: the fertilizer improved crop yields on real farms, or it didn’t.

    Am I missing anything?

      • Let me respond to Kenett’s comment on quality control rather than inference and testing:
        “Quality control” from Deming, Neyman and Pearson and others never went away. It’s what goes on in full-bodied uses of error statistics which combines piece-meal studies, each keeping to error control. It’s quite ironic if today’s critics of statistical significance tests point to “quality control” as opposed to frequentist error statistical methodology, when that is at the heart of its foundations. One of the strongest criticisms of N-P tests and related methods is that they focus on ensuring one will not act or infer an erroneous claim too often in the long run.
        What is it we are controlling when we apply quality control to statistics? We’re controlling claims about data, data models, and links between statistics and what we’re trying to find out about. Inferences are our product! The accusations that we’re not interested in inference and hypothesis testing must be forgetting that the goal of using data to find things out is to INFER claims or solutions to problems that go beyond the data. If people understood that statistical significance tests are a small part of a rich methodology intended to combine local, stringent probes of error, they wouldn’t be charging, as they do, that frequentist error statistics does not use background information. What they mean is that it doesn’t use prior probabilities (unless parameters can be treated as random variables with frequentist distributions), but it systematically and brilliantly uses background information to design and control error and variability at multiple levels from data to substantive claims.

        As Mayo and Cox (2006) say:

        “A fundamental tenet of the conception of inductive learning most at home with the frequentist philosophy is that inductive inference requires building up incisive arguments and inferences by putting together several different piece-meal results; we have set out considerations to guide these pieces. Although the complexity of the issues makes it more difficult to set out neatly, as, for example, one could by imagining that a single algorithm encompasses the whole of inductive inference, the payoff is an account that approaches the kind of arguments that scientists build up in order to obtain reliable knowledge and understanding of a field.”

        Click to access 2006-mayocox-freq-stats-as-a-theory-of-inductive-inference.pdf

        I disagree with Gelman’s allegation about hypothesis testing: “Hypothesis testing, in which the purpose of data collection and analysis is to rule out a null hypothesis (typically, zero effect and zero systematic error) that nobody believes in the first place.” The purpose of rejecting an appropriate test hypothesis (Fisher dubbed them nulls) is to find out if there is a genuine phenomenon. That is NOT the same as inferring there’s some non-systematic error someplace e.g., in measurement procedures. What we absolutely do NOT know ahead of time is whether we have got hold of a genuine phenomenon or real effect. That’s what scientific theories explain and predict, and we only falsify scientific theories with error-prone predictions by dint of rejecting local statistical claims.

        One of the dangers in abandoning statistical significance and the error probing methodology, is that inference—finding things out so as to go beyond the data—is downplayed or replaced with mere description or perhaps claims about degrees of belief. Inference may be risky, but that’s how we discover new things, and we can control the risks in just the way envisioned by Deming and Neyman-Pearson.

        I’m traveling, so this is quick ; but I felt the need to respond right away.

        • rkenett

          Mayo

          Thanks for your response. Some quick thoughts are listed below.

          Shewhart reverted the statistical approach in the testing of hypothesis. In classical statistics you have data and you fit a model. The data is fixed.

          In Shewhart’s conceptual framework this is reversed. You have an abstract model for a process under control. If the data shows differently, you can stop the process and get it back under control.

          This brings up the difference between process control and surveillance (when the process cannot be reset). This gets us to look at subsequent false alarms (i.e. one after anther). I believe my paper with Moshe Pollak in 1983, listed below, was the first to make this distinction and proposed using the probability of false alarms (PFA). PFA works in surveillance and process control. See

          Kenett, R., & Pollak, M. (1983). On sequential detection of a shift in the probability of a rare event. Journal of the American Statistical Association78(382), 389-395.

          Kenett, R. S., & Pollak, M. (2012). On assessing the performance of sequential procedures for detecting a change. Quality and Reliability Engineering International28(5), 500-507.

  3. Peter Monnerjahn

    Lakens:

    “I might be the only person in the world who has read all 43 contributions to this special issue.”

    I’m pretty sure there are a couple of us. 🙂

    As far as I noticed, not a single contribution to the special issue engaged with philosophy of science.

    Bingo! And that’s the major failing of the whole special issue: nobody talks about either the aims of science, what a consensus on a methodology of science might be, or even how the statistical methods commonly used in science relate to any overarching ideas about what it actually is that we (should) do. And nobody seems to have as much as thought about even asking us philosophers of science whether we might have anything worthwhile to add to the debate. It very much looks like there is a (tacit) consensus that we don’t.

    And yet, both high-level concepts like “falsificationism” and relatively low-level logical concepts like “inference” are constantly used in those discussions. Unless one thinks that practitioners simply know better about all of this, it would be decidedly odd not to at least wonder what people think about these things whose actual specialty they are.

    So, as a way of extending an invitation to start a conversation, allow me to ask a question that most of you may find odd (or even perverse) but that I promise holds a key to unlocking the debate about p-values. Here goes: Have you ever heard anyone advance the idea (or thought about that idea yourself) that a single p-value, regardless of any other considerations such as the significance level, is meaningless? And before you answer, maybe consider this: If we somehow were to respond to the substantive question in the affirmative, it would immediately suggest that any dichotomous decisions based on p-values would be illegitimate (in science). (And I might teasingly add that all three Founding Fathers of statistical tests–Fisher, Neyman, and Pearson–explicitly said just that.)

    And one other thing that I think may enrich the debate. I found it rather curious that not a single contribution to the whole replication crisis back-and-forth, neither in the special issue nor thereafter (as far as I can tell), has even mentioned the place of theory in the social sciences. Denny Borsboom had something interesting to say on that 10+ years ago: “Theoretical Amnesia“.

    I’d be very happy to discuss this further with you. 🙂

    • Peter:
      Thanks for your comment.

      Philosophers of statistics have long been well aware of the foundational and philosophical controversies in statistical methodology. If they’re not engaging with philosophical problems in scientific practice, it is they who are remiss. Philosophy of statistics was ahead of its time in engaging with practitioners and scientists on foundational issues. When I started out, it was common to find philosophers of science and at statistics meetings. I identify 1998 as the turning point, when Colin Howson said he would turn his attention away from statistics to focus on “logic” or probability logic. “Formal epistemology” was born. Despite the fact that philosophers of science in general have been pledging to be relevant to practice, this is not so when it comes to statistical practice, where, ironically, pressing philosophical debates are occurring. Few Bayesian epistemologists consider themselves working on foundational problems in statistical practice: it’s the statisticians who have taken the lead. That’s why my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is mostly directed to the debates in statistical practice, with the exception of Excursion 2.

      Lack of philosophical skills hampers practitioners just as lack of formal statistical training hampers formal epistemologists (who mostly have different goals).

      Ron Wasserstein invited me to be a “philosophical observer” at the 2015 pow-wow on which the ASA 2016 Statement on Statistical Significance and Replication was based. No other philosophers were involved. The problem isn’t that they haven’t been asked, but rather that very few are trained in statistics. In 2019, I led (with Aris Spanos) a summer seminar for 15 faculty with the goal of providing philosophers a background in statistics and central statistical controversies.
      https://liberalarts.vt.edu/news/articles/2019/10/experts-convene-to-explore-new-philosophy-of-statistics-field.html

      During the pandemic I organized and ran a series of Phil Stat Forums and workshops:
      Phil-stat-wars.com, connected with the LSE. Although philosophers of science are engaged in projects that are relevant to statistical debates, very, very few are actually doing research in the area. I would be glad to hear of some people I’m missing. In my view, unless they wish to stay on the side lines occupying themselves with philosophical puzzles, however interesting, philosophers of science should be immersed in central problems revolving around the nature and role of probability in learning about the world in the face of error.

      The leading dogma, of course, has long been that the only statistical methodology with proper philosophical foundations is Bayesian—in some classical subjectivist sense. But most practicing Bayesians are non-subjective or ‘pragmatic’, which is to say their foundations are messy and not the coherent affair that underlies the supposition that, unlike frequentist error statisticians, they alone are properly philosophical. The search for a neat logic underlies many of the criticisms of statistical significance tests. However, outside of some fields, error probabilities still matter. Although error statistical tools are a hodge-podge, I argue that there is an overarching logic that connects and interrelates them. That was the main theme of my Error and the Growth of Experimental Knowledge (Chicago, 1996, Lakatos Prize), and my work since then.

      I’d be glad to hear more about your work.

      By the way, Benjamini has a paper in Harvard Review of Data Science that gives a chart of the papers in the 2019 special issue.

    • @Peter Monnerjahn: “Here goes: Have you ever heard anyone advance the idea (or thought about that idea yourself) that a single p-value, regardless of any other considerations such as the significance level, is meaningless?”

      Wouldn’t the answer to that question just be: It may not mean much, but for sure it means more than nothing?

  4. Thank you, Daniel for your posts. Part 2 said what I hoped it would when I read Part 1. The offering of alternatives, lack of an underlying philosophical basis to aspects of the discussion and the different aims and methods of science are three intersecting issues that have been lacking from the debate.

    I’m a working scientist who has had to learn to use different statistical tools over the course of my journey. My undergrad training in stats was totally inadequate. When I got into applied research understanding past climates, quickly learned that there were few established statistical tools for exploring past uncertainties except for Monte Carlo techniques. This became even more evident when projecting future climate and assessing potential risks across a wide range of sectors (e.g., ecology, ag, economics, health). I pioneered a lot of the early risk work in these areas.

    I’m not a mathy type, being more comfortable with sandpits, so when I need to learn and use a new method, I play with it a lot to understand how it works. The contexts I aim to use it in have their own sensitivities, so I want to understand the interplay between those and sensitivities/robustness within the test. This means avoiding black boxes and the latest fashions.

    This approach highlights the importance of model selection, understanding which methods are suitable for which tasks, rather than taking a p-hammer to everything. But quantification of uncertainty is important and should be pursued wherever possible, so no (potential) likelihood should go unexamined.

    Natural and human systems are complex and we analyse them for a wide variety of interactions. This includes detection and attribution, interventions, projections, predictions and causality (if we want to understand how and why a particular phenomenon has emerged). Some of these involve mechanistic representations and others statistical (e.g., excess deaths and extreme weather). The output of mathematical models, such as earth system models, are also analysed statistically.

    In practice, if there’s time series data, someone will unquestioningly put a line through it. Trend is a synonym for change in much of the literature.

    The presence of complex systems highlights the diversity in real-world phenomena and the approaches used to explore them. The idea that there can be a single approach to how we test them, relying on a specific kind of metric, is hard to sustain.

    It is not surprising that many of the efforts to explore replicability expose potential flaws if it’s hard to pin down the actual nature of the studies. This point has been highlighted in previous posts in this series.

    Would it be possible to develop a typology of different families of tests, used for different purposes and from there develop a more systematic understanding of strengths and weaknesses? Individual practitioners with wide experience do this for themselves but efforts to impose a blanket approach have failed.

    Methods in practice range from being highly developed to totally exploratory. Pre-registration and strict guidelines (e.g., recipes and standards) are suitable for the former but not the latter.

    For exploratory work, or situations with competing processes and explanations is where I would argue methods such as severe testing is most needed. But would need bespoke approaches that are not well understood and, furthermore, will be rejected by those who favour some kind of orthodox approach.

    Tests that combine scientific and statistical hypotheses will depend on the subject matter. To this end, there are many text books that address these needs, but they are largely aimed at undergraduates and practitioners of established methods.

    Personally, I’m not overly concerned about replicability. Cutting edge research is hard to replicate, so overburdening it with rules that emphasise orthodoxy at the cost of innovation will be counter-productive.

    What I am concerned about is gate-keeping. We are in a situation where massive amounts of literature are being produced each year and merit is assessed on the capacity to produce more. The low-quality end of research will continue to increase. Paper mills will keep springing up, leading to whack-a-mole tactics as they do.

    Research governing bodies have to decide which matters more. Headlines featuring crises always get attention and research replicability has been right up there. This raises concerns and nervousness about science’s reputation. But if we look around, science is being attacked constantly – the problem is not topical, it’s structural. If it wasn’t replicability, it would be something else.

    Although the failure of replicability gets headlines, the role of science publishers who generate those headlines, while joining in the general wailing and gnashing of teeth, also needs to be examined. For an industry that gets similar rates of return to arms and drug dealers, while drawing on free labour, this is far too cute. They need to invest in better quality research. This doesn’t mean doubling down the current inequalities that disenfranchise the global south, but the opposite.

    These issues will not be addressed from within individual silos. The idea of putting together interdisciplinary teams to work through some of these issues in detail could be really beneficial. For example, building a library of exemplars is specific areas could be useful. Leaning on publishers to invest in this for their own best interests (long term) could also be tried.

    Science and research generally, needs to emphasise discovery and excellence, and encourage as many as possible to do so. It can’t be gatekeeper and (p)-police. That’s a recipe for mediocrity and things are bad enough as it is. (apologies for the soapbox)

  5. vaccinelegit0q

    In 1990 Deming wrote the following:

    Unfortunately, the statistical methods in textbooks and in the classroom do not tell the student that the problem in the use of data is prediction. What the student learns is how to calculate a variety of tests (t- test, F-test, chi-square, goodness of fit, etc.) in order to announce that the difference between the two methods or treatments is either significant or not significant. Unfortunately, such calculations are a mere formality. Significance or the lack of it provides no degree of belief—high, moderate, or low—about prediction of performance in the future, which is the only reason to carry out the comparison, test, or experiment in the first place.

    The theory behind this statement primary is because:

    1. A test of significance is only useful / relevant when studying “dead” things or processes in equilibrium / maximum entropy.
    2. When dealing with non-equilibrium / fat-tailed (non-ergodic) systems, a test of significance does not preserve the structure or order of the data because it is a symmetric function of the data. Thus, changes in the structure of the data observed have no effect upon the first, second, or fourth moments.

    For instance, it is only appropriate to apply Neyman and Pearson procedure on populations with fixed boundary conditions (e.g. to determine if you should send the 1000 radios you just made to market, or do a recall). However, it is inappropriate to use this procedure when studying whether a change to a manufacturing process is desirable or not.

    Deming also made this point in his papers On Probability as a Basis For Action and On the Distinction Between Enumerative and Analytic Surveys. And is why Shewhart said that such methods could only be used when a process is “under statistical control”

    In fact the recent Covid-19 vaccine debacle illustrates how statistics is misused / abused. When a vaccine is introduced to the body, its complete effects cannot be known in advance (i.e. are computationally irreducible). You have to run the whole process (let someone live with the vaccine for decades, have children, etc) to determine whether it’s “safe” or not. A good example is Thalidomide. When given to a pregnant woman, its effects can only be known after the fetus-maturity process executes. It was impossible to know the outcomes of taking the medication in advance because the boundary conditions of the sample space have not yet emerged.

    So, back to test of significance, there seems to me too little education on the basics of statistical theory and whether not data from an experiment meet the assumptions required – as Deming pointed out.

    Perhaps this is one reason why this debate seems to go nowhere.

  6. rkenett

    “Tests of variables that affect a process are useful only if they predict what will happen if this or that variable is increased or decreased. Statistical theory, as taught in the books, is valid and leads to operationally verifiable tests and criteria for an enumerative study. Not so with an analytic problem, as the conditions of the experiment will not be duplicated in the next  trial. Unfortunately, most problems in industry are analytic.”

    W. E. Deming, from preface to The Economic Control of  Quality of  manufactured product by W. Shewhart, 1931

    The enumerative/analytic dichotomy is key. Handling analytic problems implies generalizability of findings. In AI/ML/DL this is achieved with a split between training and validation data. A first step in addressing generalizability is developing methods representation of findings. A step in this direction is proposed in: Kenett, R. S., & Rubinstein, A. (2021). Generalizing research findings for enhanced reproducibility: an approach based on verbal alternative representations. Scientometrics126(5), 4137-4151. https://papers.ssrn.com/sol3/Papers.cfm?abstract_id=3035070

    Deming indicates the limited generalizability of enumerative studied to a current population frame. He was an expert in sampling methods aiming at this. He also indicates that analytic studies are most important. Generalizability in this context is an open challenge. For COVID related aspects of this, see Serio, C. D., Malgaroli, A., Ferrari, P., & Kenett, R. S. (2022). The reproducibility of COVID-19 data analysis: paradoxes, pitfalls, and future challenges. PNAS nexus1(3), pgac125 and Kenett, R. S., Manzi, G., Rapaport, C., & Salini, S. (2022). Integrated analysis of behavioural and health COVID-19 data combining Bayesian networks and structural equation models. International Journal of Environmental Research and Public Health19(8), 4859.

  7. Christopher Tong

    I was made aware of this series of discussions via an unsolicited email from an associate of Prof. Mayo that I received earlier this week. I would like to clarify Prof. Lakens’ comment above, regarding my paper (Tong, 2019) in the TAS special issue. He correctly notes that my paper belongs in the category of not offering alternatives to p-values. He should have added that I argued that most of the time there should be no such alternatives, because most of science is exploratory, not confirmatory, where model building and model criticism are actively taking place. At this stage statistical inference is premature, due to model uncertainty (which isn’t quantified by inferential statistics, whether they be p-values, confidence intervals, Bayes factors, likelihood ratios, or whatever). In exploratory research, the data must be allowed to influence the form of the model, and the boundary between fitting and overfitting cannot be defined too sharply. The examination of many sets of data, as part of the model/experiment iteration, can play a key role in recognizing and managing overfitting, though other precautions should also be taken (details in my paper). Only at the later stages of research, when model specification before data collection can finally be contemplated, could statistical inference be done under the conditions where various putative properties (e.g., unbiasedness, coverage probabilities, Type 1 and Type 2 error rates, etc.) are reasonably approximate. Meanwhile Statistics as a discipline should prioritize teaching good study design and execution, disciplined data exploration, and statistical thinking, which apply throughout all stages of research, unlike statistical inference, which has a very narrow scope.

    Pedantically one might still ask: for a confirmatory study, what are alternatives to p-values? My paper did not address this, as I correctly assumed that many other authors in the special issue would focus solely on this point. I don’t pretend to be smarter than they are. I only recommended that “Methods with alleged generality, such as the p-value or Bayes factor, should be avoided in favor of discipline- and problem-specific solutions that can be designed to be fit for purpose.” To provide a concrete example, here I offer the case study of vaccine efficacy trials. Consider the FDA’s original 2020 Guidance for Industry on “Development and Licensure of Vaccines to Prevent COVID-19”. The efficacy portion of the guidance specified the agency’s expectations for vaccine efficacy: “a point estimate for a placebo-controlled efficacy trial of at least 50%, with a lower bound of the appropriately alpha-adjusted confidence interval around the primary efficacy endpoint estimate of > 30%.” No p-values were mentioned anywhere in this document. I know of five major pivotal efficacy studies for COVID-19 vaccines published in the New England Journal of Medicine in 2020-2021, subsequent to the issuance of the FDA guidance: Pfizer/BioNTech, Moderna/NIAID, Janssen (Johnson & Johnson), AstraZeneca/Oxford, and Novavax. All five included estimates of vaccine efficacy (based on 1 – the incidence rate ratio, estimated using different statistical models, such as Poisson regression or the Cox proportional hazards model) with either 95% confidence or 95% credible intervals. Only two of them (Moderna and AstraZeneca) also included p-values in the main NEJM papers (I did not search the supplementary files). To emphasize my earlier point, pivotal studies occur at the end of vaccine development, after an extensive (and in this case accelerated) series of preclinical and clinical studies, and when comprehensive prespecification of the trial protocol (including statistical analysis plan) is expected.

    This is not an isolated example in vaccine clinical trials. See for example the FDA guidance on seasonal flu vaccines, or the book of Halloran et al., Design and Analysis of Vaccine Studies (Springer, 2010). In Sec. 6.2, Halloran et al. discuss seven real-world vaccine trials. In each case, interval estimates of vaccine efficacy are provided; p-values are also reported in only two of these seven. I conjecture that many investigators in the vaccine clinical trial arena have been comfortable with interval estimation, with p-values as optional at best, for a very long time. The ASA statements and special issue had absolutely no influence on such long-standing attitudes, because (in this particular arena of medical research) we were already living in “a world beyond p < 0.05″. An exercise for the reader (I have not done this myself) would be to examine the clinical efficacy sections (Sec. 14) of the package inserts for all licensed vaccines in the U.S., and tabulate how many provide one or more of the following: interval estimates of vaccine efficacy, p-values, or other inferential statistics. Here is where you can find them: https://www.fda.gov/vaccines-blood-biologics/vaccines/vaccines-licensed-use-united-states

    The views expressed are my own. (I am aware of several errors in my 2019 paper, but I stand by the overall spirit and major themes of the paper.)

    References

    The original 2020 FDA guidance for COVID-19 vaccines may no longer be extant; however the updated version does quote the original, the relevant part of which I quoted above. It may be found here: https://www.fda.gov/media/142749/download

    FDA guidance on seasonal flu vaccines: https://www.fda.gov/files/vaccines%2C%20blood%20%26%20biologics/published/Guidance-for-Industry–Clinical-Data-Needed-to-Support-the-Licensure-of-Seasonal-Inactivated-Influenza-Vaccines.pdf

    The NEJM papers are as follows:

    Pfizer/BioNTech: DOI: 10.1056/NEJMoa2034577
    Moderna/NIAID: DOI: 10.1056/NEJMoa2035389
    Janssen: DOI: 10.1056/NEJMoa2101544
    AstraZeneca/Oxford: DOI: 10.1056/NEJMoa2105290
    Novavax: DOI: 10.1056/NEJMoa2107659

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.