Guest Post: Ron Kenett: What’s happening in statistical practice since the “abandon statistical significance” call

.

Ron S. Kenett
Chairman of the KPA Group;
Senior Research Fellow, the Samuel Neaman Institute, Technion, Haifa;
Chairman, Data Science Society, Israel

 

What’s happening in statistical practice since the “abandon statistical significance” call

This is a retrospective view from experience gained by applying statistics to a wide range of problems, with an emphasis on the past few years. The post is kept at a general level in order to provide a bird’s eye view of the points being made.

An important impact on the current practice of statistics is the merging of empirical predictive analytics with probability-based modelling. In many applications, but not in all, one has access to massive data coming from sensor technologies and unstructured formats such as text and images. In these cases, methods to fit and assess a model are based on splitting the data into a training set and a validation set. The training data is used to fit a model, the validation set, to evaluate it. If the fit of a model to the training data is very high but the fit to the validation set is low, we experience what is labeled “overfitting”. Overfitting reduces the ability to generalize the model to future implementations and implies poor predictive performance.  This approach is different from the classical statistical analysis relying on probability models and hypothesis testing.

R. A. Fisher, in his fundamental paper “On the Mathematical Foundations of Theoretical Statistics”, stated that “the object of statistical method is the reduction of data.” He then identified “three problems which arise in the reduction of data.”: Problem 1: Specification—choosing the right mathematical model for a population. Problem 2: Estimation—methods to calculate estimates, from a sample, and Problem 3: Distribution—properties of estimators derived from samples. The predictive analytic methods outlined in the first paragraph change all this. The Specification, Estimation, Distribution trio is replaced, with computer intensive methods including cross validation, bootstrapping and simulations. The models used in this context are supervised or unsupervised, with transfer learning, active learning labelling, zero shot and few shot learning.

In this context of new problems and new methods we still face the fundamental issue of collecting data with relevance to the problem at hand, what Colin Mallows called the “zeroth problem.”  For example, splitting of the data into training and validation sets must be consistent with the data generation process. Some principles for achieving this are formulated in an approach titled “befitting cross validation” (BCV), see Kenett et al (2022, https://xwdeng80.github.io/BCV2022.pdf ).

The performance of a model applied to a validation set is determining goodness of fit. In classification problems one computes misclassification errors, lift, ROC and confusion matrices.

The combination of computer intensive methods and classical statistical methods (Bayesian or frequentists) is a challenge for current work in data analysis.  In many companies, Statistics groups have been replaced by Data Science groups. Some universities do not even involve statisticians in their data science programs.

With this perspective, let me share my views on what’s happening in statistical practice since the “abandon significance” call five years ago.

The few years considered here start at the Bethesda ASA symposium on statistical inference (SSI) on October 2017. The event followed the 2016 ASA “p-value statement” and fed a large special issue of the American Statistician. I summarized what happened there in a blog titled “to p or not to p”. See: https://blogisbis.wordpress.com/2017/10/24/to-p-or-not-to-p-my-thoughts-on-the-asa-symposium-on-statistical-inference/

Following SSI, Mayo’s error statistics philosophy blog provided a platform to discuss controversies around hypothesis testing and p-values as presented in Bethesda and beyond. It also provided some clarity into what was official ASA policy and what was not. The blog gathered perspectives under an umbrella labelled: “The statistics wars”. At some point I mentioned, in a comment, that these debates seem to be localised to specific groups and that most users of statistics ignored them. I also noted that the statistics wars appeared to have no impact on statistical analysis software platforms or on statistics curriculum in academia and elsewhere.

Today, I believe that these comments still hold and that the cargo cult application of statistics, presented in Stark and Saltelli, is prevalent. Moreover, the statistics war discussions in the literature, that reached a peak five years ago, seem waning. As Shakespeare wrote: “much ado about nothing.”

In contrast, there are many advances in data analysis. Some examples include, assessing selection bias (Benjamini, 2019), evaluating fairness in analytic models (Plecko and Bareinboim, 2023) and considering information quality (https://sites.google.com/site/datainfoq).

Moreover, it seems that the p-value discussions, started by the ASA over five years ago, did not strengthen the general position of statistics or statisticians. These discussions coincided with a particularly delicate period where other disciplines got deeply involved in data analysis and modelling. The result is that statistics is at a crossroad. Some thoughts on how to address current challenges in meeting this crossroad were presented in this seminar.  See also here. A summary is listed below:

  1. How should we practice Statistics? Embrace a life cycle perspective, from problem elicitation to generalization, operationalization and communication of findings.
  2. How should we teach Statistics courses? Engage the students and dedicate time to the conceptual understanding of Statistical methods and thinking.
  3. What are research areas for statistics and analytics to focus on? Areas at the interface of Statistics with Machine Learning/Artificial Intelligence/Computer Science/Data Science.
  4. How do we initiate synergistic collaborations with other disciplines? By direct communication.
  5. What is the role of professional organizations in this transformation? Professional organizations have a unique responsibility to foster discussion and provide an opportunity for contrarian views to be expressed.
  6. How should lifelong learning be implemented to update the skills of working statisticians? Adult education is posing a different challenge from the one faced in regular academia. In that context, simulation-based education and online training material are excellent options.

In summary, the positioning of analytics is at a peak. Statisticians should leverage this opportunity by clarifying the unique selling points of statistics. This was the context of the conference “On the foundations of applied statistics” held at the Samuel Neaman Institute, Technion, Israel in April 2024. Presenters addressed various aspects of applied statistics including philosophical methods, historical examples and designing experiments for generalizability of findings. For slides and a recording see https://neaman.org.il/en/On-the-foundations-of-applied-statistics. In the conference, Daniel Lakens proposed to have a Vatican like event where a multiperspective group sits down to map applied statistics foundations, until white smoke is observed. Why not…

Categories: abandon statistical significance, Wasserstein et al 2019 | 26 Comments

Post navigation

26 thoughts on “Guest Post: Ron Kenett: What’s happening in statistical practice since the “abandon statistical significance” call

  1. I thank Ron Kenett for his guest post which blends new reflections on recent and current work in statistical science with work in data analysis, predictive analytics, AI and ML. I’m grateful to him for his attempts to address some of the queries I raised on earlier versions. I’m still perplexed about some of his remarks, however. At the start, I get the impression he thinks statistics as we know it (or as Fisher knew it) has been replaced by AI/ML. “The specification, estimation, distribution trio is replaced, with computer intensive methods including cross validation, validation and simulations.” But don’t these still rely on error statistical reasoning that depends on model assumptions? And is the goal just goodness of fit (of past data) to models? (Obviously, predicting new cases is desired, but this requires more than goodness of fit.) What happens to scientific theory and understanding? Clearly, they are still crucial to science, and they used to also be of importance for statistical science. Kenett ends with claiming: “Statisticians can leverage this opportunity by clarifying the unique selling points of statistics”, but one might wonder  why there would be a need for statistics if what Kenett seems to suggest is true (and its main function replaced). Fisher excoriated Neyman for (supposedly) seeking to replace statistical methods in science with tools appropriate for industrial quality control and acceptance sampling in commerce and engineering (Fisher 1955*).  *Link can be found on this blog.

    From Kenett’s review, it appears that Neyman’s behavioral performance philosophy has won out. Statistics “is at a crossroads” Kenett says, but this, I recall was the mantra of the hand-wringing of a decade or more ago, when I was just starting this blog and people like Wasserman were worried about being replaced by data science. From an outsider’s point of view, it appears the data scientists won that battle, and Kenett appears to affirm this. But all of that strikes me as largely separate from the “abandon significance” skirmish. Perhaps that’s why Kenett says so few have taken note of it—they’re busy doing data analytics! “[T]he statistics wars appeared to have no impact on statistical analysis software platforms or on statistics curriculum in academia and elsewhere.” Do readers agree with this?

    He claims: “the statistics war discussions in the literature, that reached a peak five years ago, seem waning…In contrast, there are many advances in data analysis,” and as an example he gives assessment of selection bias as developed in Yoav Benjamini’s work. Benjamini (2020) “Selective inference: the silent killer of replicability”:

    https://hdsr.mitpress.mit.edu/pub/l39rpgyc/release/3

    The reason Benjamini was (and is) so unhappy with Wasserstein’s pointing the finger at p-values (“it’s not the p-value’s fault”) is that it promoted the idea that methods that are insensitive to error probabilities somehow escape the problem of selection effects. The crucial controversy of the statistics wars, as I see it, concerns whether the statistical assessment of evidence needs to take into account error statistical properties of methods. Those who say “yes”, I dub “error statisticians”. Many of the “alternatives” to statistical significance tests listed even in Wasserstein et al., 2016 say no. We often hear, for example, that accounts that hold to the likelihood principle are free from concerns about gambits that invalidate p-values (e.g., stopping rules).

    Worse, we hear that freedom from adjusting for multiplicities allows strict Bayesians to occupy the philosophical high ground, and it’s not just social scientists, it’s also clinical trialists in medicine:

    The requirement of type I error control for Bayesian adaptive designs causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle, and creates a design that is inherently frequentist.

    The paper is referenced in the following link:

    https://errorstatistics.com/2021/08/21/should-bayesian-clinical-trialists-wear-error-statistical-hats/

    In my view, those who say, we don’t care about type I error control because we know our models are strictly false, confuse the central issue in distinguishing real effects from noise. This is what encourages the “cargo cult” science and cargo cult statistics that Kenett mentions—not p-values. Legitimate p-values require taking account of selection effects.  

    As for the “why not” hold a Vatican like event “until white smoke Iis observed”, I say what I wrote in a comment on Lakens:

    “While I strongly endorse Lakens’ idea of putting together teams of statisticians, scientists and philosophers of science (if they are engaged in statistical practice) to weigh in on methodology, the idea of convening a group to reach consensus fills me with terror. Responding to criticisms, recognizing and including diverse perspectives are essential for scientific progress, but that’s not how the current controversies about statistical significance have played out as of late. One need only consider how even the 2016 ASA policy was created, the 2017 conference organized, and this special [2019] issue introduced”.

    I hope that others from a variety of backgrounds will weigh in on some of the perplexities I’ve raised.

    • rkenett

      Mayo

      Thank you for addressing the points in my blog. Some responses to your comments follow. I am first listing in quotes your comment:

      1 ” I get the impression he thinks statistics as we know it (or as Fisher knew it) has been replaced by AI/ML ” – Yes, this is my observation, specially in unregulated industries.

          2. ” What happens to scientific theory and understanding? ” – indeed, there are different ways to deal with this. The extraordinary performance of AI/ML is a game changer

          3. ” why there would be a need for statistics ” – this is the rallying point where efforts should converge. I am posing this question to encourage a discussion around it.

          4. ” But all of that strikes me as largely separate from the “abandon significance” skirmish. ” Mayo is right on this point. The position of data science and the abandon significance discussion can be separated. On the other hand they are collocated in time and point again at the current unique selling point of statistics.

          5. ” The statistics wars appeared to have no impact on statistical analysis software platforms or on statistics curriculum in academia and elsewhere.” Do readers agree with this? ” – Yes, this is my observation. I gets some reinforcements in the blogs of Hennig and Lakens who show the lack of impact on current active research.

          6. ” the idea of convening a group to reach consensus fills me with terror ” – I agree that there should be some prerequisites to this. An essential one is that participants should bring an application domain expertise. On option is to run the event around application domains which have different ways to conduct statistical analysis. I believe I heard Steve Goodman from Stanford discuss this. A one size fits all approach would not work

          The initiative by Mayo to organize blogs on the abandon significance perspective is most welcome. It helps map some data reflecting on different perspectives. Should this not be a fist step in statistical evaluation?

          Please keep in mind that my blog ” is kept at a general level in order to provide a bird’s eye view of the points being made. “

          • Ron:

            I’m glad that you think my blog is helpful, but I don’t think too many people engage, and admittedly, I haven’t kept it up as much as would be needed.

            I’m still not getting the “current unique selling point of statistics” in relation to issues such as the roles and importance of error statistical methods (e.g., the controversy about “abandoning significance”): You write:

            “On the other hand they are collocated in time and point again at the current unique selling point of statistics.”

            Perhaps the suggestion is that the field of statistics might offer a broad philosophical backdrop to reflecting on the implications of AI/ML replacing traditional statistics, given its tradition of sensitivity to epistemological foundations of empirical inquiry/ inductive inference and the like? Unfortunately, the recent episode (in statistics) of abandoning statistical significance is an example of a degenerating rather than a progressive case of how such controversies could achieve constructive ends.

            Moreover, if AI/ML is free from such controversy (and I’m not saying it is), then why call in the statisticians?

            • rkenett

              Mayo

              At the bottom of this is my belief in the value of integrating statistical methods and statistical thinking. The first part is mathematical, the second conceptual. This special sauce was behind the George Box Monday night beer and statistics seminar at his Shorewood house in Madison. This set up combining a domain expert presenting a problem and a statistician describing how he handled it, was quite unique. Statisticians should be experts in asking questions and finding ways to handle them with data. This needs to be reinforced and developed in research and education…

              • Kenett:
                It’s clearly of value to combine a domain expert presenting a problem and a statistician describing how he handled it—or possibly even more than one way. I am very surprised, however, that you, a student of Cox no less, could speak of statistical methods as purely mathematical and somehow distinct from something called “statistical thinking”. That’s the mindset that entirely gets the order wrong, and imagines that the founders of statistical methods weren’t creating ways to formalize and systematize methods for solving statistical problems in science. These are empirical problems of inquiry, when the goal is to find things out in the face of variability and error. The methodology for addressing them is also empirical (and theoretical). Granted Bayesians have famously touted their method(s) as providing a logic for induction akin to deductive logic where formal syntactical methods suffice. But frequentist statisticians always rejected that logical image—the very reason they are charged with failing to have the kind of unified logical foundations that many philosophers have long craved. That’s because, in the frequentist (error statistical) view, assessing a method for tackling statistical problems in science depends on the method’s relevant error probabilities. It’s an error statistical methodology, and it is neither purely mathematical nor purely conceptual. As such, it’s intimately connected to experimental design, and must be sensitive to the biases and errors that enter in collecting, modeling, and selecting data. This requires quite a lot of background and domain knowledge which rarely comes in the form of prior probability distributions.
                Of course, in practice, honest Bayesians must relinquish the picture of a unified logic as well—even if they still purport to hold the philosophical high ground. Kenett mentions a “George Box Monday night beer and statistics seminar”. Well, Box considered that the formal deductive portion of Bayes could enter only after the creative, inductive work of arriving at and testing a model, which he claimed requires statistical significance tests:

                “some check is needed on [the brain’s] pattern seeking ability, for common
                experience shows that some pattern or other can be seen in almost any set of data or
                facts. This is the object of diagnostic checks and tests of fit which, I will argue, require
                frequentist theory significance tests for their formal justification.” (Box 1983, 57)

                Fisher’s preface to the 13th edition of Statistical Methods, Experimental Design, and Scientific Inference (a reissue of 3 books that today’s practitioners should read or reread: Statistical Methods for Research Workers, The Design of Experiments, and Statistical Methods and Scientific Inference) begins:

                “For several years prior to the preparation of this book, the author had been working in somewhat intimate co-operation with a number of biological research departments at Rothamsted; the book was very decidedly the product of this circumstance. Daily contact with statistical problems as they presented themselves to laboratory workers stimulate the purely mathematical researches upon which the new methods were based.”

                Reading or rereading Fisher’s work might be a real eye-opener, however different modern problems of statistical inquiries have become. I don’t see how anyone can come away from reading Fisher, or Neyman or Pearson or Cox or many others who developed statistical methods, and still regard statistical methods as somehow separate from the “thinking” that goes into planned problem solving in learning from data.

                The bottom line is: I get the sense of what Kenett is advocating, and I admit there is a role for new perspectives on the rationales of these methods. But I say we should reject the supposed dichotomy between “statistical method and statistical thinking” which unfortunately gives rise to such titles as “Statistical inference enables bad science, statistical thinking enables good science,” in the special TAS 2019 issue. This is nonsense.

                • rkenett

                  Mayo,

                  Thank you again for sharing your thoughts and comments. In retrospect I should have put in bold “integrating” in ” integrating statistical methods and statistical thinking”.

                  To facilitate this integration I made two suggestions. One is a general framework of information quality https://sites.google.com/site/datainfoq, the other is mapping a life cycle view of statistics https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2315556

                  In both cases you will find both statistical methods and statistical thinking.

                  Regarding David Cox. His impact was immeasurable. When he was awarded the ENBIS Box Medal in 2005, his acceptance speech was mostly dedicated to his experience in the wool industry https://enbis.org/wp-content/uploads/2022/01/cox_2005.pdf. Cox is indeed a role model for integrating statistical methods and statistical thinking.

                  My favorite quote of Sir David is: “Much fine work in statistics involves minimal mathematics; some bad work in statistics gets by because of its apparent mathematical content.” David Cox (1981), Theory and general principle in statistics, JRSS(A), 144, pp. 289-297.

                  Again, my point is the need to integrate statistical methods and statistical thinking. If you want, a left brain, right brain combination challenge….

      1. John Byrd

        These are indeed some great discussions. I will make a couple of points from the perspective of someone who works with (mostly) biological data (morphology). I do see a great uptick in use of AI/ML approaches and the sales pitch for accepting them in a particular case revolved around the application of the models derived from the training set to the holdout validation set. We were doing this since the 1980’s with classification problems, so it is not totally new. (The power of computing tools and the popularity of them is new.) The underlying philosophy in deciding to make an inference from such a model application is fundamentally the same as Fisher held when he said you can make an inference when you know how to perform a test that rarely fails to give a significant result (after performing it multiple times). The problems hidden in the sales pitch often center on a couple of areas. First, who says the validation set is representative of the whole universe of cases you will apply it to in the future? In biological sciences, it will often not be a true random sample from that “population.” Second, many of these new models are what we loosely call “black box” models, which means we don’t really know what just happened each time we run it. Scientists are supposed to be able to explain how they arrived at conclusions. I suspect this bothers older scientists more than the younger ones… So, I say that the reasoning underlying these approaches was given to us by Fisher, Neyman, Pearson, Deming, Cohen, Cox and others from our past. If “data science” becomes ignorant of what statistics can teach us, they will end up re-inventing these same concepts that guide error control, sampling issues, etc. Then we all get to watch a younger generation think they invented such concepts.

        • John:
          I completely agree with your comment which isprecisely in sync with my own. I reiterate what you wrote:

          If “data science” becomes ignorant of what statistics can teach us, they will end up re-inventing these same concepts that guide error control, sampling issues, etc. Then we all get to watch a younger generation think they invented such concepts.”

        • rkenett

          John

          Thank you for your comments. Two thoughts:

          1. Cross validation: As you write, this is an area requiring some statistical thinking in order to ensure that the validation results can be generalized to set ups equivalent to the ones generating the data used in the study. I actually mentioned this in “Some principles for achieving this are formulated in an approach titled “befitting cross validation” (BCV), see Kenett et al 2022, https://xwdeng80.github.io/BCV2022.pdf ).
          2. (re)learning history: I recommend the talk by Stephen Senn in the conference mentioned in my blog https://neaman.org.il/en/On-the-foundations-of-applied-statistics.
          • Ron:

            No proper application of statistical method is free of “thinking”. My point is that viewing statistical method as pure mathematics distinct from something called “statistical thinking” is wrongheaded and harmful.

      2. rkenett

        Mayo

        Statistical methods are not mindless. My point is that one needs proper integration of methods with statistical thinking. Did not expect a pushback on this point.

        Take for example the front end of statistical analysis, problem elicitation. This requires asking questions. Statisticians need to excel at this. Cognitive scientists are looking at developing such skills, see for example https://psycnet.apa.org/record/2024-33540-001. Should statisticians be aware of these research findings and perhaps contribute to their development? I think so.

        Another example deals with communication via graphs. Statisticians have contributed to such developments. See for example: Kleiner, B., & Hartigan, J. A. (1981). Representing points in many dimensions by trees and castles. Journal of the American Statistical Association76(374), 260-269. This is an important area for statisticians to be active in.

        Yet another example, already mentioned here, is generalizability of findings. For example, the application of random effects can be considered for achieving it. In animal experiments, conducted over different labs, how to conduct them affects your ability to generalize the findings, see Richter, S. H., Garner, J. P., & Würbel, H. (2009). Environmental standardization: cure or cause of poor reproducibility in animal experiments?. Nature methods6(4), 257-261.

        The above is sketching the bigger picture embodied by a life cycle view of statistics, where methods and statistical thinking integrate. It is indeed a wide paradigm that might help establish a unique selling point for statistics in this era of AI/ML/ DL…

        ..

        • Ron:

          Yes, all that goes without saying for any applied science. It’s the conception that statistical method is mere mathematics, is separate from and doesn’t already embody statistical reasoning, that is misleading–even though I grant that critical reflection is needed to identify the natural rationales behind the methods. For the case of error statistical testing, which obviously is only 1 method, there’s at least two rationales: performance and (what I call) probativeness. Applying the methodology to substantive problems obviously involves theoretical and domain-specific information. That’s a different issue. John Byrd’s comment hits the nail on the head.

      3. An interesting discussion of today’s ML revolution from Harvard Data Science Review is is on Gelman’s blog:

        https://statmodeling.stat.columbia.edu/2024/07/12/19-ways-of-looking-at-data-science-at-the-singularity/Netsplaining

        An article by Wendall points out:

        “many of us—probably most of us—feel that neural networks are not the same as scientific explanations. They are wonderful tools that can improve the performance of instruments and make useful predictions. But ‘netsplaining’ is not the same as a scientific explanation. I fear that the search for scientific explanations will lose material support.”

      4. Very interesting posting and discussion! When thinking these days about the general state of statistics and where it is or should be going, I’m struck by a sense of tension that I have tried to express to some extent here (later published as a comment an Mayo’s editoral in Conservation Biology):

        https://errorstatistics.com/2022/01/09/the-asa-controversy-on-p-values-as-an-illustration-of-the-difficulty-of-statistics

        I tend to agree with most of what is said, and when people apparently disagree, I tend to agree with both sides, in the sense that they point to (potentially contradictory looking) valuable aspects of how statistics, data analysis, data science develop these days.

        Regarding the dichotomy between “statistical thinking” and “statistical methodology”, I am well aware of the limitations of methodology considered in isolation. Much of this is well covered by the controversy about statistical testing, where one major aspect is the trouble with “binary thinking”, with the idea of models or certain parameter values being “true” or “false”, and the idea that a test could tell us on which side we are. This extends to the term “error” in error statistics – the idea that in arriving at a particular conclusion we either commit an “error” or not is obviously driven by that same binary thinking.

        We might think that “statistical thinking” may lead us to more differentiated (and therefore supposedly more appropriate) views of a situation, and it may be framed in this way as superior to methodology.

        Then, on the other side, statistical methodology is to quite some extent a formalisation of principles of statistical thinking, and if we want to analyse formally the implications of our thinking (and more broadly how to do it best), we generate “statistical methodology” by modelling situations (probability models) and decision making (statistical methods, model-based or not). “Errors” and “error probabilities” are then relevant again in the sense that statistical thinking can be criticised by saying, “if you apply “statistical thinking principle”, i.e., method A in artificial situation B in which we know (as we can “control” the truth when assuming models) that we should arrive at conclusion C, in fact you will quite likely arrive at conclusion D which is opposite to C”, then we have learnt something about how statistical thinking can be led astray.

        I don’t really think this kind of reasoning can be easily replaced, and I claim that quality statistical thinking needs to be informed by such knowledge.

        The controversy on statistical tests may seem of very limited relevance faced with the complexity of modern data and data science methods (originating from statistics, machine learning, or anywhere else). But it shows that even very simple decision rules have complex implications, and are hard to fully understand and easy to misunderstand and misuse.

        I see how much of the focus in data science goes elsewhere, and the reasons for this. Still I’d be very worried if we’d run away from the task to understand data and the logic of statistical thinking as well as we can, helped by probability models, decision theory, error statistical thinking, which naturally starts from the simplest methods and their implications. Of course we want to build up understanding of more complex situations and methods, but at least some focus needs to remain at the basis, because as long as tests are still applied and misunderstood, we cannot hope for general good understanding of anything beyond them.

        • Christian:
          I’m glad that you point out that formal statistical methods can correct informal thinking, and that formal methods are intended to embody informal thinking in order do a better job in solving statistical problems than without the formal tools. That, of course, is standard throughout science. Formal method couldn’t correct informal thinking if the thinking were somehow superior. I thought my desk was around 2 feet, but lo and behold, the measuring tape shows it’s nearly 4 feet. I recommend we move away from recent talk dividing formal statistical methods and informal thinking, but keep to the usual understanding that formal methods can embody, sharpen and correct informal thinking, and the abilities of formal methods can direct how empirical problems are posed. That is why we have to think about how to ask a question of inquiry in terms of the kinds of questions an available method can address.

      5. rkenett

        Christian

        Your comment is a great example of a journey through this field of integration I was referring to. As you did in your guest post here, starting with data (your sample of four papers) is part of the statistical perspective. This is also demonstrated in the Tawakol et al. (2017) case study of 13 individuals with with post-traumatic stress disorder you used in your excellent presentation in the conference on the foundations of applied statistics https://neaman.org.il/en/Files/3%20Christian%20Hennig_20240411090135.376.pdf.

        Just the title of your talk: “Understanding statistical inference based on models that aren’t true” indicates that you are dealing with the integration of statistical methods and statistical thinking,

        This brings up anther dichotomy presented by David Zucker https://neaman.org.il/en/Files/David%20Zucker-discussion.pdf in his discussion of Bernard Francq talk https://neaman.org.il/en/Files/4.%20Bernard%20Francq.pdf . David suggested to distinguish between differentiating between the strength of evidence we can derive from data from the interpretation of findings from statistical analysis. Such a dichotomy actually helps the integration of statistical methods and statistical thinking,

        I believe we need more discussions on such topics and am grateful to Mayo for facilitating such exchanges. Tx for your comment here.

      6. Christopher Tong

        In responding to Prof. Kenett above, Prof. Mayo states: “we should reject the supposed dichotomy between ‘statistical method and statistical thinking’ which unfortunately gives rise to such titles as ‘Statistical inference enables bad science, statistical thinking enables good science,’ in the special TAS 2019 issue. This is nonsense.”

        I am the author of the paper whose title she attacks as “nonsense”. If she had read my paper she would know that, like Kenett, I am advocating placing statistical thinking at the center of statistical teaching and practice. The dichotomy that she thinks is false exists in much of actual teaching and practice, and is one that it seems both Kenett and I are trying to undo. The title of my paper reflects the real (not the ideal) situation, and if that’s “nonsense”, then (and I would agree) much of statistical teaching and practice is nonsense. Finally I note that mine is one of only two papers in the 2019 special issue that even contains the phrase “statistical thinking” in the title. I strongly recommend the other one, which offers a concrete solution to how the “integrating” that Kenett speaks of can be done in statistics education.

        The views expressed are my own.

        • Response to Christopher Tong:
          Thank you for your comment. My thinking was that it would be good to alert the authors of the papers Lakens discusses, and I’m glad that you have. I have read your paper, and, as much as your highly provocative title earns you rewards in contexts such as the special issue in which it appears, it does enormous disservice to statistical inference as “enabling bad science”. Your paper itself—which reviews many right-headed contributions—shows that the very insights and tools that your “good statistical thinking” requires are themselves at the foundations of frequentist error statistical methodology and depend upon statistical inference methods, formal and informal. The formal tools were developed as deliberate idealizations by the founders as exemplars—to check and improve our ordinary (pre-statistics) statistical thinking. I wrote a book called Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018). You can find all 16 “tours” on this blog (in its final draft form): 16 tours

          Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)

          The notion of statistical inference developed there is very different from your sterile depiction. You talk as if practicing scientists in fields that employ statistical method gain their first exposure to statistics at the point that they are doing applied research. This should not be true. High school students, if they are to be critical consumers of the policies and decisions that will affect them in their lives—let alone conduct research– should study statistical method, including experimental design. Nor can statistical researchers without a VERY clear understanding of statistical concepts and computations assume they only need to think about the domain field, and assume good science will emerge. They should have a deep grasp of the formal methods that others will use to check their models and results. Statistical significance tests and test of statistical hypotheses more generally, are intimately connected to experimental design, as Fisher emphasized.
          I worry that your paper warns these students off, claiming it will only endanger their ability to do good science. What a relief for the students! This is one of their hardest courses, and now they can point to an important journal that has an article that warns us NOT to study statistical inference.

          Statistical significance tests are just one small part of statistical science, but they are piecemeal methods and cannot all be learned at one time. Fisher wrote a book, Statistical Methods and Scientific Inference; the integration of the two was there from the start.

          Testing statistical assumptions is a crucial part of error statistical methods. You mention Box, but he is talking about Bayesian vs frequentist methods. Box considered that Bayesian inference gives the formal, deductive part of inference which, in his view, could enter only after the creative, inductive work of arriving at and testing a model, which he claimed requires statistical significance tests:
          “some check is needed on [the brain’s] pattern seeking ability, for common experience shows that some pattern or other can be seen in almost any set of data or facts. This is the object of diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification.” (Box 1983, 57)
          Yet you say “formal, probability-based statistical inference should play no role in most scientific research, which is inherently exploratory, requiring flexible methods of analysis that inherently risk overfitting”. Box disagrees, saying we need checks on such risks, and statistical significance tests provides that. Eye-balling the data won’t suffice (I say this after having worked with Aris Spanos, an expert on testing model assumptions). Whenever we use data to solve statistical problems we are doing statistical inference: this goes beyond the data, and thus it is inductive or ampliative. (10 A paper I wrote with David Cox in 2006 is called: “Frequentist Statistics as a Theory of Inductive Inference”:

          Click to access 2006-mayocox-freq-stats-as-a-theory-of-inductive-inference.pdf

          Perhaps some pedagogical treatments of statistical inference methods are overly formal, allowing students to just use computers to get the answer. Maybe that’s what’s behind your saying that there’s a divorce between statistical inference (bad) and statistical thinking (good). I say that computing solutions by hand provides a much deeper understanding of methods, and of where our intuitive thinking about probability and statistical inference is often badly wrong. It seems you’re missing that the key rationale for using deliberately idealized models in statistics is in order to learn from data how they fail and how to improve them. Used correctly, they serve as references for severe testing.

          Of course, as you stress, “exploratory” inquiry and model building require a data dependence that would not be kosher in a predesignated “confirmatory inquiry”. But even in exploratory inquiry, we can use data both to build and severely probe such questions as whether a given method or model ought to be modified, whether it will serve to find out what we want to know, despite approximations, etc. Moreover, in exploratory inference, there are still statistical assumptions that ought to, and can be, checked by methods with different assumptions, and triangulating results. Fisher, Neyman and many, many others gave us mathematics to show how various designs (e.g., randomizations), and remodeling of data allow “subtracting out” or compensating misspecifications. contemporary methods go further, but puzzlingly, you reject all such “technical fixes”.

          John Byrd had it right (in his comment on Kenett—who I do not think shares your view of statistical inference as enabling bad science):
          “So, I say that the reasoning underlying these [data science] approaches was given to us by Fisher, Neyman, Pearson, Deming, Cohen, Cox and others from our past. If “data science” becomes ignorant of what statistics can teach us, they will end up re-inventing these same concepts that guide error control, sampling issues, etc. Then we all get to watch a younger generation think they invented such concepts.”

          A paper I have jointly written with David Hand in 2022 might be of interest:
          “Statistical significance and its critics: practicing damaging science or damaging scientific practice”

          D. Mayo & D. Hand: “Statistical significance and its critics: practicing damaging science, or damaging scientific practice?”

      7. Christopher Tong

        Prof. Mayo, thank you for your response and entertaining my remarks. I had also posted a reply directly to Prof. Lakens’ post at the same time I posted the above, but the other post appears to still be stuck in moderation, perhaps because it was laden with links to sources.

        Here you wrote: “Whenever we use data to solve statistical problems we are doing statistical inference”. We seem to have a disagreement over a definition. By statistical inference I mean the estimation and/or testing of parameters in statistical models, a much narrower concept than you intend. My title will make more sense if this is understood (I tried to explain it in sec. 2’s first paragraph, but I’ve known for a while that this paragraph needs to be completely rewritten).

        In real life, statistical inference (as I define it) is more often misused than used properly, as amply documented in the papers by Chatfield, Gelman & Loken, Nelder, Simmons et al, and Freedman, cited in my paper. In that sense, mostly it does enable bad science. A caveat is that (when used properly in the sense I describe in my paper, sec. 3) it can also enable good science, and I give some vivid examples in my reply to Lakens. The caveat is stated in different words in my paper (sec. 3; sec. 8), but I could have been even more explicit about it. Please note that the title of my paper did not say “Statistical Inference always enables bad science”, nor did I say that statistical thinking always enables good science. It doesn’t.

        I do not agree with Box’s point about testing assumptions, as it pertains to formal hypothesis testing. This is a subtle topic I’m not going to dwell on here, but an interesting recent discussion about it just recently occured on Reddit (search for “testing assumptions” in the r/statistics subreddit). The commenters do not agree with each other there, of course, but to me sufficient questions were raised to render formal tests of assusmptions not something we should knee-jerk accept.

        Randomization and other experimental design concepts were not included in the technical solutions I dismissed (sec. 5); on the contrary my paper says that these should receive more proportionate emphasis in statistical teaching and practice (abstract; Sec. 7.2; Sec. 8).

        I agree that the whole point of statistical modeling during exploratory research is to assist with model selection/criticism/improvement. However such tasks do not become “severe” (i.e., surviving severe scrutiny, as you define it) until confronted with new data. I think the prion example in your book is a grand illustration of this process – 30 years of designing various experiments along various lines of evidence, generating data, scrutinizing the central dogma of infectious agents, and building a new narrative that rests on a huge evidence base.

        Forgive me, I will not be able to look at the other papers you linked (or even to return here) until after JSM.

        The views expressed are my own.

      8. Christopher Tong

        This month’s issue of Physics Today features yet another example of a “severe test” in the discovery of dark matter. An article by Jaco de Swart describes two 1974 papers less celebrated than the famous work of Fritz Zwicky (1933) and Vera Rubin (1970). One is by a group at Tartu Obsevatory in Estonia (USSR) consisting of Jaan Einasto, Enn Saar, and Ants Kaasik; the other by a group at Princeton consisting of Jeremiah Ostriker, P. J. E. Peebles (later a 2019 Nobel laureate), and Amos Yahil. de Swart writes of the two 1974 papers:

        On either side of the globe, Einasto’s and Ostriker’s groups independently demonstrated the evidence of dark matter. Despite working in vastly different political contexts, both groups involved collaborations between young astrophysicists and cosmologists studying galaxies. The evidence they presented was neither a simple proof nor a single observation, like that of Zwicky or Rubin, but an inference using a combination of different arguments. As Peebles stated when I interviewed him, “What was the best argument? None of them. This is a case of no one argument being compelling, but so many arguments pointing in the same direction.” The two papers were exemplars of the nascent field of physical cosmology and its interdisciplinary teamwork and methodology; combining data and arguments from different scales–from stars and galaxies to clusters–to form a consistent picture of the cosmos.

        de Swart makes clear that the two groups’ ideas were initially resisted harshly by their colleagues, perhaps with justification, but over time gained acceptance, especially after additional data from multiple teams published post-1977. I referred to this extended process of interrogating nature as “triangulation” in my 2019 paper, and it has a similar ring to the prion example in Statistical Inference as Severe Testing. I submit that no single statistical calculation such as a p-value can possibly constitute a “severe test” (unless it is from a confirmatory study done at the end of a series of precursor studies in a learn-and-confirm framework, such as phased clinical trials per ICH E9). Rather it is the accumulation of multiple studies along different lines of evidence, as David Freedman argued (using the John Snow cholera example, the smoking and lung cancer example, and others), that constitutes the severe scrutiny needed to make a credible scientific claim. The use of a statistical computation from a single set of data is a scientifically unsound way to short-circuit this process if the output is taken too seriously or too literally.

        (I skimmed through both 1974 dark matter papers, and couldn’t find any p-values in either of them.)

        The views expressed are my own.

      I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

      Blog at WordPress.com.