Has Statistics become corrupted? Philip Stark’s questions (and some questions about them) (ii)

.

In this post, I consider the questions posed for my (October 9) Neyman Seminar by Philip Stark, Distinguished Professor of Statistics at UC Berkeley. We didn’t directly deal with them during the panel discussion following my talk, and I find some of them a bit surprising. (Other panelist’s questions are here).

Philip Stark asks:

When and how did Statistics lose its way and become (largely) a mechanical way to bless results rather than a serious attempt to avoid fooling ourselves and others?

  1. To what extent have statisticians been complicit in the corruption of Statistics?
  2. Are there any clear turning points where things got noticeably worse?
  3. Is this a problem of statistics instruction ((a) teaching methodology rather than teaching how to answer scientific questions, (b) deemphasizing assumptions, (c) encouraging mechanical calculations and ignoring the interpretation of those calculations), (d) of disciplinary myopia (to publish in the literature of particular disciplines, you are required to use inappropriate methods), (e) of moral hazard (statisticians are often funded on scientific projects and have a strong incentive to do whatever it takes to bless “discoveries”), or something else?
  4. What can academic statisticians do to help get the train back on the tracks? Can you point to good examples?

These are important and highly provocative questions! To a large extent, Stark and other statisticians would be the ones to address them. As an outsider, and as a philosopher of science, I will merely analyze these questions. and in so doing raise some questions about them. That’s Part I of this post. In Part II, I will list some of Stark’s replies to #5 in his (2018) joint paper with Andrea Saltelli “Cargo-cult statistics and scientific crisis”. (The full paper is relevant for #1-4 as well.)

Part I. Some Questions Provoked by Stark’s Questions

There may be a blurring these days between blaming methods as incapable of performing their job, as opposed to their misuse and abuse. It seems clear that Stark would not be asking in Question 5 “What can academic statisticians do to help get the train back on the tracks?” if he thought statistical methods themselves were corrupt. As I read Stark, he means not that the methods themselves merely provide holy water, but that, driven by perverse incentives, celebrity culture, researcher flexibility and the like, many(?) researchers are led to misuse them so that, in effect, they serve merely as holy water to bless results. If so, it’s not that Statistics lost its way, but that many statistical inquiries are unsound or unscientific. This makes his position (as I understand it) importantly different from what I took Ben Recht to be claiming in his recent blogpost, which I discuss and reply to in an earlier blogpost. [i]

For Stark, academic statisticians can help get the train back on the tracks by solving some of the problems he lists of statistics instruction (Question 4, whose parts I label (a)-(c)), although (d) and (e) point more to weak sciences, flexible methods, perverse incentives and moral failings.

When understanding, care, and honesty become valued less than novelty, visibility, scale, funding, and salary, science is at risk. … bad science outcompetes better science. (Stark and Saltelli, 2018)

Bad science violates my minimal severity requirement. (We don’t have evidence for a claim if little if anything has been done to probe the ways it can be wrong.) Stark’s work, to his credit, has used error statistical methods to weed out and advance severe error probes of weak and insevere statistical inferences. But here are some questions on his questions:

Question 1. Do statisticians worry that declaring “Statistics has become corrupt” will be taken as further grist for the mills of the movement to abandon significance, and even, in some quarters, to downplay instruction in statistical inference methods? While Stark may have all of Statistics in mind, my talk focused on error statistical tests, and they are the methods most often blamed for cookbook, sciency statistics, so I keep to them. Are statisticians concerned about how it might sound to a graduate student to say: “Statistics is (largely) corrupt: would you like to study for a Ph.D in Statistics?” Might overly harsh self-criticism, especially (ironiclly) of methods designed for self-criticism, weaken those methods in the meta-statistical competition between rival schools or philosophies of statistics?

I begin my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars CUP, 2018, 3)[i]:

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self-correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate.[ii]

Self-critical methods can also be ,inadvertently, self-destructive, if criticism isn’t handled in a constructive manner in  appraising competing methods. Here’s what I mean:

Question 2. Isn’t it possible that the more self-critical and honest methods, in their willingness to concede defects due to perverse incentives, may lose out in the battle with less self-critical and less self-effacing methodologies? Brad Efron (1998) is right to say the frequentist (error statistician) is the pessimist, who worries that “if anything can go wrong it will,” while the Bayesian optimistically assumes if anything can go right it will (Efron 1998, p. 99). Suppose error statisticians hold themselves to a uniquely high standard, deeply skeptical of findings that lack valid error probabilities. They scrutinize models, acknowledging that multiple testing, optional stopping, data-dredging and other biasing selection effects can readily undermine the integrity of these probabilities. Suppose that when statisticians from other schools criticize their tests for allowing illicit p-value, the error statistician confesses: “Yes, we are largely corrupt,” while Bayesians or other statisticians reply: “I agree, you are corrupt, but we’re not!”

How do statisticians ensure that methods that open themselves to criticism for severe testing violations are not replaced by tools that are less capable of error control?  I asked a question in my Neyman seminar as to whether accounts of evidence that are insensitive to error probabilities somehow escape the consequences of biasing selection effects. We did not discuss it, but my answer was: not for a severe tester. Good science requires being able to apply severity at the meta-level–in scrutinizing inferences and arguments about which methods to use for given problems. The same kind of social and ethical conflicts that Stark so aptly uncovers are operative here. Here, I think Statistics might be ideally positioned to call them out, although they rarely do. [iii]

Question 3. How, specifically, might statistical instructors tackle the issues raised in Stark’s Question 4? (See Part II.) Question 4(c) notes the need for a correct elucidation of the meaning of concepts. However, it seems to me that the kind of strictly correct, but shallow and unenthusiastic, recitals of definitions of things like p-values that we often see scarcely help. I say that a genuine appreciation of their value would also be required. If one adopts, as a default, a pessimistic standpoint, it can be an obstacle to communicating what error-statistical thinking is all about. Another obstacle to a non-equivocal interpretation, even if only unconscious, is being wedded to a notion of evidence and inference at odds with the one underlying statistical tests.

In Regina Nuzzo’s “Tips for communicating p-values”(2018) she says

One little-known requirement… is that all analyses and results be presented, no matter the outcome. Yes, these seem like strange, nit-picky rules, but they’re part of the deal when using p-values. (Nuzzo, 2018)

It’s a cool paper, (discussed in my blogpost), but are these “strange, nit-picky rules”? Are other methods for the job of distinguishing genuine free spurious statistical effects free of such rules? Effective communication requires something like an “enthusiastic grasp” of what the tools can accomplish, and how violating the “strange” rules permit being fooled by randomness. Statisticians might be better able to describe what I’m after. The type of passion-driven statistics is found in Stark’s own work.

What do readers think? I now turn to some of Stark’s suggestions in Stark and Saltelli (2018).

Part II. What Can Statisticians Do? Some suggestions from Stark and Saltelli (2018)

  • Statisticians can help with important, controversial issues with immediate consequences for society. We can help fight power asymmetries in the use of evidence. We can stand up for the responsible use of statistics, even when that means taking personal risks.
  • We should be vocally critical of cargo-cult statistics, including where study design is ignored, where p-values, confidence intervals and posterior distributions are misused… We should be critical even when the abuses involve politically charged issues, such as the social cost of climate change. …
  • We can insist that “service” courses foster statistical thinking, deep understanding, and appropriate scepticism, rather than promulgating cargo-cult statistics. We can help empower individuals to appraise quantitative information critically – to be informed, effective citizens of the world. …
  • When we appraise each other’s work in academia, we can ignore impact factors, citation counts, and the like: they do not measure importance, correctness, or quality. We can pay attention to the work itself, rather than the masthead of the journal in which it appeared, the press coverage it received, or the funding that supported it. We can insist on evidence that the work is correct – on reproducibility and replicability – rather than pretend that editors and referees can reliably vet research by proxy when the requisite evidence was not even submitted for scrutiny.
  • We can decline to referee manuscripts that do not include enough information to tell whether they are correct. We can commit to working reproducibly, to publishing code and data, and generally to contributing to the intellectual commons.
  • And we can be of service. Direct involvement of statisticians on the side of citizens in societal and environmental problems can help earn the justified trust of society. …

These are laudable goals in sync with that of severity. The authors are to be credited for how they have implemented them in their work. Please share your reactions to them, as well as the 4 questions I raise, in the comments.

[i] I don’t claim to be clear on Recht’s position because I’m unsure of Recht’s reply to the queries I raised on his posts. But it became clear in our (blog) discussion that he regards statistical tests, by which he means statistical significance tests, as serving important regulatory control of error rates. (See my earlier blogpost.)

[ii] After the Feynman quote about bending over backwards.

[iii] Stark is an exception. An example is his comment on my editorial “The statistics wars and intellectual conflicts of interest.”

 

Categories: Neyman Seminar, Stark | 16 Comments

Post navigation

16 thoughts on “Has Statistics become corrupted? Philip Stark’s questions (and some questions about them) (ii)

  1. rkenett

    This is basically a cut and paste of a comment provided to an earlier post. Could not find an option to just provide a link.

    Philip Stark comments are right on the dot and deserve in depth considerations.

    Two comments:

    1. Box, Hunter and Hunter in the preface of their book on Statistics for experimenters write: “Even more important than learning about statistical techniques is the development of what might be called a capability for statistical thinking” Implicitly this advise goes against the mechanization of statistics. My take on this is that one should focus on statistical thinking per se with methods and examples. Several of my earlier inputs to this blog series are derived from such motivation.
    2. The evolution of statistics has been wonderfully depicted in the book by Efron and Hastie. It provides some answers to Stark’s comments. See this very short video which mentions this: https://user-images.githubusercontent.com/8720575/180794703-c6f05f40-eefd-4e1a-93f9-42cb78e6a6b4.mp4
    • Ron:
      The criticisms of cookbook statistics are well known, although I wouldn’t divorce the statistical methods from “statistical thinking” (see this post https://errorstatistics.com/2024/08/26/dont-divorce-statistical-inference-from-statistical-thinking-some-exchanges/). What I’m really keen to hear are responses to the question I raise of Stark’s questions.

      • rkenett

        Mayo – OK I will be more pedestrian. By the way, please call me Ron…

        Question 1. Do statisticians worry that declaring “Statistics has become corrupt” will be taken as further grist for the mills of the movement to abandon significance, and even, in some quarters, to downplay instruction in statistical inference methods? >>> This is a bit tricky. It assume that the abandon p-values discussions had/have a significant (pardon the pun) effect on how statistics is used. My perception is that it has not. On the other hand the call by Stark and Saltelli can/should invigorate a discussion on how statistics should be practiced, which is a very good discussion to have. You question is assuming that this has to do with instruction in statistical inference methods. It is much more than that.

        Question 2. Isn’t it possible that the more self-critical and honest methods, in their willingness to concede defects due to perverse incentives, may lose out in the battle with less self-critical and less self-effacing methodologies? >>> The point is that the focus now should be on the combination of AI/ML with statistics and the identification of where statistics can/should contribute. Your question ignores this context.

        Question 3. How, specifically, might statistical instructors tackle the issues raised in Stark’s Question 4? (See Part II.) Question 4(c) notes the need for a correct elucidation of the meaning of concepts. >>> Your summarizing write up is perfectly OK. No comments on it. However your insistence that you “wouldn’t divorce the statistical methods from “statistical thinking”” is perplexing. You do not seem to recognize the need for this two poles emphasis. Applied statisticians seem to be contrarian to your view. At the minimum it helps divide and conquer the complexity of data analysis.

        David Donoho in https://hdsr.mitpress.mit.edu/pub/g9mau4m0/release/2 brings out an interesting point in that AI/ML has evolved by addressing real problems and benchmark data sets. This is in contrast to a more theoretical approach taken by some statistical communities. David is highlighting a path for the evolution of modern statistics in unison with AI and ML..

        As Neyman was saying “life is complicated but not uninteresting”. We do seem to be at an interesting singularity…

        • Ron: You focus just on instruction.

          Mayo: Stark’s questions focuses us there, as does Stark and Saltelli..
          AI/ML didn’t often come up in the panel discussion except when Recht said prediction was not inference. l Your point regarding my question 2 doesn’t address it. How does adding AI/ML alter or solve the problem of corrupt statistics that Philip talks about? Please explain.

          Ron: The point is that the focus now should be on the combination of AI/ML with statistics and the identification of where statistics can/should contribute. Your question ignores this context.

          Mayo: (revised 10/8) I’m not sure that the context (in relation to these questions–mine and Stark’s) is the need to blend statistics with AI/ML. And if one does blend them, does the corruption go away? I’m not sure if Stark intends his questions to me to apply there as well.
          Thanks for sharing the link to Donoho’s “frictionless reproducibility” in machine learning. I’d read it, and several comments, when it came out in HDSR. Building on three main components: shared data, public code, and community-wide challenges along with performance assessments, reproducibility and improvements in performance are sped up. On the face of it, this is trial and error and severe testing (for improvements in relation to the performance criterion) in the case of machine learning. If this leads to replication (not just reproducibility), then, Stark’s claims about corruption might no longer hold, at least for those AI/ML applications.

          But that’s aside from my main point which is that there are potential unintended consequences of viewing today’s statistics as corrupt–that’s what I want people to consider.

          • rkenett

            The point I was making regarding Donahoe’s paper is related to what he calls “challenges”.

            Specifically: “Adopting challenge problems as a new paradigm powering methodological research. The paradigm includes: a shared public dataset, a prescribed and quantified task performance metric, a set of enrolled competitors seeking to outperform each other on the task, and a public leaderboard. The paradigm can also include virtual challenges lacking the formal leaderboard, in which authors still attempt to publish a new state-of-the-art result going beyond previously recorded/published performance levels on a given dataset/performance metric. Thousands of such challenges and virtual challenges, with millions of entries, have now taken place, across many fields.”

            This indeed presents a new paradigm with substantial effects worth considering, with in depth discussions. My comment above was about this point. This is repeated here for the sake of clarity, I stated: “David Donoho … brings out an interesting point in that AI/ML has evolved by addressing real problems and benchmark data sets. This is in contrast to a more theoretical approach taken by some statistical communities. David is highlighting a path for the evolution of modern statistics in unison with AI and ML.” This path started in the early 1970s with Feigenbaum’s fifth generation challenge 10.1016/0004-3702(84)90047-X. Hope these notes are found useful. There is certainly lots of interesting things to do in the statistics domain to meet this challenge.

            • Ron:
              Yes, there’s an interesting discussion about his suggestion that this this paradigm (“a shared public dataset, a prescribed and quantified task performance metric, a set of enrolled competitors seeking to outperform each other” on the (prediction) task, and a public leaderboard) encompasses “empirical science”.
              Focusing just at the topic of the blog, it’s still not clear this paradigm, impressive as it is, solves the problems raised by Stark’s questions or mine. Maybe you’re saying in the future it will, at least for AI/ML. I don’t claim to have more than a distant outsider’s familiarity, but I do subscribe to Narayanan and Kapoor’s “AI Snakeoil”. Some of the commentators on Donoho (e.g., Gelman) raise similar criticisms in the current state of play.
              I thought that applied statistics also deals with “real problems”, but not limited to prediction.
              By the way, can you describe a specific, and very simple example, of the progress under this paradigm (ideally outside of image classification)?

              • rkenett

                Mayo – I guess you are looking for examples. An annotated list of some of the projects my students handed out last week includes:

                1. Stress Detection During Sleep – feature selection
                2. Road traffic predictions – improvements of infrastructure
                3. Car thefts trends – evaluating insurance policy cost
                4. 3D  printing optimization – optimizing industrial process
                5. Municipality surveys – identifying needs of elderly members
                6. Test of a laser system – improving system design
                7. Predictive waste recycling models – predictive models
                8. Football players performance – planning player careers
                9. Football games results – evaluating betting sites
                10. Camera AGC impact – evaluating impact of automatic gain control
                11. Road accidents by car type – comparing car safety performance
                12. Sensor data analysis – improving process control
                13. Sound waves analysis – system performance evaluation
                14. Road accidents by location – improving road infrastructures

                These projects mostly involved penalized regression, random forests, gaussian process models and various clustering methods. Model validation applied befitting cross validation (BCV). The students also carried out self assessments using https://www.dropbox.com/scl/fi/0b68kx2p67vgtbcq8g9g8/InfoQ-checklist.xlsx?rlkey=cho1ckena267ezfzvqf9x7sxd&st=75z8rruk&dl=0

                One of the papers representing what they learned is https://onlinelibrary.wiley.com/doi/pdf/10.1002/qre.3449

                BCV is introduced in https://onlinelibrary.wiley.com/doi/abs/10.1002/asmb.2701

                The Python code for my book on Modern Statistics that covers most of this is available in https://gedeck.github.io/mistat-code-solutions/ModernStatistics/

                Another paper with checklists the students found useful is https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/ansa.202000159

                • Thank you for the examples.
                  Don’t these “challenges” still have the problem of multiple testing? One blogger asks about the purpose of the competition challenges:

                  AI competitions don’t produce useful models


                  “They obviously aren’t to reliably find the best model. They don’t even really reveal useful techniques to build great models, because we don’t know which of the hundred plus models actually used a good, reliable method, and which method just happened to fit the under-powered test set.”

                  Things might have greatly improved of course.
                  Do we get scientific theories from these prediction models?

                  • rkenett

                    Mayo – the examples I listed where all about generating information related to specific goals, preferably information of quality. They were not set up as competitions. To achieve this, students had to state a goal, g, identify relevant data, X, apply a bunch of methods, f, and compare them with a utility function U. As you know, we define information quality, InfoQ = U [f(x)Ig]. This analysis combines mathematically defined tools with statistical thinking. In my previous comment I listed a link to an InfoQ excel that scores 8 InfoQ dimensions that represent a deconstruction of the generated information provided by a specific study.

                    Pedagogically the approach I take in a graduate data science course is “learning by doing”. In the first lecture I describe to the students the journey we will take between “needs”, “methods” and “deployment platforms”.

                    I suggest that the type of studies done by the students are typical of what is done in services, industry, engineering and business and that InfoQ has wide scope use. If, for example, Edward Ionides wants a structured assessment of his modeling work he can use it.

                    More on InfoQ in https://sites.google.com/site/datainfoq. We even used this framework to evaluate data science programs https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2911557

                    How we teach things was mentioned by Stark as a challenge. It is…

                  • rkenett

                    Mayo – An attempt to handle the issue raised by Oakden in computational biology was provided in https://ieeexplore.ieee.org/document/5611486

                    Quoting the abstract: “Large databases whose usage is open to the scientific community to facilitate research are becoming commonplace, especially in Biology and Genetics. The emerging scenario in which a community of researchers sequentially conduct multiple statistical tests on one shared database gives rise to major multiple hypothesis testing issues. It is often hard to control false discovery in the presence of unpredictable and sequential use, and existing tools are very limited. We suggest a scheme we term Quality Preserving Database (QPD) for controlling false discovery without any power loss by adding new samples for each use of the database and charging the user with the expenses. The crux of the scheme is a carefully crafted pricing system that fairly prices different user requests based on their demands while controlling false discovery. The statistical problem encountered is one of defining appropriate measures of false discovery that can be controlled sequentially, and designing methodologies that can control them in the context of QPD.
                    We describe a simple QPD implementation based on controlling the family-wise error rate using a method called alpha-spending, and a more involved implementation based on controlling a measure called mFDR, using an approach we term generalized alpha investing. We derive the favorable statistical properties of generalized alpha investing variants in general, and in the context of QPD in particular. The variant we implement can guarantee infinite use of a public database while preserving power, with very low costs, or even no costs under some realistic assumptions. We demonstrate this idea in simulations and describe its potential application to several real life setups.”

                    The QPD idea suggested a decade ago did not really take off…

  2. Ugh, Stark writes “moral hazard (statisticians are often funded on scientific projects and have a strong incentive to do whatever it takes to bless “discoveries”), or something else?”

    This may be an issue, but it’s not an example of moral hazard. Moral hazard is when an actor takes on extra risk because the costs of failure/mistakes are borne by someone else. 

    • I’m not sure if Philip means it in that strict sense. Perhaps he thinks it’s rare to be held accountable, else he wouldn’t say “largely”.

  3. Edward Ionides had commented on Stark’s questions when I posted all four sets of questions earlier. His comment is here:

    https://errorstatistics.com/2024/10/27/panel-discussion-questions-from-my-neyman-lecture-severity-as-a-basic-concept-in-philosophy-of-statistics/#comment-267284

  4. Stark informs me that he’s currently steeped in (what I would call) one of his passion-driven statistical roles–having to do with voting! So he might not reply for a while. There’s no rush.

  5. Christian Hennig

    Stark implicitly suggests that at some earlier point things were fine and “on track”. I’m not so sure I’d agree with that. Certainly some problems haven’t been as bad as they are now, but also some problems have existed for a very long time and are highlighted now more than earlier. I will list a few major issues.

    The first one doesn’t only apply to statistics and certainly not exclusively to certain methods and approaches. This is the generally rising pressure on everyone to market and sell oneself, although this has always existed to a certain extent. This means that a growing number of scientists is driven by publications and measurable impact, and also that much statistical methodology is oversold. As editor, reviewer and reader I just see so much work that makes some sense and is potentially useful, but comes with exaggerated marketing claims and doesn’t mention potential pitfalls. We can’t really be surprised that non-experts reading such material will apply it uncritically and will overinterpret results. The same happens in teaching, as teachers feel that they need to “sell” their material to the students to keep them motivated. This is all the more important when teaching statistics to non-statisticians (students or other), because these people don’t tend to enjoy the material otherwise. Connected to this are time pressure and generally the pressure to present “solutions”. Usually there is little reward for being slow and questioning such “solutions” unless of course there are immediate negative practical consequences visible. There is also, as we all know, often little reward for trying to replicate work or reproduce results, and to discover dishonesty and cheating.

    A problem with statistics is that dishonesty is rather easy, particularly selective reporting. Manipulating and falsifying data is also fairly easy and hard to find out. Statistical methods are readily available to everyone, and are seen as of crucial importance for publication (and I’d agree they are of crucial importance to science), so the temptation to misuse them is large. On top of the fact that many people apply statistics who don’t have a strong understanding of it, so there is also lots of unintentional misuse, but this is helped by the fact that the methodology is so easily available, and so important (be it for scientific progress, be it for scientists to get recognition and impact).

    Another issue specific to statistics is that statistics is in fact very difficult and subtle, and methodology is often marketed as if this were not the case. Many key problems with statistics have to do with this. Just to name a few: The whole topic of model assumptions and checking them is very problematic and complicated, and there is a strong tendency to oversimplify it when teaching. Subtlety starts realising that model assumptions are never literally fulfilled in practice, yet they are important, and it is rather hard to understand that some issues with model assumptions are harmless and some are not, and which the non-harmless ones are – and I should add here that as often in statistics and actually science (and particularly in significance testing) it is already a problematic simplification to say this in a binary way (“harmless or not”) whereas in many situations we sit uncomfortably between these two, and how much certain issues matter will crucially depend on exactly what is done, how results are interpreted and the like. Furthermore we can realise that there are many models in a close neighbourhood of an assumed model (meaning that they are hard or impossible to distinguish based on the available data, i.e., they agree with the data equally well), which may have very different implications regarding outcomes of statistical methods. This includes violation to elementary assumptions such as i.i.d. that may be hard or impossible to detect.

    Another issue that is genuinely problematic is multiple testing and how to take that into account. There is of course nice theory and methodology, but there aren’t agreed rules what criterion is relevant in what situation, and which tests to take into account when adjusting for multiple testing. Connected is the unclear impact of making statistical decisions based on the data in other ways than formal testing, particularly from visualisation, which can be problematic despite the fact that most statisticians are convinced that this is very sensible if done right. Further issues arise with causality, generalisability, missing values,… (there are “solutions” to all these issues flying around in the literature, but see above regarding “solutions”…)

    Much of this is beyond the understanding of non-statisticians applying statistics, and I should honestly say that these issues are not fully understood by statisticians either. “Cookbookery” is always a temptation if we don’t want to confuse our students (and ourselves). Simplification is a necessity. In fact we use simplified models of reality because we have hard time handling it otherwise.

    Of course like many I think of solutions to these problems, but (as always) all that comes to mind has pitfalls. I’m actually very keen on teaching the complexity and subtlety of statistics, meaning in particular raising issues and not making them go away by some kind of algorithmic solution. Emphasising the limitations of what we do and why we do it anyway, and what this implies regarding the required caution when interpreting results. But of course students and clients may not like it (and maybe consult data scientists instead who present “solutions” without the issues).

    And then of course replication, reproduction, targeting the same scientific problem in different ways is required. The really reliable scientific knowledge is the knowledge that has been challenged and checked many times by many people in different ways. But this of course means that we need to become slower before accepting something as scientific fact, more effort is required for fewer results, and at the same time patients are dying and the environment is destroyed even more, so that some may say, no, we need speed absolutely in dealing with these problems.

    • Christian:
      Thank you so much for your excellent comment; it has numerous insightful reflections, and I agree with all of them. Stark does suggest things used to be “on track”, as you say, whereas in truth, these foibles are so well-known as to be associated with reverberating slogans I list on the first page of my book (SIST, 2018):
      . Association is not causation
      . Statistical significance is not substantive significance,
      . No evidence of risk is not evidence of no risk,
      . If you torture the data enough, they will confess.

      Of course, powerful computer methods make it easier to commit fallacies. In the midst of researchers marketing themselves, it’s ironic to see statisticians selling themselves short when claiming it’s largely all corrupt. Perhaps they have settled for being a handmaiden to data science by contributing a vague “statistical thinking”. It’s good that you teach your students the complexity and value of statistics. On p. 23 I write:

      “You might aver that we are too weak to fight off the lures of retaining the status quo – the carrots are too enticing, given that the sticks aren’t usually too painful. I’ve heard some people say that evoking traditional mantras for promoting reliability, now that science has become so crooked, only makes things worse. Really? Yes there is gaming, but if we are not to become utter skeptics of good science, we should understand how the protections can work. In either case, I’d rather have rules to hold the “experts” accountable than live in a lawless wild west. I, for one, would be skeptical of entering clinical trials
      based on some of the methods now standard. There will always be cheaters, but give me an account that has eyes with which to spot them, and the means by which to hold cheaters accountable. That is, in brief, my basic statistical philosophy. The stakes couldn’t be higher in today’s world. Feynman said to take on an “extra type of integrity” that is not merely the avoidance of lying but striving “to check how you’re maybe wrong.” I couldn’t agree more. But we laywomen are still going to have to proceed with a cattle prod.”

      Why can’t we get AI methods to carry out the replications and report problems? Does anyone talk about this?

Leave a reply to Christian Hennig Cancel reply

Blog at WordPress.com.