Author Archives: Mayo

E. Ionides & Ya’acov Ritov (Guest Post) on Mayo’s editorial, “The Statatistics Wars and Intellectual Conflicts of Interest”

.

Edward L. Ionides

.

Director of Undergraduate Programs and Professor,
Department of Statistics, University of Michigan

Ya’acov Ritov Professor
Department of Statistics, University of Michigan

 

Thanks for the clear presentation of the issues at stake in your recent Conservation Biology editorial (Mayo 2021). There is a need for such articles elaborating and contextualizing the ASA President’s Task Force statement on statistical significance (Benjamini et al, 2021). The Benjamini et al (2021) statement is sensible advice that avoids directly addressing the current debate. For better or worse, it has no references, and just speaks what looks to us like plain sense. However, it avoids addressing why there is a debate in the first place, and what are the justifications and misconceptions that drive different positions. Consequently, it may be ineffective at communicating to those swing voters who have sympathies with some of the insinuations in the Wasserstein & Lazar (2016) statement. We say “insinuations” here since we consider that their 2016 statement made an attack on p-values which was forceful, indirect and erroneous. Wasserstein & Lazar (2016) started with a constructive discussion about the uses and abuses of p-values before moving against them. This approach was good rhetoric: “I have come to praise p-values, not to bury them” to invert Shakespeare’s Anthony. Good rhetoric does not always promote good science, but Wasserstein & Lazar (2016) successfully managed to frame and lead the debate, according to Google Scholar. We warned of the potential consequences of that article and its flaws (Ionides et al, 2017) and we refer the reader to our article for more explanation of these issues (it may be found below). Wasserstein, Schirm and Lazar (2019) made their position clearer, and therefore easier to confront. We are grateful to Benjamini et al (2021) and Mayo (2021) for rising to the debate. Rephrasing Churchill in support of their efforts, “Many forms of statistical methods have been tried, and will be tried in this world of sin and woe. No one pretends that the p-value is perfect or all-wise. Indeed (noting that its abuse has much responsibility for the replication crisis) it has been said that the p-value is the worst form of inference except all those other forms that have been tried from time to time”. Continue reading

Categories: ASA Task Force on Significance and Replicability, editors, P-values, significance tests | 2 Comments

B. Haig on questionable editorial directives from Psychological Science (Guest Post)

.

Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand

 

What do editors of psychology journals think about tests of statistical significance? Questionable editorial directives from Psychological Science

Deborah Mayo’s (2021) recent editorial in Conservation Biology addresses the important issue of how journal editors should deal with strong disagreements about tests of statistical significance (ToSS). Her commentary speaks to applied fields, such as conservation science, but it is relevant to basic research, as well as other sciences, such as psychology. In this short guest commentary, I briefly remark on the role played by the prominent journal, Psychological Science (PS), regarding whether or not researchers should employ ToSS. PS is the flagship journal of the Association for Psychological Science, and two of its editors-in-chief have offered explicit, but questionable, advice on this matter. Continue reading

Categories: ASA Task Force on Significance and Replicability, Brian Haig, editors, significance tests | Tags: | 2 Comments

D. Lakens (Guest Post): Averting journal editors from making fools of themselves

.

Daniël Lakens

Associate Professor
Human Technology Interaction
Eindhoven University of Technology

Averting journal editors from making fools of themselves

In a recent editorial, Mayo (2021) warns journal editors to avoid calls for authors guidelines to reflect a particular statistical philosophy, and not to go beyond merely enforcing the proper use of significance tests. That such a warning is needed at all should embarrass anyone working in statistics. And yet, a mere three weeks after Mayo’s editorial was published, the need for such warnings was reinforced when a co-editorial by journal editors from the International Society of Physiotherapy (Elkins et al., 2021) titled “Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors” stated: “[This editorial] also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.” Continue reading

Categories: D. Lakens, significance tests | 4 Comments

Midnight With Birnbaum (Remote, Virtual Happy New Year 2021)!

.

.For the second year in a row, unlike the previous 9 years that I’ve been blogging, it’s not feasible to actually revisit that spot in the road, looking to get into a strange-looking taxi, to head to “Midnight With Birnbaum”.  Because of the extended pandemic, I am not going out this New Year’s Eve again, so the best I can hope for is a zoom link of the sort I received last year, not long before midnight– that will link me to a hypothetical party with him. (The pic on the left is the only blurry image I have of the club I’m taken to.) I just keep watching my email, to see if a zoom link arrives. My book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018)  doesn’t include the argument from my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), but you can read it at that link along with commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser (who sadly passed away in 2021), Jan Hannig, and Jan Bjornstad  but there’s much in it that I’d like to discuss with him. The (Strong) Likelihood Principle (LP or SLP)–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and statistical significance testing in general. Continue reading

Categories: Birnbaum, Birnbaum Brakes, strong likelihood principle | Tags: , , , | 1 Comment

“This is the moment” to discount a positive Covid test (after 5 days) (i)

.

This week’s big controversy concerns the CDC’s deciding to cut recommended days for isolation for people infected with Covid. CDC director Walensky was all over the news explaining that this “was the moment” for a cut, given the whopping number of new Covid cases (over 400,000 on Dec. 28, exceeding the previous record which was in the 300,000’s).

“In the context of the fact that we were going to have so many more cases — many of those would be asymptomatic or mildly symptomatic — people would feel well enough to be at work, they would not necessarily tolerate being home, and that they may not comply with being home, this was the moment that we needed to make that decision,” Walensky told CNN.

The CDC had already explained last week that “health care workers’ isolation period could be cut to five days, or even fewer, in the event of severe staffing shortages at U.S. hospitals”.

Then, on Monday, the CDC announced that individuals who test positive for Covid-19 and are asymptomatic need to isolate for only five days, not 10 days, citing increasing evidence that people are most infectious in the initial days after developing symptoms.

What’s really causing alarm among many health experts is that the new policy has no requirement for a negative test result, with a rapid test, before ending isolation. Even if you test positive on day 5, the CDC says, you can go about your business. So long as you’re asymptomatic or mildly symptomatic or your “symptoms are resolving” and you wear a mask. I don’t suppose the new looser guidance would result in any pressure being put on a pilot or other worker to get back to work even with some mild brain fog or coughing that seemed to be resolving.[1] Continue reading

Categories: covid-19 | 5 Comments

January 11: Phil Stat Forum (remote)

Special Session of the (remote)
Phil Stat Forum:

11 January 2022

“Statistical Significance Test Anxiety”

TIME: 15:00-17:00 (London, GMT); 10:00-12:00 (EST)

Presenters: Deborah Mayo (Virginia Tech) &
Yoav Benjamini (Tel Aviv University)

Moderator: David Hand (Imperial College London)

Deborah Mayo       Yoav Benjamini        David Hand


Focus of the Session: 

Continue reading

Categories: Announcement, David Hand, Phil Stat Forum, significance tests, Yoav Benjamini | Leave a comment

The Statistics Wars and Intellectual Conflicts of Interest

.

My editorial in Conservation Biology is published (open access): “The Statistical Wars and Intellectual Conflicts of Interest”. Share your comments, here and/or send a separate item (to Error), if you wish, for possible guest posting*. (All readers are invited to a special January 11 Phil Stat Session with Y. Benjamini and D. Hand described here.) Here’s most of the editorial:

The Statistics Wars and Intellectual Conflicts of Interest

How should journal editors react to heated disagreements about statistical significance tests in applied fields, such as conservation science, where statistical inferences often are the basis for controversial policy decisions? They should avoid taking sides. They should also avoid obeisance to calls for author guidelines to reflect a particular statistical philosophy or standpoint. The question is how to prevent the misuse of statistical methods without selectively favoring one side.

The statistical‐significance‐test controversies are well known in conservation science. In a forum revolving around Murtaugh’s (2014) “In Defense of P values,” Murtaugh argues, correctly, that most criticisms of statistical significance tests “stem from misunderstandings or incorrect interpretations, rather than from intrinsic shortcomings of the P value” (p. 611). However, underlying those criticisms, and especially proposed reforms, are often controversial philosophical presuppositions about the proper uses of probability in uncertain inference. Should probability be used to assess a method’s probability of avoiding erroneous interpretations of data (i.e., error probabilities) or to measure comparative degrees of belief or support? Wars between frequentists and Bayesians continue to simmer in calls for reform.

Consider how, in commenting on Murtaugh (2014), Burnham and Anderson (2014 : 627) aver that “P‐values are not proper evidence as they violate the likelihood principle (Royall, 1997).” This presupposes that statistical methods ought to obey the likelihood principle (LP), a long‐standing point of controversy in the statistics wars. The LP says that all the evidence is contained in a ratio of likelihoods (Berger & Wolpert, 1988). Because this is to condition on the particular sample data, there is no consideration of outcomes other than those observed and thus no consideration of error probabilities. One should not write this off because it seems technical: methods that obey the LP fail to directly register gambits that alter their capability to probe error. Whatever one’s view, a criticism based on presupposing the irrelevance of error probabilities is radically different from one that points to misuses of tests for their intended purpose—to assess and control error probabilities.

Error control is nullified by biasing selection effects: cherry‐picking, multiple testing, data dredging, and flexible stopping rules. The resulting (nominal) p values are not legitimate p values. In conservation science and elsewhere, such misuses can result from a publish‐or‐perish mentality and experimenter’s flexibility (Fidler et al., 2017). These led to calls for preregistration of hypotheses and stopping rules–one of the most effective ways to promote replication (Simmons et al., 2012). However, data dredging can also occur with likelihood ratios, Bayes factors, and Bayesian updating, but the direct grounds to criticize inferences as flouting error probability control is lost. This conflicts with a central motivation for using p values as a “first line of defense against being fooled by randomness” (Benjamini, 2016). The introduction of prior probabilities (subjective, default, or empirical)–which may also be data dependent–offers further flexibility.

Signs that one is going beyond merely enforcing proper use of statistical significance tests are that the proposed reform is either the subject of heated controversy or is based on presupposing a philosophy at odds with that of statistical significance testing. It is easy to miss or downplay philosophical presuppositions, especially if one has a strong interest in endorsing the policy upshot: to abandon statistical significance. Having the power to enforce such a policy, however, can create a conflict of interest (COI). Unlike a typical COI, this one is intellectual and could threaten the intended goals of integrity, reproducibility, and transparency in science.

If the reward structure is seducing even researchers who are aware of the pitfalls of capitalizing on selection biases, then one is dealing with a highly susceptible group. For a journal or organization to take sides in these long-standing controversies—or even to appear to do so—encourages groupthink and discourages practitioners from arriving at their own reflective conclusions about methods.

The American Statistical Association (ASA) Board appointed a President’s Task Force on Statistical Significance and Replicability in 2019 that was put in the odd position of needing to “address concerns that a 2019 editorial [by the ASA’s executive director (Wasserstein et al., 2019)] might be mistakenly interpreted as official ASA policy” (Benjamini et al., 2021)—as if the editorial continues the 2016 ASA Statement on p-values (Wasserstein & Lazar, 2016). That policy statement merely warns against well‐known fallacies in using p values. But Wasserstein et al. (2019) claim it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” and announce taking that step. They call on practitioners not to use the phrase statistical significance and to avoid p value thresholds. Call this the no‐threshold view. The 2016 statement was largely uncontroversial; the 2019 editorial was anything but. The President’s Task Force should be commended for working to resolve the confusion (Kafadar, 2019). Their report concludes: “P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results” (Benjamini et al., 2021). A disclaimer that Wasserstein et al., 2019 was not ASA policy would have avoided both the confusion and the slight to opposing views within the Association.

The no‐threshold view has consequences (likely unintended). Statistical significance tests arise “to test the conformity of the particular data under analysis with [a statistical hypothesis] H0 in some respect to be specified” (Mayo & Cox, 2006: 81). There is a function D of the data, the test statistic, such that the larger its value (d), the more inconsistent are the data with H0. The p value is the probability the test would have given rise to a result more discordant from H0 than d is were the results due to background or chance variability (as described in H0). In computing p, hypothesis H0 is assumed merely for drawing out its probabilistic implications. If even larger differences than d are frequently brought about by chance alone (p is not small), the data are not evidence of inconsistency with H0. Requiring a low pvalue before inferring inconsistency with H0 controls the probability of a type I error (i.e., erroneously finding evidence against H0).

Whether interpreting a simple Fisherian or an N‐P test, avoiding fallacies calls for considering one or more discrepancies from the null hypothesis under test. Consider testing a normal mean H0: μ ≤ μ0 versus H1: μ > μ0. If the test would fairly probably have resulted in a smaller p value than observed, if μ = μ1 were true (where μ1 = μ0 + γ, for γ > 0), then the data provide poor evidence that μ exceeds μ1. It would be unwarranted to infer evidence of μ > μ1. Tests do not need to be abandoned when the fallacy is easily avoided by computing p values for one or two additional benchmarks (Burgman, 2005; Hand, 2021; Mayo, 2018; Mayo & Spanos, 2006).

The same is true for avoiding fallacious interpretations of nonsignificant results. These are often of concern in conservation, especially when interpreted as no risks exist. In fact, the test may have had a low probability to detect risks. But nonsignificant results are not uninformative. If the test very probably would have resulted in a more statistically significant result were there a meaningful effect, say μ > μ1 (where μ1 = μ0 + γ, for γ > 0), then the data are evidence that μ < μ1. (This is not to infer μ ≤ μ0.) “Such an assessment is more relevant to specific data than is the notion of power” (Mayo & Cox, 2006: 89). This also matches inferring that μ is less than the upper bound of the corresponding confidence interval (at the associated confidence level) or a severity assessment (Mayo, 2018). Others advance equivalence tests (Lakens, 2017; Wellek, 2017). An N‐P test tells one to specify H0 so that the type I error is the more serious (considering costs); that alone can alleviate problems in the examples critics adduce (H0would be that the risk exists).

Many think the no‐threshold view merely insists that the attained p value be reported. But leading N‐P theorists already recommend reporting p, which “gives an idea of how strongly the data contradict the hypothesis…[and] enables others to reach a verdict based on the significance level of their choice” (Lehmann & Romano, 2005: 63−64). What the no‐threshold view does, if taken strictly, is preclude testing. If one cannot say ahead of time about any result that it will not be allowed to count in favor of a claim, then one does not test that claim. There is no test or falsification, even of the statistical variety. What is the point of insisting on replication if at no stage can one say the effect failed to replicate? One may argue for approaches other than tests, but it is unwarranted to claim by fiat that tests do not provide evidence. (For a discussion of rival views of evidence in ecology, see Taper & Lele, 2004.)

Many sign on to the no‐threshold view thinking it blocks perverse incentives to data dredge, multiple test, and p hack when confronted with a large, statistically nonsignificant p value. Carefully considered, the reverse seems true. Even without the word significance, researchers could not present a large (nonsignificant) p value as indicating a genuine effect. It would be nonsensical to say that even though more extreme results would frequently occur by random variability alone that their data are evidence of a genuine effect. The researcher would still need a small value, which is to operate with a threshold. However, it would be harder to hold data dredgers culpable for reporting a nominally small p value obtained through data dredging. What distinguishes nominal p values from actual ones is that they fail to meet a prespecified error probability threshold.

 

While it is well known that stopping when the data look good inflates the type I error probability, a strict Bayesian is not required to adjust for interim checking because the posterior probability is unaltered. Advocates of Bayesian clinical trials are in a quandary because “The [regulatory] requirement of Type I error control for Bayesian [trials] causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle” (Ryan etal., 2020: 7).

It may be retorted that implausible inferences will indirectly be blocked by appropriate prior degrees of belief (informative priors), but this misses the crucial point. The key function of statistical tests is to constrain the human tendency to selectively favor views they believe in. There are ample forums for debating statistical methodologies. There is no call for executive directors or journal editors to place a thumb on the scale. Whether in dealing with environmental policy advocates, drug lobbyists, or avid calls to expel statistical significance tests, a strong belief in the efficacy of an intervention is distinct from its having been well tested. Applied science will be well served by editorial policies that uphold that distinction.

For the acknowledgments and references, see the full editorial here.

I will cite as many (constructive) readers’ views as I can at the upcoming forum with Yoav Benjamini and David Hand on January 11 on zoom (see this post). *Authors of articles I put up as guest posts or cite at the Forum will get a free copy of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018).

Categories: significance tests, spurious p values, stat wars and their casualties, strong likelihood principle | 3 Comments

Bickel’s defense of significance testing on the basis of Bayesian model checking

.

In my last post, I said I’d come back to a (2021) article by David Bickel, “Null Hypothesis Significance Testing Defended and Calibrated by Bayesian Model Checking” in The American Statistician. His abstract begins as follows:

 

Significance testing is often criticized because p-values can be low even though posterior probabilities of the null hypothesis are not low according to some Bayesian models. Those models, however, would assign low prior probabilities to the observation that the p-value is sufficiently low. That conflict between the models and the data may indicate that the models needs revision. Indeed, if the p-value is sufficiently small while the posterior probability according to a model is insufficiently small, then the model will fail a model check….(from Bickel 2021)

Continue reading

Categories: Bayesian/frequentist, D. Bickel, Fisher, P-values | 3 Comments

P-values disagree with posteriors? Problem is your priors, says R.A. Fisher

What goes around…

How often do you hear P-values criticized for “exaggerating” the evidence against a null hypothesis? If your experience is like mine, the answer is ‘all the time’, and in fact, the charge is often taken as one of the strongest cards in the anti-statistical significance playbook. The argument boils down to the fact that the P-value accorded to a point null H0 can be small while its Bayesian posterior probability high–provided a high enough prior is accorded to H0. But why suppose P-values should match Bayesian posteriors? And what justifies the high (or “spike”) prior to a point null? While I discuss this criticism at considerable length in Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP, 2018), I did not quote an intriguing response by R.A. Fisher to disagreements between P-values and posteriors’s (in Statistical Methods and Scientific Inference, Fisher 1956); namely, that such a prior probability assignment would itself be rejected by the observed small P-value–if the prior were itself regarded as a hypothesis to test. Or so he says. I did mention this response by Fisher in an encyclopedia article from way back in 2006 on “philosophy of statistics”: Continue reading

Categories: Bayesian/frequentist, Fisher, P-values | 7 Comments

Memory Lane (4 years ago): Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

.

An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test. Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, P-values, reforming the reformers, spurious p values | 3 Comments

Our presentations from the PSA: Philosophy in Science (PinS) symposium

Philosophy in Science:
Can Philosophers of Science Contribute to Science?

 

Below are the presentations from our remote session on “Philosophy in Science”on November 13, 2021 at the Philosophy of Science Association meeting. We are having an extended discussion on Monday November, 22 at 3pm Eastern Standard Time. If you wish to take part, write to me of your interest by email (error) with the subject “PinS” or use comments below. (Include name, affiliation and email). Continue reading

Categories: PSA 2021 | 7 Comments

Our session is now remote: Philo of Sci Association (PSA): Philosophy IN Science (PinS): Can Philosophers of Science Contribute to Science?

.

Philosophy in Science: Can Philosophers of Science Contribute to Science?
     on November 13, 2-4 pm

 

OUR SESSION HAS BECOME REMOTE: PLEASE JOIN US on ZOOM! This session revolves around the intriguing question: Can Philosophers of Science Contribute to Science? They’re calling it philosophy “in” science–when philosophical ministrations actually intervene in a science itself.  This is the session I’ll be speaking in. I hope you will come to our session if you’re there–it’s hybrid, so you can’t see it through a remote link. But I’d like to hear what you think about this question–in the comments to this post. Continue reading

Categories: Announcement, PSA 2021 | Leave a comment

S. Senn: The Many Halls Problem (Guest Post)

.

Stephen Senn
Consultant Statistician
Edinburgh, Scotland

 

The Many Halls Problem
It’s not that paradox but another

Generalisation is passing…from the consideration of a restricted set to that of a more comprehensive set containing the restricted one…Generalization may be useful in the solution of problems. George Pólya [1] (P108)

Introduction

In a previous blog  https://www.linkedin.com/pulse/cause-concern-stephen-senn/ I considered Lord’s Paradox[2], applying John Nelder’s calculus of experiments[3, 4]. Lord’s paradox involves two different analyses of the effect of two different diets, one for each of two different student halls, on weight of students. One statistician compares the so-called change scores or gain scores (final weight minus initial weight) and the other compares final weights, adjusting for initial weights using analysis of covariance. Since the mean initial weights vary between halls, the two analyses will come to different conclusions unless the slope of final on initial weights just happens to be one (in practice, it would usually be less). The fact that two apparently reasonable analyses would lead to different conclusions constitutes the paradox. I chose the version of the paradox outlined by Wainer and Brown [5] and also discussed in The Book of Why[6].  I illustrated this by considering two different experiments: one in which, as in the original example, the diet varies between halls and a further example in which it varies within halls. I simulated some data which are available in the appendix to that blog but which can also be downloaded from here http://www.senns.uk/Lords_Paradox_Simulated.xls so that any reader who wishes to try their hand at analysis can have a go. Continue reading

Categories: Lord's paradox, S. Senn | 7 Comments

I’ll be speaking at the Philo of Sci Association (PSA): Philosophy IN Science: Can Philosophers of Science Contribute to Science?

.

Philosophy in Science: Can Philosophers of Science Contribute to Science?
     on November 13, 2-4 pm

 

This session revolves around the intriguing question: Can Philosophers of Science Contribute to Science? They’re calling it philosophy “in” science–when philosophical ministrations actually intervene in a science itself.  This is the session I’ll be speaking in. I hope you will come to our session if you’re there–it’s hybrid, so you can’t see it through a remote link. But I’d like to hear what you think about this question–in the comments to this post. Continue reading

Categories: Error Statistics | 4 Comments

Philo of Sci Assoc (PSA) Session: Current Debates on Statistical Modeling and Inference

 

.

The Philosophy of Science Association (PSA) is holding its biennial meeting (one year late)–live/hybrid/remote*–in November, 2021, and I plan to be there (first in-person meeting since Feb 2020). Some of the members from the 2019 Summer Seminar that I ran with Aris Spanos are in a Symposium:

Current Debates on Statistical Modeling and Inference
(co-author Mike Tamir, Berkeley)     on November 13, 9 am-12:15 pm  

Here are the members and talks (Link to session/abstracts):

  • Aris Spanos (Virginia Tech): Self-Correction and Statistical Misspecification (co-author Deborah Mayo, Virginia Tech)
  • Roubin Gong (Rutgers): Measuring Severity in Statistical Inference
  • Riet van Bork (University of Amsterdam): Psychometric Models: Statistics and Interpretation (co-author Jan-Willem Romeijn, University of Groningen)
  • Marcello di Bello (Lehman College CUNY): Is Algorithmic Fairness Possible?
  • Elay Shech (Auburn University): Statistical Modeling, Mis-specification Testing, and Exploration (co-author Mike Tamir, Berkeley)

Continue reading

Categories: Error Statistics | 1 Comment

The (Vaccine) Booster Wars: A prepost

.

We’re always reading about how the pandemic has created a new emphasis on preprints, so it stands to reason that non-reviewed preposts would now have a place in blogs. Maybe then I’ll “publish” some of the half-baked posts languishing on draft in errorstatistics.com. I’ll update or replace this prepost after reviewing.

The Booster wars

Continue reading

Categories: the (Covid vaccine) booster wars | 19 Comments

Workshop-New Date!

The Statistics Wars
and Their Casualties

New Date!

4-5 April 2022

London School of Economics (CPNSS)

Yoav Benjamini (Tel Aviv University), Alexander Bird (University of Cambridge), Mark Burgman (Imperial College London),  Daniele Fanelli (London School of Economics and Political Science), Roman Frigg (London School of Economics and Political Science), Stephen Guettinger (London School of Economics and Political Science), David Hand (Imperial College London), Margherita Harris (London School of Economics and Political Science), Christian Hennig (University of Bologna), Katrin Hohl (City University London), Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland), Jon Williamson (University of Kent) Continue reading

Categories: Error Statistics | Leave a comment

All She Wrote (so far): Error Statistics Philosophy: 10 years on

Dear Reader: I began this blog 10 years ago (Sept. 3, 2011)! A double celebration is taking place at the Elbar Room–remotely for the first time due to Covid– both for the blog and the 3 year anniversary of the physical appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [SIST] (CUP, 2018). A special rush edition made an appearance on Sept 3, 2018 in time for the RSS meeting in Cardiff, where we had a session deconstructing the arguments against statistical significance tests (with Sir David Cox, Richard Morey and Aris Spanos). Join us between 7 and 8 pm in a drink of Elba Grease.

.

Many of the discussions in the book were importantly influenced (corrected and improved) by reader’s comments on the blog over the years. I posted several excerpts and mementos from SIST here. I thank readers for their input. Readers might want to look up the topics in SIST on this blog to check out the comments, and see how ideas were developed, corrected and turned into “excursions” in SIST.

I recently invited readers to weigh in on the ASA Task Force on Statistical significance and Replication--any time through September–to be part of a joint guest post (or posts). All contributors will get a free copy of SIST. Continue reading

Categories: 10 year memory lane, Statistical Inference as Severe Testing | Leave a comment

Should Bayesian Clinical Trialists Wear Error Statistical Hats? (i)

 

I. A principled disagreement

The other day I was in a practice (zoom) for a panel I’m in on how different approaches and philosophies (Frequentist, Bayesian, machine learning) might explain “why we disagree” when interpreting clinical trial data. The focus is radiation oncology.[1] An important point of disagreement between frequentist (error statisticians) and Bayesians concerns whether and if so, how, to modify inferences in the face of a variety of selection effects, multiple testing, and stopping for interim analysis. Such multiplicities directly alter the capabilities of methods to avoid erroneously interpreting data, so the frequentist error probabilities are altered. By contrast, if an account conditions on the observed data, error probabilities drop out, and we get principles such as the stopping rule principle. My presentation included a quote from Bayarri and J. Berger (2004): Continue reading

Categories: multiple testing, statistical significance tests, strong likelihood principle | 26 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. Yes, i know I’ve been neglecting this blog as of late, but this topic will appear in a new guise in a post I’m writing now, to appear tomorrow.

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.  Continue reading

Categories: E.S. Pearson, Error Statistics | 2 Comments

Blog at WordPress.com.