reforming the reformers

Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

Posted on May 22, 2016 by Mayo

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ R_pre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H₁ and H₀ respectively), is shown to capture the strength of evidence in the experiment for H₁over H₀. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading →

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics | 36 Comments

Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Posted on May 14, 2016 by Mayo

Jackie Mason

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the n^th time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading →

Categories: reforming the reformers, science-wise screening, Statistical power, statistical tests, Statistics | Tags: criticism of frequentist methods, fallacy of rejection, Jackie Mason, sample size, significance tests | 5 Comments

A. Spanos: Talking back to the critics using error statistics

Posted on March 26, 2016 by Mayo

Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics [1]. A related paper for your Saturday night reading is Mayo and Spanos (2011).[2] It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.

I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.

[1]It was first blogged here, as part of our seminar 2 years ago.

[2] For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.

Philosophy-laden meta-statistics: Is “technical activism” free of statistical philosophy? (ii)

Posted on February 3, 2016 by Mayo

Ben Goldacre (of Bad Science), in a Nature article today (“Make Journals Report Clinical Trials Properly“), expresses puzzlement as to why bad statistical practices– “selective publication, inadequate descriptions of study methods that block efforts at replication, and data dredging through undisclosed use of multiple analytical strategies“–are continuing to occur even in the face of the new “technical activism” (a great term he introduces). Worse, these questionable practices are actually being defended by some medical journals. “[J]ournal editors now need to engage in a serious public discussion on why this is still happening“. Goldacre doesn’t consider that at least some of the pushback he’s seeing has a basis in statistical philosophy! I explain at the end. Here’s Goldacre (Feb 2, 2016; emphasis mine):

Science is in flux. The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data: selective publication, inadequate descriptions of study methods that block efforts at replication, and data dredging through undisclosed use of multiple analytical strategies. Problems such as these undermine the integrity of published data and increase the risk of exaggerated or even false-positive findings, leading collectively to the ‘replication crisis’. Continue reading →

Categories: P-values, reforming the reformers, Statistics | 60 Comments

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

Posted on January 24, 2016 by Mayo

When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading →

Categories: P-values, reforming the reformers, statistical tests | 2 Comments

High error rates in discussions of error rates (1/21/16 update)

Posted on January 19, 2016 by Mayo

27D0BB5300000578-3168627-image-a-27_1437433320306

waiting for the other shoe to drop…

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities and error rates—indeed in the glossary for the site—while in others it is denied they provide error rates. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses. Continue reading →

Categories: highly probable vs highly probed, J. Berger, reforming the reformers, Statistics | 11 Comments

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)[4]

Posted on December 20, 2015 by Mayo

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

Double Jeopardy?: Judge Jeffreys Upholds the Law*[4]

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing n counters and supposes that all m drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes: Continue reading →

Categories: Jeffreys, P-values, reforming the reformers, Stephen Senn | Tags: let PBP | 11 Comments

WHIPPING BOYS AND WITCH HUNTERS (ii)

Posted on October 31, 2015 by Mayo

At least as apt today as 3 years ago…HAPPY HALLOWEEN! Memory Lane with new comments in blue.

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways)—as well as for what really boils down to a field’s weaknesses in modeling, theorizing, experimentation, and data collection. Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline. It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago. They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis—especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing. Continue reading →

Categories: P-values, reforming the reformers, significance tests, Statistics | Tags: reformers, significance test controversies, significance tests | Leave a comment

P-value madness: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)

Posted on October 10, 2015 by Mayo

Given the excited whispers about the upcoming meeting of the American Statistical Association Committee on P-Values and Statistical Significance, it’s an apt time to reblog my post on the “Don’t Ask Don’t Tell” policy that began the latest brouhaha!

A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called Basic and Applied Social Psychology (BASP)[i]. Enough. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. Oh please, no ESP required.None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about (but Saturday night is a perfect time to read/reread the (satirical) Task force post [ia]). Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (really?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors. I don’t know, maybe some good will come of all this.

Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them! Continue reading →

Categories: P-values, pseudoscience, reforming the reformers, Statistics | 7 Comments

From our “Philosophy of Statistics” session: APS 2015 convention

Posted on May 24, 2015 by Mayo

“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

D. Mayo: “Error Statistical Control: Forfeit at your Peril”

S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”

A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)

For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)

Posted on May 9, 2015 by Mayo

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

Double Jeopardy?: Judge Jeffreys Upholds the Law

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

Note that in the case that only one counter remains we have n = m + 1 and the two probabilities are the same. However, if n > m+1 they are not the same and in particular if m is large but n is much larger, the first probability can approach 1 whilst the second remains small.

The practical implication of this just because Bayesian induction implies that a large sequence of successes (and no failures) supports belief that the next trial will be a success, it does not follow that one should believe that all future trials will be so. This distinction is often misunderstood. This is The Economist getting it wrong in September 2000

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

See Dicing with Death(3) (pp76-78).

The practical relevance of this is that scientific laws cannot be established by Laplacian induction. Jeffreys (1891-1989) puts it thus

Thus I may have seen 1 in 1000 of the ‘animals with feathers’ in England; on Laplace’s theory the probability of the proposition, ‘all animals with feathers have beaks’, would be about 1/1000. This does not correspond to my state of belief or anybody else’s. (P128)

Continue reading →

Categories: Jeffreys, P-values, reforming the reformers, Statistics, Stephen Senn | 42 Comments

A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)

Posted on March 5, 2015 by Mayo

A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called Basic and Applied Social Psychology (BASP)[i]. Enough. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. Oh please, no ESP required.None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about. Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (really?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors. I don’t know, maybe some good will come of all this.

“Question 1. Will manuscripts with p-values be desk rejected automatically?

Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).”

Now if these measures are alleged to be irrelevant and invalid instruments for statistical inference, then why should they be included in the peer review process at all? Will reviewers be told to ignore them? That would only seem fair: papers should not be judged by criteria alleged to be invalid, but how will reviewers blind themselves to them? It would seem the measures should be excluded from the get-go. If they are included in the review, why shouldn’t the readers see what the reviewers saw when they recommended acceptance?

But here’s where the puzzle really sets in. If the authors must free their final papers from such encumbrances as sampling distributions and error probabilities, what will be the basis presented for their conclusions in the published paper? Presumably, from the notice, they are allowed only mere descriptive statistics or non-objective Bayesian reports (added: actually can’t tell which kind of Bayesianism they allow, given the Fisher reference which doesn’t fit*). Won’t this be tantamount to requiring authors support their research in a way that is either (actually) invalid, or has little to do with the error statistical properties that were actually reported and on which the acceptance was based?[ii] Continue reading →

Categories: P-values, reforming the reformers, Statistics | 73 Comments

2015 Saturday Night Brainstorming and Task Forces: (4th draft)

Posted on January 31, 2015 by Mayo

TFSI workgroup

Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!

Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?

Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative? It’s hard to say.

This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pawl: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past. Continue reading →

Categories: Comedy, reforming the reformers, science communication, Statistical fraudbusting, statistical tests, Statistics | Tags: criticism of frequentists methods, NHST, power, reformers, significance tests, Sir Karl Popper, test ban | 19 Comments

All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!”

Posted on December 23, 2014 by Mayo

This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)

Categories: Error Statistics, fallacy of rejection, Phil6334, reforming the reformers, Statistics | 27 Comments

A. Spanos: Talking back to the critics using error statistics (Phil6334)

Posted on May 7, 2014 by Mayo

Aris Spanos’ overview of error statistical responses to familiar criticisms of statistical tests. Related reading is Mayo and Spanos (2011)

Categories: Error Statistics, frequentist/Bayesian, Phil6334, reforming the reformers, statistical tests, Statistics | Leave a comment

T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)

Posted on November 13, 2013 by Mayo

Tom Kepler’s guest post arose in connection with my November 9 post & comments.

Professor Thomas B. Kepler
Department of Microbiology
Department of Mathematics & Statistics
Boston University School of Medicine

There is much to say about the article in the Economist, but the first is to note that it is far more balanced than its sensational headline promises. Promising to throw open the curtain on “Unreliable research” is mere click-bait for the science-averse readers who have recently found validation against their intellectual insecurities in the populist uprising against the shadowy world of the scientist. What with the East Anglia conspiracy, and so on, there’s no such thing as “too skeptical” when it comes to science.

There is some remarkably casual reporting in an article that purports to be concerned with mechanisms to assure that inaccuracies not be perpetuated.

For example, the authors cite the comment in Nature by Begley and Ellis and summarize it thus: …scientists at Amgen, an American drug company, tried to replicate 53 studies that they considered landmarks in the basic science of cancer, often co-operating closely with the original researchers to ensure that their experimental technique matched the one used first time round. Stan Young, in his comments to Mayo’s blog adds, “These claims can not be replicated – even by the original investigators! Stop and think of that.” But in fact the role of the original investigators is described as follows in Begley and Ellis: “…when findings could not be reproduced, an attempt was made to contact the original authors, discuss the discrepant findings, exchange reagents and repeat experiments under the authors’ direction, occasionally even in the laboratory of the original investigator.” (Emphasis added.) Now, please stop and think about what agenda is served by eliding the tempered language of the original.

Both the Begley and Ellis comment and the brief correspondence by Prinz et al. also cited in this discussion are about laboratories in commercial pharmaceutical companies failing to reproduce experimental results. While deciding how to interpret their findings, it would be prudent to bear in mind the insight from Harry Collins, the sociologist of science paraphrased in the Economist piece as indicating that “performing an experiment always entails what sociologists call “tacit knowledge”—craft skills and extemporisations that their possessors take for granted but can pass on only through example. Thus if a replication fails, it could be because the repeaters didn’t quite get these je-ne-sais-quoi bits of the protocol right.” Indeed, I would go further and conjecture that few experimental biologists would hold out hope that any one laboratory could claim the expertise necessary to reproduce the results of 53 ground-breaking papers in diverse specialties, even within cancer drug discovery. And to those who are unhappy that authors often do not comply with the journals’ clear policy of data-sharing, how do you suppose you would fare getting such data from the pharmaceutical companies that wrote these damning papers? Or the authors of the papers themselves? Nature had to clarify, writing two months after the publication of Begley and Ellis, “Nature, like most journals, requires authors of research papers to make their data available on request. In this less formal Comment, we chose not to enforce this requirement so that Begley and Ellis could abide by the legal agreements [they made with the original authors].” Continue reading →

Categories: junk science, reforming the reformers, science communication, Statistics | 20 Comments

P-values can’t be trusted except when used to argue that P-values can’t be trusted!

Posted on June 14, 2013 by Mayo

Have you noticed that some of the harshest criticisms of frequentist error-statistical methods these days rest on methods and grounds that the critics themselves purport to reject? Is there a whiff of inconsistency in proclaiming an “anti-hypothesis-testing stance” while in the same breath extolling the uses of statistical significance tests and p-values in mounting criticisms of significance tests and p-values? I was reminded of this in the last two posts (comments) on this blog (here and here) and one from Gelman from a few weeks ago (“Interrogating p-values”).

Gelman quotes from a note he is publishing:

“..there has been a growing sense that psychology, biomedicine, and other fields are being overwhelmed with errors … . In two recent series of papers, Gregory Francis and Uri Simonsohn and collaborators have demonstrated too-good-to-be-true patterns of p-values in published papers, indicating that these results should not be taken at face value.”

But this fraudbusting is based on finding statistically significant differences from null hypotheses (e.g., nulls asserting random assignments of treatments)! If we are to hold small p-values untrustworthy, we would be hard pressed to take them as legitimating these criticisms, especially those of a career-ending sort.

…in addition to the well-known difficulties of interpretation of p-values…,…and to the problem that, even when all comparisons have been openly reported and thus p-values are mathematically correct, the ‘statistical significance filter’ ensures that estimated effects will be in general larger than true effects, with this discrepancy being well over an order of magnitude in settings where the true effects are small… (Gelman 2013)

But surely anyone who believed this would be up in arms about using small p-values as evidence of statistical impropriety. Am I the only one wondering about this?*

CLARIFICATION (6/15/13): Corey’s comment today leads me to a clarification, lest anyone misunderstand my point. I am sure that Francis, Simonsohn and others would never be using p-values and associated methods in the service of criticism if they did not regard the tests as legitimate scientific tools. I wasn’t talking about them. I was alluding to critics of tests who point to their work as evidence the statistical tools are not legitimate. Now maybe Gelman only intends to say, what we know and agree with, that tests can be misused and misinterpreted. But in these comments, our exchanges, and elsewhere, it is clear he is saying something much stronger. In my view, the use of significance tests by debunkers should have been taken as strong support for the value of the tools, correctly used. In short, I thought it was a success story! and I was rather perplexed to see somewhat the reverse.

______________________

*This just in: If one wants to see a genuine ~~quack~~ extremist** who was outed long ago***, see Ziliac’s article declaring the Higgs physicists are pseudoscientists for relying on significance levels!( in the Financial Post 6/12/13).

**I am not placing the critics referred to above under this umbrella in the least.

***For some reviews of Ziliac and McCloskey, see widgets on left. For their flawed testimony on the Matrixx case, please search this blog.

Categories: reforming the reformers, Statistical fraudbusting, Statistics | 43 Comments

reforming the reformers

Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

A. Spanos: Talking back to the critics using error statistics

Philosophy-laden meta-statistics: Is “technical activism” free of statistical philosophy? (ii)

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

High error rates in discussions of error rates (1/21/16 update)

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)[4]

WHIPPING BOYS AND WITCH HUNTERS (ii)

P-value madness: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)

From our “Philosophy of Statistics” session: APS 2015 convention

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)

A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)

2015 Saturday Night Brainstorming and Task Forces: (4th draft)

All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!”

A. Spanos: Talking back to the critics using error statistics (Phil6334)

T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)

P-values can’t be trusted except when used to argue that P-values can’t be trusted!

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.