Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”

Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”

Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics
39 Comments

“Guides for the Perplexed” in statistics become “Guides to Become Perplexed” when “error probabilities” (in relation to statistical hypotheses tests) are confused with posterior probabilities of hypotheses. Moreover, these posteriors are neither frequentist, subjectivist, nor default. Since this doublespeak is becoming more common in some circles, it seems apt to reblog a post from one year ago (you may wish to check the comments).

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities —indeed in the glossary for the site—while in others it is denied they provide error probabilities. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.

Begin with one of his kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics” (blue is Frost): Continue reading

When logical fallacies of statistics go uncorrected, they are repeated again and again…and again. And so it is with the limb-sawing fallacy I first posted in one of my “Overheard at the Comedy Hour” posts.* It now resides as a comic criticism of significance tests in a paper by Szucs and Ioannidis (posted this week), Here’s their version:

“[P]aradoxically, when we achieve our goal and successfully reject

Hwe will actually be left in complete existential vacuum because during the rejection of_{0 }HNHST ‘_{0 }saws off its own limb’ (Jaynes, 2003; p. 524): If we manage to rejectHthen it follows that pr(data or more extreme data|_{0}H) is useless because_{0}His not true” (p.15)._{0}

Here’s Jaynes (p. 524):

“Suppose we decide that the effect exists; that is, we reject [null hypothesis]

H. Surely, we must also reject probabilities conditional on_{0}H, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ “_{0}

*Ha! Ha!* By this reasoning, no hypothetical testing or falsification could ever occur. As soon as *H* is falsified, the grounds for falsifying disappear! If *H*: all swans are white, then if I see a black swan, *H* is falsified. But according to this criticism, we can no longer assume the deduced prediction from *H*! What? Continue reading

Categories: Error Statistics, P-values, reforming the reformers, Statistics
14 Comments

Here are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here:

I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data [0]. The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in * black*. Continue reading

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ R

_{pre}, the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting underH_{1}andH_{0}respectively), is shown to capture the strength of evidence in the experiment for H_{1 }over H_{0}. (ibid., p. 2)

*But in fact it does no such thing!* [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics
36 Comments

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the n^{th} time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading

Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics [1]. A related paper for your Saturday night reading is Mayo and Spanos (2011).[2] It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.

I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.

[1]It was first blogged here, as part of our seminar 2 years ago.

[2] For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.

See also on this blog:

A. Spanos, “Recurring controversies about p-values and confidence intervals revisited”

A. Spanos, “Lecture on frequentist hypothesis testing

Ben Goldacre (of *Bad Science*), in a *Nature* article today (“Make Journals Report Clinical Trials Properly“), expresses puzzlement as to why bad statistical practices– “selective publication, inadequate descriptions of study methods that block efforts at replication, and data dredging through undisclosed use of multiple analytical strategies“–are continuing to occur even in the face of the new “technical activism” (a great term he introduces). Worse, these questionable practices are actually being defended by some medical journals. “[J]ournal editors now need to engage in a serious public discussion on why this is still happening“. Goldacre doesn’t consider that at least some of the pushback he’s seeing has a basis in statistical philosophy! I explain at the end. Here’s Goldacre (Feb 2, 2016; emphasis mine):

Science is in flux. The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data: selective publication, inadequate descriptions of study methods that block efforts at replication, and data dredging through undisclosed use of multiple analytical strategies. Problems such as these undermine the integrity of published data and increase the risk of exaggerated or even false-positive findings, leading collectively to the ‘replication crisis’. Continue reading

Categories: P-values, reforming the reformers, Statistics
60 Comments

When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

Categories: P-values, reforming the reformers, statistical tests
2 Comments

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities and error rates—indeed in the glossary for the site—while in others it is denied they provide error rates. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses. Continue reading

**Stephen Senn**

Head of Competence Center for Methodology and Statistics (CCMS)

Luxembourg Institute of Health

**Double Jeopardy?: Judge Jeffreys Upholds the Law*[4]**

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing *n* counters and supposes that all *m* drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes: Continue reading

At least as apt today as 3 years ago…* HAPPY HALLOWEEN!* Memory Lane with new comments in

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways)—as well as for what really boils down to a field’s weaknesses in modeling, theorizing, experimentation, and data collection. Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline. It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s *Significance Test Controversy* (1962), performed an important service over fifty years ago. They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis—especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing. Continue reading

Given the excited whispers about the upcoming meeting of the American Statistical Association Committee on P-Values and Statistical Significance, it’s an apt time to **reblog my post **on the “Don’t Ask Don’t Tell” policy that began the latest brouhaha!

A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called *Basic and Applied Social Psychology* (BASP)[i]. *Enough*. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. *Oh please, no ESP required.*None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about (but Saturday night is a perfect time to read/reread the (satirical) Task force post [ia]). Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (*really*?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors. I don’t know, maybe some good will come of all this.

Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them! Continue reading

Categories: P-values, pseudoscience, reforming the reformers, Statistics
7 Comments

*“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference*,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

**D. Mayo: “Error Statistical Control: Forfeit at your Peril” **

**S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”**

**A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)**

**For more details see this post.**

**Stephen Senn**

Head of Competence Center for Methodology and Statistics (CCMS)

Luxembourg Institute of Health

**Double Jeopardy?: Judge Jeffreys Upholds the Law**

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing *n* counters and supposes that all *m* drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes:

Note that in the case that only one counter remains we have *n = m + 1* and the two probabilities are the same. However, if *n > m+1* they are not the same and in particular if *m* is large but *n* is much larger, the first probability can approach 1 whilst the second remains small.

The practical implication of this just because Bayesian induction implies that a large sequence of successes (and no failures) supports belief that the next trial will be a success, it does not follow that one should believe that all future trials will be so. This distinction is often misunderstood. This is The Economist getting it wrong in September 2000

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

See *Dicing with Death*(3) (pp76-78).

The practical relevance of this is that scientific laws cannot be established by Laplacian induction. Jeffreys (1891-1989) puts it thus

Thus I may have seen 1 in 1000 of the ‘animals with feathers’ in England; on Laplace’s theory the probability of the proposition, ‘all animals with feathers have beaks’, would be about 1/1000. This does not correspond to my state of belief or anybody else’s. (P128)

Categories: Jeffreys, P-values, reforming the reformers, Statistics, Stephen Senn
41 Comments

A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called *Basic and Applied Social Psychology* (BASP)[i]. *Enough*. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. *Oh please, no ESP required.*None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about. Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (*really*?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors. I don’t know, maybe some good will come of all this.

Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them!

“

Question 1. Will manuscripts with p-values be desk rejected automatically?

Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).”

Now if these measures are alleged to be irrelevant and invalid instruments for statistical inference, then why should they be included in the peer review process at all? Will reviewers be told to ignore them? That would only seem fair: papers should not be judged by criteria alleged to be invalid, but how will reviewers blind themselves to them? It would seem the measures should be excluded from the get-go. If they are included in the review, why shouldn’t the readers see what the reviewers saw when they recommended acceptance?

But here’s where the puzzle really sets in. If the authors must free their final papers from such encumbrances as sampling distributions and error probabilities, what will be the basis presented for their conclusions in the published paper? Presumably, from the notice, they are allowed only mere descriptive statistics or non-objective Bayesian reports (*added*: actually can’t tell which kind of Bayesianism they allow, given the Fisher reference which doesn’t fit*). Won’t this be tantamount to requiring authors support their research in a way that is either (actually) invalid, or has little to do with the error statistical properties that were actually reported and on which the acceptance was based?[ii] Continue reading

Categories: P-values, reforming the reformers, Statistics
73 Comments

*Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!*

*Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! ***Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers.**

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–**the New Reformers have created, and **very successfully* published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? **Or not? **

*Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative? It’s hard to say. *

*This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. *

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

** Pawl**: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

** Franz**: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

** Jake**: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past. Continue reading

This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)