**My “April 1” posts for the past 5 years have been so close to the truth or possible truth that they weren’t always spotted as April Fool’s pranks, which is what made them genuine April Fool’s pranks. (After a few days I labeled them as such, or revealed it in a comment). So since it’s Saturday night on the last night of April, I’m reblogging my 5 posts from first days of April. (Which fooled you the most?)** Continue reading

# Comedy

## Yes, these were not (entirely) real–my five April pranks

## Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

This headliner appeared two years ago, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance…

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno [1] who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejectingH_{0}because of outcomesH_{0}didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

## Return to the Comedy Hour: P-values vs posterior probabilities (1)

Some recent criticisms of statistical tests of significance have breathed brand new life into some very old howlers, many of which have been discussed on this blog. One variant that returns to the scene every decade I think (for 50+ years?), takes a “disagreement on numbers” to show a problem with significance tests even from a “frequentist” perspective. Since it’s Saturday night, let’s listen in to one of the comedy hours from **3 years ago **(0) (new notes in red):

*D id you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?*

JB[Jim Berger]:I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!(1)

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

But you assumed 50% of the null hypotheses are true, and computed P(HFrequentist Significance Tester:_{0}|x) (imagining P(H_{0})= .5)—and then assumed my p-value should agree with the number you get, if it is not to be misleading!

Yet, our significance tester is not heard from as they move on to the next joke…. Continue reading

## Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)

This headliner appeared before, but to a sparse audience, so Management’s giving him another chance… His joke relates to both Senn’s post (about alternatives), and to my recent post about using (1 – β)/α as a likelihood ratio--but for very different reasons. (I’ve explained at the bottom of this “(b) draft”.)

** ….If you look closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike, (especially as he’s no longer doing the Tonight Show) …**.

**It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler joke* in criticizing the use of p-values.**

“Did you hear the one about significance testers rejectingH_{0}because of outcomesH_{0}didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

## 2015 Saturday Night Brainstorming and Task Forces: (4th draft)

*Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!*

*Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! ***Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers.**

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–**the New Reformers have created, and **very successfully* published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? **Or not? **

*Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative? It’s hard to say. *

*This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. *

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

** Pawl**: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

** Franz**: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

** Jake**: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past. Continue reading

## Should a “Fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no (rejected post)

So I hear that Diederik Stapel is the co-author of a book *Fictionfactory* (in Dutch,with a novelist, Dautzenberg)[i], and of what they call their “Fictionfactory peepshow”, only it’s been disinvited at the last minute from a Dutch festival on“truth and reality” (due to have run 9/26/14), and all because of Stapel’s involvement. Here’s an excerpt from an article in last week’s Retraction Watch (article is here):*

Here’s a case of art imitating science.

The organizers of a Dutch drama festival have put a halt to a play about the disgraced social psychologist Diederik Stapel, prompting protests from the authors of the skit — one of whom is Stapel himself.

According to an article in

NRC Handelsblad:The Amsterdam Discovery Festival on science and art has canceled at the last minute, the play written by Anton Dautzenberg and former professor Diederik Stapel. Co-sponsor, The Royal Netherlands Academy of Arts and Sciences (KNAW), doesn’t want Stapel, who committed science fraud, to perform at a festival that’s associated with the KNAW.

FICTION FACTORY

The management of the festival, planned for September 26th at the Tolhuistuin in Amsterdam, contacted Stapel and Dautzenberg 4 months ago with the request to organize a performance of their book and lecture project ‘The Fictionfactory”. Especially for this festival they [Stapel and Dautzenberg] created a ‘Fictionfactory-peepshow’.

“Last Friday I received a call [from the management of the festival] that our performance has been canceled at the last minute because the KNAW will withdraw their subsidy if Stapel is on the festival program”, says Dautzenberg. “This looks like censorship, and by an institution that also wants to represents arts and experiments”.

Well this is curious, as things with Stapel always are. What’s the “Fichtionfactory Peepshow”? If you go to Stapel’s homepage, it’s all in Dutch, but Google translation isn’t too bad, and I have a pretty good description of the basic idea. *So since it’s Saturday night,let’s take a peek, or peep (at what it might have been)…*

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Here we are at the “Truth and Reality” Festival: first stop (after some cotton candy): the Stapel Fictionfactory Peepshow! It’s all dark, I can’t see a thing. What? It says I have to put some coins in a slot if I want to turn turn him it on (but that they also take credit cards). So I’m to look down in this tiny window. The curtains are opening!…I see a stage with two funky looking guys– one of them is Stapel. They’re reading, or reciting from some magazine with big letters: “Fact, Fiction and the Frictions we Hide”.*

STAPEL: Welkom.You can ask us any questions! In response, you will always be given an option: ‘Do you want to know the truth or do you want to be comforted with fictions and feel-good fantasy?’

“Well I’ve brought some data with me from a study in social psychology. My question is this: “Is there a statistically significant effect here?”

STAPEL:Do you want to know the truth or do you want to be comforted with fictions and feel-good fantasy?

“Fiction please”.

STAPEL: I can massage your data, manipulate your numbers, reveal the taboos normally kept under wraps. For a few more coins I will let you see the secrets behind unreplicable results, and for a large bill will manufacture for you a sexy statistical story to turn on the editors.

(Then after the dirty business is all done [ii].)

STAPEL: Do you have more questions for me?

“Will it be published (fiction please)?”

STAPEL: “yes”

“will anyone find out about this (fiction please)?”

STAPEL: “No, I mean yes, I mean no.”

“I’d like to change to hearing the truth now. I have three questions”.

STAPEL: No problem, we take credit cards. Dank u. What are your questions?’

“Will Uri Simonsohn be able to fraudbust my results using the kind of tests he used on others? and if so, how long will it take him? (truth, please)?

STAPEL: “Yes.But not for at least 6 months to one year.”

“Here’s my final question. Are these data really statistically significant and at what level?” (truth please)

**Nothing. Blank screen suddenly! With an acrid smelling puff of smoke, ew. But I’d already given the credit card! (Tricked by the master trickster).**

**What if he either always lies or always tells the truth? Then what would you ask him if you want to know the truth about your data? (Liar’s paradox variant)**

**Feel free to share your queries/comments.**

* I thank Caitlin Parker for sending me the article

[i]Diederik Stapel was found guilty of science fraud in psychology in 2011, made up data out of whole cloth, retracted over 50 papers.. http://www.nytimes.com/2013/04/28/magazine/diederik-stapels-audacious-academic-fraud.html?pagewanted=all&_r=0

Bookjacket:

[ii] Perhaps they then ask you how much you’ll pay for a bar of soap (because you’d sullied yourself). Why let potential priming data go to waste? Oh wait, he doesn’t use real data…. Perhaps the peepshow was supposed to be a kind of novel introduction to research ethics.

Some previous posts on Stapel:

- Phil/Stat/Law: 50 Shades of gray between error and fraud (July 3, 2013)
- How to hire a fraudster chauffeur (Sept 18, 2013)

## Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)

Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010** **explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15 (see video clip that was added below).Remember junk shots, top kill, blowout preventers? [1] The EPA has lifted its gulf drilling ban on BP just a couple of weeks ago* (BP has paid around ~~$13~~ $27 billion in fines and compensation), and April 20, 2014, is the deadline to properly file forms for new compensations.

*(*After which BP had another small spill in Lake Michigan.)*

But what happened to the 200 million gallons of oil? Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night, let’s listen in to a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes.

## Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

It was from my Virginia Tech colleague I.J. Good (in statistics), who died five years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules. (I had posted this last year here.)

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time.

The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135)[*]

## Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers. Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

## Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

This headliner appeared last month, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance…

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejectingH_{0}because of outcomesH_{0}didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

## Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)

Since we’ll be discussing Bayesian confirmation measures in next week’s seminar—the relevant blogpost being here--let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

The problem was, the epistemic probability in H was so low that H couldn’t be believed! Instead we believe its denial H’! So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis *H* has passed a test, this Bayesian critic assigns a sufficiently low prior probability to *H* so as to yield a low posterior probability in *H*[i]*. * But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis *H*.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.” This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

## Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejectingH_{0}because of outcomesH_{0}didn’t predict?

‘What’s unusual about that?’ you ask?

Well, what’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

I say it’s funny, so to see why I’ll strive to give it a generous interpretation.

We can view p-values in terms of rejecting *H*_{0}, as in the joke: There’s a test statistic D such that *H*_{0} is rejected if its observed value d_{0} reaches or exceeds a cut-off d* where Pr(D > d*; *H*_{0}) is small, say .025.

* Reject H*_{0} if Pr(D > d_{0}; *H*_{0}) < .025.

The report might be “reject *H*_{0 }at level .025″.

*Example*: *H*_{0}: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject *H*_{0} .

Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that *H*_{0} “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; *H*_{0}), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]

So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting *H*_{0. }Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d_{0} ; *H*_{0} ) be small, but that Pr(D > d_{0} ; *H*_{0} ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this …. Continue reading

## Saturday night comedy from a Bayesian diary (rejected post*)

*See “rejected posts”.

## Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)

Our favorite high school student, Isaac, gets a better shot at showing his college readiness using one of the comparative measures of support or confirmation discussed last week. Their assessment thus seems more in sync with the severe tester, but they are not purporting that z is evidence for inferring (or even believing) an H to which z affords a high B-boost*. Their measures identify a third category that reflects the degree to which H would predict z (where the comparison might be predicting without z, or under ~H or the like). At least if we give it an empirical, rather than a purely logical, reading. Since it’s Saturday night let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

The problem was, the epistemic probability in H was so low that H couldn’t be believed! Instead we believe its denial H’! So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis *H* has passed a test, this Bayesian critic assigns a sufficiently low prior probability to *H* so as to yield a low posterior probability in *H*[i]*. * But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis *H*.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.” This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

## First blog: “Did you hear the one about the frequentist…”? and “Frequentists in Exile”

*Dear Reader*: Tonight marks the 2-year anniversary of this blog; so I’m reblogging my very first posts from 9/3/11 here and here (from the rickety old blog site)*. (One was the “about”.) The current blog was included once again in the top 50 statistics blogs. Amazingly, I have received e-mails from different parts of the world describing experimental recipes for the special concoction we exiles favor! (Mine is here.) If you can fly over to the Elbar Room, please join us: I’m treating everyone to doubles of Elbar Grease! Thanks for reading and contributing! *D. G. Mayo*

(*The old blogspot is a big mix; it was before Rejected blogs. Yes, I still use this old typewriter [ii])

**“Overheard at the Comedy Club at the Bayesian Retreat” 9/3/11 by D. Mayo**

**“Did you hear the one about the frequentist . . .**

- “who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

or

- “who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of “straw-men” fallacies, they form the basis of why some reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the curious reader to stay and find out.

If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call “error statistics,” continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.

First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called *probabilism*. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define “controlling long-run error,” it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of “There’s No Theorem Like Bayes’s Theorem.”

Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in many Bayesian textbooks and articles on philosophical foundations. The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson “really thought”. Many others just find the “statistical wars” distasteful.

Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error- statistical philosophy.

But given this is a blog, I shall be direct and to the point: I hope to cultivate the interests of others who might want to promote intellectual honesty within a generally very lopsided philosophical debate. I will begin with the first entry to the comedy routine, as it is put forth by leading Bayesians……

___________________________________________

**“Frequentists in Exile” 9/3/11 by D. Mayo**

Confronted with the position that “arguments for this personalistic theory were so persuasive that anything to any extent inconsistent with that theory should be discarded” (Cox 2006, 196), frequentists might have seen themselves in a kind of exile when it came to foundations, even those who had been active in the dialogues of an earlier period [i]. Sometime around the late 1990s there were signs that this was changing. Regardless of the explanation, the fact that it did occur and is occurring is of central importance to statistical philosophy.

Now that Bayesians have stepped off their a priori pedestal, it may be hoped that a genuinely deep scrutiny of the frequentist and Bayesian accounts will occur. In some corners of practice it appears that frequentist error statistical foundations are being discovered anew. Perhaps frequentist foundations, never made fully explicit, but at most lying deep below the ocean floor, are finally being disinterred. But let’s learn from some of the mistakes in the earlier attempts to understand it. With this goal I invite you to join me in some deep water drilling, here as I cast about on my Isle of Elba.

Cox, D. R. (2006), *Principles of Statistical Inference*, CUP.

________________________________________________

[i] Yes, that’s the Elba connection: Napolean’s exile (from which he returned to fight more battles).

[ii] I have discovered a very reliable antique typewriter shop in Oxford that was able to replace the two missing typewriter keys. So long as my “ribbons” and carbon sheets don’t run out, I’m set.

## Overheard at the comedy hour at the Bayesian retreat-2 years on

It’s nearly two years since I began this blog, and some are wondering if I’ve covered all the howlers thrust our way? Sadly, no. So since it’s Saturday night here at the Elba Room, let’s listen in on one of the more puzzling fallacies–one that I let my introductory logic students spot…

“Did you hear the one about significance testers sawing off their own limbs?‘Suppose we decide that the effect exists; that is, we reject [null hypothesis]

H. Surely, we must also reject probabilities conditional on_{0}H, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ “_{0}

*Ha! Ha!* By this reasoning, no hypothetical testing or falsification could ever occur. As soon as *H* is falsified, the grounds for falsifying disappear! If *H*: all swans are white, then if I see a black swan, *H* is falsified. But according to this critic, we can no longer assume the deduced prediction from *H*! What? The entailment from a hypothesis or model *H* to ** x**, whether it is statistical or deductive, does not go away after the hypothesis or model

*H*is rejected on grounds that the prediction is not born out.[i] When particle physicists deduce that the events could not be due to background alone, the statistical derivation (to what would be expected under

*H*: background alone) does not get sawed off when

*H*is denied!

The above quote is from Jaynes (p. 524) writing on the pathologies of “orthodox” tests. How does someone writing a great big book on “the logic of science” get this wrong? To be generous, we may assume that in the heat of criticism, his logic takes a wild holiday. Unfortunately, I’ve heard several of his acolytes repeat this. There’s a serious misunderstanding of how hypothetical reasoning works: 6 lashes, and a pledge not to uncritically accept what critics say, however much you revere them.

______

Jaynes, E. T. 2003. *Probability Theory: The Logic of Science. *Cambridge: Cambridge University Press.

[i]Of course there is also no warrant for inferring an alternative hypothesis, unless it is a non-null warranted with severity—even if the alternative entails the existence of a real effect. (Statistical significance is not substantive significance—it is by now cliché . Search this blog for *fallacies of rejection*.)

**A few previous comedy hour posts:**

(09/03/11) Overheard at the comedy hour at the Bayesian retreat

(4/4/12) Jackie Mason: Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

(04/28/12) Comedy Hour at the Bayesian Retreat: P-values versus Posteriors

(05/05/12) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed

(09/03/12) After dinner Bayesian comedy hour…. (1 year anniversary)

(09/08/12) Return to the comedy hour…(on significance tests)

(04/06/13) Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

(04/27/13) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)

## Bad news bears: ‘Bayesian bear’ rejoinder-reblog mashup

Oh No! It’s those mutant bears again. To my dismay, I’ve been sent, for the *third* time, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (first posted on Aug 5, 2012), who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a mashup of both the clip and my original rejoinder (in which p-value bears are replaced with hypothetical Bayesian bears).* *

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities.

*Mayo’s Rejoinder:*

*Bear #1:* Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

*Bear #2:* Not really, that would be an incorrect interpretation. Continue reading

## Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)

Three years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010** **explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15. Trials have been taking place this month, as people try to meet the 3 year deadline to sue BP and others. *But what happened to the 200 million gallons of oil? * (Is anyone up to date on this?) Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night around the 3 year anniversary, let’s listen into a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes.

*In effect, it accuses *the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

*Oil Exec:*We had highly reliable evidence that

*H:*the pressure was at normal levels on April 20, 2010!

*Senator:* But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

* Oil Exec: *Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average! *You see,* we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April 20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

*Senator: *But you don’t know that your system would have passed the more stringent test you didn’t perform!

*Oil Exec: * That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail), that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion: the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore *it misinterprets the actual data*. The question is why anyone would saddle the frequentist with such shenanigans on averages? … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

*Two Measuring Instruments with Different Precisions:*

* *A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10^{-4}, while with tails, we use E”, with a known large variance, say 10^{4}. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in, ton o’bricks).

In applying our test *T+* (see November 2011 blog post ) to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”. Denote the two p-values as p’ and p”, respectively. However, or so the criticism proceeds, the error statistician would report the average p-value: .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed? Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of. Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion. Continue reading

## Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

It was from my Virginia Tech colleague I.J. Good (in statistics), who died four years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules.

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time.

The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135)[*]

This paper came from a conference where we both presented, and he was *extremely* critical of my error statistical defense on this point. (I was a year out of grad school, and he a University Distinguished Professor.)

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a Continue reading