Three years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010** **explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15. Trials have been taking place this month, as people try to meet the 3 year deadline to sue BP and others. *But what happened to the 200 million gallons of oil? * (Is anyone up to date on this?) Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night around the 3 year anniversary, let’s listen into a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes.

*In effect, it accuses *the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

*Oil Exec:*We had highly reliable evidence that

*H:*the pressure was at normal levels on April 20, 2010!

*Senator:* But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

* Oil Exec: *Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average! *You see,* we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April 20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

*Senator: *But you don’t know that your system would have passed the more stringent test you didn’t perform!

*Oil Exec: * That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail), that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion: the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore *it misinterprets the actual data*. The question is why anyone would saddle the frequentist with such shenanigans on averages? … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

*Two Measuring Instruments with Different Precisions:*

* *A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10^{-4}, while with tails, we use E”, with a known large variance, say 10^{4}. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in, ton o’bricks).

In applying our test *T+* (see November 2011 blog post ) to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”. Denote the two p-values as p’ and p”, respectively. However, or so the criticism proceeds, the error statistician would report the average p-value: .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed? Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of. Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion. Continue reading