Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15.(Remember junk shots, top kill, blowout preventers?) The EPA has lifted its gulf drilling ban on BP just a couple of weeks ago* (BP has paid around
$13 $27 billion in fines and compensation), and April 20, 2014, is the deadline to properly file forms for new compensations.
(*After which BP had another small spill in Lake Michigan.)
But what happened to the 200 million gallons of oil? Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night, let’s listen in to a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes.
In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:
Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!
Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.
Oil Exec: Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average! You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April 20 just happened to be one of those times we did the nonstringent test; but on average we do ok.
Senator: But you don’t know that your system would have passed the more stringent test you didn’t perform!
Oil Exec: That’s the beauty of the the frequentist test!
Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail), that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion: the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages? … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).
Two Measuring Instruments with Different Precisions:
A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in my discussions of the strong likelihood principle (SLP), e.g., ton o’bricks, and here).
In applying our favorite one-sided (upper) Normal test T+ to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”. Denote the two p-values as p’ and p”, respectively. However, or so the criticism proceeds, the error statistician would report the average p-value: .5(p’ + p”).
But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).
But what could lead the critic to suppose the error statistician must average over experiments not even performed? Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of. Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:
- If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!
The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion.
I gave an honorary mention to Christian Robert  on this point in his discussion of Cox and Mayo (2010). Robert writes p. 9 :
A compelling section is the one about the weak conditionality principle (pp.294- 298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behaviour in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Robert (2007), Example 1.3.7, p.18) that the classical confidence interval averages over the experiments. The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this.
He would want me to mention that he does raise some caveats:
I could, however, [argue] about ‘conditioning is warranted to achieve objective frequentist goals’ (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”.http://arxiv.org/abs/1111.5827
But there is nothing arbitrary about regarding as “good” the only experiment actually run and from which the actual data arose. The severity criterion only makes explicit what is/should be already obvious. Objectivity, for us, is directed by the goal of making correct and warranted inferences, not freedom from thinking. After all, any time an experiment E is performed, the critic could insist that the decision to perform E is the result of some chance circumstances and with some probability we might have felt differently that day and have run some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with. The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests.
Still, we exiled ones can’t be too fussy, and Robert still gets the mention for conceding that we have a solid leg on which to pirouette.
 The relevance of the Deepwater Horizon spill to this blog stems from its having occurred while I was busy organizing the conference “StatSci meets PhilSci” (to take place at the LSE in June 2010). So all my examples there involved “deepwater drilling” but of the philosophical sort. Search the blog for further connections (especially the RMM volume, and the blog’s “mascot” stock, Diamond offshore, DO, which has now bottomed out at around $48, long story.
Of course, the spill cam wasn’t set up right away.
 If any readers work on the statistical analysis of the toxicity of the fish or sediment from the BP oil spill, or know of good references, please let me know.
BP said all tests had shown that Gulf seafood was safe to consume and there had been no published studies demonstrating seafood abnormalities due to the Deepwater Horizon accident.
 There have been around 4-5 other “honorable mentions” since then, I’m not sure.
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.