Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)


Spill Cam

Spill Cam

Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15 (see video clip that was added below).Remember junk shots, top kill, blowout preventers? [1] The EPA has lifted its gulf drilling ban on BP just a couple of weeks ago* (BP has paid around $13 $27 billion in fines and compensation), and April 20, 2014, is the deadline to properly file forms for new compensations.

(*After which BP had another small spill in Lake Michigan.)

But what happened to the 200 million gallons of oil? Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night, let’s listen in to a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes. 

In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!

Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

 Oil Exec:  Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average!  You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April  20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator:  But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec:  That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail),  that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion:  the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages?  … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

Two Measuring Instruments with Different Precisions:

 A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in my discussions of the strong likelihood principle (SLP), e.g., ton o’bricks, and here).

In applying our favorite one-sided (upper) Normal test T+ to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”.  Denote the two p-values as p’ and p”, respectively.  However, or so the criticism proceeds, the error statistician would report the average p-value:  .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed?  Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of.  Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

  •   If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion.
I gave an honorary mention to Christian Robert [3] on this point in his discussion of Cox and Mayo (2010).  Robert writes p. 9 :

A compelling section is the one about the weak conditionality principle (pp.294- 298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behaviour in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Robert (2007), Example 1.3.7, p.18) that the classical confidence interval averages over the experiments. The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this.

He would want me to mention that he does raise some caveats:

I could, however, [argue] about ‘conditioning is warranted to achieve objective frequentist goals’ (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”.

But there is nothing arbitrary about regarding as “good” the only experiment actually run and from which the actual data arose.  The severity criterion only makes explicit what is/should be already obvious. Objectivity, for us, is directed by the goal of making correct and warranted inferences, not freedom from thinking. After all, any time an experiment E is performed, the critic could insist that the decision to perform E is the result of some chance circumstances and with some probability we might have felt differently that day and have run some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with.  The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests.

Still, we exiled ones can’t be too fussy, and Robert still gets the mention for conceding that we have  a solid leg on which to pirouette.

[1] The relevance of the Deepwater Horizon spill to this blog stems from its having occurred while I was busy organizing the conference “StatSci meets PhilSci” (to take place at the LSE in June 2010). So all my examples there involved “deepwater drilling” but of the philosophical sort. Search the blog for further connections (especially the RMM volume, and the blog’s “mascot” stock, Diamond offshore, DO, which has now bottomed out at around $48, long story.

Of course, the spill cam wasn’t set up right away.

[2] If any readers work on the statistical analysis of the toxicity of the fish or sediment from the BP oil spill, or know of good references, please let me know.

BP said all tests had shown that Gulf seafood was safe to consume and there had been no published studies demonstrating seafood abnormalities due to the Deepwater Horizon accident.

[3] There have been around 4-5 other “honorable mentions” since then, I’m not sure.


Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.




Categories: Comedy, Statistics

Post navigation

8 thoughts on “Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)

  1. Reference to the weak conditionality principle might lead readers to think I’m hinting at a return to the infamous strong likelihood principle.

  2. john byrd

    It seems straightforward that probabilities are assigned to processes. The sample space and probability space only have meaning when assigned by the experiment. So, the only probabilities that are warranted relate to the experiment actually performed. How much confusion results from failure to articulate the experiment responsible for the outcomes?

    • John: Great to hear from you. The Cox (1958) example–the fact that it had such an enormous impact–suggests a lot of concern about identifying the relevant experiment that was actually performed. Today you will still see this example given under section-headings like “paradoxes of classical statistics” (e.g., Ghosh, Delampady, Samanta 2010, p. 37). The pattern in that text–and this is a very middle-of-the road Bayesian text–is common: 2.3 “advantages of being a Bayesian”, the focus being on examples like the probability of rain tomorrow, followed by 2.4 “paradoxes”. The other paradox is the ultra-silly example of Welch’s wherein you happen to know the true value of theta for a given outcome, even though that outcome could be seen as the result of a 95% CI. (It’s essentially the ex. 8 in Cox and Mayo (2010, p. 296)

      Click to access ch%207%20cox%20&%20mayo.pdf

      also taken up in Spanos’ chapter, p. 326–both in our Error and Inference (2010). I don’t know if I’m the only one who feels this way, but when I see an account offer howlers as “paradoxes” that are supposed to lead us to prefer view X, I am the opposite of being sold on view X. Instead, I suspect view X cannot have too much to recommend it on its own if they are trotting out artificial howlers a mere 30 pages into the book.

      Yes, the Cox example and Welch example make up the entire set of “paradoxes”* in Ghosh et al, and I can point to other texts that are similar.
      *They grant you can condition on the instrument actually used to make the measurement (in the Cox 58 example), but then you are stuck with the strong likelihood principle, and thus with giving up error probabilities (see p. 38). Of course, we at this blog already know the truth about this….

      • visitingstudent

        The classical paradoxes we’re usually told are low p-values when there’s strong evidence for a null hypothesis and allowing experimenter intensions into data analysis.

        • We just looked at those. Search for p-value vs posteriors for the first and optional stopping for the second.

  3. Spill cam video just added, so if you have nothing good to watch, you can watch the oil and mud gush forth. I actually became quite impressed with deepwater robots as a result of this, and still own stock in companies that make them, as well as the special mud they must use to plug up leaking wells that have been pretty much pumped dry.

  4. e.Berk

    I don’t understand something. Wouldn’t the Bayesian oil executive multiply the probability of running the gold-standard test in order to get a high posterior in the measurement being correct, even if this is one of the rare times they cut corners on the cement log? Wouldn’t it be like giving that high prior to Isaac being unready, despite the high grade? Isaac pays the penalty of coming from Fewready, and the BP measurement gets the benefit from their usually doing the gold standard measurement.

    • Yes if the probability was assigned to this measure being reliable, it would be higher if they usually do the gold standard test, thereby getting the kind of bump (rightly) deemed objectionable in averaging over instruments. So I guess your point is, why isn’t it objectionable for a Bayesian to average?

Blog at