Surprising Facts about Surprising Facts

Mayo mirror


A paper of mine on “double-counting” and novel evidence just came out: “Some surprising facts about (the problem of) surprising facts” in Studies in History and Philosophy of Science (2013),

ABSTRACT: A common intuition about evidence is that if data x have been used to construct a hypothesis H, then x should not be used again in support of H. It is no surprise that x fits H, if H was deliberately constructed to accord with x. The question of when and why we should avoid such ‘‘double-counting’’ continues to be debated in philosophy and statistics. It arises as a prohibition against data mining, hunting for significance, tuning on the signal, and ad hoc hypotheses, and as a preference for predesignated hypotheses and ‘‘surprising’’ predictions. I have argued that it is the severity or probativeness of the test—or lack of it—that should determine whether a double-use of data is admissible. I examine a number of surprising ambiguities and unexpected facts that continue to bedevil this debate.

Categories: double-counting, Error Statistics, philosophy of science, Statistics

Post navigation

36 thoughts on “Surprising Facts about Surprising Facts

  1. In the doc, the bold-font “x” glyph looks a fair bit like an aleph. Weird.

  2. Still reading your paper, but a first question:

    What do you think about the following severity requirement:

    “The observed data provide good evidence for H if, and only if,
    (i) they do not provide evidence against H
    (ii) they provide (strong) evidence against ~H. ”

    This makes sense to me, since when data do not provide evidence against H, it does not necessarily implies that the same data provide evidence against ~H. If this occurs, the data is said to provide evidence in favor of H.

    What do you think?


    • Alexandre: Well, for starters I would not want to set out a requirement for evidence that made use of the notion of evidence.

      • Of course,

        My intention was not that… it was something like

        “The observed data provide good evidence for H if, and only if,
        (i) they agree with H
        (ii) they (strongly) disagree with ~H. ”


        “The observed data provide good evidence for H if, and only if,
        (i) they do not contradict H
        (ii) they (strongly) contradict ~H. ”

        I come back later…


        • Alexandre: These too are unclear, and would need unpacking (especially H and ~H), and on the face of it, wouldn’t work for my purposes, for a number of reasons that would need more time to explain. Of course they might work for yours.
          Remember the importance (for the severity account) in being able to say what there’s poor evidence of. What is blocked and why needs to be informative in its own right.

  3. I have a bit of a quandary. You see, a colleague of mine and I were collaborating on a study investigating 20 possible factors that might explain an effect. I, being a careerist hack, carried out 20 tests at the 0.05 nominal significance level, and found one factor that seemed to explain the effect. What luck! Another paper for my CV.

    My colleague, Professor Mustard, is a domain expert, and suspected that a certain specific factor might be the primary cause. She tested her hypothesis using the standard statistical test — and she was, to all appearances, absolutely right! (She and I had identified the same factor.) She then asked to graduate student to check if any other factors showed any potential, but none did. What a triumph for her!

    While Professor Mustard and I were discussing the paper draft in the faculty lounge, a passing error statistician heard me describe my results and butted in. He explained that my data analytic approach had violated the severity requirement with a use-constructing rule. Professor Mustard explained that although *my* analysis was faulty, *her* approach had focused on the key factor from the start, and so we were fully justified in our plan to publish a paper about her successful hypothesis.

    This puzzled the error statistician, so he suggested we consult with you directly. (He observed rye-ly in passing that perhaps the solution was to leave my name off the list of authors! I was not amused.)

    • Cornedbeef: So you mustard a ham-handed note as Cornedbeef butt there’s a serious missed steak with this old baked-on example! The error statistician is puzzled? Boloney! The two cases each have a very different margarine for error. But I can’t rehash all I wrote. Read Mayo, Mr. mustard and catch up!

      • (Apparently a previous reply got eaten by the moderation queue…)

        Mayo: I have read your work, as you know. I’m not trying to trick you and I’m not being deliberately obtuse — I’m trying to apply the severity principle in specific toy examples and check my reasoning. The way I’ve framed my little story should make it clear that I think that the error statistician’s rye observation is actually correct. The argument from coincidence doesn’t go through for Cornedbeef’s rule but does go through for Mustard’s predesignated hypothesis.

        Mustard’s background knowledge and domain expertise an essential part of the story, while Cornedbeef has contributed nothing and should get no credit (unless he helped collect the data, but let’s assume he didn’t). So here we have a case where two rules claim that the same conclusion follows from the same data, but one of the claims is faulty on account of the properties of the rule that asserts the claim. Right?

        • Corey: Well you seem to have answered your own question, but I’m not sure I see it as a case where “two rules claim that the same conclusion follows from the same data, but one of the claims is faulty on account of the properties of the rule that asserts the claim.” I wouldn’t even call it the “same” data, and certainly not “follows from” (maybe, purports to be warranted by). Of course I’m not interested in publication policy , but if Mustard is needing to send her graduate students out to check if there are other explanations (how? Dredging the observed data?) then there’s a huge question as to whether the assumptions of her allegedly “successful” study are satisfied. So, I agree with E. Berk—Mustard’s procedure doesn’t cut the mustard for Mayo. Of course, any such criticism can be appealed, but the onus is on the researcher to show how she’s subjected her methods and models to severe scrutiny.

          • Mayo: “Well you seem to have answered your own question”

            Yup — and now I’m checking my answer in the back of the book, as it were.

            ‘I wouldn’t even call it the “same” data…’

            I think this is a terminology problem? To be ridiculously concrete, the two professors are using the exact same Excel file (and the same statistical test).

            “purports to be warranted by”

            Yeah, I had a brief struggle with how to phrase this, and then just shrugged and said “screw it, it’s a blog comment”.

            “the onus is on the researcher to show how she’s subjected her methods and models to severe scrutiny.”

            I’m skating over the “methods and models” deliberately — we may assume that these have been validated and are SOP in the field of study. Do I take your point correctly that the text Mustard might have written in the event that her grad student had reported that some other factor(s) had looked interesting is important in the severity account? (I can imagine various things a typical honest-but-not-statistically-sophisticated scientist might be inclined to write in that event, things along the lines of, “The explanation I favored does not look to be the whole story.” It seems hard to really pin down a “sample space”, though.)

            • Corey: You cannot skate over methods and models because you can’t interpret “the data” without them. Rereading the ham-handed note by Corned beef that you mustard earlier, I see that what you mean is something like the “same observed data on factor f”, and that Mustard only sent her students to scour the rest of the full data (rather than separately ruling out other factors, if that’s what she needs to affirm f–I have no idea), but this assumes their finding no nominal significance on the other factors rules them out and affirms her explanation which it does not. And yes, what she would have done having found the others is relevant to what she did, or rather it IS part of, and inseparable from, correctly understanding her method and model. I don’t think going much further with this imaginary and vague example will take us too far though..

              • Mayo: You’ve used what I call “toy problems” (e.g., Mayo/Hatshepsut’s scale) to explain severity reasoning in your papers. My approach here was to create a toy problem with two analyses in which certain severity-relevant aspects were “bracketed” (held constant) in order to focus on the contrast between the two analysis vis-a-vis other severity-relevant aspects. Apparently my approach needs some fine tuning…

                • Corey: Sure, but it depends on what’s at stake in the toy example. In the Hatshepsut weighing-of-mummy example, I was just giving something analogous to the Hitchcock-Sober measuring ex.(to take up their criticism of severity). Not sure what you’re aiming for with your example.

                  • Aiming for predesignation of an hypothesis of interest based on prior information.

                    • Corey: Don’t understand aiming for predesignation thru background. Explain. Maybe you mean aiming for a well warranted inference. You won’t get this with isolated test results; you won’t have knowledge of a genuine effect.

                    • Mayo: I’m trying to get at whether or not the fact that prior information (in our judgment) pick out a particular hypothesis of interest among some set in which it is otherwise undistinguished is sufficient to make the argument from coincidence go through.

                    • My thought was that the argument from coincidence goes through (to some extent?) for a hypothesis predesignated by (our interpretation of the) prior information, but fails when that is lacking.

                    • Check the link I sent last time. An arg from coincidence depends on what was actually done/shown–not a matter of an interpretation of prior information.

                    • Derp. My first reply wasn’t showing up for me until I posted the second.

    • E.Berk

      Cornedbeef: Any data from a randomized control clinical trial could have come from a judgment sample, but the properties of the test still differ. With the data-free information you provide, the test properties are indeterminable. It would be to fraudulently report data to treat a hunting expedition for a small p-value as having error probability of p% , but it’s impossible to say if Mustard has passed anything with severity either. She told her graduate students to go check for any other factors and they found nothing? Doesn’t cut the mustard.

  4. The connections between novelty, predesignation, no-double counting and severity correspond to major contrasts and problems in philosophy of science over the years. The Popperians and predesignationists before them spent a lot of time trying to pin down “novel facts”. The paper mentions the 3 main accounts of novelty. Popper advocated either temporal or theoretical novelty (or maybe both together). The “no double counting” and “no use-constructing” variants seemed the most promising, but they still didn’t really have a clear epistemological rationale (a rationale having to do with knowledge or evidence).

    On the statistics side you have frequentist testers also caring about predesignation–setting out the hypothesis to test or criterion for rejection “ahead of time”–or at least not using the data in a certain way to tie things down…..
    But it cannot be the time alone that matters–if one is seeking an epistemological rationale. Prohibiting “use-constructing” seemed to make more sense. Then one day some old man backed his car into my silver Camaro (#1), but drove off, apparently unaware he’d smashed into me. Maybe you’d say the reenactment for the police wasn’t even obviously a violation of predesignation, but I think it was. No matter, that’s when I realized what was wrong with taking predesignation as a requirement, in and of itself, and what to replace it with.

  5. This just in on the Harkonen trial: “failure to disclose multiple testing or deviation from a protocol is demonstrably false or misleading”.
    We’ve discussed the case at some length before (search blog), and it’s so circuitous, as to prove little if anything about how the methodological issues are playing out at SCOTUS. Still, Professor Cornbeef should take note!

  6. Deborah

    The severity principle proposed by Deborah Mayo is used for accepting a null hypothesis H when:

    1. there is no evidence to reject H and
    2. H passes the test with high severity.

    It seems to me that it is quite similar to my purpose (the s-value), but the measures involved in steps 1. and 2. are different.

    see more on

    • Alexandre: I know we contrasted our views before, and the contrast remains, so far as I can see. It appears that you’re looking at fits, nearness, or likelihoods (or, more literally, the confidence level at which the null or alternative would be an upper or lower CI limit given the data) . I could never say, as you do (correct me if I’m wrong) that I have SEV(H) = 1 and also SEV(~H)=.3, for the H, ~H you give. Also, in contrast to what you have in example (c), the sample average of 9.9 would not warrant saying mu > 10 is hunky dory. SEV(mu > 10) would be less than .5.

      But I only had a quick look at your blog. I may be missing something. Thanks.

      • Deborah,

        I think there is a relation, but it is not a 1:1 relation. Unfortunately, I am not able to explain it properly now, for many reasons.

        We should first acknowledge that there are different types of hypotheses:

        1) H0 and H1 are both non-sharp hypotheses
        2) H0 is sharp and H1 is non-sharp
        3) H0 is non-sharp and H1 is sharp.

        For each of these types, we have different types of decisions. It is very important to treat each case particularly, since for each case we have different degrees of restrictiveness. On the one hand, if H0 is sharp, then it contains a strong restrictiveness and such H0 cannot be accepted (e.g., all swans are white). On the other hand, if H0 is non-sharp, then it contains a soft restrictiveness and such H0 can be accepted (e.g., all swans in backyard are white).

        *Analysis only for a non-sharp H0*

        When H0 is non-sharp, the value s(H0) can happen to be 1. This just means the the observed data do not provide information against H0. It is not sufficient to accept H0, in order to accept H0 we should verify if H1 is extremely contradicted by the data, i.e., if s(H1) is small.

        Remember that s(H0) = 1 does not imply that the p-value under H0 is also 1. It implies that the p-value (under H0) is not small. That is, the observed data is corroborating with H0.

        I stress that: even if the observed data corroborate with H0, the corresponding p-value will almost never be one. Therefore, s-value and p-value are not directly compared, it is required a special transformation.

        A confusion might occur here. If H0 is a point null hypothesis (it is a sharp hypothesis), then the observed data will never corroborate with H0, since the best estimate will almost always be out of H0. However, remember that there exist non-point null hypotheses (non-sharp hypotheses) and for such cases the best estimate can lies inside of H0 (corroborating with it).

  7. I don’t know why but the hypotheses did not appear right in the text above:

    Let’s try again:

    H0: μ 10.

    • Alexandre: The comment parser is not smart enough to distinguish between the angle-bracket used as a less-than sign vs an HTML tag.delimiter.

  8. Off-topic: I just noticed the palindrome for December. Impressive!

    • Corey: Thanks. I wonder why no one tries to win free books when I’ve made the contest trivially easy. I admit it used to be quite hard with two words + Elba, but now it’s a cinch.

  9. Comments should now be open to all, the Elbians modified the restrictions.

  10. Corey: I couldn’t reply directly under your comment: “Mayo: I’m trying to get at whether or not the fact that prior information (in our judgment) pick out a particular hypothesis of interest among some set in which it is otherwise undistinguished is sufficient to make the argument from coincidence go through.”

    I’m trying to guess at your meaning. Perhaps you’d find it of relevance to look at 3 pages of chapter 9: “the Creative Error Theorist” (EGEK pp. 314-316). It’s avaiable on my publications page off this site.

    It’s the tail end of this chapter:

  11. Disclosing flexibility with data is a good start, but I say, demonstrate you’ve subjected your analysis to severe scrutiny.

  12. An example from Ronald Giere (showing N-P inadmissibility of a type of postdesignation of properties);
    To each set of n members from a population of A’s assign some shared property. The full population has U members where U > 2n. Then arbitrarily assign this same property to (U/2 – n) additional members. Then every possible n-membered sample shares at least one apparent regularity, even though every property has a ratio of ½ in the population. (EGEK p. 305)

  13. Christian Hennig

    Prof. Mustard is fine if she only ever intended to publish the result of her pre-specified test as a proper significance result, whereas the student’s potential findings would in any case be published adding that this is the result of rather exploratory scanning and p-values are nothing more than exploratory tools pointing at what to run a proper study on next, right?

Blog at