Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day#13)


images-6S. Stanley Young, PhD
Assistant Director for Bioinformatics
National Institute of Statistical Sciences
Research Triangle Park, NC

Here are Dr. Stanley Young’s slides from our April 25 seminar. They contain several tips for unearthing deception by fraudulent p-value reports. Since it’s Saturday night, you might wish to perform an experiment with three 10-sided dice*,recording the results of 100 rolls (3 at a time) on the form on slide 13. An entry, e.g., (0,1,3) becomes an imaginary p-value of .013 associated with the type of tumor, male-female, old-young. You report only hypotheses whose null is rejected at a “p-value” less than .05. Forward your results to me for publication in a peer-reviewed journal.

*Sets of 10-sided dice will be offered as a palindrome prize beginning in May.

Categories: Phil6334, science communication, spurious p values, Statistical fraudbusting, Statistics | Tags:

Post navigation

12 thoughts on “Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day#13)

  1. The key statement is ‘if you report just the significant ones’. The P-value per test is controlled provided that each test is performed correctly. Of course the P-values within a given trial are implausibly independent and dependence may cause some difficulties for interepretation. Nevertheless, if the null is true the expected number of P-values less than 0.05 (say) will not be more than 1/20 however many tests you do.

    The main sin of multiplicity is doing lots of stuff and only reporting what appears interesting. This is actually a problem whether or not you do signicance tests.

    1. Senn S, Bretz F. Power and sample size when multiple endpoints are considered. Pharmaceutical statistics 2007; 6: 161-170.

  2. Stan Young led a terrific seminar that really let participants see the applications of the foundational issues we’d been discussing from various philosophical perspectives all semester. To mention just one thing: I hadn’t understood before how resampling can be used to adjust for multiple testing. The entire discussion was extremely illuminating. Thanks so much Stan!

  3. “Papers following good manufacturing procedures and addressing important questions should be accepted without regard to statistical significance” (slide 28).

    Couldn’t agree more!

    • Carlos: I think there is some ambiguity in that point, so I’m glad you raised it so we can see what Stan says. I take his point to be, not that the actual (non-fraudulent, adjusted as needed) p-value wouldn’t be reported, but that there is much to learn from non-significant results as well. Provided, of course, they follow “good manufacturing procedures”–the question is what will that require aside from critical scrutiny as to whether the error probabilities reported are legitimate or illegitimate. If the concern isn’t to control and assess the legitimacy of error probabilities reported (be they p-value, confidence levels, or other), then it’s hard to see the basis for Stan’s emphasis on needed adjustments and avoidance of biased assessments.

      • Taking his analogy from Deming (who, by the way, is one of the applied statisticians I admire the most!),I guess his trying to enphasize the incentives!

        Today we “reward” significant results. So we give “employees” the wrong incentive – they hunt significance. And so we end up with unreliable “significant” results. Certainly not what we wanted.

        If, instead, we “reward” good methods, we will get what we want – good data that we can draw reliable inference, even if the inference is that we can’t settle the question yet (which is better than false precision)!

        • Carlos: Right but this is different from ignoring or throwing out p-values or other error probabilities. I was noticing that his statement was ambiguous.

  4. * I guess he is

  5. West

    Mayo: Apologies for going off-topic, but an interesting small workshop was held this weekend in SC that heavily discussed the problem of multiple testing in particle physics. Figured from your Higgs coverage, you would be interested in perusing the slides. Particularly the one from Robert Cousins. They are public available here –

    • West: test very interested, where was this again? SC?

      • West

        University of South Carolina at Columbia

        • I can’t believe no one told me, not Staley nor Cousins and I’m doing a session on the Higgs next fall with both of them. Thanks for letting me know.

Blog at