P-values can’t be trusted except when used to argue that P-values can’t be trusted!

images-1Have you noticed that some of the harshest criticisms of frequentist error-statistical methods these days rest on methods and grounds that the critics themselves purport to reject? Is there a whiff of inconsistency in proclaiming an “anti-hypothesis-testing stance” while in the same breath extolling the uses of statistical significance tests and p-values in mounting criticisms of significance tests and p-values? I was reminded of this in the last two posts (comments) on this blog (here and here) and one from Gelman from a few weeks ago (“Interrogating p-values”).

Gelman quotes from a note he is publishing:

“..there has been a growing sense that psychology, biomedicine, and other fields are being overwhelmed with errors … . In two recent series of papers, Gregory Francis and Uri Simonsohn and collaborators have demonstrated too-good-to-be-true patterns of p-values in published papers, indicating that these results should not be taken at face value.”

But this fraudbusting is based on finding statistically significant differences from null hypotheses (e.g., nulls asserting random assignments of treatments)! If we are to hold small p-values untrustworthy, we would be hard pressed to take them as legitimating these criticisms, especially those of a career-ending sort.

…in addition to the well-known difficulties of interpretation of p-values…,…and to the problem that, even when all comparisons have been openly reported and thus p-values are mathematically correct, the ‘statistical significance filter’ ensures that estimated effects will be in general larger than true effects, with this discrepancy being well over an order of magnitude in settings where the true effects are small… (Gelman 2013)

But surely anyone who believed this would be up in arms about using small p-values as evidence of statistical impropriety. Am I the only one wondering about this?*

CLARIFICATION (6/15/13): Corey’s comment today leads me to a clarification, lest anyone misunderstand my point. I am sure that Francis, Simonsohn and others would never be using p-values and associated methods in the service of criticism if they did not regard the tests as legitimate scientific tools. I wasn’t talking about them. I was alluding to critics of tests who point to their work as evidence the statistical tools are not legitimate. Now maybe Gelman only intends to say, what we know and agree with, that tests can be misused and misinterpreted. But in these comments, our exchanges, and elsewhere, it is clear he is saying something much stronger. In my view, the use of significance tests by debunkers should have been taken as strong support for the value of the tools, correctly used. In short, I thought it was a success story! and I was rather perplexed to see somewhat the reverse.


*This just in: If one wants to see a genuine quack extremist** who was outed long ago***, see Ziliac’s article declaring the Higgs physicists are pseudoscientists for relying on significance levels!( in the Financial Post 6/12/13).

**I am not placing the critics referred to above under this umbrella in the least.

***For some reviews of Ziliac and McCloskey, see widgets on left. For their flawed testimony on the Matrixx case, please search this blog.

Categories: reforming the reformers, Statistical fraudbusting, Statistics

Post navigation

43 thoughts on “P-values can’t be trusted except when used to argue that P-values can’t be trusted!

  1. Nathan Schachtman


    Thanks for the link; I will read with skeptical caution.

    My first reaction was to doubt that the quote of Justice Breyer is accurate. It is accurate, but we should remember that it was Matrixx Initiatives that was insisting that Siracusano PLEAD statistical significance, on the mistaken and misguided notion that security fraud plaintiffs must prove causation. (And it follows that if plaintiffs must prove causation and its necessary predicate facts, then they must plead these facts as allegations in their complaint.) Now the Supreme Court unanimously rejected Matrixx’s assertion, and it’s easy to see why. FDA may take action (and did against this “homeopathic” remedy) without evidence of causation. Having said that it didn’t need causation, the Court was done with statistical significance, but the Justices waxed on, as they sometimes improvidently do. (Note Justice Scalia’s humble concurrence in yesterday’s decision in the Myriad gene patenting case, which was a subtle complaint that Justice Thomas had gone on too long about things he really didn’t understand, at least Justice Scalia didn’t understand.)

    Anyway, Ziliak has harrumphed his success in several sources, including Significance, and the Lancet, but in fact, his amicus brief, which incorrectly interpreted p-values as posterior odds, was ultimately irrelevant to the holding of the Matrixx case.

    I will now settle down and read the Financial Post piece. Thanks for the links to Gelman’s recent papers, too.

    Nathan Schachtman

    • Nathan: It is hard to believe that Z & M have so blatantly and incorrectly interpreted p values as posterior odds. For their erroneously interpretation of power see:
      “If the power of a test is high, say, .85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct”. (Z & M, 132-3).
      When Spanos first told me about this misinterpretation, I assumed they were only alleging that others were guilty of these and other fallacies. But it turns out that they are. I naively thought that they would want to correct this wrong assertion, especially as it works against their own arguments. If psychological sciences are now joining to try and insist on better practices regarding correct use of p-values, as they seem to be, shouldn’t economists be outing the “reformers” spreading bad definitions?* Your excellent post today on power gets to a related issue:

      *Their published response to Spanos was to declare he should be ignored as merely stating mathematical mumbo-jumbo. I’m serious.

  2. Nate: My link to Ziliak was a last second add-on, not really relating to the post which is about more serious critics. Actually Ziliac had sent me this a few days ago, but I didn’t notice it. How do I get to write something of substance on junk science in the Financial Post?

    Anyone who wants to know what really happened in the Matrixx case see (aside from Schachtman):


  3. Thanks for sharing. I was quite interested in Ziliac’s post. The problem is not just statistical about which you would know better. I think he gets scientific reasoning all wrong. I would think that science is based on “inference to the best explanation”. If we go down Ziliac’s route, then we wouldn’t be able to say anything about anything.

    • Rameez: The problem with inference to the best explanation is that it is easy to find “explanations” for observed results, even where those explanations have not been probed in the least. In fact, it is this fallacy—affirming the consequent—that is at the bottom of fallacious uses of statistics. (That is why the pure likelihood approach fails to control error probabilities; and why “abduction” differs from reliable ampliative inference, to refer to Peirce..) What one requires in addition to the data x being explained by (or according with) hypothesis H is some assurance that it would not be very easy to have obtained so good an accordance even if H is false. This is the minimal requirement for x even counting as evidence for H. Else one can readily read one’s favorite “explanation” into the data.
      The general severity requirement is found in Peirce, Popper, Meehl and many other places. In the formal realm, a proper use of power* may be used to provide a formal variant on this general severity requirement (*or its data dependent variation in severity). Some recently posted slides:

      • Hmm…the way I mean it is that all theories are grossly underdetermined by the data. Yes, as Ziliac says, it might turn out its not Higgs but rather “Jove or Zeus or Prometheus”. But pointing this out is almost meaningless. That’s how science has worked historically. We go with what we go with. Often “other explanations can’t be probed in the least” because we usually have no idea what those might be, or we can’t formulate them coherently in the theoretical framework that we are currently using.

        • rrameez: I don’t think all theories are grossly underdetermined, even with large scale theories like, say, the Standard Model, which may be grossly underdetermined AS A WHOLE, permit highly stringent learning about parts and variants. So even if we believe, for example, that GTR will break down, we also know many things on which an eventual theory of gravity will still need to include–one key reason we would not wish to represent the knowledge we have about many portions of the theory with posterior probabilities. So I agree with you that we usually have no clear idea about the hypotheses under “the catchall” umbrella,which is why a closed system as is required by the Bayesian formulation, is a poor representation of how we manage to find things out.

          Take a look at this great comment by Nelder on Lindley:

          As for being able to explain the Higgs by any old thing like Marilyn Monroe (apparently something some philosopher said), see https://errorstatistics.com/2013/04/29/what-should-philosophers-of-science-do-falsification-higgs-statistics-marilyn/

  4. Poole and I have also found it amusing that Gelman and others use significance tests to criticize P-values. But then, this use seems no less defensible than the routine uses of testing I see in observational studies (in which the randomness model used to construct the test is as hypothetical as the hypothesis being tested)

    I for one have been in the minority among those identified with the “Modern Epidemiology” movement in defending P-values and advocating their proper interpretation rather than rejection; see Greenland and Poole (Jurimetrics 2011) and Greenland (Preventive Medicine 2011, Annals of Epidemiology 2012).

    In the first issue of Epidemiology 2013 there is an article by Greenland and Poole followed by an exchange with Gelman about P-values. The series arose from the editor of the journal forbidding P-values in an article on which I was coauthor. Poole and I focused on the point that P-values are legitimate Bayesian statistics as well as being core frequentist statistics (error statistics in terms of Mayo and Spanos). That fact was known as far back as Laplace, but it seems Fisher and Neyman drove that interpretation out of common (and commonsense) thinking and teaching.

    In the school of thought to which I might be said to adhere, Bayesian and frequentist interpretations are considered complementary, not competitive. I would count Good, Box, Rubin and Gelman and probably most thoughtful applied statisticians today in that school.

    • Sander: Thanks for your comment. Sorry to hear the witch hunt has reached the journal in which you wanted to publish p-values. I will study your papers to learn more about your proposed interpretation; I am very familiar with what Good thought. Not sure what interpretation Fisher and Neyman are supposed to have driven out of common sense.

      • We reviewed the old Bayesian interpretation of P-values familiar to Student (Gosset) in our 2013 article before turning to the more recent Bayesian bounding interpretation of Casella and Berger (Stat Sci 1987). By “drove out” I meant Fisher and Neyman neglected to mention the interpretation as an adjunct to their preferred interpretations – their influence was so far reaching that it is hard to find the interpretation in statistics texts after WWII (especially basic texts, where I think it is most needed to moderate the usual “significance” interpretation).

    • Andrew Gelman pointed out I misattributed the use of P-values to criticize use of P-values to him instead of to Francis, and so should have begun with reference to Francis instead of Gelman.

      • Sander: But why would you find it amusing that Francis uses significance tests to criticize (certain uses of) P-values? I do not think you would. As I say in my post, “I am sure that Francis, Simonsohn and others would never be using p-values and associated methods in the service of criticism if they did not regard the tests as legitimate scientific tools. I wasn’t talking about them. I was alluding to critics of tests who point to their work as evidence the statistical tools are not legitimate.”

        • Perhaps “amusing” was too vague a word, and by that I did not mean to imply illegitimacy of the analysis in question (which I hope the subsequent sentence made clear).

          But to answer your question (I hope), I sense some confusion of criticism of the value of tests as popular tools vs. criticism of their logical foundation. I am a critic in the first, practical category, who regards the adoption of testing outside of narrow experimental programs as an unmitigated disaster, resulting in publication bias, prosecutor-type fallacies, and affirming the consequent fallacies throughout the health and social science literature. Even though testing can in theory be used soundly, it just hasn’t done well in practice in these fields. This could be ascribed to human failings rather than failings of received testing theories, but I would require any theory of applied statistics to deal with human limitations, just as safety engineering must do for physical products. I regard statistics as having been woefully negligent of cognitive psychology in this regard. In particular, widespread adoption and vigorous defense of a statistical method or philosophy is no more evidence of its scientific value than widespread adoption and vigorous defense of a religion is evidence of its scientific value.

          That should bring us to alternatives. I am aware of no compelling data showing that other approaches would have done better, but I do find compelling the arguments that at least some of the problems would have been mitigated by teaching a dualist approach to statistics, in which every procedure must be supplied with both an accurate frequentist and an accurate Bayesian interpretation, if only to to reduce prevalent idiocies like interpreting a two-sided P-value as “the” posterior probability of a point null hypothesis.

          • Nicole Jinn

            To: Sander Greenland: What exactly is this ‘dualist’ approach to teaching statistics and why does it mitigate the problems, as you claim? (I am increasingly interested in finding more effective ways to teach/instruct others in various age groups about statistics.)

            I have a difficult time seeing how effective this ‘dualist’ way of teaching could be for the following reason: the Bayesian and frequentist approaches are vastly different in their aims and the way they see statistics being used in (natural or social) science, especially when one looks more carefully at the foundations of each methodology (e.g., disagreements about where exactly probability enters into inference, or about what counts as relevant information). Hence, it does not make sense (to me) to supply both types of interpretation to the same data and the same research question! Instead, it makes more sense (from a teaching perspective) to demonstrate a Bayesian interpretation for one experiment, and a frequentist interpretation for another experiment, in the hopes of getting at the (major) differences between the two methodologies.

            • The reasoning behind the dualist approach is that by seeing a correct Bayesian interpretation of a confidence interval, the user is warned that the confidence interval is not necessarily her posterior interval, because the prior required to make it her posterior interval may be far from the prior she would have if she accounted for available background information. In parallel, by seeing a correct Bayesian interpretation of a P-value, the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information. This approach requires no elaborate computation on her part, just initial training (I wrote a series of papers for the International Journal of Epidemiology on that, and I give workshops based on those), and then in application some additional reflection on the background literature for the topic.

              This approach is intended to provide a brake on the usual misinterpretation of confidence intervals as posterior intervals and P-values as posterior probabilities. They are indeed posterior intervals and probabilities, but only under particular priors that may not represent the opinion of anyone who has actually done a critical reading of the background literature on a topic. A P-value says a little bit more, however: It bounds from below posterior probabilities from a class of priors that is fairly broad. Now, the larger a lower bound, the more informative it is. This means, almost paradoxically under some frequentist perspectives, that for Bayesian bounding purposes the larger the P-value, the more informative it is. All this is discussed in two articles by Greenland & Poole, 2013, who review old theoretical results by Casella & Berger, 1987.

              I have used this ‘dualist’ approach in teaching for over 30 years, inspired by the dualist authors of the 1970s (especially Good, Leamer, and Box). It seems effective in scaling back the misrepresentation of confidence intervals and P-values as the posterior intervals and probabilities implied by the data, and my experience has been corroborated by other colleagues who have tried it. I recommend the dualist approach this empirical reason, as well as for the following more philosophical reasons: As you note, certain rigid statistical approaches labeled “Bayesian” and “frequentist” appear vastly different in their aims and the way they say statistics should be used in science. Hence, it makes perfect sense to me to supply both types of interpretation to the same data and the same research question, so that the user is aware of potential conflicts and respond as needed. In my work I have found that much if not most of the time the two perspectives will agree if constructed under the same assumptions (model) and interpreted properly, but if they seem to diverge then one is alerted to a problem and needs to isolate the source of apparent disagreement.

              The bottom line is I think it essential to understand both frequentist and Bayesian interpretations in real applications. I regard the disagreements between certain extremist camps within these “schools” as a pseudo-controversy generated by a failure to appreciate the importance of understanding and viewing results through alternative perspectives, at least any perspective held by a large segment of the scientific community. Failing to seek multiple perspectives is gambling (foolishly in my view) that the one perspective you chose gets you the best inference possible. No perspective or philosophy comes with a credible guarantee of optimality or even sufficiency. Furthermore, all the empirical evidence I see suggests to me that using a frequentist perspective alone is frequently disastrous; as Poole pointed out, even Neyman could not get his interpretations correct all the time. Surely we can do better.

              I think artificial intelligence (AI) research provides us important clues as to how. That research shows there is no known single formal methodology for inference that can said to be optimal in any practical sense, as is clear from the fact that we cannot yet build a robot or program that can perform inferences about certain highly complex human systems (like economies) consistently better than the best human experts, who blend statistical tools (often modest ones) with intuitive judgments. To take a highly statistical field as an example, econometrics has made great strides in forecasting but econometricians still cannot use their tools to build investment strategies that beat highly informed but informal experts like Warren Buffett. Consonant with that observation, AI research has found Bayesian structures beneficial in the construction of expert systems, where the priors correspond to the intuition injected into the system along with data to produce inferences.

              By the way, you use the word “experiment” but the fields of greatest concern to me are primarily nonexperimental (observational. My own research experience has included a lot of studies using secondary databases, needed because randomized trials were either infeasible (as in occupational or environmental studies) or were too small, short-term, or selective to provide for detection of infrequent or long-term effects. In these studies the entire frequentist foundation often looks far more hypothetical and dubious than carefully constructed priors.

          • Sander. Thanks for you comment.
            Interestingly, I think the conglomeration of error statistical tools are the ones most apt at dealing with human limitations and foibles: they give piecemeal methods to ask one question at a time (e.g., would we be mistaken to suppose there is evidence of any effect at all?, mistaken about how large?, about iid assumptions?, about possible causes?, about implications for distinguishing any theories?). The standard Bayesian apparatus requires setting out a complete set of hypotheses that might arise, plus prior probabilities in each of them (or in “catchall” hypotheses), as well as priors in the model…and after this herculean task is complete, there is a purely deductive update: being deductive it never goes beyond the givens. Perhaps the data will require a change in your prior—this is what you must have believed before, since otherwise you find your posterior unacceptable—thereby encouraging the very self-sealing inferences we all claim to deplore.

            As for your suggestion of requiring a justification using both or various schools, the thing is, the cases you call “disasters” can readily be “corroborated” Bayesianly. Take the case of a spurious p-value regarded as evidence for hypothesis H, let it be a favorite howler (e.g., ovulation and political preferences, repressed memories, or drug x benefits y, or what have you). The researcher believes his hypothesis H fairly plausible to begin with, and he has found data x that seem to be just what is expected were H true; and the low (nominal) p-value leads him to find the data improbable if H is false. P(H) and P(x|H) are fairly high, P(x|not-H) = very low, and a high posterior for P(H|x) may be had. I’m giving a crude reconstruction, but you get the picture. Now we have two methods warranting the original silly inference! Now an inference that is not countenanced by error statistical testing (due to both violated assumptions and the fact that statistical significance does not warrant the substantive “explanation”) is corroborated!

            So, while I’m sorry to shoot down so ecumenical-sounding a suggestion, this would not ring (but instead would mute) the error-statistical alarm bells that we want to hear (wrt the disasters), and which are the basis for mounting criticisms. Besides, even Bayesians cannot reconcile competing approaches to Bayesian inference; even those under the same banner, e.g., default Bayesians, disagree on rudimentary examples—as Bernardo and Berger and others concede. True, there are unifications that show “agreement on numbers” (as Berger puts it), and ways to show that even Bayesian methods have good long-run performance in a long series of repetitions. I happen to think the result is the worst of both worlds (i.e., heralding p-values as giving posterior beliefs, and extolling crude “behavioristic” rationales of long-run performance.

            Anyone wanting so see more on these topics here, please search this blog.

            (I just noticed Nicole Jinn also asked after this “dualist” suggestion.)

            • Thanks for your comments. As I mentioned, it seems we live in antiparallel universes in terms of how we see what is going on. All the empirical evidence I see reading through health and medical science journals points to error-statistical methods as being routinely misinterpreted in Bayesian ways. Such observations are widely corroborated in health and social sciences (see Ch. 10 of Modern Epidemiology for many citations) and I think are unsurprising given common cognitive fallacies (such as those reported in the Kahnemann et al. 1982 and Gilovich et al. 2002 anthologies). These observations should be alarming. We can theorize all we want about why these problems arise and persist, but I think we do not yet understand human limitations in statistical reasoning half as well as it seems your comments assume.

              Thus I opt for an empirical view of the situation: We have had an education and practice disaster on our hands for a half century or more. In recent decades it has abated somewhat thanks to authors emphasizing estimation over testing. But as still lamented improper inversion of conditional data probabilities (error statistics) into hypothesis probabilities is still habitual and even encouraged by some teachers. See this link for an example and tell me if you don’t see a problem:
              Charlie Poole sent us a more subtle example in which failure to reject the null is misinterpreted as no evidence against the null (Hans-Hermann Dubben, Hans-Peter Beck-Bornholdt . Systematic review of publication bias in studies on publication bias. BMJ 2005; 331:433-434). Knowledge of most any branch of the medical literature will show this kind of error remains prevalent, unsurprisingly perhaps since as Charlie showed even Neyman fell prey to it.

              Your completely hypothetical Bayes example, disconnected from any research context, and resembles nothing like I see, teach, recommend or apply as Bayesian checks on inferences. Those methods I have found useful are detailed in several articles of mine, the basic introduction being Greenland 2006, Bayesian perspectives for epidemiologic research. I. Foundations and basic methods (with comment and
              reply) Int J Epidemiol 35:765–778, reprinted as Chapter 18 in the 3rd edition of Modern Epidemiology . The one comment printed with it may be interesting because it shows that some Bayesians feel just as threatened by this Bayes-frequentist connection as do some frequentists – understandably perhaps because it points up the weaknesses of the data models they share.

              In nonexperimental epidemiology, error statistics derive from purely hypothetical data-probability models (DPMs) which appear to be nothing more than prior distributions for the data, as they have no basis in the study design, which by definition lacks any known randomization of treatment. Only some of the Bayesian analogs I recommend as checks derive from the same DPM; others derive by relaxation of the sharp constraints that are used to force identification in error statistics, but are not known to be correct by design and may even be highly doubtful, up to a point.

              I have explained how I see penalized estimation as a frequentist and Bayesian blend that demonstrates how both can be used to explore proper uncertainty in nonexperimental results that arise from inability to identify the correct or even assuredly adequate model (Greenland 2009, Relaxation penalties and priors for plausible modeling of nonidentified bias sources, Statistical Science 24:195-210.). For real case studies explaining and demonstrating the methods see Greenland 2000, When should epidemiologic regressions use random coefficients? Biometrics 56:915–921; and Greenland 2005, Multiple-bias modeling for analysis of observational data (with discussion), J R Stat Soc Ser A 2005;168:267–308. I have published several other case studies in epidemiology and statistics journals, and have used these methods for subject-matter research articles.

              I have endorsed these methods based on my experience with them and that reported by others. Following Good, Leamer, Box and others I recognize that no set of tools should be recommended without empirical evaluation, and that accepted tools should be subject to severe tests indefinitely, just as tests of general relativity continue to this day. Conversely, no one should not reject a set of tools without empirical evaluation, as long as the tools have their own theoretical justification that is consistent both internally (free of contradictions) and externally (is not absolutely contradicted by data), and preferably consistent with accepted theory (a theory that consistently predicts accepted facts). To reject tools or facts on philosophical grounds beyond these basics strikes me as unscientific as refusing to question one’s accepted tools and facts.

              Before we go any further I think it is growing time to dispense with the increasingly obsolete labels of frequentist and Bayesian, except in the history of 20th century statistics (and I suppose 21st century philosophy). Good pointed out there were over 40,000 kinds of Bayesians, and I doubt if the number has diminished or the number of kinds of frequentist are much fewer. I am not a Bayesian or a frequentist, as I find such labels confining and almost antithetical to how I conceive science, which seeks many tools as well as theories and clings to none when faced with breakdowns (perhaps philosophically I may be sort of restrained descendant of Feyerabend). As Stan Young said, we have a system out of control and it needs to be fixed somehow. On this and data sharing we agree even though his answer to the problem is multiple-testing adjustment whereas mine sees testing obsession as one of the factors worsening the current situation; hence I focus on sparse-data estimation instead.

              My message is not that we need to Bayesian analyses all or even most of the time; rather it is that if we are claiming to make an inference, we need to understand the priors that we would need to hold to endorse the inference if we were Bayesian, so we can contrast those to actual data information in the field. Senn categorizes this approach as subjunctive Bayes: if I had believed that, I should have come to believe this were I Bayesian. Conversely, I do well to ponder that if I now believed this, it would imply that I must have believed that, were I a Bayesian. As with any diagnostic, this reverse-Bayesian analysis may warn us to take special care discourage Bayesian interpretation of our error statistics, and alert us to overconfident inferences from our colleagues and ourselves.

              This is a far cry from anything I see you allude to as Bayesian. If you are not familiar with the ideas then you might do worse than start with Leamer – his 1978 masterpiece Specification Searches is available as a free download here:
              Also Box (his classic 1980 JRSS article) and some articles by Good (in his 1983 anthology). My 2006 IJE article gives other relevant references.
              Perhaps reflecting my 40 years in epidemiology, I am more concerned with frequencies of events and their relation to antecedent events than with untested theories. Thus, if you want me to consider your claims seriously, what I would ask of you is:

              1) Please supply one real example of a published study in which the authors applied the kind of methodology I am talking about, and because of that they made clearly incorrect statements about the hypothesis or parameter or evidence under discussion (apart from computational errors, which all method are at risk of). You need to supply some case studies which provide at least some test of your claims in real analyses for which we have a sense of where the answers lie.

              2) Please comment on Charlie’s example (Dubben and Beck-Bornholdt , 2005) in which he showed the writers were led into an incorrect statement about evidence by misinterpreting a singular test rather than considering the full confidence distribution or P-value function (the usual null P-value and confidence limits comprising 3 point on the graph).

              3) Please comment on the examples in the articles I sent you showing a professor of biostatics misusing tests and of power to claim support of the null in data that discriminate nothing within the range of interest (Greenland 2011, Null misinterpretation in statistical testing and its impact on
              health risk assessment, Prev Med 53:225–228; and Greenland 2012, Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol 22:364–368).

              4) Please explain how you rationalize your promotion of error statistics in the face of documented amd chronic failure of researchers to interpret them correctly in vitally important settings.

  5. Often the p-value you see or are shown are but the tip of a multiple-testing iceberg. Tens or hundreds or thousands of p-values are computed, the result of looking at multiple outcomes and/or computing multiple models. As such, these p-values are not to be trusted at all. Editors of journals let authors get away with this and sophists provide slippery justifications. You can have trust-worthy p-values, but that requires access to the data. Mostly, those reporting selected p-values will not provide their data sets.

    • Stan: Thanks for your comment. Yes, I agree with all that you’ve written. I think I’m more skeptical than you in a couple of ways*, but first to note the obvious: the existence of mistakenly assessed p-values is not the same as discrediting significance test reasoning in general. I had something on this recently…


      I think dichotomous tests are generally oversimple and distorting, and I advocate determining discrepancies that are and are not well warranted (with severity).

      But never mind that. We are fortunate to have a real expert here on adjustments. (You, I mean.) So perhaps you can weigh in on the puzzle that is bothering me (you likely already know about it). It is sketched in the comments (with Richard Gill) on my last post (and in Gill’s post prior): the use of positive false discovery rates to adjust for multiple testing in fraud detection. The problem was just brought to my attention a few days ago in reading Gill’s analysis of Smeesters. I hope to post on this at some point, once I understand it better. Can you enlighten us at some point on this? (The issue is not whether he massaged data, he already admitted to that; I’m only curious about the properties of the method in its own right.)

      *Note for example my previous post that turns entirely on criticizing the flexibility of the concepts and hypotheses. I don’t see how any corrections can really bridge the gap between the statistics gleaned from these lab settings and the rather fuzzy theories they hope to confirm. The importance of emphasizing this is that it is wrong to blame significance tests or any statistical method for pseudoscientific theories, or for cases that are merely exploring interpretations of data, in light of a general or vague background theory.

  6. Corey

    “If we are to hold small p-values untrustworthy…”

    Mayo: We are to hold a *collection* of small p-values untrustworthy due to a recognition of a signature of data-dredging/p-hacking (and also the obvious incentives scientists face to engage in p-hacking). All you need to warrant the claim is the assumption that Francis, Simonsohn, et alia aren’t such cartoonish hypocrites as to have engaged in p-hacking of their own. The argument isn’t that difficult to follow, especially given your own criticism of p-hacking; did you apply the principle of charity?

    The second quote from Gelman has nothing to do with criticizing p-values, but rather with the practice of applying a p-value “filter” to a large (say, more than 200) collection of hypotheses and then using the usual estimators within that list.

    Here’s a toy example/challenge for you. Suppose that I take one sample from a normal distribution with unknown mean and standard deviation 1. It is known to you that I will refuse to report any observed value within [-2, +2]. But I do report to you that I observed 2.5. What is the severity of the test that just barely passes “mean greater than 2.5”?

    The practice Gelman is criticizing is equivalent to you claiming that “mean greater than 2.5” is warranted with severity = 0.5 in the above scenario.

    • Corey: Let me correct something: I did not allege or assume that Francis, Simonsohn were doubting the value of using p-values, so maybe I need to correct this in my post. Quite the opposite. They would never be using them to this end if they did not regard them as valuable statistical tools. My remarks concerned those who point to their results as evidence, not that p-values may be misused or its assumptions flagrantly violated, but that using significance tests is wrong-headed. Significance tests, of course, are just a small part of the panoply of error statistical methods, and using them in a dichotomous fashion is generally oversimplified. Used to identify discrepancies indicated and not indicated, however, would avoid the criticism I take Gelman to be raising.

      Update: I’ve made the clarification, thanks for noting it. But if you read Gelman’s article you’ll see he’s clearly distancing himself from a position of merely pointing up fallacies.

      Thanks too for your point that a discrepancy warranted with Severity of .5 would be unwarranted. This shows that such misinterpretations are readily avoided with a simple severity computation, and significance tests of that sort should not leave home without something like it.

      • Corey

        Mayo: Got it. (My impression of) Gelman’s position is that p-values are well-suited to some settings and poorly suited to others, e.g., they work in settings with large effects and targeted experiments to measure them (as in agriculture, the applied setting for which Fisher developed them) and fail in settings with very large numbers of weak effects (as in genome wide association studies).

        • Corey: I’m confused about your last comment. There are several research programs, each fairly huge, involving genome wide association studies, all of which are directly based on the use of p-values and various other, quite ingenious but basic, sampling theory methods. Adjustments for selection and multiple tests are carried out in a dozen ways, all to distinguish real effects from chance. With screening methods, of course, the goal shifts to what I call behavioristic; in many cases frequentist priors are genuinely of interest and not so hard to get. But other approaches are intermingled with theorizing about gene clusters. Brad Efron is the one who knows; also Tom Kepler, who follows this blog I think.

          • Corey

            Mayo: Fair enough; say rather that raw p-values, used naively, fail in such contexts.

            A biostatistician of my acquaintance once related an anecdote about attending a biology conference in the early days of microarrays. He saw a session in which the presenter claimed detection of differential expression (with p < 0.05, natch) for 5% of the total number of genes on the chip. The list of selected genes was elaborately interpreted in the biological context of the experiment. During the question period he commented that the microarray results were entirely consistent with no true positives and was informed by the session moderator that his comment was *very* controversial.

            • Corey: Are we arguing foundations by anecdote now? A friend of a friend of a friend said he heard someone say such and such and then someone said “his comment was *very* controversial”?

              • Corey

                Mayo: I’m not arguing foundations per se; I’m just giving an example to show that such a use-case has actually existed. It seems that first molecular biology (with microarrays) and then epidemiology (with GWAS) had the same experience: enthusiastic researchers jumped on initial results, failed to get replication, and then slowed down and actually got statisticians into the picture.

                • Corey: OK and when they got statisticians into the picture, they used significance tests, right? Take a look and see. Let me be clear that nothing turns on whether or not statistics succeeds in the land of microarrays (of course in many ways it already has). if I had to guess, I’d say substantive theoretical knowledge is what is and will make a dent in those efforts, but I think that is typical of good sciences, i.e., the use of statistics on one problem withers away thanks to theory, but some novel question may call for statistical probes again. Pure speculation in an arena I know little about (i.e., bioinformatics). Can this also happen in the social sciences? I don’t know.

                  • Tom Kepler

                    What happened with microarrays and statistics is more interesting than what Stan lets on. It was not that biologists blundered about until finally getting wise and turning things over to statisticians.

                    In fact, biologists went to statisticians early on, but were deeply dissatisfied with their analyses. The statistician typically claimed that there were no genes expressed differentially after correcting for multiple tests, but the biologist knew that that could not be correct. The comparisons they were doing were certain to give rise to differential gene expression. The null hypothesis that no genes were differentially expressed was inappropriate to start with.

                    It is important to realize that this episode was not a matter of biologists simply being naive (although there were certainly examples of that) but more generally of a failure to communicate effectively between the scientists and the statisticians about a challenging new technology.

                    Once biologists and statisticians learned how to talk to each other, significant improvements in appropriate analytical methods were developed, such as False Discovery Rate.

                    There is clearly much more important work to be done in collaborations between statisticians and biologists for high-throughput assays, perhaps in cleverer ways of thinking about null hypotheses. The solution in any case is not simply to insist on p-value corrections for irrelevant null hypotheses.

                    • Corey

                      Tom: You wrote “Stan”, but I think you meant me. My impressions were formed second-hand, so thank you for the in-depth look!

                • Corey: Screening is screening (my interest is in scientific inference, which differs–though even here, behavioristic methods have a role. Recall triggering in the Higgs particle case). I think the opening up of multiple, competing lines of inquiry in bioinformatics is healthy and intriguing. (It would make a great case study for philosophers of science interested in multiple methodologies.) Some avenues are more linked with theory than others, but all that I have seen use sampling theory statistics. I am inclined to favor theoretical approaches, but I am also rooting for the formal statistical efforts, if only because Brad Efron has developed this masterful research program…

                  apophenia–good word.

                  This reply should go after Corey’s comment below, but it didn’t

                  • There is a relatively unknown problem with microarray experiments, in addition to the multiple testing problem. Samples should be randomized over important sources of variation. Until relatively recently, the samples were not sent through assay equipment in random order. That turned out to be a problem. See http://blog.goldenhelix.com/?p=322. Essentially all the data pre-2010 is unreliable.

                    Lack of randomization can mess up p-values.

                    • “Lack of randomization can mess up p-values.”
                      Indeed – that was the topic of my 1990 paper “Randomization, statistics, and causal inference.Epidemiology,” 1, 421-429.
                      Note that almost nothing in epidemiologic research is randomized, and that is source of error in all its stats at least comparable to error from data dredging.

                • In a 1989 JASA paper and more fully developed in a Wiley 1993 book, Peter Westfall and I showed how to adjust raw p-values to reflect both the multiple testing and distribution of the observed data to give adjusted p-values. The technology is rather simple and easily understood. The methodology is rather widely adopted, but some persist in testing a lot of questions and using only raw p-values. It is not surprising at all that claims based on these raw p-values do not replicate.

                  • Stan: Thanks for reminding me where this example began. Corey was illustrating a case where significance tests don’t work (possibly according to Gelman), but it turns out to be a case where raw p-values in multiple testing don’t work. p-values are not evidential relation measures, despite many people wanting them to be.
                    Their use requires them to be able to serve as (relevant) error probabilities.

              • Corey

                Mayo: And “a friend of a friend of a friend” is not what I said. I said my acquaintance was the one who stood up in a roomful of biologists and told them, in the gentlest terms possible, that most of their recent biological findings were apophenia.

  7. After Corey’s initial comment, I was curious about that point of Gelman’s (claiming that statistically significant results give overly large effects sizes). I put aside my dislike of examples involving scoring people on how attractive/unattractive they are; especially if they also seek evolutionary explanations for observed differences in sex ratios.

    But never mind, just the numbers. In the article he references, (Gelman and Weakleim 2009) p. 312, they suggest that a 2-sided test of null: m = 0, with SE =.043 would imply an estimate of .084 which is “much larger than anything we would realistically think the effect size should be” (based on their background knowledge of sex ratio differences).
    Now a positive 5% statistically significant result x from 0 (2.5 % one-sided) would be x = .084 (1.96SE from 0). That would be taken to reject the null, and infer m > 0. SEV(m > 0) = .95. (0 is also the CI lower .95 bound). It should not be taken to indicate an estimated effect of .084! SEV(m > .084) = .5!
    The SEV associated with m being as great as .01 is ~ .92. Maybe they are getting at the flaw of taking the observed difference as an estimate of the population effect size—big mistake. Besides p-values are famously said not to give effect sizes. I used SEV to do so, but strictly, the test would stop with: x indicates m > 0.
    Inform me of any errors please.
    See fallacies of rejection 6.1 in https://errorstatistics.com/2012/10/21/mayo-section-6-statsci-and-philsci-part-2″/
    (recall, too my warnings about CI reformers.)

  8. Tom: Thanks so much for weighing in. I’ve forgotten when I first heard about FDRs (but I especially recall it was reading Julie Schaffer in the late 1990s). Are you saying that FDRs were essentially developed to accommodate this kind of situation when the computations under all true nulls was deemed wrongheaded? So, in a sense, it grew out of the need in microarrays?

    However, I was recently rereading some of the early discussion of multiple testing, e.g., in Morrison and Henkel (~1970), and thinking that they were already taking into account the number of nulls (in a group) that are found nominally significant (the ‘honest hunter’ in EGEK ch. 9),so that the more there are, the less conservative the correction. This made me wonder if one doesn’t wind up in the same place as using the FDR (step down) technique, but I had no time to compare the computations. I’m likely not being very clear, just landed….

    I don’t know why the order of comments is so jumbled (nor how to fix it), I have to get the Elbians to look into this.

    • Tom Kepler

      The FDR had been under development for a few years prior to microarrays, and Benjamini and Hochberg’s seminal paper came out one year before the first papers on microarrays.

      But in a historical note (J. R. Statist. Soc. B (2010)
      72, Part 4, pp. 405–416) Benjamini wrote:

      “Acceptance of the FDR idea remained slow even after Benjamini and Hochberg (1995) was published…The dramatic change in attitude came when genetic research took a new dimension, in quantitative trait loci and microarrays analyses, where the number of hypotheses tested in an experiment reached thousands. This seemed unthinkable 10 years earlier: for example, our simulations in Benjamini and Hochberg (1995) had been criticized for considering 4–64 hypotheses, as ‘no one uses multiple comparisons for problems with 50 or 100 tested hypotheses’. Alas, facing the new challenges, tools that balance multiplicity control and power were needed, and FDR methodology could yield useful answers.”

  9. A new post will be started this evening to focus on a couple of the microarray themes, since wordpress has a limit to nested comments, and these are now kind of a jumble, sorry. I will inquire about the settings, which I’ve tried fiddling with (input welcome). Thanks

  10. I recognize the oddness of using frequentist logic to criticize frequentist hypothesis tests. As indicated above, despite their problems, I do think frequentist hypothesis tests can be used to draw some reasonable conclusions (but I am open to being convinced otherwise, and I already think they are used way too often and often incorrectly). The main complaint is that biased articles are mis-using hypothesis tests, not that they are fundamentally wrong (the latter might also by true, but not necessarily by these analyses).

    My rational for applying frequentist logic is two-fold. First, I do not know how to apply a Bayesian version of the analysis. I’ve thought about it, and I’ve asked some Bayesians to think about. Maybe we are not clever enough, but we have not yet identified a method that makes sense. We have not given up. Second, my audience is mostly made up of people who believe in the frequentist approach. If I used a Bayesian method to critique their findings, they would just ignore it.

    Gelman’s post was referring to one of several comments that were published in reply to an article I wrote in the Journal of Mathematical Psychology. Links to the original article, the comments, and my reply can be found on my web site.

    • Greg: Thanks for your note, but my point is the reverse of what you seem to be suggesting. “I recognize the oddness of using frequentist logic to criticize frequentist hypothesis tests.” Nothing odd in the least! Quite the opposite: any good method is self-correcting and error correcting. We use deductive logic to criticize arguments using deductive logic, do we not? and we use error statistical principles to demonstrate that certain error statistical assumptions are being violated and the like. If anything, what’s problematic is using principles at odds with method M in order to criticize an application of method M (see my current blogpost on Greenland’s ‘dualism’.) My point in this blog (i.e.,the one on which you’re commenting), is the whiff of inconsistency in both denying and asserting method M gives trustworthy results. Perhaps reread this post and the clarification after Corey’s point. The context of the previous two blogposts might also be useful (in setting the context wherein my points arose).


Blog at WordPress.com.