P-values

Why I am not a “dualist” in the sense of Sander Greenland

Janus--2face

This post picks up, and continues, an exchange that began with comments on my June 14 blogpost (between Sander Greenland, Nicole Jinn, and I). My new response is at the end. The concern is how to expose and ideally avoid some of the well known flaws and foibles in statistical inference, thanks to gaps between data and statistical inference, and between statistical inference and substantive claims. I am not rejecting the use of multiple methods in the least (they are highly valuable when one method is capable of detecting or reducing flaws in one or more others). Nor am I speaking of classical dualism in metaphysics (which I also do not espouse). I begin with Greenland’s introduction of this idea in his comment… (For various earlier comments, see the post.)

Sander Greenland 

. I sense some confusion of criticism of the value of tests as popular tools vs. criticism of their logical foundation. I am a critic in the first, practical category, who regards the adoption of testing outside of narrow experimental programs as an unmitigated disaster, resulting in publication bias, prosecutor-type fallacies, and affirming the consequent fallacies throughout the health and social science literature. Even though testing can in theory be used soundly, it just hasn’t done well in practice in these fields. This could be ascribed to human failings rather than failings of received testing theories, but I would require any theory of applied statistics to deal with human limitations, just as safety engineering must do for physical products. I regard statistics as having been woefully negligent of cognitive psychology in this regard. In particular, widespread adoption and vigorous defense of a statistical method or philosophy is no more evidence of its scientific value than widespread adoption and vigorous defense of a religion is evidence of its scientific value. 
That should bring us to alternatives. I am aware of no compelling data showing that other approaches would have done better, but I do find compelling the arguments that at least some of the problems would have been mitigated by teaching a dualist approach to statistics, in which every procedure must be supplied with both an accurate frequentist and an accurate Bayesian interpretation, if only to reduce prevalent idiocies like interpreting a two-sided P-value as “the” posterior probability of a point null hypothesis.

 Nicole Jinn
 (to Sander Greenland)

 What exactly is this ‘dualist’ approach to teaching statistics and why does it mitigate the problems, as you claim? (I am increasingly interested in finding more effective ways to teach/instruct others in various age groups about statistics.)
I have a difficult time seeing how effective this ‘dualist’ way of teaching could be for the following reason: the Bayesian and frequentist approaches are vastly different in their aims and the way they see statistics being used in (natural or social) science, especially when one looks more carefully at the foundations of each methodology (e.g., disagreements about where exactly probability enters into inference, or about what counts as relevant information). Hence, it does not make sense (to me) to supply both types of interpretation to the same data and the same research question! Instead, it makes more sense (from a teaching perspective) to demonstrate a Bayesian interpretation for one experiment, and a frequentist interpretation for another experiment, in the hopes of getting at the (major) differences between the two methodologies.

Mayo

Sander. Thanks for your comment. 
Interestingly, I think the conglomeration of error statistical tools are the ones most apt at dealing with human limitations and foibles: they give piecemeal methods to ask one question at a time (e.g., would we be mistaken to suppose there is evidence of any effect at all? mistaken about how large? about iid assumptions? about possible causes? about implications for distinguishing any theories?). The standard Bayesian apparatus requires setting out a complete set of hypotheses that might arise, plus prior probabilities in each of them (or in “catchall” hypotheses), as well as priors in the model…and after this herculean task is complete, there is a purely deductive update: being deductive it never goes beyond the givens. Perhaps the data will require a change in your prior—this is what you must have believed before, since otherwise you find your posterior unacceptable—thereby encouraging the very self-sealing inferences we all claim to deplore. Continue reading

Categories: Bayesian/frequentist, Error Statistics, P-values, Statistics | 21 Comments

Stanley Young: better p-values through randomization in microarrays

I wanted to locate some uncluttered lounge space for one of the threads to emerge in comments from 6/14/13. Thanks to Stanley Young for permission to post this. 

YoungPhoto2008 S. Stanley Young, PhD
Assistant Director for Bioinformatics
National Institute of Statistical Sciences
Research Triangle Park, NC

There is a relatively unknown problem with microarray experiments, in addition to the multiple testing problems. Samples should be randomized over important sources of variation; otherwise p-values may be flawed. Until relatively recently, the microarray samples were not sent through assay equipment in random order. Clinical trial statisticians at GSK insisted that the samples go through assay in random order. Rather amazingly the data became less messy and p-values became more orderly. The story is given here:
http://blog.goldenhelix.com/?p=322
Essentially all the microarray data pre-2010 is unreliable. For another example, Mass spec data was analyzed Petrocoin. The samples were not randomized that claims with very small p-values failed to replicate. See K.A. Baggerly et al., “Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments,” Bioinformatics, 20:777-85, 2004. So often the problem is not with p-value technology, but with the design and conduct of the study.

experim_design6

Please check other comments on microarrays from 6/14/13.

Categories: P-values, Statistics | Tags: , , | 9 Comments

Guest Post. Kent Staley: On the Five Sigma Standard in Particle Physics

Kent Staley

Kent Staley
Associate Professor
Department of philosophy
Saint Louis University

Regular visitors to Error Statistics Philosophy may recall a discussion that broke out here and on other sites last summer when the CMS and ATLAS collaborations at the Large Hadron Collider announced that they had discovered a new particle in their search for the Higgs boson that had at least some of the properties expected of the Higgs. Both collaborations emphasized that they had results that were significant at the level of “five sigma,” and the press coverage presented this is a requirement in high energy particle physics for claiming a new discovery. Both the use of significance testing and the reliance on the five sigma standard became a matter of debate.

Mayo has already commented on the recent updates to the Higgs search results (here and here); these seem to have further solidified the evidence for a new boson and the identification of that boson with the Higgs of the Standard Model. I have been thinking recently about the five sigma standard of discovery and what we might learn from reflecting on its role in particle physics. (I gave a talk on this at a workshop sponsored by the “Epistemology of the Large Hadron Collider” project at Wuppertal [i], which included both philosophers of science and physicists associated with the ATLAS collaboration.)

Just to refresh our memories, back in July 2012, Tony O’Hagan posted at the ISBA forum (prompted by “a question from Dennis Lindley”) three questions regarding the five-sigma claim:

  1. “Why such an extreme evidence requirement?} We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?
  2. “Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?
  3. “We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LHC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?”

O’Hagan received a lot of responses to this post, and he very helpfully wrote up and posted a digest of those responses, discussed on this blog here and here. Continue reading

Categories: Error Statistics, P-values, Statistics | 26 Comments

Higgs analysis and statistical flukes (part 2)

imagesEveryone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post. This, too, is a rough outsider’s angle on one small aspect of the statistical inferences involved. (Doubtless there will be corrections.) But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Following an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as: Continue reading

Categories: P-values, statistical tests, Statistics | 33 Comments

Normal Deviate: Double Misunderstandings About p-values

Sisyphus

sisyphean task

I’m really glad to see that the Normal Deviate has posted about the error in taking the p-value as any kind of conditional probability. I consider the “second” misunderstanding to be the (indirect) culprit behind the “first”.

Double Misunderstandings About p-values

March 14, 2013 – 7:57 pm

It’s been said a million times and in a million places that a p-value is not the probability of H0  given the data.

But there is a different type of confusion about p-values. This issue arose in a discussion on Andrew’s blog.

Andrew criticizes the New York times for giving a poor description of the meaning of p-values. Of course, I agree with him that being precise about these things is important. But, in reading the comments on Andrew’s blog, it occurred to me that there is often a double misunderstanding.

First, let me say that I am neither defending nor criticizing p-values in this post. I am just going to point out that there are really two misunderstandings floating around. Continue reading

Categories: P-values | 3 Comments

Blog at WordPress.com.