selection effects

Yoav Benjamini, “In the world beyond p < .05: When & How to use P < .0499…"

.

These were Yoav Benjamini’s slides,”In the world beyond p<.05: When & How to use P<.0499…” from our session at the ASA 2017 Symposium on Statistical Inference (SSI): A World Beyond p < 0.05. (Mine are in an earlier post.) He begins by asking:

However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism).

Readers of this blog will recall one of the controversial cases of failed replication in psychology: Simone Schnall’s 2008 paper on cleanliness and morality. In an earlier post, I quote from the Chronicle of Higher Education:

…. Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental.

My gripe has long been that the focus of replication research is with the “pure” statistics– do we get a low p-value or not?–largely sidestepping the often tenuous statistical-substantive links and measurement questions, as in this case [1]. When the study failed to replicate there was a lot of heated debate about the “fidelity” of the replication.[2] Yoav Benjamini cuts to the chase by showing the devastating selection effects involved (on slide #18):

For the severe tester, the onus is on researchers to show explicitly that they could put to rest expected challenges about selection effects. Failure to do so suffices to render the finding poorly or weakly tested (i.e., it passes with low severity) overriding the initial claim to have a non-spurious effect.

“Every research proposal and paper should have a replicabiity-check component” Benjamini recommends.[3] Here are the full slides for your Saturday night perusal.

[1] I’ve yet to hear anyone explain why unscrambling soap-related words should be a good proxy for “situated cognition” of cleanliness. But even without answering that thorny issue, identifying the biasing selection effects that were not taken account of vitiates the nominal low p-value. It is easy rather than difficult to find at least one such computed low p-value by selection alone. I advocate going further, where possible, and falsifying the claim that the statistical correlation is a good test of the hypothesis.

[2] For some related posts, with links to other blogs, check out:

Some ironies in the replication crisis in social psychology.

Repligate returns.

A new front in the statistics wars: peaceful negotiation in the face of so-called methodological terrorism

For statistical transparency: reveal multiplicity and/or falsify the test: remark on Gelman and colleagues

[3] The problem underscores the need for a statistical account to be directly altered by biasing selection effects, as is the p-value.

Categories: Error Statistics, P-values, replication research, selection effects | 22 Comments

Going round and round again: a roundtable on reproducibility & lowering p-values

.

There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then you will find the situation is reversed, and the recommended BF exaggerates the evidence!  (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

  1. Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of “Redefine Statistical Significance”
  2. Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and a primary co-author of a response to ‘Redefine statistical significance’ (under review).
  3. Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
  4. Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
  5. E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper “Redefine Statistical Significance”

To tune in to the presentation and participate in the discussion after the talk, visit this site on the day of the talk. To register for the talk in advance, click here.

The paradox for those wishing to abandon significance tests on grounds that there’s “a replication crisis”–and I’m not alleging everyone under the “lower your p-value” umbrella are advancing this–is that lack of replication is effectively uncovered thanks to statistical significance tests. They are also the basis for fraud-busting, and adjustments for multiple testing and selection effects. Unlike Bayes Factors, they:

  • are directly affected by cherry-picking, data dredging and other biasing selection effects
  • are able to test statistical model assumptions, and may have their own assumptions vouchsafed by appropriate experimental design
  • block inferring a genuine effect when a method has low capability of having found it spurious.

In my view, the result of a significance test should be interpreted in terms of the discrepancies that are well or poorly indicated by the result. So we’d avoid the concern that leads some to recommend a .005 cut-off to begin with. But if this does become the standard for testing the existence of risks, I’d make “there’s an increased risk of at least r” the test hypothesis in a one-sided test, as Neyman recommends. Don’t give a gift to the risk producers. In the most problematic areas of social science, the real problems are (a) the questionable relevance of the “treatment” and “outcome” to what is purported to be measured, (b) cherry-picking, data-dependent endpoints, and a host of biasing selection effects, and (c) violated model assumptions. Lowering a p-value will do nothing to help with these problems; forgoing statistical tests of significance will do a lot to make them worse.

 *Added Oct 27. This is worth noting because in other Bayesian assessment, indeed, in assessments deemed more sensible and less biased in favor of the null hypothesis–the p-value scarcely differs from the posterior on Ho. This is discussed, for example, in Casella and R. Berger 1987. See links in [iii]. The two are reconciled with 1-sided tests, and insofar as the typical study states a predicted direction, that’s what they should be doing.

[i] Both “frequentist” and “sampling theory” are unhelpful names. Since the key feature is basing inference on error probabilities of methods, I abbreviate by error statistics. The error probabilities are based on the sampling distribution of the appropriate test statistic. A proper subset of error statistical contexts are those that utilize error probabilities to assess and control the severity by which a particular claim is tested.

[ii] See#4 of my recent talk on statistical skepticism “7 challenges and how to respond to them”

[iii]  Two related posts: p-values overstate the evidence against the null fallacy

How likelihoodists exaggerate evidence from statistical tests (search the blog for others)

 

 

Save

Save

Save

Save

Save

Save

Categories: Announcement, P-values, reforming the reformers, selection effects | 5 Comments

What have we learned from the Anil Potti training and test data fireworks ? Part 1 (draft 2)

toilet-fireworks-by-stephenthruvegas-on-flickr

Over 100 patients signed up for the chance to participate in the clinical trials at Duke (2007-10) that promised a custom-tailored cancer treatment spewed out by a cutting-edge prediction model developed by Anil Potti, Joseph Nevins and their team at Duke. Their model purported to predict your probable response to one or another chemotherapy based on microarray analyses of various tumors. While they are now described as “false pioneers” of personalized cancer treatments, it’s not clear what has been learned from the fireworks surrounding the Potti episode overall. Most of the popular focus has been on glaring typographical and data processing errors—at least that’s what I mainly heard about until recently. Although they were quite crucial to the science in this case,(surely more so than Potti’s CV padding) what interests me now are the general methodological and logical concerns that rarely make it into the popular press. Continue reading

Categories: science communication, selection effects, Statistical fraudbusting | 38 Comments

Phil 6334 Visitor: S. Stanley Young, “Statistics and Scientific Integrity”

We are pleased to announce our guest speaker at Thursday’s seminar (April 24, 2014): Statistics and Scientific Integrity”:

YoungPhoto2008S. Stanley Young, PhD 
Assistant Director for Bioinformatics
National Institute of Statistical Sciences
Research Triangle Park, NC

Author of Resampling-Based Multiple Testing, Westfall and Young (1993) Wiley.


0471557617

 

 

 

The main readings for the discussion are:

 

Categories: Announcement, evidence-based policy, Phil6334, science communication, selection effects, Statistical fraudbusting, Statistics | 4 Comments

Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power

SEV CALCULATORBelow are slides from March 6, 2014: (a) the 2nd half of “Frequentist Statistics as a Theory of Inductive Inference” (Selection Effects),”* and (b) the discussion of the Higgs particle discovery and controversy over 5 sigma.physics pic yellow particle burst blue cone

We spent the rest of the seminar computing significance levels, rejection regions, and power (by hand and with the Excel program). Here is the updated syllabus  (3rd installment).

A relevant paper on selection effects on this blog is here.

Categories: Higgs, P-values, Phil6334, selection effects | Leave a comment

capitalizing on chance (ii)

Mayo playing the slots

DGM playing the slots

I may have been exaggerating one year ago when I started this post with “Hardly a day goes by”, but now it is literally the case*. (This  also pertains to reading for Phil6334 for Thurs. March 6):

Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Continue reading

Categories: junk science, selection effects, spurious p values, Statistical fraudbusting, Statistics | 4 Comments

Stephen Senn: Dawid’s Selection Paradox (guest post)

Stephen SennStephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

“Dawid’s Selection Paradox”

You can protest, of course, that Dawid’s Selection Paradox is no such thing but then those who believe in the inexorable triumph of logic will deny that anything is a paradox. In a challenging paper published nearly 20 years ago (Dawid 1994), Philip Dawid drew attention to a ‘paradox’ of Bayesian inference. To describe it, I can do no better than to cite the abstract of the paper, which is available from Project Euclid, here: http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?

 When the inference to be made is selected after looking at the data, the classical statistical approach demands — as seems intuitively sensible — that allowance be made for the bias thus introduced. From a Bayesian viewpoint, however, no such adjustment is required, even when the Bayesian inference closely mimics the unadjusted classical one. In this paper we examine more closely this seeming inadequacy of the Bayesian approach. In particular, it is argued that conjugate priors for multivariate problems typically embody an unreasonable determinism property, at variance with the above intuition.

I consider this to be an important paper not only for Bayesians but also for frequentists, yet it has only been cited 14 times as of 15 November 2013 according to Google Scholar. In fact I wrote a paper about it in the American Statistician a few years back (Senn 2008) and have also referred to it in a previous blogpost (12 May 2012). That I think it is important and neglected is excuse enough to write about it again.

Philip Dawid is not responsible for my interpretation of his paradox but the way that I understand it can be explained by considering what it means to have a prior distribution. First, as a reminder, if you are going to be 100% Bayesian, which is to say that all of what you will do by way of inference will be to turn a prior into a posterior distribution using the likelihood and the operation of Bayes theorem, then your prior distribution has to satisfy two conditions. First, it must be what you would use to bet now (that is to say at the moment it is established) and second no amount of subsequent data will change your prior qua prior. It will, of course, be updated by Bayes theorem to form a posterior distribution once further data are obtained but that is another matter. The relevant time here is your observation time not the time when the data were collected, so that data that were available in principle but only came to your attention after you established your prior distribution count as further data.

Now suppose that you are going to make an inference about a population mean, θ, using a random sample from the population and choose the standard conjugate prior distribution. Then in that case you will use a Normal distribution with known (to you) parameters μ and σ2. If σ2 is large compared to the random variation you might expect for the means in your sample, then the prior distribution is fairly uninformative and if it is small then fairly informative but being uninformative is not in itself a virtue. Being not informative enough runs the risk that your prior distribution is not one you might wish to use to bet now and being too informative that your prior distribution is one you might be tempted to change given further information. In either of these two cases your prior distribution will be wrong. Thus the task is to be neither too informative nor not informative enough. Continue reading

Categories: Bayesian/frequentist, selection effects, Statistics, Stephen Senn | 68 Comments

Blog at WordPress.com.