**From Gelman’s blog:**

“In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons”Posted by 14 October 2014, 11:13 am on

Exhibit B: The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time, in press. (Andrew Gelman and Eric Loken) (Shortened version is here.)

The “forking paths” paper, in my reading, ** basically argues that mere hypothetical possibilities about what you would or might have done had the data been different (in order to secure a desired interpretation) suffices to alter the characteristics of the analysis you actually did.** That’s an error statistical argument–maybe even stronger than what some error statisticians would say. What’s really being condemned are overly flexible ways to move from statistical results to substantive claims.

**The p-values are illicit when taken to provide evidence for those claims because an actual p-value requires Prob(P < p;Ho) = p**(and the actual p-value has become much greater by design). The criticism makes perfect sense if you’re scrutinizing inferences according to how well or severely tested they are. Actual error probabilities are accordingly altered or unable to be calculated. However, if one is going to scrutinize inferences according to severity then the same problematic flexibility would apply to Bayesian analyses, whether or not they have a way to pick up on it. (It’s problematic if they don’t.) I don’t see the magic by which a concern for multiple testing disappears in Bayesian analysis (e.g., in the first paper) except by assuming some prior takes care of it.

See my comment here.

When you have lots of P-values you can examine them to see as a set whether they seem small or not. In 2001-2002 I gave a series of 8 identical (well nearly identical) course to the laboratories of a big pharma company. It was like Groundhog day all over again, all over again, all over again…. etc

The one question I was asked during every course was ‘how do I adjust for multiplicity?’. I pointed out to them that controlling the type I error rate per experiment didn’t make much sense, since some of them had 2 treatments per experiment, some 3 some 8 and some even more. What the lab was doing was screening compound (thousands of them) to send them further down the line. If a lab that screened 10,000 molecules found only 500 significant at the 5% level then it had a problem but if it had 2000 that were significant it might have a different sort of problem. If the lab down the road could only handle 1000 molecules then in that case why not set the threshold for significance at a level that would hand 1000 on? What would certainly make no sense would be to adjust the levels per experiment. An operational characteristic should be set for the lab and perhaps more widely for the company but frankly, you have to be lacking all nous if you think that the way to solve this is by application of some fancy procedure for each experiment.

Now the Bayesian, of course, has a lab book in his or her head corresponding to millions of experiments against which this one can be compared. All that is necessary is to open the book and study it, isn’t it?

See my posts on this website here

https://errorstatistics.com/2013/12/03/stephen-senn-dawids-selection-paradox-guest-post/

and here

https://errorstatistics.com/2012/05/01/stephen-senn-a-paradox-of-prior-probabilities/

> All that is necessary is to open the book and study it, isn’t it?

Obviously, but what if you opened the wrong book?

As you point out, if your theory says you can’t ask that question or if you opened a book you have bought it and are stuck with it forever all the repeated use evaluations involve draws from omnipotent priors (Greenland put it) and you never can be found at fault.

However, thoughtful Bayesians do worry about repeated use of methods (as Rubin once put it, smart people do not like being repeatedly wrong) but they seem to have difficulties getting this across to others – maybe even especially to other Bayesians.

One of the major barriers might be that once you admit the prior (and likelihood might be wrong) almost all the claimed advantages of the Bayesian approach evaporate. Arguably the flexibility, convenience and directness of Bayesian methods of analysis remain (i.e. why Box argued for Bayesian estimation once an acceptable for the purpose model [representation] was tentatively settled on) but that’s not a seemingly impressive enough for most?

In an frequentist approach, when the model is taken as true there is a huge amount of work to be done in realistic applications that seldom is totally satisfactory and often unfinished (as you point out, relevant subsets keep coming up and as David Cox used to comment we need more students to study higher order asymptotics). In a Bayesian approach, when the model is taken as true there is no work that can’t be done by simulation and with that supposedly no risk of being “wrong”. In either approach, the model is always wrong and ways to deal with that are not totally satisfactory or even that hopeful.

We see eye to eye on this, Keith. I have no particular difficulty with Bayesian methods as being one of a number of approaches we use to examining data with a goal of local temporary coherence but it is clear that this requires us to be allowed to do something at a higher level. Quite what this higher level is, is a bit of a mystery but one of the possible advantages (but I realise it would require deeper justification) of having systems that don’t always agree is that disagreement can be a trigger for reflection.

I just got Laurie Davies’ new book on “Data Analysis and Approximate Models”,

http://www.crcpress.com/product/isbn/9781482215861

in which he says that normally data analysis has two modes. In the first one people use all kinds of exploratory techniques and goodness of fit tests to nail down a model, and once the model is found, it is treated in the second mode as if it was always fixed, using likelihood, Bayes, Neyman-Pearson etc.

He writes that the ways of arguing in the two modes are largely inconsistent with each other, which is why much of statistics is based on theory for the second mode, ignoring the first one altogether, which he finds inappropriate and deplorable.

Davies himself proposes an integrated one-mode approach to data analysis with probability models that doesn’t assume that any model is “true”. Obviously the jury is still out on Davies’s own approach but the two-modes idea about what is currently done hits the nail on its head, as far as I see.

Christian: Thanks for the update, I will want to look at his new book. Of course, exploratory vs confirmatory has been a distinction that’s been around for donkey’s years, but rather than separating them, they may be seen as interspersed. After all, building pieces requires the pieces to be well tested–I mean if they’re going to be part of the model later used in “confirmatory” tests.

Indeed this is a sensible discussion from an error-statistical point of view, or in Gelman’s words

“The statistical framework of this paper is frequentist: we consider the statistical properties of hypothesis tests under hypothetical replications of the data.”

What also is being condemned is the unfortunate practice of not honestly describing exploratory findings as such:

“To put it another way, we view these papers – despite their statistically significant p-values – as exploratory, and when we look at exploratory results we must be aware of their uncertainty and fragility. It does not seem to us to be good scientific practice to make strong general claims based on noisy data, and one problem with much current scientific practice is the inadvertent multiplicity of analysis that allows statistical significance to be so easy to come by, with researcher degrees of freedom hidden because researcher only sees one data set at a time.”

Unfortunately for honest researchers, there are not many “top-tier” journals that will publish “exploratory findings” so in the publish-or-perish environment with for-profit business-model journals authors are forced to misrepresent the nature of their findings. In all honesty, many studies are exploratory in nature, being early studies in a given field (witness the recent STAP fiasco in Nature). But because journals want “leading edge” findings described in hyped-up language, and rarely will publish confirmatory studies, honest discussion of early findings and their tenuous nature is sorely lacking. This needs to change. Hopefully this and related publications by Gelman will help on that front.

Steven: I’m really glad you wrote because I can’t really figure out Gelman on this one, and I know you were writing about some recent health policy documents in which Bayesians were promising to make multiplicity vanish.

My position isn’t that background knowledge can’t make them vanish, it’s the fact that the properties of the test may (not always) be altered by things like multiple testing–often easy to demonstrate, other times more subtle. So, I’d rather report the properties of the test and then let the challenge to the inference be defeasible by background.

Now in Gelman’s “forking paths” paper it’s hard to tell whether he’s imagining these are only problems for one who defines QRPs or cheating as an error statistician does, or whether he too regards those practices as unwarranted. That’s what I want to know. Or is he merely saying they are unwarranted for a person who believes the capacity of your tests to easily infer your pet hypothesis matters.

I feel the crucial point, at least one of them, is missing in this discussion.(I realize Senn and Keith are onto something else of importance.) The issue I have in mind is essentially, why/when do considerations about what could fairly easily have happened (to lead to unwarranted interpretations of data) matter in interpreting the inference actually made with the actual data obtained. There is a critical stance in the kind of error statistical view that I think (i may be wrong) is oozing through the “garden” paper: the mere fact of allowing flexibility indicates you weren’t “sincerely trying” (as Popper inadequately puts it) to be self-critical at every stage. Even if you thought you were trying, you didn’t do a good enough job. Thus, even if we were to think the inference plausible in its own right, we may wish to say that it wasn’t tested well here.

It is because, at least in my reading, the error statistical standpoint is scrutinizing well-H’s testedness (not plausibility of H) that its stipulations are so often misunderstood, and assumed to be a matter of merely not wanting to be often wrong in some long run. Instead the counterfactual bears upon how capable your overall method was in avoiding erroneous interpretations in the case at hand.

I see it as being the problem of the elusive denominator. The question is if we want to control an error rate what is the “what” per “what” we want to control it for? The typical lazy default is that “what”=”experiment”, but this soon leads to absurdities. If I have 8 new antifungal treatments, A,B…H and I conduct 8 experiments each of which consists of comparing one of A to H to a neutral vehicle in vitro I don’t have to adjust for multiplicity. However, if I compare all 8 to vehicle in one experiment, I do. This is just mad, especially since, other things being equal, the probability of making at least one type I error is greater in the former case than the latter.

If, on the other hand, I assume “what”= “test”. I don’t have to make an adjustment. What is clearly illegitimate is to carry out lots of tests and only report that which is significant but it is less obvious that carrying out lost of tests unadjusted and reporting them all is a problem.

The problem of the denominator is one reason why there has been a switch to false postive rates but it would be wrong to claim these as a cure-all.

Stephen: I think you’re going back to the typical way of using error probs as mere behavioristic error rates, as opposed to their relevance in the case at hand. If one moves away from “rates” as I do (which isn’t to say they don’t matter in health policy, but my focus is on inference), then it’s not really a matter of the denominator. Accordingly, using “rates” as the basis for whether cases should be distinguished is not the crucial criterion. Perhaps it would help if you gave me your impression of the “forking paths” paper (short version), which is rather different than your case, I realize, but invokes some hypotheticals.

On the earlier issue, Stan Young makes a distinction between meta-analysis and multiplicity in relation to what one wants to know, and also, whether they can all be considered as testing essentially the same hypothesis. In Westfall and Young, they say adjusting p-values is wanted if the assessment of the individual test is of interest, as opposed to meta-analysis that blends them.

Christian Hennig has kindly mentioned my book. I offer here a more

precise description of my attitude to EDA and formal statistical

inference. The modus operandi of EDA is tentative and probing. It is

distribution function based with a weak topology typified by the

Kolmogorov metric. The modus operandi of formal inference is `behave

as if true’ (this is much stronger than fixing a model). It is density

based with a strong topology typified by the total variation

metric. There is therefore a double break when moving from EDA to

formal inference. The distribution function and the density function

are linked by the pathologically discontinuous differential

operator. This discontinuity makes all discussions about the

likelihood principle pointless. Likelihood requires truth, the adequacy of

a model is not sufficient. Once in the `behave as if true’ mode the

means by which one came to be in possession of the truth, foul or fair,

are irrelevant which is why (Steven McKinney) `exploratory findings’

are seldom mentioned. This applies equally well to Bayesians and

frequentists. I argue for one phase, essential the EDA phase,

for one topology and for treating models consistently as

approximations. This may at first glance seem innocuous but it is

not. The Bayesian paradigm requires truth. Two different models cannot

both be true which via a Dutch book argument leads to the additivity

of Bayesian priors. However two different models can both be adequate

approximation. There is no exclusion and consequently no

additivity. For the frequentists there are no `true parameter values’

and so no confidence intervals which cover them. For approximate

models an appropriately defined P-values is a measure of how weak the

concept of adequate approximation has to be before the model can be

regarded as an adequate approximation. Finally the pathological

discontinuity of the differential operator means that it is possible

to have arbitrarily severe models which fit the data. A theory of

error statistics requires a discussion of regularization.

Laurie: This isn’t your main issue, I spoze, but wish to register my denial that ” The modus operandi of formal inference is `behave

as if true’ (this is much stronger than fixing a model)”. Even the “bheavioristic” model of Neyman identified the “act” with a very specific bheavior, e.g., declare there’s an indication of a discrepancy, deny we can warrant ‘confirming’ the discrepancy from null is less than an amount against which the test has low power (Neyman’s power analysis”, publish/do not publish a result, infer the experimental effort of interest is unable to be brought about at will, decide the randomization assumption has been violated (as with cloud seeding), infer the model is adequate for probing such and such a question, but not another, report the data indicate doubts that the gold has been adulturated by more than x%, report the data indicate the skull’s age is approx m +- e, etcetc

Howson and Urbach spread the view that for Neyman and Pearson one is instructed to bet all one’s worldly goods on the result of a single statistical test–among other howlers, like they deny there’s such a thing as evidence, learning, inference. Neyman never regarded stat models as more than what he called “picturesque descriptions”: of a given aspect of a situation more or less adequate for tackling a given problem.

Having said all that, what do you mean by “A theory of

error statistics requires a discussion of regularization.”

Deborah, let me use a concrete example. In my book there are 27 measurements of the amount of copper in a sample of drinking water (real data). Suppose for the sake of argument the legal upper limit is 2.03mg per litre. We want therefore to make a decision about whether the limit is exceeded by the water sample we have. A statistician will

look at the data, decide upon some location-scale model and (speculatively) identify the amount of copper with the location parameter mu. Exceeding the legal limit is translated into the null hypothesis H_0: mu > 2.03. How does the statistician choose a model? One can take the Pratt approach, the observations `look normally

distributed’ (they do!) and use the normal model. Having decided on the normal model and then one does the best one can do under the normal model (behaving as if true) and uses the mean to `estimate’ mu. I take this to being as severe as possible under the model. But as the observations also `look Laplace distributed’ (they

do!), the statistician can use the Laplace distribution, do the best under this model and `estimate’ mu by the median. This is even more severe than the mean. Finally the observations also `look comb* distributed’ (they

do!), the statistician can use the comb distribution, do the best under this model and `estimate’ mu by maximum likelihood. This is more severe than the mean and median and much more so. The three 95% confidence

intervals are [1.970,2.062], [1.989,2.071] and [2.0248,2.0256] respectively. The latter confidence interval is not a typing error, it really is that small. A Bayesian approach is of no help. The posterior for mu for the comb model is essentially concentrated on the confidence interval. Severity or efficiency can be imported from the

model. Tukey called this a free lunch, which in his experience does not exist in statistics, and said that bland or hornless models should be used, for example minimum Fisher information models. This is what I mean by regularization. The normal model is bland, the comb model is not. The modus operandi `behave as if true’ does not commit the statistician to actually believing that the model

is true, it simply means that for the purpose of the analysis the statistician behaves as if this were so. Instead of calculating confidence intervals as above I suggest the use of approximation intervals which specify those values of the parameter which are consistent with the data. For the normal model consistency is based on the Kolmogorov (or better the Kuiper) distance of the data from the

model, on the absence of outliers, on the skewness of the data and on the values of the mean and standard deviation. You can do this also for the comb model and the two approximation intervals are about the same. This means that the EDA part of the analysis is included in the

specification of the parameters. A small approximation interval can be an indication of lack of fit. What one does not do is to declare the normal model as adequate, move into the `behave as if true’ mode and then base the rest of the analysis on the the optimal estimators and forget completely about the results of the EDA phase.

comb* defined in my book

http://www.crcpress.com/product/isbn/9781482215861

lauriedavies: I don’t know why your message washed–I was traveling, and thought it had been approved, sorry.I look forward to reading your book, and I am interested in the drinking water example.

“I don’t see the magic by which a concern for multiple testing disappears in Bayesian analysis (e.g., in the first paper) except by assuming some prior takes care of it.”

I’ll take a stab at this. The point of the first paper is that in a Bayesian analysis, the inference is no longer weakly powered as it was in the hypothesis test.

I don’t think it’s fair to say all the additional information is coming from the prior in a Bayesian analysis. In multilevel models, the information comes from the model structure and the ability to say, make a single inference from multiple measurements.

I think there’s also a bit of confusion regarding the 2014 paper, as many people seem to interpret the message was, “gelman says we should be using multiple comparison adjustments after all!” My read of it is that conditional on doing weakly powered tests (with many subjective decision points) multiple comparisons matter. However, my preferred approach is still to use the structure of the problem so as not to avoid doing weakly powered analyses in the first place.

BTW, thank you for the generous offer in the other thread. I will buy the e-book. Still have a backlog of reading material though (slowly making my way through Senn’s book, for one thing)

I think the question is a tad more subtle than described. The original work on the question was by Storey, in 2003, http://www.cs.berkeley.edu/~jordan/sail/readings/storey-annals-05.pdf. Storey introduces an analogous measure called the q-value to the p-value in these settings, and shows its meaning in both classical frequentist and Bayesian contexts.

There is a natural way in which multiple tests framed as sequential Bayesian updates have their false alarm and false negative rates accommodated within Bayesian analysis. In particular, Bayesian hierarchical models address the multiple comparisons phenomenon in terms of sampling from hyperpriors. See Section 4.5 of the 3rd edition of BAYESIAN STATISTICAL ANALYSIS, by Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin, 2014.

Hypergeometric. Well, I gather he’s not too satisfied with this recommendation as a general solution. His garden of forking paths, etc. is pointing to general issues of latitude in interpretation, and finding one’s hypothesis in the data and such.

By the way, the discussion of appealing to these hyper priors to match error probability results has been discussed a few times on this blog, e.g., by Stephen Senn discussing Dawid. Even where this can “work”, there needs to be an impetus or justification to do something to change the assessment. It requires a principle. It may be error statistical, it may be something else.