Since the comments to my previous post are getting too long, I’m reblogging it here to make more room. I say that the issue raised by J. Berger and Sellke (1987) and Casella and R. Berger (1987) concerns evaluating the evidence in relation to a given hypothesis (using error probabilities). Given the information that *this* hypothesis H* was randomly selected from an urn with 99% true hypothesis, we wouldn’t say this gives a great deal of evidence for the truth of H*, nor suppose that H* had thereby been well-tested. (H* might concern the existence of a standard model-like Higgs.) I think the issues about “science-wise error rates” and long-run performance in dichotomous, diagnostic screening should be taken up separately, but commentators can continue on this, if they wish (perhaps see this related post).

** 0. July 20, 2014: **Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

**1. What you should ask…**

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*For this exercise, fix the sample mean M to be** just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

**References **(minimalist) A number of additional links are given in comments to my previous post

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Filed under: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics ]]>

** 0. July 20, 2014: **Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

**1. What you should ask…**

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*For this exercise, fix the sample mean M to be** just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

**References **(minimalist)

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Filed under: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics ]]>

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

**“Higgs Analysis and Statistical Flukes: part 2″**

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of *excess events* of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess;

H_{0}: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone). The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

*Error probabilities*

In a Neyman-Pearson setting, a cut-off c_{α}_{ }is chosen pre-data so that the probability of a type I error is low. In general,

Pr(

d(X) > c_{α};H_{0}) ≤ α

and in particular,alluding to an overall test T:

(1) Pr(Test T yields

d(X) > 5 standard deviations;H_{0}) ≤ .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, *p*_{0}. In general,

Pr(

P<p_{0};H_{0}) <p_{0}

and in particular,

(2) Pr(Test T yields

P<.0000003;H_{0}) < .0000003.

For test T to yield a “worse fit” with *H*_{0 }(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0. With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic *d*(**X)**, or the P-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

*An implicit principle of inference or evidence*

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Datax_{0 }from a test T provide evidence for rejecting H_{0}(just) to the extent that H_{0}would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between *H*_{0}* *and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977). “*H** _{0 }*is true” is a shorthand for a very long statement that

*Severity and the detachment of inferences*

The sampling distributions serve to give counterfactuals. In this case they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to *H*_{0}.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out..*Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference*. (This is why bootstrap, and other types of, resampling works when one has a random sample from the process or population of interest.)

The *severity principle*, put more generally:

Data from a test T[ii]provide good evidence for inferring H (just) to the extent that H passes severely withx_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.)* *In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim H requires considering H’s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually **detached** from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

*Qualifying claims by how well they have been probed*

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned. Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

*Telling what’s true about significance levels*

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation). It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to *H** _{0}*. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to *H** _{0}*, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

*Those prohibited phrases*

One may wish to return to some of the condemned phrases of particular physics reports.Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution: The statistical null asserts that Ho: background alone adequately describes the process.

Ho does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under Ho”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {x: p < po}. Even when Ho is true, such “signal like” outcomes may occur. They are po level flukes. Were such flukes generated even with moderate frequency under Ho, they would not be evidence against Ho. But in this case, such flukes occur a teeny tiny proportion of the time. Then SEV enters: if we are regularly able to generate such teeny tiny p-values, we have evidence of a genuine discrepancy from Ho.

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain Ho as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

*Triggering, indicating, inferring*

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

I hope it is (more or less) clear that burgandy is new; black is old. If interested: *See statistical flukes (part 3)*

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1: http://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Part 2 http://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

New Notes

[1] I plan to do some new work in this arena soon, so I’ll be glad to have comments.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

REFERENCES (from March, 2013 post):

ATLAS Collaboration (November 14, 2012), Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” *Annals of Mathematical Statistics*, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” *Scandinavian Journal of Statistics*, 4: 49–70.

Mayo, D.G. (1996), *Error and the Growth of Experimental Knowledge*, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy *of *Science*, 57: 323–357.

___________

Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(H) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv]In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis http://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

Filed under: Higgs, highly probable vs highly probed, P-values, Severity, Statistics ]]>

July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:

“Arethe particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out [1]. (See O’Hagan’s “Digest and Discussion”)

Then, as details of the statistical analysis trickled down to the media, the P-value police (Wasserman, see (2)) came out in full force to examine if reports by journalists and scientists could in any way or stretch of the imagination be seen to have misinterpreted the sigma levels as posterior probability assignments to the various models and claims. The HEP (High Energy Physics) community had been painstaking in their communication of the results, but the P-bashers insisted on transforming the intended conditional….(I’ll come back to this.)

As for the HEP researchers, a central interest now is to explore any and all leads in the data that would point to physics beyond the Standard Model (BSM). The Higgs is just coming out to be too “perfectly plain vanilla,” and they’ve been unable to reject an SM null for years (3) (more on this later). So on this two-year anniversary, I’ll reblog a few of the Higgs posts, with some updated remarks—beginning with the first one below.

I suppose[ed] this was somewhat of a joke from the ISBA, prompted by Dennis Lindley, but as I [now] accord the actual extent of jokiness to be only ~10%, I’m sharing it on the blog [i]. Lindley (according to O’Hagan) wonders why scientists require so high a level of statistical significance before claiming to have evidence of a Higgs boson. It is asked: “Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

*Bad science? * I’d really like to understand what these representatives from the ISBA would recommend, if there is even a shred of seriousness here (or is Lindley just peeved that significance levels are getting so much press in connection with so important a discovery in particle physics?)

Well, read the letter and see what you think.

On Jul 10, 2012, at 9:46 PM, ISBA Webmaster wrote:

Dear Bayesians,

A question from Dennis Lindley prompts me to consult this list in search of answers.

We’ve heard a lot about the Higgs boson. The news reports say that the LHC needed convincing evidence before they would announce that a particle had been found that looks like (in the sense of having some of the right characteristics of) the elusive Higgs boson. Specifically, the news referred to a confidence interval with 5-sigma limits.

Now this appears to correspond to a frequentist significance test with an extreme significance level. Five standard deviations, assuming normality, means a p-value of around 0.0000005. A number of questions spring to mind.

1. Why such an extreme evidence requirement? We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?

2. Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?

3. We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LNC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?If anyone has any answers to these or related questions, I’d be interested to know and will be sure to pass them on to Dennis.

Regards,

Tony

—-

Professor A O’Hagan

Email: a.ohagan@sheffield.ac.uk

Department of Probability and Statistics

University of Sheffield

So given that the Higgs boson does not have such an extremely small prior probability, a proper Bayesian analysis would have enabled evidence of the Higgs long before attaining such an “extreme evidence requirement”. Why has no one tried to explain to these scientists how with just a little Bayesian analysis, they might have been done ~~in~~ last year or years ago? I take it the Bayesian would also enjoy the simplicity and freedom of not having to adjust “the Look Elsewhere Effect” (LEE[ii])

Let’s see if there’s a serious follow-up.[iii]

[i] bringing it down from my “Msc Kvetching page” where I’d put it last night.

[ii] For a discussion of how the error statistical philosophy avoids the classic criticisms of significance tests, see Mayo & Spanos (2011) ERROR STATISTICS. Other articles may be found on the link to my publication page.

[iii] O’Hagan informed me of several replies to his letter at the following:: http://bayesian.org/forums/news/3648

*****************************************************

(1) There’s scarce need for my “Rejected Posts” blog now that renegade thoughts can go on “twitter” (@learnfromerror), but I’ll keep it around for later.

(2) The Higgs Boson and the p-value Police: http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boson-and-the-p-value-police/

(3)The logic in this case is especially interesting. Each failure to reject the nulls of this type inform about the variant of BSM ruled out. (I’ll check with Robert Cousins that I’ve put this correctly. Update: He says that I have.) Here’s a link to Cousins’ recent paper on the Higgs and foundations of statistics http://arxiv.org/abs/1310.3791.

Filed under: Bayesian/frequentist, fallacy of non-significance, Higgs, Lindley, Statistics Tagged: comedy, Dennis V. Lindley, Higgs boson, p-value vs posterior, particle physics, significance tests ]]>

**Winner of June 2014 Palindrome Contest: First Second* Time Winner! **

******Her April **win is here*

**Palindrome:**

**Parsec? I overfit omen as Elba sung “I err on! Oh, honor reign!” Usable, sane motif revoices rap.**

**The requirement:** A palindrome with Elba plus overfit. (The optional second word: “average” was not needed to win.)

**Bio:**

Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine.

**Statement**:

I’m thrilled to be a second-time winner of the palindrome contest and my love of book collecting overrides any guilty feelings I may have about winning twice! Here’s a fun picture of me in the midst of polygonal fracturing from my June escapades. Sadly, I don’t think I can work “polygonal” into a palindrome******.

I’ve been fascinated by palindromes ever since first learning about them as a child in a Martin Gardner book. I started writing palindromes several years ago when my interest in the form was rekindled by reading about the constraint-based techniques of several Oulipo writers. While I love all sorts of wordplay and puzzles, and I occasionally write some word-unit palindromes as well, I find writing the traditional letter-unit palindromes to be the most satisfying challenge, due to the extreme formal constraint of exact letter reversal–which is made even more fun in a contest like this where one has to include specific words in the palindrome. I also enjoy writing palindromes about specific themes (Poe’s Raven, Oedipus Rex, Verdi’s Aida) and I have plans to write a very long palindrome about Proust one of these days.

**Book Choice**:

*Dicing with Death: Chance, Risk and Health* (Stephen Senn 2003, Cambridge: Cambridge University Press)

Filed under: Announcement, Palindrome ]]>

The article in the Chronicle of Higher Education also gets credit for its title: “Replication Crisis in Psychology Research Turns Ugly and Odd”. I’ll likely write this in installments…(2^{nd}, 3^{rd} , 4^{th})

^^^^^^^^^^^^^^^

The Guardian article answers yes to the question “Do ‘hard’ sciences hold the solution…“:

Psychology is evolving faster than ever. For decades now, many areas in psychology have relied on what academics call “questionable research practices” – a comfortable euphemism for types of malpractice that distort science but which fall short of the blackest of frauds, fabricating data.

But now a new generation of psychologists is fed up with this game. Questionable research practices aren’t just being seen as questionable – they are being increasingly recognised for what they are: soft fraud. In fact, “soft” may be an understatement. What would your neighbours say if you told them you got published in a prestigious academic journal because you cherry-picked your results to tell a neat story? How would they feel if you admitted that you refused to share your data with other researchers out of fear they might use it to undermine your conclusions? Would your neighbours still see you as an honest scientist – a person whose research and salary deserves to be funded by their taxes?

For the first time in history, we are seeing a co-ordinated effort to make psychology more robust, repeatable, and transparent.

“Soft fraud”? (Is this like “white collar” fraud?) Is it possible that holding social psych up as a genuine replicable science is, ironically, creating soft frauds too readily?

*Or would it be all to the good if the result is to so label large portions of the (non-trivial) results of social psychology?*

The sentiment in the Guardian article is that the replication program in psych is just doing what is taken for granted in other sciences; it shows psych is maturing, it’s getting *better and better all the time* …so long as the replication movement continues. Yes? [0]

^^^^^^^^

It’s hard to entirely dismiss the concerns of the pushback, dubbed in some quarters as “Repligate”. Even in this contrarian mode, you might sympathize with “those who fear that psychology’s growing replication movement, which aims to challenge what some critics see as a tsunami of suspicious science, is more destructive than corrective” (e.g., Professor Wilson, at U Va) while at the same time rejecting their dismissal of the seriousness of the problem of false positives in psych. The problem *is* serious, but there may be built-in obstacles to fixing things by the current route. From the Chronicle:

Still, Mr. Wilson was polite. Daniel Gilbert, less so. Mr. Gilbert, a professor of psychology at Harvard University, … wrote that certain so-called replicators are “shameless little bullies” and “second stringers” who engage in tactics “out of Senator Joe McCarthy’s playbook” (he later took back the word “little,” writing that he didn’t know the size of the researchers involved).

Wow. Let’s read a bit more:

Scrutiny From the Replicators

What got Mr. Gilbert so incensed was the treatment of Simone Schnall, a senior lecturer at the University of Cambridge, whose 2008 paper on cleanliness and morality was selected for replication in a special issue of the journal

Social Psychology.….In one experiment, Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental…..These studies fit into a relatively new field known as embodied cognition, which examines how one’s environment and body affect one’s feelings and thoughts. …

For instance, political extremists might literally be less capable of discerning shades of grey than political moderates—or so Matt Motyl thought until his results disappeared. Now he works actively in the replication movement.[1]

Links are here.

7/1: By the way, since Schnall’s research was testing “embodied cognition” why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?

^^^^^^^^^^

Another irony enters: some of the people working on the replication project in social psych are the same people who hypothesize that a large part of the blame for lack of replication may be traced to the reward structure, to incentives to publish surprising and sexy studies, and to an overly flexible methodology opening the door to promiscuous QRPs (you know: Questionable Research Practices.) Call this the “rewards and flexibility” hypothesis. If the rewards/flex hypothesis is correct, as is quite plausible, then wouldn’t it follow that the same incentives are operative in the new psych replication movement? [2]

A skeptic of the movement in psychology could well ask how the replication can be judged sounder than the original studies? When RCTs fail to replicate observational studies, the presumption is that RCTs would have found the effect, were it genuine. That’s why it’s taken as an indictment of the observational study. But here, one could argue, it’s just another study, not obviously one that *corrects* the earlier. The question some have asked, “Who will replicate the replicators?” is not entirely without merit. Triangulation for purposes of correction, I say, is what’s really needed. [3]

Daniel Kahneman, who first called for the “daisy chain” (after the Stapel scandal), likely hadn’t anticipated the tsunami he was about to unleash.[4]

Daniel Kahneman, a Nobel Prize winner who has tried to serve as a sort of a peace broker, recently offered some rules of the road for replications, including keeping a record of the correspondence between the original researcher and the replicator, as was done in the Schnall case. Mr. Kahneman argues that such a procedure is important because there is “a lot of passion and a lot of ego in scientists’ lives, reputations matter, and feelings are easily bruised.”

That’s undoubtedly true, and taking glee in someone else’s apparent misstep is unseemly. Yet no amount of politeness is going to soften the revelation that a published, publicized finding is bogus. Feelings may very well get bruised, reputations tarnished, careers trashed. That’s a shame, but while being nice is important, so is being right.

Is the replication movement getting psych closer to “being right”? That is the question. What if inferences from priming studies and ”embodied cognition” really *are* questionable. What if the hypothesized effects are incapable of being turned into replicable science?

^^^^^^^^^

The sentiment voiced in the Guardian bristles at the thought; there is pushback even to Kahneman’s apparently civil “rules of the road”:

For many psychologists, the reputational damage [from a failed replication]… is grave – so grave that they believe we should limit the freedom of researchers to pursue replications. In a recent open letter, Nobel laureate Daniel Kahneman called for a new rule in which replication attempts should be “prohibited” unless the researchers conducting the replication consult beforehand with the authors of the original work. Kahneman says, “Authors, whose work and reputation are at stake, should have the right to participate as advisers in the replication of their research.” Why? Because method sections published by psychology journals are generally too vague to provide a recipe that can be repeated by others. Kahneman argues that successfully reproducing original effects could depend on seemingly irrelevant factors – hidden secrets that only the original authors would know. “For example, experimental instructions are commonly paraphrased in the methods section, although their wording and even the font in which they are printed are known to be significant.”

“Hidden secrets”? This was a remark sure to enrage those who take psych measurements as (at least potentially) akin to measuring the Hubble constant:

If this doesn’t sound very scientific to you, you’re not alone. For many psychologists, Kahnemann’s cure is worse than the disease. Dr Andrew Wilson from Leeds Metropolitan University points out that if the problem with replication in psychology is vague method sections then the logical solution – not surprisingly – is to publish

detailedmethod sections. In a lively response to Kahnemann, Wilson rejects the suggestion of new regulations: “If you can’t stand the replication heat, get out of the empirical kitchen because publishing your work means you think it’s ready for prime time, and if other people can’t make it work based on your published methods then that’s your problem and not theirs.”

Prime time for priming research in social psych?

Read the rest of the Guardian article. Second installment later on…maybe….

**What do readers think?**

^^^^^^^^^^^^^^

Naturally the issues that interest me the most are statistical-methodological. Some of the methodology and meta-methodology of the replication effort is apparently being developed hand-in-hand with the effort itself—that makes it all the more interesting, while also potentially risky.

The replicationist’s question of methodology, as I understand it, is alleged to be what we might call “purely statistical”. It is not: would the initial positive results warrant the psychological hypothesis, were the statistics unproblematic? The presumption from the start was that the answer to this question is yes. In the case of the controversial Schnall study, the question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s runover? At least not directly. In other words, the statistical-substantive link was not at issue. The question is limited to: do we get the statistically significant effect in a replication of the initial study, presumably one with high power to detect the effects at issue. So, for the moment, I too will retain that as the sole issue around which the replication attempts revolve.

Checking statistical assumptions is, of course, a part of the pure statistics question, since the P-value and other measures depend on assumptions being met at least approximately.

The replication team assigned to Schnall (U of Cambridge) reported results apparently inconsistent with the positive ones she had obtained. Schnall shares her experiences in “Further Thoughts on Replications, Ceiling Effects and Bullying” and “The Replication Authors’ Rejoinder”:http://www.psychol.cam.ac.uk/cece/blog

The replication authors responded to my commentary in a rejoinder. It is entitled “Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results.” In it, they accuse me of “criticizing after the results are known,” or CARKing, as Nosek and Lakens (2014) call it in their editorial. In the interest of “increasing the credibility of published results” interpretation of data evidently needs to be discouraged at all costs, which is why the special issue editors decided to omit any independent peer review of the results of all replication papers. (Schnall)

Perhaps her criticisms are off the mark, and in no way discount the failed replication (I haven’t read them), but CARKing? Data and model checking are intended to take place post-data. So the post-data aspect of a critique scarcely renders it illicit. The statistical fraud-busting of a Smeesters or a Jens Forster were all based on post-data criticisms. So it would be *ironic* if in the midst of defending efforts to promote scientific credentials they inadvertently labeled as questionable post-data criticisms. top

^^^^^^^^^^^^^^^^^^^^^^^^^^^

Uri Simonsohn [5] at “Data Colada” discusses, specifically, the objections raised by Simone Schnall (2nd installment), and the responses by the authors who failed to replicate her work: Brent Donnellan, Felix Cheung and David Johnson.

Simonsohn does not reject out of hand Schnall’s allegation that the lack of replication is explained away (e.g., by a “ceiling effect”). (In fact, he has elsewhere discussed a case that was rightfully absolved thereby [6].) Simonsohn provides statistical grounds for denying a ceiling effect is to be blamed in Schnall’s case. However, he also agrees with Schnall’s discounting the replicators’ reaction to the charge of a ceiling effect by simply lopping off the most extreme results.

In their rejoinder (.pdf), the replicators counter by dropping all observations at the ceiling and showing the results are still not significant.

I don’t think that’s right either.Data Colada

Since the replicators here have the burden of proof of evidence, the statistical problems with their *ad hoc* retort to Schnall are grounds for concern, or should be.

http://datacolada.org/2014/06/04/23-ceiling-effects-and-replications/

What follows from this? What follows is that the analysis of the evidential import of failed replications in this field is an unsettled business. Despite the best of intentions of the new replicationists, there are grounds for questioning if the meta-methodology is ready for the heavy burden being placed on it. I’m not saying that facets for the necessary methodology aren’t out there, but that the pieces haven’t been fully assembled ahead of time. Until they are,the basis for scrutinizing failed (and successful) replications will remain in flux.

^^^^^^^^^^

Final irony. If the replication researchers claim they haven’t caught on to any of the problems or paradoxes I have intimated for their enterprise, let me end with one more. ..No, I’ve save it for installment 4. top

^^^^^^^^^^

Statistical significance testers in psychology (and other areas) often maintain there is no information, or no proper inference, to be obtained from statistically insignificant (negative) results. This, despite power analyst Jacob Cohen toiling amongst them for years. Maybe they’ve been misled by their own constructed animal, the so-called NHST (no need to look it up, if you don’t already know).

*The irony is that much replication analysis turns on interpreting non statistically significant results!*

One of my first blogposts talks about interpreting negative results and I’ve been publishing on this for donkey’s years[7]. Here are some posts for your Saturday night reading:

http://errorstatistics.com/2011/11/09/neymans-nursery-2-power-and-severity-continuation-of-oct-22-post/Some numerical examples:

^^^^^^

[0] Unsurprisingly, replicationistas in psych are finding well-known results from experimental psych to be replicable. Interestingly, similar results are found in experimental economics, dubbed “experimental exhibits”. Expereconomists recognize that rival interpretations of the exhibits are still open to debate.

[1] In Nuzzo’s article: “For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white”.

(Glory, I tell you!)

[2] Some of the results are now published in Social Psychology. Perhaps it was not such an exaggeration to suggest, in an earlier post, that “non-significant results are the new significant results”. At the time I didn’t know the details of the replication project; I was just reacting to graduate students presenting this as the basis for a philosophical position, when philosophers should have been performing a stringent methodological critique.

[3] By contrast, statistical fraudbusting and statistical forensics have some rigorous standards that are hard to evade, e.g., recently Jens Forster.

[4] In Kahneman’s initial call (Oct, 2012) “He suggested setting up a ‘daisy chain’ of replication, in which each lab would propose a priming study that another lab would attempt to replicate. Moreover, he wanted labs to select work they considered to be robust, and to have the lab that performed the original study help the replicating lab vet its procedure.”

[5] Simonsohn is always churning out the most intriguing and important statistical analyses in social psychology. The field needs more like him.

[6] For an excellent discussion of a case that *is* absolved from non-replication by appealing to the ceiling effect see http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/.

[7] e.g., Mayo 1985, 1988, to see how we talked about statistics in risk assessment philosophy back then.

Filed under: junk science, science communication, Statistical fraudbusting, Statistics ]]>

One of the world’s leading economists, INET Oxford’s Prof. Sir David Hendry received a unique award from the Economic and Social Research Council (ESRC)…

Commenting on the award, Torbjørn Hægeland, Director of Research at Statistics Norway, said: ‘Professor David Hendry’s contributions have exerted a great influence on the way we do practical econometric work. In particular, the automatic models election programme, Autometrics, is used extensively to guide improved empirical modelling, especially when there are structural shifts, avoiding wasted time on incorrect formulations so our economists can focus on analysis and specification.’

You can read about the award here.

Sir David Hendry’s contribution to RMM Vol. 2, 2011:* *“Empirical Economic Model Discovery and Theory Evaluation:

**Abstract: **
*Economies are so high dimensional and non-constant that many features of models cannot be derived by prior reasoning, intrinsically involving empirical discovery and requiring theory evaluation. Despite important differences, discovery and evaluation in economics are similar to those of science. Fitting a pre-specified equation limits discovery, but automatic methods can formulate much more general initial models with many possible variables, long lag lengths and non-linearities, allowing for outliers, data contamination, and parameter shifts; then select congruent parsimonious-encompassing models even with more candidate variables than observations, while embedding the theory; finally rigorously evaluate selected models to ascertain their viability.*

http://www.rmm-journal.de/downloads/Article_Hendry.pdf

His work grows out of a unique philosophical conception of the relationship between data and theory. [3]

*BOOKS*

(new)** ***Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics (Arne Ryde Memorial Lectures)*

*Hendry, D.F. and B. Nielsen (2007)*, Econometric Modeling: A Likelihood Approach. Princeton University Press.

*J. Campos, N.R. Ericsson and D.F. Hendry (2004)* **General to Specific Modelling**. Edward Elgar. Forthcoming.

*Clements, M.P. and D.F. Hendry (2002)*. **A Companion to Economic Forecasting**. Oxford: Blackwell Publishers. (ISBN 0631215697)

*Hendry, D.F. and N.R. Ericsson (2001)* **Understanding Economic Forecasts** Cambridge, Mass.: MIT Press.

*Doornik, J.A. and D.F. Hendry (2001)*. **Interactive Monte Carlo Experimentation in Econometrics Using PcNaive** London: Timberlake Consultants Press.

*Doornik, J.A. and D.F. Hendry (2001)*. **GiveWin: An Interface to Empirical Modelling** (2nd edition), London: Timberlake Consultants Press. (ISBN 0-9533394-3-2) (1st ed. 1996, 2nd ed. 1999)

*Hendry, D.F. and J.A. Doornik (2001)*. **Empirical Econometric Modelling Using PcGive** Volumes I, II and III London: Timberlake Consultants Press. (Vol I: 2nd ed. 1999, 1st ed. 1996; version 8: 1994, version 7: 1992) (Vol II: 2nd ed. 1999, 1st ed. 1997; version 8: 1994)

*D.F. Hendry and H-M. Krolzig (2001)*. **Automatic Econometric Model Selection** London: Timberlake Consultants Press.

*Hendry, D.F. (2001)* **Econometrics: Alchemy or Science?** 2nd Edition. Oxford: Oxford University Press. (ISBN 0-19-829354-2)

*W. A. Barnett, D. F. Hendry, S. Hylleberg, T. Teräsvirta, D. Tjøstheim, and A. Würtz (eds) (2000)*. **Nonlinear Econometric Modeling in Time Series**. Proceedings of the Eleventh International Symposium in Economic Theory Cambridge: Cambridge University Press.

*Clements, M.P. and D.F. Hendry (1999)*. **Forecasting Non-stationary Economic Time Series**. Cambridge, Mass.: MIT Press.

*Clements, M.P. and D.F. Hendry (1998)*. **Forecasting Economic Time Series**. Cambridge: Cambridge University Press. (ISBN 0-521-634806)

*Hendry, D.F. and M.S. Morgan (1995)*. **The Foundations of Econometric Analysis**. Cambridge: Cambridge University Press. (ISBN 0-521-38043-X)

Some general comments by Clark Glymour, in relation to Hendry’s paper,are below:

[1] Professor Hendry is Director of the Programme in Economic Modelling at the Institute for New Economic Thinking at the Oxford Martin School.

[2] Hendry, D. (2011) “Empirical Economic Model Discovery and Theory Evaluation”, in *Rationality, Markets and Morals*, Volume 2, Special Topic: *Statistical Science and Philosophy of Science,* (D. G. Mayo, A. Spanos & K. W. Staley (guest eds.)): 115-145.

[3] Hendry was Aris Spanos’ dissertation advisor at the LSE; their work has interconnected over the years.

Filed under: David Hendry, StatSci meets PhilSci Tagged: David Hendry ]]>

**May 2014**

(5/1) Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle

(5/3) You can only become coherent by ‘converting’ non-Bayesianly

(5/6) Winner of April Palindrome contest: Lori Wike

(5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)

(5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)

(5/15) Scientism and Statisticism: a conference* (i)

(5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”

(5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop

(5/25) Blog Table of Contents: March and April 2014

(5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976

(5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

Filed under: blog contents, Metablog, Statistics ]]>

[The papers in this collection] give examples of problems which are well-suited to being tackled using such methods, but one must not lose sight of the merits of having multiple different strategies and tools in one’s inferential armory.(Hand [1])_

…. But I have to ask, is the emphasis on ‘Bayesian’ necessary? That is, do we need further demonstrations aimed at promoting the merits of Bayesian methods? … The examples in this special issue were selected, firstly by the authors, who decided what to write about, and then, secondly, by the editors, in deciding the extent to which the articles conformed to their desiderata of being Bayesian success stories: that they ‘present actual data processing stories where a non-Bayesian solution would have failed or produced sub-optimal results.’ In a way I think this is unfortunate. I am certainly convinced of the power of Bayesian inference for tackling many problems, but the generality and power of the method is not really demonstrated by a collection specifically selected on the grounds that this approach works and others fail. To take just one example, choosing problems which would be difficult to attack using the Neyman-Pearson hypothesis testing strategy would not be a convincing demonstration of a weakness of that approach if those problems lay outside the class that that approach was designed to attack.

Hand goes on to make a philosophical assumption that might well be questioned by Bayesians:

One of the basic premises of science is that you must not select the data points which support your theory, discarding those which do not. In fact, on the contrary, one should test one’s theory by challenging it with tough problems or new observations. (This contrasts with political party rallies, where the candidates speak to a cheering audience of those who already support them.) So the fact that the articles in this collection provide wonderful stories illustrating the power of modern Bayesian methods is rather tarnished by the one-sidedness of the story.

This, of course, is the philosophical standpoint reflected in a severe or stringent testing philosophy, and it’s one that I heartily endorse. But it may be a mistake to assume it is universal: there’s an entirely distinct conception of confirmation as gathering data in order to support a position already held [2]. *I don’t mean this at all facetiously.* On the contrary, to suppose the editors of this issue share the testing conception is to implicitly suggest they are engaged in an exercise with questionable scientific standards (“tarnished by the one-sidedness of the story”). Recall my post on “who is allowed to cheat” and optional stopping with I.J. Good? It took some pondering for him to admit a different way of cashing out “allowed to cheat”. Likewise, wearing Bayesian glasses lets me take various Bayesian remarks as other than disingenuous. Hand goes on to offer a tantalizing suggestion:

Or perhaps, if one is going to have a collection of papers demonstrating the power of one particular inferential school, then, in the journalist spirit of balanced reporting, we should invite a series of similar issue containing articles which present actual data processing stories where a nonfrequentist / non-likelihood / non-[fill in your favourite school of inference] solution would have failed or produced sub-optimal results.

On the face of it, it sounds like a great idea! Sauce for the goose and all that….David Hand is courageous for even suggesting it (deserving an * honorary mention*!), and he’d be an excellent editor of such an imaginary, parallel journal issue. [Share potential names. See [3]] But if X = “a frequentist” approach, it becomes clear, on further thought, it actually wouldn’t make sense, and frequentists (or, as I prefer, error statisticians) wouldn’t wish to pursue such a thing. Besides, they wouldn’t be allowed– “Frequentist” seems to be some kind of an “F” word in statistics these days–and anyway Bayesian accounts have the latitude to mimic any solution post hoc, if they so desire; if they didn’t concur with the solution, they’d merely deny the claims to superior performance (as sought by the editors of any such imaginary, parallel, journal issue). [Yet, perhaps a good example of the kind of article that would work is Fraser's quick and dirty confidence in a 2011 issue of the same journal.]

Christian Robert explains that the goal was for “a collection of six-page vignettes that describe real cases in which Bayesian analysis has been the only way to crack a really important problem.” Papers should address the question: “Why couldn’t it be solved by other means? What were the shortcomings of other statistical solutions?” I’m not sure what criteria the special editors employed to judge that Bayesian methods were required. According to one of the contributors (Stone) it means the problem required subjective priors. [See Note 4] (I’m a bit surprised at the choice of name for the special issue. Incidentally, the “big” refers to the bigness of the problem, not big data. Not sure about “stories”.)

Yet scientific methods are supposed to be interconnected, fostering both interchecking via multiple lines of evidence as well as building on diverse strategies. I just read of a promising new technique that would allow a blood test to detect infectious prions (as in mad cow disease) in living animals—a first. This will be both scrutinized and built upon by multiple approaches in current prion research. Seeing how the new prion test works, those using other methods will *want* to avail themselves of the new Mad Cow test. Saying Bayesianism is *required, *by contrast, doesn’t obviously* *suggest that non-Bayesians would wish to go there.

Aside: Robert begins his description of the special issue: “Bayesian statistics is now endemic in many areas of scientific, business and social research”, but does he really mean endemic? (See [5])

All in all, I think Hand gives a strong, generous, positive endorsement, interspersed with some caveats and hesitations:

When presented with fragmentary evidence, for example, one should proceed with caution. In such circumstances, the opportunity for undetected selection bias is considerable. Assumptions about the missing data mechanism may be untestable, perhaps even unnoticed. Data can be missing only in the context of a larger model, and one might not have any idea about what model might be suitable.

Caution is voiced by another discussant, A. H. Welsh:

Another reason a model may be difficult to fit is that it does not describe the data. Forcing it to “fit”, for example by switching to a Bayesian analysis, may not be the best response. It is difficult to check complicated models,particularly hierarchical models with latent variables, measurement error,missing data etc but using an incorrect model may be a concern when the model proves difficult to fit.

Recall, in this connection, this post (on “When Bayesian Inference Shatters”.)

Do you know what would really have been impressive (in my judgement)? A special journal issue replete with articles identifying the most serious flaws, shortcomings, and problems in Bayesian applications; perhaps showing how non-Bayesian methods helped to pinpoint loopholes and improve solutions. Methodological progress is never so sure or so speedy as when subjected to severe criticism. I think people would stand up and really take notice to see Bayesians remove the rose-colored glasses for a bit. What do you think?

[Added 6/22: I see this is equivocal. I had meant that the criticism be self-criticism and that the Bayesians themselves would have vigorously brought out the problems. But mixing in constructive criticism from others would also be of value.]

Here’s some of the rest….

The editors emphasised that they were not looking for ‘argumentative rehashes of the Bayesian versus frequentist debate’. I can only commend them on that. On the other hand, times move on, ideas develop, and understanding deepens, so while ‘argumentative rehashes’ might not be desirable, re-examination from a more sophisticated perspective might be.

I couldn’t agree more as to the need for “a re-examination from a more sophisticated perspective”, and it’s a point very rarely articulated. I hear people quote Neyman and Pearson from like the first few months of exploring a brand new approach and overlook the 70 years of developments in the general frequentist, sampling or (as I prefer) error statistical domain of inference and modeling. ….

An interesting question, perhaps in part sociological, is why different scientific communities tend to favour different schools of inference. Astronomers favour Bayesian methods, particle physicists and psychologists seem to favour frequentist methods. Is there something about these different domains which makes them more amenable to attack by different approaches? In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world… …As an aside, there is also the question of what exactly is meant by ‘Bayesian’. Cox and Donnelly (2011, p144) remark that ‘the word Bayesian, however, became ever more widely used, sometimes representing a regression to the older usage of “flat” prior distributions supposedly representing initial ignorance, sometimes meaning models in which the parameters of interest are regarded as random variables and occasionally meaning little more than that the laws of probability are somewhere invoked.’

Yes that’s another thorny question that remains without a generally accepted answer. I’ve seen it used to simply mean the use of conditional probability anywhere, any time.

Turning to the papers themselves, the Bayesian approach to statistics, with its interpretation of parameters as random variables, has the merit of formulating everything in a consistent manner. Instead of trying to fit together objects of various different kinds, one merely has a single common type of brick to use, which certainly makes life easier.

What is this single brick? Managing to assess everything as a probability brick, when they actually have very different references, isn’t obviously better than recognizing and reporting the differences, possibly synthesizing in some other way. To end up with a remark by Welsh:

One motivation for doing a Bayesian analysis for this problem (and one that is commonly articulated) is that the event in question is unique so it is not meaningful to think about replications. This is not really convincing because hypothetical replications are hypothetical whether they are conceived of for an event that is extremely rare (and in the extreme happens once) or for events that occur frequently.

I concur with Welsh. The study of unique events and fixed hypotheses still involves general types of questions and theories under what I call a repertoire of background. [One might ask, if “the event in question is unique so it is not meaningful to think about replications,” then how does the methodology serve for replicable science?]

Please send any corrections to this draft (i).

**I invite comments, as always, and UPhils for guest blog posting (by July 15), if anyone is interested: error@vt.edu**

[1] The citations come from the Statistical Science posting of future articles (thus final corrected versions could differ), but I am also linking to the published discussion articles.

[2] As even Popper emphasized, even a certain degree of dogmatism has a role, to avoid rejecting a claim too soon. But this is intended to occur within an inquiry that is working hard to find flaws and weaknesses, else it falls far short of being scientific–*for Popper.*

[3] Fab frequentist “tales” (areas)?

[4] I never know whether requiring subjective priors means they required beliefs about weights of evidence, beliefs about frequencies, beliefs about beliefs, or something closer to Christian Robert’s idea that a prior “has nothing to do with ‘reality,’ it is a reference measure that is necessary for making probability statements” (2011, 317-18) in a comment on Don Fraser’s quick and dirty confidence paper.

[5] Endemic

- (of a disease or condition) regularly found among particular people or in a certain area; “areas where malaria is
**endemic**“

Denoting an area in which a particular disease is regularly found. -
(of a plant or animal) native or restricted to a certain country or area; “a marsupial
**endemic to**northeastern Australia”

Growing or existing in a certain place or region.

Filed under: Bayesian/frequentist, Honorary Mention, Statistics ]]>

Four ~~score~~ years ago (!) we held the conference “Statistical Science and Philosophy of Science: Where Do (Should) They meet?” at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, where I’m visiting professor [1] Many of the discussions on this blog grew out of contributions from the conference, and conversations initiated soon after. The conference site is here; my paper on the general question is here.[2]

*My main contribution was “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. **It begins like this: *

**1. Comedy Hour at the Bayesian Retreat[3]**

** **Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist…

“who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

or

“who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statisticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out.

If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call ‘error statistics’, continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.

First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define ‘controlling long-run error’, it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of ‘There’s No Theorem Like Bayes’s Theorem’.

Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in every Bayesian and some non-Bayesian textbooks and articles on philosophical foundations. No wonder that statistician Stephen Senn is inclined to “describe a Bayesian as one who has a reverential awe for all opinions except those of a frequentist statistician” (Senn 2011, 59, this special topic of RMM). Never mind that a correct understanding of the error-statistical demands belies the assumption that any method (with good performance properties in the asymptotic long-run) succeeds in satisfying error-statistical demands.

The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson ‘really thought’. I regard this as a shallow way to do foundations.

Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error-statistical philosophy. Here I mostly sketch key ingredients and report on updates in a larger, ongoing project.

** ****2. Popperians Are to Frequentists as Carnapians Are to Bayesians**

** **Statisticians do, from time to time, allude to better-known philosophers of science (e.g., Popper). The familiar philosophy/statistics analogy—that Popper is to frequentists as Carnap is to Bayesians—is worth exploring more deeply, most notably the contrast between the popular conception of Popperian falsification and inductive probabilism. Popper himself remarked:

In opposition to [the] inductivist attitude, I assert that C(

H,x) must not be interpreted as the degree of corroboration ofHbyx, unlessxreports the results of our sincere efforts to overthrowH. The requirement of sincerity cannot be formalized—no more than the inductivist requirement thatxmust represent our total observational knowledge. (Popper 1959, 418, I replace ‘e’ with ‘x’)

In contrast with the more familiar reference to Popperian falsification, and its apparent similarity to statistical significance testing, here we see Popper alluding to failing to reject, or what he called the “corroboration” of hypothesis *H*. Popper chides the inductivist for making it too easy for agreements between data **x **and* H* to count as giving *H* a degree of confirmation.

Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory. (Popper 1994, 89)

(Note the similarity to Peirce in Mayo 2011, 87.)

**2.1 Severe Tests**

Popper did not mean to cash out ‘sincerity’ psychologically of course, but in some objective manner. Further, high corroboration must be ascertainable: ‘sincerely trying’ to find flaws will not suffice. Although Popper never adequately cashed out his intuition, there is clearly something right in this requirement. It is the gist of an experimental principle presumably accepted by Bayesians and frequentists alike, thereby supplying a minimal basis to philosophically scrutinize different methods. (Mayo 2011, section 2.5, this special topic of RMM) Error-statistical tests lend themselves to the philosophical standpoint reflected in the severity demand. Pretty clearly, evidence is not being taken seriously in appraising hypothesis *H* if it is predetermined that, even if *H* is false, a way would be found to either obtain, or interpret, data as agreeing with (or ‘passing’) hypothesis *H*. Here is one of many ways to state this:

Severity Requirement (weakest): An agreement between dataxandHfails to count as evidence for a hypothesis or claimHif the test would yield (with high probability) so good an agreement even ifHis false.

Because such a test procedure had little or no ability to find flaws in *H*, finding none would scarcely count in *H*’s favor.

*2.1.1 Example: Negative Pressure Tests on the Deep Horizon Rig*

Did the negative pressure readings provide ample evidence that:

H: leaking gases, if any, were within the bounds of safety (e.g., less than θ_{0}_{0})?

Not if the rig workers kept decreasing the pressure until H passed, rather than performing a more stringent test (e.g., a so-called ‘cement bond log’ using acoustics). Such a lowering of the hurdle for passing *H _{0}* made it too easy to pass

H: the pressure build-up was in excess of θ_{1}_{0}.

That ‘the negative pressure readings were misinterpreted’, meant that it was incorrect to construe them as indicating H_{0}. If such negative readings would be expected, say, 80 percent of the time, even if* H _{1}* is true, then

**2.2 Another Egregious Violation of the Severity Requirement**

Too readily interpreting data as agreeing with or fitting hypothesis *H* is not the only way to violate the severity requirement. Using utterly irrelevant evidence, such as the result of a coin flip to appraise a diabetes treatment, would be another way. In order for data **x **to succeed in corroborating *H* with severity, two things are required: (i) **x **must fit *H*, for an adequate notion of fit, and (ii) the test must have a reasonable probability of finding worse agreement with *H*, were *H* false. I have been focusing on (ii) but requirement (i) also falls directly out from error statistical demands. In general, for *H* to fit **x**, H would have to make **x **more probable than its denial. Coin tossing hypotheses say nothing about hypotheses on diabetes and so they fail the fit requirement. Note how this immediately scotches the second howler in the second opening example.

But note that we can appraise the severity credentials of other accounts by using whatever notion of ‘fit’ they permit. For example, if a Bayesian method assigns high posterior probability to *H* given data **x**, we can appraise how often it would do so even if *H* is false. That is a main reason I do not want to limit what can count as a purported measure of fit: we may wish to entertain different measures for purposes of criticism.

**2.3 The Rationale for Severity is to Find Things Out Reliably**

** **Although the severity requirement reflects a central intuition about evidence, I do not regard it as a primitive: it can be substantiated in terms of the goals of learning. To flout it would not merely permit being wrong with high probability—a long-run behavior rationale. In any particular case, little if anything will have been done to rule out the ways in which data and hypothesis can ‘agree’, even where the hypothesis is false. The burden of proof on anyone claiming to have evidence for H is to show that the claim is not guilty of at least an egregious lack of severity.

Although one can get considerable mileage even with the weak severity requirement, I would also accept the corresponding positive conception of evidence, which will comprise the full severity principle:

Severity Principle (full):Dataxprovide a good indication of or evidence for hypothesisH(only) to the extent that testTseverely passesHwithx.

Degree of corroboration is a useful shorthand for the degree of severity with which a claim passes, and may be used as long as the meaning remains clear.

**2.4 What Can Be Learned from Popper; What Can Popperians Be Taught?**

** **Interestingly, Popper often crops up as a philosopher to emulate—both by Bayesian and frequentist statisticians. As a philosopher, I am glad to have one of our own taken as useful, but feel I should point out that, despite having the right idea, Popperian logical computations never gave him an adequate way to implement his severity requirement, and I think I know why: Popper once wrote to me that he regretted never having learned mathematical statistics. Were he to have made the ‘error probability’ turn, today’s meeting ground between philosophy of science and statistics would likely look very different, at least for followers of Popper, the ‘critical rationalists’.

Consider, for example, Alan Musgrave (1999; 2006). Although he declares that “the critical rationalist owes us a theory of criticism” (2006, 323) this has yet to materialize. Instead, it seems that current-day critical rationalists retain the limitations that emasculated Popper. Notably, they deny that the method they recommend—either to accept or to prefer the hypothesis best-tested so far—is reliable. They are right: the best-tested so far may have been poorly probed. But critical rationalists maintain nevertheless that their account is ‘rational’. If asked why, their response is the same as Popper’s: ‘I know of nothing more rational’ than to accept the best-tested hypotheses. It sounds rational enough, but only if the best-tested hypothesis so far is itself well tested (see Mayo 2006; 2010b). So here we see one way in which a philosopher, using methods from statistics, could go back to philosophy and implement an incomplete idea.

On the other hand, statisticians who align themselves with Popper need to show that the methods they favor uphold falsificationist demands: that they are capable of finding claims false, to the extent that they are false; and retaining claims, just to the extent that they have passed severe scrutiny (of ways they can be false). Error probabilistic methods can serve these ends; but it is less clear that Bayesian methods are well-suited for such goals (or if they are, it is not clear they are properly ‘Bayesian’).

__________________

To read sections 3 and 4 see: SS & POS 2 or go to the RMM page, and scroll down to Mayo’s Sept 25 paper.

*Here is section 5:*

**5. The Error-Statistical Philosophy**

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are).

**5.1 Error (Probability) Statistics**

*What is key on the statistics side *is that the probabilities refer to the distribution of a statistic *d*(**X**)—the so-called sampling distribution. This alone is at odds with Bayesian methods where consideration of outcomes other than the one observed is disallowed (likelihood principle [LP]), at least once the data are available.

Neyman-Pearson hypothesis testing violates the likelihood principle, because the event either happens or does not; and hence has probability one or zero. (Kadane 2011, 439)

The idea of considering, hypothetically, what other outcomes could have occurred in reasoning from the one that did occur seems so obvious in ordinary reasoning that it will strike many, at least those outside of this specialized debate, as bizarre for an account of statistical inference to banish such considerations. And yet, banish them the Bayesian must—at least if she is being coherent. I come back to the likelihood principle in section 7.

*What is key on the philosophical side *is that error probabilities may be used to quantify the probativeness or severity of tests (in relation to a given inference).

The twin goals of probative tests and informative inferences constrain the selection of tests. But however tests are specified, they are open to an after-data scrutiny based on the severity achieved. Tests do not always or automatically give us relevant severity assessments, and I do not claim one will find just this construal in the literature. Because any such severity assessment is relative to the particular ‘mistake’ being ruled out, it must be qualified in relation to a given inference, and a given testing context. We may write:

SEV(

T,x,H) to abbreviate ‘the severity with which testTpasses hypothesisHwith datax’.

When the test and data are clear, I may just write SEV(*H*). The standpoint of the severe prober, or the severity principle, directs us to obtain error probabilities that are relevant to determining well testedness, and this is the key, I maintain, to avoiding counterintuitive inferences which are at the heart of often-repeated comic criticisms. This makes explicit what Neyman and Pearson implicitly hinted at:

If properly interpreted we should not describe one [test] as more accurate than another, but according to the problem in hand should recommend this one or that as providing information which is more relevant to the purpose. (Neyman and Pearson 1967, 56–57)

For the vast majority of cases we deal with, satisfying the N-P long-run desiderata leads to a uniquely appropriate test that simultaneously satisfies Cox’s (Fisherian) focus on minimally sufficient statistics, and also the severe testing desiderata (Mayo and Cox 2010).

**5.2 Philosophy-Laden Criticisms of Frequentist Statistical Methods**

** **What is rarely noticed in foundational discussions is that appraising statistical accounts at the foundational level is ‘theory-laden’, and in this case the theory is philosophical. A deep as opposed to a shallow critique of such appraisals must therefore unearth the philosophical presuppositions underlying both the criticisms and the plaudits of methods. To avoid question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.

But for many philosophers, in particular, Bayesians, the presumption that inference demands a posterior probability for hypotheses is thought to be so obvious as not to require support. At any rate, the only way to give a generous interpretation of the critics (rather than assume a deliberate misreading of frequentist goals) is to allow that critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, the criticisms of frequentist statistical methods assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error-statistical methods can only achieve radical behavioristic goals, wherein long-run error rates alone matter.

Criticisms then follow readily, in the form of one or both:

- Error probabilities do not supply posterior probabilities in hypotheses.
- Methods can satisfy long-run error probability demands while giving rise to counterintuitive inferences in particular cases.

I have proposed an alternative philosophy that replaces these tenets with different ones:

- The role of probability in inference is to quantify how reliably or severely claims have been tested.
- The severity principle directs us to the relevant error probabilities; control of long-run error probabilities, while necessary, is not sufficient for good tests.

The following examples will substantiate and flesh out these claims.

** ****5.3 Severity as a ‘Metastatistical’ Assessment**

** **In calling severity ‘metastatistical’, I am deliberately calling attention to the fact that the reasoned deliberation it involves cannot simply be equated to formal quantitative measures, particularly those that arise in recipe-like uses of methods such as significance tests. In applying it, we consider several possible inferences that might be considered of interest. In the example of test *T+* [this is a one-sided Normal test of H_{0}: μ≤μ_{0} against H_{1}: μ>μ_{0}, on p. 81], the data specific severity evaluation quantifies the extent of the discrepancy (γ) from the null that is (or is not) indicated by data **x **rather than quantifying a ‘degree of confirmation’ accorded a given hypothesis. Still, if one wants to emphasize a post-data measure one can write:

SEV(μ <

X+ γσ_{0}_{x}) to abbreviate: The severity with which a testT+with a resultxpasses the hypothesis:(μ <

X+ γσ_{0}_{x}) with σ_{x}abbreviating (σ /√n)^{ }

One might consider a series of benchmarks or upper severity bounds:

SEV(μ <

x+ 0σ_{0}_{x}) = .5

SEV(μ <x+ .5σ_{0}_{x}) = .7

SEV(μ <x+ 1σ_{0}_{x}) = .84

SEV(μ <x+ 1.5σ_{0}_{x}) = .93

SEV(μ <x+ 1.98σ_{0}_{x}) = .975

More generally, one might interpret nonstatistically significant results (i.e., *d*(**x**) ≤* c _{α}*) in test

(μ ≤

X+ γ_{0}_{ε}(σ /√n)) passes the testT+with severity (1 –ε),

for any P(*d*(**X**)>γ_{ε}) = ε.

It is true that I am here limiting myself to a case where σ is known and we do not worry about other possible ‘nuisance parameters’. Here I am doing philosophy of statistics; only once the logic is grasped will the technical extensions be forthcoming.

*5.3.1 Severity and Confidence Bounds in the Case of Test T+*

It will be noticed that these bounds are identical to the corresponding upper confidence interval bounds for estimating μ. There is a duality relationship between confidence intervals and tests: the confidence interval contains the parameter values that would not be rejected by the given test at the specified level of significance. It follows that the (1 – α) one-sided confidence interval (CI) that corresponds to test *T+* is of form:

μ>X− c_{α}(σ /√n)^{ }

The corresponding CI, in other words, would not be the assertion of the upper bound, as in our interpretation of statistically insignificant results. In particular, the 97.5 percent CI estimator corresponding to test *T+* is:

μ>X− 1.96(σ /√n)

We were only led to the upper bounds in the midst of a severity interpretation of negative results (see Mayo and Spanos 2006). [See also posts on this blog, e.g., on reforming the reformers.]

Still, applying the severity construal to the application of confidence interval estimation is in sync with the recommendation to consider a series of lower and upper confidence limits, as in Cox (2006). But are not the degrees of severity just another way to say how probable each claim is? No. This would lead to well-known inconsistencies, and gives the wrong logic for ‘how well-tested’ (or ‘corroborated’) a claim is.

A classic misinterpretation of an upper confidence interval estimate is based on the following fallacious instantiation of a random variable by its fixed value:

P(μ < (

X+2(σ /√n); μ) = .975,

observe mean **x**,

therefore, P (μ < (

x+ 2(σ /√n); μ) = .975.

While this instantiation is fallacious, critics often argue that we just cannot help it. Hacking (1980) attributes this assumption to our tendency toward ‘logicism’, wherein we assume a logical relationship exists between any data and hypothesis. More specifically, it grows out of the first tenet of the statistical philosophy that is assumed by critics of error statistics, that of probabilism.

*5.3.2 Severity versus Rubbing Off*

The severity construal is different from what I call the ‘rubbing off construal’ which says: infer from the fact that the procedure is rarely wrong to the assignment of a low probability to its being wrong in the case at hand. This is still dangerously equivocal, since the probability properly attaches to the method not the inference. Nor will it do to merely replace an error probability associated with an inference to *H* with the phrase ‘degree of severity’ with which *H* has passed. The long-run reliability of the rule is a necessary but not a sufficient condition to infer *H* (with severity).

The reasoning instead is the counterfactual reasoning behind what we agreed was at the heart of an entirely general principle of evidence. Although I chose to couch it within the severity principle, the general frequentist principle of evidence (FEV) or something else could be chosen.

To emphasize another feature of the severity construal, suppose one wishes to entertain the severity associated with the inference:

*H*: μ< (** x _{0}** + 0σ

on the basis of mean **x _{0}** from test

*5.3.3 What’s Belief Got to Do with It?*

Some philosophers profess not to understand what I could be saying if I am prepared to allow that a hypothesis *H* has passed a severe test *T* with **x **without also advocating (strong) belief in* H*. When SEV(*H*) is high there is no problem in saying that **x **warrants *H*, or if one likes, that **x **warrants believing *H*, even though that would not be the direct outcome of a statistical inference. The reason it is unproblematic in the case where SEV(*H*) is high is:

If SEV(*H*) is high, its denial is low, i.e., SEV(~*H*) is low.

But it does not follow that a severity assessment should obey the probability calculus, or be a posterior probability—it should not, and is not.

After all, a test may poorly warrant both a hypothesis *H* and its denial, violating the probability calculus. That is, SEV(*H*) may be low because its denial was ruled out with severity, i.e., because SEV(~*H*) is high. But Sev(*H*) may also be low because the test is too imprecise to allow us to take the result as good evidence for *H*.

Even if one wished to retain the idea that degrees of belief correspond to (or are revealed by?) bets an agent is willing to take, that degrees of belief are comparable across different contexts, and all the rest of the classic subjective Bayesian picture, this would still not have shown the relevance of a measure of belief to the objective appraisal of what has been learned from data. Even if I strongly believe a hypothesis, I will need a concept that allows me to express whether or not the test with outcome **x** warrants *H*. That is what a severity assessment would provide. In this respect, a dyed-in-the wool subjective Bayesian could accept the severity construal for science, and still find a home for his personalistic conception.

Critics should also welcome this move because it underscores the basis for many complaints: the strict frequentist formalism alone does not prevent certain counterintuitive inferences. That is why I allowed that a severity assessment is on the metalevel in scrutinizing an inference. Granting that, the error- statistical account based on the severity principle does prevent the counterintuitive inferences that have earned so much fame not only at Bayesian retreats, but throughout the literature.

*5.3.4 Tacking Paradox Scotched*

In addition to avoiding fallacies within statistics, the severity logic avoids classic problems facing both Bayesian and hypothetical-deductive accounts in philosophy. For example, tacking on an irrelevant conjunct to a well-confirmed hypothesis *H* seems magically to allow confirmation for some irrelevant conjuncts. Not so in a severity analysis. Suppose the severity for claim *H* (given test *T* and data **x**) is high: i.e., SEV(*T*, **x**, *H*) is high, whereas a claim *J* is not probed in the least by test *T*. Then the severity for the conjunction (*H* & *J*) is very low, if not minimal.

If SEV(Test

T, datax, claimH) is high, butJis not probed in the least by the experimental testT, then SEV (T,x, (H&J)) = very low or minimal.

For example, consider:

H: GTR andJ: Kuru is transmitted through funerary cannibalism,

and let data **x _{0}** be a value of the observed deflection of light in accordance with the general theory of relativity, GTR. The two hypotheses do not refer to the same data models or experimental outcomes, so it would be odd to conjoin them; but if one did, the conjunction gets minimal severity from this particular data set. Note that we distinguish

A severity assessment also allows a clear way to distinguish the well-testedness of a portion or variant of a larger theory, as opposed to the full theory. To apply a severity assessment requires exhausting the space of alternatives to any claim to be inferred (i.e., ‘*H* is false’ is a specific denial of *H*). These must be relevant rivals to *H*—they must be at ‘the same level’ as *H*. For example, if *H* is asking about whether drug Z causes some effect, then a claim at a different (‘higher’) level might a theory purporting to explain the causal effect. A test that severely passes the former does not allow us to regard the latter as having passed severely. So severity directs us to identify the portion or aspect of a larger theory that passes. We may often need to refine the hypothesis of stated interest so that it is sufficiently local to enable a severity assessment. Background knowledge will clearly play a key role. Nevertheless we learn a lot from determining that we are not allowed to regard given claims or theories as passing with severity. I come back to this in the next section (and much more elsewhere, e.g., Mayo 2010a, b).

[1] co-organized with Aris Spanos.

[2] This was a special topic of the on-line journal, *Rationality, Markets and Morals (RMM)*, edited by Max Albert—also a conference participant. For more Saturday night reading, check out the page.Authors are: David Cox, Andrew Gelman, David F. Hendry, Deborah G. Mayo, Stephen Senn, Aris Spanos, Jan Sprenger, Larry Wasserman. Search this blog for a number of commentaries on most of these papers.

[3]Long-time blog readers will recognize this from the start of this blog. for some background, and a table of contents for the paper, see my Oct 17 post.

Filed under: Error Statistics, Philosophy of Statistics, Severity, Statistics, StatSci meets PhilSci ]]>

**Aris Spanos**

Wilson E. Schmidt Professor of Economics

*Department of Economics, Virginia Tech*

**Recurring controversies about P values and conﬁdence intervals revisited*
**

Volume 95, Issue 3 (March 2014): pp. 645-651

*INTRODUCTION*

The use, abuse, interpretations and reinterpretations of the notion of a *P* value has been a hot topic of controversy since the 1950s in statistics and several applied ﬁelds, including psychology, sociology, ecology, medicine, and economics.

The initial controversy between Fisher’s signiﬁcance testing and the Neyman and Pearson (N-P; 1933) hypothesis testing concerned the extent to which the pre-data Type I error probability α can address the arbitrariness and potential abuse of Fisher’s *post-data threshold *for the *P *value. Fisher adopted a falsiﬁcationist stance and viewed the *P *value as an indicator of disagreement (inconsistency, contradiction) between data *x*_{0}* _{ }*and the null hypothesis (

The primary aim of this paper is to revisit several charges, interpretations, and comparisons of the *P* value with other procedures as they relate to their primary aims and objectives, the nature of the questions posed to the data, and the nature of their underlying reasoning and the ensuing inferences. The idea is to shed light on some of these issues using the *error-statistical* perspective; see Mayo and Spanos (2011).

…..

Click to read all of A. Spanos on “Recurring controversies“.

……

*SUMMARY AND CONCLUSIONS*

The paper focused primarily on certain charges, claims, and interpretations of the *P *value as they relate to CIs and the AIC. It was argued that some of these comparisons and claims are misleading because they ignore key differences in the procedures being compared, such as (1) their primary aims and objectives, (2) the nature of the questions posed to the data, as well as (3) the nature of their underlying reasoning and the ensuing inferences.

In the case of the *P *value, the crucial issue is whether Fisher’s evidential interpretation of the *P *value as ‘‘indicating the strength of evidence against *H*_{0}’’ is appropriate. It is argued that, despite Fisher’s maligning of the Type II error, a principled way to provide an adequate evidential account, in the form of post-data severity evaluation, calls for taking into account the power of the test.

The error-statistical perspective brings out a key weakness of the *P *value and addresses several foundational issues raised in frequentist testing, including the fallacies of acceptance and rejection as well as misinterpretations of observed CIs; see Mayo and Spanos (2011). The paper also uncovers the connection between model selection procedures and hypothesis testing, revealing the inherent unreliability of the former. Hence, the choice between different procedures should not be ‘‘stylistic’’ (Murtaugh 2013), but should depend on the questions of interest, the answers sought, and the reliability of the procedures.

*Spanos, A. (2014) Recurring controversies about P values and conﬁdence intervals revisited. *Ecology *95(3): 645-651.

**Murtaugh_In defense of P values Murtaugh_Rejoinder**

** Burnham & Anderson_P values are only an index to evidence_ 20th- vs 21st-century statistical science**

Filed under: CIs and tests, Error Statistics, Fisher, P-values, power, Statistics ]]>

So what happened? Medical journals, the main vehicles for publishing clinical trials today, are after all the ‘gatekeepers of medical evidence’—as they are described in

Bad Pharma, Ben Goldacre’s 2012 bestseller. …… The Alltrials campaign, launched two years ago on the back of Goldacre’s book, has attracted an extraordinary level of support. …

Professor Senn has long argued the AllTrials case, he insisted. ‘There’s no doubt that obtaining a license to market a drug should involve an obligation to share the results with interested parties,’ he said.

His point, however, was that this sharing should not involve medical journals. …There were several reasons, he said, as to why Bad

JAMAand other journals were at least as much to blame as Bad Pharma for a lack of transparency in pharmaceutical research: the constant need of the medical press to make a sensational impact, ‘the vanity and ambitions of scientists,’ and the confusing restrictions of embargos—as well as the fact that, despite the evidence, it was clear that journalsdofavour ‘exciting’ research. Instead of journals, Professor Senn claimed, trials should be self published either on the web or in some publicly searchable registry, such as the website Clinicaltrials.gov.

I wonder if this would have helped in the case of the Potti and Nevins Duke trials. I believe the NCI only discovered it was partially funding one of the trials by noticing it on the clinical trials website.

Between the medical journals and the regulators, Senn puts more trust in the latter.

[A]ccording to Professor Senn, it’s the regulators, virtually alone, that keep medicine safe. ‘Regulators may make mistakes, but they do a better job than the journals,’ he said. ‘Would you want to fly to New York with a big reputable airline like BA, which is heavily regulated? Or a plane built by Professor Smith and his colleagues from the local university?’

What do you think?

With so much to disagree on, speakers and audience members agreed that transparent clinical research is a complex goal, and should be addressed as such. Discussing the future is just the start of the process, pointed out Dr Groves.’Publication bias is not only down to publishers, it is also dependent on people submitting their results including old data, whether it’s in a loft or on a floppy disk or filed away somewhere—so bring out your dead,’ she said. ‘We need to be able to make decisions on all the evidence. That means that observational studies should be regarded as being as important as randomised controlled trials. We know we’ve got to improve and there’s a long way to go. It’s an exciting time,’ she said.

*Bring out your dead?*

Filed under: PhilPharma, science communication, Statistics ]]>

by Stephen Senn*

Those not familiar with drug development might suppose that showing that a new pharmaceutical formulation (say a generic drug) is equivalent to a formulation that has a licence (say a brand name drug) ought to be simple. However, it can often turn out to be bafflingly difficult[1].

If, as is often the case, both formulations are given in forms that are absorbed through the gut, whether as pills, oral solutions or suppositories, then so-called *bioequivalence trials* form an attractive option. The basic idea is that the concentration in the blood of the new *test* formulation can be compared to the *licensed* reference formulation. Equivalence of concentration in the blood plausibly implies equivalence in all possible effect sites and thus equality of all benefits and harms.

Typically, healthy volunteers are recruited and given the test formulation on one occasion and the reference formulation on another, the order being randomised. Regular blood samples are taken and the concentration time curves summarised using simple statistics: for example the area under the curve (AUC) is always used, the concentration maximum C_{max} nearly always also and the time to reach a maximum T_{max}, very often. These statistics are then compared across formulations to show that they are similar.

In the rest of this post I shall ignore the problem that various summary measures are employed and assume that we are just considering AUC. There seems to be a general (but arbitrary) agreement that two formulations are equivalent if the true ratio of AUC under test and reference lies between 0.8 & 1.25. In that case (at least as regards the AUC requirement) the formulations are deemed bioequivalent. The true ratio, however, is a parameter not a statistic and so the task is to see what the data can show about the reasonableness of any claim regarding this unknown theoretical quantity.

It is here, however, that the statistical difficulties begin. A simple frequentist solution would appear to be to calculate the 95% confidence intervals for the relative bioavailability and check that these lie within the limits of equivalence. Modelling is always done on the log-scale and since log(0.8)=-log(1.25) we have that limits for the log relative bioavailability of test and reference are (approximately) -0.22 to +0.22. However there is more than one 95% confidence interval and an early dispute in this field was whether a traditional confidence interval centred on the point estimate should be calculated, as Kirkwood[2] proposed in 1981 or one centred on the middle of the range of equivalence, that is to say on 0 (on the log scale) as Westlake[3] had earlier proposed in 1972 .

As O’Quigley and Baudoin pointed out[4], the difference is, essentially, between deciding whether the ‘shortest’ confidence interval is included within the limits of equivalence or whether the fiducial probability that the true relative bioavailability lies within the limits is at least 95%. The latter is always the easier requirement to satisfy. To see why consider the case where the point estimate is positive. In that case clearly the lower conventional confidence limit would never lie outside the limit unless the upper one did. Thus by lengthening the lower limit and shortening the upper in such a way to maintain the 95% probability one can make it easier to satisfy equivalence.

An alternative approach was taken by Schuirmann[5] who proposed to look at the matter in terms of two one–sided tests. Imagine that we have two regulators: a toxicity and an efficacy regulator. The former defines as toxic any drug whose relative bioavailability is greater than 1.25 and the latter as ineffective any drug whose relative bioavailability is less than 0.8. Each is unconcerned by the other’s decision and so no trading of alpha from one to the other can take place. It turns out that this requirement is satisfied operationally by accepting bioequivalence if the conventional 90% confidence limits are within the limits of equivalence. Opinions differ as to how logical this is. For example, the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%. Why should it be lower for bioequivalence?

Be that as it may, 90% confidence intervals are regularly used but they have been criticised by a number of frequentists of a Neyman-Pearson persuasion. (See for example R. Berger and Hsu[6].) The argument goes as follows. If the trial is small enough so that the standard error is large enough the width of the confidence interval, however calculated, will exceed the width of the equivalence interval. Thus the type I error rate is zero. Various proposals have been made as to how to recover the missing Type I error but they all boil down to this: given a small enough trial you could claim equivalence even though the point estimate was outside the limits of equivalence! Needless to say nobody uses such tests in practice and they have been severely criticised from a theoretical point of view[7])

The above argument is based on Normal theory tests. Horrendous complications are introduced by using the t-test if one departs from classical confidence intervals.

And don’t get me started on equivalence when concentration in the blood is irrelevant but a pharmacodynamic outcome must be used instead!

So, what seems to be a simple problem turns out to be controversial and difficult. As I sometimes put it ‘equivalence is different’.

Here there be tygers!

*Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**References**

1. Senn, S.J., Statistical issues in bioequivalence*.* Statistics in Medicine, 2001. **20**(17-18): p. 2785-2799.

2. Kirkwood, T.B.L., *Bioequivalence testing – a need to rethink.* Biometrics, 1981. **37**: p. 589-591.

3. Westlake, W.J., *Use of confidence intervals in analysis of comparative bioavailability trials.* Journal of Pharmaceutical Sciences, 1972. **61**(8): p. 1340-1341.

4. O’Quigley, J. and C. Baudoin, *General approaches to the problem of bioequivalence.* The Statistician, 1988. **37**: p. 51-58.

5. Schuirmann, D.J., *A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.* J Pharmacokinet Biopharm, 1987. **15**(6): p. 657-80.

6. Berger, R.L. and J.C. Hsu, *Bioequivalence trials, intersection-union tests and equivalence confidence sets.* Statistical Science, 1996. **11**(4): p. 283-302.

7. Perlman, M.D. and L. Wu, *The emperor’s new tests.* Statistical Science, 1999. **14**(4): p. 355-369.

References added by Editor for readers:

1. Senn SJ. Falsificationism and clinical trials [see comments]. Statistics in Medicine 1991; 10: 1679-1692.

2. Senn SJ. Inherent difficulties with active control equivalence studies. Statistics in Medicine 1993; 12: 2367-2375.

3. Senn SJ. Fisher’s game with the Devil. Statistics in Medicine 1994; 13: 217-230.

Filed under: bioequivalence, confidence intervals and tests, PhilPharma, Statistics, Stephen Senn ]]>

Over 100 patients signed up for the chance to participate in the clinical trials at Duke (2007-10) that promised a custom-tailored cancer treatment spewed out by a cutting-edge prediction model developed by Anil Potti, Joseph Nevins and their team at Duke. Their model purported to predict your probable response to one or another chemotherapy based on microarray analyses of various tumors. While they are now described as “false pioneers” of personalized cancer treatments, it’s not clear what has been learned from the fireworks surrounding the Potti episode overall. Most of the popular focus has been on glaring typographical and data processing errors—at least that’s what I mainly heard about until recently. Although they were quite crucial to the science in this case,(surely more so than Potti’s CV padding) what interests me now are the general methodological and logical concerns that rarely make it into the popular press. These revolve around the capability of the predictive model, and the back and forth criticisms and defense of its reported error rates both for so-called “internal validity” and especially for the intended recommendations on new patients. Even after the errors were exposed by Baggerly and Coombes (2007, 2009), the trials were allowed to continue (after a brief pause to let the Duke internal committee investigate, but they found no problems.) Surely they would have tested and validated a model on which they would be recommending chemo treatments and associated surgery; it couldn’t be that these human subjects were the first external tests of the model? Could it?

This is my first foray into the episode, and I don’t claim to have a worked out view on the methodology (that’s the beauty of a blog, right?) Here, then for your weekend reading, are some background materials in relation to this episode. For starters, there is what Baggerly and Coombes call “a starter kit”:

http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/index.html

Other key links and background will be found through this post. I’ll also be adding to this case later on.

**1.2.** **It’s Not My Fault if You Didn’t Apply My Method**

True or false? You can’t complain about not being able to reproduce my result if you haven’t used my method.

Well, suppose I’ve claimed to provide evidence of a genuine statistical effect or of a reliable statistical predictor, and applying “my method” for warranting my claim depends on such gambits as: leaving out unfriendly data, cherry picking, ignoring multiple testing or the like. Then you certainly can rightly complain about not being able to reproduce my result. That’s because my claim C (to have evidence of a genuine effect or reliable predictor) readily “passes the test”–by the lights of my method–even if C is false. Diederik Stapel had “a method” for assuring support for his social psychology hypotheses (i.e., inventing data), but no one would think the purported effects are justified by his method simply because we too could finagle data that “fit” his hypotheses.

On the other hand, we can imagine cases where it would be correct to complain that a perfectly valid method had been misapplied. So some distinctions are needed, and I will try to supply them. (I take this up in Part 2).

**1.3 Potti and Nevins and Baggerly and Coombes:**

Potti and Nevins denied Baggerly and Coombes’ (B & C) criticism based on their inability to reproduce their results, and denounced B & C’s allegation that the Potti and Nevins method does “not work”.

“When we apply the same methods but maintain the separation of training and test sets, predictions are poor….Simulations show that the results are no better than those obtained with randomly selected cell lines.” (Baggerly, Wang and Coombes, Nature Medicine, Nov 2007, p. 1277.)

To which Potti and Nevins responded:

[T]hey suggest that our method of including both training and test data in the generation of mutagenes (principal components) is flawed. We feel this approach is entirely appropriate, as it does not include any information regarding the actual patient response and thus does not influence the generation of the signature with respect to predicting patient outcome….In short, they reproduce our result when they use our method. (Potti and Nevins, Nature Medicine, Nov 2007, p. 1277.

The Institute of medicine (IOM) Report (link below) growing out of the Duke episode clearly appears to be siding with C & B:

Candidate omics-based tests should be confirmed using an independent set of samples not used in the generation of the computational model, and when feasible, blinded to any outcome or phenotypic data until after the computational procedures have been locked down. …Ideally the specimens for independent confirmation will have been collected at a different point in time, at different institutions, from a different patient population, with samples processed in a different laboratory to demonstrate that the test has broad applicability and is not overfit to any particular situation.”(p. 36)

See “Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine Evolution of Translational Omics: Lessons Learned and the Path Forward.”

*We will want to examine (in part 2) when it is warranted to claim a method “does not work” or “fails to reproduce”.*

**1.4 Steven McKinney letter to the IOM**

It was a recent comment on this blog by statistician Steven McKinney that led to my delving further into this case. He agreed to my posting a letter he supplied to the IOM committee (PAF Document 19) below, and to responding to reader questions on this blog. [It can be found at the Cancer Letter website page: www.cancerletter.com/downloads/20110107_2/download, item 3 in "Internal NCI documents (zip files 1, 2 and 3)].So have a read, and I’ll come back to this in Part 2 later on.

——-

December 16, 2010

Steven McKinney, Ph.D.

Statistician

Molecular Oncology and Breast Cancer Program

British Columbia Cancer Research Centre

Vancouver B.C.

Canada

Christine M. Micheel, Ph.D.

Program Officer

Board on Health Care Services and National Cancer policy Forum

Institute of Medicine

500 5th Street, NW, 767;

Washington, DC 20001, USA

Dear Dr. Micheel,

I have been following with interest and concern the development of events related to the three clinical trials (NCT00509366, NCT00545948, NCT00636441) currently under review by the Institute of Medicine (Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials).

I have reviewed many of the omics papers related to this issue, and wish to communicate my concerns to the review committee. In brief, my concern is that the methodology employed in the now retracted papers, and many others issued by the Duke group all use a flawed statistical analytical paradigm. Essentially the paradigm involves fitting a statistical model to all available study data then splitting the data into subsets, labeling one of them a “training” set, another a “validation” or “test” set, and showing that the statistical model works well for both sets. The analysis paradigm is described as a statistical train-test-validate exercise in several published papers, though it is technically not a true train-test-validate exercise as the model under evaluation involves predictor components derived from the full data set.

I believe that this issue needs to be investigated as part of the Institute of Medicine’s review, because concerned readers who have written letters to journal editors have not been successful in educating a wider audience (in particular journal editors and reviewers of biomedical journals) as to the problematic aspects of the analysis method that are repeatedly used by the Duke group. The issue at hand is not just one researcher who committed errors in one analysis, but rather the systematic use of a flawed analytical paradigm in multiple papers discussing personalized medicine in a widening scope of medical scenarios.

The statistical properties of this analytical paradigm, in particular its type I error rate, have not to my knowledge been reviewed or published. I respectfully request the IOM committee to include this issue in its agenda for the upcoming review, as findings from this committee will provide a broader educational opportunity, allowing journal editors and reviewers to have a better understanding of the statistical properties of the analyses repeatedly developed and submitted for publication by the Duke University investigators.

As a citizen of the United States and a taxpayer, and as a practicing biomedical applied statistician, I am especially concerned about the possibility that the funding garnered for such potentially flawed studies is detracting from other groups’ ability to obtain funding to perform valid research in the valuable arena of personalized medicine. Additionally, the use of human subjects in ongoing studies involving this methodology is ethically problematic.

I will discuss this issue in greater detail in the attached Appendix to this letter of concern, and provide citations to the literature illustrating the various aspects involved.

Thank you for your consideration of this matter.

Yours sincerely

Steven McKinney

Attachments: Appendix – Details of points of concern regarding the statistical analytical paradigm repeatedly used in personalized medicine research papers published by Duke University investigators.

**Appendix – **Details of points of concern regarding the statistical analytical paradigm repeatedly used in personalized medicine research papers published by Duke University investigators

In 2001 West et al. [1] published some details of a statistical analytical method involving “Bayesian regression models that provide predictive capability based on gene expression data”. In the Statistical Methods section of this paper they state that the “Analysis uses binary regression models combined with singular value decompositions (SVDs) and with stochastic regularization by using Bayesian analysis (M.W., unpublished work) as discussed and referenced in Experimental Procedures, which are published as supporting information on the PNAS web site.”

Given the current state of affairs, it is of concern that so many papers have been published using this methodology when some undetermined amount of the underlying theory is unpublished.

In the supporting information on the PNAS website, the authors state “Statistical Methods. The analysis uses standard binary regression models combined with singular value decompositions (SVDs), also referred to as singular factor decompositions, and with stochastic regularization using Bayesian analysis (1). It is beyond the scope here to provide full technical details, so the interested reader is referred to ref. 2, which extends ref. 3 from linear to binary regression models; these manuscripts are available at the Duke web site, www.isds.duke.edu/~mw. Some key details are elaborated here.”

It is unclear why it should be “beyond the scope” to include details of the analytical methods in the supporting information materials – typically this is precisely the place to provide such details. Fortunately the reference “ref. 2″ cited, above (West, M., Nevins, J. R., Marks, J. R., Spang, R. & Zuzan, H. (2000) *German conference on Bioinformatics*, in press.) is still available as an online publication in the electronic journal *In Silico Biology *(reference [2] below).

In the online journal article, the authors provide additional details about the analytical method, including the fact that “In a first step we fitted the regression model using the entire set of expression profiles and class assignments” (see the section titled “Probabilistic tumor classification”). This is a key point, and is precisely why the investigators’ continued publications claiming to have “validated” their analysis is false and deserves thorough statistical evaluation as part of the IOM review of these issues. When predictor variables derived from the entire set of data are used, it cannot be claimed that subsequent “validation” exercises are true cross-validation or out-of-sample evaluations of the model’s predictive capabilities, as the Duke investigators repeatedly state in publications.

In the same paragraph, the authors state “Note, that if we draw a decision line at a probability of 0.5 we obtain a perfect classification of all 27 tumors. However the analysis uses the true class assignments *z*_{1} … *z*_{27} of all the tumors. Hence, although the plot demonstrates a good fit of the model to the data it does not give us reliable indications for a good predictive performance. One might suspect that the method just “stores” the given class assignments in the parameter, . Indeed this would be the case if one uses binary regression for *n* samples and *n* predictors without the additional restrains introduced by the priors. That this suspicion is unjustified with respect to the Bayesian method can be demonstrated by out-of-sample predictions.”

I believe this is the key flaw in the reasoning behind this statistical analytical method. The authors state without proof (via theoretical derivation or simulation study) that this Bayesian method is somehow immune to the issue of overfitting a model to a set of data.This is the aspect of this analytical paradigm that truly needs a sound statistical evaluation, so that a determination as to the true predictive capacity of this method can be scientifically demonstrated.

The authors state further in the Discussion section that “Clearly, the methodology is not limited to only this medical context nor is it specialized to diagnostic questions only. We have applied our model to the problem of predicting the nodal status of breast tumors based on expression profiles of tissue samples form the primary tumor. The results are reported in West et al., 2001. Due to the very general setting of our model, we expect it would be successful for a large class of diagnostic problems in various fields of medicine.”

Interestingly, the supporting material cited in [1] actually references [1]. This again is an issue of concern.

Also now of concern is the realization of the authors’ prediction that they expect the method to be applicable to a large class of diagnostic problems in various fields of medicine. Indeed the authors have used this methodology in a widening scope of medical fields, as will be outlined below. That this methodology has been accepted for publication in many journals over many years, before its statistical properties have truly been investigated, is indeed an issue of concern.

I believe that part of the reason that journal editors and reviewers have not questioned the methodology is that the method uses primarily Bayesian statistical models, which are not as widely taught or understood in biological and medical higher education. It is difficult for many non-statisticians to follow the statistical logic and mathematical aspects of such complex Bayesian methods.

Thus the authors clearly describe that their paradigm is to fit a model to an entire data set, derive a set of predictors from that model, then use those predictors along with others on subsets of the entire data set. They state that the excellent performance of such models is validated, when it appears that what is actually demonstrated is that a model overfitted to an entire data set performs well on subsets of that entire data set. This issue should in my opinion be a key issue of concern in this IOM review of this omics methodology.

In early 2006, an apparently seminal paper was published by the Duke investigators in the journal *Nature* (Bild et al. [3]).This was followed by a publication in the *New **England Journal of Medicine* (Potti et al. [4]) and another in *Nature Medicine* (Potti et al. [5]). All of these papers cite West et al. [1] and use the methodology therein. References [3] and [5] discuss breast cancer, and reference [4] discusses lung cancer. All use the same analytical paradigm, fitting an initial model to all available data to develop predictors (called “metagenes” in [3] and [4], then “gene expression signatures” in [5]) based on a singular value decomposition of the entire data set; then using these predictors on various subsets of the data involved and calling some portion of this subset analysis a “validation” exercise.

At this point researchers at the M.D. Anderson clinic explored the possibility of adapting this analytical paradigm, and asked statisticians Keith Baggerly and Kevin Coombes to review the publications. Their investigations are of course key in shedding light on this issue. In 2007 Baggedy and Coombes published a letter in the Correspondence section of *Nature Medicine* (Coombes et al. [6]). Coombes et al. state “Their software does not maintain the independence of training and test sets, and the test data alter the model. Specifically, their software uses ‘metagenes’: weighted combinations of individual genes. Weights are assigned through a singular value decomposition (SVD). Their software applies SVD to the training and test data simultaneously, yielding different weights than when SVD is applied only to the training data (Supplementary Report 9). Even using this more extensive model, however, we could not reproduce the reported results.” and further state that “When we apply the same methods but maintain the separation of training and test sets, predictions are poor (Fig. 1 and Supplementary Report 7). Simulations show that the results are no better than those obtained with randomly selected cell lines (Supplementary Report 8).”

Thus Coombes et al. have performed some initial analysis that sheds light on the true type I and type II error rates of this methodology. What is unclear from the work of Coombes et al. is the degree of departure from the null condition of no difference between groups of interest in the data sets used, so that the power of the statistical method can be properly evaluated. This is why a careful systematic study of this methodology, using known null data (data with equivalent distributional characteristics between groups of interest) and known non-null data (data with increasing levels of differential characteristics between groups of interest) is required, so that power characteristics of the methodology can be measured under null and non-null conditions. Further, such analysis needs to properly evaluate model performance on true out-of-sample data.

In 2007, the Duke group published another heavily cited paper (Hsu et al. [7], recently retracted on November 16, 2010). SVD components developed for this paper were termed “gene expression signatures”. All of these papers share the attribute that excessive claims of model accuracy are repeatedly asserted, with purported evidence from exercises termed “cross-validation”.

Recently, additional papers from the Duke investigators have been published concerning viral infection (Zaas et al. [8]) and bacterial infection (Zaas et al. [9]. Statnikov et al. [10] submitted a letter challenging this methodology once again, stating “We suggest several approaches to improve the analysis protocol that led to discovery of the acute respiratory viral response signature. First, to obtain an unbiased estimate of predictive accuracy, genes should be selected using the training set of subjects as opposed to selecting genes from the entire data set as was done in the study of Zaas-et al. (2009). The latter gene selection procedure is known to typically lead to overoptimistic predictive accuracy estimates. Second, the cross-validation procedure employed by Zaas et al. should be modified to prohibit the use of samples from the same subjects both for developing signature and estimating its predictive accuracy, as this is another potential source of over-optimism.”

The Duke investigators, as with all previous challenges, offer only verbal refutations to these points, with no formal statistical evaluation via simulation or otherwise to address the true distributional properties of this method.

More recent papers from the Duke investigators that should be reviewed include Chen et al. [11] and Chen et al. [12]. Additional complex Bayesian methods continue to be combined around the same analytical paradigm, and it is beyond the capability of many journal editors and reviewers to understand and deconstruct the arguments offered by the Duke investigators.

Additionally, with the apparent weight of so many seemingly accurate analysis results, resources such as research grants from federal agencies are being utilized without proper understanding of the value of the returns. Moreover, several studies on humans (the clinical trials currently scheduled for review, and the viral infection studies described in references [8] and [9]) have been conducted based on methodology with as yet unknown statistical properties. This is an issue of major concern, and a review of the statistical properties of the methodology used throughout these studies along with a publication of guidelines for evaluation of whether or not human trials involving this methodology should be permitted would be very valuable to the research community.

[McKinney References: see page 7 of PAF Document 19]

I’ll come back to this in “Potti Training and Test Data” Part 2. Please share your thoughts.

_____________

**References:**

- Baggerly & Coombes. (2009). Deriving Chemosensitivity from cell lines: Forensic Bioinformatics and reproducible research in high-throughput Biology,
*Ann. of Appl. Stat.*, Vol. 3, No. 4 (Dec. 2009), pp. 1309-1334. [Starter Kit Webpage supplement for B&C 2009: http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/index.html] - Baggerly, Coombes, Neeley (2008) Run Batch Effects Potentially Compromise the Usefulness of Genomic Signatures for Ovarian Cancer.
*JCO*March 1, 2008:1186-1187. - Coombes, Wang & Baggerly. (2007). “Microrrays: retracing steps.”
*Nat. Med*. Nov 13(11):1276-7. - Dressman, Potti, Nevins & Lancaster (2008) In Reply.
*JCO*March 1, 2008:1187-1188 - McShane (2010). NCI Address to lnstitute of Medicine Committee Convened to Review Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials.
*PAF 20*. - Micheel et al (Eds) Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine (2012).
*Evolution of Translational Omics: Lessons Learned and the Path Forward.*Nat. Acad. Press. - Potti et al.(2006). Genomic signatures to guide the use of chemotherapeutics.
*Nat. Med*. Nov 12(11):1294-300. Epub 2006 Oct 22. - Potti and Nevins (2007) Reply to Coombes, Wang & Baggerly
*Nat. Med*. Nov 13(11):1277-8. - Spang et al.(2002) Prediction And Uncertainty Gene Expression Profiles.
*In Silico**Biology*2, 0033.

Filed under: science communication, selection effects, Statistical fraudbusting ]]>

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference” is in *Breakthroughs in Statistics (volume I 1993). *I’ve a hunch that Birnbaum would have liked my rejoinder to discussants of my forthcoming paper (*Statistical Science*): **Bjornstad, Dawid, Evans, Fraser, Hannig, **and** Martin and Liu. **I hadn’t realized until recently that all of this is up under “future papers” here [1]. You can find the rejoinder: **STS1404-004RA0-2**. That takes away some of the surprise of having it all come out at once (and in final form). For those unfamiliar with the argument, at the end of this entry are slides from a recent, entirely informal, talk that I never posted, as well as some links from this blog. **Happy Birthday Birnbaum!**

My Rejoinder

I. IntroductionI am honored and grateful to have so many interesting and challenging comments on my paper. I want to thank the discussants for their willingness to jump back into the thorny quagmire of Birnbaum’s argument. To a question raised in the paper “Does it matter?”, these discussions show the answer is yes. The enlightening connections to contemporary projects are especially valuable in galvanizing future efforts to address foundational issues in statistics.

As long-standing as Birnbaum’s result has been, Birnbaum himself went through dramatic shifts in a short period of time following his famous (1962) result. More than of historical interest, these shifts provide a unique perspective on the current problem.

Already in the rejoinder to Birnbaum (1962), he is worried about criticisms (by Pratt 1962) pertaining to applying WCP to his constructed mathematical mixtures (what I call Birnbaumization), and hints at replacing WCP with another principle (Irrelevant Censoring). Then there is a gap until around 1968 at which point Birnbaum declares the SLP plausible “only in the simplest case, where the parameter space has but two” predesignated points (1968, 301). He tells us in Birnbaum (1970a, 1033) that he has pursued the matter thoroughly leading to “rejection of both the likelihood concept and various proposed formalizations of prior information”. The basis for this shift is that the SLP permits interpretations that “can be seriously misleading with high probability” (1968, 301). He puts forward the “confidence concept” (Conf) which takes from the Neyman-Pearson (N-P) approach “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” while supplying it an evidential interpretation (1970a, 1033). Given the many different associations with “confidence,” I use (Conf) in this Rejoinder to refer to Birnbaum’s idea. Many of the ingenious examples of the incompatibilities of SLP and (Conf) are traceable back to Birnbaum, optional stopping being just one (see Birnbaum 1969). A bibliography of Birnbaum’s work is Giere 1977. Before his untimely death (at 53), Birnbaum denies the SLP even counts as a principle of evidence (in Birnbaum 1977). He thought it anomalous that (Conf) lacked an explicit evidential interpretation even though, at an intuitive level, he saw it as the “one rock in a shifting scene” in statistical thinking and practice (Birnbaum 1970, 1033). I return to this in part IV of this rejoinder.

II. Bjornstad, Dawid, and EvansLet me begin by answering the central criticisms that, if correct, would be obstacles to what I purport to have shown in my paper. It is entirely understandable that leading voices in a long-lived controversy would assume that all of the twists and turns, avenues and roadways, have already been visited, and that no new flaw in the argument could enter to shake up the debate. I say to the reader that the surest sign that the issue is unsettled is that my critics disagree among themselves about the puzzle and even the key principles under discussion: the WCP, and in one case, the SLP itself.

……

IV Post-SLP foundationsReturn to where we left off in the opening section of this rejoinder: Birnbaum (1969).

The problem-area of main concern here may be described as that of determining precise

concepts of statistical evidence(systematically linked with mathematical models of experiments), concepts which are to benon-Bayesian, non-decision-theoretic, and significantlyrelevant to statistical practice.(Birnbaum 1969, 113)Given Neyman’s behavioral decision construal, Birnbaum claims that “when a confidence region estimate is interpreted as statistical evidence about a parameter”(1969, p. 122), an investigator has necessarily adjoined a concept of evidence, (Conf) that goes beyond the formal theory. What is this evidential concept? The furthest Birnbaum gets in defining (Conf) is in his posthumous article (1977):

(Conf) A concept of statistical evidence is not plausible unless it finds ‘strong evidence for

H_{2}againstH_{1}’ with small probability (α) whenH_{1}is true, and with much larger probability (1 – β) whenH_{2}is true. (1977, 24)On the basis of (Conf), Birnbaum reinterprets statistical outputs from N-P theory as strong, weak, or worthless statistical evidence depending on the error probabilities of the test (1977, 24-26). While this sketchy idea requires extensions in many ways (e.g., beyond pre-data error probabilities, and beyond the two hypothesis setting), the spirit of (Conf), that error probabilities qualify properties of methods which in turn indicate the warrant to accord a given inference, is, I think, a valuable shift of perspective. This is not the place to elaborate, except to note that my own twist on Birnbaum’s general idea is to appraise evidential warrant by considering the capabilities of tests to have detected erroneous interpretations, a concept I call

severity. That Birnbaum preferred a propensity interpretation of error probabilities is not essential. What matters is their role in picking up how features of experimental design and modeling alter a methods’ capabilities to control “seriously misleading interpretations”. Even those who embrace a version of probabilism may find a distinct role for a severity concept. Recall that Fisher always criticized the presupposition that a single use of mathematical probability must be competent for qualifying inference in all logical situations (1956, 47).Birnbaum’s philosophy evolved from seeking concepts of evidence in degree of support, belief, or plausibility between statements of data and hypotheses to embracing (Conf) with the required control of misleading interpretations of data. The former view reflected the logical empiricist assumption that there exist context-free evidential relationships—a paradigm philosophers of statistics have been slow to throw off. The newer (post-positivist) movements in philosophy and history of science were just appearing in the 1970s. Birnbaum was ahead of his time in calling for a philosophy of science relevant to statistical practice; it is now long overdue!

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists” (Birnbaum 1972, 861).

The paper itself is here.

Below are my slides from my May 2, 2014 presentation in the Virginia Tech Department of Philosophy 2014 Colloquium series:

“Putting the Brakes on the Breakthrough, or

‘How I used simple logic to uncover a flaw in a controversial 50 year old ‘theorem’ in statistical foundations taken as a

‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”

Some previous posts on this topic can be found at the following links (and by searching this blog with key words):

- U-Phil: Blogging the Likelihood Principle, New Summary
- Don’t Birnbaumize that Experiment my friend–updated reblog
- New Version: On the Birnbaum Argument for the SLP: slides for my JSM talk

[1] I discovered, not long ago, that for months an uncorrected version was up at the *Statistical Science* page. I hope it didn’t confuse too many people.

Birnbaum, A. 1962. “On the Foundations of Statistical Inference.” In *Breakthroughs in Statistics*, edited by S. Kotz and N. Johnson, 1:478–518. Springer Series in Statistics 1993. New York: Springer-Verlag.

Filed under: Birnbaum, Birnbaum Brakes, Likelihood Principle, Statistics ]]>