Suppose you are reading about a statistically significant result * x* from a one-sided test T+ of the mean of a Normal distribution with

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the statistically significant

isxpoorevidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ).*See point on language in notes.They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the statistically significant

isxgoodevidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is

unwarranted.

**Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?**

Allow the test assumptions are adequately met. I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post:

the probability of correctly rejecting the null

–which is both ambiguous and fails to specify the all important *conjectured* alternative. [For handholding slides on power, please see this post.] That you compute power for several alternatives is not the slightest bit problematic; it’s precisely what you *want* to do in order to assess the test’s capability to detect discrepancies. If you knew the true parameter value, why would you be running an inquiry to make statistical inferences about it?

It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ_{0 }+ δ, or µ < µ’ =µ_{0 }+ δ or the like. They are *not* to point values! (Not even to the point µ =M_{0}.) Most simply, you may consider that the inference is in terms of the one-sided lower confidence bound (for various confidence levels)–the dual for test T+.

**DEFINITION**: POW(T+,µ’**) = **POW(Test T+ rejects *H*_{0};µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous it doesn’t matter if we write > or ≥). I’ll leave off the T+ and write POW(µ’**).**

In terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

**Let σ = 10, n = 100, so (σ/ √n) = 1. **(Nice and simple!) Test T+ rejects H

**Test T+ rejects H _{0 }at ~ .025 level if M > 2. **

**CASE 1: ** We need a µ’ such that POW(µ’) = low. The power against alternatives between the null and the cut-off M* will range from α to .5. Consider the power against the null:

1. POW(µ

= 0) = α = .025._{ }Since the the probability of M > 2, under the assumption that µ

= 0, is low, the significant result indicates_{ }µ > 0.That is, since power against µ= 0 is low, the statistically significant result is a good indication that_{ }µ > 0.Equivalently, 0 is the lower bound of a .975 confidence interval.

2. For a second example of low power that does not use the null: We get power of .04 if µ’ = M* – 1.75 (σ/ √

n) unit –which in this case is (2 – 1.75) .25. That is, POW(.25) =.04.[ii]Equivalently, µ >.25 is the lower confidence interval (CI) at level .96 (this is the CI that is dual to the test T+.)

**CASE 2: ** We need a µ’ such that POW(µ’) = high. Using one of our power facts, POW(M* + 1(σ/ √*n*)) = .84.

3. That is, adding one (σ/ √

n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So µ= 2 + 1 will work: POW(T+, µ_{ }= 3) = .84. See this post._{ }Should we say that the significant result is a good indication that

µ > 3? No, the confidence level would be .16.Pr(M > 2; µ = 3 ) = Pr(Z > -1) = .84. It would be

terribleevidence forµ > 3!

Blue curve is the null, red curve is one possible conjectured alternative: µ** _{ }**= 3. Green area is power, little turquoise area is α.

As Stephen Senn points out (in my favorite of his posts), the alternative against which we set high power is the discrepancy from the null that “we should not like to miss”, delta Δ. Δ is not the discrepancy we may infer from a significant result (in a test where POW(Δ, ) = .84).

**So the correct answer is B.**

**Does A hold true if we happen to know (based on previous severe tests) that µ <µ’? I’ll return to this.**

*Point on language: “to detect alternative µ'” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data. Perhaps the strict definition should be employed unless one is clear on this. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.

[i] I surmise, without claiming a scientific data base, that this fallacy has been increasing over the past few years. It was discussed way back when in Morrison and Henkel (1970). (A relevant post relates to a Jackie Mason comedy routine.) Research was even conducted to figure out how psychologists could be so wrong. Wherever I’ve seen it, it’s due to (explicitly or implicitly) transposing the conditional in a Bayesian use of power. For example, (1 – β)/ α is treated as a kind of likelihood in a Bayesian computation. I say this is unwarranted, even for a Bayesian’s goal, see 2/10/15 post below.

[ii] Pr(M > 2; µ = .25 ) = Pr(Z > 1.75) = .04.

OTHER RELEVANT POSTS ON POWER

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
**12/29/14**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**01/03/15**No headache power (for Deirdre)**02/10/15**What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?

Filed under: confidence intervals and tests, power, Statistics ]]>

**Stephen Senn**

*Head of Competence Center for Methodology and Statistics (CCMS)*

*Luxembourg Institute of Health*

*This post first appeared here.* An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence-based medicine?

Philosophy of Science2002;69: S316-S330: see p. S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random.

The second point is that in the absence of a treatment effect, where randomization has taken place, the statistical theory predicts probabilistically how the variation in outcome between groups relates to the variation within.The third point, strongly related to the other two, is that statistical inference in clinical trials proceeds using ratios. The F statistic produced from Fisher’s famous analysis of variance is the *ratio* of the variance between to the variance within and calculated using observed outcomes. (The ratio form is due to Snedecor but Fisher’s approach using semi-differences of natural logarithms is equivalent.) The critics of randomization are talking about the effect of the unmeasured covariates on the *numerator* of this ratio. However, any factor that could be imbalanced *between* groups could vary strongly *within* and thus while the numerator would be affected, so would the denominator. Any Bayesian will soon come to the conclusion that, given randomization, coherence imposes strong constraints on the degree to which one expects an unknown something to inflate the numerator (which implies not only differing between groups but also, coincidentally, having predictive strength) but not the denominator.

The final point is that statistical inferences are probabilistic: either about statistics in the frequentist mode or about parameters in the Bayesian mode. Many strong predictors varying from patient to patient will tend to inflate the variance within groups; this will be reflected in due turn in wider confidence intervals for the estimated treatment effect. It is not enough to attack the estimate. Being a statistician means never having to say you are certain. It is not the estimate that has to be attacked to prove a statistician a liar, it is the certainty with which the estimate has been expressed. We don’t call a man a liar who claims that with probability one half you will get one head in two tosses of a coin just because you might get two tails.

Filed under: RCTs, S. Senn, Statistics Tagged: confounders, Evidence-based medicine ]]>

**MONTHLY MEMORY LANE: 3 years ago: July 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog**.[1]** This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014. (Once again it was tough to pick just 3; please check out others which might interest you, e.g., Schachtman on StatLaw, the machine learning conference on simplicity, the story of Lindley and particle physics, Glymour and so on.)

**July 2012**

- (7/1) PhilStatLaw: “Let’s Require Health Claims to Be ‘Evidence Based’” (Schachtman)
- (7/2) More from the Foundations of Simplicity Workshop*
- (7/3) Elliott Sober Responds on Foundations of Simplicity
- (7/4) Comment on Falsification
- (7/6) Vladimir Cherkassky Responds on Foundations of Simplicity
- (7/8) Metablog: Up and Coming
**(7/9) Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics**- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
- (7/11) Is Particle Physics Bad Science?
- (7/12) Dennis Lindley’s “Philosophy of Statistics”
- (7/15) Deconstructing Larry Wasserman – it starts like this…
- (7/16) Peter Grünwald: Follow-up on Cherkassky’s Comments
- (7/19) New Kvetch Posted 7/18/12
- (7/21) “Always the last place you look!”
- (7/22) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 1)
- (7/23) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 2)
- (7/27) P-values as Frequentist Measures
**(7/28) U-PHIL: Deconstructing Larry Wasserman**(connects to 7/15)**(7/31) What’s in a Name? (Gelman’s blog)**

**[1] **excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Filed under: 3-year memory lane, Statistics ]]>

Someone linked this to me on Twitter. I thought it was a home blog at first. Surely the U.S. Dept of Health and Human Services can give a better definition than this.

U.S. Department of Health and Human Services

Effective Health Care Program

Glossary of TermsWe know that many of the concepts used on this site can be difficult to understand. For that reason, we have provided you with a glossary to help you make sense of the terms used in Comparative Effectiveness Research. Every word that is defined in this glossary should appear highlighted throughout the Web site…..

Statistical Significance

Definition:A mathematical technique to measure whether the results of a study are likely to be true.Statistical significanceis calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example:For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to bestatistically significantbecause p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

You can find it here. First of all, one should never use “likelihood” and “probability” in what is to be a clarification of formal terms, as these mean very different things in statistics.Some of the claims given actually aren’t so bad if “likely” takes its statistical meaning, but are all wet if construed as mathematical probability.

What really puzzles me is, how do they expect readers to understand the claims that appear within this definition? Are their meanings known to anyone? Watch:

**Statistical Significance **

- A mathematical technique to measure whether the results of a study are likely to be true.

**What does it mean to say “the results of a study are likely to be true”?**

*Statistical significance*is calculated as the probability that an effect observed in a research study is occurring because of chance.

**Meaning?**

- Statistical significance is usually expressed as a P-value.
- The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).

**How should we define “more likely that the results are true”?**

- Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

**oy, oy**

- The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

**Oy, oy, oy ****OK, I’ll turn this into a single “oy” and just suggest dropping “probably” (leaving the hypertext “probability”). But this was part of the illustration, not the definition.**

Surely it’s possible to keep to their brevity and do a better job than this, even though one would really want to explain about the types of null hypotheses, the test statistic, the assumptions of the test (we aren’t told if their example is an RCT.) I’ve listed how they might capture what I think they mean to say, off the top of my head. Submit your improvements, corrections and additions, and I’ll add them. Updates will be indicated with (ii), (iii), etc.

**Statistical Significance**

- A mathematical technique to measure whether the results of a study are likely to be true.

a) A statistical technique to measure whether the results of a study indicate the null hypothesis is false, that some*genuine*discrepancy from the null hypothesis exists.

*Statistical significance*is calculated as the probability that an effect observed in a research study is occurring because of chance.

a) The statistical significance of an observed difference is the probability of observing results as large as was observed, even if the null hypothesis is true.

b) The statistical significance of an observed difference is how frequently even larger differences than were observed would occur (through chance variability), even if the null hypothesis is true.

- Statistical significance is usually expressed as a P-value.

a) Statistical significance may be expressed as a P-value associated with an observed difference from a null hypothesis*H*_{0}within a given statistical test T.

- The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).

a) The smaller the P-value, the less consistent the results are with the null hypothesis, and the more consistent they are with a genuine discrepancy from the null.

- Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

a) Researchers generally regard the results as inconsistent with the null if statistical significance is less than 0.05 (p<.05).

- (Part of the illustrative example): The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

a) The probability that even larger differences would occur due to chance variability (even if the null is true) is high enough to regard the result as consistent with the null being true.

**7/17/15 remark:** Maybe there’s a convention in this glossary that if the word is not in hypertext, it is being used informally. In that case, this might not be so bad. I’d remove “probably” to get:

b) The probability that the results were due to chance was high enough to conclude that the two drugs did not differ in causing blood pressure problems.

**7/17/15:** In (ii) In reaction to a comment, I replaced d_{obs} with “observed difference”, and cut out Pr(d ≥ d_{obs} ;*H*_{0}). I also allowed that #6 wasn’t too bad, especially if (the non-hypertext) “probably” is removed. The only thing is, this was *not* part of the definition, but rather the illustration. So maybe this could be the basis for fixing the others in the definition itself.

** **

Filed under: P-values, Statistics ]]>

- The power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
- So, the probability of incorrectly rejecting the null hypothesis is β.
- But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.

Filed under: Error Statistics, power, Statistics ]]>

2015: The Large Hadron Collider (LHC) is back in collision mode in 2015[0]. There’s a 2015 update, a virtual display, and links from ATLAS, one of two detectors at (LHC)) here. The remainder is from one *year ago. (2014) I’m reblogging a few of the Higgs posts at the anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013**).[1]*

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

**“Higgs Analysis and Statistical Flukes: part 2″**

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of *excess events* of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess;

H_{0}: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone). The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

*Error probabilities*

In a Neyman-Pearson setting, a cut-off c_{α}_{ }is chosen pre-data so that the probability of a type I error is low. In general,

Pr(

d(X) > c_{α};H_{0}) ≤ α

and in particular, alluding to an overall test T:

(1) Pr(Test T yields

d(X) > 5 standard deviations;H_{0}) ≤ .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, *p*_{0}. In general,

Pr(

P<p_{0};H_{0}) <p_{0}

and in particular,

(2) Pr(Test T yields

P<.0000003;H_{0}) < .0000003.

For test T to yield a “worse fit” with *H*_{0 }(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0. With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic *d*(**X)**, or the p-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

*An implicit principle of inference or evidence*

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Datax_{0 }from a test T provide evidence for rejecting H_{0}(just) to the extent that H_{0}would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between *H*_{0}* *and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977). “*H** _{0 }*is true” is a shorthand for a very long statement that

*Severity and the detachment of inferences*

The sampling distributions serve to give counterfactuals. In this case, they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to *H*_{0}.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out. *Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference*. (This is why the bootstrap, and other types of, re-sampling works when one has a random sample from the process or population of interest.)

The *severity principle*, put more generally:

Data from a test T[ii]provide good evidence for inferring H (just) to the extent that H passes severely withx_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.)* *In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim *H* requires considering *H*‘s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually **detached** from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

*Qualifying claims by how well they have been probed*

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned. Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

*Telling what’s true about significance levels*

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation). It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to *H** _{0}*. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to *H** _{0}*, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

*Those prohibited phrases*

One may wish to return to some of the condemned phrases of particular physics reports. Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution: The statistical null asserts that *H*_{0}: background alone adequately describes the process.

*H*_{0} does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under *H*_{0}”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {** x**: p < p

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain *H*_{0} as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

*Triggering, indicating, inferring*

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

I hope it is (more or less) clear that blue is from 2015, burgandy is from 2014; black is old. If interested: *See statistical flukes (part 3)*

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1: http://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Part 2 http://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

2015/2014 Notes

[0]Physicists manage to learn quite a lot from negative results. They’d love to find something more exotic, but the negative results will not go away.

“Physicists aren’t just praying for hints of new physics, Strassler stresses. He says there is very good reason to believe that the LHC should find new particles. For one, the mass of the Higgs boson, about125.09 billion electron volts, seems precariously low if the census of particles is truly complete. Various calculations based on theory dictate that the Higgs mass should be comparable to a figure called the Planck mass, which is about 17 orders of magnitude higher than the boson’s measured heft.”The article is here.

[1]My presentation at a Symposium on the Higgs discovery at the Philosophy of Science Association (Nov. 2014) is here.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

___________

Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(*H*) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv] In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis http://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

REFERENCES (from March, 2013 post):

ATLAS Collaboration (November 14, 2012), Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” *Annals of Mathematical Statistics*, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” *Scandinavian Journal of Statistics*, 4: 49–70.

Mayo, D.G. (1996), *Error and the Growth of Experimental Knowledge*, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy *of *Science*, 57: 323–357.

Filed under: Higgs, highly probable vs highly probed, P-values, Severity ]]>

**Winner of June 2014 Palindrome Contest: (a dozen book choices)**

**Lori Wike: **Principal bassoonist of the Utah Symphony; Faculty member at University of Utah and Westminster College

**Palindrome: Sir, a pain, a madness! Elba gin in a pro’s tipsy end? I know angst, sir! I taste, I demonstrate lemon omelet arts. Nome diet satirists gnaw on kidneys, pits or panini. Gab less: end a mania, Paris!
**

**Book choice**: *Conjectures and Refutations *(K. Popper 1962, New York: Basic Books)

**The requirement:** A palindrome using “demonstrate” (and Elba, of course).

**Bio: **Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine.

**Statement**: “I’m very happy to be a fourth-time winner in this palindrome contest. This palindrome ended up being a particularly fun one to write. Here is a picture of me visiting Akaka Falls, a necessary stop on any palindromist tour itinerary! I’ve been fascinated by palindromes ever since first learning about them as a child in a Martin Gardner book. I started writing palindromes several years ago when my interest in the form was rekindled by reading about the constraint-based techniques of several Oulipo writers. While I love all sorts of wordplay and puzzles, and I occasionally write some word-unit palindromes as well, I find writing the traditional letter-unit palindromes to be the most satisfying challenge. My latest palindrome goal is to attempt to write a palindromic mystery story.”

**Runner-up:** John Falcone (Asturias, Spain)

**Die tarts! No medics. I’d abandon Elba. Nut unable? Nod. Nab a disc. I demonstrate id.**

**Mayo’s June attempts (selected):**

- Able Noah, Plato or God. Deified lo-diet arts, no media. I demonstrate idol: deified dogroot! Alpha on Elba.
- Disable code, but use tarts. No Medco data doc demonstrates U-tube doc. Elba’s id.
- Elba fat, a diet? Arts, no media. I demonstrate i-data fable
- Lo! Disable now Sam. God’s assay: “Monet arts, no med deified!” Demonstrate, no? My ass! As dogma’s won Elba’s idol.

**Lori is amazing, but you can win with very simple palindromes. (I call on Lori only if no one has won for months on end.) I’ll be adding more book selections for July and August.**

Filed under: Palindrome ]]>

**Professor Larry Laudan**

** Lecturer in Law and Philosophy**

** University of Texas at Austin**

**“When the ‘Not-Guilty’ Falsely Pass for Innocent” by Larry Laudan**

While it is a belief deeply ingrained in the legal community (and among the public) that false negatives are much more common than false positives (a 10:1 ratio being the preferred guess), empirical studies of that question are very few and far between. While false convictions have been carefully investigated in more than two dozen studies, there are virtually no well-designed studies of the frequency of false acquittals. The disinterest in the latter question is dramatically borne out by looking at discussions among intellectuals of the two sorts of errors. (A search of Google Books identifies some 6.3k discussions of the former and only 144 treatments of the latter in the period from 1800 to now.) I’m persuaded that it is time we brought false negatives out of the shadows, not least because each such mistake carries significant potential harms, typically inflicted by falsely-acquitted recidivists who are on the streets instead of in prison.

In criminal law, false negatives occur under two circumstances: when a guilty defendant is acquitted at trial and when an arrested, guilty defendant has the charges against him dropped or dismissed by the judge or prosecutor. Almost no one tries to measure how often either type of false negative occurs. That is partly understandable, given the fact that the legal system prohibits a judicial investigation into the correctness of an acquittal at trial; the double jeopardy principle guarantees that such acquittals are fixed in stone. Thanks in no small part to the general societal indifference to false negatives, there have been virtually no efforts to design empirical studies that would yield reliable figures on false acquittals. That means that my efforts here to estimate how often they occur must depend on a plethora of *indirect* indicators. With a bit of ingenuity, it is possible to find data that provide strong clues as to approximately how often a truly guilty defendant is acquitted at trial and in the pre-trial process. The resulting inferences are not precise and I will try to explain why as we go along. As we look at various data sources not initially designed to measure false negatives, we will see that they nonetheless provide salient information about when and why false acquittals occur, thereby enabling us to make an approximate estimate of their frequency.

My discussion of how to estimate the frequency of false negatives will fall into two parts, reflecting the stark differences between the sources of errors in pleas and the sources of error in trials. (All the data to be cited here deal entirely with cases of crimes of violence.)

**i). Estimating the frequency of false negatives at trials. **Trial acquittals represent a very small subset of overall acquittals. Specifically, of the 232k defendants who were arrested in 2008 for, but not convicted of, a violent crime, only 6% (15k) of the freed defendants were products of a trial. Conventional wisdom has it that most defendants acquitted at trial are probably factually guilty. After all, so the usual argument goes, these defendants wouldn’t even be going to trial unless the prosecutor believed that he had a strong chance of persuading jurors that these defendants were guilty beyond a reasonable doubt.

While this argument does not rest on any solid data (and we will soon be looking at one that does), it enjoys a prima facie plausibility. Even if the prosecutor sometimes overestimates the strength of his case against the defendant, it seems reasonable to suppose that most defendants winning an acquittal at trial have an apparent guilt in the range from about 70% to 90%. One’s initial inclination in such circumstances is to suppose that at least half of those who are acquitted at trial actually committed the crime(s) they are charged with but the evidence allowed room for rational doubt about defendant’s guilt. Accordingly, one might assume that about half of those acquitted at trial are guilty, giving us some 7.5k false negatives, even though my strong suspicion is that the true figure is higher than that. There are two powerful reasons for thinking that this simplistic assumption understates the frequency of guilt among those acquitted at trial. They are as follows:

a). One potential source for corroborating my hunch involves looking at some interesting data from Scotland. There, the justice system uses BARD as the standard, as in the United States, and trial by jury. However, the Scottish system consists of

threeverdicts rather than the usual two: ‘guilty’, ‘guilt not proven’ and ‘not guilty’.^{[1]}The intermediate verdict gives us a point of entry for trying to pin down the rate of false acquittals. A guilt-not-proven verdict is called for when i). the jury is persuaded that the defendant is factually guilty (that is, p(guilt)≧0.5) but ii). the jury is not convinced of that guilt beyond a reasonable doubt. Both the not-guilty and the guilt-not-proven verdicts count as official acquittals but they send decidedly different messages. In a study of criminal prosecutions in 2005 and 2006 done by the Scottish government, it turned out that 71% of those defendants tried for homicide and acquitted received a ‘guilt-not-proven’ verdict.^{[2]}That means that about 7-in-10 acquittals in Scotland involve defendants regarded by the jurors as having probably committed the crime.b). A different way of estimating the frequency of false acquittals at trials emerges from the monumental study by Kalven and Zeisel (The American Jury) of some 3,500+ jury trials in the US. The researchers asked judges in each of the trials that resulted in an acquittal whether, in the opinion of the judge, the case was ‘close’ (meaning the apparent guilt of the acquitted defendant verged on proof beyond a reasonable doubt) or whether it was a ‘clear’ acquittal (meaning that defendant’s apparent guilt was well below the BARD standard). According to the responses to this question (dealing with 1,191 acquittals), judges indicated that, in their opinions, only 5% of the trials resulted in ‘clear’ acquittals; by contrast, 52% of the cases were, in the view of judges, ‘clear for conviction’.

^{[3]}Since about one-third of trials for violent crimes result in an acquittal, the Kalven-Zeisel data would seem to entail that only about 15% of the acquittals are ‘clearly’ acquittals, while some 85% are, in the opinion of the presiding judge, close cases. If, as in our example from 2008, there are some 15k acquittals, more than 12k of them are close enough to warrant an assumption that these are probably factually guilty defendants, even if their apparent guilt fails to eliminate all reasonable doubts.

Putting the two data sets together, it is fair to say that significantly more than half of those acquitted at trial of a violent crime were nonetheless regarded by the jurors and judges as probably guilty and thereby are reasonably assumed to be false negatives.^{[4]} Accordingly, I shall hereafter assume that, among those 15k acquittals that emerged in trials for violent crimes in the US in 2008, some 11.2k of them were false negatives.

**ii). False negatives in the dropping of charges (pre-trial acquittals).** The much more intriguing question concerns the true guilt or innocence not of those 15k defendants acquitted at trial but of those 217k arrestees against whom charges were dropped or dismissed. Such decisions obviously came prior to trial, usually at the initiative of a prosecutor, sometimes at the initiative of a judge. We know that of those arrested by the police and charged with violent crimes in 2008, some 37% never make it to a trial or a plea bargain; the prosecutor or the pre-trial judge, in effect, acquits them. But how many of them so acquitted were truly innocent? Fortunately, there are two very large studies that shed substantial light on the answer to that crucial question. Both depend on the responses of thousands of prosecutors who were quizzed about the reasons why they dropped the charges that they did. One such study, analyzing FBI-initiated prosecutions nationwide, provides annual data about the reasons why federal prosecutors have dropped (and judges have dismissed) charges against those accused of a violent crime. The second study, undertaken by the Bureau of Justice Statistics, looked at the same issue in state cases, where of course most violent crime adjudications take place.

What emerge from both studies are many cases that were dropped for reasons that may indicate defendant’s innocence, or at least the relative weakness of the prosecutor’s case against the defendant. I shall call these factors *innocence-indicators*. Both studies show that prosecutors have multiple reasons for the dismissal or dropping of charges against persons charged with a violent crime. Still, both data sets about prosecutorial decisions indicate that the dominant motive for dropping outstanding charges is *not*, as you might expect, a belief that the defendant is actually innocent.^{[5]}

Sometimes, charges are dropped because of a defendant’s willingness to testify for the state in the separate trial of an accomplice. Occasionally, charges are dropped because the prosecutor discovers that the statute of limitations expires before the trial can be scheduled or he discovers that the defendant, when the alleged crime occurred, was a minor and should be tried in juvenile court. Prosecutors will also often drop charges if the rulings in the pre-trial evidence hearing indicate that the judge will exclude what the prosecutors deem to be highly inculpatory evidence of defendants’ guilt. When that occurs, the case against the defendant obviously becomes less compelling than it would have been if the relevant evidence were admitted. In fact, this was reported as the most frequent problem that prosecutors’ offices ran into.^{[6]} Commonly, prosecutors cite limitations of personnel and financial resources to cope with all the cases on their docket as another reason for dropping charges. (So much for the common idea that prosecutors have virtually unlimited resources!) Charges are also likely to be dropped if a key witness for the state vanishes or changes her testimony (as the Bureau of Justice Statistics puts it: “the reason for this reluctance [to testify] was usually fear of reprisal, followed by actual threats against the victim or witness.”^{[7]}), or if the defendant was awarded bail awaiting trial and vanished, thereby becoming a fugitive at large.^{[8]} Clearly, none of these reasons for dropping a case is, in any sense, an indicator of the defendant’s innocence.

Oftentimes, of course, charges are dropped for reasons that imply the weakness of the case against the defendant. A detailed report about the many decisions made in 2010 by federal prosecutors – in deciding whether to drop charges against some 7.3k detainees arrested by the FBI– claims that in 20.5% of dismissals, there appeared to be a ‘lack of criminal intent’; 7% of dropped charges were a result of the prosecutor’s decision that ‘no crime was committed’; and in another quarter of the dropped cases there were signs of ‘weak or insufficient evidence.’^{[9]} That boils down to saying that, in federal trials for violent crimes, slightly less than half of all dismissals (48%) are motivated by factors other than a worry that defendant’s guilt might not be provable at trial. (Recall, too, that ‘insufficient evidence’ does not mean lack of substantial evidence that defendant committed the crime but rather evidence the prosecution believes is probably insufficient to establish defendant’s guilt beyond a reasonable doubt.)

This already gives us reason to suspect that about half of the cases where charges are dropped involve the abandonment of charges against defendants whom the prosecutor thought were probably factually guilty but was not at all sure that he could prove that guilt beyond a reasonable doubt. That argument becomes much more convincing when we remind ourselves of how defendants came to the prosecutors’ attention in the first place. Typically, a person becomes the object of police investigations initially as nothing more than a suspect, perhaps among several others who strike the police as possible culprits. If, after further inquiries and the analysis of more evidence, police decide to file charges (thereby ‘clearing’ the case as far as the police are concerned), they are required to have grounds to believe that it is more likely than not that defendant committed the crime. To make the arrest official, the police must persuade either a judge or a grand jury (or both) that a rational person, confronted with the available evidence, would conclude that defendant probably committed the crime.

Accordingly, by the time the prosecutor typically gets deeply into the act, he is dealing with a host of arrestees, each of whom is considered by the police, a grand jury and the arraigning judge to be more likely than not to be guilty on the available evidence. As the prosecutor begins assembling his case, some new evidence will often come in or be actively sought. Sometimes, that evidence will be exculpatory, and persuade the prosecutor that defendant really did not commit the crime. Much more often, though, the decision point for the prosecutor arrives when, after having reviewed the evidence, he must decide whether the case against the defendant is strong enough to persuade a trial jury that the defendant is guilty beyond a reasonable doubt. Supposing, with many scholars, that this standard represents roughly a 90+% likelihood of guilt, this means that most of those now charged with a crime have an apparent guilt that falls in the very broad range from 50+% to something close to 100%. The prosecutor will generally cull those defendants in the range of 50-80% apparent guilt out of the class of those he intends to take to trial or to negotiate a plea bargain with.

Why would he do that? When apparent guilt is in that range, the prosecutor knows that it is unlikely that he will be able to persuade the defendant to accept a plea bargain and he also knows that, if he takes the defendant to trial, it will probably result in an acquittal. There are moral reasons as well that lead to the dropping of charges,^{[10]} even against those whom the prosecutor believes to be factually guilty.

The second pertinent study on this vexing issue of the frequency of guilt among those dropped out of the system prior to trial was published in 1992.^{[11]} Unlike the FBI study, this one investigated state (rather than federal) criminal trials. It included some 40k cases. The researchers asked prosecutors why they had dropped charges in the cases (or why judges had dismissed charges) when they did. Three of the reasons given appear to be innocence-indicators: ‘evidence issues’, ‘witness problems’ and ‘the interests of justice’. Some 35% of the dropped/dismissed cases were attributed to these reasons. That left 65% of the abandoned cases involving reasons implying nothing about guilt or innocence.^{[12]} An earlier study of 17,500 arrests in Washington, D.C. federal courts indicates that the prosecutor dropped 3.6k cases but only a third of those dismissals (34%) were attributed to ‘insufficiency of evidence’.^{[13]}

Taking the mean between the FBI probably-guilty rate of 47% and the BJS value of 65%, we arrive at the estimate that about 56% of the dismissed and dropped arrestees were probably factually guilty. Even so, that figure doesn’t take us fully where we want to go. We’re after a reasonable estimate of the number of *truly* guilty who have the charges against them either dropped or dismissed. The fact that the 56% of arrestees against whom charges were dropped are probably guilty does not yet give us a definite way of determining how many of them were actually guilty.

There is, however, a way of generating the result we seek. Remember that the defendants in this group were dropped or dismissed because of reasons that had nothing to do with signs of their innocence. Hence, we can reasonably suppose that the proportion of guilty among them would be about the same as the proportion of guilty among those who go to trial. (After all, there is no perceived evidential weakness in the case against them that distinguished them from those who do go to trial.) Exactly two-thirds of those who went to trial for a violent crime were convicted. We have already explained why we assume that that 75% of those acquitted at trial are probably truly guilty.

That seems to provide a plausible rationale for saying that, among those defendants who had the charges against them dropped for *non-evidentiary* reasons, approximately two-out-of-three (and probably more) are highly likely to be guilty. Hence, we shall assume that about 37% (that is two-thirds of the 56% of those whom were booted out of the trial system for non-evidentiary reasons) are factually guilty (and, if they had gone to trial, would have been convicted). This amounts to 81k false negatives. When added to the estimate of 12k probably guilty defendants among those acquitted at trial, this figure entails that, at a minimum, some 93k of the 595k arrestees are acquitted although truly guilty. This suggests a false negative rate of ~40% (viz., 93k guilty out of 232k acquitted). The false positive rate in this example is 3% (some 11k falsely convicted defendants out of 360k convicted).

It remains to be seen whether this pattern of error distribution serves the interests of society. That is the subject of my next book. For now, I will simply note that recidivism data show unambiguously that the 88k false negatives do vastly more harm to innocent citizens than the 11k false positives do. Quite clearly, the current standard of proof needs drastic re-adjustment.

*Notes*

[1] For a lengthy discussion of the Scottish verdict system, see my “Need Verdicts Come in Pairs?” *International Journal of Evidence and Proof*, vol. 14 (2010), 1-24.

[2] *Scottish Government Statistical Bulletin*, Crim/2006/Part 11. The data come from the years 2004-2005. (http://www.scotland.gov.uk/Publications/2006/04/25104019/11.)

[3] Kalven & Zeisel, *op. cit*., Table 32.

[4] Given the Scottish estimate of ~70% false negatives at trial and the Kalven-Zeisel estimate of an 85% false negative rate in trials, I shall assume a false negative rate of ~75% in acquittals at trial.

[5] See especially US Dept. of Justice, *United States Attorneys’ Annual Statistical Report*, 2010.

[6] *Ibid*., Table 6.

[7] BJS, *Prosecutors in State Courts, 1994* (1996), p.5.

[8] 5% of those on bail awaiting trial on a murder charge become fugitives. *BJS, Felony Defendants in Large Urban Counties, 2009 –Statistical Tables, *Table 18*.*

[9] The detailed breakdown of the relevant data can be found in Table 14 of US Dept. of Justice, *United States Attorneys’ Annual Statistical Report*, 2010. In that year, the FBI declined to prosecute some 7,252 cases of arrested defendants (794 of these cases were violent crimes) (*ibid*., Table 3).

[10] The ethics manual of the American Bar Association, the *ABA Standards for Criminal Justice: Prosecution and Defense Function*, insists that prosecutors “should not institute, or cause to be instituted, or permit the continued pendency of criminal charges when the prosecutor knows that the charges are not supported by probable cause.” (Standard 3-3.9) It goes on to say that the prosecutor should drop charges against the defendant if there is** “**reasonable doubt that the accused is in fact guilty.” (*ibid.*)

[11] Barbara Boland et al., *The Prosecution of Felony Arrests, 1988*. Bureau of Justice Statistics, 1992.

[12] Here were the data for some of the cities in their study: Denver (46% dropped because of innocence issue); Los Angeles (50%); Manhattan (43%); St. Louis (20%); San Diego (27%); Seattle (25%); and Washington, D.C. (37%). Id., Table 5.

[13] Brian Forst et al., *What Happens after Arrest? *Institute for Law and Social Research (1977), Exhibit 5.1.

Earlier guest post by Laudan:

Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior

Filed under: evidence-based policy, false negatives, PhilStatLaw, Statistics Tagged: false negatives ]]>

**Stapel’s “fix” for science is to admit it’s all “fixed!”**

That recent case of the guy suspected of using faked data for a study on how to promote support for gay marriage in a (retracted) paper, Michael LaCour, is directing a bit of limelight on our star fraudster Diederik Stapel (50+ retractions).

**The Chronicle of Higher Education **just published an article by Tom Bartlett:** “**Can a Longtime Fraud Help Fix Science? You can read his full interview of Stapel here. A snippet:

You write that “every psychologist has a toolbox of statistical and methodological procedures for those days when the numbers don’t turn out quite right.” Do you think every psychologist uses that toolbox? In other words, is everyone at least a little bit dirty?

: In essence, yes. The universe doesn’t give answers. There are no data matrices out there. We have to select from reality, and we have to interpret. There’s always dirt, and there’s always selection, and there’s always interpretation. That doesn’t mean it’s all untruthful. We’re dirty because we can only live with models of reality rather than reality itself. It doesn’t mean it’s all a bag of tricks and lies. But that’s where the inconvenience starts.Stapel

It’s the illusion that these models are one-to-one descriptions of reality. That’s what we hope for, but that’s of course not true.I think the solution is in accepting this and saying these are the tips and tricks, and this is the story I want to tell, and this is how I did it, instead of trying to pose as if it’s real. We should be more open about saying, I’m using this trick, this statistical method, and people can figure out for themselves.

This is our “dirty hands” argument, so often used these days, coupled with claims of so-called “perverse incentives,” to excuse QRPs (questionable research practices), bias, and flat out **cheating**. The leap from “our models are invariably idealizations” to “we all have dirty hands” to “statistical tricks cannot be helped,” may inadvertantly be encouraged by some articles on how to “fix” science.

Earlier in the interview:

You mention lots of possible reasons for your fraud: laziness, ambition, a short attention span. One of the more intriguing reasons to me — and you mention it twice in the book — is nihilism. Do you mean that? Did you think of yourself as a nihilist? Then or now?

: I’m not sure I’m a nihilist. ….Stapel

Did you think of the work you were doing as meaningful?

I was raised in the 1980s, at the height of postmodernism, and that was something I related to. I studied many of the French postmodernists. That made me question meaningfulness. I had a hard time explaining the meaningfulness of my work to students.Stapel:

I’ll bet.

I agree with Bartlett that you don’t have to have any sympathy with a fraudster to possibly learn from him about preventing doctored statistics, or sharpening fraudbusting skills, except that it turns out Stapel *really and truly* believes science is a fraud![ii] In his pristine accomplishment of using *no data at all,* rather than merely subjecting them to extraordinary rendition (leaving others to wrangle over the fine points of statistics), you could say that Stapel is the ultimate, radical, postmodern scientific anarchist. Stapel is a personable guy, and I’ve had some interesting exchanges with him; but on that basis, from his “Fictionfactory,” and autobiography, “Derailment”, I say he’s the wrong person to ask*. He still doesn’t get it!*

[i]There are several posts on this blog that discuss Stapel:

Some Statistical Dirty Laundry

Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)

Should a “fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no

How to hire a fraudster chauffeur (includes video of Stapel’s TED talk)

50 shades of grey between error and fraud

Thinking of Eating Meat Causes Antisocial behavior

[ii] At least social science, social psychology. He may be right that the effects are small or uninteresting in social psych.

Filed under: junk science, Statistics ]]>

**MONTHLY MEMORY LANE: 3 years ago: June 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog**.[1]** It was *extremely* difficult to pick only 3 this month; please check out others that look interesting to you. This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014.

**June 2012**

- (6/2) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
**(6/6) Review of***Error and Inference*(Mayo and Spanos) by C. Hennig- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (6/12) CMU Workshop on Foundations for Ockham’s Razor
- (6/14) Answer to the Homework & a New Exercise
- (6/15) Scratch Work for a SEV Homework Problem
- (6/17) Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers
- (6/17) G. Cumming Response: The New Statistics
**(6/19) The Error Statistical Philosophy and The Practice of Bayesian Statistics: Comments on Gelman and Shalizi**- (6/23) Promissory Note
**(6/26) Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop***- (6/29) Further Reflections on Simplicity: Mechanisms

**[1]**excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Filed under: 3-year memory lane ]]>

This is one of the questions high on the “To Do” list I’ve been keeping for this blog. The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in* Rationality, Markets, and Morals.[i]*

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

**Is it legitimate to change one’s prior based on the data?**

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while

“arrival of new data will, of course, require you to update your prior distribution to being a posterior distribution, no conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it.” (Senn, 2011, 63)

“If you cannot go back to the drawing board, one seems stuck with priors one now regards as wrong; if one does change them, then what was the meaning of the prior as carrying prior information?” (Senn, 2011, p. 58)

I take it that Senn is referring to a Bayesian prior expressing belief. (He will correct me if I’m wrong.)[ii] Senn takes the upshot to be that priors cannot be changed based on data. **Is there a principled ground for blocking such moves?**

I.J. GOOD: The traditional idea was that one would have thought very hard about one’s prior before proceeding—that’s what Jack Good always said. Good advocated his device of “imaginary results” whereby one would envisage all possible results in advance (1971, p. 431) and choose a prior that you can live with whatever happens. *This could take a long time!* Given how difficult this would be, in practice, Good allowed

“that it is possible after all to change a prior in the light of actual experimental results” [but] rationality of type II has to be used.” (Good 1971, p. 431)

Maybe this is an example of what Senn calls requiring the informal to come to the rescue of the formal? Good was commenting on D. J. Bartholomew [iii] in the same wonderful volume (edited by Godambe and Sprott).

D. LINDLEY: According to subjective Bayesian Dennis Lindley:

“[I]f a prior leads to an unacceptable posterior then I modify it to cohere with properties that seem desirable in the inference.”(Lindley 1971, p. 436)

This would seem to open the door to all kinds of verification biases, wouldn’t it? This is the same Lindley who famously declared:

“I am often asked if the method gives the

rightanswer: or, more particularly, how do you know if you have got therightprior. My reply is that I don’t know what is meant by “right” in this context. The Bayesian theory is aboutcoherence, not about right or wrong.” (1976, p. 359)

H. KYBURG: Philosopher Henry Kyburg (who wrote a book on subjective probability, but was or became a frequentist) gives what I took to be the standard line (for subjective Bayesians at least):

There is no way I can be in error in my prior distribution for μ ––unless I make a logical error–… . It is that very fact that makes this prior distribution perniciously subjective. It represents an assumption that has consequences, but cannot be corrected by criticism or further evidence.” (Kyburg 1993, p. 147)

It can be updated of course via Bayes rule.

D.R. COX: While recognizing the serious problem of “temporal incoherence”, (a violation of diachronic Bayes updating), David Cox writes:

“On the other hand [temporal coherency] is not inevitable and there is nothing intrinsically inconsistent in changing prior assessments” in the light of data; however, the danger is that “even initially very surprising effects can

post hocbe made to seem plausible.” (Cox 2006, p. 78)

An analogous worry would arise, Cox notes, if frequentists permit data dependent selections of hypotheses (significance seeking, cherry picking, etc). However, frequentists (if they are not to be guilty of cheating) would need to take into account any adjustments to the overall error probabilities of the test. But the Bayesian is not in the business of computing error probabilities associated with a method for reaching posteriors. At least not traditionally. Would Bayesians even be required to report such shifts of priors? (A principle is needed.)

What if the proposed adjustment of prior is based on the data and resulting likelihoods, rather than an impetus to ensure one’s favorite hypothesis gets a desirable posterior? After all, Jim Berger says that prior elicitation typically takes place *after* “the expert has already seen the data” (2006, p. 392). Do they instruct them to try not to take the data into account? Anyway, if the prior is determined post-data, then one wonders how it can be seen to reflect information distinct from the data under analysis. All the work to obtain posteriors would have been accomplished by the likelihoods. There’s also the issue of using data twice.

**So what do you think is the answer? Does it differ for subjective vs conventional vs other stripes of Bayesian?**

[i]Both were contributions to the RMM (2011) volume: Special Topic: Statistical Science and Philosophy of Science: Where Do (Should) They Meet in 2011 and Beyond? (edited by D. Mayo, A. Spanos, and K. Staley). The volume was an outgrowth of a 2010 conference that Spanos and I (and others) ran in London, and conversations that emerged soon after. See full list of participants, talks and sponsors here.

[ii] Senn and I had a published exchange on his paper that was based on my “deconstruction” of him on this blog, followed by his response! The published comments are here (Mayo) and here (Senn).

[iii] At first I thought Good was commenting on Lindley. Bartholomew came up in this blog in discussing when Bayesians and frequentists can agree on numbers.

**WEEKEND READING**

Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis.”

Senn, S. 2011. “You May Believe You Are a Bayesian But You Are Probably Wrong.”

Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.”

Discussions and Responses on Senn and Gelman can be found searching this blog:

Commentary on Berger & Goldstein**: **Christen, Draper, Fienberg, Kadane, Kass, Wasserman,

Rejoinders**: **Berger, Goldstein,

REFERENCES

Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” *Bayesian Analysis* 1 (3): 385–402.

Cox, D. R. 2006. *Principles of Statistical Inference*. Cambridge, UK: Cambridge University Press.

Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis.” *Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics* 2 (Special Topic: Statistical Science and Philosophy of Science): 67–78.

Godambe, V. P., and D. A. Sprott, ed. 1971. *Foundations of Statistical Inference*. Toronto: Holt, Rinehart and Winston of Canada.

Good, I. J. 1971. Comment on Bartholomew. In *Foundations of Statistical Inference*, edited by V. P. Godambe and D. A. Sprott, 108–122. Toronto: Holt, Rinehart and Winston of Canada.

Kyburg, H. E. Jr. 1993. “The Scope of Bayesian Reasoning.” In *Philosophy of Science Association: PSA 1992*, vol 2, 139-152. East Lansing: Philosophy of Science Association.

Lindley, D. V. 1971. “The Estimation of Many Parameters.” In *Foundations of Statistical Inference*, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

Lindley, D.V. 1976. “Bayesian Statistics.” In Harper and Hooker’s (eds.)*Foundations of Probabilitiy Theory, Statistical Inference and Statistical Theories of Science*., 353-362. D Reidel.

Senn, S. 2011. “You May Believe You Are a Bayesian But You Are Probably Wrong.” *Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics* 2 (Special Topic: Statistical Science and Philosophy of Science): 48–66.

Filed under: Bayesian/frequentist, Gelman, S. Senn, Statistics ]]>

*I had a chance to reread the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. The discussion of the statistics is around pp. 17-21 (of course there was so little actual data in this case!) You might find it interesting. Here are some stray thoughts reblogged from 2 years ago…*

*1. Slipping into pseudoscience.*

The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts

…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

*2. A role for philosophy of science?*

I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant

basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of sciencemust be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments). That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

* 3. Hanging out some statistical dirty laundry.
*Items in their laundry list include:

- An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
- A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
- The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
- The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
- Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)

For many further examples, and also caveats [3],see Report.

**4. Significance tests don’t abuse science, people do**.

Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. *Statistical methods don’t kill scientific validity, people do*.

I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and associated confidence intervals. At least the methods admit of tools for mounting a critique.

In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, *researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias*? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one? Or was there just a bit of data massaging and cherry picking to support the desired conclusion?* As a matter of routine, researchers should tell us. Yes, as *Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!” No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)

*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”

[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).

[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)

[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).

[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!

[5] From Simmons, Nelson and Simonsohn:

The Fall 2012 Newsletter for the Society for Personality and Social Psychology Popper, K. 1994, The Myth of the Framework.Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.

Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If

youdid not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.If you determined sample size in advance,

say it.If you did not drop any variables,

say it.If you did not drop any conditions,

say it.

Filed under: junk science, spurious p values ]]>

I thought the criticisms of social psychologist Jens Förster were already quite damning (despite some attempts to explain them as mere QRPs), but there’s recently been some pushback from two of his co-authors Liberman and Denzler. Their objections are directed to the application of a distinct method, touted as “Bayesian forensics”, to their joint work with Förster. I discussed it very briefly in a recent “rejected post“. Perhaps the earlier method of criticism was inapplicable to these additional papers, and there’s an interest in seeing those papers retracted as well as the one that was. I don’t claim to know. A distinct “policy” issue is whether there should be uniform standards for retraction calls. At the very least, one would think new methods should be well-vetted before subjecting authors to their indictment (particularly methods which are incapable of issuing in exculpatory evidence, like this one). Here’s a portion of their response. I don’t claim to be up on this case, but I’d be very glad to have reader feedback.

**Nira Liberman, School of Psychological Sciences, Tel Aviv University, Israel**

**Markus Denzler, Federal University of Applied Administrative Sciences, Germany**

June 7, 2015

**Response to a Report Published by the University of Amsterdam**

The University of Amsterdam (UvA) has recently announced the completion of a report that summarizes an examination of all the empirical articles by Jens Förster (JF) during the years of his affiliation with UvA, including those co-authored by us. The report is available online. The report relies solely on statistical evaluation, using the method originally employed in the anonymous complaint against JF, as well as a new version of a method for detecting “low scientific veracity” of data, developed by Prof. Klaassen (2015). The report concludes that some of the examined publications show “strong statistical evidence for low scientific veracity”, some show “inconclusive evidence for low scientific veracity”, and some show “no evidence for low veracity”. UvA announced that on the basis of that report, it would send letters to the Journals, asking them to retract articles from the first category, and to consider retraction of articles in the second category.

After examining the report, **we have reached the conclusion that it is misleading, biased and is based on erroneous statistical procedures**. In view of that we surmise that it **does not present reliable evidence for “low scientific veracity”**.

**We ask you to consider our criticism of the methods used in UvA’s report and the procedures leading to their recommendations in your decision.**

Let us emphasize that we never fabricated or manipulated data, nor have we ever witnessed such behavior on the part of Jens Förster or other co-authors.

**Here are our major points of criticism. **Please note that, due to time considerations, our examination and criticism focus on papers co-authored by us. Below, we provide some background information and then elaborate on these points.

**The new method is falsely portrayed as “standard procedure in Bayesian forensic inference**.”**In fact, it is set up in such a way that evidence can only strengthen a prior belief in low data veracity.**This method is not widely accepted among other experts, and has never been published in a peer-reviewed journal.

Despite that, UvA’s recommendations for all but one of the papers in question are solely based on this method. No confirming (not to mention disconfirming) evidence from independent sources was sought or considered.

**The new method’s criteria for “low veracity” are too inclusive**(5-8% chance to wrongly accuse a publication as having “strong evidence of low veracity” and as high as 40% chance to wrongly accuse a publication as showing “inconclusive evidence for low veracity”). Illustrating the potential consequences, a failed replication paper by other authors that we examined was flagged by this method.

**The new method (and in fact also the “old method” used in former cases against JF) rests on a wrong assumption that dependence of errors between experimental conditions necessarily indicates “low veracity”**, whereas in real experimental settings many (benign) reasons may contribute to such dependence.

- The reports treats between-subjects designs of 3 x 2 as two independent instances of 3-level single-factor experiments. However, the same (benign) procedures may render this assumption questionable, thus inflating the indicators for “low veracity” used in the report.

**The new method (and also the old method) estimate fraud as extent of deviation from a linear contrast. This contrast cannot be applied to “control” variables**(or control conditions) for which experimental effects were neither predicted nor found, as was done in the report. The misguided application of the linear contrast to control variables also produces, in some cases, inflated estimates of “low veracity”.

**The new method appears to be critically sensitive to minute changes in values**that are within the boundaries of rounding.

- Finally,
**we examine every co-authored paper that was classified as showing “strong” or “inconclusive” evidence of low veracity**(excluding one paper that is already retracted),**and show that it does not feature any reliable evidence for low veracity.**

Background

On April 2nd each of us received an email from the head of the Psychology Department at the University of Amsterdam (UvA), Prof. De Groot, on behalf of University’s Executive Board. She informed us that all the empirical articles by Jens Förster (JF) during the years of his affiliation with UvA, including those co-authored by us, have been examined by three statisticians who submitted their report. According to this (earlier version of the) report, we were told, some of the examined publications had “strong statistical evidence for fabrication”, some had “questionable veracity,” and some showed “no statistical evidence for fabrication”. Prof. De Groot also wrote that on the basis of that report, letters would be sent to the relevant Journals, asking them to retract articles from the first two categories. It is important to note that this was the first time we were officially informed about the investigation. None of the co-authors had been ever contacted by UvA to assist with the investigation. The University could have taken interest in the data files, or in earlier drafts of the papers, or in information on when, where and by whom the studies were run. Apparently, however, UvA’s Executive Board did not find any of these relevant for judging the potential veracity of the publications and requesting retraction.

Only upon repeated requests, on April 7th, 2015 we received the 109-page report (dated March 31st, 2015) and were given 2.5 weeks to respond. This deadline was determined one-sidedly. Also, UvA did not provide the R-code used to investigate our papers for almost two weeks (until April 22nd), despite the fact that it was listed as an attachment to the initial report. We responded on April 27th, following which the authors of the report corrected it (henceforth Report-R) and wrote a response letter (henceforth, the PKW letter, after authors Peeters, Klaassen, and de Wiel ). Both documents are dated May 15, 2015, but were sent to us only on June 2, **the same day that UvA also published the news regarding the report **and its conclusions on its official site, and the final report was leaked. **Thus, we were not allowed any time to read Report-R or the PKW letter before the report and UvA’s conclusions were made public. These and other procedural decisions by the UvA were needlessly detrimental to us.**

The present response letter refers to Report-R. The R-Report is almost unchanged compared to the original report, except that the language of the report and the labels for the qualitative assessments of the papers is somewhat softened, to refer to “low veracity” rather than “fraud” or “manipulation”. This has been done to reflect the authors’ own acknowledgement that their methods “cannot demarcate fabrication from erroneous or questionable research practices.” UvA’s retraction decisions only slightly changed in response to this acknowledgement. They are still requesting retraction of papers with “strong evidence for low veracity”. They are also asking journals to “consider retraction” for papers with “inconclusive evidence for low veracity,” which seems not to match this lukewarm new label (also see Point 2 below about the likelihood for a paper to to receive this label erroneously).

Unlike our initial response letter, this letter is not addressed to UvA, but rather to editors who read Report-R or reports about it. To keep things simple, we will refer to the PKW letter by citing from it only when necessary. In this way, a reader can follow our argument by reading Report-R and the present letter, but is not required to also read the original version of the report, our previous response letter, and the PKW letter.

Because of time pressure, we decided to respond only to findings that concerned co-authored papers, excluding the by-now-retracted paper Förster and Denzler (2012, SPPS). We therefore looked at the general introduction of Report-R and at the sections that concern the following papers:

In the “strong evidence for low veracity” category

Förster and Denzler, 2012, JESP

Förster, Epstude, and Ozelsel, 2009, PSPB

Förster, Liberman, and Shapira, 2009, JEP:G

Liberman and Förster, 2009, JPSP

In the “inconclusive evidence for low veracity” category

Denzler, Förster, and Liberman, 2009, JESP

Förster, Liberman, and Kuschel, 2008, JPSP

Kuschel, Förster, and Denzler, 2010, SPPS

This is not meant to suggest that our criticism does not apply to the other parts of Report-R. We just did not have sufficient time to carefully examine them. **We would like to elaborate now on points 1-7 above and explain in detail why we think that UvA’s report is biased, misleading, and flawed.**

**The new method by Klaassen (2015) (the V method) is inherently biased**

Report-R heavily relies on a new method for detecting low veracity (Klaassen, 2015), whose author, Prof. Klaassen, is also one of the authors of Report-R (and its previous version).

In this method (which we’ll refer to as the V method), a V coefficient is computed and used as an indicator of data veracity. V is called “evidential value” and is treated as the belief-updating coefficient in Bayes formula, as in equation (2) in Klaassen (2015) For example, according to the V method, when we examine a new study with V = 2, our posterior odds for fabrication should be double the prior odds. If we now add another study with V = 3, our confidence in fabrication should triple still. Klaassen, 2015, writes “When a paper contains more than one study based on independent data, then the evidential values of these studies can and may be combined into an overall evidential value by multiplication in order to determine the validity of the whole paper” (p. 10).

The problem is that V is not allowed to be less than unity. This means that there is nothing that can ever reduce confidence in the hypothesis of “low data veracity”. The V method entails, for example, that the more studies there are in a paper, the more we should get convinced that the data has low veracity.

**Klaassen (2015) writes “we apply the by now standard approach in Forensic Statistics” (p. 1). We doubt very much, however, that an approach that can only increase confidence in a defendant’s guilt could be a standard approach in court.**

We consulted an expert in Bayesian statistics (s/he preferred not to disclose her name). S/he found the V method problematic, and noted that quite contrary to the V method, typical Bayesian methods would allow both upward and downward changes in one’s confidence in a prior hypothesis.

In their letter, PKW defend the V method by saying that it has been used in the Stapel and Smeesters cases. As far as we know, however, in these cases there was other, independent evidence of fraud (e.g., Stapel reported significant effects with t-test values smaller than 1, in a Smeesters’ data individual scores were distributed too evenly; see Simonsohn, 2013) and the V method was only supporting other evidence. In contrast, in our case, labeling the papers in question as having “low scientific veracity” is almost always based only on V values – the second method for testing “ultra-linearity” in a set of studies (Δ*F *combined with the Fisher’s method) either could not be applied due to a low number of independent studies in the paper or was applied and did not yield a reason for concern. We do not know what weight the V method received in the Staple and Smeesters cases (relative to other evidence), and whether all the experts who examined those cases found the method useful. As noted before, a statistician we consulted found the method very problematic.

The authors of Report-R do acknowledge that combining V values becomes problematic as the number of studies increases (e.g., p. 4) and explain in the PKW letter that “the conclusions reached in the report are never based on overall evidential values, but on the (number of) evidential values of individual samples/sub-experiments that are considered substantial”. They nevertheless proceed to compute overall V’s and report them repeatedly in Report-R (e.g., “The overall V has a lower bound of 9.93″, p. 31; “The overall V amounts to 8.77″, on p. 66). Why?

**The criteria for “low veracity” are too inclusive**

… applying the V method across the board would result in erroneously retracting 1/12-1/19 of all published papers with experimental designs similar to those examined in Report-R (before taking into account those flagged as exhibiting “inconclusive” evidence).

In their letter, PKW write “these probabilities are in line with (statistical) standards for accepting a chance-result as scientific evidence”. In fact, these p-values are higher than is commonly acceptable in science. One would think that in “forensic” contexts of “fraud detection” the threshold should be, if anything, even higher (meaning, with lower chance for error).

Report-R says “When there is no strong evidence for low scientific veracity (according to the judgment above), but there are multiple constituent (sub)experiments with a substantial evidential value, then the evidence for low scientific veracity of a publication is considered inconclusive (p.2).” As already mentioned, UvA plans to ask journals to consider retraction of such papers. For example, in Denzler, Förster, and Liberman (2009) there are two Vs that are greater than 6 (Table 14.2) out of 17 V values computed for that paper in Report-R. The probability of obtaining two or more values of 6 or more out of 17 computed values by chance is 0.40. **Let us reiterate this figure – 40% chance of type-I error.**

Do these thresholds provide good enough reasons to ask journals to retract a paper or consider retraction? Apparently, the Executive Board of the University of Amsterdam thinks so. We are sure that many would disagree.

An anecdotal demonstration of the potential consequences of applying such liberal standards comes from our examination of a recent publication by Blanken, de Ven, Zeelenberg, and Meijers (2014, Social Psychology) using the V method. We chose this paper because it had the appropriate design (three between-subjects conditions) and was conducted as part of an Open Science replication initiative. It presents three failures to replicate the moral licensing effect (e.g., Merritt, Effron, & Monin, 2010) . The whole research process is fully transparent and materials and data are available online. The three experiments in this paper yield 10 V values, two of which are higher than 6 (9.02 and 6.18; we thank PKW for correcting a slight error in our earlier computation). The probability of obtaining two or more V-values of 6 or more out of 10 by chance is 0.19. By the criteria of Report-R, this paper would be classified as showing “inconclusive evidence of low veracity”. **By the standards of UvA’s Executive Board, which did not seek any confirming evidence to statistical findings based on the V method, this would require sending a note to the journal asking it to consider retraction of this failed replication paper**. We doubt if many would find this reasonable.

It is interesting in this context to note that in a different investigation that applied a variation of the V method (investigation of the Smeesters case) a V = 9 was used as the threshold. Simply adopting that threshold from previous work in the current report would dramatically change the conclusions. Of the 20 V values deemed “substantial” in the papers we consider here, only four have Vs over 9, which would qualify them as “substantial” with this higher threshold. Accordingly, none of the papers would have made it to the “strong evidence” category. In addition, three of the four Vs that are above 9 pertain to control conditions – we elaborate later on why this might be problematic.

**Dependence of measurement errors does not necessarily indicate low veracity**

Klaassen (2015) writes: “If authors are fiddling around with data and are fabricating and falsifying data, they tend to underestimate the variation that the data should show due to the randomness within the model. Within the framework of the above ANOVA-regression case, we model this by introducing dependence between the normal random variables ε ij , which represent the measurement errors” (p. 3). Thus, the argument that underlies the V method is that if fraud tends to create dependence of measurement errors between independent samples, then any evidence of such dependence is indicative of fraud. This is a logically invalid deduction. There are many benign causes that might create dependency between measurement errors in independent conditions. ……

**See the entire response:** Response to a Report Published by the University of Amsterdam.

**Klaassen, C. A. J. (2015).** *Evidential value in ANOVA-regression results in scientific integrity studies*. arXiv:1405.4540v2 [stat.ME].

Discussion of the Klaassen method on pubpeer review https://pubpeer.com/publications/5439C6BFF5744F6F47A2E0E9456703

**Some previous posts on Jens Förster case:**

- May 10, 2014: Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
- January 18, 2015: Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?

Filed under: junk science, reproducibility Tagged: Jens Forster ]]>

Filed under: evidence-based policy, frequentist/Bayesian, junk science, Rejected Posts ]]>

“There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

That’s philosopher’s talk for “I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics”. Yesterday, I began my talk at the Society for Philosophy and Psychology workshop on “Replication in the Sciences”with examples of two main philosophical tasks: to clarify concepts, and reveal inconsistencies, tensions and ironies surrounding methodological “discomforts” in scientific practice.

Example of aconceptual clarificationEditors of a journal,

Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” (A puzzle about the latest “test ban”)It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of

H_{0}) (2015 Trafimow and Marks)

- Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid.
- Simple conceptual job that philosophers are good at
(I don’t know if the group of eminent statisticians assigned to react to the “test ban” will bring up this point. I don’t think it includes any philosophers.)

____________________________________________________________________________________

Example of revealing inconsistencies and tensions

: It’s too easy to satisfy standard significance thresholdsCritic

: Why do replicationists find it so hard to achieve significance thresholds?You

: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPsCritic

: So, the replication researchers want methods that pick up on and block these biasing selection effects.You

Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference.________________________________________________________________

Whether this can be resolved or not is separate.

- We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility
- As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!)

I also think it’s crucial for philosophers of science and statistics to show how to improve on and solve problems of methodology in scientific practice.

My slides are below; share comments.

Filed under: Error Statistics, reproducibility, Statistics ]]>