Phil 6334 Visitor: S. Stanley Young, “Statistics and Scientific Integrity”

We are pleased to announce our guest speaker at Thursday’s seminar (April 24, 2014): Statistics and Scientific Integrity”:

YoungPhoto2008S. Stanley Young, PhD 
Assistant Director for Bioinformatics
National Institute of Statistical Sciences
Research Triangle Park, NC



The main readings for the discussion are:


Categories: Announcement, evidence-based policy, Phil6334, science communication, selection effects, Statistical fraudbusting | Leave a comment

Phil 6334: Foundations of statistics and its consequences: Day #12

picture-216-1We interspersed key issues from the reading for this session (from Howson and Urbach) with portions of my presentation at the Boston Colloquium (Feb, 2014): Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge. (Slides below)*.

Someone sent us a recording  (mp3)of the panel discussion from that Colloquium (there’s a lot on “big data” and its politics) including: Mayo, Xiao-Li Meng (Harvard), Kent Staley (St. Louis), and Mark van der Laan (Berkeley). 

See if this works: | mp3

*There’s a prelude here to our visitor on April 24: Professor Stanley Young from the National Institute of Statistical Sciences.


Categories: Bayesian/frequentist, Error Statistics, Phil6334 | 7 Comments

Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)


Spill Cam

Spill Cam

Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15 (see video clip that was added below).Remember junk shots, top kill, blowout preventers? [1] The EPA has lifted its gulf drilling ban on BP just a couple of weeks ago* (BP has paid around $13 $27 billion in fines and compensation), and April 20, 2014, is the deadline to properly file forms for new compensations.

(*After which BP had another small spill in Lake Michigan.)

But what happened to the 200 million gallons of oil? Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night, let’s listen in to a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes. 

In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!

Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

 Oil Exec:  Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average!  You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April  20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator:  But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec:  That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail),  that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion:  the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages?  … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

Two Measuring Instruments with Different Precisions:

 A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in my discussions of the strong likelihood principle (SLP), e.g., ton o’bricks, and here).

In applying our favorite one-sided (upper) Normal test T+ to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”.  Denote the two p-values as p’ and p”, respectively.  However, or so the criticism proceeds, the error statistician would report the average p-value:  .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed?  Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of.  Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

  •   If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion.
I gave an honorary mention to Christian Robert [3] on this point in his discussion of Cox and Mayo (2010).  Robert writes p. 9 :

A compelling section is the one about the weak conditionality principle (pp.294- 298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behaviour in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Robert (2007), Example 1.3.7, p.18) that the classical confidence interval averages over the experiments. The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this.

He would want me to mention that he does raise some caveats:

I could, however, [argue] about ‘conditioning is warranted to achieve objective frequentist goals’ (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”.

But there is nothing arbitrary about regarding as “good” the only experiment actually run and from which the actual data arose.  The severity criterion only makes explicit what is/should be already obvious. Objectivity, for us, is directed by the goal of making correct and warranted inferences, not freedom from thinking. After all, any time an experiment E is performed, the critic could insist that the decision to perform E is the result of some chance circumstances and with some probability we might have felt differently that day and have run some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with.  The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests.

Still, we exiled ones can’t be too fussy, and Robert still gets the mention for conceding that we have  a solid leg on which to pirouette.

[1] The relevance of the Deepwater Horizon spill to this blog stems from its having occurred while I was busy organizing the conference “StatSci meets PhilSci” (to take place at the LSE in June 2010). So all my examples there involved “deepwater drilling” but of the philosophical sort. Search the blog for further connections (especially the RMM volume, and the blog’s “mascot” stock, Diamond offshore, DO, which has now bottomed out at around $48, long story.

Of course, the spill cam wasn’t set up right away.

[2] If any readers work on the statistical analysis of the toxicity of the fish or sediment from the BP oil spill, or know of good references, please let me know.

BP said all tests had shown that Gulf seafood was safe to consume and there had been no published studies demonstrating seafood abnormalities due to the Deepwater Horizon accident.

[3] There have been around 4-5 other “honorable mentions” since then, I’m not sure.


Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.




Categories: Comedy, Statistics | 8 Comments

Duality: Confidence intervals and the severity of tests

confidence intervalA question came up in our seminar today about how to understand the duality between a simple one-sided test and a lower limit (LL) of a corresponding 1-sided confidence interval estimate. This is also a good route to SEV (i.e., severity). Here’s a quick answer:

Consider our favorite test of the mean of a Normal distribution with n iid samples, and known standard deviation σ: test T+. This time let:

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Nothing of interest to the logic changes if s.d. is estimated as is more typical. If σ = 1, n = 25, (σ/ √n) = .2.

The (1 – α) confidence interval (CI) corresponding to test T+ is that µ > the (1 – α) LL:

µ > M – ca(1/ √n ).

where M represents the statistic, usually written X-bar, the sample mean. For example,

M – 2.5(1/ √n )

is the generic lower limit (LL) of a 99% CI. The impressive thing is that this holds regardless of the true value of µ. If, for any M you assert:

µ > M – ca(1/ √n ),

your assertions will be correct 99% of the time. [Once the data are in hand, M takes the value of a particular sample mean. Without quantifiers, this is a little imprecise.]

Now for the duality between CIs and tests. How does it work?

Put aside for the moment our fixed hypothesis of interest; just retain the form of test T+. Keeping the s.d. of 1, and n = 25, suppose we have observed M = .6.

Consider the question: For what value of µ0 would M = .6 be the 2.5 s.d. cut-off (in test T+)? That is, for what value of µ0 would an observed mean of .6 exceed µby 2.5 s.d.s? (Or again, for what value of µ0 would our observation reach a p-value of .01 in test T+?)

Clearly, the answer is in testing H0: µ ≤  .1 against H1: µ >  .1.

The corresponding .99 lower limit of the one-sided confidence interval would be:

[.1 < µ , infinity]

 The duality with tests says that these are the µ values (in the given model and test) that would not be statistically significant at the .01 level, had they been the ones tested in T+. For example:

H0: µ ≤  .15 would not be rejected, nor would H0: µ ≤  .2, H0: µ ≤  .25 and so on. That’s because the observed M is not statistically significantly greater (at the .01 level) than any of the µ values in the interval. Since this is continuous, it does not matter if the cut-off is just at .1 or values greater than .1.

On the other hand, a test hypothesis of H0: µ ≤  .09 would be rejected by M = .6; as would µ ≤  .08, µ ≤  .07…. H0: µ ≤  0, and so on. Using significance test language again, the observed M is statistically significantly greater than all these values (p-level smaller than .01), and at smaller and smaller levels of significance.

Under the supposition that the data were generated from a world where H0: µ ≤  .1 against µ >.1,  at least 99% of the time a larger M than was observed would occur.

The test was so incapable of having produced so large a value of M as .6, were µ less than the 99% CI lower bound, that we argue there is an indication (if not full blown evidence) that µ > .1.

We are assuming these values are “audited”, and the assumptions of the model permit the computations to be approximately valid. Following Fisher, evidence of an experimental effect requires more than a single, isolated significant result, but let us say that is satisfied.

The severity with which µ > .1 “passes” the test with this result M = .6 (in test T+)  is ~ .99.

SEV( µ > .1, test T+, M = .6) = P(M < .6; µ =.1) = P( Z < (.6 – .1)/.2)=

P(Z < 2.5) = .99.

Here’s a little chart for this example:

Duality between LL of 1-sided confidence intervals and a fixed outcome M = .6 of test T+: H0: µ ≤  µ0 vs H1: µ >  µ0. σ = 1, n = 25, (σ/ √n) = .2. These computations are approximate.

Were µ no greater than The capability of T+ to produce M as large as .6 is _ µ is the 1-sided LL with level _ Claim C SEV associated with C
.1 .01 .99 (µ > .1) .99
.2 .025 .975 (µ > .2) .975
.3 .07 .93 (µ > .3) .93
.4 .16 .84 (µ > .4) .84
.5 .3 .7 (µ > .5) .7
.6 .5 .5 (µ > .6) .5
.7 .69 .31 (µ > .7) .31

In all these cases, the test had fairly low capability to produce M as large at .6–the largest it gets is .69. I’ll consider what the test is more capable of doing in another post. Note that: as the capability increases, the corresponding confidence level decreases.

Categories: confidence intervals and tests, Phil6334 | Leave a comment

A. Spanos: Jerzy Neyman and his Enduring Legacy

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data x0 come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X1,X2,…,Xn), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!

Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):

Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)

From my perspective, this was a major step forward for several reasons, including the following.

First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.

Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:

Mθ(x)={f(x;θ), θ∈Θ}, x∈Rn , Θ⊂Rm; m << n,

where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from  f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:

Xt = α0 + α1Xt-1 + σεt,  t=1,2,…,n

This indicates how one can use pseudo-random numbers for the error term  εt ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.

Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).


For further discussion on the above issues see:

Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese:

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.

Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.

Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.

Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.

[i]He was born in an area that was part of Russia.

Categories: phil/history of stat, Spanos, Statistics | Tags: , | 4 Comments

Phil 6334: Notes on Bayesian Inference: Day #11 Slides



A. Spanos Probability/Statistics Lecture Notes 7: An Introduction to Bayesian Inference (4/10/14)

Categories: Bayesian/frequentist, Phil 6334 class material, Statistics | 10 Comments

“Murder or Coincidence?” Statistical Error in Court: Richard Gill (TEDx video)

“There was a vain and ambitious hospital director. A bad statistician. ..There were good medics and bad medics, good nurses and bad nurses, good cops and bad cops … Apparently, even some people in the Public Prosecution service found the witch hunt deeply disturbing.”

This is how Richard Gill, statistician at Leiden University, describes a feature film (Lucia de B.) just released about the case of Lucia de Berk, a nurse found guilty of several murders based largely on statistics. Gill is widely-known (among other things) for showing the flawed statistical analysis used to convict her, which ultimately led (after Gill’s tireless efforts) to her conviction being revoked. (I hope they translate the film into English.) In a recent e-mail Gill writes:

“The Dutch are going into an orgy of feel-good tear-jerking sentimentality as a movie comes out (the premiere is tonight) about the case. It will be a good movie, actually, but it only tells one side of the story. …When a jumbo jet goes down we find out what went wrong and prevent it from happening again. The Lucia case was a similar disaster. But no one even *knows* what went wrong. It can happen again tomorrow.

I spoke about it a couple of days ago at a TEDx event (Flanders).

You can find some p-values in my slides ["Murder by Numbers", pasted below the video]. They were important – first in convicting Lucia, later in getting her a fair re-trial.”

Since it’s Saturday night, let’s watch Gill’s TEDx talk, “Statistical Error in court”.

Slides from the Talk: “Murder by Numbers”:


Categories: junk science, P-values, PhilStatLaw, science communication, Statistics | Tags: | Leave a comment

“Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)

Sell me that antiseptic!

We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here).[1]

But since then the field has, ostensibly, attempted to clean up its act. On the meta-level, Simmons, Nelson, and Simonsohn (2011) is an excellent example of the kind of self-scrutiny the field needs, and their list of requirements and guidelines offer a much needed start (along with their related work). But the research itself appears to be going on in the same way as before (I don’t claim this one is representative), except that now researchers are keen to show their ability and willingness to demonstrate failure to replicate. So negative results are the new positives! If the new fashion is non-replication, that’s what will be found (following Kahneman‘s call for a “daisy chain” in [1]).

In “Out Damned Spot,” The authors are unable to replicate what they describe as a famous experiment (Zhong and Lilijenquist 2006) wherein participants who read “a passage describing an unethical deed as opposed to an ethical deed, … were subsequently likelier to rate cleansing products as more desirable than other consumer products”. (92). There are a variety of protocols, all rather similar. For instance students are asked to write out a passage to the effect that:

“I shredded a document that I knew my co-worker Harlan was desperately looking for so that I would be the one to get a promotion.”


“I place the much sought-after document in Harlan’s mail box.”

See the article for the exact words. Participants are told, untruthfully, that the study is on handwriting, or on punctuation or the like. (Aside: Would you feel more desirous of soap products after punctuating a paragraph about shredding a file that your colleague is looking for? More desirous than when…? More desirous than if you put it in his mailbox, I guess.[2]) In another variation on the Zhong et al studies, when participants are asked to remember an unethical vs ethical deed they committed, they tended to pick antiseptic wipe over a pen as compensation.

Yet these authors declare there is “a robust experimental foundation for the existence of a real-life Macbeth Effect” and therefore are  surprised that they are unable to replicate the result. The very fact that the article starts with giving high praise to these earlier studies already raises a big question mark in my mind as to their critical capacities, so I am not too surprised that they do not bring such capacities into their own studies. It’s so nice to have cross-out capability. Given that the field considers this effect solid and important, it is appropriate for the authors to regard it as such. (I think they are just jumping onto the new bandwagon. Admittedly, I’m skeptical, so send me defenses, if you have them. I place this under “fallacies of negative results”)

I asked the group of seminar participants if they could even identify a way to pick up on the “Macbeth” effect assuming no limits to where they could look or what kind of imaginary experiment one could run. Hmmm. We were hard pressed to come up with any. Follow evil-doers around (invisibly) and see if they clean-up? Follow do-gooders around (invisibly) to see if they don’t wash so much? (Never mind that cleanliness is next to godliness.) Of course if the killer has got blood on her (as in Lady “a little water clears us of this deed” Macbeth) she’s going to wash up, but the whole point is to apply it to moral culpability more generally (seeing if moral impurity cashes out as physical). So the first signal that an empirical study is at best wishy-washy, and at worst pseudoscientific is the utter vagueness of the effect they are studying. There’s little point to a sophisticated analysis of the statistics if you cannot get past this…unless you’re curious as to what other howlers lie in store. Yet with all of these experiments, the “causal” inference of interest is miles and miles away from the artificial exercises subjects engage in…. (unless too trivial to bother studying).

Returning to their study, after the writing exercise, the current researchers (Earl,, ) have participants rate various consumer products for their desirability on a scale of 1 to 7.

They found “no significant difference in the mean desirability of the cleansing items between the moral condition (M= 3.09) and immoral condition (M = 3.08)” (94)—a difference that is so small as to be suspect in itself. Their two-sided confidence interval contains 0 so the null is not rejected. (We get a p-value and Cohen’s d, but no data.) Aris Spanos brought out a point we rarely hear (that came up in our criticism of a study on hormesis): it’s easy to get phony results with artificial measurement scales like 1-7. (Send links of others discussing this.) The mean isn’t even meaningful, and anyway, by adjusting the scale, a non-significant difference can become significant. (I don’t think this is mentioned in Simmons, Nelson, and Simonsohn 2011, but I need to reread it.)

The authors seem to think that failing to replicate studies restores credibility, and is indicative of taking a hard-nosed line, getting beyond the questionable significant results that have come in for such a drubbing. It does not. You can do just as questionable a job finding no effect as finding one. What they need to do is offer a stringent critique of the other (and their own) studies. A negative result is not a stringent critique. (Kahneman: please issue this further requirement.)

In fact, the scrutiny our seminar group arrived at in a mere one-hour discussion did more to pinpoint the holes in the other studies than all their failures to replicate. As I see it, that’s the kind of meta-level methodological scrutiny that their field needs if they are to lift themselves out of the shadows of questionable science. I could go on for pages and pages on all that is irksome and questionable about their analysis but will not. These researchers don’t seem to get it. (Or so it seems.)

If philosophers are basing philosophical theories on such “experimental” work without tearing them apart methodologically, then they’re not doing their job. Quine was wrong, and Popper was right (on this point): naturalized philosophy (be it ethics, epistemology or other) is not a matter of looking to psychological experiment.

Some proposed labels: We might label as questionable science any inferential inquiry where the researchers have not shown sufficient self-scrutiny of fairly flagrant threats to the inferences of interest. These threats would involve problems all along the route from the data generation and modeling to their interpretation. If an enterprise regularly fails to demonstrate such self-scrutiny, or worse, if its standard methodology revolves around reports that do a poor job at self-scrutiny, then I label the research area pseudoscience. If it regularly uses methods that permit erroneous interpretations of data with high probability, then we might be getting into “fraud” or at least “junk” science. (Some people want to limit “fraud” to a deliberate act. Maybe so, but my feeling is, as professional researchers claiming to have evidence of something, the onus is on them to be self-critical. Unconscious wishful thinking doesn’t get you off the hook.)

[1] In 2012 Kahneman said he saw a train-wreck looming for social psychology and suggested a “daisy chain” of replication.

[2] Correction I had “less” switched with “more” in the early draft (I wrote this quickly during the seminar).

[3] New reference from Uri Simonsohn:

[4]Addition: April 11, 2014: A commentator wrote that I should read Mook’s classic paper against “external validity”.

 In replying,  I noted that I agreed with Mook entirely. “I entirely agree that “artificial” experiments can enable the most probative and severe tests. But, Mook emphasizes the hypothetical deductive method (which we may assume would be of the statistical and not purely deductive variety). This requires the very entailments that are questionable in these studies. …I have argued (in sync with Mook I think) as to why “generalizability”, external validity and the like may miss the point of severe testing, which typically requires honing in on, amplifying, isolating, and even creating effects that would not occur in any natural setting. If our theory T would predict a result or effect with which these experiments conflict, then the argument to the flaw in T holds—as Mook remarks. What is missing from the experiments I criticize is the link that’s needed for testing—the very point that Mook is making.

I especially like Mook’s example of the wire monkeys. Despite the artificiality, we can use that experiment to understand hunger reduction is not valued more than motherly comfort or the like. That’s the trick of a good experiment, that if the theory (e.g., about hunger-reduction being primary) were true, then we would not expect those laboratory results. The key question is whether we are gaining an understanding, and that’s what I’m questioning.

I’m emphasizing the meaningfulness of the theory-statistical hypothesis link on purpose. People get so caught up in the statistics that they tend to ignore, at times, the theory-statistics link.

Granted, there are at least two distinct things that might be tested: the effect itself (here the Macbeth effect), and the reliability of the previous positive results. Even if the previous positive results are irrelevant for understanding the actual effect of interest, one may wish to argue that it is or could be picking up something reliably. Even though understanding the effect is of primary importance, one may claim only to be interested in whether the previous results are statistically sound. Yet another interest might be claimed to be learning more about how to trigger it. I think my criticism, in this case, actually gets to all of these, and for different reasons. I’ll be glad to hear other positions.




Simmons, J., Nelson, L. and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science XX(X) 1-8.

Zhong, C. and Liljenquist, K. 2006. Washing away your sins: Threatened morality and physical cleansing. Science, 313, 1451-1452.

Categories: fallacy of non-significance, junk science, reformers, Statistics | 12 Comments

Phil 6334: Duhem’s Problem, highly probable vs highly probed; Day #9 Slides


picture-216-1April 3, 2014: We interspersed discussion with slides; these cover the main readings of the day (check syllabus): the Duhem’s Probem and the Bayesian Way, and “Highly probable vs Highly Probed”. syllabus four. Slides are below (followers of this blog will be familiar with most of this, e.g., here). We also did further work on misspecification testing.

Monday, April 7, is an optional outing, “a seminar class trip”

"Thebes", Blacksburg, VA

“Thebes”, Blacksburg, VA

you might say, here at Thebes at which time we will analyze the statistical curves of the mountains, pie charts of pizza, and (seriously) study some experiments on the problem of replication in “the Hamlet Effect in social psychology”. If you’re around please bop in!

Mayo’s slides on Duhem’s Problem and more from April 3 (Day#9):



Categories: Bayesian/frequentist, highly probable vs highly probed, misspecification testing | 8 Comments

Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

UnknownIt was from my Virginia Tech colleague I.J. Good (in statistics), who died five years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules. (I had posted this last year here.)

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]

This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was like a year out of grad school, and he a University Distinguished Professor.) 

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

 To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

images-3By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

 Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level. 

Much laughter.

So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results are statistically significant.”

 Howls of laughter.

 But then the guy calls back with the bad news . . .

It turns out that failing to score a sufficiently impressive effect after n’ trials, the experimenter went on to n” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the argument from intentions. When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the look-elsewhere effect (LEE), which arose in the context of “bump hunting” in the Higgs results.

The optional stopping effect often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

Xi ~ N(µ,σ) and we test  H0: µ=0, vs. H1: µ≠0.

The stopping rule might take the form:

Keep sampling until |m| ≥ 1.96 σ/√n),

with m the sample mean. When n is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . . I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.” If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

H is not being put to a stringent test when a researcher allows trying and trying again until the data are far enough from H0 to reject it in favor of H.

Stopping Rule Principle

Picking up on the effect appears evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the irrelevance of the stopping rule or the Stopping Rule Principle (SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

A Funny Thing Happened at the Savage Forum[i]

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not.  And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule.  It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a non sequitur.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (x1,…, xn) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available.  If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

One last thing….

 The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian  interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ =  m + 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?

[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”.  Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.

[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:


Armitage, P. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.

Berger J. O. and Wolpert, R. L. (1988), The Likelihood Principle: A Review, Generalizations, and Statistical Implications 2nd edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.

Birnbaum, A. (1969), “Concepts of Statistical Evidence” In Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.

Cox, D. R. (1977), “The Role of Significance Tests (with discussion)”, Scandinavian Journal of Statistics 4, 49–70.

Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, London: Chapman & Hall.

Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. Psychological Review 70: 193-242.

Good, I.J.(1983), Good Thinking, The Foundations of Probability and its Applications, Minnesota.

Howson, C., and P. Urbach (1993[1989]), Scientific Reasoning: The Bayesian Approach, 2nd  ed., La Salle: Open Court.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80 (2013): 73-93.

Categories: Bayesian/frequentist, Comedy, Statistics | Tags: , , | 18 Comments

Self-referential blogpost (conditionally accepted*)

This is a blogpost on a talk (by Jeremy Fox) on blogging that will be live tweeted here at Virginia Tech on Monday April 7, and the moment I post this blog on “Blogging as a Mode of Scientific Communication” it will be tweeted. Live.

Jeremy’s upcoming talk on blogging will be live-tweeted by @FisheriesBlog, 1 pm EDT Apr. 7

Posted on April 3, 2014 by Jeremy Fox

If you like to follow live tweets of talks, you’re in luck: my upcoming Virginia Tech talk on blogging will be live tweeted by Brandon Peoples, a grad student there who co-authors The Fisheries Blog. Follow @FisheriesBlog at 1 pm US Eastern Daylight Time on Monday, April 7 for the live tweets.

Jeremy Fox’s excellent blog, “Dynamic Ecology,” often discusses matters statistical from a perspective in sync with error statistics.

I’ve never been invited to talk about blogging or even to blog about blogging, maybe this is a new trend. I look forward to meeting him (live!).


* Posts that don’t directly pertain to philosophy of science/statistics are placed under “rejected posts” but since this is a metablogpost on a talk on a blog pertaining to statistics it has been “conditionally accepted”, unconditionally, i.e., without conditions.

Categories: Announcement, Metablog | Leave a comment

Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic

Danver State Hospital

Danvers State Hospital

I had heard of medical designs that employ individuals who supply Bayesian subjective priors that are deemed either “enthusiastic” or “skeptical” as regards the probable value of medical treatments.[i] From what I gather, these priors are combined with data from trials in order to help decide whether to stop trials early or continue. But I’d never heard of these Bayesian designs in relation to decisions about building security or renovations! Listen to this….

You may have heard that the Department of Homeland Security (DHS), whose 240,000 employees are scattered among 50 office locations around D.C.,has been planning to have headquarters built at an abandoned insane asylum St Elizabeths in DC [ii]. See a recent discussion here. In 2006 officials projected the new facility would be ready by 2015; now an additional $3.2 billion is needed to complete the renovation of the 159-year-old mental hospital by 2026 (Congressional Research Service)[iii].The initial plan of developing the entire structure is no longer feasible; so to determine which parts of the facility are most likely to be promising, “DHS is bringing in a team of data analysts who are possessed” said Homeland Security Secretary Jeh Johnson (during a DHS meeting, Feb 26) –-“possessed with vibrant background beliefs to sense which buildings are most probably worth renovating, from the point of view of security. St. Elizabeths needs to be fortified with 21st-century technologies for cybersecurity and antiterrorism missions” Johnson explained.

Failing to entice private companies to renovate the dilapidated west campus of the historic mental health facility that sits on 176 acres overlooking the Anacostia River,they can only hope to renovate selectively:  “Which parts are we going to overhaul? Parts of the hospital have been rotting for years!”Johnson declared.

Read more:

[I’m too rushed at the moment to write this up clearly but thought it of sufficient interest to post a draft (look for follow-up drafts).]

Skeptical and enthusiastic priors: excerpt from DHS memo:

The description of the use of so-called “enthusiastic” and “skeptical” priors is sketched in a DHS memo released in January 2014 (but which had been first issued in 2011). Here’s part of it:

Enthusiastic priors are used in evaluating the portions of St. Elizabeths campus thought to be probably unpromising, in terms of environmental soundness, or because of an existing suspicion of probable security leaks. If the location fails to be probably promising using an enthusiastic prior, plus data, then there is overwhelming evidence to support the decision that the particular area is not promising.

Skeptical priors are used in situations where the particular asylum wing, floor, or campus quadrant is believed to be probably promising for DHS. If the skeptical opinion, combined with the data on the area in question, yields a high posterior belief that it is a promising area to renovate, this would be taken as extremely convincing evidence to support the decision that the wing, floor or building is probably promising.

But long before they can apply this protocol, they must hire specialists to provide the enthusiastic or skeptical priors. (See stress testing below.) The article further explains, “In addition, Homeland Security took on a green initiative — deciding to outfit the campus’ buildings (some dating back to 1855) with features like rainwater toilets and Brazilian hardwood in the name of sustainability.” With that in mind, they also try to get a balance of environmentalist enthusiasts and green skeptics.

Asked how he can justify the extra 3 billion (a minimal figure), Mr. Johnson said that “I think that the morale of DHS, unity of mission, would go a long way if we could get to a headquarters.” He was pleased to announce that an innovative program of recruiting was recently nearly complete.

Stress Testing for Calibrated “Enthusastic” and “Skeptical” Prior Probabilities

Perhaps the most interesting part of all this is how they conduct stress testing for individuals to supply calibrated Bayesian priors concerning St. Elizabeths. Before being hired to give “skeptical” or “enthusiastic” prior distributions, candidates must pass a rather stringent panoply of stress tests based on their hunches regarding relevant facts associated with a number of other abandoned insane asylums. (It turns out there are a lot of them throughout the world. I had no idea.) The list of asylums on which they based the testing (over the past years) has been kept Top Secret Classified until very recently [iv]. Even now,one is directed to a non-governmental website to find a list of 8 or so of the old mental facilities that apparently appeared in just one batch of tests.

Scott Bays-Knorr, a DHS data analyst specialist who is coordinating the research and hiring of “sensors,” made it clear that the research used acceptable, empirical studies: “We’re not testing for paranormal ability or any hocus pocus. These are facts, and we are interested in finding those people whose beliefs match the facts reliably. DHS only hires highly calibrated, highly sensitive individuals”, said Bays-Knorr. Well I’m glad there’s no hocus-pocus at least.

The way it works is that they combine written tests with fMRI data— which monitors blood flow and, therefore, activity inside the brain in real time —to try to establish a neural signature that can be correlated with security-relevant data about the abandoned state hospitals. “The probability they are attuned to these completely unrelated facts about abandoned state asylums they’ve probably never even heard of is about 0. So we know our results are highly robust,” Bays-Knorr assured some skeptical senators.

Danvers State Hospital

Take for example,Danvers State Hospital, a psychiatric asylum opened in 1878 in Danvers, Massachusetts.

“We check their general sensitivity by seeing if any alarm bells go off in relation to little known facts about unrelated buildings that would be important to a high security facility. ‘What about a series of underground tunnels’ we might ask, any alarm bells go off? Any hairs on their head stand up when we flash a picture of a mysterious fire at the Danvers site in 2007?” Bays-Knorr enthused. “If we’ve got a verified fire skeptic who, when we get him to DC, believes that a part of St. Elizabeths is clear, then we start to believe that’s a fire-safe location. You don’t want to build U.S. cybersecurity if good sensors give it a high probability of being an incendiary location.” I think some hairs on my head are starting to stand up.

Interestingly, some of the tests involve matching drawn pictures, which remind me a little of those remote sensing tests of telepathy. Here’s one such target picture:

Target: Danvers State Hospital

They claim they can ensure robustness by means of correlating a sensor’s impressions of completely unrelated facts about the facility. For example, using fMRI data they can check if “anything lights up” in connection with Lovecraft’s Arkham Sanatorium”, a short story “The Thing on the Doorstep”, or Arkham Asylum in the Batman comic world.

Bays-Knorr described how simple facts are used as a robust benchmark for what he calls “the touchy-feely stuff”. For example, picking up on a simple fact in connection with High Royds Hospital (in England) is sensing its alternative name: West Riding Pauper Lunatic Asylum.They compare that to reactions to the question of why patient farming ended. I frankly don’t get it, but then again, I’m highly skeptical of approaches not constrained by error statistical probing.


Yet Bays-Knorr seemed to be convincing many of the Senators who will have to approve an extra 3 billion on the project. He further described the safeguards, “We never published this list of asylums, the candidate sensors did not even know what we were gong to ask them. It doesn’t matter if they’re asylum specialists or have a 6th sense. If they have good hunches, if they fit the average of the skeptics or the enthusiasts, then we want them.”  Only if the correlations are sufficiently coherent is a ‘replication score’ achieved. The testing data are then sent to an independent facility of blind “big data” statisticians,Cherry Associates, from whom the purpose of the analysis is kept entirely hidden.”We look for feelings and sensitivity, often the person doesn’t know she even has it,” one Cherry Assoc representative noted.Testing has gone on for the past 7 years ($700 million) and is only now winding up. (I’m not sure how many were hired, but with $150,000 salaries for part time work, it seems a god gig!)

Community priors, skeptical and enthusiastic, are eventually obtained based on those hired as U.S. Government DHS Calibrated Prior Degree Specialists.

Sounds like lunacy to me![v]


High Royds Hospital



[i]Spiegelhalter, D. J., Abrams, K. R., & Myles, J. P. (2004). Bayesian approaches to clinical trials and health care evaluation. Chichester: Wiley.

[ii]Homepage for DHS and St. Elizabeths Campus Plans.

GSA Development of St. Elizabeths campus:“Preserving the Legacy, Realizing Potential”


U.S. House of Representatives
Committee on Homeland Security
January 2014

Prepared by Majority Staff of the Committee on Homeland Security

[iv] The randomly selected hospitals in one standardized test included the following:

Topeka State Hospital

Danvers State Hopital

Denbigh Asylum

Pilgrim State Hospital

Trans-Allegheny Asylum

High Royds Hospital

Whittingham Hospital

Norwich State Hospital

Essential guide to abandoned insane asylums: -

[v[ Or, a partial April Fool’s joke!———-



Categories: junk science, Statistics, subjective Bayesian elicitation | 11 Comments

Phil 6334: March 26, philosophy of misspecification testing (Day #9 slides)


may-4-8-aris-spanos-e2809contology-methodology-in-statistical-modelinge2809d“Probability/Statistics Lecture Notes 6: An Introduction to Mis-Specification (M-S) Testing” (Aris Spanos)


[Other slides from Day 9 by guest, John Byrd, can be found here.]

Categories: misspecification testing, Phil 6334 class material, Spanos, Statistics | Leave a comment

Winner of the March 2014 palindrome contest (rejected post)

caitlin-parkerWinner of the March 2014 Palindrome Contest

Caitlin Parker


Able, we’d well aim on. I bet on a note. Binomial? Lewd. Ew, Elba!

The requirement was: A palindrome with Elba plus Binomial with an optional second word: bet. A palindrome that uses both Binomial and bet topped an acceptable palindrome that only uses Binomial.

Short bio: 
Caitlin Parker is a first-year master’s student in the Philosophy department at Virginia Tech. Though her interests are in philosophy of science and statistics, she also has experience doing psychological research.

“Thanks for the challenge! Palindromes give us a fun opportunity to practice planning in a setting where each new letter has the power to completely recast one’s previous efforts. Since one has to balance developing a structure with preserving some kind of meaning, it can take forever to get a palindrome to ‘work’ – but it’s incredibly satisfying when it does.”

Choice of Book:
Fisher, Neyman and the Creation of Classical Statistics (E. L. Lehmann 2012, Dordrecht, New York: Springer)

Congratulations Caitlin! With consecutive months now of winners using two words (+ Elba), this bodes well for returning to that more severe challenge.
See April contest (first word: fallacy; optional second word: error).
Categories: Announcement, Palindrome, Rejected Posts | Leave a comment

Severe osteometric probing of skeletal remains: John Byrd

images-3John E. Byrd, Ph.D. D-ABFA

Central Identification Laboratory

Guest, March 27, PHil 6334

“Statistical Considerations of the Histomorphometric Test Protocol for Determination of Human Origin of Skeletal Remains”

Byrd 1John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory

Categories: Phil6334, Philosophy of Statistics, Statistics | 1 Comment

Phil 6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)


 We’re going to be discussing the philosophy of m-s testing today in our seminar, so I’m reblogging this from Feb. 2012. I’ve linked the 3 follow-ups below. Check the original posts for some good discussion. (Note visitor*)

“This is the kind of cure that kills the patient!”

is the line of Aris Spanos that I most remember from when I first heard him talk about testing assumptions of, and respecifying, statistical models in 1999.  (The patient, of course, is the statistical model.) On finishing my book, EGEK 1996, I had been keen to fill its central gaps one of which was fleshing out a crucial piece of the error-statistical framework of learning from error: How to validate the assumptions of statistical models. But the whole problem turned out to be far more philosophically—not to mention technically—challenging than I imagined.I will try (in 3 short posts) to sketch a procedure that I think puts the entire process of model validation on a sound logical footing.  Thanks to attending several of Spanos’ seminars (and his patient tutorials, for which I am very grateful), I was eventually able to reflect philosophically on aspects of  his already well-worked out approach. (Synergies with the error statistical philosophy, of which this is a part,  warrant a separate discussion.)

Problems of Validation in the Linear Regression Model (LRM)

The example Spanos was considering was the the Linear Regression Model (LRM) which may be seen to take the form:

M0:      yt = β0 + β1xt + ut,  t=1,2,…,n,…

Where µt = β0 + β1xt is viewed as the systematic component, and ut = yt – β0 - β1xt  as the error (non-systematic) component.  The error process {ut, t=1, 2, …, n, …,} is assumed to be Normal, Independent and Identically Distributed (NIID) with mean 0, variance σ2 , i.e. Normal white noise.  Using the data z0:={(xt, yt), t=1, 2, …, n} the coefficients (β0 , β1) are estimated (by least squares)yielding an empirical equation intended to enable us to understand how yt varies with Xt.

Empirical Example

Suppose that in her attempt to find a way to understand and predict changes in the U.S.A. population, an economist discovers, using regression, an empirical relationship that appears to provide almost a ‘law-like’ fit (see figure 1):

yt = 167.115+ 1.907xt + ût,                                    (1)

where yt denotes the population of the USA (in millions), and  xt denotes a secret variable whose identity is not revealed until the end (these 3 posts). Both series refer to annual data for the period 1955-1989.

Figure 1: Fitted Line

Figure 1: Fitted Line

 A Primary Statistical QuestionHow good a predictor is xt?

The goodness of fit measure of this estimated regression, R2=.995, indicates an almost perfect fit. Testing the statistical significance of the coefficients shows them to be highly significant, p-values are zero (0) to a third decimal, indicating a very strong relationship between the variables.  Everything looks hunky dory; what could go wrong?

Is this inference reliable? Not unless the data z0 satisfy the probabilistic assumptions of the LRM, i.e., the errors are NIID with mean 0, variance σ2.

Misspecification (M-S) Tests: Questions of model validation may be  seen as ‘secondary’ questions in relation to primary statistical ones; the latter often concern the sign and magnitude of the coefficients of this linear relationship.

Partitioning the Space of Possible Models: Probabilistic Reduction (PR)

The task in validating a model M0 (LRM) is to test ‘M0is valid’ against everything else!

In other words, if we let H0 assert that the ‘true’ distribution of the sample Z, f(z) belongs to M0, the alternative H1 would be the entire complement of M0, more formally:

H0: f(z) €  M0  vs. H1: f(z) € [P - M0]

where P denotes the set of all possible statistical models that could have given rise to z0:={(xt,yt), t=1, 2, …, n}, and  is “an element of” (all we could find).

The traditional analysis of the LRM has already, implicitly, reduced the space of models that could be considered. It reflects just one way of reducing the set of all possible models of which data z0 can be seen to be a realization. This provides the motivation for Spanos’ modeling approach (first in Spanos 1986, 1989, 1995).

Given that each statistical model arises as a parameterization from the joint distribution:

D(Z1,…,Zn;φ): = D((X1, Y1), (X2, Y2), …., (Xn, Yn); φ),

we can consider how one or another set of probabilistic assumptions on the joint distribution gives rise to different models. The assumptions used to reduce P, the set of all possible models,  to a single model, here the LRM, come from a menu of three broad categories.  These three categories  can always be used in statistical modeling:

(D) Distribution, (M) Dependence, (H) Heterogeneity.

For example, the LRM arises when we reduce P by means of the “reduction” assumptions:

(D) Normal (N), (M) Independent (I), (H) Identically Distributed (ID).

Since we are partitioning or reducing P by means of the probabilistic assumptions, it may be called the Probabilistic Partitioning or Probabilistic Reduction (PR) approach.[i]

The same assumptions, traditionally given by means of the error term, are instead specified in terms of the observable random variables (yt, Xt): [1]-[5] in table 1 to render them directly assessable by the data in question.

Table 1 – The Linear Regression Model (LRM)

yt = β0 + β1xt + ut,  t=1,2,…,n,…

[1] Normality: D(yt |xt; θ) Normal
[2] Linearity: E(yt |Xt=xt) = β0 + β1xt Linear in xt
[3] Homoskedasticity: Var(yt |Xt=xt) =σ2, Free of xt
[4] Independence: {(yt |Xt=xt), t=1,…,n,…} Independent
[5] t-invariance: θ:=(β0 , β1, σ2), constant over t

There are several advantages to specifying the model assumptions in terms of the observables yt and xt instead of the unobservable error term.

First, hidden or implicit assumptions now become explicit ([5]).

Second, some of the error term assumptions, such as having a zero mean, do not look nearly as innocuous when expressed as an assumption concerning the linearity of the regression function between yt and xt .

Third, the LRM (conditional) assumptions can be assessed indirectly from the data via the (unconditional) reduction assumptions, since:

N entails [1]-[3],             I entails [4],             ID entails [5].

As a first step, we partition the set of all possible models coarsely

in terms of reduction assumptions on D(Z1,…,Zn;φ):

LRM Alternatives
(D)   Distribution: Normal non-Normal
(M)   Dependence: Independent Dependent
(H) Heterogeneity: Identically Distributed Non-ID

Given the practical impossibility of probing for violations in all possible directions, the PR approach consciously considers an effective probing strategy to home in on the directions in which the primary statistical model might be potentially misspecified.   Having taken us back to the joint distribution, why not get ideas by looking at yt and xt themselves using a variety of graphical techniques?  This is what the Probabilistic Reduction (PR) approach prescribes for its diagnostic task….Stay tuned!

*Rather than list scads of references, I direct the interested reader to those in Spanos.

[i] This is because when the NIID assumptions are imposed on  the latter simplifies into a product of conditional distributions  (LRM).

See follow-up parts:




*We also have a visitor to the seminar from Hawaii, John Byrd, a forensic anthropologist and statistical osteometrician. He’s long been active on the blog. I’ll post something of his later on.

Categories: Intro MS Testing, Statistics | Tags: , , , , | 16 Comments

The Unexpected Way Philosophy Majors Are Changing The World Of Business




“Philosophy majors rule” according to this recent article. We philosophers should be getting the word out. Admittedly, the type of people inclined to do well in philosophy are already likely to succeed in analytic areas. Coupled with the chuzpah of taking up an “outmoded and impractical” major like philosophy in the first place, innovative tendencies are not surprising.  But can the study of philosophy also promote these capacities? I think it can and does; yet it could be far more effective than it is, if it was less hermetic and more engaged with problem-solving across the landscape of science,statistics,law,medicine,and evidence-based policy. Here’s the article:


  The Unexpected Way Philosophy Majors Are Changing The World Of Business

Dr. Damon Horowitz quit his technology job and got a Ph.D. in philosophy — and he thinks you should too.

“If you are at all disposed to question what’s around you, you’ll start to see that there appear to be cracks in the bubble,” Horowitz said in a 2011 talk at Stanford. “So about a decade ago, I quit my technology job to get a philosophy PhD. That was one of the best decisions I’ve made in my life.”

As Horowitz demonstrates, a degree in philosophy can be useful for professions beyond a career in academia. Degrees like his can help in the business world, where a philosophy background can pave the way for real change. After earning his PhD in philosophy from Stanford, where he studied computer science as an undergraduate, Horowitz went on to become a successful tech entrepreneur and Google’s in-house philosopher/director of engineering. His own career makes a pretty good case for the value of a philosophy education.

Despite a growing media interest in the study of philosophy and dramatically increasing enrollment in philosophy programs at some universities, the subject is still frequently dismissed as outmoded and impractical, removed from the everyday world and relegated to the loftiest of ivory towers.

That doesn’t fit with the realities of both the business and tech worlds, where philosophy has proved itself to be not only relevant but often the cornerstone of great innovation. Philosophy and entrepreneurship are a surprisingly good fit. Some of the most successful tech entrepreneurs and innovators come from a philosophy background and put the critical thinking skills they developed to good use launching new digital services to fill needs in various domains of society. Atlantic contributor Edward Tenner even went so far as to call philosophy the “most practical major.”

In fact, many leaders of the tech world — from LinkedIn co-founder Reid Hoffman to Flickr founder Stewart Butterfield — say that studying philosophy was the secret to their success as digital entrepreneurs.

“The thought leaders of our industry are not the ones who plodded dully, step by step, up the career ladder,” said Horowitz. “They’re the ones who took chances and developed unique perspectives.”

Here are a few reasons that philosophy majors will become the entrepreneurs who are shaping the business world.

Philosophy develops strong critical thinking skills and business instincts.

Philosophy is a notoriously challenging major, and has rigorous standards of writing and argumentation, which can help students to develop strong critical thinking skills that can be applied to a number of different professions. The ability to think critically may be of particular advantage to tech entrepreneurs.

“Open-ended assignments push philosophy students to find and take on a unique aspect of the work of the philosopher they are studying, to frame their thinking around a fresh and interesting question, or to make original connections between the writings of two distinct thinkers,” Christine Nasserghodsi, director of innovation at the Wellington International School in Dubai, wrote in a HuffPost College blog. “Similarly, entrepreneurs need to be able to identify and understand new and unique opportunities in existing markets.”

Flickr co-founder Stewart Butterfield got his bachelor’s and master’s degrees in philosophy at University of Victoria and Cambridge, where he specialized in philosophy of mind. After the highly profitable sale of Flickr to Yahoo!, the Canadian tech entrepreneur began working on a new online civilization-building game, Glitch.

“I think if you have a good background in what it is to be human, an understanding of life, culture and society, it gives you a good perspective on starting a business, instead of an education purely in business,” Butterfield told University of Victoria students in 2008. “You can always pick up how to read a balance sheet and how to figure out profit and loss, but it’s harder to pick up the other stuff on the fly.”

Former philosophy students have gone on to make waves in the tech world.

Besides Horowitz and Butterfield, a number of tech executives, including former Hewlett-Packard Company CEO Carly Fiorina and LinkedIn co-founder and executive chairman Reid Hoffman, studied philosophy as undergraduates. Hoffman majored in philosophy Oxford before he went on to become a highly successful tech entrepreneur and venture capitalist, and author of The Startup Of You.

“My original plan was to become an academic,” Hoffman told Wired. “I won a Marshall scholarship to read philosophy at Oxford, and what I most wanted to do was strengthen public intellectual culture — I’d write books and essays to help us figure out who we wanted to be.”

Hoffman decided to instead become a software engineer when he realized that staying in academia might not have the measurable impact on the world that he desires. Now, he uses the sharp critical thinking skills he honed while studying philosophy to make profitable investments in tech start-ups.

“When presented with an investment that I think will change the world in a really good way, if I can do it, I’ll do it,” he said.

Philosophers (amateur and professional) will be the ones to grapple with the biggest issues facing their generation.

I can’t say I agree with the following suggestion that philosophers have a “better understanding of the nature of man and his place in the world” but it sounds good.

Advances in physics, technology and neuroscience pose an ever-evolving set of questions about the nature of the world and man’s place in it; questions that we may not yet have the answers to, but that philosophers diligently explore through theory and argument. And of course, there are some questions of morality and meaning that were first posed by ancient thinkers and that we must continue to question as humanity evolves: How should we treat one another? What does it mean to live a good life?

The Princeton philosophy department argues that because philosophers have a “better understanding of the nature of man and his place in the world,” they’re better able to identify address issues in modern society. For this reason, philosophy should occupy a more prominent place in the business world, says Dov Seidman, author of HOW: Why HOW We Do Anything Means Everything.

“Philosophy can help us address the (literally) existential challenges the world currently confronts, but only if we take it off the back burner and apply it as a burning platform in business,” Seidman wrote in a 2010 Bloomberg Businessweek article. “Philosophy explores the deepest, broadest questions of life—why we exist, how society should organize itself, how institutions should relate to society, and the purpose of human endeavor, to name just a few.”

Philosophy students are ‘citizens of the world.’

In an increasingly global economy — one in which many businesses are beginning to accept a sense of social responsibility — those who care and are able to think critically about global and humanitarian issues will be the ones who are poised to create real change.

Rebecca Newberger Goldstein, philosopher, novelist and author of the forthcomingPlato at the Googleplex, recently told The Atlantic that doing philosophical work makes students “citizens of the world.” She explains why students should study philosophy, despite their concerns about employability:

To challenge your own point of view. Also, you need to be a citizen in this world. You need to know your responsibilities. You’re going to have many moral choices every day of your life. And it enriches your inner life. You have lots of frameworks to apply to problems, and so many ways to interpret things. It makes life so much more interesting. It’s us at our most human. And it helps us increase our humanity. No matter what you do, that’s an asset.

This global-mindedness and humanistic perspective may even make you a more desirable job candidate.

“You go into the humanities to pursue your intellectual passion, and it just so happens as a byproduct that you emerge as a desired commodity for industry,” said Horowitz. “Such is the halo of human flourishing.”

Categories: philosophy of science, Philosophy of Statistics, Statistics | 1 Comment

Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)



We spent the first half of Thursday’s seminar discussing the FisherNeyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning.

We then turned to a severity evaluation of tests as a way to avoid classic fallacies and misinterpretations.



“Probability/Statistics Lecture Notes 5 for 3/20/14: Post-data severity evaluation” (Prof. Spanos)

[i] Fisher, Neyman, and E. Pearson.

[ii] In a recent Nature article by Regina Nuzzo, we hear that N-P statistics “was spearheaded in the late 1920s by Fisher’s bitter rivals”. Nonsense. It was Neyman and Pearson who came to Fisher’s defense against the old guard. See for example Aris Spanos’ post here. According to Nuzzo, “Neyman called some of Fisher’s work mathematically ‘worse than useless’”. It never happened. Nor does she reveal, if she is aware of, the purely technical notion being referred to. Nuzzo’s article doesn’t give the source of the quote; I’m guessing it’s from Gigerenzer quoting Hacking, or Goodman (whom she is clearly following and cites) quoting Gigerenzer quoting Hacking, but that’s a big jumble.

N-P did provide a theory of testing that could avoid the purely technical problem that can theoretically emerge in an account that does not consider alternatives or discrepancies from a null. As for Fisher’s charge against an extreme behavioristic, acceptance sampling approach, there’s something to this, but as Neyman’s response shows, Fisher, in practice, was more inclined toward a dichotomous “thumbs up or down” use of tests than Neyman. Recall Neyman’s “inferential” use of power in my last post.  If Neyman really had altered the tests to such an extreme, it wouldn’t have required Barnard to point it out to Fisher many years later. Yet suddenly, according to Fisher, we’re in the grips of Russian 5-year plans or U.S. robotic widget assembly lines! I’m not defending either side in these fractious disputes, but alerting the reader to what’s behind a lot of writing on tests (see my anger management post). I can understand how Nuzzo’s remark could arise from a quote of a quote, doubly out of context. But I think science writers on statistical controversies have an obligation to try to avoid being misled by whomever they’re listening to at the moment. There are really only a small handful of howlers to take note of. It’s fine to sign on with one side, but not to state controversial points as beyond debate. I’ll have more to say about her article in a later post (and thanks to the many of you who have sent it to me).

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale: Lawrence Erlbaum Associates.

Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.

Nuzzo, R .(2014). “Scientific method: Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume”. Nature, 12 February 2014.

Categories: phil/history of stat, Phil6334, science communication, Severity, significance tests, Statistics | Tags: | 35 Comments

Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Unknown-3Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

Senn comment: So let me give you another analogy to your (very interesting) fire alarm analogy (My analogy is imperfect but so is the fire alarm.) If you want to cross the Atlantic from Glasgow you should do some serious calculations to decide what boat you need. However, if several days later you arrive at the Statue of Liberty the fact that you see it is more important than the size of the boat for deciding that you did, indeed, cross the Atlantic.

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance.

Mayo comment (in reply): A crucial disanalogy arises: You see the statue and you see the observed difference in a test, but even when the stat sig alarm goes off, you are not able to see the discrepancy that generated the observed difference or the alarm you hear. You don’t know that you’ve arrived (at the cause). The statistical inference problem is precisely to make that leap from the perceived alarm to some aspect of the underlying process that resulted in the alarm being triggered. Then it is of considerable relevance to exploit info on the capability of your test procedure to result in alarms going off (perhaps of different loudness), due to varying values of an aspect of the underlying process µ’, µ”,µ”‘  …etc..

Using the loudness of the alarm you actually heard, rather than the minimal stat sig bell, would be analogous to using the p-value rather than the pre-data cut-off for rejection. But the logic is just the same.

While post-data power is scarcely taboo for a severe tester, severity always uses the actual outcome, with its level of statistical significance, whereas power is in terms of the fixed cut-off. Still power provides (worst-case) pre-data guarantees. Now before you get any wrong ideas, I am not endorsing what some people call retrospective power, and I call “shpower”–which goes against severity logic, and is misconceived.

We are reading the Fisher-Pearson-Neyman “triad” tomorrow in Phil6334. Even here (i.e., Neyman 1956), Neyman alludes to a post-data use of power. But, strangely enough,I only noticed this after discovering more blatant discussions in what Spanos and I call “Neyman’s hidden papers”.  Here’s an excerpt of from Neyman’s Nursery (part 2) [NN-2] 


One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955).  It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman.  Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H0: µ ≤ µ0 against H1: µ > µ0.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ0 iff {d(x0) > cα) where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x0) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].

We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1)  P(d(X) > cα; µ =  µ0 + δ)

It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed.  He sounds like a Cohen-style power analyst!  Still, power is calculated relative to an outcome just missing the cutoff  cα.  This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results.  It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2)  P(d(X) > d(x0); µ = µ0 + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ0 + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before.  We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange).  Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean!  (I call this “shpower”. )

Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25).  This reasoning yields a core frequentist principle of evidence  (FEV) in Mayo and Cox 2010, 256):

FEV:1 A moderate p-value is evidence of the absence of a discrepancy d from H0 only if there is a high probability the test would have given a worse fit with H0 (i.e., smaller p value) were a discrepancy d to exist.

It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power.  In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account.  These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.…..

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.
 It didn’t have to be done this way (at first I didn’t), but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

[i] To repeat it again: some may be thinking of an animal I call “shpower”.

[ii] I realize comments are informal and unpolished, but isn’t that the beauty of blogging?

NOTE:To read the full post go to [NN-2].There are 5 Neyman’s Nursery posts (NN1-NN5). Search this blog for the others.


Cohen, J. (1992) A Power Primer.
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Categories: exchange with commentators, Neyman's Nursery, P-values, Phil6334, power, Stephen Senn | 5 Comments

Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

Stephen Senn


Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),

Delta Force
To what extent is clinical relevance relevant?

This note has been inspired by a Twitter exchange with respected scientist and famous blogger  David Colquhoun. He queried whether a treatment that had 2/3 of an effect that would be described as clinically relevant could be useful. I was surprised at the question, since I would regard it as being pretty obvious that it could but, on reflection, I realise that things that may seem obvious to some who have worked in drug development may not be obvious to others, and if they are not obvious to others are either in need of a defence or wrong. I don’t think I am wrong and this note is to explain my thinking on the subject.

Conventional power or sample size calculations
As is happens, I don’t particularly like conventional power calculations but I think they are, nonetheless, a good place to start. To carry out such a calculation a statistician needs the following ingredients

  1. A definition of a rational design (the smallest design that is feasible but would retain the essential characteristics of the design chosen).
  2. An agreed outcome measure.
  3. A proposed analysis.
  4. A measure of variability for the rational design. (This might, for example, be the between-patient variance σ2 for a parallel group design.)
  5. An agreed type I error rate, α.
  6. An agreed power, 1-β.
  7. A clinically relevant difference, δ. (To be discussed.)
  8. The size of the experiment, n, (in terms of multiples of the rational design).

In treatments of this subject points 1-3 are frequently glossed over as already being known and given, although in my experience, any serious work on trial design involves the statistician in a lot of work investigating and discussing these issues. In consequence, in conventional discussions, attention is placed on points 4-8. Typically, it is assumed that 4-7 are given and 8, the size of the experiment, is calculated as a consequence. More rarely, 4, 5, 7 and 8 are given and 6, the power, is calculated from the other 4. An obvious weakness of this system is that there is no formal mention of cost whether in money, lost opportunities or patient time and suffering.

An example
A parallel group trial is planned in asthma with 3 months follow up. The agreed outcome measure is forced expiratory volume in one second (FEV1) at the end of the trial. The between-patient standard deviation is 450ml and the clinically relevant difference is 200ml. A type I error rate of 5% is chosen and the test will be two-sided. A power of 80% is targeted.

An approximate formula that may be used is

Senn 1

Here the second term on the right hand side reflects what I call decision precision, with zα/2, zβ  as the relevant percentage points of the standard Normal. If you lower the type I error rate or increase the power, decision precision will increase. The first term on the right hand side is the variance for a rational design (consisting of one patient on each arm) expressed as a ratio to the square of a clinically relevant difference. It is a noise-to-signal ratio.

Substituting we have

Senn 2

Thus we need an 80-fold replication of the rational design, which is to say, 80 patients on each arm.

What is delta?
I now list different points of view regarding this.

     1.     It is the difference we would like to observe
This point of view is occasionally propounded but it is incompatible with the formula used. To see this consider a re-arrangement of equation (1) as

Senn real 2

The numerator on the left hand side is the clinically relevant difference and the denominator is the standard error. Now if the observed difference, d,  is the same as the clinically relevant difference the we can replace δ by d in (2) but that would imply that the ratio of observed value to statistic would be (in our example) 2.8. This does not correspond to a P-value of 0.05, which our calculation was supposed to deliver us with 80% probability if the clinically relevant difference obtained but to a P-value of 0.006, or just over 1/10 of what our power calculation would accept as constituting proof of efficacy.

To put it another way if δ is the value we would like to observe and if the treatment does, indeed, have a value of δ then we have only half a chance, not an 80% chance, that the trial will deliver to us a value as big as this.

     2.     It is the difference we would like to ‘prove’ obtains
This view is hopeless. It requires that the lower confidence interval should be greater than δ. If this is what is needed, the power calculation is completely irrelevant.

     3.     It is the difference we believe obtains
This is another wrong-headed notion. Since, the smaller the value of δ the larger the sample size, it would have the curious side effect that given a number of drug-development candidates we would spend most money on those we considered least promising. There are some semi-Bayesian versions of this in which a probability distribution for δ would be substituted for a single value. Most medical statisticians would reject this as being a pointless elaboration of a point of view that is wrong in the first place. If you reject the notion that δ is your best guess as to what the treatment effect is there is no need to elaborate this rejected position by giving δ a probability distribution.

Note, I am not rejecting the idea of Bayesian sample size calculations. A fully decision-analytic approach might be interesting.  I am rejecting what is a Bayesian-frequentist chimera.

     4.     It is the difference you would not like to miss
This is the interpretation I favour. The idea is that we control two (conditional) errors in the process. The first is α, the probability of claiming that a treatment is effective when it is, in fact, no better than placebo. The second is the error of failing to develop a (very) interesting treatment further. If a trial in drug development is not ‘successful’, there is a chance that the whole development programme will be cancelled. It is the conditional probability of cancelling an interesting project that we seek to control.

Note that the FDA will usually requires that two phase III trials are ‘significant’ and significance requires that the observed effect is at least equal to Senn 3aIn our example this would give us 1.96/2.8=0.7δ, or a little over two thirds of δ for at least two trials for any drug that obtained registration. In practice, the observed average of the two would have an effect somewhat in excess of 0.7δ. Of course, we would be naïve to believe that all drugs that get accepted have this effect (regression to the mean is ever- present) but nevertheless it provides some reassurance.

In other words, if you are going to do a power calculation and you are going to target some sort of value like 80% power, you need to set δ at a value that is higher than that you would be happy to find. Statisticians like me think of δ as the difference we would not like to miss and we call this the clinically relevant difference.

Does this mean that an effect that is 2/3 of the clinically relevant difference is worth having? Not necessarily. That depends on what your understanding of the phrase is. It should be noted, however, that when it is crucial to establish that no important difference between treatments exists, as in a non-inferiority study, then another sort of difference is commonly used. This is referred to as the clinically irrelevant difference. Such differences are quite commonly no more than 1/3 of the sort of difference a drug will have shown historically to placebo and hence much smaller than the difference you would not like to miss.

Another lesson, however, is this. In this area, as in others in the analysis of clinical data, dichotomisation is a bad habit. There are no hard and fast boundaries. Relevance is a matter of degree not kind.

Categories: power, Statistics, Stephen Senn | 38 Comments

Blog at Customized Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 324 other followers