**S. Stanley Young, PhD**

Assistant Director for Bioinformatics

National Institute of Statistical Sciences

Research Triangle Park, NC

Author of* Resampling-Based Multiple Testing,* Westfall and Young (1993) Wiley.

* *

The main readings for the discussion are:

- Young, S. & Karr, A. (2011). Deming, Data and Observational Studies. Signif. 8 (3), 116–120.
- Begley & Ellis (2012) Raise standards for preclinical cancer research. Nature 483: 531-533.
- Ioannidis (2005). Why most published research ﬁndings are false. PLoS Med 2(8): e124.
- Peng, R. D., Dominici, F. & Zeger, S. L. (2006). “Reproducible Epidemiologic Research” American Journal of Epidemiology 163 (9), 783-789.

Filed under: Announcement, evidence-based policy, Phil6334, science communication, selection effects, Statistical fraudbusting, Statistics ]]>

*Someone sent us a recording (mp3)of the panel discussion from that Colloquium (there’s a lot on “big data” and its politics) including: Mayo, Xiao-Li Meng (Harvard), Kent Staley (St. Louis), and Mark van der Laan (Berkeley). *

See if this works: | mp3

*There’s a prelude here to our visitor on April 24: Professor Stanley Young from the National Institute of Statistical Sciences.

Filed under: Bayesian/frequentist, Error Statistics, Phil6334 ]]>

Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010** **explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15 (see video clip that was added below).Remember junk shots, top kill, blowout preventers? [1] The EPA has lifted its gulf drilling ban on BP just a couple of weeks ago* (BP has paid around ~~$13~~ $27 billion in fines and compensation), and April 20, 2014, is the deadline to properly file forms for new compensations.

*(*After which BP had another small spill in Lake Michigan.)*

But what happened to the 200 million gallons of oil? Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night, let’s listen in to a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes.

*In effect, it accuses *the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

Oil Exec:We had highly reliable evidence thatH:the pressure was at normal levels on April 20, 2010!

Senator:But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

Oil Exec:Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average!You see,we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April 20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator:But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec:That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail), that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion: the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore *it misinterprets the actual data*. The question is why anyone would saddle the frequentist with such shenanigans on averages? … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

*Two Measuring Instruments with Different Precisions:*

* *A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10^{-4}, while with tails, we use E”, with a known large variance, say 10^{4}. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in my discussions of the strong likelihood principle (SLP), e.g., ton o’bricks, and here).

In applying our favorite one-sided (upper) Normal test *T+* to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”. Denote the two p-values as p’ and p”, respectively. However, or so the criticism proceeds, the error statistician would report the average p-value: .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed? Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of. Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion.

I gave an honorary mention to Christian Robert [3] on this point in his discussion of Cox and Mayo (2010). Robert writes p. 9 :

A compelling section is the one about the weak conditionality principle (pp.294- 298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behaviour in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Robert (2007), Example 1.3.7, p.18) that the classical confidence interval averages over the experiments. The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this.

He would want me to mention that he does raise some caveats:

I could, however, [argue] about ‘conditioning is warranted to achieve objective frequentist goals’ (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”.http://arxiv.org/abs/1111.5827

But there is nothing arbitrary about regarding as “good” the only experiment actually run and from which the actual data arose. The severity criterion only makes explicit what is/should be already obvious. Objectivity, for us, is directed by the goal of making correct and warranted inferences, not freedom from thinking. After all, any time an experiment E is performed, the critic could insist that the decision to perform E is the result of some chance circumstances and with some probability we might have felt differently that day and have run some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with. The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests.

Still, we exiled ones can’t be too fussy, and Robert still gets the mention for conceding that we have a solid leg on which to pirouette.

[1] The relevance of the Deepwater Horizon spill to this blog stems from its having occurred while I was busy organizing the conference “StatSci meets PhilSci” (to take place at the LSE in June 2010). So all my examples there involved “deepwater drilling” but of the philosophical sort. Search the blog for further connections (especially the RMM volume, and the blog’s “mascot” stock, Diamond offshore, DO, which has now bottomed out at around $48, long story.

Of course, the spill cam wasn’t set up right away.

[2] If any readers work on the statistical analysis of the toxicity of the fish or sediment from the BP oil spill, or know of good references, please let me know.

BP said all tests had shown that Gulf seafood was safe to consume and there had been no published studies demonstrating seafood abnormalities due to the Deepwater Horizon accident.

[3] There have been around 4-5 other “honorable mentions” since then, I’m not sure.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Filed under: Comedy, Statistics ]]>

Consider our favorite test of the mean of a Normal distribution with n iid samples, and known standard deviation σ: test T+. This time let:

H_{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0 , and let σ= 1.

Nothing of interest to the logic changes if s.d. is estimated as is more typical. If σ = 1, *n* = 25, (σ/ √*n*) = .2.

The (1 – α) confidence interval (CI) corresponding to test T+ is that µ > the (1 – α) LL:

µ > M – c_{a}(1/* √n* ).

*where M represents the statistic, usually written X-bar, the sample mean. For example,*

M – 2.5(1/* √n* )

is the generic lower limit (LL) of a 99% CI. The impressive thing is that this holds regardless of the true value of µ. If, for any M you assert:

µ > M – c_{a}(1/* √n* ),

your assertions will be correct 99% of the time. [Once the data are in hand, M takes the value of a particular sample mean. Without quantifiers, this is a little imprecise.]

*Now for the duality between CIs and tests. How does it work?*

Put aside for the moment our fixed hypothesis of interest; just retain the form of test T+. Keeping the s.d. of 1, and n = 25, suppose we have observed M = .6.

Consider the question: For what value of µ_{0} would M = .6 be the 2.5 s.d. cut-off (in test T+)? That is, for what value of µ_{0} would an observed mean of .6 exceed µ_{0 }by 2.5 s.d.s? (Or again, for what value of µ_{0} would our observation reach a p-value of .01 in test T+?)

Clearly, the answer is in testing H_{0}: µ ≤ _{ }.1 against H_{1}: µ > _{ }.1.

The corresponding .99 lower limit of the one-sided confidence interval would be:

[.1 < µ , infinity]

The duality with tests says that these are the µ values (in the given model and test) that would not be statistically significant at the .01 level, *had they been the ones tested in T+*. For example:

H_{0}: µ ≤ _{ }.15 would not be rejected, nor would H_{0}: µ ≤ _{ }.2, H_{0}: µ ≤ _{ }.25 and so on. *That’s because the observed M is not statistically significantly greater (at the .01 level) than any of the µ values in the interval.* Since this is continuous, it does not matter if the cut-off is just at .1 or values greater than .1.

On the other hand, a test hypothesis of H_{0}: µ ≤ _{ }.09 would be rejected by M = .6; as would µ ≤ _{ }.08, µ ≤ _{ }.07…. H_{0}: µ ≤ _{ }0, and so on. Using significance test language again, the observed M is statistically significantly greater than all these values (p-level smaller than .01), and at smaller and smaller levels of significance.

Under the supposition that the data were generated from a world where H_{0}: µ ≤ _{ }.1 against µ >.1, at least 99% of the time a larger M than was observed would occur.

The test was so incapable of having produced so large a value of M as .6, were µ less than the 99% CI lower bound, that we argue there is an indication (if not full blown evidence) that µ > .1.

We are assuming these values are “audited”, and the assumptions of the model permit the computations to be approximately valid. Following Fisher, evidence of an experimental effect requires more than a single, isolated significant result, but let us say that is satisfied.

The severity with which µ > .1 “passes” the test with this result M = .6 (in test T+) is ~ .99.

SEV( µ > .1, test T+, M = .6) = P(M < .6; µ =.1) = P( Z < (.6 – .1)/.2)=

P(Z < 2.5) = .99.

Here’s a little chart for this example:

Duality between LL of 1-sided confidence intervals and a fixed outcome M = .6 of test T+: H_{0}: µ ≤ _{ }µ_{0} vs H_{1}: µ > _{ }µ_{0}. σ = 1, *n* = 25, (σ/ √*n*) = .2. These computations are approximate.

Were µ no greater than | The capability of T+ to produce M as large as .6 is _ | µ is the 1-sided LL with level _ | Claim C | SEV associated with C |

.1 | .01 | .99 | (µ > .1) | .99 |

.2 | .025 | .975 | (µ > .2) | .975 |

.3 | .07 | .93 | (µ > .3) | .93 |

.4 | .16 | .84 | (µ > .4) | .84 |

.5 | .3 | .7 | (µ > .5) | .7 |

.6 | .5 | .5 | (µ > .6) | .5 |

.7 | .69 | .31 | (µ > .7) | .31 |

In all these cases, the test had fairly low capability to produce M as large at .6–the largest it gets is .69. I’ll consider what the test is more capable of doing in another post. Note that: as the capability increases, the corresponding confidence level decreases.

Filed under: confidence intervals and tests, Phil6334 ]]>

**Jerzy Neyman** (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for non-random samples. Fisher’s original parametric statistical model M_{θ}(**x**) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data **x**_{0}:=(x_{1},x_{2},…,x_{n}) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data **x**_{0} come from sample surveys or it can be viewed as a typical realization of a random sample **X**:=(X_{1},X_{2},…,X_{n}), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!

Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):

Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)

From my perspective, this was a major step forward for several reasons, including the following.

*First*, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.

*Second*, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:

M_{θ}(**x**)={f(**x**;θ), θ∈Θ**}**, **x**∈R^{n }, Θ⊂R^{m}; m << n,

where the distribution of the sample f(**x**;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from f(**x**;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:

X_{t} = α_{0} + α_{1}X_{t-1} + σε_{t}, *t=1,2,…,n*

This indicates how one can use *pseudo-random* numbers for the error term ε_{t} ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size *n* in nanoseconds on a PC.

*Third*, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its *repeatability in principle*, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).

**HAPPY BIRTHDAY NEYMAN!**

For further discussion on the above issues see:

Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in *Synthese*:

http://www.econ.vt.edu/faculty/2008vitas_research/Spanos/1Spanos-2011-Synthese.pdf

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society* A, 222: 309-368.

Mayo, D. G. (1996), *Error and the Growth of Experimental Knowledge*, The University of Chicago Press, Chicago.

Neyman, J. (1950), *First Course in Probability and Statistics*, Henry Holt, NY.

Neyman, J. (1952), *Lectures and Conferences on Mathematical Statistics and Probability*, 2nd ed. U.S. Department of Agriculture, Washington.

Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” *Synthese*, 36, 97-131.

[i]He was born in an area that was part of Russia.

Filed under: phil/history of stat, Spanos, Statistics Tagged: frequentist statistics, Spanos ]]>

A. Spanos Probability/Statistics Lecture Notes 7: An Introduction to Bayesian Inference (4/10/14)

Filed under: Bayesian/frequentist, Phil 6334 class material, Statistics ]]>

This is how Richard Gill, statistician at Leiden University, describes a feature film (Lucia de B.) just released about the case of Lucia de Berk, a nurse found guilty of several murders based largely on statistics. Gill is widely-known (among other things) for showing the flawed statistical analysis used to convict her, which ultimately led (after Gill’s tireless efforts) to her conviction being revoked. (I hope they translate the film into English.) In a recent e-mail Gill writes:

“The Dutch are going into an orgy of feel-good tear-jerking sentimentality as a movie comes out (the premiere is tonight) about the case. It will be a good movie, actually, but it only tells one side of the story. …When a jumbo jet goes down we find out what went wrong and prevent it from happening again. The Lucia case was a similar disaster. But no one even *knows* what went wrong. It can happen again tomorrow.

I spoke about it a couple of days ago at a TEDx event (Flanders).

You can find some p-values in my slides ["Murder by Numbers", pasted below the video]. They were important – first in convicting Lucia, later in getting her a fair re-trial.”

Since it’s Saturday night, let’s watch Gill’s TEDx talk, “Statistical Error in court”.

Slides from the Talk: “Murder by Numbers”:

Filed under: junk science, P-values, PhilStatLaw, science communication, Statistics Tagged: Richard Gill ]]>

We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in *Basic and Applied Social Psychology* 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here).[1]

But since then the field has, ostensibly, attempted to clean up its act. On the meta-level, Simmons, Nelson, and Simonsohn (2011) is an excellent example of the kind of self-scrutiny the field needs, and their list of requirements and guidelines offer a much needed start (along with their related work). But the research itself appears to be going on in the same way as before (I don’t claim this one is representative), except that now researchers are keen to show their ability and willingness to *demonstrate failure to replicate. So negative results are the new positives!* If the new fashion is non-replication, that’s what will be found (following Kahneman‘s call for a “daisy chain” in [1]).

In “Out Damned Spot,” The authors are unable to replicate what they describe as a famous experiment (Zhong and Lilijenquist 2006) wherein participants who read “a passage describing an unethical deed as opposed to an ethical deed, … were subsequently likelier to rate cleansing products as more desirable than other consumer products”. (92). There are a variety of protocols, all rather similar. For instance students are asked to write out a passage to the effect that:

“I shredded a document that I knew my co-worker Harlan was desperately looking for so that I would be the one to get a promotion.”

or

“I place the much sought-after document in Harlan’s mail box.”

See the article for the exact words. Participants are told, untruthfully, that the study is on handwriting, or on punctuation or the like. (Aside: Would you feel more desirous of soap products after punctuating a paragraph about shredding a file that your colleague is looking for? More desirous than when…? More desirous than if you put it in his mailbox, I guess.[2]) In another variation on the Zhong et al studies, when participants are asked to remember an unethical vs ethical deed they committed, they tended to pick antiseptic wipe over a pen as compensation.

Yet these authors declare there is “a robust experimental foundation for the existence of a real-life Macbeth Effect” and therefore are surprised that they are unable to replicate the result. ~~The very fact that the article starts with giving high praise to these earlier studies already raises a big question mark in my mind as to their critical capacities, so I am not too surprised that they do not bring such capacities into their own studies.~~ It’s so nice to have cross-out capability. Given that the field considers this effect solid and important, it is appropriate for the authors to regard it as such. *(I think they are just jumping onto the new bandwagon. Admittedly, I’m skeptical, so send me defenses, if you have them. I place this under “fallacies of negative results”)*

I asked the group of seminar participants if they could even identify a way to pick up on the “Macbeth” effect assuming no limits to where they could look or what kind of imaginary experiment one could run. Hmmm. We were hard pressed to come up with any. Follow evil-doers around (invisibly) and see if they clean-up? Follow do-gooders around (invisibly) to see if they don’t wash so much? (Never mind that cleanliness is next to godliness.) Of course if the killer has got blood on her (as in Lady “a little water clears us of this deed” Macbeth) she’s going to wash up, but the whole point is to apply it to moral culpability more generally (seeing if moral impurity cashes out as physical). So the first signal that an empirical study is at best wishy-washy, and at worst pseudoscientific is the utter vagueness of the effect they are studying. There’s little point to a sophisticated analysis of the statistics if you cannot get past this…unless you’re curious as to what other howlers lie in store. Yet with all of these experiments, the “causal” inference of interest is miles and miles away from the artificial exercises subjects engage in…. (unless too trivial to bother studying).

Returning to their study, after the writing exercise, the current researchers (Earl, et.al., ) have participants rate various consumer products for their desirability on a scale of 1 to 7.

They found “no significant difference in the mean desirability of the cleansing items between the moral condition (M= 3.09) and immoral condition (M = 3.08)” (94)—a difference that is so small as to be suspect in itself. Their two-sided confidence interval contains 0 so the null is not rejected. (We get a p-value and Cohen’s d, but no data.) Aris Spanos brought out a point we rarely hear (that came up in our criticism of a study on hormesis): it’s easy to get phony results with artificial measurement scales like 1-7. (Send links of others discussing this.) The mean isn’t even meaningful, and anyway, by adjusting the scale, a non-significant difference can become significant. (I don’t think this is mentioned in Simmons, Nelson, and Simonsohn 2011, but I need to reread it.)

The authors seem to think that failing to replicate studies restores credibility, and is indicative of taking a hard-nosed line, getting beyond the questionable significant results that have come in for such a drubbing. It does not. You can do just as questionable a job finding no effect as finding one. What they need to do is offer a stringent critique of the other (and their own) studies. A negative result is not a stringent critique. (Kahneman: please issue this further requirement.)

In fact, the scrutiny our seminar group arrived at in a mere one-hour discussion did more to pinpoint the holes in the other studies than all their failures to replicate. As I see it, that’s the kind of meta-level methodological scrutiny that their field needs if they are to lift themselves out of the shadows of questionable science. I could go on for pages *and pages* on all that is irksome and questionable about their analysis but will not. These researchers don’t seem to get it. *(Or so it seems.)*

If philosophers are basing philosophical theories on such “experimental” work without tearing them apart methodologically, then they’re not doing their job. Quine was wrong, and Popper was right (on this point): naturalized philosophy (be it ethics, epistemology or other) is not a matter of looking to psychological experiment.

*Some proposed labels:* We might label as questionable science any inferential inquiry where the researchers have not shown sufficient self-scrutiny of fairly flagrant threats to the inferences of interest. These threats would involve problems all along the route from the data generation and modeling to their interpretation. If an enterprise regularly fails to demonstrate such self-scrutiny, or worse, if its standard methodology revolves around reports that do a poor job at self-scrutiny, then I label the research area pseudoscience. If it regularly uses methods that permit erroneous interpretations of data with high probability, then we might be getting into “fraud” or at least “junk” science. (Some people want to limit “fraud” to a deliberate act. Maybe so, but my feeling is, as professional researchers claiming to have evidence of something, the onus is on them to be self-critical. Unconscious wishful thinking doesn’t get you off the hook.)

[1] In 2012 Kahneman said he saw a train-wreck looming for social psychology and suggested a “daisy chain” of replication.

[2] Correction I had “less” switched with “more” in the early draft (I wrote this quickly during the seminar).

[3] New reference from Uri Simonsohn: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879

[4]Addition: April 11, 2014: A commentator wrote that I should read Mook’s classic paper against “external validity”.

http://www.uoguelph.ca/~psystats/readings_3380/mook%20article.pdf

* *In replying, I noted that I agreed with Mook entirely. “I entirely agree that “artificial” experiments can enable the most probative and severe tests. But, Mook emphasizes the hypothetical deductive method (which we may assume would be of the statistical and not purely deductive variety). This requires the very entailments that are questionable in these studies. …I have argued (in sync with Mook I think) as to why “generalizability”, external validity and the like may miss the point of severe testing, which typically requires honing in on, amplifying, isolating, and even creating effects that would not occur in any natural setting. If our theory T would predict a result or effect with which these experiments conflict, then the argument to the flaw in T holds—as Mook remarks. What is missing from the experiments I criticize is the link that’s needed for testing—the very point that Mook is making.

I especially like Mook’s example of the wire monkeys. Despite the artificiality, we can use that experiment to understand hunger reduction is not valued more than motherly comfort or the like. That’s the trick of a good experiment, that if the theory (e.g., about hunger-reduction being primary) were true, then we would not expect those laboratory results. The key question is whether we are gaining an understanding, and that’s what I’m questioning.

I’m emphasizing the meaningfulness of the theory-statistical hypothesis link on purpose. People get so caught up in the statistics that they tend to ignore, at times, the theory-statistics link.

Granted, there are at least two distinct things that might be tested: the effect itself (here the Macbeth effect), and the reliability of the previous positive results. Even if the previous positive results are irrelevant for understanding the actual effect of interest, one may wish to argue that it is or could be picking up something reliably. Even though understanding the effect is of primary importance, one may claim only to be interested in whether the previous results are statistically sound. Yet another interest might be claimed to be learning more about how to trigger it. I think my criticism, in this case, actually gets to all of these, and for different reasons. I’ll be glad to hear other positions.

______

Simmons, J., Nelson, L. and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. *Psychological Science* XX(X) 1-8.

Zhong, C. and Liljenquist, K. 2006. Washing away your sins: Threatened morality and physical cleansing. *Science*, 313, 1451-1452.

Filed under: fallacy of non-significance, junk science, reformers, Statistics ]]>

April 3, 2014: We interspersed discussion with slides; these cover the main readings of the day (check syllabus): the Duhem’s Probem and the Bayesian Way, and “Highly probable vs Highly Probed”. syllabus four. Slides are below (followers of this blog will be familiar with most of this, e.g., here). We also did further work on misspecification testing.

Monday, April 7, is an optional outing, “a seminar class trip”

you might say, here at Thebes at which time we will analyze the statistical curves of the mountains, pie charts of pizza, and (seriously) study some experiments on the problem of replication in “the Hamlet Effect in social psychology”. If you’re around please bop in!

Mayo’s slides on Duhem’s Problem and more from April 3 (Day#9):

Filed under: Bayesian/frequentist, highly probable vs highly probed, misspecification testing ]]>

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time.

The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135)[*]

This paper came from a conference where we both presented, and he was *extremely* critical of my error statistical defense on this point. (I was like a year out of grad school, and he a University Distinguished Professor.)

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show astatistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.

Much laughter.

So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the resultsarestatistically significant.”

Howls of laughter.

But then the guy calls back with the bad news . . .

It turns out that failing to score a sufficiently impressive effect aftern’ trials, the experimenter went on ton” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the *argument from intentions.* When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the *look-elsewhere effect* (LEE), which arose in the context of “bump hunting” in the Higgs results.

The *optional stopping effect* often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

*X _{i }*~ N(µ,σ) and we test

The stopping rule might take the form:

*Keep sampling until |m| ≥* 1.96 σ/√n),

with *m* the sample mean. When *n* is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . .

I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.”If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

*H* is *not* being put to a stringent test when a researcher allows trying and trying again until the data are far enough from *H _{0 }*to reject it in favor of

**Stopping Rule Principle**

Picking up on the effect *appears* evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of

ndata actually observed will be exactly the same as it would be had you planned to take exactlynobservations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the *irrelevance of the stopping rule *or the* Stopping Rule Principle *(SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

**A Funny Thing Happened at the Savage Forum[i]**

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a *non sequitur*.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (

x_{1},…,x) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)_{n}

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available. If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

One last thing….

** ****The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals**

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ = * m* + 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?

[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”. Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.

[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:

- Savage Forum title page through page 20.pdf
- Savage Forum pages 21 through 35.pdf
- Savage Forum pages 36 through 52.pdf
- Savage Forum pages 53 through 55.pdf
- Savage Forum pages 56 through 70.pdf
- Savage Forum pages 71 through 77.pdf
- Savage Forum pages 78 through 103.pdf
- Savage Forum reference pages 104 through 112.pdf

Armitage, P. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.

Berger J. O. and Wolpert, R. L. (1988), *The Likelihood Principle: **A Review, Generalizations, and Statistical Implications* 2^{nd} edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.

Birnbaum, A. (1969), “Concepts of Statistical Evidence” In *Philosophy, Science, and Method: Essays in Honor of Ernest Nagel*, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.

Cox, D. R. (1977), “The Role of Significance Tests (with discussion)*”, Scandinavian Journal of Statistics* 4, 49–70.

Cox, D. R. and D. V. Hinkley (1974), *Theoretical Statistics,* London: Chapman & Hall.

Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. *Psychological Review* 70: 193-242.

Good, I.J.(1983), *Good Thinking, The Foundations of Probability and its Applications*, Minnesota.

Howson, C., and P. Urbach (1993[1989]), *Scientific Reasoning: The Bayesian Approach*, 2^{nd} ed., La Salle: Open Court.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) *Foundations of Bayesianism*. Dordrecht: Kluwer Academic Publishes: 381-403.

Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” *Philosophy of Science*, 80 (2013): 73-93.

Filed under: Bayesian/frequentist, Comedy, Statistics Tagged: argument from intentions, frequentist criticisms, Stopping rules ]]>

Jeremy’s upcoming talk on blogging will be live-tweeted by @FisheriesBlog, 1 pm EDT Apr. 7Posted on

April 3, 2014byJeremy FoxIf you like to follow live tweets of talks, you’re in luck: my upcoming Virginia Tech talk on blogging will be live tweeted by Brandon Peoples, a grad student there who co-authors The Fisheries Blog. Follow @FisheriesBlog at 1 pm US Eastern Daylight Time on Monday, April 7 for the live tweets.

Jeremy Fox’s excellent blog, “Dynamic Ecology,” often discusses matters statistical from a perspective in sync with error statistics.

I’ve never been invited to talk about blogging or even to blog about blogging, maybe this is a new trend. I look forward to meeting him (live!).

* Posts that don’t directly pertain to philosophy of science/statistics are placed under “rejected posts” but since this is a metablogpost on a talk on a blog pertaining to statistics it has been “conditionally accepted”, unconditionally, i.e., without conditions.

Filed under: Announcement, Metablog ]]>

I had heard of medical designs that employ individuals who supply Bayesian subjective priors that are deemed either “enthusiastic” or “skeptical” as regards the probable value of medical treatments.[i] From what I gather, these priors are combined with data from trials in order to help decide whether to stop trials early or continue. But I’d never heard of these Bayesian designs in relation to decisions about building security or renovations! Listen to this….

You may have heard that the Department of Homeland Security (DHS), whose 240,000 employees are scattered among 50 office locations around D.C.,has been planning to have headquarters built at an abandoned insane asylum St Elizabeths in DC [ii]. See a recent discussion here. In 2006 officials projected the new facility would be ready by 2015; now an additional $3.2 billion is needed to complete the renovation of the 159-year-old mental hospital by 2026 (Congressional Research Service)[iii].The initial plan of developing the entire structure is no longer feasible; so to determine which parts of the facility are most likely to be promising, “DHS is bringing in a team of data analysts who are possessed” said Homeland Security Secretary Jeh Johnson (during a DHS meeting, Feb 26) –-“possessed with vibrant background beliefs to sense which buildings are most probably worth renovating, from the point of view of security. St. Elizabeths needs to be fortified with 21st-century technologies for cybersecurity and antiterrorism missions” Johnson explained.

Failing to entice private companies to renovate the dilapidated west campus of the historic mental health facility that sits on 176 acres overlooking the Anacostia River,they can only hope to renovate selectively: “Which parts are we going to overhaul? Parts of the hospital have been rotting for years!”Johnson declared.

Read more:

**Skeptical and enthusiastic priors: excerpt from DHS memo:**

The description of the use of so-called “enthusiastic” and “skeptical” priors is sketched in a DHS memo released in January 2014 (but which had been first issued in 2011). Here’s part of it:

Enthusiastic priors are used in evaluating the portions of St. Elizabeths campus thought to be probably unpromising, in terms of environmental soundness, or because of an existing suspicion of probable security leaks. If the location fails to be probably promising using an enthusiastic prior, plus data, then there is overwhelming evidence to support the decision that the particular area is not promising.

Skeptical priors are used in situations where the particular asylum wing, floor, or campus quadrant is believed to be probably promising for DHS. If the skeptical opinion, combined with the data on the area in question, yields a high posterior belief that it is a promising area to renovate, this would be taken as extremely convincing evidence to support the decision that the wing, floor or building is probably promising.

But long before they can apply this protocol, they must hire specialists to provide the enthusiastic or skeptical priors. *(See stress testing below.)* The article further explains, “In addition, Homeland Security took on a green initiative — deciding to outfit the campus’ buildings (some dating back to 1855) with features like rainwater toilets and Brazilian hardwood in the name of sustainability.” With that in mind, they also try to get a balance of environmentalist enthusiasts and green skeptics.

Asked how he can justify the extra 3 billion (a minimal figure), Mr. Johnson said that “I think that the morale of DHS, unity of mission, would go a long way if we could get to a headquarters.” He was pleased to announce that an innovative program of recruiting was recently nearly complete.

**Stress Testing for Calibrated “Enthusastic” and “Skeptical” Prior Probabilities**

Perhaps the most interesting part of all this is how they conduct stress testing for individuals to supply calibrated Bayesian priors concerning St. Elizabeths. Before being hired to give “skeptical” or “enthusiastic” prior distributions, candidates must pass a rather stringent panoply of stress tests based on their hunches regarding relevant facts associated with a number of other abandoned insane asylums.* (It turns out there are a lot of them throughout the world. I had no idea.)* The list of asylums on which they based the testing (over the past years) has been kept Top Secret Classified until very recently [iv]. Even now,one is directed to a non-governmental website to find a list of 8 or so of the old mental facilities that apparently appeared in just one batch of tests.

Scott Bays-Knorr, a DHS data analyst specialist who is coordinating the research and hiring of “sensors,” made it clear that the research used acceptable, empirical studies: “We’re not testing for paranormal ability or any hocus pocus. These are facts, and we are interested in finding those people whose beliefs match the facts reliably. DHS only hires highly calibrated, highly sensitive individuals”, said Bays-Knorr. *Well I’m glad there’s no hocus-pocus at least.*

The way it works is that they combine written tests with fMRI data— which monitors blood flow and, therefore, activity inside the brain in real time —to try to establish a neural signature that can be correlated with security-relevant data about the abandoned state hospitals. “The probability they are attuned to these completely unrelated facts about abandoned state asylums they’ve probably never even heard of is about 0. So we know our results are highly robust,” Bays-Knorr assured some skeptical senators.

**Danvers State Hospital**

Take for example,Danvers State Hospital, a psychiatric asylum opened in 1878 in Danvers, Massachusetts.

“We check their general sensitivity by seeing if any alarm bells go off in relation to little known facts about unrelated buildings that would be important to a high security facility. ‘What about a series of underground tunnels’ we might ask, any alarm bells go off? Any hairs on their head stand up when we flash a picture of a mysterious fire at the Danvers site in 2007?” Bays-Knorr enthused. “If we’ve got a verified fire skeptic who, when we get him to DC, believes that a part of St. Elizabeths is clear, then we start to believe that’s a fire-safe location. You don’t want to build U.S. cybersecurity if good sensors give it a high probability of being an incendiary location.” *I think some hairs on my head are starting to stand up.*

Interestingly, some of the tests involve matching drawn pictures, which remind me a little of those remote sensing tests of telepathy. Here’s one such target picture:

They claim they can ensure robustness by means of correlating a sensor’s impressions of completely unrelated facts about the facility. For example, using fMRI data they can check if “anything lights up” in connection with Lovecraft’s Arkham Sanatorium”, a short story “The Thing on the Doorstep”, or Arkham Asylum in the Batman comic world.

Bays-Knorr described how simple facts are used as a robust benchmark for what he calls “the touchy-feely stuff”. For example, picking up on a simple fact in connection with High Royds Hospital (in England) is sensing its alternative name: West Riding Pauper Lunatic Asylum.They compare that to reactions to the question of why patient farming ended. *I frankly don’t get it, but then again, I’m highly skeptical of approaches not constrained by error statistical probing.*

* *

Yet Bays-Knorr seemed to be convincing many of the Senators who will have to approve an extra 3 billion on the project. He further described the safeguards, “We never published this list of asylums, the candidate sensors did not even know what we were gong to ask them. It doesn’t matter if they’re asylum specialists or have a 6th sense. If they have good hunches, if they fit the average of the skeptics or the enthusiasts, then we want them.” Only if the correlations are sufficiently coherent is a ‘replication score’ achieved. The testing data are then sent to an independent facility of blind “big data” statisticians,Cherry Associates, from whom the purpose of the analysis is kept entirely hidden.”We look for feelings and sensitivity, often the person doesn’t know she even has it,” one Cherry Assoc representative noted.Testing has gone on for the past 7 years ($700 million) and is only now winding up. *(I’m not sure how many were hired, but with $150,000 salaries for part time work, it seems a god gig!)*

Community priors, skeptical and enthusiastic, are eventually obtained based on those hired as U.S. Government DHS Calibrated Prior Degree Specialists.

**Sounds like lunacy to me![v] (but check the date of this post!)**

[i]Spiegelhalter, D. J., Abrams, K. R., & Myles, J. P. (2004). *Bayesian approaches to clinical trials and health care evaluation*. Chichester: Wiley.

[ii]Homepage for DHS and St. Elizabeths Campus Plans.

http://www.stelizabethsdevelopment.com/index.html

GSA Development of St. Elizabeths campus:“Preserving the Legacy, Realizing Potential”

[iii]

U.S. House of Representatives

Committee on Homeland Security

January 2014

Prepared by Majority Staff of the Committee on Homeland Security

http://homeland.house.gov/sites/homeland.house.gov/files/documents/01-10-14-StElizabeths-Report.pdf

[iv] The randomly selected hospitals in one standardized test included the following:

Topeka State Hospital

Danvers State Hopital

Denbigh Asylum

Pilgrim State Hospital

Trans-Allegheny Asylum

High Royds Hospital

Whittingham Hospital

Norwich State Hospital

Essential guide to abandoned insane asylums: -http://www.atlasobscura.com/articles/abandoned-insane-asylums

[v[ **Or, a partial April Fool’s joke!———-**

Filed under: junk science, Statistics, subjective Bayesian elicitation ]]>

“Probability/Statistics Lecture Notes 6: An Introduction to Mis-Speciﬁcation (M-S) Testing” (Aris Spanos)

[Other slides from Day 9 by guest, John Byrd, can be found here.]

Filed under: misspecification testing, Phil 6334 class material, Spanos, Statistics ]]>

**Caitlin Parker**

**Able, we’d well aim on. I bet on a note. Binomial? Lewd. Ew, Elba!**

The requirement was: A palindrome with Elba plus Binomial with an optional second word: bet. A palindrome that uses both Binomial and bet topped an acceptable palindrome that only uses Binomial.

**Short bio: **

Caitlin Parker is a first-year master’s student in the Philosophy department at Virginia Tech. Though her interests are in philosophy of science and statistics, she also has experience doing psychological research.

**Statement:
**“Thanks for the challenge! Palindromes give us a fun opportunity to practice planning in a setting where each new letter has the power to completely recast one’s previous efforts. Since one has to balance developing a structure with preserving some kind of meaning, it can take forever to get a palindrome to ‘work’ – but it’s incredibly satisfying when it does.”

**Choice of Book:
**

See April contest (first word: fallacy; optional second word: error).

Filed under: Announcement, Palindrome, Rejected Posts ]]>

Central Identification Laboratory

JPAC

*Guest, March 27, PHil 6334*

“Statistical Considerations of the Histomorphometric Test Protocol for Determination of Human Origin of Skeletal Remains”

By:

John E. Byrd, Ph.D. D-ABFA

Maria-Teresa Tersigni-Tarrant, Ph.D.

Central Identification Laboratory

JPAC

Filed under: Phil6334, Philosophy of Statistics, Statistics ]]>