Your (very own) personalized genomic prediction varies depending on who else was around?


personalized medicine roulette

As if I wasn’t skeptical enough about personalized predictions based on genomic signatures, Jeff Leek recently had a surprising post about a “A surprisingly tricky issue when using genomic signatures for personalized medicine“.  Leek (on his blog Simply Statistics) writes:

My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.

….it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set.

Here’s an extract from the paper,”Test set bias affects reproducibility of gene signatures“:

Test set bias is a failure of reproducibility of a genomic signature. In other words, the same patient, with the same data and classification algorithm, may be assigned to different clinical groups. A similar failing resulted in the cancellation of clinical trials that used an irreproducible genomic signature to make chemotherapy decisions (Letter (2011)).

This is a reference to the Anil Potti case:

Letter, T. C. (2011). Duke Accepts Potti Resignation; Retraction Process Initiated with Nature Medicine.

But far from the Potti case being some particularly problematic example (see here and here), at least with respect to test set bias, this article makes it appear that test set bias is a threat to be expected much more generally. Going back to the abstract of the paper:

ABSTRACT Motivation: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction.

Results: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly-available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias while maintaining highly accurate classification, even across platforms…..

“The implications of a patient’s classification changing due to test set bias may be important clinically, financially, and legally. … a patient’s classification could affect a treatment or therapy decision. In other cases, an estimation of the patient’s probability of survival may be too optimistic or pessimistic. The fundamental issue is that the patient’s predicted quantity should be fully determined by the patient’s genomic information, and the bias we will explore here is induced completely due to technical steps.”

“DISCUSSION We found that breast cancer tumor subtype predictions varied for the same patient when the data for that patient were processed using differing numbers of patient sets and patient sets had varying distributions of key characteristics (ER* status). This is undesirable behavior for a prediction algorithm, as the same patient should always be assigned the same prediction assuming their genomic data do not change (6)…

“This raises the question of how similar the test set needs to be to the training data for classifications to be trusted when the test data are normalized.”

*Endocrine receptor.

Returning to Leeks’ post:

The basic problem is illustrated in this graphic.


Screen Shot 2015-03-19 at 12.58.03 PM


This seems like a pretty esoteric statistical issue, but it turns out that  this one simple normalization problem can dramatically change the results of the predictions. …

In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.

Screen Shot 2015-03-19 at 1.02.25 PM


This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.

As a complete outsider to this field, I’m wondering, at what point in the determination of the patient’s prediction does the normalization apply? A patient walks into her doctor’s office and is to get a prediction/recommendation?…
As for their recommendation not to normalize but use ranks, can it work? Should we expect these concerns to be well taken care of in the latest rendition of microarrays?


Prasad Patil, Pierre-Olivier Bachant-Winner, Benjamin Haibe-Kains, and Jeffrey T. Leek, “Test set bias affects reproducibility of gene signatures.” Bioinformatics Advance Access published March 18, 2015, CUP.



Categories: Anil Potti, personalized medicine, Statistics | 5 Comments

Objectivity in Statistics: “Arguments From Discretion and 3 Reactions”

dirty hands

We constantly hear that procedures of inference are inescapably subjective because of the latitude of human judgment as it bears on the collection, modeling, and interpretation of data. But this is seriously equivocal: Being the product of a human subject is hardly the same as being subjective, at least not in the sense we are speaking of—that is, as a threat to objective knowledge. Are all these arguments about the allegedly inevitable subjectivity of statistical methodology rooted in equivocations? I argue that they are! [This post combines this one and this one, as part of our monthly “3 years ago” memory lane.]

“Argument from Discretion” (dirty hands)

Insofar as humans conduct science and draw inferences, it is obvious that human judgments and human measurements are involved. True enough, but too trivial an observation to help us distinguish among the different ways judgments should enter, and how, nevertheless, to avoid introducing bias and unwarranted inferences. The issue is not that a human is doing the measuring, but whether we can reliably use the thing being measured to find out about the world.

Remember the dirty-hands argument? In the early days of this blog (e.g., October 13, 16), I deliberately took up this argument as it arises in evidence-based policy because it offered a certain clarity that I knew we would need to come back to in considering general “arguments from discretion”. To abbreviate:

  1. Numerous  human judgments go into specifying experiments, tests, and models.
  2. Because there is latitude and discretion in these specifications, they are “subjective.”
  3. Whether data are taken as evidence for a statistical hypothesis or model depends on these subjective methodological choices.
  4. Therefore, statistical inference and modeling is invariably subjective, if only in part.

We can spot the fallacy in the argument much as we did in the dirty hands argument about evidence-based policy. It is true, for example, that by employing a very insensitive test for detecting a positive discrepancy d’ from a 0 null, that the test has low probability of finding statistical significance even if a discrepancy as large as d’ exists. But that doesn’t prevent us from determining, objectively, that an insignificant difference from that test fails to warrant inferring evidence of a discrepancy less than d’.

Test specifications may well be a matter of  personal interest and bias, but, given the choices made, whether or not an inference is warranted is not a matter of personal interest and bias. Setting up a test with low power against d’ might be a product of your desire not to find an effect for economic reasons, of insufficient funds to collect a larger sample, or of the inadvertent choice of a bureaucrat. Or ethical concerns may have entered. But none of this precludes our critical evaluation of what the resulting data do and do not indicate (about the question of interest). The critical task need not itself be a matter of economics, ethics, or what have you. Critical scrutiny of evidence reflects an interest all right—an interest in not being misled, an interest in finding out what the case is, and others of an epistemic nature.

Objectivity in statistical inference, and in science more generally, is a matter of being able to critically evaluate the warrant of any claim. This, in turn, is a matter of evaluating the extent to which we have avoided or controlled those specific flaws that could render the claim incorrect. If the inferential account cannot discern any flaws, performs the task poorly, or denies there can ever be errors, then it fails as an objective method of obtaining knowledge.

Consider a parallel with the problem of objectively interpreting observations: observations are always relative to the particular instrument or observation scheme employed.  But we are often aware not only of the fact that observation schemes influence what we observe but also of how they influence observations and how much noise they are likely to produce so as to subtract them out. Hence, objective learning from observation is not a matter of getting free of arbitrary choices of instrument, but a matter of critically evaluating the extent of their influence to get at the underlying phenomenon.

For a similar analogy, the fact that my weight shows up as k pounds reflects the convention (in the United States) of using the pound as a unit of measurement on a particular type of scale. But given the convention of using this scale, whether or not my weight shows up as k pounds is a matter of how much I weigh!*

Likewise, the result of a statistical test is only partly determined by the specification of the tests (e.g., when a result counts as statistically significant); it is also determined by the underlying scientific phenomenon, at least as modeled.  What enables objective learning to take place is the possibility of devising means for recognizing and effectively “subtracting out” the influence of test specifications, in order to learn about the underlying phenomenon, as modeled.

Focusing just on statistical inference, we can distinguish between an objective statistical inference, and an objective statistical method of inference.  A specific statistical inference is objectively warranted, if it has passed a severe test; a statistical method is objective by being able to evaluate and control (at least approximately) the error probabilities needed for a severity appraisal.  This also requires the method to communicate the information needed to conduct the error statistical evaluation  (or report it as problematic).

It should be kept in mind that we are after the dual aims of severity and informativeness.  Merely stating tautologies is to state objectively true claims, but they are not informative. But, it is vital to have a notion of objectivity, and we should stop feeling that we have to say, well there are objective and subjective elements in all methods; we cannot avoid dirty hands in discretionary choices of specification, so all inference methods do about as well when it comes to the criteria of objectivity.  They do not.

*Which, in turn, is a matter of my having overeaten in London.


3 Reactions to the Challenge of Objectivity

(1) If discretionary judgments are thought to introduce subjectivity in inference, a classic strategy thought to achieve objectivity is to extricate such choices, replacing them with purely formal a priori computations or agreed-upon conventions (see March 14).  If leeway for discretion introduces subjectivity, then cutting off discretion must yield objectivity!  Or so some argue. Such strategies may be found, to varying degrees, across the different approaches to statistical inference. The inductive logics of the type developed by Carnap promised to be an objective guide for measuring degrees of confirmation in hypotheses, despite much-discussed problems, paradoxes, and conflicting choices of confirmation logics.  In Carnapian inductive logics, initial assignments of probability are based on a choice of language and on intuitive, logical principles. The consequent logical probabilities can then be updated (given the statements of evidence) with Bayes’s Theorem. The fact that the resulting degrees of confirmation are at the same time analytical and a priori—giving them an air of objectivity–reveals the central weakness of such confirmation theories as “guides for life”, e.g., —as guides, say, for empirical frequencies or for finding things out in the real world. Something very similar  happens with the varieties of “objective’” Bayesian accounts, both in statistics and in formal Bayesian epistemology in philosophy (a topic to which I will return; if interested, see my RMM contribution). A related way of trying to remove latitude for discretion might be to define objectivity in terms of the consensus of a specified group, perhaps of experts, or of agents with “diverse” backgrounds. Once again, such a convention may enable agreement yet fail to have the desired link-up with the real world.  It would be necessary to show why consensus reached by the particular choice of group (another area for discretion) achieves the learning goals of interest.

Likewise, routine and automatic choices in statistics can be justified as promoting a specified goal, but it is the onus of anyone supporting the account in question to show this.

(2) The second reaction is to acknowledge and even to embrace subjective and personal factors.  For Savage (1964: 178) the fact that a subjective (which I am not here distinguishing from a “personalistic”) account restores the role of opinion in statistics was a cause of celebration.  I am not sure if current-day subjective Bayesians concur—but I would like to hear from them. Underlying this second reaction, there is often a deep confusion between our limits in achieving the goal of adequately capturing a given data generating mechanism, and making the goal itself be to capture our subjective degrees of belief in (or about) the data generating mechanism. The former may be captured by severity assessments (or something similar), but these are not posterior probabilities (even if one grants the latter could be).  Most importantly for the current issue, assessing the existing limitations and inadequacies of inferences is not the same as making our goal be to quantitatively model (our or someone else’s) degrees of belief!  Yet these continue to be run together, making it easy to suppose that acknowledging the former limitation is tantamount to accepting the latter. As I noted in a March 14 comment to A. Spanos, “let us imagine there was a perfect way to measure a person’s real and true degrees of belief in a hypothesis (maybe with some neuropsychology development), while with frequentist statistical models, we grope our way and at most obtain statistically adequate representations of aspects of the data generating mechanism producing the relevant phenomenon. In the former [we are imagining], the measurement is 100% reliable, but the question that remains is the relevance of the thing being measured for finding out about the world. People seem utterly to overlook this” (at least when they blithely repeat variations on “arguments from discretion”, see March 14 post). Henry Kyburg (1992) put it in terms of error: the subjectivist precludes objectivity because they he or she cannot be in error:

This is almost a touchstone of objectivity: the possibility of error. There is no way I can be in error in my prior distribution for µ—unless I make a logical error. . . . It is that very fact that makes this prior distribution perniciously subjective. It represents an assumption that has consequences, but cannot be corrected by criticism or further evidence. (p. 147)

(3) The third way to deal with the challenges of objectivity in inference is to deliberately develop checks of error, and to insist that our statistical methods be self-correcting. Rather than expressing opinions, we want to avoid being misled by beliefs and opinions—mine and yours—building on the recognition that checks of error enable us to acquire reliable knowledge about the world. This third way is to discern what enabled us to reject the “dirty hands” argument: we can critically evaluate discretionary choices, and design methods to determine objectively what is and is not indicated. It may well mean that the interpretation of the data itself is a report of the obstacles to inference! Far from being a hodgepodge of assumptions and decisions, objectivity in inference can and should involve a systematic self-critical scrutiny all along the inferential path.  Each stage of inquiry and each question within that stage involve potential errors and biases. By making these explicit we can learn despite background judgments. Nowadays, the reigning mood may be toward some sort of third way; but we must be careful. Merely rejecting the dirty-hands conclusion (as in my March 14 post) is not yet to show that any particular method achieves such objective scrutiny in given cases.  Nor does it suffice to declare that “of course we subject our assumptions to stringent checks”, and “we will modify our models should we find misfits with the data”. We have seen in our posts on m-s tests, for instance, the dangers of “error fixing” strategies  (M-S post 1, 2, 3, 4).  The method for checking must itself be justified by showing it has the needed properties for pinpointing flaws reliably. It is not obvious that popular “third-way” gambits meet the error statistical requirements for objectivity in statistics that I have discussed in many previous posts and papers (the ability to evaluate and control relevant error probabilities). At least, it remains an open question as to whether they do. _____________

Carnap, R. (1962). Logical Foundations of Probability. Chicago: University of Chicago Press.

Kyburg, H. E., Jr.  (1992). “The Scope of Bayesian Reasoning,” in D. Hull, M. Forbes, and K. Okruhlik (eds.), PSA 1992, Vol. II, East Lansing, MI: 139-52.

Savage, L. J. (1964).  “The Foundations of Statistics Reconsidered,” pp. 173-188 in H. E. Kyburg and and H.E. Smokler (eds.), Studies in Subjective Probability, Wiley, New York: 173-88.

Categories: Objectivity, Statistics | Tags: , | 6 Comments

Stephen Senn: The pathetic P-value (Guest Post)

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption.

To understand this, consider Student’s key paper[3] of 1908, in which the following statement may be found:

student quoteStudent was comparing two treatments that Cushny and Peebles had considered in their trials of optical isomers at the Insane Asylum at Kalamazoo[4]. The t-statistic for the difference between the two means (in its modern form as proposed by Fisher) would be 4.06 on 9 degrees of freedom. The cumulative probability of this is 0.99858 or 0.9986 to 4 decimal places. However, given the constraints under which Student had to labour, 0.9985 is remarkably accurate and he calculated 0.9985/(1-0.9985)= 666 to 3 decimal places and interpreted this in terms of what a modern Bayesian would call posterior odds. Note that right-hand probability corresponding to Student’s left hand 0.9885 is 0.0015 and is, in modern parlance, the one-tailed P-value.

Where did Student get this method of calculation from? His own innovation was in deriving the appropriate distribution for what later came to be known as the t-statistic but the general method of calculating an inverse probability from the distribution of the statistic was much older and associated with Laplace. In his influential monograph, Statistical Methods for Research Workers[5], Fisher, however, proposed an alternative more modest interpretation, stating:

Fisiher Stat Methods quote

(Here n is the degrees of freedom and not the sample size.) In fact, Fisher does not even give a P-value here but merely notes that the probability is less than some agreed ‘significance’ threshold.

Comparing Fisher here to Student, and even making allowance for the fact that Student has calculated the ‘exact probability’ whereas Fisher, as a consequence of the way he had constructed his own table (entering at fixed pre-determined probability levels), merely gives a threshold, it is hard to claim that Fisher is somehow responsible for a more exaggerated interpretation of the probability concerned. In fact, Fisher has compared the observed value of 4.06 to a two-tailed critical value, a point that is controversial but cannot be represented as being more liberal than Student’s approach.

To understand where the objection of some modern Bayesians to P-values comes from, we have to look to work that came after Fisher, not before him. The chief actor in the drama was Harold Jeffreys whose Theory of Probability[6] first appeared in 1939, by which time Statistical Methods for Research Workers was already in its seventh edition.

Jeffreys had been much impressed by work of the Cambridge philosopher CD Broad who had pointed out that the principle of insufficient reason might lead one to suppose that, given a large series of only positive trials, the next would also be positive but could not lead one to conclude that all future trials would be. In fact, if the future series was large compared to the preceding observations, the probability was small[7, 8]. Jeffreys wished to show that induction could provide a basis for establishing the (probable) truth of scientific laws. This required lumps of probability on simpler forms of the law, rather than the smooth distribution associated with Laplace. Given a comparison of two treatments (as in Student’s case) the simpler form of the law might require only one parameter for their two means, or equivalently, that the parameter for their difference, τ , was zero. To translate this into the Neyman-Pearson framework requires testing something like

H0: τ = 0 v H1: τ ≠ 0         (1)

It seems, however, that Student was considering something like

H0: τ ≤ 0 v H1: τ > 0,         (2)

although he perhaps also ought simultaneously to be considering something like

H0: τ ≥0 v H1: τ < 0,           (3)

although, again, in a Bayesian framework this is perhaps unnecessary.

(See David Cox[9] for a discussion of the difference between plausible and dividing hypotheses.)

Now the interesting thing about all this is if you choose between (1) on the one hand and (2) or (3) on the other, it makes remarkably little difference to the inference you make in a frequentist framework. You can see this as either a strength or a weakness and is largely to do with the fact that the P-value is calculated under the null hypothesis and that in (2) and (3) the most extreme value, which is used for the calculation, is the same as that in (1). However if you try and express the situations covered by (1) on the one hand and (2) and (3) on the other, it terms of prior distributions and proceed to a Bayesian analysis, then it can make a radical difference, basically because all the other values in H0 in (2) and (3) have even less support than the value of H0 in (1). This is the origin of the problem: there is a strong difference in results according to the Bayesian formulation. It is rather disingenuous to represent it as a problem with P-values per se.

To do so, you would have to claim, at least, that the Laplace, Student etc Bayesian formulation is always less appropriate than the Jeffreys one. In Twitter exchanges with me, David Colquhoun has vigorously defended the position that (1) is what scientists do, even going so far as to state that all life-scientists do this. I disagree. My reading of the literature is that jobbing scientists don’t know what they do. The typical paper says something about the statistical methods, may mention the significance level but does not define the hypothesis being tested. In fact, a paper in the same journal and same year as Colquhoun’s affords an example. Smyth et al[10], have 17 lines on statistical methods, including permutation tests (of which Colquhoun approves) but nothing about hypotheses, plausible, point, precise, dividing or otherwise, although the paper does, subsequently, contain a number of P-values.

In other words scientists don’t bother to state which of (1) on the one hand or (2) and (3) on the other is relevant. It might be that they should but it is not clear if they did, which way they would jump. Certainly, in drug development I could argue that the most important thing is to avoid deciding that the new treatment is better than the standard, when in fact it is worse and this is certainly an important concern in developing treatments for rare diseases, a topic on which I research. True Bayesian scientists, of course, would have to admit that many intermediate positions are possible. Ultimately, however, if we are concerned about the real false discovery rate, rather than what scientists should coherently believe about it, it is the actual distribution of effects that matters rather than their distribution in my head, or, for that matter, David Colquhoun’s. Here a dram of data is worth a pint of pontification and some interesting evidence as regards clinical trials is given by Djulbegovic et al[11].

Furthermore, in the one area, model-fitting, where the business of comparing simpler versus complex laws is important, rather than, say, deciding which of two treatments is better (note that in the latter case a wrong decision has more serious consequences), then a common finding is not that the significance test using the 5% level is liberal but that it is conservative. The AIC criterion will choose a complex law more easily and although there is no such general rule about the BIC, because of its dependence on sample size, when one surveys this area it is hard to come to the conclusion that significance tests are generally more liberal.

Finally, I want to make it clear, that I am not suggesting that P-values alone are a good way to summarise results, nor am I suggesting that Bayesian analysis is necessarily bad. I am suggesting, however, that Bayes is hard and pointing the finger at P-values ducks the issue. Bayesians (quite rightly so according to the theory) have every right to disagree with each other. This is the origin of the problem and to therefore dismiss P-values

‘…would require that a procedure is dismissed because, when combined with information which it doesn’t require and which may not exist, it disagrees with a procedure that disagrees with itself.’[2] (p 195)


My research on inference for small populations is carried out in the framework of the IDEAL project and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.


  1. Colquhoun, D., An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 2014. 1(3): p. 140216.
  2. Senn, S.J., Two cheers for P-values. Journal of Epidemiology and Biostatistics, 2001. 6(2): p. 193-204.
  3. Student, The probable error of a mean. Biometrika, 1908. 6: p. 1-25.
  4. Senn, S.J. and W. Richardson, The first t-test. Statistics in Medicine, 1994. 13(8): p. 785-803.
  5. Fisher, R.A., Statistical Methods for Research Workers, in Statistical Methods, Experimental Design and Scientific Inference, J.H. Bennet, Editor 1990, Oxford University: Oxford.
  6. Jeffreys, H., Theory of Probability. Third ed1961, Oxford: Clarendon Press.
  7. Senn, S.J., Dicing with Death2003, Cambridge: Cambridge University Press.
  8. Senn, S.J., Comment on “Harold Jeffreys’s Theory of Probability Revisited”. Statistical Science, 2009. 24(2): p. 185-186.
  9. Cox, D.R., The role of significance tests. Scandinavian Journal of Statistics, 1977. 4: p. 49-70.
  10. Smyth, A.K., et al., The use of body condition and haematology to detect widespread threatening processes in sleepy lizards (Tiliqua rugosa) in two agricultural environments. Royal Society Open Science, 2014. 1(4): p. 140257.
  11. Djulbegovic, B., et al., Medical research: trial unpredictability yields predictable therapy gains. Nature, 2013. 500(7463): p. 395-396.
Categories: P-values, S. Senn, statistical tests, Statistics | 139 Comments

All She Wrote (so far): Error Statistics Philosophy: 3.5 years on


metablog old fashion typewriter

D.G. Mayo with typewriter

Error Statistics Philosophy: Blog Contents (3.5 years)
By: D. G. Mayo [i]

September 2011

October 2011

November 2011

December 2011

January 2012

February 2012

17 February 1890--29 July 1962

17 February 1890–29 July 1962

March 2012

April 2012

May 2012

June 2012

July 2012

August 2012

September 2012

October 2012

November 2012

December 2012

January 2013

  • (1/2) Severity as a ‘Metastatistical’ Assessment
  • (1/4) Severity Calculator
  • (1/6) Guest post: Bad Pharma? (S. Senn)


    S. Senn

  • (1/9) RCTs, skeptics, and evidence-based policy
  • (1/10) James M. Buchanan
  • (1/11) Aris Spanos: James M. Buchanan: a scholar, teacher and friend
  • (1/12) Error Statistics Blog: Table of Contents
  • (1/15) Ontology & Methodology: Second call for Abstracts, Papers
  • (1/18) New Kvetch/PhilStock
  • (1/19) Saturday Night Brainstorming and Task Forces: (2013) TFSI on NHST

    NHST task force 2

    NHST task force 2

  • (1/22) New PhilStock
  • (1/23) P-values as posterior odds?
  • (1/26) Coming up: December U-Phil Contributions….
  • (1/27) U-Phil: S. Fletcher & N.Jinn
  • (1/30) U-Phil: J. A. Miller: Blogging the SLP

    Jean Miller

    Jean Miller

February 2013

  • (2/2) U-Phil: Ton o’ Bricks
  • (2/4) January Palindrome Winner
  • (2/6) Mark Chang (now) gets it right about circularity
  • (2/8) From Gelman’s blog: philosophy and the practice of Bayesian statistics
  • (2/9) New kvetch: Filly Fury
  • (2/10) U-PHIL: Gandenberger & Hennig: Blogging Birnbaum’s Proof
    A. Spanos

    A. Spanos

  • (2/11) U-Phil: Mayo’s response to Hennig and Gandenberger
  • (2/13) Statistics as a Counter to Heavyweights…who wrote this?
  • (2/16) Fisher and Neyman after anger management?
  • (2/17) R. A. Fisher: how an outsider revolutionized statistics
  • (2/20) Fisher: from ‘Two New Properties of Mathematical Likelihood’
  • (2/23) Stephen Senn: Also Smith and Jones
  • (2/26) PhilStock: DO < $70
  • (2/26) Statistically speaking…

March 2013

  • (3/1) capitalizing on chance

    Mayo at slots

    Mayo at slots

  • (3/4) Big Data or Pig Data?


    pig data

  • (3/7) Stephen Senn: Casting Stones

    S. Senn

    S. Senn

  • (3/10) Blog Contents 2013 (Jan & Feb)
  • (3/11) S. Stanley Young: Scientific Integrity and Transparency

    Stan Young

    Stan Young

  • (3/13) Risk-Based Security: Knives and Axes
  • (3/15) Normal Deviate: Double Misunderstandings About p-values
  • (3/17) Update on Higgs data analysis: statistical flukes (1)
  • (3/21) Telling the public why the Higgs particle matters
  • (3/23) Is NASA suspending public education and outreach?
  • (3/27) Higgs analysis and statistical flukes (part 2)
  • (3/31) possible progress on the comedy hour circuit?

April 2013

  • (4/1) Flawed Science and Stapel: Priming for a Backlash?

    .Kent Staley

    .Kent Staley

  • (4/4) Guest Post. Kent Staley: On the Five Sigma Standard in Particle Physics

    I. J. Good

    I. J. Good

  • (4/6) Who is allowed to cheat? I.J. Good and that after dinner comedy hour….
  • (4/10) Statistical flukes (3): triggering the switch to throw out 99.99% of the data
  • (4/11) O & M Conference (upcoming) and a bit more on triggering from a participant…..

    Marilyn Monroe not walking past a Higgs boson and not making it decay, whatever philosophers might say.

    Marilyn Monroe not walking past a Higgs

  • (4/14) Does statistics have an ontology? Does it need one? (draft 2)
  • (4/19) Stephen Senn: When relevance is irrelevant
  • (4/22) Majority say no to inflight cell phone use, knives, toy bats, bow and arrows, according to survey
  • (4/23) PhilStock: Applectomy? (rejected post)
  • (4/25) Blog Contents 2013 (March)
  • (4/27) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)

    BP oil spill

    BP oil spill comedy hour

  • (4/29) What should philosophers of science do? (falsification, Higgs, statistics, Marilyn)

May 2013

  • (5/3) Schedule for Ontology & Methodology, 2013
  • (5/6) Professorships in Scandal?
  • (5/9) If it’s called the “The High Quality Research Act,” then ….
  • (5/13) ‘No-Shame’ Psychics Keep Their Predictions Vague: New Rejected post
  • (5/14) “A sense of security regarding the future of statistical science…” Anon review of Error and Inference
  • (5/18) Gandenberger on Ontology and Methodology (May 4) Conference: Virginia Tech



  • (5/19) Mayo: Meanderings on the Onto-Methodology Conference
  • (5/22) Mayo’s slides from the Onto-Meth conference
  • (5/24) Gelman sides w/ Neyman over Fisher in relation to a famous blow-up
  • (5/26) Schachtman: High, Higher, Highest Quality Research Act

    .Kent Staley

    .Kent Staley

  • (5/27) A.Birnbaum: Statistical Methods in Scientific Inference
  • (5/29) K. Staley: review of Error & Inference

June 2013

  • (6/1) Winner of May Palindrome Contest
  • (6/1) Some statistical dirty laundry
  • (6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12):



  • (6/6) PhilStock: Topsy-Turvy Game
  • (6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
  • (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?”

    Richard Gill

    Richard Gill

  • (6/11) Mayo: comment on the repressed memory research
  • (6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!
  • (6/19) PhilStock: The Great Taper Caper
  • (6/19) Stanley Young: better p-values through randomization in microarrays

    Stan Young

    Stan Young

  • (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri
  • (6/26) Why I am not a “dualist” in the sense of Sander Greenland
  • (6/29) Palindrome “contest” contest
  • (6/30) Blog Contents: mid-year

July 2013

  • (7/3) Phil/Stat/Law: 50 Shades of gray between error and fraud
  • (7/6) Bad news bears: ‘Bayesian bear’ rejoinder–reblog mashup
  • (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
  • (7/11) Is Particle Physics Bad Science? (memory lane)
  • (7/13) Professor of Philosophy Resigns over Sexual Misconduct (rejected post)
  • (7/14) Stephen Senn: Indefinite irrelevance
  • (7/17) Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)
  • (7/19) Msc Kvetch: A question on the Martin-Zimmerman case we do not hear
  • (7/20) Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior

    Larry Laudan

    Larry Laudan

  • (7/23) Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs
  • (7/26) New Version: On the Birnbaum argument for the SLP: Slides for JSM talk

August 2013

  • (8/1) Blogging (flogging?) the SLP: Response to Reply- Xi’an Robert
  • (8/5) At the JSM: 2013 International Year of Statistics
  • (8/6) What did Nate Silver just say? Blogging the JSM
  • (8/9) 11th bullet, multiple choice question, and last thoughts on the JSM
  • (8/11) E.S. Pearson: “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”

    Egon Pearson on a Gate (by D. Mayo)

    Egon Pearson on a Gate (by D. Mayo)

  • (8/13) Blogging E.S. Pearson’s Statistical Philosophy
  • (8/15) A. Spanos: Egon Pearson’s Neglected Contributions to Statistics
  • (8/17) Gandenberger: How to Do Philosophy That Matters (guest post)
  • (8/21) Blog contents: July, 2013
  • (8/22) PhilStock: Flash Freeze
  • (8/22) A critical look at “critical thinking”: deduction and induction
  • (8/28) Is being lonely unnatural for slim particles? A statistical argument
  • (8/31) Overheard at the comedy hour at the Bayesian retreat-2 years on


    significance tests saw off their own limbs-comedy

September 2013

  • (9/2) Is Bayesian Inference a Religion?
  • (9/3) Gelman’s response to my comment on Jaynes
  • (9/5) Stephen Senn: Open Season (guest post)
  • (9/7) First blog: “Did you hear the one about the frequentist…”? and “Frequentists in Exile”
  • (9/10) Peircean Induction and the Error-Correcting Thesis (Part I)

    C. S . Peirce 10 Sept. 1939

    C. S . Peirce 10 Sept. 1939

  • (9/10) (Part 2) Peircean Induction and the Error-Correcting Thesis
  • (9/12) (Part 3) Peircean Induction and the Error-Correcting Thesis
  • (9/14) “When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (guest post)


    when Bayesian inference shatters.

  • (9/18) PhilStock: Bad news is good news on Wall St.
  • (9/18) How to hire a fraudster chauffeur
  • (9/22) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
  • (9/23) Barnard’s Birthday: background, likelihood principle, intentions
  • (9/24) Gelman est efffectivement une erreur statistician
  • (9/26) Blog Contents: August 2013
  • (9/29) Highly probable vs highly probed: Bayesian/ error statistical differences

October 2013

  • (10/3) Will the Real Junk Science Please Stand Up? (critical thinking)

    J. Hosiasson

    J. Hosiasson

  • (10/5) Was Janina Hosiasson pulling Harold Jeffreys’ leg?
  • (10/9) Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock
  • (10/12) Sir David Cox: a comment on the post, “Was Hosiasson pulling Jeffreys’ leg?”

    Sir David Cox

    Sir David Cox

  • (10/19) Blog Contents: September 2013
  • (10/19) Bayesian Confirmation Philosophy and the Tacking Paradox (iv)*
  • (10/25) Bayesian confirmation theory: example from last post…
  • (10/26) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)

November 2013

  • (11/2) Oxford Gaol: Statistical Bogeymen
  • (11/4) Forthcoming paper on the strong likelihood principle



  • (11/9) Null Effects and Replication
  • (11/9) Beware of questionable front page articles warning you to beware of questionable front page articles (iii)
  • (11/13) T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)



  • (11/16) PhilStock: No-pain bull
  • (11/16) S. Stanley Young: More Trouble with ‘Trouble in the Lab’ (Guest post)

    Stan Young

    Stan Young

  • (11/18) Lucien Le Cam: “The Bayesians hold the Magic”
  • (11/20) Erich Lehmann: Statistician and Poet
  • (11/23) Probability that it is a statistical fluke [i]
  • (11/27) “The probability that it be a statistical fluke” [iia]
  • (11/30) Saturday night comedy at the “Bayesian Boy” diary (rejected post*)

December 2013

  • (12/3) Stephen Senn: Dawid’s Selection Paradox (guest post)

    S. Senn

    S. Senn

  • (12/7) FDA’s New Pharmacovigilance
  • (12/9) Why ecologists might want to read more philosophy of science (UPDATED)
  • (12/11) Blog Contents for Oct and Nov 2013
  • (12/14) The error statistician has a complex, messy, subtle, ingenious piece-meal approach
  • (12/15) Surprising Facts about Surprising Facts
  • (12/19) A. Spanos lecture on “Frequentist Hypothesis Testing”



  • (12/24) U-Phil: Deconstructions [of J. Berger]: Irony & Bad Faith 3
  • (12/25) “Bad Arguments” (a book by Ali Almossawi)
  • (12/26) Mascots of Bayesneon statistics (rejected post)
  • (12/27) Deconstructing Larry Wasserman
  • (12/28) More on deconstructing Larry Wasserman (Aris Spanos)
  • (12/28) Wasserman on Wasserman: Update! December 28, 2013
  • (12/31) Midnight With Birnbaum (Happy New Year)

January 2014

  • (1/2) Winner of the December 2013 Palindrome Book Contest (Rejected Post)
  • (1/3) Error Statistics Philosophy: 2013
  • (1/4) Your 2014 wishing well. …
  • (1/7) “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech)
  • (1/11) Two Severities? (PhilSci and PhilStat)
  • (1/14) Statistical Science meets Philosophy of Science: blog beginningscomedy hour Leno
  • (1/16) Objective/subjective, dirty hands and all that: Gelman/Wasserman blogolog (ii)
  • (1/18) Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]
  • (1/22) Phil6334: “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos (Virginia Tech) UPDATE: JAN 21
  • (1/24) Phil 6334: Slides from Day #1: Four Waves in Philosophy of Statistics
  • (1/25) U-Phil (Phil 6334) How should “prior information” enter in statistical inference?
  • (1/27) Winner of the January 2014 palindrome contest (rejected post)
  • (1/29) BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Revisiting the Foundations of Statistics

    Boston Colloquium 2013-2014


  • (1/31) Phil 6334: Day #2 Slides

February 2014

  • (2/1) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)
  • (2/3) PhilStock: Bad news is bad news on Wall St. (rejected post)
  • (2/5) “Probabilism as an Obstacle to Statistical Fraud-Busting” (draft iii)
  • (2/9) Phil6334: Day #3: Feb 6, 2014
  • (2/10) Is it true that all epistemic principles can only be defended circularly? A Popperian puzzle



  • (2/12) Phil6334: Popper self-test

    Statistical Snow Sculpture

    Statistical Snow Sculpture

  • (2/13) Phil 6334 Statistical Snow Sculpture
  • (2/14) January Blog Table of Contents
  • (2/15) Fisher and Neyman after anger management?
  • (2/17) R. A. Fisher: how an outsider revolutionized statistics
  • (2/18) Aris Spanos: The Enduring Legacy of R. A. Fisher
  • (2/20) R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’
  • (2/21) STEPHEN SENN: Fisher’s alternative to the alternative
  • (2/22) Sir Harold Jeffreys’ (tail-area) one-liner: Sat night comedy [draft ii]

    Comedy Hour

    Comedy Hour

  • (2/24) Phil6334: February 20, 2014 (Spanos): Day #5
  • (2/26) Winner of the February 2014 palindrome contest (rejected post)
  • (2/26) Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)

March 2014

C. Shalizi

C. Shalizi

  • (3/1) Cosma Shalizi gets tenure (at last!) (metastat announcement)
  • (3/2) Significance tests and frequentist principles of evidence: Phil6334 Day #6
  • (3/3) Capitalizing on Chance (ii)
  • (3/4) Power, power everywhere–(it) may not be what you think! [illustration]
  • (3/8) Msc kvetch: You are fully dressed (even under you clothes)?
  • (3/8) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
  • (3/11) Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power
  • (3/12) Get empowered to detect power howlers
  • (3/15) New SEV calculator (guest app: Durvasula)
  • (3/17) Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

    S. Senn

    S. Senn

  • (3/19) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
  • (3/22) Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)
  • (3/25) The Unexpected Way Philosophy Majors Are Changing The World Of Business

    J. Byrd

    J. Byrd

  • (3/26) Phil6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
  • (3/28) Severe osteometric probing of skeletal remains: John Byrd
  • (3/29) Winner of the March 2014 palindrome contest (rejected post)
  • (3/30) Phil6334: March 26, philosophy of misspecification testing (Day #9 slides)

April 2014

  • (4/1) Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic
  • (4/3) Self-referential blogpost (conditionally accepted*)

    I. J. Good

    I. J. Good

  • (4/5) Who is allowed to cheat? I.J. Good and that after dinner comedy hour. . ..

    Richard Gill

    Richard Gill

  • (4/6) Phil6334: Duhem’s Problem, highly probable vs highly probed; Day #9 Slides
  • (4/8) “Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)
  • (4/12) “Murder or Coincidence?” Statistical Error in Court: Richard Gill (TEDx video)

    Neyman April 16, 1894 – August 5, 1981

    Neyman April 16, 1894 – August 5, 1981

  • (4/14) Phil6334: Notes on Bayesian Inference: Day #11 Slides
  • (4/16) A. Spanos: Jerzy Neyman and his Enduring Legacy
  • (4/17) Duality: Confidence intervals and the severity of tests
  • (4/19) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)
  • (4/21) Phil 6334: Foundations of statistics and its consequences: Day#12
  • (4/23) Phil 6334 Visitor: S. Stanley Young, “Statistics and Scientific Integrity”

    Stan Young

    Stan Young

  • (4/26) Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day #13)
  • (4/30) Able Stats Elba: 3 Palindrome nominees for April! (rejected post)

May 2014

  • (5/1) Putting the brakes on the breakthrough: An informal look at the

    breaking through

    argument for the Likelihood Principle

  • (5/3) You can only become coherent by ‘converting’ non-Bayesianly
  • (5/6) Winner of April Palindrome contest: Lori Wike
  • (5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)
    fraud buster

    fraud buster

    A. Spanos

    A. Spanos

  • (5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
  • (5/15) Scientism and Statisticism: a conference* (i)
  • (5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”
  • (5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop



  • (5/25) Blog Table of Contents: March and April 2014
  • (5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976
  • (5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

    Potti training

    Potti training

June 2014

  • (6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)
  • (6/9) “The medical press must become irrelevant to publication of clinical trials.”
  • (6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”
  • (6/14) “Statistical Science and Philosophy of Science: where should they meet?”

    Sir David Hendry

    Sir David Hendry

  • (6/21) Big Bayes Stories? (draft ii)
  • (6/25) Blog Contents: May 2014
  • (6/28) Sir David Hendry Gets Lifetime Achievement Award
  • (6/30) Some ironies in the ‘replication crisis’ in social psychology (4th and final installment)

July 2014

  • (7/7) Winner of June Palindrome Contest: Lori Wike
  • (7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)
  • (7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)
  • (7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)

    R. Berger

    R. Berger

  • (7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?
  • (7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
  • (7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

August 2014

  • (08/03) Blogging Boston JSM2014?
  • (08/05) Neyman, Power, and Severity
  • (08/06) What did Nate Silver just say? Blogging the JSM 2013

    N. Silver

    N. Silver

  • (08/09) Winner of July Palindrome: Manan Shah

    E.S. Pearson on the gate, D. Mayo sketch

    E.S. Pearson on the gate,
    D. Mayo sketch

  • (08/09) Blog Contents: June and July 2014
  • (08/11) Egon Pearson’s Heresy
  • (08/17) Are P Values Error Probabilities? Or, “It’s the methods, stupid!” (2nd install)
  • (08/23) Has Philosophical Superficiality Harmed Science?
  • (08/29) BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2nd)

September 2014

  • (9/30) Letter from George (Barnard)


    G. Barnard

  • (9/27) Should a “Fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no (rejected post)
  • (9/23) G.A. Barnard: The Bayesian “catch-all” factor: probability vs likelihood
  • (9/21) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
  • (9/18) Uncle Sam wants YOU to help with scientific reproducibility!
  • (9/15) A crucial missing piece in the Pistorius trial? (2): my answer (Rejected Post)
  • (9/12) “The Supernal Powers Withhold Their Hands And Let Me Alone”: C.S. Peirce
  • (9/6) Statistical Science: The Likelihood Principle issue is out…!

    Table of Contents


  • (9/4) All She Wrote (so far): Error Statistics Philosophy Contents-3 years on
  • (9/3) 3 in blog years: Sept 3 is 3rd anniversary of

October 2014

  • 10/01 Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?

    Faye Flam

    Faye Flam

  • 10/05 Diederik Stapel hired to teach “social philosophy” because students got tired of success stories… or something (rejected post)
  • 10/07 A (Jan 14, 2014) interview with Sir David Cox by “Statistics Views”
  • 10/10 BREAKING THE (Royall) LAW! (of likelihood) (C)
  • 10/14 Gelman recognizes his error-statistical (Bayesian) foundations
  • 10/18 PhilStat/Law: Nathan Schachtman: Acknowledging Multiple Comparisons in Statistical Analysis: Courts Can and Must



  • 10/22 September 2014: Blog Contents
  • 10/26 To Quarantine or not to Quarantine?: Science & Policy in the time of Ebola
  • 10/31 Oxford Gaol: Statistical Bogeymen

November 2014

  • 11/01 Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”
  • 11/09 “Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSAAllan Franklinrobert cousins
  • 11/11 The Amazing Randi’s Million Dollar ChallengeMayo Kent Staley
  • 11/12 A biased report of the probability of a statistical fluke: Is it cheating?
  • 11/15 Why the Law of Likelihood is bankrupt–as an account of evidence

    Juliet Shafer, Erich Lehmann, D. Mayo

    Juliet Shafer, Erich Lehmann, D. Mayo

  • 11/18 Lucien Le Cam: “The Bayesians Hold the Magic”
  • 11/20 Erich Lehmann: Statistician and Poet
  • 11/22 Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”
  • 11/25 How likelihoodists exaggerate evidence from statistical tests

December 2014

  • 12/02 My Rutgers Seminar: tomorrow, December 3, on philosophy of statistics
  • 12/04 “Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)
  • 12/06 How power morcellators inadvertently spread uterine cancer
  • 12/11 Msc. Kvetch: What does it mean for a battle to be “lost by the media”?
  • 12/13 S. Stanley Young: Are there mortality co-benefits to the Clean Power Plan? It depends. (Guest Post)
  • 12/17 Announcing Kent Staley’s new book, An Introduction to the Philosophy of Science (CUP)

    Staley Intro to Phil of Science

    Staley book

  • 12/21 Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)
  • 12/23 All I want for Chrismukkah is that critics & “reformers” quit howlers of


    testing (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”

  • 12/26 3 YEARS AGO: MONTHLY (Dec.) MEMORY LANEMidnight With Birnbaum
  • 12/29 To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
  • 12/31 Midnight With Birnbaum (Happy New Year)

January 2015

  • 01/02 Blog Contents: Oct.- Dec. 2014
  • 01/03 No headache power (for Deirdre)


    no-headache power

  • 01/04 Significance Levels are Made a Whipping Boy on Climate Change Evidence: Is .05 Too Strict? (Schachtman on Oreskes)
  • 01/07 “When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (reblog)
  • 01/08 On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post)



  • 01/12 “Only those samples which fit the model best in cross validation were included” (whistleblower) “I suspect that we likely disagree with what constitutes validation” (Potti and Nevins)
  • 01/16 Winners of the December 2014 Palindrome Contest: TWO!
  • 01/18 Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?
  • 01/21 Some statistical dirty laundry



  • 01/24 What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri
  • 01/26 Trial on Anil Potti’s (clinical) Trial Scandal Postponed Because Lawyers Get the Sniffles (updated)
  • 01/31 Saturday Night Brainstorming and Task Forces: (4th draft)

    NHST task force 3

    NHST task force 3

February 2015

  • 02/05 Stephen Senn: Is Pooling Fooling? (Guest Post)
  • 02/10 What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?
  • 02/13 Induction, Popper and Pseudoscience
  • 02/16 Continuing the discussion on truncation, Bayesian convergence and testing of priors
  • 02/16 R. A. Fisher: ‘Two New Properties of Mathematical Likelihood': Just before breaking up (with N-P)
  • 02/17 R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

    Jeffreys 'one-liner

    Jeffreys ‘one-liner

  • 02/19 Stephen Senn: Fisher’s Alternative to the Alternative
  • 02/21 Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)



  • 02/27 Big Data is the New Phrenology?

[i]Table of Contents compiled by N. Jinn & J. Miller)*

*I thank Jean Miller for her assiduous work on the blog, and all contributors and readers for helping “frequentists in exile” to feel (and truly become) less exiled–wherever they may be!

Categories: blog contents, Metablog, Statistics | 1 Comment

A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)



A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called Basic and Applied Social Psychology (BASP)[i]. Enough. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. Oh please, no ESP required.None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about. Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (really?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors.  I don’t know, maybe some good will come of all this.

Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them!

Question 1. Will manuscripts with p-values be desk rejected automatically?

Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).”

Now if these measures are alleged to be irrelevant and invalid instruments for statistical inference, then why should they be included in the peer review process at all? Will reviewers be told to ignore them? That would only seem fair: papers should not be judged by criteria alleged to be invalid, but how will reviewers blind themselves to them? It would seem the measures should be excluded from the get-go. If they are included in the review, why shouldn’t the readers see what the reviewers saw when they recommended acceptance?

But here’s where the puzzle really sets in. If the authors must free their final papers from such encumbrances as sampling distributions and error probabilities, what will be the basis presented for their conclusions in the published paper? Presumably, from the notice, they are allowed only mere descriptive statistics or non-objective Bayesian reports (added: actually can’t tell which kind of Bayesianism they allow, given the Fisher reference which doesn’t fit*). Won’t this be tantamount to requiring authors support their research in a way that is either (actually) invalid, or has little to do with the error statistical properties that were actually reported and on which the acceptance was based?[ii]

Maybe this won’t happen because prospective authors already know there’s a bias in this particular journal against reporting significance levels, confidence levels, power etc., but the announcement says they are permitted.

Translate P-values into euphemisms

Or might authors be able to describe p-values only using a variety of euphemisms, for instance: “We have consistently observed differences such that, were there no genuine effect, then there is a very high probability we would have observed differences smaller than those we found; yet we kept finding results that could almost never have been produced if we hadn’t got hold of a genuine discrepancy from the null model.” Or some such thing, just as long as the dreaded “P-word” is not mentioned? In one way, that would be good; the genuine basis for why and when small p-values warrant indications of discrepancies should be made explicit. I’m all for that. But, in all likelihood such euphemisms would be laughed at; everyone would know the code for “small p-value” when banned from saying the “p-word”, so what would have been served?

Or, much more likely, rewording p-values wouldn’t be allowed, so authors might opt to:

Find a way to translate error-statistical results to Bayesian posteriors?

They might say something like: “These results make me assign a very small probability to the ‘no effect’ hypothesis”, even though their study actually used p-values and not priors? But a problem immediately arises. If the paper is accepted based on p-values, then if they want to use priors to satisfy the editors in the final publication, they might have to resort to the uninformative priors that the editors have also banned [added: again, on further analysis, it’s unclear which type of Bayesian priors they are permitting as “interesting” enough to be considered on a case by case basis, as the Fisher genetics example supports frequentist priors]. So it would follow that unless authors did a non-objective Bayesian analysis first, the only reasonable thing would be for the authors to give, in their published paper, merely a descriptive report.[iii]

Give descriptive reports and make no inferences

If the only way to translate an error statistical report into a posterior entails illicit uninformative priors, then authors can opt for a purely descriptive report. What kind of descriptive report would convey the basis of the inference if it was actually based on statistical inference methods? Unclear, but there’s something else. Won’t descriptive reports in published papers be a clear tip off for readers that p-values, size, power or confidence intervals were actually used in the original paper? The only way they wouldn’t be is if the papers were merely descriptive from the start. Will readers be able to find out? Will they be able to obtain those error statistics used or will the authors not be allowed to furnish them? If they are allowed to furnish them, then all the test ban would have achieved is the need for a secret middle level source that publishes the outlawed error probabilities. How does this fit with the recent moves toward transparency, shared data, even telling whether variables were selected post hoc, etc. See “upshot” below. This is sounding more like “don’t ask, don’t tell!”

To sum-up this much.

If papers based on error statistics are accepted, then the final published papers must find a different way to justify their results. We have considered three ways, either:

  1. using euphemisms for error probabilities,
  2. merely giving a descriptive report without any hint of inference.
  3. translating what was done so as to give a (non-default? informative? non-nonsubjective?) posterior probability

But there’s a serious problem with each.

Consider # 3 again. If they’re led to invent priors that permit translating the low p-value into a low prior for the null, say, then won’t that just create the invalidity that was actually not there at all when p-values were allowed to be as p-values?  If they’re also led to obey the ban on non-informative priors, mightn’t they be compelled to employ (or assume) information in the form of a prior, say, even though that did not enter their initial argument?  You can see how confusing this can get. Will the readers at least be told by the authors that they had to change the justification from the one used in the appraisal of the manuscript? “Don’t ask, don’t tell” doesn’t help if people are trying to replicate the result thinking the posterior probability was the justification when in fact it was based on a p-value? Each generally has different implications for replication. Of course, if it’s just descriptive statistics, it’s not clear what “replication” would even amount to.

What happens to randomization and experimental design?

If we’re ousting error probabilities, be they p-values, type 1 and 2 errors, power, or confidence levels, then shouldn’t authors be free to oust the methods of experimental design and data collection whose justification is in substantiating the “physical basis for the validity of the test” of significance? (Fisher, DOE 17). Why should they go through the trouble of experimental designs whose justification is precisely to support an inference procedure the editors deem illegitimate?


It would have made more sense if the authors were required to make the case without the (alleged) invalid measures from the start.  Maybe they should correct this. I’m serious, at least if one is to buy into the test ban. Authors could be encouraged to attend to points almost universally ignored (in social psychology) when the attention is on things like p-values, to wit: what’s the connection between what you’re measuring and your inference or data interpretation? (Remember unscrambling soap words and moral judgments?) [iv] On the other hand, the steps toward progress are at risk of being nullified.

See out damned pseudoscience, and Some ironies in the replication crisis in social psychology

The major problems with the uses of NHST in social psych involve the presumption that one is allowed to go from a statistical to a substantive (often causal!) inference—never mind that everyone has known this fallacy for 100 years—, invalid statistical assumptions (including questionable proxy variables), and questionable research practices (QRPs): cherry-picking, post-data subgroups, barn-hunting, p-hacking, and so on. That these problems invalidate the method’s error probabilities was the basis for deeming them bad practices!

Everyone can see at a glance (without any statistics) that reporting a lone .05 p-value for green jelly beans and acne (in that cartoon), while failing to report the 19 other colors that showed no association, means that the reported .05 p value is invalidated! We can valuably grasp immediately that finding 1 of 20 with a nominal p-value of .05 is common and not rare by chance alone. Therefore, it shows directly that the actual p-value is not low as purported! That’s what an invalid p-value really means. The main reason for the existence of the p-value is that it renders certain practices demonstrably inadmissible (like this one). They provably alter the actual p-value. Without such invalidating moves, the reported p-value is very close to the actual! Pr(p-value < .05;null) ~ .05. But these illicit moves reveal themselves in invalid p-values![v] What grounds will there be for transparency about such cherry-picking now, in that journal?

Remember that bold move by Simmons, Nelson and Simonsohn? (See “statistical dirty laundry” post here). They had called on researchers to “just say it”: “If you determined sample size in advance, say it. If you did not drop any variables, say it. If you did not drop any conditions, say it.”

The new call, for this journal at least, will be: “If you used p-values, confidence intervals, size, power, sampling distributions, just don’t say it”.[vi]


*See my comment on this blog concerning their Fisher 1973 reference.

[i]An NSF Director asked for my response but I didn’t have one for dissemination. They sent me the ASA response.

[ii]Allowing statistical significance to go directly to substantive significance, as we often see in NHST is invalid; but there’s nothing invalid in the correct report of a p-value, as used, for instance in recent discovery of the Higgs particle (search blog for posts), that hormone replacement therapy increases risks of breast cancer (unlike what observational studies were telling us for years), that Anil Potti’s prediction model, on which personalized cancer treatments were based, was invalid. Everyone who reads this blog knows I oppose cookbook statistics, and knows I’d insist on indicating discrepancies passed with good or bad severity, insist on taking account a slew of selection effects, and violation of statistical model assumptions—especially links from observed proxy variables in social psych and claims inferred. Alternatives to the null are made explicit, but what’s warranted may not be the alternative posed for purposes of getting a good distance measure, etc etc. (You can search this post for all these issues and more.)

[iii]Can anyone think of an example wherein a warranted low Bayesian probability of the null hypothesis—what the editors seek—would not have corresponded to finding strong evidence of discrepancy from the null hypothesis by means of a low p-value, ideally with corresponding discrepancy size? I can think of cases where a high posterior in a “real effect” claim is shot down by a non-low p-value (once selection effects, and stopping rules are taken account of) but that’s not at issue, apparently.

[iv]I think one of the editors may have had a representative at the Task force meeting I recently posted.

An aside: These groups seem to love evocative terms and acronyms. We’ve got the Test Ban (reminds me of when I was a kid in NYC public schools and we had to get under our desks) of NHSTP at BASP.

[v] Anyone who reads this blog knows that I favor reporting the discrepancies well-warranted and poorly warranted and not merely a p-value. There are some special circumstances where the p-value alone is of value. (See Mayo and Cox 2010).

[vi] Think of how all this would have helped Diederik Stapel.

Categories: P-values, reforming the reformers, Statistics | 70 Comments

“Probabilism as an Obstacle to Statistical Fraud-Busting”

Boston Colloquium 2013-2014


“Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?” was my presentation at the 2014 Boston Colloquium for the Philosophy of Science):“Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge.”  

 As often happens, I never put these slides into a stand alone paper. But I have incorporated them into my book (in progress*), “How to Tell What’s True About Statistical Inference”. Background and slides were posted last year.

Slides (draft from Feb 21, 2014) 

Download the 54th Annual Program

Cosponsored by the Department of Mathematics & Statistics at Boston University.

Friday, February 21, 2014
10 a.m. – 5:30 p.m.
Photonics Center, 9th Floor Colloquium Room (Rm 906)
8 St. Mary’s Street

*Seeing a light at the end of tunnel, finally.
Categories: P-values, significance tests, Statistical fraudbusting, Statistics | 7 Comments

Big Data Is The New Phrenology?




It happens I’ve been reading a lot lately about the assumption in social psychology and psychology in general that what they’re studying is measurable, quantifiable. Addressing the problem has been shelved to the back burner for decades thanks to some redefinitions of what it is to “measure” in psych (anything for which there’s a rule to pop out a number says Stevens–an operationalist in the naive positivist spirit). This at any rate is what I’m reading, thanks to papers sent by a colleague of Meehl’s (N. Waller).  (Here’s one by Mitchell.) I think it’s time to reopen the question.The measures I see of “severity of moral judgment”, “degree of self-esteem” and much else in psychology appear to fall into this behavior in a very non-self critical manner. No statistical window-dressing (nor banning of statistical inference) can help them become more scientific. So when I saw this on Math Babe’s twitter I decided to try the “reblog” function and see what happened. Here it is (with her F word included). The article to which she alludes is “Recruiting Better Talent Through Brain Games” )

Originally posted on mathbabe:

Have you ever heard of phrenology? It was, once upon a time, the “science” of measuring someone’s skull to understand their intellectual capabilities.

This sounds totally idiotic but was a huge fucking deal in the mid-1800’s, and really didn’t stop getting some credit until much later. I know that because I happen to own the 1911 edition of the Encyclopedia Britannica, which was written by the top scholars of the time but is now horribly and fascinatingly outdated.

For example, the entry for “Negro” is famously racist. Wikipedia has an excerpt: “Mentally the negro is inferior to the white… the arrest or even deterioration of mental development [after adolescence] is no doubt very largely due to the fact that after puberty sexual matters take the first place in the negro’s life and thoughts.”

But really that one line doesn’t tell the whole story. Here’s the whole thing…

View original 351 more words

Categories: msc kvetch, scientism, Statistics | 3 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: February 2012. I am to mark in red three posts (or units) that seem most apt for general background on key issues in this blog. Given our Fisher reblogs, we’ve already seen many this month. So, I’m marking in red (1) The Triad, and (2) the Unit on Spanos’ misspecification tests. Plase see those posts for their discussion. The two posts from 2/8 are apt if you are interested in a famous case involving statistics at the Supreme Court. Beyond that it’s just my funny theatre of the absurd piece with Barnard. (Gelman’s is just a link to his blog.)


February 2012


  • (2/11) R.A. Fisher: Statistical Methods and Scientific Inference
  • (2/11)  JERZY NEYMAN: Note on an Article by Sir Ronald Fisher
  • (2/12) E.S. Pearson: Statistical Concepts in Their Relation to Reality





This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014.


Jan. 2012

Dec. 2011

Nov. 2011

Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))

Categories: 3-year memory lane, Statistics | 1 Comment

Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)

Comedy hour icon


This headliner appeared before, but to a sparse audience, so Management’s giving him another chance… His joke relates to both Senn’s post (about alternatives), and to my recent post about using (1 – β)/α as a likelihood ratio--but for very different reasons. (I’ve explained at the bottom of this “(b) draft”.)

 ….If you look closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike, (especially as he’s no longer doing the Tonight Show) ….



It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler joke* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

Categories: Comedy, Discussion continued, Fisher, Jeffreys, P-values, Statistics, Stephen Senn | 5 Comments

Stephen Senn: Fisher’s Alternative to the Alternative


As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog Senn from 3 years ago.  

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.


The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , , , | 59 Comments

R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)



In recognition of R.A. Fisher’s birthday….

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Spanos, Statistics | 6 Comments

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood': Just before breaking up (with N-P)

17 February 1890–29 July 1962

In recognition of R.A. Fisher’s birthday tomorrow, I will post several entries on him. I find this (1934) paper to be intriguing –immediately before the conflicts with Neyman and Pearson erupted. It represents essentially the last time he could take their work at face value, without the professional animosities that almost entirely caused, rather than being caused by, the apparent philosophical disagreements and name-calling everyone focuses on. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a very similar place (no pun intended) while starting from different origins. I quote just the most relevant portions…the full article is linked below. I’d blogged it earlier here.  You may find some gems in it.

‘Two new Properties of Mathematical Likelihood’

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 3 Comments

Continuing the discussion on truncation, Bayesian convergence and testing of priors



My post “What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?” gave rise to a set of comments that were mostly off topic but interesting in their own right. Being too long to follow, I put what appears to be the last group of comments here, starting with Matloff’s query. Please feel free to continue the discussion here; we may want to come back to the topic. Feb 17: Please note one additional voice at the end. (Check back to that post if you want to see the history)


I see the conversation is continuing. I have not had time to follow it, but I do have a related question, on which I’d be curious as to the response of the Bayesians in our midst here.

Say the analyst is sure that μ > c, and chooses a prior distribution with support on (c,∞). That guarantees that the resulting estimate is > c. But suppose the analyst is wrong, and μ is actually less than c. (I believe that some here conceded this could happen in some cases in whcih the analyst is “sure” μ > c.) Doesn’t this violate one of the most cherished (by Bayesians) features of the Bayesian method — that the effect of the prior washes out as the sample size n goes to infinity?


(to Matloff),

The short answer is that assuming information such as “mu is greater than c” which isn’t true screws up the analysis. It’s like a mathematician starting a proof of by saying “assume 3 is an even number”. If it were possible to consistently get good results from false assumptions, there would be no need to ever get our assumptions right. Continue reading

Categories: Discussion continued, Statistics | 60 Comments

Induction, Popper and Pseudoscience



February is a good time to read or reread these pages from Popper’s Conjectures and Refutations. Below are (a) some of my newer reflections on Popper after rereading him in the graduate seminar I taught one year ago with Aris Spanos (Phil 6334), and (b) my slides on Popper and the philosophical problem of induction, first posted here. I welcome reader questions on either.

As is typical in rereading any deep philosopher, I discover (or rediscover) different morsels of clues to understanding—whether fully intended by the philosopher or a byproduct of their other insights, and a more contemporary reading. So it is with Popper. A couple of key ideas to emerge from the seminar discussion (my slides are below) are:

  1. Unlike the “naïve” empiricists of the day, Popper recognized that observations are not just given unproblematically, but also require an interpretation, an interest, a point of view, a problem. What came first, a hypothesis or an observation? Another hypothesis, if only at a lower level, says Popper.  He draws the contrast with Wittgenstein’s “verificationism”. In typical positivist style, the verificationist sees observations as the given “atoms,” and other knowledge is built up out of truth functional operations on those atoms.[1] However, scientific generalizations beyond the given observations cannot be so deduced, hence the traditional philosophical problem of induction isn’t solvable. One is left trying to build a formal “inductive logic” (generally deductive affairs, ironically) that is thought to capture intuitions about scientific inference (a largely degenerating program). The formal probabilists, as well as philosophical Bayesianism, may be seen as descendants of the logical positivists–instrumentalists, verificationists, operationalists (and the corresponding “isms”). So understanding Popper throws a great deal of light on current day philosophy of probability and statistics.

Continue reading

Categories: Phil 6334 class material, Popper, Statistics | 7 Comments

What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?



Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..

1. Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)


  1. Simple rules for alternatives against which T+ has high power:
  • If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null,z= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[Power(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Continue reading

Categories: Bayesian/frequentist, law of likelihood, Statistical power, statistical tests, Statistics, Stephen Senn | 87 Comments

Stephen Senn: Is Pooling Fooling? (Guest Post)

Stephen Senn


Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg

Is Pooling Fooling?

‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s A Journal of a Tour to the Hebrides

A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.

It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).

A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose[1]) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.

Suppose that we have k ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of k trials. We can label these Hn1, Hn2, … Hnk. We are perfectly entitled to test the null hypothesis Hjoint that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling. Continue reading

Categories: evidence-based policy, PhilPharma, S. Senn, Statistics | 19 Comments

2015 Saturday Night Brainstorming and Task Forces: (4th draft)


TFSI workgroup

Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!

Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. 


Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative?  It’s hard to say. 

This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. 


Pawl: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past. Continue reading

Categories: Comedy, reforming the reformers, science communication, Statistical fraudbusting, statistical tests, Statistics | Tags: , , , , , , | 19 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: January 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.

January 2012

This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014. I will count U-Phil’s on a single paper as one of the three I highlight (else I’d have to choose between them). I will comment on  3-year old posts from time to time.

This Memory Lane needs a bit of explanation. This blog began largely as a forum to discuss a set of contributions from a conference I organized (with A. Spanos and J. Miller*) “Statistical Science and Philosophy of Science: Where Do (Should) They meet?”at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, in June 2010 (where I am a visitor). Additional papers grew out of conversations initiated soon after (with Andrew Gelman and Larry Wasserman). The conference site is here.  My reflections in this general arena (Sept. 26, 2012) are here.

As articles appeared in a special topic of the on-line journal, Rationality, Markets and Morals (RMM), edited by Max Albert[i]—also a conference participant —I would announce an open invitation to readers to take a couple of weeks to write an extended comment.  Each “U-Phil”–which stands for “U philosophize”- was a contribution to this activity. I plan to go back to that exercise at some point.  Generally I would give a “deconstruction” of the paper first, followed by U-Phils, and then the author gave responses to U-Phils and me as they wished. You can readily search this blog for all the U-Phils and deconstructions**.

I was also keeping a list of issues that we either haven’t taken up, or need to return to. One example here is: Bayesian updating and down dating. Further notes about the origins of this blog are here. I recommend everyone reread Senn’s paper.** 

For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!

[i] Along with Hartmut Kliemt and Bernd Lahno.

*For a full list of collaborators, sponsors, logisticians, and related collaborations, see the conference page. The full list of speakers is found there as well.

**The U-Phil exchange between Mayo and Senn was published in the same special topic of RIMM. But I still wish to know how we can cultivate “Senn’s-ability.” We could continue that activity as well, perhaps.


Dec. 2011
Nov. 2011
Oct. 2011
Sept. 2011 (Within “All She Wrote (so far))

Categories: 3-year memory lane, blog contents, Statistics, Stephen Senn, U-Phil | 2 Comments

Trial on Anil Potti’s (clinical) Trial Scandal Postponed Because Lawyers Get the Sniffles (updated)



Trial in Medical Research Scandal Postponed
By Jay Price

DURHAM, N.C. — A judge in Durham County Superior Court has postponed the first civil trial against Duke University by the estate of a patient who had enrolled in one of a trio of clinical cancer studies that were based on bogus science.

The case is part of what the investigative TV news show “60 Minutes” said could go down in history as one of the biggest medical research frauds ever.

The trial had been scheduled to start Monday, but several attorneys involved contracted flu. Judge Robert C. Ervin hasn’t settled on a new start date, but after a conference call with him Monday night, attorneys in the case said it could be as late as this fall.

Flu? Don’t these lawyers get flu shots? Wasn’t Duke working on a flu vaccine? Delaying til Fall 2015?

The postponement delayed resolution in the long-running case for the two patients still alive among the eight who filed suit. It also prolonged a lengthy public relations headache for Duke Medicine that has included retraction of research papers in major scientific journals, the embarrassing segment on “60 Minutes” and the revelation that the lead scientist had falsely claimed to be a Rhodes Scholar in grant applications and credentials.

Because it’s not considered a class action, the eight cases may be tried individually. The one designated to come first was brought by Walter Jacobs, whose wife, Julie, had enrolled in an advanced stage lung cancer study based on the bad research. She died in 2010.

“We regret that our trial couldn’t go forward on the scheduled date,” said Raleigh attorney Thomas Henson, who is representing Jacobs. “As our filed complaint shows, this case goes straight to the basic rights of human research subjects in clinical trials, and we look forward to having those issues at the forefront of the discussion when we are able to have our trial rescheduled.”

It all began in 2006 with research led by a young Duke researcher named Anil Potti. He claimed to have found genetic markers in tumors that could predict which cancer patients might respond well to what form of cancer therapy. The discovery, which one senior Duke administrator later said would have been a sort of Holy Grail of cancer research if it had been accurate, electrified other scientists in the field.

Then, starting in 2007, came the three clinical trials aimed at testing the approach. These enrolled more than 100 lung and breast cancer patients, and were eventually expected to enroll hundreds more.

Duke shut them down permanently in 2010 after finding serious problems with Potti’s science.

Now some of the patients – or their estates, since many have died from their illnesses – are suing Duke, Potti, his mentor and research collaborator Dr. Joseph Nevins, and various Duke administrators. The suit alleges, among other things, that they had engaged in a systematic plan to commercially develop cancer tests worth billions of dollars while using science that they knew or should have known to be fraudulent.

The latest revelation in the case, based on documents that emerged from the lawsuit and first reported in the Cancer Letter, a newsletter that covers cancer research issues, is that a young researcher working with Potti had alerted university officials to problems with the research data two years before the experiments on the cancer patients were stopped. Continue reading

Categories: junk science, rejected post, Statistics | Tags: | 6 Comments

What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri


For entertainment only

Here’s the follow-up to my last (reblogged) post. initially here. My take hasn’t changed much from 2013. Should we be labeling some pursuits “for entertainment only”? Why not? (See also a later post on the replication crisis in psych.)


I had said I would label as pseudoscience or questionable science any enterprise that regularly permits the kind of ‘verification biases’ in the statistical dirty laundry list.  How regularly? (I’ve been asked)

Well, surely if it’s as regular as, say, much of social psychology, it goes over the line. But it’s not mere regularity, it’s the nature of the data, the type of inferences being drawn, and the extent of self-scrutiny and recognition of errors shown (or not shown). The regularity is just a consequence of the methodological holes. My standards may be considerably more stringent than most, but quite aside from statistical issues, I simply do not find hypotheses well-tested if they are based on “experiments” that consist of giving questionnaires. At least not without a lot more self-scrutiny and discussion of flaws than I ever see. (There may be counterexamples.)

Attempts to recreate phenomena of interest in typical social science “labs” leave me with the same doubts. Huge gaps often exist between elicited and inferred results. One might locate the problem under “external validity” but to me it is just the general problem of relating statistical data to substantive claims.

Experimental economists (expereconomists) take lab results plus statistics to warrant sometimes ingenious inferences about substantive hypotheses.  Vernon Smith (of the Nobel Prize in Econ) is rare in subjecting his own results to “stress tests”.  I’m not withdrawing the optimistic assertions he cites from EGEK (Mayo 1996) on Duhem-Quine (e.g., from “Rhetoric and Reality” 2001, p. 29). I’d still maintain, “Literal control is not needed to attribute experimental results correctly (whether to affirm or deny a hypothesis). Enough experimental knowledge will do”.  But that requires piece-meal strategies that accumulate, and at least a little bit of “theory” and/or a decent amount of causal understanding.[1]

I think the generalizations extracted from questionnaires allow for an enormous amount of “reading into” the data. Suddenly one finds the “best” explanation. Questionnaires should be deconstructed for how they may be misinterpreted, not to mention how responders tend to guess what the experimenter is looking for. (I’m reminded of the current hoopla over questionnaires on breadwinners, housework and divorce rates!) I respond with the same eye-rolling to just-so story telling along the lines of evolutionary psychology.

I apply the “Stapel test”: Even if Stapel had bothered to actually carry out the data-collection plans that he so carefully crafted, I would not find the inferences especially telling in the least. Take for example the planned-but-not-implemented study discussed in the recent New York Times article on Stapel: Continue reading

Categories: junk science, Statistical fraudbusting, Statistics | 3 Comments

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 704 other followers