Monthly Archives: June 2012

Further Reflections on Simplicity: Mechanisms

To continue with some philosophical reflections on the papers from the “Ockham’s razor” conference, let me respond to something in Shalizi’s recent comments ( His emphasis on the interest in understanding processes and mechanisms, as opposed to mere prediction, seems exactly right. But he raises a question that seems to me simply answered (on grounds of evidence):  If “a model didn’t seem to need” a mechanism, it is left out, why?

“It’s this, the leave-out-processes-you-don’t-need, which seems to me the core of the Razor for scientific model-building. This is definitely not the same as parameter-counting, and I think it’s also different from capacity control and even from description-length-measuring (cf.), though I am open to Peter persuading me otherwise. I am not, however, altogether sure how to formalize it, or what would justify it, beyond an aesthetic preference for tidy models. (And who died and left the tidy-minded in charge?) The best hope for such justification, I think, is something like Kevin’s idea that the Razor helps us get to the truth faster, or at least with fewer needless detours. Positing processes and mechanisms which aren’t strictly called for to account for the phenomena is asking for trouble needlessly.”

But it is easy to see that if a model M is adequate for data x regarding an aspect of a phenomenon (i.e., M had passed reasonably severe tests with x) , then a model M’ that added an “unnecessary” mechanism would have passed with very low severity, or, if one prefers, M’ would be very poorly corroborated.  To justify “leaving-out-processes-you-don’t-need” then, the appeal is not to aesthetics or heuristics but to the severity or well-testedness of M and M’.

Continue reading

Categories: philosophy of science, Statistics | Tags: , , , ,

Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop*

Picking up the pieces…

My flight out of Pittsburgh has been cancelled, and as I may be stuck in the airport for some time, I will try to make a virtue of it by jotting down some of my promised reflections on the “simplicity and truth” conference at Carnegie Mellon (organized by Kevin Kelly). My remarks concern only the explicit philosophical connections drawn by (4 of) the seven non-philosophers who spoke. For more general remarks, see blogs of: Larry Wasserman (Normal Deviate) and Cosma Shalizi (Three-Toed Sloth). (The following, based on my notes and memory, may include errors/gaps, but I trust that my fellow bloggers and sloggers, will correct me.)

First to speak were Vladimir Vapnik and Vladimir Cherkassky, from the field of machine learning, a discipline I know of only formally. Vapnik, of the Vapnik Chervonenkis (VC) theory, is known for his seminal work here. Their papers, both of which addressed directly the philosophical implications of their work, share enough themes to merit being taken up together.

Vapnik and Cherkassky find a number of striking dichotomies in the standard practice of both philosophy and statistics. They contrast the “classical” conception of scientific knowledge as essentially rational with the more modern, “data-driven” empirical view:

The former depicts knowledge as objective, deterministic, rational. Ockham’s razor is a kind of synthetic a priori statement that warrants our rational intuitions as the foundation of truth with a capital T, as well as a naïve realism (we may rely on Cartesian “clear and distinct” ideas; God does not deceive; and so on). The latter empirical view, illustrated by machine learning, is enlightened. It settles for predictive successes and instrumentalism, views models as mental constructs (in here, not out there), and exhorts scientists to restrict themselves to problems deemed “well posed” by machine-learning criteria.

But why suppose the choice is between assuming “a single best (true) theory or model” and the extreme empiricism of their instrumental machine learner? Continue reading

Categories: philosophy of science, Statistics | Tags: , , , ,

Promissory Note

Dear Reader:
After a month of traveling, I’m soon to return to home port; then it’s just a ferry back to Elba. I promise to post (hopefully by Monday) some philosophical reflections on the past few days at the Ockham’s Razor conference, here at CMU (see post from June 12, 2012), and catch up on your comments/e-mails. I am to present Sunday (tomorrow) at 9 a.m.

Categories: Metablog | Tags: ,

The Error Statistical Philosophy and The Practice of Bayesian Statistics: Comments on Gelman and Shalizi

Mayo elbowThe following is my commentary on a paper by Gelman and Shalizi, forthcoming (some time in 2013) in the British Journal of Mathematical and Statistical Psychology* (submitted February 14, 2012).

The Error Statistical Philosophy and the Practice of Bayesian Statistics: Comments on A. Gelman and C. Shalizi: Philosophy and the Practice of Bayesian Statistics**
Deborah G. Mayo

  1. Introduction

I am pleased to have the opportunity to comment on this interesting and provocative paper. I shall begin by citing three points at which the authors happily depart from existing work on statistical foundations.

First, there is the authors’ recognition that methodology is ineluctably bound up with philosophy. If nothing else “strictures derived from philosophy can inhibit research progress” (p. 4). They note, for example, the reluctance of some Bayesians to test their models because of their belief that “Bayesian models were by definition subjective,” or perhaps because checking involves non-Bayesian methods (4, n4).

Second, they recognize that Bayesian methods need a new foundation. Although the subjective Bayesian philosophy, “strongly influenced by Savage (1954), is widespread and influential in the philosophy of science (especially in the form of Bayesian confirmation theory),”and while many practitioners perceive the “rising use of Bayesian methods in applied statistical work,” (2) as supporting this Bayesian philosophy, the authors flatly declare that “most of the standard philosophy of Bayes is wrong” (2 n2). Despite their qualification that “a statistical method can be useful even if its philosophical justification is in error”, their stance will rightly challenge many a Bayesian.

Continue reading

Categories: Statistics | Tags: , , , ,

G. Cumming Response: The New Statistics

Prof. Geoff Cumming [i] has taken up my invite to respond to “Do CIs Avoid Fallacies of Tests? Reforming the Reformers” (May 17th), reposted today as well. (I extend the same invite to anyone I comment on, whether it be in the form of a comment or full post).   He reviews some of the complaints against p-values and significance tests, but he has not here responded to the particular challenge I raise: to show how his appeals to CIs avoid the fallacies and weakness of significance tests. The May 17 post focuses on the fallacy of rejection; the one from June 2, on the fallacy of acceptance. In each case, one needs to supplement his CIs with something along the lines of the testing scrutiny offered by SEV. At the same time, a SEV assessment avoids the much-lampooned uses of p-values–or so I have argued. He does allude to a subsequent post, so perhaps he will address these issues there.

The New Statistics

PROFESSOR GEOFF CUMMING [ii] (submitted June 13, 2012)

I’m new to this blog—what a trove of riches! I’m prompted to respond by Deborah Mayo’s typically insightful post of 17 May 2012, in which she discussed one-sided tests and referred to my discussion of one-sided CIs (Cumming, 2012, pp 109-113). A central issue is:

Cumming (quoted by Mayo): as usual, the estimation approach is better

Mayo: Is it?

Lots to discuss there. In this first post I’ll outline the big picture as I see it.

‘The New Statistics’ refers to effect sizes, confidence intervals, and meta-analysis, which, of course, are not themselves new. But using them, and relying on them as the basis for interpretation, would be new for most researchers in a wide range of disciplines—that for decades have relied on null hypothesis significance testing (NHST). My basic argument for the new statistics rather than NHST is summarised in a brief magazine article ( and radio talk ( The website has information about the book (Cumming, 2012) and ESCI software, which is a free download.

Continue reading

Categories: Statistics | Tags: , , , , , , ,

Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include: Continue reading

Categories: Statistics | Tags: , , ,

Scratch Work for a SEV Homework Problem

Scratch-Paper-postSomeone wrote to me asking to see the scratch work for the SEV calculations.  (See June 14 post, also LSE problem set.)  I’ll just do the second one:

What is the Severity with which (μ<3.29) passes the test T+ in the case where  σx = 2?  We have that the observed sample mean M is 1.4, so

SEV (μ < 3.29) = P( test T+ yields a result that fits the 0 null less well than the one you got (in the direction of the alternative); computed assuming μ as large as 3.29)

SEV(μ < 3.29) = P(M >1.4; μ >3.29) > P(Z > (1.4 -3.29)/2)) * = P(Z > -1.89/2) = P(Z > -.945 ) ~ .83

*We calculate this at the point μ = 3.29, since the SEV would be larger for greater values of μ.

That’s quite a difference from the power calculation of .5, calculated in the usual way of a discrepancy detect size (DDS) analysis.


NEW PROBLEM: You want to make an inference that passes with high SEV, say you want  SEV(μ < μ’) = .99, with the same (statistically insignificant) outcome you got from the second case of test T+ as before (σx = 2).  What value for μ’ can you infer μ < μ’ with a SEV of .99?

Categories: Statistics | Tags: ,

Answer to the Homework & a New Exercise

Debunking the “power paradox” allegation from my previous post. The authors consider a one-tailed Z test of the hypothesis H0: μ ≤ 0 versus H1: μ > 0: our Test T+.  The observed sample mean is = 1.4 and in the first case σx = 1, and in the second case σx = 2.

First case: The power against μ = 3.29 is high, .95 (i.e. P(Z > 1.645; μ=3.29) =1-φ(-1.645) = .95), and thus the DDS assessor would take the result as a good indication that μ < 3.29.

Second case: For σx = 2, the cut-off for rejection would be 0 + 1.65(2) = 3.30.

So, in the second case (σx = 2) the probability of erroneously accepting H0, even if μ were as high as 3.29, is .5!  (i.e. P(Z ≤ 1.645; μ=3.29)  = φ(1.645-(3.29/2)) ~.5.)  Although p1 < p2[i] the justifiable upper bound in the first test is smaller (closer to 0) than in the second!  Hence, the DDS assessment is entirely in keeping with the appropriate use of error probabilities in interpreting tests. There is no conflict with p-value reasoning.


The DDS power analyst always takes the worst cast of just missing the cut-off for rejection. Compare instead

SEV(μ < 3.29) for the first test, and SEV(μ < 3.29) for the second (using the actual outcomes as SEV requires).

[i] p1= .081 and p2 = .242.

Categories: Statistics | Tags: , , ,

CMU Workshop on Foundations for Ockham’s Razor

CMU Workshop on Foundations for Ockham’s Razor

Carnegie Mellon University, Center for Formal Epistemology:

Workshop on Foundations for Ockham’s Razor

All are welcome to attend.

June 22-24, 2012

Adamson WingBaker Hall 136A, Carnegie Mellon University

Workshop web page and schedule

Contact:  Kevin T. Kelly (

Rationale:  Scientific theory choice is guided by judgments of simplicity, a bias frequently referred to as “Ockham’s Razor”. But what is simplicity and how, if at all, does it help science find the truth? Should we view simple theories as means for obtaining accurate predictions, as classical statisticians recommend? Or should we believe the theories themselves, as Bayesian methods seem to justify? The aim of this workshop is to re-examine the foundations of Ockham’s razor, with a firm focus on the connections, if any, between simplicity and truth.


Categories: Announcement, philosophy of science | Tags: ,

U-Phil: Is the Use of Power* Open to a Power Paradox?

* to assess Detectable Discrepancy Size (DDS)

In my last post, I argued that DDS type calculations (also called Neymanian power analysis) provide needful information to avoid fallacies of acceptance in the test T+; whereas, the corresponding confidence interval does not (at least not without special testing supplements).  But some have argued that DDS computations are “fundamentally flawed” leading to what is called the “power approach paradox”, e.g., Hoenig and Heisey (2001).

We are to consider two variations on the one-tailed test T+: H0: μ ≤ 0 versus H1: μ > 0 (p. 21).  Following their terminology and symbols:  The Z value in the first, Zp1, exceeds the Z value in the second, Zp2, although the same observed effect size occurs in both[i], and both have the same sample size, implying that σ1 < σ2.  For example, suppose σx1 = 1 and σx2 = 2.  Let observed sample mean M be 1.4 for both cases, so Zp1 = 1.4 and Zp2 = .7. They note that for any chosen power, the computable detectable discrepancy size will be smaller in the first experiment, and for any conjectured effect size, the computed power will always be higher in the first experiment.

“These results lead to the nonsensical conclusion that the first experiment provides the stronger evidence for the null hypothesis (because the apparent power is higher but significant results were not obtained), in direct contradiction to the standard interpretation of the experimental results (p-values).” (p. 21)

But rather than show the DDS assessment “nonsensical”, nor any direct contradiction to interpreting p values, this just demonstrates something  nonsensical in their interpretation of the two p-value results from tests with different variances.  Since it’s Sunday  night and I’m nursing[ii] overexposure to rowing in the Queen’s Jubilee boats in the rain and wind, how about you find the howler in their treatment. (Also please inform us of articles pointing this out in the last decade, if you know of any.)


Hoenig, J. M. and D. M. Heisey (2001), “The Abuse of Power: The Pervasive Fallacy of Power Calculations in Data Analysis,” The American Statistician, 55: 19-24.


[i] The subscript indicates the p-value of the associated Z value.

[ii] With English tea and a cup of strong “Elbar grease”.

Categories: Statistics, U-Phil | Tags: , , , , ,

Review of Error and Inference by C. Hennig

Theoria just sent me this review by Hennig* of Error and Inference.
in THEORIA 74 (2012): 245-247,

(Open access)

Deborah G. Mayo and Aris Spanos, eds. 2009. Error and Inference. Cambridge: Cambridge University Press.

Error and Inference focuses on the error-statistical philosophy of science (ESP) put forward by Deborah Mayo and Aris Spanos (MS). Chapters 1, 6 and 7 are mainly written by MS (partly with the statistician David Cox), whereas Chapters 2-5, 8, and 9 are driven by the contributions of other authors. There are responses to all these contributions at the end of the chapters, usually written by Mayo.

The structure of the book with the responses at the end of each chapter is a striking feature. The critical contributions enable a very lively discussion of ESP. On the other hand always having the last word puts Mayo and Spanos in a quite advantageous position. Some of the contributors may have underestimated Mayo’s ability to make the most of this advantage.

Central to ESP are the issues of probing scientific theories objectively by data, and Mayo’s concept of “severe testing” (ST). ST is based on a frequentist interpretation of probability, on conventional hypothesis testing and the associated error probabilities. ESP advertises a “piecemeal” approach to testing a scientific theory, in which various different aspects, which can be used to make predictions about data, are subjected to hypothesis tests. A statistical problem with such an approach is that failure of rejection of a null hypothesis H0 does not necessarily constitute evidence in favour of H0. The space of probability models is so rich that it is impossible to rule out all other probability models.

Continue reading

Categories: philosophy of science, Statistics | Tags: ,

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

This post may be seen to continue the discussion in May 17 post on Reforming the Reformers.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high(low), then x constitutes good (poor) evidence that the actual effect is no greater than γ. (See 11/9/11 post)

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

Continue reading

Categories: Reformers: Prionvac, Statistics | Tags: , , , , , , ,

Blog at