Dear Reader: I will be traveling a lot in the next few weeks, and may not get to post much; we’ll see. If I do not reply to comments, I’m not ignoring them—they’re a lot more fun than some of the things I must do now to complete my book, but need to resist, especially while traveling and giving seminars.* The rule we’ve followed is for comments to shut after 10 days, but we wanted to allow them still to appear. The blogpeople on Elba forward comments for 10 days, so beyond that it’s just haphazard if I notice them. It’s impossible otherwise to keep this blog up at all, and I would like to. Feel free to call any to my attention (use “can we talk” page or error@vt.edu). If there’s a burning issue, interested readers might wish to poke around (or scour) the multiple layers of goodies on the left hand side of this web page, wherein all manner of foundational/statistical controversies are considered from many years of working in this area. In a recent attempt by Aris Spanos and I to address the age-old criticisms from the perspective of the “error statistical philosophy,” we delineate 13 criticisms. I list them below. Continue reading

# Monthly Archives: May 2012

## Painting-by-Number #1

In an exchange with an anonymous commentator, responding to my May 23 blog post, I was asked what I meant by an argument (in favor of a method) based on “painting-by-number” reconstructions. “Painting-by-numbers” refers to reconstructing an inference or application of method X (analogous to a method of painting) to make it consistent with an application of method Y (painting with a paint-by-number kit). The locution comes from EGEK (Mayo 1996) and alludes to a kind of argument sometimes used to garner “success stories” for a method: i.e., show that any case, given enough latitude, could be reconstructed so as to be an application of (or at least consistent with) the preferred method.

Referring to specific applications of error-statistical methods, I wrote in (EGEK, (pp. 100-101):

We may grant that experimental inferences, once complete, may be reconstructed so as to be seen as applications of Bayesian methods—even though that would be stretching it in many cases. My point is that the inferences actually made are applications of standard non-Bayesian methods [e.g., significance tests]. . . . The point may be made with an analogy. Imagine the following conversation: Continue reading

## An Error-Statistical Philosophy of Evidence (PH500, LSE Seminar)

*This short paper, together with the response to comments by Casella and McCoy, may provide an OK overview of some issues/ideas, and as I’m making it available for my upcoming PH500 seminar*, I thought I’d post it too. The paper itself was a 15-minute presentation at the Ecological Society of America in 1998; my response to criticisms, around the same length, was requested much later. While in some ways the time lag shows, e.g., McCoy’s reference to “reductionist” accounts–part of the popular constructive leanings of the time; scant mention of Bayesian developments taking place around then, it is simple and short and non-technical **. Also, as I should hope, my own views have gone considerably beyond what I wrote then.
*

(Taper and Lele did an excellent job with this volume, as long as it took, particularly interspersing the commentary. I recommend it!***)

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence” in M. Taper and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. *Chicago: University of Chicago Press: 79-118 (with discussion). Continue reading

## Does the Bayesian Diet Call For Error-Statistical Supplements?

Some of the recent comments to my May 20 post leads me to point us back to my earlier (April 15) post on dynamic dutch books, and continue where Howson left off:

“And where does this conclusion leave the Bayesian theory? ….I claim that nothing valuable is lost by abandoning updating rules. The idea that the only updating policy sanctioned by the Bayesian theory is updating by conditionalization was untenable even on its own terms, since the learning of each conditioning proposition could not itself have been by conditionalization.” (Howson 1997, 289).

So a Bayesian account requires a distinct account of empirical learning in order to learn “of each conditioning proposition” (propositions which may be statistical hypotheses). This was my argument in EGEK (1996, 87)*. And this other account, I would go on to suggest, should ensure the claims (which I prefer to “propositions”) are reliably warranted or severely corroborated.

*Error and the Growth of Experimental Knowledge (Mayo 1997): http://www.phil.vt.edu/dmayo/personal_website/bibliography%20complete.htm. Scroll down to chapter 3.

- Howson, C. (1997). “A Logic of Induction,”
*Philosophy of Science***64**(2):268-290. - Mayo D. G. (1996).
*Error and the Growth of Experimental Knowledge*. Chicago: Chicago University Press. - Mayo D. G. (1997). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan”
*Philosophy of Science***64**(2):222-244 and 323-333.

## Betting, Bookies and Bayes: Does it Not Matter?

On Gelman’s blog today he offers a simple rejection of Dutch Book arguments for Bayesian inference:

“I have never found this argument appealing, because a bet is a game not a decision. A bet requires 2 players, and one player has to offer the bets.”

But what about dynamic Bayesian Dutch book arguments which are thought to be the basis for advocating updating by Bayes’s theorem? Betting scenarios, even if hypothetical, are often offered as the basis for making Bayesian measurements operational, and for claiming Bayes’s rule is a warranted representation of updating “uncertainty”. The question I had asked in an earlier (April 15) post (and then placed on hold) is: Does it not matter that Bayesians increasingly seem to debunk betting representations?

## Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values ofµwithin the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it? Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+. Continue reading

## Saturday Night Brainstorming & Task Forces: The TFSI on NHST

*Each year leaders of the movement to reform statistical methodology in psychology and related social sciences get together for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. See my discussion of the New Reformers in the blogposts of Sept 26, Oct. 3 and 4, 2011[i]*

*While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and **very successfully* published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?

*Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers, somewhere near an airport in a major metropolitan area.[ii] Please see 2015 update here. Continue reading *

## Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”

*Dear Reader: I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost. It is a letter to the editor of Statistics in Medicine in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing, and you may wish to track down the rest of it. Sincerely, D. G. Mayo*

Statist. Med. 2002; 21:2437–2444 http://errorstatistics.files.wordpress.com/2013/12/goodman.pdf

STATISTICS IN MEDICINE, LETTER TO THE EDITOR

A comment on replication, p-values and evidence: S.N. Goodman, Statistics in Medicine 1992; 11:875–879

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

## LSE Summer Seminar: Contemporary Problems in Philosophy of Statistics

*I am planning to lead 5 seminars in the department of Philosophy, Logic, and Scientific Method this summer (2) and autumn (3) on Contemporary Philosophy of Statistics under the PH500 rubric, (listed under summer term).*This will be rather informal, based on the book I am writing with this name. There will be at least one guest seminar leader in the fall. Anyone interested in attending or finding out more may write to me: error@vt.edu .*

Wednesday 6th June 3-5pm T206

Wednesday 13th June 3-5pm T206

Autumn term dates: To Be Announced

LSE contact person:c.j.thompson@lse.ac.uk.

*PH 500. Contemporary Problems in Philosophy of Statistical Science Continue reading *

## Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed

Bayesian philosophers (among others) have analogous versions of the criticism in my April 28 blogpost: error probabilities (associated with inferences to hypotheses) may conflict with chosen posterior probabilities in hypotheses. Since it’s Saturday night let’s listen in to one of the comedy hours at the Bayesian retreat (note the sedate philosopher’s comedy club backdrop):

*D**id you hear the one about the frequentist error statistical tester who inferred a hypothesis H passed a stringent test (with data x)?*

*The problem was, the epistemic probability in H was so low that H couldn’t be believed! Instead we believe its denial H’! So, she will infer hypotheses that are simply unbelievable!
*

So clearly the error statistical testing account fails to serve in an account of knowledge or inference (i.e., an epistemic account). However severely I might wish to say that a hypothesis *H* has passed a test, the Bayesian critic assigns a sufficiently low prior probability to *H* so as to yield a low posterior probability in *H*[i]*. * But this is no argument about why this counts in favor of, rather than against, their Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis *H*.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.” This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

## Stephen Senn: A Paradox of Prior Probabilities

*Stephen Senn*

Head of the Methodology and Statistics Group,

Competence Center for Methodology and Statistics (CCMS), Luxembourg

This paradox is clearly inspired by and in a sense is just another form of Philip Dawid’s selection paradox[1]. See my paper in *The American Statistician* for a discussion of this[2]. However, I rather like this concrete example of it.

Imagine that you are about to carry out a Bayesian analysis of a new treatment for rheumatism. However, just to avoid various complications I am going to assume that you are looking at a potential side effect of the treatment. I am going to take the effect on diastolic blood pressure (DBP) as the example of a side-effect one might look at.

Now, to be truly Bayesian I think that you ought to have a look at a long list of previous treatments for rheumatism but time is short and this is not always so easy. So instead you argue like this.

- I know from the results of the WHO Monica project that the standard deviation of DBP is about 11mmHg in a general population.
- I have no prior opinion as to whether anti-rheumatics as a class have a beneficial or harmful effect on DBP
- I think that large effects on DBP, whether harmful or beneficial, are rather improbable for a drug designed to treat rheumatism.
- I believe the data are approximately Normal
- I am going to use a conjugate prior for the effect of treatment with mean 0 and standard deviation = 4 mm Hg. This makes very large beneficial or harmful effects unlikely but still allows reasonable play for the data. This means that the prior variance is 16mgHg
^{2}compared to a data variance I am expecting to be about 120 mmHg^{2}. This means that as soon as I have treated 8 subjects the data mean variance should be smaller (about 15 mmHg^{2}) that the prior mean and so I will actually be weighting the data more than the prior at that point. This seems about reasonable to me.

You can choose different figures if you want but here I am attempting to apply a standard Bayesian analysis in a reasonably honest manner. Continue reading