Author Archives: Mayo

Review of Error and Inference by C. Hennig

Theoria just sent me this review by Hennig* of Error and Inference.
in THEORIA 74 (2012): 245-247,

(Open access)

Deborah G. Mayo and Aris Spanos, eds. 2009. Error and Inference. Cambridge: Cambridge University Press.

Error and Inference focuses on the error-statistical philosophy of science (ESP) put forward by Deborah Mayo and Aris Spanos (MS). Chapters 1, 6 and 7 are mainly written by MS (partly with the statistician David Cox), whereas Chapters 2-5, 8, and 9 are driven by the contributions of other authors. There are responses to all these contributions at the end of the chapters, usually written by Mayo.

The structure of the book with the responses at the end of each chapter is a striking feature. The critical contributions enable a very lively discussion of ESP. On the other hand always having the last word puts Mayo and Spanos in a quite advantageous position. Some of the contributors may have underestimated Mayo’s ability to make the most of this advantage.

Central to ESP are the issues of probing scientific theories objectively by data, and Mayo’s concept of “severe testing” (ST). ST is based on a frequentist interpretation of probability, on conventional hypothesis testing and the associated error probabilities. ESP advertises a “piecemeal” approach to testing a scientific theory, in which various different aspects, which can be used to make predictions about data, are subjected to hypothesis tests. A statistical problem with such an approach is that failure of rejection of a null hypothesis H0 does not necessarily constitute evidence in favour of H0. The space of probability models is so rich that it is impossible to rule out all other probability models.

Continue reading

Categories: philosophy of science, Statistics | Tags: , | Leave a comment

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

This post may be seen to continue the discussion in May 17 post on Reforming the Reformers.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high(low), then x constitutes good (poor) evidence that the actual effect is no greater than γ. (See 11/9/11 post)

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

Continue reading

Categories: Reformers: Prionvac, Statistics | Tags: , , , , , , , | 8 Comments

Metablog: May 31, 2012

Dear Reader: I will be traveling a lot in the next few weeks, and may not get to post much; we’ll see. If I do not reply to comments, I’m not ignoring them—they’re a lot more fun than some of the things I must do now to complete my book, but need to resist, especially while traveling and giving seminars.* The  rule we’ve followed is for comments to shut after 10 days, but we wanted to allow them still to appear. The blogpeople on Elba forward comments for 10 days, so beyond that it’s just haphazard if I notice them. It’s impossible otherwise to keep this blog up at all, and I would like to. Feel free to call any to my attention (use “can we talk” page or error@vt.edu). If there’s a burning issue,  interested readers might wish to poke around (or scour) the multiple layers of goodies on the left hand side of this web page, wherein all manner of foundational/statistical controversies are considered from many years of working in this area. In a recent attempt by Aris Spanos and I to address the age-old criticisms from the perspective of the “error statistical philosophy,” we delineate  13 criticisms.  I list them below. Continue reading

Categories: Metablog, Philosophy of Statistics, Statistics | Tags: , , | 10 Comments

Painting-by-Number #1

In an exchange with an anonymous commentator, responding to my May 23 blog post, I was asked what I meant by an argument (in favor of a method) based on “painting-by-number” reconstructions. “Painting-by-numbers” refers to reconstructing an inference or application of method X (analogous to a method of painting) to make it consistent with an application of method Y (painting with a paint-by-number kit). The locution comes from EGEK (Mayo 1996) and alludes to a kind of argument sometimes used to garner “success stories” for a method: i.e., show that any case, given enough latitude, could be reconstructed so as to be an application of (or at least consistent with) the preferred method.

Referring to specific applications of error-statistical methods, I wrote in (EGEK, (pp. 100-101):

We may grant that experimental inferences, once complete, may be reconstructed so as to be seen as applications of Bayesian methods—even though that would be stretching it in many cases. My point is that the inferences actually made are applications of standard non-Bayesian methods [e.g., significance tests]. . . . The point may be made with an analogy. Imagine the following conversation: Continue reading

Categories: Statistics | Tags: , , , | 12 Comments

An Error-Statistical Philosophy of Evidence (PH500, LSE Seminar)

This short paper, together with the response to comments by Casella and McCoy, may provide an OK overview of some issues/ideas, and as I’m making it available for my upcoming PH500 seminar*, I thought I’d post it too. The paper itself was a 15-minute presentation at the Ecological Society of America in 1998; my response to criticisms, around the same length, was requested much later. While in some ways the time lag shows, e.g., McCoy’s reference to “reductionist” accounts–part of the popular constructive leanings of the time; scant mention of Bayesian developments taking place around then, it is simple and short and non-technical **. Also, as I should hope, my own views have gone considerably beyond what I wrote then.

(Taper and Lele did an excellent job with this volume, as long as it took, particularly interspersing the commentary. I recommend it!***)

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118 (with discussion). Continue reading

Categories: philosophy of science, Statistics | Tags: , , , | 18 Comments

Does the Bayesian Diet Call For Error-Statistical Supplements?

Some of the recent comments to my May 20 post leads me to point us back to my earlier (April 15) post  on dynamic dutch books, and continue where Howson left off:

“And where does this conclusion leave the Bayesian theory? ….I claim that nothing valuable is lost by abandoning updating rules.  The idea that the only updating policy sanctioned by the Bayesian theory is updating by conditionalization was untenable even on its own terms, since the learning of each conditioning proposition could not  itself have been by conditionalization.” (Howson 1997, 289).

So a Bayesian account requires a distinct account of empirical learning in order to learn “of each conditioning proposition” (propositions which may be statistical hypotheses).  This was my argument in EGEK (1996, 87)*. And this other account, I would go on to suggest, should ensure the claims (which I prefer to “propositions”) are reliably warranted or severely corroborated.

*Error and the Growth of Experimental Knowledge (Mayo 1996):  Scroll down to chapter 3.

Categories: Statistics | Tags: , , | 32 Comments

Betting, Bookies and Bayes: Does it Not Matter?

On Gelman’s blog today he offers a simple rejection of Dutch Book arguments for Bayesian inference:

“I have never found this argument appealing, because a bet is a game not a decision. A bet requires 2 players, and one player has to offer the bets.”

But what about dynamic Bayesian Dutch book arguments which are thought to be the basis for advocating updating by Bayes’s theorem?  Betting scenarios, even if hypothetical, are often offered as the basis for making Bayesian measurements operational, and for claiming Bayes’s rule is a warranted representation of updating “uncertainty”. The question I had asked in an earlier (April 15) post (and then placed on hold) is: Does it not matter that Bayesians increasingly seem to debunk  betting representations?

Categories: Statistics | Tags: , | 27 Comments

Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+. Continue reading

Categories: Statistics | Tags: , , , , , , | 14 Comments

Saturday Night Brainstorming & Task Forces: The TFSI on NHST

Each year leaders of the movement to reform statistical methodology in psychology and related social sciences get together for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. See my discussion of the New Reformers in the blogposts of Sept 26, Oct. 3 and 4, 2011[i]

While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers, somewhere near an airport in a major metropolitan area.[ii] Please see 2015 update here. Continue reading

Categories: Statistics | Tags: , , , , , , | 7 Comments

Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”

old blogspot typewriterDear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing, and you may wish to track down the rest of it. Sincerely, D. G. Mayo

Statist. Med. 2002; 21:2437–2444  https://errorstatistics.com/wp-content/uploads/2013/12/goodman.pdf

 STATISTICS IN MEDICINE, LETTER TO THE EDITOR

A comment on replication, p-values and evidence: S.N. Goodman, Statistics in Medicine 1992; 11:875–879

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: Statistics | Tags: , , , | 8 Comments

LSE Summer Seminar: Contemporary Problems in Philosophy of Statistics

As a visitor of the Centre for Philosophy of Natural and Social Science (CPNSS) at the London School of Economics and Political Science, I am planning to lead 5 seminars in the department of Philosophy, Logic, and Scientific Method this summer (2) and autumn (3) on Contemporary Philosophy of Statistics under the PH500 rubric, (listed under summer term).   This will be rather informal, based on the book I am writing with this name. There will be at least one guest seminar leader in the fall. Anyone interested in attending or finding out more may write to me: error@vt.edu .*

Wednesday   6th June            3-5pm                        T206

Wednesday 13th June             3-5pm                        T206

Autumn term dates: To Be Announced

LSE contact person:c.j.thompson@lse.ac.uk.

PH 500. Contemporary Problems in Philosophy of Statistical Science Continue reading

Categories: Announcement, philosophy of science, Statistics | Tags: , | Leave a comment

Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed

Bayesian philosophers (among others) have analogous versions of the criticism in my April 28 blogpost: error probabilities (associated with inferences to hypotheses) may conflict with chosen posterior probabilities in hypotheses. Since it’s Saturday night let’s listen in to one of the comedy hours at the Bayesian retreat (note the sedate philosopher’s comedy club backdrop):

Did you hear the one about the frequentist error statistical tester who inferred a hypothesis H passed a stringent test (with data x)?

The problem was, the epistemic probability in H was so low that H couldn’t be believed!  Instead we believe its denial H’!  So, she will infer hypotheses that are simply unbelievable!

So clearly the error statistical testing account fails to serve in an account of knowledge or inference (i.e., an epistemic account). However severely I might wish to say that a hypothesis H has passed a test, the Bayesian critic assigns a sufficiently low prior probability to H so as to yield a low posterior probability in H[i].  But this is no argument about why this counts in favor of, rather than against, their Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis H.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.”  This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true.  This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , | 34 Comments

Stephen Senn: A Paradox of Prior Probabilities

Stephen Senn

Head of the Methodology and Statistics Group,

Competence Center for Methodology and Statistics (CCMS), Luxembourg

This paradox is clearly inspired by and in a sense is just another form of Philip Dawid’s selection paradox[1]. See my paper in The American Statistician for a discussion of this[2]. However, I rather like this concrete example of it.

Imagine that you are about to carry out a Bayesian analysis of a new treatment for rheumatism. However, just to avoid various complications I am going to assume that you are looking at a potential side effect of the treatment. I am going to take the effect on diastolic blood pressure (DBP) as the example of a side-effect one might look at.

Now, to be truly Bayesian I think that you ought to have a look at a long list of previous treatments for rheumatism but time is short and this is not always so easy. So instead you argue like this.

  1. I know from the results of the WHO Monica project that the standard deviation of DBP is about 11mmHg in a general population.
  2. I have no prior opinion as to whether anti-rheumatics as a class have a beneficial or harmful effect on DBP
  3. I think that large effects on DBP, whether harmful or beneficial, are rather improbable for a drug designed to treat rheumatism.
  4. I believe the data are approximately Normal
  5. I am going to use a conjugate prior for the effect of treatment with mean 0 and standard deviation = 4 mm Hg. This makes very large beneficial or harmful effects unlikely but still allows reasonable play for the data. This means that the prior variance is 16mgHg2 compared to a data variance I am expecting to be about 120 mmHg2. This means that as soon as I have treated 8 subjects the data mean variance should be smaller (about 15 mmHg2) that the prior mean and so I will actually be weighting the data more than the prior at that point. This seems about reasonable to me.

You can choose different figures if you want but here I am attempting to apply a standard Bayesian analysis in a reasonably honest manner. Continue reading

Categories: Statistics | Tags: , , , | 13 Comments

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

JB: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

Frequentist significance tester: But you assumed 50% of the null hypotheses are true, and  computed P(H0|x) (imagining P(H0)= .5)—and then assumed my p-value should agree with the number you get!

But, our significance tester is not heard from as they move on to the next joke….

Of course it is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 [i] .  Somewhat more recent work generalizes the result, e.g., J. Berger and Sellke, 1987. Although from their Bayesian perspective, it appears that p-values come up short as measures of evidence, the significance testers balk at the fact that use of the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!   An interesting twist in recent work is to try to “reconcile” the p-value and the posterior e.g., Berger 2003[ii].

The conflict between p-values and Bayesian posteriors considers the two sided  test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .

“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!

CHART

Table 1 (modified) from J.O. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many find the example compelling evidence that the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative”(?) Bayesian prior probability assignment of .5 toH0, the remaining .5 being spread out over the alternative parameter space. Others charge that the problem is not p-values but the high prior (Casella and R.Berger, 1987).  Moreover, the “spiked concentration of belief in the null” is at odds with the prevailing view “we know all nulls are false”.  Note too the conflict with confidence interval reasoning since the value zero (0) lies outside the corresponding confidence interval (Mayo 2005).

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and frequentist error probabilities: it is imagined that we sample randomly from a population of hypotheses, some proportion of which are assumed to be true, 50% is a common number used. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.

This particular null H0 was randomly selected from this urn (and, it may be added, nothing else is known, or the like).

Therefore P(H0 is true) = .5.

It isn’t that one cannot play a carnival game of reaching into an urn of nulls (and one can imagine lots of choices for what to put in the urn), and use a Bernouilli model for the chance of drawing a true hypothesis (assuming we could even tell), but this “generic hypothesis”  is no longer the particular hypothesis one aims to use in computing the probability of data x0 (be it on eclipse data, risk rates, or whatever) under hypothesis H0. [iii]  In any event .5 is not the frequentist probability that the chosen null H0 is true. (Note the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”.)

Yet J. Berger claims his applets are perfectly frequentist, and by adopting his recommended O-priors, we frequentists can become more frequentist (than using our flawed p-values)[iv]. We get what he calls conditional p-values (of a special sort). This is a reason for a coining a different name, e.g.,  frequentist error statistician.

Upshot: Berger and Sellke tell us they will cure  the significance tester’s tendency to exaggerate the evidence against the null  (in two-sided testing) by using some variant on a spiked prior. But the result of their “cure” is that outcomes may too readily be taken as no evidence against, or even evidence for, the null hypothesis, even if it is false.  We actually don’t think we need a cure.  Faced with conflicts between error probabilities and Bayesian posterior probabilities, the error statistician may well conclude that the flaw lies with the latter measure. This is precisely what Fisher argued:

Discussing a test of the hypothesis that the stars are distributed at random, Fisher takes the low p-value (about 1 in 33,000) to “exclude at a high level of significance any theory involving a random distribution” (Fisher, 1956, page 42). Even if one were to imagine that H0 had an extremely high prior probability, Fisher continues—never minding “what such a statement of probability a priori could possibly mean”—the resulting high posteriori probability to H0, he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (44) . . . “is not capable of finding expression in any calculation of probability a posteriori” (43). Sampling theorists do not deny there is ever a legitimate frequentist prior probability distribution for a statistical hypothesis: one may consider hypotheses about such distributions and subject them to probative tests. Indeed, Fisher says,  if one were to consider the claim about the a priori probability to be itself a hypothesis, it would be rejected by the data!


[i] A result my late colleague I.J. wanted me to call the Jeffreys-Good-Lindley Paradox).

[ii] An applet is available at http://www.stat.duke.edu/∼berger

[iii] Bayesian philosophers, e.g., Achinstein, allow this does not yield a frequentist prior, but he claims it yields an acceptable prior for the epistemic  probabilist (e.g., See Error and Inference 2010).

[iv]Does this remind you of how the Bayesian is said to become more subjective by using the Berger O-Bayesian prior? See Berger deconstruction.

References & Related articles

Berger, J. O.  (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R..  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Fisher, R. A., (1956) Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.

Jeffreys, (1939) Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (2003), Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science 18, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
Categories: Statistics | Tags: , , , , , | 53 Comments

Matching Numbers Across Philosophies

The search for an agreement on numbers across different statistical philosophies is an understandable pastime in foundations of statistics. Perhaps identifying matching or unified numbers, apart from what they might mean, would offer a glimpse as to shared underlying goals? Jim Berger (2003) assures us there is no sacrilege in agreeing on methodology without philosophy, claiming “while the debate over interpretation can be strident, statistical practice is little affected as long as the reported numbers are the same” (Berger, 2003, p. 1).

Do readers agree?

Neyman and Pearson (or perhaps it was mostly Neyman) set out to determine when tests of statistical hypotheses may be considered “independent of probabilities a priori” ([p. 201). In such cases, frequentist and Bayesian may agree on a critical or rejection region.

The agreement between “default” Bayesians and frequentists in the case of one-sided Normal (IID) testing (known σ) is very familiar.   As noted in Ghosh, Delampady, and Samanta (2006, p. 35), if we wish to reject a null value when “the posterior odds against it are 19:1 or more, i.e., if posterior probability of H0 is < .05” then the rejection region matches that of the corresponding test of H0, (at the .05 level) if that were the null hypothesis. By contrast, they go on to note the also familiar fact that there would be disagreement between the frequentist and Bayesian if one were instead testing the two sided: H0: μ=μ0 vs. H1: μ≠μ0 with known σ. In fact, the same outcome that would be regarded as evidence against the null in the one-sided test (for the default Bayesian and frequentist) can result in statistically significant results being construed as no evidence against the null —for the Bayesian– or even evidence for it (due to a spiked prior).[i] Continue reading

Categories: Statistics | Tags: , , , | 7 Comments

U-Phil: Jon Williamson: Deconstructing Dynamic Dutch Books

Jon Williamson

I am  posting Jon Williamson’s* (Philosophy, Kent) U-Phil from 4-15-12

In this paper http://www.springerlink.com/content/q175036678w17478 (Synthese 178:67–85) I identify four ways in which Bayesian conditionalisation can fail. Of course not all Bayesians advocate conditionalisation as a universal rule, and I argue that objective Bayesianism as based on the maximum entropy principle should be preferred to subjective Bayesianism as based on conditionalisation, where the two disagree.

Conditionalisation is just one possible way of updating probabilities and I think it’s interesting to see how different formal approaches compare.

*Williamson participated in our June 2010 “Phil-Stat Meets Phil Sci” conference at the LSE, and we jointly ran a conference at Kent in June 2009.

Categories: Statistics, U-Phil | Tags: , , , , | 10 Comments

Jean Miller: Happy Sweet 16 to EGEK #2 (Hasok Chang Review of EGEK)

Jean Miller here, reporting back from the island. Tonight we complete our “sweet sixteen” celebration of Mayo’s EGEK (1996) with the book review by Dr. Hasok Chang (currently the Hans Rausing Professor of History and Philosophy of Science at the University of Cambridge). His was chosen as our top favorite in the category of ‘reviews by philosophers’. Enjoy!

REVIEW: British Journal of the Philosophy of Science 48 (1997), 455-459
DEBORAH MAYO Error and the Growth of Experimental Knowledge, 
The University of Chicago Press, 1996
By: Hasok Chang

Deborah Mayo’s Error and the Growth of Experimental Knowledge is a rich, useful, and accessible book. It is also a large volume which few people can realistically be expected to read cover to cover. Considering those factors, the main focus of this review will be on providing various potential readers with guidelines for making the best use of the book.

As the author herself advises, the main points can be grasped by reading the first and the last chapters. The real benefit, however, would only come from studying some of the intervening chapters closely. Below I will offer comments on several of the major strands that can be teased apart, though they are found rightly intertwined in the book. Continue reading

Categories: philosophy of science, Statistics | Tags: , , , | 2 Comments

Jean Miller: Happy Sweet 16 to EGEK! (Shalizi Review: “We Have Ways of Making You Talk”)

Jean Miller here.  (I obtained my PhD with D. Mayo in Phil/STS at VT.) Some of us “island philosophers” have been looking to pick our favorite book reviews of EGEK (Mayo 1996; Lakatos Prize 1999) to celebrate its “sweet sixteen” this month. This review, by Dr. Cosma Shalizi (CMU, Stat) has been chosen as the top favorite (in the category of reviews outside philosophy).  Below are some excerpts–it was hard to pick, as each paragraph held some new surprise, or unique way to succinctly nail down the views in EGEK. You can read the full review here. Enjoy.

“We Have Ways of Making You Talk, or, Long Live Peircism-Popperism-Neyman-Pearson Thought!”
by Cosma Shalizi

After I’d bungled teaching it enough times to have an idea of what I was doing, one of the first things students in my introductory physics classes learned (or anyway were taught), and which I kept hammering at all semester, was error analysis: estimating the uncertainty in measurements, propagating errors from measured quantities into calculated ones, and some very quick and dirty significance tests, tests for whether or not two numbers agree, within their associated margins of error. I did this for purely pragmatic reasons: it seemed like one of the most useful things we were supposed to teach, and also one of the few areas where what I did had any discernible effect on what they learnt. Now that I’ve read Mayo’s book, I’ll be able to offer another excuse to my students the next time I teach error analysis, namely, that it’s how science really works.

I exaggerate her conclusion slightly, but only slightly. Mayo is a dues-paying philosopher of science (literally, it seems), and like most of the breed these days is largely concerned with questions of method and justification, of “ampliative inference” (C. S. Peirce) or “non-demonstrative inference” (Bertrand Russell). Put bluntly and concretely: why, since neither can be deduced rigorously from unquestionable premises, should we put more trust in David Grinspoon‘s ideas about Venus than in those of Immanuel Velikovsky? A nice answer would be something like, “because good scientific theories are arrived at by employing thus-and-such a method, which infallibly leads to the truth, for the following self-evident reasons.” A nice answer, but not one which is seriously entertained by anyone these days, apart from some professors of sociology and literature moonlighting in the construction of straw men. In the real world, science is alas fallible, subject to constant correction, and very messy. Still, mess and all, we somehow or other come up with reliable, codified knowledge about the world, and it would be nice to know how the trick is turned: not only would it satisfy curiosity (“the most agreeable of all vices” — Nietzsche), and help silence such people as do, in fact, prefer Velikovsky to Grinspoon, but it might lead us to better ways of turning the trick. Asking scientists themselves is nearly useless: you’ll almost certainly just get a recital of whichever school of methodology we happened to blunder into in college, or impatience at asking silly questions and keeping us from the lab. If this vice is to be indulged in, someone other than scientists will have to do it: namely, the methodologists. Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , | 33 Comments

Earlier U-Phils and Deconstructions

Dear Reader: If you wish to see some previous rounds of philosophical analyses and deconstructions on this blog, we’ve listed some of them below:(search this blog under “U-Phil” for more)

Introductory explanation: https://errorstatistics.com/2012/01/13/u-phil-so-you-want-to-do-a-philosophical-analysis/

Mayo on Jim Berger:  https://errorstatistics.com/2011/12/11/irony-and-bad-faith-deconstructing-bayesians-1/

Contributed deconstructions of J. Berger: https://errorstatistics.com/2011/12/26/contributed-deconstructions-irony-bad-faith-3/

J. Berger on J. Berger: https://errorstatistics.com/2011/12/29/jim-berger-on-jim-berger/

Mayo on Senn:  https://errorstatistics.com/2012/01/15/mayo-philosophizes-on-stephen-senn-how-can-we-cultivate-senns-ability/

Others on Senn: https://errorstatistics.com/2012/01/22/u-phil-stephen-senn-1-c-robert-a-jaffe-and-mayo-brief-remarks/

Gelman on Senn: https://errorstatistics.com/2012/01/23/u-phil-stephen-senn-2-andrew-gelman/

Senn on Senn: http://errorstatistics.com/2012/01/24/u-phil-3-stephen-senn-on-stephen-senn/

Mayo, Senn & Wasserman on Gelman: https://errorstatistics.com/2012/03/06/2645/

Hennig on Gelman: https://errorstatistics.com/2012/03/10/a-further-comment-on-gelman-by-c-hennig/

Deconstructing Dutch books: https://errorstatistics.com/2012/04/15/3376/

Deconstructing Larry Wasserman
https://errorstatistics.com/2012/07/28/u-phil-deconstructing-larry-wasserman/

Aris Spanos on Larry Wasserman
https://errorstatistics.com/2012/08/08/u-phil-aris-spanos-on-larry-wasserman/

Hennig and Gelman on Wasserman
https://errorstatistics.com/2012/08/10/u-phil-hennig-and-gelman-on-wasserman-2011/

Wasserman replies to Spanos and Hennig
https://errorstatistics.com/2012/08/11/u-phil-wasserman-replies-to-spanos-and-hennig/

concluding the deconstruction: Wasserman-Mayo
https://errorstatistics.com/2012/08/13/u-phil-concluding-the-deconstruction-wasserman-mayo/

https://errorstatistics.com/2013/02/10/u-phil-gandenberger-hennig-birnbaums-proof/

https://errorstatistics.com/2013/01/30/u-phil-j-a-miller-blogging-the-slp/

 

There are  others, but this should do; if you care to write on my previous post (send directly to error@vt.edu).

Sincerely,

D Mayo

Categories: philosophy of science, U-Phil | Tags: , , | Leave a comment

A. Spanos: Jerzy Neyman and his Enduring Legacy

A Statistical Model as a Chance Mechanism

Aris Spanos

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals.

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data x0 come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X1,X2,…,Xn), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data! Continue reading

Categories: Statistics | Tags: , , , , , , | 2 Comments

Blog at WordPress.com.