Statistics

More from the Foundations of Simplicity Workshop*

*See also earlier posts from the CMU workshop here and here.

Elliott Sober has been writing on simplicity for a long time, so it was good to hear his latest thinking. If I understood him, he continues to endorse a comparative likelihoodist account, but he allows that, in model selection, “parsimony fights likelihood,” while, in adequate evolutionary theory, the two are thought to go hand in hand. Where it seems needed, therefore, he accepts a kind of “pluralism”. His discussion of the rival models in evolutionary theory and how they may give rise to competing likelihoods (for “tree taxonomies”) bears examination in its own right, but being in no position to accomplish this, I shall limit my remarks to the applicability of Sober’s insights (as my notes reflect them) to the philosophy of statistics and statistical evidence.

1. Comparativism:  We can agree that a hypothesis is not appraised in isolation, but to say that appraisal is “contrastive” or “comparativist” is ambiguous. Error statisticians view hypothesis testing as between exhaustive hypotheses H and not-H (usually within a model), but deny that the most that can be said is that one hypothesis or model is comparatively better than another, among a group of hypotheses that is to be delineated at the outset. There’s an important difference here. The best-tested of the lot need not be well-tested!

2. Falsification: Sober made a point of saying that his account does not falsify models or hypotheses. We are to start out with all the possible models to be considered (hopefully including one that is true or approximately true), akin to the “closed universe” of standard Bayesian accounts[i], but do we not get rid of any as falsified, given data? It seems not.

Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , , , | 3 Comments

PhilStatLaw: “Let’s Require Health Claims to Be ‘Evidence Based'” (Schachtman)

I see that Nathan Schachtman has had many interesting posts during the time I was away.  His recent post endorses the idea of “a hierarchy of evidence”–but philosophers of “evidence-based” medicine generally question or oppose it, at least partly because of disagreement as to where to place RCTs in the hierarchy.  What do people think?

Litigation arising from the FDA’s refusal to approval “health claims” for foods and dietary supplements is a fertile area for disputes over the interpretation of statistical evidence.  A ‘‘health claim’’ is ‘‘any claim made on the label or in labeling of a food, including a dietary supplement, that expressly or by implication … characterizes the relationship of any substance to a disease or health-related condition.’’ 21 C.F.R. § 101.14(a)(1); see also 21 U.S.C. § 343(r)(1)(A)-(B).

Unlike the federal courts exercising their gatekeeping responsibility, the FDA has committed to pre-specified principles of interpretation and evaluation. By regulation, the FDA gives notice of standards for evaluating complex evidentiary displays for the ‘‘significant scientific agreement’’ required for approving a food or dietary supplement health claim.  21 C.F.R. § 101.14.  SeeFDA – Guidance for Industry: Evidence-Based Review System for the Scientific Evaluation of Health Claims – Final (2009).

If the FDA’s refusal to approve a health claim requires pre-specified criteria of evaluation, then we should be asking ourselves why have the federal courts failed to develop a set of criteria for evaluating health effects claims as part of its Rule 702 (“Daubert“) gatekeeping responsibilities.  Why, after close to 20 years after the Supreme Court decided Daubert, can lawyers make “health claims” without having to satisfy evidence-based criteria?

Read the rest.

Categories: philosophy of science, Statistics | Tags: , , , | Leave a comment

Further Reflections on Simplicity: Mechanisms

To continue with some philosophical reflections on the papers from the “Ockham’s razor” conference, let me respond to something in Shalizi’s recent comments (http://cscs.umich.edu/~crshalizi/weblog/). His emphasis on the interest in understanding processes and mechanisms, as opposed to mere prediction, seems exactly right. But he raises a question that seems to me simply answered (on grounds of evidence):  If “a model didn’t seem to need” a mechanism, it is left out, why?

“It’s this, the leave-out-processes-you-don’t-need, which seems to me the core of the Razor for scientific model-building. This is definitely not the same as parameter-counting, and I think it’s also different from capacity control and even from description-length-measuring (cf.), though I am open to Peter persuading me otherwise. I am not, however, altogether sure how to formalize it, or what would justify it, beyond an aesthetic preference for tidy models. (And who died and left the tidy-minded in charge?) The best hope for such justification, I think, is something like Kevin’s idea that the Razor helps us get to the truth faster, or at least with fewer needless detours. Positing processes and mechanisms which aren’t strictly called for to account for the phenomena is asking for trouble needlessly.”

But it is easy to see that if a model M is adequate for data x regarding an aspect of a phenomenon (i.e., M had passed reasonably severe tests with x) , then a model M’ that added an “unnecessary” mechanism would have passed with very low severity, or, if one prefers, M’ would be very poorly corroborated.  To justify “leaving-out-processes-you-don’t-need” then, the appeal is not to aesthetics or heuristics but to the severity or well-testedness of M and M’.

Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , | 4 Comments

Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop*

Picking up the pieces…

My flight out of Pittsburgh has been cancelled, and as I may be stuck in the airport for some time, I will try to make a virtue of it by jotting down some of my promised reflections on the “simplicity and truth” conference at Carnegie Mellon (organized by Kevin Kelly). My remarks concern only the explicit philosophical connections drawn by (4 of) the seven non-philosophers who spoke. For more general remarks, see blogs of: Larry Wasserman (Normal Deviate) and Cosma Shalizi (Three-Toed Sloth). (The following, based on my notes and memory, may include errors/gaps, but I trust that my fellow bloggers and sloggers, will correct me.)

First to speak were Vladimir Vapnik and Vladimir Cherkassky, from the field of machine learning, a discipline I know of only formally. Vapnik, of the Vapnik Chervonenkis (VC) theory, is known for his seminal work here. Their papers, both of which addressed directly the philosophical implications of their work, share enough themes to merit being taken up together.

Vapnik and Cherkassky find a number of striking dichotomies in the standard practice of both philosophy and statistics. They contrast the “classical” conception of scientific knowledge as essentially rational with the more modern, “data-driven” empirical view:

The former depicts knowledge as objective, deterministic, rational. Ockham’s razor is a kind of synthetic a priori statement that warrants our rational intuitions as the foundation of truth with a capital T, as well as a naïve realism (we may rely on Cartesian “clear and distinct” ideas; God does not deceive; and so on). The latter empirical view, illustrated by machine learning, is enlightened. It settles for predictive successes and instrumentalism, views models as mental constructs (in here, not out there), and exhorts scientists to restrict themselves to problems deemed “well posed” by machine-learning criteria.

But why suppose the choice is between assuming “a single best (true) theory or model” and the extreme empiricism of their instrumental machine learner? Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , | 14 Comments

The Error Statistical Philosophy and The Practice of Bayesian Statistics: Comments on Gelman and Shalizi

Mayo elbowThe following is my commentary on a paper by Gelman and Shalizi, forthcoming (some time in 2013) in the British Journal of Mathematical and Statistical Psychology* (submitted February 14, 2012).
_______________________

The Error Statistical Philosophy and the Practice of Bayesian Statistics: Comments on A. Gelman and C. Shalizi: Philosophy and the Practice of Bayesian Statistics**
Deborah G. Mayo

  1. Introduction

I am pleased to have the opportunity to comment on this interesting and provocative paper. I shall begin by citing three points at which the authors happily depart from existing work on statistical foundations.

First, there is the authors’ recognition that methodology is ineluctably bound up with philosophy. If nothing else “strictures derived from philosophy can inhibit research progress” (p. 4). They note, for example, the reluctance of some Bayesians to test their models because of their belief that “Bayesian models were by definition subjective,” or perhaps because checking involves non-Bayesian methods (4, n4).

Second, they recognize that Bayesian methods need a new foundation. Although the subjective Bayesian philosophy, “strongly influenced by Savage (1954), is widespread and influential in the philosophy of science (especially in the form of Bayesian confirmation theory),”and while many practitioners perceive the “rising use of Bayesian methods in applied statistical work,” (2) as supporting this Bayesian philosophy, the authors flatly declare that “most of the standard philosophy of Bayes is wrong” (2 n2). Despite their qualification that “a statistical method can be useful even if its philosophical justification is in error”, their stance will rightly challenge many a Bayesian.

Continue reading

Categories: Statistics | Tags: , , , , | Leave a comment

G. Cumming Response: The New Statistics

Prof. Geoff Cumming [i] has taken up my invite to respond to “Do CIs Avoid Fallacies of Tests? Reforming the Reformers” (May 17th), reposted today as well. (I extend the same invite to anyone I comment on, whether it be in the form of a comment or full post).   He reviews some of the complaints against p-values and significance tests, but he has not here responded to the particular challenge I raise: to show how his appeals to CIs avoid the fallacies and weakness of significance tests. The May 17 post focuses on the fallacy of rejection; the one from June 2, on the fallacy of acceptance. In each case, one needs to supplement his CIs with something along the lines of the testing scrutiny offered by SEV. At the same time, a SEV assessment avoids the much-lampooned uses of p-values–or so I have argued. He does allude to a subsequent post, so perhaps he will address these issues there.

The New Statistics

PROFESSOR GEOFF CUMMING [ii] (submitted June 13, 2012)

I’m new to this blog—what a trove of riches! I’m prompted to respond by Deborah Mayo’s typically insightful post of 17 May 2012, in which she discussed one-sided tests and referred to my discussion of one-sided CIs (Cumming, 2012, pp 109-113). A central issue is:

Cumming (quoted by Mayo): as usual, the estimation approach is better

Mayo: Is it?

Lots to discuss there. In this first post I’ll outline the big picture as I see it.

‘The New Statistics’ refers to effect sizes, confidence intervals, and meta-analysis, which, of course, are not themselves new. But using them, and relying on them as the basis for interpretation, would be new for most researchers in a wide range of disciplines—that for decades have relied on null hypothesis significance testing (NHST). My basic argument for the new statistics rather than NHST is summarised in a brief magazine article (http://tiny.cc/GeoffConversation) and radio talk (http://tiny.cc/geofftalk). The website www.thenewstatistics.com has information about the book (Cumming, 2012) and ESCI software, which is a free download.

Continue reading

Categories: Statistics | Tags: , , , , , , , | 5 Comments

Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include: Continue reading

Categories: Statistics | Tags: , , , | Leave a comment

Scratch Work for a SEV Homework Problem

Scratch-Paper-postSomeone wrote to me asking to see the scratch work for the SEV calculations.  (See June 14 post, also LSE problem set.)  I’ll just do the second one:

What is the Severity with which (μ<3.29) passes the test T+ in the case where  σx = 2?  We have that the observed sample mean M is 1.4, so

SEV (μ < 3.29) = P( test T+ yields a result that fits the 0 null less well than the one you got (in the direction of the alternative); computed assuming μ as large as 3.29)

SEV(μ < 3.29) = P(M >1.4; μ >3.29) > P(Z > (1.4 -3.29)/2)) * = P(Z > -1.89/2) = P(Z > -.945 ) ~ .83

*We calculate this at the point μ = 3.29, since the SEV would be larger for greater values of μ.

That’s quite a difference from the power calculation of .5, calculated in the usual way of a discrepancy detect size (DDS) analysis.

QUESTIONS?

NEW PROBLEM: You want to make an inference that passes with high SEV, say you want  SEV(μ < μ’) = .99, with the same (statistically insignificant) outcome you got from the second case of test T+ as before (σx = 2).  What value for μ’ can you infer μ < μ’ with a SEV of .99?

Categories: Statistics | Tags: , | 5 Comments

Answer to the Homework & a New Exercise

Debunking the “power paradox” allegation from my previous post. The authors consider a one-tailed Z test of the hypothesis H0: μ ≤ 0 versus H1: μ > 0: our Test T+.  The observed sample mean is = 1.4 and in the first case σx = 1, and in the second case σx = 2.

First case: The power against μ = 3.29 is high, .95 (i.e. P(Z > 1.645; μ=3.29) =1-φ(-1.645) = .95), and thus the DDS assessor would take the result as a good indication that μ < 3.29.

Second case: For σx = 2, the cut-off for rejection would be 0 + 1.65(2) = 3.30.

So, in the second case (σx = 2) the probability of erroneously accepting H0, even if μ were as high as 3.29, is .5!  (i.e. P(Z ≤ 1.645; μ=3.29)  = φ(1.645-(3.29/2)) ~.5.)  Although p1 < p2[i] the justifiable upper bound in the first test is smaller (closer to 0) than in the second!  Hence, the DDS assessment is entirely in keeping with the appropriate use of error probabilities in interpreting tests. There is no conflict with p-value reasoning.

NEW PROBLEM

The DDS power analyst always takes the worst cast of just missing the cut-off for rejection. Compare instead

SEV(μ < 3.29) for the first test, and SEV(μ < 3.29) for the second (using the actual outcomes as SEV requires).


[i] p1= .081 and p2 = .242.

Categories: Statistics | Tags: , , , | 6 Comments

U-Phil: Is the Use of Power* Open to a Power Paradox?

* to assess Detectable Discrepancy Size (DDS)

In my last post, I argued that DDS type calculations (also called Neymanian power analysis) provide needful information to avoid fallacies of acceptance in the test T+; whereas, the corresponding confidence interval does not (at least not without special testing supplements).  But some have argued that DDS computations are “fundamentally flawed” leading to what is called the “power approach paradox”, e.g., Hoenig and Heisey (2001).

We are to consider two variations on the one-tailed test T+: H0: μ ≤ 0 versus H1: μ > 0 (p. 21).  Following their terminology and symbols:  The Z value in the first, Zp1, exceeds the Z value in the second, Zp2, although the same observed effect size occurs in both[i], and both have the same sample size, implying that σ1 < σ2.  For example, suppose σx1 = 1 and σx2 = 2.  Let observed sample mean M be 1.4 for both cases, so Zp1 = 1.4 and Zp2 = .7. They note that for any chosen power, the computable detectable discrepancy size will be smaller in the first experiment, and for any conjectured effect size, the computed power will always be higher in the first experiment.

“These results lead to the nonsensical conclusion that the first experiment provides the stronger evidence for the null hypothesis (because the apparent power is higher but significant results were not obtained), in direct contradiction to the standard interpretation of the experimental results (p-values).” (p. 21)

But rather than show the DDS assessment “nonsensical”, nor any direct contradiction to interpreting p values, this just demonstrates something  nonsensical in their interpretation of the two p-value results from tests with different variances.  Since it’s Sunday  night and I’m nursing[ii] overexposure to rowing in the Queen’s Jubilee boats in the rain and wind, how about you find the howler in their treatment. (Also please inform us of articles pointing this out in the last decade, if you know of any.)

______________________

Hoenig, J. M. and D. M. Heisey (2001), “The Abuse of Power: The Pervasive Fallacy of Power Calculations in Data Analysis,” The American Statistician, 55: 19-24.

 


[i] The subscript indicates the p-value of the associated Z value.

[ii] With English tea and a cup of strong “Elbar grease”.

Categories: Statistics, U-Phil | Tags: , , , , , | 7 Comments

Review of Error and Inference by C. Hennig

Theoria just sent me this review by Hennig* of Error and Inference.
in THEORIA 74 (2012): 245-247,

(Open access)

Deborah G. Mayo and Aris Spanos, eds. 2009. Error and Inference. Cambridge: Cambridge University Press.

Error and Inference focuses on the error-statistical philosophy of science (ESP) put forward by Deborah Mayo and Aris Spanos (MS). Chapters 1, 6 and 7 are mainly written by MS (partly with the statistician David Cox), whereas Chapters 2-5, 8, and 9 are driven by the contributions of other authors. There are responses to all these contributions at the end of the chapters, usually written by Mayo.

The structure of the book with the responses at the end of each chapter is a striking feature. The critical contributions enable a very lively discussion of ESP. On the other hand always having the last word puts Mayo and Spanos in a quite advantageous position. Some of the contributors may have underestimated Mayo’s ability to make the most of this advantage.

Central to ESP are the issues of probing scientific theories objectively by data, and Mayo’s concept of “severe testing” (ST). ST is based on a frequentist interpretation of probability, on conventional hypothesis testing and the associated error probabilities. ESP advertises a “piecemeal” approach to testing a scientific theory, in which various different aspects, which can be used to make predictions about data, are subjected to hypothesis tests. A statistical problem with such an approach is that failure of rejection of a null hypothesis H0 does not necessarily constitute evidence in favour of H0. The space of probability models is so rich that it is impossible to rule out all other probability models.

Continue reading

Categories: philosophy of science, Statistics | Tags: , | Leave a comment

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

This post may be seen to continue the discussion in May 17 post on Reforming the Reformers.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high(low), then x constitutes good (poor) evidence that the actual effect is no greater than γ. (See 11/9/11 post)

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

Continue reading

Categories: Reformers: Prionvac, Statistics | Tags: , , , , , , , | 8 Comments

Metablog: May 31, 2012

Dear Reader: I will be traveling a lot in the next few weeks, and may not get to post much; we’ll see. If I do not reply to comments, I’m not ignoring them—they’re a lot more fun than some of the things I must do now to complete my book, but need to resist, especially while traveling and giving seminars.* The  rule we’ve followed is for comments to shut after 10 days, but we wanted to allow them still to appear. The blogpeople on Elba forward comments for 10 days, so beyond that it’s just haphazard if I notice them. It’s impossible otherwise to keep this blog up at all, and I would like to. Feel free to call any to my attention (use “can we talk” page or error@vt.edu). If there’s a burning issue,  interested readers might wish to poke around (or scour) the multiple layers of goodies on the left hand side of this web page, wherein all manner of foundational/statistical controversies are considered from many years of working in this area. In a recent attempt by Aris Spanos and I to address the age-old criticisms from the perspective of the “error statistical philosophy,” we delineate  13 criticisms.  I list them below. Continue reading

Categories: Metablog, Philosophy of Statistics, Statistics | Tags: , , | 10 Comments

Painting-by-Number #1

In an exchange with an anonymous commentator, responding to my May 23 blog post, I was asked what I meant by an argument (in favor of a method) based on “painting-by-number” reconstructions. “Painting-by-numbers” refers to reconstructing an inference or application of method X (analogous to a method of painting) to make it consistent with an application of method Y (painting with a paint-by-number kit). The locution comes from EGEK (Mayo 1996) and alludes to a kind of argument sometimes used to garner “success stories” for a method: i.e., show that any case, given enough latitude, could be reconstructed so as to be an application of (or at least consistent with) the preferred method.

Referring to specific applications of error-statistical methods, I wrote in (EGEK, (pp. 100-101):

We may grant that experimental inferences, once complete, may be reconstructed so as to be seen as applications of Bayesian methods—even though that would be stretching it in many cases. My point is that the inferences actually made are applications of standard non-Bayesian methods [e.g., significance tests]. . . . The point may be made with an analogy. Imagine the following conversation: Continue reading

Categories: Statistics | Tags: , , , | 12 Comments

An Error-Statistical Philosophy of Evidence (PH500, LSE Seminar)

This short paper, together with the response to comments by Casella and McCoy, may provide an OK overview of some issues/ideas, and as I’m making it available for my upcoming PH500 seminar*, I thought I’d post it too. The paper itself was a 15-minute presentation at the Ecological Society of America in 1998; my response to criticisms, around the same length, was requested much later. While in some ways the time lag shows, e.g., McCoy’s reference to “reductionist” accounts–part of the popular constructive leanings of the time; scant mention of Bayesian developments taking place around then, it is simple and short and non-technical **. Also, as I should hope, my own views have gone considerably beyond what I wrote then.

(Taper and Lele did an excellent job with this volume, as long as it took, particularly interspersing the commentary. I recommend it!***)

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118 (with discussion). Continue reading

Categories: philosophy of science, Statistics | Tags: , , , | 18 Comments

Does the Bayesian Diet Call For Error-Statistical Supplements?

Some of the recent comments to my May 20 post leads me to point us back to my earlier (April 15) post  on dynamic dutch books, and continue where Howson left off:

“And where does this conclusion leave the Bayesian theory? ….I claim that nothing valuable is lost by abandoning updating rules.  The idea that the only updating policy sanctioned by the Bayesian theory is updating by conditionalization was untenable even on its own terms, since the learning of each conditioning proposition could not  itself have been by conditionalization.” (Howson 1997, 289).

So a Bayesian account requires a distinct account of empirical learning in order to learn “of each conditioning proposition” (propositions which may be statistical hypotheses).  This was my argument in EGEK (1996, 87)*. And this other account, I would go on to suggest, should ensure the claims (which I prefer to “propositions”) are reliably warranted or severely corroborated.

*Error and the Growth of Experimental Knowledge (Mayo 1996):  Scroll down to chapter 3.

Categories: Statistics | Tags: , , | 32 Comments

Betting, Bookies and Bayes: Does it Not Matter?

On Gelman’s blog today he offers a simple rejection of Dutch Book arguments for Bayesian inference:

“I have never found this argument appealing, because a bet is a game not a decision. A bet requires 2 players, and one player has to offer the bets.”

But what about dynamic Bayesian Dutch book arguments which are thought to be the basis for advocating updating by Bayes’s theorem?  Betting scenarios, even if hypothetical, are often offered as the basis for making Bayesian measurements operational, and for claiming Bayes’s rule is a warranted representation of updating “uncertainty”. The question I had asked in an earlier (April 15) post (and then placed on hold) is: Does it not matter that Bayesians increasingly seem to debunk  betting representations?

Categories: Statistics | Tags: , | 27 Comments

Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+. Continue reading

Categories: Statistics | Tags: , , , , , , | 14 Comments

Saturday Night Brainstorming & Task Forces: The TFSI on NHST

Each year leaders of the movement to reform statistical methodology in psychology and related social sciences get together for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. See my discussion of the New Reformers in the blogposts of Sept 26, Oct. 3 and 4, 2011[i]

While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers, somewhere near an airport in a major metropolitan area.[ii] Please see 2015 update here. Continue reading

Categories: Statistics | Tags: , , , , , , | 7 Comments

Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”

old blogspot typewriterDear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing, and you may wish to track down the rest of it. Sincerely, D. G. Mayo

Statist. Med. 2002; 21:2437–2444  https://errorstatistics.com/wp-content/uploads/2013/12/goodman.pdf

 STATISTICS IN MEDICINE, LETTER TO THE EDITOR

A comment on replication, p-values and evidence: S.N. Goodman, Statistics in Medicine 1992; 11:875–879

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: Statistics | Tags: , , , | 8 Comments

Blog at WordPress.com.