Monthly Archives: July 2015

Telling What’s True About Power, if practicing within the error-statistical tribe



Suppose you are reading about a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. 

I have heard some people say [0]:

A. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ).*See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?

Allow the test assumptions are adequately met. I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post: Continue reading

Categories: confidence intervals and tests, power, Statistics | 36 Comments

Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics


Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

This post first appeared here. An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence-based medicine? Philosophy of Science 2002; 69: S316-S330: see p. S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random. Continue reading

Categories: RCTs, S. Senn, Statistics | Tags: , | 6 Comments


3 years ago...
3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.[1]  This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014. (Once again it was tough to pick just 3; please check out others which might interest you, e.g., Schachtman on StatLaw, the machine learning conference on simplicity, the story of Lindley and particle physics, Glymour and so on.)

July 2012

[1] excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Categories: 3-year memory lane, Statistics | Leave a comment

“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)

Mayo elbow

Mayo, frustrated

Someone linked this to me on Twitter. I thought it was a home blog at first. Surely the U.S. Dept of Health and Human Services can give a better definition than this.

U.S. Department of Health and Human Services
Effective Health Care Program
Glossary of Terms

We know that many of the concepts used on this site can be difficult to understand. For that reason, we have provided you with a glossary to help you make sense of the terms used in Comparative Effectiveness Research. Every word that is defined in this glossary should appear highlighted throughout the Web site…..

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

You can find it here.  First of all, one should never use “likelihood” and “probability” in what is to be a clarification of formal terms, as these mean very different things in statistics.Some of the claims given actually aren’t so bad if “likely” takes its statistical meaning, but are all wet if construed as mathematical probability. Continue reading

Categories: P-values, Statistics | 68 Comments

Spot the power howler: α = ß?

Spot the fallacy!

  1. METABLOG QUERYThe power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
  2. So, the probability of incorrectly rejecting the null hypothesis is β.
  3. But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.


Categories: Error Statistics, power, Statistics | 12 Comments

Higgs discovery three years on (Higgs analysis and statistical flukes)



2015: The Large Hadron Collider (LHC) is back in collision mode in 2015[0]. There’s a 2015 update, a virtual display, and links from ATLAS, one of two detectors at (LHC)) here. The remainder is from one year ago. (2014) I’m reblogging a few of the Higgs posts at the anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories. 

“Higgs Analysis and Statistical Flukes: part 2”images

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels. Continue reading

Categories: Higgs, highly probable vs highly probed, P-values, Severity | Leave a comment

Winner of the June Palindrome contest: Lori Wike

lori wike falls


Winner of June 2015 Palindrome Contest: (a dozen book choices)

Lori Wike: Principal bassoonist of the Utah Symphony; Faculty member at University of Utah and Westminster College

Palindrome: Sir, a pain, a madness! Elba gin in a pro’s tipsy end? I know angst, sir! I taste, I demonstrate lemon omelet arts. Nome diet satirists gnaw on kidneys, pits or panini. Gab less: end a mania, Paris!

Book choiceConjectures and Refutations (K. Popper 1962, New York: Basic Books)

The requirement: A palindrome using “demonstrate” (and Elba, of course).

Bio: Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine. Continue reading

Categories: Palindrome | Leave a comment

Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)

Larry Laudan

Larry Laudan

Professor Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin

“When the ‘Not-Guilty’ Falsely Pass for Innocent” by Larry Laudan

While it is a belief deeply ingrained in the legal community (and among the public) that false negatives are much more common than false positives (a 10:1 ratio being the preferred guess), empirical studies of that question are very few and far between. While false convictions have been carefully investigated in more than two dozen studies, there are virtually no well-designed studies of the frequency of false acquittals. The disinterest in the latter question is dramatically borne out by looking at discussions among intellectuals of the two sorts of errors. (A search of Google Books identifies some 6.3k discussions of the former and only 144 treatments of the latter in the period from 1800 to now.) I’m persuaded that it is time we brought false negatives out of the shadows, not least because each such mistake carries significant potential harms, typically inflicted by falsely-acquitted recidivists who are on the streets instead of in


In criminal law, false negatives occur under two circumstances: when a guilty defendant is acquitted at trial and when an arrested, guilty defendant has the charges against him dropped or dismissed by the judge or prosecutor. Almost no one tries to measure how often either type of false negative occurs. That is partly understandable, given the fact that the legal system prohibits a judicial investigation into the correctness of an acquittal at trial; the double jeopardy principle guarantees that such acquittals are fixed in stone. Thanks in no small part to the general societal indifference to false negatives, there have been virtually no efforts to design empirical studies that would yield reliable figures on false acquittals. That means that my efforts here to estimate how often they occur must depend on a plethora of indirect indicators. With a bit of ingenuity, it is possible to find data that provide strong clues as to approximately how often a truly guilty defendant is acquitted at trial and in the pre-trial process. The resulting inferences are not precise and I will try to explain why as we go along. As we look at various data sources not initially designed to measure false negatives, we will see that they nonetheless provide salient information about when and why false acquittals occur, thereby enabling us to make an approximate estimate of their frequency.

My discussion of how to estimate the frequency of false negatives will fall into two parts, reflecting the stark differences between the sources of errors in pleas and the sources of error in trials. (All the data to be cited here deal entirely with cases of crimes of violence.) Continue reading

Categories: evidence-based policy, false negatives, PhilStatLaw, Statistics | Tags: | 9 Comments

Blog at