Higgs discovery two years on (2: Higgs analysis and statistical flukes)

Higgs_cake-sI’m reblogging a few of the Higgs posts, with some updated remarks, on this two-year anniversary of the discovery. (The first was in my last post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories. 

“Higgs Analysis and Statistical Flukes: part 2″images

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess; H0: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone).  The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

Error probabilities

In a Neyman-Pearson setting, a cut-off cα is chosen pre-data so that the probability of a type I error is low. In general,

Pr(d(X) > cαH0) ≤  α

and in particular,alluding to an overall test T:

(1) Pr(Test T yields d(X) > 5 standard deviations; H0) ≤  .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, p0. In general,

Pr(P < p0H0) < p0

and in particular,

(2) Pr(Test T yields P < .0000003H0.0000003.

For test T to yield a “worse fit” with H(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0.  With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic d(X), or the P-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

An implicit principle of inference or evidence

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Data x from a test T provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between H0 and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977).  “His true” is a shorthand for a very long statement that H0 is an approximately adequate model of a specified aspect of the process generating the data in the context. (This relates to statistical models and hypotheses living “lives of their own”.)

Severity and the detachment of inferences

The sampling distributions serve to give counterfactuals. In this case they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to H0.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out..Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference. (This is why bootstrap, and other types of, resampling works when one has a random sample from the process or population of interest.)

The severity principle, put more generally:

Data from a test T[ii] provide good evidence for inferring H (just) to the extent that H passes severely with x0, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.) In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim H requires considering H’s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually detached from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated  H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

Qualifying claims by how well they have been probed

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned.  Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

Telling what’s true about significance levels

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation).  It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to H0. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to H0, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just H and H0 but all rivals to the Standard Model, rivals to the data and statistical models, and higher level theories as well. But can’t we just imagine a Bayesian catchall hypothesis?  On paper, maybe, but where will we get these probabilities? What do any of them mean? How can the probabilities even be comparable in different data analyses, using different catchalls and different priors?[iv]

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

Those prohibited phrases

One may wish to return to some of the condemned phrases of particular physics reports.Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution:  The statistical null asserts that Ho: background alone adequately describes the process.

Ho does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under Ho”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {x: p < po}. Even when Ho is true, such “signal like” outcomes may occur. They are po level flukes. Were such flukes generated even with moderate frequency under Ho, they would not be evidence against Ho. But in this case, such flukes occur a teeny tiny proportion of the time. Then SEV enters: if we are regularly able to generate such teeny tiny p-values, we have evidence of a genuine discrepancy from Ho.

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain Ho as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

Triggering, indicating, inferring

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

I hope it is (more or less) clear that burgandy is new; black is old. If interested: See statistical flukes (part 3)

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1:

Part 2

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

New Notes

[1] I plan to do some new work in this arena soon, so I’ll be glad to have comments.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

REFERENCES (from March, 2013 post):

ATLAS Collaboration  (November 14, 2012),  Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162.

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” Annals of Mathematical Statistics, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” Scandinavian Journal of Statistics, 4: 49–70.

Mayo, D.G. (1996), Error and the Growth of Experimental Knowledge, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323–357.


Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(H) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is  given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv]In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis


Categories: Higgs, highly probable vs highly probed, P-values, Severity, Statistics | 13 Comments

“Statistical Science and Philosophy of Science: where should they meet?”


Four score years ago (!) we held the conference “Statistical Science and Philosophy of Science: Where Do (Should) They meet?” at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, where I’m visiting professor [1] Many of the discussions on this blog grew out of contributions from the conference, and conversations initiated soon after. The conference site is here; my paper on the general question is here.[2]

My main contribution was “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. It begins like this: 

1. Comedy Hour at the Bayesian Retreat[3]

 Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist… Continue reading

Categories: Error Statistics, Philosophy of Statistics, Severity, Statistics, StatSci meets PhilSci | 23 Comments

Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)



We spent the first half of Thursday’s seminar discussing the FisherNeyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning. Continue reading

Categories: phil/history of stat, Phil6334, science communication, Severity, significance tests, Statistics | Tags: | 35 Comments

New SEV calculator (guest app: Durvasula)

Unknown-1Karthik Durvasula, a blog follower[i], sent me a highly apt severity app that he created:
I have his permission to post it or use it for pedagogical purposes, so since it’s Saturday night, go ahead and have some fun with it. Durvasula had the great idea of using it to illustrate howlers. Also, I would add, to discover them.
It follows many of the elements of the Excel Sev Program discussed recently, but it’s easier to use.* (I’ll add some notes about the particular claim (i.e, discrepancy) for which SEV is being computed later on).
*If others want to tweak or improve it, he might pass on the source code (write to me on this).
[i] I might note that Durvasula was the winner of the January palindrome contest.
Categories: Severity, Statistics | 12 Comments

Two Severities? (PhilSci and PhilStat)

Janus--2faceThe blog “It’s Chancy” (Corey Yanofsky) has a post today about “two severities” which warrants clarification. Two distinctions are being blurred: between formal and informal severity assessments, and between a statistical philosophy (something Corey says he’s interested in) and its relevance to philosophy of science (which he isn’t). I call the latter an error statistical philosophy of science. The former requires both formal, semi-formal and informal severity assessments. Here’s his post:

In the comments to my first post on severity, Professor Mayo noted some apparent and some actual misstatements of her views.To avert misunderstandings, she directed readers to two of her articles, one of which opens by making this distinction:

“Error statistics refers to a standpoint regarding both (1) a general philosophy of science and the roles probability plays in inductive inference, and (2) a cluster of statistical tools, their interpretation, and their justification.”

In Mayo’s writings I see  two interrelated notions of severity corresponding to the two items listed in the quote: (1) an informal severity notion that Mayo uses when discussing philosophy of science and specific scientific investigations, and (2) Mayo’s formalization of severity at the data analysis level.

One of my besetting flaws is a tendency to take a narrow conceptual focus to the detriment of the wider context. In the case of Severity, part one, I think I ended up making claims about severity that were wrong. I was narrowly focused on severity in sense (2) — in fact, on one specific equation within (2) — but used a mish-mash of ideas and terminology drawn from all of my readings of Mayo’s work. When read through a philosophy-of-science lens, the result is a distorted and misstated version of severity in sense (1) .

As a philosopher of science, I’m a rank amateur; I’m not equipped to add anything to the conversation about severity as a philosophy of science. My topic is statistics, not philosophy, and so I want to warn readers against interpreting Severity, part one as a description of Mayo’s philosophy of science; it’s more of a wordy introduction to the formal definition of severity in sense (2).[It’s Chancy, Jan 11, 2014)

A needed clarification may be found in a post of mine which begins: 

Error statistics: (1) There is a “statistical philosophy” and a philosophy of science. (a) An error-statistical philosophy alludes to the methodological principles and foundations associated with frequentist error-statistical methods. (b) An error-statistical philosophy of science, on the other hand, involves using the error-statistical methods, formally or informally, to deal with problems of philosophy of science: to model scientific inference (actual or rational), to scrutinize principles of inference, and to address philosophical problems about evidence and inference (the problem of induction, underdetermination, warranting evidence, theory testing, etc.).

I assume the interest here* is on the former, (a). I have stated it in numerous ways, but the basic position is that inductive inference—i.e., data-transcending inference—calls for methods of controlling and evaluating error probabilities (even if only approximate). An inductive inference, in this conception, takes the form of inferring hypotheses or claims to the extent that they have been well tested. It also requires reporting claims that have not passed severely, or have passed with low severity. In the “severe testing” philosophy of induction, the quantitative assessment offered by error probabilities tells us not “how probable” but, rather, “how well probed” hypotheses are.  The local canonical hypotheses of formal tests and estimation methods need not be the ones we entertain post data; but they give us a place to start without having to go “the designer-clothes” route.

The post-data interpretations might be formal, semi-formal, or informal.

See also: Staley’s review of Error and Inference (Mayo and Spanos eds.)

Categories: Review of Error and Inference, Severity, StatSci meets PhilSci | 52 Comments

A. Spanos lecture on “Frequentist Hypothesis Testing”


Aris Spanos

I attended a lecture by Aris Spanos to his graduate econometrics class here at Va Tech last week[i]. This course, which Spanos teaches every fall, gives a superb illumination of the disparate pieces involved in statistical inference and modeling, and affords clear foundations for how they are linked together. His slides follow the intro section. Some examples with severity assessments are also included.

Frequentist Hypothesis Testing: A Coherent Approach

Aris Spanos

1    Inherent difficulties in learning statistical testing

Statistical testing is arguably  the  most  important, but  also the  most difficult  and  confusing chapter of statistical inference  for several  reasons, including  the following.

(i) The need to introduce numerous new notions, concepts and procedures before one can paint —  even in broad brushes —  a coherent picture  of hypothesis  testing.

(ii) The current textbook discussion of statistical testing is both highly confusing and confused.  There  are several sources of confusion.

  • (a) Testing is conceptually one of the most sophisticated sub-fields of any scientific discipline.
  • (b) Inadequate knowledge by textbook writers who often do not have  the  technical  skills to read  and  understand the  original  sources, and  have to rely on second hand  accounts  of previous  textbook writers that are  often  misleading  or just  outright erroneous.   In most  of these  textbooks hypothesis  testing  is poorly  explained  as  an  idiot’s guide to combining off-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square,  etc., where the underlying  statistical  model that gives rise to the testing procedure  is hidden  in the background.
  • (c)  The  misleading  portrayal of Neyman-Pearson testing  as essentially  decision-theoretic in nature, when in fact the latter has much greater  affinity with the Bayesian rather than the frequentist inference.
  • (d)  A deliberate attempt to distort and  cannibalize  frequentist testing by certain  Bayesian drumbeaters who revel in (unfairly)  maligning frequentist inference in their  attempts to motivate their  preferred view on statistical inference.

(iii) The  discussion of frequentist testing  is rather incomplete  in so far as it has been beleaguered by serious foundational problems since the 1930s. As a result, different applied fields have generated their own secondary  literatures attempting to address  these  problems,  but  often making  things  much  worse!  Indeed,  in some fields like psychology  it has reached the stage where one has to correct the ‘corrections’ of those chastising  the initial  correctors!

In an attempt to alleviate  problem  (i),  the discussion  that follows uses a sketchy historical  development of frequentist testing.  To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed to highlight important points that shed light on certain  erroneous  in- terpretations or misleading arguments.  The discussion will pay special attention to (iii), addressing  some of the key foundational problems.

[i] It is based on Ch. 14 of Spanos (1999) Probability Theory and Statistical Inference. Cambridge[ii].

[ii] You can win a free copy of this 700+ page text by creating a simple palindrome!

Categories: Bayesian/frequentist, Error Statistics, Severity, significance tests, Statistics | Tags: | 36 Comments

A critical look at “critical thinking”: deduction and induction

images-1I’m cleaning away some cobwebs around my old course notes, as I return to teaching after 2 years off (since I began this blog). The change of technology alone over a mere 2 years (at least here at Super Tech U) might be enough to earn me techno-dinosaur status: I knew “Blackboard” but now it’s “Scholar” of which I know zilch. The course I’m teaching is supposed to be my way of bringing “big data” into introductory critical thinking in philosophy! No one can be free of the “sexed up term for statistics,” Nate Silver told us (here and here), and apparently all the college Deans & Provosts have followed suit. Of course I’m (mostly) joking; and it was my choice.

Anyway, the course is a nostalgic trip back to critical thinking. Stepping back from the grown-up metalogic and advanced logic I usually teach, hop-skipping over baby logic, whizzing past toddler and infant logic…. and arriving at something akin to what R.A. Fisher dubbed “the study of the embryology of knowledge” (1935, 39) (a kind of ‘fetal logic’?) which, in its very primitiveness, actually demands a highly sophisticated analysis. In short, it’s turning out to be the same course I taught nearly a decade ago! (but with a new book and new twists). But my real point is that the hodge-podge known as “critical thinking,” were it seriously considered, requires getting to grips with some very basic problems that we philosophers, with all our supposed conceptual capabilities, have left unsolved. (I am alluding to Gandenberger‘s remark). I don’t even think philosophers are working on the problem (these days). (Are they?)

I refer, of course, to our inadequate understanding of how to relate deductive and inductive inference, assuming the latter to exist (which I do)—whether or not one chooses to call its study a “logic”[i]. [That is, even if one agrees with the Popperians that the only logic is deductive logic, there may still be such a thing as a critical scrutiny of the approximate truth of premises, without which no inference is ever detached even from a deductive argument. This is also required for Popperian corroboration or well-testedness.]

We (and our textbooks) muddle along with vague attempts to see inductive arguments as more or less parallel to deductive ones, only with probabilities someplace or other. I’m not saying I have easy answers, I’m saying I need to invent a couple of new definitions in the next few days that can at least survive the course. Maybe readers can help.


I view ‘critical thinking’ as developing methods for critically evaluating the (approximate) truth or adequacy of the premises which may figure in deductive arguments. These methods would themselves include both deductive and inductive or “ampliative” arguments. Deductive validity is a matter of form alone, and so philosophers are stuck on the idea that inductive logic would have a formal rendering as well. But this simply is not the case. Typical attempts are arguments with premises that take overly simple forms:

If all (or most) J’s were observed to be K’s, then the next J will be a K, at least with a probability p.

To evaluate such a claim (essentially the rule of enumerative induction) requires context-dependent information (about the nature and selection of the K and J properties, their variability, the “next” trial, and so on). Besides, most interesting ampliative inferences are to generalizations and causal claims, not mere predictions to the next J. The problem isn’t that an algorithm couldn’t evaluate such claims, but that the evaluation requires context-dependent information as to how the ampliative leap can go wrong. Yet our most basic texts speak as if potentially warranted inductive arguments are like potentially sound deductive arguments, more or less. But it’s not easy to get the “more or less” right, for any given example, while still managing to say anything systematic and general. That is essentially the problem…..

The age-old definition of argument that we all learned from Irving Copi still serves: a group of statements, one of which (the conclusion) is claimed to follow from one or more others (the premises) which are regarded as supplying evidence for the truth of that one. This is written:

P1, P2,…Pn/ ∴ C.

In a deductively valid argument, if the premises are all true then, necessarily, the conclusion is true. To use the “⊨” (double turnstile) symbol:

 P1, P2,…Pn ⊨  C.

Does this mean:

 P1, P2,…Pn/ ∴ necessarily C?

No, because we do not detach “necessarily C”, which would suggest C was a necessary claim (i.e., true in all possible worlds). “Necessarily” qualifies “⊨”, the very relationship between premises and conclusion:

It’s logically impossible to have all true premises and a false conclusion, on pain of logical contradiction.

We should see it (i.e., deductive validity) as qualifying the process of “inferring,” as opposed to the “inference” that is detached–the statement  placed to the right of “⊨”. A valid argument is a procedure of inferring that is 100% reliable, in the sense that if the premises are all true, then 100% of the time the conclusion is true.

Deductively Valid Argument: Three equivalent expressions:

(D-i) If the premises are all true, then necessarily, the conclusion is true.
(I.e., if the conclusion is false, then (necessarily) one of premises is false.)

(D-ii) It’s (logically) impossible for the premises to be true and the conclusion false.
(I.e., to have the conclusion false with the premises true leads to a logical contradiction, A & ~A.)

(D-iii) The argument maps true premises into a true conclusion with 100% reliability.
(I.e., if the premises are all true, then 100% of the time the conclusion is true).

(Deductively) Sound argument:  deductively valid + premises are true/approximately true.

All of this is baby logic; but with so-called inductive arguments, terms are not so clear-cut. (“Embryonic logic” demands, at times, more sophistication than grown-up logic.) But maybe the above points can help…


With an inductive argument, the conclusion goes beyond the premises. So it’s logically possible for all the premises to be true and the conclusion false.

Notice that if one had characterized deductive validity as

(a)  P1, P2,…Pn ⊨ necessarily C,

then it would be an easy slide to seeing inductively inferring as:

(b)  P1, P2,…Pnprobably C.

But (b) is wrongheaded, I say, for the same reason (a) is. Nevertheless, (b) (or something similar) is found in many texts. We (philosophers) should stop foisting ampliative inference into the deductive mould. So, here I go trying out some decent parallels:

In all of the following, “true” will mean “true or approximately true”.

An inductive argument (to inference C) is strong or potentially severe only if any of the following (equivalent claims) hold [iii]

(I-i) If the conclusion is false, then very probably at least one of the premises is false.

(I-ii) It’s improbable that the premises are all true while the conclusion false.

(I-iii) The argument leads from true premises to a true conclusion with high reliability (i.e., if the premises are all true then (1-a)% of the time, the conclusion is true).

To get the probabilities to work, the premises and conclusion must refer to “generic” claims of the type, but this is the case for deductive arguments as well (else their truth values wouldn’t be altering). However, the basis for the [I-i through I-iii] requirement, in any of its forms, will not be formal; it will demand a contingent or empirical ground. Even after these are grounded, the approximate truth of the premises will be required. Otherwise, it’s only potentially severe. (This is parallel to viewing a valid deductive argument as potentially sound.)

We get the following additional parallel:

Deductively unsound argument:

Denial of D-(i), (D-ii), or (D-iii): it’s logically possible for all its premises to be true and the conclusion false.
One or more of its premises are false.

Inductively weak inference: insevere grounds for C

Denial of I-(i), (ii), or (iii): Premises would be fairly probable even if C is false.
Its premises are false (not true to a sufficient approximation)

There’s still some “winking” going on, and I’m sure I’ll have to tweak this. What do you think?

Fully aware of how the fuzziness surrounding inductive inference has non-trivially (adversely) influenced the entire research program in philosophy of induction, I’ll want to rethink some elements from scratch, this time around….


So I’m back in my Thebian palace high atop the mountains in Blacksburg, Virginia. The move from looking out at the Empire state building to staring at endless mountain ranges is… calming.[iv]


[i] I do, following Peirce, but it’s an informal not a formal logic (using the terms strictly).

[ii]The double turnstile denotes the “semantic consequence” relationship; the single turnstyle, the syntatic (deducibility) relationship. But some students are not so familiar with “turnstiles”.

[iii]I intend these to function equivalently.

[iv] Someone asked me “what’s the biggest difference I find in coming to the rural mountains from living in NYC?” I think the biggest contrast is the amount of space. Not just that I live in a large palace, there’s the tremendous width of grocery aisles: 3 carts wide rather than 1.5 carts wide. I hate banging up against carts in NYC, but this feels like a major highway!

Copi, I.  (1956). Introduction to Logic. New York: Macmillan.

Fisher, R.A.  (1935). The Design of Experiments.  Edinburgh: Oliver & Boyd.



Categories: critical thinking, Severity, Statistics | 28 Comments

P-values as posterior odds?

METABLOG QUERYI don’t know how to explain to this economist blogger that he is erroneously using p-values when he claims that “the odds are” (1 – p)/p that a null hypothesis is false. Maybe others want to jump in here?

On significance and model validation (Lars Syll)

Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20. Continue reading

Categories: fallacy of non-significance, Severity, Statistics | 36 Comments

Severity Calculator

Severitiy excel program pic

SEV calculator (with comparisons to p-values, power, CIs)

In the illustration in the Jan. 2 post,

H0: μ < 0 vs H1: μ > 0

and the standard deviation SD = 1, n = 25, so σx  = SD/√n = .2
Setting α to .025, the cut-off for rejection is .39.  (can round to .4).

Let the observed mean X  = .2 , a statistically insignificant result (p value = .16)
SEV (μ < .2) = .5
SEV(μ <.3) = .7
SEV(μ <.4) = .84
SEV(μ <.5) = .93
SEV(μ <.6*) = .975

Some students asked about crunching some of the numbers, so here’s a rather rickety old SEV calculator*. It is limited, rather scruffy-looking (nothing like the pretty visuals others post) but it is very useful. It also shows the Normal curves, how shaded areas change with changed hypothetical alternatives, and gives contrasts with confidence intervals. Continue reading

Categories: Severity, statistical tests | Leave a comment

Severity as a ‘Metastatistical’ Assessment

Some weeks ago I discovered an error* in the upper severity bounds for the one-sided Normal test in section 5 of: “Statistical Science Meets Philosophy of Science Part 2″ SS & POS 2.  The published article has been corrected.  The error was in section 5.3, but I am blogging all of 5.  

(* μo was written where xo should have been!)

5. The Error-Statistical Philosophy

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of  a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are). Continue reading

Categories: Error Statistics, philosophy of science, Philosophy of Statistics, Severity, Statistics | 5 Comments

An established probability theory for hair comparison? “is not — and never was”

Forensic Hair red

Hypothesis H: “person S is the source of this hair sample,” if indicated by a DNA match, has passed a more severe test than if it were indicated merely by a visual analysis under a microscopic. There is a much smaller probability of an erroneous hair match using DNA testing than using the method of visual analysis used for decades by the FBI.

The Washington Post reported on its latest investigation into flawed statistics behind hair match testimony. “Thousands of criminal cases at the state and local level may have relied on exaggerated testimony or false forensic evidence to convict defendants of murder, rape and other felonies”. Below is an excerpt of the Post article by Spencer S. Hsu.

I asked John Byrd, forensic anthropologist and follower of this blog, what he thought. It turns out that “hair comparisons do not have a well-supported weight of evidence calculation.” (Byrd).  I put Byrd’s note at the end of this post. Continue reading

Categories: Severity, Statistics | 14 Comments

Mayo: (section 5) “StatSci and PhilSci: part 2″

Here is section 5 of my new paper: “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. Sections 1 and 2 are in my last post.*

5. The Error-Statistical Philosophy

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of  a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are). Continue reading

Categories: Error Statistics, philosophy of science, Philosophy of Statistics, Severity | 5 Comments

Mayo: (first 2 sections) “StatSci and PhilSci: part 2″

Here are the first two sections of my new paper: “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. (Alternatively, go to the RMM page and scroll down to the Sept 26, 2012 entry.)

1. Comedy Hour at the Bayesian Retreat[i]

 Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist…

 “who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”


 “who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statisticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out. Continue reading

Categories: Error Statistics, Philosophy of Statistics, Severity | 2 Comments

Blog at Customized Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 371 other followers