# Severity

## Memento & Quiz (on SEV): Excursion 3, Tour I

.

As you enjoy the weekend discussion & concert in the Captain’s Central Limit Library & Lounge, your Tour Guide has prepared a brief overview of Excursion 3 Tour I, and a short (semi-severe) quiz on severity, based on exhibit (i).*

We move from Popper through a gallery on “Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR)” (3.1) which leads to the main gallery on the origin of statistical tests (3.2) by way of a look at where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps in E.S. Pearson’s opening description of tests in 3.2. The classical testing notions–type I and II errors, power, consistent tests–are shown to grow out of requiring probative tests. The typical (behavioristic) formulation of N-P tests came later. The severe tester breaks out of the behavioristic prison. A first look at the severity construal of N-P tests is in Exhibit (i). Viewing statistical inference as severe testing shows how to do all N-P tests do (and more) while a member of the Fisherian Tribe (3.3). We consider the frequentist principle of evidence FEV and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR. Continue reading

Categories: Severity, Statistical Inference as Severe Testing

## First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

Excursion 3 Exhibit (i)

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1). Continue reading

## If you think it’s a scandal to be without statistical falsification, you will need statistical tests (ii)

.

1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:

While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)

Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.[1] But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability? Continue reading

## Severity in a Likelihood Text by Charles Rohde

.

I received a copy of a statistical text recently that included a discussion of severity, and this is my first chance to look through it. It’s Introductory Statistical Inference with the Likelihood Function by Charles Rohde from Johns Hopkins. Here’s the blurb:

.

This textbook covers the fundamentals of statistical inference and statistical theory including Bayesian and frequentist approaches and methodology possible without excessive emphasis on the underlying mathematics. This book is about some of the basic principles of statistics that are necessary to understand and evaluate methods for analyzing complex data sets. The likelihood function is used for pure likelihood inference throughout the book. There is also coverage of severity and finite population sampling. The material was developed from an introductory statistical theory course taught by the author at the Johns Hopkins University’s Department of Biostatistics. Students and instructors in public health programs will benefit from the likelihood modeling approach that is used throughout the text. This will also appeal to epidemiologists and psychometricians. After a brief introduction, there are chapters on estimation, hypothesis testing, and maximum likelihood modeling. The book concludes with sections on Bayesian computation and inference. An appendix contains unique coverage of the interpretation of probability, and coverage of probability and mathematical concepts.

It’s welcome to see severity in a statistics text; an example from Mayo and Spanos (2006) is given in detail. The author even says some nice things about it: “Severe testing does a nice job of clarifying the issues which occur when a hypothesis is accepted (not rejected) by finding those values of the parameter (here mu) which are plausible (have high severity[i]) given acceptance. Similarly severe testing addresses the issue of a hypothesis which is rejected…. . ”  I don’t know Rohde, and the book isn’t error-statistical in spirit at all.[ii] In fact, inferences based on error probabilities are often called “illogical” because they take into account cherry-picking, multiple testing, optional stopping and other biasing selection effects that the likelihoodist considers irrelevant. I wish he had used severity to address some of the classic howlers he delineates regarding N-P statistics. To his credit, they are laid out with unusual clarity. For example a rejection of a point null µ= µ0 based on a result that just reaches the 1.96 cut-off for a one-sided test is claimed to license the inference to a point alternative µ= µ’ that is over 6 standard deviations greater than the null. (pp. 49-50). But it is not licensed. The probability of a larger difference than observed, were the data generated under such an alternative is ~1, so the severity associated with such an inference is ~ 0. SEV(µ <µ’) ~1.

[i]Not to quibble, but I wouldn’t say parameter values are assigned severity, but rather that various hypotheses about mu pass with severity. The hypotheses are generally in the form of discrepancies, e..g,µ >µ’

[ii] He’s a likelihoodist from Johns Hopkins. Royall has had a strong influence there (Goodman comes to mind), and elsewhere, especially among philosophers. Bayesians also come back to likelihood ratio arguments, often. For discussions on likelihoodism and the law of likelihood see:

How likelihoodists exaggerate evidence from statistical tests

Breaking the Law of Likelihood ©

Breaking the Law of Likelihood, to keep their fit measures in line A, B

Why the Law of Likelihood is Bankrupt as an Account of Evidence

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.

Click to access 2006Mayo_Spanos_severe_testing.pdf

Categories: Error Statistics, Severity

## Higgs discovery three years on (Higgs analysis and statistical flukes)

.

2015: The Large Hadron Collider (LHC) is back in collision mode in 2015[0]. There’s a 2015 update, a virtual display, and links from ATLAS, one of two detectors at (LHC)) here. The remainder is from one (2014) I’m reblogging a few of the Higgs posts at the anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

“Higgs Analysis and Statistical Flukes: part 2”

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels. Continue reading

Categories: Higgs, highly probable vs highly probed, P-values, Severity

## “Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo

.

A new joint paper….

“Error statistical modeling and inference: Where methodology meets ontology”

Aris Spanos · Deborah G. Mayo

Abstract: In empirical modeling, an important desideratum for deeming theoretical entities and processes real is that they can be reproducible in a statistical sense. Current day crises regarding replicability in science intertwine with the question of how statistical methods link data to statistical and substantive theories and models. Different answers to this question have important methodological consequences for inference, which are intertwined with a contrast between the ontological commitments of the two types of models. The key to untangling them is the realization that behind every substantive model there is a statistical model that pertains exclusively to the probabilistic assumptions imposed on the data. It is not that the methodology determines whether to be a realist about entities and processes in a substantive field. It is rather that the substantive and statistical models refer to different entities and processes, and therefore call for different criteria of adequacy.

Keywords: Error statistics · Statistical vs. substantive models · Statistical ontology · Misspecification testing · Replicability of inference · Statistical adequacy

To read the full paper: “Error statistical modeling and inference: Where methodology meets ontology.”

Reference: Spanos, A. & Mayo, D. G. (2015). “Error statistical modeling and inference: Where methodology meets ontology.” Synthese (online May 13, 2015), pp. 1-23.

## Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!

bending of starlight.

[T]he impressive thing about the 1919 tests of Einstein ‘s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted. The theory is incompatible with certain possible results of observation—in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where]..it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories.” (Popper, CR, [p. 36))

Popper lauds Einstein’s General Theory of Relativity (GTR) as sticking its neck out, bravely being ready to admit its falsity were the deflection effect not found. The truth is that even if no deflection effect had been found in the 1919 experiments it would have been blamed on the sheer difficulty in discerning so small an effect (the results that were found were quite imprecise.) This would have been entirely correct! Yet many Popperians, perhaps Popper himself, get this wrong.[i] Listen to Popperian Paul Meehl (with whom I generally agree).

The stipulation beforehand that one will be pleased about substantive theory T when the numerical results come out as forecast, but will not necessarily abandon it when they do not, seems on the face of it to be about as blatant a violation of the Popperian commandment as you could commit. For the investigator, in a way, is doing…what astrologers and Marxists and psychoanalysts allegedly do, playing heads I win, tails you lose.” (Meehl 1978, 821)

No, there is a confusion of logic. A successful result may rightly be taken as evidence for a real effect H, even though failing to find the effect need not be taken to refute the effect, or even as evidence as against H. This makes perfect sense if one keeps in mind that a test might have had little chance to detect the effect, even if it existed. The point really reflects the asymmetry of falsification and corroboration. Popperian Alan Chalmers wrote an appendix to a chapter of his book, What is this Thing Called Science? (1999)(which at first had criticized severity for this) once I made my case. [i] Continue reading

| Tags:

## Higgs discovery two years on (2: Higgs analysis and statistical flukes)

I’m reblogging a few of the Higgs posts, with some updated remarks, on this two-year anniversary of the discovery. (The first was in my last post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.  Continue reading

## “Statistical Science and Philosophy of Science: where should they meet?”

Four score years ago (!) we held the conference “Statistical Science and Philosophy of Science: Where Do (Should) They meet?” at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, where I’m visiting professor [1] Many of the discussions on this blog grew out of contributions from the conference, and conversations initiated soon after. The conference site is here; my paper on the general question is here.[2]

My main contribution was “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” It begins like this:

1. Comedy Hour at the Bayesian Retreat[3]

Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist… Continue reading

## Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)

.

We spent the first half of Thursday’s seminar discussing the FisherNeyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning. Continue reading

| Tags:

## New SEV calculator (guest app: Durvasula)

Karthik Durvasula, a blog follower[i], sent me a highly apt severity app that he created: https://karthikdurvasula.shinyapps.io/Severity_Calculator/
updated June,
I have his permission to post it or use it for pedagogical purposes, so since it’s Saturday night, go ahead and have some fun with it. Durvasula had the great idea of using it to illustrate howlers. Also, I would add, to discover them.
It follows many of the elements of the Excel Sev Program discussed recently, but it’s easier to use.* (I’ll add some notes about the particular claim (i.e, discrepancy) for which SEV is being computed later on).
*If others want to tweak or improve it, he might pass on the source code (write to me on this).
[i] I might note that Durvasula was the winner of the January palindrome contest.
Categories: Severity, Statistics

## Two Severities? (PhilSci and PhilStat)

The blog “It’s Chancy” (Corey Yanofsky) has a post today about “two severities” which warrants clarification. Two distinctions are being blurred: between formal and informal severity assessments, and between a statistical philosophy (something Corey says he’s interested in) and its relevance to philosophy of science (which he isn’t). I call the latter an error statistical philosophy of science. The former requires both formal, semi-formal and informal severity assessments. Here’s his post:

In the comments to my first post on severity, Professor Mayo noted some apparent and some actual misstatements of her views.To avert misunderstandings, she directed readers to two of her articles, one of which opens by making this distinction:

“Error statistics refers to a standpoint regarding both (1) a general philosophy of science and the roles probability plays in inductive inference, and (2) a cluster of statistical tools, their interpretation, and their justiﬁcation.”

In Mayo’s writings I see  two interrelated notions of severity corresponding to the two items listed in the quote: (1) an informal severity notion that Mayo uses when discussing philosophy of science and specific scientific investigations, and (2) Mayo’s formalization of severity at the data analysis level.

One of my besetting flaws is a tendency to take a narrow conceptual focus to the detriment of the wider context. In the case of Severity, part one, I think I ended up making claims about severity that were wrong. I was narrowly focused on severity in sense (2) — in fact, on one specific equation within (2) — but used a mish-mash of ideas and terminology drawn from all of my readings of Mayo’s work. When read through a philosophy-of-science lens, the result is a distorted and misstated version of severity in sense (1) .

As a philosopher of science, I’m a rank amateur; I’m not equipped to add anything to the conversation about severity as a philosophy of science. My topic is statistics, not philosophy, and so I want to warn readers against interpreting Severity, part one as a description of Mayo’s philosophy of science; it’s more of a wordy introduction to the formal definition of severity in sense (2).[It’s Chancy, Jan 11, 2014)

A needed clarification may be found in a post of mine which begins:

Error statistics: (1) There is a “statistical philosophy” and a philosophy of science. (a) An error-statistical philosophy alludes to the methodological principles and foundations associated with frequentist error-statistical methods. (b) An error-statistical philosophy of science, on the other hand, involves using the error-statistical methods, formally or informally, to deal with problems of philosophy of science: to model scientific inference (actual or rational), to scrutinize principles of inference, and to address philosophical problems about evidence and inference (the problem of induction, underdetermination, warranting evidence, theory testing, etc.).

I assume the interest here* is on the former, (a). I have stated it in numerous ways, but the basic position is that inductive inference—i.e., data-transcending inference—calls for methods of controlling and evaluating error probabilities (even if only approximate). An inductive inference, in this conception, takes the form of inferring hypotheses or claims to the extent that they have been well tested. It also requires reporting claims that have not passed severely, or have passed with low severity. In the “severe testing” philosophy of induction, the quantitative assessment offered by error probabilities tells us not “how probable” but, rather, “how well probed” hypotheses are.  The local canonical hypotheses of formal tests and estimation methods need not be the ones we entertain post data; but they give us a place to start without having to go “the designer-clothes” route.

The post-data interpretations might be formal, semi-formal, or informal.

See also: Staley’s review of Error and Inference (Mayo and Spanos eds.)

## A. Spanos lecture on “Frequentist Hypothesis Testing”

Aris Spanos

I attended a lecture by Aris Spanos to his graduate econometrics class here at Va Tech last week[i]. This course, which Spanos teaches every fall, gives a superb illumination of the disparate pieces involved in statistical inference and modeling, and affords clear foundations for how they are linked together. His slides follow the intro section. Some examples with severity assessments are also included.

Frequentist Hypothesis Testing: A Coherent Approach

Aris Spanos

1    Inherent difficulties in learning statistical testing

Statistical testing is arguably  the  most  important, but  also the  most difficult  and  confusing chapter of statistical inference  for several  reasons, including  the following.

(i) The need to introduce numerous new notions, concepts and procedures before one can paint —  even in broad brushes —  a coherent picture  of hypothesis  testing.

(ii) The current textbook discussion of statistical testing is both highly confusing and confused.  There  are several sources of confusion.

• (a) Testing is conceptually one of the most sophisticated sub-fields of any scientific discipline.
• (b) Inadequate knowledge by textbook writers who often do not have  the  technical  skills to read  and  understand the  original  sources, and  have to rely on second hand  accounts  of previous  textbook writers that are  often  misleading  or just  outright erroneous.   In most  of these  textbooks hypothesis  testing  is poorly  explained  as  an  idiot’s guide to combining off-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square,  etc., where the underlying  statistical  model that gives rise to the testing procedure  is hidden  in the background.
• (c)  The  misleading  portrayal of Neyman-Pearson testing  as essentially  decision-theoretic in nature, when in fact the latter has much greater  affinity with the Bayesian rather than the frequentist inference.
• (d)  A deliberate attempt to distort and  cannibalize  frequentist testing by certain  Bayesian drumbeaters who revel in (unfairly)  maligning frequentist inference in their  attempts to motivate their  preferred view on statistical inference.

(iii) The  discussion of frequentist testing  is rather incomplete  in so far as it has been beleaguered by serious foundational problems since the 1930s. As a result, different applied fields have generated their own secondary  literatures attempting to address  these  problems,  but  often making  things  much  worse!  Indeed,  in some fields like psychology  it has reached the stage where one has to correct the ‘corrections’ of those chastising  the initial  correctors!

In an attempt to alleviate  problem  (i),  the discussion  that follows uses a sketchy historical  development of frequentist testing.  To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed to highlight important points that shed light on certain  erroneous  in- terpretations or misleading arguments.  The discussion will pay special attention to (iii), addressing  some of the key foundational problems.

[i] It is based on Ch. 14 of Spanos (1999) Probability Theory and Statistical Inference. Cambridge[ii].

[ii] You can win a free copy of this 700+ page text by creating a simple palindrome! https://errorstatistics.com/palindrome/march-contest/

| Tags:

## A critical look at “critical thinking”: deduction and induction

I’m cleaning away some cobwebs around my old course notes, as I return to teaching after 2 years off (since I began this blog). The change of technology alone over a mere 2 years (at least here at Super Tech U) might be enough to earn me techno-dinosaur status: I knew “Blackboard” but now it’s “Scholar” of which I know zilch. The course I’m teaching is supposed to be my way of bringing “big data” into introductory critical thinking in philosophy! No one can be free of the “sexed up term for statistics,” Nate Silver told us (here and here), and apparently all the college Deans & Provosts have followed suit. Of course I’m (mostly) joking; and it was my choice.

Anyway, the course is a nostalgic trip back to critical thinking. Stepping back from the grown-up metalogic and advanced logic I usually teach, hop-skipping over baby logic, whizzing past toddler and infant logic…. and arriving at something akin to what R.A. Fisher dubbed “the study of the embryology of knowledge” (1935, 39) (a kind of ‘fetal logic’?) which, in its very primitiveness, actually demands a highly sophisticated analysis. In short, it’s turning out to be the same course I taught nearly a decade ago! (but with a new book and new twists). But my real point is that the hodge-podge known as “critical thinking,” were it seriously considered, requires getting to grips with some very basic problems that we philosophers, with all our supposed conceptual capabilities, have left unsolved. (I am alluding to Gandenberger‘s remark). I don’t even think philosophers are working on the problem (these days). (Are they?)

I refer, of course, to our inadequate understanding of how to relate deductive and inductive inference, assuming the latter to exist (which I do)—whether or not one chooses to call its study a “logic”[i]. [That is, even if one agrees with the Popperians that the only logic is deductive logic, there may still be such a thing as a critical scrutiny of the approximate truth of premises, without which no inference is ever detached even from a deductive argument. This is also required for Popperian corroboration or well-testedness.]

We (and our textbooks) muddle along with vague attempts to see inductive arguments as more or less parallel to deductive ones, only with probabilities someplace or other. I’m not saying I have easy answers, I’m saying I need to invent a couple of new definitions in the next few days that can at least survive the course. Maybe readers can help.

______________________

I view ‘critical thinking’ as developing methods for critically evaluating the (approximate) truth or adequacy of the premises which may figure in deductive arguments. These methods would themselves include both deductive and inductive or “ampliative” arguments. Deductive validity is a matter of form alone, and so philosophers are stuck on the idea that inductive logic would have a formal rendering as well. But this simply is not the case. Typical attempts are arguments with premises that take overly simple forms:

If all (or most) J’s were observed to be K’s, then the next J will be a K, at least with a probability p.

To evaluate such a claim (essentially the rule of enumerative induction) requires context-dependent information (about the nature and selection of the K and J properties, their variability, the “next” trial, and so on). Besides, most interesting ampliative inferences are to generalizations and causal claims, not mere predictions to the next J. The problem isn’t that an algorithm couldn’t evaluate such claims, but that the evaluation requires context-dependent information as to how the ampliative leap can go wrong. Yet our most basic texts speak as if potentially warranted inductive arguments are like potentially sound deductive arguments, more or less. But it’s not easy to get the “more or less” right, for any given example, while still managing to say anything systematic and general. That is essentially the problem…..
______________________

The age-old definition of argument that we all learned from Irving Copi still serves: a group of statements, one of which (the conclusion) is claimed to follow from one or more others (the premises) which are regarded as supplying evidence for the truth of that one. This is written:

P1, P2,…Pn/ ∴ C.

In a deductively valid argument, if the premises are all true then, necessarily, the conclusion is true. To use the “⊨” (double turnstile) symbol:

P1, P2,…Pn ⊨  C.

Does this mean:

P1, P2,…Pn/ ∴ necessarily C?

No, because we do not detach “necessarily C”, which would suggest C was a necessary claim (i.e., true in all possible worlds). “Necessarily” qualifies “⊨”, the very relationship between premises and conclusion:

It’s logically impossible to have all true premises and a false conclusion, on pain of logical contradiction.

We should see it (i.e., deductive validity) as qualifying the process of “inferring,” as opposed to the “inference” that is detached–the statement  placed to the right of “⊨”. A valid argument is a procedure of inferring that is 100% reliable, in the sense that if the premises are all true, then 100% of the time the conclusion is true.

Deductively Valid Argument: Three equivalent expressions:

(D-i) If the premises are all true, then necessarily, the conclusion is true.
(I.e., if the conclusion is false, then (necessarily) one of premises is false.)

(D-ii) It’s (logically) impossible for the premises to be true and the conclusion false.
(I.e., to have the conclusion false with the premises true leads to a logical contradiction, A & ~A.)

(D-iii) The argument maps true premises into a true conclusion with 100% reliability.
(I.e., if the premises are all true, then 100% of the time the conclusion is true).

(Deductively) Sound argument:  deductively valid + premises are true/approximately true.

All of this is baby logic; but with so-called inductive arguments, terms are not so clear-cut. (“Embryonic logic” demands, at times, more sophistication than grown-up logic.) But maybe the above points can help…

________

With an inductive argument, the conclusion goes beyond the premises. So it’s logically possible for all the premises to be true and the conclusion false.

Notice that if one had characterized deductive validity as

(a)  P1, P2,…Pn ⊨ necessarily C,

then it would be an easy slide to seeing inductively inferring as:

(b)  P1, P2,…Pnprobably C.

But (b) is wrongheaded, I say, for the same reason (a) is. Nevertheless, (b) (or something similar) is found in many texts. We (philosophers) should stop foisting ampliative inference into the deductive mould. So, here I go trying out some decent parallels:

In all of the following, “true” will mean “true or approximately true”.

An inductive argument (to inference C) is strong or potentially severe only if any of the following (equivalent claims) hold [iii]

(I-i) If the conclusion is false, then very probably at least one of the premises is false.

(I-ii) It’s improbable that the premises are all true while the conclusion false.

(I-iii) The argument leads from true premises to a true conclusion with high reliability (i.e., if the premises are all true then (1-a)% of the time, the conclusion is true).

To get the probabilities to work, the premises and conclusion must refer to “generic” claims of the type, but this is the case for deductive arguments as well (else their truth values wouldn’t be altering). However, the basis for the [I-i through I-iii] requirement, in any of its forms, will not be formal; it will demand a contingent or empirical ground. Even after these are grounded, the approximate truth of the premises will be required. Otherwise, it’s only potentially severe. (This is parallel to viewing a valid deductive argument as potentially sound.)

We get the following additional parallel:

Deductively unsound argument:

Denial of D-(i), (D-ii), or (D-iii): it’s logically possible for all its premises to be true and the conclusion false.
OR
One or more of its premises are false.

Inductively weak inference: insevere grounds for C

Denial of I-(i), (ii), or (iii): Premises would be fairly probable even if C is false.
OR
Its premises are false (not true to a sufficient approximation)

There’s still some “winking” going on, and I’m sure I’ll have to tweak this. What do you think?

Fully aware of how the fuzziness surrounding inductive inference has non-trivially (adversely) influenced the entire research program in philosophy of induction, I’ll want to rethink some elements from scratch, this time around….

______________

So I’m back in my Thebian palace high atop the mountains in Blacksburg, Virginia. The move from looking out at the Empire state building to staring at endless mountain ranges is… calming.[iv]

References:

[i] I do, following Peirce, but it’s an informal not a formal logic (using the terms strictly).

[ii]The double turnstile denotes the “semantic consequence” relationship; the single turnstyle, the syntatic (deducibility) relationship. But some students are not so familiar with “turnstiles”.

[iii]I intend these to function equivalently.

[iv] Someone asked me “what’s the biggest difference I find in coming to the rural mountains from living in NYC?” I think the biggest contrast is the amount of space. Not just that I live in a large palace, there’s the tremendous width of grocery aisles: 3 carts wide rather than 1.5 carts wide. I hate banging up against carts in NYC, but this feels like a major highway!

Copi, I.  (1956). Introduction to Logic. New York: Macmillan.

Fisher, R.A.  (1935). The Design of Experiments.  Edinburgh: Oliver & Boyd.

Categories: critical thinking, Severity, Statistics

## P-values as posterior odds?

I don’t know how to explain to this economist blogger that he is erroneously using p-values when he claims that “the odds are” (1 – p)/p that a null hypothesis is false. Maybe others want to jump in here?

On significance and model validation (Lars Syll)

Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20. Continue reading

Categories: fallacy of non-significance, Severity, Statistics

## Severity Calculator

SEV calculator (with comparisons to p-values, power, CIs)

In the illustration in the Jan. 2 post,

H0: μ < 0 vs H1: μ > 0

and the standard deviation SD = 1, n = 25, so σx  = SD/√n = .2
Setting α to .025, the cut-off for rejection is .39.  (can round to .4).

Let the observed mean X  = .2 , a statistically insignificant result (p value = .16)
SEV (μ < .2) = .5
SEV(μ <.3) = .7
SEV(μ <.4) = .84
SEV(μ <.5) = .93
SEV(μ <.6*) = .975
*rounding

Some students asked about crunching some of the numbers, so here’s a rather rickety old SEV calculator*. It is limited, rather scruffy-looking (nothing like the pretty visuals others post) but it is very useful. It also shows the Normal curves, how shaded areas change with changed hypothetical alternatives, and gives contrasts with confidence intervals. Continue reading

Categories: Severity, statistical tests

## Severity as a ‘Metastatistical’ Assessment

Some weeks ago I discovered an error* in the upper severity bounds for the one-sided Normal test in section 5 of: “Statistical Science Meets Philosophy of Science Part 2” SS & POS 2.  The published article has been corrected.  The error was in section 5.3, but I am blogging all of 5.

(* μo was written where xo should have been!)

5. The Error-Statistical Philosophy

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of  a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are). Continue reading

## An established probability theory for hair comparison? “is not — and never was”

Hypothesis H: “person S is the source of this hair sample,” if indicated by a DNA match, has passed a more severe test than if it were indicated merely by a visual analysis under a microscopic. There is a much smaller probability of an erroneous hair match using DNA testing than using the method of visual analysis used for decades by the FBI.

The Washington Post reported on its latest investigation into flawed statistics behind hair match testimony. “Thousands of criminal cases at the state and local level may have relied on exaggerated testimony or false forensic evidence to convict defendants of murder, rape and other felonies”. Below is an excerpt of the Post article by Spencer S. Hsu.

I asked John Byrd, forensic anthropologist and follower of this blog, what he thought. It turns out that “hair comparisons do not have a well-supported weight of evidence calculation.” (Byrd).  I put Byrd’s note at the end of this post. Continue reading

Categories: Severity, Statistics

## Mayo: (section 5) “StatSci and PhilSci: part 2”

Here is section 5 of my new paper: “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” Sections 1 and 2 are in my last post.*

5. The Error-Statistical Philosophy

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of  a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are). Continue reading

## Mayo: (first 2 sections) “StatSci and PhilSci: part 2”

Here are the first two sections of my new paper: “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. (Alternatively, go to the RMM page and scroll down to the Sept 26, 2012 entry.)

1. Comedy Hour at the Bayesian Retreat[i]

Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist…

“who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

or

“who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statisticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out. Continue reading

Categories: Error Statistics, Philosophy of Statistics, Severity