Error Statistics

Levels of Inquiry

levels: data-statistics-theory

Many fallacious uses of statistical methods result from supposing that the statistical inference licenses a jump to a substantive claim that is ‘on a different level’ from a statistical one being probed. Given the familiar refrain that statistical significance is not substantive significance, it may seem surprising how often criticisms of significance tests depend on running the two together! But it is not just two, but a great many levels that need distinguishing linking collecting, modeling and analyzing data to a variety of substantive claims of inquiry (though for simplicity I often focus on the three depicted, described in various ways).

A question that continues to arise revolves around a blurring of levels, and is behind my recent ESP post.  It goes roughly like this:

If we are prepared to take a statistically significant proportion of successes (greater than .5) in n Binomial trials as grounds for inferring a real (better than chance) effect (perhaps of two teaching methods) but not as grounds for inferring Uri’s ESP (at guessing outcomes, say), then aren’t we implicitly invoking a difference in prior probabilities?  The answer is no, but there are two very different points to be made:

First, merely finding evidence of a non-chance effect is at a different “level” from a subsequent question about the explanation or cause of a non-chance effect. To infer from the former to the latter is an example of a fallacy of rejection.[1] The nature and threats of error in the hypothesis about a specific cause of an effect are very different from those in merely inferring a real effect. There are distinct levels of inquiry and distinct errors at each given level. The severity analysis for the respective claims makes this explicit.[ii] Even a test that did a good job distinguishing and ruling out threats to a hypothesis of “mere chance” would not thereby have probed errors about specific causes or potential explanations. Nor does an “isolated record” of  statistically significant results suffice. Recall Fisher: “In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result”(1935, 14).  PSI researchers never managed to demonstrate this. Continue reading

Categories: Background knowledge, Error Statistics, Philosophy of Statistics, Statistics | 2 Comments

Insevere tests and pseudoscience

Against the PSI skeptics of this period (discussed in my last post), defenders of PSI would often erect means to take experimental results as success stories (e.g., if he failed to correctly predict the next card, maybe he was aiming at the second or third card). If the data could not be made to fit some ESP claim or other (e.g., through multiple end points) it might, as a last resort, be explained away as due to negative energy of nonbelievers (or being on the Carson show). They manage to get their ESP hypothesis H to “pass,” but the “test” had little or no capability of finding (uncovering, admitting) the falsity of H, even if H is false. (This is the basis for my term “Gellerization”.) In such cases, I would deny that the results afford any evidence for H. They are terrible evidence for H. Now any domain will have some terrible tests, but a field that routinely passes off terrible tests as success stories I would deem pseudoscientific. 

We get a kind of minimal requirement for a test result to afford any evidence of assertion H, however partial and approximate H may be:  If a hypothesis H is assured of having* “passed” a test T, even if H is false, then test T is a terrible test or no test at all.**

Far from trying to reveal flaws, it masks them or prevents them from being uncovered. No one would be impressed to learn their bank had passed a “stress test” if it turns out that the test had little or no chance of giving a failing score to any bank, regardless of its ability to survive a stressed economy. (Would they?)

There are a million different ways to flesh out the idea, and I welcome hearing others. Now you might say that no one would disagree with this. Great. Because a core requirement for an adequate account of inquiry, as I see it, is that it be able to capture this rationale for pretty terrible evidence and fairly pseudoscientific inquiry– and it should do so in such a way that affords a starting point for not-so-awful tests, and rather reliable learning.

* or very probably would have passed.

**QUESTION: I seek your input: which sounds better, or is more accurate: saying a test T passes a hypothesis H, or that a hypothesis H passes a test T? I’ve used both and want to settle on one.

Categories: Error Statistics, philosophy of science | 5 Comments

Barnard, background info/ intentions

G.A. Barnard: 23 Sept.1915 – 9 Aug.2002

G.A. Barnard’s birthday is 9/23, so, here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to our 2 recent issues: stopping rules, and background information here and here (at least of one type).

Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevant in certain circumstances.  Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant. Continue reading

Categories: Background knowledge, Error Statistics, Philosophy of Statistics | Leave a comment

Objectivity (#5): Three Reactions to the Challenge of Objectivity (in inference):

(1) If discretionary judgments are thought to introduce subjectivity in inference, a classic strategy thought to achieve objectivity is to extricate such choices, replacing them with purely formal a priori computations or agreed-upon conventions (see March 14).  If leeway for discretion introduces subjectivity, then cutting off discretion must yield objectivity!  Or so some argue. Such strategies may be found, to varying degrees, across the different approaches to statistical inference.

The inductive logics of the type developed by Carnap promised to be an objective guide for measuring degrees of confirmation in hypotheses, despite much-discussed problems, paradoxes, and conflicting choices of confirmation logics.  In Carnapian inductive logics, initial assignments of probability are based on a choice of language and on intuitive, logical principles. The consequent logical probabilities can then be updated (given the statements of evidence) with Bayes’s Theorem. The fact that the resulting degrees of confirmation are at the same time analytical and a priori—giving them an air of objectivity–reveals the central weakness of such confirmation theories as “guides for life”, e.g., —as guides, say, for empirical frequencies or for finding things out in the real world. Something very similar  happens with the varieties of “objective’” Bayesian accounts, both in statistics and in formal Bayesian epistemology in philosophy (a topic to which I will return; if interested, see my RMM contribution).

A related way of trying to remove latitude for discretion might be to define objectivity in terms of the consensus of a specified group, perhaps of experts, or of agents with “diverse” backgrounds. Once again, such a convention may enable agreement yet fail to have the desired link-up with the real world.  It would be necessary to show why consensus reached by the particular choice of group (another area for discretion) achieves the learning goals of interest.

Continue reading

Categories: Objectivity, Objectivity, Statistics | Tags: , , | Leave a comment

Objectivity (#4) and the “Argument From Discretion”

We constantly hear that procedures of inference are inescapably subjective because of the latitude of human judgment as it bears on the collection, modeling, and interpretation of data. But this is seriously equivocal: Being the product of a human subject is hardly the same as being subjective, at least not in the sense we are speaking of—that is, as a threat to objective knowledge. Are all these arguments about the allegedly inevitable subjectivity of statistical methodology rooted in equivocations? I argue that they are!

Insofar as humans conduct science and draw inferences, it is obvious that human judgments and human measurements are involved. True enough, but too trivial an observation to help us distinguish among the different ways judgments should enter, and how, nevertheless, to avoid introducing bias and unwarranted inferences. The issue is not that a human is doing the measuring, but whether we can reliably use the thing being measured to find out about the world.

Continue reading

Categories: Objectivity, Objectivity, Statistics | Tags: , | 29 Comments

Lifting a piece from Spanos’ contribution* will usefully add to the mix

The following two sections from Aris Spanos’ contribution to the RMM volume are relevant to the points raised by Gelman (as regards what I am calling the “two slogans”)**.

 6.1 Objectivity in Inference (From Spanos, RMM 2011, pp. 166-7)

The traditional literature seems to suggest that ‘objectivity’ stems from the mere fact that one assumes a statistical model (a likelihood function), enabling one to accommodate highly complex models. Worse, in Bayesian modeling it is often misleadingly claimed that as long as a prior is determined by the assumed statistical model—the so called reference prior—the resulting inference procedures are objective, or at least as objective as the traditional frequentist procedures:

“Any statistical analysis contains a fair number of subjective elements; these include (among others) the data selected, the model assumptions, and the choice of the quantities of interest. Reference analysis may be argued to provide an ‘objective’ Bayesian solution to statistical inference in just the same sense that conventional statistical methods claim to be ‘objective’: in that the solutions only depend on model assumptions and observed data.” (Bernardo 2010, 117)

This claim brings out the unfathomable gap between the notion of ‘objectivity’ as understood in Bayesian statistics, and the error statistical viewpoint. As argued above, there is nothing ‘subjective’ about the choice of the statistical model Mθ(z) because it is chosen with a view to account for the statistical regularities in data z0, and its validity can be objectively assessed using trenchant M-S testing. Model validation, as understood in error statistics, plays a pivotal role in providing an ‘objective scrutiny’ of the reliability of the ensuing inductive procedures.

Continue reading

Categories: Philosophy of Statistics, Statistics, Testing Assumptions, U-Phil | Tags: , , , , | 43 Comments

Objectivity #3: Clean(er) Hands With Metastatistics

I claim that all but the first of the “dirty hands” argument’s five premises are flawed. Even the first premise too directly identifies a policy decision with a statistical report. But the key flaws begin with premise 2. Although risk policies may be based on a statistical report of evidence, it does not follow that the considerations suitable for judging risk policies are the ones suitable for judging the statistical report. They are not. The latter, of course, should not be reduced to some kind of unthinking accept/reject report. If responsible, it must clearly and completely report the nature and extent of (risk-related) effects that are and are not indicated by the data, making plain how the methodological choices made in the generation, modeling, and interpreting of data raise or lower the chances of finding evidence of specific risks. These choices may be called risk assessment policy (RAP) choices. Continue reading

Categories: Objectivity, Objectivity, Statistics | Tags: , | 10 Comments

Objectivity #2: The “Dirty Hands” Argument for Ethics in Evidence

Some argue that generating and interpreting data for purposes of risk assessment invariably introduces ethical (and other value) considerations that might not only go beyond, but might even conflict with, the “accepted canons of objective scientific reporting.”  This thesis, we may call it the thesis of ethics in evidence and inference, some think, shows that an ethical interpretation of evidence may warrant violating canons of scientific objectivity, and even that a scientist must choose between norms of morality and objectivity.

The reasoning is that since the scientists’ hands must invariably get “dirty” with policy and other values, they should opt for interpreting evidence in a way that promotes ethically sound values, or maximizes public benefit (in some sense).

I call this the “dirty hands” argument, alluding to a term used by philosopher Carl Cranor (1994).1

I cannot say how far its proponents would endorse taking the argument.2 However, it seems that if this thesis is accepted, it may be possible to regard as “unethical” the objective reporting of scientific uncertainties in evidence.  This consequence is worrisome: in fact, it would conflict with the generally accepted imperative for an ethical interpretation of scientific evidence.

Nevertheless, the “dirty hands” argument as advanced has apparently plausible premises, one or more of which would need to be denied to avoid the conclusion which otherwise follows deductively. It goes roughly as follows:

  1. Whether observed data are taken as evidence of a risk depends on a methodological decision as to when to reject the null hypothesis of no risk  H0 (and infer the data are evidence of a risk).
  2. Thus interpreting data to feed into policy decisions with potentially serious risks to the public, the scientist is actually engaged in matters of policy (what is generally framed as an issue of evidence and science, is actually an issue of policy values, ethics, and politics).
  3.  The public funds scientific research and the scientist should be responsible for promoting the public good, so scientists should interpret risk evidence so as to maximize public benefit.
  4. Therefore, a responsible (ethical) interpretation of scientific data on risks is one that maximizes public benefit–and one that does not do so is irresponsible or unethical.
  5. Public benefit is maximized by minimizing the chance of failing to find a risk.  This leads to the conclusion in 6:
  6. CONCLUSION: In situations of risk assessment the ethical interpreter of evidence will maximize the chance of inferring there is a risk–even if this means inferring a risk when there is none with high probability (or at least a probability much higher than is normally countenanced)

The argument about ethics in evidence is often put in terms of balancing type 1 and 2 errors.

Type I error:test T finds evidence of an increased risk ( H0 is rejected), when in fact the risk is absent (false positive)

Type II error:
test T does not find evidence of an increased risk ( H0 is accepted), when in fact an increased risk δ is present (false negative).

The traditional balance of type I and type II error probabilities, wherein type I errors are minimized, some argue, is unethical. Rather than minimize type I errors, it might be  claimed, an “ethical” tester should minimize type II errors.

I claim that at least 3 of the premises, while plausible-sounding, are false.  What do you think?
_____________________________________________________

(1) Cranor (to my knowledge) was among the first to articulate the argument in philosophy, in relation to statistical significance tests (it is echoed by more recent philosophers of evidence based policy):

Scientists should adopt more health protective evidentiary standards, even when they are not consistent with the most demanding inferential standards of the field.  That is, scientists may be forced to choose between the evidentiary ideals of their fields and the moral value of protecting the public from exposure to toxins, frequently they cannot realize both (Cranor 1994, pp. 169-70).

Kristin Shrader-Frechette has advanced analogous arguments in numerous risk research contexts.

(2) I should note that Cranor is aware that properly scrutinizing statistical tests can advance matters here.

Cranor, C. (1994), “Public Health Research and Uncertainty”, in K. Shrader-Frechette, Ethics of Sciencetific Research.  Rowman and Littlefield, pp. 169-186.

Shrader-Frechette, K. (1994), Ethics of Scientific Research, Rowman and Littlefield

Categories: Objectivity, Objectivity, Statistics | Tags: , , , , | 17 Comments

Blog at WordPress.com.