Author Archives: Mayo

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately. Continue reading

Categories: CIs and tests, Error Statistics, reformers, Statistics | Tags: , , , , , , , | Leave a comment

PhilStock: Topsy-Turvy Game

stock picture smaillSee rejected posts.  

Categories: PhilStock, Rejected Posts | Leave a comment

Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12)

The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ. Continue reading

Categories: confidence intervals and tests, reformers, Statistics | Tags: , , , | 7 Comments

Some statistical dirty laundry

Objectivity 1: Will the Real Junk Science Please Stand Up?I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. Here are some stray thoughts…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments).  That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

3. Hanging out some statistical dirty laundry.images
Items in their laundry list include:

  • An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings…. Continue reading
Categories: junk science, spurious p values, Statistics | 6 Comments

Winner of May Palindrome Contest

“Able no one nil red nudist opening nine pots. I’d underline ‘No’ on Elba.”  Anonymous. See rejected posts.

Categories: Palindrome | Leave a comment

K. Staley: review of Error & Inference

kent-staleyK. W. Staley
Associate Professor
Department of Philosophy,
Saint Louis University

(Almost) All about error


BOOK REVIEW Metascience (2012) 21:709–713 DOI 10.1007/s11016-011-9618-1E & I Cover 2
Deborah G. Mayo and Aris Spanos (eds): Error and inference: Recent exchanges on experimental reasoning, reliability, objectivity, and rationality. New York: Cambridge University Press, 2010, xvii+419 pp

The ERROR’06 (experimental reasoning, reliability, objectivity, and rationality) conference held at Virginia Tech aimed to advance the discussion of some central themes in philosophy of science debated by Deborah Mayo and her more-or-less friendly critics over the years. The volume here reviewed brings together the contributions of these critics and Mayo’s responses to them (with Mayo’s collaborator Aris Spanos). (I helped with the organization of the conference and, with Mayo and Jean Miller, edited a separate collection of workshop papers that were presented there, published as a special issue of Synthese.) My review will focus on a couple of themes I hope to be of interest to a broad philosophical audience, then turn more briefly to an overview of the entire collection. The discussions in Error and Inference (E&I) are indispensable for understanding several current issues regarding the methodology of science.

The remarkably useful introductory chapter lays out the broad themes of the volume and discusses ‘‘The Error-Statistical Philosophy’’. Here, Mayo and Spanos provide the most succinct and non-technical account of the error-statistical approach that has yet been published, a feature that alone should commend this text to anyone who has found it difficult to locate a reading on error statistics suitable for use in teaching.

Mayo holds that the central question for a theory of evidence is not the degree to which some observation E confirms some hypothesis H but how well-probed for error a hypothesis H is by a testing procedure T that results in data x0. This reorientation has far-reaching consequences for Mayo’s approach to philosophy of science. On this approach, addressing the question of when data ‘‘provide good evidence for or a good test of’’ a hypothesis requires attention to characteristics of the process by means of which the data are used to bear on the hypothesis. Mayo identifies the starting point from which her account is developed as the ‘‘Weak Severity Principle’’ (WSP):

Data x0 do not provide good evidence for hypothesis H if x0 results from a test procedure with a very low probability or capacity of having uncovered the falsity of H (even if H is incorrect). (21)

The weak severity principle is then developed into the full severity principle (SP), according to which ‘‘data x0 provide a good indication of or evidence for hypothesis H (just) to the extent that test T has severely passed H with x0’’ where H passes a severe test T with x0 if x0 ‘‘agrees with’’ H and ‘‘with very high probability, test T would have produced a result that accords less well with H than doesx0, if H were false or incorrect’’ (22). This principle constitutes the heart of the error-statistical account of evidence, and E&I, by including some of the most important critiques of the principle, provides a forum in which Mayo and Spanos attempt to correct misunderstandings of the principle and to clarify its meaning and application.

The appearance in the WSP of the disjunctive phrase ‘‘a very low probability or capacity’’ (my emphasis) indicates a point central to much of this clarificatory work. The error-statistical account is resolutely frequentist in its construal of probability. It is commonly held (including by some frequentists) that the rationale for frequentist statistical methods lies exclusively in the fact that they can sometimes be shown to have low error rates in the long run. Throughout E&I, Mayo insists that this ‘‘behaviorist rationale’’ is not applicable when it comes to evaluating a particular body of data in order to determine what inferences may be warranted. That evaluation rests upon thinking about the particular data and the inference at hand in light of the capacity of the test to reveal potential errors in the inference drawn. Frequentist probabilities are part of how one models the error-detecting capacities of the process. As Mayo explains in a later chapter co-authored with David Cox, tests of hypotheses function analogously to measuring instruments: ‘‘Just as with the use of measuring instruments, applied to a specific case, we employ the performance features to make inferences about aspects of the particular thing that is measured, aspects that the measuring tool is appropriately capable of revealing’’ (257).

One of the most fascinating exchanges in E&I concerns the role of severe testing in the appraisal of ‘‘large-scale’’ theories. According to Mayo, theory appraisal proceeds by a ‘‘piecemeal’’ process of severe probing for specific ways in which a theory might be in error. She illustrates this with the history of experimental tests of theories of gravity, emphasizing Clifford Will’s parametrized post-Newtonian (PPN) framework, by means of which all metric theories of gravity can be represented in their weak-field, slow-motion limits by means of ten parameters. Experimental work on gravity theories then severely tests hypotheses about the values of those parameters. Rather than attempting to confirm or probabilify the general theory of relativity (GTR), the aim is to learn about the ways in which GTR might be in error, more generally to ‘‘measure how far off what a given theory says about a phenomenon can be from what a ‘correct’ theory would need to say about it’’ (55).

Alan Chalmers and Alan Musgrave both challenge this view. According to Chalmers, no general theory, whether ‘‘low level’’ or ‘‘high level’’, can pass a severe test because the content of theories surpasses whatever empirical evidence supports them. As a consequence, Chalmers argues, Mayo’s severe-testing account of scientific inference must be incomplete because even low-level experimental testing sometimes demands relying on general theoretical claims. Similarly, Musgrave accuses Mayo of holding that (general) theories are not tested by ‘‘testing their consequences’’, but that ‘‘all that we really test are the consequences’’ (105), leaving her with ‘‘nothing to say’’ about the assessment, adoption, or rejection of general theories (106). Continue reading

Categories: Error Statistics, Statistics | Tags: , | 1 Comment

A.Birnbaum: Statistical Methods in Scientific Inference

Birnbaum: born May 27, 1923

Today is (statistician) Allan Birnbaum’s birthday. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to obtain what he called “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I would heartily endorse!  While known for attempts to argue that the (strong) Likelihood Principle followed from sufficiency and conditionality principles, a few years after publishing this result, he seems to have turned away from it, perhaps discovering gaps in his argument.

NATURE VOL. 225 MARCH 14, 1970 (1033)

LETTERS TO THE EDITOR

Statistical methods in Scientific Inference

 It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood).  I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.

 If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence. This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data. (The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].

While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.

Allan Birnbaum
New York University
Courant Institute of Mathematical Sciences,
251 Mercer Street,
New York, NY 10012

Birnbaum’s confidence concept, sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:

(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.

Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence”  simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below). Continue reading

Categories: Likelihood Principle, phil/history of stat, Statistics | Tags: | 3 Comments

Schachtman: High, Higher, Highest Quality Research Act

wavy capitalSince posting on the High Quality Research act a few weeks ago, I’ve been following it in the news, have received letters from professional committees (asking us to write letters), and now see that  Nathan A. Schachtman, Esq., PC posted the following on May 25, 2013 on his legal blog*:

NAS-3“The High Quality Research Act” (HQRA), which has not been formally introduced in Congress, continues to draw attention. SeeClowns to the left of me, Jokers to the right.”  Last week, Sarewitz suggests that “the problem” is the hype about the benefits of pure research and the let down that results from the realization that scientific progress is “often halting and incremental,” with much research not “particularly innovative or valuable.”  Fair enough, but why is this Congress such an unsophisticated consumer of scientific research in the 21st century?  How can it be a surprise that the scientific community engages in the same rent-seeking behaviors as do other segments of our society? Has it escaped Congress’s attention that scientists are subject to enthusiasms and group think, just like, … congressmen?

Nature published an editorial piece suggesting that the HQRA is not much of a threat. Daniel Sarewitz, “Pure hype of pure research helps no one, ” 497 Nature 411 (2013).

Still, Sarewitz believes that the HQRA bill is not particularly threatening to the funding of science:

“In other words, it’s not a very good bill, but neither is it much of a threat. In fact, it’s just the latest skirmish in a long-running battle for political control over publicly funded science — one fought since at least 1947, when President Truman vetoed the first bill to create the NSF because it didn’t include strong enough lines of political accountability.”

This sanguine evaluation misses the effect of the superlatives in the criteria for National Science Foundation funding:

“(1) is in the interests of the United States to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science;

(2) is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

(3) is not duplicative of other research projects being funded by the Foundation or other Federal science agencies.” Continue reading

Categories: evidence-based policy, PhilStatLaw, Statistics | Tags: | 12 Comments

Gelman sides w/ Neyman over Fisher in relation to a famous blow-up

3-d red yellow puzzle people (E&I)

blog-o-log

Andrew Gelman had said he would go back to explain why he sided with Neyman over Fisher in relation to a big, famous argument discussed on my Feb. 16, 2013 post: “Fisher and Neyman after anger management?”, and I just received an e-mail from Andrew saying that he has done so: “In which I side with Neyman over Fisher”. (I’m not sure what Senn’s reply might be.) Here it is:

“In which I side with Neyman over Fisher” Posted by  on 24 May 2013, 9:28 am

As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally.gelman5

Here’s an example that recently came up.

Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero.

Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible that a bunch of nonzero effects would exactly cancel). And I remember a similar discussion as a student, many years ago, when Rubin talked about that silly Neyman null hypothesis. Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , | 10 Comments

Mayo’s slides from the Onto-Meth conference*

img_1249-e1356389909748Methodology and Ontology in Statistical Modeling: Some error statistical reflections (Spanos and Mayo)uncorrected

 Our presentation falls under the second of the bulleted questions for the conference (conference blog is here):

How do methods of data generation, statistical modeling, and inference influence the construction and appraisal of theories?

Statistical methodology can influence what we think we’re finding out about the world, in the most problematic ways, traced to such facts as:

  • All statistical models are false
  • Statistical significance is not substantive significance
  • Statistical association is not causation
  • No evidence against a statistical null hypothesis is not evidence the null is true
  • If you torture the data enough they will confess.

(or just omit unfavorable data)

These points are ancient (lying with statistics, lies damn lies, and statistics)

People are discussing these problems more than ever (big data), but it’s rarely realized is how much certain methodologies are at the root of the current problems.

__________________1__________________

All Statistical Models are False

Take the popular slogan in statistics and elsewhere is “all statistical models are false!”

What the “all models are false” charge boils down to:

(1)          the statistical model of the data is at most an idealized and partial representation of the actual data generating source.

(2) a statistical inference is at most an idealized and partial answer to a substantive theory or question.

  • But we already know our models are idealizations: that’s what makes them models
  • Reasserting these facts is not informative,.
  • Yet they are taken to have various (dire) implications about the nature and limits of statistical methodology
  • Neither of these facts precludes the use of these to find out true things
  • On the contrary, it would be impossible to learn about the world if we did not deliberately falsify and simplify.

    __________________2__________________

  • Notably, the “all models are false” slogan is followed up by “But some are useful”,
  • Their usefulness, we claim, is being capable of adequately capturing an aspect of a phenomenon of interest
  • Then a hypothesis asserting its adequacy (or inadequacy) is capable of being true!

Note: All methods of statistical inferences rest on statistical models.

What differentiates accounts is how well they step up to the plate in checking adequacy, learning despite violations of statistical assumptions (robustness)

__________________3__________________

Statistical significance is not substantive significance

Statistical models (as they arise in the methodology of statistical inference) live somewhere between

  1. Substantive questions, hypotheses, theories H
  1. Statistical models of phenomenon, experiments, data: M
  1. Data x

What statistical inference has to do is afford adequate link-ups (reporting precision, accuracy, reliability)

__________________4__________________ Continue reading

Categories: O & M conference | 34 Comments

Mayo: Meanderings on the Onto-Methodology Conference

mayo blackboard b&w 2Writing a blog like this, a strange and often puzzling exercise[1], does offer a forum for sharing half-baked chicken-scratchings from the back of frayed pages on themes from our Onto-Meth[2] conference from two weeks ago[3]. (The previous post had notes from blogger and attendee, Gandenberger.)

Onto-Meth conference

Onto-Meth conference

Several of the talks reflect a push-back against the idea that the determination of “ontology” in science—e.g., the objects and processes of theories, models and hypotheses—is (or should strive to correspond to?)  “real” objects in the world and/or what is approximately the case about them. Instead, at least some of the speakers wish to liberate ontology to recognize how “merely” pragmatic goals, needs, and desires are not just second-class citizens, but can and do (and should?) determine the categories of reality. Well there are a dozen equivocations here, most of which we did not really discuss at the conference.

In my own half of the Spanos-Mayo (D & P presentation[4]) I granted and even promoted the idea of a methodology that was pragmatic while also objective, so I’m not objecting to that part. The measurement of my weight is a product of “discretionary” judgments (e.g., to weigh in pounds with a scale having a given precision), but it is also a product of how much I really weigh (no getting around it). By understanding the properties of methodological tools and measuring systems, it is possible to “subtract out” the influence of the judgments to get at what is actually the case. At least approximately. But that view is different, it seems to me, from someone like Larry Laudan (at least in his later metamorphosis). Even though he considers his “reticulated” view a fairly hard-nosed spin on the Kuhnian idea of scientific paradigms as invariably containing an ontology (e.g., theories), a methodology, and (what he called) an “axiology” or set of aims (OMA), Laudan seems to think standards are so variable that what counts as evidence is constantly fluctuating (aside from maybe retaining the goal of fitting diverse facts). So I wonder if these pragmatic leanings are more like Laudan or more like me (and my view here, I take it, is essentially that of Peirce). I am perfectly sympathetic to the piecemeal “locavoracity” idea in Ruesche, by the way.

My worry, one of them, is that all kinds of rival entities and processes arise to account for (accord with, predict, and purportedly explain) data and patterns in data, and don’t we need ways to discriminate them? During the open discussion, I mentioned several examples, some of which I can make out all scrunched up in the corners of my coffee-logged program, such as appeals to “cultural theories” of risk and risk perceptions. These theories say appeals to supposedly “real” hazards, e.g, chance of disease, death, catastrophe, and other “objective” risk assessments are wrong.  They say it is not only possible but preferable (truer?) to capture attitudes toward risks, e.g., GM foods, nuclear energy, climate change, breast implants, etc. by means of one or another favorite politico-cultural grid-group categories (e.g., marginal-individualists, passive-egalitarians, hierarchical-border people, fatalists,  etc.). (Your objections to these vague category schemes are often taken as further evidence that you belong in one of the pigeon-holes!) And the other day I heard a behavioral economist declare that he had found the “mechanism” to explain deciding between options in virtually all walks of life using a regression parameter, he called it beta, and guess what? beta = 1/3! He proved it worked statistically too. He might be right, he had a lot of data. Anyway, in my deliberate attempt to trigger discussion at the conference end, I was wondering if some of the speakers and/or attendees (Danks, Woodward, Glymour? Anyone?) had anything to say about cases that some of us might wish to call reification. Continue reading

Categories: O & M conference, Statistics | 10 Comments

Gandenberger on Ontology and Methodology (May 4) Conference: virginia Tech

greg pic

Gregory Gandenberger
Ph.D graduate student: Dept. of History and Philosophy of Science & Dept. of Statistics
University of Pittsburgh
http://gsganden.tumblr.com/

Onto-Meth conference

Onto-Meth conference


Some Thoughts on the O&M 2013 Conference
I was struck by how little speakers at the Ontology and Methodology conference engaged with the realism/antirealism debate. Laura Ruetsche defended a version of Arthur Fine’s Natural Ontological Attitude (NOA) in the first talk of the conference, but none of the speakers after her addressed the debate directly. David Danks and Jim Woodward made it particularly clear that they were deliberately avoiding questions about realism in favor of questions about what kinds of ontologies our theories should have in order to best serve the various purposes for which we develop them.

I am not criticizing the speakers! I am inclined to agree with Clark Glymour that the kinds of questions Danks and Woodward addressed are more interesting and important than questions about “what’s really real.” On the other hand, I worry that we lose something when we focus only on the use of science toward such ends as prediction and control. During the discussion period at the end of the conference, Peter Godfrey-Smith argued that science has some value simply for telling us what really is the case. For instance, science tells us that all living things on earth have a common ancestor, and that fact is a good thing to know regardless of whether or not it helps us predict or control anything.

One feature of the realism/antirealism debate that has long bothered me is that it treats all of “our best sciences” as if they had roughly the same epistemic status. In fact, realism about quantum field theory, for instance, is much harder to defend than realism about evolutionary biology. I am inclined to dismiss the realism debate as ill-formed insofar as it presumes that the question of scientific realism is a single question that spans all of the sciences. I am also suspicious of the debate in its bread-and-butter domain of fundamental physics. It is not clear to me that there is such a thing as fundamental physics; that if there is such a thing as fundamental physics, then it is converging toward a unified ontology; that if it is converging toward a unified ontology, then we can make sense of the question whether or not that ontology is correct; or that if we can make sense of the question whether or not that ontology is correct, then we have the means to give a justified answer to that question.

Nevertheless, as Glymour pointed out during the open discussion period, there are still good and open questions to address about whether and how we are justified in believing that science tells us the truth in other domains (such as evolutionary theory) where the realism question seems relatively well-formed and answerable. We can dismiss questions about “what’s really real” at a “fundamental level” while still thinking that philosophers of science should have a story to tell the 46% of Americans who believe that human beings were created in more or less their current form within the last 10,000 years—not a story about how science serves purposes of prediction and control, but a story about how science can help us find the truth.

Categories: O & M conference | 7 Comments

“A sense of security regarding the future of statistical science…” Anon review of Error and Inference

errorinferencebookcover-e1335149598836-1Aris Spanos, my colleague and co-author (Economics),recently came across this seemingly anonymous review of our Error and Inference (2010) [E & I]. It’s interesting that the reviewer remarks that “The book gives a sense of security regarding the future of statistical science and its importance in many walks of life.” I wish I knew just what the reviewer means–but it’s appreciated regardless.

2010 American Statistical Association and the American Society for Quality

TECHNOMETRICS, AUGUST 2010, VOL. 52, NO. 3, Book Reviews, 52:3, pp. 362-370.

Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. MAYO and Aris SPANOS, New York: Cambridge University Press, 2010, ISBN 978-0-521-88008-4, xvii+419 pp., $60.00.

This edited volume contemplates the interests of both scientists and philosophers regarding gathering reliable information about the problem/question at hand in the presence of error, uncertainty, and with limited data information.

The volume makes a significant contribution in bridging the gap between scientific practice and the philosophy of science. The main contribution of this volume pertains to issues of error and inference, and showcases intriguing discussions on statistical testing and providing alternative strategy to Bayesian inference. In words, it provides cumulative information towards the philosophical and methodological issues of scientific inquiry at large.

The target audience of this volume is quite general and open to a broad readership. With some reasonable knowledge of probability theory and statistical science, one can get the maximum benefit from most of the chapters of the volume. The volume contains original and fascinating articles by eminent scholars (nine, including the editors) who range from names in statistical science to philosophy, including D. R. Cox, a name well known to statisticians.

The editors have done a superb job in presenting, organizing, and structuring the material in a logical order. The “Introduction and Background” is nicely presented and summarized, allowing for a smooth reading of the rest of the volume. There is a broad range of carefully selected topics from various related fields reflecting recent developments in these areas. The rest of the volume is divided in nine chapters/sections as follows:

1. Learning from Error, Severe Testing, and the Growth of Theoretical

Knowledge

2. The Life of Theory in the New Experimentalism

3. Revisiting Critical Rationalism

4. Theory Confirmation and Novel Evidence

5. Induction and Severe Testing

6. Theory Testing in Economics and the Error-Statistical Perspective

7. New Perspectives on (Some Old) Problems of Frequentist Statistics

8. Casual Modeling, Explanation and Severe Testing

9. Error and Legal Epistemology

In summary, this volume contains a wealth of knowledge and fascinating debates on a host of important and controversial topics equally important to the philosophy of science and scientific practice. This is a must-read—I enjoyed reading it and I am sure you will too! The book gives a sense of security regarding the future of statistical science and its importance in many walks of life. I also want to take the opportunity to suggest another seemingly related book by Harman and Kulkarni (2007). The review of this book was appeared in Technometricsin May 2008 (Ahmed 2008).

The following are chapters in E & I (2010) written by Mayo and/or Spanos, if you’re interested. If you produce a palindrome meeting the extremely simple requirements for May (by May 25 or so), you can win a free copy! Continue reading

Categories: Review of Error and Inference, Statistics | 3 Comments

‘No-Shame’ Psychics Keep Their Predictions Vague: New Rejected post

imagesSee new rejected post.(You may comment here or on the Rejected Posts blog)

Categories: msc kvetch, rejected post | Leave a comment

If it’s called the “The High Quality Research Act,” then ….

Unknown-2Among the (less technical) items sent my way over the past few days are discussions of the so-called High Quality Research Act. I’d not heard of it, but it’s apparently an outgrowth of the recent hand-wringing over junk science, flawed statistics, non-replicable studies, and fraud (discussed at times on this blog). And it’s clearly a hot topic. Let me just run this by you and invite your comments (before giving my impression). Following the Bill, below, is a list of five NSF projects about which the HQRA’s sponsor has requested further information, and then part of an article from today’s New Yorker on this “divisive new bill”: “Not Safe for Funding: The N.S.F. and the Economics of Science”.

[DISCUSSION DRAFT]

A BILL

April 18, 2013

TO [BE SUPPLIED]

Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,

SECTION 1. SHORT TITLE.

This act may be cited as the “High Quality Research Act”.

SECTION 2. HIGH QUALITY RESEARCH.

(a) CERTIFICATION.—prior to making an award of any contract or grant funding for a scientific research project, the Director of the NSF shall publish a statement on the public website of the Foundation that certifies that the research project—

(1) is in the interests of the U.S. to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science;

(2) is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

(3) is not duplicative of other research projects being funded by the Foundation or other Federal Science agencies.

(b) TRANSFER OF FUNDS.—Any unobligated funds for projects ot meeting the requirements of subjection (a) may be awarded to other scientific research projects that do meet such requirements.

(e) INITIAL IMPLEMENTATION REPORT.—Not later than 60 days after the date of enactment of this Act, the Director shall report to the Committee on Commerce, Science, and Transportation of the Senate and the Committee on Science, Space, and Technology of the House of Representatives on how the requirements set for in subsection (a) are being implemented.

(d) NATIONAL SCIENCE BOARD IMPLEMENTATION REPORT. __ Not later than 1 year after the date of enactment of this act, the national science board shall report to the committee on commerce, science, and transportation of the senate and the committee on science, space and technology of the house of representatives its findings and recommendations on how the requirements of subsection (a) are being implemented.

etc. etc.

Link to the Bill

Rep. Lamar Smith,author of the Bill, listed five NSF projects about which he has requested further information. 

1. Award Abstract #1247824: “Picturing Animals in National Geographic, 1888-2008,” March 15, 2013, ($227,437); 

2. Award Abstract #1230911: “Comparative Histories of Scientific Conservation: Nature, Science, and Society in Patagonian and Amazonian South America,” September 1, 2012 ($195,761);

3. Award Abstract #1230365: “The International Criminal Court and the Pursuit of Justice,” August 15, 2012 ($260,001);

4. Award Abstract #1226483, “Comparative Network Analysis: Mapping Global Social Interactions,” August 15, 2012, ($435,000); and

5. Award Abstract #1157551: “Regulating Accountability and Transparency in China’s Dairy Industry,” June 1, 2012 ($152,464).

________________________

MAY 9, 2013

NOT SAFE FOR FUNDING: THE N.S.F. AND THE ECONOMICS OF SCIENCE Continue reading

Categories: junk science, science communication, Statistics | 14 Comments

Professorships in Scandal?

Unknown-1On page 1 of the New York Times yesterday was an article, “The Last Refuge From Scandal? Professorships”:

The traditional path to an academic job is long and laborious: the solitude and penury of graduate study, the scramble for one of the few open positions in each field, the blood sport of competitive publishing. But while colleges have always courted accomplished public figures, a leap to the front of the class has now become a natural move for those who have suffered spectacular career flameouts. At this point, the transition from public disgrace to college lectern is so familiar that when Mr. Galliano merely stepped foot on the campus of Central Saint Martins, an art and design school in London, speculation rippled around the world— incorrectly — that he would soon be teaching there.

I guess this shouldn’t surprise anyone. Sexy course titles and “novelty academics” are pretty old-hat; power and scandal, even if on the sleazy side, attract students; and if students are buying, universities can’t be blamed for selling. Or can they?  Here are some examples they cite:

After a sex scandal forced Eliot Spitzer from the governor’s mansion in Albany, he turned up at City College, teaching a course called “Law and Public Policy.” …

More recently, Parsons the New School for Design announced that John Galliano, the celebrated clothing designer who lost his job at Christian Dior after unleashing a torrent of anti-Semitic vitriol in a bar, would be leading a four-day workshop and discussion called “Show Me Emotion.”

And David H. Petraeus, the general turned intelligence chief turned ribald punch line, will have not one college paycheck, but two. Last month, the City University of New York said he would be the next visiting professor of public policy at Macaulay Honors College. On Thursday, the University of Southern California announced that Mr. Petraeus would also be teaching there…

Despite a petition objecting to Galliano, there seems to be little public concern that offering such courses threatens a university’s ethical standards, especially, perhaps, if “only” sexual transgressions are involved.  Still, while I can see students wanting to enroll in a course taught by a Petreaus or a Spitzer, I doubt the same would be true for one run by a Deiderick Stapel*.  Is it because in the former cases the scandal does not directly touch on their accomplishments? Is there a justifiable principle of distinction operating?**   (Or might it depend on the course?) Continue reading

Categories: rejected post | 4 Comments

Schedule for Ontology & Methodology, 2013

copy-cropped-ampersand-logo-blog1

May 4 (Saturday):

8:30-9:00: Pastries & Coffee (Continental Breakfast) outside of Pamplin 2030

MORNING SESSIONS:

9:00-9:15—Welcome talk
9:15-10:00 Ruetsche: “Method, Metaphysics, and Quantum Theory”
10:00-10:25: Discussion

10:25-10:40 coffee break

10:40-11:05 Shech, “Phase Transitions, Ontology and Earman’s Sound Principle”
11:05-11:20: Discussion

11:20-12:05 Godfrey-Smith, “Evolution and Agency: A Case Study in Ontology and Methodology”
12:05-12:30: Discussion

12:30-1:30 Box Lunch

AFTERNOON SESSIONS: Continue reading

Categories: Announcement | Leave a comment

What should philosophers of science do? (Higgs, statistics, Marilyn)

Marilyn Monroe not walking past a Higgs boson and not making it decay, whatever philosophers might say.

Marilyn Monroe not walking past a Higgs boson and not making it decay, whatever philosophers might say.

My colleague, Lydia Patton, sent me this interesting article, “The Philosophy of the Higgs,” (from The Guardian, March 24, 2013) when I began the posts on “statistical flukes” in relation to the Higgs experiments (here and here); I held off posting it partly because of the slightly sexist attention-getter pic  of Marilyn (in reference to an “irrelevant blonde”[1]), and I was going to replace it, but with what?  All the men I regard as good-looking have dark hair (or no hair). But I wanted to take up something in the article around now, so here it is, a bit dimmed. Anyway apparently MM was not the idea of the author, particle physicist Michael Krämer, but rather a group of philosophers at a meeting discussing philosophy of science and science. In the article, Krämer tells us:

For quite some time now, I have collaborated on an interdisciplinary project which explores various philosophical, historical and sociological aspects of particle physics at the Large Hadron Collider (LHC). For me it has always been evident that science profits from a critical assessment of its methods. “What is knowledge?”, and “How is it acquired?” are philosophical questions that matter for science. The relationship between experiment and theory (what impact does theoretical prejudice have on empirical findings?) or the role of models (how can we assess the uncertainty of a simplified representation of reality?) are scientific issues, but also issues from the foundation of philosophy of science. In that sense they are equally important for both fields, and philosophy may add a wider and critical perspective to the scientific discussion. And while not every particle physicist may be concerned with the ontological question of whether particles or fields are the more fundamental objects, our research practice is shaped by philosophical concepts. We do, for example, demand that a physical theory can be tested experimentally and thereby falsified, a criterion that has been emphasized by the philosopher Karl Popper already in 1934. The Higgs mechanism can be falsified, because it predicts how Higgs particles are produced and how they can be detected at the Large Hadron Collider.

On the other hand, some philosophers tell us that falsification is strictly speaking not possible: What if a Higgs property does not agree with the standard theory of particle physics? How do we know it is not influenced by some unknown and thus unaccounted factor, like a mysterious blonde walking past the LHC experiments and triggering the Higgs to decay? (This was an actual argument given in the meeting!) Many interesting aspects of falsification have been discussed in the philosophical literature. “Mysterious blonde”-type arguments, however, are philosophical quibbles and irrelevant for scientific practice, and they may contribute to the fact that scientists do not listen to philosophers.

I entirely agree that philosophers have wasted a good deal of energy maintaining that it is impossible to solve Duhemian problems of where to lay the blame for anomalies. They misrepresent the very problem by supposing there is a need to string together a tremendously long conjunction consisting of a hypothesis H and a bunch of auxiliaries Ai which are presumed to entail observation e. But neither scientists nor ordinary people would go about things in this manner. The mere ability to distinguish the effects of different sources suffices to pinpoint blame for an anomaly. For some posts on falsification, see here and here*.

The question of why scientists do not listen to philosophers was also a central theme of the recent inaugural conference of the German Society for Philosophy of Science. I attended the conference to present some of the results of our interdisciplinary research group on the philosophy of the Higgs. I found the meeting very exciting and enjoyable, but was also surprised by the amount of critical self-reflection. Continue reading

Categories: Higgs, Statistics, StatSci meets PhilSci | 88 Comments

Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)

UnknownThree years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15. Trials have been taking place this month, as people try to meet the 3 year deadline to sue BP and others. But what happened to the 200 million gallons of oil?  (Is anyone up to date on this?)  Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night around the 3 year anniversary, let’s listen into a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes. 

In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!

Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

 Oil Exec:  Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average!  You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April  20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator:  But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec:  That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail),  that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion:  the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages?  … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

Two Measuring Instruments with Different Precisions:

 A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in, ton o’bricks).

In applying our test T+ (see November 2011 blog post ) to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”.  Denote the two p-values as p’ and p”, respectively.  However, or so the criticism proceeds, the error statistician would report the average p-value:  .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed?  Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of.  Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

  •   If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion. Continue reading

Categories: Bayesian/frequentist, Comedy, Statistics | 2 Comments

Blog Contents 2013 (March)

metablog old fashion typewriterError Statistics Philosophy Blog: March 2013* (Frequentists in Exile-the blog)**:

(3/1) capitalizing on chance
(3/4) Big Data or Pig Data?
(3/7) Stephen Senn: Casting Stones
(3/10) Blog Contents 2013 (Jan & Feb)
(3/11) S. Stanley Young: Scientific Integrity and Transparency
(3/13) Risk-Based Security: Knives and Axes
(3/15) Normal Deviate: Double Misunderstandings About p-values
(3/17) Update on Higgs data analysis: statistical flukes (1)
(3/21) Telling the public why the Higgs particle matters
(3/23) Is NASA suspending public education and outreach?
(3/27) Higgs analysis and statistical flukes (part 2)
(3/31) possible progress on the comedy hour circuit?

*March was incredibly busy here; I’m saving up several partially-baked posts on draft. Also, while I love this old typewriter, I’ve had to have special keys made for common statistical symbols, and that has delayed me some. I hope people will scan the previous contents starting from the beginning (e.g., with “prionvac“): it’s philosophy, remember, and philosophy has to be reread many times over.  January and February 2013 contents are here.

**compiled by Jean Miller and Nicole Jinn.

Categories: Metablog, Statistics | Leave a comment

Blog at WordPress.com.