Author Archives: Mayo

PhilStock: The Great Taper Caper

Posted on June 19, 2013 by Mayo

Categories: PhilStock, Rejected Posts | Leave a comment

P-values can’t be trusted except when used to argue that P-values can’t be trusted!

Posted on June 14, 2013 by Mayo

Have you noticed that some of the harshest criticisms of frequentist error-statistical methods these days rest on methods and grounds that the critics themselves purport to reject? Is there a whiff of inconsistency in proclaiming an “anti-hypothesis-testing stance” while in the same breath extolling the uses of statistical significance tests and p-values in mounting criticisms of significance tests and p-values? I was reminded of this in the last two posts (comments) on this blog (here and here) and one from Gelman from a few weeks ago (“Interrogating p-values”).

Gelman quotes from a note he is publishing:

“..there has been a growing sense that psychology, biomedicine, and other fields are being overwhelmed with errors … . In two recent series of papers, Gregory Francis and Uri Simonsohn and collaborators have demonstrated too-good-to-be-true patterns of p-values in published papers, indicating that these results should not be taken at face value.”

But this fraudbusting is based on finding statistically significant differences from null hypotheses (e.g., nulls asserting random assignments of treatments)! If we are to hold small p-values untrustworthy, we would be hard pressed to take them as legitimating these criticisms, especially those of a career-ending sort.

…in addition to the well-known difficulties of interpretation of p-values…,…and to the problem that, even when all comparisons have been openly reported and thus p-values are mathematically correct, the ‘statistical significance filter’ ensures that estimated effects will be in general larger than true effects, with this discrepancy being well over an order of magnitude in settings where the true effects are small… (Gelman 2013)

But surely anyone who believed this would be up in arms about using small p-values as evidence of statistical impropriety. Am I the only one wondering about this?*

CLARIFICATION (6/15/13): Corey’s comment today leads me to a clarification, lest anyone misunderstand my point. I am sure that Francis, Simonsohn and others would never be using p-values and associated methods in the service of criticism if they did not regard the tests as legitimate scientific tools. I wasn’t talking about them. I was alluding to critics of tests who point to their work as evidence the statistical tools are not legitimate. Now maybe Gelman only intends to say, what we know and agree with, that tests can be misused and misinterpreted. But in these comments, our exchanges, and elsewhere, it is clear he is saying something much stronger. In my view, the use of significance tests by debunkers should have been taken as strong support for the value of the tools, correctly used. In short, I thought it was a success story! and I was rather perplexed to see somewhat the reverse.

______________________

*This just in: If one wants to see a genuine ~~quack~~ extremist** who was outed long ago***, see Ziliac’s article declaring the Higgs physicists are pseudoscientists for relying on significance levels!( in the Financial Post 6/12/13).

**I am not placing the critics referred to above under this umbrella in the least.

***For some reviews of Ziliac and McCloskey, see widgets on left. For their flawed testimony on the Matrixx case, please search this blog.

Categories: reforming the reformers, Statistical fraudbusting, Statistics | 43 Comments

Mayo: comment on the repressed memory research

Posted on June 11, 2013 by Mayo

Here are some reflections on the repressed memory articles from Richard Gill’s post, focusing on Geraerts, et.al.,(2008).

1. Richard Gill reported that “Everyone does it this way, in fact, if you don’t, you’d never get anything published: …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them and are just doing their best to make this as clear as possible to everyone.”

This remark is very telling. I recommend we just regard those cases as illustrating a theory one believes, rather than providing evidence for that theory. If we could mark them as such, we can stop blaming significance tests for playing a role in what are actually only illustrative attempts, or to strengthen someone’s beliefs about a theory.

2. I was surprised the examples had to do with recovered memories. Wasn’t that entire area dubbed a pseudoscience way back (at least 15-25 years ago?) when “therapy induced” memories of childhood sexual abuse (CSA) were discovered to be just that—therapy induced and manufactured? After the witch hunts that ensued (the very accusation sufficing for evidence), I thought the field of “research” had been put out of its and our misery. So, aside from having used the example in a course on critical thinking, I’m not up on this current work at all. But, as these are just blog comments, let me venture some off-the-cuff skeptical thoughts. They will have almost nothing to do with the statistical data analysis, by the way…

3. Geraerts, et.al., (2008, 22) admit at the start of the article that therapy-recovered CSA memories are unreliable, and the idea of automatically repressing a traumatic event like CSA implausible. Then mightn’t it seem the entire research program should be dropped? Not to its adherents! As with all theories that enjoy the capacity of being sufficiently flexible to survive anomaly (Popper’s pseudosciences), there’s some life left here too. Maybe , its adherents reason, it’s not necessary for those who report “spontaneously recovered” CSA memories to be repressors, instead they merely be “suppressors” who are good at blocking out negative events. If so, they didn’t automatically repress but rather deliberately suppressed: “Our findings may partly explain why people with spontaneous CSA memories have the subjective impression that they have ‘repressed’ their CSA memories for many years.” (ibid., 22).

4. Shouldn’t we stop there? I would. We have a research program growing out of an exemplar of pseudoscience being kept alive by ever-new “monster-barring” strategies (as Lakatos called them). (I realize they’re not planning to go out to the McMartin school, but still…) If a theory T is flexible enough so that any observations can be interpreted through it, and thereby regarded as confirming T, then it is no surprise that this is still true when the instances are dressed up with statistics. It isn’t that theories of repressed memories are implausible or improbable (in whatever sense one takes those terms). It is the ever-flexibility of these theories that renders the research program pseudoscience (along with, in this case, a history of self-sealing data interpretations). Continue reading →

Categories: junk science, Statistical fraudbusting, Statistics | 7 Comments

Richard Gill: “Integrity or fraud… or just quesionable research practices?”

Posted on June 8, 2013 by Mayo

Professor Gill

Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University
http://www.math.leidenuniv.nl/~gill/

I am very grateful to Richard Gill for permission to post an e-mail from him (after my “dirty laundry” post) along with slides from his talk, “Integrity or fraud… or just questionable research practices?” and associated papers. I record my own reflections on the pseudoscientific nature of the program in one of the Geraerts et.al., papers in a later post.

I certainly have been thinking about these issues a lot in recent months. I got entangled in intensive scientific and media discussions – mainly confined to the Netherlands – concerning the cases of social psychologist Dirk Smeesters and of psychologist Elke Geraerts. See: http://www.math.leidenuniv.nl/~gill/Integrity.pdf

And I recently got asked to look at the statistics in some papers of another … [researcher] ..but this one is still confidential ….

The verdict on Smeesters was that he like Stapel actually faked data (though he still denies this).

The Geraerts case is very much open, very much unclear. The senior co-authors Merckelbach, McNally of the attached paper, published in the journal “Memory”, have asked the journal editors for it to be withdrawn because they suspect the lead author, Elke Geraerts, of improper conduct. She denies any impropriety. It turns out that none of the co-authors have the data. Legally speaking it belongs to the University of Maastricht where the research was carried out and where Geraerts was a promising postdoc in Merckelbach’s group. She later got a chair at Erasmus University Rotterdam and presumably has the data herself but refuses to share it with her old co-authors or any other interested scientists. Just looking at the summary statistics in the paper one sees evidence of “too good to be true”. Average scores in groups supposed in theory to be similar are much closer to one another than one would expect on the basis of the within group variation (the paper reports averages and standard deviations for each group, so it is easy to compute the F statistic for equality of the three similar groups and use its left tail probability as test statistic. Continue reading →

Categories: junk science, Statistical fraudbusting, Statistics | 5 Comments

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Posted on June 6, 2013 by Mayo

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null H₀: μ < μ₀ vs μ >μ₀ and μ₀ = 0, α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results. To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ₀, we wish to identify discrepancies that can and cannot be ruled out. For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ₀ + γ

Fisher continually emphasized that failure to reject was not evidence for the null. Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H₀, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).

By taking into account the actual x₀, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately. Continue reading →

Categories: CIs and tests, Error Statistics, reformers, Statistics | Tags: confidence intervals, criticism of frequentist methods, fallacy of acceptance, fallacy of rejection, P-value, power, R. Carnap, reformers | Leave a comment

PhilStock: Topsy-Turvy Game

Posted on June 6, 2013 by Mayo

See rejected posts.

Categories: PhilStock, Rejected Posts | Leave a comment

Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12)

Posted on June 5, 2013 by Mayo

The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it? Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H₀: µ ≤ 0 against H₁: µ > 0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer µ > µ₀ + δ. Continue reading →

Categories: confidence intervals and tests, reformers, Statistics | Tags: confidence intervals, Geoff Cumming, reformers, significance tests | 7 Comments

Some statistical dirty laundry

Posted on June 1, 2013 by Mayo

I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. Here are some stray thoughts…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments). That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

3. Hanging out some statistical dirty laundry.
Items in their laundry list include:

An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings…. Continue reading →

Categories: junk science, spurious p values, Statistics | 6 Comments

Winner of May Palindrome Contest

Posted on June 1, 2013 by Mayo

“Able no one nil red nudist opening nine pots. I’d underline ‘No’ on Elba.” Anonymous. See rejected posts.

Categories: Palindrome | Leave a comment

K. Staley: review of Error & Inference

Posted on May 29, 2013 by Mayo

K. W. Staley
Associate Professor
Department of Philosophy,
Saint Louis University

(Almost) All about error

BOOK REVIEW Metascience (2012) 21:709–713 DOI 10.1007/s11016-011-9618-1
Deborah G. Mayo and Aris Spanos (eds): Error and inference: Recent exchanges on experimental reasoning, reliability, objectivity, and rationality. New York: Cambridge University Press, 2010, xvii+419 pp

The ERROR’06 (experimental reasoning, reliability, objectivity, and rationality) conference held at Virginia Tech aimed to advance the discussion of some central themes in philosophy of science debated by Deborah Mayo and her more-or-less friendly critics over the years. The volume here reviewed brings together the contributions of these critics and Mayo’s responses to them (with Mayo’s collaborator Aris Spanos). (I helped with the organization of the conference and, with Mayo and Jean Miller, edited a separate collection of workshop papers that were presented there, published as a special issue of Synthese.) My review will focus on a couple of themes I hope to be of interest to a broad philosophical audience, then turn more brieﬂy to an overview of the entire collection. The discussions in Error and Inference (E&I) are indispensable for understanding several current issues regarding the methodology of science.

The remarkably useful introductory chapter lays out the broad themes of the volume and discusses ‘‘The Error-Statistical Philosophy’’. Here, Mayo and Spanos provide the most succinct and non-technical account of the error-statistical approach that has yet been published, a feature that alone should commend this text to anyone who has found it difﬁcult to locate a reading on error statistics suitable for use in teaching.

Mayo holds that the central question for a theory of evidence is not the degree to which some observation E conﬁrms some hypothesis H but how well-probed for error a hypothesis H is by a testing procedure T that results in data x₀. This reorientation has far-reaching consequences for Mayo’s approach to philosophy of science. On this approach, addressing the question of when data ‘‘provide good evidence for or a good test of’’ a hypothesis requires attention to characteristics of the process by means of which the data are used to bear on the hypothesis. Mayo identiﬁes the starting point from which her account is developed as the ‘‘Weak Severity Principle’’ (WSP):

Data x₀ do not provide good evidence for hypothesis H if x₀ results from a test procedure with a very low probability or capacity of having uncovered the falsity of H (even if H is incorrect). (21)

The weak severity principle is then developed into the full severity principle (SP), according to which ‘‘data x₀ provide a good indication of or evidence for hypothesis H (just) to the extent that test T has severely passed H with x₀’’ where H passes a severe test T with x₀ if x₀ ‘‘agrees with’’ H and ‘‘with very high probability, test T would have produced a result that accords less well with H than doesx₀, if H were false or incorrect’’ (22). This principle constitutes the heart of the error-statistical account of evidence, and E&I, by including some of the most important critiques of the principle, provides a forum in which Mayo and Spanos attempt to correct misunderstandings of the principle and to clarify its meaning and application.

The appearance in the WSP of the disjunctive phrase ‘‘a very low probability or capacity’’ (my emphasis) indicates a point central to much of this clariﬁcatory work. The error-statistical account is resolutely frequentist in its construal of probability. It is commonly held (including by some frequentists) that the rationale for frequentist statistical methods lies exclusively in the fact that they can sometimes be shown to have low error rates in the long run. Throughout E&I, Mayo insists that this ‘‘behaviorist rationale’’ is not applicable when it comes to evaluating a particular body of data in order to determine what inferences may be warranted. That evaluation rests upon thinking about the particular data and the inference at hand in light of the capacity of the test to reveal potential errors in the inference drawn. Frequentist probabilities are part of how one models the error-detecting capacities of the process. As Mayo explains in a later chapter co-authored with David Cox, tests of hypotheses function analogously to measuring instruments: ‘‘Just as with the use of measuring instruments, applied to a speciﬁc case, we employ the performance features to make inferences about aspects of the particular thing that is measured, aspects that the measuring tool is appropriately capable of revealing’’ (257).

One of the most fascinating exchanges in E&I concerns the role of severe testing in the appraisal of ‘‘large-scale’’ theories. According to Mayo, theory appraisal proceeds by a ‘‘piecemeal’’ process of severe probing for speciﬁc ways in which a theory might be in error. She illustrates this with the history of experimental tests of theories of gravity, emphasizing Clifford Will’s parametrized post-Newtonian (PPN) framework, by means of which all metric theories of gravity can be represented in their weak-ﬁeld, slow-motion limits by means of ten parameters. Experimental work on gravity theories then severely tests hypotheses about the values of those parameters. Rather than attempting to conﬁrm or probabilify the general theory of relativity (GTR), the aim is to learn about the ways in which GTR might be in error, more generally to ‘‘measure how far off what a given theory says about a phenomenon can be from what a ‘correct’ theory would need to say about it’’ (55).

Alan Chalmers and Alan Musgrave both challenge this view. According to Chalmers, no general theory, whether ‘‘low level’’ or ‘‘high level’’, can pass a severe test because the content of theories surpasses whatever empirical evidence supports them. As a consequence, Chalmers argues, Mayo’s severe-testing account of scientiﬁc inference must be incomplete because even low-level experimental testing sometimes demands relying on general theoretical claims. Similarly, Musgrave accuses Mayo of holding that (general) theories are not tested by ‘‘testing their consequences’’, but that ‘‘all that we really test are the consequences’’ (105), leaving her with ‘‘nothing to say’’ about the assessment, adoption, or rejection of general theories (106). Continue reading →

Categories: Error Statistics, Statistics | Tags: Error & Inference, Staley | 1 Comment

A.Birnbaum: Statistical Methods in Scientific Inference

Posted on May 27, 2013 by Mayo

Birnbaum: born May 27, 1923

Today is (statistician) Allan Birnbaum’s birthday. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to obtain what he called “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I would heartily endorse! While known for attempts to argue that the (strong) Likelihood Principle followed from sufficiency and conditionality principles, a few years after publishing this result, he seems to have turned away from it, perhaps discovering gaps in his argument.

NATURE VOL. 225 MARCH 14, 1970 (1033)

LETTERS TO THE EDITOR

Statistical methods in Scientific Inference

It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood). I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.

If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence. This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data. (The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].

While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.

Allan Birnbaum
New York University
Courant Institute of Mathematical Sciences,
251 Mercer Street,
New York, NY 10012

Birnbaum’s confidence concept, sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:

(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.

Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence” simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below). Continue reading →

Categories: Likelihood Principle, phil/history of stat, Statistics | Tags: Birnbaum | 3 Comments

Schachtman: High, Higher, Highest Quality Research Act

Posted on May 26, 2013 by Mayo

Since posting on the High Quality Research act a few weeks ago, I’ve been following it in the news, have received letters from professional committees (asking us to write letters), and now see that Nathan A. Schachtman, Esq., PC posted the following on May 25, 2013 on his legal blog*:

“The High Quality Research Act” (HQRA), which has not been formally introduced in Congress, continues to draw attention. See“Clowns to the left of me, Jokers to the right.” Last week, Sarewitz suggests that “the problem” is the hype about the benefits of pure research and the let down that results from the realization that scientific progress is “often halting and incremental,” with much research not “particularly innovative or valuable.” Fair enough, but why is this Congress such an unsophisticated consumer of scientific research in the 21st century? How can it be a surprise that the scientific community engages in the same rent-seeking behaviors as do other segments of our society? Has it escaped Congress’s attention that scientists are subject to enthusiasms and group think, just like, … congressmen?

Nature published an editorial piece suggesting that the HQRA is not much of a threat. Daniel Sarewitz, “Pure hype of pure research helps no one, ” 497 Nature 411 (2013).

Still, Sarewitz believes that the HQRA bill is not particularly threatening to the funding of science:

“In other words, it’s not a very good bill, but neither is it much of a threat. In fact, it’s just the latest skirmish in a long-running battle for political control over publicly funded science — one fought since at least 1947, when President Truman vetoed the first bill to create the NSF because it didn’t include strong enough lines of political accountability.”

This sanguine evaluation misses the effect of the superlatives in the criteria for National Science Foundation funding:

“(1) is in the interests of the United States to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science;

(2) is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

(3) is not duplicative of other research projects being funded by the Foundation or other Federal science agencies.” Continue reading →

Categories: evidence-based policy, PhilStatLaw, Statistics | Tags: Schachtman | 12 Comments

Gelman sides w/ Neyman over Fisher in relation to a famous blow-up

Posted on May 24, 2013 by Mayo

blog-o-log

Andrew Gelman had said he would go back to explain why he sided with Neyman over Fisher in relation to a big, famous argument discussed on my Feb. 16, 2013 post: “Fisher and Neyman after anger management?”, and I just received an e-mail from Andrew saying that he has done so: “In which I side with Neyman over Fisher”. (I’m not sure what Senn’s reply might be.) Here it is:

“In which I side with Neyman over Fisher” Posted by Andrew on 24 May 2013, 9:28 am

As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally.

Here’s an example that recently came up.

Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero.

Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible that a bunch of nonzero effects would exactly cancel). And I remember a similar discussion as a student, many years ago, when Rubin talked about that silly Neyman null hypothesis. Continue reading →

Categories: Fisher, Statistics, Stephen Senn | Tags: blogolog, Fisher | 10 Comments

Mayo’s slides from the Onto-Meth conference*

Posted on May 22, 2013 by Mayo

Methodology and Ontology in Statistical Modeling: Some error statistical reflections (Spanos and Mayo)—uncorrected

Our presentation falls under the second of the bulleted questions for the conference (conference blog is here):

How do methods of data generation, statistical modeling, and inference influence the construction and appraisal of theories?

Statistical methodology can influence what we think we’re finding out about the world, in the most problematic ways, traced to such facts as:

All statistical models are false
Statistical significance is not substantive significance
Statistical association is not causation
No evidence against a statistical null hypothesis is not evidence the null is true
If you torture the data enough they will confess.

(or just omit unfavorable data)

These points are ancient (lying with statistics, lies damn lies, and statistics)

People are discussing these problems more than ever (big data), but it’s rarely realized is how much certain methodologies are at the root of the current problems.

__________________1__________________

All Statistical Models are False

Take the popular slogan in statistics and elsewhere is “all statistical models are false!”

What the “all models are false” charge boils down to:

(1) the statistical model of the data is at most an idealized and partial representation of the actual data generating source.

(2) a statistical inference is at most an idealized and partial answer to a substantive theory or question.

But we already know our models are idealizations: that’s what makes them models
Reasserting these facts is not informative,.
Yet they are taken to have various (dire) implications about the nature and limits of statistical methodology
Neither of these facts precludes the use of these to find out true things
On the contrary, it would be impossible to learn about the world if we did not deliberately falsify and simplify.
__________________2__________________

Notably, the “all models are false” slogan is followed up by “But some are useful”,

Their usefulness, we claim, is being capable of adequately capturing an aspect of a phenomenon of interest

Then a hypothesis asserting its adequacy (or inadequacy) is capable of being true!

Note: All methods of statistical inferences rest on statistical models.

What differentiates accounts is how well they step up to the plate in checking adequacy, learning despite violations of statistical assumptions (robustness)

__________________3__________________

Statistical significance is not substantive significance

Statistical models (as they arise in the methodology of statistical inference) live somewhere between

Substantive questions, hypotheses, theories H

Statistical models of phenomenon, experiments, data: M

Data x

What statistical inference has to do is afford adequate link-ups (reporting precision, accuracy, reliability)

__________________4__________________ Continue reading →

Categories: O & M conference | 34 Comments

Mayo: Meanderings on the Onto-Methodology Conference

Posted on May 19, 2013 by Mayo

Writing a blog like this, a strange and often puzzling exercise[1], does offer a forum for sharing half-baked chicken-scratchings from the back of frayed pages on themes from our Onto-Meth[2] conference from two weeks ago[3]. (The previous post had notes from blogger and attendee, Gandenberger.)

Onto-Meth conference

Several of the talks reflect a push-back against the idea that the determination of “ontology” in science—e.g., the objects and processes of theories, models and hypotheses—is (or should strive to correspond to?) “real” objects in the world and/or what is approximately the case about them. Instead, at least some of the speakers wish to liberate ontology to recognize how “merely” pragmatic goals, needs, and desires are not just second-class citizens, but can and do (and should?) determine the categories of reality. Well there are a dozen equivocations here, most of which we did not really discuss at the conference.

In my own half of the Spanos-Mayo (D & P presentation[4]) I granted and even promoted the idea of a methodology that was pragmatic while also objective, so I’m not objecting to that part. The measurement of my weight is a product of “discretionary” judgments (e.g., to weigh in pounds with a scale having a given precision), but it is also a product of how much I really weigh (no getting around it). By understanding the properties of methodological tools and measuring systems, it is possible to “subtract out” the influence of the judgments to get at what is actually the case. At least approximately. But that view is different, it seems to me, from someone like Larry Laudan (at least in his later metamorphosis). Even though he considers his “reticulated” view a fairly hard-nosed spin on the Kuhnian idea of scientific paradigms as invariably containing an ontology (e.g., theories), a methodology, and (what he called) an “axiology” or set of aims (OMA), Laudan seems to think standards are so variable that what counts as evidence is constantly fluctuating (aside from maybe retaining the goal of fitting diverse facts). So I wonder if these pragmatic leanings are more like Laudan or more like me (and my view here, I take it, is essentially that of Peirce). I am perfectly sympathetic to the piecemeal “locavoracity” idea in Ruesche, by the way.

My worry, one of them, is that all kinds of rival entities and processes arise to account for (accord with, predict, and purportedly explain) data and patterns in data, and don’t we need ways to discriminate them? During the open discussion, I mentioned several examples, some of which I can make out all scrunched up in the corners of my coffee-logged program, such as appeals to “cultural theories” of risk and risk perceptions. These theories say appeals to supposedly “real” hazards, e.g, chance of disease, death, catastrophe, and other “objective” risk assessments are wrong. They say it is not only possible but preferable (truer?) to capture attitudes toward risks, e.g., GM foods, nuclear energy, climate change, breast implants, etc. by means of one or another favorite politico-cultural grid-group categories (e.g., marginal-individualists, passive-egalitarians, hierarchical-border people, fatalists, etc.). (Your objections to these vague category schemes are often taken as further evidence that you belong in one of the pigeon-holes!) And the other day I heard a behavioral economist declare that he had found the “mechanism” to explain deciding between options in virtually all walks of life using a regression parameter, he called it beta, and guess what? beta = 1/3! He proved it worked statistically too. He might be right, he had a lot of data. Anyway, in my deliberate attempt to trigger discussion at the conference end, I was wondering if some of the speakers and/or attendees (Danks, Woodward, Glymour? Anyone?) had anything to say about cases that some of us might wish to call reification. Continue reading →

Categories: O & M conference, Statistics | 10 Comments

Gandenberger on Ontology and Methodology (May 4) Conference: virginia Tech

Posted on May 18, 2013 by Mayo

Gregory Gandenberger
Ph.D graduate student: Dept. of History and Philosophy of Science & Dept. of Statistics
University of Pittsburgh
http://gsganden.tumblr.com/

Onto-Meth conference

Some Thoughts on the O&M 2013 Conference
I was struck by how little speakers at the Ontology and Methodology conference engaged with the realism/antirealism debate. Laura Ruetsche defended a version of Arthur Fine’s Natural Ontological Attitude (NOA) in the first talk of the conference, but none of the speakers after her addressed the debate directly. David Danks and Jim Woodward made it particularly clear that they were deliberately avoiding questions about realism in favor of questions about what kinds of ontologies our theories should have in order to best serve the various purposes for which we develop them.

I am not criticizing the speakers! I am inclined to agree with Clark Glymour that the kinds of questions Danks and Woodward addressed are more interesting and important than questions about “what’s really real.” On the other hand, I worry that we lose something when we focus only on the use of science toward such ends as prediction and control. During the discussion period at the end of the conference, Peter Godfrey-Smith argued that science has some value simply for telling us what really is the case. For instance, science tells us that all living things on earth have a common ancestor, and that fact is a good thing to know regardless of whether or not it helps us predict or control anything.

One feature of the realism/antirealism debate that has long bothered me is that it treats all of “our best sciences” as if they had roughly the same epistemic status. In fact, realism about quantum field theory, for instance, is much harder to defend than realism about evolutionary biology. I am inclined to dismiss the realism debate as ill-formed insofar as it presumes that the question of scientific realism is a single question that spans all of the sciences. I am also suspicious of the debate in its bread-and-butter domain of fundamental physics. It is not clear to me that there is such a thing as fundamental physics; that if there is such a thing as fundamental physics, then it is converging toward a unified ontology; that if it is converging toward a unified ontology, then we can make sense of the question whether or not that ontology is correct; or that if we can make sense of the question whether or not that ontology is correct, then we have the means to give a justified answer to that question.

Nevertheless, as Glymour pointed out during the open discussion period, there are still good and open questions to address about whether and how we are justified in believing that science tells us the truth in other domains (such as evolutionary theory) where the realism question seems relatively well-formed and answerable. We can dismiss questions about “what’s really real” at a “fundamental level” while still thinking that philosophers of science should have a story to tell the 46% of Americans who believe that human beings were created in more or less their current form within the last 10,000 years—not a story about how science serves purposes of prediction and control, but a story about how science can help us find the truth.

Categories: O & M conference | 7 Comments

“A sense of security regarding the future of statistical science…” Anon review of Error and Inference

Posted on May 14, 2013 by Mayo

Aris Spanos, my colleague and co-author (Economics),recently came across this seemingly anonymous review of our Error and Inference (2010) [E & I]. It’s interesting that the reviewer remarks that “The book gives a sense of security regarding the future of statistical science and its importance in many walks of life.” I wish I knew just what the reviewer means–but it’s appreciated regardless.

2010 American Statistical Association and the American Society for Quality

TECHNOMETRICS, AUGUST 2010, VOL. 52, NO. 3, Book Reviews, 52:3, pp. 362-370.

Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. MAYO and Aris SPANOS, New York: Cambridge University Press, 2010, ISBN 978-0-521-88008-4, xvii+419 pp., $60.00.

This edited volume contemplates the interests of both scientists and philosophers regarding gathering reliable information about the problem/question at hand in the presence of error, uncertainty, and with limited data information.

The volume makes a signiﬁcant contribution in bridging the gap between scientiﬁc practice and the philosophy of science. The main contribution of this volume pertains to issues of error and inference, and showcases intriguing discussions on statistical testing and providing alternative strategy to Bayesian inference. In words, it provides cumulative information towards the philosophical and methodological issues of scientiﬁc inquiry at large.

The target audience of this volume is quite general and open to a broad readership. With some reasonable knowledge of probability theory and statistical science, one can get the maximum beneﬁt from most of the chapters of the volume. The volume contains original and fascinating articles by eminent scholars (nine, including the editors) who range from names in statistical science to philosophy, including D. R. Cox, a name well known to statisticians.

The editors have done a superb job in presenting, organizing, and structuring the material in a logical order. The “Introduction and Background” is nicely presented and summarized, allowing for a smooth reading of the rest of the volume. There is a broad range of carefully selected topics from various related ﬁelds reﬂecting recent developments in these areas. The rest of the volume is divided in nine chapters/sections as follows:

1. Learning from Error, Severe Testing, and the Growth of Theoretical

Knowledge

2. The Life of Theory in the New Experimentalism

3. Revisiting Critical Rationalism

4. Theory Conﬁrmation and Novel Evidence

5. Induction and Severe Testing

6. Theory Testing in Economics and the Error-Statistical Perspective

7. New Perspectives on (Some Old) Problems of Frequentist Statistics

8. Casual Modeling, Explanation and Severe Testing

9. Error and Legal Epistemology

In summary, this volume contains a wealth of knowledge and fascinating debates on a host of important and controversial topics equally important to the philosophy of science and scientiﬁc practice. This is a must-read—I enjoyed reading it and I am sure you will too! The book gives a sense of security regarding the future of statistical science and its importance in many walks of life. I also want to take the opportunity to suggest another seemingly related book by Harman and Kulkarni (2007). The review of this book was appeared in Technometricsin May 2008 (Ahmed 2008).

The following are chapters in E & I (2010) written by Mayo and/or Spanos, if you’re interested. If you produce a palindrome meeting the extremely simple requirements for May (by May 25 or so), you can win a free copy! Continue reading →

Categories: Review of Error and Inference, Statistics | 3 Comments

‘No-Shame’ Psychics Keep Their Predictions Vague: New Rejected post

Posted on May 13, 2013 by Mayo

See new rejected post.(You may comment here or on the Rejected Posts blog)

Categories: msc kvetch, rejected post | Leave a comment

If it’s called the “The High Quality Research Act,” then ….

Posted on May 9, 2013 by Mayo

Among the (less technical) items sent my way over the past few days are discussions of the so-called High Quality Research Act. I’d not heard of it, but it’s apparently an outgrowth of the recent hand-wringing over junk science, flawed statistics, non-replicable studies, and fraud (discussed at times on this blog). And it’s clearly a hot topic. Let me just run this by you and invite your comments (before giving my impression). Following the Bill, below, is a list of five NSF projects about which the HQRA’s sponsor has requested further information, and then part of an article from today’s New Yorker on this “divisive new bill”: “Not Safe for Funding: The N.S.F. and the Economics of Science”.

[DISCUSSION DRAFT]

A BILL

April 18, 2013

TO [BE SUPPLIED]

Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,

SECTION 1. SHORT TITLE.

This act may be cited as the “High Quality Research Act”.

SECTION 2. HIGH QUALITY RESEARCH.

(a) CERTIFICATION.—prior to making an award of any contract or grant funding for a scientific research project, the Director of the NSF shall publish a statement on the public website of the Foundation that certifies that the research project—

(1) is in the interests of the U.S. to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science;

(2) is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

(3) is not duplicative of other research projects being funded by the Foundation or other Federal Science agencies.

(b) TRANSFER OF FUNDS.—Any unobligated funds for projects ot meeting the requirements of subjection (a) may be awarded to other scientific research projects that do meet such requirements.

(e) INITIAL IMPLEMENTATION REPORT.—Not later than 60 days after the date of enactment of this Act, the Director shall report to the Committee on Commerce, Science, and Transportation of the Senate and the Committee on Science, Space, and Technology of the House of Representatives on how the requirements set for in subsection (a) are being implemented.

(d) NATIONAL SCIENCE BOARD IMPLEMENTATION REPORT. __ Not later than 1 year after the date of enactment of this act, the national science board shall report to the committee on commerce, science, and transportation of the senate and the committee on science, space and technology of the house of representatives its findings and recommendations on how the requirements of subsection (a) are being implemented.

etc. etc.

Link to the Bill:

Rep. Lamar Smith,author of the Bill, listed five NSF projects about which he has requested further information.

1. Award Abstract #1247824: “Picturing Animals in National Geographic, 1888-2008,” March 15, 2013, ($227,437);

2. Award Abstract #1230911: “Comparative Histories of Scientific Conservation: Nature, Science, and Society in Patagonian and Amazonian South America,” September 1, 2012 ($195,761);

3. Award Abstract #1230365: “The International Criminal Court and the Pursuit of Justice,” August 15, 2012 ($260,001);

4. Award Abstract #1226483, “Comparative Network Analysis: Mapping Global Social Interactions,” August 15, 2012, ($435,000); and

5. Award Abstract #1157551: “Regulating Accountability and Transparency in China’s Dairy Industry,” June 1, 2012 ($152,464).

________________________

MAY 9, 2013

NOT SAFE FOR FUNDING: THE N.S.F. AND THE ECONOMICS OF SCIENCE Continue reading →

Categories: junk science, science communication, Statistics | 14 Comments

Professorships in Scandal?

Posted on May 6, 2013 by Mayo

On page 1 of the New York Times yesterday was an article, “The Last Refuge From Scandal? Professorships”:

The traditional path to an academic job is long and laborious: the solitude and penury of graduate study, the scramble for one of the few open positions in each field, the blood sport of competitive publishing. But while colleges have always courted accomplished public figures, a leap to the front of the class has now become a natural move for those who have suffered spectacular career flameouts. At this point, the transition from public disgrace to college lectern is so familiar that when Mr. Galliano merely stepped foot on the campus of Central Saint Martins, an art and design school in London, speculation rippled around the world— incorrectly — that he would soon be teaching there.

I guess this shouldn’t surprise anyone. Sexy course titles and “novelty academics” are pretty old-hat; power and scandal, even if on the sleazy side, attract students; and if students are buying, universities can’t be blamed for selling. Or can they? Here are some examples they cite:

After a sex scandal forced Eliot Spitzer from the governor’s mansion in Albany, he turned up at City College, teaching a course called “Law and Public Policy.” …

More recently, Parsons the New School for Design announced that John Galliano, the celebrated clothing designer who lost his job at Christian Dior after unleashing a torrent of anti-Semitic vitriol in a bar, would be leading a four-day workshop and discussion called “Show Me Emotion.”

And David H. Petraeus, the general turned intelligence chief turned ribald punch line, will have not one college paycheck, but two. Last month, the City University of New York said he would be the next visiting professor of public policy at Macaulay Honors College. On Thursday, the University of Southern California announced that Mr. Petraeus would also be teaching there…

Despite a petition objecting to Galliano, there seems to be little public concern that offering such courses threatens a university’s ethical standards, especially, perhaps, if “only” sexual transgressions are involved. Still, while I can see students wanting to enroll in a course taught by a Petreaus or a Spitzer, I doubt the same would be true for one run by a Deiderick Stapel*. Is it because in the former cases the scandal does not directly touch on their accomplishments? Is there a justifiable principle of distinction operating?** (Or might it depend on the course?) Continue reading →

Categories: rejected post | 4 Comments

Author Archives: Mayo

(Almost) All about error

[DISCUSSION DRAFT]

A BILL

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

Follow Blog via Email

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2025. All Rights Reserved.