Author Archives: Mayo

About Mayo

I am a professor in the Department of Philosophy at Virginia Tech and hold a visiting appointment at the Center for the Philosophy of Natural and Social Science of the London School of Economics. I am the author of Error and the Growth of Experimental Knowledge, which won the 1998 Lakatos Prize, awarded to the most outstanding contribution to the philosophy of science during the previous six years. I have applied my approach toward solving key problems in philosophy of science: underdetermination, the role of novel evidence, Duhem's problem, and the nature of scientific progress. I am also interested in applications to problems in risk analysis and risk controversies, and co-edited Acceptable Evidence: Science and Values in Risk Management (with Rachelle Hollander). I teach courses in introductory and advanced logic (including the metatheory of logic and modal logic), in scientific method, and in philosophy of science.I also teach special topics courses in Science and Technology Studies.

P-values can’t be trusted except when used to argue that P-values can’t be trusted!

images-1Have you noticed that some of the harshest criticisms of frequentist error-statistical methods these days rest on methods and grounds that the critics themselves purport to reject? Is there a whiff of inconsistency in proclaiming an “anti-hypothesis-testing stance” while in the same breath extolling the uses of statistical significance tests and p-values in mounting criticisms of significance tests and p-values? I was reminded of this in the last two posts (comments) on this blog (here and here) and one from Gelman from a few weeks ago (“Interrogating p-values”).

Gelman quotes from a note he is publishing:

“..there has been a growing sense that psychology, biomedicine, and other fields are being overwhelmed with errors … . In two recent series of papers, Gregory Francis and Uri Simonsohn and collaborators have demonstrated too-good-to-be-true patterns of p-values in published papers, indicating that these results should not be taken at face value.”

But this fraudbusting is based on finding statistically significant differences from null hypotheses (e.g., nulls asserting random assignments of treatments)! If we are to hold small p-values untrustworthy, we would be hard pressed to take them as legitimating these criticisms, especially those of a career-ending sort.

…in addition to the well-known difficulties of interpretation of p-values…,…and to the problem that, even when all comparisons have been openly reported and thus p-values are mathematically correct, the ‘statistical significance filter’ ensures that estimated effects will be in general larger than true effects, with this discrepancy being well over an order of magnitude in settings where the true effects are small… (Gelman 2013)

But surely anyone who believed this would be up in arms about using small p-values as evidence of statistical impropriety. Am I the only one wondering about this?*

CLARIFICATION (6/15/13): Corey’s comment today leads me to a clarification, lest anyone misunderstand my point. I am sure that Francis, Simonsohn and others would never be using p-values and associated methods in the service of criticism if they did not regard the tests as legitimate scientific tools. I wasn’t talking about them. I was alluding to critics of tests who point to their work as evidence the statistical tools are not legitimate. Now maybe Gelman only intends to say, what we know and agree with, that tests can be misused and misinterpreted. But in these comments, our exchanges, and elsewhere, it is clear he is saying something much stronger. In my view, the use of significance tests by debunkers should have been taken as strong support for the value of the tools, correctly used. In short, I thought it was a success story! and I was rather perplexed to see somewhat the reverse.

______________________

*This just in: If one wants to see a genuine quack extremist** who was outed long ago***, see Ziliac’s article declaring the Higgs physicists are pseudoscientists for relying on significance levels!( in the Financial Post 6/12/13).

**I am not placing the critics referred to above under this umbrella in the least.

***For some reviews of Ziliac and McCloskey, see widgets on left. For their flawed testimony on the Matrixx case, please search this blog.

Categories: reforming the reformers, Statistical fraudbusting, Statistics | 29 Comments

Mayo: comment on the repressed memory research

freud mirror espHere are some reflections on the repressed memory articles from Richard Gill’s post, focusing on Geraerts, et.al.,(2008).

1. Richard Gill reported that “Everyone does it this way, in fact, if you don’t, you’d never get anything published: …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them and are just doing their best to make this as clear as possible to everyone.”

This remark is very telling. I recommend we just regard those cases as illustrating a theory one believes, rather than providing evidence for that theory. If we could mark them as such, we can stop blaming significance tests for playing a role in what are actually only illustrative attempts, or to strengthen someone’s beliefs about a theory.

2. I was surprised the examples had to do with recovered memories. Wasn’t that entire area dubbed a pseudoscience way back (at least 15-25 years ago?) when “therapy induced” memories of childhood sexual abuse (CSA) were discovered to be just that—therapy induced and manufactured? After the witch hunts that ensued (the very accusation sufficing for evidence), I thought the field of “research” had been put out of its and our misery. So, aside from having used the example in a course on critical thinking, I’m not up on this current work at all. But, as these are just blog comments, let me venture some off-the-cuff skeptical thoughts. They will have almost nothing to do with the statistical data analysis, by the way…

3. Geraerts, et.al., (2008, 22) admit at the start of the article that therapy-recovered CSA memories are unreliable, and the idea of automatically repressing a traumatic event like CSA implausible. Then mightn’t it seem the entire research program should be dropped? Not to its adherents! As with all theories that enjoy the capacity of being sufficiently flexible to survive anomaly (Popper’s pseudosciences), there’s some life left here too. Maybe , its adherents reason, it’s not necessary for those who report “spontaneously recovered” CSA memories to be repressors, instead they merely be “suppressors” who are good at blocking out negative events. If so, they didn’t automatically repress but rather deliberately suppressed: “Our findings may partly explain why people with spontaneous CSA memories have the subjective impression that they have ‘repressed’ their CSA memories for many years.” (ibid., 22).

4. Shouldn’t we stop there? I would. We have a research program growing out of an exemplar of pseudoscience being kept alive by ever-new “monster-barring” strategies (as Lakatos called them). (I realize they’re not planning to go out to the McMartin school, but still…) If a theory T is flexible enough so that any observations can be interpreted through it, and thereby regarded as confirming T, then it is no surprise that this is still true when the instances are dressed up with statistics. It isn’t that theories of repressed memories are implausible or improbable (in whatever sense one takes those terms). It is the ever-flexibility of these theories that renders the research program pseudoscience (along with, in this case, a history of self-sealing data interpretations).

5. Let’s give the researchers a bit more leeway. Let’s consider how they propose to “test” their hypothesized explanation. We still won’t need to look at the data for this…In Geraerts own research (as reported ‘in press’) “we found that the memories of CSA emerging during recovered memory therapy could not be corroborated, whereas those emerging outside therapy were corroborated just as often as memories of CSA that had never been forgotten.” (ibid., 23).

First of all, they could never have literally “found” that information, but let us grant for the sake of argument that they found the memories recovered in psychotherapy so unreliable that those spontaneously discovered/remembered are quite reliable in comparison. (They did not, by the way, check on the reliability of the CSA memories of their research subjects, so far as I can tell.) Doesn’t this admission show that recovered memory therapy was/is a highly unreliable practice?  If repressed memory therapists managed to“uncover” CSA “memories” by essentially manufacturing them, then isn’t there a danger that they are capable of implanting yet more false impressions in their subjects? I just wonder about the self-criticism here…

6. The gist of what they claim to show is that participants “recruited through ads in papers”, (ibid., 24) who reported spontaneously recovered CSA are actually just very good at deliberately forgetting unpleasant things (as compared to a control group who report no abuse). Two other groups are recruited: one with therapy-discovered CSA, and a second with people who never forgot CSA. So 4 groups in all.

In the main part of the experiment, all the participants write down positive and negative (anxious) events from the past few years, then are asked to suppress thinking about them during a 2 minute “suppression period.” The negative events are not the long ago CSA events, by the way. If one of the “target thoughts” pop into their minds in the suppression period, they are to trigger a joystick. (Various stages of imagining, expressing and suppressing thoughts ensue. They take home a 7-day diary to keep up the reports.)

I take it the researchers didn’t register in advance what would count as a failed result. I mean, let’s say the therapy-discovered CSA group reported statistically significantly fewer occurrences of the negative target during those two minutes (instead of the sponaneous group). That might be interpreted as indicating they tend to obey therapists’ wishes (they suppress when they’re told to suppress). That too could have been a publishable result, helping to explain the rampant false memories in this general group.

What they claim they hoped to show is that those who report spontaneously recovered memories are not repressors even though they think they are. That is, they hope to show the spontaneous recoverers do not “automatically” blank out negative events. Instead they are “suppressors” (those who deliberately don’t think about negative events). Let’s grant that was the pre-data goal. But is there really a difference here? Those who report spontaneously recovered memories claim they really never thought about the CSA until the day it was spontaneously brought to mind, but Geraerts claims they actually had remembered it but they forgot they remembered it. So, we know in advance that self-described repressors are easily redescribed by the researchers as suppressors.

All of these points, and many more besides, would arise in a critique before even looking at any results. It is based on logic and some information of the flaws of this and related research programs.We do not say the theories are implausible, only that the onus is on the researchers to show how they will conduct a stringent test of their theories, but we do not see that.

Note that the above criticisms are quite separate from the statistical questions Professor Gill was called in to consider. We don’t need shrewd statistics to criticize this research–although maybe we do for fraud. Yet as fraudbuster-buster* Gill seems to be saying, there is a fine line between fraud and bad practices.

7. So what about the statistical analysis? “LSD tests indicated that people with spontaneous recovered memories reported significantly fewer occurrences of the anxious target thought than did the other groups.” (25) This is by means of Post-hoc Least-Significant-Difference (LSD) tests. Putting the best spin on the statistics, what is the upshot?

People reporting recovered memories are not repressors, but rather suppressors, as evidenced by the fact that they successfully block out negative events (when told not to think about them in an experiment), at least statistically significantly more often than do the other groups.

But notice that these people answered the ad, so they haven’t suppressed the memory of the CSA event. To Geraerts, further evidence that they are suppressors is the fact that they don’t think too much about the negative (target) event in the week after the experiment. But this seems irrelevant, since we know they remembered the CSA event.

But others are apparently giving the research greater mileage than I would. As Gill observes, “they honestly believe in their theories and believe the data is [are] supporting them”. I am prepared to be corrected by suppressors…

*This term seems more apt, now that I better understand Gill’s work in this arena.

REFERENCES:

Geraerts, E., McNally, R. J., Jelicic, M., Merckelbach, H., & Raymaekers, L. (2008). Linking thought suppression and recovered memories of childhood sexual abuse. Memory, 16, 22-28.

Gill, R. (2013). http://errorstatistics.com/2013/06/08/richard-gill-integrity-or-fraud-or-just-quesionable-research-practices/

Categories: junk science, Statistical fraudbusting, Statistics | 7 Comments

Richard Gill: “Integrity or fraud… or just quesionable research practices?”

Professor Gill

Professor Gill

Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University
http://www.math.leidenuniv.nl/~gill/

I am very grateful to Richard Gill for permission to post an e-mail from him (after my “dirty laundry” post) along with slides from his talk, “Integrity or fraud… or just questionable research practices?” and associated papers. I record my own reflections on the pseudoscientific nature of the program in one of the Geraerts et.al., papers in a later post.

I certainly have been thinking about these issues a lot in recent months. I got entangled in intensive scientific and media discussions – mainly confined to the Netherlands  - concerning the cases of social psychologist Dirk Smeesters and of psychologist Elke Geraerts.  See: http://www.math.leidenuniv.nl/~gill/Integrity.pdf

And I recently got asked to look at the statistics in some papers of another … [researcher] ..but this one is still confidential ….

The verdict on Smeesters was that he like Stapel actually faked data (though he still denies this).

The Geraerts case is very much open, very much unclear. The senior co-authors Merckelbach, McNally of the attached paper, published in the journal “Memory”, have asked the journal editors for it to be withdrawn because they suspect the lead author, Elke Geraerts, of improper conduct. She denies any impropriety. It turns out that none of the co-authors have the data. Legally speaking it belongs to the University of Maastricht where the research was carried out and where Geraerts was a promising postdoc in Merckelbach’s group. She later got a chair at Erasmus University Rotterdam and presumably has the data herself but refuses to share it with her old co-authors or any other interested scientists. Just looking at the summary statistics in the paper one sees evidence of “too good to be true”. Average scores in groups supposed in theory to be similar are much closer to one another than one would expect on the basis of the within group variation (the paper reports averages and standard deviations for each group, so it is easy to compute the F statistic for equality of the three similar groups and use its left tail probability as test statistic.

The same phenomenon turns up in another unpublished paper by the same authors and moreover in one of the papers contained in Geraerts (Maastricht) thesis. I attach the two papers published in Geraert’s thesis which present results in very much the same pattern as the disputed “Memory” paper. Four groups of subjects, three supposed in theory to be rather similar, one expected to be strikingly different. In one of the two, just as in the Memory paper, the average scores of the three similar groups are much closer to one another than one would expect on the basis of the within-groups variation.

I got involved in the quarrel between Merckelbach and Geraerts which was being fought out in the media so various science journalists also consulted me about the statistical issues. I asked Geraerts if I could have the data of the Memory paper so that I could carry out distribution-free versions of the statistical tests of “too good to be true” which are easy to perform if you just have the summary statistics. She claimed that I had to get permission from the University of Maastricht. At some point both the presidents of Maastricht and Erasmus university were involved and presumably their legal departments too. Finally I got permission and arranged a meeting with Geraerts where she was going to tell me “her side of the story” and give me the data and we would look at my analyses together. Merckelbach and his other co-authors all enthusiastically supported this too, by the way. However at the last moment the chair of her department at Erasmus university got worried and stepped in and now an internal Rotterdam (=Erasmus) committee is investigating the allegations and Geraerts is not allowed to give anyone the data or talk to anyone about the problem.

I think this is totally crazy. First of all, the data set should have been made public years ago. Secondly, the fact that the co-authors of the paper never even saw the data themselves is a sign of poor research practices. Thirdly, getting university lawyers and having high level university ethics committees involved does not further science. Science is furthered by open discussion. Publish the data, publish the criticism, and let the scientific community come to its own conclusion. Hold a workshop where different points of view of presented about what is going on in these papers, where statisticians and psychologists communicate to one another.

Probably, Geraerts’s data has been obtained by some combination of the usual “questionable research practices” which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’ld never get anything published: sample sizes are too small, effects are too small, noise is too large. People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them and are just doing the best to make this as clear as possible to everyone.

Richard

PS summary of my investigation of the papers contained in Geraert’s PhD thesis:

ch 8 Geraerts et al 2006b BRAT Long term consequences of suppression of intrusive anxious thoughts and repressive coping.

ch 9 Geraerts et al 2006 AJP Suppression of intrusive thoughts and working memory capacity in repressive coping.These two chapters show the pattern of four groups of subjects, three of which are very similar, while the fourth is strikingly different with respect to certain (but not all) responses.In the case of chapter 8, the groups which are expected to be similar are (just as in the already disputed Memory and JAb papers) actually much too similar! The average scores are closer to one another than one can expect on the basis of the observed within-group variation (1 over square root of N law).In the case of chapter 9, nothing odd seems to be going on. The variation between the average scores of similar groups of subjects is just as big as it ought to be, relative to the variation within the groups.

Geraerts et al (2008 Memory pdf). “Recovered memories of childhood sexual abuse: Current findings and their legal implications” Legal and Criminological Psychology 13, 165–176

It was Richard Gill who first told me about Diederik Stapel shortly after I started blogging, see an earlier post on Diederik. We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have had Gill be the commentator/presenter of my paper—he was excellent!—and I thank him for these intriguing items. My puzzlements and reactions will follow in a separate post….

Categories: junk science, Statistical fraudbusting, Statistics | 5 Comments

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately.

The one-sided CI for the parameter μ in test T+ with Mo the observed sample mean, and

α = .025 is:                                     (Mo -1.96(σ/ √n), infinity]

(σ would generally be estimated.) Outcome M = .39 just fails to reject H0 at the .025 level, correspondingly 0 is included in the one-sided 97.5% interval:

-.002 < μ

Suppose one had an insignificant result from test T+  and wanted to evaluate the inference:   μ < .4

(It doesn’t matter why just now, this is an illustration).

Since the power of test T+ to detect  μ =.4 is hardly more than .5, Neyman would say “it was a little rash” to regard the observed mean as indicating μ < .4 , to use his language in chiding Carnap.  So the N-P tester avoids taking the insignificant result as evidence that μ < .4.  Not only has she avoided regarding the insignificant result as evidence of no discrepancy from the null, she immediately and properly denies there is good evidence for ruling out a discrepancy of .4. [i]

What about our CI reformer?

How does the confidence interval:       -.002 < μ

block interpreting the negative result as evidence that the discrepancy is less than .4?

It does not.

Yet many New Reformers declare that once the confidence interval is in hand, there is no additional information to be obtained from a power analysis, by which they are referring to precisely what Neyman recommends (although they have in mind Cohen).[ii]

CI advocates typically hold that anything tests can do, CIs do better, or at any rate that the information is already in the CI. However, in claiming this (regarding test T+), they always compute the CI-upper bound–yet  it is the lower bound that corresponds to this test. If we wish to use the upper confidence bound as a kind of adjunct for interpreting intervals, fine, but then CIs must be supplemented with a principle for warranting such a move: it does not come from CI theory.  But even granting the use of the corresponding 95% CI they recommend, we get:

(-.002 < μ < .782)

How does this rule out supposing one has corroboration for μ < .4?  It doesn’t. All of the values in the interval (they tell us) are plausible, so they fail to rule out the erroneous inference. (Some CI advocates even chastise the power analyst for denying there is evidence for μ < μ’, for a value of  μ’ smaller than the upper limit (UL) of the CI. Their grounds are that the data are strong evidence that μ < UL. True. But that does not prevent us from denying there is strong evidence that μ < various μ values less than the upper limit.)

The hypothesis μ < 0.4 is non-rejectable by the test with this outcome. In general, the values within the interval are not excluded, they are survivors, as it were, of the test. But if we wish to block fallacies of acceptance, CI’s won’t go far enough. Although M is not sufficiently greater (or less) than any of the values in the confidence interval to reject them at the α-level, this does not imply there is evidence for each of the values in the interval (for discussion see Mayo and Spanos 2006).

By contrast, for each value of μ1 in the confidence interval, there would be a different answer to the question(ii):

What is the power of the test against μ1?

Thus the power analyst makes distinctions that the CI interval theorist does not.  As we saw, the power analyst blocks as “rash” (Neyman) the inference  μ < 0.4 since the power of T+ to detect .4 is not high (.5). Even worse off would be the inference to  μ < 0.2 since the power to detect .2 is only .16.  Likewise for a severity analysis which also avoids the coarseness of a power analysis.[iii]

CIs are swell, and their connection to severity evaluations may be developed, but CI theory requires being supplemented by a principle that will direct their correct interpretation if they are to avoid the fallacies of significance tests.

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

[i] Recall, Neyman employed in chiding Rudolph Carnap: (See 11/9/11 post). 

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0is true of the particular data set? (Neyman, pp 40-41).Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95.

For more on power see: See posts under “Neyman’s Nursery” (1, 2, 3, 4, 5)

[ii] Likewise for the question: what is the SEV associated with a given inference, from a given test with a given outcome.

[iii] If, we observe not M = .39, but rather M= -.2, we again fail to reject H0, but the power analyst, looking just at cα = 1.96 is led to the same assessment, again regarding as fallacious the claim to have evidence for  μ < 0.2. Although the “prespecified” power is low, .16, we would wish to say, taking into account the actual outcome, that there is a high probability for a more significant result than the one attained, were m as great as 0.2!

References
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.

Categories: CIs and tests, Error Statistics, reformers, Statistics | Tags: , , , , , , , | Leave a comment

PhilStock: Topsy-Turvy Game

stock picture smaillSee rejected posts.  

Categories: PhilStock, Rejected Posts | Leave a comment

Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12):

The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ.

Considering problem (2), suppose two tests of type T+ reach the same significance level, .02 and let

(i) n = 100 and  (ii) n = 400.

(With n = 100, M = .2; with n = 400, M = .1)

(i) for n = 100, the .98 (lower) CI = µ > M – 2(1/10)

(ii)  for n = 400, the .98 (lower) CI = µ > M – 2(1/20)

So in both cases, the confidence intervals are

µ > 0

or as he writes them (0, infinity]. So how are the CIs distinguishing them?

The sample means in both cases here are just statistically significant. As Cumming states, for a 98% CI, the p-value is .02 if the interval falls so that the LL is at µ0 (p. 103).  Here, the LL (lower limit) of the CI is µ0–namely, 0.

So the p-value in our case would be .02 and the result is taken to infer µ > 0. So where is the difference? The construal is dichotomous: in or out, plausible or not; all values within the interval are on par.  But if we wish to avoid fallacies, CI’s won’t go far enough.

To avoid fallacies of rejection,  distinguish between cases (i) and (ii), and make good on the promise to have solved the problem in (3), we would need to report the extent of discrepancies well and poorly indicated. Let’s just pick an example to illustrate: Is there evidence of a discrepancy .1? , i.e., that  µ > .1

For n = 100, we would say that µ > .1 is fairly well indicated (p-value is .16, associated SEV is .84)*.

The reasoning is counterfactual: were µ less than or equal to .1, it is fairly probable, .84, that a larger M would have occurred.

For n = 400, µ > .1 is poorly indicated (p-value is .5, associated SEV is .5).

The reasoning, among many ways it can be put, is that the M observed is scarcely unusual under the assumption that µ is less than or equal to .1. The probability is .5 that we’d observe sample means as (or more) statistically significant as the one we observed, even if  µ < .1.

Now it might be said that it is required to always compute a two-sided interval.  But we cannot just deny one-sided tests, nor does Cumming. In fact, he encourages one-sided tests/CIs (although he also says he is happy for people to decide afterwards whether to report it as a one or two-sided test (p. 112), but I put this to one side).

Or it might be suggested that we do the usual one-sided test by means of the one-sided CI (lower) interval, but we add the CI upper (at the same level) for purposes of scrutinizing the effective discrepancy indicated. First, let me be clear that Cumming does not suggest this.  Were one to suggest this, a justification would be needed, and that would demand going beyond confidence interval reasoning to something akin to severe-testing reasoning.

Merely forming the two two-sided intervals would not help:

(i) (0, .4]

(ii) (0, .2]

The question is: how well do the data indicate µ > .1?  It would scarcely jump out at you that this is poorly warranted by (ii).  in this way,  simple severe testing reasoning distinguishes (i) and (ii) as was wanted.

This was not a very extreme example either, in terms of the difference in sample sizes. The converse problem, the inability of standard CIs to avoid fallacies of insignificant results, is even more glaring; whereas, it is easily and intuitively accomplished by a severity evaluation. (For several computations, See Mayo and Spanos 2006: ”Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” .)

Nor does it suffice to have a series of confidence intervals as some suggest: they are still viewed as parameter values that “fit” the data to varying degrees, without a clear principled ground for using the series of intervals to avoid unwarranted interpretations. The counterfactual reasoning may be made out in terms of (what may be dubbed) a Severity Interpretation of Rejection and Acceptance (SIR) and (SIA), in Mayo and Spanos 2006, or in terms of the frequentist principle of evidence (FEV) in Mayo and Cox (2010, 256): “Frequentist Statistics as a Theory of Inductive Inference“.

*Here SEV is calculated by the probability of getting a less statistically significant result, computed under the assumption that µ = .1. The SEV would increase if computed under smaller values of µ.

Cumming, G. (2012), Understanding the New Statistics, Routledge.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.


[1] Null Hypothesis Significance Tests.

[2] The warrant for the confidence, for Cumming, is the usual one: it was arrived at by a method with high probability of covering the true parameter value, and he has worked out nifty programs to view the “dance” of CI’s, as well as other statistics.

[3] “If we think of whether or not the CIs include the null value,there’s a direct correspondence between, respectively, the two- and one-tailed tests, and the two- and one-sided intervals” (Cumming 2012, p. 111).

Categories: confidence intervals and tests, reformers, Statistics | Tags: , , , | 7 Comments

Some statistical dirty laundry

Objectivity 1: Will the Real Junk Science Please Stand Up?I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments).  That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

3. Hanging out some statistical dirty laundry.images
Items in their laundry list include:

  • An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings…. Read more »
Categories: junk science, spurious p values, Statistics | 5 Comments

Winner of May Palindrome Contest

“Able no one nil red nudist opening nine pots. I’d underline ‘No’ on Elba.”  Anonymous. See rejected posts.

Categories: Palindrome | Leave a comment

K. Staley: review of Error & Inference

kent-staleyK. W. Staley
Associate Professor
Department of Philosophy,
Saint Louis University

(Almost) All about error


BOOK REVIEW Metascience (2012) 21:709–713 DOI 10.1007/s11016-011-9618-1E & I Cover 2
Deborah G. Mayo and Aris Spanos (eds): Error and inference: Recent exchanges on experimental reasoning, reliability, objectivity, and rationality. New York: Cambridge University Press, 2010, xvii+419 pp

The ERROR’06 (experimental reasoning, reliability, objectivity, and rationality) conference held at Virginia Tech aimed to advance the discussion of some central themes in philosophy of science debated by Deborah Mayo and her more-or-less friendly critics over the years. The volume here reviewed brings together the contributions of these critics and Mayo’s responses to them (with Mayo’s collaborator Aris Spanos). (I helped with the organization of the conference and, with Mayo and Jean Miller, edited a separate collection of workshop papers that were presented there, published as a special issue of Synthese.) My review will focus on a couple of themes I hope to be of interest to a broad philosophical audience, then turn more briefly to an overview of the entire collection. The discussions in Error and Inference (E&I) are indispensable for understanding several current issues regarding the methodology of science.

The remarkably useful introductory chapter lays out the broad themes of the volume and discusses ‘‘The Error-Statistical Philosophy’’. Here, Mayo and Spanos provide the most succinct and non-technical account of the error-statistical approach that has yet been published, a feature that alone should commend this text to anyone who has found it difficult to locate a reading on error statistics suitable for use in teaching.

Mayo holds that the central question for a theory of evidence is not the degree to which some observation E confirms some hypothesis H but how well-probed for error a hypothesis H is by a testing procedure T that results in data x0. This reorientation has far-reaching consequences for Mayo’s approach to philosophy of science. On this approach, addressing the question of when data ‘‘provide good evidence for or a good test of’’ a hypothesis requires attention to characteristics of the process by means of which the data are used to bear on the hypothesis. Mayo identifies the starting point from which her account is developed as the ‘‘Weak Severity Principle’’ (WSP):

Data x0 do not provide good evidence for hypothesis H if x0 results from a test procedure with a very low probability or capacity of having uncovered the falsity of H (even if H is incorrect). (21)

The weak severity principle is then developed into the full severity principle (SP), according to which ‘‘data x0 provide a good indication of or evidence for hypothesis H (just) to the extent that test T has severely passed H with x0’’ where H passes a severe test T with x0 if x0 ‘‘agrees with’’ H and ‘‘with very high probability, test T would have produced a result that accords less well with H than doesx0, if H were false or incorrect’’ (22). This principle constitutes the heart of the error-statistical account of evidence, and E&I, by including some of the most important critiques of the principle, provides a forum in which Mayo and Spanos attempt to correct misunderstandings of the principle and to clarify its meaning and application.

The appearance in the WSP of the disjunctive phrase ‘‘a very low probability or capacity’’ (my emphasis) indicates a point central to much of this clarificatory work. The error-statistical account is resolutely frequentist in its construal of probability. It is commonly held (including by some frequentists) that the rationale for frequentist statistical methods lies exclusively in the fact that they can sometimes be shown to have low error rates in the long run. Throughout E&I, Mayo insists that this ‘‘behaviorist rationale’’ is not applicable when it comes to evaluating a particular body of data in order to determine what inferences may be warranted. That evaluation rests upon thinking about the particular data and the inference at hand in light of the capacity of the test to reveal potential errors in the inference drawn. Frequentist probabilities are part of how one models the error-detecting capacities of the process. As Mayo explains in a later chapter co-authored with David Cox, tests of hypotheses function analogously to measuring instruments: ‘‘Just as with the use of measuring instruments, applied to a specific case, we employ the performance features to make inferences about aspects of the particular thing that is measured, aspects that the measuring tool is appropriately capable of revealing’’ (257).

One of the most fascinating exchanges in E&I concerns the role of severe testing in the appraisal of ‘‘large-scale’’ theories. According to Mayo, theory appraisal proceeds by a ‘‘piecemeal’’ process of severe probing for specific ways in which a theory might be in error. She illustrates this with the history of experimental tests of theories of gravity, emphasizing Clifford Will’s parametrized post-Newtonian (PPN) framework, by means of which all metric theories of gravity can be represented in their weak-field, slow-motion limits by means of ten parameters. Experimental work on gravity theories then severely tests hypotheses about the values of those parameters. Rather than attempting to confirm or probabilify the general theory of relativity (GTR), the aim is to learn about the ways in which GTR might be in error, more generally to ‘‘measure how far off what a given theory says about a phenomenon can be from what a ‘correct’ theory would need to say about it’’ (55).

Alan Chalmers and Alan Musgrave both challenge this view. According to Chalmers, no general theory, whether ‘‘low level’’ or ‘‘high level’’, can pass a severe test because the content of theories surpasses whatever empirical evidence supports them. As a consequence, Chalmers argues, Mayo’s severe-testing account of scientific inference must be incomplete because even low-level experimental testing sometimes demands relying on general theoretical claims. Similarly, Musgrave accuses Mayo of holding that (general) theories are not tested by ‘‘testing their consequences’’, but that ‘‘all that we really test are the consequences’’ (105), leaving her with ‘‘nothing to say’’ about the assessment, adoption, or rejection of general theories (106). Read more »

Categories: Error Statistics, Statistics | Tags: , | 1 Comment

A.Birnbaum: Statistical Methods in Scientific Inference

Birnbaum: born May 27, 1923

Today is (statistician) Allan Birnbaum’s birthday. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to obtain what he called “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I would heartily endorse!  While known for attempts to argue that the (strong) Likelihood Principle followed from sufficiency and conditionality principles, a few years after publishing this result, he seems to have turned away from it, perhaps discovering gaps in his argument.

NATURE VOL. 225 MARCH 14, 1970 (1033)

LETTERS TO THE EDITOR

Statistical methods in Scientific Inference

 It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood).  I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.

 If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence. This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data. (The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].

While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.

Allan Birnbaum
New York University
Courant Institute of Mathematical Sciences,
251 Mercer Street,
New York, NY 10012

Birnbaum’s confidence concept, sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:

(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.

Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence”  simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below). Read more »

Categories: Likelihood Principle, phil/history of stat, Statistics | Tags: | 3 Comments

Schachtman: High, Higher, Highest Quality Research Act

wavy capitalSince posting on the High Quality Research act a few weeks ago, I’ve been following it in the news, have received letters from professional committees (asking us to write letters), and now see that  Nathan A. Schachtman, Esq., PC posted the following on May 25, 2013 on his legal blog*:

NAS-3“The High Quality Research Act” (HQRA), which has not been formally introduced in Congress, continues to draw attention. SeeClowns to the left of me, Jokers to the right.”  Last week, Sarewitz suggests that “the problem” is the hype about the benefits of pure research and the let down that results from the realization that scientific progress is “often halting and incremental,” with much research not “particularly innovative or valuable.”  Fair enough, but why is this Congress such an unsophisticated consumer of scientific research in the 21st century?  How can it be a surprise that the scientific community engages in the same rent-seeking behaviors as do other segments of our society? Has it escaped Congress’s attention that scientists are subject to enthusiasms and group think, just like, … congressmen?

Nature published an editorial piece suggesting that the HQRA is not much of a threat. Daniel Sarewitz, “Pure hype of pure research helps no one, ” 497 Nature 411 (2013).

Still, Sarewitz believes that the HQRA bill is not particularly threatening to the funding of science:

“In other words, it’s not a very good bill, but neither is it much of a threat. In fact, it’s just the latest skirmish in a long-running battle for political control over publicly funded science — one fought since at least 1947, when President Truman vetoed the first bill to create the NSF because it didn’t include strong enough lines of political accountability.”

This sanguine evaluation misses the effect of the superlatives in the criteria for National Science Foundation funding:

“(1) is in the interests of the United States to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science;

(2) is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

(3) is not duplicative of other research projects being funded by the Foundation or other Federal science agencies.” Read more »

Categories: evidence-based policy, PhilStatLaw, Statistics | Tags: | 12 Comments

Gelman sides w/ Neyman over Fisher in relation to a famous blow-up

3-d red yellow puzzle people (E&I)

blog-o-log

Andrew Gelman had said he would go back to explain why he sided with Neyman over Fisher in relation to a big, famous argument discussed on my Feb. 16, 2013 post: “Fisher and Neyman after anger management?”, and I just received an e-mail from Andrew saying that he has done so: “In which I side with Neyman over Fisher”. (I’m not sure what Senn’s reply might be.) Here it is:

“In which I side with Neyman over Fisher” Posted by  on 24 May 2013, 9:28 am

As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally.gelman5

Here’s an example that recently came up.

Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero.

Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible that a bunch of nonzero effects would exactly cancel). And I remember a similar discussion as a student, many years ago, when Rubin talked about that silly Neyman null hypothesis. Read more »

Categories: Fisher, Statistics, Stephen Senn | Tags: , | 10 Comments

Mayo’s slides from the Onto-Meth conference*

img_1249-e1356389909748Methodology and Ontology in Statistical Modeling: Some error statistical reflections (Spanos and Mayo)uncorrected

 Our presentation falls under the second of the bulleted questions for the conference (conference blog is here):

How do methods of data generation, statistical modeling, and inference influence the construction and appraisal of theories?

Statistical methodology can influence what we think we’re finding out about the world, in the most problematic ways, traced to such facts as:

  • All statistical models are false
  • Statistical significance is not substantive significance
  • Statistical association is not causation
  • No evidence against a statistical null hypothesis is not evidence the null is true
  • If you torture the data enough they will confess.

(or just omit unfavorable data)

These points are ancient (lying with statistics, lies damn lies, and statistics)

People are discussing these problems more than ever (big data), but it’s rarely realized is how much certain methodologies are at the root of the current problems.

__________________1__________________

All Statistical Models are False

Take the popular slogan in statistics and elsewhere is “all statistical models are false!”

What the “all models are false” charge boils down to:

(1)          the statistical model of the data is at most an idealized and partial representation of the actual data generating source.

(2) a statistical inference is at most an idealized and partial answer to a substantive theory or question.

  • But we already know our models are idealizations: that’s what makes them models
  • Reasserting these facts is not informative,.
  • Yet they are taken to have various (dire) implications about the nature and limits of statistical methodology
  • Neither of these facts precludes the use of these to find out true things
  • On the contrary, it would be impossible to learn about the world if we did not deliberately falsify and simplify.

    __________________2__________________

  • Notably, the “all models are false” slogan is followed up by “But some are useful”,
  • Their usefulness, we claim, is being capable of adequately capturing an aspect of a phenomenon of interest
  • Then a hypothesis asserting its adequacy (or inadequacy) is capable of being true!

Note: All methods of statistical inferences rest on statistical models.

What differentiates accounts is how well they step up to the plate in checking adequacy, learning despite violations of statistical assumptions (robustness)

__________________3__________________

Statistical significance is not substantive significance

Statistical models (as they arise in the methodology of statistical inference) live somewhere between

  1. Substantive questions, hypotheses, theories H
  1. Statistical models of phenomenon, experiments, data: M
  1. Data x

What statistical inference has to do is afford adequate link-ups (reporting precision, accuracy, reliability)

__________________4__________________ Read more »

Categories: O & M conference | 34 Comments

Mayo: Meanderings on the Onto-Methodology Conference

mayo blackboard b&w 2Writing a blog like this, a strange and often puzzling exercise[1], does offer a forum for sharing half-baked chicken-scratchings from the back of frayed pages on themes from our Onto-Meth[2] conference from two weeks ago[3]. (The previous post had notes from blogger and attendee, Gandenberger.)

Onto-Meth conference

Onto-Meth conference

Several of the talks reflect a push-back against the idea that the determination of “ontology” in science—e.g., the objects and processes of theories, models and hypotheses—is (or should strive to correspond to?)  “real” objects in the world and/or what is approximately the case about them. Instead, at least some of the speakers wish to liberate ontology to recognize how “merely” pragmatic goals, needs, and desires are not just second-class citizens, but can and do (and should?) determine the categories of reality. Well there are a dozen equivocations here, most of which we did not really discuss at the conference.

In my own half of the Spanos-Mayo (D & P presentation[4]) I granted and even promoted the idea of a methodology that was pragmatic while also objective, so I’m not objecting to that part. The measurement of my weight is a product of “discretionary” judgments (e.g., to weigh in pounds with a scale having a given precision), but it is also a product of how much I really weigh (no getting around it). By understanding the properties of methodological tools and measuring systems, it is possible to “subtract out” the influence of the judgments to get at what is actually the case. At least approximately. But that view is different, it seems to me, from someone like Larry Laudan (at least in his later metamorphosis). Even though he considers his “reticulated” view a fairly hard-nosed spin on the Kuhnian idea of scientific paradigms as invariably containing an ontology (e.g., theories), a methodology, and (what he called) an “axiology” or set of aims (OMA), Laudan seems to think standards are so variable that what counts as evidence is constantly fluctuating (aside from maybe retaining the goal of fitting diverse facts). So I wonder if these pragmatic leanings are more like Laudan or more like me (and my view here, I take it, is essentially that of Peirce). I am perfectly sympathetic to the piecemeal “locavoracity” idea in Ruesche, by the way.

My worry, one of them, is that all kinds of rival entities and processes arise to account for (accord with, predict, and purportedly explain) data and patterns in data, and don’t we need ways to discriminate them? During the open discussion, I mentioned several examples, some of which I can make out all scrunched up in the corners of my coffee-logged program, such as appeals to “cultural theories” of risk and risk perceptions. These theories say appeals to supposedly “real” hazards, e.g, chance of disease, death, catastrophe, and other “objective” risk assessments are wrong.  They say it is not only possible but preferable (truer?) to capture attitudes toward risks, e.g., GM foods, nuclear energy, climate change, breast implants, etc. by means of one or another favorite politico-cultural grid-group categories (e.g., marginal-individualists, passive-egalitarians, hierarchical-border people, fatalists,  etc.). (Your objections to these vague category schemes are often taken as further evidence that you belong in one of the pigeon-holes!) And the other day I heard a behavioral economist declare that he had found the “mechanism” to explain deciding between options in virtually all walks of life using a regression parameter, he called it beta, and guess what? beta = 1/3! He proved it worked statistically too. He might be right, he had a lot of data. Anyway, in my deliberate attempt to trigger discussion at the conference end, I was wondering if some of the speakers and/or attendees (Danks, Woodward, Glymour? Anyone?) had anything to say about cases that some of us might wish to call reification. Read more »

Categories: O & M conference, Statistics | 10 Comments

Gandenberger on Ontology and Methodology (May 4) Conference: virginia Tech

greg pic

Gregory Gandenberger
Ph.D graduate student: Dept. of History and Philosophy of Science & Dept. of Statistics
University of Pittsburgh
http://gsganden.tumblr.com/

Onto-Meth conference

Onto-Meth conference


Some Thoughts on the O&M 2013 Conference
I was struck by how little speakers at the Ontology and Methodology conference engaged with the realism/antirealism debate. Laura Ruetsche defended a version of Arthur Fine’s Natural Ontological Attitude (NOA) in the first talk of the conference, but none of the speakers after her addressed the debate directly. David Danks and Jim Woodward made it particularly clear that they were deliberately avoiding questions about realism in favor of questions about what kinds of ontologies our theories should have in order to best serve the various purposes for which we develop them.

I am not criticizing the speakers! I am inclined to agree with Clark Glymour that the kinds of questions Danks and Woodward addressed are more interesting and important than questions about “what’s really real.” On the other hand, I worry that we lose something when we focus only on the use of science toward such ends as prediction and control. During the discussion period at the end of the conference, Peter Godfrey-Smith argued that science has some value simply for telling us what really is the case. For instance, science tells us that all living things on earth have a common ancestor, and that fact is a good thing to know regardless of whether or not it helps us predict or control anything.

One feature of the realism/antirealism debate that has long bothered me is that it treats all of “our best sciences” as if they had roughly the same epistemic status. In fact, realism about quantum field theory, for instance, is much harder to defend than realism about evolutionary biology. I am inclined to dismiss the realism debate as ill-formed insofar as it presumes that the question of scientific realism is a single question that spans all of the sciences. I am also suspicious of the debate in its bread-and-butter domain of fundamental physics. It is not clear to me that there is such a thing as fundamental physics; that if there is such a thing as fundamental physics, then it is converging toward a unified ontology; that if it is converging toward a unified ontology, then we can make sense of the question whether or not that ontology is correct; or that if we can make sense of the question whether or not that ontology is correct, then we have the means to give a justified answer to that question.

Nevertheless, as Glymour pointed out during the open discussion period, there are still good and open questions to address about whether and how we are justified in believing that science tells us the truth in other domains (such as evolutionary theory) where the realism question seems relatively well-formed and answerable. We can dismiss questions about “what’s really real” at a “fundamental level” while still thinking that philosophers of science should have a story to tell the 46% of Americans who believe that human beings were created in more or less their current form within the last 10,000 years—not a story about how science serves purposes of prediction and control, but a story about how science can help us find the truth.

Categories: O & M conference | 7 Comments

“A sense of security regarding the future of statistical science…” Anon review of Error and Inference

errorinferencebookcover-e1335149598836-1Aris Spanos, my colleague and co-author (Economics),recently came across this seemingly anonymous review of our Error and Inference (2010) [E & I]. It’s interesting that the reviewer remarks that “The book gives a sense of security regarding the future of statistical science and its importance in many walks of life.” I wish I knew just what the reviewer means–but it’s appreciated regardless.

2010 American Statistical Association and the American Society for Quality

TECHNOMETRICS, AUGUST 2010, VOL. 52, NO. 3, Book Reviews, 52:3, pp. 362-370.

Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. MAYO and Aris SPANOS, New York: Cambridge University Press, 2010, ISBN 978-0-521-88008-4, xvii+419 pp., $60.00.

This edited volume contemplates the interests of both scientists and philosophers regarding gathering reliable information about the problem/question at hand in the presence of error, uncertainty, and with limited data information.

The volume makes a significant contribution in bridging the gap between scientific practice and the philosophy of science. The main contribution of this volume pertains to issues of error and inference, and showcases intriguing discussions on statistical testing and providing alternative strategy to Bayesian inference. In words, it provides cumulative information towards the philosophical and methodological issues of scientific inquiry at large.

The target audience of this volume is quite general and open to a broad readership. With some reasonable knowledge of probability theory and statistical science, one can get the maximum benefit from most of the chapters of the volume. The volume contains original and fascinating articles by eminent scholars (nine, including the editors) who range from names in statistical science to philosophy, including D. R. Cox, a name well known to statisticians.

The editors have done a superb job in presenting, organizing, and structuring the material in a logical order. The “Introduction and Background” is nicely presented and summarized, allowing for a smooth reading of the rest of the volume. There is a broad range of carefully selected topics from various related fields reflecting recent developments in these areas. The rest of the volume is divided in nine chapters/sections as follows:

1. Learning from Error, Severe Testing, and the Growth of Theoretical

Knowledge

2. The Life of Theory in the New Experimentalism

3. Revisiting Critical Rationalism

4. Theory Confirmation and Novel Evidence

5. Induction and Severe Testing

6. Theory Testing in Economics and the Error-Statistical Perspective

7. New Perspectives on (Some Old) Problems of Frequentist Statistics

8. Casual Modeling, Explanation and Severe Testing

9. Error and Legal Epistemology

In summary, this volume contains a wealth of knowledge and fascinating debates on a host of important and controversial topics equally important to the philosophy of science and scientific practice. This is a must-read—I enjoyed reading it and I am sure you will too! The book gives a sense of security regarding the future of statistical science and its importance in many walks of life. I also want to take the opportunity to suggest another seemingly related book by Harman and Kulkarni (2007). The review of this book was appeared in Technometricsin May 2008 (Ahmed 2008).

The following are chapters in E & I (2010) written by Mayo and/or Spanos, if you’re interested. If you produce a palindrome meeting the extremely simple requirements for May (by May 25 or so), you can win a free copy! Read more »

Categories: Review of Error and Inference, Statistics | 3 Comments

‘No-Shame’ Psychics Keep Their Predictions Vague: New Rejected post

imagesSee new rejected post.(You may comment here or on the Rejected Posts blog)

Categories: msc kvetch, rejected post | Leave a comment

If it’s called the “The High Quality Research Act,” then ….

Unknown-2Among the (less technical) items sent my way over the past few days are discussions of the so-called High Quality Research Act. I’d not heard of it, but it’s apparently an outgrowth of the recent hand-wringing over junk science, flawed statistics, non-replicable studies, and fraud (discussed at times on this blog). And it’s clearly a hot topic. Let me just run this by you and invite your comments (before giving my impression). Following the Bill, below, is a list of five NSF projects about which the HQRA’s sponsor has requested further information, and then part of an article from today’s New Yorker on this “divisive new bill”: “Not Safe for Funding: The N.S.F. and the Economics of Science”.

[DISCUSSION DRAFT]

A BILL

April 18, 2013

TO [BE SUPPLIED]

Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,

SECTION 1. SHORT TITLE.

This act may be cited as the “High Quality Research Act”.

SECTION 2. HIGH QUALITY RESEARCH.

(a) CERTIFICATION.—prior to making an award of any contract or grant funding for a scientific research project, the Director of the NSF shall publish a statement on the public website of the Foundation that certifies that the research project—

(1) is in the interests of the U.S. to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science;

(2) is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

(3) is not duplicative of other research projects being funded by the Foundation or other Federal Science agencies.

(b) TRANSFER OF FUNDS.—Any unobligated funds for projects ot meeting the requirements of subjection (a) may be awarded to other scientific research projects that do meet such requirements.

(e) INITIAL IMPLEMENTATION REPORT.—Not later than 60 days after the date of enactment of this Act, the Director shall report to the Committee on Commerce, Science, and Transportation of the Senate and the Committee on Science, Space, and Technology of the House of Representatives on how the requirements set for in subsection (a) are being implemented.

(d) NATIONAL SCIENCE BOARD IMPLEMENTATION REPORT. __ Not later than 1 year after the date of enactment of this act, the national science board shall report to the committee on commerce, science, and transportation of the senate and the committee on science, space and technology of the house of representatives its findings and recommendations on how the requirements of subsection (a) are being implemented.

etc. etc.

Link to the Bill

Rep. Lamar Smith,author of the Bill, listed five NSF projects about which he has requested further information. 

1. Award Abstract #1247824: “Picturing Animals in National Geographic, 1888-2008,” March 15, 2013, ($227,437); 

2. Award Abstract #1230911: “Comparative Histories of Scientific Conservation: Nature, Science, and Society in Patagonian and Amazonian South America,” September 1, 2012 ($195,761);

3. Award Abstract #1230365: “The International Criminal Court and the Pursuit of Justice,” August 15, 2012 ($260,001);

4. Award Abstract #1226483, “Comparative Network Analysis: Mapping Global Social Interactions,” August 15, 2012, ($435,000); and

5. Award Abstract #1157551: “Regulating Accountability and Transparency in China’s Dairy Industry,” June 1, 2012 ($152,464).

________________________

MAY 9, 2013

NOT SAFE FOR FUNDING: THE N.S.F. AND THE ECONOMICS OF SCIENCE Read more »

Categories: junk science, science communication, Statistics | 14 Comments

Professorships in Scandal?

Unknown-1On page 1 of the New York Times yesterday was an article, “The Last Refuge From Scandal? Professorships”:

The traditional path to an academic job is long and laborious: the solitude and penury of graduate study, the scramble for one of the few open positions in each field, the blood sport of competitive publishing. But while colleges have always courted accomplished public figures, a leap to the front of the class has now become a natural move for those who have suffered spectacular career flameouts. At this point, the transition from public disgrace to college lectern is so familiar that when Mr. Galliano merely stepped foot on the campus of Central Saint Martins, an art and design school in London, speculation rippled around the world— incorrectly — that he would soon be teaching there.

I guess this shouldn’t surprise anyone. Sexy course titles and “novelty academics” are pretty old-hat; power and scandal, even if on the sleazy side, attract students; and if students are buying, universities can’t be blamed for selling. Or can they?  Here are some examples they cite:

After a sex scandal forced Eliot Spitzer from the governor’s mansion in Albany, he turned up at City College, teaching a course called “Law and Public Policy.” …

More recently, Parsons the New School for Design announced that John Galliano, the celebrated clothing designer who lost his job at Christian Dior after unleashing a torrent of anti-Semitic vitriol in a bar, would be leading a four-day workshop and discussion called “Show Me Emotion.”

And David H. Petraeus, the general turned intelligence chief turned ribald punch line, will have not one college paycheck, but two. Last month, the City University of New York said he would be the next visiting professor of public policy at Macaulay Honors College. On Thursday, the University of Southern California announced that Mr. Petraeus would also be teaching there…

Despite a petition objecting to Galliano, there seems to be little public concern that offering such courses threatens a university’s ethical standards, especially, perhaps, if “only” sexual transgressions are involved.  Still, while I can see students wanting to enroll in a course taught by a Petreaus or a Spitzer, I doubt the same would be true for one run by a Deiderick Stapel*.  Is it because in the former cases the scandal does not directly touch on their accomplishments? Is there a justifiable principle of distinction operating?**   (Or might it depend on the course?) Read more »

Categories: rejected post | 3 Comments

Schedule for Ontology & Methodology, 2013

copy-cropped-ampersand-logo-blog1

May 4 (Saturday):

8:30-9:00: Pastries & Coffee (Continental Breakfast) outside of Pamplin 2030

MORNING SESSIONS:

9:00-9:15—Welcome talk
9:15-10:00 Ruetsche: “Method, Metaphysics, and Quantum Theory”
10:00-10:25: Discussion

10:25-10:40 coffee break

10:40-11:05 Shech, “Phase Transitions, Ontology and Earman’s Sound Principle”
11:05-11:20: Discussion

11:20-12:05 Godfrey-Smith, “Evolution and Agency: A Case Study in Ontology and Methodology”
12:05-12:30: Discussion

12:30-1:30 Box Lunch

AFTERNOON SESSIONS: Read more »

Categories: Announcement | Leave a comment

Blog at WordPress.com. Theme: Customized Adventure Journal by Contexture International.

Follow

Get every new post delivered to your Inbox.

Join 91 other followers