The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)

images

The unpopular P-value is invited to dance.

  1. The Paradox of Replication

Critic 1: It’s much too easy to get small P-values.

Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).

Is it easy or is it hard?

You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other biasing selection effects. The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs. In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.

The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:

The findings also offered some support for the oft-criticized statistical tool known as the P value, which measures whether a result is significant or due to chance. …

The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)

The Replication Report itself, published in Science, gives more details:

Considering significance testing, reproducibility was stronger in studies and journals representing cognitive psychology than social psychology topics. For example, combining across journals, 14 of 55 (25%) of social psychology effects replicated by the P < 0.05 criterion, whereas 21 of 42 (50%) of cognitive psychology effects did so. …The difference in significance testing results between fields appears to be partly a function of weaker original effects in social psychology studies, particularly in JPSP, and perhaps of the greater frequency of high-powered within-subjects manipulations and repeated measurement designs in cognitive psychology as suggested by high power despite relatively small participant samples. …

A negative correlation of replication success with the original study P value indicates that the initial strength of evidence is predictive of reproducibility. For example, 26 of 63 (41%) original studies with P < 0.02 achieved P < 0.05 in the replication, whereas 6 of 23 (26%) that had a P value between 0.02 < P < 0.04 and 2 of 11 (18%) that had a P value > 0.04 did so (Fig. 2). Almost two thirds (20 of 32, 63%) of original studies with P < 0.001 had a significant P value in the replication. [i]

Since it’s expected to have only around 50% replications as strong as the original, the cases of initial significance level < .02 don’t do too badly, judging just on numbers. But I disagree with those who say that all that’s needed is to lower the required P-value, because it ignores the real monster: biasing selection effects.

 2. Is there evidence that differences (between initial studies vs replications) are due to A, B, C…or not?  Moreover, simple significance tests and cognate methods were the tools of choice in exploring possible explanations for the disagreeing results.

Last, there was little evidence that perceived importance of the effect, expertise of the original or replication teams, or self-assessed quality of the replication accounted for meaningful variation in reproducibility across indicators. Replication success was more consistently related to the original strength of evidence (such as original P value, effect size, and effect tested) than to characteristics of the teams and implementation of the replication (such as expertise, quality, or challenge of conducting study) (tables S3 and S4).

They look to a battery of simple significance tests for answers, if only indications. It is apt that they report these explanations as the result of “exploratory” analysis; they weren’t generalizing, but scrutinizing if various factors could readily account for the results.

What evidence is there that the replication studies are not themselves due to bias? According to the Report:

There is no publication bias in the replication studies because all results are reported. Also, there are no selection or reporting biases because all were confirmatory tests based on pre-analysis plans. This maximizes the interpretability of the replication P values and effect estimates.

One needn’t rule out bias altogether to agree with the Report that the replication research controlled the most common biases and flexibilities to which initial experiments were open. If your P-value emerged from torture and abuse, it can’t be hidden from a replication that ties your hands. If you don’t cherry-pick, try and try again, barn hunt, capitalize on flexible theory, and so on, it’s hard to satisfy R.A. Fisher’s requirement of rarely failing to bring about statistically significant results–unless you’ve found a genuine effect. Admittedly a small part of finding things out, the same methods can be used to go deeper in discovering and probing alternative explanations of an effect.

3. Observed differences cannot be taken as caused by the “treatment”: My main worries with the replicationist conclusions in psychology are that they harbor many of the same presuppositions that cause problems in (at least some) psychological experiments to begin with, notably the tendency to assume that differences observed–any differences– are due to the “treatments”, and further, that they are measuring the phenomenon of interest. Even nonsignificant observed differences are interpreted as merely indicating smaller effects of the experimental manipulation, when the significance test is indicating the absence of a genuine effect, much less the particular causal thesis. The statistical test is shouting disconfirmation, if not falsification, of unwarranted hypotheses, but no such interpretation is heard.

It would be interesting to see a list of the failed replications. (I’ll try to dig them out at some point.) The New York Times gives three, but even they are regarded as “simply weaker”.

The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.

This is akin to the habit some researchers have of describing non-significant results as sort of “trending” significant––when the P-value is telling them it’s not significant, and I don’t mean falling short of a “bright line” at .05, but levels like .2, .3, and .4.  These differences are easy to bring about by chance variability alone. Psychologists also blur the observed difference (in statistics) with the inferred discrepancy (in parameter values). This inflates the inference. I don’t know the specific P-values for the following three:

More than 60 of the studies did not hold up. Among them was one on free will. It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.

Another was on the effect of physical distance on emotional closeness. Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.

A third was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

What are the grounds for saying they’re merely weaker? The author of the mate preference study protests even this mild criticism, claiming that a “theory required adjustment” shows her findings to have been replicated after all.

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences [between her study and the replication] — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

Wait a minute. This was to be a general evolutionary theory, yes? According to the abstract:

Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men. (A link to the abstract and paper is here.)

Is the author saying that Italian women obey a distinct evolutionary process? I take it one could argue that evolutionary forces manifest themselves in different ways in distinct cultures. Doubtless, ratings of attractiveness by U.S. psychology students can’t be assumed to reflect assessments about availability for impromptu sex. But can they even among Italian women? This is just one particular story through which the data are being viewed. [9/2/15 Update on the mate preference and ovulation study is in Section 4.]

I can understand that the authors of the replication Report wanted to tread carefully to avoid the kind of pushback that erupted when a hypothesis about cleanliness and morality failed to be replicated. (“Repligate” some called it.) My current concern echoes the one I raised about that case (in an earlier post):

“the [replicationist] question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s run over? At least not directly. In other words, the statistical-substantive link was not at issue.”

Just because subjects (generally psychology students) select a number on a questionnaire, or can be scored on an official test of attitude, feelings, self-esteem, etc., doesn’t mean it’s actually been measured, and you can proceed to apply statistics. You may adopt a method that allows you to go from statistical significance to causal claims—the unwarranted NHST animal that Fisher opposed—but the question does not disappear [ii]. Reading a passage against “free will” makes me more likely to cheat on a test? (There’s scarce evidence that reading a passage influenced the subject’s view on the deep issue of free will, nor even that the subject (chose to*) “cheat”, much less that the former is responsible for the latter.) When I plot two faraway points on a graph I’m more likely to feel more “distant” from my family than if I plot two close together points? The effect is weaker but still real? There are oceans of studies like these (especially in social psychology & priming research). Some are even taken to inform philosophical theories of mind or ethics when, in my opinion, philosophers should be providing a rigorous methodological critique of these studies [iii].  We need to go deeper; in many cases, no statistical analysis would even be required. The vast literatures on the assumed effect live lives of their own; to test their fundamental presuppositions could bring them all crashing down [iv]. Are they to remain out of bounds of critical scrutiny? What do you think?

I may come back to this post in later installments.

*Irony intended.

4. Update on the Italian Mate Selection Replication

Here’s the situation as I understand it, having read both the replication and the response by Bressan. The women in the study had to be single, not pregnant, not on the pill, heterosexual. Among the single women,some are in relationships, they are “partnered”. The thesis is this: if a partnered woman is not ovulating, she’s more attracted to the “attached” guy, because he is deemed capable of a long-term commitment, as evidenced by his being in a relationship. So she might leave her current guy for him (at least if he’s handsome in a masculine sort of way). On the other hand, if she’s ovulating, she’d be more attracted to a single (not attached) man than an attached man. “In this way she could get pregnant and carry the high-genetic-fitness man’s offspring without having to leave her current, stable relationship” (Frazier and Hasselman Bressan_online_in lab (1).2)

So the deal is this: if she’s ovulating, she’s got to do something fast: Have a baby with the single (non-attached) guy whose not very good at commitments (but shows high testosterone, and thus high immunities, according to the authors), and then race back to have the baby in her current stable relationship. As Bressan puts it in her response to the replication:“This effect was interpreted on the basis of the hypothesis that, during ovulation, partnered women would be “shopping for good genes” because they “already have a potentially investing ‘father’ on their side.” But would he be an invested father if it was another man’s baby? I mean, does this even make sense on crude evolutionary terms? [I don’t claim to know. I thought male lions are prone to stomp on babies fathered by other males. Even with humans, I doubt that even the “feminine” male Pleistocene partner would remain fully invested.]

Nevertheless, when you see the whole picture, Bressan does raise some valid questions of the replication attempt BRESSAN COMMENTARY. I may come back to this later. You can find all the reports, responses by authors, and other related materials here.

[i] Here’s a useful overview from the Report in Science:

Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Since it’s expected to have only around 50% replications as strong as the original, this might not seem that low. I think the entire issue of importance goes beyond rates, and that focusing on rates of replication actually distracts from what’s involved in appraising a given study or theory.

[ii] Statistical methods are relevant to answering this question and even falsifying conjectured causal claims. My point is that it demands more than checking the purely statistical question in these “direct” replications, and more than P-values. Oddly, since these studies appeal to power, they ought to be in Neyman-Pearson hypotheses testing (ideally without the behavioristic rationale). This would immediately scotch an illicit slide from statistical to substantive inference.

[iii] Yes, this is one of the sources of my disappointment: philosophers of science should be critically assessing this so-called “naturalized” philosophy. It all goes back to Quine, but never mind.

[iv] It would not be difficult to test whether these measures are valid. The following is about the strongest, hedged, claim (from the Report) that the replication result is sounder than the original:

If publication, selection, and reporting biases completely explain the effect differences, then the replication estimates would be a better estimate of the effect size than would the meta-analytic and original results. However, to the extent that there are other influences, such as moderation by sample, setting, or quality of replication, the relative bias influencing original and replication effect size estimation is unknown.

Categories: replication research, reproducibility, spurious p values, Statistics | 12 Comments

3 YEARS AGO (AUGUST 2012): MEMORY LANE

3 years ago...
3 years ago…

MONTHLY MEMORY LANE: 3 years ago: August 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.[1] Posts that are part of a “unit” or a group of “U-Phils” count as one (there are 4 U-Phils on Wasserman this time). Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014. We’re about to turn four.

August 2012

[1] excluding those reblogged fairly recently.

[2] Larry Wasserman’s paper was “Low Assumptions, High dimensions” in our special RIMM volume.

Categories: 3-year memory lane, Statistics | 1 Comment

How to avoid making mountains out of molehills, using power/severity

images

.

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significant x is a good indication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result is just statistically significant. (As soon as it exceeds the cut-off the rule has to be modified). 

Rule (1) was stated in relation to a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. Here’s the companion:

(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’.
(The higher the POW(T+,µ’) is, the better the indication  that µ < µ’.)

That is, if the test’s power to detect alternative µ’ is high, then the statistically significant x is a good indication (or good evidence) that the discrepancy from null is not as large as µ’ (i.e., there’s good evidence that  µ < µ’).

 An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

EXAMPLE. Let σ = 10, n = 100, so (σ/√n) = 1.  Test T+ rejects Hat the .025 level if  M  > 1.96(1). For simplicity, let the cut-off, M*, be 2. Let the observed mean M0 just reach the cut-off  2.

POWER: POW(T+,µ’) = POW(Test T+ rejects H0;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*. Using one of our power facts, POW(M* + 1(σ/√n)) = .84.

That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ = 3) = .84. See this post.

 By (2), the (just) significant result x is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84.  The upper .84 confidence limit is 3. The significant result is even better evidence that µ< 4,  the upper .975 confidence limit is 4 (approx.), etc. 

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (Only (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical insignificance, (2) is essentially ordinary power analysis. (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, that’s a context-dependent consideration, often stipulated in regulatory statutes.

Rule (2) also provides a way to distinguish values within a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often confused with “retrospective power” (what I call shpower). Again, confidence bounds could be, but they are not now, used to this end (but rather the opposite [iii]).

Severity replaces M* in (2) with the actual result, be it significant or insignificant. 

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to custom tailor results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

One more thing:  

Applying (1) and (2) requires the error probabilities to be actual (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones. There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ+ δ,  or µ < µ’ =µ+ δ  or the like. They are not to point values! (Not even to the point µ =M0.) Most simply, you may consider that the inference is in terms of the one-sided upper confidence bound (for various confidence levels)–the dual for test T+.

[iii] That is, upper confidence bounds are viewed as “plausible” bounds, and as values for which the data provide positive evidence. As soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

Categories: fallacy of rejection, power, Statistics | 20 Comments

Statistics, the Spooky Science

.

I was reading this interview Of Erich Lehmann yesterday: “A Conversation with Erich L. Lehmann”

Lehmann: …I read over and over again that hypothesis testing is dead as a door nail, that nobody does hypothesis testing. I talk to Julie and she says that in the behaviorial sciences, hypothesis testing is what they do the most. All my statistical life, I have been interested in three different types of things: testing, point estimation, and confidence-interval estimation. There is not a year that somebody doesn’t tell me that two of them are total nonsense and only the third one makes sense. But which one they pick changes from year to year. [Laughs] (p.151)…..

DeGroot: …It has always amazed me about statistics that we argue among ourselves about which of our basic techniques are of practical value. It seems to me that in other areas one can argue about whether a methodology is going to prove to be useful, but people would agree whether a technique is useful in practice. But in statistics, as you say, some people believe that confidence intervals are the only procedures that make any sense on practical grounds, and others think they have no practical value whatsoever. I find it kind of spooky to be in such a field.

Lehmann: After a while you get used to it. If somebody attacks one of these, I just know that next year I’m going to get one who will be on the other side. (pp.151-2)

Emphasis is mine.

I’m reminded of this post.

Morris H. DeGroot, Statistical Science, 1986, Vol. 1, No.2, 243-258

 

 

Categories: phil/history of stat, Statistics | 1 Comment

Severity in a Likelihood Text by Charles Rohde

Mayo elbow

.

I received a copy of a statistical text recently that included a discussion of severity, and this is my first chance to look through it. It’s Introductory Statistical Inference with the Likelihood Function by Charles Rohde from Johns Hopkins. Here’s the blurb:

41ZRoC3dRWL._SX340_BO1,204,203,200_

.

This textbook covers the fundamentals of statistical inference and statistical theory including Bayesian and frequentist approaches and methodology possible without excessive emphasis on the underlying mathematics. This book is about some of the basic principles of statistics that are necessary to understand and evaluate methods for analyzing complex data sets. The likelihood function is used for pure likelihood inference throughout the book. There is also coverage of severity and finite population sampling. The material was developed from an introductory statistical theory course taught by the author at the Johns Hopkins University’s Department of Biostatistics. Students and instructors in public health programs will benefit from the likelihood modeling approach that is used throughout the text. This will also appeal to epidemiologists and psychometricians. After a brief introduction, there are chapters on estimation, hypothesis testing, and maximum likelihood modeling. The book concludes with sections on Bayesian computation and inference. An appendix contains unique coverage of the interpretation of probability, and coverage of probability and mathematical concepts.

It’s welcome to see severity in a statistics text; an example from Mayo and Spanos (2006) is given in detail. The author even says some nice things about it: “Severe testing does a nice job of clarifying the issues which occur when a hypothesis is accepted (not rejected) by finding those values of the parameter (here mu) which are plausible (have high severity[i]) given acceptance. Similarly severe testing addresses the issue of a hypothesis which is rejected…. . ”  I don’t know Rohde, and the book isn’t error-statistical in spirit at all.[ii] In fact, inferences based on error probabilities are often called “illogical” because they take into account cherry-picking, multiple testing, optional stopping and other biasing selection effects that the likelihoodist considers irrelevant. I wish he had used severity to address some of the classic howlers he delineates regarding N-P statistics. To his credit, they are laid out with unusual clarity. For example a rejection of a point null µ= µ0 based on a result that just reaches the 1.96 cut-off for a one-sided test is claimed to license the inference to a point alternative µ= µ’ that is over 6 standard deviations greater than the null. (pp. 49-50). But it is not licensed. The probability of a larger difference than observed, were the data generated under such an alternative is ~1, so the severity associated with such an inference is ~ 0. SEV(µ <µ’) ~1. 

[i]Not to quibble, but I wouldn’t say parameter values are assigned severity, but rather that various hypotheses about mu pass with severity. The hypotheses are generally in the form of discrepancies, e..g,µ >µ’

[ii] He’s a likelihoodist from Johns Hopkins. Royall has had a strong influence there (Goodman comes to mind), and elsewhere, especially among philosophers. Bayesians also come back to likelihood ratio arguments, often. For discussions on likelihoodism and the law of likelihood see:

How likelihoodists exaggerate evidence from statistical tests

Breaking the Law of Likelihood ©

Breaking the Law of Likelihood, to keep their fit measures in line A, B

Why the Law of Likelihood is Bankrupt as an Account of Evidence

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.

http://www.phil.vt.edu/dmayo/personal_website/2006Mayo_Spanos_severe_testing.pdf

Categories: Error Statistics, Severity | Leave a comment

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson (11 Aug, 1895-12 June, 1980). I reblog a relevant post from 2012.

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand. Continue reading

Categories: 3-year memory lane, phil/history of stat | Tags: | 28 Comments

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

egon pearson swim

11 August 1895 – 12 June 1980

Today is Egon Pearson’s birthday. I reblog a post by my colleague Aris Spanos from (8/18/12): “Egon Pearson’s Neglected Contributions to Statistics.”  Happy Birthday Egon Pearson!

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: phil/history of stat, Statistics, Testing Assumptions | Tags: , , , | Leave a comment

Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”

metablog old fashion typewriter

.

Memory lane: Did you ever consider how some of the colorful exchanges among better-known names in statistical foundations could be the basis for high literary drama in the form of one-act plays (even if appreciated by only 3-7 people in the world)? (Think of the expressionist exchange between Bohr and Heisenberg in Michael Frayn’s play Copenhagen, except here there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included, with no attempt to create a “story line”.)  Somehow I didn’t think so. But rereading some of Savage’s high-flown praise of Birnbaum’s “breakthrough” argument (for the Likelihood Principle) today, I was swept into a “(statistical) theater of the absurd” mindset.(Update Aug, 2015 [ii])

The first one came to me in autumn 2008 while I was giving a series of seminars on philosophy of statistics at the LSE. Modeled on a disappointing (to me) performance of The Woman in Black, “A Funny Thing Happened at the [1959] Savage Forum” relates Savage’s horror at George Barnard’s announcement of having rejected the Likelihood Principle!

Barnard-1979-picture

.

The current piece also features George Barnard. It recalls our first meeting in London in 1986. I’d sent him a draft of my paper on E.S. Pearson’s statistical philosophy, “Why Pearson Rejected the Neyman-Pearson Theory of Statistics” (later adapted as chapter 11 of EGEK) to see whether I’d gotten Pearson right. Since Tuesday (Aug 11) is Pearson’s birthday, I’m reblogging this. Barnard had traveled quite a ways, from Colchester, I think. It was June and hot, and we were up on some kind of a semi-enclosed rooftop. Barnard was sitting across from me looking rather bemused.

The curtain opens with Barnard and Mayo on the roof, lit by a spot mid-stage. He’s drinking (hot) tea; she, a Diet Coke. The dialogue (is what I recall from the time[i]):

 Barnard: I read your paper. I think it is quite good.  Did you know that it was I who told Fisher that Neyman-Pearson statistics had turned his significance tests into little more than acceptance procedures? Continue reading

Categories: Barnard, phil/history of stat, Statistics | Tags: , , , , | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

Neyman April 16, 1894 – August 5, 1981

April 16, 1894 – August 5, 1981

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena” by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

Neyman died on August 5, 1981. Here’s an unusual paper of his, “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” I have been reading a fair amount by Neyman this summer in writing about the origins of his philosophy, and have found further corroboration of the position that the behavioristic view attributed to him, while not entirely without substance*, is largely a fable that has been steadily built up and accepted as gospel. This has justified ignoring Neyman-Pearson statistics (as resting solely on long-run performance and irrelevant to scientific inference) and turning to crude variations of significance tests, that Fisher wouldn’t have countenanced for a moment (so-called NHSTs), lacking alternatives, incapable of learning from negative results, and permitting all sorts of P-value abuses–notably going from a small p-value to claiming evidence for a substantive research hypothesis. The upshot is to reject all of frequentist statistics, even though P-values are a teeny tiny part. *This represents a change in my perception of Neyman’s philosophy since EGEK (Mayo 1996).  I still say that that for our uses of method, it doesn’t matter what anybody thought, that “it’s the methods, stupid!” Anyway, I recommend, in this very short paper, the general comments and the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | 19 Comments

Telling What’s True About Power, if practicing within the error-statistical tribe

two_pink_clouds

.

Suppose you are reading about a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. 

I have heard some people say [0]:

A. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ).*See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?

Allow the test assumptions are adequately met. I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post: Continue reading

Categories: confidence intervals and tests, power, Statistics | 36 Comments

Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics

.

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

This post first appeared here. An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence-based medicine? Philosophy of Science 2002; 69: S316-S330: see p. S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random. Continue reading

Categories: RCTs, S. Senn, Statistics | Tags: , | 6 Comments

3 YEARS AGO (JULY 2012): MEMORY LANE

3 years ago...
3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.[1]  This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014. (Once again it was tough to pick just 3; please check out others which might interest you, e.g., Schachtman on StatLaw, the machine learning conference on simplicity, the story of Lindley and particle physics, Glymour and so on.)

July 2012

[1] excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Categories: 3-year memory lane, Statistics | Leave a comment

“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)

Mayo elbow

Mayo, frustrated

Someone linked this to me on Twitter. I thought it was a home blog at first. Surely the U.S. Dept of Health and Human Services can give a better definition than this.

U.S. Department of Health and Human Services
Effective Health Care Program
Glossary of Terms

We know that many of the concepts used on this site can be difficult to understand. For that reason, we have provided you with a glossary to help you make sense of the terms used in Comparative Effectiveness Research. Every word that is defined in this glossary should appear highlighted throughout the Web site…..

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

You can find it here.  First of all, one should never use “likelihood” and “probability” in what is to be a clarification of formal terms, as these mean very different things in statistics.Some of the claims given actually aren’t so bad if “likely” takes its statistical meaning, but are all wet if construed as mathematical probability. Continue reading

Categories: P-values, Statistics | 68 Comments

Spot the power howler: α = ß?

Spot the fallacy!

  1. METABLOG QUERYThe power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
  2. So, the probability of incorrectly rejecting the null hypothesis is β.
  3. But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.

 

Categories: Error Statistics, power, Statistics | 12 Comments

Higgs discovery three years on (Higgs analysis and statistical flukes)

3rd-birthday-cake2

.

2015: The Large Hadron Collider (LHC) is back in collision mode in 2015[0]. There’s a 2015 update, a virtual display, and links from ATLAS, one of two detectors at (LHC)) here. The remainder is from one year ago. (2014) I’m reblogging a few of the Higgs posts at the anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories. 

“Higgs Analysis and Statistical Flukes: part 2”images

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels. Continue reading

Categories: Higgs, highly probable vs highly probed, P-values, Severity | Leave a comment

Winner of the June Palindrome contest: Lori Wike

lori wike falls

.

Winner of June 2014 Palindrome Contest: (a dozen book choices)

Lori Wike: Principal bassoonist of the Utah Symphony; Faculty member at University of Utah and Westminster College

Palindrome: Sir, a pain, a madness! Elba gin in a pro’s tipsy end? I know angst, sir! I taste, I demonstrate lemon omelet arts. Nome diet satirists gnaw on kidneys, pits or panini. Gab less: end a mania, Paris!

Book choiceConjectures and Refutations (K. Popper 1962, New York: Basic Books)

The requirement: A palindrome using “demonstrate” (and Elba, of course).

Bio: Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine. Continue reading

Categories: Palindrome | Leave a comment

Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)

Larry Laudan

Larry Laudan

Professor Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin

“When the ‘Not-Guilty’ Falsely Pass for Innocent” by Larry Laudan

While it is a belief deeply ingrained in the legal community (and among the public) that false negatives are much more common than false positives (a 10:1 ratio being the preferred guess), empirical studies of that question are very few and far between. While false convictions have been carefully investigated in more than two dozen studies, there are virtually no well-designed studies of the frequency of false acquittals. The disinterest in the latter question is dramatically borne out by looking at discussions among intellectuals of the two sorts of errors. (A search of Google Books identifies some 6.3k discussions of the former and only 144 treatments of the latter in the period from 1800 to now.) I’m persuaded that it is time we brought false negatives out of the shadows, not least because each such mistake carries significant potential harms, typically inflicted by falsely-acquitted recidivists who are on the streets instead of in prison.scot-free-1_1024x1024

 

In criminal law, false negatives occur under two circumstances: when a guilty defendant is acquitted at trial and when an arrested, guilty defendant has the charges against him dropped or dismissed by the judge or prosecutor. Almost no one tries to measure how often either type of false negative occurs. That is partly understandable, given the fact that the legal system prohibits a judicial investigation into the correctness of an acquittal at trial; the double jeopardy principle guarantees that such acquittals are fixed in stone. Thanks in no small part to the general societal indifference to false negatives, there have been virtually no efforts to design empirical studies that would yield reliable figures on false acquittals. That means that my efforts here to estimate how often they occur must depend on a plethora of indirect indicators. With a bit of ingenuity, it is possible to find data that provide strong clues as to approximately how often a truly guilty defendant is acquitted at trial and in the pre-trial process. The resulting inferences are not precise and I will try to explain why as we go along. As we look at various data sources not initially designed to measure false negatives, we will see that they nonetheless provide salient information about when and why false acquittals occur, thereby enabling us to make an approximate estimate of their frequency.

My discussion of how to estimate the frequency of false negatives will fall into two parts, reflecting the stark differences between the sources of errors in pleas and the sources of error in trials. (All the data to be cited here deal entirely with cases of crimes of violence.) Continue reading

Categories: evidence-based policy, false negatives, PhilStatLaw, Statistics | Tags: | 9 Comments

Stapel’s Fix for Science? Admit the story you want to tell and how you “fixed” the statistics to support it!

images-16

.

Stapel’s “fix” for science is to admit it’s all “fixed!”

That recent case of the guy suspected of using faked data for a study on how to promote support for gay marriage in a (retracted) paper, Michael LaCour, is directing a bit of limelight on our star fraudster Diederik Stapel (50+ retractions).

The Chronicle of Higher Education just published an article by Tom Bartlett:Can a Longtime Fraud Help Fix Science? You can read his full interview of Stapel here. A snippet:

You write that “every psychologist has a toolbox of statistical and methodological procedures for those days when the numbers don’t turn out quite right.” Do you think every psychologist uses that toolbox? In other words, is everyone at least a little bit dirty?

Stapel: In essence, yes. The universe doesn’t give answers. There are no data matrices out there. We have to select from reality, and we have to interpret. There’s always dirt, and there’s always selection, and there’s always interpretation. That doesn’t mean it’s all untruthful. We’re dirty because we can only live with models of reality rather than reality itself. It doesn’t mean it’s all a bag of tricks and lies. But that’s where the inconvenience starts. Continue reading

Categories: junk science, Statistics | 11 Comments

3 YEARS AGO (JUNE 2012): MEMORY LANE

3 years ago...
3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.[1]  It was extremely difficult to pick only 3 this month; please check out others that look interesting to you. This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014.

 

June 2012

[1]excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Categories: 3-year memory lane | 1 Comment

Can You change Your Bayesian prior? (ii)

images-1

.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

images

.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while Continue reading

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

Blog at WordPress.com. The Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 1,013 other followers