“Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?” was my presentation at the 2014 Boston Colloquium for the Philosophy of Science):**“Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge.” **

** **As often happens, I never put these slides into a stand alone paper. But I have incorporated them into my book (in progress*), “How to Tell What’s True About Statistical Inference”. Background and slides were posted last year.

Slides (draft from Feb 21, 2014)

Download the 54th Annual Program

Cosponsored by the Department of Mathematics & Statistics at Boston University.

Friday, February 21, 2014

10 a.m. – 5:30 p.m.

Photonics Center, 9th Floor Colloquium Room (Rm 906)

8 St. Mary’s Street

10 a.m. – 5:30 p.m.

Photonics Center, 9th Floor Colloquium Room (Rm 906)

8 St. Mary’s Street

*Seeing a light at the end of tunnel, finally.

Filed under: P-values, significance tests, Statistical fraudbusting, Statistics ]]>

Have you ever heard of phrenology? It was, once upon a time, the “science” of measuring someone’s skull to understand their intellectual capabilities. This sounds totally idiotic but was a huge fucking deal in the mid-1800’s, and really didn’t stop getting some credit until much later. I know that because I…]]>

Mayo:

It happens I’ve been reading a lot lately about the assumption in social psychology and psychology in general that what they’re studying is measurable, quantifiable. Addressing the problem has been shelved to the back burner for decades thanks to some redefinitions of what it is to “measure” in psych (anything for which there’s a rule to pop out a number says Stevens–an operationalist in the naive positivist spirit). This at any rate is what I’m reading, thanks to papers sent by a colleague of Meehl’s (N. Waller). (Here’s one by Mitchell.) I think it’s time to reopen the question.The measures I see of “severity of moral judgment”, “degree of self-esteem” and much else in psychology appear to fall into this behavior in a very non-self critical manner. No statistical window-dressing (nor banning of statistical inference) can help them become more scientific. So when I saw this on Math Babe’s twitter I decided to try the “reblog” function and see what happened. Here it is (with her F word included). The article to which she alludes is “Recruiting Better Talent Through Brain Games” )

Originally posted on mathbabe:

Have you ever heard of phrenology? It was, once upon a time, the “science” of measuring someone’s skull to understand their intellectual capabilities.

This sounds totally idiotic but was a huge fucking deal in the mid-1800’s, and really didn’t stop getting some credit until much later. I know that because I happen to own the 1911 edition of the Encyclopedia Britannica, which was written by the top scholars of the time but is now horribly and fascinatingly outdated.

For example, the entry for “Negro” is famously racist. Wikipedia has an excerpt: “Mentally the negro is inferior to the white… the arrest or even deterioration of mental development [after adolescence] is no doubt very largely due to the fact that after puberty sexual matters take the first place in the negro’s life and thoughts.”

But really that one line doesn’t tell the whole story. Here’s the whole thing…

View original 351 more words

Filed under: msc kvetch, scientism, Statistics ]]>

**MONTHLY MEMORY LANE: 3 years ago: February 2012. **I am to mark in **red** **three **posts (or units) that seem most apt for general background on key issues in this blog. Given our Fisher reblogs, we’ve already seen many this month. So, I’m marking in red (1) The Triad, and (2) the Unit on Spanos’ misspecification tests. Plase see those posts for their discussion. The two posts from 2/8 are apt if you are interested in a famous case involving statistics at the Supreme Court. Beyond that it’s just my funny theatre of the absurd piece with Barnard. (Gelman’s is just a link to his blog.)

**February 2012**

- (2/3) Senn Again (Gelman)
- (2/7) When Can Risk-Factor Epidemiology Provide Reliable Tests?
- (2/8) Guest Blogger: Interstitial Doubts About the Matrixx (Schachtman)
- (2/8) Distortions in the Court? (PhilStat/PhilStock) **Cobb on Zilizk & McCloskey

TRIAD:

- (2/11) R.A. Fisher: Statistical Methods and Scientific Inference
- (2/11) JERZY NEYMAN: Note on an Article by Sir Ronald Fisher
- (2/12) E.S. Pearson: Statistical Concepts in Their Relation to Reality

REBLOGGED LAST WEEK

- (2/12) Guest Blogger. STEPHEN SENN: Fisher’s alternative to the alternative
- (2/15) Guest Blogger. Aris Spanos: The Enduring Legacy of R.A. Fisher
- (2/17) Two New Properties of Mathematical Likelihood

- (2/20) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”? (Rejected Post Feb 20)

M-S TESTING UNIT

- (2/22) Intro to Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
- (2/23) Misspecification Testing: (part 2) A Fallacy of Error “Fixing”
- (2/27) Misspecification Testing: (part 3) Subtracting-out effects “on paper”
- (2/28) Misspecification Tests: (part 4) and brief concluding remarks

This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014.

**Previous 3 YEAR MEMORY LANES:**

Sept. 2011 (Within “All She Wrote (so far))

Filed under: 3-year memory lane, Statistics ]]>

This headliner appeared before, but to a sparse audience, so Management’s giving him another chance… His joke relates to both Senn’s post (about alternatives), and to my recent post about using (1 – β)/α as a likelihood ratio--but for very different reasons. (I’ve explained at the bottom of this “(b) draft”.)

** ….If you look closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike, (especially as he’s no longer doing the Tonight Show) …**.

**It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler joke* in criticizing the use of p-values.**

“Did you hear the one about significance testers rejectingH_{0}because of outcomesH_{0}didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation.

We can view p-values in terms of rejecting *H*_{0}, as in the joke: There’s a test statistic D such that *H*_{0} is rejected if its observed value d_{0} reaches or exceeds a cut-off d* where Pr(D > d*; *H*_{0}) is small, say .025.

* Reject H*_{0} if Pr(D > d_{0}; *H*_{0}) < .025.

The report might be “reject *H*_{0 }at level .025″.

*Example*: *H*_{0}: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject *H*_{0} .

Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that *H*_{0} “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; *H*_{0}), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]

So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting *H*_{0. }Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d_{0} ; *H*_{0} ) be small, but that Pr(D > d_{0} ; *H*_{0} ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this ….

1.The joke talks about outcomes the null does not predict–just what we wouldn’t know without an assumed test statistic, but the tail area consideration arises in Fisherian tests in order to determine what outcomes *H*_{0} “has not predicted”. That is, it arises to identify a sensible test statistic D.

In familiar scientific tests, we know the outcomes that are ‘more extreme’ from a given hypothesis in the direction of interest, e.g., the more patients show side effects after taking drug Z, the less indicative Z is benign, *not the other way around*. But that’s to assume the equivalent of a test statistic. In Fisher’s set-up, one needs to identify a suitable measure of accordance, fit, or directional departure. Improbability of outcomes (under *H*_{0}) should not indicate discrepancy from *H*_{0} if even less probable outcomes would occur under discrepancies from *H*_{0}. (Note: To avoid confusion, I always use “discrepancy” to refer to the parameter values used in describing the underlying data generation; values of D are “differences”.)

*2. N-P tests and tail areas*: Now N-P tests do not consider “tail areas” explicitly, but they fall out of the desiderata of good tests and sensible test statistics. N-P tests were developed to provide the tests that Fisher used with a rationale by making explicit the alternatives of interest—even if just in terms of directions of departure.

In order to determine the appropriate test and compare alternative tests “Neyman and I introduced the notions of the class of admissible hypotheses and the power function of a test. The class of admissible alternatives is formally related to the direction of deviations—changes in mean, changes in variability, departure from linear regression, existence of interactions, or what you will.” (Pearson 1955, 207)

Under N-P test criteria, tests should rarely reject a null erroneously, and as discrepancies from the null increase, the probability of signaling discordance from the null should increase. In addition to ensuring Pr(D < d*; *H*_{0}) is high, one wants Pr(D > d*; *H*’: μ_{0} + γ) to increase as γ increases. Any sensible distance measure D must **track** discrepancies from *H*_{0}. If you’re going to reason, “the larger the D value, the worse the fit with *H*_{0},” then observed differences must occur **because** of the falsity of *H*_{0} (in this connection consider Kadane’s howler).

3. But Fisher, strictly speaking, has only the null distribution, along with an implicit interest in tests with *sensitivity* toward implicit departures. To find out if *H*_{0} has or has not predicted observed results, we need a sensible distance measure. (Recall Senn’s post: “Fisher’s alternative to the alternative”, just reblogged yet again.**)

Suppose I take an observed difference d_{0} as grounds to reject *H*_{0 }on account of its being improbable under *H*_{0}, when in fact larger differences (larger D values) are more probable under *H*_{0}. Then, as Fisher rightly notes, the improbability of the observed difference was a poor indication of underlying discrepancy. This fallacy would be revealed by looking at the tail area; whereas it is readily committed with accounts that only look at the improbability of the observed outcome d_{0} under *H*_{0}.

4. Even if you have a sensible distance measure D (tracking the discrepancy relevant for the inference), and observe D = d, the improbability of d under *H*_{0} should not be indicative of a genuine discrepancy, if it’s rather easy to bring about differences even greater than observed, under *H*_{0}. Equivalently, we want a high probability of inferring *H*_{0} when *H*_{0} is true. In my terms, considering Pr(D < d*;*H*_{0}) is what’s needed to block rejecting the null and inferring alternative *H*’ when you haven’t rejected it with severity (where H’ and *H*_{0 }exhaust the parameter space). In order to say that we have “sincerely tried”, to use Popper’s expression, to reject *H*’ when it is false and *H*_{0} is correct, we need Pr(D < d*; *H*_{0}) to be high.

*5. Concluding remarks*:

The rationale for the tail area *for Fisher*, as I see it, is twofold: to get the right direction of departure, but also to ensure Pr(test T does *not* reject *H*_{0}; *H*_{0} ) is high.

If we don’t already have an appropriate distance measure D, then we don’t know which outcomes we should regard as those *H*_{0} *does or does not *predict–so Jeffreys’ quip can’t even be made out. That’s why Fisher looks at the tail area associated with any candidate for a test statistic. Neyman and Pearson make alternatives explicit in order to arrive at relevant test statistics. (For N-P, the “tail area” falls out of the rejection region; they actually criticize Fisher for not justifying his use of it.)

If we have an appropriate D, then Jeffreys’ criticism is equally puzzling because considering the tail area does not make it easier to reject *H*_{0} but harder. Harder because it’s not enough that the outcome be improbable under the null, outcomes even greater must be improbable under the null. And it makes it a lot harder (leading to blocking a rejection) just when it should: because the data could readily be produced by *H*_{0} [ii].

Either way, Jeffreys’ criticism, funny as it seems, collapses.

When an observation leads to rejecting the null in a significance test, it is because of that outcome—*not because of any unobserved outcomes.* Considering other possible outcomes that could have arisen is essential for determining (and controlling) the capabilities of the given testing method. In fact, understanding the properties of our testing tool T just is to understand what T would do under different outcomes, under different conjectures about what’s producing the data.

**Feb 22 addition(b): The relevance to Senn’s post is pretty obvious, as considering the tail area is a way Fisher ensures a sensible (directional) test statistic. The connection to my post about using (1 – β)/α as a likelihood ratio is less direct. If you use this in a Bayesian computation you’ll get a higher posterior probability for the non-null than if you used the observed data point. In this case, considering the tail area (for the power) really would make it easier to find evidence against the null. But it comes from using these terms in an entirely unintended way. Moreover, power is a lousy “distance” (between data and hypothesis) measure. That’s what I was trying to bring up in the post that went off topic.**

[i] Jeffreys’ next sentence, remarkably is: “On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it.” This further supports my reading, as if we’d reject a fair coin null because it would not predict 100% heads, even though we only observed 51% heads. But the allegation has no relation to significance tests of the Fisherian or N-P varieties.

[ii] One may argue it should be even harder, but this is a distinct issue.

[iii] I’ll indicate a significantly changed draft with [b] in the title.

*I initially called this, “Sir Harold’s ‘howler’. That phrase fell out naturally from the alliteration, but it’s strictly incorrect (as I wish to use the term “howler”). I don’t think so famous a “one-liner”–one that raises a legitimate question to be answered –should be lumped in with the group of howlers that are repeated over and over again, despite clarifications/explanations/corrections having been given many times. (So there’s a time factor involved.) I also wouldn’t place logical puzzles, e.g., the Birnbaum argument in this category. By contrast, alleging that rejecting a null is intended, by N-P theory, to give stronger evidence against the null as the power increases, is a howler. Several other howlers may found on this blog. I realized the need for a qualification in reading a comment on this blog by Andrew Gelman (1/14).

**Perhaps Senn disagrees with my take?

Jeffreys, H. (1939 edition), *Theory of Probability*. Oxford.

Pearson, E.S. (1955), “Statistical Concepts in Their Relation to Reality.”

Filed under: Comedy, Discussion continued, Fisher, Jeffreys, P-values, Statistics, Stephen Senn ]]>

**As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog Senn from 3 years ago. **

*‘Fisher’s alternative to the alternative’*

*By: Stephen Senn*

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:

A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?

B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)

It seems clear that by *hidden postulates* Fisher means *alternative hypotheses* and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test

statistics. You have to choose one, however. To say that you should choose the one with the greatest *power* gets you nowhere. This *power* depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.

I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.

Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.

**References: **

J. H. Bennett (1990) *Statistical Inference and Analysis Selected Correspondence of R.A. Fisher*, Oxford: Oxford University Press.

L. J. Savage (1976) On rereading R A Fisher. *The Annals of Statistics,* 441-500.

- JERZY NEYMAN: Note on an Article by Sir Ronald Fisher (errorstatistics.com)
- E.S. PEARSON: Statistical Concepts in Their Relation to Reality (errorstatistics.com)
- Fisher, Statistical Methods and Scientific Inference (errorstatistics.com)

Filed under: Fisher, Statistics, Stephen Senn Tagged: power, Ronald Fisher, Savage, Stephen Senn ]]>

In recognition of R.A. Fisher’s birthday….

**‘R. A. Fisher: How an Outsider Revolutionized Statistics’**

by **Aris Spanos**

Few statisticians will dispute that R. A. Fisher **(February 17, 1890 – July 29, 1962)** is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of *optimal estimation* based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of *optimal testing* in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the *ultimate outsider* when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper.

After graduating from Cambridge he drifted into a series of jobs, including subsistence farming and teaching high school mathematics and physics, until his temporary appointment as a statistician at Rothamsted Experimental Station in 1919. During the period 1912-1919 his interest in statistics was driven by his passion for eugenics and a realization that his mathematical knowledge of n-dimensional geometry can be put to good use in deriving finite sample distributions for estimators and tests in the spirit of Gosset’s (1908) paper. Encouraged by his early correspondence with Gosset, he derived the finite sampling distribution of the sample correlation coefficient which he published in 1915 in Biometrika; the only statistics journal at the time, edited by Karl Pearson. To put this result in a proper context, Pearson was working on this problem for two decades and published more than a dozen papers with several assistants on approximating the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments.

Due to its importance, the 1915 paper provided Fisher’s first skirmish with the ‘statistical establishment’. Karl Pearson would not accept being overrun by a ‘newcomer’ lightly. So, he prepared a critical paper with four of his assistants that became known as “the cooperative study”, questioning Fisher’s result as stemming from a misuse of Bayes theorem. He proceeded to publish it in Biometrika in 1917 without bothering to let Fisher know before publication. Fisher was furious at K.Pearson’s move and prepared his answer in a highly polemical style which Pearson promptly refused to publish in his journal. Eventually Fisher was able to publish his answer after tempering the style in *Metron*, a brand new statistics journal. As a result of this skirmish, Fisher pledged never to send another paper to *Biometrika*, and declared a war against K.Pearson’s perspective on statistics. Fisher, not only questioned his method of moments as giving rise to inefficient estimators, but also his derivation of the degrees of freedom of his chi-square test. Several, highly critical published papers ensued.[i]

Between 1922 and 1930 Fisher did most of his influential work in recasting statistics, including publishing a highly successful textbook in 1925, but the ‘statistical establishment’ kept him ‘in his place’; a statistician at an experimental station. All his attempts to find an academic position, including a position in Social Biology at the London School of Economics (LSE), were unsuccessful (see Box, 1978, p. 202). Being turned down for the LSE position was not unrelated to the fact that the professor of statistics at the LSE was Arthur Bowley (1869-1957); second only to Pearson in statistical high priesthood.[ii]

Coming of age as a statistician during the 1920s in England, was being awarded the Guy medal in gold, silver or bronze, or at least receiving an invitation to present your work to the Royal Statistical Society (RSS). Despite his fundamental contributions to the field, Fisher’s invitation to RSS would not come until 1934. To put that in perspective, Jerzy Neyman, his junior by some distance, was invited six months earlier! Indeed, one can make a strong case that the statistical establishment kept Fisher away for as long as they could get away with it. However, by 1933 they must have felt that they had to invite Fisher after he accepted a professorship at University College, London. The position was created after Karl Pearson retired and the College decided to split his chair into a statistics position that went to Egon Pearson (Pearson’s son) and a Galton professorship in Eugenics that was offered to Fisher. To make it worse, Fisher’s offer came with a humiliating clause that he was forbidden to teach statistics at University College (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself!

Fisher’s presentation to the Royal Statistical Society, on December 18th, 1934, entitled “The Logic of Inductive Inference”, was an attempt to summarize and explain his published work on recasting the problem of statistical induction since his classic 1922 paper. Bowley was (self?) appointed to move the traditional vote of thanks and open the discussion. After some begrudging thanks for Fisher’s ‘contributions to statistics in general’, he went on to disparage his new approach to statistical inference based on the likelihood function by describing it as abstruse, arbitrary and misleading. His comments were predominantly sarcastic and discourteous, and went as far as to accuse Fisher of plagiarism, by not acknowledging Edgeworth’s priority on the likelihood function idea (see Fisher, 1935, pp. 55-7). The litany of churlish comments continued with the rest of the old guard: Isserlis, Irwin and the philosopher Wolf (1935, pp. 57-64), who was brought in by Bowley to undermine Fisher’s philosophical discussion on induction. Jeffreys complained about Fisher’s criticisms of the Bayesian approach (1935, pp. 70-2).

To Fisher’s support came … Egon Pearson, Neyman and Bartlett. E. Pearson argued that:

“When these ideas [on statistical induction] were fully understood … it would be realized that statistical science owed a very great deal to the stimulus Professor Fisher had provided in many directions.” (Fisher, 1935, pp. 64-5)

Neyman too came to Fisher’s support, praising Fisher’s path-breaking contributions, and explaining Bowley’s reaction to Fisher’s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73).

Fisher, in his reply to Bowley and the old guard, was equally contemptuous:

“The acerbity, to use no stronger term, with which the customary vote of thanks has been moved and seconded … does not, I confess, surprise me. From the fact that thirteen years have elapsed between the publication, by the Royal Society, of my first rough outline of the developments, which are the subject of to-day’s discussion, and the occurrence of that discussion itself, it is a fair inference that some at least of the Society’s authorities on matters theoretical viewed these developments with disfavour, and admitted with reluctance. … However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company. … For the rest, I find that Professor Bowley is offended with me for “introducing misleading ideas”. He does not, however, find it necessary to demonstrate that any such idea is, in fact, misleading. It must be inferred that my real crime, in the eyes of his academic eminence, must be that of “introducing ideas”. (Fisher, 1935, pp. 76-82)[iii]

In summary, the pioneering work of Fisher and later supplemented by Egon Pearson and Neyman, was largely ignored by the Royal Statistical Society (RSS) establishment until the early 1930s. By 1933 it was difficult to ignore their contributions, published primarily in other journals, and the ‘establishment’ of the RSS decided to display its tolerance to their work by creating ‘the Industrial and Agricultural Research Section’, under the auspices of which both papers by Neyman and Fisher were presented in 1934 and 1935, respectively. [iv]

In 1943, Fisher was offered the Balfour Chair of Genetics at the University of Cambridge. Recognition from the RSS came in 1946 with the Guy medal in gold, and he became its president in 1952-1954, just after he was knighted! Sir Ronald Fisher retired from Cambridge in 1957. The father of modern statistics never held an academic position in statistics!

Read more in Spanos 2008 (below)

**References**

Bowley, A. L. (1902, 1920, 1926, 1937) *Elements of Statistics*, 2nd, 4th, 5th and 6th editions, Staples Press, London.

Box, J. F. (1978) *The Life of a Scientist: R. A. Fisher*, Wiley, NY.

Fisher, R. A. (1912), “On an Absolute Criterion for Fitting Frequency Curves,” *Messenger of Mathematics*, 41, 155-160.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” *Biometrika,* 10, 507-21.

Fisher, R. A. (1921) “On the ‘probable error’ of a coefficient deduced from a small sample,” *Metron* 1, 2-32.

Fisher, R. A. (1922) “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society*, A 222, 309-68.

Fisher, R. A. (1922a) “On the interpretation of *c*^{2} from contingency tables, and the calculation of p, “*Journal of the Royal Statistical Society* 85, 87–94.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae and the distribution of regression coefficients,” *Journal of the Royal Statistical Society,* 85, 597–612.

Fisher, R. A. (1924) “The conditions under which the x2 measures the discrepancy between observation and hypothesis,” *Journal of the Royal Statistical Society*, 87, 442-450.

Fisher, R. A. (1925) *Statistical Methods for Research Workers*, Oliver & Boyd, Edinburgh.

Fisher, R. A. (1935) “The logic of inductive inference,” *Journal of the Royal Statistical Society* 98, 39-54, discussion 55-82.

Fisher, R. A. (1937), “Professor Karl Pearson and the Method of Moments,” *Annals of Eugenics*, 7, 303-318.

Gossett, W. S. (1908) “The probable error of the mean,” *Biometrika*, 6, 1-25.

Hald, A. (1998) *A History of Mathematical Statistics from 1750 to 1930*, Wiley, NY.

Hotelling, H. (1930) “British statistics and statisticians today,” *Journal of the American Statistical Association*, 25, 186-90.

Neyman, J. (1934) “On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection,” *Journal of the Royal Statistical Society,* 97, 558-625.

Rao, C. R. (1992), “ R. A. Fisher: The Founder of Modern, *Statistical Science*, 7, 34-48.

RSS (Royal Statistical Society) (1934) *Annals of the Royal Statistical Society* 1834-1934, The Royal Statistical Society, London.

Savage, L . J. (1976) “On re-reading R. A. Fisher,” *Annals of Statistics*, 4, 441-500.

Spanos, A. (2008), “Statistics and Economics,” pp. 1057-1097 in *The New Palgrave Dictionary of Economics*, Second Edition. Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan.

Tippet, L. H. C. (1931) *The Methods of Statistics*, Williams & Norgate, London.

[i] Fisher (1937), published a year after Pearson’s death, is particularly acerbic. In Fisher’s mind, Karl Pearson went after a young Indian statistician – totally unfairly – just the way he went after him in 1917.

[ii] Bowley received the Guy Medal in silver from the Royal Statistical Society (RSS) as early as 1895, and became a member of the Council of the RSS in 1898. He was awarded the society’s highest honor, the Guy Medal in gold, in 1935.

[iii] It is important to note that Bowley revised his textbook in statistics for the last time in 1937, and predictably, he missed the whole change of paradigms brought about by Fisher, Neyman and Pearson.

Spanos-2008[iv] In their centennial volume published in 1934, the RSS acknowledged the development of ‘mathematical statistics’, referring to Galton, Edgeworth, Karl Pearson, Yule and Bowley as the main pioneers, and listed the most important contributions in this sub-field which appeared in its Journal during the period 1909-33, but the three important papers by Fisher (1922a-b; 1924) are conspicuously absent from that list. The list itself is dominated by contributions in vital, commercial, financial and labour statistics (see RSS, 1934, pp. 208-23). There is a single reference to Egon Pearson.

This was first posted on 17, Feb. 2013 here.

**HAPPY BIRTHDAY R.A. FISHER!**

Filed under: Fisher, phil/history of stat, Spanos, Statistics ]]>

*In recognition of R.A. Fisher’s birthday tomorrow, I will post several entries on him. I find this (1934) paper to be intriguing –immediately before the conflicts with Neyman and Pearson erupted. It represents essentially the last time he could take their work at face value, without the professional animosities that almost entirely caused, rather than being caused by, the apparent philosophical disagreements and name-calling everyone focuses on. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see them as ending up in a very similar place (no pun intended) while starting from different origins. I quote just the most relevant portions…the full article is linked below. I’d blogged it earlier here. You may find some gems in it.*

**‘Two new Properties of Mathematical Likelihood’**

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H_{1} is less than that of any other group of samples outside the region, but is not less on the hypothesis H_{0}, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H_{0} to that on the hypothesis H_{1} is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H_{0} bears to the likelihood of H_{1}, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number. In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T. For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient. Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the tesitng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters. Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

You can read the full paper here.

Filed under: Fisher, phil/history of stat, Statistics Tagged: Bayesianism, induction, Ronald Fisher, significance tests ]]>

My post “What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?” gave rise to a set of comments that were mostly off topic but interesting in their own right. Being too long to follow, I put what appears to be the last group of comments here, starting with Matloff’s query. Please feel free to continue the discussion here; we may want to come back to the topic. Feb 17: Please note one additional voice at the end. (Check back to that post if you want to see the history)

I see the conversation is continuing. I have not had time to follow it, but I do have a related question, on which I’d be curious as to the response of the Bayesians in our midst here.

Say the analyst is sure that μ > c, and chooses a prior distribution with support on (c,∞). That guarantees that the resulting estimate is > c. But suppose the analyst is wrong, and μ is actually less than c. (I believe that some here conceded this could happen in some cases in whcih the analyst is “sure” μ > c.) Doesn’t this violate one of the most cherished (by Bayesians) features of the Bayesian method — that the effect of the prior washes out as the sample size n goes to infinity?

(to Matloff),

The short answer is that assuming information such as “mu is greater than c” which isn’t true screws up the analysis. It’s like a mathematician starting a proof of by saying “assume 3 is an even number”. If it were possible to consistently get good results from false assumptions, there would be no need to ever get our assumptions right.

The longer answer goes like this. Statisticians can get inferences and their associated uncertainties from probability distributions. If those inferences are true to within those uncertainties, we say the distribution is ‘good’. Statisticians typically do this with posteriors. Good posteriors being those that give us interval estimates that jive with reality. Obviously though it can be done for any distribution no matter what it’s type or purpose.

Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.

Given the prior with support on (c, infty) we’d infer that “the true mu is greater than c”. If the true mu is less than c, then the prior is ‘bad’ and shouldn’t’ be used. Using it is equivalent to making a false assumption no different than “assume 3 is an even number”,

**Alan**

The moral of the story Matloff is that your prior should only say “mu is greater than c” if your prior information guarantees it. If the prior information about mu isn’t strong enough to guarantee it with certainty you should choose a prior which reflects that and has a larger support than (c, infty)

**rasmusab**

Well using a (c,∞) prior makes a model that “considers” values less than c impossible and is useful when you don’t have time or need to coming up with something more nuanced. But if it seems that the (c,∞) is not doing a good job (or if you learn new information) there is nothing stopping you from changing the prior (as you can change other assumptions in the model). So you could say, “All priors all false, but some are usefull”.

Of course, if you want to you can put some other prior on μ where you reserve a tiny bit of probability on μ less than 3 and in that model you would have the property that “the effect of the prior washes out as the sample size n goes to infinity”.

Thanks for the thoughtful comments, Alan and rasmusab. But I think you agree, then, with my point: One of the most famous defenses offered by Bayesians for their methods — that the influence of the prior gradually washes out (“Our answers won’t be much different from those of the frequentists”) — fails in a broad category of situations. The Bayesian philosophy is not quite as advertised.

The other point I’d make in response to your comments (which I’ve mentioned before here and in Andrew Gelman’s blog) is that frequentist methods are robust to bad assumptions, in the sense that one can verify the assumptions via the data (if you have enough of it). By contrast, one can’t do that for a (subjective) prior, by definition, because one is working with only one realization of the parameter θ.

**Alan**

Matloff, I’ve never heard anyone claim that if a prior assigns zero probability to the true value of mu that the posterior will settle on the true mu given enough data. Since elementary algebra shows the support of the posterior is a subset of the support of the prior, the claim is trivially false, and I doubt anyone ever did say it was true.

John Byrd, there is no *“validated by estimating error probabilities that will result from applications of it”* being done. The prior and posterior describe an uncertainty range for a single mu. There are no frequencies to calibrate to. Separately, if x_i = mu+e_i and the measuring instrument gives errors ~N(0,10) as in the post, it’s possible to get a CI entirely below the cuttoff. This will happen some small percentage of the time randomly. If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set (the intersection of the CI and the interval greater than the cutoff). Is that answer acceptable to you? Mayo seems to indicate it is, and that I’m “stamping [my] feet” over it.

Mayo, for P(mu|A) to do it’s job it has to faithfully reflect what A says about mu. If it doesn’t the distribution is “wrong”. If A says “it’s possible mu is less than c” but P(mu|A) says “mu must be bigger than c” then the distribution is bad. P(mu|A) is contradicting what A has to say about mu. That’s the philosophical origin of the ‘test’ and it in no way requires some extra Bayesian ingredient.

Even if it did, in what sense could this secretly be “Error Statistical” when it involves assigning probabilities to hypothesis and uses distributions which aren’t frequency distributions in any way? (this is not a rhetorical question. If everything else is ignored, please answer this one)

**john byrd**

From Alan: “Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.”

I understand that a Bayesian model– like any model– can be validated by estimating error probabilities that will result from applications of it. That is a good thing and a saving grace. But, consider this need for validation in the context of the toy example of the couch measurement, and it becomes very clear why Mark’s answer was correct, and my suggestion to stick to the CI because a laser transit has its own error makes practical sense for scientists trying to solve problems. If you get a CI with most likely values of mu below 3, you will likely end up having to revise your prior following attempts to validate…

It seems very improbable to me that you can follow the protocol of validating a Bayesian model against real data and end sharply divergent from the CI in a case like that. If you gain advantage by validation in that you obtain more data, then the CI can also be narrowed with the additional data. Two paths to the same end point?

**rasmusab**

I actually believe most people that do Bayesian data analysis (those you call Bayesians) actually use convenience priors (such as default priors, or reference priors). And I think that’s fine, as long as you know that you are using a convenience prior. Just like most people use convenience models (like linear regression), it’s quick and easy and hopefully works ok most of the time.

It’s’ a different case if you were to chose a convenience prior and then stick to it whatever happens. That would be like sticking to linear regression without ever questioning the model assumptions. And that would be questionable.

A useful way of thinking about priors is just as “part of the model”. Just like the assumption of linearity is part of the model, and has to be justified, the priors are also part of the model and have to be justified. But sometimes use use linear regression because you have no better option and sometimes you use convenience priors because you haven’t figured out something better.

What I meant with the Rubin/Jaynes approach was a very pragmatic approach to Bayesian data analysis, like the one described here, for example:http://projecteuclid.org/euclid.aos/1176346785

**Matloff:**

I’m replying to rasmusab, who had replied to me.

You and I of course agree on the conditions under which the Bayesians’ famous “the prior eventually washes out” claim fails. But my point was that the Bayesians don’t put an asterisk on that famous slogan, which is why I said the Bayesian approach is not quite as advertised. That’s a really big deal to me.

And more importantly, we’re not talking about some rare case here. On the contrary, the excellent book *Bayesian Ideas and Data Analysis, one of whose authors is my former colleague Wes Johnson (a really smart guy and a leading Bayesian), is chock full of examples of priors that assume bound(s) on θ.*

*The examples in that book — and in every other book I know of on the Bayesian method — show that many, indeed most, Bayesians set up priors exactly in the way you believe that the vast majority don’t: Their priors are chosen, as you say, because “it feels right.” Of course, they also often choose “convenient” priors because they lead to nice posterior distributions, making the priors even more questionable.*

*I’m not familiar with the Rubin/Jaynes approach. A quick Web search seems to indicate it is aimed at performing “What if?” analyses. I have no problem with that at all (providing, as always, that the ultimate consumers of the analyses are aware of the nature of what was done).*

**john byrd**

Alan: It appears that you employ circular reasoning. The prior is to be corrected through “experience” unless it is to be taken as a certainty before application? Makes no sense. This is what I call the self-licking ice cream cone approach to Bayesian philosophy. Establish a prior, take it as meaningful, sell it to others unless the model does not work. If the model performs poorly, change the prior, call it prior information anyway, then repeat process.

You say: ” If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set .”. So, you say we must accept the prior as more important than the data. And also:“Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.” It appears the latter approach of testing to correct the prior is most reasonable. The latter approach would correct the prior to avoid the empty set.

So, you are faced with a scenario where IF you are willing to allow that your prior is subject to revision when faced with reality, then your Bayesian model will gravitate to the CI solution. Or, you can simply not test it. But then it becomes religion not science.

And, it appears to me that validating a model by comparing its predictions to reality to measure its performance is precisely seeking to minimize error probabilities. Seems obvious to me. I am puzzled that you do nor think so.

John: You bring out a good point: they have to assume something like the single mu that is responsible for the current data itself having been randomly selected from a population of mus. That’s a sample of size 1. We wouldn’t reject a statistical hypothesis on the basis of a sample of size one. So, it’s not clear they can be seen as getting error probabilities, which require a sampling distribution. We’re never just interested in fitting this case, the error probabilities are used to assess the overall capacity of the method to have resulted in erroneous interpretations of data.

And of course, there’s the problem of distinguishing between violated assumptions, like iid, and a violated prior. I note this in my remarks on Gelman and Shalizi’s paper.

But rasmuab, you are ignoring the key point: One can use the data to assess the propriety of frequentist models, as linearity of a regression function, but one can NOT do that in the (subjective) Bayesian case. In Bayesian settings, since one has only a single realization of θ at hand, one can’t estimate the distribution of θ to verify the prior.

All this changes in the empirical Bayes case. Then there is a real distribution for θ , and one’s model for that distribution can be verified as in any other frequentist method — because it IS a frequentist method. For instance, Fisher’s linear discriminant analysis (or for that matter logistic regression) without raising an eyebrow, even though it is an empirical Bayes method.

I skimmed through the first few pages of the Rubin paper (thanks for the interesting link), and immediately noticed that his very first example, on law school grades uses an *empirical Bayes approach, not a subjective one, which makes it frequentist*.

Feb 17 addition:

I had grabbed the last handful of comments (excluding most of mine) but didn’t mean to exclude anyone who made remarks on the new topic (of truncation), so here was Mark’s remark to Alan’s initial concern about truncation:

Alan, let me get this straight. Your example involves a case where there’s a hard physical constraint on the mean being greater than 3, but no such physical constraint on individual observations? The only possible way to get a CI that lies almost entirely below the cutoff is to have the vast majority of values lying below the cutoff. What’s a Bayesian to do in this case, stamp his feet and say “no, no, no, the mean must be constrained to be greater than 3, so I’ll put the vast majority of my weight on my prior” (that is, acknowledge that the data are noisy and so essentially throw them out)? I’d love to see a Bayesian analysis where a) there is a physical constraint on the mean being greater than 3, b) almost all of the data are sufficiently lower than the cutoff *such that the standard frequentist CI was almost entirely below the cutoff*, and c) the final inference was not based almost exclusively on the prior. If your answer is that your final inference in this case would be essentially the prior, then I frankly don’t see anything less absurd in your approach than claiming that (3, 3.00001) is a reasonable CI. It’s the same argument, as far as I’m concerned, they’re equally concocted.

Now, if there truly is a physical cutoff, such that both the mean and realized values are required to be above this cutoff, then there is a very simple frequentist approach to incorporate this background information. Do a transformation like log(X-3). No need to truncate, your entire CI will be in the required range.

Filed under: Discussion continued, Statistics ]]>

**February is a good time to read or reread these pages from Popper’s Conjectures and Refutations**. Below are (a) some of my newer reflections on Popper after rereading him in the graduate seminar I taught one year ago with Aris Spanos (Phil 6334), and (b) my slides on

As is typical in rereading any deep philosopher, I discover (or rediscover) different morsels of clues to understanding—whether fully intended by the philosopher or a byproduct of their other insights, and a more contemporary reading. So it is with Popper. A couple of key ideas to emerge from the seminar discussion (my slides are below) are:

- Unlike the “naïve” empiricists of the day, Popper recognized that observations are not just given unproblematically, but also require an interpretation, an interest, a point of view, a problem. What came first, a hypothesis or an observation? Another hypothesis, if only at a lower level, says Popper. He draws the contrast with Wittgenstein’s “verificationism”. In typical positivist style, the verificationist sees observations as the given “atoms,” and other knowledge is built up out of truth functional operations on those atoms.[1] However, scientific generalizations beyond the given observations cannot be so deduced, hence the traditional philosophical problem of induction isn’t solvable. One is left trying to build a formal “inductive logic” (generally deductive affairs, ironically) that is thought to capture intuitions about scientific inference (a largely degenerating program). The formal probabilists, as well as philosophical Bayesianism, may be seen as descendants of the logical positivists–instrumentalists, verificationists, operationalists (and the corresponding “isms”).
**So understanding Popper throws a great deal of light on current day philosophy of probability and statistics.**

- The fact that observations must be interpreted opens the door to interpretations that prejudge the construal of data. With enough interpretive latitude, anything (or practically anything) that is observed can be interpreted as in sync with a general claim H. (Once you opened your eyes, you see confirmations everywhere, as with a gestalt conversion, as Popper put it.) For Popper, positive instances of a general claim H, i.e., observations that agree with or “fit” H, do not even count as evidence for H if virtually any result could be interpreted as according with H.

* Note a modification of Popper here*: Instead of putting the “riskiness” on H itself, it is the method of assessment or testing that bears the burden of showing that something (ideally quite a lot) has been done in order to scrutinize the way the data were interpreted (to avoid “verification bias”). The scrutiny needs to ensure that it would be difficult (rather than easy) to get an accordance between data x and H (as strong as the one obtained) if H were false (or specifiably flawed).

* Note the second modification of Popper* that goes along with the first: It isn’t that GTR opened itself to literal “refutation” (as Popper says), because even if true, a positive result could scarcely be said to follow, or even to have been expected in 1919 (or long afterward). (Poor fits, at best, were expected.) So failing to find the “predicted” phenomenon (the Einstein deflection effect) would not falsify GTR. There were too many background explanations for observed anomalies (Duhem’s problem). This is so even though observing a deflection effect

[1] The verificationist’s view of meaning: the meaning of a proposition is its method of verification. Popper contrasts his problem of demarcating science and non-science from this question of “meaning”. Were the verificationist’s account of meaning used as a principle of “demarcation” it would be both too narrow and too wide. (see Popper).

[2] For discussion of background theories in the early eclipse tests, see EGEK chapter 8:

For more contemporary experiments, see my discussion in *Error and Inference*: http://www.phil.vt.edu/dmayo/personal_website/ch%201%20mayo%20theory.pdf

NOTE: I have a “no pain philosophy” 3-part tutorial (very short) on Popper on this blog. If you search under that, you’ll find it. Questions are welcome.

**Problem of Induction & some Notes on Popper**

Filed under: Phil 6334 class material, Popper, Statistics ]]>

**Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..**

1.** Take a one-sided Normal test T+: with n iid samples:**

**H _{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0**

σ = 10, n = 100, σ/√n =σ** _{x}**= 1, α = .025.

So the test would reject H_{0} iff Z > c_{.025} =1.96. (1.96. is the “cut-off”.)

~~~~~~~~~~~~~~

**Simple rules for alternatives against which T+ has high power:**

- If we add σ
_{x }(here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect. - If we add 3σ
to the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ_{x }^{.}^{999}= 4.96

Let the observed outcome just reach the cut-off to reject the null,z_{0 }= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ_{0} = 0 using

[Power(T+, 4.96)]/α,

it would be 40. (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H_{0} |z_{0}) = 1/ (1 + 40) = .024, so Pr(H_{1} |z_{0}) = .976.

Such an inference is highly unwarranted and would almost always be wrong.

~~~~~~~~~~~~~~

**How could people think it plausible to compute a comparative likelihood this way?**

I have been thinking about this for awhile because it’s ubiquitous throughout criticisms of error statistical testing, and it comes from a plausible comparativist likelihood position (which I do not hold), namely that data are better evidence for μ than for μ’ if μ is more likely than μ’ given the data. I’m guessing they’re reasoning as follows:

The probability is very high that z > 1.96 under the assumption that μ = 4.96.

The probability is low that z > 1.96 under the assumption that μ=μ

_{0}= 0.We’ve observed z

_{0 }= 1.96 (so you’ve observed z > 1.96)Therefore,μ= 4.96 makes the observation more probable than does μ = 0.

Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.

But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this. Power against μ’ concerns the capacity of a test to have produced a larger difference, under μ’. (It refers to all of the outcomes that could have been generated.)

~~~~~~~~~~~~~~

**That’s not at all how power works.**

The result is that power works in the opposite way! If there’s a high probability you should have observed a larger difference than you did, assuming the data came from a world where μ =μ’, then the data indicate you’re *not* in a world where μ is as high as μ’. In fact:

if Pr(Z > z

_{0};μ =μ’) = high , then Z = z_{0}is strong evidence that μ<μ’!Rather than being evidence

forμ’, the statistically significant result is evidenceagainstμ being as high as μ’.

~~~~~~~~~~~~~~

**Stephen Senn**

Stephen Senn (2007, p. 201) has correctly said that the following is “nonsense”:

“[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect.”

Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. *The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big. **(See also Senn’s post** here.) *

Supposing that it is, is essentially to treat the test as if it were:

H

_{0}:μ < 0 vs H_{1}:μ > 4.96

This, he says, is “ludicrous”as it:

“would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference.”(Senn, 2007, p. 201).

The same holds with H_{0}:μ = 0 as null.

If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:

μ > 0 or μ > .3.

~~~~~~~~~~~~~~

**What does the severe tester say?**

In sync with the confidence interval, she would say SEV(μ > 0)= .975 (if one sided), and would also note some other benchmarks, e.g., SEV(μ > .96)= .84.

Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate

μ > 4.96

*would be wrong over 99% of the time!*

Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.

~~~~~~~~~~~~~~

**The (type1,2 error probability) trade-off vanishes**

Notice what happens if we consider the “real type 1 error” as Pr(H_{0} |z_{0})

Since Pr(H_{0} |z_{0}) decreases with increasing power, it decreases with decreasing type 2 error. So we know that to identify “type 1 error” and Pr(H_{0} |z_{0}) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between type 1 and 2 error probabilities.

The conclusion is that using size and power as likelihoods is a bad idea for anyone who wants to assess the comparative evidence by likelihoods. It’s true that the error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses (much less do we wish to be required to assign priors to the point hypotheses). Criticisms often start out forming these ratios and then blaming the “tail areas” for exaggerating the evidence against. We don’t form those ratios. My point here, though, is that this gambit serves very badly for a Bayes ratio or likelihood assessment.

**Likelihood is a “fit” measure, “power” is not. (Power is a “capacity” measure.)**

~~~~~~~~~~~~~~

Send any corrections, I was just scribbling this….

This is related to my “no headache power for Dierdre” post, and several posts having to do with allegations that p-values overstate the evidence against the null hypothesis, such as this one.

Senn, S. (2007), *Statistical Issues in Drug Development*. Wiley.

Filed under: Bayesian/frequentist, law of likelihood, Statistical power, statistical tests, Statistics, Stephen Senn ]]>

**Stephen Senn**

Head, Methodology and Statistics Group,

Competence Center for Methodology and Statistics (CCMS), Luxembourg

Is Pooling Fooling?‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s

A Journal of a Tour to the Hebrides

A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.

It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).

A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose[1]) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.

Suppose that we have* k* ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of *k* trials. We can label these H_{n1}, H_{n2}, … H_{nk}. We are perfectly entitled to test the null hypothesis H_{joint} that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling.

Of course, how we choose to pool is a matter of skill, judgement, experience and statistical know-how. In the Neyman-Pearson framework it would depend on some suitable and possibly quite complicated alternative hypothesis. In the Fisherian framework the choice is more direct. Either way we can compose a composite statistic based on all the trials and test H_{joint}.

What we have to be careful about, however, is choosing what hypothesis we are entitled to assert if H_{joint} is rejected. Rejection of H_{joint} does not entitle us to regard each of H_{n1}, H_{n2}, … H_{nk} as rejected. We are entitled to assert that at least one is rejected.

The issue at stake is well illustrated by a famous meta-analysis of rofecoxib[2] carried out by Juni et al in 2004. They pooled a number of studies comparing rofecoxib to various comparators (naproxen, other non-steroidal inflammatory drugs or placebo) and concluded that it was possible to decide that there was an increased risk of adverse cardiovascular events compared to placebo prior to 2004 when Merck pulled the drug off the market. In a subsequent commentary Kim and Reicin, two scientists working for Merck, protested that pooling comparators like this violated general principles of meta-analysis[3].

However, the discussion above shows that both were wrong. There is no general principle requiring the pooling of like with like but on the other hand it is not logical, having pooled unlike with unlike, to conclude that a treatment that is not identical to all comparators must be different from each and every one[4].

In fact, sometimes, a pooled meta-analysis is properly regarded as a step to looking further. In a much-cited paper that presented, amongst other matters, a meta-analysis of the effect of cholesterol-lowering treatment on risk of ischaemic heart disease , Simon Thompson[5] was able to show heterogeneity of effect amongst the trials and that this was ascribable to a number of factors; included amongst them was the type of treatment given.

Of course, the fact that one may pool different treatments does not mean that this is always wise. My experience is that those who have worked in drug development are very reluctant to pool different formulations, let alone molecules without careful consideration, whereas those who have not are less so. A trial I worked on nearly twenty years ago showed (to high precision) a relative potency of four to one between two dry-powder formulations[6] of the same drug.

Pooling doses might be suitable for some purposes but not others. For instance, if there were no difference between several doses of a treatment and placebo as regards side-effects, one might take this as reassuring regarding the lowest dose but it would be quite unacceptable as a proof of safety of the highest. Similarly if in such a pooling there was a definite benefit compared to placebo one might take this as showing the efficacy of the highest dose but not the lowest. Such judgements would be based upon the presumed monotonicity of the dose response. However in either of these cases the pooled analysis (if performed at all) would probably be taken as a starting point for investigation with attempts to follow (depending on numbers available) to say something about individual doses.

It is interesting to note that fashions regarding pooling of treatments are rapidly changing as network meta-analysis (see[7] for an example) is becoming much more popular. Such analyses used comparisons within trials as a means of connecting treatments but maintain distinctions between different treatments.

Considering the case of pooling different populations where the treatments are otherwise identical raises different issues. Here the very problem raised by pooling calls into question the interpretation of a single trial. Consider the case where two trials are run in asthma. One specifies that patients should be aged 18-65 and the other aged 65-75. Let us call the first group *non elderly adults* and the second *elderly*. By pooling them in a meta-analysis we are testing the hypothesis that there is no difference between the effects of the treatments in either group. A rejected hypothesis then implies a difference in effect in at least one group.

The issue this raises is that any trial can be regarded as containing subgroups that might have formed the object of separate study. For example, we could have run a single trial which included patients aged 18-75. Clearly it would be absurd to suggest, if analysis shows a difference between treatments, that *therefore* there is a difference for all patients of any age: *non elderly adults* and also *elderly*.

It might be supposed that this means that there can never be any justification in using any treatment because we can always imagine some further subdivision of the patients. However, this ignores the necessity of choice. This is where the relevance of Johnson’s remark quoted at the beginning comes in. Consider a case where A has been compared to B in a trial or set of trials involving many different types of patient: young, old, male, female, severely ill, moderately ill and so forth. The fact that the mean effect of B is better than A does not prove that it is better than A for every patient. But consider this, if nothing else is known, however much you might doubt whether B really was better for a given patient than A, it would be perverse to use this as a reason for recommending A given that A was on average worse than B. Whatever your doubts about B for this patient your doubts about A would be higher.

In much of the recent discussion about subgroups in clinical trials, some of it driven by regulators, I think that this point has been overlooked. One could say, that having established reasonably precisely the average effect of a treatment this then becomes, if not the new null hypothesis then at least a base hypothesis for future action. In my view the further investigation of subgroups then becomes a project amongst many possible projects. If it can be realistically done cheaply in a way that permits useful inferences, so be it. If not it should be regarded as competing for resources with other projects, perhaps involving other treatments altogether. The question then is ‘does it make the cut?’

**Declaration of interest**

I consult regularly for the pharmaceutical industry. A full declaration of interest is maintained here http://www.senns.demon.co.uk/Declaration_Interest.htm

**References**

- Senn, S.J.,
*Trying to be precise about vagueness.*Statistics in Medicine, 2007.**26**: p. 1417-1430. - Juni, P., et al.,
*Risk of cardiovascular events and rofecoxib: cumulative meta-analysis.*Lancet, 2004.**364**(9450): p. 2021-9. - Kim, P.S. and A.S. Reicin,
*Discontinuation of Vioxx.*Lancet, 2005.**365**(9453): p. 23; author reply 26-7. - Senn, S.J.,
*Overstating the evidence: double counting in meta-analysis and related problems.*BMC Medical Research Methodology, 2009.**9**: p. 10. - Thompson, S.G.,
*Systematic Review – Why Sources of Heterogeneity in Metaanalysis Should Be Investigated.*British Medical Journal, 1994.**309**(6965): p. 1351-1355. - Senn, S.J., et al.,
*An incomplete blocks cross-over in asthma: a case study in collaboration*, in*Cross-over Clinical Trials*, J. Vollmar and L.A. Hothorn, Editors. 1997, Fischer: Stuttgart. p. 3-26. - Senn, S., et al.,
*Issues in performing a network meta-analysis.*Statistical Methods in Medical Research, 2013.**22**(2): p. 169-189.

Filed under: evidence-based policy, PhilPharma, S. Senn, Statistics ]]>

*Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!*

*Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! ***Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers.**

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–**the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not? *

*Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative? It’s hard to say. *

*This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. *

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

** Pawl**: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

** Franz**: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

** Jake**: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past.

** Marty**: Well, I have with me a comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the

** Dora**: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.

** Pawl**: Yes, that would be important. But, what

** Franz**: Already tried. Rozeboom 1997, page 335. Very, very similar phrasing also attempted by many, many others over 50 years. All failed. Darn.

** Jake**: Didn’t it kill to see all the attention p-values got with the 2012 Higgs Boson discovery? P-value policing by Lindley and O’Hagan (to use a term from the Normal Deviate) just made things worse.

** Pawl**: Indeed! And the machine’s back on in 2015. Fortunately, one could see the physicist’s analysis in terms of frequentist confidence intervals.

** Nayth**: As the “non-academic” Big Data member of the TFSI, I have something brand new: explain that “frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researcher’s bias–keep him hermetically sealed off from the real world.”

** Gerry**: Declared by Nate Silver 2012, page 253. Anyway, we’re not out to condemn all of frequentist inference, because then our confidence intervals go out the window too! Let alone our replication gig.

** Marty**: It’s a gig that keeps on giving.

** Dora**: I

** Nayth**: Well here’s a news flash, Dora: I’ve heard the same thing about Ziliac and McCloskey’s book!

** Gerry**: If we can get back to our business, my diagnosis is that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.” It’s that germaphobia business that Nayth just raised. Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers. We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”

** Pawl**: Oh My, Gerry! That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.

** Gerry**: I thought it was pretty good, especially the part about “denying its parents”.

** Dora**: I like the part about the “compulsive hand washing”. Cool!

** Jake**: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST? Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…

** Dora**:

** Pawl**: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a

** Gerry**:

** Marty**: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203. Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.

** Jake**: You want to ban Popper too? Now you’re really going to scare people off our mission.

** Nayth**: True, it’s not just Meehl who extols Popper. Even some of the most philosophical of statistical practitioners are channeling Popper. I was just reading an on-line article by Andrew Gelman. He says:

“At a philosophical level, I have been persuaded by the arguments of Popper (1959), … and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). ….At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models …. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, 70)

** Jake**: Of course Popper’s prime example of non falsifiable science was Freudian/Adlerian psychology which gave psychologist Paul Meehl conniptions because he was a Freudian as well as a Popperian. I’ve always suspected that’s one reason Meehl castigated experimental psychologists who could falsify via P-values, and thereby count as scientific (by Popper’s lights) whereas he could not. At least not yet.

** Gerry**: Maybe for once we should set the record straight:“It should be recognized that, according to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter cannot be established on the basis of one single experiment but requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.”

**Pawl**: That was tried by Gigerenzer in “The Inference Experts” (1989, 96).

** S.C**: This is radical but maybe p-values should just be used as measures of observed fit, and all inferential uses of significance testing banned.

** Franz:** But then you give up control of error probabilities. Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”

*(Franz stands. Chest up, chin out, hand over his heart): *

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”

** Pawl**: My! That was incredibly inspiring Franz.

** Dora**: Yes, really moving, only …

** Gerry**: Only problem is, Schmidt’s already said it, 1996, page 123.

* S.C.*: How’s that meta-analysis working out for you social scientists, huh? Is the gloom lifting?

** Franz**: It was until we saw the ‘train wreck looming’ (after Deiderik Stapel),..but now we have replication projects.

** Nayth**: From my experience as a TV pundit, I say just tell everyone how bad data and NHST are producing ” ‘statistically significant ‘(but manifestly ridiculous) findings” like how toads can predict earthquakes. You guys need to get on the Bayesian train.

** Marty**: Is it leaving? Anyway, this is in Nathan Silver’s 2012 book, page 253. But I don’t see why it’s so ridiculous, I’ll bet it’s true. I read that some woman found that all the frogs she had been studying every day just up and disappeared from the pond just before that quake in L’Aquila, Italy, in 2009.

** Dora**: I’m with Marty. I really, really believe animals pick up those ions from sand and pools of water near earthquakes. My gut feeling is its very probable. Does our epidemiologist want to jump in here?

** S.C.**: Not into the green frog pool I should hope! Nyuk! Nyuk! But I do have a radical suggestion that no one has so far dared to utter.

** Dora**: Oomph! Tell, tell!

** S.C.**“Given the extensive misinterpretations of frequentist statistics and the enormous (some would say impossible) demands made by fully Bayesian analyses, a serious argument can be made for de-emphasizing (if not eliminating) inferential statistics in favor of more data presentation… Without an inferential ban, however, an improvement of practice will require re-education, not restriction”.

** Marty**: “Living With P-Values,” Greenland and Poole in a 2013 issue of

** Pawl**: I just had a quick look, but their article appears to resurrect the same-old same-old: P-values are or have to be (mis)interpreted as posteriors, so here are some priors to do the trick. Of course their reconciliation between P-values and the posterior probability (with weak priors) that you’ve got the wrong directional effect (in one sided tests) was shown long ago by Cox, Pratt, others.

** Franz**: It’s a neat trick, but it’s unclear how this reconciliation advances our goals. Historically, the TFSI has not pushed the Bayesian line. We in psychology have enough trouble being taken as serious scientists.

** Nayth**: Journalists must be Bayesian because,let’s face it, we’re all biased, and we need to be up front about it.

** Jenina**: I know I’m here just to take notes for a story, but I believe Nate Silver made this pronouncement in his 10 or 11 point list in his presidential address to the Joint Statistical Meetings in 2013. Has it caught on?

** Nayth**: Well, of course I’d never let the writers of my on-line data-driven news journal introduce prior probabilities into their articles, are you kidding me? No way! We’ve got to keep it objective, push randomized-controlled trials, you know, regular statistical methods–you want me to lose my advertising revenue?

** Dora**: Do as we say, not as we do– when it’s really important to us, and our cost-functions dictate otherwise.

** Gerry**: As Franz observes, the TFSI has not pushed the Bayesian line because we want people to use confidence intervals (CIs). Anything tests can do CIs do better.

** S.C.**: You still need tests. Notice you don’t use CIs alone in replication or fraudbusting.

** Dora**: No one’s going to notice we use the very methods we are against when we have to test if another researcher did a shoddy job. Mere mathematical quibbling I say.

* Paul: *But it remains to show how confidence intervals can ensure ‘Popperian risk’ and ‘severe tests’ . We may need to supplement CI’s with some kind of severity analysis [as in Mayo] discussed in her blog. In the Year of Statistics, 2013, we promised we’d take up the challenge at long last, but have we? No. I’d like to hear from our new member, Dr. Ian Nydes.

** Ian**: I have a new and radical suggestion, and coming from a doctor, people will believe it: prove mathematically that “the high rate of non replication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research finds solely on the basis of a single study assessment by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values….

*It can be proven that most claimed research findings are false”*

** S.C.**: That was, word for word, in John Ioannidis’ celebrated (2005) paper: “Why Most Published Research Findings Are False.” I and my colleague have serious problems with this alleged “proof”.

** Ian**: Of course it’s Ioannidis, and many before him (e.g., John Pratt, Jim Berger), but it doesn’t matter. It works. The best thing about it is that many frequentists have bought it–hook, line, and sinker–as a genuine problem for

* Jake: *Ioannidis is to be credited for calling wide attention to some of the terrible practices we’ve been on about since at least the 1960s (if not earlier). Still, I agree with S.C. here: his “proof” works only if you basically assume researchers are guilty of all the sins that we’re trying to block: turning tests into dichotomous “up or down” affairs, p-hacking, cherry-picking and you name it. But as someone whose main research concerns power analysis, the thing I’m most disturbed about is his abuse of power in his computations.

** Dora**: Oomph! Power, shpower! Fussing over the correct use of such mathematical concepts is just, well, just mathematics; while surely the “Lord’s work,” it doesn’t pay the rent, and Ioannidis has stumbled on a gold mine!

* Pawl: *Do medical researchers really claim “conclusive research finds solely on the basis of a single study assessment by formal statistical significance”? Do you mean to tell me that medical researchers are actually engaging in worse practices than I’ve been chastising social science researchers about for decades?

**Marty**: I love it! We’re not the bottom of the scientific barrel after all–medical researchers are worse!

** Dora**: And they really “are killing people”, unlike that wild exaggeration by Ziliac and McCloskey (2008, 186).

* Gerry: *Ioannidis (2005) is a wild exaggeration. But my main objection is that it misuses frequentist probability. Suppose you’ve got an urn filled with hypotheses, they can be from all sorts of fields, and let’s say 10% of them are true, the rest false. In an experiment involving randomly selecting a hypothesis from this urn, the probability of selecting one with the property “true” is 10%. But now suppose I pick out H': “Ebola can be contracted by sexual intercourse”, or any hypothesis you like. It’s mistaken to say that the Pr(H’) = 0.10.

** Pawl**: It’s a common fallacy of probabilistic instantiation, like saying a particular 95% confidence interval estimate has probability .95.

** S.C.**: Only it’s worse, because we know the probabilities the hypotheses are true are very different from what you get in following his “Cult of the Holy Spike” priors.

** Dora**: More mathematical nit-picking. Who cares? My loss function says they’re irrelevant.

** Ian**: But Pawl, if all you care about is long-run screening probabilities, and posterior predictive values (PPVs), then it’s correct; the thing is, it works! It’s got everyone all upset.

** Nayth**: Not everyone, we Bayesians love it. The best thing is that it’s basically a Bayesian computation but with frequentist receiver operating curves!

** Dora**: It’s brilliant! No one cares about the delicate mathematical nit-picking. Ka-ching! (That’s the sound of my cash register, I’m an economist, you know.)

** Pawl**: I’ve got to be frank, I don’t see how some people, and I include some people in this room, can disparage the scientific credentials of significance tests and yet rely on them to indict other people’s research, and even to insinuate questionable research practices (QRPs) if not out and out fraud (as with Smeesters, Forster, Anil Potti, many others). Folks are saying that in the past we weren’t so hypocritical.

(silence)

** Gerry**: I’ve heard other’s raise Pawl’s charge. They’re irate that the hard core reformers nowadays act like p-values can’t be trusted except when used to show that p-values can’t be trusted. Is this something our committee has to worry about?

** Marty**: Nah! It’s a gig!

** Jenina**: I have to confess that this is a part of my own special “gig”. I’ve a simple 3-step recipe that automatically lets me publish an attention-grabbing article whenever I choose.

** Nayth**: Really? What are they? (I love numbered lists.)

** Jenina**: It’s very simple:

Step #1: Non-replication: a story of some poor fool who thinks he’s on the brink of glory and fame when he gets a single statistically significant result in a social psychology experiment. Only then it doesn’t replicate! (Priming studies work well, or situated cognition).

Step #2: A crazy, sexy example where p-values are claimed to support a totally unbelievable, far-out claim. As a rule I take examples about sex (as in this study on penis size and voting behavior)–perhaps with a little evolutionary psychology twist thrown in to make it more “high brow”. (Appeals to ridicule about Fisher or Neyman-Pearson give it a historical flair, while still keeping it fun.) Else I choose something real

spo-o-kylike ESP. (If I run out of examples, I take what some of the more vocal P-bashers are into, citing them of course, and getting extra brownie points.) Then, it’s a cinch to say “this stuff’s sounbelievable, if we just use our brains we’d know it was bunk!” (I guess you could call this my Bayesian cheerleading bit.)Step #3: Ioannidis’ proof that most published research findings are false, illustrated with colorful charts.

*Nayth*: So it’s:

-step #1: non-replication,

-step #2: an utterly unbelievable but sexy “chump effect”[ii]that someone, somewhere has found statistically significant, and

-Step #3: Ioannidis (2005).

** Jenina**: (breaks into hysterical laughter): Yes, and sometimes,…(laughing so hard she can hardly speak)…sometimes the guy from Step #1, (who couldn’t replicate, and so couldn’t publish, his result) goes out and joins the replication movement and gets to publish his non-replication, without the hassle of peer review. (Ha Ha!)

(General laugher, smirking, groans, or head shaking)

*[ Nayth: (Aside to Jenina) Your step #2 is just like my toad example, except that turned out to be somewhat plausible.]*

** Dora**: I love the three-step cha cha cha! It’s a win-win strategy: Non-replication, chump effect, Ioannidis (“most research findings are false”).

** Marty**:

** Pawl**: I’m more interested in a 2014 paper Ioannidis jointly wrote with several others on reducing research waste.

**Gerry**: I move we place it on our main reading list for our 2016 meeting and then move we adjourn to drinks and dinner. I’ve got reservations at a 5 star restaurant, all covered by TFSI Foundation.

**Jake**: I second. All in favor?

* All*:

** Pawl**: Adjourned. Doubles of Elbar Grease for all!

*S.C.:* Isn’t that Deborah Mayo’s special concoction?

** Pawl**: Yes, most of us, if truth be known, are closet or even open error statisticians!*

***This post, of course, is a parody or satire (statistical satirical); all quotes are authentic as cited. Send any corrections.**

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Parting Remark: Given Fisher’s declaration when first setting out tests to the effect that isolated results are too cheap to be worth having, Cox’s (1982) insistence donkey’s years ago that “It is very bad practice to summarise an important investigation solely by a value of P”, and a million other admonishments against statistical fallacies and lampoons, I’m starting to think that a license should be procured (upon passing a severe test) before being permitted to use statistical tests of any kind. My own conception is an inferential reformulation of Neyman-Pearson statistics in which one uses error probabilities to infer discrepancies that are well or poorly warranted by given data. (It dovetails also with Fisherian tests, as seen in Mayo and Cox 2010, using essentially the P-value distribution for sensitivity rather than attained power). Some of the latest movements to avoid biases, selection effects, ban hunting, cherry picking, multiple testing and the like, and to promote controlled trials and attention to experimental design are all to the good. They get their justification from the goal of avoiding corrupt error probabilities. Anyone who rejects the inferential use of error probabilities is hard pressed to justify the strenuous efforts to sustain them. These error probabilities, on which confidence levels and severity assessments are built, are very different from PPVs and similar computations that arise in the context of screening, say, thousands of genes. My worry is that the best of the New Reforms, in failing to make clear the error statistical basis for their recommendations, and confusing screening with the evidential appraisal of particular statistical hypotheses, will fail to halt some of the deepest and most pervasive confusions and fallacies about running and interpreting statistical tests.

* *

[i] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.

[ii] From “*The Chump Effect: Reporters are credulous, studies show”, by Andrew Ferguson.*

“Entire journalistic enterprises, whole books from cover to cover, would simply collapse into dust if even a smidgen of skepticism were summoned whenever we read that “scientists say” or “a new study finds” or “research shows” or “data suggest.” Most such claims of social science, we would soon find, fall into one of three categories: the trivial, the dubious, or the flatly untrue.”

(selected) REFERENCES:

Cohen, J. (1994). The Earth is round (p < .05). *American Psychologist*, 49, 997-1003.

Cox, D. R. (1958). Some problems connected with statistical inference. *Annals of Mathematical Statistics* 29 : 357-372.

Cox, D. R. (1977). The role of significance tests. (With discussion). *Scand. J. Statist*. 4 : 49-70.

Cox, D. R. (1982). Statistical significance tests. *Br. J. Clinical. Pharmac.* 14 : 325-331.

Gigerenzer, G. et.al., (1989) *The Empire of Chance*, CUP.

Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ *Adaptive Thinking, Rationality in the Real World*, OUP.

Gelman, A. (2011), “Induction and Deduction in Bayesian Data Analysis,” *RMM* vol. 2, 2011, 67-78. Special Topic: Statistical Science and Philosophy of Science: where do (should) they meet in 2011 and beyond?

Greenland, S. and Poole, C. (2013), “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics,”* Epidemiology 24: 62-8. *

Ioannidis, J (2005), ‘Why Most Published Research Findings are False“. *PLOS.Med*

Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, *Rationality, Markets, and Morals (RMM)* 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in *The Second Erich L. Lehmann Symposium: Optimality*, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in *Philosophy of Statistics , Handbook of Philosophy of Science* Volume 7 *Philosophy of Statistics*, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. *Journal of Economic Literature*, 34(1), 97-114.

Meehl, P. E. (1990), “Why summaries of research on psychological theories are often uninterpretable. *Psychological Reports*, 66, 195-244.

Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”*Psychological Methods*, Vol. 7: 283–300.

Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” *Organizational Research Methods* 15(2): 199-228.

Popper, K. (1962). *Conjectures and Refutations. *NY: Basic Books.

Popper, K. (1977). *The Logic of Scientific Discovery, *NY: Basic Books. (Original published 1959)

Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), *What if there were no significance tests?* (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.

Salmon, W. C. (1984). *Scientific Explanation and the Causal Structure of the World, *Princeton, NJ: Princeton.

Schmidt, F. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, *Psychological Methods*, Vol. 1(2): 115-129.

Sliver, N. (2012), *The Signal and the Noise*, Penguin.

Ziliak, S. T., & McCloskey, D. N. (2008), *The cult of statistical significance: How the standard error costs us jobs, justice, and lives*.” Ann Arbor: University of Michigan Press. (Short piece see: “The Cult of Statistical Significance” from Section on Statistical Education – *JSM* 2009).

Filed under: Comedy, reforming the reformers, science communication, Statistical fraudbusting, statistical tests, Statistics Tagged: criticism of frequentists methods, NHST, power, reformers, significance tests, Sir Karl Popper, test ban ]]>

**MONTHLY MEMORY LANE: 3 years ago: January 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog.

**January 2012**

- (1/3) Model Validation and the LLP-(Long Playing Vinyl Record)
- (1/8) Don’t Birnbaumize that Experiment my Friend*
- (1/10) Bad-Faith Assertions of Conflicts of Interest?*
- (1/13) U-PHIL: “So you want to do a philosophical analysis?”
- (1/14)
**“You May Believe You are a Bayesian**But You Are Probably Wrong” (Extract from Senn RMM article) - (1/15)
**U-Phil Mayo Philosophizes on Stephen Senn(0):**“How Can We Culivate Senn’s-Ability?” - (1/17) “Philosophy of Statistics”: Nelder on Lindley
- (1/19) RMM-6 Special Volume on Stat Sci Meets Phil Sci (Sprenger)
- (1/22)
**U-Phil: Stephen Senn (1)**: C. Robert, A. Jaffe, and Mayo (brief remarks) - (1/23)
**U-Phil: Stephen Senn (2)**: Andrew Gelman - (1/24)
**U-Phil (3)****:**Stephen Senn on Stephen Senn! - (1/26) Updating & Downdating: One of the Pieces to Pick up
- (1/29)
**No-Pain Philosophy:**Skepticism, Rationality, Popper, and All That: First of 3 Parts

This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014. I will count U-Phil’s on a single paper as one of the three I highlight (else I’d have to choose between them). I will comment on 3-year old posts from time to time.

This Memory Lane needs a bit of explanation. This blog began largely as a forum to discuss a set of contributions from a conference I organized (with A. Spanos and J. Miller*) “Statistical Science and Philosophy of Science: Where Do (Should) They meet?”at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, in June 2010 (where I am a visitor). Additional papers grew out of conversations initiated soon after (with Andrew Gelman and Larry Wasserman). The conference site is here. My reflections in this general arena (Sept. 26, 2012) are here.

As articles appeared in a special topic of the on-line journal, *Rationality, Markets and Morals (RMM)*, edited by Max Albert[i]—also a conference participant —I would announce an open invitation to readers to take a couple of weeks to write an extended comment. Each “U-Phil”–which stands for “U philosophize”- was a contribution to this activity. I plan to go back to that exercise at some point. Generally I would give a “deconstruction” of the paper first, followed by U-Phils, and then the author gave responses to U-Phils and me as they wished. You can readily search this blog for all the U-Phils and deconstructions**.

I was also keeping a list of issues that we either haven’t taken up, or need to return to. One example here is: Bayesian updating and down dating. Further notes about the origins of this blog are here. **I recommend everyone reread Senn’s paper.** **

**For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!**

[i] Along with Hartmut Kliemt and Bernd Lahno.

*For a full list of collaborators, sponsors, logisticians, and related collaborations, see the conference page. The full list of speakers is found there as well.

**The U-Phil exchange between Mayo and Senn was published in the same special topic of RIMM. **But I still wish to know how we can cultivate “Senn’s-ability.”** We could continue that activity as well, perhaps.

**Previous 3 YEAR MEMORY LANES:**

Dec. 2011

Nov. 2011

Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))

Filed under: 3-year memory lane, blog contents, Statistics, Stephen Senn, U-Phil ]]>

**Trial in Medical Research Scandal Postponed**

By Jay Price

DURHAM, N.C. — A judge in Durham County Superior Court has postponed the first civil trial against Duke University by the estate of a patient who had enrolled in one of a trio of clinical cancer studies that were based on bogus science.

The case is part of what the investigative TV news show “60 Minutes” said could go down in history as one of the biggest medical research frauds ever.

The trial had been scheduled to start Monday, *but several attorneys involved contracted flu*. Judge Robert C. Ervin hasn’t settled on a new start date, but after a conference call with him Monday night, attorneys in the case said it could be as late as this fall.

**Flu? Don’t these lawyers get flu shots? Wasn’t Duke working on a flu vaccine? Delaying til Fall 2015?**

The postponement delayed resolution in the long-running case for the two patients still alive among the eight who filed suit. It also prolonged a lengthy public relations headache for Duke Medicine that has included retraction of research papers in major scientific journals, the embarrassing segment on “60 Minutes” and the revelation that the lead scientist had falsely claimed to be a Rhodes Scholar in grant applications and credentials.

Because it’s not considered a class action, the eight cases may be tried individually. The one designated to come first was brought by Walter Jacobs, whose wife, Julie, had enrolled in an advanced stage lung cancer study based on the bad research. She died in 2010.

“We regret that our trial couldn’t go forward on the scheduled date,” said Raleigh attorney Thomas Henson, who is representing Jacobs. “As our filed complaint shows, this case goes straight to the basic rights of human research subjects in clinical trials, and we look forward to having those issues at the forefront of the discussion when we are able to have our trial rescheduled.”

It all began in 2006 with research led by a young Duke researcher named Anil Potti. He claimed to have found genetic markers in tumors that could predict which cancer patients might respond well to what form of cancer therapy. The discovery, which one senior Duke administrator later said would have been a sort of Holy Grail of cancer research if it had been accurate, electrified other scientists in the field.

Then, starting in 2007, came the three clinical trials aimed at testing the approach. These enrolled more than 100 lung and breast cancer patients, and were eventually expected to enroll hundreds more.

Duke shut them down permanently in 2010 after finding serious problems with Potti’s science.

Now some of the patients – or their estates, since many have died from their illnesses – are suing Duke, Potti, his mentor and research collaborator Dr. Joseph Nevins, and various Duke administrators. The suit alleges, among other things, that they had engaged in a systematic plan to commercially develop cancer tests worth billions of dollars while using science that they knew or should have known to be fraudulent.

The latest revelation in the case, based on documents that emerged from the lawsuit and first reported in the Cancer Letter, a newsletter that covers cancer research issues, is that a young researcher working with Potti had alerted university officials to problems with the research data two years before the experiments on the cancer patients were stopped.

The whistleblower, Brad Perez, is now finishing up a medical residency at Duke. Perez declined to be interviewed, but responded by email that the issues with the research led him to quit working with Potti, though that cost him an extra year in medical school.

“In the course of my work in the Potti lab, I discovered what I perceived to be problems in the predictor models that made it difficult for me to continue working in that environment,” he wrote. “I raised my concerns with my laboratory peers, laboratory supervisors and medical school administrators. I chose to take an additional year to complete medical school in order to have a more successful research experience.”

In an emailed statement in response to questions about the case, Michael Schoenfeld, Duke’s vice president for public affairs and government relations, said Perez had passed his concerns about the lab through proper channels at Duke, and that the resulting review didn’t find research misconduct.

Since then, though, Perez’s concerns have been fully appreciated and recognized, Schoenfeld wrote.

“We can say with great confidence that any concerns like this received today would be handled very differently,” Schoenfeld wrote.

**Really? What would they do differently?**

“Despite his experience in Dr. Potti’s lab, we’re pleased that Dr. Perez elected to complete his medical education and research training at Duke, and is currently completing his residency in radiation oncology at Duke.”

Through his Raleigh attorney, Dan McLamb, Potti declined an interview, citing the pending court action. Potti now works at another cancer clinic, in Grand Forks, N.D.

**No surprise he’s still practicing. Remind me of where not to go.**

Potti, Nevins and various collaborators published studies in major research journals based on Potti’s findings beginning in 2006. But researchers elsewhere couldn’t reproduce their results and quickly began to raise questions. In particular, two biostatisticians at MD Anderson Cancer Center in Houston, Keith Baggerly and Kevin Coombes, brought problems they found to the attention Duke officials and began questioning the research publicly.

In 2009 Duke suspended the enrollment of new patients and commissioned an outside review. But the reviewers reported that Potti’s work seemed fine, and Duke rebooted the trials. University leaders later said those reviewers hadn’t looked at the basic data Potti had used.

Only after the Cancer Letter, which has followed the case closely for years, published a report in 2010 saying that Potti had falsely claimed a Rhodes Scholarship in grant applications and elsewhere did Duke’s official support for the research finally began to crumble. It again suspended new enrollments and ended the studies.

Outside scientists who raised questions about the research said they were most worried about the prospect that patients were being put at risk by their participation in the clinical trials. They said the unproven genetic analysis could result in patients being prescribed an improper treatment.

**The following sounds like doublespeak:**

Duke has maintained, though, that the patients received proper care.

“The criticism in the lawsuit is not related to the high quality of care this patient received,” Schoenfeld wrote in his statement. “While the science behind the genomic predictor used in the trials was ultimately found to involve falsified data, a key factor in the approval of the trial protocols provided that every patients would receive standard of care therapy for their disease whether or not the predictor ultimately proved to be useful.”

**Firstly, the patients were promised “a personalized cancer regimen”custom-tailored for their tumor; secondly, the last sense is incomprehensible. Second, some of them were apparently getting the less effective treatment due to data mix-ups. Realize that these “treatments” also involved additional surgeries for purposes of the clinical trial only. Please correct me if I’m mistaken.**

Regardless of which treatment patients in the clinical trials received, it was considered a best one for treating their disease, he wrote.

**“A” best one?**

The lawsuit charges, among other things, that in grant applications for the clinical trials that Potti intentionally lied, and included false and fraudulent information about the research results. Nevins, as his research supervisor, and Duke should have known what was wrong, the lawsuit says, because biostatisticians from MD Anderson and others had made numerous attempts to call attention to flaws in the science.

The suit also charges that the clinical trials began after Duke had been “placed on notice” of the flawed underlying science and suggests that the relationships among researchers and administrators were too cozy within the university for it to properly pursue questions about the research.

Duke has made substantive changes to prevent the problems brought to light in the Potti case from recurring, Schoenfeld said. These include better data management, new reviews of potential conflict of interest and improvements in handling reports related to the integrity of research.

“Many lessons learned from this situation have led to significant improvements in both basic and clinical research processes including many new and expanded programs related to scientific accountability, reporting of concerns related to research integrity, multiple improvements in data management and governance, and new scientific and conflict of interest review processes,” he wrote.

Duke put Potti on administrative leave in July 2010 after the charges about his credentials emerged. The next month, Duke announced that he had indeed padded his resume. Potti resigned six months later, and Nevins began the process of retracting the journal articles. Nevins retired from Duke in 2013.

In 2012, Potti accepted a reprimand from the North Carolina Medical Board. He remains licensed to practice in the state and in a few other states. According to records posted online by the North Carolina Medical Board, Potti had agreed to settlements in at least 11 malpractice cases against him, each resulting in a payment of at least $75,000.

Also, in a consent order negotiated with the medical board, Potti agreed to accept a formal reprimand for unprofessional conduct and admitted to having inaccurate information on his resume and in official Duke biographical sketches and to using those flawed credentials in research grant applications.

After leaving Duke, Potti worked awhile in a clinical role rather than in research at a cancer clinic in South Carolina. He was fired from that job after the “60 Minutes” segment aired, though the company he had worked for there, Coastal Cancer Center, said in a news release that his work had been exemplary.

The news release also said that the company had hired him after received glowing letters of recommendation from top medical officials at Duke.

While the trial may be delayed for months, the judge still is expected to hear motions in the case Thursday.

**Why months?**

**Note: I did not fix any of the ungrammatical parts of this news release.**

**Read the article here:**

**For background posts on this blog, please see:**

What have we learned from the Anil Potti training and test data fireworks ? Part 1 (draft 2)

————————————————–

**The following is an excerpt from this week’s Cancer Letter on the issue of “No Harm done?”**

**http://www.cancerletter.com/articles/20150123_2**

No Harm Done? 1/23/15Duke’s motions for a summary judgment argue that the case should turn on North Carolina law, as opposed to established ethical constructs.

In an effort to determine the burden of proof that has to be met by the plaintiff to demonstrate negligence per se, Duke’s motion states that standards contained in the 1979 report by the National Commission for protection of human Services of Biomedical and Behavioral Research, known as the Belmont Report, don’t create obligations under North Carolina law. Similarly, they argue that the federal law, Title 45 part 46 of the Code of Federal Regulations, which sets out requirements for research institutions, is not a part of North Carolina law, either.

Duke basically states that it did nothing wrong.

“Plaintiffs cannot show that a different course of treatment would have made any difference in their care or chance of survival,” the Duke motion reads. “Expert testimony in this case has not established that any clinical trial available in the United States in 2010 would have prolonged plaintiffs’ life expectancy or treated them more effectively. Therefore, plaintiffs cannot meet causation of damage elements of their negligence per se claim.”

Another court filing deals specifically with the case of Juliet Jacobs, a patient with metastatic lung cancer who—with Potti’s knowledge—made a recording of the now disgraced doctor as he presented the trial to her. Juliet’s widower, Walter, is one of the plaintiffs.

Duke attorneys argue that in that specific instance, “these defendants did not abuse, breach, or take advantage of Mrs. Jacobs’s confidence or trust. Instead, they were open, fair, and honest with Mrs. Jacobs and her husband regarding her prognosis and treatment options. Mr. & Mrs. Jacobs were made aware that the clinical trial may increase, decrease or have no effect on Mrs. Jacobs’s likelihood of responding to chemotherapy. They were also encouraged to seek other treatment alternatives.”

Duke’s filings also hold that “the undisputed evidence in this case has established that there was no clinical trial or other treatment available in the United States in 2010 that would have cured Mrs. Jacobs’s cancer or prolonged her life expectancy. Plaintiff cannot show that a different course of treatment would have made any difference in Mrs. Jacobs’s chance of survival.”

Duke attorneys are not representing Potti, who was dismissed from the university. However, they are representing Nevins, the deans, the IRB chair and the spinoff company that was going to commercialize the Nevins-and-Potti inventions.

The defendants argue that the plaintiffs cannot prove “negligence per se” claims because they cannot show that there was “(1) a duty created by a statute or ordinance; (2) that the statute or ordinance was enacted to protect a class of persons which includes the plaintiff; (3) a breach of the statutory duty; (4) that the injury sustained was suffered by an interest which the statute protected; (5) that the injury was of the nature contemplated in the statute; and (6) that the violation of the statute proximally caused the injury.”

Plaintiffs argue that Duke is ultimately responsible for the actions of its scientists and administrators.

“Defendants admit that Dr. Potti fabricated, falsified and intentionally manipulated the data that formed the ‘basis for clinical trials’ in which Juliet Jacobs was enrolled,” one of the plaintiffs’ filings states. “Much of the… falsified, fabricated, and manipulated data came from the laboratory of Dr. Nevins, for which he was ultimately responsible. In fact, Dr. Nevins admitted one set of ‘intentionally altered’ data that came from his lab ‘provided support for the lung cancer trials…’

“Manipulating and fabricating the data for a clinical trial and then lying to a patient to obtain informed consent is a breach of good faith. It constitutes battery and invalidates informed consent. Dr. Potti is the physician who presented the informed consent to the plaintiffs. He is the one who falsified, fabricated and intentionally manipulated the data. He entered into a Consent Order with the North Carolina Medical Board admitting that he committed ‘unprofessional conduct.’ He admitted that there was a responsibility to tell the patients, including Juliet Jacobs, about the controversy with the medicine. Dr. Potti did not inform the Jacobs of either the ‘controversy’ or the fraud.”

Nevins acknowledges that he did not examine the data until October 2010, three months after this publication reported that Potti had misstated his credentials, claiming to have been a Rhodes Scholar, and after Potti was barred from Duke campus.

“Money, Fame and Overall Fortune”Countering Duke’s assertion that no one was injured because patients were assigned to standard therapy, the plaintiffs say that Juliet Jacobs was falsely led to accept a treatment regimen she would not have ordinarily considered.

**Imagine if she had been told the trials had been stopped on grounds of flawed data/bad models, and only recently renewed. Imagine if she’d seen the Perez letter or the Baggerly and Coombes articles. I can’t argue the legal subtleties, but it’s outrageous.**

The patient’s husband and daughter “testified to the exact opposite,” the filing reads. “Plaintiffs showed that Juliet and Walter Jacobs did not want standard of care chemotherapy and would not have participated if it had not been for the defendants’ fraud.”

The Duke protocol required a second biopsy and led the patient to a chemotherapy regimen that was more aggressive than she would have ordinarily chosen for end-of-life care.

“The second biopsy was not required for the alleged ‘standard of care’ chemotherapy—it was required for participation in the clinical trials,” the plaintiffs argue. “Defendants want to turn a lawsuit based upon personal injury into a wrongful death action. The question is not whether ‘standard of care chemotherapy’ was provided and whether or not the same caused her death. Instead, the question posed by the plaintiffs is whether or not the defendants’ actions caused a personal injury to Juliet and Walter Jacobs. Attempting to recast this as a wrongful death action…is a red herring thrown to distract the finder of fact.”

Most importantly, Juliet Jacobs was deceived, the plaintiffs’ attorneys argue.

“Because her quality of life was very important to her, if she had been given proper consent and told that there was no ‘silver bullet’ and if she had not been told by Dr. Potti that he could give her a chance to live for ten years, she and Walter would more likely than not have made other choices regarding how they spent her last days and what quality that life would have.”

An audio recording of the Jacobs meeting with Potti captures the doctors expressing hope for a miracle.

In the recording, Juliet Jacobs says that her son-in-law has had chemotherapy for a decade, and that he is the only survivor in a clinical trial.

Potti: “Wow. And I, I wouldn’t be surprised if I expect that from you. That’s what I mean. I’m 100 percent on board here, OK?”

Like other patients, Jacobs was presented with a consent form that contained the claim that the genomic predictor that would be used had the accuracy of approximately 80 percent.

Instead of going into hospice care, Juliet Jacobs ended up with a lot of toxicity and a quality of life her family members described as poor.

The date of the family’s meeting with Potti is important: Feb. 11, 2010, a month after Duke restarted the trials following an internal investigation that has since been shown to be cursory and skewed. That controversy was never mentioned to the prospective patient and her family.

Knowing what he knows now, Walter Jacobs is furious.

“I know that it’s an immoral, evil, awful thing that has been done,” he said in a deposition.

The plaintiffs also allege a “civil conspiracy.”

“The underlying conspiracy was among the defendants and Dr. Potti and Dr. Nevins, on behalf of themselves and on behalf of their outside financial interest, Cancer Guide, to cover up the falsification in order to continue the clinical trials. The successful conclusion of the clinical trials would have meant money, fame and overall fortune.”

Filed under: junk science, rejected post, Statistics Tagged: Anil Potti ]]>

Here’s the follow-up to my last (reblogged) post. initially here. My take hasn’t changed much from 2013. Should we be labeling some pursuits “for entertainment only”? Why not? (See also a later post on the replication crisis in psych.)

**^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^**

I had said I would label as pseudoscience or questionable science any enterprise that regularly permits the kind of ‘verification biases’ in the statistical dirty laundry list. How regularly? (I’ve been asked)

Well, surely if it’s as regular as, say, *much of* social psychology, it goes over the line. But it’s not mere regularity, it’s the nature of the data, the type of inferences being drawn, and the extent of self-scrutiny and recognition of errors shown (or not shown). The regularity is just a consequence of the methodological holes. My standards may be considerably more stringent than most, but quite aside from statistical issues, I simply do not find hypotheses well-tested if they are based on “experiments” that consist of giving questionnaires. At least not without a lot more self-scrutiny and discussion of flaws than I ever see. (There may be counterexamples.)

Attempts to recreate phenomena of interest in typical social science “labs” leave me with the same doubts. Huge gaps often exist between elicited and inferred results. One might locate the problem under “external validity” but to me it is just the general problem of relating statistical data to substantive claims.

Experimental economists (expereconomists) take lab results plus statistics to warrant sometimes ingenious inferences about substantive hypotheses. Vernon Smith (of the Nobel Prize in Econ) is rare in subjecting his own results to “stress tests”. I’m not withdrawing the optimistic assertions he cites from EGEK (Mayo 1996) on Duhem-Quine (e.g., from “Rhetoric and Reality” 2001, p. 29). I’d still maintain, “Literal control is not needed to attribute experimental results correctly (whether to affirm or deny a hypothesis). Enough experimental knowledge will do”. But that requires piece-meal strategies that accumulate, and at least a little bit of “theory” and/or a decent amount of causal understanding.[1]

I think the generalizations extracted from questionnaires allow for an enormous amount of “reading into” the data. Suddenly one finds the “best” explanation. Questionnaires should be deconstructed for how they may be misinterpreted, not to mention how responders tend to guess what the experimenter is looking for. (I’m reminded of the current hoopla over questionnaires on breadwinners, housework and divorce rates!) I respond with the same eye-rolling to just-so story telling along the lines of evolutionary psychology.

I apply the “Stapel test”: Even if Stapel had bothered to actually carry out the data-collection plans that he so carefully crafted, I would not find the inferences especially telling in the least. Take for example the planned-but-not-implemented study discussed in the recent New York Times article on Stapel:

Stapel designed one such study to test whether individuals are inclined to consume more when primed with the idea of capitalism. He and his research partner developed a questionnaire that subjects would have to fill out under two subtly different conditions. In one, an M&M-filled mug with the word “

kapitalisme” printed on it would sit on the table in front of the subject; in the other, the mug’s word would be different, a jumble of the letters in “kapitalisme.” Although the questionnaire included questions relating to capitalism and consumption, like whether big cars are preferable to small ones, the study’s key measure was the amount of M&Ms eaten by the subject while answering these questions….Stapel and his colleague hypothesized that subjects facing a mug printed with “kapitalisme” would end up eating more M&Ms.Stapel had a student arrange to get the mugs and M&Ms and later load them into his car along with a box of questionnaires. He then drove off, saying he was going to run the study at a high school in Rotterdam where a friend worked as a teacher.

Stapel dumped most of the questionnaires into a trash bin outside campus. At home, using his own scale, he weighed a mug filled with M&Ms and sat down to simulate the experiment. While filling out the questionnaire, he ate the M&Ms at what he believed was a reasonable rate and then weighed the mug again to estimate the amount a subject could be expected to eat. He built the rest of the data set around that number. He told me he gave away some of the M&M stash and ate a lot of it himself. “I was the only subject in these studies,” he said.

He didn’t even know what a plausible number of M&Ms consumed would be! But never mind that, observing a genuine “effect” in this silly study would not have probed the hypothesis. Would it?

**II. Dancing the pseudoscience limbo: How low should we go?**

Should those of us serious about improving the understanding of statistics be expending ammunition on studies sufficiently crackpot to lead CNN to withdraw reporting on a resulting (published) paper?

“Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as “silly,” “stupid,” “sexist,” and “offensive.” Others were less nice.”

That’s too low down for me.…(though it’s good for it to be in Retraction Watch). Even stooping down to the level of “The Journal of Psychological Pseudoscience” strikes me as largely a waste of time–for meta-methodological efforts at least. **January 25, 2015 note: Given the replication projects, and the fact that a meta-methodological critique of them IS worthwhile, this claim should be qualified. Remember this post was first blogged in June, 2013. **

I was hastily making these same points in an e-mail to A. Gelman just yesterday:

E-mail to Gelman: Yes, the idea that X should be published iff a p<.05 in an interesting topic is obviously crazy.

I keep emphasizing that the problems of design and of linking stat to substantive are the places to launch a critique, and the onus is on the researcher to show how violations are avoided. … I haven’t looked at the ovulation study (but this kind of thing has been done a zillion times) and there are a zillion confounding factors and other sources of distortion that I know were not ruled out. I’m prepared to abide such studies as akin to Zoltar at the fair [Zoltar the fortune teller]. Or, view it as a human interest story—let’s see what amusing data they collected, […oh, so they didn’t even know if women they questioned were ovulating]. You talk of top psych journals, but I see utter travesties in the ones you call top. I admit I have little tolerance for this stuff, but I fail to see how adopting a better statistical methodology could help them. …

Look, there aren’t real regularities in many, many areas–better statistics could only reveal this to an honest researcher. If Stapel actually collected data on M&M’s and having a mug with “Kapitalism” in front of subjects, it would still be B.S.! There are a lot of things in the world I consider crackpot. They may use some measuring devices, and I don’t blame those measuring devices simply because they occupy a place in a pseudoscience or “pre-science” or “a science-wannabe”. Do I think we should get rid of pseudoscience? Yes! [At least if they have pretensions to science, and are not described as “for entertainment purposes only”[2].] But I’m afraid this would shut down [or radically redescribe] a lot more fields than you and most others would agree to. So it’s live and let live, and does anyone really think it’s hurting honest science very much?

There are fields like (at least parts of) experimental psychology that have been trying to get scientific by relying on formal statistical methods, rather than doing science. We get pretensions to science, and then when things don’t work out, they blame the tools. First, significance tests, then confidence intervals, then meta-analysis,…do you think these same people are going to get the cumulative understanding they seek when they move to Bayesian methods? Recall [Frank] Schmidt in one of my Saturday night comedies, rhapsodizing about meta-analysis:

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”(Schmidt 1996)

**III. Dale Carnegie salesman fallacy:**

It’s not just that bending over backwards to criticize the most blatant abuses of statistics is a waste of time. I also think dancing the pseudoscientific limbo too low has a tendency to promote its very own fallacy! I don’t know if it has a name, so I made one up. Carnegie didn’t mean this to be used fallaciously, but merely as a means to a positive sales pitch for an idea, call it H. You want to convince a person of H? Get them to say yes to a series of claims first, then throw in H and let them make the leap to accept H too. “You agree that the p-values in the ovulation study show nothing?” “Yes” “You agree that study on bicep diameter is bunk?” “Yes, yes”, and “That study on ESP—pseudoscientific, yes?” “Yes, yes, yes!” Then announce, “I happen to favor operational probalogist statistics (H)”. Nothing has been said to advance H, no reasons have been given that it avoids the problems raised. But all those yeses may well lead the person to say yes to H, and to even imagine an argument has been given. Dale Carnegie was a shrewd man.

Note: added Jan 24, 2015: You might be interested in the (brief) exchange between Gelman and I in the comments from the original post.

Of relevance was the later post on the replication crisis in psychology: http://errorstatistics.com/2014/06/30/some-ironies-in-the-replication-crisis-in-social-psychology-1st-installment/

[1] Vernon Smith ends his paper:

My personal experience as an experimental economist since 1956 resonates, well with Mayo’s critique of Lakatos: “Lakatos, recall, gives up on

justifyingcontrol; at best we decide—by appeal to convention—that the experiment is controlled. … I reject Lakatos and others’ apprehension about experimental control. Happily, the image of experimental testing that gives these philosophers cold feet bears little resemblance to actual experimental learning. Literal control is not needed to correctly attribute experimental results (whether to affirm or deny a hypothesis). Enough experimental knowledge will do. Nor need it be assured that the various factors in the experimental context have no influence on the result in question—far from it. A more typical strategy is to learn enough about the type and extent of their influences and then estimate their likely effects in the given experiment”. [Mayo EGEK 1996, 240]. V. Smith, “Method in Experiment: Rhetoric and Reality” 2001, 29.

My example in this chapter was linking statistical models in experiments on Brownian motion (by Brown).

[2] I actually like Zoltar (or Zoltan) fortune telling machines, and just the other day was delighted to find one in a costume store on 21st St.

Filed under: junk science, Statistical fraudbusting, Statistics ]]>

Alan