Monthly Archives: February 2020

R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

A SPANOS

.

This is a belated birthday post for R.A. Fisher (17 February, 1890-29 July, 1962)–it’s a guest post from earlier on this blog by Aris Spanos. 

Happy belated birthday to R.A. Fisher!

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper.

imgres

.

After graduating from Cambridge he drifted into a series of jobs, including subsistence farming and teaching high school mathematics and physics, until his temporary appointment as a statistician at Rothamsted Experimental Station in 1919. During the period 1912-1919 his interest in statistics was driven by his passion for eugenics and a realization that his mathematical knowledge of n-dimensional geometry can be put to good use in deriving finite sample distributions for estimators and tests in the spirit of Gosset’s (1908) paper. Encouraged by his early correspondence with Gosset, he derived the finite sampling distribution of the sample correlation coefficient which he published in 1915 in Biometrika; the only statistics journal at the time, edited by Karl Pearson. To put this result in a proper context, Pearson was working on this problem for two decades and published more than a dozen papers with several assistants on approximating the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments.

Due to its importance, the 1915 paper provided Fisher’s first skirmish with the  ‘statistical establishment’. Karl Pearson would not accept being overrun by a ‘newcomer’ lightly. So, he prepared a critical paper with four of his assistants that became known as “the cooperative study”, questioning Fisher’s result as stemming from a misuse of Bayes theorem. He proceeded to publish it in Biometrika in 1917 without bothering to let Fisher know before publication. Fisher was furious at K.Pearson’s move and prepared his answer in a highly polemical style which Pearson promptly refused to publish in his journal. Eventually Fisher was able to publish his answer after tempering the style in Metron, a brand new statistics journal. As a result of this skirmish, Fisher pledged never to send another paper to Biometrika, and declared a war against K.Pearson’s perspective on statistics. Fisher, not only questioned his method of moments as giving rise to inefficient estimators, but also his derivation of the degrees of freedom of his chi-square test. Several, highly critical published papers ensued.[i]

Between 1922 and 1930 Fisher did most of his influential work in recasting statistics, including publishing a highly successful textbook in 1925, but the ‘statistical establishment’ kept him ‘in his place’; a statistician at an experimental station. All his attempts to find an academic position, including a position in Social Biology at the London School of Economics (LSE), were unsuccessful (see Box, 1978, p. 202). Being turned down for the LSE position was not unrelated to the fact that the professor of statistics at the LSE was Arthur Bowley (1869-1957); second only to Pearson in statistical high priesthood.[ii]

Coming of age as a statistician during the 1920s in England, was being awarded the Guy medal in gold, silver or bronze, or at least receiving an invitation to present your work to the Royal Statistical Society (RSS). Despite his fundamental contributions to the field, Fisher’s invitation to RSS would not come until 1934. To put that in perspective, Jerzy Neyman, his junior by some distance, was invited six months earlier! Indeed, one can make a strong case that the statistical establishment kept Fisher away for as long as they could get away with it. However, by 1933 they must have felt that they had to invite Fisher after he accepted a professorship at University College, London. The position was created after Karl Pearson retired and the College decided to split his chair into a statistics position that went to Egon Pearson (Pearson’s son) and a Galton professorship in Eugenics that was offered to Fisher. To make it worse, Fisher’s offer came with a humiliating clause that he was forbidden to teach statistics at University College (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself!

Fisher’s presentation to the Royal Statistical Society, on December 18th, 1934, entitled “The Logic of Inductive Inference”, was an attempt to summarize and explain his published work on recasting the problem of statistical induction since his classic 1922 paper. Bowley was (self?) appointed to move the traditional vote of thanks and open the discussion. After some begrudging thanks for Fisher’s ‘contributions to statistics in general’, he went on to disparage his new approach to statistical inference based on the likelihood function by describing it as abstruse, arbitrary and misleading. His comments were predominantly sarcastic and discourteous, and went as far as to accuse Fisher of plagiarism, by not acknowledging Edgeworth’s priority on the likelihood function idea (see Fisher, 1935, pp. 55-7). The litany of churlish comments continued with the rest of the old guard: Isserlis, Irwin and the philosopher Wolf (1935, pp. 57-64), who was brought in by Bowley to undermine Fisher’s philosophical discussion on induction. Jeffreys complained about Fisher’s criticisms of the Bayesian approach (1935, pp. 70-2).

To Fisher’s support came … Egon Pearson, Neyman and Bartlett. E. Pearson argued that:

“When these ideas [on statistical induction] were fully understood … it would be realized that statistical science owed a very great deal to the stimulus Professor Fisher had provided in many directions.” (Fisher, 1935, pp. 64-5)

Neyman too came to Fisher’s support, praising Fisher’s path-breaking contributions, and explaining Bowley’s reaction to Fisher’s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73).

Fisher, in his reply to Bowley and the old guard, was equally contemptuous:

“The acerbity, to use no stronger term, with which the customary vote of thanks has been moved and seconded … does not, I confess, surprise me. From the fact that thirteen years have elapsed between the publication, by the Royal Society, of my first rough outline of the developments, which are the subject of to-day’s discussion, and the occurrence of that discussion itself, it is a fair inference that some at least of the Society’s authorities on matters theoretical viewed these developments with disfavour, and admitted with reluctance. … However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company. … For the rest, I find that Professor Bowley is offended with me for “introducing misleading ideas”. He does not, however, find it necessary to demonstrate that any such idea is, in fact, misleading. It must be inferred that my real crime, in the eyes of his academic eminence, must be that of “introducing ideas”. (Fisher, 1935, pp. 76-82)[iii]

In summary, the pioneering work of Fisher and later supplemented by Egon Pearson and Neyman, was largely ignored by the Royal Statistical Society (RSS) establishment until the early 1930s. By 1933 it was difficult to ignore their contributions, published primarily in other journals, and the ‘establishment’ of the RSS decided to display its tolerance to their work by creating ‘the Industrial and Agricultural Research Section’, under the auspices of which both papers by Neyman and Fisher were presented in 1934 and 1935, respectively. [iv]

In 1943, Fisher was offered the Balfour Chair of Genetics at the University of Cambridge. Recognition from the RSS came in 1946 with the Guy medal in gold, and he became its president in 1952-1954, just after he was knighted! Sir Ronald Fisher retired from Cambridge in 1957. The father of modern statistics never held an academic position in statistics!

You can read more in Spanos 2008 (below)

References

Bowley, A. L. (1902, 1920, 1926, 1937) Elements of Statistics, 2nd, 4th, 5th and 6th editions, Staples Press, London.

Box, J. F. (1978) The Life of a Scientist: R. A. Fisher, Wiley, NY.

Fisher, R. A. (1912), “On an Absolute Criterion for Fitting Frequency Curves,” Messenger of Mathematics, 41, 155-160.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10, 507-21.

Fisher, R. A. (1921) “On the ‘probable error’ of a coefficient deduced from a small sample,” Metron 1, 2-32.

Fisher, R. A. (1922) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society, A 222, 309-68.

Fisher, R. A. (1922a) “On the interpretation of c2 from contingency tables, and the calculation of p, “Journal of the Royal Statistical Society 85, 87–94.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae and the distribution of regression coefficients,”  Journal of the Royal Statistical Society, 85, 597–612.

Fisher, R. A. (1924) “The conditions under which the x2 measures the discrepancy between observation and hypothesis,” Journal of the Royal Statistical Society, 87, 442-450.

Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver & Boyd, Edinburgh.

Fisher, R. A. (1935) “The logic of inductive inference,” Journal of the Royal Statistical Society 98, 39-54, discussion 55-82.

Fisher, R. A. (1937), “Professor Karl Pearson and the Method of Moments,” Annals of Eugenics, 7, 303-318.

Gossett, W. S. (1908) “The probable error of the mean,” Biometrika, 6, 1-25.

Hald, A. (1998) A History of Mathematical Statistics from 1750 to 1930, Wiley, NY.

Hotelling, H. (1930) “British statistics and statisticians today,” Journal of the American Statistical Association, 25, 186-90.

Neyman, J. (1934) “On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection,” Journal of the Royal Statistical Society, 97, 558-625.

Rao, C. R. (1992), “ R. A. Fisher: The Founder of Modern, Statistical Science, 7, 34-48.

RSS (Royal Statistical Society) (1934) Annals of the Royal Statistical Society 1834-1934, The Royal Statistical Society, London.

Savage, L . J. (1976) “On re-reading R. A. Fisher,” Annals of Statistics, 4, 441-500.

Spanos, A. (2008), “Statistics and Economics,” pp. 1057-1097 in The New Palgrave Dictionary of Economics, Second Edition. Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan.

Tippet, L. H. C. (1931) The Methods of Statistics, Williams & Norgate, London.


[i] Fisher (1937), published a year after Pearson’s death, is particularly acerbic. In Fisher’s mind, Karl Pearson went after a young Indian statistician – totally unfairly – just the way he went after him in 1917.

[ii] Bowley received the Guy Medal in silver from the Royal Statistical Society (RSS) as early as 1895, and became a member of the Council of the RSS in 1898. He was awarded the society’s highest honor, the Guy Medal in gold, in 1935.

[iii] It is important to note that Bowley revised his textbook in statistics for the last time in 1937, and predictably, he missed the whole change of paradigms brought about by Fisher, Neyman and Pearson.

Spanos-2008[iv] In their centennial volume published in 1934, the RSS acknowledged the development of ‘mathematical statistics’, referring to Galton, Edgeworth, Karl Pearson, Yule and Bowley as the main pioneers, and listed the most important contributions in this sub-field which appeared in its Journal during the period 1909-33, but the three important papers by Fisher (1922a-b; 1924) are conspicuously absent from that list. The list itself is dominated by contributions in vital, commercial, financial and labour statistics (see RSS, 1934, pp. 208-23). There is a single reference to Egon Pearson.

This was first posted on 17, Feb. 2013 here.

HAPPY BIRTHDAY R.A. FISHER!

Categories: Fisher, phil/history of stat, Spanos | Leave a comment

Bad Statistics is Their Product: Fighting Fire With Fire (ii)

Mayo fights fire w/ fire

I. Doubt is Their Product is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle: How Industry’s Assault on Science Threatens Your Health. Imagine you have just picked up a book, published in 2020: Bad Statistics is Their Product. Is the author writing about how exaggerating bad statistics may serve in the interest of denying well-established risks? [Interpretation A]. Or perhaps she’s writing on how exaggerating bad statistics serves the interest of denying well-established statistical methods? [Interpretation B]. Both may result in distorting science and even in dismantling public health safeguards–especially if made the basis of evidence policies in agencies. A responsible philosopher of statistics should care.

II. Fixing Science. So, one day in January, I was invited to speak in a panel “Falsifiability and the Irreproducibility Crisis” at a conference “Fixing Science: Practical Solutions for the Irreproducibility Crisis.” The inviter, whom I did not know, David Randall, explained that a speaker withdrew from the session because of some kind of controversy surrounding the conference, but did not give details. He pointed me to an op-ed in the Wall Street Journal. I had already heard about the conference months before (from Nathan Schachtman) and before checking out the op-ed, my first thought was: I wonder if the controversy has to do with the fact that a keynote speaker is Ron Wasserstein, ASA Executive Director, a leading advocate of retiring “statistical significance”, and barring P-value thresholds in interpreting data. Another speaker eschews all current statistical inference methods (e.g., P-values, confidence intervals) as just too uncertain (D. Trafimow). More specifically, I imagined it might have to do with the controversy over whether the March 2019 editorial in TAS (Wasserstein, Schirm, and Lazar 2019) was a continuation of the ASA 2016 Statement on P-values, and thus an official ASA policy document, or not. Karen Kafadar, recent President of the American Statistical Association (ASA), made it clear in December 2019 that it is not.[2] The “no significance/no thresholds” view is the position of the guest editors of the March 2019 issue. (See “P-Value Statements and Their Unintended(?) Consequences” and “Les stats, c’est moi“.) Kafadar created a new 2020 ASA Task Force on Statistical Significance and Replicability to:

prepare a thoughtful and concise piece …without leaving the impression that p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice”. (Kafadar 2019, p. 4)

Maybe those inviting me didn’t know I’m “anti” the Anti-Statistical Significance campaign (“On some self-defeating aspects of the 2019 recommendations“), that  I agree with John Ioannidis (2019) that “retiring statistical significance would give bias a free pass“, and published an editorial “P-value Thresholds: Forfeit at Your Peril“. While I regard many of today’s statistical reforms as welcome (preregistration, testing for replication, transparency about data-dredging, P-hacking and multiple testing), I argue that those in Wasserstein et al., (2019) are “Doing more harm than good“. In “Don’t Say What You don’t Mean“, I express doubts that Wasserstein et al. (2019) could really mean to endorse certain statements in their editorial that are so extreme as to conflict with the ASA 2016 guide on P-values. To be clear, I reject oversimple dichotomies, and cookbook uses of tests, long lampooned, and have developed a reformulation of tests that avoids the fallacies of significance and non-significance.[1] It’s just that many of the criticisms are confused, and, consequently so are many reforms.

III. Bad Statistics is Their Product. It turns out that the brouhaha around the conference had nothing to do with all that. I thank Dorothy Bishop for pointing me to her blog which gives a much fuller background. Aside from the lack of women (I learned a new word–a manference), her real objection is on the order of “Bad Statistics is Their Product”: The groups sponsoring the Fixing Science conference, The National Association of Scholars and the Independent Institute, Bishop argues, are using the replication crisis to cast doubt on well-established risks, notably those of climate change. She refers to a book whose title echoes David Michael’s: Merchants of Doubt (2010(by historians of science: Conway and Oreskes). Bishop writes:

Uncertainty about science that threatens big businesses has been promoted by think tanks … which receive substantial funding from those vested interests. The Fixing Science meeting has a clear overlap with those players. (Bishop)

The speakers on bad statistics, as she sees it, are “foils” for these interests, and thus “responsible scientists should avoid” the meeting.

But what if things are the reverse?  What if “bad statistics is our product” leaders also have an agenda. By influencing groups who have a voice in evidence policy in government agencies, they might effectively discredit methods they don’t like, and advance those they like. Suppose you have strong arguments that the consequences of this will undermine important safeguards (despite the key players being convinced they’re promoting better science). Then you should speak, if you can, and not stay away. You should try to fight fire with fire.

IV. So what Happened? So I accepted the invitation and gave what struck me as a fairly radical title: “P-Value ‘Reforms’: Fixing Science or Threats to Replication and Falsification?” (The abstract and slides are below.) Bishop is right that evidence of bad science can be exploited to selectively weaken entire areas of science; but evidence of bad statistics can also be exploited to selectively weaken entire methods one doesn’t like, and successfully gain acceptance of alternative methods, without the hard work of showing those alternative methods do a better, or even a good, job at the task at hand. Of course both of these things might be happening simultaneously.

Do the conference organizers overlap with science policy as Bishop alleges? I’d never encountered either outfits before, but Bishop quotes from their annual report.

In April we published The Irreproducibility Crisis, a report on the modern scientific crisis of reproducibility—the failure of a shocking amount of scientific research to discover true results because of slipshod use of statistics, groupthink, and flawed research techniques. We launched the report at the Rayburn House Office Building in Washington, DC; it was introduced by Representative Lamar Smith, the Chairman of the House Committee on Science, Space, and Technology.

So there is a mix with science policy makers in Washington, and their publication, The Irreproducibility Crisis, is clearly prepared to find its scapegoat in the bad statistics supposedly encouraged in statistical significance tests. To its credit, it discusses how data-dredging and multiple testing can make it easy to arrive at impressive-looking findings that are spurious, but nothing is said about ways to adjust or account for multiple testing and multiple modeling. (P-values are defined correctly, but their interpretation of confidence levels is incorrect.)  Published before the Wasserstein et al. (2019) call to end P-value thresholds, which would require the FDA and other agencies to end what many consider vital safeguards of error control, it doesn’t go that far. Not yet at least! Trying to prevent that from happening is a key reason I decided to attend. (updated 2/16)

My first step was to send David Randall my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–which he actually read and wrote a report on–and I met up with him in NYC to talk. He seemed surprised to learn about the controversies over statistical foundations and the disagreement about reforms. So did I hold people’s feet to the fire at the conference (when it came to scapegoating statistical significance tests and banning P-value thresholds for error probability control?) I did! I continue to do so in communications with David Randall. (I’ll write more in the comments to this post, once our slides are up.)

As for climate change, I wound up entirely missing that part of the conference: Due to the grounding of all flights to and from CLT the day I was to travel, thanks to rain, hail and tornadoes, I could only fly the following day, so our sessions were swapped. I hear the presentations will be posted. Doubtless, some people will use bad statistics and the “replication crisis” to claim there’s reason to reject our best climate change models, without having adequate knowledge of the science. But the real and present danger today that I worry about is that they will use bad statistics to claim there’s reason to reject our best (error) statistical practices, without adequate knowledge of the statistics or the philosophical and statistical controversies behind  the “reforms”.

Let me know what you think in the comments.

V. Here’s my abstract and slides

P-Value “Reforms”: Fixing Science or Threats to Replication and Falsification?

Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome, others are quite radical. The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. Paradoxically, some of the reforms intended to fix science enable rather than reveal illicit inferences due to P-hacking, multiple testing, and data-dredging. Some even preclude testing and falsifying claims altogether. Too often the statistics wars become proxy battles between competing tribal leaders, each keen to advance a method or philosophy, rather than improve scientific accountability.

[1] Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST), 2018; SIST excerpts; Mayo and Cox 2006; Mayo and Spanos 2006.

[2] All uses of ASA II on this blog must now be qualified to reflect this.

[3] You can find a lot on the conference and the groups involved on-line. The letter by Lenny Teytelman warning people off the conference is here. Nathan Schachtman has a post up today on his law blog here.

 

Categories: ASA Guide to P-values, Error Statistics, P-values, replication research, slides | 28 Comments

My paper, “P values on Trial” is out in Harvard Data Science Review

.

My new paper, “P Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in Harvard Data Science Review (HDSR). HDSR describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue.

This is a case where reality proves the parody (or maybe, the proof of the parody is in the reality) or something like that. More specifically, Excursion 4 Tour III of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) opens with a parody of a legal case, that of Scott Harkonen (in the parody, his name is Paul Hack). You can read it here. A few months after the book came out, the actual case took a turn that went even a bit beyond what I imagined could transpire in my parody. I got cold feet when it came to naming names in the book, but in this article I do.

Below I paste Meng’s blurb, followed by the start of my article.

Meng’s blurb (his full editorial is here):

P values on Trial (and the Beauty and Beast in a Single Number)

Perhaps there are no statistical concepts or methods that have been used and abused more frequently than statistical significance and the p value.  So much so that some journals are starting to recommend authors move away from rigid p value thresholds by which results are classified as significant or insignificant. The American Statistical Association (ASA) also issued a statement on statistical significance and p values in 2016, a unique practice in its nearly 180 years of history.  However, the 2016 ASA statement did not settle the matter, but only ignited further debate, as evidenced by the 2019 special issue of The American Statistician.  The fascinating account by the eminent philosopher of science Deborah Mayo of how the ASA’s 2016 statement was used in a legal trial should remind all data scientists that what we do or say can have completely unintended consequences, despite our best intentions.

The ASA is a leading professional society of the studies of uncertainty and variabilities. Therefore, the tone and overall approach of its 2016 statement is understandably nuanced and replete with cautionary notes. However, in the case of Scott Harkonen (CEO of InterMune), who was found guilty of misleading the public by reporting a cherry-picked ‘significant p value’ to market the drug Actimmune for unapproved uses, the appeal lawyers cited the ASA Statement’s cautionary note that “a p value without context or other evidence provides limited information,” as compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false.  I doubt the authors of the ASA statement ever anticipated that their warning against the inappropriate use of p value could be turned into arguments for protecting exactly such uses.

To further clarify the ASA’s position, especially in view of some confusions generated by the aforementioned special issue, the ASA recently established a task force on statistical significance (and research replicability) to “develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors” within 2020.  As a member of the task force, I’m particularly mindful of the message from Mayo’s article, and of the essentially impossible task of summarizing scientific evidence by a single number.  As consumers of information, we are all seduced by simplicity, and nothing is simpler than conveying everything through a single number, which renders simplicity on multiple fronts, from communication to decision making.  But, again, there is no free lunch.  Most problems are just too complex to be summarized by a single number, and concision in this context can exact a considerable cost. The cost could be a great loss of information or validity of the conclusion, which are the central concerns regarding the p value.  The cost can also be registered in terms of the tremendous amount of hard work it may take to produce a usable single summary.

P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting

Abstract

In an attempt to stem the practice of reporting impressive-looking findings based on data dredging and multiple testing, the American Statistical Association’s (ASA) 2016 guide to interpreting p values (Wasserstein & Lazar) warns that engaging in such practices “renders the reported p-values essentially uninterpretable” (pp. 131-132). Yet some argue that the ASA statement actually frees researchers from culpability for failing to report or adjust for data dredging and multiple testing. We illustrate the puzzle by means of a case appealed to the Supreme Court of the United States: that of Scott Harkonen. In 2009, Harkonen was found guilty of issuing a misleading press report on results of a drug advanced by the company of which he was CEO. Downplaying the high p value on the primary endpoint (and 10 secondary points), he reported statistically significant drug benefits had been shown, without mentioning this referred only to a subgroup he identified from ransacking the unblinded data. Nevertheless, Harkonen and his defenders argued that “the conclusions from the ASA Principles are the opposite of the government’s” conclusion that his construal of the data was misleading (Harkonen v. United States, 2018, p. 16). On the face of it, his defenders are selectively reporting on the ASA guide, leaving out its objections to data dredging. However, the ASA guide also points to alternative accounts to which some researchers turn to avoid problems of data dredging and multiple testing. Since some of these accounts give a green light to Harkonen’s construal, a case might be made that the guide, inadvertently or not, frees him from culpability.

Keywords: statistical significance, p values, data dredging, multiple testing, ASA guide to p values, selective reporting

  1. Introduction

The biggest source of handwringing about statistical inference boils down to the fact it has become very easy to infer claims that have not been subjected to stringent tests. Sifting through reams of data makes it easy to find impressive-looking associations, even if they are spurious. Concern with spurious findings is considered sufficiently serious to have motivated the American Statistical Association (ASA) to issue a guide to stem misinterpretations of p values (Wasserstein & Lazar, 2016; hereafter, ASA guide). Principle 4 of the ASA guide asserts that:

Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. (pp. 131–132)

An intriguing example is offered by a legal case that was back in the news in 2018, having made it to the U.S. Supreme Court (Harkonen v. United States, 2018). In 2009, Scott Harkonen (CEO of drug company InterMune) was found guilty of wire fraud for issuing a misleading press report on Phase III results of a drug Actimmune in 2002, successfully pumping up its sales. While Actimmune had already been approved for two rare diseases, it was hoped that the FDA would approve it for a far more prevalent, yet fatal, lung disease (whose treatment would cost patients $50,000 a year). Confronted with a disappointing lack of statistical significance (p = .52)[1] on the primary endpoint—that the drug improves lung function as reflected by progression free survival—and on any of ten prespecified endpoints, Harkonen engaged in postdata dredging on the unblinded data until he unearthed a non-prespecified subgroup with a nominally statistically significant survival benefit. The day after the Food and Drug Administration (FDA) informed him it would not approve the use of the drug on the basis of his post hoc finding, Harkonen issued a press release to doctors and shareholders optimistically reporting Actimmune’s statistically significant survival benefits in the subgroup he identified from ransacking the unblinded data.

What makes the case intriguing is not its offering yet another case of p-hacking, nor that it has found its way more than once to the Supreme Court. Rather, it is because in 2018, Harkonen and his defenders argued that the ASA guide provides “compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false” (Goodman, 2018, p. 3). His appeal alleges that “the conclusions from the ASA Principles are the opposite of the government’s” charge that his construal of the data was misleading (Harkonen v. United States, 2018, p. 16 ).

Are his defenders merely selectively reporting on the ASA guide, making no mention of Principle 4, with its loud objections to the behavior Harkonen displayed? It is hard to see how one can hold Principle 4 while averring the guide’s principles run counter to the government’s charges against Harkonen. However, if we view the ASA guide in the context of today’s disputes about statistical evidence, things may look topsy turvy. None of the attempts to overturn his conviction succeeded (his sentence had been to a period of house arrest and a fine), but his defenders are given a leg to stand on—wobbly as it is. While the ASA guide does not show that the theory of statistical significance testing ‘is demonstrably false,’ it might be seen to communicate a message that is in tension with itself on one of the most important issues of statistical inference.

Before beginning, some caveats are in order. The legal case was not about which statistical tools to use, but merely whether Harkonen, in his role as CEO, was guilty of intentionally issuing a misleading report to shareholders and doctors. However, clearly, there could be no hint of wrongdoing if it were acceptable to treat post hoc subgroups the same as prespecified endpoints. In order to focus solely on that issue, I put to one side the question whether his press report rises to the level of wire fraud. Lawyer Nathan Schachtman argues that “the judgment in United States v. Harkonen is at odds with the latitude afforded companies in securities fraud cases” even where multiple testing occurs (Schachtman, 2020, p. 48). Not only are the intricacies of legal precedent outside my expertise, the arguments in his defense, at least the ones of interest here, regard only the data interpretation. Moreover, our concern is strictly with whether the ASA guide provides grounds to exonerate Harkonen-like interpretations of data.

I will begin by describing the case in relation to the ASA guide. I then make the case that Harkonen’s defenders mislead by omission of the relevant principle in the guide. I will then reopen my case by revealing statements in the guide that have thus far been omitted from my own analysis. Whether they exonerate Harkonen’s defenders is for you, the jury, to decide.

You can read the full article at HDSR here. The Harkonen case is also discussed on this blog: search Harkonen (and Matrixx).

 

Categories: multiple testing, P-values, significance tests, Statistics | 29 Comments

Blog at WordPress.com.