My paper, “P values on Trial” is out in Harvard Data Science Review

Posted on February 1, 2020 by Mayo

My new paper, “P Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in Harvard Data Science Review (HDSR). HDSR describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue.

This is a case where reality proves the parody (or maybe, the proof of the parody is in the reality) or something like that. More specifically, Excursion 4 Tour III of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) opens with a parody of a legal case, that of Scott Harkonen (in the parody, his name is Paul Hack). You can read it here. A few months after the book came out, the actual case took a turn that went even a bit beyond what I imagined could transpire in my parody. I got cold feet when it came to naming names in the book, but in this article I do.

Below I paste Meng’s blurb, followed by the start of my article.

Meng’s blurb (his full editorial is here):

P values on Trial (and the Beauty and Beast in a Single Number)

Perhaps there are no statistical concepts or methods that have been used and abused more frequently than statistical significance and the p value. So much so that some journals are starting to recommend authors move away from rigid p value thresholds by which results are classified as significant or insignificant. The American Statistical Association (ASA) also issued a statement on statistical significance and p values in 2016, a unique practice in its nearly 180 years of history. However, the 2016 ASA statement did not settle the matter, but only ignited further debate, as evidenced by the 2019 special issue of The American Statistician. The fascinating account by the eminent philosopher of science Deborah Mayo of how the ASA’s 2016 statement was used in a legal trial should remind all data scientists that what we do or say can have completely unintended consequences, despite our best intentions.

The ASA is a leading professional society of the studies of uncertainty and variabilities. Therefore, the tone and overall approach of its 2016 statement is understandably nuanced and replete with cautionary notes. However, in the case of Scott Harkonen (CEO of InterMune), who was found guilty of misleading the public by reporting a cherry-picked ‘significant p value’ to market the drug Actimmune for unapproved uses, the appeal lawyers cited the ASA Statement’s cautionary note that “a p value without context or other evidence provides limited information,” as compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false. I doubt the authors of the ASA statement ever anticipated that their warning against the inappropriate use of p value could be turned into arguments for protecting exactly such uses.

To further clarify the ASA’s position, especially in view of some confusions generated by the aforementioned special issue, the ASA recently established a task force on statistical significance (and research replicability) to “develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors” within 2020. As a member of the task force, I’m particularly mindful of the message from Mayo’s article, and of the essentially impossible task of summarizing scientific evidence by a single number. As consumers of information, we are all seduced by simplicity, and nothing is simpler than conveying everything through a single number, which renders simplicity on multiple fronts, from communication to decision making. But, again, there is no free lunch. Most problems are just too complex to be summarized by a single number, and concision in this context can exact a considerable cost. The cost could be a great loss of information or validity of the conclusion, which are the central concerns regarding the p value. The cost can also be registered in terms of the tremendous amount of hard work it may take to produce a usable single summary.

P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting

Abstract

In an attempt to stem the practice of reporting impressive-looking findings based on data dredging and multiple testing, the American Statistical Association’s (ASA) 2016 guide to interpreting p values (Wasserstein & Lazar) warns that engaging in such practices “renders the reported p-values essentially uninterpretable” (pp. 131-132). Yet some argue that the ASA statement actually frees researchers from culpability for failing to report or adjust for data dredging and multiple testing. We illustrate the puzzle by means of a case appealed to the Supreme Court of the United States: that of Scott Harkonen. In 2009, Harkonen was found guilty of issuing a misleading press report on results of a drug advanced by the company of which he was CEO. Downplaying the high p value on the primary endpoint (and 10 secondary points), he reported statistically significant drug benefits had been shown, without mentioning this referred only to a subgroup he identified from ransacking the unblinded data. Nevertheless, Harkonen and his defenders argued that “the conclusions from the ASA Principles are the opposite of the government’s” conclusion that his construal of the data was misleading (Harkonen v. United States, 2018, p. 16). On the face of it, his defenders are selectively reporting on the ASA guide, leaving out its objections to data dredging. However, the ASA guide also points to alternative accounts to which some researchers turn to avoid problems of data dredging and multiple testing. Since some of these accounts give a green light to Harkonen’s construal, a case might be made that the guide, inadvertently or not, frees him from culpability.

Keywords: statistical significance, p values, data dredging, multiple testing, ASA guide to p values, selective reporting

Introduction

The biggest source of handwringing about statistical inference boils down to the fact it has become very easy to infer claims that have not been subjected to stringent tests. Sifting through reams of data makes it easy to find impressive-looking associations, even if they are spurious. Concern with spurious findings is considered sufficiently serious to have motivated the American Statistical Association (ASA) to issue a guide to stem misinterpretations of p values (Wasserstein & Lazar, 2016; hereafter, ASA guide). Principle 4 of the ASA guide asserts that:

Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. (pp. 131–132)

An intriguing example is offered by a legal case that was back in the news in 2018, having made it to the U.S. Supreme Court (Harkonen v. United States, 2018). In 2009, Scott Harkonen (CEO of drug company InterMune) was found guilty of wire fraud for issuing a misleading press report on Phase III results of a drug Actimmune in 2002, successfully pumping up its sales. While Actimmune had already been approved for two rare diseases, it was hoped that the FDA would approve it for a far more prevalent, yet fatal, lung disease (whose treatment would cost patients $50,000 a year). Confronted with a disappointing lack of statistical significance (p = .52)[1] on the primary endpoint—that the drug improves lung function as reflected by progression free survival—and on any of ten prespecified endpoints, Harkonen engaged in postdata dredging on the unblinded data until he unearthed a non-prespecified subgroup with a nominally statistically significant survival benefit. The day after the Food and Drug Administration (FDA) informed him it would not approve the use of the drug on the basis of his post hoc finding, Harkonen issued a press release to doctors and shareholders optimistically reporting Actimmune’s statistically significant survival benefits in the subgroup he identified from ransacking the unblinded data.

What makes the case intriguing is not its offering yet another case of p-hacking, nor that it has found its way more than once to the Supreme Court. Rather, it is because in 2018, Harkonen and his defenders argued that the ASA guide provides “compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false” (Goodman, 2018, p. 3). His appeal alleges that “the conclusions from the ASA Principles are the opposite of the government’s” charge that his construal of the data was misleading (Harkonen v. United States, 2018, p. 16 ).

Are his defenders merely selectively reporting on the ASA guide, making no mention of Principle 4, with its loud objections to the behavior Harkonen displayed? It is hard to see how one can hold Principle 4 while averring the guide’s principles run counter to the government’s charges against Harkonen. However, if we view the ASA guide in the context of today’s disputes about statistical evidence, things may look topsy turvy. None of the attempts to overturn his conviction succeeded (his sentence had been to a period of house arrest and a fine), but his defenders are given a leg to stand on—wobbly as it is. While the ASA guide does not show that the theory of statistical significance testing ‘is demonstrably false,’ it might be seen to communicate a message that is in tension with itself on one of the most important issues of statistical inference.

Before beginning, some caveats are in order. The legal case was not about which statistical tools to use, but merely whether Harkonen, in his role as CEO, was guilty of intentionally issuing a misleading report to shareholders and doctors. However, clearly, there could be no hint of wrongdoing if it were acceptable to treat post hoc subgroups the same as prespecified endpoints. In order to focus solely on that issue, I put to one side the question whether his press report rises to the level of wire fraud. Lawyer Nathan Schachtman argues that “the judgment in United States v. Harkonen is at odds with the latitude afforded companies in securities fraud cases” even where multiple testing occurs (Schachtman, 2020, p. 48). Not only are the intricacies of legal precedent outside my expertise, the arguments in his defense, at least the ones of interest here, regard only the data interpretation. Moreover, our concern is strictly with whether the ASA guide provides grounds to exonerate Harkonen-like interpretations of data.

I will begin by describing the case in relation to the ASA guide. I then make the case that Harkonen’s defenders mislead by omission of the relevant principle in the guide. I will then reopen my case by revealing statements in the guide that have thus far been omitted from my own analysis. Whether they exonerate Harkonen’s defenders is for you, the jury, to decide.

You can read the full article at HDSR here. The Harkonen case is also discussed on this blog: search Harkonen (and Matrixx).

Categories: multiple testing, P-values, significance tests, Statistics | 29 Comments

29 thoughts on “My paper, “P values on Trial” is out in Harvard Data Science Review”

February 1, 2020

Richard Gill

“Nothing is simpler than conveying everything through a single number, which renders simplicity on multiple fronts, from communication to decision making”. This reminds me of 1 in 342 million. https://www.math.leidenuniv.nl/~gill/Statistics_and_Serial_Killer_Nurses.pdf

Reply
February 1, 2020

Richard Gill

In the legal case, isn’t it relevant to decide whether Harkonen was deliberately misleading people or not? Suppose he’s not trained as a statistician, and he just does what everybody else in a similar position does. There has just been a legal case in the Netherlands concerning an econometrician who did some worthless statistical analyses which he self-published (i.e. no peer review) and which his university promoted with press releases which got him on TV and caused major unjustified financial loss to several businesses. The businesses complained to the university scientific integrity committee. The university saw no wrongdoing, the businesses appealed, it went to the national scientific integrity organ. Their conclusions have just been published. The guy is not found guilty of a violation of scientific integrity because his work was typical for the field he is working in: everyday econometrics, ie a mindless linear regression model applied to a data set which certainly cannot be thought of as a random sample from any meaningful population, let alone that the regression model could be thought of as a causal model). The scientific integrity committee doesn’t realise that there are also lots of people in econometrics who are concerned about causality, they think that this is just a worry of nit-picking mathematicians. The university is blamed for putting out press-releases with text written by a PR guy and not checked by the scientist. Though I am pretty sure the “scientist” would have had no objection at all to the text which the publicity department put out. https://www.math.leidenuniv.nl/~gill/Haring.pdf

Reply

February 1, 2020

Mayo

Richard:
Yes it definitely is relevant to decide whether Harkonen was deliberately misleading people. The jury decided he was.
Your example is fascinating, I haven’t read it through. Maybe write me a summary for this blog, or possibly I can link to it.

Reply

February 1, 2020

Richard Gill

I plan to write a paper about the example I mentioned – a “Dutch new herring” scandal – before the herring fleet next sails into harbour. That’s around 1 June. The herring gather to mate in the North Sea in May, and they are then at their best, and all of them in one place, too!

Reply
February 1, 2020

John Byrd

Suppose they convert the results of the individual tests from p-values to likelihood ratios. Can they then argue that the press release was valid because the “evidence” from the statistical analysis support H1 over H0, citing the acclaimed likelihood principle? Would a statistician testify to this argument?

Reply

February 1, 2020

Mayo

Hi John: My answers are: yes, and maybe. I tried to pin Goodman down before finalizing this paper, but he wouldn’t tell me. I mean, he wrote back, but did not have an answer.

Let me qualify my answer. I mean that if the view of evidence was the likelihood ratio, then he would not be culpable. But his report was on statistical significance, and there was a prespecified sampling plan with the FDA. To overthrow an account that insists on error probability control is what many fear. It’s why I wrote this. I do discuss this in the paper.

Reply

February 2, 2020

Richard Gill

I think a likelihood ratio without a sampling plan is also wrong! A likelihood ratio makes sense when you combine it with your prior. Your prior would change dramatically depending on what information you have about *why* someone decided to tell you exactly that particular likelihood ratio.

Reply

February 2, 2020

Mayo

Richard:
It might influence a prior but (a) Likelihoodists don’t use priors, (b) it would violate the Likelihood Principle (which some Bayesians might be happy to do–despite violating coherence), and (c) someone who takes all the import of the evidence to be in the likelihood ratio is not bound to report: “I ransacked until I got a high LR in favor of benefit in order to trick you into thinking there was good evidence for the data dredged hypothesis”.

Reply

February 2, 2020

Richard Gill

I never met a live likelihoodist who actually did statistics in the real world. Most problems have very many parameters, very many nuisance parameters. Coherence is an ideal. Life is complex. I think one has to be pragmatic. Remember – we don’t actually believe in our models, anyway. I think that ” someone who takes all the import of the evidence to be in the likelihood ratio” is probably pretty stupid.

Reply

February 2, 2020

Mayo

Richard: Well Michael Lew often comes around this blog, and of course there’s Royall. But you’re moving too quickly from the fact that in the real world people aren’t Bayesian coherent to the supposition that priors should be influenced by error probabilities (in the frequentist sense), and that Bayesians accept this. Do you know of any Bayesians who go that far? Also, there’s my point (c). If you read any part of his defense or appeals, you’ll see these people exist–not only exist, they purport to be the consensus view these days. Please look up the quotes in this paper from Benjamini et al., and from Nuzzo. And also the one from Lindley. Even the 2016 ASA Guide qualifies principle 4 (the one concerned with data-dredging and multiple testing) to apply only to p-values and “related measures”.

Reply

February 2, 2020

John Byrd

Right, and the dangerous implication of ASA II pointing flippantly at the “other approaches” as superior alternatives is that the defendant in this case should convert to LR’s, cite the Likelihood Principle, and proceed to say there is nothing whatsoever wrong with claiming real effects for the few results that come up from fishing in a large pool of results. Because this other approach does not concern itself with the overall study design. ASA II says we only worry about these real problems if using the method that demands we worry about these real problems. Ignorance is bliss.

Reply

February 2, 2020

Mayo

John:
We can’t call it ASA II any more. I’d like to think that my referring to it in that manner–the manner in which it was presented–namely, as a continuation of ASA I–helped to underscore the problem already recognized by some ASA board members. Now it’s plain that it’s not a continuation, and not a policy document, but something put forward by the authors of the editorial, Wasserstein et al., 2019. In fact, as I understand it, Wasserstein is not even to be seen wearing his Executive Director’s hat when serving as a guest editor for the special March 2019 issue. He is to be seen merely as an author and not an ASA official–or so it appears the ASA board is claiming. Of course, wherever he goes, people cannot help but see him as wearing his ASA Executive Director’s hat.

On your other point, the defendants can’t convert to a different account of evidence in the middle of his trial, they can at most try to argue that these other approaches give a green light to his interpretation. The danger will be in future cases, especially after Wasserstein et al., 2019 which, remember, was not available at any time during Harkonen’s long history of appeals (which were over some time in fall of 2018).

February 3, 2020

Michael J Lew

Richard, you make some good points about pitfalls of a likelihood ration, but you(and Mayo) have confused a few important issues.

1. The evidence in the data concerning the relative merits of parameter values can be seen in the likelihood function, but not in any singular likelihood ratio. (Except in the strange case where the parameter of interest can take only two values.)

Yes, your concern about the practicalities of generating a likelihood function where there are many parameters of interest (or the vector parameter of interest has several dimensions) or in the presence of nuisance parameters is entirely valid. I suspect that non-likelihood methods that avoid that problem can do so by being less responsive the full evidential import of the data.

2. There are many types of questions that one might like to answer using data, some that are best answered using just the evidence, others using the evidence and a prior, and others that need some input of information regarding the reliability or representativeness of the data. Then there are decisions that are final or interim. No statistical procedure can be appropriate for more than a small fraction of the situations that a statistician might find him or herself in.

3. When we make a clear distinction between the evidence in _these_ data concerning _this_ parameter according to _this_ statistical model (“local evidence”) and the error rate expectations for the analytical decision procedure (“global errors”) then it is easier to navigate the difficulties implied by 2.

4. No statistical procedure can protect against deliberate or ignorant misapplication or against deliberately or accidentally unrepresentative data. And we must not pretend otherwise.

I have written about these issues in a recent paper available from arXiv https://arxiv.org/abs/1910.02042

Reply

February 3, 2020

Mayo

Michael:
I’m so glad you came by to comment on Richard.I do discuss, in the paper, the fact that likelihood ratio advocates are free to insist on other considerations separate from assessing the evidence (and even cite Royall) What I don’t understand is how someone like Goodman can say that the LR reports the evidence, in a case like Harkonen. I also don’t see how the likelihood ratio function, your first point, helps to pick up on the sampling plan, but maybe I’m misunderstanding how you’re using that term.

Reply

February 3, 2020

Michael J Lew

The likelihood function does not “pick up on the sampling plan”. The sampling plan can affect the global error rate properties of the analysis and may influence the reliability of real-world inferences based on the analysis, but it does not affect the local evidence. (Yes, I know that many people want a system where the evidence itself is altered by sampling rules and by the ethical behaviour of the data gatherers. However, statistical evidence does not appear to work that way.)

The reason to inspect the full likelihood function (rather than a pre- or post-data selected singular likelihood ratio) is that it allows you to see what the data say about the parameter of interest within the chosen statistical model. If you want to know what the data say then you need the whole function.

I cannot comment on what Goodman has in mind.

Reply

February 3, 2020

Mayo

Michael:
Thank you for clarifying this. Your answer, as I understand it, is that the likelihood function does not take the sampling plan into account, but that one can add considerations later on, and these can include priors as well as error probabilities of the overall method. It’s interesting that you say many would like the sampling plan to influence the evidence (“many people want a system where the evidence itself is altered by sampling rules”) because, these days, I’m often thinking it’s a minority view (that I and others hold).
What happens if you want to look at the full likelihood function if the hypothesis results from the kind of post-hoc selection as with Harkonen? I guess you report the LR values for different parameter values, not for different possible post hoc hypotheses.

Reply

February 3, 2020

Michael J Lew

The hypotheses that are acted upon by statistical inference are usually hypotheses concerning the values of parameters in a statistical model. Extrapolation from the within model world to the real world always needs thoughtful care and caution. I do not think that an arithmetical decision rule should be considered thoughtful.

A likelihood function would not have rescued the Harkonen case. Neither would ‘correction’ of the “nominal” p-values. The idea that so-called “nominal” p-values should be “corrected” to account for multiplicity of testing is silly because it removes the evidential calibration of the original p-value and it changes the null hypothesis into an uncertain composite. Yes, it would likely have led to a “non-significant” p-value, but it would have done so by robbing all of the analyses of power. What should have happened is that the data dredging would have been described. The subgroup analysis should never have been reported as if it allowed anything beyond the weakest, suggestive type of inference. A scientific response to that inference might have been to seek new data.

The lesson should be that we always need to distinguish between preliminary evidence and corroborating evidence. The data that suggest a hypothesis (the data-dredged subgroup data) cannot be used to test that same hypothesis, as we all know. A likelihood function would have been useful to delineate the range of effect sizes that might be used to design a new experimental test of the hypothesis.

(I wrote about those ideas in my recent paper too. Linked in a comment above.)

I’m glad to hear that you consider attempts to roll error rates together with evidence to be a minority pursuit, but I’m not at all sure that that is the case.
February 3, 2020

Mayo

Michael:
I don’t really disagree with anything you wrote, and your comments are very illuminating. (I’m curious what Richard Gill thinks.) It’s true that the trial plan would not have recommended “correcting” the p-value for nonprespecified hypotheses–as I understand it. They have to be prespecified for adjustments to kick in. I (and the prosecutors) agree that “What should have happened is that the data dredging would have been described. The subgroup analysis should never have been reported as if it allowed anything beyond the weakest, suggestive type of inference. A scientific response to that inference might have been to seek new data.”
The FDA helped the company to run a new trial properly testing what had been a data-dredged subgroup. Had Harkonen qualified his results as you suggest, he would not have been found guilty. However, your position and that of the prosecutors disagree with the position taken in his defense and claimed to be supported by the 2016 ASA guide. His defenders go so far as to claim that the ASA guide shows the falsity of the scientific theory of statistical significance tests! Worse, many of the people supporting Harkonen in his attempt to be shown to be “actually innocent” in 2018 agree. The main caveat, for them, seems to be that the initial jury had been instructed in accordance with the use of statistical significance tests in clinical trials, and no expert witnesses were called to challenge this. I believe the Innocence Project was on his side, but I don’t know the details as to what they were told.
February 4, 2020

Richard Gill

I said above “The likelihood function is just what it is. It’s a sufficient statistic”. I should have given it a rather more privileged status: it’s *the* minimal sufficient statistic. But this does not mean that experiments which happen to produce the same outcome of minimal sufficient statistic should be treated in the same way. In LeCam’s (decision theoretic) approach this is nicely expressed through the theorem that says that experiments whose set of probability distributions of their likelihood functions are the same are essentially the same experiment, up to added noise. And this still doesn’t resolve the questions of ancillarity and conditioning. ie choice of reference set.

February 4, 2020

Richard David Gill

Yes, I am very aware of all those issues! Good that you bring them up, too.

Reply

February 4, 2020

Richard Gill

I just posted: “Yes, I am very aware of all those issues! Good that you bring them up, too.”

That was a reply to Lew’s response to me some way above here but WordPress has put it completely in the wrong place and doesn’t allow me to edit it, either…

Reply

February 4, 2020

Mayo

Richard: I think the order makes sense because for each of the 2 Lew comments, there’s a reply from me, and now your reply is below.

Reply

February 4, 2020

Richard Gill

OK, Mayo. But I do have to get used to the idea that I can’t edit a reply, not even for a few minutes, after posting it. I also have to get used to the idea that I should add a lot more of the context of what I am replying to. Don’t worry, I’ll learn!

BTW, the sampling plan is indeed not reflected in the whole likelihood function. But, if you were a real Bayesian, your prior would be very “entangled” with your sampling plan. So the Bayesian does (or in principle, can) take care of everything, provided their models are correct and they have superpowers regarding their capacity to figure out their real prior. Some people have more confidence in proposing priors than others. I don’t know if the ones who have more confidence in doing it, actually are good at doing it. But I hope they learn from their mistakes. Trouble is, we often never find out if we were wrong or not.

The likelihood function is just what it is. It’s a sufficient statistic, and this means that if you violate the likelihood principle, your inference is partly based on random noise, and it could in principle be improved. But I personally don’t think that that is a terrible sin. In the real world, all the time, we have to make compromises between competing principles. Optimality is a nice property to have in order to sell your work to your consumers, but apart from that, it may not be wise to go after it, always.
February 4, 2020

Michael J Lew

Richard (and Mayo), glad to see that we are largely in agreement. I would like to take up a couple of points that seem to me to be important and under-explored: the use of a prior to deal with sampling plan and concerns about the reliability of the data derivation; and the meaning of ‘violating the likelihood principle’.

In my personal theory of statistical and scientific inference (see my paper!), the global and local issues have to be taken into account separately (conditional on their relevance to the inferential purpose), the Bayesian prior provides the prior information or belief concerning the probable value of the parameter of interest. Statistical inferences are constructed by the most relevant combination of the global, local and prior. (Relevance determined by both the type of question to be answered*, which is affected by whether the research is preliminary or confirmatory, and the available analytical procedures.)

Concerns about the reliability of the data would be dealt with at the step of extrapolation from the statistical inference (which has model-bounded relevance) to the real-world scientific inference.

(Yes, the process I describe might be viewed as nebulous and under-specified, but scientific inferences should be made after thoughtful consideration of all relevant information. That cannot be turned into a singular algorithm.)

Notice that I do not deal with the advocate the “entangling” of prior with the sampling plan and data reliability. I would expect that many forms of entanglement would lead to the posterior density being moved away from the best estimate towards a more skeptical value. Or would it involve making the prior more peaked and thus reducing the influence of the likelihood on the posterior? How would the non-prior issues be put into the prior if the prior was uninformative?

Now the likelihood principle. To me the likelihood principle should have a relatively limited scope: it applies only to inferences about the parameter on the x-axis of the likelihood function within the statistical model. Some people assert that the likelihood principle says something like “only the likelihood function can be used when making inferences about the parameter of interest”. I would advocate violating that latter form of likelihood principle more often than not, and, even with my more restricted form of the principle, the scientific inferential scheme that I prefer might lead to apparent violations.

When you say that violating the likelihood principle means that the inference is based partly on random noise, what do you have in mind?

* The types of questions come from Royall: what do these data say?; what should I do or decide now that I have these data?; what should I believe now that I have these data?

February 5, 2020

Mayo

Richard:
You can put the name of the person you’re responding to and link to their comment by copying and pasting the date corresponding to their comment, like this:
https://errorstatistics.com/2020/02/01/my-paper-p-values-on-trial-is-out-in-harvard-data-science-review/#comment-188095

Comments are numbered.
Also, I have said, that if you want me to change or wipe out a comment, I will do it, with a caveat regarding other comments that might suddenly make no sense.

Reply

February 2, 2020

Steven McKinney

“There is no consensus whether, when, or how to adjust p-values or Type I error rates for multiple testing” (Mayo (2020) p. 8)

This disingenuous quote by shady lawyers associated with the Harkonen case is shameful. Sadly we are surrounded by such disingenuous lawyering, including currently in the well of the U.S. Senate, without even a nod of disapproval by the Chief Justice of the Supreme Court.

There is no consensus whether climate change is due to human burning of fossil fuels, because some shameless fossil fuel executives, investors and lawyers fail to concur. That doesn’t make the connection invalid, the vast majority of scientists and other honest actors concur fully that human use of fossil fuels for over 200 years is finally showing a major and undeniable impact on rising global temperatures on land and in the ocean, with all the other attendant consequences of stronger storms, heavier rains, melting sea ice and on and on.

There is no consensus that the Earth is round, because there are still a few oddball characters that enjoy peddling flat-earth tales. Just because there is no consensus does not mean that the Earth is not round.

So just because there is no consensus concerning how to adjust p-values or Type I error rates for multiple testing does not give Harkonen a hall pass to dredge data until he finds something he can dress up and apply lipstick to peddle as a “scientific” finding to investors. There is broad agreement through a wide cross-section of credentialed statistically educated scientists and researchers that the methods of Benjamini and Hochberg, for example, describe a sensible way to adjust a suite of p-values in a multiple testing situation to achieve something close to the desired Type I error rate. There are a host of other multiple comparisons adjustment procedures developed decades before Benjamini and Hochberg’s recent improved methodology to address this issue.

Harkonen and his defenders deflect attention away from decades of such efforts, and the wide consensus in the statistical community that such procedures are necessary to ensure overall error rates end up in the desired range.

Now page 7 of Mayo (2020) states “Inferring from a nonstatistically significant difference (which his p =.5 clearly was) to claiming there’s zero association is a well-known fallacy”, but in a properly powered (1 – Type II error) pre-specified clinical trial one can indeed conclude (with Type II error rate at the minimum difference of medical value)) that there’s no effect of a size that has clinical or medical value. When a-priori data is used to understand the distributional properties of the type of data under consideration, and a large enough subsequent set of data is collected that could frequently detect (detect with high power) differences of medical value, then a large p-value does allow us to claim in a non-fallacious manner that no relevant difference exists, and we do know the associated error rate accompanying our claim. Harkonen of course could not assert any type II error rate when dredging the trial data at hand, and was deceptively asserting that his p = 0.004 data-dredged finding should be understood to have occurred in an analytic framework with a type I error rate of 5% which of course was not the case since the number of analyses Harkonen undertook on his route to finding this p = 0.004 was not disclosed.

Nathan Schachtman and others, in an Amicus brief discussed at this website

https://law.stanford.edu/2013/10/07/lawandbiosciences-2013-10-07-u-s-v-harkonen-should-scientists-worry-about-being-prosecuted-for-how-they-interpret-their-research-results/

states

“All branches of government depend upon access to scientific data, interpreted and evaluated by capable scientists, without fear of reprisal. The prosecution and resulting conviction in this case threaten to chill scientific speech in many important activities and contexts, to the detriment of public health.”

This argument is again disingenuous. Scott Harkonen was not acting as a scientist when he sliced and diced the data until he found a p-value of 0.004, then deceptively announced this finding while telling his trial analysts to shut up about his dredging expedition.

For a lawyer to claim that a desperate businessman seeking to deceive investors so said businessman might garner some of their money is somehow a “capable scientist” undertaking “important activities” for the betterment of public health is disingenuous. Why lawyers choose to engage in such disingenuous argument in support of obvious criminals to the detriment of our broader society mystifies me and leaves me deeply saddened.

I note that the final comment on this Stanford blog post is by one “D. Mayo” stating

This gives another, telling, perspective on the discussion we’re having on this at my blog; namely he was not really acting as a scientist but as a CEO. https://errorstatistics.com/2013/10/09/bad-statistics-crime-or-free-speech-ii-harkonen-update-phil-stat-law-stock/comment-page-1/#comment-15316

Thank you Mayo for your years of efforts to push back against so much disingenuous nonsense. I am still somewhat flummoxed that Wasserstein et al. made the unfortunate statement “whether a p-value passes any arbitrary threshold should not be considered at all”, and aghast that their opening argument references the bizarre points of view of Ziliak and McCloskey in their odd book “The cult of statistical significance . . .”, hardly a standard reference in any serious statistical arena and itself the go-to bible of cultish deniers of effective statistical methodology.

Indeed as Mayo states at the end of this well thought out paper, “The onus is on the ASA to clarify its standpoint on this important issue of statistical evidence.” There’s some cleanup that must be done in light of the unfortunate set of inappropriate suggestions arising from the Wasserstein et al. collection of essays in the American Statistician, so that such fodder never proves useful in disingenuous court arguments such as those attempting to defend the convicted stock fraudster Harkonen. This juror finds no exoneration for Harkonen’s defenders.

Reply

February 2, 2020

Mayo

Dear Steven:
Thank you so much for your great comment! I agree with everything you say, except not all the amicus writers were lawyers, the others were statisticians. These are the same statisticians leading the campaigns to “reform” statistical science by disallowing the use of the word “significance” and precluding the use of P-value thresholds. As indicated in my caveat at the start of the paper, I admit that the question of whether what he did rose to the level of fraud is a somewhat separate issue. His defenders say that so long as ANY statistician or other practitioner in the field agrees with his construal, then it cannot be fraudulent. Maybe if Harkonen had included an expert witness prepared to defend his construal of the data, he might have gotten off.
I am, however, somewhat sympathetic to the legal issue that Schachtman was trying to press regarding what was said about the Matrixx case. It’s discussed in Appendix B in the paper and on this blog.

Anyone who doubts that leading statisticians take the standpoint Steven finds disingenuous, check this post (which includes links to others)
https://errorstatistics.com/2013/10/09/bad-statistics-crime-or-free-speech-ii-harkonen-update-phil-stat-law-stock/
or the SCOTUS links in the published version. It’s a great feature of HDSR.

Reply

February 4, 2020

Richard Gill

I said above “The likelihood function is just what it is. It’s a sufficient statistic”. I should have given it a rather more privileged status: it’s *the* minimal sufficient statistic. But this does not mean that experiments which happen to produce the same outcome of minimal sufficient statistic should be treated in the same way. In LeCam’s (decision-theoretic) approach this is nicely expressed through the theorem that says that experiments whose set of probability distributions of their likelihood functions are the same are essentially the same experiment, up to added noise. And this still doesn’t resolve the questions of ancillarity and conditioning. ie choice of the reference set.

Reply

February 5, 2020

Mayo

Richard:
I’m replying to your comment:
https://errorstatistics.com/2020/02/01/my-paper-p-values-on-trial-is-out-in-harvard-data-science-review/#comment-188113
You say you’re frustrated WordPress won’t let you change your comment. Do some have that feature? I can’t recall if I’ve ever seen that. There is a limit to how many replies to a comment, so, for ex., I’ve run out of responses for the comment I’m replying to. I’ll see if I can change the setting.

I’m confused as to whether you’re saying you would want to obey the LP, and think anything beyond it is noise, because you are also saying the sampling distribution must be taken into account, and even that Bayesians would let it alter their priors (which is very non-standard for a Bayesian). There’s a great deal of tension here between 3 positions.*
Can I ask you to look at Excursion i of my book, especially Tour II? It discusses these issues. One example is that of how LPers regard stopping rules irrelevant, at least in the kind of case I take up there. You can find the full excerpt of that portion of the book on this blog if you don’t have the book. Here it is https://errorstatistics.com/2019/04/04/excursion-1-tour-ii-error-probing-tools-versus-logics-of-evidence-excerpt/
*One thing that I found illuminating in writing Cox and Mayo 2010 has to do with how to understand a sufficient statistic. It is sufficient only along with its sampling distribution. I’ll link to the article and paste it here.
It’s on p. 289 of this paper in Error and Inference. https://www.phil.vt.edu/dmayo/personal_website/ch%207%20cox%20&%20mayo.pdf

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

My paper, “P values on Trial” is out in Harvard Data Science Review

P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting

Post navigation

29 thoughts on “My paper, “P values on Trial” is out in Harvard Data Science Review”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

My paper, “P values on Trial” is out in Harvard Data Science Review

P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting

Related

Post navigation

29 thoughts on “My paper, “P values on Trial” is out in Harvard Data Science Review”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.