But anyone who does want to read about Harkonen can search this blog. (I wonder what company he’s in nowadays.) ]]>

The Harkonen case has a complicated fact set, not easily summarized here. But to make it very short, he was prosecuted for Wire Fraud, for reporting a clinical trial that, in his word, demonstrated a survival benefit in a subgroup that was not prespecified. The overall mortality benefit was p = 0.08, which shrunk to 0.055 on a per protocol analysis. So, very close even without data dredging. And, there was a prior independent trial that showed benefit, with a very low p value. I conceded that Harkonen’s practice in not revealing that the analysis was not prespecified was poor practice, but hardly a criminal fraud, especially considering the complexity that he was speaking about multiple trials. In some of past exchanges, Mayo took the position that Harkonen’s analysis was indefensible, and I had responded, legally with a demurrer: poor practice, which is all too common, but too close to the line to put the man in prison unless we wanted to throw a LOT of scientists in the hoosegow as well.

Nathan

]]>*I’m more inclined than he seems to be to call violators “fraudfeasors”–a good term!

]]>*It does, however, make it doubly ironic that the case should have been referred to in the goofy Matrixx case. Error upon error. Don’t these Supreme Court members have huge staffs to look up relevant precedents? Perhaps since it was “obiter dicta” (or however that goes) no one cared much, but surely there were examples of egregious side effects shown without resort to statistics. I don’t want to confuse the current discussion with that case, so I withdraw my Well’s question, which just just curiosity.

]]>and parts 2 through 6, of this series, and other posts as well. Professor Gastwirth wrote an article about the case, in which he seemed to try to say it wasn’t so off the wall, but I think he missed some really important points. The remarkable thing, of course, is that the case would not have survived (or should not) Daubert/Rule 702 review if decided today, but the Solicitor General and then the Supreme Court cited the case in 2011 as though it were still good law.

]]>Good morning! Not too much violence to my original post,but you do challenge me on a few topics. As for my attempt to distinguish the Harkonen case and some of the other legal cases, I know I have not, to date, persuaded you. We lawyers reserve “guilty” for those who have committed crimes. For me, when many NIH-funded researchers publish articles with subgroup analyses not prespecified in their protocols and not identified as post hoc analyses, and when federal government researchers at NIH tout non-prespecified outcomes in RCTs, again without notice that the outcomes were not prespecified, and when the US gov’t takes the position in another case before the Supreme Court (Matrixx Initiatives v. Siracusan) that statistical significance is not necessary to “demonstrate” causation, then I think the government loses its moral, legal, and scientific standing to prosecute someone like Dr Harkonen.

My fellow amici may have had different goals, but I was never trying to advance Harkonen’s approach as “best practice” or the like. I would even heartily agree that such a practice should lead to Harkonen’s opinion’s being excluded if offered as testimony in litigation, but I don’t believe it is a fraud.

Why not? First, Harkonen had other information about efficacy. The 1999 Austrian RCT of interferon gamma 1b showed efficacy, and was published in the NEJM. The Austrian RCT was then continued, again with a strong showing of benefit for the therapy arm of the trial. The specific trial that Harkonen had new data on had a showing of survival benefit (a prespecified outcome), with large “effect” size, at p = 0.08, which shrank to 0.055 when the data were analyzed for compliance with the protocol. And when the researchers published their data with time-to-event analyses, the hazard ratio for the entire cohort was well below 1.0, with p < 0.05.

The "offending" subgroup analysis, admittedly non-prespecified, had a p = 0.004, and the gov't never showed, as was its burden, that this p-value would have been inflated above 5% if adjusted. Nor did the gov't show that, despite the multiple attempts to define the "right" subgroup, that the reported subgroup of mild to moderate cases of disease was clinically implausible, or that the magnitude of the mortality benefit was clinically insignificant.

Anyway, a longer story than I can bang out here in comments, but I would be happy to share my amicus brief with anyone who wants to read it.

As for the "notorious" Wells case, I have blogged extensively about it. If you like, I can provide links. The case was cited by the gov't in the Matrixx case, and then again by the Supreme Court in Matrixx, as an example of how causation decisions can be made without statistical significance, but as I have pointed out, plaintiffs' expert witnesses in Wells actually had studies that showed at least "nominal" significance. The problem was multiple testing (both announced and covert – but no one was prosecuted), and confounding by multiple exposures (arsenical spermicides, known to be genotoxic, were included in some of the studies, but were not part of the exposure claimed to have caused the birth defects in the specific case).

So there you have it for Sunday morning. A response on Harkonen, and an elaboration on Wells. I will try to respond to other points, and answer questions if I can.

Nathan

]]>look at the data, decide upon some location-scale model and (speculatively) identify the amount of copper with the location parameter mu. Exceeding the legal limit is translated into the null hypothesis H_0: mu > 2.03. How does the statistician choose a model? One can take the Pratt approach, the observations `look normally

distributed’ (they do!) and use the normal model. Having decided on the normal model and then one does the best one can do under the normal model (behaving as if true) and uses the mean to `estimate’ mu. I take this to being as severe as possible under the model. But as the observations also `look Laplace distributed’ (they

do!), the statistician can use the Laplace distribution, do the best under this model and `estimate’ mu by the median. This is even more severe than the mean. Finally the observations also `look comb* distributed’ (they

do!), the statistician can use the comb distribution, do the best under this model and `estimate’ mu by maximum likelihood. This is more severe than the mean and median and much more so. The three 95% confidence

intervals are [1.970,2.062], [1.989,2.071] and [2.0248,2.0256] respectively. The latter confidence interval is not a typing error, it really is that small. A Bayesian approach is of no help. The posterior for mu for the comb model is essentially concentrated on the confidence interval. Severity or efficiency can be imported from the

model. Tukey called this a free lunch, which in his experience does not exist in statistics, and said that bland or hornless models should be used, for example minimum Fisher information models. This is what I mean by regularization. The normal model is bland, the comb model is not. The modus operandi `behave as if true’ does not commit the statistician to actually believing that the model

is true, it simply means that for the purpose of the analysis the statistician behaves as if this were so. Instead of calculating confidence intervals as above I suggest the use of approximation intervals which specify those values of the parameter which are consistent with the data. For the normal model consistency is based on the Kolmogorov (or better the Kuiper) distance of the data from the

model, on the absence of outliers, on the skewness of the data and on the values of the mean and standard deviation. You can do this also for the comb model and the two approximation intervals are about the same. This means that the EDA part of the analysis is included in the

specification of the parameters. A small approximation interval can be an indication of lack of fit. What one does not do is to declare the normal model as adequate, move into the `behave as if true’ mode and then base the rest of the analysis on the the optimal estimators and forget completely about the results of the EDA phase.

comb* defined in my book ]]>

I’ll take a stab at this. The point of the first paper is that in a Bayesian analysis, the inference is no longer weakly powered as it was in the hypothesis test.

I don’t think it’s fair to say all the additional information is coming from the prior in a Bayesian analysis. In multilevel models, the information comes from the model structure and the ability to say, make a single inference from multiple measurements.

I think there’s also a bit of confusion regarding the 2014 paper, as many people seem to interpret the message was, “gelman says we should be using multiple comparison adjustments after all!” My read of it is that conditional on doing weakly powered tests (with many subjective decision points) multiple comparisons matter. However, my preferred approach is still to use the structure of the problem so as not to avoid doing weakly powered analyses in the first place.

BTW, thank you for the generous offer in the other thread. I will buy the e-book. Still have a backlog of reading material though (slowly making my way through Senn’s book, for one thing)

]]>as if true’ (this is much stronger than fixing a model)”. Even the “bheavioristic” model of Neyman identified the “act” with a very specific bheavior, e.g., declare there’s an indication of a discrepancy, deny we can warrant ‘confirming’ the discrepancy from null is less than an amount against which the test has low power (Neyman’s power analysis”, publish/do not publish a result, infer the experimental effort of interest is unable to be brought about at will, decide the randomization assumption has been violated (as with cloud seeding), infer the model is adequate for probing such and such a question, but not another, report the data indicate doubts that the gold has been adulturated by more than x%, report the data indicate the skull’s age is approx m +- e, etcetc

Howson and Urbach spread the view that for Neyman and Pearson one is instructed to bet all one’s worldly goods on the result of a single statistical test–among other howlers, like they deny there’s such a thing as evidence, learning, inference. Neyman never regarded stat models as more than what he called “picturesque descriptions”: of a given aspect of a situation more or less adequate for tackling a given problem.

Having said all that, what do you mean by “A theory of

error statistics requires a discussion of regularization.”

precise description of my attitude to EDA and formal statistical

inference. The modus operandi of EDA is tentative and probing. It is

distribution function based with a weak topology typified by the

Kolmogorov metric. The modus operandi of formal inference is `behave

as if true’ (this is much stronger than fixing a model). It is density

based with a strong topology typified by the total variation

metric. There is therefore a double break when moving from EDA to

formal inference. The distribution function and the density function

are linked by the pathologically discontinuous differential

operator. This discontinuity makes all discussions about the

likelihood principle pointless. Likelihood requires truth, the adequacy of

a model is not sufficient. Once in the `behave as if true’ mode the

means by which one came to be in possession of the truth, foul or fair,

are irrelevant which is why (Steven McKinney) `exploratory findings’

are seldom mentioned. This applies equally well to Bayesians and

frequentists. I argue for one phase, essential the EDA phase,

for one topology and for treating models consistently as

approximations. This may at first glance seem innocuous but it is

not. The Bayesian paradigm requires truth. Two different models cannot

both be true which via a Dutch book argument leads to the additivity

of Bayesian priors. However two different models can both be adequate

approximation. There is no exclusion and consequently no

additivity. For the frequentists there are no `true parameter values’

and so no confidence intervals which cover them. For approximate

models an appropriately defined P-values is a measure of how weak the

concept of adequate approximation has to be before the model can be

regarded as an adequate approximation. Finally the pathological

discontinuity of the differential operator means that it is possible

to have arbitrarily severe models which fit the data. A theory of

error statistics requires a discussion of regularization. ]]>