Thanks so much for your comment. There’s too much I’d want to say. But first off, I had no idea that Fisher would ever entertain post-data designation of alternatives (did he really do tha that?) because N-P spend a long time talking about the need for an alternative, and don’t really mention that. They’re concerned with (a) the fact that one can find a way to have the same data reject or accept while still satisfy Fisher’s p-value requirement, and (b) the fact that considering the alternative enables ensuring power, and they complain that some of Fisher’s tests have low power, but that he takes nonsignificance as taking the data as corroborating the null in some sense. (Neyman’s 1956 reply linked in my last poast.)

So, I’m surprised you say that Fisher didn’t block data-dependent alternatives. What I mean is, although the formal test appears to license a data dependent test statistic, I assumed that it was an unwritten rule that Fisher precluded your doing that, because to do that would preclude the key argument: e.g., that a result as or more statistically significant would be rare under the null. To choose a data dependent test statistic, Fisher might say, would be to hide information.I will continue to assume that Fisherians do block such a move, because otherwise David Cox wouldn’t be so concerned with selection effects.

Now you raise a very different case: data dependencies in identifying the source of anomalies, whether in formal testing hypotheses, in testing model assumptions or in entirely informal set-ups. I’ve written a lot on this over many years. There’s a difference between what might be called “explaining a known effect” and trying to infer there even is a genuine effect at all. The Fisherian test context we were talking about is an example of the latter. By contrast, explaining a deflection effect, or changes in scale of photos before and after an eclipse are examples of the former. Was it the heat of the sun? Or did Eddington get some of his jelly-donut on the plates? Reliable means to pinpoint blame for known effects or anomalies postdata may be available, even if no one thought of the way heat can warp telescope mirrors in advance. There are a set number of assumptions required for usable plates (for safely applying a statistical method), just like a set of assumptions for a usable model—and these are predesignated. If you can show the mirror distortion (or jelly donut) precludes a reliable estimate of error, then the original data do not satisfy the assumptions needed to be usable. That’s why George Barnard emphasized,specifically discussing the eclipse tests, that a Neyman-style requirement of predesignating the usable data points can’t hold up in such cases.

I’m blurring a few things, but I’m denying that all data dependent specifications invalidate the relevant error probabilities. It’s rather the opposite: to vouchsafe relevant error probabilities may require post-data specifics.

]]>‘ I personally choose the method which is most likely to be profitable when designing the experiment rather than use Prof. Fisher’s system of a posteriori choice[2] which has always seemed to me to

savour rather too much of ” heads I win, tails you lose ” ‘(P370)

Here Student is referring to the opinion of Fisher,that for a matched pairs design where the result was non-significant, one could always check to see whether the two-sample t-test might not give a significant result. This is a piece of “cake and eat it” advice of Fisher’s that I have always found rather strange.

However, that does not mean that Neyman is let off the hook, since he substitutes an a-priori choice that could have been different had another statistician been involved, which choice implies that a great deal of information is available when it is not.

Furthermore, your criticism ought to apply equally to any scheme in which models are examined for adequacy prior to carrying out a substantive test, a procedure that makes some of us uneasy, but which has its defenders. (See, for example, posts by Aris Spanos.)

One way of rescuing Fisher’s scheme from your criticism is to say that the choice for a current data-set should be based on analysis of previous ones. This, in my view, has a lot to recommend it. And I have frequently suggested to pharma companies that they should consider re-analysing old trials, not with the object of changing their views on previously studied treatments, but in order to learn how to analyse better in future. See[3] for a discussion.

Reference

[1]Gossett WS. Comparison between balanced and random arrangements of field plots. Biometrika 1938; 29:363–378.

[2] Statistical Methods for Research Workers, 24.1 (5th ed.),

[3]Senn, S. (2008). Statisticians and evidence–mote and beam. Pharmaceutical statistics, 7(3), 155-157

I’m suspicious of the whole “magicians had to be brought in” thing, which is something that magicians like to say. Sure, Randi’s great, etc etc., but at the same time it’s my impression that professional magicians have a lot invested in the idea they have special skills. Lots of us could read Daryl Bem’s papers, for example, and see all sort of problems. No expertise in sleight of hand required. I have a feeling the same thing was the case with Uri Geller. Sure, he had some tricks, but the general principle—that he’s holding the spoons and he has some method of bending them with his hands—is not so complicated. I don’t have to know exactly how Geller did it to realize that he’s not demonstrating some new principle of physics.

]]>You write, “what the practicing scientist wants to know is what is a good test in practice.” I think you have to be careful about giving the practicing scientist what he or she wants! It’s my impression that the practicing scientist wants certainty: thus, the result of an experiment is either “statistically significant” (the result is true) or “not significant” (the result is false). Or perhaps “marginally significance” which is really great because then if you want the result to be true you can call it evidence in favor, and if you want the result to be false you can call it evidence against. This desire for certainty on the part of statistical consumers has historically been aided and abetted by the statistics profession.

]]>Neyman shows that with only a null you can always reject, and complains as well that without considering “sensitivity” ahead of time, Fisher “takes it that there’s no effect” (or the like) when a non-significant result occurs, even when the test had low power (he discusses this in the section I linked to in my previous comment.) I take it that most of the time the test stat, for Fisher, is to come from a good estimator of the parameter in question. But there’s still the matter of sensitivity. Cox requires at least an implicit alternative. So,I don’t really understand this alternative to the alternative–as far as current uses of tests–– even though I obviously see where Fisher was keen to reject the type 2 error in the way he imagined N-P had in mind. Barnard says that Fisher was happy with power and didn’t object even to a behavioristic justification of tests–so long as that wasn’t the only way they could be used. That seems right.

Thanks.

Mayo ]]>

“The present book attempts to fill this gap by promoting what Hampel (2006) calls the original and correct fiducial argument (Fisher, 1930, 1973), as opposed to Fisher’s later incorrect fiducial theory. The second decade of the second millennium is witnessing a renewed interest in fiducial analysis (see, e.g., Hannig [2009] and references therein) and in the related concept of confidence distribution (see e.g. the review and discussion paper Xie and Singh [2013]).”

]]>“The present book attempts to fill this gap by promoting what Hampel (2006) calls the original and correct fiducial argument (Fisher, 1930, 1973), as opposed to Fisher’s later incorrect fiducial theory. The second decade of the second millennium is witnessing a renewed interest in fiducial analysis (see, e.g., Hannig [2009] and references therein) and in the related concept of confidence distribution (see e.g. the review and discussion paper Xie and Singh [2013]).”

*http://www.cambridge.org/catalogue/catalogue.asp?isbn=9780521861601

]]>There were a few posts: ]]>

The other noteworthy and surprising thing, is that Fisher is still adhering to the idea that probabilistic instantiation is a legitimate deductive move, and castigating Neyman for not seeing this. This is like 20 years after the fiducial argument was being puzzled over, if not refuted. This bothers me, because it makes me question some of Fisher’s best insights. It’s extremely noteworthy, as well, that Neyman is still having trouble explaining what goes wrong with such an instantiation.

People are reluctant to get into the fiducial business in interpreting the Neyman-Fisher dispute all those years, but I’ve realized in the past couple of years that this is a big mistake. Lehmann, for example, says we can discuss Fisher& Neyman without getting into that, but the arguments between them are highly distorted as a result. Why does it matter? It shouldn’t. But people have taken to heart theidea that Fisherian p-values are inductive, and N-P error probabilities are behavioristic. Since the latter are assumed irrelevant to inference, people are taught p-values without alternative hypotheses. Power is thrown in, and the inconsistent hybrid is born. And on it goes…

]]>Lehmann, E. L. 1993. “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” Journal of the American Statistical Association 88 (424): 1242–1249.

A relevant post with links is here:

https://errorstatistics.com/2015/11/20/erich-lehmann-neyman-pearson-vs-fisher-on-p-values/

“…To summarize, p values, fixed-level significance statements,conditioning, and power considerations can be combined into a unified approach. When long-term power and conditioning are in conflict, specification of the appropriate frame of reference takes priority, because it determines the meaning of the probability statements. A fundamental gap in the theory is the lack of clear principles for selecting the appropriate framework. Additional work in this area will have to come to terms with the fact that the decision in any particular situation must be based not only on abstract principles but also on contextual aspects.”

He opts for “conditioning” over power in scientific contexts, but i think some will take issue as to why the problem is even put this way. The relevant framework should have been considered at the start, some may say, rather than “conditioning” afterward.

But the most notable upshot is, far from this great catastrophic foundational problem, the main spokesperson for N-P seems to be saying, “no big deal”. It depends on your question, problem, and goal.

https://errorstatistics.files.wordpress.com/2014/06/lehmann_1993.pdf

]]>