A comment today by Stephen Senn leads me to post the last few sentences of my (2010) paper with David Cox, “Frequentist Statistics as a Theory of Inductive Inference”:
A fundamental tenet of the conception of inductive learning most at home with the frequentist philosophy is that inductive inference requires building up incisive arguments and inferences by putting together several different piece-meal results; we have set out considerations to guide these pieces[i]. Although the complexity of the issues makes it more difficult to set out neatly, as, for example, one could by imagining that a single algorithm encompasses the whole of inductive inference, the payoff is an account that approaches the kind of arguments that scientists build up in order to obtain reliable knowledge and understanding of a field.” (273)[ii]
A reread for Saturday night?
[i]The pieces hang together by dint of the rationale growing out of a severity criterion (or something akin but using a different term.)
[ii]Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
Dr. Mayo,
Do you believe that if two different statisticians use the exact same evidence to answer a question, and break the problem down into different but both legitimate (by whatever criterion you favor) piecemeal steps, then they should, when they put it all together, get the same answer? Or at least, answers that don’t contradict each other?
If “no”, then on what basis can their answers be “objective” when the conclusions depend on how the analysis was broken down into pieces?
If “yes”, then do you believe Error Statistics, and in particular SEV, has this characteristic?
Or alternatively, do you think it’s worth looking at the mathematical consequences of forcing answers from both legitimate “paths” to be consistent with each other?
Anon: I don’t know what you mean by “forcing answers from both legitimate ‘paths’ to be consistent.” But objectivity, on my view, isn’t a matter of getting the same answer, even in the mythical situation of having the same evidence.
For example, two analysts are investigating H_0. They make the same modeling assumptions and use the same data D.
However analyst A uses tests T_1, T_2, T_3, while analyst B uses test T_4, T_5. Assume they carry out the tests correctly. Call these different sequence of tests “different research paths”.
If analyst A says “my work shows there’s strong evidence for H_0”, while analyst B says “my work shows there’s strong evidence against H_0” then these results aren’t consistent.
In this case, the inconsistency doesn’t arise from different data or modeling assumptions. It arises purely because they chose different sets of piecemeal analysis.
So my questions are:
A: if this can happen do you think it’s a problem?
B: Do you think SEV has this problem?
C: If it is a problem and we agree to only look at methods that don’t have this flaw, then what general form must those methods have?
Anon: I think this is a mindset that directs one to consider “methods” and “rules” that are on automatic pilot. We are forced to “use the old noodle” as Le Cam put it. I think it would be a good idea for people to consider how non-statistical science makes progress at the frontiers (I began this blog talking about prions and Mad Cow–one or two parts had statistical questions (the rates of infection), but the bulk of it does not). In solving various problems, disparate answers arise.
The severity approach, don’t forget, always requires an inference or assessment about what was not probed severely. That’s an important part of adjudicating/understanding disparate answers. The key is to be able to understand why apparently different answers to the same problem arise.
My point has nothing to do with “rules” that put inference on automatic pilot.
If what I describe can happen, then every researcher at the outset faces the very practical problem of deciding which piecemeal path to take. Their arbitrary choice can dramatically affect the answer in that case.
Your only practical advice to someone facing that problem is to tell them science is messy.
Anon: I’m not giving practical advice to someone. If I were, I’d deny we should limit ourselves to methods where approaching a problem from different paths couldn’t lead to disparate results. As a general rule, that would be silly.
But I take it you’re actually interested in a specific case where one approach is found defective from the perspective of given goals and criteria. And further, you know why. It’s not fashion; you can show, I presume, that one approach gets it wrong. Good.That’s what I mean by having a rationale.
No, I’m not at all interested in “one approach is found defective from the perspective of given goals and criteria”, I have no idea how you got that from my comments. It couldn’t be further from what I’m aiming at.
“I’d deny we should limit ourselves to methods where approaching a problem from different paths couldn’t lead to disparate results. As a general rule, that would be silly.”
So you think as a matter of principle that given the same data and modeling assumptions, and without using any other information, it’s ok if one choice of test statistics strongly confirms H_0 while another strongly discredits H_0.
How much of Error Statistics depends on this principle?
Anon: Straw man! And it comes at the perfect time, as I’m busy writing up the final exam for my critical thinking class.
If there’s an example where this allegedly occurs, I’d be glad to have a look later on.
Are you willing to stake the reputation of Error Statistics on the claim this doesn’t happen?
I only ask because it’s easy to find examples of it. I’m shocked you hadn’t discovered this already in your applied work utilizing Severity. Unfortunately, if you do only consider methods consistent in this way, it leads naturally toward the Bayesian formalism.
Anon:
We are prepared to allow anonymous comments, so long as that wasn’t working too badly, but think it’s important to prevent an unconstructive turn, particularly when it’s entirely unnecessary– as when it’s not at all clear what’s being alleged. You’re shocked that I hadn’t discovered that approaching a problem from different paths could lead to disparate results? But I never said that, quite the opposite. I said I’d deny we should limit ourselves to methods where approaching a problem from different paths couldn’t lead to disparate results.
Tests with different properties, for just one example, yield different results. Understanding the properties of the tests allows understanding why, and critically scrutinizing warranted inferences.
But I would also deny that an error statistician who is consistent is led naturally toward assigning prior probabilities to hypotheses in order to go down a Bayesian path. So, basically, I think we made a wrong turn in understanding somewhere, but I’m unable to write further in reply.
Dr Mayo: You missed the fallacy of complex question: How much of your book depends on a principle leading to diametrically opposite inductive inferences? Little bit of humour and ridicule too. This sure beats studying.
Student: Go back to studying, now I can’t include this on the test.
Neyman proved that with a single test hypothesis, there are test statistics x, y such that when |x| is large, |y| is small. That’s why he and Pearson introduced the alternative to the null.
E. Berk: True. Of course I spoze it’s possible to always say that approaching a problem from one path rather than another reflects different information—but that’s a stretch. It’s not just the fact that different results are to be expected with different designs, questions and what not, it’s that it’s necessary in order to learn, empirically, the different properties of the tools, and perhaps reject one route in favor of another (as in Senn’s example which started my whole post). Aside from that there’s (literally) the role of chance. They used one telescope at Principe, another at Sobral (in the 1919 eclipse results) and all kinds of accidents wound up making some plates more informative. Others needed to be scrapped entirely (for good reason). Usable sample size is a post data affair as well.
Your interlocutor wrote: “the inconsistency doesn’t arise from different data or modeling assumptions”, which is incorrect in these cases.
E.Berk: Clearly so. I think he’s just trying to rattle my chain.
I had started to challenge the interlocutor to provide an example– a real one. I was wondering what world this person was living in. But, I was traveling and lost the wireless connection in an airport and arrived home to see that perhaps it was something of a joke?Perhaps another attempt at a howler?
John: Well it would have been good if you had (jumped in). No it was definitely not a joke.
Anon (and the others): Two different tests can give different results, but taken together they give a fuller picture of what is going on (although one needs to worry about multiple testing when putting too many tests together).
I think that it’s very important here to distinguish between situations in which a binary decision is needed in finite time, and situations where it is about collecting more and more knowledge, “finding things out”, but always leaving open the possibility of criticising what once was seen as “secured” knowledge.
In decision setups, people using different but equally legitimate tests may make conflicting decisions.
In “collecting information” setups, different tests computed on the same data give different bits of information that can only be seen as contradictory if over-interpreted.
Keep in mind that a non-rejection never means that the null is true. Even if a test has a high severity (and may therefore be seen as positive confirmation of a H0 if not significant), it can only distinguish the H0 from specific alternatives, and there are always further alternatives open with the potential to create a consistent stroy out of all the apparently contradictory test results.
Christian: Thanks for this sane reflection.