Error statistics doesn’t blame for possible future crimes of QRPs (ii)

A seminal controversy in statistical inference is whether error probabilities associated with an inference method are evidentially relevant once the data are in hand. Frequentist error statisticians say yes; Bayesians say no. A “no” answer goes hand in hand with holding the Likelihood Principle (LP), which follows from inference by Bayes theorem. A “yes” answer violates the LP (also called the strong LP). The reason error probabilities drop out according to the LP is that it follows from the LP that all the evidence from the data is contained in the likelihood ratios (at least for inference within a statistical model). For the error statistician, likelihood ratios are merely measures of comparative fit, and omit crucial information about their reliability. A dramatic illustration of this disagreement involves optional stopping, and it’s the one to which Roderick Little turns in the chapter “Do you like the likelihood principle?” in his new book that I cite in my last post

Bayesians criticize error statisticians for taking into account error probabilities which consider outcomes other than the one observed. That’s because their account “conditions” on the observed data. While their criticism along these lines is familiar, I noticed something very odd in the passages Roderick Little cites from Berger and Wolpert’s Likelihood Principle (chapter 4), intended to describe the error statistical position. You can find Berger and Wolpert (1988) here. First consider the example.

We are to consider Savage’s (1962) famous example: two-sided testing of a Normal mean with known σ: H0: μ = 0 against H1: μ ≠ 0 (no effect vs some effect). If the researcher keeps sampling until reaching a “nominally” significant result ‘then with probability 1 he will eventually reject any sharp null hypothesis, even though it be true’ (Edwards, Lindman, and Savage [1963], p. 239)–hereafter, ELS.(1) Nevertheless, ELS aver ‘the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place’ (ibid., pp. 238-9)–at least for Bayesians. This is called the Stopping Rule Principle, and it follows directly from the Likelihood Principle (LP).

For the error statistician, the actual significance level in sequential trials is the probability of finding a .05 nominally significant result at some stopping point or other, up to the point it stops. The P-value accumulates. To be clear: sequential trials are regularly used by error statisticians, but the multiplicity must be taken account of (Wald 1947, Armitage 1975). Pre-data, this is to ensure error probability control; post data, this is to ensure that the error statistical assessment, e.g., the P-value report, is a relevant assessment of the warranted inference. (2) Ignoring the stopping rule, as the Berger and Wolpert’s Bayesian recommends, would make it easy to infer evidence of an effect erroneously.

Suppose the trial had a predesignated sample size of n = 100, and the resulting data reaches P-value of .02.  It would be wrong to claim that the error statistician would need to report a much larger P-value, perhaps even 1, because the researcher, if confronted with a nonsignificant P-value, might have committed a QRP: the researcher might have kept sampling until reaching a preset (nominal) P-value, say .05 (corresponding to a 2 SE difference in either direction), and then, hiding the fact that this resulted from trying and trying again, go ahead and report the actual P-value is .05. According to this wrong-headed view of error statistical reasoning, where the possibility of future guilt is claimed to enter in assessing the attained data, the legitimate P-value of .02, arrived at fair and square, would no longer be legitimate. It would have to balloon to 1. (3) That, at any rate, appears to be the message from the Berger and Wolpert passages (pp. 74-5):

They go on to speculate that, since continuing to sample might depend on successfully getting a grant, the probability of this occurring would also have to enter into a proper error-statistical assessment of the P-value attained. But this is crazy, and not at all what the error statistician requires in taking account of error probabilities. One would be inclined to just dismiss such appeals to ridicule and humor, were they not taken to heart by critics of frequentist error statistics.  (For a discussion of optional stopping in SIST, Mayo 2018, see this link or search this blog.)

Consider how an analogous criticism could be lodged against the error statistical requirement of predesignating endpoints in clinical trials. The pre-designated P-value, having been met by an honest tester, could be dismissed under the accusation that in a possible world, finding non-significance might have led the researcher to data dredge subgroups and other forms of slicing and dicing the data. This consideration of future guilt would, according to Berger and Wolpert, invalidate the attained P-value. With open-ended dredging (data “torture”), no error probability can be assigned. So if we take their construal of the error statistical standpoint seriously, it’s not clear how a valid P-value would ever be attainable.

Now Berger and Wolpert might claim that their argument is only intended for researchers who, once pressured, concede the possibility they might have been tempted to cheat. The argument still fails. Such a concession might make us more inclined to stringently audit that researcher in the future, but it would not alter the warrant properly accorded to the inference actually reached.

By the way, you might wonder why my picture is on the cover of Little’s book (along with Fisher, Neyman and others). Little takes up Birnbaum’s (1962) argument claiming that acceptable error statistical principles lead to the Likelihood Principle, and he discusses my 2014 criticism of it in the journal Statistical Science. You can find that article, along with a discussion by several statisticians, here. Little does not seem convinced by my argument though.

Share your thoughts in the comments.(4)

 


(1) This is an example of a proper stopping rule: it will stop in finitely many steps. I discuss this in EGEK (Mayo 1996), Chapter 10 , Mayo and Kruse (2001), and SIST (Mayo 2018) Excursion 1 Tour II.  For those really interested, you can find a pale copy of the 1962 “Savage forum” where Savage first announces this result to the world.

(2) From the severe testing standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

(3) Ironically, ignoring the stopping rule has some unwelcome results for Bayesians: As Peter Armitage ([1962], p. 72) points out: ‘The departure of the mean by two standard errors … corresponds to the null hypothesis being at the five per cent point of the posterior distribution’—using a uniform prior probability for μ. Since in this case the posterior for the null hypothesis matches the significance level, a Bayesian with this prior is assured of assigning a low posterior probability to H0, even though it is true. The same thing happens in using this stopping rule (and prior) to form Bayesian credibility intervals: they are assured of excluding the true value.

(4) I apologize for some typos and blurry pics in the earlier draft (i). I was keen to put this up while in an airport yesterday, and neglected to check.

 

 

REFERENCES

Armitage, P. (1962): ‘Contribution to Discussion’, in L. J. Savage, (ed.), pp. 62–103.

Armitage, P. (1975): Sequential Medical Trials, 2nd ed. New York: Wiley.

Berger J. O. and Wolpert, R. L. (1988), The Likelihood Principle.

Birnbaum, A. (1962) JASA 57(298), 269-326.

Edwards, W., H, Lindman, and L. Savage. (1963). Bayesian Statistical Inference for Psychological Research. Psychological Review 70: 193-242.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. (2018). [SIST] Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. (CUP).

Wald, A. [1947] Sequential Analysis. New York: John Wiley & Sons.

 

Categories: Likelihood Principle, Rod Little, stopping rule | 5 Comments

Post navigation

5 thoughts on “Error statistics doesn’t blame for possible future crimes of QRPs (ii)

  1. John Byrd

    Being a practitioner of science, not a statistician, I cannot help but be appalled that many people seem to think the stopping rule example is a good argument for the acceptance of the LP. Any results, and any inference, are only as good as the methods used to obtain them. This includes everything from how the data were generated (sampling schemes, equipment, trained personnel, etc.) to how the statistic was generated on the back end. Optional stopping is clearly a foul, unless the method of choosing when to stop is captured in the statistic on the back end. The sales pitch for the LP seems to be look how easy your life can be…

    • John:
      Thanks for your comment. But aside from the usual: irrelevance of the stopping rule restores simplicity and freedom that had been lost with N-P statistics (a paraphrase of Savage), these passages completely misrepresent what’s called for to take into account stopping rules.

      • fwiw

        since continuing to sample might depend on successfully getting a grant, the probability of this occurring would also have to enter into a proper error-statistical assessment of the P-value attained. But this is crazy,

        To compute a p-value, we need to know or assume what the researchers would have done had the data been different. If we don’t agree on this, we won’t agree on the p-value.

        Why is it crazy? Berger says, if they do not see an effect, with probability p(they win grant) sampling continues. You say, sampling stops with probability 1 if they see an effect or not. Then, having made different assumptions about the sampling distribution, you and Berger compute different p-values.

        • fwiw: I don’t understand your question. Are you asking: couldn’t someone erect the crazy rule in Berger and Wopert as their p value? Berger and Wolpert claim it is the standard frequentist understanding for p-values with or without optional stopping.

        • John Byrd

          “To compute a p-value, we need to know or assume what the researchers would have done had the data been different. If we don’t agree on this, we won’t agree on the p-value.”

          Talking about what is the researcher’s intent or thinking is misleading and just makes this more confusing. Focus should be on the design protocol as actually done. If a researcher has a design that includes repeating an experiment until a desired result is obtained, then that design must have a p-value proper for the design (takes into account multiple attempts). That is not hard to grasp. If a researcher does a single experiment, calculates the p-value, and then does another (same experiment) hoping for a more exciting result and calculates a p-value pretending that it is a single experiment, then this is malpractice.

Leave a reply to Mayo Cancel reply

Blog at WordPress.com.