Monthly Archives: June 2017

3 YEARS AGO (JUNE 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others of general relevance to philosophy of statistics [2].  Posts that are part of a “unit” or a group count as one.

June 2014

  • (6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)
  • (6/9) “The medical press must become irrelevant to publication of clinical trials.”
  • (6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”
  • (6/14) “Statistical Science and Philosophy of Science: where should they meet?”
  • (6/21) Big Bayes Stories? (draft ii)
  • (6/25) Blog Contents: May 2014
  • (6/28) Sir David Hendry Gets Lifetime Achievement Award
  • (6/30) Some ironies in the ‘replication crisis’ in social psychology (4th and final installment)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.

Save

Save

Save

Save

Save

Categories: 3-year memory lane | Leave a comment

Can You Change Your Bayesian Prior? The one post whose comments (some of them) will appear in my new book

.

I blogged this exactly 2 years ago here, seeking insight for my new book (Mayo 2017). Over 100 (rather varied) interesting comments ensued. This is the first time I’m incorporating blog comments into published work. You might be interested to follow the nooks and crannies from back then, or add a new comment to this.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

images

.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while “arrival of new data will, of course, require you to update your prior distribution to being a posterior distribution, no conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it.” (Senn, 2011, 63)

“If you cannot go back to the drawing board, one seems stuck with priors one now regards as wrong; if one does change them, then what was the meaning of the prior as carrying prior information?” (Senn, 2011, p. 58)

I take it that Senn is referring to a Bayesian prior expressing belief. (He will correct me if I’m wrong.)[ii] Senn takes the upshot to be that priors cannot be changed based on data. Is there a principled ground for blocking such moves?

I.J. GOOD: The traditional idea was that one would have thought very hard about one’s prior before proceeding—that’s what Jack Good always said. Good advocated his device of “imaginary results” whereby one would envisage all possible results in advance (1971,  p. 431) and choose a prior that you can live with whatever happens. This could take a long time! Given how difficult this would be, in practice, Good allowed

“that it is possible after all to change a prior in the light of actual experimental results” [but] rationality of type II has to be used.” (Good 1971, p. 431)

Maybe this is an example of what Senn calls requiring the informal to come to the rescue of the formal? Good was commenting on D. J. Bartholomew [iii] in the same wonderful volume (edited by Godambe and Sprott).

D. LINDLEY: According to subjective Bayesian Dennis Lindley:

“[I]f a prior leads to an unacceptable posterior then I modify it to cohere with properties that seem desirable in the inference.”(Lindley 1971, p. 436)

This would seem to open the door to all kinds of verification biases, wouldn’t it? This is the same Lindley who famously declared:

“I am often asked if the method gives the right answer: or, more particularly, how do you know if you have got the right prior. My reply is that I don’t know what is meant by “right” in this context. The Bayesian theory is about coherence, not about right or wrong.” (1976, p. 359)

H. KYBURG:  Philosopher Henry Kyburg (who wrote a book on subjective probability, but was or became a frequentist) gives what I took to be the standard line (for subjective Bayesians at least):

There is no way I can be in error in my prior distribution for μ ––unless I make a logical error–… . It is that very fact that makes this prior distribution perniciously subjective. It represents an assumption that has consequences, but cannot be corrected by criticism or further evidence.” (Kyburg 1993, p. 147)

It can be updated of course via Bayes rule.

D.R. COX: While recognizing the serious problem of “temporal incoherence”, (a violation of diachronic Bayes updating), David Cox writes:

“On the other hand [temporal coherency] is not inevitable and there is nothing intrinsically inconsistent in changing prior assessments” in the light of data; however, the danger is that “even initially very surprising effects can post hoc be made to seem plausible.” (Cox 2006, p. 78)

An analogous worry would arise, Cox notes, if frequentists permit data dependent selections of hypotheses (significance seeking, cherry picking, etc). However, frequentists (if they are not to be guilty of cheating) would need to take into account any adjustments to the overall error probabilities of the test. But the Bayesian is not in the business of computing error probabilities associated with a method for reaching posteriors. At least not traditionally. Would Bayesians even be required to report such shifts of priors? (A principle is needed.)

What if the proposed adjustment of prior is based on the data and resulting likelihoods, rather than an impetus to ensure one’s favorite hypothesis gets a desirable posterior? After all, Jim Berger says that prior elicitation typically takes place after “the expert has already seen the data” (2006, p. 392). Do they instruct them to try not to take the data into account? Anyway, if the prior is determined post-data, then one wonders how it can be seen to reflect information distinct from the data under analysis. All the work to obtain posteriors would have been accomplished by the likelihoods. There’s also the issue of using data twice.

So what do you think is the answer? Does it differ for subjective vs conventional vs other stripes of Bayesian?

[i]Both were contributions to the RMM (2011) volumeSpecial Topic: Statistical Science and Philosophy of Science: Where Do (Should) They Meet in 2011 and Beyond? (edited by D. Mayo, A. Spanos, and K. Staley). The volume  was an outgrowth of a 2010 conference that Spanos and I (and others) ran in London (LSE), and conversations that emerged soon after. See full list of participants, talks and sponsors here.

[ii] Senn and I had a published exchange on his paper that was based on my “deconstruction” of him on this blog, followed by his response! The published comments are here (Mayo) and here (Senn).

[iii] At first I thought Good was commenting on Lindley. Bartholomew came up in this blog in discussing when Bayesians and frequentists can agree on numbers.

WEEKEND READING

Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis.”
Senn, S. 2011. “You May Believe You Are a Bayesian But You Are Probably Wrong.”
Berger, J. O.  2006. “The Case for Objective Bayesian Analysis.”

Discussions and Responses on Senn and Gelman can be found searching this blog:

Commentary on Berger & Goldstein: Christen, Draper, Fienberg, Kadane, Kass, Wasserman,
Rejoinders: Berger, Goldstein,

REFERENCES

Berger, J. O.  2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

Cox, D. R. 2006. Principles of Statistical Inference. Cambridge, UK: Cambridge University Press.

Mayo, D. G. 2017. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge.

Categories: Bayesian priors, Bayesian/frequentist | 11 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

E.S. Pearson died on this day in 1980. Aside from being co-developer of Neyman-Pearson statistics, Pearson was interested in philosophical aspects of statistical inference. A question he asked is this: Are methods with good error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. But how exactly does it work? It’s not just the frequentist error statistician who faces this question, but also some contemporary Bayesians who aver that the performance or calibration of their methods supplies an evidential (or inferential or epistemic) justification (e.g., Robert Kass 2011). The latter generally ties the reliability of the method that produces the particular inference C to degrees of belief in C. The inference takes the form of a probabilism, e.g., Pr(C|x), equated, presumably, to the reliability (or coverage probability) of the method. But why? The frequentist inference is C, which is qualified by the reliability of the method, but there’s no posterior assigned C. Again, what’s the rationale? I think existing answers (from both tribes) come up short in non-trivial ways.

I’ve recently become clear (or clearer) on a view I’ve been entertaining for a long time. There’s more than one goal in using probability, but when it comes to statistical inference in science, I say, the goal is not to infer highly probable claims (in the formal sense)* but claims which have been highly probed and have passed severe probes.  Even highly plausible claims can be poorly tested (and I require a bit more of a test than informal uses of the word.) The frequency properties of a method are relevant in those contexts where they provide assessments of a method’s capabilities and shortcomings in uncovering ways C may be wrong. Knowledge of the methods capabilities are used, in turn, to ascertain how well or severely C has been probed. C is warranted only to the extent that it survived a severe probe of ways it can be incorrect. There’s poor evidence for C when little has been done to rule out C’s flaws. The most important role of error probabilities is in blocking inferences to claims that have not passed severe tests, but also to falsify (statistically) claims whose denials pass severely. This view is in the spirit of E.S. Pearson, Peirce, and Popper–though none fully worked it out. That’s one of the things I do or try to in my latest work. Each supplied important hints. The following remarks of Pearson, earlier blogged here, contains some of his hints.

*Nor to give a comparative assessment of the probability of claims

From Pearson, E. S. (1947)

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171) 

“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

Three Steps in the Original Construction of Tests

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information  available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2.  However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

Neyman Was the More Behavioristic of the Two

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

__________________________

 

References:

Kass, R. (2011), “Statistical Inference: The Big Picture,” Statistical Science 26, No. 1, 1–9.

Pearson, E. S. (1935), The Application of Statistical Methods to Industrial Standardization and Quality Control, London: British Standards Institution.

Pearson, E. S. (1947), “The Choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,Biometrika 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to Reality” Journal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” Biometrika 20(A): 175-240.


[i] In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

Categories: E.S. Pearson, highly probable vs highly probed, phil/history of stat | Leave a comment

3 YEARS AGO (May 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2014. I leave them unmarked this month, read whatever looks interesting.

May 2014

  • (5/1) Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle
  • (5/3) You can only become coherent by ‘converting’ non-Bayesianly
  • (5/6) Winner of April Palindrome contest: Lori Wike
  • (5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)
  • (5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
  • (5/15) Scientism and Statisticism: a conference* (i)
  • (5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”
  • (5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop
  • (5/25) Blog Table of Contents: March and April 2014
  • (5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976
  • (5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

 

Save

Save

Save

Save

Save

Categories: 3-year memory lane | 1 Comment

Blog at WordPress.com.