Statistics

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday)

images-14

Neyman April 16, 1894 – August 5, 1981

For my final Jerzy Neyman item, here’s the post I wrote for his birthday last year: 

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”.

Since it’s Saturday night, let’s listen in on this one act play, just about to begin at the Elba Dinner Theater. Don’t worry, food and drink are allowed to be taken in. (I’ve also included, in the References, several links to papers for your weekend reading enjoyment!)  There go les trois coups–the curtain’s about to open!

search

The curtain opens with a young Neyman and Pearson (from 1933) standing mid-stage, lit by a spotlight. (Neyman does the talking, since its his birthday).

Neyman: “Bertrand put into statistical form a variety of hypotheses, as for example the hypothesis that a given group of stars…form a ‘system.’ His method of attack, which is that in common use, consisted essentially in calculating the probability, P, that a certain character, x, of the observed facts would arise if the hypothesis tested were true. If P were very small, this would generally be considered as an indication that…H was probably false, and vice versa. Bertrand expressed the pessimistic view that no test of this kind could give reliable results.

Borel, however, considered…that the method described could be applied with success provided that the character, x, of the observed facts were properly chosen—were, in fact, a character which he terms ‘en quelque sorte remarquable’” (Neyman and Pearson 1933, p.141/290).

The stage fades to black, then a spotlight shines on Bertrand, stage right.

Bertrand: “How can we decide on the unusual results that chance is incapable of producing?…The Pleiades appear closer to each other than one would naturally expect…In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances?…Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility. …

[He turns to the audience, shaking his head.]

The application of such calculations to questions of this kind is a delusion and an abuse.” (Bertrand, 1907, p. 166; Lehmann 1993, p. 963).

The stage fades to black, then a spotlight appears on Borel, stage left.

Borel: “The particular form that problems of causes often take…is the following: Is such and such a result due to chance or does it have a cause? It has often been observed how much this statement lacks in precision. Bertrand has strongly emphasized this point. But …to refuse to answer under the pretext that the answer cannot be absolutely precise, is to… misunderstand the essential nature of the application of mathematics.” (ibid. p. 964) Bertrand considers the Pleiades. ‘If one has observed a [precise angle between the stars]…in tenths of seconds…one would not think of asking to know the probability [of observing exactly this observed angle under chance] because one would never have asked that precise question before having measured the angle’… (ibid.)

The question is whether one has the same reservations in the case in which one states that one of the angles of the triangle formed by three stars has “une valeur remarquable” [a striking or noteworthy value], and is for example equal to the angle of the equilateral triangle…. (ibid.)

Here is what one can say on this subject: One should carefully guard against the tendency to consider as striking an event that one has not specified beforehand, because the number of such events that may appear striking, from different points of view, is very substantial” (ibid. p. 964).J. Neyman and E. Pearson

The stage fades to black, then a spotlight beams on Neyman and Pearson mid-stage. (Neyman does the talking)

Neyman: “We appear to find disagreement here, but are inclined to think that…the two writers [Bertrand and Borel] are not really considering precisely the same problem. In general terms the problem is this: Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature. …What is the precise meaning of the words ‘an efficient test of a hypothesis’?” (1933, p. 140/290)

“[W]e may consider some specified hypothesis, as that concerning the group of stars, and look for a method which we should hope to tell us, with regard to a particular group of stars, whether they form a system, or are grouped ‘by chance,’…their relative movements unrelated.” (ibid.)
“If this were what is required of ‘an efficient test’, we should agree with Bertrand in his pessimistic view. For however small be the probability that a particular grouping of a number of stars is due to ‘chance’, does this in itself provide any evidence of another ‘cause’ for this grouping but ‘chance’? …Indeed, if x is a continuous variable—as for example is the angular distance between two stars—then any value of x is a singularity of relative probability equal to zero. We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. But we may look at the purpose of tests from another view-point.” (Emphasis added; ibid. pp. 141-2; 290-1)

Fade to black, spot on narrator mid-stage:

Narrator: We all know our famous (miserable) lines are about to come. But let’s linger on the “as far as a particular hypothesis is concerned” portion. For any particular case, one may identify a data dependent feature x that would be highly improbable “under the particular hypothesis of chance”. We must “carefully guard,” Borel warns, “against the tendency to consider as striking an event that one has not specified beforehand”. But if you are required to set the test’s capabilities ahead of time then you need to specify the type of falsity of Ho, the distance measure or test statistic beforehand. An efficient test should capture Fisher’s concern with tests sensitive to departures of interest. Listen to Neyman over 40 years later, reflecting on the relevance of Borel’s position in 1977.

Fade to black. Spotlight on an older Neyman, stage right.

Neyman April 16, 1894 – August 5, 1981

Neyman: “The question (what is an efficient test of a statistical hypothesis) is about an intelligible methodology for deciding whether the observed difference…contradicts the stochastic model….

This question was the subject of a lively discussion by Borel and others. Borel was optimistic but insisted that: (a) the criterion to test a hypothesis (a ‘statistical hypothesis’) using some observations must be selected not after the examination of the results of observation, but before, and (b) this criterion should be a function of the observations (of some sort remarkable) (Neyman 1977, pp. 102-103).
It is these remarks of Borel that served as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses.”(ibid. p. 103)

Fade to back. Spotlight on an older Egon Pearson writing a letter to Neyman about the preprint Neyman sent of his 1977 paper. (The letter is unpublished, but I cite Lehmann 1993).

Pearson: “I remember that you produced this quotation [from Borel] when we began to get our [1933] paper into shape… The above stages [wherein he had been asking ‘Why use that particular test statistic?’] led up to Borel’s requirement of finding…a criterion which was ‘a function of the observations ‘en quelque sorte remarquable’. Now my point is that you and I (perhaps my first leading) had ourselves reached the Borel requirement independently of Borel, because we were serious humane thinkers; Borel’s expression neatly capped our own.”

Fade to black. End Play

Egon has the habit of leaving the most tantalizing claims unpacked, and this is no exception: What exactly is the Borel requirement already reached due to their being “serious humane thinkers”? I can well imagine growing this one act play into something like the expressionist play of Michael Fraylin, Copenhagen, wherein a variety of alternative interpretations are entertained based on subsequent work and remarks. I don’t say that it would enjoy a long life on Broadway, but a small handful of us would relish it.

As with my previous attempts at “statistical theatre of the absurd, (e.g., “Stat on a hot-tin roof”) there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included.

Deconstructions on the Meaning of the Play by Theater Critics

It’s not hard to see that “as far as a particular” star grouping is concerned, we cannot expect a reliable inference to just any non-chance effect discovered in the data. The more specific the feature is to these particular observations, the more improbable. What’s the probability of 3 hurricanes followed by 2 plane crashes (as occurred last month, say)? Harold Jeffreys put it this way: any sample is improbable in some respect;to cope with this fact statistical method does one of two things: appeals to prior probabilities of a hypothesis or to error probabilities of a procedure. The former can check our tendency to find a more likely explanation H’ than chance by an appropriately low prior weight to H’. What does the latter approach do? It says, we need to consider the problem as of a general type. It’s a general rule, from a test statistic to some assertion about alternative hypotheses, expressing the non-chance effect. Such assertions may be in error but we can control such erroneous interpretations. We deliberately move away from the particularity of the case at hand, to the general type of mistake that could be made.

Isn’t this taken care of by Fisher’s requirement that Pr(P < p0; Ho) = p—that the test rarely rejects the null if true?   It may be, in practice, Neyman and Pearson thought, but only with certain conditions that were not explicitly codified by Fisher’s simple significance tests. With just the null hypothesis, it is unwarranted to take low P-values as evidence for a specific “cause” or non-chance explanation. Many could be erected post data, but the ways these could be in error would not have been probed. Fisher (1947, p. 182) is well aware that “the same data may contradict the hypothesis in any of a number of different ways,” and that different corresponding tests would be used.

The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation. [T]he experimenter is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs (ibid., p. 185).

Even if “an experienced experimenter” knows the appropriate test, this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for the choices made on informal grounds. In today’s world, if not in Fisher’s day, there’s legitimate concern about selecting the alternative that gives the more impressive P-value.

Here’s Egon Pearson writing with Chandra Sekar: In testing if a sample has been drawn from a single normal population, “it is not possible to devise an efficient test if we only bring into the picture this single normal probability distribution with its two unknown parameters. We must also ask how sensitive the test is in detecting failure of the data to comply with the hypotheses tested, and to deal with this question effectively we must be able to specify the directions in which the hypothesis may fail” ( p. 121). “It is sometimes held that the criterion for a test can be selected after the data, but it will be hard to be unprejudiced at this point” (Pearson & Chandra Sekar, 1936, p. 129).

To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptions if the hypothesis is true….By choosing the feature most unfavourable to Ho out of a very large number of features examined it will usually be possible to find some reason for rejecting the hypothesis. It must be remembered, however, that the point now at issue will not be whether it is exceptional to find a given criterion with so unfavourable a value. We shall need to find an answer to the more difficult question. Is it exceptional that the most unfavourable criterion of the n, say, examined should have as unfavourable a value as this? (ibid., p. 127).

Notice, the goal is not behavioristic; it’s a matter of avoiding the glaring fallacies in the test at hand, fallacies we know all too well.

“The statistician who does not know in advance with which type of alternative to H0 he may be faced, is in the position of a carpenter who is summoned to a house to undertake a job of an unknown kind and is only able to take one tool with him! Which shall it be? Even if there is an omnibus tool, it is likely to be far less sensitive at any particular job than a specialized one; but the specialized tool will be quite useless under the wrong conditions” (ibid., p. 126).

In a famous result, Neyman (1952) demonstrates that by dint of a post-data choice of hypothesis, a result that leads to rejection in one test yields the opposite conclusion in another, both adhering to a fixed significance level. [Fisher concedes this as well.] If you are keen to ensure the test is capable of teaching about discrepancies of interest, you should prespecify an alternative hypothesis, where the null and alternative hypothesis exhaust the space, relative to a given question. We can infer discrepancies from the null, as well as corroborate their absence by considering those the test had high power to detect.

Playbill Souvenir

Let’s flesh out Neyman’s conclusion to the Borel-Bertrand debate: if we accept the words, “an efficient test of the hypothesis H” to mean a statistical (methodological) falsification rule that controls the probabilities of erroneous interpretations of data, and ensures the rejection was because of the underlying cause (as modeled), then we agree with Borel that efficient tests are possible. This requires (a) a prespecified test criterion to avoid verification biases while ensuring power (efficiency), and (b) consideration of alternative hypotheses to avoid fallacies of acceptance and rejection. We must steer clear of isolated or particular curiosities to find indications that we are tracking genuine effects.

“Fisher’s the one to be credited,” Pearson remarks, “for his emphasis on planning an experiment, which led naturally to the examination of the power function, both in choosing the size of sample so as to enable worthwhile results to be achieved, and in determining the most appropriate tests” (Pearson 1962, p. 277). If you’re planning, you’re prespecifying, perhaps, nowadays, by means of explicit preregistration.

Nevertheless prespecifying the question (or test statistic) is distinct from predesignating a cut-off P-value for significance. Discussions of tests often suppose one is somehow cheating if the attained P-value is reported, as if it loses its error probability status. It doesn’t.[2] I claim they are confusing prespecifying the question or hypothesis, with fixing the P-value in advance–a confusion whose origin stems from failing to identify the rationale behind conventions of tests, or so I argue. Nor is it even that the predesignation is essential, rather than an excellent way to promote valid error probabilities.

But not just any characteristic of the data affords the relevant error probability assessment. It has got to be pretty remarkable!

Enter those pivotal statistics called upon in Fisher’s Fiducial inference. In fact, the story could well be seen to continue in the following two posts: “You can’t take the Fiducial out of Fisher if you want to understand the N-P performance philosophy“, and ” Deconstructing the Fisher-Neyman conflict wearing fiducial glasses”.

[1] Or, it might have been titled, “A Polish Statistician in Paris”, given the remake of “An American in Paris” is still going strong on Broadway, last time I checked.

[2] We know that Lehmann insisted people report the attained p-value so that others could apply their own preferred error probabilities. N-P felt the same way. (I may add some links to relevant posts later on.)

REFERENCES

Bertrand, J. (1888/1907). Calcul des Probabilités. Paris: Gauthier-Villars.

Borel, E. 1914. Le Hasard. Paris: Alcan.

Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.

Lehmann, E.L. 2012. “The Bertrand-Borel Debate and the Origins of the Neyman-Pearson Theory” in J. Rojo (ed.), Selected Works of E. L. Lehmann, 2012, Springer US, Boston, MA, pp. 965-974.

Neyman, J. 1952. Lectures and Conferences on Mathematical Statistics and Probability. 2nd ed. Washington, DC: Graduate School of U.S. Dept. of Agriculture.

Neyman, J. 1977. “Frequentist Probability and Frequentist Statistics“, Synthese 36(1): 97–131.

Neyman, J. & Pearson, E. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses“, Philosophical Transactions of the Royal Society of London 231. Series A, Containing Papers of a Mathematical or Physical Character: 289–337.

Pearson, E. S. 1962. “Some Thoughts on Statistical Inference”, The Annals of Mathematical Statistics, 33(2): 394-403.

Pearson, E. S. & Sekar, C. C. 1936. “The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations“, Biometrika 28(3/4): 308-320. Reprinted (1966) in The Selected Papers of E. S. Pearson, (pp. 118-130). Berkeley: University of California Press.

Reid, Constance (1982). Neyman–from life

 

 

Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

neyman

April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | 2 Comments

If you’re seeing limb-sawing in P-value logic, you’re sawing off the limbs of reductio arguments

images-2I was just reading a paper by Martin and Liu (2014) in which they allude to the “questionable logic of proving H0 false by using a calculation that assumes it is true”(p. 1704).  They say they seek to define a notion of “plausibility” that

“fits the way practitioners use and interpret p-values: a small p-value means H0 is implausible, given the observed data,” but they seek “a probability calculation that does not require one to assume that H0 is true, so one avoids the questionable logic of proving H0 false by using a calculation that assumes it is true“(Martin and Liu 2014, p. 1704).

Questionable? A very standard form of argument is a reductio (ad absurdum) wherein a claim C  is inferred (i.e., detached) by falsifying ~C, that is, by showing that assuming ~C entails something in conflict with (if not logically contradicting) known results or known truths [i]. Actual falsification in science is generally a statistical variant of this argument. Supposing Hin p-value reasoning plays the role of ~C. Yet some aver it thereby “saws off its own limb”! Continue reading

Categories: P-values, reforming the reformers, Statistics | 13 Comments

3 YEARS AGO (MARCH 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: March 2014. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. 3/19 and 3/17 are one, as are  3/19, 3/12 and 3/4, and the 6334 items 3/11, 3/22 and 3/26. So that covers nearly all the posts!

March 2014

 

  • (3/1) Cosma Shalizi gets tenure (at last!) (metastat announcement)
  • (3/2) Significance tests and frequentist principles of evidence: Phil6334 Day #6
  • (3/3) Capitalizing on Chance (ii)
  • (3/4) Power, power everywhere–(it) may not be what you think! [illustration]
  • (3/8) Msc kvetch: You are fully dressed (even under you clothes)?
  • (3/8) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
  • (3/11) Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power
  • (3/12) Get empowered to detect power howlers
  • (3/15) New SEV calculator (guest app: Durvasula)
  • (3/17) Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

     

     

  • (3/19) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
  • (3/22) Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)
  • (3/25) The Unexpected Way Philosophy Majors Are Changing The World Of Business

     

  • (3/26) Phil6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
  • (3/28) Severe osteometric probing of skeletal remains: John Byrd
  • (3/29) Winner of the March 2014 palindrome contest (rejected post)
  • (3/30) Phil6334: March 26, philosophy of misspecification testing (Day #9 slides)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.

Save

Save

Save

Save

Save

Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

Er, about those other approaches, hold off until a balanced appraisal is in

I could have told them that the degree of accordance enabling the ASA’s “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

“Reaching for Best Practices in Statistics: Proceed with Caution Until a Balanced Critique Is In”

J. Hossiason

“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)

What’s sauce for the goose is sauce for the gander right?  Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, that practitioners move with caution until they can put forward at least a few agreed upon principles for interpreting and applying Bayesian inference methods. The words we heard ranged from “go slow” to “moratorium [emphasis mine]. Having been privy to some of the results of this survey, we at Stat Report Watch decided to contact some of the individuals involved. Continue reading

Categories: P-values, reforming the reformers, Statistics | 6 Comments

BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Understanding Reproducibility & Error Correction in Science

BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE

2016–2017
57th Annual Program

Download the 57th Annual Program

The Alfred I. Taub forum:

UNDERSTANDING REPRODUCIBILITY & ERROR CORRECTION IN SCIENCE

Cosponsored by GMS and BU’s BEST at Boston University.
Friday, March 17, 2017
1:00 p.m. – 5:00 p.m.
The Terrace Lounge, George Sherman Union
775 Commonwealth Avenue

  • Reputation, Variation, &, Control: Historical Perspectives
    Jutta Schickore History and Philosophy of Science & Medicine, Indiana University, Bloomington.
  • Crisis in Science: Time for Reform?
    Arturo Casadevall Molecular Microbiology & Immunology, Johns Hopkins
  • Severe Testing: The Key to Error Correction
    Deborah Mayo Philosophy, Virginia Tech
  • Replicate That…. Maintaining a Healthy Failure Rate in Science
    Stuart Firestein Biological Sciences, Columbia

 

boston-mayo-2017

Categories: Announcement, Statistical fraudbusting, Statistics | Leave a comment

The ASA Document on P-Values: One Year On

imgres-6

I’m surprised it’s a year already since posting my published comments on the ASA Document on P-Values. Since then, there have been a slew of papers rehearsing the well-worn fallacies of tests (a tad bit more than the usual rate). Doubtless, the P-value Pow Wow raised people’s consciousnesses. I’m interested in hearing reader reactions/experiences in connection with the P-Value project (positive and negative) over the past year. (Use the comments, share links to papers; and/or send me something slightly longer for a possible guest post.)
Some people sent me a diagram from a talk by Stephen Senn (on “P-values and the art of herding cats”). He presents an array of different cat commentators, and for some reason Mayo cat is in the middle but way over on the left side,near the wall. I never got the key to interpretation.  My contribution is below: 

Chart by S.Senn

“Don’t Throw Out The Error Control Baby With the Bad Statistics Bathwater”

D. Mayo*[1]

The American Statistical Association is to be credited with opening up a discussion into p-values; now an examination of the foundations of other key statistical concepts is needed. Continue reading

Categories: Bayesian/frequentist, P-values, science communication, Statistics, Stephen Senn | 14 Comments

3 YEARS AGO (FEBRUARY 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: February 2014. I normally mark in red three posts from each month that seem most apt for general background on key issues in this blog, but I decided just to list these as they are (some are from a seminar I taught with Aris Spanos 3 years ago; several on Fisher were recently reblogged). I hope you find something of interest!    

February 2014

  • (2/1) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)
  • (2/3) PhilStock: Bad news is bad news on Wall St. (rejected post)
  • (2/5) “Probabilism as an Obstacle to Statistical Fraud-Busting” (draft iii)
  • (2/9) Phil6334: Day #3: Feb 6, 2014
  • (2/10) Is it true that all epistemic principles can only be defended circularly? A Popperian puzzle
  • (2/12) Phil6334: Popper self-test
  • (2/13) Phil 6334 Statistical Snow Sculpture
  • (2/14) January Blog Table of Contents
  • (2/15) Fisher and Neyman after anger management?
  • (2/17) R. A. Fisher: how an outsider revolutionized statistics
  • (2/18) Aris Spanos: The Enduring Legacy of R. A. Fisher
  • (2/20) R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’
  • (2/21) STEPHEN SENN: Fisher’s alternative to the alternative
  • (2/22) Sir Harold Jeffreys’ (tail-area) one-liner: Sat night comedy [draft ii]
  • (2/24) Phil6334: February 20, 2014 (Spanos): Day #5
  • (2/26) Winner of the February 2014 palindrome contest (rejected post)
  • (2/26) Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)

 

 

Categories: 3-year memory lane, Statistics | 2 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012.  (I will comment in the comments.)

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 13 Comments

Guest Blog: ARIS SPANOS: The Enduring Legacy of R. A. Fisher

By Aris Spanos

One of R. A. Fisher’s (17 February 1890 — 29 July 1962) most re­markable, but least recognized, achievement was to initiate the recast­ing of statistical induction. Fisher (1922) pioneered modern frequentist statistics as a model-based approach to statistical induction anchored on the notion of a statistical model, formalized by:

Mθ(x)={f(x;θ); θ∈Θ}; x∈Rn ;Θ⊂Rm; m < n; (1)

where the distribution of the sample f(x;θ) ‘encapsulates’ the proba­bilistic information in the statistical model.

Before Fisher, the notion of a statistical model was vague and often implicit, and its role was primarily confined to the description of the distributional features of the data in hand using the histogram and the first few sample moments; implicitly imposing random (IID) samples. The problem was that statisticians at the time would use descriptive summaries of the data to claim generality beyond the data in hand x0:=(x1,x2,…,xn) As late as the 1920s, the problem of statistical induction was understood by Karl Pearson in terms of invoking (i) the ‘stability’ of empirical results for subsequent samples and (ii) a prior distribution for θ.

Fisher was able to recast statistical inference by turning Karl Pear­son’s approach, proceeding from data x0 in search of a frequency curve f(x;ϑ) to describe its histogram, on its head. He proposed to begin with a prespecified Mθ(x) (a ‘hypothetical infinite population’), and view x0 as a ‘typical’ realization thereof; see Spanos (1999). Continue reading

Categories: Fisher, Spanos, Statistics | Tags: , , , , , , | Leave a comment

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll post some different Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

Two New Properties of Mathematical Likelihood

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 2 Comments

Cox’s (1958) weighing machine example

IMG_0079

.

A famous chestnut given by Cox (1958) recently came up in conversation. The example  “is now usually called the ‘weighing machine example,’ which draws attention to the need for conditioning, at least in certain types of problems” (Reid 1992, p. 582). When I describe it, you’ll find it hard to believe many regard it as causing an earthquake in statistical foundations, unless you’re already steeped in these matters. If half the time I reported my weight from a scale that’s always right, and half the time use a scale that gets it right with probability .5, would you say I’m right with probability ¾? Well, maybe. But suppose you knew that this measurement was made with the scale that’s right with probability .5? The overall error probability is scarcely relevant for giving the warrant of the particular measurement,knowing which scale was used. Continue reading

Categories: Error Statistics, Sir David Cox, Statistics, strong likelihood principle | 1 Comment

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

images-3

.

Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics | 39 Comments

High error rates in discussions of error rates: no end in sight

27D0BB5300000578-3168627-image-a-27_1437433320306

waiting for the other shoe to drop…

“Guides for the Perplexed” in statistics become “Guides to Become Perplexed” when “error probabilities” (in relation to statistical hypotheses tests) are confused with posterior probabilities of hypotheses. Moreover, these posteriors are neither frequentist, subjectivist, nor default. Since this doublespeak is becoming more common in some circles, it seems apt to reblog a post from one year ago (you may wish to check the comments).

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities —indeed in the glossary for the site—while in others it is denied they provide error probabilities. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.

Begin with one of his kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics” (blue is Frost): Continue reading

Categories: highly probable vs highly probed, J. Berger, reforming the reformers, Statistics | 1 Comment

3 YEARS AGO (JANUARY 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: January 2014. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. This month, I’m grouping the 3 posts from my seminar with A. Spanos, counting them as 1.

January 2014

  • (1/2) Winner of the December 2013 Palindrome Book Contest (Rejected Post)
  • (1/3) Error Statistics Philosophy: 2013
  • (1/4) Your 2014 wishing well. …
  • (1/7) “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech)
  • (1/11) Two Severities? (PhilSci and PhilStat)
  • (1/14) Statistical Science meets Philosophy of Science: blog beginnings
  • (1/16) Objective/subjective, dirty hands and all that: Gelman/Wasserman blogolog (ii)
  • (1/18) Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]
  • (1/22) Phil6334: “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos (Virginia Tech) UPDATE: JAN 21
  • (1/24) Phil 6334: Slides from Day #1: Four Waves in Philosophy of Statistics
  • (1/25) U-Phil (Phil 6334) How should “prior information” enter in statistical inference?
  • (1/27) Winner of the January 2014 palindrome contest (rejected post)
  • (1/29) BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Revisiting the Foundations of Statistics

    .

  • (1/31) Phil 6334: Day #2 Slides

 

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016-very convenient.

 

Save

Save

Save

Save

Save

Categories: 3-year memory lane, Bayesian/frequentist, Statistics | 1 Comment

The “P-values overstate the evidence against the null” fallacy

3077175-lg

.

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally Bayesian probabilities of the sort used in Jeffrey’s-Lindley disagreement (default or “I’m selecting from an urn of nulls” variety). Szucs and Ioannidis (in a draft of a 2016 paper) claim “it can be shown formally that the definition of the p value does exaggerate the evidence against H0” (p. 15) and they reference the paper I discuss below: Berger and Sellke (1987). It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  But the formulation of the “P-values overstate the evidence” meme introduces brand new misinterpretations into an already confused literature! The following are snippets from some earlier posts–mostly this one–and also includes some additions from my new book (forthcoming). 

Categories: Bayesian/frequentist, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 46 Comments

BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Understanding Reproducibility & Error Correction in Science

BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE

2016–2017
57th Annual Program

Download the 57th Annual Program

The Alfred I. Taub forum:

UNDERSTANDING REPRODUCIBILITY & ERROR CORRECTION IN SCIENCE

Cosponsored by GMS and BU’s BEST at Boston University.
Friday, March 17, 2017
1:00 p.m. – 5:00 p.m.
The Terrace Lounge, George Sherman Union
775 Commonwealth Avenue

  • Reputation, Variation, &, Control: Historical Perspectives
    Jutta Schickore History and Philosophy of Science & Medicine, Indiana University, Bloomington.
  • Crisis in Science: Time for Reform?
    Arturo Casadevall Molecular Microbiology & Immunology, Johns Hopkins
  • Severe Testing: The Key to Error Correction
    Deborah Mayo Philosophy, Virginia Tech
  • Replicate That…. Maintaining a Healthy Failure Rate in Science
    Stuart Firestein Biological Sciences, Columbia

 

boston-mayo-2017

Categories: Announcement, philosophy of science, Philosophy of Statistics, Statistical fraudbusting, Statistics | Leave a comment

Midnight With Birnbaum (Happy New Year 2016)

 Just as in the past 5 years since I’ve been blogging, I revisit that spot in the road at 11p.m., just outside the Elbar Room, get into a strange-looking taxi, and head to “Midnight With Birnbaum”. (The pic on the left is the only blurry image I have of the club I’m taken to.) I wonder if the car will come for me this year, given that my Birnbaum article has been out since 2014… The (Strong) Likelihood Principle–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics (and cognate methods). Yet as Birnbaum insisted, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder.) Birnbaum bemoaned the lack of an explicit evidential interpretation of N-P methods. Maybe in 2017? Anyway, it’s 6 hrs later here, so I’m about to leave for that spot in the road… If I’m picked up, I’ll add an update at the end.

You know how in that (not-so) recent Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf?  He is impressed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight (New Year’s Eve 2011 2012, 2013, 2014, 2015, 2016) and is taken back fifty years and, lo and behold, finds herself in the company of Allan Birnbaum.[i] There are a couple of brief (12/31/14 & 15) updates at the end.  

.

.

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics.  I happen to be writing on your famous argument about the likelihood principle (LP).  (whispers: I can’t believe this!)

BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. Continue reading

Categories: Birnbaum Brakes, Statistics, strong likelihood principle | Tags: , , , | 21 Comments

Szucs & Ioannidis Revive the Limb-Sawing Fallacy

 

images-2

.

When logical fallacies of statistics go uncorrected, they are repeated again and again…and again. And so it is with the limb-sawing fallacy I first posted in one of my “Overheard at the Comedy Hour” posts.* It now resides as a comic criticism of significance tests in a paper by Szucs and Ioannidis (posted this week),  Here’s their version:

“[P]aradoxically, when we achieve our goal and successfully reject Hwe will actually be left in complete existential vacuum because during the rejection of HNHST ‘saws off its own limb’ (Jaynes, 2003; p. 524): If we manage to reject H0then it follows that pr(data or more extreme data|H0) is useless because H0 is not true” (p.15).

Here’s Jaynes (p. 524):

“Suppose we decide that the effect exists; that is, we reject [null hypothesis] H0. Surely, we must also reject probabilities conditional on H0, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ 

Ha! Ha! By this reasoning, no hypothetical testing or falsification could ever occur. As soon as H is falsified, the grounds for falsifying disappear! If H: all swans are white, then if I see a black swan, H is falsified. But according to this criticism, we can no longer assume the deduced prediction from H! What? Continue reading

Categories: Error Statistics, P-values, reforming the reformers, Statistics | 14 Comments

3 YEARS AGO (DECEMBER 2013): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: December 2013. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. In this post, that makes 12/27-12/28 count as one.

December 2013

  • (12/3) Stephen Senn: Dawid’s Selection Paradox (guest post)
  • (12/7) FDA’s New Pharmacovigilance
  • (12/9) Why ecologists might want to read more philosophy of science (UPDATED)
  • (12/11) Blog Contents for Oct and Nov 2013
  • (12/14) The error statistician has a complex, messy, subtle, ingenious piece-meal approach
  • (12/15) Surprising Facts about Surprising Facts
  • (12/19) A. Spanos lecture on “Frequentist Hypothesis Testing
  • (12/24) U-Phil: Deconstructions [of J. Berger]: Irony & Bad Faith 3
  • (12/25) “Bad Arguments” (a book by Ali Almossawi)
  • (12/26) Mascots of Bayesneon statistics (rejected post)
  • (12/27) Deconstructing Larry Wasserman
  • (12/28) More on deconstructing Larry Wasserman (Aris Spanos)
  • (12/28) Wasserman on Wasserman: Update! December 28, 2013
  • (12/31) Midnight With Birnbaum (Happy New Year)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016-very convenient.

 

Save

Save

Save

Save

Save

Categories: 3-year memory lane, Bayesian/frequentist, Error Statistics, Statistics | 1 Comment

Blog at WordPress.com.