Author Archives: Mayo

About Mayo

I am a professor in the Department of Philosophy at Virginia Tech and hold a visiting appointment at the Center for the Philosophy of Natural and Social Science of the London School of Economics. I am the author of Error and the Growth of Experimental Knowledge, which won the 1998 Lakatos Prize, awarded to the most outstanding contribution to the philosophy of science during the previous six years. I have applied my approach toward solving key problems in philosophy of science: underdetermination, the role of novel evidence, Duhem's problem, and the nature of scientific progress. I am also interested in applications to problems in risk analysis and risk controversies, and co-edited Acceptable Evidence: Science and Values in Risk Management (with Rachelle Hollander). I teach courses in introductory and advanced logic (including the metatheory of logic and modal logic), in scientific method, and in philosophy of science.I also teach special topics courses in Science and Technology Studies.

G.A. Barnard: The “catch-all” factor: probability vs likelihood

 

Barnard 23 Sept.1915 – 9 Aug.20

With continued acknowledgement of Barnard’s birthday on Friday, Sept.23, I reblog an exchange on catchall probabilities from the “The Savage Forum” (pp 79-84 Savage, 1962) with some new remarks.[i] 

 BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.

SAVAGE: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague­–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.

Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …

On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.

Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.

BARNARD: Professor Savage says in effect, ‘add at the bottom of list H1, H2,…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.

LINDLEY: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.

BARTLETT: But you would be inconsistent because your prior probability would be zero one day and non-zero another.

LINDLEY: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.

BARNARD: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.

LINDLEY: I do not care what it is as long as it is not one.

BARNARD: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.

LINDLEY: All probabilities are conditional.

BARNARD: I agree.

LINDLEY: If there are only conditional ones, what is the point at issue?

PROFESSOR E.S. PEARSON: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.

BARNARD: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.

LINDLEY: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.

BARNARD: Only if you knew that the condition was true, but you do not.

GOOD: Make a conditional bet.

BARNARD: You can make a conditional bet, but that is not what we are aiming at.

WINSTEN: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.

BARNARD: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H1 against H2, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H1 as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.

[i]Anyone who thinks we really want a Bayesian probability assignment to a hypothesis must come to grips with the fact that it depends on having a catchall factor-of all possible hypotheses that could explain the data-and the probability of data given “something else”.  Barnard is telling Savage this is unrealistic and when something new enters, the original probability assessments are wrong. In their attempts  to get the “catchall factor” to disappear, most probabilists appeal to comparative assessments–likelihood ratios or Bayes’ factors. So, Barnard was saying, back then, there’s no advantage to likelihoods. Several key problems remain: (i) the appraisal is always relative to the choice of alternative, and this allows “favoring” one or the other hypothesis, without being able to say there is evidence for either; (ii) although the hypotheses are not exhaustive, many give priors to the null and alternative that sum to 1 (iii) the ratios do not have the same evidential meaning in different cases (what’s high? 10, 50, 800?), and (iv) there’s a lack of control of the probability of misleading interpretations, except with predesignated point against point hypotheses or special cases (this is why Barnard later rejected the Likelihoodism). You can read the rest of pages 78-103 of the Savage Forum here. This exchange was first blogged here. Share your comments.

 References

Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.
 Links to a scan of the entire Savage forum may be found here.
Categories: Barnard, highly probable vs highly probed, phil/history of stat | Leave a comment

George Barnard’s birthday: stopping rules, intentions

G.A. Barnard: 23 Sept.1915 – 9 Aug.2002

Today is George Barnard’s birthday. I met him in the 1980s and we corresponded off and on until 1999. Here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to issues often taken up on this blog: stopping rules and the likelihood principle. (It’s a slightly revised reblog of an earlier post.) I’ll post some other items related to Barnard this week, in honor of his birthday.

Happy Birthday George!

Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevant in certain circumstances.  Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant.

In particular, suppose somebody sets out to demonstrate the existence of extrasensory perception and says ‘I am going to go on until I get a one in ten thousand significance level’. Knowing that this is what he is setting out to do would lead you to adopt a different test criterion. What you would look at would not be the ratio of successes obtained, but how long it took him to obtain it. And you would have a very simple test of significance which said if it took you so long to achieve this increase in the score above the chance fraction, this is not at all strong evidence for E.S.P., it is very weak evidence. And the reversing of the choice of test criteria would I think overcome the difficulty.

This is the answer to the point Professor Savage makes; he says why use one method when you have vague knowledge, when you would use a quite different method when you have precise knowledge. It seem to me the answer is that you would use one method when you have precisely determined alternatives, with which you want to compare a given hypothesis, and you use another method when you do not have these alternatives.

Savage: May I digress to say publicly that I learned the stopping-rule principle from professor Barnard, in conversation in the summer of 1952. Frankly I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right. I am particularly surprised to hear Professor Barnard say today that the stopping rule is irrelevant in certain circumstances only, for the argument he first gave in favour of the principle seems quite unaffected by the distinctions just discussed. The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head and cannot be known to those who have to judge the experiment. Never having been comfortable with that argument, I am not advancing it myself. But if Professor Barnard still accepts it, how can he conclude that the stopping-rule principle is only sometimes valid? (emphasis added)

Barnard: If I may reply briefly to Professor Savage’s question as to whether I still accept the argument I put to Professor Savage in 1952 (Barnard 1947a), I would say that I do so in relation to the question then discussed, where it is a matter of choosing from among a number of simple statistical hypotheses. When it is a question of deciding whether an observed result is reasonably consistent or not with a single hypothesis, no simple statistical alternatives being specified, then the argument cannot be applied. I would not claim it as foresight so much as good fortune that on page 664 of the reference given I did imply that the likelihood-ratio argument would apply ‘to all questions where the choice lies between a finite number of exclusive alternatives’; it is implicit that the alternatives here must be statistically specified. (75-77)

By the time I met Barnard, he was in what might be called the Fisher-E.S. Pearson-Birnbaum-Cox error statistical camp, rather than being a Likelihoodist. The example under discussion is a 2-sided Normal test of a 0 mean and was brought up by Peter Armitage (a specialist in sequential trials in medicine). I don’t see that it makes sense to favor preregistration and preregistered reports while denying the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history or plan supplied by the preregistered report is equivalent to worrying about the fact that the researcher would have stopped had the results been significant prior to the point of stopping–when this is not written down. It’s not a matter of “intentions locked up inside his head” that renders an inference unwarranted when it results from trying and trying again (to reach significance)–at least in a case like this one.
Share your Barnard reflections and links in the comments.
Related: 
References
Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Barnard, G.A. (1947a), “A Review of Sequential Analysis by Abraham Wald,” J. amer. Statist. Assoc., 42, 658-669.

Barnard, G.A. (1947b), “The Meaning of a Significance Level”, Biometrika, 34, 179-182.

Categories: Likelihood Principle, Philosophy of Statistics | Tags: | 2 Comments

Revisiting Popper’s Demarcation of Science 2017

28 July 1902- 17 Sept. 1994

Karl Popper died on September 17 1994. One thing that gets revived in my new book (Statistical Inference as Severe Testing, 2018, CUP) is a Popperian demarcation of science vs pseudoscience Here’s a snippet from what I call a “live exhibit” (where the reader experiments with a subject) toward the end of a chapter on Popper:

Live Exhibit. Revisiting Popper’s Demarcation of Science: Here’s an experiment: Try shifting what Popper says about theories to a related claim about inquiries to find something out. To see what I have in mind, join me in watching a skit over the lunch break:

Physicist: “If mere logical falsifiability suffices for a theory to be scientific, then, we can’t properly oust astrology from the scientific pantheon. Plenty of nutty theories have been falsified, so by definition they’re scientific. Moreover, scientists aren’t always looking to subject well corroborated theories to “grave risk” of falsification.”

Fellow traveler: “I’ve been thinking about this. On your first point, Popper confuses things by making it sound as if he’s asking: When is a theory unscientific? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H unscientific? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face-saving devices, then the inquiry–the handling and use of data–is unscientific. Despite being logically falsifiable, theories can be rendered immune from falsification by means of cavalier methods for their testing. Adhering to a falsified theory no matter what is poor science. On the other hand, some areas have so much noise that you can’t pinpoint what’s to blame for failed predictions. This is another way that inquiries become bad science.”

She continues:

“On your second point, it’s true that Popper talked of wanting to subject theories to grave risk of falsification. I suggest that it’s really our inquiries into, or tests of, the theories that we want to subject to grave risk. The onus is on interpreters of data to show how they are countering the charge of a poorly run test. I admit this is a modification of Popper. One could reframe the entire problem as one of the character of the inquiry or test.

In the context of trying to find something out, in addition to blocking inferences that fail the minimal requirement for severity[1]:

A scientific inquiry or test: must be able to embark on a reliable inquiry to pinpoint blame for anomalies (and use the results to replace falsified claims and build a repertoire of errors).

The parenthetical remark isn’t absolutely required, but is a feature that greatly strengthens scientific credentials. Without solving, not merely embarking on, some Duhemian problems there are no interesting falsifications. The ability or inability to pin down the source of failed replications–a familiar occupation these days–speaks to the scientific credentials of an inquiry. At any given time, there are anomalies whose sources haven’t been traced–unsolved Duhemian problems–generally at “higher” levels of the theory-data array. Embarking on solving these is the impetus for new conjectures. Checking test assumptions is part of working through the Duhemian maze. The reliability requirement is given by inferring claims just to the extent that they pass severe tests. There’s no sharp line for demarcation, but when these requirements are absent, an inquiry veers into the realm of questionable science or pseudo science. Some physicists worry that highly theoretical realms can’t be expected to be constrained by empirical data. Theoretical constraints are also important.

[1] Before claiming to have evidence for claim C, something must have been done to have found flaws in C, were C false. If a method is incapable of finding flaws with C, then finding none is poor grounds for inferring they are absent.

Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Cambridge)

Categories: Error Statistics, Popper, pseudoscience, science vs pseudoscience | Tags: | 8 Comments

Peircean Induction and the Error-Correcting Thesis

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Sunday, September 10, was C.S. Peirce’s birthday. He’s one of my heroes. He’s a treasure chest on essentially any topic, and anticipated quite a lot in statistics and logic. (As Stephen Stigler (2016) notes, he’s to be credited with articulating and appling randomization [1].) I always find something that feels astoundingly new, even rereading him. He’s been a great resource as I complete my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) [2]. I’m reblogging the main sections of a (2005) paper of mine. It’s written for a very general philosophical audience; the statistical parts are very informal. I first posted it in 2013Happy (belated) Birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

Self-Correcting Thesis SCT: methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.

Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).

In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.

2. Probabilities are assigned to procedures not hypotheses

Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, H, of a given type be rejected or not, calculate a specified character, x0, of the observed facts; if x> x0 reject H; if x< x0 accept H.” Although the outputs of N-P tests do not assign hypotheses degrees of probability, “it may often be proved that if we behave according to such a rule … we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false” (Neyman and Pearson, 1933, p.142).[i]

The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:

The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)

The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:

If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)

For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).

Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).

Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis H is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what H asserts, and yet it did not.

3. So why is justifying Peirce’s SCT thought to be so problematic?

You can read Section 3 here. (it’s not necessary for understanding the rest).

4. Peircean induction as severe testing

… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).

The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)

When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.

This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)

While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly corroborated (by his lights), he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.

In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis H when not only does H “accord with” the data x; but also, so good an accordance would very probably not have resulted, were H not true. In other words, we may inductively infer H when it has withstood a test of experiment that it would not have withstood, or withstood so well, were H not true (or were a specific flaw present). This can be encapsulated in the following severity requirement for an experimental test procedure, ET, and data set x.

Hypothesis H passes a severe test with x iff (firstly) x accords with H and (secondly) the experimental test procedure ET would, with very high probability, have signaled the presence of an error were there a discordancy between what H asserts and what is correct (i.e., were H false).

The test would “have signaled an error” by having produced results less accordant with H than what the test yielded. Thus, we may inductively infer H when (and only when) H has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely H has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for H but the probative capacity of the test of experiment ET (with regard to those errors that an inference to H is declaring to be absent)……….

You can read the rest of Section 4 here here

5. The path from qualitative to quantitative induction

In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.

(I) First-Order, Rudimentary or Crude Induction

Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim H, provisionally adopt H. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of H‘s falsity would probably have been detected, were H false, finding no evidence against H is poor inductive evidence for H. H has passed only a highly unreliable error probe.

(II) Second Order (Qualitative) Induction

It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.

The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)

The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)

The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.

The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)

It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.

(III) Third Order, Statistical (Quantitative) Induction

We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.

6. Quantitative and qualitative induction: significance test reasoning

Quantitative Severity

A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a null hypothesis H0, which is an assertion about the distribution of the sample X = (X1, …, Xn), a set of random variables, and (2) a function of the sample, d(x), the test statistic, which reflects the difference between the data x = (x1, …, xn), and null hypothesis H0. The observed value of d(X) is written d(x). The larger the value of d(x) the further the outcome is from what is expected under H0, with respect to the particular question being asked. We can imagine that null hypothesis H0 is

H0: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.

Let d(x) measure the increased risk of cancer in n women, half of which were randomly assigned to HRT. H0 asserts, in effect, that it is an error to take as genuine any positive value of d(x)—any observed difference is claimed to be “due to chance”. The test computes (3) the p-value, which is the probability of a difference larger than d(x), under the assumption that H0 is true:

p-value = Prob(d(X) > d(x)); H0).

If this probability is very small, the data are taken as evidence that

H*: cancer risks are higher in women treated with HRT

The reasoning is a statistical version of modes tollens.

If the hypothesis H0 is correct then, with high probability, 1- p, the data would not be statistically significant at level p.

x is statistically significant at level p.

Therefore, x is evidence of a discrepancy from H0, in the direction of an alternative hypothesis H.

(i.e., H* severely passes, where the severity is 1 minus the p-value)[iii]

If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject H0 (and H0 is actually true), we would find we were rarely able to get so statistically significant a result to recur, and in this way we would discover our original error.

It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)

The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.

[Here are Part 2 and part 3; you can find the rest of section 6 here.]

[1] Stigler discusses some of the experiments Peirce performed. In one, with Joseph Jastrow, the goal was to test whether there’s a threshold below which you can’t discern the difference in weights between two objects. Psychologists had hypothesized that there was a minimal threshold “ such that if the difference was below the threshold, termed the just noticeable difference (jnd), the two stimuli were indistinguishable….[Peirce and Jastrow] showed this speculation was false’ Stigler (2016, 160). No matter how close in weight the objects were the probability of a correct discernment of difference differed from ½. A good example of evidence for a “no-effect” null by falsifying the alternative statistically.

[2] I’m now, truly, within days of completing a very short, but deep, conclusion.9/13/17

REFERENCES:

Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite. Cambridge: Cambridge University Press.

Laudan, L. 1981 Science and Hypothesis: Historical Essays on Scientific Methodology. Dordrecht: D. Reidel.

Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite. Cambridge: Cambridge University Press.

Mayo, D. 1991 “Novel Evidence and Severe Tests”, Philosophy of Science, 58: 523-552.

———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), Charles S. Peirce and the Philosophy of Science. Tuscaloosa: University of Alabama Press.

——— 1996 Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), Probability Is the Very Guide in Life. Chicago: Open Court Press, pp. 89-117.

———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), Scientific Evidence, Johns Hopkins University Press.

Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in Foundations of Bayesianism, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.

Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” Philosophy of Science, Vol. II, PSA 2002, pp. 1007-1025.

———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, The British Journal of Philosophy of Science 57: 323-357.

Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium, 2005.

Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in Philosophical Transactions of the Royal Society, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.

———- 1967 Joint Statistical Papers, Berkeley: University of California Press.

Niiniluoto, I. 1984 Is Science Progressive? Dordrecht: D. Reidel.

Peirce, C. S. Collected Papers: Vols. I-VI, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.

Popper, K. 1962 Conjectures and Refutations: the Growth of Scientific Knowledge, Basic Books, New York.

Rescher, N.  1978 Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method, Notre Dame: University of Notre Dame Press.

Stigler, S. 2016 The Seven Pillars of Statistical Wisdom, Harvard.


[i] Others who relate Peircean induction and Neyman-Pearson tests are Isaac Levi (1980) and Ian Hacking (1980). See also Mayo 1993 and 1996.

[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative T’ is closer to the truth than a refuted T.

[iii] If the p-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard H0 as consistent with data x, but we may wish to go further and determine the size of an increased risk r that has thereby been ruled out with severity. We do so by finding a risk increase, such that, Prob(d(x) > d(x); risk increase r) is high, say. Then the assertion: the risk increase < r passes with high severity, we would argue.

If there were a discrepancy from hypothesis H0 of r (or more), then, with high probability,1-p, the data would be statistically significant at level p.

x is not statistically significant at level p.

Therefore, x is evidence than any discrepancy from H0 is less than r.

For a general treatment of severity, see Mayo and Spanos (2006).

[Ed. Note: A not bad biographical sketch can be found on wikipedia.]

Categories: Bayesian/frequentist, C.S. Peirce | 2 Comments

Professor Roberta Millstein, Distinguished Marjorie Grene speaker September 15

 

CANCELED

Virginia Tech Philosophy Department

2017 Distinguished Marjorie Grene Speaker

 

Professor Roberta L. Millstein


University of California, Davis

“Types of Experiments and Causal Process Tracing: What Happened on the Kaibab Plateau in the 1920s?”

September 15, 2017

320 Lavery Hall: 5:10-6:45pm

 

.

ABSTRACT. In a well-cited article, ecologist Jared Diamond characterizes three main types of experiment that are performed in community ecology: the Laboratory Experiment (LE), the Field Experiment (FE), and the Natural Experiment (NE). Diamond argues that each form of experiment has strengths and weaknesses, with respect to, for example, realism or the ability to follow a causal trajectory. But does Diamond’s typology exhaust the available kinds of cause-finding practices? Some social scientists have characterized something they call causal process tracing. Is this a fourth type of experiment or something else? In this talk, I examine Diamond’s typology and causal process tracing in the context of a case study concerning the dynamics of wolf and deer populations on the Kaibab Plateau in the 1920s, a case that has been used as a canonical example of a trophic cascade by ecologists but which has also been subject to a fair bit of controversy. I argue that ecologists have profitably deployed causal process tracing together with other types of experiment to help settle questions of causality in this case. It remains to be seen how widespread the use of causal process tracing outside of the social sciences is (or could be), but there are some potentially promising applications, particularly with respect to questions about specific causal sequences.

There will be an additional, informal discussion of Millstein’s* (2013) Chapter 8: “Natural Selection and Causal Productivity” on Saturday, September 16 10:15a.m. at Thebes (Mayo’s house).  For queries: error@vt.edu

 

Sponsored by Mayo-Chatfield Fund for Experimental Reasoning, Reliability, Objectivity and Rationality ( E.R.R.O.R.) and the Philosophy Department

Related:

Possibly Related:

*Roberta L. Millstein is Professor of Philosophy at the University of California, Davis, with affiliations to the Science and Technology Studies Program and the John Muir Institute for the Environment. She specializes in the history and philosophy of biology as well as environmental ethics, with a particular focus on fundamental concepts in evolution and ecology. Her work has appeared in journals such as Philosophy of Science; Journal of the History of Biology; Studies in History and Philosophy of Biological and Biomedical Sciences; Journal of the History of Biology; and Ethics, Policy & Environment. She is Co-Editor of the online, open-access journal, Philosophy, Theory, and Practice in Biology and has served on governing boards for several academic societies. Her current project develops and defends a reinterpretation of Aldo Leopold’s land ethic in light of contemporary ecology.

Categories: Announcement | 4 Comments

All She Wrote (so far): Error Statistics Philosophy: 6 years on

metablog old fashion typewriter

D.G. Mayo with her  blogging typewriter

Error Statistics Philosophy: Blog Contents (6 years) [i]
By: D. G. Mayo

Dear Reader: It’s hard to believe I’ve been blogging for six years (since Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening. If you’re in the neighborhood, stop by for some Elba Grease.

Amazingly, this old typewriter not only still works; one of the whiz kids on Elba managed to bluetooth it to go directly from my typewriter onto the blog (I never got used to computer keyboards.) I still must travel to London to get replacement ribbons for this klunker.

Please peruse the offerings below, and take advantage of some of the super contributions and discussions by guest posters and readers! I don’t know how much longer I’ll continue blogging–I’ve had to cut back this past year (sorry)–but at least until the publication of my book “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (CUP, 2018). After that I plan to run conferences, workshops, and ashrams on PhilStat and PhilSci, and I will invite readers to take part! Keep reading and commenting. Sincerely, D. Mayo

.

 

September 2011

October 2011 Continue reading

Categories: blog contents, Metablog | Leave a comment

Egon Pearson’s Heresy

E.S. Pearson: 11 Aug 1895-12 June 1980.

Here’s one last entry in honor of Egon Pearson’s birthday: “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years (6!), but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, some people concentrate to an absurd extent on “science-wise error rates in dichotomous screening”.) Continue reading

Categories: phil/history of stat, Philosophy of Statistics, Statistics | Tags: , , | Leave a comment

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

11 August 1895 – 12 June 1980

Continuing with my Egon Pearson posts in honor of his birthday, I reblog a post by Aris Spanos:  Egon Pearson’s Neglected Contributions to Statistics“. 

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: E.S. Pearson, phil/history of stat, Spanos, Testing Assumptions | 2 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll blog some E. Pearson items this week, including, my latest reflection on a historical anecdote regarding Egon and the woman he wanted marry, and surely would have, were it not for his father Karl!

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.  Continue reading

Categories: highly probable vs highly probed, phil/history of stat, Statistics | Tags: | Leave a comment

Thieme on the theme of lowering p-value thresholds (for Slate)

.

Here’s an article by Nick Thieme on the same theme as my last blogpost. Thieme, who is Slate’s 2017 AAAS Mass Media Fellow, is the first person to interview me on p-values who (a) was prepared to think through the issue for himself (or herself), and (b) included more than a tiny fragment of my side of the exchange.[i]. Please share your comments.

Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.

 

 

Illustration by Slate

                 Illustration by Slate

Last week a team of 72 scientists released the preprint of an article attempting to address one aspect of the reproducibility crisis, the crisis of conscience in which scientists are increasingly skeptical about the rigor of our current methods of conducting scientific research.

Their suggestion? Change the threshold for what is considered statistically significant. The team, led by Daniel Benjamin, a behavioral economist from the University of Southern California, is advocating that the “probability value” (p-value) threshold for statistical significance be lowered from the current standard of 0.05 to a much stricter threshold of 0.005. Continue reading

Categories: P-values, reforming the reformers, spurious p values | 14 Comments

“A megateam of reproducibility-minded scientists” look to lowering the p-value

.

Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today: Continue reading

Categories: Error Statistics, highly probable vs highly probed, P-values, reforming the reformers | 55 Comments

3 YEARS AGO (JULY 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1]. Posts that are part of a “unit” or a group count as one. This month there are three such groups: 7/8 and 7/10; 7/14 and 7/23; 7/26 and 7/31.

July 2014

  • (7/7) Winner of June Palindrome Contest: Lori Wike
  • (7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)
  • (7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)
  • (7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)
  • (7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?
  • (7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
  • (7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

 

Save

Save

Save

Save

Save

Categories: 3-year memory lane, Higgs, P-values | Leave a comment

On the current state of play in the crisis of replication in psychology: some heresies

.

The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.

Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek et.al. 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]). Continue reading

Categories: Error Statistics, preregistration, reforming the reformers, replication research | 9 Comments

S. Senn: Fishing for fakes with Fisher (Guest Post)

.

 

Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Fishing for fakes with Fisher

 Stephen Senn

The essential fact governing our analysis is that the errors due to soil heterogeneity will be divided by a good experiment into two portions. The first, which is to be made as large as possible, will be completely eliminated, by the arrangement of the experiment, from the experimental comparisons, and will be as carefully eliminated in the statistical laboratory from the estimate of error. As to the remainder, which cannot be treated in this way, no attempt will be made to eliminate it in the field, but, on the contrary, it will be carefully randomised so as to provide a valid estimate of the errors to which the experiment is in fact liable. R. A. Fisher, The Design of Experiments, (Fisher 1990) section 28.

Fraudian analysis?

John Carlisle must be a man endowed with exceptional energy and determination. A recent paper of his is entitled, ‘Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals,’ (Carlisle 2017) and has created quite a stir. The journals examined include the Journal of the American Medical Association and the New England Journal of Medicine. What Carlisle did was examine 29,789 variables using 72,261 means to see if they were ‘consistent with random sampling’ (by which, I suppose, he means ‘randomisation’). The papers chosen had to report either standard deviations or standard errors of the mean. P-values as measures of balance or lack of it were then calculated using each of three methods and the method that gave the value closest to 0.5 was chosen. For a given trial the P-values chosen were then back-converted to z-scores combined by summing them and then re-converted back to P-values using a method that assumes the summed Z-scores to be independent. As Carlisle writes, ‘All p values were one-sided and inverted, such that dissimilar means generated p values near 1’. Continue reading

Categories: Fisher, RCTs, Stephen Senn | 5 Comments

3 YEARS AGO (JUNE 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others of general relevance to philosophy of statistics [2].  Posts that are part of a “unit” or a group count as one.

June 2014

  • (6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)
  • (6/9) “The medical press must become irrelevant to publication of clinical trials.”
  • (6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”
  • (6/14) “Statistical Science and Philosophy of Science: where should they meet?”
  • (6/21) Big Bayes Stories? (draft ii)
  • (6/25) Blog Contents: May 2014
  • (6/28) Sir David Hendry Gets Lifetime Achievement Award
  • (6/30) Some ironies in the ‘replication crisis’ in social psychology (4th and final installment)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.

Save

Save

Save

Save

Save

Categories: 3-year memory lane | Leave a comment

Can You Change Your Bayesian Prior? The one post whose comments (some of them) will appear in my new book

.

I blogged this exactly 2 years ago here, seeking insight for my new book (Mayo 2017). Over 100 (rather varied) interesting comments ensued. This is the first time I’m incorporating blog comments into published work. You might be interested to follow the nooks and crannies from back then, or add a new comment to this.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data? Continue reading

Categories: Bayesian priors, Bayesian/frequentist | 14 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

E.S. Pearson died on this day in 1980. Aside from being co-developer of Neyman-Pearson statistics, Pearson was interested in philosophical aspects of statistical inference. A question he asked is this: Are methods with good error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. But how exactly does it work? It’s not just the frequentist error statistician who faces this question, but also some contemporary Bayesians who aver that the performance or calibration of their methods supplies an evidential (or inferential or epistemic) justification (e.g., Robert Kass 2011). The latter generally ties the reliability of the method that produces the particular inference C to degrees of belief in C. The inference takes the form of a probabilism, e.g., Pr(C|x), equated, presumably, to the reliability (or coverage probability) of the method. But why? The frequentist inference is C, which is qualified by the reliability of the method, but there’s no posterior assigned C. Again, what’s the rationale? I think existing answers (from both tribes) come up short in non-trivial ways. Continue reading

Categories: E.S. Pearson, highly probable vs highly probed, phil/history of stat | Leave a comment

3 YEARS AGO (May 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2014. I leave them unmarked this month, read whatever looks interesting.

May 2014

  • (5/1) Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle
  • (5/3) You can only become coherent by ‘converting’ non-Bayesianly
  • (5/6) Winner of April Palindrome contest: Lori Wike
  • (5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)
  • (5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
  • (5/15) Scientism and Statisticism: a conference* (i)
  • (5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”
  • (5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop
  • (5/25) Blog Table of Contents: March and April 2014
  • (5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976
  • (5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

 

Save

Save

Save

Save

Save

Categories: 3-year memory lane | 1 Comment

Allan Birnbaum: Foundations of Probability and Statistics (27 May 1923 – 1 July 1976)

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s birthday. In honor of his birthday, I’m posting the articles in the Synthese volume that was dedicated to his memory in 1977. The editors describe it as their way of  “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up. (Even if you are, you may be unaware of some of these key papers.)

HAPPY BIRTHDAY ALLAN!

Synthese Volume 36, No. 1 Sept 1977: Foundations of Probability and Statistics, Part I

Editorial Introduction:

This special issue of Synthese on the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors of Synthese in October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.

THE EDITORS

Continue reading

Categories: Birnbaum, Likelihood Principle, Statistics, strong likelihood principle | Tags: | 1 Comment

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

Slide1

.

ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn | 17 Comments

Blog at WordPress.com.