G.A. Barnard’s 101st Birthday: The Bayesian “catch-all” factor: probability vs likelihood

barnard-1979-picture

G. A. Barnard: 23 Sept 1915-30 July, 2002

Today is George Barnard’s 101st birthday. In honor of this, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] Six other posts on Barnard are linked below: 2 are guest posts (Senn, Spanos); the other 4 include a play (pertaining to our first meeting), and a letter he wrote to me. 

 ♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠

BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.

SAVAGE: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague­–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.

Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …

On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.

Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.

BARNARD: Professor Savage says in effect, ‘add at the bottom of list H1, H2,…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.

LINDLEY: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.

BARTLETT: But you would be inconsistent because your prior probability would be zero one day and non-zero another.

LINDLEY: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.

BARNARD: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.

LINDLEY: I do not care what it is as long as it is not one.

BARNARD: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.

LINDLEY: All probabilities are conditional.

BARNARD: I agree.

LINDLEY: If there are only conditional ones, what is the point at issue?

PROFESSOR E.S. PEARSON: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.

BARNARD: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.

LINDLEY: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.

BARNARD: Only if you knew that the condition was true, but you do not.

GOOD: Make a conditional bet.

BARNARD: You can make a conditional bet, but that is not what we are aiming at.

WINSTEN: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.

BARNARD: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H1 against H2, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H1 as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.

You can read the rest of pages 78-103 of the Savage Forum here.

 HAPPY BIRTHDAY GEORGE!

References

[i] Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.
 
 

 

Categories: Barnard, highly probable vs highly probed, phil/history of stat, Statistics | 8 Comments

The Myth of ‘The Myth of Objectivity” (i)

images-28Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inference and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276)


I. The myth of objectivity.
Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective,” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity-subjectivity distinction really toothless as many will have you believe? I say no.

Cavalier attitudes toward objectivity are in tension with widely endorsed movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science–if they are not mere lip-service–are rooted in the supposition that we can more objectively scrutinize results,even if it’s only to point out those that are poorly tested. The fact that the term “objectivity” is used equivocally should not be taken as grounds to oust it, but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t.

II. The Key is Getting Pushback. While knowledge gaps leave plenty of room for biases, arbitrariness and wishful thinking, we regularly come up against data that thwart our expectations and disagree with the predictions we try to foist upon the world. We get pushback! This supplies objective constraints on which our critical capacity is built. Our ability to recognize when data fail to match anticipations affords the opportunity to systematically improve our orientation. In an adequate account of statistical inference, explicit attention is paid to communicating results to set the stage for others to check, debate, extend or refute the inferences reached. Don’t let anyone say you can’t hold them to an objective account of statistical inference.

If you really want to find something out, and have had some experience with flaws and foibles, you deliberately arrange inquiries so as to capitalize on pushback, on effects that will not go away, and on strategies to get errors to ramify quickly to force you to pay attention to them. The ability to register alterations in error probabilities due to hunting, optional stopping, and other questionable research practices (QRPs) is a crucial part of objectivity in statistics. In statistical design, day-to-day tricks of the trade to combat bias are amplified and made systematic. It is not because of a “disinterested stance” that such methods are invented. It is that we, competitively and self-interestedly, want to find things out.

Admittedly, that desire won’t suffice to incentivize objective scrutiny if you can do just as well producing junk. Succeeding in scientific learning is very different from success at grants, honors, publications, or engaging in technical activism, replication research and meta-research. That’s why the reward structure of science is so often blamed nowadays. New incentives, gold stars and badges for sharing data, preregistration, and resisting the urge to cherry pick, outcome-switch, or otherwise engage in bad science are proposed. I say that if the allure of carrots has grown stronger than the sticks (which they have), then what we need are stronger sticks.

III. Objective procedures. It is often urged that, however much we may aim at objective constraints, we can never have clean hands, free of the influence of beliefs and interests. The fact that my background knowledge enters in researching a claim H doesn’t mean I combinine my beliefs about H into the analysis so as to prejudge any inference. I may instead use background information to give H a hard time. I may use it to question your claim to have grounds to infer H by showing it hasn’t survived a stringent effort to falsify or find flaws in H. The test H survived might be quite lousy, and even if I have independent grounds to believe H, I may deny you’ve done a good job testing it.

Others argue that we invariably sully methods of inquiry by the entry of personal judgments in their specification and interpretation. It’s just human all too human. The issue is not that a human is doing the measuring; the issue is whether that which is being measured is something we can reliably use to solve some problem of inquiry. That an inference is done by machine, untouched by human hands, wouldn’t make it objective, in the relevant sense. There are three distinct requirements for an objective procedure for solving problems of inquiry:

  • Relevance: It should be relevant to learning about the intended topic of inquiry; having an uncontroversial way to measure something doesn’t make it relevant to solving a knowledge-based problem of inquiry.
  • Reliably capable: It should not routinely declare the problem solved when it is not solved (or solved incorrectly); it should be capable of controlling the reliability of erroneous reports of purported answers to question.
  • Capacity to learn from error: If the problem is not solved (or poorly solved) at a given stage, the method should set the stage for pinpointing why. (It should be able to at least embark on an inquiry for solving “Duhemian problems” of where to lay blame for anomalies.)

Yes, there are numerous choices in collecting, analyzing, modeling, and drawing inferences from data, and there is often disagreement about how they should be made. Why suppose this means all accounts are in the same boat as regards subjective factors? It need not, and they are not. An account of inference shows itself to be objective precisely in how it steps up to the plate in handling potential threats to objectivity.

IV. Idols of Objectivity. We should reject phony objectivity and false trappings of objectivity. They often grow out of one or another philosophical conception of what objectivity requires—even though you will almost surely not see them described that way. If it’s thought objectivity is limited to direct observations (whatever they are) plus mathematics and logic, as does the typical logical positivist, then it’s no surprise to wind up worshiping “the idols of a universal method” as Gigerenzer and Marewski (2015) call it. Such a method is to supply a formal, ideally mechanical, way to process statements of observations and hypotheses. To recognize such mechanical rules don’t exist is not to relinquish the view that they’re demanded by objectivity. Instead, objectivity goes by the board, replaced by various stripes of relativism and constructivism, or more extreme forms of post-modernisms.

Relativists may augment their rather thin gruel with a pseudo-objectivity arising from social or political negotiation, cost-benefits (“they’re buying it”), or a type of consensus (“it’s in a 5 star journal”), but that’s to give away the goat far too soon. The result is to abandon the core stipulations of scientific objectivity. To be clear: There are authentic problems that threaten objectivity. We shouldn’t allow outdated philosophical accounts to induce us into giving it up.

V. From Discretion to Subjective Probabilities. Some argue that “discretionary choices” in tests, which Neyman himself tended to call “subjective”[1], leads us to subjective probabilities in claims. A weak version goes: since you can’t avoid subjective (discretionary) choices in getting the data and the model, there can be little ground for complaint about subjective degrees of belief in the resulting inference. This is weaker than arguing you must use subjective probabilities; it argues merely that doing so is no worse than discretion. But it still misses the point.

Even if the entry of discretionary judgments in the journey to a statistical inference/model have the capability to introduce subjectivity, they need not. Second, not all discretionary judgments are in the same boat when it comes to being open to severe testing.

A stronger version of the argument goes on a slippery slope from the premise of discretion in data generation and modeling to the conclusion: statistical inference just is a matter of subjective beliefs (or their updates). How does that work? One variant, which I do not try to pin on anyone in particular, involves a subtle slide from “our models are merely objects of belief”, to “statistical inference is a matter of degrees of belief”. From there it’s a short step to “statistical inference is a matter of subjective probability” (whether my assignments or that of an imaginary omniscient agent).

It is one thing to describe our models as objects of belief and quite another to maintain that our task is to model beliefs.

This is one of those philosophical puzzles of language that might set some people’s eyes rolling. If I believe in the deflection effect (of gravity) then that effect is the object of my belief, but only in the sense that my belief is about said effect. Yet if I’m inquiring into the deflection effect, I’m not inquiring into beliefs about the effect. The philosopher of science Clark Glymour (2010, p. 335) calls this a shift from phenomena (content) to epiphenomena (degrees of belief).

Karl Popper argues that the central confusion all along was sliding from the degree of the rationality (or warrantedness) of a belief, to the degree of rational belief (1959, p. 424). The former is assessed via degrees of corroboration and well-testedness, rooted in the error probing capacities of procedures. (These are supplied by error probabilities of methods, formal or informal.)

VI. Blurring What’s Being Measured vs My Ability to Test It. You will sometimes hear a Bayesian claim that anyone who says their probability assignments to hypotheses are subjective must also call the use of any model subjective because it too is based on my choice of specifications. This is a confusion of two notions of subjective.

  • The first concerns what’s being measured, and for the Bayesian, with some exceptions, probability is supposed to represent a subject’s strength of belief (be it actual or rational), betting odds, or the like.
  • The second sense of subjective concerns whether the measurement is checkable or testable.

This goes back to my point about what’s required for a feature to be relevant to a method’s objectivity in III.

(Passages, modified, are from Mayo, Statistical Inference as Severe Testing (forthcoming)

[1]But he never would allow subjective probabilities to enter in statistical inference. Objective, i.e., frequentist, priors in a hypothesis H could enter, but he was very clear that this required H’s truth being the result of some kind of stochastic mechanism. He found that idea plausible in cases, the problem was not knowing the stochastic mechanism sufficiently to assign the priors. Such frequentist (or “empirical”) priors in hypotheses are not given by drawing Hrandomly from an urn of hypothesis k% of which are assumed to be true. Yet, an “objective” Bayesian like Jim Berger will call these frequentist, resulting in enormous confusion in today’s guidebooks on the probability of type 1 errors.

REFERENCES:

Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.

Gigerenzer, G. and Marewski, J. 2015. ‘Surrogate Science: The Idol of a Universal Method for Scientific Inference,’ Journal of Management 41(2): 421-40.

Glymour, C. 2010. ‘Explanation and Truth’, in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), CUP: 331–350.

Mayo, D. (1983). “An Objective Theory of Statistical Testing.” Synthese 57(2): 297-340.

Popper, K. 1959. The Logic of Scientific Discovery. New York: Basic Books.

 

 

Categories: Background knowledge | Tags: | 6 Comments

Peircean Induction and the Error-Correcting Thesis (Part I)

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Today is C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic, and he anticipated several major ideas in statistics (e.g., randomization, confidence intervals) as well as in logic. I’ll reblog the first portion of a (2005) paper of mine. Links to Parts 2 and 3 are at the end. It’s written for a very general philosophical audience; the statistical parts are pretty informal. Happy birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

Self-Correcting Thesis SCT: methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.

Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).

In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.

2. Probabilities are assigned to procedures not hypotheses

Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, H, of a given type be rejected or not, calculate a specified character, x0, of the observed facts; if x> x0 reject H; if x< x0 accept H.” Although the outputs of N-P tests do not assign hypotheses degrees of probability, “it may often be proved that if we behave according to such a rule … we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false” (Neyman and Pearson, 1933, p.142).[i]

The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:

The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)

The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:

If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)

For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).

Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).

Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis H is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what H asserts, and yet it did not.

3. So why is justifying Peirce’s SCT thought to be so problematic?

You can read Section 3 here. (it’s not necessary for understanding the rest).

4. Peircean induction as severe testing

… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).

The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)

When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.

This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)

While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly corroborated (by his lights), he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.

In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis H when not only does H “accord with” the data x; but also, so good an accordance would very probably not have resulted, were H not true. In other words, we may inductively infer H when it has withstood a test of experiment that it would not have withstood, or withstood so well, were H not true (or were a specific flaw present). This can be encapsulated in the following severity requirement for an experimental test procedure, ET, and data set x.

Hypothesis H passes a severe test with x iff (firstly) x accords with H and (secondly) the experimental test procedure ET would, with very high probability, have signaled the presence of an error were there a discordancy between what H asserts and what is correct (i.e., were H false).

The test would “have signaled an error” by having produced results less accordant with H than what the test yielded. Thus, we may inductively infer H when (and only when) H has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely H has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for H but the probative capacity of the test of experiment ET (with regard to those errors that an inference to H is declaring to be absent)……….

You can read the rest of Section 4 here here

5. The path from qualitative to quantitative induction

In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.

(I) First-Order, Rudimentary or Crude Induction

Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim H, provisionally adopt H. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of H‘s falsity would probably have been detected, were H false, finding no evidence against H is poor inductive evidence for H. H has passed only a highly unreliable error probe.

(II) Second Order (Qualitative) Induction

It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.

The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)

The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)

The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.

The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)

It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.

(III) Third Order, Statistical (Quantitative) Induction

We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.

6. Quantitative and qualitative induction: significance test reasoning

Quantitative Severity

A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a null hypothesis H0, which is an assertion about the distribution of the sample X = (X1, …, Xn), a set of random variables, and (2) a function of the sample, d(x), the test statistic, which reflects the difference between the data x = (x1, …, xn), and null hypothesis H0. The observed value of d(X) is written d(x). The larger the value of d(x) the further the outcome is from what is expected under H0, with respect to the particular question being asked. We can imagine that null hypothesis H0 is

H0: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.

Let d(x) measure the increased risk of cancer in n women, half of which were randomly assigned to HRT. H0 asserts, in effect, that it is an error to take as genuine any positive value of d(x)—any observed difference is claimed to be “due to chance”. The test computes (3) the p-value, which is the probability of a difference larger than d(x), under the assumption that H0 is true:

p-value = Prob(d(X) > d(x)); H0).

If this probability is very small, the data are taken as evidence that

H*: cancer risks are higher in women treated with HRT

The reasoning is a statistical version of modes tollens.

If the hypothesis H0 is correct then, with high probability, 1- p, the data would not be statistically significant at level p.

x is statistically significant at level p.

Therefore, x is evidence of a discrepancy from H0, in the direction of an alternative hypothesis H.

(i.e., H* severely passes, where the severity is 1 minus the p-value)[iii]

For example, the results of recent, large, randomized treatment-control studies showing statistically significant increased risks (at the 0.001 level) give strong evidence that HRT, taken for over 5 years, increases the chance of breast cancer, the severity being 0.999. If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject H0 (and H0 is actually true), we would find we were rarely able to get so statistically significant a result to recur, and in this way we would discover our original error.

It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)

The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.

[Part 2 and Part 3 are here; you can find the full paper here.]

REFERENCES:

Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite. Cambridge: Cambridge University Press.

Laudan, L. 1981 Science and Hypothesis: Historical Essays on Scientific Methodology. Dordrecht: D. Reidel.

Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite. Cambridge: Cambridge University Press.

Mayo, D. 1991 “Novel Evidence and Severe Tests”, Philosophy of Science, 58: 523-552.

———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), Charles S. Peirce and the Philosophy of Science. Tuscaloosa: University of Alabama Press.

——— 1996 Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), Probability Is the Very Guide in Life. Chicago: Open Court Press, pp. 89-117.

———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), Scientific Evidence, Johns Hopkins University Press.

Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in Foundations of Bayesianism, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.

Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” Philosophy of Science, Vol. II, PSA 2002, pp. 1007-1025.

———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, The British Journal of Philosophy of Science 57: 323-357.

Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium, 2005.

Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in Philosophical Transactions of the Royal Society, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.

———- 1967 Joint Statistical Papers, Berkeley: University of California Press.

Niiniluoto, I. 1984 Is Science Progressive? Dordrecht: D. Reidel.

Peirce, C. S. Collected Papers: Vols. I-VI, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.

Popper, K. 1962 Conjectures and Refutations: the Growth of Scientific Knowledge, Basic Books, New York.

Rescher, N.  1978 Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method, Notre Dame: University of Notre Dame Press.


[i] Others who relate Peircean induction and Neyman-Pearson tests are Isaac Levi (1980) and Ian Hacking (1980). See also Mayo 1993 and 1996.

[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative T’ is closer to the truth than a refuted T.

[iii] If the p-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard H0 as consistent with data x, but we may wish to go further and determine the size of an increased risk r that has thereby been ruled out with severity. We do so by finding a risk increase, such that, Prob(d(X) > d(x); risk increase r) is high, say. Then the assertion: the risk increase < r passes with high severity, we would argue.

If there were a discrepancy from hypothesis H0 of r (or more), then, with high probability,1-p, the data would be statistically significant at level p.

x is not statistically significant at level p.

Therefore, x is evidence than any discrepancy from H0 is less than r.

For a general treatment of effect size, see Mayo and Spanos (2006).

[Ed. Note: A not bad biographical sketch can be found on wikipedia.]

Categories: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics | 18 Comments

All She Wrote (so far): Error Statistics Philosophy: 5 years on

metablog old fashion typewriter

D.G. Mayo with her  blogging typewriter

Error Statistics Philosophy: Blog Contents (5 years) [i]
By: D. G. Mayo

Dear Reader: It’s hard to believe I’ve been blogging for five years (since Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening. If you’re in the neighborhood, stop by for some Elba Grease.

Amazingly, this old typewriter not only still works; one of the whiz kids on Elba managed to bluetooth it to go directly from my typewriter onto the blog (I never got used to computer keyboards.) I still must travel to London to get replacement ribbons for this klunker.

Please peruse the offerings below, and take advantage of some of the super contributions and discussions by guest posters and readers! I don’t know how much longer I’ll continue blogging, but at least until the publication of my book on statistical inference. After that I plan to run conferences, workshops, and ashrams on PhilStat and PhilSci, and will invite readers to take part! Keep reading and commenting. Sincerely, D. Mayo

imgres-11

.

September 2011

Continue reading

Categories: blog contents, Metablog, Statistics | 11 Comments

TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!) Continue reading

Categories: Bayesian/frequentist, Comedy, significance tests, Statistics | 8 Comments

Larry Laudan: “‘Not Guilty’: The Misleading Verdict” (Guest Post)

Larry Laudan

.

Prof. Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin

“‘Not Guilty’: The Misleading Verdict and How It Fails to Serve either Society or the Innocent Defendant”

Most legal systems in the developed world share in common a two-tier verdict system: ‘guilty’ and ‘not guilty’.  Typically, the standard for a judgment of guilty is set very high while the standard for a not-guilty verdict (if we can call it that) is quite low. That means any level of apparent guilt less than about 90% confidence that the defendant committed the crime leads to an acquittal (90% being the usual gloss on proof beyond a reasonable doubt, although few legal systems venture a definition of BARD that precise). According to conventional wisdom, the major reason for setting the standard as high as we do is the desire, even the moral necessity, to shield the innocent from false conviction. Continue reading

Categories: L. Laudan, PhilStat Law | Tags: | 22 Comments

History of statistics sleuths out there? “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”–No wait, it was apples, probably

E.S.Pearson on Gate

E.S.Pearson on a Gate, Mayo sketch

Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is

“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot… –E.S Pearson, “Statistical Concepts in Their Relation to Reality”.

He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]

So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.

OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her: Continue reading

Categories: E.S. Pearson, phil/history of stat, Statistics | 1 Comment

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post.  I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background. 

 

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Continue reading

Categories: 4 years ago!, highly probable vs highly probed, phil/history of stat, Statistics | Tags: | Leave a comment

If you think it’s a scandal to be without statistical falsification, you will need statistical tests (ii)

Screen Shot 2016-08-09 at 2.55.33 PM

.

1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:

While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)

Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.[1] But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability? Continue reading

Categories: P-values, Severity, statistical tests, Statistics, StatSci meets PhilSci | 22 Comments

S. Senn: “Painful dichotomies” (Guest Post)

.

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Painful dichotomies

The tweet read “Featured review: Only 10% people with tension-type headaches get a benefit from paracetamol” and immediately I thought, ‘how would they know?’ and almost as quickly decided, ‘of course they don’t know, they just think they know’. Sure enough, on following up the link to the Cochrane Review in the tweet it turned out that, yet again, the deadly mix of dichotomies and numbers needed to treat had infected the brains of researchers to the extent that they imagined that they had identified personal response. (See Responder Despondency for a previous post on this subject.)

The bare facts they established are the following:

The International Headache Society recommends the outcome of being pain free two hours after taking a medicine. The outcome of being pain free or having only mild pain at two hours was reported by 59 in 100 people taking paracetamol 1000 mg, and in 49 out of 100 people taking placebo.

and the false conclusion they immediately asserted is the following

This means that only 10 in 100 or 10% of people benefited because of paracetamol 1000 mg.

To understand the fallacy, look at the accompanying graph. Continue reading

Categories: junk science, PhilStat/Med, Statistics, Stephen Senn | 27 Comments

3 YEARS AGO (JULY 2013): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1], and in green up to 3 others I’d recommend[2].  Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one.

July 2013

  • (7/3) Phil/Stat/Law: 50 Shades of gray between error and fraud
  • (7/6) Bad news bears: ‘Bayesian bear’ rejoinder–reblog mashup
  • (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
  • (7/11) Is Particle Physics Bad Science? (memory lane)
  • (7/13) Professor of Philosophy Resigns over Sexual Misconduct (rejected post)
  • (7/14) Stephen Senn: Indefinite irrelevance
  • (7/17) Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)
  • (7/19) Msc Kvetch: A question on the Martin-Zimmerman case we do not hear
  • (7/20) Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior
  • (7/23) Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs
  • (7/26) New Version: On the Birnbaum argument for the SLP: Slides for JSM talk

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016.

 

Save

Save

Save

Save

Save

Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

For Popper’s Birthday: A little Popper self-test & reading from Conjectures and Refutations

images-10

28 July 1902 – 17 September 1994

Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with an undergraduate item I came across: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, and thumbnail definitions at the end[ii].

Blog Readers who wish to send me their answers will have their papers graded (at least try the multiple choice; if you’re unsure, do the reading). [Use the comments or e-mail.]

[i] Popper reading from Conjectures and Refutations
[ii] I might note the “No-Pain philosophy” (3 part) Popper posts from this blog: parts 12, and 3.

HAPPY BIRTHDAY POPPER!

REFERENCE:

Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.

Save

Save

Categories: Popper | 11 Comments

“Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”

images-24

Seeing the world through overly rosy glasses

Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”  I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?

(I) Listen to Jacob Cohen (1988) introduce Power Analysis

“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists. Continue reading

Categories: Cohen, Greenland, power, Statistics | 46 Comments

Philosophy and History of Science Announcements

.

.

2016 UK-EU Foundations of Physics Conference

Start Date:16 July 2016

Categories: Announcement | Leave a comment

3 YEARS AGO (JUNE 2013): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1].  Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.

June 2013

  • (6/1) Winner of May Palindrome Contest
  • (6/1) Some statistical dirty laundry*(recently reblogged)
  • (6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers :(6/5 and6/6 are paired as one)
  • (6/6) PhilStock: Topsy-Turvy Game
  • (6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
  • (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?*(recently reblogged)
  • (6/11) Mayo: comment on the repressed memory research [How a conceptual criticism, requiring no statistics, might go.]
  • (6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!
  • (6/19) PhilStock: The Great Taper Caper
  • (6/19) Stanley Young: better p-values through randomization in microarrays
  • (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri*(recently reblogged)
  • (6/26) Why I am not a “dualist” in the sense of Sander Greenland
  • (6/29) Palindrome “contest” contest
  • (6/30) Blog Contents: mid-year

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

Save

Save

Save

Save

Save

Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

A. Birnbaum: Statistical Methods in Scientific Inference (May 27, 1923 – July 1, 1976)

Allan Birnbaum: May 27, 1923- July 1, 1976

Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 Statistical Science issue discussing Birnbaum’s result is here. Reference [5] links to the Synthese 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading! Continue reading

Categories: Birnbaum, Likelihood Principle, phil/history of stat, Statistics | Tags: | 62 Comments

Richard Gill: “Integrity or fraud… or just questionable research practices?” (Is Gill too easy on them?)

Professor Gill

Professor Gill

Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University

It was statistician Richard Gill who first told me about Diederik Stapel (see an earlier post on Diederik). We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have Gill be assigned as my commentator/presenter—he was excellent! As I was explaining some data problems to him, he suddenly said, “Some people don’t bother to collect data at all!” That’s when I learned about Stapel.

Committees often turn to Gill when someone’s work is up for scrutiny of bad statistics or fraud, or anything in between. Do you think he’s being too easy on researchers when he says, about a given case:

“data has been obtained by some combination of the usual ‘questionable research practices’ [QRPs] which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published. …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them.”

Isn’t that the danger in relying on deeply felt background beliefs?  Have our attitudes changed (toward QRPs) over the past 3 years (harsher or less harsh)? Here’s a talk of his I blogged 3 years ago (followed by a letter he allowed me to post). I reflect on the pseudoscientific nature of the ‘recovered memories’ program in one of the Geraerts et al. papers in a later post. Continue reading

Categories: 3-year memory lane, junk science, Statistical fraudbusting, Statistics | 4 Comments

What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Are we lowering the bar?

images-2

For entertainment only

In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!) Continue reading

Categories: junk science, replication research, Statistics | 2 Comments

Some statistical dirty laundry: have the stains become permanent?

images

.

Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong[2].) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013: Continue reading

Categories: junk science, reproducibility, spurious p values, Statistics | 4 Comments

Mayo & Parker “Using PhilStat to Make Progress in the Replication Crisis in Psych” SPSP Slides

Screen Shot 2016-06-19 at 12.53.32 PMHere are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here:

Categories: P-values, reforming the reformers, replication research, Statistics, StatSci meets PhilSci | Leave a comment

Create a free website or blog at WordPress.com.