**Junk Science** (as first coined).* Have you ever noticed in wranglings over evidence-based policy that it’s always one side that’s politicizing the evidence—the side whose policy one doesn’t like? The evidence on the near side, or your side, however, is solid science. Let’s call those who first coined the term “junk science” Group 1. For Group 1, junk science is bad science that is used to defend pro-regulatory stances, whereas sound science would identify errors in reports of potential risk. (Yes, this was the first popular use of “junk science”, to my knowledge.) For the challengers—let’s call them Group 2—junk science is bad science that is used to defend the *anti*-regulatory stance, whereas sound science would identify potential risks, advocate precautionary stances, and recognize errors where risk is denied.

*Both groups agree that politicizing science is very, very bad—but it’s only the other group that does it!*

A given print exposé exploring the distortions of fact on one side or the other routinely showers wild praise on their side’s—their science’s and their policy’s—objectivity, their adherence to the facts, just the facts. How impressed might we be with the text or the group that admitted to its own biases?

Take, say, global warming, genetically modified crops, electric-power lines, medical diagnostic testing. Group 1 alleges that those who point up the risks (actual or potential) have a vested interest in construing the evidence that exists (and the gaps in the evidence) accordingly, which may bias the relevant science and pressure scientists to be politically correct. Group 2 alleges the reverse, pointing to industry biases in the analysis or reanalysis of data and pressures on scientists doing industry-funded work to go along to get along.

When the battle between the two groups is joined, issues of evidence—what counts as bad/good evidence for a given claim—and issues of regulation and policy—what are “acceptable” standards of risk/benefit—may become so entangled that no one recognizes how much of the disagreement stems from divergent *assumptions* about how models are produced and used, as well as from contrary stands on the foundations of uncertain knowledge and statistical inference. The core disagreement is mistakenly attributed to divergent policy values, at least for the most part.

Over the years I have tried my hand in sorting out these debates (e.g., Mayo and Hollander 1991). My account of testing actually came into being to systematize reasoning from statistically insignificant results in evidence based risk policy: no evidence of risk is not evidence of no risk! (see October 5). Unlike the disputants who get the most attention, I have argued that the current polarization cries out for critical or meta-scientific (or meta-statistical) scrutiny of the uncertainties, assumptions, and risks of error that are part and parcel of the gathering and interpreting of evidence on both sides. Unhappily, the disputants tend not to welcome this position—and are even hostile to it. This used to shock me when I was starting out—why would those who were trying to promote greater risk accountability not want to avail themselves of ways to hold the agencies and companies responsible when they bury risks in fallacious interpretations of statistically insignificant results? By now, I am used to it.

This isn’t to say that there’s no honest self-scrutiny going on, but only that all sides are so used to anticipating conspiracies of bias that my position is likely viewed as yet another politically motivated ruse. So what we are left with is scientific evidence having less and less a role in constraining or adjudicating disputes. Even to suggest an evidential adjudication risks being attacked as a paid insider.

I agree with David Michaels (2008, 61) that “the battle for the integrity of science is rooted in issues of methodology,” but winning the battle would demand something that both sides are increasingly unwilling to grant. It comes as no surprise that some of the best scientists stay as far away as possible from such controversial science.

**What about the recent case of some scientists asking Obama to prosecute “global warming skeptics”? Science is being politicized but on which side (or both)?**

*Just as relevant now as when I first blogged this 4 years ago (under “objectivity”).

Mayo,D. and Hollander. R. (eds.). 1991. *Acceptable Evidence: Science and Values in Risk Management*, Oxford.

Mayo. 1991. Sociological versus Metascientific Views of Risk Assessment, in D. Mayo and R. Hollander (eds.), *Acceptable Evidence*: 249-79.

Michaels, D. 2008. *Doubt Is Their Product*, Oxford.

Filed under: 4 years ago!, junk science, Objectivity, Statistics Tagged: David Michaels, evidence based policy, Evidence-based medicine, Junk science, risk assessment ]]>

**ONE YEAR AGO, the NYT “Science Times” (9/29/14) published Fay Flam’s article, first blogged here. **

Congratulations to Faye Flam for finally getting her article published at the Science Times at the *New York Times*, “The odds, continually updated” after months of reworking and editing, interviewing and reinterviewing. I’m grateful that *one remark from me remained.* Seriously I am. A few comments: The Monty Hall example is simple probability not statistics, and finding that fisherman who floated on his boots at best used likelihoods. I might note, too, that critiquing that ultra-silly example about ovulation and voting–a study so bad they actually had to pull it at CNN due to reader complaints[i]–scarcely required more than noticing the researchers didn’t even know the women were ovulating[ii]. Experimental design is an old area of statistics developed by frequentists; on the other hand, these ovulation researchers really believe their theory (and can point to a huge literature)….. Anyway, I should stop kvetching and thank Faye and the NYT for doing the article at all[iii]. Here are some excerpts:

…….When people think of statistics, they may imagine lists of numbers — batting averages or life-insurance tables. But the current debate is about how scientists turn data into knowledge, evidence and predictions. Concern has been growing in recent years that some fields are not doing a very good job at this sort of inference. In 2012, for example, a team at the biotech company Amgen announced that they’d analyzed 53 cancer studies and found it could not replicate 47 of them.

Similar follow-up analyses have cast doubt on so many findings in fields such as neuroscience and social science that researchers talk about a “replication crisis”

Some statisticians and scientists are optimistic that Bayesian methods can improve the reliability of research by allowing scientists to crosscheck work done with the more traditional or “classical” approach, known as frequentist statistics. The two methods approach the same problems from different angles.

…..

Looking at Other Factors

Take, for instance, a study concluding that single women who were ovulating were 20 percent more likely to vote for President Obama in 2012 than those who were not. (In married women, the effect was reversed.)

Dr. Gelman re-evaluated the study using Bayesian statistics. That allowed him look at probability not simply as a matter of results and sample sizes, but in the light of other information that could affect those results.

He factored in data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. When he did, the study’s statistical significance evaporated. (The paper’s lead author, Kristina M. Durante of the University of Texas, San Antonio, said she stood by the finding.)

Dr. Gelman said the results would not have been considered statistically significant had the researchers used the frequentist method properly. He suggests using Bayesian calculations not necessarily to replace classical statistics but to flag spurious results.

……

Others say that in confronting the so-called replication crisis, the best cure for misleading findings is not Bayesian statistics, but good frequentist ones.

The technique was developed to distinguish real effects from chance, and to prevent scientists from fooling themselves.It was frequentist statistics that allowed people to uncover all the problems with irreproducible research in the first place, said Deborah Mayo, a philosopher of science at Virginia Tech.Uri Simonsohn, a psychologist at the University of Pennsylvania, agrees. Several years ago, he published a paper that exposed common statistical shenanigans in his field — logical leaps, unjustified conclusions, and various forms of unconscious and conscious cheating.

He said he had looked into Bayesian statistics and concluded that if people misused or misunderstood one system, they would do just as badly with the other. Bayesian statistics, in short, can’t save us from bad science. …

**You can read Faye’s article here:“The odds, continually updated“.**

[i]“Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as ‘silly,’ ‘stupid,’ ‘sexist,’ and ‘offensive.’ Others were less nice.”

[ii] I used it here as an illustration of an example that fell below my “limbo stick” cut-off of being worth criticizing. Doing so tends to lead to what I call the Dale Carnegie Fallacy.

[iii] Faye was really exceptional in her attempts to understand the ideas, and to avoid biasing the story too much more than was necessary. I look forward to more from Flam at her new gig.

Filed under: Bayesian/frequentist, Statistics ]]>

**MONTHLY MEMORY LANE: 3 years ago: September 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog**.[1]** (Once again it was tough to pick just 3; many of the ones I selected are continued in the following posts, so please check out subsequent dates of posts that interest you…)

**September 2012**

- (9/3) After dinner Bayesian comedy hour. …
- (9/6)
**Stephen Senn:**The nuisance parameter nuisance - (9/8) Metablog: One-Year Anniversary
- (9/8) Return to the comedy hour … (on significance tests)
- (9/12)
**U-Phil (9/25/12)**How should “prior information” enter in statistical inference? - (9/15) More on using background info
- (9/19) Barnard, background info/intentions
- (9/22)
**Statistics and ESP****research**(Diaconis) - (9/25) Insevere tests and pseudoscience/26) Levels of Inquiry
- (9/29) Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis
- (9/30) Letter from George (Barnard)

**[1] **excluding those reblogged fairly recently. Posts that are part of a “unit” or a group of “U-Phils” count as one. Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

Filed under: 3-year memory lane, Statistics ]]>

** From the “The Savage Forum” (pp 79-84 Savage, 1962)[i] **

**BARNARD**:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.

**SAVAGE**: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.

Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …

On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.

Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.

**BARNARD**: Professor Savage says in effect, ‘add at the bottom of list H_{1}, H_{2},…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.

**LINDLEY**: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.

**BARTLETT**: But you would be inconsistent because your prior probability would be zero one day and non-zero another.

**LINDLEY**: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.

**BARNARD**: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.

**LINDLEY**: I do not care what it is as long as it is not one.

**BARNARD**: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.

**LINDLEY**: All probabilities are conditional.

**BARNARD**: I agree.

**LINDLEY**: If there are only conditional ones, what is the point at issue?

**PROFESSOR E.S. PEARSON**: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.

**BARNARD**: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.

**LINDLEY**: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.

**BARNARD**: Only if you knew that the condition was true, but you do not.

**GOOD**: Make a conditional bet.

**BARNARD**: You can make a conditional bet, but that is not what we are aiming at.

**WINSTEN**: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.

**BARNARD**: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H_{1} against H_{2}, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H_{1} as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.

**In their attempts to get the “catchall factor” to disappear, many appeal to comparative assessments–likelihood ratios or Bayes’ factors. Several key problems remain: (i) the appraisal is always relative to the choice of alternative, and this allows “favoring” one or the other hypothesis, without being able to say there is evidence for either; (ii) although the hypotheses are not exhaustive, many give priors to the null and alternative that sum to 1 (iii) the ratios do not have the same evidential meaning in different cases (what’s high? 10, 50, 800?), and (iv) there’s a lack of control of the probability of misleading interpretations, except with predesignated point against point hypotheses or special cases (this is why Barnard later rejected the Likelihood Principle). You can read the rest of pages 78-103 of the Savage Forum here. This exchange was first blogged here. Share your comments.**

**References**

[i] Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

*Other Barnard links on this blog:

Aris Spanos: Comment on the Barnard and Copas (2002) Empirical Example

Mayo, Barnard, Background Information/Intentions

Links to a scan of the entire Savage forum may be found here.

Filed under: Barnard, highly probable vs highly probed, phil/history of stat, Statistics ]]>

The answer to the question of my last post is George Barnard, and today is his 100th birthday*. The paragraphs stem from a 1981 conference in honor of his 65^{th} birthday, published in his 1985 monograph: “A Coherent View of Statistical Inference” (Statistics, Technical Report Series, University of Waterloo). **Happy Birthday George!**

[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

Here’s an excerpt from the following section: “A Little History”(1)

Although I had been interested in statistics at school, in 1932, and first met Fisher in 1933, I came properly into the subject during the Second World War….[I]t was not, I think, recognized until the publication of Joan Box’s book, that the man who, more than any other, was responsible for creating the concepts now central to our subject, was cut off from these developments by some mysterious personal or political agency….

It is idle to speculate on what might have happened had the leaders of the subject, Fisher, Bartlett, Pearson, Neyman, Wald, Wilks, and others, all been engaged to work together during the war. Cynics might suggest that the resulting explosions would have made the Manhatten project redundant. But on an optimistic view we could have been spared the sharp and not particularly fruitful controversies which have beset the foundations over the past thirty years. Only now do we seem to be approaching a consensus on the respective role of “tests” or P-values, “estimates” likelihood, Bayes’ theorem, confidence or “fiducial” distributions and other more complex concepts. ….

It is interesting that Barnard calls for “more complexity” while urging “a coherent view” of statistics. I agree that a “coherent” view is possible at a foundational, philosophical level, if not on a formal level.

I’ll reblog some other posts on Barnard this week.

*There was at least one correct, original answer from Oliver Maclaren.

(1) Barnard’s rivulets remind me of Walt Whitman’s Autumn Rivulets.

Filed under: Barnard, phil/history of stat, Statistics ]]>

[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

**Who wrote this and when?**

Filed under: Error Statistics, Statistics ]]>

Jump to Part (ii) 9/18/15 and (iii) 9/20/15 updates

I heard a podcast the other day in which the philosopher of science, Massimo Pigliucci, claimed that Popper’s demarcation of science fails because it permits pseudosciences like astrology to count as scientific! Now Popper requires supplementing in many ways, but we can get far more mileage out of Popper’s demarcation than Pigliucci supposes.

Pigliucci has it that, according to Popper, mere logical falsifiability suffices for a theory to be scientific, and this prevents Popper from properly ousting astrology from the scientific pantheon. Not so. In fact, Popper’s central goal is to call our attention to theories that, despite being logically falsifiable, are rendered immune from falsification by means of *ad hoc* maneuvering, sneaky face-saving devices, “monster-barring” or “conventionalist stratagems”. Lacking space on Twitter (where the “Philosophy Bites” podcast was linked), I’m placing some quick comments here. (For other posts on Popper, please search this blog.) Excerpts from the classic two pages in *Conjectures and Refutations* (1962, pp. 36-7) will serve our purpose:

It is easy to obtain confirmations, or verifications, for nearly every theory–if we look for confirmations.

Confirmations should countonly if they are the result of risky predictions; that is [if the theory or claimHis false] we should have expected an event which was incompatible with the theory [or claim]….Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability, but there are degrees of testability, some theories are more testable..

Confirming evidence should not count

except when it is the result of a genuine test of the theory,and this means that it can be presented as a serious but unsuccessful attempt to falsify the theory. (I now speak of such cases as ‘corroborating evidence’).

Some genuinely testable theories, when found to be false, are still upheld by their admirers–for example by introducing

ad hocsome auxiliary assumption, or by re-interpreting the theoryad hocin such a way that it escapes refutation. Such a procedure is always possible, but it rescues the theory from refutation only at the price of destroying, or at least lowering its scientific status. (I later described such a rescuing operation as a ‘conventionalist twist’ or a ‘conventionalist stratagem’.)…Einstein’s theory of gravitation clearly satisfied the criterion of falsifiability. Even if our measuring instruments at the time did not allow us to pronounce on the results of the tests with complete assurance, there was clearly a possibility of refuting the theory.

Astrology did not pass the test. Astrologers were greatly impressed, and misled, by what they believed to be confirming evidence–so much so that they were quite unimpressed by any unfavourable evidence. Moreover, by making their interpretations and prophecies sufficiently vague they were able to explain away anything that might have been a refutation of the theory had the theory and the prophecies been more precise.

In order to escape falsification they destroyed the testability of their theory.It is a typical soothsayer’s trick to predict things so vaguely that the predictions can hardly fail: that they become irrefutable.The Marxist theory of history, in spite of the serious efforts of some of its founders and followers, ultimately adopted this soothsaying practice. In some of its earlier formulations…their predictions were testable, and in fact falsified. Yet instead of accepting the refutations the followers of Marx re-interpreted both the theory and the evidence in order to make them agree. In this way they rescued the theory from refutation…. They thus gave a ‘conventionalist twist’ to the theory; and by this stratagem they destroyed its much advertised claim to scientific status.

The two psycho-analytic theories were in a different class. They were simply non-testable, irrefutable. There was no conceivable human behavior which could contradict them….I personally do not doubt that much of what they say is of considerable importance, and may well play its part one day in a psychological science which is testable. But it does mean that those ‘clinical observations’ which analysts naively believe confirm their theory cannot do this any more than the daily confirmations which astrologers find in their practice.

Only in the third case does Popper take up theories that he considers non-testable due to (self-sealing) features in the theories themselves. The only difference for Popper is that in the third case the *ad hoc* saves are already part and parcel of the theory, but little turns on that. Popper’s central thesis is that it makes no difference how you immunize a theory from having its flaws uncovered by data—the result is that it is not actually being critically tested by data, data aren’t really being taken seriously. Thus such theories, or theory appraisals, are unscientific.

Thus, Popper is quite clear that the appraisals of theories by data, in various domains, are unscientific because, far from subjecting claims to severe criticism, far from accepting that flaws have been unearthed when predictions fail, far from giving the theories a hard time, theories are retained by means of *ad hoc* saves and conventionalist stratagems. The claims are logically falsifiable but they are rendered immune to criticism. In these arenas, theories are not being appraised in a scientific (critical) manner; the fact that data are involved fails utterly to make them scientific. An unscientific appraisal is merely telling us that data “could be interpreted in the light of” the theory. In genuine sciences, passing a test must be difficult to achieve, if specifiable flaws exist.

Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory, and especially from trying to find faults where these might be expected in the light of all our knowledge. (Popper, 1994, p. 89)

Popper demands more than logical falsifiability for a theory (or theory appraisal) to be properly scientific. In fact, Popper intends his demarcation to capture a condition that “cannot be formalized”, it is “material” or “historical”; it can be located in the methodological process by which data are brought to bear on theories. Popper’s demarcation, remember, is intended as a contrast with the conception of science he finds in verificationists, inductivists and “confirmation theorists”. Reject verificationism and beware of verification biases, says Popper: confirmations are “too cheap to be worth having” and can only count if they are the result of severe testing. Genuine evidence for claim H requires (at minimum) spelling out those outcomes that would have been construed as counterevidence to H. The onus is on interpreters of data to show how the charge of questionable science has been avoided. The “ability” in Popper’s falsifiability refers to the capability of the testing methodology they use.

Pigliucci is right that philosophers of science these days have tended to shy away from the task of demarcating science and pseudoscience. The task is left to those courageous committees reviewing fraud cases[1]––and they invariably turn to Popper!

There is much more that is required to flesh out Popper’s view of the demarcation of science. Here I am simply responding to this one point I heard on the podcast. I will likely update this with further remarks (look for (ii), etc.), and naturally invite Pigliucci to comment.

Popper confuses things by making it sound as if he’s asking: *When is a theory unscientific?* when he is actually asking: *When is an appraisal of a theory or claim H unscientific*? Unscientific appraisals of *H* are those that lack severity, often as a result of various face-saving stratagems. Nowadays these are better known as cherry picking, P-hacking, trying and trying again, multiple testing, and a slew of other *biasing selection effects*. Popper’s main shortcoming is that he never provided an adequate account of severe testing, either in the case of falsifying (discorroborating) or corroborating. He defined “*H* passes a severe test with data * x*” as

Pigliucci sees his position on demarcation as reacting to Laudan, who declared in 1983 that the demarcation problem was dead. It died, apparently, because philosophers couldn’t provide a set of necessary and sufficient conditions to define “science”. But such an analytic activity is not what’s involved in identifying minimal requirements for good or terrible tests. Nor would Laudan disagree. (He will correct me if I’m wrong.)

Laudan would have just come to Virginia Tech around the time of the demarcation paper. I was fresh out of graduate school. I think Laudan’s point was mainly that we should stick to identifying reliable/unreliable methods, rather than try and identify which practices get to wear the label “science”. The question of “what is science” used to occupy years of brown bags here, way back when, at the very start of the STS program. Thanks to Laudan, I quickly discovered how to use my work in philosophy of statistics to help solve these core problems in philosophy of science. Notably, the error statistical methodology can be used to supply an adequate account of stringent, reliable or severe tests—just what Popper lacked. With a severe testing account in hand, the demarcation task becomes scrutinizing particular inquiries and inferences, not whole fields. How well do they accomplish the task of severely probing errors? Can they reliably solve their Duhemian problems (of where to lay the blame for anomalies)? Or do they permit researchers ample degrees of freedom to explain away anomalies? It’s when an inquiry is incapable of learning from anomaly and error that it slips into the “questionable science” category–or so I argue.[2]

Pigliucci goes in a different direction. Lacking necessary and sufficient conditions, he proposes to map out an array of sciences in the spirit of a (Wittgensteinian) family resemblance. The trouble is, lacking criteria for answering the above questions, the pigeon-holing tends to reflect someone’s assessment of plausibility, and would differ depending on who formulates the array. This gets us to Laudan’s worry: who is in and who is out, and whether a field is deemed scientific or fringe, may be largely a reflection of one group’s values, be they political, economic, religious, life-style or other. The danger is that determining what counts as junk science or good science itself turns into a rather non-scientific enterprise. Rival positions typically allege the “other side” is politicizing the science. See, “Will the real junk science please stand up?”

To be clear, I deny this needs to happen; it occurs when we fail to identify at least minimal requirements for passing a severe test. With that in hand, a cluster of ways of violating severity is forthcoming, e.g., cherry picking, monster barring, multiple testing, post-data selection effects, exception incorporation, barn-hunting, data reinterpretation, etc.. The capability of the practice to be genuinely self-critical–to find flaws in its models, hypotheses and data, even where they exist—is absent or low.

***

[1] The committee reviewing fraudster Diederik Stapel does quite a good job:

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts

…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means may be called verification bias. (Report, 48).

[2] For my deconstruction of Kuhn in light of Popper on demarcation, see: “Ducks, Rabbits, and Normal Science: Recasting the Kuhn’s-eye view of Popper http://www.phil.vt.edu/dmayo/personal_website/EGEKChap2.pdf

Laudan, Larry. 1983. “The Demise of the Demarcation Problem.” In *Physics, Philosophy and Psychoanalysis*, edited by R. S. Cohen and L. Laudan, pp. 111–27. Dordrecht: D. Reidel.

Popper, K. (1962) *Conjectures and Refutations: The Growth of Scientific Knowledge*, New York: Basic Books.

Popper, K. (1994) *The Myth of the Framework: In Defence of Science and Rationality* (edited by N.A. Notturno). London: Routledge.

Filed under: Error Statistics, Popper, pseudoscience, Statistics Tagged: Laudan, science/pseudoscience ]]>

**Last third of “Peircean Induction and the Error-Correcting Thesis”**

Deborah G. Mayo

*Transactions of the Charles S. Peirce Society *41(2) 2005: 299-319

Part 2 is here.

**8. Random sampling and the uniformity of nature**

We are now at the point to address the final move in warranting Peirce’s SCT. The severity or trustworthiness assessment, on which the error correcting capacity depends, requires an appropriate link (qualitative or quantitative) between the data and the data generating phenomenon, e.g., a reliable calibration of a scale in a qualitative case, or a probabilistic connection between the data and the population in a quantitative case. Establishing such a link, however, is regarded as assuming observed regularities will persist, or making some “uniformity of nature” assumption—the bugbear of attempts to justify induction.

But Peirce contrasts his position with those favored by followers of Mill, and “almost all logicians” of his day, who “commonly teach that the inductive conclusion approximates to the truth because of the uniformity of nature” (2.775). Inductive inference, as Peirce conceives it (i.e., severe testing) does not use the uniformity of nature as a premise. Rather, the justification is sought in the manner of obtaining data. Justifying induction is a matter of showing that there exist methods with good error probabilities. For this it suffices that randomness be met only approximately, that inductive methods check their own assumptions, and that they can often detect and correct departures from randomness.

… It has been objected that the sampling cannot be random in this sense. But this is an idea which flies far away from the plain facts. Thirty throws of a die constitute an approximately random sample of all the throws of that die; and that the randomness should be approximate is all that is required. (1.94)

Peirce backs up his defense with robustness arguments. For example, in an (attempted) Binomial induction, Peirce asks, “what will be the effect upon inductive inference of an imperfection in the strictly random character of the sampling” (2.728). What if, for example, a certain proportion of the population had twice the probability of being selected? He shows that “an imperfection of that kind in the random character of the sampling will only weaken the inductive conclusion, and render the concluded ratio less determinate, but will not necessarily destroy the force of the argument completely” (2.728). This is particularly so if the sample mean is near 0 or 1. In other words, violating experimental assumptions may be shown to weaken the trustworthiness or severity of the proceeding, but this may only mean we learn a little less.

Yet a further safeguard is at hand:

Nor must we lose sight of the constant tendency of the inductive process to correct itself. This is of its essence. This is the marvel of it. …even though doubts may be entertained whether one selection of instances is a random one, yet a different selection, made by a different method, will be likely to vary from the normal in a different way, and if the ratios derived from such different selections are nearly equal, they may be presumed to be near the truth. (2.729)

Here, the marvel is an inductive method’s ability to correct the attempt at random sampling. Still, Peirce cautions, we should not depend so much on the self-correcting virtue that we relax our efforts to get a random and independent sample. But if our effort is not successful, and neither is our method robust, we will probably discover it. “This consideration makes it extremely advantageous in all ampliative reasoning to fortify one method of investigation by another” (ibid.).

*“The Supernal Powers Withhold Their Hands And Let Me Alone”*

Peirce turns the tables on those skeptical about satisfying random sampling—or, more generally, satisfying the assumptions of a statistical model. He declares himself “willing to concede, in order to concede as much as possible, that when a man draws instances at random, all that he knows is that he tried to follow a certain precept” (2.749). There might be a “mysterious and malign connection between the mind and the universe” that deliberately thwarts such efforts. He considers betting on the game of *rouge et noire*: “could some devil look at each card before it was turned, and then influence me mentally” to bet or not, the ratio of successful bets might differ greatly from 0.5. But, as Peirce is quick to point out, this would equally vitiate deductive inferences about the expected ratio of successful bets.

Consider our informal example of weighing with calibrated scales. If I check the properties of the scales against known, standard weights, then I can check if my scales are working in a particular case. Were the scales infected by systematic error, I would discover this by finding systematic mismatches with the known weights; I could then subtract it out in measurements. That scales have given properties where I know the object’s weight indicates they have the same properties when the weights are unknown, lest I be forced to assume that my knowledge or ignorance somehow influences the properties of the scale. More generally, Peirce’s insightful argument goes, the experimental procedure thus confirmed where the measured property is known must work as well when it is unknown unless a mysterious and malign demon deliberately thwarts my efforts.

Peirce therefore grants that the validity of induction is based on assuming “that the supernal powers withhold their hands and let me alone, and that no mysterious uniformity … interferes with the action of chance” (ibid.). But this is very different from the uniformity of nature assumption.

…the negative fact supposed by me [no mysterious force interferes with the action of chance] is merely the denial of any major premise from which the falsity of the inductive conclusion could be deduced. Actually so long as the influence of this mysterious source not be overwhelming, the wonderful self-correcting nature of the ampliative inference would enable us, even so, to detect and make allowance for them. (2.749)

Not only do we not need the uniformity of nature assumption, Peirce declares “That there is a general tendency toward uniformity in nature is not merely an unfounded, it is an absolutely absurd, idea in any other sense than that man is adapted to his surroundings” (2.750). In other words, it is not nature that is uniform, it is we who are able to find patterns enough to serve our needs and interests. But the validity of inductive inference does not depend on this.

**9. Conclusion**

For Peirce, “the true guarantee of the validity of induction” is that it is a method of reaching conclusions which corrects itself; inductive methods—understood as methods of severe testing—are justified to the extent that they are error-correcting methods (SCT). I have argued that the well-known skepticism as regards Peirce’s SCT is based on erroneous views concerning the nature of inductive testing as well as what is required for a method to be self-correcting. By revisiting these two theses, justifying the SCT boils down to showing that severe testing methods exist and that they enable reliable means for learning from error.

An inductive inference to hypothesis *H *is warranted to the extent that *H *passes a severe test, that is, one which, with high probability, would have detected a specific flaw or departure from what *H* asserts, and yet it did not. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. Modern statistical methods (e.g., statistical significance tests) based on controlling a test’s error probabilities provide tools which, when properly interpreted, afford severe tests. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce sets the stage for developing an adequate solution to the age-old problem of induction. To carry out this project fully is a topic for future work.*

[You can find a pdf version of this paper here.]

**REFERENCES and Notes (see part 1)**

*That was 2005; I think (hope) I’ve made headway since then.

Filed under: C.S. Peirce, Error Statistics, phil/history of stat ]]>

**Continuation of “Peircean Induction and the Error-Correcting Thesis”**

Deborah G. Mayo

*Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy*, Volume 41, Number 2, 2005, pp. 299-319

Part 1 is here.

There are two other points of confusion in critical discussions of the SCT, that we may note here:

*I. The SCT and the Requirements of Randomization and Predesignation*

The concern with “the trustworthiness of the proceeding” for Peirce like the concern with error probabilities (e.g., significance levels) for error statisticians generally, is directly tied to their view that inductive method should closely link inferences to the methods of data collection as well as to how the hypothesis came to be formulated or chosen for testing.

This account of the rationale of induction is distinguished from others in that it has as its consequences two rules of inductive inference which are very frequently violated (1.95) namely, that the sample be (approximately) random and that the property being tested not be determined by the particular sample ** x**— i.e., predesignation.

The picture of Peircean induction that one finds in critics of the SCT disregards these crucial requirements for induction: Neither enumerative induction nor H-D testing, as ordinarily conceived, requires such rules. Statistical significance testing, however, clearly does.

Suppose, for example that researchers wishing to demonstrate the benefits of HRT search the data for factors on which treated women fare much better than untreated, and finding one such factor they proceed to test the null hypothesis:

*H _{0}*: there is no improvement in factor F (e.g. memory) among women treated with HRT.

Having selected this factor for testing solely because it is a factor on which treated women show impressive improvement, it is not surprising that this null hypothesis is rejected and the results taken to show a genuine improvement in the population. However, when the null hypothesis is tested on the same data that led it to be chosen for testing, it is well known, a spurious impression of a genuine effect easily results. Suppose, for example, that 20 factors are examined for impressive-looking improvements among HRT-treated women, and the one difference that appears large enough to test turns out to be significant at the 0.05 level. The actual significance level—the actual probability of reporting a statistically significant effect when in fact the null hypothesis is true—is not 5% but approximately 64% (Mayo 1996, Mayo and Kruse 2001, Mayo and Cox 2006). To infer the denial of *H _{0}*, and infer there is evidence that HRT improves memory, is to make an inference with low severity (approximately 0.36).

*II Understanding the “long-run error correcting” metaphor*

Discussions of Peircean ‘self-correction’ often confuse two interpretations of the ‘long-run’ error correcting metaphor, even in the case of quantitative induction: *(a) Asymptotic self-correction (as n approaches ∞):* In this construal, it is imagined that one has a sample, say of size

It may help to consider a very informal example. Suppose that weight gain is measured by 10 well-calibrated and stable methods, possibly using several measuring instruments and the results show negligible change over a test period of interest. This may be regarded as grounds for inferring that the individual’s weight gain is negligible within limits set by the sensitivity of the scales. Why? While it is true that by averaging more and more weight measurements, i.e., an eleventh, twelfth, etc., one would get asymptotically close to the true weight, that is not the rationale for the particular inference. The rationale is rather that the error probabilistic properties of the weighing procedure (the probability of ten-fold weighings erroneously failing to show weight change) inform one of the correct weight in the case at hand, e.g., that a 0 observed weight increase passes the “no-weight gain” hypothesis with high severity.

**7. Induction corrects its premises**

Justifying the severity, and accordingly, the error-correcting capacity, of tests depends upon being able to justify sufficiently test assumptions, whether in the quantitative or qualitative realms. In the former, a typical assumption would be that the data set constitutes a random sample from the appropriate population; in the latter, assumptions would include such things as “my instrument (e.g., scale) is working”. The problem of justifying methods is often taken to stymie attempts to justify inductive methods. Self-correcting, or error-correcting, enters here too, and precisely in the way that Peirce recognized. This leads me to consider something apparently overlooked by his critics; namely, Peirce’s insistence that induction “not only corrects its conclusions, *it even corrects its premises*” (3.575).

Induction corrects its premises by checking, correcting, or validating its own assumptions. One way that induction corrects its premises is by correcting and improving upon the accuracy of its data. This idea is at the heart of what allows induction—understood as severe testing—to be genuinely ampliative: to come out with more than is put in. Peirce comes to his philosophical stances from his experiences with astronomical observations.

Every astronomer, however, is familiar with the fact that the catalogue place of a fundamental star, which is the result of elaborate reasoning, is far more accurate than any of the observations from which it was deduced. (5.575)

His day-to-day use of the method of least squares made it apparent to him how knowledge of errors of observation can be used to infer an accurate observation from highly shaky data.

It is commonly assumed that empirical claims are only as reliable as the data involved in their inference, thus it is assumed, with Popper, that “should we try to establish anything with our tests, we should be involved in an infinite regress” (Popper 1962, p. 388). Peirce explicitly rejects this kind of “tower image” and argues that we can often arrive at rather accurate claims from far less accurate ones. For instance, with a little data massaging, e.g., averaging, we can obtain a value of a quantity of interest that is far more accurate than individual measurements.

*Qualitative Error Correction*

Peirce applies the same strategy from astronomy to a qualitative example:

That Induction tends to correct itself, is obvious enough. When a man undertakes to construct a table of mortality upon the basis of the Census, he is engaged in an inductive inquiry. And lo, the very first thing that he will discover from the figures … is that those figures are very seriously vitiated by their falsity. (5.576)

How is it discovered that there are systematic errors in the age reports? By noticing that the number of men reporting their age as 21 far exceeds those who are 20 (while in all other cases ages are much more likely to be expressed in round numbers). Induction, as Pierce understands it, helps to uncover this subject bias, that those under 21 tend to put down that they are 21. It does so by means of formal models of age distributions along with informal, background knowledge of the root causes of such bias. “The young find it to their advantage to be thought older than they are, and the old to be thought younger than they are” (5.576). Moreover, statistical considerations often allow correcting for bias, i.e., by estimating the number of “21” reports that are likely to be attributable to 20 year olds. As with the star catalogue, the data thus corrected is more accurate than the original data report.

By means of an informal tool kit of key errors and their causes, coupled with formal or systematic tools to model them, experimental inquiry checks and corrects its own assumptions for the purpose of carrying out some other inquiry. As I have been urging for Peircean self-correction generally, satisfying the SCT is not a matter of saying with enough data we will get better and better estimates of the star positions or the distribution of ages in a population; it is a matter of being able to employ methods in a given inquiry to detect and correct mistakes in that inquiry, or that data set. To get such methods off the ground there is no need to build a careful tower where inferences are piled up, each depending on what went on before: Properly exploited, inaccurate observations can give way to far more accurate data. By building up a “repertoire” of errors and means to check, avoid, or correct them, scientific induction is self-correcting.

*Induction Fares Better Than Deduction at Correcting its Errors*

Consider how this reading of Peirce makes sense of his holding inductive science as better at self-correcting than deductive science.

Deductive inquiry … has its errors; and it corrects them, too. But it is by no means so sure, or at least so swift to do this as is Inductive science. (5.577)

An example he gives is that the error in Euclid’s elements was undiscovered until non-Euclidean geometry was developed. Or again, “It is evident that when we run a column of figures down as well as up, as a check” or look out for possible flaws in a demonstration, “we are acting precisely as when in an induction we enlarge our sample for the sake of the self-correcting effect of induction” (5.580). In both cases we are appealing to various methods we have devised because we find they increase our ability to correct our mistakes, and thus increase the error probing power of our reasoning. What is distinctive about the methodology of inductive testing is that it deliberately directs itself to devising tools for reliable error probes. This is not so for mathematics. Granted, “once an error is suspected, the whole world is speedily in accord about it” (5.577) in deductive reasoning. But, for the most part mathematics does not itself supply tools for uncovering flaws.

So it appears that this marvelous self-correcting property of Reason … belongs to every sort of science, although it appears as essential, intrinsic and inevitable only in the highest type of reasoning, which is induction. (5.579)

In one’s inductive or experimental tool kit, one finds explicit models and methods whose single purpose is the business of detecting patterns of irregularity, checking assumptions, assessing departures from canonical models, and so on. If an experimental test is unable to do this—if it is unable to mount severe tests—then it fails to count as scientific induction.

** **

[You can find a pdf version of this paper here.]

**REFERENCES and Notes (see part 1 here)**

Filed under: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics ]]>

Yesterday was C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic. I only recently discovered a passage where Popper calls Peirce one of the greatest philosophical thinkers ever (I don’t have it handy). If Popper had taken a few more pages from Peirce, he would have seen how to solve many of the problems in his work on scientific inference, probability, and severe testing. I’ll blog the main sections of a (2005) paper of mine over the next few days. It’s written for a very general philosophical audience; the statistical parts are pretty informal. I first posted it in 2013. *Happy **(slightly belated)** Birthday Peirce*.

**Peircean Induction and the Error-Correcting Thesis**

Deborah G. Mayo

*Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy*, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

**Self-Correcting Thesis SCT:** methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.

Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).

In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.

**2. Probabilities are assigned to procedures not hypotheses**

Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, *H*, of a given type be rejected or not, calculate a specified character, ** x_{0}**, of the observed facts; if

The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:

The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)

The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:

If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)

For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).

Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).

Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis *H* is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what *H* asserts, and yet it did not.

**3. So why is justifying Peirce’s SCT thought to be so problematic?**

You can read Section 3 here. (it’s not necessary for understanding the rest).

**4. Peircean induction as severe testing**

… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).

The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)

When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.

This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)

While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly *corroborated (by his lights)*, he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.

In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis *H* when not only does *H* “accord with” the data ** x**; but also, so good an accordance would very probably not have resulted, were

*Hypothesis H passes a severe test with* ** x** iff (firstly)

The test would “have signaled an error” by having produced results less accordant with *H* than what the test yielded. Thus, we may inductively infer *H* when (and only when) *H* has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely *H* has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for *H* but the probative capacity of the test of experiment ET (with regard to those errors that an inference to *H* is declaring to be absent)……….

You can read the rest of Section 4 here here

**5. The path from qualitative to quantitative induction**

In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.

*(I) First-Order, Rudimentary or Crude Induction*

Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim *H*, provisionally adopt *H*. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of *H*‘s falsity would probably have been detected, were H false, finding no evidence against *H* is poor inductive evidence *for* *H*. *H* has passed only a highly unreliable error probe.

*(II) Second Order (Qualitative) Induction*

It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.

The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)

The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)

The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.

The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)

It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.

*(III) Third Order, Statistical (Quantitative) Induction*

We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.

**6. Quantitative and qualitative induction: significance test reasoning**

*Quantitative Severity*

A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a *null hypothesis* *H_{0}*, which is an assertion about the distribution of the sample

*H_{0}*: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.

*Let d(x)* measure the increased risk of cancer in

*p*-value = Prob(** d**(

If this probability is very small, the data are taken as evidence that

*H**: cancer risks are higher in women treated with HRT

The reasoning is a statistical version of *modes tollens*.

If the hypothesis *H _{0}* is correct then, with high probability, 1-

** x** is statistically significant at level

Therefore, ** x** is evidence of a discrepancy from

*(i.e., H* severely passes, where the severity is 1 minus the p-value) [iii]*

For example, the results of recent, large, randomized treatment-control studies showing statistically significant increased risks (at the 0.001 level) give strong evidence that HRT, taken for over 5 years, increases the chance of breast cancer, the severity being 0.999. If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject *H _{0}* (and

It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)

The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.

[I will post Part 2 tomorrow, then part 3; you can find the rest of section 6 here.]

**REFERENCES:**

Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Laudan, L. 1981 *Science and Hypothesis: Historical Essays on Scientific Methodology*. Dordrecht: D. Reidel.

Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Mayo, D. 1991 “Novel Evidence and Severe Tests”, *Philosophy of Science*, 58: 523-552.

———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), *Charles S. Peirce and the Philosophy of Science*. Tuscaloosa: University of Alabama Press.

——— 1996 *Error and the Growth of Experimental Knowledge*, The University of Chicago Press, Chicago.

———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), *Probability Is the Very Guide in Life*. Chicago: Open Court Press, pp. 89-117.

———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), *Scientific Evidence*, Johns Hopkins University Press.

Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in *Foundations of Bayesianism*, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.

Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” *Philosophy of Science*, Vol. II, PSA 2002, pp. 1007-1025.

———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, *The British Journal of Philosophy of Science* 57: 323-357.

Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, *Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium*, *2005*.

Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in *Philosophical Transactions of the Royal Society*, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.

———- 1967 *Joint Statistical Papers*, Berkeley: University of California Press.

Niiniluoto, I. 1984 *Is Science Progressive*? Dordrecht: D. Reidel.

Peirce, C. S. *Collected Papers: Vols. I-VI*, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.

Popper, K. 1962 *Conjectures and Refutations: the Growth of Scientific Knowledge*, Basic Books, New York.

Rescher, N. 1978 *Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method*, Notre Dame: University of Notre Dame Press.

[i] Others who relate Peircean induction and Neyman-Pearson tests are Isaac Levi (1980) and Ian Hacking (1980). See also Mayo 1993 and 1996.

[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative *T’* is closer to the truth than a refuted *T*.

[iii] If the *p*-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard *H _{0}* as consistent with data

If there were a discrepancy from hypothesis *H _{0}* of

** x** is not statistically significant at level

Therefore, ** x** is evidence than any discrepancy from

For a general treatment of effect size, see Mayo and Spanos (2006).

[Ed. Note: A not bad biographical sketch can be found on wikipedia.]

Filed under: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics ]]>

**Error Statistics Philosophy: Blog Contents (4 years)
**

*Dear Reader: It’s hard to believe I’ve been blogging for 4 whole years (as of Sept. 3, 2015)! A big celebration is taking place at the Elbar Room as I type this. (Remember the 1 year anniversary here? Remember that hideous blogspot? Oy!) Please peruse the offerings below, and take advantage of some of the super contributions and discussions by readers! I don’t know how much longer I’ll continue blogging; in the past 6 months I’ve mostly been focusing on completing my book, “How to Tell What’s True About Statistical Inference.” I plan to experiment with some new ideas and novel pursuits in the coming months. Stay tuned, and thanks for reading! Best Wishes, D. Mayo*

**September 2011**

- (9/3) Frequentists in Exile: The Purpose of this Blog
- (9/3) Overheard at the comedy hour at the Bayesian retreat
- (9/4) Drilling Rule #1
- (9/9) Kuru
- (9/13) In Exile, Clinging to Old Ideas?
- (9/15) SF conferences & E. Lehmann
- (9/16) Getting It Right But for the Wrong Reason
- (9/20) A Highly Anomalous Event
- (9/23) LUCKY 13 (Critcisms)
- (9/26) Whipping Boys and Witch Hunters
- (9/29) Part 1: Imaginary scientist at an imaginary company, Prionvac, and an imaginary reformer

**October 2011**

- (10/3) Part 2 Prionvac: The Will to Understand Power
- (10/4) Part 3 Prionvac: How the Reformers Should Have done Their Job
- (10/5) Formaldehyde Hearing: How to Tell the Truth With Statistically Insignificant Results
- (10/7) Blogging the (Strong) Likelihood Principle
- (10/10) RMM-1: Special Volume on Stat Sci Meets Phil Sci
- (10/10) Objectivity 1: Will the Real Junk Science Please Stand Up?
- (10/13) Objectivity #2: The “Dirty Hands” Argument for Ethics in Evidence
- (10/14) King Tut Includes ErrorStatistics in Top 50 Statblogs!
- (10/16) Objectivity #3: Clean(er) Hands With Metastatistcs
- (10/19) RMM-2: “A Conversation Between Sir David Cox & D.G. Mayo
- (10/20) Blogging the Likelihood Principle #2: Solitary Fishing: SLP Violations
- (10/22) The Will to Understand Power: Neyman’s Nursery (NN1)
- (10/28) RMM-3: Special Volume on Stat Scie Meets Phl Sci (Hendry)
- (10/30) Background Knowledge: Not to Quantify, but to Avoid Being Misled by, Subjective Beliefs
- (10/31) Oxford Gaol: Statistical Bogeymen

**November 2011**

- (11/1) RMM-4: Special Volume on Stat Scie Meets Phil Sci (Spanos)
- (11/3) Who is Really Doing the Work?*
- (11/5) Skeleton Key and Skeletal Points for (Esteemed) Ghost Guest
- (11/9) Neyman’s Nursery 2: Power and Severity [Continuation of Oct. 22 Post]
- (11/12) Neyman’s Nursery (NN) 3: SHPOWER vs POWER
- (11/15) Logic Takes a Bit of a Hit!: (NN 4) Continuing: Shpower (“observed” power) vs Power
- (11/18) Neyman’s Nursery (NN5): Final Post
- (11/21) RMM-5: Special Volume on Stat Scie Meets Phil Sci (Wasserman)
- (11/23) Elbar Grease: Return to the Comedy Hour at the Bayesian Retreat
- (11/28) The UN Charter: double-counting and data snooping
- (11/29) If you try sometime, you find you get what you need!

**December 2011**

- (12/2) Getting Credit (or blame) for Something You Don’t Deserve (and first honorable mention)
- (12/6) Putting the Brakes on the Breakthrough Part 1*

- (12/7) Part II: Breaking Through the Breakthrough*
- (12/11) Irony and Bad Faith: Deconstructing Bayesians
- (12/19) Deconstructing and Deep-Drilling* 2
- (12/22) The 3 stages of the acceptance of novel truths
- (12/25) Little Bit of Blog Log-ic
- (12/26) Contributed Deconstructions: Irony & Bad Faith 3
- (12/29) JIM BERGER ON JIM BERGER!
- (12/31) Midnight With Birnbaum

**January 2012**

- (1/3) Model Validation and the LLP-(Long Playing Vinyl Record)
- (1/8) Don’t Birnbaumize that Experiment my Friend*
- (1/10) Bad-Faith Assertions of Conflicts of Interest?*
- (1/13) U-PHIL: “So you want to do a philosophical analysis?”
- (1/14) “You May Believe You are a Bayesian But You Are Probably Wrong” (Extract from Senn RMM article)
- (1/15) Mayo Philosophizes on Stephen Senn: “How Can We Culivate Senn’s-Ability?”
- (1/17) “Philosophy of Statistics”: Nelder on Lindley
- (1/19) RMM-6 Special Volume on Stat Sci Meets Phil Sci (Sprenger)
- (1/22) U-Phil: Stephen Senn (1): C. Robert, A. Jaffe, and Mayo (brief remarks)
- (1/23) U-Phil: Stephen Senn (2): Andrew Gelman
- (1/24) U-Phil (3): Stephen Senn on Stephen Senn!
- (1/26) Updating & Downdating: One of the Pieces to Pick up
- (1/29) No-Pain Philosophy: Skepticism, Rationality, Popper, and All That: First of 3 Parts

**February 2012**

- (2/3) Senn Again (Gelman)
- (2/7) When Can Risk-Factor Epidemiology Provide Reliable Tests?
- (2/8) Guest Blogger: Interstitial Doubts About the Matrixx (Schachtman)
- (2/8) Distortions in the Court? (PhilStat/PhilStock) **Cobb on Zilizk & McCloskey
- (2/11) R.A. Fisher: Statistical Methods and Scientific Inference
- (2/11) JERZY NEYMAN: Note on an Article by Sir Ronald Fisher
- (2/12) E.S. Pearson: Statistical Concepts in Their Relation to Reality
- (2/12) Guest Blogger. STEPHEN SENN: Fisher’s alternative to the alternative
- (2/15) Guest Blogger. Aris Spanos: The Enduring Legacy of R.A. Fisher
- (2/17) Two New Properties of Mathematical Likelihood
- (2/20) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”? (Rejected Post Feb 20)
- (2/22) Intro to Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
- (2/23) Misspecification Testing: (part 2) A Fallacy of Error “Fixing”
- (2/27) Misspecification Testing: (part 3) Subtracting-out effects “on paper”
- (2/28) Misspecification Tests: (part 4) and brief concluding remarks

**March 2012**

- (3/2) MetaBlog: March 2, 2012
- (3/3) Statistical Science Court?
- (3/6) Mayo, Senn, and Wasserman on Gelman’s RMM** Contribution
- (3/8) Lifting a piece from Spanos’ contribution* will usefully add to the mix
- (3/10) U-PHIL: A Further Comment on Gelman by Christian Henning (UCL, Statistics)
- (3/11) Blogologue*
- (3/11) RMM-7: Commentary and Response on Senn published: Special Volume on Stat Scie Meets Phil Sci
- (3/14) Objectivity (#4) and the “Argument From Discretion”
- (3/18) Objectivity (#5): Three Reactions to the Challenge of Objectivity (in inference)
- (3/22) Generic Drugs Resistant to Lawsuits
- (3/25) The New York Times Goes to War Against Generic Drug Manufacturers: Schactman
- (3/26) Announcement: Philosophy of Scientific Experiment Conference
- (3/28) Comment on the Barnard and Copas (2002) Empirical Example: Aris Spanos

**April 2012**

- (4/1) Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1
- (4/3) History and Philosophy of Evidence-Based Health Care
- (4/4) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
- (4/6) Going Where the Data Take Us
- (4/9) N. Schachtman: Judge Posner’s Digression on Regression
- (4/10) Call for papers: Philosepi?
- (4/12) That Promissory Note From Lehmann’s Letter; Schmidt to Speak
- (4/15) U-Phil: Deconstructing Dynamic Dutch-Books?
- (4/16) A. Spanos: Jerzy Neyman and his Enduring Legacy
- (4/17) Earlier U-Phils and Deconstructions
- (4/18) Jean Miller: Happy Sweet 16 to EGEK! (Shalizi Review: “We have Ways of Making You Talk”)
- (4/21) Jean Miller: Happy Sweet 16 to EGEK #2 (Hasok Chang Review of EGEK)
- (4/23) U-Phil: Jon Williamson: Deconstructing DynamicDutch Books
- (4/25) Matching Numbers Across Philosophies
- (4/28) Comedy Hour at the Bayesian Retreat: P-values versus Posteriors

**May 2012**

- (5/1) Stephen Senn: A Paradox of Prior Probabilities
- (5/5) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed
- (5/8) LSE Summer Seminar: Contemporary Problems in Philosophy of Statistics
- (5/10) Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”
- (5/12) Saturday Night Brainstorming & Task Forces: The TFSI on
NHST

- (5/17) Do CIs Avoid Fallacies of Tests? Reforming the Reformers
- (5/20) Betting, Bookies and Bayes: Does it Not Matter?
- (5/23) Does the Bayesian Diet Call For Error-Statistical Supplements?
- (5/24) An Error-Statistical Philosophy of Evidence (PH500, LSE Seminar)
- (5/28) Painting-by-Number #1
- (5/31) Metablog: May 31, 2012

**June 2012**

- (6/2) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
- (6/6) Review of Error and Inference by C. Hennig

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (6/12) CMU Workshop on Foundations for Ockham’s Razor
- (6/14) Answer to the Homework & a New Exercise
- (6/15) Scratch Work for a SEV Homework Problem
- (6/17) Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers
- (6/17) G. Cumming Response: The New Statistics
- (6/19) The Error Statistical Philosophy and The Practice of Bayesian Statistics: Comments on Gelman and Shalizi
- (6/23) Promissory Note
- (6/26) Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop*
- (6/29) Further Reflections on Simplicity: Mechanisms

**July 2012**

- (7/1) PhilStatLaw: “Let’s Require Health Claims to Be ‘Evidence Based’” (Schachtman)
- (7/2) More from the Foundations of Simplicity Workshop*
- (7/3) Elliott Sober Responds on Foundations of Simplicity
- (7/4) Comment on Falsification
- (7/6) Vladimir Cherkassky Responds on Foundations of Simplicity
- (7/8) Metablog: Up and Coming
- (7/9) Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics
- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
- (7/11) Is Particle Physics Bad Science?
- (7/12) Dennis Lindley’s “Philosophy of Statistics”
- (7/15) Deconstructing Larry Wasserman – it starts like this…
- (7/16) Peter Grünwald: Follow-up on Cherkassky’s Comments
- (7/19) New Kvetch Posted 7/18/12
- (7/21) “Always the last place you look!”
- (7/22) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 1)
- (7/23) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 2)
- (7/27) P-values as Frequentist Measures
- (7/28) U-PHIL: Deconstructing Larry Wasserman
- (7/31) What’s in a Name? (Gelman’s blog)

**August 2012**

- (8/2) Stephen Senn: Fooling the Patient: an Unethical Use of Placebo? (Phil/Stat/Med)
- (8/5) A “Bayesian Bear” rejoinder practically writes itself…
- (8/6) Bad news bears: Bayesian rejoinder
- (8/8) U-PHIL: Aris Spanos on Larry Wasserman
- (8/10) U-PHIL: Hennig and Gelman on Wasserman (2011)
- (8/11) E.S. Pearson Birthday
- (8/11) U-PHIL: Wasserman Replies to Spanos and Hennig
- (8/13) U-Phil: (concluding the deconstruction) Wasserman/Mayo
- (8/14) Good Scientist Badge of Approval?
- (8/16) E.S. Pearson’s Statistical Philosophy
- (8/18) A. Spanos: Egon Pearson’s Neglected Contributions to Statistics
- (8/20) Higgs Boson: Bayesian “Digest and Discussion”
- (8/22) Scalar or Technicolor? S. Weinberg, “Why the Higgs?”
- (8/27) Knowledge/evidence not captured by mathematical prob.
- (8/30) Frequentist Pursuit
- (8/31) Failing to Apply vs Violating the Likelihood Principle

**September 2012**

- (9/3) After dinner Bayesian comedy hour. …
- (9/6) Stephen Senn: The nuisance parameter nuisance
- (9/8) Metablog: One-Year Anniversary
- (9/8) Return to the comedy hour … (on significance tests)
- (9/12) U-Phil (9/25/12) How should “prior information” enter in statistical inference?
- (9/15) More on using background info
- (9/19) Barnard, background info/intentions
- (9/22) Statistics and ESP research (Diaconis)
- (9/25) Insevere tests and pseudoscience
- (9/26) Levels of Inquiry
- (9/29) Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis
- (9/30) Letter from George (Barnard)

**October 2012**

- (10/02)PhilStatLaw: Infections in the court
- (10/05) Metablog: Rejected posts (blog within a blog)
- (10/05) Deconstructing Gelman, Part 1: “A Bayesian wants everybody else to be a non-Bayesian.”
- (10/07) Deconstructing Gelman, Part 2: Using prior information
- (10/09) Last part (3) of the deconstruction: beauty and background knowledge
- (10/12) U-Phils: Hennig and Aktunc on Gelman 2012
- (10/13) Mayo Responds to U-Phils on Background Information
- (10/15) New Kvetch: race-based academics in Fla
- (10/17) RMM-8: New Mayo paper: “StatSci and PhilSci: part 2 (Shallow vs Deep Explorations)”
- (10/18) Query
- (10/18) Mayo: (first 2 sections) “StatSci and PhilSci: part 2”
- (10/20) Mayo: (section 5) “StatSci and PhilSci: part 2”
- (10/21) Mayo: (section 6) “StatSci and PhilSci: part 2”
- (10/22) Mayo: (section 7) “StatSci and PhilSci: part 2”
- (10/24) Announcement: Ontology and Methodology (Virginia Tech)
- (10/25) New rejected post: phil faux
- (10/27) New rejected post: “Are you butter off now?”
- (10/29) Reblogging: Oxford Gaol: Statistical Bogeymen
- (10/29) Type 1 and 2 errors: Frankenstorm
- (10/30) Guest Post: Greg Gandenberger, “Evidential Meaning and Methods of Inference”
- (10/31) U-Phil: Blogging the Likelihood Principle: New Summary

**November 2012**

- (11/04) PhilStat: So you’re looking for a Ph.D. dissertation topic?
- (11/07) Seminars at the London School of Economics: Contemporary Problems in Philosophy of Statistics
- (11/10) Bad news bears: ‘Bayesian bear’ rejoinder – reblog
- (11/12) new rejected post: kvetch (and query)
- (11/14) continuing the comments. …
- (11/16) Philosophy of Science Association (PSA) 2012 Program
- (11/18) What is Bayesian/Frequentist Inference? (from the normal deviate)
- (11/18) New kvetch/PhilStock: Rapiscan Scam
- (11/19) Comments on Wasserman’s “what is Bayesian/frequentist inference?”
- (11/21) Irony and Bad Faith: Deconstructing Bayesians – reblog
- (11/23) Announcement: 28 November: My Seminar at the LSE (Contemporary PhilStat)
- (11/25) Likelihood Links [for 28 Nov. Seminar and Current U-Phil]
- (11/28) Blogging Birnbaum: on Statistical Methods in Scientific Inference
- (11/30) Error Statistics (brief overview)

**December 2012**

- (12/2) Normal Deviate’s blog on false discovery rates
- (12/2) Statistical Science meets Philosophy of Science
- (12/3) Mayo Commentary on Gelman & Robert’s paper
- (12/6) Announcement: U-Phil Extension: Blogging the Likelihood Principle
- (12/7) Nov. Palindrome Winner: Kepler
- (12/8) Don’t Birnbaumize that experiment my friend*–updated reblog
- (12/11) Announcement: Prof. Stephen Senn to lead LSE grad seminar: 12-12-12
- (12/11) Mayo on S. Senn: “How Can We Cultivate Senn’s-Ability?”
- (12/13) “Bad statistics”: crime or free speech?
- (12/14) PhilStat/Law (“Bad Statistics” Cont.)
- (12/17) PhilStat/Law/Stock: multiplicity and duplicity
- (12/19) PhilStat/Law/Stock: more on “bad statistics”: Schachtman
- (12/21) Rejected Post: Clinical Trial Statistics Doomed by Mayan Apocalypse?
- (12/22) Msc kvetch: unfair but lawful discrimination (vs the irresistibly attractive)
- (12/24) 13 well-worn criticisms of significance tests (and how to avoid them)
- (12/27) 3 msc kvetches on the blog bagel circuit
- (12/30) An established probability theory for hair comparison?“–is not — and never was”
- (12/31) Midnight with Birnbaum-reblog

**January 2013**

- (1/2) Severity as a ‘Metastatistical’ Assessment
- (1/4) Severity Calculator
- (1/6) Guest post: Bad Pharma? (S. Senn)
- (1/9) RCTs, skeptics, and evidence-based policy
- (1/10) James M. Buchanan
- (1/11) Aris Spanos: James M. Buchanan: a scholar, teacher and friend
- (1/12) Error Statistics Blog: Table of Contents
- (1/15) Ontology & Methodology: Second call for Abstracts, Papers
- (1/18) New Kvetch/PhilStock
- (1/19) Saturday Night Brainstorming and Task Forces: (2013) TFSI on NHST
- (1/22) New PhilStock
- (1/23) P-values as posterior odds?
- (1/26) Coming up: December U-Phil Contributions….
- (1/27) U-Phil: S. Fletcher & N.Jinn
- (1/30) U-Phil: J. A. Miller: Blogging the SLP

**February 2013**

- (2/2) U-Phil: Ton o’ Bricks
- (2/4) January Palindrome Winner
- (2/6) Mark Chang (now) gets it right about circularity
- (2/8) From Gelman’s blog: philosophy and the practice of Bayesian statistics
- (2/9) New kvetch: Filly Fury
- (2/10) U-PHIL: Gandenberger & Hennig: Blogging Birnbaum’s Proof
- (2/11) U-Phil: Mayo’s response to Hennig and Gandenberger
- (2/13) Statistics as a Counter to Heavyweights…who wrote this?
- (2/16) Fisher and Neyman after anger management?
- (2/17) R. A. Fisher: how an outsider revolutionized statistics
- (2/20) Fisher: from ‘Two New Properties of Mathematical Likelihood’
- (2/23) Stephen Senn: Also Smith and Jones
- (2/26) PhilStock: DO < $70
- (2/26) Statistically speaking…

**March 2013**

- (3/1) capitalizing on chance
- (3/4) Big Data or Pig Data?
- (3/7) Stephen Senn: Casting Stones
- (3/10) Blog Contents 2013 (Jan & Feb)
- (3/11) S. Stanley Young: Scientific Integrity and Transparency
- (3/13) Risk-Based Security: Knives and Axes
- (3/15) Normal Deviate: Double Misunderstandings About p-values
- (3/17) Update on Higgs data analysis: statistical flukes (1)
- (3/21) Telling the public why the Higgs particle matters
- (3/23) Is NASA suspending public education and outreach?
- (3/27) Higgs analysis and statistical flukes (part 2)
- (3/31) possible progress on the comedy hour circuit?

**April 2013**

- (4/1) Flawed Science and Stapel: Priming for a Backlash?
- (4/4) Guest Post. Kent Staley: On the Five Sigma Standard in Particle Physics
- (4/6) Who is allowed to cheat? I.J. Good and that after dinner comedy hour….
- (4/10) Statistical flukes (3): triggering the switch to throw out 99.99% of the data
- (4/11) O & M Conference (upcoming) and a bit more on triggering from a participant…..
- (4/14) Does statistics have an ontology? Does it need one? (draft 2)
- (4/19) Stephen Senn: When relevance is irrelevant
- (4/22) Majority say no to inflight cell phone use, knives, toy bats, bow and arrows, according to survey
- (4/23) PhilStock: Applectomy? (rejected post)
- (4/25) Blog Contents 2013 (March)
- (4/27) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)
- (4/29) What should philosophers of science do? (falsification, Higgs, statistics, Marilyn)

**May 2013**

- (5/3) Schedule for Ontology & Methodology, 2013
- (5/6) Professorships in Scandal?
- (5/9) If it’s called the “The High Quality Research Act,” then ….
- (5/13) ‘No-Shame’ Psychics Keep Their Predictions Vague: New Rejected post
- (5/14) “A sense of security regarding the future of statistical science…” Anon review of Error and Inference
- (5/18) Gandenberger on Ontology and Methodology (May 4) Conference: Virginia Tech
- (5/19) Mayo: Meanderings on the Onto-Methodology Conference
- (5/22) Mayo’s slides from the Onto-Meth conference
- (5/24) Gelman sides w/ Neyman over Fisher in relation to a famous blow-up
- (5/26) Schachtman: High, Higher, Highest Quality Research Act
- (5/27) A.Birnbaum: Statistical Methods in Scientific Inference
- (5/29) K. Staley: review of Error & Inference

**June 2013
**

- (6/1) Winner of May Palindrome Contest
- (6/1) Some statistical dirty laundry
- (6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12):
- (6/6) PhilStock: Topsy-Turvy Game
- (6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
- (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?”
- (6/11) Mayo: comment on the repressed memory research
- (6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!
- (6/19) PhilStock: The Great Taper Caper
- (6/19) Stanley Young: better p-values through randomization in microarrays
- (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri
- (6/26) Why I am not a “dualist” in the sense of Sander Greenland
- (6/29) Palindrome “contest” contest
- (6/30) Blog Contents: mid-year

**July 2013**

- (7/3) Phil/Stat/Law: 50 Shades of gray between error and fraud
- (7/6) Bad news bears: ‘Bayesian bear’ rejoinder–reblog mashup
- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
- (7/11) Is Particle Physics Bad Science? (memory lane)
- (7/13) Professor of Philosophy Resigns over Sexual Misconduct (rejected post)
- (7/14) Stephen Senn: Indefinite irrelevance
- (7/17) Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)
- (7/19) Msc Kvetch: A question on the Martin-Zimmerman case we do not hear
- (7/20) Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior
- (7/23) Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs
- (7/26) New Version: On the Birnbaum argument for the SLP: Slides for JSM talk

**August 2013**

- (8/1) Blogging (flogging?) the SLP: Response to Reply- Xi’an Robert
- (8/5) At the JSM: 2013 International Year of Statistics
- (8/6) What did Nate Silver just say? Blogging the JSM
- (8/9) 11
^{th}bullet, multiple choice question, and last thoughts on the JSM - (8/11) E.S. Pearson: “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”
- (8/13) Blogging E.S. Pearson’s Statistical Philosophy
- (8/15) A. Spanos: Egon Pearson’s Neglected Contributions to Statistics
- (8/17) Gandenberger: How to Do Philosophy That Matters (guest post)
- (8/21) Blog contents: July, 2013
- (8/22) PhilStock: Flash Freeze
- (8/22) A critical look at “critical thinking”: deduction and induction
- (8/28) Is being lonely unnatural for slim particles? A statistical argument
- (8/31) Overheard at the comedy hour at the Bayesian retreat-2 years on

**September 2013**

- (9/2) Is Bayesian Inference a Religion?
- (9/3) Gelman’s response to my comment on Jaynes
- (9/5) Stephen Senn: Open Season (guest post)
- (9/7) First blog: “Did you hear the one about the frequentist…”? and “Frequentists in Exile”
- (9/10) Peircean Induction and the Error-Correcting Thesis (Part I)
- (9/10) (Part 2) Peircean Induction and the Error-Correcting Thesis
- (9/12) (Part 3) Peircean Induction and the Error-Correcting Thesis
- (9/14) “When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (guest post)
- (9/18) PhilStock: Bad news is good news on Wall St.
- (9/18) How to hire a fraudster chauffeur
- (9/22) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
- (9/23) Barnard’s Birthday: background, likelihood principle, intentions
- (9/24) Gelman est efffectivement une erreur statistician
- (9/26) Blog Contents: August 2013
- (9/29) Highly probable vs highly probed: Bayesian/ error statistical differences

**October 2013**

- (10/3) Will the Real Junk Science Please Stand Up? (critical thinking)
- (10/5) Was Janina Hosiasson pulling Harold Jeffreys’ leg?
- (10/9) Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock
- (10/12) Sir David Cox: a comment on the post, “Was Hosiasson pulling Jeffreys’ leg?”
- (10/19) Blog Contents: September 2013
- (10/19) Bayesian Confirmation Philosophy and the Tacking Paradox (iv)*
- (10/25) Bayesian confirmation theory: example from last post…
- (10/26) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)
- (10/31) WHIPPING BOYS AND WITCH HUNTERS

**November 2013**

- (11/2) Oxford Gaol: Statistical Bogeymen
- (11/4) Forthcoming paper on the strong likelihood principle
- (11/9) Null Effects and Replication
- (11/9) Beware of questionable front page articles warning you to beware of questionable front page articles (iii)
- (11/13) T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)
- (11/16) PhilStock: No-pain bull
- (11/16) S. Stanley Young: More Trouble with ‘Trouble in the Lab’ (Guest post)
- (11/18) Lucien Le Cam: “The Bayesians hold the Magic”
- (11/20) Erich Lehmann: Statistician and Poet
- (11/23) Probability that it is a statistical fluke [i]
- (11/27) “The probability that it be a statistical fluke” [iia]
- (11/30) Saturday night comedy at the “Bayesian Boy” diary (rejected post*)

**December 2013**

- (12/3) Stephen Senn: Dawid’s Selection Paradox (guest post)
- (12/7) FDA’s New Pharmacovigilance
- (12/9) Why ecologists might want to read more philosophy of science (UPDATED)
- (12/11) Blog Contents for Oct and Nov 2013
- (12/14) The error statistician has a complex, messy, subtle, ingenious piece-meal approach
- (12/15) Surprising Facts about Surprising Facts
- (12/19) A. Spanos lecture on “Frequentist Hypothesis Testing”
- (12/24) U-Phil: Deconstructions [of J. Berger]: Irony & Bad Faith 3
- (12/25) “Bad Arguments” (a book by Ali Almossawi)
- (12/26) Mascots of Bayesneon statistics (rejected post)
- (12/27) Deconstructing Larry Wasserman
- (12/28) More on deconstructing Larry Wasserman (Aris Spanos)
- (12/28) Wasserman on Wasserman: Update! December 28, 2013
- (12/31) Midnight With Birnbaum (Happy New Year)

**January 2014
**

- (1/2) Winner of the December 2013 Palindrome Book Contest (Rejected Post)
- (1/3) Error Statistics Philosophy: 2013
- (1/4) Your 2014 wishing well. …
- (1/7) “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech)
- (1/11) Two Severities? (PhilSci and PhilStat)
- (1/14) Statistical Science meets Philosophy of Science: blog beginnings
- (1/16) Objective/subjective, dirty hands and all that: Gelman/Wasserman blogolog (ii)
- (1/18) Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]
- (1/22) Phil6334: “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos (Virginia Tech) UPDATE: JAN 21
- (1/24) Phil 6334: Slides from Day #1: Four Waves in Philosophy of Statistics
- (1/25) U-Phil (Phil 6334) How should “prior information” enter in statistical inference?
- (1/27) Winner of the January 2014 palindrome contest (rejected post)
- (1/29) BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Revisiting the Foundations of Statistics
- (1/31) Phil 6334: Day #2 Slides

**February 2014**

- (2/1) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)
- (2/3) PhilStock: Bad news is bad news on Wall St. (rejected post)
- (2/5) “Probabilism as an Obstacle to Statistical Fraud-Busting” (draft iii)
- (2/9) Phil6334: Day #3: Feb 6, 2014
- (2/10) Is it true that all epistemic principles can only be defended circularly? A Popperian puzzle
- (2/12) Phil6334: Popper self-test
- (2/13) Phil 6334 Statistical Snow Sculpture
- (2/14) January Blog Table of Contents
- (2/15) Fisher and Neyman after anger management?
- (2/17) R. A. Fisher: how an outsider revolutionized statistics
- (2/18) Aris Spanos: The Enduring Legacy of R. A. Fisher
- (2/20) R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’
- (2/21) STEPHEN SENN: Fisher’s alternative to the alternative
- (2/22) Sir Harold Jeffreys’ (tail-area) one-liner: Sat night comedy [draft ii]
- (2/24) Phil6334: February 20, 2014 (Spanos): Day #5
- (2/26) Winner of the February 2014 palindrome contest (rejected post)
- (2/26) Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)

**March 2014**

- (3/1) Cosma Shalizi gets tenure (at last!) (metastat announcement)
- (3/2) Significance tests and frequentist principles of evidence: Phil6334 Day #6
- (3/3) Capitalizing on Chance (ii)
- (3/4) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/8) Msc kvetch: You are fully dressed (even under you clothes)?
- (3/8) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
- (3/11) Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power
- (3/12) Get empowered to detect power howlers
- (3/15) New SEV calculator (guest app: Durvasula)
- (3/17) Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)
- (3/19) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
- (3/22) Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)
- (3/25) The Unexpected Way Philosophy Majors Are Changing The World Of Business
- (3/26) Phil6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
- (3/28) Severe osteometric probing of skeletal remains: John Byrd
- (3/29) Winner of the March 2014 palindrome contest (rejected post)
- (3/30) Phil6334: March 26, philosophy of misspecification testing (Day #9 slides)

**April 2014**

- (4/1) Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic
- (4/3) Self-referential blogpost (conditionally accepted*)
- (4/5) Who is allowed to cheat? I.J. Good and that after dinner comedy hour. . ..
- (4/6) Phil6334: Duhem’s Problem, highly probable vs highly probed; Day #9 Slides
- (4/8) “Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)
- (4/12) “Murder or Coincidence?” Statistical Error in Court: Richard Gill (TEDx video)
- (4/14) Phil6334: Notes on Bayesian Inference: Day #11 Slides
- (4/16) A. Spanos: Jerzy Neyman and his Enduring Legacy
- (4/17) Duality: Confidence intervals and the severity of tests
- (4/19) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)
- (4/21) Phil 6334: Foundations of statistics and its consequences: Day#12
- (4/23) Phil 6334 Visitor: S. Stanley Young, “Statistics and Scientific Integrity”
- (4/26) Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day #13)
- (4/30) Able Stats Elba: 3 Palindrome nominees for April! (rejected post)

**May 2014**

- (5/1) Putting the brakes on the breakthrough: An informal look at the
argument for the Likelihood Principle

- (5/3) You can only become coherent by ‘converting’ non-Bayesianly
- (5/6) Winner of April Palindrome contest: Lori Wike
- (5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)
- (5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
- (5/15) Scientism and Statisticism: a conference* (i)
- (5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”
- (5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop
- (5/25) Blog Table of Contents: March and April 2014
- (5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976

- (5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

**June 2014**

- (6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)
- (6/9) “The medical press must become irrelevant to publication of clinical trials.”
- (6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”
- (6/14) “Statistical Science and Philosophy of Science: where should they meet?”
- (6/21) Big Bayes Stories? (draft ii)
- (6/25) Blog Contents: May 2014
- (6/28) Sir David Hendry Gets Lifetime Achievement Award
- (6/30) Some ironies in the ‘replication crisis’ in social psychology (4
^{th}and final installment)

**July 2014**

- (7/7) Winner of June Palindrome Contest: Lori Wike
- (7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)
- (7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)
- (7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)
- (7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?
- (7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
- (7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

**August 2014**

- (08/03) Blogging Boston JSM2014?
- (08/05) Neyman, Power, and Severity
- (08/06) What did Nate Silver just say? Blogging the JSM 2013
- (08/09) Winner of July Palindrome: Manan Shah
- (08/09) Blog Contents: June and July 2014
- (08/11) Egon Pearson’s Heresy
- (08/17) Are P Values Error Probabilities? Or, “It’s the methods, stupid!” (2
^{nd}install) - (08/23) Has Philosophical Superficiality Harmed Science?
- (08/29) BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2
^{nd})

**September 2014**

- (9/30) Letter from George (Barnard)
- (9/27) Should a “Fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no (rejected post)
- (9/23) G.A. Barnard: The Bayesian “catch-all” factor: probability vs likelihood
- (9/21) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
- (9/18) Uncle Sam wants YOU to help with scientific reproducibility!
- (9/15) A crucial missing piece in the Pistorius trial? (2): my answer (Rejected Post)
- (9/12) “The Supernal Powers Withhold Their Hands And Let Me Alone”: C.S. Peirce
- (9/6)
The Likelihood Principle issue is out…!*Statistical Science:* - (9/4) All She Wrote (so far): Error Statistics Philosophy Contents-3 years on
- (9/3) 3 in blog years: Sept 3 is 3rd anniversary of errorstatistics.com

**October 2014**

**10/01**Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?**10/05**Diederik Stapel hired to teach “social philosophy” because students got tired of success stories… or something (rejected post)**10/07**A (Jan 14, 2014) interview with Sir David Cox by “Statistics Views”**10/10**BREAKING THE (Royall) LAW! (of likelihood) (C)**10/14**Gelman recognizes his error-statistical (Bayesian) foundations**10/18**PhilStat/Law: Nathan Schachtman: Acknowledging Multiple Comparisons in Statistical Analysis: Courts Can and Must**10/22**September 2014: Blog Contents**10/25**3 YEARS AGO: MONTHLY MEMORY LANE**10/26**To Quarantine or not to Quarantine?: Science & Policy in the time of Ebola**10/31**Oxford Gaol: Statistical Bogeymen

**November 2014**

**11/01**Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”**11/09**“Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA**11/11**The Amazing Randi’s Million Dollar Challenge**11/12**A biased report of the probability of a statistical fluke: Is it cheating?**11/15**Why the Law of Likelihood is bankrupt–as an account of evidence**11/18**Lucien Le Cam: “The Bayesians Hold the Magic”**11/20**Erich Lehmann: Statistician and Poet**11/22**Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”**11/25**How likelihoodists exaggerate evidence from statistical tests**11/30**3 YEARS AGO: MONTHLY (Nov.) MEMORY LANE

**December 2014**

**12/02**My Rutgers Seminar: tomorrow, December 3, on philosophy of statistics**12/04**“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)**12/06**How power morcellators inadvertently spread uterine cancer**12/11**Msc. Kvetch: What does it mean for a battle to be “lost by the media”?**12/13**S. Stanley Young: Are there mortality co-benefits to the Clean Power Plan? It depends. (Guest Post)**12/17**Announcing Kent Staley’s new book, An Introduction to the Philosophy of Science (CUP)**12/21**Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)**12/23**All I want for Chrismukkah is that critics & “reformers” quit howlers oftesting (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”

**12/26**3 YEARS AGO: MONTHLY (Dec.) MEMORY LANE**12/29**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**12/31**Midnight With Birnbaum (Happy New Year)

**January 2015
**

**01/02**Blog Contents: Oct.- Dec. 2014**01/03**No headache power (for Deirdre)**01/04**Significance Levels are Made a Whipping Boy on Climate Change Evidence: Is .05 Too Strict? (Schachtman on Oreskes)**01/07**“When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (reblog)**01/08**On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post)**01/12**“Only those samples which fit the model best in cross validation were included” (whistleblower) “I suspect that we likely disagree with what constitutes validation” (Potti and Nevins**)****01/16**Winners of the December 2014 Palindrome Contest: TWO!**01/18**Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?**01/21**Some statistical dirty laundry**01/24**What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri- 01/26 Trial on Anil Potti’s (clinical) Trial Scandal Postponed Because Lawyers Get the Sniffles (updated)
**01/27**3 YEARS AGO: (JANUARY 2012) MEMORY LANE**01/31**Saturday Night Brainstorming and Task Forces: (4th draft)

**February 2015**

**02/05**Stephen Senn: Is Pooling Fooling? (Guest Post)**02/10**What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?**02/13**Induction, Popper and Pseudoscience**02/16**Continuing the discussion on truncation, Bayesian convergence and testing of priors**02/16**R. A. Fisher: ‘Two New Properties of Mathematical Likelihood’: Just before breaking up (with N-P)**02/17**R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)**02/19**Stephen Senn: Fisher’s Alternative to the Alternative**02/21**Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)**02/25**3 YEARS AGO: (FEBRUARY 2012) MEMORY LANE**02/27**Big Data is the New Phrenology?

**March 2015**

**03/01**“Probabilism as an Obstacle to Statistical Fraud-Busting”**03/05**A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)**03/12**All She Wrote (so far): Error Statistics Philosophy: 3.5 years on**03/16**Stephen Senn: The pathetic P-value (Guest Post)**03/21**Objectivity in Statistics: “Arguments From Discretion and 3 Reactions”**03/24**3 YEARS AGO (MARCH 2012): MEMORY LANE**03/28**Your (very own) personalized genomic prediction varies depending on who else was around?

**April 2015**

**04/01**Are scientists really ready for ‘retraction offsets’ to advance ‘aggregate reproducibility’? (let alone ‘precautionary withdrawals’)**04/04**Joan Clarke, Turing, I.J. Good, and “that after-dinner comedy hour…”**04/08**Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!**04/13**Philosophy of Statistics Comes to the Big Apple! APS 2015 Annual Convention — NYC**04/16**A. Spanos: Jerzy Neyman and his Enduring Legacy**04/18**Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen**04/22**NEYMAN: “Note on an Article by Sir Ronald Fisher” (3 uses for power, Fisher’s fiducial argument)**04/24**“Statistical Concepts in Their Relation to Reality” by E.S. Pearson**04/27**3 YEARS AGO (APRIL 2012): MEMORY LANE**04/30**96% Error in “Expert” Testimony Based on Probability of Hair Matches: It’s all Junk!

**May 2015**

**05/04**Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)**05/08**What really defies common sense (Msc kvetch on rejected posts)**05/09**Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)**05/16**“Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo**05/19**Workshop on Replication in the Sciences: Society for Philosophy and Psychology: (2nd part of double header)**05/24**From our “Philosophy of Statistics” session: APS 2015 convention**05/27**“Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday**05/30**3 YEARS AGO (MAY 2012): Saturday Night Memory Lane

**June 2015**

**06/04**What Would Replication Research Under an Error Statistical Philosophy Be?**06/09**“Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)**6/11**Evidence can only strengthen a prior belief in low data veracity, N. Liberman & M. Denzler: “Response”**06/14**Some statistical dirty laundry: The Tilberg (Stapel) Report on “Flawed Science”**06/18**Can You change Your Bayesian prior? (ii)**06/25**3 YEARS AGO (JUNE 2012): MEMORY LANE**06/30**Stapel’s Fix for Science? Admit the story you want to tell and how you “fixed” the statistics to support it!

**07/03**Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)**07/09**Winner of the June Palindrome contest: Lori Wike**07/11**Higgs discovery three years on (Higgs analysis and statistical flukes)**07/14**Spot the power howler: α = ß?**07/17**“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)**07/22**3 YEARS AGO (JULY 2012): MEMORY LANE**07/24**Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics**07/29**Telling What’s True About Power, if practicing within the error-statistical tribe

**August 2015**

**08/05**Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen**08/08**Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”**08/11**A. Spanos: Egon Pearson’s Neglected Contributions to Statistics**08/14**Performance or Probativeness? E.S. Pearson’s Statistical Philosophy**08/15**Severity in a Likelihood Text by Charles Rohde**08/19**Statistics, the Spooky Science**08/20**How to avoid making mountains out of molehills, using power/severity**08/24**3 YEARS AGO (AUGUST 2012): MEMORY LANE**08/31**The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)

[i]Table of Contents compiled by N. Jinn & J. Miller)*

*I thank Jean Miller for her assiduous work on the blog, and all contributors and readers for helping “frequentists in exile” to feel (and truly become) less exiled–wherever they may be!

Filed under: blog contents, Metablog, Statistics ]]>

**The Paradox of Replication**

Critic 1: It’s much too easy to get small P-values.

Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).

Is it easy or is it hard?

You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other *biasing selection effects*. *The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs.* In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.

The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:

The findings also offered some support for the oft-criticized statistical tool known as the

Pvalue, which measures whether a result is significant or due to chance. …The project analysis showed that a low

Pvalue was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with aPvalue of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)

The Replication Report itself, published in *Science*, gives more details:

Considering significance testing, reproducibility was stronger in studies and journals representing cognitive psychology than social psychology topics. For example, combining across journals, 14 of 55 (25%) of social psychology effects replicated by the

P< 0.05 criterion, whereas 21 of 42 (50%) of cognitive psychology effects did so. …The difference in significance testing results between fields appears to be partly a function of weaker original effects in social psychology studies, particularly inJPSP, and perhaps of the greater frequency of high-powered within-subjects manipulations and repeated measurement designs in cognitive psychology as suggested by high power despite relatively small participant samples. …A negative correlation of replication success with the original study

Pvalue indicates that the initial strength of evidence is predictive of reproducibility. For example, 26 of 63 (41%) original studies withP< 0.02 achievedP< 0.05 in the replication, whereas 6 of 23 (26%) that had aPvalue between 0.02 <P< 0.04 and 2 of 11 (18%) that had aPvalue > 0.04 did so (Fig. 2). Almost two thirds (20 of 32, 63%) of original studies withP< 0.001 had a significantPvalue in the replication. [i]

Since it’s expected to have only around 50% replications as strong as the original, the cases of initial significance level < .02 don’t do too badly, judging just on numbers. But I disagree with those who say that all that’s needed is to lower the required P-value, because it ignores the real monster: biasing selection effects.

** 2. Is there evidence that differences (between initial studies vs replications) are due to A, B, C… or not? **Moreover, simple significance tests and cognate methods were the tools of choice in exploring possible explanations for the disagreeing results.

Last, there was little evidence that perceived importance of the effect, expertise of the original or replication teams, or self-assessed quality of the replication accounted for meaningful variation in reproducibility across indicators. Replication success was more consistently related to the original strength of evidence (such as original

Pvalue, effect size, and effect tested) than to characteristics of the teams and implementation of the replication (such as expertise, quality, or challenge of conducting study) (tables S3 and S4).

They look to a battery of simple significance tests for answers, if only indications. It is apt that they report these explanations as the result of “exploratory” analysis; they weren’t generalizing, but scrutinizing if various factors could readily account for the results.

What evidence is there that the replication studies are not themselves due to bias? According to the Report:

There is no publication bias in the replication studies because all results are reported. Also, there are no selection or reporting biases because all were confirmatory tests based on pre-analysis plans. This maximizes the interpretability of the replication

Pvalues and effect estimates.

One needn’t rule out bias altogether to agree with the Report that the replication research controlled the most common biases and flexibilities to which initial experiments were open. If your P-value emerged from torture and abuse, it can’t be hidden from a replication that ties your hands. If you don’t cherry-pick, try and try again, barn hunt, capitalize on flexible theory, and so on, it’s hard to satisfy R.A. Fisher’s requirement of rarely failing to bring about statistically significant results–*unless you’ve found a genuine effect.* Admittedly a small part of finding things out, the same methods can be used to go deeper in discovering and probing alternative explanations of an effect.

**3. Observed differences cannot be taken as caused by the “treatment”: **My main worries with the replicationist conclusions in psychology are that they harbor many of the same presuppositions that cause problems in (at least some) psychological experiments to begin with, notably the tendency to assume that differences observed–*any* differences– are due to the “treatments”, and further, that they are measuring the phenomenon of interest. Even nonsignificant observed differences are interpreted as merely indicating smaller effects of the experimental manipulation, when the significance test is indicating the absence of a genuine effect, much less the particular causal thesis. The statistical test is shouting disconfirmation, if not falsification, of unwarranted hypotheses, but no such interpretation is heard.

It would be interesting to see a list of the failed replications. (I’ll try to dig them out at some point.) The New York Times gives three, but even they are regarded as “simply weaker”.

The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.

This is akin to the habit some researchers have of describing non-significant results as sort of “trending” significant––when the P-value is telling them it’s not significant, and I don’t mean falling short of a “bright line” at .05, but levels like .2, .3, and .4. These differences are easy to bring about by chance variability alone. Psychologists also blur the observed difference (in statistics) with the inferred discrepancy (in parameter values). This inflates the inference. I don’t know the specific P-values for the following three:

More than 60 of the studies did not hold up. Among them was one on free will. It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.

Another was on the effect of physical distance on emotional closeness. Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.

A third was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

What are the grounds for saying they’re merely weaker? The author of the mate preference study protests even this mild criticism, claiming that a “theory required adjustment” shows her findings to have been replicated after all.

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences [between her study and the replication] — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

Wait a minute. This was to be a general evolutionary theory, yes? According to the abstract:

Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men. (A link to the abstract and paper is here.)

Is the author saying that Italian women obey a distinct evolutionary process? I take it one could argue that evolutionary forces manifest themselves in different ways in distinct cultures. Doubtless, ratings of attractiveness by U.S. psychology students can’t be assumed to reflect assessments about availability for impromptu sex. But can they even among Italian women? This is just one particular story through which the data are being viewed. **[9/2/15 Update on the mate preference and ovulation study is in Section 4.]**

I can understand that the authors of the replication Report wanted to tread carefully to avoid the kind of pushback that erupted when a hypothesis about cleanliness and morality failed to be replicated. (“Repligate” some called it.) My current concern echoes the one I raised about that case (in an earlier post):

“the [replicationist] question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s run over? At least not directly. In other words, the statistical-substantive link was not at issue.”

Just because subjects (generally psychology students) select a number on a questionnaire, or can be scored on an official test of attitude, feelings, self-esteem, etc., doesn’t mean it’s actually been measured, and you can proceed to apply statistics. You may adopt a method that allows you to go from statistical significance to causal claims—the unwarranted NHST animal that Fisher opposed—but the question does not disappear [ii]. Reading a passage against “free will” makes me more likely to cheat on a test? (There’s scarce evidence that reading a passage influenced the subject’s view on the deep issue of free will, nor even that the subject (chose to*) “cheat”, much less that the former is responsible for the latter.) When I plot two faraway points on a graph I’m more likely to feel more “distant” from my family than if I plot two close together points? The effect is weaker but still real? There are oceans of studies like these (especially in social psychology & priming research). Some are even taken to inform philosophical theories of mind or ethics when, in my opinion, philosophers should be providing a rigorous methodological critique of these studies [iii]. We need to go deeper; in many cases, no statistical analysis would even be required. The vast literatures on the assumed effect live lives of their own; to test their fundamental presuppositions could bring them all crashing down [iv]. Are they to remain out of bounds of critical scrutiny? What do you think?

I may come back to this post in later installments.

*Irony intended.

4. **Update on the Italian Mate Selection Replication**

Here’s the situation as I understand it, having read both the replication and the response by Bressan. The women in the study had to be single, not pregnant, not on the pill, heterosexual. Among the single women,some are in relationships, they are “partnered”. The thesis is this: if a partnered woman is not ovulating, she’s more attracted to the “attached” guy, because he is deemed capable of a long-term commitment, as evidenced by his being in a relationship. So she might leave her current guy for him (at least if he’s handsome in a masculine sort of way). On the other hand, if she’s ovulating, she’d be more attracted to a single (not attached) man than an attached man. “In this way she could get pregnant and carry the high-genetic-fitness man’s offspring without having to leave her current, stable relationship” (Frazier and Hasselman Bressan_online_in lab (1).2)

So the deal is this: if she’s ovulating, she’s got to do something fast: Have a baby with the single (non-attached) guy whose not very good at commitments (but shows high testosterone, and thus high immunities, according to the authors), and then race back to have the baby in her current stable relationship. As Bressan puts it in her response to the replication:“This effect was interpreted on the basis of the hypothesis that, during ovulation, partnered women would be “shopping for good genes” because they “already have a potentially investing ‘father’ on their side.” But would he be an invested father if it was another man’s baby? I mean, does this even make sense on crude evolutionary terms? [I don’t claim to know. I thought male lions are prone to stomp on babies fathered by other males. Even with humans, I doubt that even the “feminine” male Pleistocene partner would remain fully invested.]

Nevertheless, when you see the whole picture, Bressan does raise some valid questions of the replication attempt BRESSAN COMMENTARY. I may come back to this later. You can find all the reports, responses by authors, and other related materials here.

[i] Here’s a useful overview from the Report in *Science*:

Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Since it’s expected to have only around 50% replications as strong as the original, this might not seem that low. I think the entire issue of importance goes beyond rates, and that focusing on rates of replication actually distracts from what’s involved in appraising a given study or theory.

[ii] Statistical methods are relevant to answering this question and even falsifying conjectured causal claims. My point is that it demands more than checking the purely statistical question in these “direct” replications, and more than P-values. Oddly, since these studies appeal to power, they ought to be in Neyman-Pearson hypotheses testing (ideally without the behavioristic rationale). This would immediately scotch an illicit slide from statistical to substantive inference.

[iii] Yes, this is one of the sources of my disappointment: philosophers of science should be critically assessing this so-called “naturalized” philosophy. It all goes back to Quine, but never mind.

[iv] It would not be difficult to test whether these measures are valid. The following is about the strongest, hedged, claim (from the Report) that the replication result is sounder than the original:

If publication, selection, and reporting biases completely explain the effect differences, then the replication estimates would be a better estimate of the effect size than would the meta-analytic and original results. However, to the extent that there are other influences, such as moderation by sample, setting, or quality of replication, the relative bias influencing original and replication effect size estimation is unknown.

Filed under: replication research, reproducibility, spurious p values, Statistics ]]>

**MONTHLY MEMORY LANE: 3 years ago: August 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog**.[1]** Posts that are part of a “unit” or a group of “U-Phils” count as one (there are 4 U-Phils on Wasserman this time). Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014. We’re about to turn four.

**August 2012**

- (8/2) Stephen Senn: Fooling the Patient: an Unethical Use of Placebo? (Phil/Stat/Med)
- (8/5) A “Bayesian Bear” rejoinder practically writes itself…
- (8/6) Bad news bears: Bayesian rejoinder
- (8/8) U-PHIL: Aris Spanos on Larry Wasserman[2]
- (8/10) U-PHIL: Hennig and Gelman on Wasserman (2011)
- (8/11) E.S. Pearson Birthday
- (8/11) U-PHIL: Wasserman Replies to Spanos and Hennig
- (8/13) U-Phil: (concluding the deconstruction) Wasserman/Mayo
- (8/14) Good Scientist Badge of Approval?
- (8/16) E.S. Pearson’s Statistical Philosophy
- (8/18) A. Spanos: Egon Pearson’s Neglected Contributions to Statistics
- (8/20) Higgs Boson: Bayesian “Digest and Discussion”
- (8/22) Scalar or Technicolor? S. Weinberg, “Why the Higgs?”
- (8/25) “Did Higgs Physicists Miss an Opportunity by Not Consulting More With Statisticians?
- (8/27) Knowledge/evidence not captured by mathematical prob.
- (8/30) Frequentist Pursuit
- (8/31) Failing to Apply vs Violating the Likelihood Principle

**[1] **excluding those reblogged fairly recently.

[2] Larry Wasserman’s paper was “Low Assumptions, High dimensions” in our special RIMM volume.

Filed under: 3-year memory lane, Statistics ]]>

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significantxis agoodindication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result *is just statistically significant. *(As soon as it exceeds the cut-off the rule has to be modified).

Rule (1) was stated in relation to a statistically significant result ** x** (at level α) from a one-sided test T+ of the mean of a Normal distribution with

(2) If POW(T+,µ’) is high, then an α statistically significantxis agoodindication that µ < µ’.

(The higher the POW(T+,µ’) is, the better the indication that µ < µ’.)That is, if the test’s power to detect alternative µ’ is

high, then the statistically significantis axgoodindication (or good evidence) that the discrepancy from null isnotas large as µ’ (i.e., there’s good evidence that µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

**EXAMPLE**. Let σ = 10, *n* = 100, so (σ/√*n*) = 1. Test T+ rejects H_{0 }at the .025 level if M_{ } > 1.96(1). For simplicity, let the cut-off, M*, be 2. Let the observed mean M_{0} just reach the cut-off 2.

**POWER**: POW(T+,µ’)** = **POW(Test T+ rejects *H*_{0};µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*. Using one of our power facts, POW(M* + 1(σ/√*n*)) = .84.

That is, adding one (σ/ √*n*) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ** _{ }**= 3) = .84. See this post.

By (2), the (just) significant result * x* is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84. The upper .84 confidence limit is 3. The significant result is even better evidence that µ< 4, the upper .975 confidence limit is 4 (approx.), etc.

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (Only (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical *in*significance, (2) is essentially ordinary *power analysis.* (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, that’s a context-dependent consideration, often stipulated in regulatory statutes.

Rule (2) also provides a way to distinguish values *within* a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often confused with “retrospective power” (what I call shpower). Again, confidence bounds could be, but they are not now, used to this end (but rather the opposite [iii]).

**Severity replaces M* in (2) with the actual result, be it significant or insignificant. **

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to *custom tailor* results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

*One more thing:*

**Applying (1) and (2) requires the error probabilities to be actual** (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. *If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones.* There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ_{0 }+ δ, or µ < µ’ =µ_{0 }+ δ or the like. They are *not* to point values! (Not even to the point µ =M_{0}.) Most simply, you may consider that the inference is in terms of the one-sided upper confidence bound (for various confidence levels)–the dual for test T+.

[iii] That is, upper confidence bounds are viewed as “plausible” bounds, and as values for which the data provide positive evidence. As soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
**12/29/14**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**01/03/15**No headache power (for Deirdre)

Filed under: fallacy of rejection, power, Statistics ]]>

I was reading this interview Of Erich Lehmann yesterday: “A Conversation with Erich L. Lehmann”

Lehmann: …I read over and over again that hypothesis testing is dead as a door nail, that nobody does hypothesis testing. I talk to Julie and she says that in the behaviorial sciences, hypothesis testing is what they do the most. All my statistical life, I have been interested in three different types of things: testing, point estimation, and confidence-interval estimation.There is not a year that somebody doesn’t tell me that two of them are total nonsense and only the third one makes sense. But which one they pick changes from year to year.[Laughs] (p.151)…..

DeGroot:…It has always amazed me about statistics that we argue among ourselves about which of our basic techniques are of practical value. It seems to me that in other areas one can argue about whether a methodology is going to prove to be useful, but people would agree whether a technique is useful in practice. But in statistics, as you say, some people believe that confidence intervals are the only procedures that make any sense on practical grounds, and others think they have no practical value whatsoever.I find it kind of spooky to be in such a field.

Lehmann:After a while you get used to it.If somebody attacks one of these, I just know that next year I’m going to get one who will be on the other side. (pp.151-2)

Emphasis is mine.

I’m reminded of this post.

Morris H. DeGroot,* Statistical Science, *1986, Vol. 1, No.2, 243-258

* *

Filed under: phil/history of stat, Statistics ]]>