SIST* BLOG POSTS (up to Nov 17, 2018)
Excerpts
Mementos, Keepsakes and Souvenirs
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018)
]]>Excursion 2 Tour II: Falsification, Pseudoscience, Induction*
Outline of Tour. Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions (live exhibit (v)). Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set (often overlooked): isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.
Mementos from 2.3
There are four key, interrelated themes from Popper:
(1) Science and Pseudoscience. For a theory to be scientiﬁc it must be testable and falsiﬁable.
(2) Conjecture and Refutation. We learn not by enumerative induction but by trial and error: conjecture and refutation.
(3) Observations Are Not Given. If they are at the “foundation,” it is only because there are apt methods for testing their validity. We dub claims observable because or to the extent that they are open to stringent checks.
(4) Corroboration Not Conﬁrmation, Severity Not Probabilism. Rejecting probabilism, Popper denies scientists are interested in highly probable hypotheses (in any sense). They seek bold, informative, interesting conjectures and ingenious and severe attempts to refute them.
These themes are in the spirit of the error statistician. Considerable spade-work is required to see what to keep and what to revise, so bring along your archeological shovels.
The Severe Tester Revises Popper’s Demarcation of Science (Live Exhibit (vi)): What he should be asking is not whether a theory is unscientific, but When is an inquiry into a theory, or an appraisal of claim H, unscientiﬁc? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face- saving devices, then the inquiry – the handling and use of data – is unscientiﬁc. Despite being logically falsiﬁable, theories can be rendered immune from falsiﬁcation by means of questionable methods for their testing.
Greater Content, Greater Severity. The severe tester accepts Popper’s central intuition in (4): if we wanted highly probable claims, scientists would stick to low-level observables and not seek generalizations, much less theories with high explanatory content.A highly explanatory, high-content theory, with interconnected tentacles, has a higher probability of having ﬂaws discerned than low-content theories that do not rule out as much. Thus, when the bolder, higher content, theory stands up to testing, it may earn higher overall severity than the one with measly content. It is the fuller, unifying, theory developed in the course of solving interconnected problems that enables severe tests.
Methodological Probability. Probability in learning attaches to a method of conjecture and refutation, that is to testing: it is methodological probability. An error probability is a special case of a methodological probability. We want methods with a high probability of teaching us (and machines) how to distinguish approximately correct and incorrect interpretations of data. That a theory is plausible is of little interest, in and of itself; what matters is that it is implausible for it to have passed these tests were it false or incapable of adequately solving its set of problems.
Methodological falsification. We appeal to methodological rules for when to regard a claim as falsified.
Despite giving lip service to testing and falsiﬁcation, many popular accounts of statistical inference do not embody falsiﬁcation – even of a statistical sort.
However, the falsifying hypotheses that are integral for Popper also necessitate an evidence-transcending (inductive) statistical inference. If your statistical account denies we can reliably falsify interesting theories because doing so is not strictly deductive, it is irrelevant to real-world knowledge.
The Popperian (Methodological) Falsiﬁcationist Is an Error Statistician
When is a statistical hypothesis to count as falsiﬁed? Although extremely rare events may occur, Popper notes:
such occurrences would not be physical eﬀects, because, on account of their immense improbability, they are not reproducible at will … If, however, we ﬁnd reproducible deviations from a macro eﬀect .. . deduced from a probability estimate … then we must assume that the probability estimate is falsiﬁed. (Popper 1959, p. 203)
In the same vein, we heard Fisher deny that an “isolated record” of statistically signiﬁcant results suﬃces to warrant a reproducible or genuine eﬀect (Fisher 1935a, p. 14).
In a sense, the severe tester ‘breaks’ from Popper by solving his key problem: Popper’s account rests on severe tests, tests that would probably falsify claims if false, but he cannot warrant saying a method is probative or severe, because that would mean it was reliable, which makes Popperians squeamish. It would appear to concede to his critics that Popper has a “whiﬀ of induction” after all. But it’s not inductive enumeration. Error statistical methods (whether from statistics or informal) can supply the severe tests Popper sought.
A scientific inquiry (a procedure for finding something out) for a severe tester:
The parenthetical remark isn’t absolutely required, but is a feature that greatly strengthens scientiﬁc credentials.
The reliability requirement is: infer claims just to the extent that they pass severe tests. There’s no sharp line for demarcation, but when these requirements are absent, an inquiry veers into the realm of questionable science or pseudoscience.
To see mementos of 2.4-2.7, I’ve placed them here.**
All of 2.3 is here.
Please use the comments for your questions, corrections, suggested additions.
*All items refer to my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018)
**I’m bound to revise and add to these during a seminar next semester.
]]>Excursion 2 Tour I: Induction and Confirmation (Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars)
Tour Blurb. The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. These are key concepts of fundamental importance to our journey. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction. This led to confirmation theory and some projects in today’s formal epistemology. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory are directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine the problem of irrelevant conjunctions: that if x confirms H, it confirms (H & J) for any J. This also leads to what’s called the tacking paradox.
Quiz on 2.1 Soundness vs Validity in Deductive Logic. Let ~C be the denial of claim C. For each of the following argument, indicate whether it is valid and sound, valid but unsound, invalid.
Remember, validity is a matter of form. Any argument with the same form as a valid argument is itself valid. If an argument is not deductively valid, then it is invalid. An invalid argument is one where it’s possible to have an argument with its same form where all the premises are true and the conclusion false. A deductively sound argument must be both valid and have all true premises.
(1) All U.S. senators are male.
Dianne Feinstein is a U.S. senator who is female.
Therefore, it’s not true that all U.S. senators are male. __________________
(2) All U.S. senators are male.
Dianne Feinstein is a U.S. senator.
Therefore, Feinstein is male. __________________
(3) All numbers are either even or odd.
3 is a number but is neither even nor odd.
Therefore, it’s not true that all numbers are even or odd.________________________
So as not to duplicate too closely the form of (1), I had actually wanted the following form for (3). I’ll call it (3)’:
(3)’ If all numbers are either even or odd, then 3 is either even or odd.
3 is neither even nor odd.
Therefore, it’s not true that all numbers are even or odd.________________________
(4) All U.S. senators are female.
Dianne Feinstein is female.
Therefore, Dianne Feinstein is a U.S. senator. ______________________
(5) If all senators are female, then Senators Feinstein and Warren are female.
Senators Feinstein and Warren are female.
Therefore, all senators are female.___________________
(6) If a Normal model M gave good predictions in the 3 cases I applied it, then M will always give good predictions.
Normal model M gave good predictions in the 3 cases I applied it.
Therefore, model M will always give good predictions. __________________________________
(1) – (6) follow patterns in 2.1. Here’s one that’s a bit different for extra credit:
If Normal model M gave good predictions in all the 5 cases I applied it, then M will always give good predictions.
Therefore, if Normal model M ever fails to give good predictions, then M would have failed in at least 1 of the 5 cases I applied it. (Is it valid or invalid?) ____________________
(Answers will be posted in the comments next week, I invite you to post yours in the mean time)
Excursion Tour 1 concepts: the asymmetry of induction and falsification; argument, sound and valid; enumerative induction (straight rule); problem of induction; confirmation theory (and formal epistemology); statistical affirming the consequent; guide to life; paradox of irrelevant conjunction, tacking paradox; Likelihood ratio [LR] between H and ~ H; the concept “entails severely”, Bayes-Boost (B-boost), absolute vs incremental confirmation; Fisher and Peirce on the faulty analogy between deduction and induction; Likelihood Ratio [LR]
^{Where you are in the Journey: }
^{Excursion 1 Tour I: I posted all of Excursion 1 Tour I, here, here, and here. Excursion 1 Tour II. Except for the Souvenir C: A severe tester’s translation Guide, I did not post Excursion 1 Tour II. For the material on Royall and the Law of Likelihood in 1.4 (including a link to an article by Royall), see this post; for stopping rules and the likelihood principle, see this post. That post also offers Museum links to the Savage Forum! }^{Excursion 2 Tour I: I posted the first stop of Tour I (2.1) here. Material from 2.2 (irrelevant conjunction/tacking paradox) may be found in these blogposts here and here. }^{For the full Itinerary: SIST Itinerary.}
]]>Stephen Senn
Consultant Statistician
Edinburgh
The Rothamsted School
I never worked at Rothamsted but during the eight years I was at University College London (1995-2003) I frequently shared a train journey to London from Harpenden (the village in which Rothamsted is situated) with John Nelder, as a result of which we became friends and I acquired an interest in the software package Genstat®.
That in turn got me interested in John Nelder’s approach to analysis of variance, which is a powerful formalisation of ideas present in the work of others associated with Rothamsted. Nelder’s important predecessors in this respect include, at least, RA Fisher (of course) and Frank Yates and others such as David Finney and Frank Anscombe. John died in 2010 and I regard Rosemary Bailey, who has done deep and powerful work on randomisation and the representation of experiments through Hasse diagrams, as being the greatest living proponent of the Rothamsted School. Another key figure is Roger Payne who turned many of John’s ideas into code in Genstat®.
Lord’s Paradox
Lord’s paradox dates from 1967(1) and I wrote a paper(2) about it in Statistics in Medicine some years ago. It was reading The Book of Why(3) by Judea Pearl and Dana MacKenzie and its interesting account of Pearl’s important work in causal inference that revived my interest in it. I recommend The Book of Why but it has one rather irritating feature. It claims that all that statisticians ever did with causation is point out that correlation does not mean causation. I find this rather surprising, since very little of what I have ever done as a statistician has had anything to do with correlation but rather a lot with causation and I certainly don’t think that I am unusual in this respect. I thought it would be an interesting challenge to illustrate what the Rothamsted school, armed with Genstat®, might make of Lord’s paradox.
Interesting discussions of Lord’s paradox will be found not only in The Book of Why but in papers by Van Breukelen(4) and Wainer and Brown(5). However, the key paper is that of Holland and Rubin(6), which has been an important influence on my thinking. I shall consider the paradox in the Wainer and Brown form, in which it is supposed that we have a situation in which the effect of diet on the weight of students in two halls of residence (say 1 & 2), one providing diet A and the other Diet B, is considered. The mean weight at the start of observation in September is different between the two halls (it is higher in B than in A) but it differs by exactly the same amount, at the end of observation the following June and (but this is not necessary to the paradox) in fact, in neither hall has there been any change over time in mean weight. The means at outcome are the same as the means at baseline. The four means are as given in the table below in which X stands for baseline and Y for outcome, with Y_{B} – Y_{A} = X_{B} – X_{A }= D.Although the mean weights per hall are the same at outcome as at baseline, some students have lost weight and some have gained weight, so the correlation between baseline and outcome is less than 1. However, the variances are the same at baseline and at outcome and, indeed, from one hall to another as indeed is the correlation. In further discussion I shall assume that we have the same number of students per hall.
Two statisticians (say John and Jane) now prepare to analyse the data. John uses a so-called change-score (the difference between weight at outcome and weight at baseline for every student). Once he has averaged the scores per hall, he will be calculating the difference
(Y_{B }– X_{B}) – (Y_{A}– X_{A}) = ( Y_{B} – Y_{A}) – (X_{B} – X_{A }) = D – D = 0.
John thus concludes that there is no effect of diet on weight. Jane, on the other hand, proposes to use analysis of covariance. This is equivalent to ‘correcting’ each student’s weight at outcome by the within-halls regression of the weight at outcome on baseline. Since the variances at baseline and outcome are the same, this is equivalent to correcting the weights by the correlation coefficient, r. We can skip some of the algebra here but it turns out that Jane calculates
( Y_{B} – Y_{A}) – r(X_{B} – X_{A }) = D – rD = (1 – r)D,
which is not equal to zero unless r = 1. However, that is not the case here and so a difference is observed. Jane, furthermore, finds that this is extremely significantly different from 0. Hence, Jane concludes that there is a difference between the diets. Who is right?
A graphical representation is given in Figure 1, where we can see that if we adjust along the line of equality there is no difference between the halls but if we adjust using the within groups regression there is.
The Book of Why points out that the initial weight X is a confounding variable here and not a mediator stating, ‘Therefore, the second statistician would be unambiguously correct here.’ (P216) My analysis, however, is slightly different. Basically, I consider that the first statistician is unambiguously wrong but that the second statistician is not unambiguously right. Jane may be right but this depends on assumptions that need to be made explicit. I shall now explain why.
As Holland and Rubin point out, a key to understanding the paradox is to try and think causally: is there a causal question and if so what does it imply? The way I usually try and understand these things is by imagining what I would do if it were a reasonable experiment, which, as we shall see, it is not and then consider what further adjustments are necessary.
Genstat® versus Lord’s paradox
So let us first of all assume that in order to understand the effects of diet, each of the two halls had been randomised to the diet. What would a reasonable analysis be? I shall start the investigation by considering outcomes only and see what John Nelder’s theory of the analysis of experiments(7, 8), as encoded in Genstat® would lead us to conclude. I shall then consider the role of the baseline values. I assume, just for illustration, that we have 100 students per hall and have created a data-set with exactly this situation: two halls, one diet per hall, 100 students per hall.
To analyse the structure, Genstat®, requires me to declare the block structure first. This is how the experimental material is organised before anything is done to it. Here we have students nested within halls. This is indicated as follows
BLOCKSTRUCTURE Hall/Student
Here “/” is the so-called nesting operator. Next, I have to inform the program of the treatment structure. This is quite simple in this case. There is only one treatment and that is Diet. So I write
TREATMENTSTRUCTURE Diet
Note that this difference between blocking and treatment structure is fundamental to John Nelder’s approach and, indeed, where not explicit, implicit in the whole approach of the Rothamsted school to designed experiments and thus to Genstat®. Without taking anything away from the achievements of the causal revolution outlined in The Book of Why it is interesting to note an analogy to the crucial difference between see (block structure) and do (treatment structure) in Pearl’s theory.
Next I need to inform Genstat® what the outcome variable is via an ANOVA statement, for example,
ANOVA Weight
but if I don’t, and just write
ANOVA
all that it will do is produce a so-called null analysis of variance as follows:
Source of variation d.f.
Hall stratum
Diet 1
Hall.Student stratum 198
Total 199
This immediately shows a problem. The problem is, in a sense obvious, and I am sure many a reader will consider that I have taken a sledgehammer to crack a nut but in my defence I can say, however obvious it is, it appears to have been rather overlooked in the discussion of Lord’s paradox. The problem is that, as any statistician can tell you, it is not enough to produce an estimate, you also have to produce an estimate of how uncertain the estimate is. Genstat® tells me here that this is impossible. The block structure defines two strata: the hall stratum and the student-within-hall stratum (here indicated by Hall.Student). There is only one degree of freedom in the first stratum but unfortunately the treatment appears in this stratum and competes for this single degree of freedom with what has to be used for error variation, namely the difference between halls. There is nothing that can be said using the data only about how precisely we have estimated the effect of diet and, if this is the case, the estimate is useless. The problem is that what we have is the structure of what in a clinical trials context would be called a cluster randomised trial with only two clusters.
What happens if I re-arrange my experiment to deal with this? Let us accept an implicit practical constraint that we cannot allocate students in the same hall to different diets but let us suppose that we can recruit more halls. Suppose that I could recruit 20 halls, ten for each diet with the same number of students studied in total, so that each hall provides ten students and, as before, I have 200 students. The Genstat® null ANOVA now looks like this.
Source of variation d.f.
Hall stratum
Diet Experiment 1 1
Residual 18
Hall.Student stratum 180
Total 199
We can see now, even more clearly, that it is the hall stratum that provides the residual variation with which we can estimate the precision of the treatment estimate and furthermore that whatever the contribution of studying students may be to making our experiment more precise, we cannot use the degrees of freedom within halls to estimating how precise it will be unless we can declare that the contribution to the overall variance of halls is zero. This is an important point to remember because it is now time we considered the baseline weight.
Suppose that we now stop to compare John’s and Jane’s estimates in terms of our improved experiment. Given that we have more information (if the variance between halls is important) we ought to expect to find that the values will differ less. First note that only the term ( Y_{B} – Y_{A}) can reflect the difference in diets and that this term is the same for both John and Jane’s estimate. Therefore, any convergence of John’s and Jane’s estimate is not because the term ( Y_{B} – Y_{A}) will estimate the causal effect of diet better, although it may very reasonably be expected to do so, since that virtue is reflected in both their approaches. Note also that diet can only affect this term, since the terms X_{B} – X_{A} involving occur before the dietary intervention.
No. The reason that we may expect some convergence is that although the correction term, involving the baselines is not the same for both statisticians, for both it is a multiple of the same difference. For John we have ( X_{B} – X_{A}) and for Jane we have r( X_{B} – X_{A}) and the difference between the two is (1 – r)( X_{B} – X_{A}) and this may be expected to get smaller as the number of halls increases. In fact, over all randomisations it is zero and if we keep the number of students per hall constant but increase the number of hall it approaches zero, so that in some sense we can regard both John and Jane as measuring the same effect marginally.
Now, it is certainly my point of view(9) that we should not be satisfied with such marginal arguments, although they should always be considered because they are calibrating. The consequence of this is that although marginal inferences do not trump conditional ones, if you get your marginal inference wrong you will almost certainly do the same with your conditional one. But suppose, we have a large number of halls but notice some particular difference at baseline in weights between the two groups of halls, be it large or be it small. What should we do about it? It turns out that if we can condition on this difference appropriately we will have an estimate that is a) independent of the observed difference and b) efficient (that is to say has a small variance). It is also known that a way to do this, that works, asymptotically at least, is analysis of covariance. So we should adjust the difference in weights at outcome using the differences in weights at baseline in an analysis of covariance. Doesn’t this get us back to Jane’s solution?
Not quite. The relevant difference we would observe is the relevant difference between the groups of halls. What Jane was proposing to do was to use the correlation within halls to correct a difference between. However, we have already seen that the Hall.Student stratum is not relevant for judging the variance of the outcomes. Can it automatically be right for judging the covariance? No. It might be an assumption one chooses to make but it will be a choice and it certainly cannot be said that this choice would be unambiguously right. If we just rely on the data, then Genstat® will have the baseline covariate entering in both strata, that is to say not only within halls but between.
Thus, my conclusion is that Jane’s analysis could be right if the within-hall variances and covariances are relevant to the variation across halls. They might be relevant but it is far from obvious that they must be and it therefore does not follow that Jane’s argument is unambiguously right.
Of course, what I described was what one would decide for an experiment. You may choose to disagree that such a supposedly randomised experiment could provide any guidance for something quasi-experimental such as Lord’s paradox. After all, we are not told that the diets were randomised to the halls. I am not so sure. I think that in this case, at least, the quasi-experimental set-up inherits the problems that the similar randomised experiment would show. I think that it is far from obvious that what Jane proposes to do is unambiguously right.
Whether you agree or disagree, I hope I have succeeded in showing you that statistical theory, and in particular careful examination of variation, a topic initiated by RA Fisher one hundred years ago(10) and for which he proposed the squared measure and gave it the name variance, goes beyond merely warning that correlation is not causation. Sometimes correlation isn’t even correlation.
(Associated slides are below.)
References
]]>
I will continue to post mementos and, at times, short excerpts following the pace of one “Tour” a week, in sync with some book clubs reading Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST or Statinfast 2018, CUP), e.g., Lakens. This puts us at Excursion 2 Tour I, but first, here’s a quick Souvenir (Souvenir C) from Excursion 1 Tour II:
]]>Souvenir C: A Severe Tester’s Translation Guide
Just as in ordinary museum shops, our souvenir literature often probes treasures that you didn’t get to visit at all. Here’s an example of that, and you’ll need it going forward. There’s a confusion about what’s being done when the significance tester considers the set of all of the outcomes leading to a d(x) greater than or equal to 1.96, i.e., {x: d(x) ≥ 1.96}, or just d(x) ≥ 1.96. This is generally viewed as throwing away the particular x, and lumping all these outcomes together. What’s really happening, according to the severe tester, is quite different. What’s actually being signified is that we are interested in the method, not just the particular outcome. Those who embrace the LP make it very plain that data-dependent selections and stopping rules drop out. To get them to drop in, we signal an interest in what the test procedure would have yielded. This is a counterfactual and is altogether essential in expressing the properties of the method, in particular, the probability it would have yielded some nominally significant outcome or other.
When you see Pr(d(X) ≥ d(x_{0}); H_{0}), or Pr(d(X) ≥ d(x_{0}); H_{1}), for any particular alternative of interest, insert:
“the test procedure would have yielded”
just before the d(X). In other words, this expression, with its inequality, is a signal of interest in, and an abbreviation for, the error probabilities associated with a test.
Applying the Severity Translation. In Exhibit (i), Royall described a significance test with a Bernoulli(θ) model, testing H_{0}: θ ≤ 0.2 vs. H_{1}: θ >0.2. We blocked an inference from observed difference d(x) = 3.3 to θ = 0.8 as follows. (Recall that x= 0.53 and d(x_{0}) ≃ 3.3.)
We computed Pr(d(X) > 3.3; θ = 0.8) ≃1.
We translate it as Pr(The test would yield d(X) > 3.3; θ = 0.8) ≃1.
We then reason as follows:
Statistical inference: If θ = 0.8, then the method would virtually always give a difference larger than what we observed. Therefore, the data indicate θ < 0.8.
(This follows for rejecting H_{0} in general.) When we ask: “How often would your test have found such a significant effect even if H_{0} is approximately true?” we are asking about the properties of the experiment that did happen. The counterfactual “would have” refers to how the procedure would behave in general, not just with these data, but with other possible data sets in the sample space.
Below are my slides from a session on replication at the recent Philosophy of Science Association meetings in Seattle.
]]>
Excursion 1 Tour II: Error Probing Tools vs. Logics of Evidence
Blurb. Core battles revolve around the relevance of a method’s error probabilities. What’s distinctive about the severe testing account is that it uses error probabilities evidentially: to assess how severely a claim has passed a test. Error control is necessary but not sufficient for severity. Logics of induction focus on the relationships between given data and hypotheses–so outcomes other than the one observed drop out. This is captured in the Likelihood Principle (LP). Tour II takes us to the crux of central wars in relation to the Law of Likelihood (LL) and Bayesian probabilism. (1.4) Hypotheses deliberately designed to accord with the data can result in minimal severity. The likelihoodist wishes to oust them via degrees of belief captured in prior probabilities. To the severe tester, such gambits directly alter the evidence by leading to inseverity. (1.5) Stopping rules: If a tester tries and tries again until significance is reached–optional stopping–significance will be attained erroneously with high probability. According to the LP, the stopping rule doesn’t alter evidence. The irrelevance of optional stopping is an asset for holders of the LP, it’s the opposite for a severe tester. The warring sides talk past each other.
1.4 The Law of Likelihood and Error Statistics: Key Items
Ian Hacking (1965) – the Law of Likelihood.
Law of Likelihood (LL): Data x are better evidence for hypothesis H1
than for H0 if x is more probable under H1 than under H0.
Likelihoods are defined and several examples are given.
Likelihoods of hypotheses should not be confused with their probabilities.
The Law of Likelihood (LL) is seen to fail the minimal severity requirement – at least if it is taken as an account of inference.
Gellerized hypotheses: maximally fitting, but minimally severely tested, hypotheses.
We observe one outcome, but we can consider that for any outcome, unless it makes H0 maximally likely, we can find an H1 that is more likely.
A severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs.
Sampling distribution.
Richard Royall: He distinguishes three questions: belief, action, and evidence:
Exhibit (i): Law of Likelihood Compared to a Significance Test.
Why the LL Reject Composite Hypotheses
Royall holds that all attempts to say whether x is good evidence for H, or even if x is better evidence for H than is y, are futile. Similarly,
“What does the [LL] say when one hypothesis attaches the same probability to two different observations? It says absolutely nothing . . . [it] applies when two different hypotheses attach probabilities to the same observation” (Royall 2004, p. 148).
The severe tester distinguishes the evidential warrant for one and the same hypothesis H in two cases: one where it was constructed post hoc, cherry picked, and so on, a second where it was predesignated.
Souvenir B: Likelihood versus Error Statistical
To the Likelihoodist, points in favor of the LL are:
To the error statistician, problems with the LL include:
Notice, the last two points are identical for both. What’s a selling point for a Likelihoodist is a problem for an error statistician.
1.5 Trying and Trying again: Key Items
“ trying and trying again” to achieve statistical significance, stopping rules and their relevance/irrelevance
Edwards, Lindman, and Savage (E, L, & S, 1963).
Simmons, Nelson, and Simonsohn
The Likelihood Principle (LP).
Weak Repeated Sampling Principle.
(Cox and Hinkley 1974, p. 51). “ [W]e should not follow procedures which for some possible parameter values would give, in hypothetical repetitions, misleading conclusions most of the time” (ibid., pp. 45– 6).
The 1959 Savage Forum
Arguments from Intentions:
Error Probabilities Violate the LP
Problem of “ known (or old) evidence” made famous by Clark Glymour (1980).
Souvenir C. A severe Tester’s Translation Guide [i]
HOW TO FIND MATERIAL FROM EXCURSION 1 TOUR II (if you don’t have a copy of the book). I have not posted Excursion 1 Tour II (I did post Tour I). (Andrew Gelman may post a draft for a possible discussion on his blog.)
However, there are posts on this bog that cover much of the material from 1.4 and 1.5 (in blog form). For the material on Royall and the Law of Likelihood in 1.4 (including a link to an article by Royall), see this post; for stopping rules and the likelihood principle, see this post. That post also offers Museum links to the Savage Forum! You can also search this blog for terms of interest, and there’s quite a lot on those in 1.4 and 1.5. Have fun! Please share comments, queries, favorite quotes, etc.
[i] I may post Souvenir C separately.
Tour Guide Mementos (Excursion 1, Tour I of How to Get Beyond the Statistics Wars)
FOR ALL OF TOUR I (proofs): SIST Excursion 1 Tour I
]]>
I’ve been asked if I agree with Regina Nuzzo’s recent note on p-values [i]. I don’t want to be nit-picky, but one very small addition to Nuzzo’s helpful tips for communicating statistical significance can make it a great deal more helpful. Here’s my friendly amendment. She writes:
Basics to remember
What’s most important to keep in mind? That we use p-values to alert us to surprising data results, not to give a final answer on anything. (Or at least that’s what we should be doing). And that results can get flagged as “statistically surprising” with a small p-value for a number of reasons:
- There was a fluke. Something unusual happened in the data just by chance.
- Something was violated. By this I mean there was a mismatch between what was actually done in the data analysis and what needed to be done for the p-value to be a valid indicator. One little-known requirement, for example, is that the data analysis be planned before looking at the data. Another is that all analyses and results be presented, no matter the outcome. Yes, these seem like strange, nit-picky rules, but they’re part of the deal when using p-values. A small p-value might simply be a sign that data analysis rules were broken.
- There was a real but tiny relationship, so tiny that we shouldn’t really care about it. A large trial can detect a true effect that is too small to matter at all, but a p-value will still flag it as being surprising.
- There was a relationship that is worth more study. There’s more to be done. Can the result be replicated? Is the effect big enough to matter? How does it relate to other studies?
Or any combination of the above.(Nuzzo)
My tiny addition is to the next to last sentence of #2, “the dark horse that we can’t ignore“. I suggest replacing it with (something like):
One little-known requirement, for example, is that the data analysis be planned before looking at the data. Another is that all analyses and results be presented, no matter the outcome. Yes, these seem like strange, nit-picky rules, but they’re part of the deal in making statistical inferences (or the like).
Why is that tiny addition (in bold) so important? Because without it many people suppose that other statistical methods don’t have to worry about post data selections, selective reporting and the like. With all the airplay it has been receiving–the acknowledged value of preregistration, the “21 word solution”[ii] and many other current reforms–hopefully the danger of ad hoc moves in the “forking paths” (Gelman and Loken 2014) in collecting and interpreting data is no longer a “little-known requirement”. But it’s very important to emphasize that the misleading is not unique to statistical significance tests. The same P-hacked hypothesis can find its way into likelihood ratios, Bayesian factors, posterior probabilities and credibility regions. In their recent Significance article, “Cargo-cult statistics“, Stark and Saltelli emphasize:
“The misuse of p-values, hypothesis tests, and confidence intervals might be deemed frequentist cargo-cult statistics. There is also Bayesian cargo-cult statistics. While a great deal of thought has been given to methods for eliciting priors, in practice, priors are often chosen for convenience or out of habit; perhaps worse, some practitioners choose the prior after looking at the data, trying several priors, and looking at the results – in which case Bayes’ rule no longer applies!”
[i]I’m always asked these things nowadays on twitter, and I’m keen to bring people back to my blog as a far more appropriate place to actually have a discussion.
[ii]“We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study (Simmons, J., Nelson, L., and Simonsohn, U. p. 4).”
References not linked to
Gelman, A. and Loken, E. (2014). ‘The Statistical Crisis in Science’, American Scientist 2,460–5.
Simmons, J., Nelson, L., and Simonsohn, U. (2012). ‘A 21 word solution’, Dialogue: The Official Newsletter of the Society for Personality and Social Psychology 26(2), 4–7.
]]>
I came across this anomaly on Christian Robert’s blog.
Last week, I received this new book of Deborah Mayo, which I was looking forward reading and annotating!, but thrice alas, the book had been sabotaged: except for the preface and acknowledgements, the entire book is printed upside down [a minor issue since the entire book is concerned] and with some part of the text cut on each side [a few letters each time but enough to make reading a chore!]. I am thus waiting for a tested copy of the book to start reading it in earnest!
How bizarre, my book has been slashed with a knife, cruelly stabbing the page,letting words bleed out helter skelter. Some part of the text cut on each side? It wasn’t words with “Bayesian” in them was it? The only anomalous volume I’ve seen has a slightly crooked cover. Do you think it is the Book Slasher out for Halloween, or something more sinister? It’s a bit like serving the Michelin restaurant reviewer by dropping his meal on the floor, or accidentally causing a knife wound. I hope they remedy this quickly. (Talk about Neyman and quality control).
Readers: Feel free to use the comments to share you particular tale of woe in acquiring the book.
Tour guides in your travels jot down Mementos and Keepsakes from each Tour[i] of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018). Their scribblings, which may at times include details, at other times just a word or two, may be modified through the Tour, and in response to questions from travelers (so please check back). Since these are just mementos, they should not be seen as replacements for the more careful notions given in the journey (i.e., book) itself. Still, you’re apt to flesh out your notes in greater detail, so please share yours (along with errors you’re bound to spot), and we’ll create Meta-Mementos.
Excursion 1. Tour I: Beyond Probabilism and Performance
Notes from Section1.1 Severity Requirement: Bad Evidence, No Test (BENT)
1.1 Terms (quick looks, to be crystalized as we journey on)
error statistical account: one that revolves around the control and assessment of a method’s error probabilities. An inference is qualified by the error probability of the method that led to it.
(This replaces common uses of “frequentist” which actually has many other connotations.)
error statistician: one who uses error statistical methods.
severe testers: a proper subset of error statisticians: those who use error probabilities to assess and control severity. (They may use them for other purposes as well.)
The severe tester also requires reporting what has been poorly probed and inseverely tested,
Error probabilities can, but don’t necessarily, provide assessments of the capability of methods to reveal or avoid mistaken interpretations of data. When they do, they may be used to assess how severely a claim passes a test.
We can keep to testing language as part of the meta-language we use to talk about formal statistical methods, where the latter include estimation, exploration, prediction, and data analysis.
There’s a diﬀerence between ﬁnding H poorly tested by data x, and ﬁnding x renders H improbable – in any of the many senses the latter takes on.
H: Isaac knows calculus.
x: results of a coin flipping experiment
Even taking H to be true, data x has done nothing to probe the ways in which H might be false.
5. R.A. Fisher, against isolated statistically significant results (p.4).
[W]e need, not an isolated record, but a reliable method of procedure. In relation to the
test of significance, we may say that a phenomenon is experimentally demonstrable
when we know how to conduct an experiment which will rarely fail to give us
a statistically significant result. (Fisher 1935b/1947, p. 14)
Notes from section 1.2 of SIST: How to get beyond the stat wars
6. statistical philosophy (associated with a statistical methodology): core ideas that direct its principles, methods, and interpretations.
two main philosophies about the roles of probability in statistical inference : performance (in the long run) and probabilism.
(i) performance: probability functions to control and assess the relative frequency of erroneous inferences in some long run of applications of the method
(ii) probabilism: probability functions to assign degrees of belief, support, or plausibility to hypotheses. They may be non-comparative (a posterior probability) or comparative (a likelihood ratio or Bayes Factor)
Severe testing introduces a third:
(iii) probativism: probability functions to assess and control a methods’ capability of detecting mistaken inferences, i.e., the severity associated with inferences.
• Performance is a necessary but not a sufficient condition for probativeness.
• Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.
7. Severity strong (argument from coincidence):
We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet no or few are found, then the passing result, x, is evidence for C.
lift-off vs drag down
(i) lift-off : an overall inference can be more reliable and precise than its premises individually.
(ii) drag-down: An overall inference is only as reliable/precise as is its weakest premise.
• Lift-off is associated with convergent arguments, drag-down with linked arguments.
• statistics is the science par excellence for demonstrating lift-off!
8. arguing from error: there is evidence an error is absent to the extent that a procedure with a high capability of signaling the error, if and only if it is present, nevertheless detects no error.
Bernouilli (coin tossing) model: we record success or failure, assume a fixed probability of success θ on each trial, and that trials are independent. (P-value in the case of the Lady Tasting tea, pp. 16-17).
Error probabilities can be readily invalidated due to how the data (and hypotheses!) are generated or selected for testing.
9. computed (or nominal) vs actual error probabilities: You may claim it’s very difficult to get such an impressive result due to chance, when in fact it’s very easy to do so, with selective reporting (e.g., your computed P-value can be small, but the actual P-value is high.)
Examples: Peirce and Dr. Playfair (a law is inferred even though half of the cases required Playfair to modify the formula after the fact. ) Texas marksman (shooting prowess inferred from shooting bullets into the side of a barn, and painting a bull’s eye around clusters of bullet holes); Pickrite stock portfolio (Pickrite’s effectiveness at stock picking is inferred based on selecting those on which the “method” did best)
• We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.
• A key role for statistical inference is to identify ways to spot egregious deceptions and create strong arguments from coincidence.
10. Auditing a P-value (one part) checking if the results due to selective reporting, cherry picking, trying and trying again, or any number of other similar ruses.
• Replicability isn’t enough: Example. observational studies on Hormone Replacement therapy (HRT) reproducibly showed benefits, but had little capacity to unearth biases due to “the healthy women’s syndrome.”
Souvenir A.[ii] Postcard to Send: the 4 fallacies from the opening of 1.1.
• We should oust mechanical, recipe-like uses of statistical methods long lampooned,
• But simple significance tests have their uses, and shouldn’t be ousted simply because some people are liable to violate Fisher’s warnings.
• They have the means by which to register formally the fallacies in the postcard list. (Failed statistical assumptions, selection effects alter a test’s error probing capacities).
• Don’t throw out the error control baby with the bad statistics bathwater.
10. severity requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.
severity (strong): If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet no or few are found, then the passing result, x, is an indication of, or evidence for, C.
Notes from Section 1.3: The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon
The Bayesian versus frequentist dispute parallels disputes between probabilism and performance.
-Using Bayes’ Theorem doesn’t make you a Bayesian.
-subjective Bayesianism and non-subjective (default) Bayesians
11. Advocates of uniﬁcations are keen to show that (i) default Bayesian methods have good performance in a long series of repetitions – so probabilism may yield performance; or alternatively, (ii) frequentist quantities are similar to Bayesian ones (at least in certain cases) – so performance may yield probabilist numbers. Why is this not bliss? Why are so many from all sides dissatisﬁed?
It had long been assumed that only subjective or personalistic Bayesianism had a shot at providing genuine philosophical foundations, but some Bayesians have come to question whether the widespread use of methods under the Bayesian umbrella, however useful, indicates support for subjective Bayesianism as a foundation.
Marriages of Convenience? The current frequentist–Bayesian uniﬁcations are often marriages of convenience;
-some are concerned that methodological conﬂicts are bad for the profession.
-frequentist tribes have not disappeared; scientists still call for error control.
-Frequentists’ incentive to marry: Lacking a suitable epistemic interpretation of error probabilities – significance levels, power, and conﬁdence levels – frequentists are constantly put on the defensive.
Eclecticism and Ecumenism. Current-day eclecticisms have a long history – the dabbling in tools from competing statistical tribes has not been thought to pose serious challenges.
Decoupling. On the horizon is the idea that statistical methods may be decoupled from the philosophies in which they are traditionally couched (e.g., Gelman and Cosma Shalizi 2013). The concept of severe testing is suﬃciently general to apply to any of the methods now in use.
Why Our Journey? To disentangle the jumgle. Being hesitant to reopen wounds from old battles does not heal them. They show up in the current problems of scientific integrity, irreproducibility, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports.
How it occurs: the new stat scrutiny (arising from failures of replication) collects from:
-the earlier social science “significance test controversy”
-the traditional frequentist and Bayesian accounts, and corresponding frequentist-Bayesian wars
-the newer Bayesian–frequentist uniﬁcations (non-subjective, default Bayesianism)
This jungle has never been disentangled.
Your Tour Guide invites your questions in the comments.
[i] As these are scribbled down in notebooks through ocean winds, wetlands and insects, do not expect neatness. Please share improvements nd corrections.
[ii] For a free copy of “Statistical Inference as Severe Testing”, send me your conception of Souvenir A, your real souvenir A, or a picture of your real Souvenir A (through Nov 16, 2018).
]]>
In my talk yesterday at the Philosophy Department at Virginia Tech, I introduced my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Cambridge 2018). I began with my preface (explaining the meaning of my title), and turned to the Statistics Wars, largely from Excursion 1 of the book. After the sum-up at the end, I snuck in an example from the replication crisis in psychology. Here are the slides.
]]>
Where you are in the Journey* We’ll move from the philosophical ground ﬂoor to connecting themes from other levels, from Popperian falsiﬁcation to signiﬁcance tests, and from Popper’s demarcation to current-day problems of pseudoscience and irreplication. An excerpt from our Museum Guide gives a broad-brush sketch of the ﬁrst few sections of Tour II:
Karl Popper had a brilliant way to “solve” the problem of induction: Hume was right that enumerative induction is unjustiﬁed, but science is a matter of deductive falsiﬁcation. Science was to be demarcated from pseudoscience according to whether its theories were testable and falsiﬁable. A hypothesis is deemed severely tested if it survives a stringent attempt to falsify it. Popper’s critics denied he could sustain this and still be a deductivist …
Popperian falsiﬁcation is often seen as akin to Fisher’s view that “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (1935a, p. 16). Though scientists often appeal to Popper, some critics of signiﬁcance tests argue that they are used in decidedly non-Popperian ways. Tour II explores this controversy.
While Popper didn’t make good on his most winning slogans, he gives us many seminal launching-oﬀ points for improved accounts of falsiﬁcation, corroboration, science versus pseudoscience, and the role of novel evidence and predesignation. These will let you revisit some thorny issues in today’s statistical crisis in science.
Here’s Popper’s summary (drawing from Popper, Conjectures and Refutations, 1962, p. 53):
- [Enumerative] induction … is a It is neither a psychological fact …nor one of scientiﬁc procedure.
- The actual procedure of science is to operate with conjectures…
- Repeated observation and experiments function in science as tests of our conjectures or hypotheses, i.e., as attempted refutations.
- [It is wrongly believed that using the inductive method can] serve as a criterion of demarcation between science and pseudoscience. … None of this is altered in the least if we say that induction makes theories only probable.
There are four key, interrelated themes:
(1) Science and Pseudoscience. Redeﬁning scientiﬁc method gave Popper a new basis for demarcating genuine science from questionable science or pseudoscience. Flexible theories that are easy to conﬁrm – theories of Marx, Freud, and Adler were his exemplars – where you open your eyes and ﬁnd conﬁrmations everywhere, are low on the scientiﬁc totem pole (ibid., p. 35). For a theory to be scientiﬁc it must be testable and falsiﬁable.
(2) Conjecture and Refutation. The problem of induction is a problem only if it depends on an unjustiﬁable procedure such as enumerative induction. Popper shocked everyone by denying scientists were in the habit of inductively enumerating. It doesn’t even hold up on logical grounds. To talk of “another instance of an A that is a B” assumes a conceptual classiﬁcation scheme. How else do we recognize it as another item under the umbrellas A and B? (ibid., p. 44). You can’t just observe, you need an interest, a point of view, a problem.
The actual procedure for learning in science is to operate with conjectures in which we then try to ﬁnd weak spots and ﬂaws. Deductive logic is needed to draw out the remote logical consequences that we actually have a shot at testing (ibid., p. 51). From the scientist down to the amoeba, says Popper, we learn by trial and error: conjecture and refutation (ibid., p. 52). The crucial diﬀerence is the extent to which we constructively learn how to reorient ourselves after clashes.
Without waiting, passively, for repetitions to impress or impose regularities upon us, we actively try to impose regularities upon the world… These may have to be discarded later, should observation show that they are wrong. (ibid., p. 46)
(3) Observations Are Not Given. Popper rejected the time-honored empiricist assumption that observations are known relatively unproblematically. If they are at the “foundation,” it is only because there are apt methods for testing their validity. We dub claims observable because or to the extent that they are open to stringent checks. (Popper: “anyone who has learned the relevant technique can test it” (1959, p. 99).) Accounts of hypothesis appraisal that start with “evidence x,” as in conﬁrmation logics, vastly oversimplify the role of data in learning.
(4) Corroboration Not Conﬁrmation, Severity Not Probabilism. Last, there is his radical view on the role of probability in scientiﬁc inference. Rejecting probabilism, Popper not only rejects Carnap-style logics of conﬁrmation, he denies scientists are interested in highly probable hypotheses (in any sense). They seek bold, informative, interesting conjectures and ingenious and severe attempts to refute them. If one uses a logical notion of probability, as philosophers (including Popper) did at the time, the high content theories are highly improbable; in fact, Popper said universal theories have 0 probability. (Popper also talked of statistical probabilities as propensities.)
These themes are in the spirit of the error statistician. Considerable spade-work is required to see what to keep and what to revise, so bring along your archeological shovels.
There is a reason that statisticians and scientists often refer back to Popper; his basic philosophy – at least his most winning slogans – are in sync with ordinary intuitions about good scientiﬁc practice. Even people divorced from Popper’s full philosophy wind up going back to him when they need to demarcate science from pseudoscience. Popper’s right that if using enumerative induction makes you scientiﬁc then anyone from an astrologer to one who blithely moves from observed associations to full blown theories is scientiﬁc. Yet the criterion of testability and falsiﬁability – as typically understood – is nearly as bad. It is both too strong and too weak. Any crazy theory found false would be scientiﬁc, and our most impressive theories aren’t deductively falsiﬁable. Larry Laudan’s famous (1983) “The Demise of the Demarcation Problem” declared the problem taboo. This is a highly unsatisfactory situation for philosophers of science. Now Laudan and I generally see eye to eye, perhaps our disagreement here is just semantics. I share his view that what really matters is determining if a hypothesis is warranted or not, rather than whether the theory is “scientiﬁc,” but surely Popper didn’t mean logical falsiﬁability suﬃced. Popper is clear that many unscientiﬁc theories (e.g., Marxism, astrology) are falsiﬁable. It’s clinging to falsiﬁed theories that leads to unscientiﬁc practices. (Note: The use of a strictly falsiﬁed theory for prediction, or because nothing better is available, isn’t unscientiﬁc.) I say that, with a bit of ﬁne-tuning, we can retain the essence of Popper to capture what makes an inquiry, if not an entire domain, scientiﬁc.
Following Laudan, philosophers tend to shy away from saying anything general about science versus pseudoscience – the predominant view is that there is no such thing. Some say that there’s at most a kind of “family resemblance” amongst domains people tend to consider scientiﬁc (Dupré 1993, Pigliucci 2010, 2013). One gets the impression that the demarcation task is being left to committees investigating allegations of poor science or fraud. They are forced to articulate what to count as fraud, as bad statistics, or as mere questionable research practices (QRPs). People’s careers depend on their ruling: they have “skin in the game,” as Nassim Nicholas Taleb might say (2018). The best one I know – the committee investigating fraudster Diederik Stapel – advises making philosophy of science a requirement for researchers (Levelt Committee, Noort Committee, and Drenth Committee 2012). So let’s not tell them philosophers haven given up on it.
Diederik Stapel. A prominent social psychologist “was found to have committed a serious infringement of scientiﬁc integrity by using ﬁctitious data in his publications” (Levelt Committee 2012, p. 7). He was required to retract 58 papers, relinquish his university degree and much else. The authors of the report describe walking into a culture of conﬁrmation and veriﬁcation bias. They could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientiﬁc method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (ibid., p. 48). Free of the qualms that give philosophers of science cold feet, they advance some obvious yet crucially important rules with Popperian echoes:
One of the most fundamental rules of scientiﬁc research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that conﬁrm the research hypotheses. Violations of this fundamental rule, such as continuing to repeat an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tend to conﬁrm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts. (ibid., p. 48)
Exactly! This is our minimal requirement for evidence: If it’s so easy to ﬁnd agreement with a pet theory or claim, such agreement is bad evidence, no test, BENT. To scrutinize the scientiﬁc credentials of an inquiry is to determine if there was a serious attempt to detect and report errors and biasing selection eﬀects. We’ll meet Stapel again when we reach the temporary installation on the upper level: The Replication Revolution in Psychology.
The issue of demarcation (point (1)) is closely related to Popper’s conjecture and refutation (point (2)). While he regards a degree of dogmatism to be necessary before giving theories up too readily, the trial and error methodology “gives us a chance to survive the elimination of an inadequate hypothesis – when a more dogmatic attitude would eliminate it by eliminating us” (Popper 1962, p. 52). Despite giving lip service to testing and falsiﬁcation, many popular accounts of statistical inference do not embody falsiﬁcation – even of a statistical sort.
Nearly everyone, however, now accepts point (3), that observations are not just “given”– knocking out a crucial pillar on which naïve empiricism stood. To the question: What came ﬁrst, hypothesis or observation? Popper answers, another hypothesis, only lower down or more local. Do we get an inﬁnite regress? No, because we may go back to increasingly primitive theories and even, Popper thinks, to an inborn propensity to search for and ﬁnd regularities (ibid., p. 47). I’ve read about studies appearing to show that babies are aware of what is statistically unusual. In one, babies were shown a box with a large majority of red versus white balls (Xu and Garcia 2008, Gopnik 2009). When a succession of white balls are drawn, one after another, with the contents of the box covered with a screen, the babies looked longer than when the more common red balls were drawn. I don’t ﬁnd this far-fetched. Anyone familiar with preschool computer games knows how far toddlers can get in solving problems without a single word, just by trial and error.
Greater Content, Greater Severity. The position people are most likely to take a pass on is (4), his view of the role of probability. Yet Popper’s central intuition is correct: if we wanted highly probable claims, scientists would stick to low-level observables and not seek generalizations, much less theories with high explanatory content. In this day of fascination with Big Data’s ability to predict what book I’ll buy next, a healthy Popperian reminder is due: humans also want to understand and to explain. We want bold “improbable” theories. I’m a little puzzled when I hear leading machine learners praise Popper, a realist, while proclaiming themselves fervid instrumentalists. That is, they hold the view that theories, rather than aiming at truth, are just instruments for organizing and predicting observable facts. It follows from the success of machine learning, Vladimir Cherkassky avers, that “realism is not possible.” This is very quick philosophy! “.. . [I]n machine learning we are given a set of [random] data samples, and the goal is to select the best model (function, hypothesis) from a given set of possible models” (Cherkassky 2012). Fine, but is the background knowledge required for this setup itself reducible to a prediction–classiﬁcation problem? I say no, as would Popper. Even if Cherkassky’s problem is relatively theory free, it wouldn’t follow this is true for all of science. Some of the most impressive “deep learning” results in AI have been criticized for lacking the ability to generalize beyond observed “training” samples, or to solve open-ended problems (Gary Marcus 2018).
A valuable idea to take from Popper is that probability in learning attaches to a method of conjecture and refutation, that is to testing: it is methodological probability. An error probability is a special case of a methodological probability. We want methods with a high probability of teaching us (and machines) how to distinguish approximately correct and incorrect interpretations of data, even leaving murky cases in the middle, and how to advance knowledge of detectable, while strictly unobservable, eﬀects.
The choices for probability that we are commonly oﬀered are stark: “in here” (beliefs ascertained by introspection) or “out there” (frequencies in long runs, or chance mechanisms). This is the “epistemology” versus “variability” shoe- horn we reject (Souvenir D). To qualify the method by which H was tested, frequentist performance is necessary, but it’s not suﬃcient. The assessment must be relevant to ensuring that claims have been put to severe tests. You can talk of a test having a type of propensity or capability to have discerned ﬂaws, as Popper did at times. A highly explanatory, high-content theory, with inter- connected tentacles, has a higher probability of having ﬂaws discerned than low-content theories that do not rule out as much. Thus, when the bolder, higher content, theory stands up to testing, it may earn higher overall severity than the one with measly content. That a theory is plausible is of little interest, in and of itself; what matters is that it is implausible for it to have passed these tests were it false or incapable of adequately solving its set of problems. It is the fuller, unifying, theory developed in the course of solving interconnected problems that enables severe tests.
Methodological probability is not to quantify my beliefs, but neither is it about a world I came across without considerable eﬀort to beat nature into line. Let alone is it about a world-in-itself which, by deﬁnition, can’t be accessed by us. Deliberate eﬀort and ingenuity are what allow me to ensure I shall come up against a brick wall, and be forced to reorient myself, at least with reasonable probability, when I test a ﬂawed conjecture. The capabilities of my tools to uncover mistaken claims (its error probabilities) are real properties of the tools. Still, they are my tools, specially and imaginatively designed. If people say they’ve made so many judgment calls in building the inferential apparatus that what’s learned cannot be objective, I suggest they go back and work some more at their experimental design, or develop better theories.
Falsiﬁcation Is Rarely Deductive. It is rare for any interesting scientiﬁc hypotheses to be logically falsiﬁable. This might seem surprising given all the applause heaped on falsiﬁability. For a scientiﬁc hypothesis H to be deductively falsiﬁed, it would be required that some observable result taken together with H yields a logical contradiction (A & ~A). But the only theories that deductively prohibit observations are of the sort one mainly ﬁnds in philosophy books: All swans are white is falsiﬁed by a single non-white swan. There are some statistical claims and contexts, I will argue, where it’s possible to achieve or come close to deductive falsiﬁcation: claims such as, these data are independent and identically distributed (IID). Going beyond a mere denial to replacing them requires more work.
However, interesting claims about mechanisms and causal generalizations require numerous assumptions (substantive and statistical) and are rarely open to deductive falsiﬁcation. How then can good science be all about falsiﬁability? The answer is that we can erect reliable rules for falsifying claims with severity. We corroborate their denials. If your statistical account denies we can reliably falsify interesting theories, it is irrelevant to real-world knowledge. Let me draw your attention to an exhibit on a strange disease, kuru, and how it falsiﬁed a fundamental dogma of biology.
Exhibit (v): Kuru. Kuru (which means “shaking”) was widespread among the Fore people of New Guinea in the 1960s. In around 3–6 months, Kuru victims go from having diﬃculty walking, to outbursts of laughter, to inability to swallow and death. Kuru, and (what we now know to be) related diseases, e.g., mad cow, Creutzfeldt–Jakob, and scrapie, are “spongiform” diseases, causing brains to appear spongy. Kuru clustered in families, in particular among Fore women and their children, or elderly parents. They began to suspect transmission was through mortuary cannibalism. Consuming the brains of loved ones, a way of honoring the dead, was also a main source of meat permitted to women. Some say men got ﬁrst dibs on the muscle; others deny men partook in these funerary practices. What we know is that ending these cannibalistic practices all but eradicated the disease. No one expected at the time that understanding kuru’s cause would falsify an established theory that only viruses and bacteria could be infectious. This “central dogma of biology” says:
H: All infectious agents have nucleic acid.
Any infectious agent free of nucleic acid would be anomalous for H – meaning it goes against what H claims. A separate step is required to decide when H’s anomalies should count as falsifying H. There needn’t be a cut-oﬀ so much as a standpoint as to when continuing to defend H becomes bad science. Prion researchers weren’t looking to test the central dogma of biology, but to understand kuru and related diseases. The anomaly erupted only because kuru appeared to be transmitted by a protein alone, by changing a normal protein shape into an abnormal fold. Stanley Prusiner called the infectious protein a prion – for which he received much grief. He thought, at ﬁrst, he’d made a mistake “and was puzzled when the data kept telling me that our preparations contained protein but not nucleic acid” (Prusiner 1997). The anomalous results would not go away and, eventually, were demonstrated via experimental transmission to animals. The discovery of prions led to a “revolution” in molecular biology, and Prusiner received a Nobel prize in 1997. It is logically possible that nucleic acid is somehow involved. But continuing to block the falsiﬁcation of H (i.e., block the “protein only” hypothesis) precludes learning more about prion diseases, which now include Alzheimer’s. (See Mayo 2014a.)
Insofar as we falsify general scientiﬁc claims, we are all methodological falsiﬁcationists. Some people say, “I know my models are false, so I’m done with the job of falsifying before I even begin.” Really? That’s not falsifying. Let’s look at your method: always infer that H is false, fails to solve its intended problem. Then you’re bound to infer this even when this is erroneous. Your method fails the minimal severity requirement.
Do Probabilists Falsify? It isn’t obvious a probabilist desires to falsify, rather than supply a probability measure indicating disconﬁrmation, the opposite of a B-boost (a B-bust?), or a low posterior. Members of some probabilist tribes propose that Popper is subsumed under a Bayesian account by taking a low value of Pr(x|H) to falsify H. That could not work. Individual outcomes described in detail will easily have very small probabilities under H without being genuine anomalies for H. To the severe tester, this as an attempt to distract from the inability of probabilists to falsify, insofar as they remain probabilists. What about comparative accounts (Likelihoodists or Bayes factor accounts), which I also place under probabilism? Reporting that one hypothesis is more likely than the other is not to falsify anything. Royall is clear that it’s wrong to even take the comparative report as evidence against one of the two hypotheses: they are not exhaustive. (Nothing turns on whether you prefer to put Likelihoodism under its own category.) Must all such accounts abandon the ability to falsify? No, they can indirectly falsify hypotheses by adding a methodological falsiﬁcation rule. A natural candidate is to falsify H if its posterior probability is suﬃciently low (or, perhaps, suﬃciently disconﬁrmed). Of course, they’d need to justify the rule, ensuring it wasn’t often mistaken.
When is a statistical hypothesis to count as falsiﬁed? Although extremely rare events may occur, Popper notes:
such occurrences would not be physical eﬀects, because, on account of their immense improbability, they are not reproducible at will … If, however, we ﬁnd reproducible deviations from a macro eﬀect .. . deduced from a probability estimate … then we must assume that the probability estimate is falsiﬁed. (Popper 1959, p. 203)
In the same vein, we heard Fisher deny that an “isolated record” of statistically signiﬁcant results suﬃces to warrant a reproducible or genuine eﬀect (Fisher 1935a, p. 14). Early on, Popper (1959) bases his statistical falsifying rules on Fisher, though citations are rare. Even where a scientiﬁc hypothesis is thought to be deterministic, inaccuracies and knowledge gaps involve error-laden predictions; so our methodological rules typically involve inferring a statistical hypothesis. Popper calls it a falsifying hypothesis. It’s a hypothesis inferred in order to falsify some other claim. A ﬁrst step is often to infer an anomaly is real, by falsifying a “due to chance” hypothesis.
The recognition that we need methodological rules to warrant falsiﬁcation led Popperian Imre Lakatos to dub Popper’s philosophy “methodological falsiﬁcationism” (Lakatos 1970, p. 106). If you look at this footnote, where Lakatos often buried gems, you read about “the philosophical basis of some of the most interesting developments in modern statistics. The Neyman–Pearson approach rests completely on methodological falsiﬁcationism” (ibid., p. 109, note 6). Still, neither he nor Popper made explicit use of N-P tests. Statistical hypotheses are the perfect tool for “falsifying hypotheses.” However, this means you can’t be a falsiﬁcationist and remain a strict deductivist. When statisticians (e.g., Gelman 2011) claim they are deductivists like Popper, I take it they mean they favor a testing account like Popper, rather than inductively building up probabilities. The falsifying hypotheses that are integral for Popper also necessitate an evidence-transcending (inductive) statistical inference.
This is hugely problematic for Popper because being a strict Popperian means never having to justify a claim as true or a method as reliable. After all, this was part of Popper’s escape from induction. The problem is this: Popper’s account rests on severe tests, tests that would probably falsify claims if false, but he cannot warrant saying a method is probative or severe, because that would mean it was reliable, which makes Popperians squeamish. It would appear to concede to his critics that Popper has a “whiﬀ of induction” after all. But it’s not inductive enumeration. Error statistical methods (whether from statistics or informal) can supply the severe tests Popper sought. This leads us to Pierre Duhem, physicist and philosopher of science.
….
Live Exhibit (vi): Revisiting Popper’s Demarcation of Science. Here’s an experiment: try shifting what Popper says about theories to a related claim about inquiries to ﬁnd something out. To see what I have in mind, let’s listen to an exchange between two fellow travelers over coﬀee at Statbucks.
TRAVELER 1: If mere logical falsiﬁability suﬃces for a theory to be scientiﬁc, then, we can’t properly oust astrology from the scientiﬁc pantheon. Plenty of nutty theories have been falsiﬁed, so by deﬁnition they’re scientiﬁc. Moreover, scientists aren’t always looking to subject well-corroborated theories to “grave risk” of falsiﬁcation.
TRAVELER 2: I’ve been thinking about this. On your ﬁrst point, Popper confuses things by making it sound as if he’s asking: When is a theory unscientiﬁc? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H, unscientiﬁc? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face- saving devices, then the inquiry – the handling and use of data – is unscientiﬁc. Despite being logically falsiﬁable, theories can be rendered immune from falsiﬁcation by means of cavalier methods for their testing. Adhering to a falsiﬁed theory no matter what is poor science. Some areas have so much noise and/or ﬂexibility that they can’t or won’t distinguish warranted from unwarranted explanations of failed predictions. Rivals may ﬁnd ﬂaws in one another’s inquiry or model, but the criticism is not constrained by what’s actually responsible. This is another way inquiries can become unscientiﬁc.^{1}
She continues:
On your second point, it’s true that Popper talked of wanting to subject theories to grave risk of falsiﬁcation. I suggest that it’s really our inquiries into, or tests of, the theories that we want to subject to grave risk. The onus is on interpreters of data to show how they are countering the charge of a poorly run test. I admit this is a modiﬁcation of Popper. One could reframe the entire demarcation problem as one of the characters of an inquiry or test.
She makes a good point. In addition to blocking inferences that fail the minimal requirement for severity:
A scientiﬁc inquiry or test: must be able to embark on a reliable probe to pinpoint blame for anomalies (and use the results to replace falsiﬁed claims and build a repertoire of errors).
The parenthetical remark isn’t absolutely required, but is a feature that greatly strengthens scientiﬁc credentials. Without solving, not merely embarking on, some Duhemian problems there are no interesting falsiﬁcations. The ability or inability to pin down the source of failed replications – a familiar occupation these days – speaks to the scientiﬁc credentials of an inquiry. At any given time, even in good sciences there are anomalies whose sources haven’t been traced – unsolved Duhemian problems – generally at “higher” levels of the theory-data array. Embarking on solving these is the impetus for new conjectures. Checking test assumptions is part of working through the Duhemian maze. The reliability requirement is: infer claims just to the extent that they pass severe tests. There’s no sharp line for demarcation, but when these requirements are absent, an inquiry veers into the realm of questionable science or pseudoscience. Some physicists worry that highly theoretical realms can’t be expected to be constrained by empirical data. Theoretical constraints are also important. We’ll ﬂesh out these ideas in future tours.
^{1} _{For example, astronomy, but not astrology, can reliably solve its Duhemian puzzles. Chapter 2, Mayo (1996), following my reading of Kuhn (1970) on “normal science.”}
^{*Where you are in the Journey: I posted all of Excursion 1 Tour I, here, here, and here, and omitted Tour II (but blogposts on the Law of Likelihood, Royall, optional stopping, and Barnard may be found by searching this blog). You are now in Excursion 2, the first stop of Tour I (2.1) is here. The main material from 2.2 can be found in this blogpost. You can read the rest of Excursion 2 Tour II section 2.3, in proof form, here. For the full Itinerary of Statistical Inference as Severe Testing: How to Get Beyond the Stat Wars (2018, CUP) SIST Itinerary.}
]]>
My new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,” you might have discovered, includes Souvenirs throughout (A-Z). But there are some highlights within sections that might be missed in the excerpts I’m posting. One such “keepsake” is a quote from Fisher at the very end of Section 2.1.
These are some of the ﬁrst clues we’ll be collecting on a wide diﬀerence between statistical inference as a deductive logic of probability, and an inductive testing account sought by the error statistician. When it comes to inductive learning, we want our inferences to go beyond the data: we want lift-oﬀ. To my knowledge, Fisher is the only other writer on statistical inference, aside from Peirce, to emphasize this distinction.
In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigour is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. (Fisher 1935b, p. 54)
How do you understand this remark of Fisher’s? (Please share your thoughts in the comments.) My interpretation, and its relation to the “lift-off” needed to warrant inductive inferences, is discussed in an earlier section, 1.2, posted here. Here’s part of that.
The weakest version of the severity requirement (Section 1.1), in the sense of easiest to justify, is negative, warning us when BENT data are at hand, and a surprising amount of mileage may be had from that negative principle alone. It is when we recognize how poorly certain claims are warranted that we get ideas for improved inquiries. In fact, if you wish to stop at the negative requirement, you can still go pretty far along with me. I also advocate the positive counterpart:
Severity (strong): We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of ﬁnding ﬂaws or discrepancies from C, and yet none or few are found, then the passing result, x, is evidence for C.
One way this can be achieved is by an argument from coincidence. The most vivid cases occur outside formal statistics.
Some of my strongest examples tend to revolve around my weight. Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’s oﬃce. Suppose they are well calibrated and nearly identical in their readings, and they also all pick up on the extra 3 pounds when I’m weighed carrying three copies of my 1-pound book, Error and the Growth of Experimental Knowledge (EGEK). Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4–5 pound gain. There’s no difference when I place the three books on the scales, so I must conclude, unfortunately, that I’ve gained around 4 pounds. Even for me, that’s a lot. I’ve surely falsified the supposition that I lost weight! From this informal example, we may make two rather obvious points that will serve for less obvious cases. First, there’s the idea I call lift-oﬀ.
Lift-oﬀ: An overall inference can be more reliable and precise than its premises individually.
Each scale, by itself, has some possibility of error, and limited precision. But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. Were one scale oﬀ balance, it would be discovered by another, and would show up in the weighing of books. They cannot all be systematically misleading just when it comes to objects of unknown weight, can they? Rejecting a conspiracy of the scales, I conclude I’ve gained weight, at least 4 pounds. We may call this an argument from coincidence, and by its means we can attain lift-oﬀ. Lift-oﬀ runs directly counter to a seemingly obvious claim of drag-down.
Drag-down: An overall inference is only as reliable/precise as is its weakest premise.
The drag-down assumption is common among empiricist philosophers: As they like to say, “It’s turtles all the way down.” Sometimes our inferences do stand as a kind of tower built on linked stones – if even one stone fails they all come tumbling down. Call that a linked argument.
Our most prized scientific inferences would be in a very bad way if piling on assumptions invariably leads to weakened conclusions. Fortunately we also can build what may be called convergent arguments, where lift-oﬀ is attained. This seemingly banal point suffices to combat some of the most well entrenched skepticisms in philosophy of science. And statistics happens to be the science par excellence for demonstrating lift-oﬀ!
Now consider what justifies my weight conclusion, based, as we are supposing it is, on a strong argument from coincidence. No one would say: “I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.” To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings: H: I’ve gained weight. Simple as that. It would be a preposterous coincidence if none of the scales registered even slight weight shifts when weighing objects of known weight, and yet were systematically misleading when applied to my weight. You see where I’m going with this. This is the key – granted with a homely example – that can ﬁll a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand. Nor is it merely the improbability of all the results were H false; it is rather like denying an evil demon has read my mind just in the cases where I do not know the weight of an object, and deliberately deceived me. The argument to “weight gain” is an example of an argument from coincidence to the absence of an error, what I call:
Arguing from Error: There is evidence an error is absent to the extent that a procedure with a very high capability of signaling the error, if and only if it is present, nevertheless detects no error.
I am using “signaling” and “detecting” synonymously: It is important to keep in mind that we don’t know if the test output is correct, only that it gives a signal or alert, like sounding a bell. Methods that enable strong arguments to the absence (or presence) of an error I call strong error probes. Our ability to develop strong arguments from coincidence, I will argue, is the basis for solving the “problem of induction.”
^{*Where you are in the Journey: I posted all of Excursion 1 Tour I, here, here, and here, omitted Tour II (but blogposts on the Law of Likelihood, Royall, and optional stopping, may be found by searching this blog). }^{SIST Itinerary}
]]>Where you are in the Journey*
Cox: [I]n some ﬁelds foundations do not seem very important, but we both think that foundations of statistical inference are important; why do you think that is?
Mayo: I think because they ask about fundamental questions of evidence, inference, and probability … we invariably cross into philosophical questions about empirical knowledge and inductive inference. (Cox and Mayo 2011, p. 103)
Contemporary philosophy of science presents us with some taboos: Thou shalt not try to ﬁnd solutions to problems of induction, falsiﬁcation, and demarcating science from pseudoscience. It’s impossible to understand rival statistical accounts, let alone get beyond the statistics wars, without ﬁrst exploring how these came to be “lost causes.” I am not talking of ancient history here: these problems were alive and well when I set out to do philosophy in the 1980s. I think we gave up on them too easily, and by the end of Excursion 2 you’ll see why. Excursion 2 takes us into the land of “Statistical Science and Philosophy of Science” (StatSci/PhilSci). Our Museum Guide gives a terse thumbnail sketch of Tour I. Here’s a useful excerpt:
Once the Problem of Induction was deemed to admit of no satisfactory, non-circular solutions (~1970s), philosophers of science turned to building formal logics of induction using the deductive calculus of probabilities, often called Conﬁrmation Logics or Theories. A leader of this Conﬁrmation Theory movement was Rudolf Carnap. A distinct program, led by Karl Popper, denies there is a logic of induction, and focuses on Testing and Falsiﬁcation of theories by data. At best a theory may be accepted or corroborated if it fails to be falsiﬁed by a severe test. The two programs have analogues to distinct methodologies in statistics: Conﬁrmation theory is to Bayesianism as Testing and Falsiﬁcation are to Fisher and Neyman–Pearson.
Tour I begins with the traditional Problem of Induction, then moves to Carnapian conﬁrmation and takes a brief look at contemporary formal epistemology. Tour II visits Popper, falsiﬁcation, and demarcation, moving into Fisherian tests and the replication crisis. Redolent of Frank Lloyd Wright’s Guggenheim Museum in New York City, the StatSci/PhilSci Museum is arranged in concentric sloping oval ﬂoors that narrow as you go up. It’s as if we’re in a three-dimensional Normal curve. We begin in a large exposition on the ground ﬂoor. Those who start on the upper ﬂoors forfeit a central Rosetta Stone to decipher today’s statistical debates.
Start with the asymmetry of falsiﬁcation and conﬁrmation. One black swan falsiﬁes the universal claim that C: all swans are white. Observing a single white swan, while a positive instance of C, wouldn’t allow inferring generalization C, unless there was only one swan in the entire population. If the generalization refers to an inﬁnite number of cases, as most people would say about scientiﬁc theories and laws, then no matter how many positive instances observed, you couldn’t infer it with certainty. It’s always possible there’s a black swan out there, a negative instance, and it would only take one to falsify C. But surely we think enough positive instances of the right kind might warrant an argument for inferring C. Enter the problem of induction. First, a bit about arguments.
Soundness versus Validity
An argument is a group of statements, one of which is said to follow from or be supported by the others. The others are premises, the one inferred, the conclusion. A deductively valid argument is one where if its premises are all true, then its conclusion must be true. Falsiﬁcation of “all swans are white” follows a deductively valid argument. Let ~C be the denial of claim C.
(1) C: All swans are white.
x is a swan but is black.
Therefore, ~C.
We can also infer, validly, what follows if a generalization C is true.
(2) C: All swans are white
x is a swan.
Therefore, x is white.
However, validity is not the same thing as soundness. Here’s a case of argument form (2):
(3) All philosophers can ﬂy
Mayo is a philosopher.
Therefore, Mayo can ﬂy.
Validity is a matter of form. Since (3) has a valid form, it is a valid argument. But its conclusion is false! That’s because it is unsound: at least one of its premises is false (the ﬁrst). No one can stop you from applying deductively valid arguments, regardless of your statistical account. Don’t assume you will get truth thereby. Bayes’ Theorem can occur in a valid argument, within a formal system of probability:
(4) If Pr(H_{1}), .. . , Pr(H_{n}) are the prior probabilities of an exhaustive set of hypotheses, and Pr(x|H_{i}) the corresponding likekihoods.
Data x are given, and Pr(H_{1}|x) is deﬁned.
Therefore, Pr(H_{1}|x) = p.^{1}
The conclusion is the posterior probability Pr(H_{1}|x). It can be inferred only if the argument is sound: all the givens must hold (at least approximately). To deny that all of statistical inference is reducible to Bayes’ Theorem is not to preclude your using this or any other deductive argument. What you need to be concerned about is their soundness. So, you will still need a way to vouchsafe the premises.
Now to the traditional philosophical problem of induction. What is it? Why has confusion about induction and the threat of the traditional or “logical” problem of induction made some people afraid to dare use the “I” word? The traditional problem of induction seeks to justify a type of argument: one taking a form of enumerative induction (EI) (or the straight rule of induction). Infer from past cases of A’s that were B’s to all or most A’s will be B’s:
EI: All observed A1, A2, .. ., An have been B’s.
Therefore, H: all A’s are B’s.
It is not a deductively valid argument, because clearly its premises can all be true while its conclusion false. It’s invalid, as is so for any inductive argument. As Hume (1739) notes, nothing changes if we place the word “probably” in front of the conclusion: it is justiﬁed to infer from past A’s being B’s that, probably, all or most A’s will be B’s. To “rationally” justify induction is to supply a reasoned argument for using EI. The traditional problem of induction, then, involves trying to ﬁnd an argument to justify a type of argument!
Exhibit (i): Justifying Induction Is Circular. In other words, the traditional problem of induction is to justify the conclusion:
Conclusion: : EI is rationally justiﬁed, it’s a reliable rule.
We need an argument for concluding EI is reliable. Using an inductive argument to justify induction lands us in a circle. We’d be using the method we’re trying to justify, or begging the question. What about a deductively valid argument? The premises would have to be things we know to be true, otherwise the argument would not be sound. We might try:
Premise 1: EI has been reliable in a set of observed cases.
Trouble is, this premise can’t be used to deductively infer EI will be reliable in general: the known cases only refer to the past and present, not the future. Suppose we add a premise:
Premise 2: Methods that have worked in past cases will work in future cases.
Yet to assume Premise 2 is true is to use EI, and thus, again, to beg the question.
Another idea for the additional premise is in terms of assuming nature is uniform. We do not escape: to assume the uniformity of nature is to assume EI is a reliable method. Therefore, induction cannot be rationally justiﬁed. It is called the logical problem of induction because logical argument alone does not appear able to solve it. All attempts to justify EI assume past successes of a rule justify its general reliability, which is to assume EI – what we’re trying to show.
I’m skimming past the rest of a large exhibition on brilliant attempts to solve induction in this form. Some argue that although an attempted justiﬁcation is circular it is not viciously circular. (An excellent source is Skyrms 1986.)
But wait. Is inductive enumeration a rule that has been reliable even in the past? No. It is reasonable to expect that unobserved or future cases will be very diﬀerent from the past, that apparent patterns are spurious, and that observed associations are not generalizable. We would only want to justify inferences of that form if we had done a good job ruling out the many ways we know we can be misled by such an inference. That’s not the way conﬁrmation theorists see it, or at least, saw it.
Exhibit (ii): Probabilistic (Statistical) Aﬃrming the Consequent. Enter logics of conﬁrmation. Conceding that we cannot justify the inductive method (EI), philosophers sought logics that represented apparently plausible inductive reasoning. The thinking is this: never mind trying to convince a skeptic of the inductive method, we give up on that. But we know what we mean. We need only to make sense of the habit of applying EI. True to the logical positivist spirit of the 1930s–1960s, they sought evidential relationships between statements of evidence and conclusions. I sometimes call them evidential-relation (E-R) logics. They didn’t renounce enumerative induction, they sought logics that embodied it. Begin by ﬂeshing out the full argument behind EI:
If H: all A’s are B’s, then all observed A’s (A_{1}, A_{2}, .. ., A_{n}) are B’s.
All observed A’s (A_{1}, A_{2}, .. ., A_{n}) are B’s.
Therefore, H: all A’s are B’s.
The premise that we added, the ﬁrst, is obviously true; the problem is that the second premise can be true while the conclusion false. The argument is deductively invalid – it even has a name: aﬃrming the consequent. However, its probabilistic version is weaker. Probabilistic aﬃrming the consequent says only that the conclusion is probable or gets a boost in conﬁrmation or probability – a B-boost. It’s in this sense that Bayes’ Theorem is often taken to ground a plausible conﬁrmation theory. It probabilistically justiﬁes EI in that it embodies probabilistic aﬃrming the consequent.
How do we obtain the probabilities? Rudolf Carnap’s audacious program (1962) had been to assign probabilities of hypotheses or statements by deducing them from the logical structure of a particular (ﬁrst order) language. These were called logical probabilities. The language would have a list of properties (e.g., “is a swan,” “is white”) and individuals or names (e.g., i, j, k). The task was to assign equal initial probability assignments to states of this mini world, from which we could deduce the probabilities of truth functional combinations. The degree of probability, usually understood as a rational degree of belief, would hold between two statements, one expressing a hypothesis and the other the data. C(H,x) symbolizes “the conﬁrmation of H, given x.” Once you have chosen the initial assignments to core states of the world, calculating degrees of conﬁrmation is a formal or syntactical matter, much like deductive logic. The goal was to somehow measure the degree of implication or conﬁrmation that x aﬀords H. Carnap imagined the scientist coming to the inductive logician to have the rational degree of conﬁrmation in H evaluated, given her evidence. (I’m serious.) Putting aside the diﬃculty of listing all properties of scientiﬁc interest, from where do the initial assignments come?
Carnap’s ﬁrst attempt at a C-function resulted in no learning! For a toy illustration, take a universe with three items, i, j, k, and a single property B. “Bk” expresses “k has property B.” There are eight possibilities, each called a state description. Here’s one: {Bi, ~Bj, ~Bk}. If each is given initial probability of ⅛, we have what Carnap called the logic c^{†}. The degree of conﬁrmation that j will be black given that i was white = ½, which is the same as the initial conﬁrmation of Bi (since it occurs in four of eight state descriptions). Nothing has been learned: c^{†} is scrapped. By apportioning initial probabilities more coarsely, one could learn, but there was an inﬁnite continuum of inductive logics characterized by choosing the value of a parameter he called λ (λ continuum). λ in eﬀect determines how much uniformity and regularity to expect. To restrict the ﬁeld, Carnap had to postulate what he called “inductive intuitions.” As a logic student, I too found these attempts tantalizing – until I walked into my ﬁrst statistics class. I was also persuaded by philosopher Wesley Salmon:
Carnap has stated that the ultimate justiﬁcation of the axioms is inductive intuition. I do not consider this answer an adequate basis for a concept of rationality. Indeed, I think that every attempt, including those by Jaakko Hintikka and his students, to ground the concept of rational degree of belief in logical probability suﬀers from the same unacceptable apriorism. (Salmon 1988, p. 13).
This program, still in its heyday in the 1980s, was part of a general logical positivist attempt to reduce science to observables plus logic (no metaphysics). Had this reductionist goal been realized, which it wasn’t, the idea of scientiﬁc inference being reduced to particular predicted observations might have succeeded. Even with that observable restriction, the worry remained: what does a highly probable claim, according to a particular inductive logic, have to do with the real world? How can it provide “a guide to life?” (e.g., Kyburg 2003, Salmon 1966). The epistemology is restricted to inner coherence and consistency. However much contemporary philosophers have gotten beyond logical positivism, the hankering for an inductive logic remains. You could say it’s behind the appeal of the default (non-subjective) Bayesianism of Harold Jeﬀreys, and other attempts to view probability theory as extending deductive logic.
Exhibit (iii): A Faulty Analogy Between Deduction and Induction. When we heard Hacking announce (Section 1.4): “there is no such thing as a logic of statistical inference” (1980, p. 145), it wasn’t only the failed attempts to build one, but the recognition that the project is “founded on a false analogy with deductive logic” (ibid.). The issue here is subtle, and we’ll revisit it through our journey. I agree with Hacking, who is agreeing with C. S. Peirce:
In the case of analytic [deductive] inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic [inductive] inferences we only know the degree of trustworthiness of our proceeding. (Peirce 2.693)
In getting new knowledge, in ampliative or inductive reasoning, the conclusion should go beyond the premises; probability enters to qualify the overall “trustworthiness” of the method. Hacking not only retracts his Law of Likelihood (LL), but also his earlier denial that Neyman–Pearson statistics is inferential. “I now believe that Neyman, Peirce, and Braithwaite were on the right lines to follow in the analysis of inductive arguments” (Hacking 1980, p. 141). Let’s adapt some of Hacking’s excellent discussion.
When we speak of an inference, it could mean the entire argument including premises and conclusion. Or it could mean just the conclusion, or statement inferred. Let’s use “inference” to mean the latter – the claim detached from the premises or data. A statistical procedure of inferring refers to a method for reaching a statistical inference about some aspect of the source of the data, together with its probabilistic properties: in particular, its capacities to avoid erroneous (and ensure non-erroneous) interpretations of data. These are the method’s error probabilities. My argument from coincidence to weight gain (Section 1.3) inferred H: I’ve gained at least 4 pounds. The inference is qualiﬁed by the detailed data (group of weighings), and information on how capable the method is at blocking erroneous pronouncements of my weight. I argue that, very probably, my scales would not produce the weight data they do (e.g., on objects with known weight) were H false. What is being qualiﬁed probabilistically is the inferring or testing process.
By contrast, in a probability or conﬁrmation logic, what is generally detached is the probability of H, given data. It is a probabilism. Hacking’s diagnosis in 1980 is that this grows out of an abiding logical positivism, with which he admits to having been aﬄicted. There’s this much analogy with deduction: In a deductively valid argument: if the premises are true then, necessarily, the conclusion is true. But we don’t attach the “necessarily” to the conclusion. Instead it qualiﬁes the entire argument. So mimicking deduction, why isn’t the inductive task to qualify the method in some sense, for example, report that it would probably lead to true or approximately true conclusions? That would be to show the reliable performance of an inference method. If that’s what an inductive method requires, then Neyman–Pearson tests, which aﬀord good performance, are inductive.
My main diﬀerence from Hacking here is that I don’t argue, as he seems to, that the warrant for the inference is that it stems from a method that very probably gets it right (so I may hope it is right this time). It’s not that the method’s reliability “rubs oﬀ” on this particular claim. I say inference C may be detached as indicated or warranted, having passed a severe test (a test that C probably would have failed, if false in a speciﬁed manner). This is the central point of Souvenir D. The logician’s “semantic entailment” symbol, the double turnstile: “|=”, could be used to abbreviate “entails severely”:
Data + capacities of scales |=_{SEV} I’ve gained at least k pounds.
(The premises are on the left side of |=.) However, I won’t use this notation
Keeping to a deductive logic of probability, we never detach an inference. This is in sync with a probabilist such as Bruno de Finetti:
The calculus of probability can say absolutely nothing about reality .. . As with the logic of certainty, the logic of the probable adds nothing of its own: it merely helps one to see the implications contained in what has gone before. (de Finetti 1974, p. 215)
These are some of the ﬁrst clues we’ll be collecting on a wide diﬀerence between statistical inference as a deductive logic of probability, and an inductive testing account sought by the error statistician. When it comes to inductive learning, we want our inferences to go beyond the data: we want lift-oﬀ. To my knowledge, Fisher is the only other writer on statistical inference, aside from Peirce, to emphasize this distinction.
In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigour is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. (Fisher 1935b, p. 54)
NOTES:
1
^{
}
^{*Where you are in the Journey: I posted all of Excursion 1 Tour I, here, here, and here, and omitted Tour II (but blogposts on the Law of Likelihood, Royall, optional stopping, and Barnard, whose birthday was Sept 23, may be found by searching this blog). I am now moving to Excursion 2, posting the first stop of Tour I (2.1). }^{For the full Itinerary of Statistical Inference as Severe Testing: How to Get Beyond the Stat Wars (2018, CUP) SIST Itinerary}
]]>Error Statistics Philosophy: Blog Contents (7 years) [i]
By: D. G. Mayo
Dear Reader: I began this blog 7 years ago (Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening, both for the blog and the appearance of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP). While a special rush edition made an appearance on Sept 3, in time for the RSS meeting in Cardiff, it was decided to hold off on the festivities until copies of the book were officially available (yesterday)! If you’re in the neighborhood, stop by for some Elba Grease
Many of the discussions in the book were importantly influenced (corrected and improved) by reader’s comments on the blog over the years. I thank readers for their input. Please peruse the offerings below, taking advantage of the discussions by guest posters and readers! I posted the first 3 sections of Tour I (in Excursion i) here, here, and here.
This blog will return to life, although I’m not yet sure of exactly what form it will take. Ideas are welcome. The tone of a book differs from a blog, so we’ll have to see what voice emerges here.
Sincerely,
D. Mayo
September 2011
October 2011
November 2011
December 2011
January 2012
February 2012
March 2012
April 2012
May 2012
NHST
June 2012
July 2012
August 2012
September 2012
October 2012
November 2012
December 2012
January 2013
February 2013
March 2013
p-values
April 2013
Didn’t Do (BP oil spill, comedy hour)
May 2013
June 2013
July 2013
August 2013
September 2013
October 2013
November 2013
December 2013
January 2014
February 2014
March 2014
April 2014
May 2014
argument for the Likelihood Principle
June 2014
July 2014
August 2014
September 2014
October 2014
November 2014
or “How Medical Care Is Being Corrupted”
December 2014
testing (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”
January 2015
February 2015
March 2015
April 2015
May 2015
(Aris Spanos)
June 2015
July 2015
August 2015
September 2015
October 2015
November 201412/20
December 2015
January 2016
February 2016
March 2016
April 2016
May 2016
June 2016
July 2016
August 2016
September 2016
October 2016
November 2016
December 2016
January 2017
February 2017
March 2017
April 2017
May 2017
June 2017
July 2017
August 2017
September 2017
October 2017
November 2017
December 2017
January 2018
February 2018
March 2018
April 2018
May 2018
June 2018
July 2018
August 2018
[i]Compiled by Jean Miller.
]]>