Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?
Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!
Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!
Raucous laughter ensues!
(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!)
The frequentist tester should retort:
Frequentist Tester: But you assume 50% of the null hypotheses are true, compute P(H_{0}|x) using P(H_{0}) = .5, imagine the null is rejected based on a single small p-value, and then blame the p-value for disagreeing with the result of your computation!
At times you even use α and power as likelihoods in your analysis! These tests violate both Fisherian and Neyman-Pearson tests.
It is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0.} This Jeffreys-Lindley “disagreement” is considered problematic for Bayes ratios (e.g., Bernardo). It is not problematic for error statisticians. We always indicate the extent of discrepancy that is and is not indicated, and avoid making mountains out of molehills (See Spanos 2013). J. Berger and Sellke (1987) attempt to generalize the result to show the “exaggeration” even without large n. From their Bayesian perspective, it appears that p-values come up short, error statistical testers (and even some tribes of Bayesians) balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!
The conflict between p-values and Bayesian posteriors typically considers the two sided test of the Normal mean, H_{0}: μ = μ_{0} versus H_{1}: μ ≠ μ_{0}.
“If n = 50 one can classically ‘reject H_{0} at significance level p = .05,’ although Pr (H_{0}|x) = .52 (which would actually indicate that the evidence favors H_{0}).” (Berger and Sellke, 1987, p. 113).
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!
Some find the example shows the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative” Bayesian prior probability assignment of .5 to H_{0}, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others demonstrate that the problem is not p-values but the high prior.
Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of H_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Many complain the “spiked concentration of belief in the null” is at odds with the view that “we know all nulls are false” (even though that view is also false.) See Senn’s interesting points on this same issue in his letter (to Goodman) here.
But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and to frequentist error probabilities. How’s that supposed to work? It is imagined that we sample randomly from a population of hypotheses, k% of which are assumed to be true. 50% is a common number. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for your particular H_{0}. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:
50% of the null hypotheses in a given pool of nulls are true.
This particular null H_{0 }was randomly selected from this urn (some may wish to add “nothing else is known” which would scarcely be true here).
Therefore P(H_{0} is true) = .5.
I discussed this 20 years ago, Mayo 1997a and b (links in the references) and ever since. However, statistical fallacies repeatedly return to fashion in slightly different guises. Nowadays, you’re most likely to see it within what may be called diagnostic screening models of tests.
It’s not that you can’t play a carnival game of reaching into an urn of nulls (and there are lots of choices for what to put in the urn), and use a Bernoulli model for the chance of drawing a true hypothesis (assuming we even knew the % of true hypotheses, which we do not), but the “event of drawing a true null” is no longer the particular hypothesis one aims to use in computing the probability of data x_{0} under hypothesis H_{0}. In other words, it’s no longer the H_{0} needed for the likelihood portion of the frequentist computation. (Note, too, the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)
In any event, .5 is not the frequentist probability that the selected null H_{0} is true–in those cases where a frequentist prior exists. (I first discussed the nature of legitimate frequentist priors with Erich Lehmann; see the poem he wrote for me as a result in Mayo 1997a).
The diagnostic screening model of tests. The diagnostic screening model of tests has become increasingly commonplace, thanks to Big Data, perverse incentives, nonreplication and all the rest (Ioannidis 2005). As Taleb puts it:
“With big data, researchers have brought cherry-picking to an industrial level”.
Now the diagnostic screening model is apt for various goals–diagnostic screening (for disease) most obviously, but also for TSA bag checks, high throughput studies in genetics and other contexts where the concern is controlling the noise in the network rather than appraising the well-testedness of your research claim. Dichotomies are fine for diagnostics (disease or not, worth further study or not, dangerous bag or not) Forcing scientific inference into a binary basket is what most of us wish to move away from, yet the new screening model dichotomizes results into significant/non-significant, usually at the .05 level. One shouldn’t mix the notions of prevalence, positive predictive value, negative predictive value, etc. from screening with the concepts from statistical testing in science. Yet people do, and there are at least 2 tragicomic results: One is that error probabilities are blamed for disagreeing with measures of completely different things. One journal editor claims the fact that p-values differ from posteriors proves the “invalidity” of p-values.
The second tragicomic result is that inconsistent meanings of type 1 (and 2) error probabilities have found their way into the latest reforms, and into guidebooks for how to avoid inconsistent interpretations of statistical concepts. Whereas there’s a trade-off between type 1 error and type 2 error probabilities in Neyman-Pearson style hypotheses tests, this is no longer true when a type 1 error probability is defined as the posterior of H_{0} conditional on rejecting. Topsy turvy claims about power readily ensure (search this blog under power for numerous examples).
Conventional Bayesian variant. J Berger doesn’t really imagine selecting from an urn of nulls (he claims). Instead, spiked priors come from one of the systems of default or conventional priors. Curiously, he claims that by adopting his recommended conventional priors, frequentists can become more frequentist (than using flawed error probabilities). We get what he calls conditional p-values (or conditional error probabilities). Magician that he is, the result is that frequentist error probabilities are no longer error probabilities, or even frequentist!
How it happens is not entirely clear, but it’s based on his defining a “Frequentist Principle” that demands that a type 1 (or 2) error probability yield the same number as his conventional posterior probability. (See Berger 2003, and my comment in Mayo 2003).
Senn, in a guest post remarks:
The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.
It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.
Urn of Nulls. Others appear to be serious about the urn of nulls metaphor (e.g., Colquhoun 2014) Say 50% of the nulls in the urn are imagined to be true. Then, when you select your null, its initial probability of truth is .5. This however is to commit the fallacy of probabilistic instantiation.
Two moves are made: (1) it’s admitted it’s an erroneous probabilistic instantiation, but the goal is said to be assessing “science wise error rates” as in a diagnostic screening context. A second move (2) is to claim that a high positive predictive value PPV from the diagnostic model warrants high “epistemic probability”–whatever that is– to the particular case at hand.
The upshot of both are at odds with the goal of restoring scientific integrity. Even if we were to grant these “prevalence rates” (to allude to diagnostic testing), my question is: Why would it be relevant to how good a job you did in testing your particular hypothesis, call it H*? Sciences with high “crud factors” (Meehl 1990) might well get a high PPV simply because of nearly all its nulls being false. This still wouldn’t be evidence of replication ability, nor of understanding of the phenomenon. It would reward non-challenging thinking, and taking the easiest way out.
Safe Science. We hear it recommended that research focus on questions and hypotheses with high prior prevalence. Of course we’d never know the % of true nulls (many say all nulls are false, although that too is false) and we could cleverly game the description to have suitably high or low prior prevalence. Just think of how many ways you could describe those urns of nulls to get a desired PPV, especially on continuous parameters. Then there’s the heralding of safe science:
Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive (Ioannidis, 2005, p. 0700).
The diagnostic model, in effect, says keep doing what you’re doing: publish after an isolated significant result, possibly with cherry-picking and selection effects to boot, just make sure there’s high enough prior prevalence. That preregistration often makes previous significant results vanish shows the problem isn’t the statistical method but its abuse. Ioannidis has done much to expose bad methods, but not with the diagnostic model he earlier popularized.
In every case of a major advance or frontier science that I can think of, there had been little success in adequately explaining some effect–low prior prevalence. It took Prusiner 10 years of failed experiments to finally transmit the prion for mad cow to chimps. People didn’t believe there could be infection without nucleic acid (some still adhere to the “protein only” hypothesis.) He finally won a Nobel Prize, but he would have had a lot less torture if he’d just gone along to get along, keep to the central dogma of biology rather than follow the results that upended it. However, it’s the researcher who has worked with a given problem, building on results and subjecting them to scrutiny, who understands the phenomenon well enough to not just replicate, but alter the entire process in new ways (e.g., prions are now being linked to Alzheimer’s).
Researchers who have churned out and published isolated significant results, and focused on “research questions where the where the pre-study probability is already considerably high” might meet the quota on PPV, but still won’t have the understanding to even show they “know how to conduct an experiment which will rarely fail to give us a statistically significant result”–which was Fisher’s requirement before inferring a genuine phenomenon (Fisher 1947).
Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)
References & Related articles
Berger, J. O. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.
Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.
Cassella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.
Colquhoun, D. (2014) “An investigation of the false discovery rate and the misinterpretation of p-values.” Royal Society Open Science, 2014 1(3): pp. 1-16.
Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.
Fisher, R.A. (1947), Design of Experiments.
Ioannidis, J. (2005). “Why Most Published Research Findings Are False”.
Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.
Mayo, D. (1997a). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,” Philosop2hy of Science 64(1): 222-244 and 323-333. NOTE: This issue only comes up in the “Response”, but it made most sense to include both here.
Mayo, D. (1997b) “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64: S195-S212.
Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science18, 19-24.
Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.
Mayo (2005). “Philosophy of Statistics” in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815. (Has typos.)
Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.
Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66 (1): 195-244.
Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.
Prusiner, S. (1991). Molecular Biology of Prion Diseases. Science, 252(5012), 1515-1522.
Prusiner, S. B. (2014) Madness and Memory: The Discovery of Prions—a New Biological Principle of Disease, New Haven, Connecticut: Yale University Press.
Spanos, A. (2013). “Who Should Be Afraid of the Jeffreys-Lindley Paradox”.
Taleb, N. (2013). “Beware the Big Errors of Big Data”. Wired.
Related posts:
Prof. Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin
“‘Not Guilty’: The Misleading Verdict and How It Fails to Serve either Society or the Innocent Defendant”
Most legal systems in the developed world share in common a two-tier verdict system: ‘guilty’ and ‘not guilty’. Typically, the standard for a judgment of guilty is set very high while the standard for a not-guilty verdict (if we can call it that) is quite low. That means any level of apparent guilt less than about 90% confidence that the defendant committed the crime leads to an acquittal (90% being the usual gloss on proof beyond a reasonable doubt, although few legal systems venture a definition of BARD that precise). According to conventional wisdom, the major reason for setting the standard as high as we do is the desire, even the moral necessity, to shield the innocent from false conviction.
There is, however, an egregious drawback to a legal system so structured. To wit, a verdict of ‘not guilty’ tells us nothing whatever about whether it is reasonable to believe that the defendant did not commit the crime. It offers no grounds whatever for inferring that an acquitted defendant probably did not commit the crime. That fact alone should make most of us leery about someone acquitted of a felony. Will a bank happily hire someone recently acquitted of a forgery charge? Are the neighbors going to rest easy when one of them was charged with, and then acquitted of, child molestation?
While the current proof standard provides ample protection to the innocent from being falsely convicted (the false positive rate is ~3%), it does little or nothing to protect the reputation of the truly innocent defendants. If properly understood, it fails to send any message to the general public about how they should regard and treat an acquitted defendant because it fails to tell the public whether it’s likely or unlikely that he committed the crime.
It would not be difficult to remedy this centuries-old mess, both for the public and for the acquitted defendant, by employing a three-verdict system, as the Scots have been doing for some time. Their verdicts are: guilty, guilt not proven and innocent. In a Scottish trial, if guilt is proven beyond a reasonable doubt, the defendant is found guilty; if the jury thinks it more likely than not that defendant committed no crime, his verdict is ‘innocent’; if the jury suspects that defendant did the crime but is not sure beyond all reasonable doubt, the verdict is ‘guilt not proven’. Both the guilt-not-proven verdict and the innocence verdict are officially acquittals in the sense that those receiving it serve no jail time. (This gives a whole new meaning to the well-known phrase ‘going scot-free’.)
The Scottish verdict pattern serves the interests of both the innocent defendant and the general society. The Scots know that if a defendant received an innocent verdict, then the jury believed it likely that he committed no crime and that he should be treated accordingly. That is both important information for the citizenry and a substantial protection for the innocent defendant himself, since the innocent verdict is in effect an exoneration, entailing the likelihood of his innocence.
On the other hand, the Scottish guilt-not-proven verdict sends out the important message to citizens that no other Anglo-Saxon legal system can; to wit, that the acquitted defendant (with a guilt-not-proven verdict) should be treated warily by society since he was probably the culprit, even though he was neither convicted nor punished.
Interestingly, there is ample use of the intermediary verdict. The Scottish government reports in a study of criminal prosecutions in 2005 and 2006 that it turned out that 71% of those defendants tried for homicide and acquitted received a ‘guilt-not-proven’ verdict. That means that about 7-in-10 acquittals for murder in Scotland involved defendants regarded by the jurors as having probably committed the crime.[1] In a more recent analysis, the Scottish government reported that in rape cases some 35% of acquittals resulted in ‘guilt not proven’ verdicts. In murder cases, the probably guilty verdict rate was 27% of all acquittals.[2]
It’s worth adding that Scotland’s intermediary verdict gives us access to information on an error whose frequency no other Western legal system can easily compute: to wit, the frequency of false acquittals. It tells us that, at least in Scotland, the rate of false acquittals hovers between 1-in-4 and 1-in-3. That is crucial information for those of us who believe that a legitimate system of inquiry—whether a legal one or otherwise— must get a handle on its error rates. Without knowing that, we cannot possibly figure out whether the distribution of erroneous verdicts is in line with our beliefs about the respective costs of the two errors.
Scottish criminal law has one other interesting feature worthy of mention in this context: a verdict there requires only a majority vote from the 15 citizens who serve as the jury. By contrast, most American states require a unanimous vote among 12 jurors, contributing to a situation in which mistrials are both expensive and common. They are expensive because they usually lead to re-trials, which are rarely cheap. In some jurisdictions in the US, 20% or more of trials end in a hung jury.[3] Not surprisingly, hung juries in Scottish cases are much less frequent.
***
[1] See http://www.scotland.gov.uk/Publications/2006/04/25104019/11.) See also the Scottish Government Statistical Bulletin, Crim/2006/Part 11.
[2] See Scottish Government, Criminal Proceedings in Scotland, 2013-14, Table 2B.
[3] A study by Paula Agor et al., (Are Hung Juries a Problem? National Center for State Courts and National Institute of Justice, 2002) found that in Washington, D.C. Superior Courts some 22.4% of jury trials ended in a hung jury; In Los Angeles Superior Courts, the hung jury rate was 19.5%.
ADDITIONAL RESOURCES:
Previous guest posts:
Among Laudan’s books:
1977. Progress and its Problems: Towards a Theory of Scientific Growth
1981. Science and Hypothesis
1984. Science and Values
1990. Science and Relativism: Dialogues on the Philosophy of Science
1996. Beyond Positivism and Relativism
2006. Truth, Error and Criminal Law: An Essay in Legal Epistemology
Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is
“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot… –E.S Pearson, “Statistical Concepts in Their Relation to Reality”.
He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]
So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.
OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her:
One day at the beginning of April 1926, down ‘in the middle of small samples,’ wandering among apple plots at East Malling, where a cousin was director of the fruit station, he was ‘suddenly smitten,’ as he later expressed it,with a ‘doubt’ about the justification for using Student’s ratio (the t-statistic) to test a normal mean (Quotes are from Pearson in Reid, p. 60).
Soon after, Egon contacted Neyman and their joint work began.
I assumed the meanderings over apple plots was a different time, and that Egon just had a habit of conducting his deepest statistical thinking while overlooking fruit. Yet it shared certain unique features with the revelation when gazing over at the blackcurrant plot, as in my picture, if only in the date and the great importance he accorded it (although I never recall his saying he was “smitten” before). I didn’t think more about it. Then, late one night last week I grabbed a peculiar book off my shelf that contains a smattering of writings by Pearson for a work he never completed: “Student: A Statistical Biography of William Sealy Gosset” (1990, edited and augmented by Plackett and Barnard, Clarendon, Oxford). The very first thing I open up to is a note by Egon Pearson:
I cannot recall now what was the form of the doubt which struck me at East Malling, but it would naturally have arisen when discussing there the interpretation of results derived from small experimental plots. I seem to visualize myself sitting alone on a gate thinking over the basis of ‘small sample’ theory and ‘mathematical statistics Mark II’ [i.e., Fisher]. When nearly thirty years later (JRSS B, 17, 204 1955), I wrote refuting the suggestion of R.A.F. [Fisher] that the Neyman-Pearson approach to testing statistical hypotheses had arisen in industrial acceptance procedures, the plot which the gate was overlooking had through the passage of time become a blackcurrant one! (Pearson 1990 p. 81)
What? This is weird. So that must mean it wasn’t blackcurrants after all, and Egon is mistaken in the caption under the picture I drew 20 years ago. Yet, he doesn’t say here that it was apples either, only that it had “become a blackcurrant” plot in a later retelling. So, not blackcurrant, so, it must have been apple, putting this clue together with what he told Constance Reid. So it appears I can no longer quote that “blackcurrant” statement, at least not without explaining that, in all likelihood, it was really apples. If any statistical sleuths out there can corroborate that it was apples, or knows the correct fruit that Egon was gazing at (and, come to think of it, why couldn’t it have been both?) I’d be very grateful to know [ii]. I will happily cite you. I know this is a bit of minutia–don’t say I didn’t warn you [iii]. By contrast, the Pearson paper replying to Fisher is extremely important (and very short). It’s entitled “Statistical Concepts in Their Relation to Reality”. You can read the paper HERE.
[i] Some of the previous lines, and 6 following words:
There was no question of a difference in point of view having ‘originated’ when Neyman ‘re-interpreted’ Fisher’s early work on tests of significance ‘in terms of that technological and commercial apparatus which is known as an acceptance procedure’. …
Indeed, to dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot at the East Malling Research Station!–E.S Pearson, “Statistical Concepts in Their Relation to Reality”
[ii] As Erich Lehmann put it in his EGEK review, Pearson is “the hero of Mayo’s story” because I found in his work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of Neyman. So I should get the inspirational fruit correct.
[iii] I’m not saying I know the answer isn’t in the book on Student, or someplace else.
Fisher 1955 “Scientific Methods and Scientific Induction” .
Pearson E.S., 1955 “Statistical Methods in Their Relation to Reality”.
Reid, C. 1998, Neyman–From Life. Springer.
This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post. I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background.
Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.
Cases of Type A and Type B
“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)
Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:
“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…
(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)
In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:
“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?
Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)
Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.
“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.
Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).
Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:
“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171)
“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)
We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:
“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)
The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.
Three Steps in the Original Construction of Tests
After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:
“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…
Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).
“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).
Pearson warns that:
“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).
Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2. However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.
So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.
If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]
Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.
Neyman Was the More Behavioristic of the Two
Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.
Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:
“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.
In Pearson’s (1955) response to Fisher (blogged here):
“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)
“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).
“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)
“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)
__________________________
References:
Pearson, E. S. (1935), The Application of Statistical Methods to Industrial Standardization and Quality Control, London: British Standards Institution.
Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,” Biometrika 34(1/2): 139-167.
Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to Reality” Journal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.
Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” Biometrika 20(A): 175-240.
[i] In some cases only an upper limit to this error probability may be found.
[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.
[iii] I thank Aris Spanos for locating this work of Pearson’s from 1935
1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:
“While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)
Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.^{[1]} But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability?
We know that literal deductive falsification only occurs with trite examples like “All swans are white”; and that a single black swan falsifies the universal claim that C: all swans are white, whereas observing a single white swan wouldn’t allow inferring C (unless there was only 1 swan, or no variability in color) but Burnham and Anderson are discussing statistical falsification, and statistical methods of testing. Moreover, the authors champion a methodology that they say has nothing to do with testing or falsifying: “Unlike significance testing”, the approaches they favor “are not ‘tests,’ are not about testing” (p. 628). I’m not disputing their position that likelihood ratios, odds ratios, Akaike model selection methods are not about testing, but falsification is all about testing! No tests, no falsification, not even of the null hypotheses (which they presumably agree significance tests can falsify). It seems almost a scandal, and it would be one if critics of statistical testing were held to a more stringent, more severe, standard of evidence and argument than they are.
I may add installments/corrections (certainly on E. Pearson’s birthday Thursday); I’ll update with (i), (ii) and the date.
A bit of background. I view significance tests as only a part of a general statistical methodology of testing, estimation, and modeling that employs error probabilities of methods to control and assess how capable methods are at probing errors, and blocking misleading interpretations of data. I call it an error statistical methodology. I reformulate statistical tests as tools for severe testing. The outputs report on the discrepancies that have and have not been tested with severity. There’s much in Popper I agree with: data x only count as evidence for a claim H_{1} if it constitutes an unsuccessful attempt to falsifyH_{1}. One does not have evidence for a claim if nothing has been done to rule out ways the claim may be false. I use formal error probabilities to direct a more satisfactory notion of severity than Popper.
2. Popper, Fisher-Neyman-Pearson, and falsification.
Popper’s philosophy shares quite a lot with the stringent testing ideas found in Fisher, and also Neyman-Pearson–something Popper himself recognized in the work the authors site (LSD). Here is Popper:
We say that a theory is falsified only if we have accepted basic statements which contradict it…. This condition is necessary but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low level empirical hypothesis which describes such an effect is proposed and corroborated. (Popper LSD, 1959, 203)
Such “a low level empirical hypothesis” is well captured by a statistical claim. Unlike the logical positivists, Popper realized that singular observation statements couldn’t provide the “basic statements” for science. In the same spirit, Fisher warned that in order to use significance tests to legitimately indicate incompatibility with hypotheses, we need not an isolated low P-value, but an experimental phenomenon.
[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)
If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. Conjectured statistical effects are likewise falsified if they contradict data and/or could only be retained through ad hoc saves, verification biases and “exception incorporation”. Moving in stages between data collection, modeling, inferring, and from statistical to substantive hypotheses and back again, learning occurs by a series of piecemeal steps with the same reasoning. The fact that at one stage H_{1} might be the alternative, at another, the test hypothesis, is no difficulty. The logic differs from inductive updating probabilities of a hypothesis, as well as from a comparison of how much more probable H_{1} makes the data than does H_{0}, as in likelihood ratios. These are 2 variants of probabilism.
Now there are many who embrace probabilism who deny they need tools to reject or falsify hypotheses. That’s fine. But having declared it a scandal (almost) for a statistical account to lack a methodology to reject/falsify, it’s a bit surprising to learn their account offers no such falsificationist tools. (Perhaps I’m misunderstanding; I invite correction.) For example, the likelihood ratio, they declare, “is an evidence ratio about parameters, given the model and the data. It is the likelihood ratio that defines evidence (Royall 1997)” (Burnham and Anderson, p. 628). They italicize “given” which underscores that these methods begin their work only after models are specified. Richard Royall is mentioned often, but Royall is quite clear that for data to favor H_{1} over H_{0} is not to have supplied evidence against H_{0}. (“the fact that we can find some other hypothesis that is better supported than H does not mean that the observations are evidence against H” (1997, pp.21-2).) There’s no such thing as evidence for or against a single hypothesis for him. But without evidence against H_{0}, one can hardly mount a falsification of H_{0}. Thus, I fail to see how their preferred account promotes falsification. It’s (almost) a scandal.
Maybe all they mean is that “historical” Fisher said the tests have only a null, so the only alternative would be its denial. First, we shouldn’t be limiting ourselves to what Fisher thought, nor keep an arbitrary distinction between Fisher vs N-P tests nor confidence intervals. David Cox is a leading Fisherian and his tests have either implicit or explicit alternatives. The choice of a test statistic indicates the alternative, even if it’s only directional. In N-P tests, the test hypothesis and the alternative may be swapped.) Second, even if one imagines the alternative is limited to either of the following:
(i) the effect is real/ non-spurious, or (ii) a parametric non-zero claim (e.g., μ ≠ 0),
they are still statistically falsifiable. An example of the first came last week. Shock waves were felt in high energy particle physics (HEP) when early indications (from last December) of a genuine new particle—one that would falsify the highly corroborated Standard Model (SM)—was itself falsified. This was based on falsifying a common statistical alternative in a significance test: the observed “resonance” (a great term) is real. (The “bumps” began to fade with more data [2].) As for case (ii), some of the most important results in science are null results. By means of high precision null hypotheses tests, bounds for statistical parameters are inferred by rejecting (or falsifying) discrepancies beyond the limits tests are capable of detecting. Think of the famous negative result of Michelson-Morley experiments that falsified the “ether” (or aether) of the type ruled out by special relativity, or the famous equivalence principles of experimental GTR. An example of each is briefly touched upon in a paper with David Cox (Mayo and Cox 2006). Of course, background knowledge about the instruments and theories are operative throughout. More typical are the cases where power analysis can be applied, as discussed in this post.
“Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”
Perhaps they only mean to say that Fisherian tests don’t directly try to falsify “the effect is real”. They’re supposed to, it should be very difficult to bring about statistically significant results if the world is like H0.
3. Model validation, specification and falsification.
When serious attention is paid to the discovery of new ways to extend models and theories, and to model validation, basic statistical tests are looked to. This is so even for Bayesians, be they ecumenical like George Box, or “falsificationists” like Gelman.
For Box, any account that relies on statistical models requires “diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification”. This leads Box to advocate ecumenism. (Box 1983, p. 57). He asks,
[w]hy can’t all criticism be done using Bayes posterior analysis?…The difficulty with this approach is that by supposing all possible sets of assumptions are known a priori, it discredits the possibility of new discovery. But new discovery is, after all, the most important object of the scientific process (ibid., p. 73).
Listen to Andrew Gelman (2011):
At a philosophical level, I have been persuaded by the arguments of Popper (1959), Kuhn (1970), Lakatos (1978), and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call ‘pure significance testing’)^{[3]} (Gelman 2011, p. 70).
Discovery, model checking and correcting rely on statistical testing, formal or informal.
4. “An explicit, objective criterion of ‘best’ models” using methods that obey the LP (p.628).
Say Burnham and Anderson:
“At a deeper level, P values are not proper evidence as they violate the likelihood principle” (Royall 1997)” (p. 627).
A list of pronouncements by Royall follows. What we know at a much deeper level is that any account that obeys the likelihood principle (LP) is not an account that directly assesses or controls the error probabilities of procedures. Control of error probabilities, even approximately, is essential for good tests, and this grows out of a concern, not for controlling error rates in the long run, but for evaluating how well tested models and hypotheses are with the data in hand. As with others who embrace the LP, the authors reject adjusting for selection effects, data dredging, multiple testing, etc.–gambits that alter the sampling distribution and, handled cavalierly, are responsible for much of the bad statistics we see. By the way, reference or default Bayesians also violate the LP. You can’t just make declarations about “proper evidence” without proper evidence. (There’s quite a lot on the LP on this blog; see also links to posts below the references.)
Burnham and Anderson are concerned with how old a method is. Oh the horrors of being a “historical” method. Appealing to ridicule (“progress should not have to ride in a hearse”) is no argument. Besides, it’s manifestly silly to suppose you use a single method, or that error statistical tests haven’t been advanced as well as reinterpreted since Fisher’s day. Moreover, the LP is a historical, baked-on principle suitable for ye olde logical positivist days when empirical observations were treated as “given”. Within that statistical philosophy, it was typical to hold that the data speak for themselves, and that questionable research practices such as cherry-picking, data-dredging, data-dependent selections, and optional stopping are irrelevant to “what the data are saying”! It’s redolent of the time where statistical philosophy sought a single, “objective” evidential relationship to hold between given data, model and hypotheses. Holders of the LP still say this, and the authors are no exception.
[The LP was, I believe, articulated by George Barnard who announced he rejected it at the 1959 Savage forum for all but predesignated simple hypotheses. If you have a date or correction, please let me know. 8/10]
The truth is that one of the biggest problems behind the “replication crisis” is the violation of some age-old truisms about science.It’s the consumers of bad science (in medicine at least) that are likely to ride in a hearse. There’s something wistful about remarks we hear from some quarters now. Listen to Ben Goldacre (2016) in Naure: “The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data,” which he follows with a list of selective publication, data dredging and all the rest, “leading collectively to the ‘replication crisis’.”
He’s trying to remind us that the rules for good science were all in place long ago and somehow are now being ignored or trampled over, in some fields. Wherever there’s a legitimate worry about “perverse incentives,” it’s not a good idea to employ methods where selection effects vanish.
5. Concluding comments
I don’t endorse many of the applications of significance tests in the literature, especially in the social sciences. Many p-values reported are vitiated by fallacious interpretations (going from a statistical to substantive effect), violated assumptions, and biasing selection effects. I’ve long recommended a reformulation of the tools to avoid fallacies of rejection and non-rejection. In some cases, sadly, better statistical inference cannot help, but that doesn’t make me want to embrace methods that do not directly pick up on the effects of biasing selections. Just the opposite.
If the authors are serious about upholding Popperian tenets of good science, then they’ll want to ensure the claims they make can be regarded as having passed a stringent probe into their falsity. I invite comments and corrections.
(Look for updates.)
____________
^{[1]}They are replying to an article by Paul Murtaugh. See the link to his paper here.
[2]http://www.physicsmatt.com/blog/2016/8/5/standard-model-1-diphotons-0
^{[3]}Gelman continues: “At the next stage, we see science–and applied statistics–as resolution of anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996).”
REFERENCES:
Related Blogposts
LAW OF LIKELIHOOD: ROYALL
8/29/14: BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2nd)
10/10/14: BREAKING THE (Royall) LAW! (of likelihood) (C)
11/15/14: Why the Law of Likelihood is bankrupt—as an account of evidence
11/25/14: How likelihoodists exaggerate evidence from statistical tests
P-VALUES EXAGGERATE
7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
7/23/14: Continued: “P-values overstate the evidence against the null”: legit or fallacious?
5/12/16: Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”
Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn
Painful dichotomies
The tweet read “Featured review: Only 10% people with tension-type headaches get a benefit from paracetamol” and immediately I thought, ‘how would they know?’ and almost as quickly decided, ‘of course they don’t know, they just think they know’. Sure enough, on following up the link to the Cochrane Review in the tweet it turned out that, yet again, the deadly mix of dichotomies and numbers needed to treat had infected the brains of researchers to the extent that they imagined that they had identified personal response. (See Responder Despondency for a previous post on this subject.)
The bare facts they established are the following:
The International Headache Society recommends the outcome of being pain free two hours after taking a medicine. The outcome of being pain free or having only mild pain at two hours was reported by 59 in 100 people taking paracetamol 1000 mg, and in 49 out of 100 people taking placebo.
and the false conclusion they immediately asserted is the following
This means that only 10 in 100 or 10% of people benefited because of paracetamol 1000 mg.
To understand the fallacy, look at the accompanying graph. This shows the simplest possible model describing events over time that is consistent with the ‘facts’. The model in question is the exponential distribution and what is shown is the cumulative probability of response for individuals suffering from tension headache depending on whether they are treated with placebo or paracetamol. The dashed vertical line is at the arbitrary International Headache Society critical time point of 2 hours. This intersects the placebo curve at 0.49 and the paracetamol curve at 0.59, exactly the figures quoted in the Cochrane review.
The model that the diagram represents is simplistic and almost certainly false. It is what would apply if it were the case that all patients given placebo had the same probability over time of headache resolution and ditto for paracetamol and an exponential model applied. However, the point is that for all we know it is true. It would take careful measurement over time for repeated headaches of the same individuals to establish the element of personal response (Senn 2016).
The curve given for placebo is what we would expect to find for the simple exponential model if it were the case that mean time to response were 2.97 hours when a patient was given placebo. The curve for paracetamol has a mean of 2.24 hours. It is important to understand that this is perfectly compatible with this being the long term average response time (that is to say averaged over many many headaches) for every patient and this means that any patient at any time feeling the symptoms of headache could expect to shorten that headache by 2.97-2.24=0.73 hrs or just under 45 minutes.
Is this a benefit or not? I would say, ‘yes’. And that means that a perfectly logical way to describe the results is to say, ‘for all we know, any patient taking paracetamol for headache will benefit. The size of that benefit is an increase of the probability of resolution at 2 hours of 10 percent or a reduction of mean headache time of 3/4 of an hour’.
The latter, of course, depends on the exponential model being appropriate and it may be that some alternative can be found by careful analysis of the data. The point is, however, that the claim that only 10% will benefit by taking paracetamol is completely unjustified.
Unfortunately, the combination of arbitrary dichotomies (Senn 2003) and naïve analysis continues to fuel misunderstandings regarding personalised medicine.
Acknowledgement:
This work was funded by grant 602552 for the IDEAL project under the European Union FP7 programme and support from the programme is gratefully acknowledged.
References:
MONTHLY MEMORY LANE: 3 years ago: July 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1], and in green up to 3 others I’d recommend[2]. Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one.
July 2013
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
[2] New Rule, July 30, 2016.
Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with an undergraduate item I came across: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, and thumbnail definitions at the end[ii].
Blog Readers who wish to send me their answers will have their papers graded (at least try the multiple choice; if you’re unsure, do the reading). [Use the comments or e-mail.]
[i] Popper reading from Conjectures and Refutations
[ii] I might note the “No-Pain philosophy” (3 part) Popper posts from this blog: parts 1, 2, and 3.
HAPPY BIRTHDAY POPPER!
REFERENCE:
Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.
Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.” I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?
(I) Listen to Jacob Cohen (1988) introduce Power Analysis
“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists.
What is really intended by the invalid affirmation of a null hypothesis is not that the population ES is literally zero, but rather that it is negligible, or trivial. This proposition may be validly asserted under certain circumstances. Consider the following: for a given hypothesis test, one defines a numerical value i (or iota) for the ES, where i is so small that it is appropriate in the context to consider it negligible (trivial, inconsequential). Power (1 – b) is then set at a high value, so that b is relatively small. When, additionally, a is specified, n can be found. Now, if the research is performed with this n and it results in nonsignificance, it is proper to conclude that the population ES is no more than i, i.e., that it is negligible; this conclusion can be offered as significant at the b level specified. In much research, “no” effect (difference, correlation) functionally means one that is negligible; “proof” by statistical induction is probabilistic. Thus, in using the same logic as that with which we reject the null hypothesis with risk equal to a, the null hypothesis can be accepted in preference to that which holds that ES = i with risk equal to b. Since i is negligible, the conclusion that the population ES is not as large as i is equivalent to concluding that there is “no” (nontrivial) effect. This comes fairly close and is functionally equivalent to affirming the null hypothesis with a controlled error rate (b), which, as noted above, is what is actually intended when null hypotheses are incorrectly affirmed (J. Cohen 1988, p. 16).
Here Cohen imagines the researcher sets the size of a negligible discrepancy ahead of time–something not always available. Even where a negligible i may be specified, the power to detect that i may be low and not high. Two important points can still be made:
Now to tell what’s true about Greenland’s concern that “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”
(II) The first step is to understand the assertion, giving the most generous interpretation. It deals with nonsignificance, so our ears are perked for a fallacy of nonrejection or nonsignificance. We know that “high power” is an incomplete concept, so he clearly means high power against “the alternative”.
For a simple example of Greenland’s phenomenon, consider an example of the Normal test we’ve discussed a lot on this blog. Let T+: H0: µ ≤ 12 versus H_{1}: µ > 12, σ = 2, n = 100. Test statistic Z is √100(M – 12)/2 where M is the sample mean. With α = .025, the cut-off for declaring .025 significance from M*_{.025 }= 12+ 2(2)/√100 = 12.4 (rounding to 2 rather than 1.96 to use a simple Figure below).
[Note: The thick black vertical line in the Figure, which I haven’t gotten to yet, is going to be the observed mean, M_{0 }= 12.35. It’s a bit lower than the cut-off at 12.4.]
Now a title like Greenland’s is supposed to signal some problem. What is it? The statistical part just boils down to noting that the observed mean M_{0 }(e.g., 12.35) may fail to make it to the cut-off M* (here 12.4), and yet be closer to an alternative against which the test has high power (e.g., 12.6) than it is to the null value, here 12. This happens because the Type 2 error probability is allowed to be greater than the Type 1 error probability (here .025).
Abbreviate the alternative against which the test T+ has .84 power as, µ^{.84} , as I’ve often done. (See, for example, this post.) That is, the probability Test T+ rejects the null when µ = µ^{.84} = .84. i.e.,POW(T+, µ^{.84}) = .84. One of our power short-cut rules tells us:
µ^{.84 }= M* + 1σ_{M }= 12.4 + .2 = 12.6,
where σ_{M}: =σ/√100 = .2.
Note: the Type 2 error probability in relation to alternative µ = 12.6 is.16. This is the area to the left of 12.4 under the red curve above. Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β(12.6).
µ^{.84 }exceeds the null value by 3σ_{M}: so any observed mean that exceeds 12 by more than 1.5σ_{M }but less than 2σ_{M }gives an example of Greenland’s phenomenon. [Note: the previous sentence corrects an earlier wording.] In T+ , values 12.3 < M_{0 }<12 .4 do the job. Pick M_{0 }= 12.35. That value is indicated by the black vertical line in the figure above.
Having established the phenomenon, your next question is: so what?
It would be problematic if power analysis took the insignificant result as evidence for μ = 12 (i.e., 0 discrepancy from the null). I’ve no doubt some try to construe it as such, and that Greenland has been put in the position of needing to correct them. This is the reverse of the “mountains out of molehills” fallacy. It’s making molehills out of mountains. It’s not uncommon when a nonsignificant observed risk increase is taken as evidence that risks are “negligible or nonexistent” or the like. The data are looked at through overly rosy glasses (or bottle). Power analysis enters to avoid taking no evidence of increased risk as evidence of no risk. Its reasoning only licenses μ < µ^{.84} where .84 was chosen for “high power”. From what we see in Cohen, he does not give a green light to the fallacious use of power analysis.
(III) Now for how the inference from power analysis is akin to significance testing (as Cohen observes). Let μ^{1−β} be the alternative against which test T+ has high power, (1 – β). Power analysis sanctions the inference that would accrue if we switched the null and alternative, yielding the one-sided test in the opposite direction, T-, we might call it. That is, T- tests H_{0}: μ ≥ μ^{1−β} versus H_{1}: μ < μ^{1−β} at the β level. The test rejects H_{0} (at level β) when M < μ_{0} – z_{β}σ_{M}. Such a significant result would warrant inferring μ < μ^{1−β }at significance level β. Using power analysis doesn’t require making this switcheroo, which might seem complicated. The point is that there’s really no new reasoning involved in power analysis, which is why the members of the Fisherian tribe manage it without even mentioning power.
EXAMPLE. Use μ^{.84} in test T+ (α = .025, n = 100, σ_{M }= .2) to create test T-. Test T+ has .84 power against μ^{.84} = 12 + 3σ_{M} = 12.6 (with our usual rounding). So, test T- is
H_{0}: μ ≥ 12.6 versus H_{1}: μ <12 .6
and a result is statistically significantly smaller than 12.6 at level .16 whenever sample mean M < 12.6 – 1σ_{M} = 12.4. To check, note (as when computing the Type 2 error probability of test T+) that
Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β. In test T-, this serves as the Type 1 error probability.
So ordinary power analysis follows the identical logic as significance testing. [i] Here’s a qualitative version of the logic of ordinary power analysis.
Ordinary Power Analysis: If data x are not statistically significantly different from H_{0}, and the power to detect discrepancy γ is high, then x indicates that the actual discrepancy is no greater than γ.[ii]
Or, another way to put this:
If data x are not statistically significantly different from H_{0}, then x indicates that the underlying discrepancy (from H_{0}) is no greater than γ, just to the extent that that the power to detect discrepancy γ is high,
************************************************************************************************
[i] Neyman, we’ve seen, was an early power analyst. See, for example, this post.
[ii] Compare power analytic reasoning with severity reasoning from a negative or insignificant result.
POWER ANALYSIS: If Pr(d > c_{α}; µ’) = high and the result is not significant, then it’s evidence µ < µ’
SEVERITY ANALYSIS: (for an insignificant result): If Pr(d > d_{0}; µ’) = high and the result is not significant, then it’s evidence µ < µ.’
Severity replaces the pre-designated cut-off c_{α} with the observed d_{0}. Thus we obtain the same result remaining in the Fisherian tribe. We still abide by power analysis though, since if Pr(d > d_{0}; µ’) = high then Pr(d > c_{α}; µ’) = high, at least in a sensible test like T+. In other words, power analysis is conservative. It gives a sufficient but not a necessary condition for warranting bound: µ < µ’. But why view a miss as good as a mile? Power is too coarse.
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum. [Link to quote above: p. 16]
Greenland, S. 2012. ‘Nonsignificance Plus High Power Does Not Imiply Support for the Null Over the Alternative’, Annals of Epidemiology 22, pp. 364-8. Link to paper: Greenland (2012)
Date: July 17, 2016
Location: London School of Economics, London
Website: http://www.lse.ac.uk/philosophy/events/carlo-rovelli-public-lecture/
Start Date: July 21, 2016
End Date: July, 22, 2016
Location: University of Cambridge
Website: http://www.crassh.cam.ac.uk/events/26814
Start Date: September 6, 2016
End Date: September 9, 2016
Location: University of Exeter, UK
Website: http://www.philsci.eu/epsa17
Submission Deadline: December 16, 2016
Flyer: Structure.pdf
Submission Deadline: September 5, 2016
Flyer: CFPLinconscio_ENG.pdf
Submission Deadline: July 17, 2016
Start Date: October 3, 2016
End Date: October 7, 2016
Location: San Sebastian, Spain
Flyer: Flier-XIIInternationalOntologyCongress.pdf
Start Date: October 12, 2016
End Date: October 13, 2016
Location: Leuven, Belgium
Flyer: TheScienceOfEvolutionAndTheEvolutionOftheSciences.pdf
Start Date: September 5, 2016
End Date: September 9, 2016
Location: Urbino, Italy
Website: https://sites.google.com/site/centroricerchecirfis/sep-2016
Start Date: September 23, 2016
End Date: September 24, 2016
Location: University of Pittsburgh, PA
Website: http://www.pitt.edu/~pittcntr/Events
MONTHLY MEMORY LANE: 3 years ago: June 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1]. Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.
June 2013
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 Statistical Science issue discussing Birnbaum’s result is here. Reference [5] links to the Synthese 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading!
NATURE VOL. 225 MARCH 14, 1970 (1033)
LETTERS TO THE EDITOR
Statistical Methods in Scientific Inference (posted earlier here)
It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood). I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.
If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence. This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data. (The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].
While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.
Allan Birnbaum
New York University
Courant Institute of Mathematical Sciences,
251 Mercer Street,
New York, NY 10012
Birnbaum’s confidence concept, sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:
(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.
Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence” simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below).
Still, since Birnbaum’s (Conf) appears to allude to pre-trial error probabilities, I regard (Conf) as still too “behavioristic”. But I discovered that Pratt, in the link in [5] below, entertains the possibility of viewing Conf in terms of what might be called post-data or “attained” error probabilities. Some of his papers hint at the possibility that he would have wanted to use Conf for a post-data assessment of how well (or poorly) various claims were tested. I developed the concept of severity and severe testing to provide an “evidential” or “inferential” notion, along with a statistical philosophy and a philosophy of science in which it is to be embedded.
I think that Fisher (1955) is essentially correct in maintaining that “When, therefore, Neyman denies the existence of inductive reasoning he is merely expressing a verbal preference”. It is a verbal preference one can also find in Popper’s view of corroboration. (He, and current day critical rationalists, also hold that probability arises to evaluate degrees of severity, well-testedness or corroboration, not inductive confirmation.) The inference to the severely corroborated claim is still inductive. It goes beyond the premises. It is qualified by the relevant severity assessments.
I have many of Birnbaum’s original drafts of papers and articles here (with carbon copies (!) and hand-written notes in the margins), thanks to the philosopher of science, Ronald Giere, who gave them to me years ago[iii].
***
[i] His untimely death was a suicide.
[ii] A considerable number of posts on the strong likelihood principle (SLP) may be found searching this blog (e.g., here and here). Links or references to the associated literature, perhaps all of it, may also be found here. A post linking to the 2014 Statistical Science issue on my criticism of Birnbaum’s “breakthrough” (to the SLP) is here.
[iii]See posts under “Neyman’s Nursery” (1, 2, 3, 4, 5)
References
[3] Birnbaum, A., in Philosophy, Science and Method: Essays in Honor of Ernest Nagel (edited by Morgenbesser, S., Suppes, P., and White, M.) (St. Martin’s Press. NY, 1969).
[4] Likelihood in International Encyclopedia of the Social Sciences (Crowell-Collier, NY, 1968).
[5] Full contents of Synthese 1977, dedicated to his memory in 1977, can be found in this post.
[6] Birnbaum, A. (1977). “The Neyman-Pearson theory as decision theory, and as inference theory; with a criticism of the Lindley-Savage argument for Bayesian theory”. Synthese 36 (1) : 19-49. See links in [5]
Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University
It was statistician Richard Gill who first told me about Diederik Stapel (see an earlier post on Diederik). We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have Gill be assigned as my commentator/presenter—he was excellent! As I was explaining some data problems to him, he suddenly said, “Some people don’t bother to collect data at all!” That’s when I learned about Stapel.
Committees often turn to Gill when someone’s work is up for scrutiny of bad statistics or fraud, or anything in between. Do you think he’s being too easy on researchers when he says, about a given case:
“data has been obtained by some combination of the usual ‘questionable research practices’ [QRPs] which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published. …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them.”
Isn’t that the danger in relying on deeply felt background beliefs? Have our attitudes changed (toward QRPs) over the past 3 years (harsher or less harsh)? Here’s a talk of his I blogged 3 years ago (followed by a letter he allowed me to post). I reflect on the pseudoscientific nature of the ‘recovered memories’ program in one of the Geraerts et al. papers in a later post.
I certainly have been thinking about these issues a lot in recent months. I got entangled in intensive scientific and media discussions – mainly confined to the Netherlands – concerning the cases of social psychologist Dirk Smeesters and of psychologist Elke Geraerts. See: http://www.math.leidenuniv.nl/~gill/Integrity.pdf
And I recently got asked to look at the statistics in some papers of another … [researcher] ..but this one is still confidential ….
The verdict on Smeesters was that he like Stapel actually faked data (though he still denies this).The Geraerts case is very much open, very much unclear. The senior co-authors Merckelbach, McNally of the attached paper, published in the journal “Memory”, have asked the journal editors for it to be withdrawn because they suspect the lead author, Elke Geraerts, of improper conduct. She denies any impropriety. It turns out that none of the co-authors have the data. Legally speaking it belongs to the University of Maastricht where the research was carried out and where Geraerts was a promising postdoc in Merckelbach’s group. She later got a chair at Erasmus University Rotterdam and presumably has the data herself but refuses to share it with her old co-authors or any other interested scientists. Just looking at the summary statistics in the paper one sees evidence of “too good to be true”. Average scores in groups supposed in theory to be similar are much closer to one another than one would expect on the basis of the within group variation (the paper reports averages and standard deviations for each group, so it is easy to compute the F statistic for equality of the three similar groups and use its left tail probability as test statistic.
The same phenomenon turns up in another unpublished paper by the same authors and moreover in one of the papers contained in Geraerts (Maastricht) thesis. I attach the two papers published in Geraert’s thesis which present results in very much the same pattern as the disputed “Memory” paper. Four groups of subjects, three supposed in theory to be rather similar, one expected to be strikingly different. In one of the two, just as in the Memory paper, the average scores of the three similar groups are much closer to one another than one would expect on the basis of the within-groups variation.
I got involved in the quarrel between Merckelbach and Geraerts which was being fought out in the media so various science journalists also consulted me about the statistical issues. I asked Geraerts if I could have the data of the Memory paper so that I could carry out distribution-free versions of the statistical tests of “too good to be true” which are easy to perform if you just have the summary statistics. She claimed that I had to get permission from the University of Maastricht. At some point both the presidents of Maastricht and Erasmus university were involved and presumably their legal departments too. Finally I got permission and arranged a meeting with Geraerts where she was going to tell me “her side of the story” and give me the data and we would look at my analyses together. Merckelbach and his other co-authors all enthusiastically supported this too, by the way. However at the last moment the chair of her department at Erasmus university got worried and stepped in and now an internal Rotterdam (=Erasmus) committee is investigating the allegations and Geraerts is not allowed to give anyone the data or talk to anyone about the problem.
I think this is totally crazy. First of all, the data set should have been made public years ago. Secondly, the fact that the co-authors of the paper never even saw the data themselves is a sign of poor research practices. Thirdly, getting university lawyers and having high level university ethics committees involved does not further science. Science is furthered by open discussion. Publish the data, publish the criticism, and let the scientific community come to its own conclusion. Hold a workshop where different points of view of presented about what is going on in these papers, where statisticians and psychologists communicate to one another.
Probably, Geraerts’s data has been obtained by some combination of the usual “questionable research practices” which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published: sample sizes are too small, effects are too small, noise is too large. People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them and are just doing the best to make this as clear as possible to everyone.
Richard
PS summary of my investigation of the papers contained in Geraert’s PhD thesis:ch 8 Geraerts et al 2006b BRAT Long term consequences of suppression of intrusive anxious thoughts and repressive coping.
ch 9 Geraerts et al 2006 AJP Suppression of intrusive thoughts and working memory capacity in repressive coping.These two chapters show the pattern of four groups of subjects, three of which are very similar, while the fourth is strikingly different with respect to certain (but not all) responses.In the case of chapter 8, the groups which are expected to be similar are (just as in the already disputed Memory and JAb papers) actually much too similar! The average scores are closer to one another than one can expect on the basis of the observed within-group variation (1 over square root of N law).In the case of chapter 9, nothing odd seems to be going on. The variation between the average scores of similar groups of subjects is just as big as it ought to be, relative to the variation within the groups.
Geraerts et al (2008 Memory pdf). “Recovered memories of childhood sexual abuse: Current ﬁndings and their legal implications” Legal and Criminological Psychology 13, 165–176
In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I had said I would label as pseudoscience or questionable science any enterprise that regularly permits the kind of ‘verification biases’ in the statistical dirty laundry list. How regularly? (I’ve been asked)
Well, surely if it’s as regular as, say, much of social psychology, it goes over the line. But it’s not mere regularity, it’s the nature of the data, the type of inferences being drawn, and the extent of self-scrutiny and recognition of errors shown (or not shown). The regularity is just a consequence of the methodological holes. My standards may be considerably more stringent than most, but quite aside from statistical issues, I simply do not find hypotheses well-tested if they are based on “experiments” that consist of giving questionnaires. At least not without a lot more self-scrutiny and discussion of flaws than I ever see. (There may be counterexamples.)
Attempts to recreate phenomena of interest in typical social science “labs” leave me with the same doubts. Huge gaps often exist between elicited and inferred results. One might locate the problem under “external validity” but to me it is just the general problem of relating statistical data to substantive claims.
Experimental economists (expereconomists) take lab results plus statistics to warrant sometimes ingenious inferences about substantive hypotheses. Vernon Smith (of the Nobel Prize in Econ) is rare in subjecting his own results to “stress tests”. I’m not withdrawing the optimistic assertions he cites from EGEK (Mayo 1996) on Duhem-Quine (e.g., from “Method in Experiment: Rhetoric and Reality” 2002, p. 104). I’d still maintain, “Literal control is not needed to attribute experimental results correctly (whether to affirm or deny a hypothesis). Enough experimental knowledge will do”. But that requires piece-meal strategies that accumulate, and at least a little bit of “theory” and/or a decent amount of causal understanding.[1]
I think the generalizations extracted from questionnaires allow for an enormous amount of “reading into” the data. Suddenly one finds the “best” explanation. Questionnaires should be deconstructed for how they may be misinterpreted, not to mention how responders tend to guess what the experimenter is looking for. (I’m reminded of the current hoopla over questionnaires on breadwinners, housework and divorce rates!) I respond with the same eye-rolling to just-so story telling along the lines of evolutionary psychology.
I apply the “Stapel test”: Even if Stapel had bothered to actually carry out the data-collection plans that he so carefully crafted, I would not find the inferences especially telling in the least. Take for example the planned-but-not-implemented study discussed in the recent New York Times article on Stapel:
Stapel designed one such study to test whether individuals are inclined to consume more when primed with the idea of capitalism. He and his research partner developed a questionnaire that subjects would have to fill out under two subtly different conditions. In one, an M&M-filled mug with the word “kapitalisme” printed on it would sit on the table in front of the subject; in the other, the mug’s word would be different, a jumble of the letters in “kapitalisme.” Although the questionnaire included questions relating to capitalism and consumption, like whether big cars are preferable to small ones, the study’s key measure was the amount of M&Ms eaten by the subject while answering these questions….Stapel and his colleague hypothesized that subjects facing a mug printed with “kapitalisme” would end up eating more M&Ms.
Stapel had a student arrange to get the mugs and M&Ms and later load them into his car along with a box of questionnaires. He then drove off, saying he was going to run the study at a high school in Rotterdam where a friend worked as a teacher.
Stapel dumped most of the questionnaires into a trash bin outside campus. At home, using his own scale, he weighed a mug filled with M&Ms and sat down to simulate the experiment. While filling out the questionnaire, he ate the M&Ms at what he believed was a reasonable rate and then weighed the mug again to estimate the amount a subject could be expected to eat. He built the rest of the data set around that number. He told me he gave away some of the M&M stash and ate a lot of it himself. “I was the only subject in these studies,” he said.
He didn’t even know what a plausible number of M&Ms consumed would be! But never mind that, observing a genuine “effect” in this silly study would not have probed the hypothesis. Would it?
II. Dancing the pseudoscience limbo: How low should we go?
Should those of us serious about improving the understanding of statistics be expending ammunition on studies sufficiently crackpot to lead CNN to withdraw reporting on a resulting (published) paper?
“Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as “silly,” “stupid,” “sexist,” and “offensive.” Others were less nice.”
That’s too low down for me.…(though it’s good for it to be in Retraction Watch). Even stooping down to the level of “The Journal of Psychological Pseudoscience” strikes me as largely a waste of time–for meta-methodological efforts at least.
I was hastily making these same points in an e-mail to Andrew Gelman just yesterday:
E-mail to Gelman: Yes, the idea that X should be published iff a p<.05 in an interesting topic is obviously crazy.
I keep emphasizing that the problems of design and of linking stat to substantive are the places to launch a critique, and the onus is on the researcher to show how violations are avoided. … I haven’t looked at the ovulation study (but this kind of thing has been done a zillion times) and there are a zillion confounding factors and other sources of distortion that I know were not ruled out. I’m prepared to abide such studies as akin to Zoltar at the fair [Zoltar the fortune teller]. Or, view it as a human interest story—let’s see what amusing data they collected, […oh, so they didn’t even know if women they questioned were ovulating]. You talk of top psych journals, but I see utter travesties in the ones you call top. I admit I have little tolerance for this stuff, but I fail to see how adopting a better statistical methodology could help them. …
Look, there aren’t real regularities in many, many areas–better statistics could only reveal this to an honest researcher. If Stapel actually collected data on M&M’s and having a mug with “Kapitalism” in front of subjects, it would still be B.S.! There are a lot of things in the world I consider crackpot. They may use some measuring devices, and I don’t blame those measuring devices simply because they occupy a place in a pseudoscience or “pre-science” or “a science-wannabe”. Do I think we should get rid of pseudoscience? Yes! [At least if they have pretensions to science, and are not described as “for entertainment purposes only”[2].] But I’m afraid this would shut down [or radically redescribe] a lot more fields than you and most others would agree to. So it’s live and let live, and does anyone really think it’s hurting honest science very much?
There are fields like (at least parts of) experimental psychology that have been trying to get scientific by relying on formal statistical methods, rather than doing science. We get pretensions to science, and then when things don’t work out, they blame the tools. First, significance tests, then confidence intervals, then meta-analysis,…do you think these same people are going to get the cumulative understanding they seek when they move to Bayesian methods? Recall [Frank] Schmidt in one of my Saturday night comedies, rhapsodizing about meta-analysis:
“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”(Schmidt 1996)
III. Dale Carnegie salesman fallacy:
It’s not just that bending over backwards to criticize the most blatant abuses of statistics is a waste of time. I also think dancing the pseudoscientific limbo too low has a tendency to promote its very own fallacy! I don’t know if it has a name, so I made one up. Carnegie didn’t mean this to be used fallaciously, but merely as a means to a positive sales pitch for an idea, call it H. You want to convince a person ofH? Get them to say yes to a series of claims first, then throw in H and let them make the leap to accept H too. “You agree that the p-values in the ovulation study show nothing?” “Yes” “You agree that study on bicep diameter is bunk?” “Yes, yes”, and “That study on ESP—pseudoscientific, yes?” “Yes, yes, yes!” Then announce, “I happen to favor operational probalogist statistics (H)”. Nothing has been said to advance H, no reasons have been given that it avoids the problems raised. But all those yeses may well lead the person to say yes to H, and to even imagine an argument has been given. Dale Carnegie was a shrewd man.
(June 25, 2016 cartoon)
Note: You might be interested in the (brief) exchange between Gelman and me in the comments from the original post.
Of relevance is a later post on the replication crisis in psych. Search the blog for more on replication, if interested.
[1] Vernon Smith ends his paper:
My personal experience as an experimental economist since 1956 resonates, well with Mayo’s critique of Lakatos: “Lakatos, recall, gives up on justifying control; at best we decide—by appeal to convention—that the experiment is controlled. … I reject Lakatos and others’ apprehension about experimental control. Happily, the image of experimental testing that gives these philosophers cold feet bears little resemblance to actual experimental learning. Literal control is not needed to correctly attribute experimental results (whether to affirm or deny a hypothesis). Enough experimental knowledge will do. Nor need it be assured that the various factors in the experimental context have no influence on the result in question—far from it. A more typical strategy is to learn enough about the type and extent of their influences and then estimate their likely effects in the given experiment”. [Mayo EGEK 1996, 240]. V. Smith, “Method in Experiment: Rhetoric and Reality” 2002, p. 106.
My example in this chapter was linking statistical models in experiments on Brownian motion (by Brown).
[2] I actually like Zoltar (or Zoltan) fortune telling machines, and just the other day was delighted to find one in a costume store on 21st St.
Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong[2].) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013:
[i] I assume this is no longer true.
[2] June 24: Earp’s correction was that QRPs had “become standard practice”. But if they were taught as things a scientist with integrity must avoid, or adjust for (or at least inform the reader about), then how did they become standard practice? In the interviews conducted by the Stapel committee, the interviewees showed a cavalier attitude toward these moves.
Some statistical dirty laundry
I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…
1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job:
One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).
I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.
2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:
In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.
A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments). That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?
3. Hanging out some statistical dirty laundry.
Items in their laundry list include:
- An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
- A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
- The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
- The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
- Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)
For many further examples, and also caveats [3],see Report.
4. Significance tests don’t abuse science, people do.
Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. Statistical methods don’t kill scientific validity, people do.
I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and associated confidence intervals. At least the methods admit of tools for mounting a critique.
In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one? Or was there just a bit of data massaging and cherry picking to support the desired conclusion? As a matter of routine, researchers should tell us. Yes, as Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!” No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)
I recommend reading the Tilberg report!
*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”
[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).
[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)
[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).
[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!
[5] From Simmons, Nelson and Simonsohn:
The Fall 2012 Newsletter for the Society for Personality and Social Psychology Popper, K. 1994, The Myth of the Framework.Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.
Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.
If you determined sample size in advance, say it.
If you did not drop any variables, say it.
If you did not drop any conditions, say it.