Statistics

Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)

 

Spill Cam

Spill Cam

Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15.(Remember junk shots, top kill, blowout preventers?)[1] The EPA has lifted its gulf drilling ban on BP just a couple of weeks ago* (BP has paid around $13 $27 billion in fines and compensation), and April 20, 2014, is the deadline to properly file forms for new compensations.

(*After which BP had another small spill in Lake Michigan.)

But what happened to the 200 million gallons of oil? Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night, let’s listen in to a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes. 

In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!

Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

 Oil Exec:  Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average!  You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April  20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator:  But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec:  That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail),  that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion:  the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages?  … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

Two Measuring Instruments with Different Precisions:

 A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in my discussions of the strong likelihood principle (SLP), e.g., ton o’bricks, and here).

In applying our favorite one-sided (upper) Normal test T+ to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”.  Denote the two p-values as p’ and p”, respectively.  However, or so the criticism proceeds, the error statistician would report the average p-value:  .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed?  Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of.  Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

  •   If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion.
I gave an honorary mention to Christian Robert [3] on this point in his discussion of Cox and Mayo (2010).  Robert writes p. 9 :

A compelling section is the one about the weak conditionality principle (pp.294- 298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behaviour in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Robert (2007), Example 1.3.7, p.18) that the classical confidence interval averages over the experiments. The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this.

He would want me to mention that he does raise some caveats:

I could, however, [argue] about ‘conditioning is warranted to achieve objective frequentist goals’ (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”.http://arxiv.org/abs/1111.5827

But there is nothing arbitrary about regarding as “good” the only experiment actually run and from which the actual data arose.  The severity criterion only makes explicit what is/should be already obvious. Objectivity, for us, is directed by the goal of making correct and warranted inferences, not freedom from thinking. After all, any time an experiment E is performed, the critic could insist that the decision to perform E is the result of some chance circumstances and with some probability we might have felt differently that day and have run some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with.  The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests.

Still, we exiled ones can’t be too fussy, and Robert still gets the mention for conceding that we have  a solid leg on which to pirouette.


[1] The relevance of the Deepwater Horizon spill to this blog stems from its having occurred while I was busy organizing the conference “StatSci meets PhilSci” (to take place at the LSE in June 2010). So all my examples there involved “deepwater drilling” but of the philosophical sort. Search the blog for further connections (especially the RMM volume, and the blog’s “mascot” stock, Diamond offshore, DO, which has now bottomed out at around $48, long story.

Of course, the spill cam wasn’t set up right away.

[2] If any readers work on the statistical analysis of the toxicity of the fish or sediment from the BP oil spill, or know of good references, please let me know.

BP said all tests had shown that Gulf seafood was safe to consume and there had been no published studies demonstrating seafood abnormalities due to the Deepwater Horizon accident.

[3] There have been around 4-5 other “honorable mentions” since then, I’m not sure.

 

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

 

 

 


Categories: Comedy, Statistics | 2 Comments

A. Spanos: Jerzy Neyman and his Enduring Legacy

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data x0 come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X1,X2,…,Xn), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!

Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):

Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)

From my perspective, this was a major step forward for several reasons, including the following.

First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.

Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:

Mθ(x)={f(x;θ), θ∈Θ}, x∈Rn , Θ⊂Rm; m << n,

where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from  f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:

Xt = α0 + α1Xt-1 + σεt,  t=1,2,…,n

This indicates how one can use pseudo-random numbers for the error term  εt ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.

Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).

HAPPY BIRTHDAY NEYMAN!

For further discussion on the above issues see:

Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese:

http://www.econ.vt.edu/faculty/2008vitas_research/Spanos/1Spanos-2011-Synthese.pdf

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.

Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.

Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.

Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.

[i]He was born in an area that was part of Russia.

Categories: phil/history of stat, Spanos, Statistics | Tags: , | 4 Comments

Phil 6334: Notes on Bayesian Inference: Day #11 Slides

 

.

A. Spanos Probability/Statistics Lecture Notes 7: An Introduction to Bayesian Inference (4/10/14)

Categories: Bayesian/frequentist, Phil 6334 class material, Statistics | 10 Comments

“Murder or Coincidence?” Statistical Error in Court: Richard Gill (TEDx video)

“There was a vain and ambitious hospital director. A bad statistician. ..There were good medics and bad medics, good nurses and bad nurses, good cops and bad cops … Apparently, even some people in the Public Prosecution service found the witch hunt deeply disturbing.”

This is how Richard Gill, statistician at Leiden University, describes a feature film (Lucia de B.) just released about the case of Lucia de Berk, a nurse found guilty of several murders based largely on statistics. Gill is widely-known (among other things) for showing the flawed statistical analysis used to convict her, which ultimately led (after Gill’s tireless efforts) to her conviction being revoked. (I hope they translate the film into English.) In a recent e-mail Gill writes:

“The Dutch are going into an orgy of feel-good tear-jerking sentimentality as a movie comes out (the premiere is tonight) about the case. It will be a good movie, actually, but it only tells one side of the story. …When a jumbo jet goes down we find out what went wrong and prevent it from happening again. The Lucia case was a similar disaster. But no one even *knows* what went wrong. It can happen again tomorrow.

I spoke about it a couple of days ago at a TEDx event (Flanders).

You can find some p-values in my slides ["Murder by Numbers", pasted below the video]. They were important – first in convicting Lucia, later in getting her a fair re-trial.”

Since it’s Saturday night, let’s watch Gill’s TEDx talk, “Statistical Error in court”.

Slides from the Talk: “Murder by Numbers”:

 

Categories: junk science, P-values, PhilStatLaw, science communication, Statistics | Tags: | Leave a comment

“Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)

Sell me that antiseptic!

We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here).[1]

But since then the field has, ostensibly, attempted to clean up its act. On the meta-level, Simmons, Nelson, and Simonsohn (2011) is an excellent example of the kind of self-scrutiny the field needs, and their list of requirements and guidelines offer a much needed start (along with their related work). But the research itself appears to be going on in the same way as before (I don’t claim this one is representative), except that now researchers are keen to show their ability and willingness to demonstrate failure to replicate. So negative results are the new positives! If the new fashion is non-replication, that’s what will be found (following Kahneman‘s call for a “daisy chain” in [1]).

In “Out Damned Spot,” The authors are unable to replicate what they describe as a famous experiment (Zhong and Lilijenquist 2006) wherein participants who read “a passage describing an unethical deed as opposed to an ethical deed, … were subsequently likelier to rate cleansing products as more desirable than other consumer products”. (92). There are a variety of protocols, all rather similar. For instance students are asked to write out a passage to the effect that:

“I shredded a document that I knew my co-worker Harlan was desperately looking for so that I would be the one to get a promotion.”

or

“I place the much sought-after document in Harlan’s mail box.”

See the article for the exact words. Participants are told, untruthfully, that the study is on handwriting, or on punctuation or the like. (Aside: Would you feel more desirous of soap products after punctuating a paragraph about shredding a file that your colleague is looking for? More desirous than when…? More desirous than if you put it in his mailbox, I guess.[2]) In another variation on the Zhong et al studies, when participants are asked to remember an unethical vs ethical deed they committed, they tended to pick antiseptic wipe over a pen as compensation.

Yet these authors declare there is “a robust experimental foundation for the existence of a real-life Macbeth Effect” and therefore are  surprised that they are unable to replicate the result. The very fact that the article starts with giving high praise to these earlier studies already raises a big question mark in my mind as to their critical capacities, so I am not too surprised that they do not bring such capacities into their own studies. It’s so nice to have cross-out capability. Given that the field considers this effect solid and important, it is appropriate for the authors to regard it as such. (I think they are just jumping onto the new bandwagon. Admittedly, I’m skeptical, so send me defenses, if you have them. I place this under “fallacies of negative results”)

I asked the group of seminar participants if they could even identify a way to pick up on the “Macbeth” effect assuming no limits to where they could look or what kind of imaginary experiment one could run. Hmmm. We were hard pressed to come up with any. Follow evil-doers around (invisibly) and see if they clean-up? Follow do-gooders around (invisibly) to see if they don’t wash so much? (Never mind that cleanliness is next to godliness.) Of course if the killer has got blood on her (as in Lady “a little water clears us of this deed” Macbeth) she’s going to wash up, but the whole point is to apply it to moral culpability more generally (seeing if moral impurity cashes out as physical). So the first signal that an empirical study is at best wishy-washy, and at worst pseudoscientific is the utter vagueness of the effect they are studying. There’s little point to a sophisticated analysis of the statistics if you cannot get past this…unless you’re curious as to what other howlers lie in store. Yet with all of these experiments, the “causal” inference of interest is miles and miles away from the artificial exercises subjects engage in…. (unless too trivial to bother studying).

Returning to their study, after the writing exercise, the current researchers (Earl, et.al., ) have participants rate various consumer products for their desirability on a scale of 1 to 7.

They found “no significant difference in the mean desirability of the cleansing items between the moral condition (M= 3.09) and immoral condition (M = 3.08)” (94)—a difference that is so small as to be suspect in itself. Their two-sided confidence interval contains 0 so the null is not rejected. (We get a p-value and Cohen’s d, but no data.) Aris Spanos brought out a point we rarely hear (that came up in our criticism of a study on hormesis): it’s easy to get phony results with artificial measurement scales like 1-7. (Send links of others discussing this.) The mean isn’t even meaningful, and anyway, by adjusting the scale, a non-significant difference can become significant. (I don’t think this is mentioned in Simmons, Nelson, and Simonsohn 2011, but I need to reread it.)

The authors seem to think that failing to replicate studies restores credibility, and is indicative of taking a hard-nosed line, getting beyond the questionable significant results that have come in for such a drubbing. It does not. You can do just as questionable a job finding no effect as finding one. What they need to do is offer a stringent critique of the other (and their own) studies. A negative result is not a stringent critique. (Kahneman: please issue this further requirement.)

In fact, the scrutiny our seminar group arrived at in a mere one-hour discussion did more to pinpoint the holes in the other studies than all their failures to replicate. As I see it, that’s the kind of meta-level methodological scrutiny that their field needs if they are to lift themselves out of the shadows of questionable science. I could go on for pages and pages on all that is irksome and questionable about their analysis but will not. These researchers don’t seem to get it. (Or so it seems.)

If philosophers are basing philosophical theories on such “experimental” work without tearing them apart methodologically, then they’re not doing their job. Quine was wrong, and Popper was right (on this point): naturalized philosophy (be it ethics, epistemology or other) is not a matter of looking to psychological experiment.

Some proposed labels: We might label as questionable science any inferential inquiry where the researchers have not shown sufficient self-scrutiny of fairly flagrant threats to the inferences of interest. These threats would involve problems all along the route from the data generation and modeling to their interpretation. If an enterprise regularly fails to demonstrate such self-scrutiny, or worse, if its standard methodology revolves around reports that do a poor job at self-scrutiny, then I label the research area pseudoscience. If it regularly uses methods that permit erroneous interpretations of data with high probability, then we might be getting into “fraud” or at least “junk” science. (Some people want to limit “fraud” to a deliberate act. Maybe so, but my feeling is, as professional researchers claiming to have evidence of something, the onus is on them to be self-critical. Unconscious wishful thinking doesn’t get you off the hook.)

[1] In 2012 Kahneman said he saw a train-wreck looming for social psychology and suggested a “daisy chain” of replication.

[2] Correction I had “less” switched with “more” in the early draft (I wrote this quickly during the seminar).

[3] New reference from Uri Simonsohn: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879

[4]Addition: April 11, 2014: A commentator wrote that I should read Mook’s classic paper against “external validity”.

http://www.uoguelph.ca/~psystats/readings_3380/mook%20article.pdf

 In replying,  I noted that I agreed with Mook entirely. “I entirely agree that “artificial” experiments can enable the most probative and severe tests. But, Mook emphasizes the hypothetical deductive method (which we may assume would be of the statistical and not purely deductive variety). This requires the very entailments that are questionable in these studies. …I have argued (in sync with Mook I think) as to why “generalizability”, external validity and the like may miss the point of severe testing, which typically requires honing in on, amplifying, isolating, and even creating effects that would not occur in any natural setting. If our theory T would predict a result or effect with which these experiments conflict, then the argument to the flaw in T holds—as Mook remarks. What is missing from the experiments I criticize is the link that’s needed for testing—the very point that Mook is making.

I especially like Mook’s example of the wire monkeys. Despite the artificiality, we can use that experiment to understand hunger reduction is not valued more than motherly comfort or the like. That’s the trick of a good experiment, that if the theory (e.g., about hunger-reduction being primary) were true, then we would not expect those laboratory results. The key question is whether we are gaining an understanding, and that’s what I’m questioning.

I’m emphasizing the meaningfulness of the theory-statistical hypothesis link on purpose. People get so caught up in the statistics that they tend to ignore, at times, the theory-statistics link.

Granted, there are at least two distinct things that might be tested: the effect itself (here the Macbeth effect), and the reliability of the previous positive results. Even if the previous positive results are irrelevant for understanding the actual effect of interest, one may wish to argue that it is or could be picking up something reliably. Even though understanding the effect is of primary importance, one may claim only to be interested in whether the previous results are statistically sound. Yet another interest might be claimed to be learning more about how to trigger it. I think my criticism, in this case, actually gets to all of these, and for different reasons. I’ll be glad to hear other positions.

 

 

______

Simmons, J., Nelson, L. and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science XX(X) 1-8.

Zhong, C. and Liljenquist, K. 2006. Washing away your sins: Threatened morality and physical cleansing. Science, 313, 1451-1452.

Categories: fallacy of non-significance, junk science, reformers, Statistics | 12 Comments

Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

UnknownIt was from my Virginia Tech colleague I.J. Good (in statistics), who died five years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules. (I had posted this last year here.)

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]

This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was like a year out of grad school, and he a University Distinguished Professor.) 

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

 To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

images-3By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

 Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level. 

Much laughter.

So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results are statistically significant.”

 Howls of laughter.

 But then the guy calls back with the bad news . . .

It turns out that failing to score a sufficiently impressive effect after n’ trials, the experimenter went on to n” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the argument from intentions. When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the look-elsewhere effect (LEE), which arose in the context of “bump hunting” in the Higgs results.

The optional stopping effect often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

Xi ~ N(µ,σ) and we test  H0: µ=0, vs. H1: µ≠0.

The stopping rule might take the form:

Keep sampling until |m| ≥ 1.96 σ/√n),

with m the sample mean. When n is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . . I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.” If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

H is not being put to a stringent test when a researcher allows trying and trying again until the data are far enough from H0 to reject it in favor of H.

Stopping Rule Principle

Picking up on the effect appears evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the irrelevance of the stopping rule or the Stopping Rule Principle (SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

A Funny Thing Happened at the Savage Forum[i]

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not.  And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule.  It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a non sequitur.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (x1,…, xn) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available.  If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

One last thing….

 The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian  interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ =  m + 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?


[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”.  Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.

[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:

REFERENCES

Armitage, P. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.

Berger J. O. and Wolpert, R. L. (1988), The Likelihood Principle: A Review, Generalizations, and Statistical Implications 2nd edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.

Birnbaum, A. (1969), “Concepts of Statistical Evidence” In Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.

Cox, D. R. (1977), “The Role of Significance Tests (with discussion)”, Scandinavian Journal of Statistics 4, 49–70.

Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, London: Chapman & Hall.

Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. Psychological Review 70: 193-242.

Good, I.J.(1983), Good Thinking, The Foundations of Probability and its Applications, Minnesota.

Howson, C., and P. Urbach (1993[1989]), Scientific Reasoning: The Bayesian Approach, 2nd  ed., La Salle: Open Court.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80 (2013): 73-93.

Categories: Bayesian/frequentist, Comedy, Statistics | Tags: , , | 18 Comments

Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic

Danver State Hospital

Danvers State Hospital

I had heard of medical designs that employ individuals who supply Bayesian subjective priors that are deemed either “enthusiastic” or “skeptical” as regards the probable value of medical treatments.[i] From what I gather, these priors are combined with data from trials in order to help decide whether to stop trials early or continue. But I’d never heard of these Bayesian designs in relation to decisions about building security or renovations! Listen to this….

You may have heard that the Department of Homeland Security (DHS), whose 240,000 employees are scattered among 50 office locations around D.C.,has been planning to have headquarters built at an abandoned insane asylum St Elizabeths in DC [ii]. See a recent discussion here. In 2006 officials projected the new facility would be ready by 2015; now an additional $3.2 billion is needed to complete the renovation of the 159-year-old mental hospital by 2026 (Congressional Research Service)[iii].The initial plan of developing the entire structure is no longer feasible; so to determine which parts of the facility are most likely to be promising, “DHS is bringing in a team of data analysts who are possessed” said Homeland Security Secretary Jeh Johnson (during a DHS meeting, Feb 26) –-“possessed with vibrant background beliefs to sense which buildings are most probably worth renovating, from the point of view of security. St. Elizabeths needs to be fortified with 21st-century technologies for cybersecurity and antiterrorism missions” Johnson explained.

Failing to entice private companies to renovate the dilapidated west campus of the historic mental health facility that sits on 176 acres overlooking the Anacostia River,they can only hope to renovate selectively:  “Which parts are we going to overhaul? Parts of the hospital have been rotting for years!”Johnson declared.

Read more:

[I’m too rushed at the moment to write this up clearly but thought it of sufficient interest to post a draft (look for follow-up drafts).]

Skeptical and enthusiastic priors: excerpt from DHS memo:

The description of the use of so-called “enthusiastic” and “skeptical” priors is sketched in a DHS memo released in January 2014 (but which had been first issued in 2011). Here’s part of it:

Enthusiastic priors are used in evaluating the portions of St. Elizabeths campus thought to be probably unpromising, in terms of environmental soundness, or because of an existing suspicion of probable security leaks. If the location fails to be probably promising using an enthusiastic prior, plus data, then there is overwhelming evidence to support the decision that the particular area is not promising.

Skeptical priors are used in situations where the particular asylum wing, floor, or campus quadrant is believed to be probably promising for DHS. If the skeptical opinion, combined with the data on the area in question, yields a high posterior belief that it is a promising area to renovate, this would be taken as extremely convincing evidence to support the decision that the wing, floor or building is probably promising.

But long before they can apply this protocol, they must hire specialists to provide the enthusiastic or skeptical priors. (See stress testing below.) The article further explains, “In addition, Homeland Security took on a green initiative — deciding to outfit the campus’ buildings (some dating back to 1855) with features like rainwater toilets and Brazilian hardwood in the name of sustainability.” With that in mind, they also try to get a balance of environmentalist enthusiasts and green skeptics.

Asked how he can justify the extra 3 billion (a minimal figure), Mr. Johnson said that “I think that the morale of DHS, unity of mission, would go a long way if we could get to a headquarters.” He was pleased to announce that an innovative program of recruiting was recently nearly complete.

Stress Testing for Calibrated “Enthusastic” and “Skeptical” Prior Probabilities

Perhaps the most interesting part of all this is how they conduct stress testing for individuals to supply calibrated Bayesian priors concerning St. Elizabeths. Before being hired to give “skeptical” or “enthusiastic” prior distributions, candidates must pass a rather stringent panoply of stress tests based on their hunches regarding relevant facts associated with a number of other abandoned insane asylums. (It turns out there are a lot of them throughout the world. I had no idea.) The list of asylums on which they based the testing (over the past years) has been kept Top Secret Classified until very recently [iv]. Even now,one is directed to a non-governmental website to find a list of 8 or so of the old mental facilities that apparently appeared in just one batch of tests.

Scott Bays-Knorr, a DHS data analyst specialist who is coordinating the research and hiring of “sensors,” made it clear that the research used acceptable, empirical studies: “We’re not testing for paranormal ability or any hocus pocus. These are facts, and we are interested in finding those people whose beliefs match the facts reliably. DHS only hires highly calibrated, highly sensitive individuals”, said Bays-Knorr. Well I’m glad there’s no hocus-pocus at least.

The way it works is that they combine written tests with fMRI data— which monitors blood flow and, therefore, activity inside the brain in real time —to try to establish a neural signature that can be correlated with security-relevant data about the abandoned state hospitals. “The probability they are attuned to these completely unrelated facts about abandoned state asylums they’ve probably never even heard of is about 0. So we know our results are highly robust,” Bays-Knorr assured some skeptical senators.

Danvers State Hospital

Take for example,Danvers State Hospital, a psychiatric asylum opened in 1878 in Danvers, Massachusetts.

“We check their general sensitivity by seeing if any alarm bells go off in relation to little known facts about unrelated buildings that would be important to a high security facility. ‘What about a series of underground tunnels’ we might ask, any alarm bells go off? Any hairs on their head stand up when we flash a picture of a mysterious fire at the Danvers site in 2007?” Bays-Knorr enthused. “If we’ve got a verified fire skeptic who, when we get him to DC, believes that a part of St. Elizabeths is clear, then we start to believe that’s a fire-safe location. You don’t want to build U.S. cybersecurity if good sensors give it a high probability of being an incendiary location.” I think some hairs on my head are starting to stand up.

Interestingly, some of the tests involve matching drawn pictures, which remind me a little of those remote sensing tests of telepathy. Here’s one such target picture:

Target: Danvers State Hospital

They claim they can ensure robustness by means of correlating a sensor’s impressions of completely unrelated facts about the facility. For example, using fMRI data they can check if “anything lights up” in connection with Lovecraft’s Arkham Sanatorium”, a short story “The Thing on the Doorstep”, or Arkham Asylum in the Batman comic world.

Bays-Knorr described how simple facts are used as a robust benchmark for what he calls “the touchy-feely stuff”. For example, picking up on a simple fact in connection with High Royds Hospital (in England) is sensing its alternative name: West Riding Pauper Lunatic Asylum.They compare that to reactions to the question of why patient farming ended. I frankly don’t get it, but then again, I’m highly skeptical of approaches not constrained by error statistical probing.

 

Yet Bays-Knorr seemed to be convincing many of the Senators who will have to approve an extra 3 billion on the project. He further described the safeguards, “We never published this list of asylums, the candidate sensors did not even know what we were gong to ask them. It doesn’t matter if they’re asylum specialists or have a 6th sense. If they have good hunches, if they fit the average of the skeptics or the enthusiasts, then we want them.”  Only if the correlations are sufficiently coherent is a ‘replication score’ achieved. The testing data are then sent to an independent facility of blind “big data” statisticians,Cherry Associates, from whom the purpose of the analysis is kept entirely hidden.”We look for feelings and sensitivity, often the person doesn’t know she even has it,” one Cherry Assoc representative noted.Testing has gone on for the past 7 years ($700 million) and is only now winding up. (I’m not sure how many were hired, but with $150,000 salaries for part time work, it seems a god gig!)

Community priors, skeptical and enthusiastic, are eventually obtained based on those hired as U.S. Government DHS Calibrated Prior Degree Specialists.

Sounds like lunacy to me![v]

image-1

High Royds Hospital

 

 

[i]Spiegelhalter, D. J., Abrams, K. R., & Myles, J. P. (2004). Bayesian approaches to clinical trials and health care evaluation. Chichester: Wiley.

[ii]Homepage for DHS and St. Elizabeths Campus Plans.

http://www.stelizabethsdevelopment.com/index.html

GSA Development of St. Elizabeths campus:“Preserving the Legacy, Realizing Potential”

[iii]

U.S. House of Representatives
Committee on Homeland Security
January 2014

Prepared by Majority Staff of the Committee on Homeland Security

http://homeland.house.gov/sites/homeland.house.gov/files/documents/01-10-14-StElizabeths-Report.pdf

[iv] The randomly selected hospitals in one standardized test included the following:

Topeka State Hospital

Danvers State Hopital

Denbigh Asylum

Pilgrim State Hospital

Trans-Allegheny Asylum

High Royds Hospital

Whittingham Hospital

Norwich State Hospital

Essential guide to abandoned insane asylums: -http://www.atlasobscura.com/articles/abandoned-insane-asylums

[v[ Or, a partial April Fool’s joke!———-

 

 

Categories: junk science, Statistics, subjective Bayesian elicitation | 11 Comments

Phil 6334: March 26, philosophy of misspecification testing (Day #9 slides)

 

may-4-8-aris-spanos-e2809contology-methodology-in-statistical-modelinge2809d“Probability/Statistics Lecture Notes 6: An Introduction to Mis-Specification (M-S) Testing” (Aris Spanos)

 

[Other slides from Day 9 by guest, John Byrd, can be found here.]

Categories: misspecification testing, Phil 6334 class material, Spanos, Statistics | Leave a comment

Severe osteometric probing of skeletal remains: John Byrd

images-3John E. Byrd, Ph.D. D-ABFA

Central Identification Laboratory
JPAC

Guest, March 27, PHil 6334

“Statistical Considerations of the Histomorphometric Test Protocol for Determination of Human Origin of Skeletal Remains”

 By:
Byrd 1John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC

Categories: Phil6334, Philosophy of Statistics, Statistics | 1 Comment

Phil 6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)

.

 We’re going to be discussing the philosophy of m-s testing today in our seminar, so I’m reblogging this from Feb. 2012. I’ve linked the 3 follow-ups below. Check the original posts for some good discussion. (Note visitor*)

“This is the kind of cure that kills the patient!”

is the line of Aris Spanos that I most remember from when I first heard him talk about testing assumptions of, and respecifying, statistical models in 1999.  (The patient, of course, is the statistical model.) On finishing my book, EGEK 1996, I had been keen to fill its central gaps one of which was fleshing out a crucial piece of the error-statistical framework of learning from error: How to validate the assumptions of statistical models. But the whole problem turned out to be far more philosophically—not to mention technically—challenging than I imagined.I will try (in 3 short posts) to sketch a procedure that I think puts the entire process of model validation on a sound logical footing.  Thanks to attending several of Spanos’ seminars (and his patient tutorials, for which I am very grateful), I was eventually able to reflect philosophically on aspects of  his already well-worked out approach. (Synergies with the error statistical philosophy, of which this is a part,  warrant a separate discussion.)

Problems of Validation in the Linear Regression Model (LRM)

The example Spanos was considering was the the Linear Regression Model (LRM) which may be seen to take the form:

M0:      yt = β0 + β1xt + ut,  t=1,2,…,n,…

Where µt = β0 + β1xt is viewed as the systematic component, and ut = yt – β0 - β1xt  as the error (non-systematic) component.  The error process {ut, t=1, 2, …, n, …,} is assumed to be Normal, Independent and Identically Distributed (NIID) with mean 0, variance σ2 , i.e. Normal white noise.  Using the data z0:={(xt, yt), t=1, 2, …, n} the coefficients (β0 , β1) are estimated (by least squares)yielding an empirical equation intended to enable us to understand how yt varies with Xt.

Empirical Example

Suppose that in her attempt to find a way to understand and predict changes in the U.S.A. population, an economist discovers, using regression, an empirical relationship that appears to provide almost a ‘law-like’ fit (see figure 1):

yt = 167.115+ 1.907xt + ût,                                    (1)

where yt denotes the population of the USA (in millions), and  xt denotes a secret variable whose identity is not revealed until the end (these 3 posts). Both series refer to annual data for the period 1955-1989.

Figure 1: Fitted Line

Figure 1: Fitted Line

 A Primary Statistical QuestionHow good a predictor is xt?

The goodness of fit measure of this estimated regression, R2=.995, indicates an almost perfect fit. Testing the statistical significance of the coefficients shows them to be highly significant, p-values are zero (0) to a third decimal, indicating a very strong relationship between the variables.  Everything looks hunky dory; what could go wrong?

Is this inference reliable? Not unless the data z0 satisfy the probabilistic assumptions of the LRM, i.e., the errors are NIID with mean 0, variance σ2.

Misspecification (M-S) Tests: Questions of model validation may be  seen as ‘secondary’ questions in relation to primary statistical ones; the latter often concern the sign and magnitude of the coefficients of this linear relationship.

Partitioning the Space of Possible Models: Probabilistic Reduction (PR)

The task in validating a model M0 (LRM) is to test ‘M0is valid’ against everything else!

In other words, if we let H0 assert that the ‘true’ distribution of the sample Z, f(z) belongs to M0, the alternative H1 would be the entire complement of M0, more formally:

H0: f(z) €  M0  vs. H1: f(z) € [P - M0]

where P denotes the set of all possible statistical models that could have given rise to z0:={(xt,yt), t=1, 2, …, n}, and  is “an element of” (all we could find).

The traditional analysis of the LRM has already, implicitly, reduced the space of models that could be considered. It reflects just one way of reducing the set of all possible models of which data z0 can be seen to be a realization. This provides the motivation for Spanos’ modeling approach (first in Spanos 1986, 1989, 1995).

Given that each statistical model arises as a parameterization from the joint distribution:

D(Z1,…,Zn;φ): = D((X1, Y1), (X2, Y2), …., (Xn, Yn); φ),

we can consider how one or another set of probabilistic assumptions on the joint distribution gives rise to different models. The assumptions used to reduce P, the set of all possible models,  to a single model, here the LRM, come from a menu of three broad categories.  These three categories  can always be used in statistical modeling:

(D) Distribution, (M) Dependence, (H) Heterogeneity.

For example, the LRM arises when we reduce P by means of the “reduction” assumptions:

(D) Normal (N), (M) Independent (I), (H) Identically Distributed (ID).

Since we are partitioning or reducing P by means of the probabilistic assumptions, it may be called the Probabilistic Partitioning or Probabilistic Reduction (PR) approach.[i]

The same assumptions, traditionally given by means of the error term, are instead specified in terms of the observable random variables (yt, Xt): [1]-[5] in table 1 to render them directly assessable by the data in question.

Table 1 – The Linear Regression Model (LRM)

yt = β0 + β1xt + ut,  t=1,2,…,n,…

[1] Normality: D(yt |xt; θ) Normal
[2] Linearity: E(yt |Xt=xt) = β0 + β1xt Linear in xt
[3] Homoskedasticity: Var(yt |Xt=xt) =σ2, Free of xt
[4] Independence: {(yt |Xt=xt), t=1,…,n,…} Independent
[5] t-invariance: θ:=(β0 , β1, σ2), constant over t

There are several advantages to specifying the model assumptions in terms of the observables yt and xt instead of the unobservable error term.

First, hidden or implicit assumptions now become explicit ([5]).

Second, some of the error term assumptions, such as having a zero mean, do not look nearly as innocuous when expressed as an assumption concerning the linearity of the regression function between yt and xt .

Third, the LRM (conditional) assumptions can be assessed indirectly from the data via the (unconditional) reduction assumptions, since:

N entails [1]-[3],             I entails [4],             ID entails [5].

As a first step, we partition the set of all possible models coarsely

in terms of reduction assumptions on D(Z1,…,Zn;φ):

LRM Alternatives
(D)   Distribution: Normal non-Normal
(M)   Dependence: Independent Dependent
(H) Heterogeneity: Identically Distributed Non-ID

Given the practical impossibility of probing for violations in all possible directions, the PR approach consciously considers an effective probing strategy to home in on the directions in which the primary statistical model might be potentially misspecified.   Having taken us back to the joint distribution, why not get ideas by looking at yt and xt themselves using a variety of graphical techniques?  This is what the Probabilistic Reduction (PR) approach prescribes for its diagnostic task….Stay tuned!

*Rather than list scads of references, I direct the interested reader to those in Spanos.


[i] This is because when the NIID assumptions are imposed on  the latter simplifies into a product of conditional distributions  (LRM).

See follow-up parts:

PART 2: http://errorstatistics.com/2012/02/23/misspecification-testing-part-2/

PART 3: http://errorstatistics.com/2012/02/27/misspecification-testing-part-3-m-s-blog/

PART 4: http://errorstatistics.com/2012/02/28/m-s-tests-part-4-the-end-of-the-story-and-some-conclusions/

*We also have a visitor to the seminar from Hawaii, John Byrd, a forensic anthropologist and statistical osteometrician. He’s long been active on the blog. I’ll post something of his later on.

Categories: Intro MS Testing, Statistics | Tags: , , , , | 16 Comments

The Unexpected Way Philosophy Majors Are Changing The World Of Business

 

Philosopher

Philosopher

“Philosophy majors rule” according to this recent article. We philosophers should be getting the word out. Admittedly, the type of people inclined to do well in philosophy are already likely to succeed in analytic areas. Coupled with the chuzpah of taking up an “outmoded and impractical” major like philosophy in the first place, innovative tendencies are not surprising.  But can the study of philosophy also promote these capacities? I think it can and does; yet it could be far more effective than it is, if it was less hermetic and more engaged with problem-solving across the landscape of science,statistics,law,medicine,and evidence-based policy. Here’s the article:

___________________________________________________

  The Unexpected Way Philosophy Majors Are Changing The World Of Business

Dr. Damon Horowitz quit his technology job and got a Ph.D. in philosophy — and he thinks you should too.

“If you are at all disposed to question what’s around you, you’ll start to see that there appear to be cracks in the bubble,” Horowitz said in a 2011 talk at Stanford. “So about a decade ago, I quit my technology job to get a philosophy PhD. That was one of the best decisions I’ve made in my life.”

As Horowitz demonstrates, a degree in philosophy can be useful for professions beyond a career in academia. Degrees like his can help in the business world, where a philosophy background can pave the way for real change. After earning his PhD in philosophy from Stanford, where he studied computer science as an undergraduate, Horowitz went on to become a successful tech entrepreneur and Google’s in-house philosopher/director of engineering. His own career makes a pretty good case for the value of a philosophy education.

Despite a growing media interest in the study of philosophy and dramatically increasing enrollment in philosophy programs at some universities, the subject is still frequently dismissed as outmoded and impractical, removed from the everyday world and relegated to the loftiest of ivory towers.

That doesn’t fit with the realities of both the business and tech worlds, where philosophy has proved itself to be not only relevant but often the cornerstone of great innovation. Philosophy and entrepreneurship are a surprisingly good fit. Some of the most successful tech entrepreneurs and innovators come from a philosophy background and put the critical thinking skills they developed to good use launching new digital services to fill needs in various domains of society. Atlantic contributor Edward Tenner even went so far as to call philosophy the “most practical major.”

In fact, many leaders of the tech world — from LinkedIn co-founder Reid Hoffman to Flickr founder Stewart Butterfield — say that studying philosophy was the secret to their success as digital entrepreneurs.

“The thought leaders of our industry are not the ones who plodded dully, step by step, up the career ladder,” said Horowitz. “They’re the ones who took chances and developed unique perspectives.”

Here are a few reasons that philosophy majors will become the entrepreneurs who are shaping the business world.

Philosophy develops strong critical thinking skills and business instincts.

Philosophy is a notoriously challenging major, and has rigorous standards of writing and argumentation, which can help students to develop strong critical thinking skills that can be applied to a number of different professions. The ability to think critically may be of particular advantage to tech entrepreneurs.

“Open-ended assignments push philosophy students to find and take on a unique aspect of the work of the philosopher they are studying, to frame their thinking around a fresh and interesting question, or to make original connections between the writings of two distinct thinkers,” Christine Nasserghodsi, director of innovation at the Wellington International School in Dubai, wrote in a HuffPost College blog. “Similarly, entrepreneurs need to be able to identify and understand new and unique opportunities in existing markets.”

Flickr co-founder Stewart Butterfield got his bachelor’s and master’s degrees in philosophy at University of Victoria and Cambridge, where he specialized in philosophy of mind. After the highly profitable sale of Flickr to Yahoo!, the Canadian tech entrepreneur began working on a new online civilization-building game, Glitch.

“I think if you have a good background in what it is to be human, an understanding of life, culture and society, it gives you a good perspective on starting a business, instead of an education purely in business,” Butterfield told University of Victoria students in 2008. “You can always pick up how to read a balance sheet and how to figure out profit and loss, but it’s harder to pick up the other stuff on the fly.”

Former philosophy students have gone on to make waves in the tech world.

Besides Horowitz and Butterfield, a number of tech executives, including former Hewlett-Packard Company CEO Carly Fiorina and LinkedIn co-founder and executive chairman Reid Hoffman, studied philosophy as undergraduates. Hoffman majored in philosophy Oxford before he went on to become a highly successful tech entrepreneur and venture capitalist, and author of The Startup Of You.

“My original plan was to become an academic,” Hoffman told Wired. “I won a Marshall scholarship to read philosophy at Oxford, and what I most wanted to do was strengthen public intellectual culture — I’d write books and essays to help us figure out who we wanted to be.”

Hoffman decided to instead become a software engineer when he realized that staying in academia might not have the measurable impact on the world that he desires. Now, he uses the sharp critical thinking skills he honed while studying philosophy to make profitable investments in tech start-ups.

“When presented with an investment that I think will change the world in a really good way, if I can do it, I’ll do it,” he said.

Philosophers (amateur and professional) will be the ones to grapple with the biggest issues facing their generation.

I can’t say I agree with the following suggestion that philosophers have a “better understanding of the nature of man and his place in the world” but it sounds good.

Advances in physics, technology and neuroscience pose an ever-evolving set of questions about the nature of the world and man’s place in it; questions that we may not yet have the answers to, but that philosophers diligently explore through theory and argument. And of course, there are some questions of morality and meaning that were first posed by ancient thinkers and that we must continue to question as humanity evolves: How should we treat one another? What does it mean to live a good life?

The Princeton philosophy department argues that because philosophers have a “better understanding of the nature of man and his place in the world,” they’re better able to identify address issues in modern society. For this reason, philosophy should occupy a more prominent place in the business world, says Dov Seidman, author of HOW: Why HOW We Do Anything Means Everything.

“Philosophy can help us address the (literally) existential challenges the world currently confronts, but only if we take it off the back burner and apply it as a burning platform in business,” Seidman wrote in a 2010 Bloomberg Businessweek article. “Philosophy explores the deepest, broadest questions of life—why we exist, how society should organize itself, how institutions should relate to society, and the purpose of human endeavor, to name just a few.”

Philosophy students are ‘citizens of the world.’

In an increasingly global economy — one in which many businesses are beginning to accept a sense of social responsibility — those who care and are able to think critically about global and humanitarian issues will be the ones who are poised to create real change.

Rebecca Newberger Goldstein, philosopher, novelist and author of the forthcomingPlato at the Googleplex, recently told The Atlantic that doing philosophical work makes students “citizens of the world.” She explains why students should study philosophy, despite their concerns about employability:

To challenge your own point of view. Also, you need to be a citizen in this world. You need to know your responsibilities. You’re going to have many moral choices every day of your life. And it enriches your inner life. You have lots of frameworks to apply to problems, and so many ways to interpret things. It makes life so much more interesting. It’s us at our most human. And it helps us increase our humanity. No matter what you do, that’s an asset.

This global-mindedness and humanistic perspective may even make you a more desirable job candidate.

“You go into the humanities to pursue your intellectual passion, and it just so happens as a byproduct that you emerge as a desired commodity for industry,” said Horowitz. “Such is the halo of human flourishing.”

Categories: philosophy of science, Philosophy of Statistics, Statistics | 1 Comment

Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)

picture-216-1

.

We spent the first half of Thursday’s seminar discussing the FisherNeyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning.

We then turned to a severity evaluation of tests as a way to avoid classic fallacies and misinterpretations.

spanos

.

“Probability/Statistics Lecture Notes 5 for 3/20/14: Post-data severity evaluation” (Prof. Spanos)

[i] Fisher, Neyman, and E. Pearson.

[ii] In a recent Nature article by Regina Nuzzo, we hear that N-P statistics “was spearheaded in the late 1920s by Fisher’s bitter rivals”. Nonsense. It was Neyman and Pearson who came to Fisher’s defense against the old guard. See for example Aris Spanos’ post here. According to Nuzzo, “Neyman called some of Fisher’s work mathematically ‘worse than useless’”. It never happened. Nor does she reveal, if she is aware of, the purely technical notion being referred to. Nuzzo’s article doesn’t give the source of the quote; I’m guessing it’s from Gigerenzer quoting Hacking, or Goodman (whom she is clearly following and cites) quoting Gigerenzer quoting Hacking, but that’s a big jumble.

N-P did provide a theory of testing that could avoid the purely technical problem that can theoretically emerge in an account that does not consider alternatives or discrepancies from a null. As for Fisher’s charge against an extreme behavioristic, acceptance sampling approach, there’s something to this, but as Neyman’s response shows, Fisher, in practice, was more inclined toward a dichotomous “thumbs up or down” use of tests than Neyman. Recall Neyman’s “inferential” use of power in my last post.  If Neyman really had altered the tests to such an extreme, it wouldn’t have required Barnard to point it out to Fisher many years later. Yet suddenly, according to Fisher, we’re in the grips of Russian 5-year plans or U.S. robotic widget assembly lines! I’m not defending either side in these fractious disputes, but alerting the reader to what’s behind a lot of writing on tests (see my anger management post). I can understand how Nuzzo’s remark could arise from a quote of a quote, doubly out of context. But I think science writers on statistical controversies have an obligation to try to avoid being misled by whomever they’re listening to at the moment. There are really only a small handful of howlers to take note of. It’s fine to sign on with one side, but not to state controversial points as beyond debate. I’ll have more to say about her article in a later post (and thanks to the many of you who have sent it to me).

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale: Lawrence Erlbaum Associates.

Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.

Nuzzo, R .(2014). “Scientific method: Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume”. Nature, 12 February 2014.

Categories: phil/history of stat, Phil6334, science communication, Severity, significance tests, Statistics | Tags: | 35 Comments

Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

Stephen Senn

Senn

Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

Delta Force
To what extent is clinical relevance relevant?

Inspiration
This note has been inspired by a Twitter exchange with respected scientist and famous blogger  David Colquhoun. He queried whether a treatment that had 2/3 of an effect that would be described as clinically relevant could be useful. I was surprised at the question, since I would regard it as being pretty obvious that it could but, on reflection, I realise that things that may seem obvious to some who have worked in drug development may not be obvious to others, and if they are not obvious to others are either in need of a defence or wrong. I don’t think I am wrong and this note is to explain my thinking on the subject.

Conventional power or sample size calculations
As is happens, I don’t particularly like conventional power calculations but I think they are, nonetheless, a good place to start. To carry out such a calculation a statistician needs the following ingredients

  1. A definition of a rational design (the smallest design that is feasible but would retain the essential characteristics of the design chosen).
  2. An agreed outcome measure.
  3. A proposed analysis.
  4. A measure of variability for the rational design. (This might, for example, be the between-patient variance σ2 for a parallel group design.)
  5. An agreed type I error rate, α.
  6. An agreed power, 1-β.
  7. A clinically relevant difference, δ. (To be discussed.)
  8. The size of the experiment, n, (in terms of multiples of the rational design).

In treatments of this subject points 1-3 are frequently glossed over as already being known and given, although in my experience, any serious work on trial design involves the statistician in a lot of work investigating and discussing these issues. In consequence, in conventional discussions, attention is placed on points 4-8. Typically, it is assumed that 4-7 are given and 8, the size of the experiment, is calculated as a consequence. More rarely, 4, 5, 7 and 8 are given and 6, the power, is calculated from the other 4. An obvious weakness of this system is that there is no formal mention of cost whether in money, lost opportunities or patient time and suffering.

An example
A parallel group trial is planned in asthma with 3 months follow up. The agreed outcome measure is forced expiratory volume in one second (FEV1) at the end of the trial. The between-patient standard deviation is 450ml and the clinically relevant difference is 200ml. A type I error rate of 5% is chosen and the test will be two-sided. A power of 80% is targeted.

An approximate formula that may be used is

Senn 1

Here the second term on the right hand side reflects what I call decision precision, with zα/2, zβ  as the relevant percentage points of the standard Normal. If you lower the type I error rate or increase the power, decision precision will increase. The first term on the right hand side is the variance for a rational design (consisting of one patient on each arm) expressed as a ratio to the square of a clinically relevant difference. It is a noise-to-signal ratio.

Substituting we have

Senn 2

Thus we need an 80-fold replication of the rational design, which is to say, 80 patients on each arm.

What is delta?
I now list different points of view regarding this.

     1.     It is the difference we would like to observe
This point of view is occasionally propounded but it is incompatible with the formula used. To see this consider a re-arrangement of equation (1) as

Senn real 2

The numerator on the left hand side is the clinically relevant difference and the denominator is the standard error. Now if the observed difference, d,  is the same as the clinically relevant difference the we can replace δ by d in (2) but that would imply that the ratio of observed value to statistic would be (in our example) 2.8. This does not correspond to a P-value of 0.05, which our calculation was supposed to deliver us with 80% probability if the clinically relevant difference obtained but to a P-value of 0.006, or just over 1/10 of what our power calculation would accept as constituting proof of efficacy.

To put it another way if δ is the value we would like to observe and if the treatment does, indeed, have a value of δ then we have only half a chance, not an 80% chance, that the trial will deliver to us a value as big as this.

     2.     It is the difference we would like to ‘prove’ obtains
This view is hopeless. It requires that the lower confidence interval should be greater than δ. If this is what is needed, the power calculation is completely irrelevant.

     3.     It is the difference we believe obtains
This is another wrong-headed notion. Since, the smaller the value of δ the larger the sample size, it would have the curious side effect that given a number of drug-development candidates we would spend most money on those we considered least promising. There are some semi-Bayesian versions of this in which a probability distribution for δ would be substituted for a single value. Most medical statisticians would reject this as being a pointless elaboration of a point of view that is wrong in the first place. If you reject the notion that δ is your best guess as to what the treatment effect is there is no need to elaborate this rejected position by giving δ a probability distribution.

Note, I am not rejecting the idea of Bayesian sample size calculations. A fully decision-analytic approach might be interesting.  I am rejecting what is a Bayesian-frequentist chimera.

     4.     It is the difference you would not like to miss
This is the interpretation I favour. The idea is that we control two (conditional) errors in the process. The first is α, the probability of claiming that a treatment is effective when it is, in fact, no better than placebo. The second is the error of failing to develop a (very) interesting treatment further. If a trial in drug development is not ‘successful’, there is a chance that the whole development programme will be cancelled. It is the conditional probability of cancelling an interesting project that we seek to control.

Note that the FDA will usually requires that two phase III trials are ‘significant’ and significance requires that the observed effect is at least equal to Senn 3aIn our example this would give us 1.96/2.8=0.7δ, or a little over two thirds of δ for at least two trials for any drug that obtained registration. In practice, the observed average of the two would have an effect somewhat in excess of 0.7δ. Of course, we would be naïve to believe that all drugs that get accepted have this effect (regression to the mean is ever- present) but nevertheless it provides some reassurance.

Lessons
In other words, if you are going to do a power calculation and you are going to target some sort of value like 80% power, you need to set δ at a value that is higher than that you would be happy to find. Statisticians like me think of δ as the difference we would not like to miss and we call this the clinically relevant difference.

Does this mean that an effect that is 2/3 of the clinically relevant difference is worth having? Not necessarily. That depends on what your understanding of the phrase is. It should be noted, however, that when it is crucial to establish that no important difference between treatments exists, as in a non-inferiority study, then another sort of difference is commonly used. This is referred to as the clinically irrelevant difference. Such differences are quite commonly no more than 1/3 of the sort of difference a drug will have shown historically to placebo and hence much smaller than the difference you would not like to miss.

Another lesson, however, is this. In this area, as in others in the analysis of clinical data, dichotomisation is a bad habit. There are no hard and fast boundaries. Relevance is a matter of degree not kind.

Categories: power, Statistics, Stephen Senn | 38 Comments

New SEV calculator (guest app: Durvasula)

Unknown-1Karthik Durvasula, a blog follower[i], sent me a highly apt severity app that he created: https://karthikdurvasula.shinyapps.io/Severity_Calculator/
I have his permission to post it or use it for pedagogical purposes, so since it’s Saturday night, go ahead and have some fun with it. Durvasula had the great idea of using it to illustrate howlers. Also, I would add, to discover them.
It follows many of the elements of the Excel Sev Program discussed recently, but it’s easier to use.* (I’ll add some notes about the particular claim (i.e, discrepancy) for which SEV is being computed later on).
*If others want to tweak or improve it, he might pass on the source code (write to me on this).
[i] I might note that Durvasula was the winner of the January palindrome contest.
Categories: Severity, Statistics | 12 Comments

Get empowered to detect power howlers

questionmark pinkIf a test’s power to detect µ’ is low then a statistically significant result is good/lousy evidence of discrepancy µ’? Which is it?

If your smoke alarm has little capability of triggering unless your house is fully ablaze, then if it has triggered, is that a strong or weak indication of a fire? Compare this insensitive smoke alarm to one that is so sensitive that burning toast sets it off. The answer is: that the alarm from the insensitive detector is triggered is a good indication of the presence of (some) fire, while hearing the ultra sensitive alarm go off is not.[i]

Yet I often hear people say things to the effect that:

if you get a result significant at a low p-value, say ~.03,
but the power of the test to detect alternative µ’ is also low, say .04 (i.e., POW(µ’)= .04),then “the result hasn’t done much to distinguish” the data from that obtained by chance alone.

–but wherever that reasoning is coming from it’s not from statistical hypothesis testing, properly understood. It’s easy to see.

We can use a variation on the one-sided test T+ from our illustration of power: We’re testing the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:

H0: µ ≤  0 against H1: µ >  0

Let σ = 1, n = 25, so (σ/ √n) = .2.

Continue reading

Categories: confidence intervals and tests, power, Statistics | 34 Comments

Power, power everywhere–(it) may not be what you think! [illustration]

sev-calculatorStatistical power is one of the neatest [i], yet most misunderstood statistical notions [ii].So here’s a visual illustration (written initially for our 6334 seminar), but worth a look by anyone who wants an easy way to attain the will to understand power.(Please see notes below slides.)

[i]I was tempted to say power is one of the “most powerful” notions.It is.True, severity leads us to look, not at the cut-off for rejection (as with power) but the actual observed value, or observed p-value. But the reasoning is the same. Likewise for less artificial cases where the standard deviation has to be estimated. See Mayo and Spanos 2006.

[ii]

  • Some say that to compute power requires either knowing the alternative hypothesis (whatever that means), or worse, the alternative’s prior probability! Then there’s the tendency (by reformers no less!) to transpose power in such a way as to get the appraisal of tests exactly backwards. An example is Ziliac and McCloskey (2008). See,for example, the will to understand power: http://errorstatistics.com/2011/10/03/part-2-prionvac-the-will-to-understand-power/
  • Many allege that a null hypothesis may be rejected (in favor of alternative H’) with greater warrant, the greater the power of the test against H’, e.g., Howson and Urbach (2006, 154). But this is mistaken. The frequentist appraisal of tests is the reverse, whether Fisherian significance tests or those of the Neyman-Pearson variety. One may find the fallacy exposed back in Morrison and Henkel (1970)! See EGEK 1996, pp. 402-3.
  •  For a humorous post on this fallacy, see: “The fallacy of rejection and the fallacy of nouvelle cuisine”: http://errorstatistics.com/2012/04/04/jackie-mason/

You can find a link to the Severity Excel Program (from which the pictures came)  on the right hand column of this blog, and a link to basic instructions.This corresponds to EXAMPLE SET 1 pdf for Phil 6334.

Howson, C. and P. Urbach (2006). Scientific Reasoning: The Bayesian Approach. La Salle, Il: Open Court.

Mayo, D. G. and A. Spanos (2006) “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction“ British Journal of Philosophy of Science, 57: 323-357.

Morrison and Henkel (1970), The significance Test controversy.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

Categories: Phil6334, Statistical power, Statistics | 26 Comments

capitalizing on chance (ii)

Mayo playing the slots

DGM playing the slots

I may have been exaggerating one year ago when I started this post with “Hardly a day goes by”, but now it is literally the case*. (This  also pertains to reading for Phil6334 for Thurs. March 6):

Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Continue reading

Categories: junk science, selection effects, spurious p values, Statistical fraudbusting, Statistics | 4 Comments

Significance tests and frequentist principles of evidence: Phil6334 Day #6

picture-216-1Slides (2 sets) from Phil 6334 2/27/14 class (Day#6).

spanos

D. Mayo:
“Frequentist Statistics as a Theory of Inductive Inference”

A. Spanos
“Probability/Statistics Lecture Notes 4: Hypothesis Testing”

Categories: P-values, Phil 6334 class material, Philosophy of Statistics, Statistics | Tags: | Leave a comment

Cosma Shalizi gets tenure (at last!) (metastat announcement)

ShaliziNews Flash! Congratulations to Cosma Shalizi who announced yesterday that he’d been granted tenure (Statistics, Carnegie Mellon). Cosma is a leading error statistician, a creative polymath and long-time blogger (at Three-Toad sloth). Shalizi wrote an early book review of EGEK (Mayo 1996)* that people still send me from time to time, in case I hadn’t seen it! You can find it on this blog from 2 years ago (posted by Jean Miller). A discussion of a meeting of the minds between Shalizi and Andrew Gelman is here.

*Error and the Growth of Experimental Knowledge.

Categories: Announcement, Error Statistics, Statistics | Tags: | Leave a comment

Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)

Phil 6334* Day #4: Mayo slides follow the comments below. (Make-up for Feb 13 snow day.) Popper reading is from Conjectures and Refutations.

.

.

As is typical in rereading any deep philosopher, I discover (or rediscover) different morsals of clues to understanding—whether fully intended by the philosopher or a byproduct of their other insights, and a more contemporary reading. So it is with Popper. A couple of key ideas to emerge from Monday’s (make-up) class and the seminar discussion (my slides are below):

  1. Unlike the “naïve” empiricists of the day, Popper recognized that observations are not just given unproblematically, but also require an interpretation, an interest, a point of view, a problem. What came first, a hypothesis or an observation? Another hypothesis, if only at a lower level, says Popper.  He draws the contrast with Wittgenstein’s “verificationism”. In typical positivist style, the verificationist sees observations as the given “atoms,” and other knowledge is built up out of truth functional operations on those atoms.[1] However, scientific generalizations beyond the given observations cannot be so deduced, hence the traditional philosophical problem of induction isn’t solvable. One is left trying to build a formal “inductive logic” (generally deductive affairs, ironically) that is thought to capture intuitions about scientific inference (a largely degenerating program). The formal probabilists, as well as philosophical Bayesianism, may be seen as descendants of the logical positivists–instrumentalists, verificationists, operationalists (and the corresponding “isms”). So understanding Popper throws a lot of light on current day philosophy of probability and statistics.
  2. The fact that observations must be interpreted opens the door to interpretations that prejudge the construal of data. With enough interpretive latitude, anything (or practically anything) that is observed can be interpreted as in sync with a general claim H. (Once you opened your eyes, you see confirmations everywhere, as with a gestalt conversion, as Popper put it.) For Popper, positive instances of a general claim H, i.e., observations that agree with or “fit” H, do not even count as evidence for H if virtually any result could be interpreted as according with H.
    Note a modification of Popper here: Instead of putting the “riskiness” on H itself, it is the method of assessment or testing that bears the burden of showing that something (ideally quite a lot) has been done in order to scrutinize the way the data were interpreted (to avoid “verification bias”). The scrutiny needs to ensure that it would be difficult (rather than easy) to get an accordance between data x and H (as strong as the one obtained) if H were false (or specifiably flawed). Continue reading
Categories: Phil 6334 class material, Popper, Statistics | 7 Comments

Blog at WordPress.com. Customized Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 319 other followers