Monthly Archives: April 2013

What should philosophers of science do? (Higgs, statistics, Marilyn)

Marilyn Monroe not walking past a Higgs boson and not making it decay, whatever philosophers might say.

Marilyn Monroe not walking past a Higgs boson and not making it decay, whatever philosophers might say.

My colleague, Lydia Patton, sent me this interesting article, “The Philosophy of the Higgs,” (from The Guardian, March 24, 2013) when I began the posts on “statistical flukes” in relation to the Higgs experiments (here and here); I held off posting it partly because of the slightly sexist attention-getter pic  of Marilyn (in reference to an “irrelevant blonde”[1]), and I was going to replace it, but with what?  All the men I regard as good-looking have dark hair (or no hair). But I wanted to take up something in the article around now, so here it is, a bit dimmed. Anyway apparently MM was not the idea of the author, particle physicist Michael Krämer, but rather a group of philosophers at a meeting discussing philosophy of science and science. In the article, Krämer tells us:

For quite some time now, I have collaborated on an interdisciplinary project which explores various philosophical, historical and sociological aspects of particle physics at the Large Hadron Collider (LHC). For me it has always been evident that science profits from a critical assessment of its methods. “What is knowledge?”, and “How is it acquired?” are philosophical questions that matter for science. The relationship between experiment and theory (what impact does theoretical prejudice have on empirical findings?) or the role of models (how can we assess the uncertainty of a simplified representation of reality?) are scientific issues, but also issues from the foundation of philosophy of science. In that sense they are equally important for both fields, and philosophy may add a wider and critical perspective to the scientific discussion. And while not every particle physicist may be concerned with the ontological question of whether particles or fields are the more fundamental objects, our research practice is shaped by philosophical concepts. We do, for example, demand that a physical theory can be tested experimentally and thereby falsified, a criterion that has been emphasized by the philosopher Karl Popper already in 1934. The Higgs mechanism can be falsified, because it predicts how Higgs particles are produced and how they can be detected at the Large Hadron Collider.

On the other hand, some philosophers tell us that falsification is strictly speaking not possible: What if a Higgs property does not agree with the standard theory of particle physics? How do we know it is not influenced by some unknown and thus unaccounted factor, like a mysterious blonde walking past the LHC experiments and triggering the Higgs to decay? (This was an actual argument given in the meeting!) Many interesting aspects of falsification have been discussed in the philosophical literature. “Mysterious blonde”-type arguments, however, are philosophical quibbles and irrelevant for scientific practice, and they may contribute to the fact that scientists do not listen to philosophers.

I entirely agree that philosophers have wasted a good deal of energy maintaining that it is impossible to solve Duhemian problems of where to lay the blame for anomalies. They misrepresent the very problem by supposing there is a need to string together a tremendously long conjunction consisting of a hypothesis H and a bunch of auxiliaries Ai which are presumed to entail observation e. But neither scientists nor ordinary people would go about things in this manner. The mere ability to distinguish the effects of different sources suffices to pinpoint blame for an anomaly. For some posts on falsification, see here and here*.

The question of why scientists do not listen to philosophers was also a central theme of the recent inaugural conference of the German Society for Philosophy of Science. I attended the conference to present some of the results of our interdisciplinary research group on the philosophy of the Higgs. I found the meeting very exciting and enjoyable, but was also surprised by the amount of critical self-reflection. Continue reading

Categories: Higgs, Statistics, StatSci meets PhilSci

Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)

UnknownThree years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15. Trials have been taking place this month, as people try to meet the 3 year deadline to sue BP and others. But what happened to the 200 million gallons of oil?  (Is anyone up to date on this?)  Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night around the 3 year anniversary, let’s listen into a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes. 

In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!

Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

 Oil Exec:  Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average!  You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April  20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator:  But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec:  That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail),  that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion:  the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages?  … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

Two Measuring Instruments with Different Precisions:

 A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in, ton o’bricks).

In applying our test T+ (see November 2011 blog post ) to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”.  Denote the two p-values as p’ and p”, respectively.  However, or so the criticism proceeds, the error statistician would report the average p-value:  .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed?  Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of.  Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

  •   If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion. Continue reading

Categories: Bayesian/frequentist, Comedy, Statistics

Blog Contents 2013 (March)

metablog old fashion typewriterError Statistics Philosophy Blog: March 2013* (Frequentists in Exile-the blog)**:

(3/1) capitalizing on chance
(3/4) Big Data or Pig Data?
(3/7) Stephen Senn: Casting Stones
(3/10) Blog Contents 2013 (Jan & Feb)
(3/11) S. Stanley Young: Scientific Integrity and Transparency
(3/13) Risk-Based Security: Knives and Axes
(3/15) Normal Deviate: Double Misunderstandings About p-values
(3/17) Update on Higgs data analysis: statistical flukes (1)
(3/21) Telling the public why the Higgs particle matters
(3/23) Is NASA suspending public education and outreach?
(3/27) Higgs analysis and statistical flukes (part 2)
(3/31) possible progress on the comedy hour circuit?

*March was incredibly busy here; I’m saving up several partially-baked posts on draft. Also, while I love this old typewriter, I’ve had to have special keys made for common statistical symbols, and that has delayed me some. I hope people will scan the previous contents starting from the beginning (e.g., with “prionvac“): it’s philosophy, remember, and philosophy has to be reread many times over.  January and February 2013 contents are here.

**compiled by Jean Miller and Nicole Jinn.

Categories: Metablog, Statistics

PhilStock: Applectomy? (rejected post)

apple-chart-660x196Apple (AAPL) stock  is a perfect example of how psychology, fear and superstition enter into stock prices as much as do measures of valuation. Any predictions for this afternoon’s earnings? In general, here’s a field where regardless of what happens, “experts” never have to say they were wrong–especially about Tech. So, certainly we don’t. Thus, a wild guess–AAPL (currently down 300 points over its high)  goes up with earnings, but not massively (~5-10pts). Still, there’s such a fear of its being “RIMMED” (i.e., dramatically losing its status as top tech, as did Research in Motion), that it may be beaten down some more.

(To be placed in rejected posts blog)

Categories: Rejected Posts

Majority say no to inflight cell phone use, knives, toy bats, bow and arrows, according to survey

headlesstsaThe Transportation Security Authority (TSA) has just announced it is backing off its decision to permit, beginning Thursday, 25 April, pocket knives, toy bats, golf clubs (limit 2), lacrosse sticks, billiard cues, ski poles, fishing reels, and other assorted sports equipment, at least for the time being. See my post on “risk based security” Apparently, Pistole (TSA chief) could not entirely ignore the vociferous objections of numerous stakeholders, whom he had not even bothered to consult,  after all. Recall that the former TSA chief, Hawley, had actually wanted to go further, saying

 “They ought to let everything on that is sharp and pointy. Battle axes, machetes … you will not be able to take over the plane. It is as simple as that,” he said. (Link is here.)

I don’t have a strong feeling about blades, but I am very much in sync with the survey that influenced Pistole’s about face as regards cell phones (against) and liquids in carry-ons (for).

Vast majority of Americans say no to cell phone use and pocket knives inflight according to new survey

In a new, nationwide survey, Travel Leaders Group asked Americans across the country if they are in favor of the change and 73% of those polled do not want pocket knives allowed in airplane cabins. Also, a vast majority (nearly 80%) indicate they do not want fellow airline passengers to have the ability to make cell phone calls inflight. The survey includes responses from 1,788 consumers throughout the United States and was conducted by Travel Leaders Group – an $18 billion powerhouse in the travel industry – from March 15 to April 8, 2013.

“The results are very clear. Most Americans would prefer the status quo with regard to cell phone use inflight. Because so many planes are flying at near capacity and many passengers already feel a lack of personal space within the airplane cabin, it’s understandable that they want to continue to have some amount of peace and quiet whether they are on a short commuter flight or a flight that lasts several hours,” stated Travel Leaders Group CEO Barry Liben.

I’m really heartened to see that people are flouting the knee-jerk expectation that they’d want as much high tech as possible, and are weighing in against cell phones on planes. Recall my post on cell phones (now in rejected posts). Here are some of the statistics from the survey:

When asked, “Are you in favor of this change or against it?” 73% of those polled said they are not in favor of allowing pocket knives on planes.

I’m OK with it.


I’m OK with everything except   pocket knives.


I don’t think these items   should be allowed.


I don’t know.


Cell Phone Use Inflight

Studies are underway to determine if full cell phone use is safe while inflight and a decision on whether to allow such use (not just “airplane mode”) is expected this summer.  In Travel Leaders Group’s survey, nearly 80% of those polled are against allowing passengers to make cell phone calls during flight.  Here are the detailed responses:


I am opposed to it.


I am in favor as long as it   is not used for conversations.


I am in favor of it.


I don’t know.


Additional Statistics and Findings:

  • Eliminate One TSA Security Measure: With regard to TSA security screening at the airport, when asked, “Which of the following TSA security measures would you most like to eliminate?” the top responses were: “removing of shoes” (27.9%), “limits on liquids in carry-on baggage” (24.1%), and “none, do not eliminate any security measures” (19.8%).

  • Airport Security Satisfaction: When asked, “What is your level of satisfaction with airport security today?” 82.0% indicate they are satisfied or neutral with today’s security measures (62.2% indicate they are “satisfied,”19.8% are “neither satisfied nor unsatisfied” and 18.0% are “unsatisfied”).

  • Coach Class Flyers: When asked, “Do you ever fly in Coach Class?” over 94% of those polled said “Yes.” And of those who indicate they fly in Coach Class, when asked what makes flying in Coach most uncomfortable, the top responses were: “Lack of leg room” (49.5%); “seat size” (17.2%) and “pitch of the seat – person in front of me reclines too much” (15.0%).

  • This is the fifth consecutive year for this travel survey.  American consumers were engaged predominantly through social media channels such as Facebook and Twitter, as well as through direct contact with travel clients for the following Travel Leaders Group companies: Nexion, Results! Travel, Travel Leaders, Tzell Travel Group and  (

 So a tiny bit of good news among the forced air traffic control reductions and FAA cuts that began yesterday: See

Categories: Uncategorized

Stephen Senn: When relevance is irrelevant

Stephen Senn(guest post) When Relevance is Irrelevant, by Stephen Senn

Head of Competence Center for Methodology and Statistics (CCMS)

Applied statisticians tend to perform analyses on additive scales and additivity is an important aspect of an analysis to try to check. Consider survival analysis. The most important model used, the default in many cases, is the proportional hazards model introduced by David Cox in 1972[1] and sometimes referred to as Cox regression. In fact, from one point of view, analysis takes place on the log-hazard scale and so the model could equally be referred to by the rather clumsier title additive log-hazards model and there is quite a literature on how the proportionality (or equivalently, additivity) assumption can be checked.

Words have a definite power on the mind and you sometimes encounter the nonsensical claim that if the proportionality assumption does not apply you should consider a log-rank test instead. In fact, when testing the null hypothesis that two treatments are identical, neither the log-rank test nor the score test using the proportional hazards model require the assumption of proportionality: the assumption is trivially satisfied by the fact of two treatments being identical. Furthermore the log-rank test is just a special case of proportional hazards: the score test for a proportional hazards model without any covariates is the log-rank test. Finally, it is easy to produce examples where proportional hazards would apply in a model with covariates but not in the model without covariates but very difficult to produce the converse.

An objection often made regarding such models is that they are very difficult for physicians to understand. My reply is to ask what is preferable: a difficult truth or an easy lie? Ah yes, it is sometimes countered, but surely I agree on the importance of clinical relevance. It is surely far more useful to express the results of a proportional hazards analysis in clinically relevant terms that can be understood, such as difference in median length of survival or the difference in the event rate up to a particular census point (say one year after treatment).

A disturbing paper by Snapinn and Jiang[2] points to a problem, however, and to explain it I can do no better that cite the abstract:

The standard analysis of a time-to-event variable often involves the calculation of a hazard ratio based on a survival model such as Cox regression; however, many people consider such relative measures of effect to be poor expressions of clinical meaningfulness. Two absolute measures of effect are often used to assess clinical meaningfulness: (1) many disease areas frequently use the absolute difference in event rates (or its inverse, the number-needed-to-treat) and (2) oncology frequently uses the difference between the median survival times in the two groups. While both of these measures appear reasonable, they directly contradict each other. This paper describes the basic mathematics leading to the two measures and shows examples. The contradiction described here raises questions about the concept of clinical meaningfulness. (p2341)

To see the problem, consider the following. The more serious the disease, the less a given difference in the rate at which people die will impact on the time survived and hence on differences in median survival. However, generally, the higher the baseline mortality rate the greater the difference in survival at a given time point that will be conveyed by a given treatment benefit.

If you find this less than clear, you have my sympathy. The only solution I can offer is to suggest that you read the paper by Snappin and Jiang[2]. However, in that case also consider the following point. If the point is so subtle, how many physicians who cannot understand proportional hazards can understand numbers needed to treat or differences in median survival? My opinion is that they can be counted on the fingers of one foot. Continue reading

Categories: Statistics

Does statistics have an ontology? Does it need one? (draft 2)

questionmark pinkChance, rational beliefs, decision, uncertainty, probability, error probabilities, truth, random sampling, resampling, opinion, expectations. These are some of the concepts we bandy about by giving various interpretations to mathematical statistics, to statistical theory, and to probabilistic models. But are they real? The question of “ontology” asks about such things, and given the “Ontology and Methodology” conference here at Virginia Tech (May 4, 5), I’d like to get your thoughts (for possible inclusion in a Mayo-Spanos presentation).*  Also, please consider attending**.

Interestingly, I noticed the posts that have garnered the most comments have touched on philosophical questions of the nature of entities and processes behind statistical idealizations (e.g.,

1. When an interpretation is supplied for a formal statistical account, its theorems may well turn out to express approximately true claims, and the interpretation may be deemed useful, but this does not mean the concepts give correct descriptions of reality. The interpreted axioms, and inference principles, are chosen to reflect a given philosophy, or set of intended aims: roughly, to use probabilistic ideas (i) to control error probabilities of methods (Neyman-Pearson, Fisher), or (ii) to assign and update degrees of belief, actual or rational (Bayesian).  But this does not mean its adherents have to take seriously the realism of all the concepts generated. In fact ,we often (on this blog) see supporters of various stripes of frequentist and Bayesian accounts running far away from taking their accounts literally, even as those interpretations are, or at least were, the basis and motivation for the development of the formal edifice (“we never meant this literally”).  But are these caveats on the same order? Or do some threaten the entire edifice of the account?

Starting with the error statistical account, recall Egon Pearson in his “Statistical Concepts in Their Relation to Reality” making it clear to Fisher that the business of controlling erroneous actions in the long run, acceptance sampling in industry and 5-year plans, only arose with Wald, and were never really part of the original Neyman-Pearson tests (declaring that the behaviorist philosophy was Neyman’s, not his).  The paper itself may be found here. I was interested to hear (Mayo 2005)  Neyman’s arch opponent, Bruno de Finetti, remark (quite correctly) that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his, the Bayesian and the Fisherian formulations” became with Abraham Wald’s work, “something much more substantial” (de Finetti 1972, 176).

Granted, it has not been obvious to people just how to interpret N-P tests “evidentially “ or “inferentially”—the subject of my work over many years. But there always seemed to me to be enough hints and examples to see what was intended: A statistical hypothesis H assigns probabilities to possible outcomes, and the warrant for accepting H as adequate—for an error statistician– is in terms of how well corroborated H is: how well H has stood up to tests that would have detected flaws in H, at least with very high probability. So the grounds for holding or using H are error statistical. The control and assessment of error probabilities may be used inferentially to determine the capabilities of methods to detect the adequacy/inadequacy of models, and express the extent of the discrepancies that have been identified. We also employ these ideas to detect gambits that make it too easy to find evidence for claims, even if the claims have been subjected to weak tests and biased procedures. A recent post is here.

The account has never professed to supply a unified logic, or any kind of logic for inference. The idea that there was a single rational way to make inferences was ridiculed by Neyman (whose birthday is April 16). Continue reading

Categories: Bayesian/frequentist, Error Statistics, Statistics

O & M Conference (upcoming) and a bit more on triggering from a participant…..

copy-cropped-ampersand-logo-blog1I notice that one of the contributed speakers, Koray Karaca*, at the upcoming Ontology and Methodology Conference at Virginia Tech (May 4-5) focuses his paper on triggering!  I entirely agree with the emphasis on the need to distinguish different questions at multiple stages of an inquiry or research endeavor from the design, collection and modeling of data to a series of hypotheses, questions, problems, and threats of error.  I do note a couple of queries below that I hope will be discussed at some point. Here’s part of his abstract…which may be found on the just created O & M Conference Blog (link is also at the O&M page on this blog). Recent posts on the Higgs data analysis are herehere, and here  Kent Staley had a recent post on the Higgs as well. (For earlier Higgs discussions search this blog.)

Koray Karaca
The method of robustness analysis and the problem of data-selection at the ATLAS experiment

In the first part, I characterize and distinguish between two problems of “methodological justification” that arise in the context of scientific experimentation. What I shall call the “problem of validation” concerns the accuracy and reliability of experimental procedures through which a particular set of experimental data is first acquired and later transformed into an experimental result. Therefore, the problem of validation can be phrased as follows: how to justify that a particular set of data as well as the procedures that transform it into an experimental result are accurate and reliable, so that the experimental result obtained at the end of the experiment can be taken as valid.  On the other hand, what I shall call the “problem of exploration” is concerned with the methodological question of whether an experiment is able, either or both, (1) to provide a genuine test of the conclusions of a scientific theory or hypothesis if the theory in question has not been previously (experimentally) tested, or to provide a novel test if the theory or hypothesis in question has already been tested, and (2) to discover completely novel phenomena; i.e., phenomena which have not been predicted by present theories and detected in previous theories. Even though the problem of validation and the ways it is dealt with in scientific practice has been thoroughly discussed in the literature of scientific experimentation, the significance of the problem of exploration has not yet been fully appreciated. In this work, I shall address this problem and examine the way it is handled in the present-day high collision-rate particle physics experiments. To this end, I shall consider the ATLAS experiment, which is one of the Large Hadron Collider (LHC) experiments currently running at CERN. …What are called “interesting events” are those collision events that are taken to serve to test the as-yet-untested predictions of the Standard Model of particle physics (SM) and its possible extensions, as well as to discover completely novel phenomena not predicted before by any theories or theoretical models.

To read the rest of the abstract, go to our just-made-public O & M conference blog.

First let me say that I’m delighted this case will be discussed at the O&M conference, and look forward to doing so. Here are a couple of reflections from the abstract, partly on terminology. First, I find it interesting that he places “tiggering” (what I alluded to in my last post as a behavioristic, pre-data, task) under “exploratory”. He may be focussed more on what occurs (in relation to this one episode anyhow) when data are later used to check for indications of anomalies for the Standard Model Higgs–having been “parked” for later analysis.  I thought the exploratory stage is usually a stage of informal or semi-formal data analysis to find interesting patterns and potential ingredients (variables, functions) for models, model building, and possible theory development.  When Strassler heard there would be “parked data” for probing anomalies, I take it his theories kicked in to program those exotic indicators. Second, it seems to me that philosophers of science and “confirmation theorists” of various sorts, have focussed on when “data,” all neat and tidied up, count as supporting, confirming, falsifying hypotheses and theories.  I wouldn’t have thought the problem of data collection, modeling or justifying data was “thoroughly discussed”–It absolutely should be– just that it seems all-too-rare. I may be wrong (I’d be glad to see references).

*Koray is a postdoctoral research fellow at the University of Wuppertal, and he knows I’m mentioning him here.

Categories: experiment & modeling

Statistical flukes (3): triggering the switch to throw out 99.99% of the data

Unknown-1This is the last of my 3 parts on “statistical flukes” in the Higgs data analysis. The others are here and here.  Kent Staley had a recent post on the Higgs as well. 

Many preliminary steps in the Higgs data generation and analysis fall under an aim that I call “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding out something else–here, excess events or bumps of interest.

(a) Triggering. First of all, 99.99% of the data must be thrown away!  So there needs to be a trigger to accept or reject” collision data for analysis–whether for immediate processing or for later on, as in so-called “data parking”.

With triggering we are not far off the idea that a result of a “test”, or single piece of data analysis, is to take one “action” or another:

reject the null -> retain the data;

do not reject -> discard the data.

(Here the null might, in effect, hypothesize that the data are not interesting.) It is an automatic classification scheme, given limits of processing and storing; the goal of controlling the rates of retaining uninteresting and discarding potentially interesting data is paramount.[i] It is common for performance oriented tasks to enter, especially in getting the data for analysis, and they too are very much under the error statistical umbrella.

Particle physicist Matt Strassler has excellent discussions of triggering and parking on his blog “Of Particular Significance”. Here’s just one passage:

Data Parking at CMS (and the Delayed Data Stream at ATLAS) takes advantage of the fact that the computing bottleneck for dealing with all this data is not data storage, but data processing. The experiments only have enough computing power to process about 300 – 400 bunch-crossings per second. But at some point the experimenters concluded that they could afford to store more than this, as long as they had time to process it later. That would never happen if the LHC were running continuously, because all the computers needed to process the stored data from the previous year would instead be needed to process the new data from the current year. But the 2013-2014 shutdown of the LHC, for repairs and for upgrading the energy from 8 TeV toward 14 TeV, allows for the following possibility: record and store extra data in 2012, but don’t process it until 2013, when there won’t be additional data coming in. It’s like catching more fish faster than you can possibly clean and cook them — a complete waste of effort — until you realize that summer’s coming to an end, and there’s a huge freezer next door in which you can store the extra fish until winter, when you won’t be fishing and will have time to process them.

(b) Bump indication. Then there are rules for identifying bumps, excesses more than 2 or 3 standard deviations above what is expected or predicted. This may be the typical single significance test serving as more of an indicator rule.  Observed signals are classified as either rejecting, or failing to reject, a null hypothesis of “mere background”; non-null indications are bumps, deemed potentially interesting. Estimates of the magnitude of any departures are reported and graphically displayed. They are not merely searching for discrepancies with the “no Higgs particle” hypothesis, they are looking for discrepancies with the simplest type, the simple Standard Model Higgs. I discussed this in my first flukes post. Continue reading

Categories: Error Statistics | Tags: ,

Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

UnknownIt was from my Virginia Tech colleague I.J. Good (in statistics), who died four years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules.

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]

This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was a year out of grad school, and he a University Distinguished Professor.) 

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

 To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

images-3By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

 Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a Continue reading

Categories: Bayesian/frequentist, Comedy, Statistics | Tags: , ,

Guest Post. Kent Staley: On the Five Sigma Standard in Particle Physics

Kent Staley

Kent Staley
Associate Professor
Department of philosophy
Saint Louis University

Regular visitors to Error Statistics Philosophy may recall a discussion that broke out here and on other sites last summer when the CMS and ATLAS collaborations at the Large Hadron Collider announced that they had discovered a new particle in their search for the Higgs boson that had at least some of the properties expected of the Higgs. Both collaborations emphasized that they had results that were significant at the level of “five sigma,” and the press coverage presented this is a requirement in high energy particle physics for claiming a new discovery. Both the use of significance testing and the reliance on the five sigma standard became a matter of debate.

Mayo has already commented on the recent updates to the Higgs search results (here and here); these seem to have further solidified the evidence for a new boson and the identification of that boson with the Higgs of the Standard Model. I have been thinking recently about the five sigma standard of discovery and what we might learn from reflecting on its role in particle physics. (I gave a talk on this at a workshop sponsored by the “Epistemology of the Large Hadron Collider” project at Wuppertal [i], which included both philosophers of science and physicists associated with the ATLAS collaboration.)

Just to refresh our memories, back in July 2012, Tony O’Hagan posted at the ISBA forum (prompted by “a question from Dennis Lindley”) three questions regarding the five-sigma claim:

  1. “Why such an extreme evidence requirement?} We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?
  2. “Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?
  3. “We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LHC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?”

O’Hagan received a lot of responses to this post, and he very helpfully wrote up and posted a digest of those responses, discussed on this blog here and here. Continue reading

Categories: Error Statistics, P-values, Statistics

Flawed Science and Stapel: Priming for a Backlash?

my 1st fraud kitDeiderik Stapel is back in the news, given the availability of the English translation of the Tilberg (Levelt and Noort Committees) Report as well as his book, Ontsporing (Dutch for “Off the Rails”), where he tries to explain his fraud. An earlier post on him is here. While the disgraced social psychologist was shown to have fabricated the data for something like 50 papers, it seems that some people think he deserves a second chance. A childhood friend, Simon Kuper, in an article “The Sin of Bad Science,” describes a phone conversation with Stapel:

“I’ve lost everything,” the disgraced former psychology professor tells me over the phone from the Netherlands. He is almost bankrupt. … He has tarnished his own discipline of social psychology. And he has become a national pariah. …

Very few social psychologists make stuff up, but he was working in a discipline where cavalier use of data was common. This is perhaps the main finding of the three Dutch academic committees which investigated his fraud. The committees found many bad practices: researchers who keep rerunning an experiment until they get the right result, who omit inconvenient data, misunderstand statistics, don’t share their data, and so on….

Chapter 5 of the Report, pp 47-54, is extremely illuminating about the general practices they discovered in examining Stapel’s papers, I recommend it.

Social psychology might recover. However, Stapel might not. A country’s way of dealing with sinners is often shaped by its religious heritage. In Catholicism, sinners can get absolution in the secrecy of confession. … …In many American versions of Protestantism, the sinner can be “born again”. …Stapel’s misfortune is to be Dutch. The dominant Dutch tradition is Calvinist, and Calvinism believes in eternal sin. …But the downside to not forgiving sinners is that there are almost no second acts in Dutch lives.

But it isn’t just old acquaintances who think Stapel might be ready for a comeback. A few researchers are beginning to defend the field from the broader accusations the Report wages against the scientific integrity of social psychology. They do not deny the “cavalier” practices, but regard them as acceptable and even necessary! This might even pave the way for Stapel’s rehabilitation. An article by a delegate for the 3rd World Conference on Research Integrity ( in Montreal, Canada, in May reports on members of a new group critical of the Report, including some who were interviewed by the Tilberg Committees: Continue reading

Categories: junk science, Statistics

Blog at