This is the last of my 3 parts on “statistical flukes” in the Higgs data analysis. The others are here and here. Kent Staley had a recent post on the Higgs as well.
Many preliminary steps in the Higgs data generation and analysis fall under an aim that I call “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding out something else–here, excess events or bumps of interest.
(a) Triggering. First of all, 99.99% of the data must be thrown away! So there needs to be a trigger to accept or reject” collision data for analysis–whether for immediate processing or for later on, as in so-called “data parking”.
With triggering we are not far off the idea that a result of a “test”, or single piece of data analysis, is to take one “action” or another:
reject the null -> retain the data;
do not reject -> discard the data.
(Here the null might, in effect, hypothesize that the data are not interesting.) It is an automatic classification scheme, given limits of processing and storing; the goal of controlling the rates of retaining uninteresting and discarding potentially interesting data is paramount.[i] It is common for performance oriented tasks to enter, especially in getting the data for analysis, and they too are very much under the error statistical umbrella.
Particle physicist Matt Strassler has excellent discussions of triggering and parking on his blog “Of Particular Significance”. Here’s just one passage:
Data Parking at CMS (and the Delayed Data Stream at ATLAS) takes advantage of the fact that the computing bottleneck for dealing with all this data is not data storage, but data processing. The experiments only have enough computing power to process about 300 – 400 bunch-crossings per second. But at some point the experimenters concluded that they could afford to store more than this, as long as they had time to process it later. That would never happen if the LHC were running continuously, because all the computers needed to process the stored data from the previous year would instead be needed to process the new data from the current year. But the 2013-2014 shutdown of the LHC, for repairs and for upgrading the energy from 8 TeV toward 14 TeV, allows for the following possibility: record and store extra data in 2012, but don’t process it until 2013, when there won’t be additional data coming in. It’s like catching more fish faster than you can possibly clean and cook them — a complete waste of effort — until you realize that summer’s coming to an end, and there’s a huge freezer next door in which you can store the extra fish until winter, when you won’t be fishing and will have time to process them.
(b) Bump indication. Then there are rules for identifying bumps, excesses more than 2 or 3 standard deviations above what is expected or predicted. This may be the typical single significance test serving as more of an indicator rule. Observed signals are classified as either rejecting, or failing to reject, a null hypothesis of “mere background”; non-null indications are bumps, deemed potentially interesting. Estimates of the magnitude of any departures are reported and graphically displayed. They are not merely searching for discrepancies with the “no Higgs particle” hypothesis, they are looking for discrepancies with the simplest type, the simple Standard Model Higgs. I discussed this in my first flukes post. Continue reading →