Statistics

“A sense of security regarding the future of statistical science…” Anon review of Error and Inference

errorinferencebookcover-e1335149598836-1Aris Spanos, my colleague and co-author (Economics),recently came across this seemingly anonymous review of our Error and Inference (2010) [E & I]. It’s interesting that the reviewer remarks that “The book gives a sense of security regarding the future of statistical science and its importance in many walks of life.” I wish I knew just what the reviewer means–but it’s appreciated regardless.

2010 American Statistical Association and the American Society for Quality

TECHNOMETRICS, AUGUST 2010, VOL. 52, NO. 3, Book Reviews, 52:3, pp. 362-370.

Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. MAYO and Aris SPANOS, New York: Cambridge University Press, 2010, ISBN 978-0-521-88008-4, xvii+419 pp., $60.00.

This edited volume contemplates the interests of both scientists and philosophers regarding gathering reliable information about the problem/question at hand in the presence of error, uncertainty, and with limited data information.

The volume makes a significant contribution in bridging the gap between scientific practice and the philosophy of science. The main contribution of this volume pertains to issues of error and inference, and showcases intriguing discussions on statistical testing and providing alternative strategy to Bayesian inference. In words, it provides cumulative information towards the philosophical and methodological issues of scientific inquiry at large.

The target audience of this volume is quite general and open to a broad readership. With some reasonable knowledge of probability theory and statistical science, one can get the maximum benefit from most of the chapters of the volume. The volume contains original and fascinating articles by eminent scholars (nine, including the editors) who range from names in statistical science to philosophy, including D. R. Cox, a name well known to statisticians.

The editors have done a superb job in presenting, organizing, and structuring the material in a logical order. The “Introduction and Background” is nicely presented and summarized, allowing for a smooth reading of the rest of the volume. There is a broad range of carefully selected topics from various related fields reflecting recent developments in these areas. The rest of the volume is divided in nine chapters/sections as follows:

1. Learning from Error, Severe Testing, and the Growth of Theoretical

Knowledge

2. The Life of Theory in the New Experimentalism

3. Revisiting Critical Rationalism

4. Theory Confirmation and Novel Evidence

5. Induction and Severe Testing

6. Theory Testing in Economics and the Error-Statistical Perspective

7. New Perspectives on (Some Old) Problems of Frequentist Statistics

8. Casual Modeling, Explanation and Severe Testing

9. Error and Legal Epistemology

In summary, this volume contains a wealth of knowledge and fascinating debates on a host of important and controversial topics equally important to the philosophy of science and scientific practice. This is a must-read—I enjoyed reading it and I am sure you will too! The book gives a sense of security regarding the future of statistical science and its importance in many walks of life. I also want to take the opportunity to suggest another seemingly related book by Harman and Kulkarni (2007). The review of this book was appeared in Technometricsin May 2008 (Ahmed 2008).

The following are chapters in E & I (2010) written by Mayo and/or Spanos, if you’re interested. If you produce a palindrome meeting the extremely simple requirements for May (by May 25 or so), you can win a free copy!

  • Spanos, A. (2010). Theory Testing in Economics and the Error-Statistical Perspective in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), Cambridge: Cambridge University Press:  202-246.
  • Spanos, A. (2010). On a New Philosophy of Frequentist Inference Exchanges with David Cox and Deborah G. Mayo in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), Cambridge: Cambridge University Press:  325-330.
  • Spanos, A. (2010). Graphical Causal Modeling ad Error Statistics Exchanges with Clark Glymour in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), Cambridge: Cambridge University Press:  364-375.
Categories: Review of Error and Inference, Statistics | Leave a comment

If it’s called the “The High Quality Research Act,” then ….

Unknown-2Among the (less technical) items sent my way over the past few days are discussions of the so-called High Quality Research Act. I’d not heard of it, but it’s apparently an outgrowth of the recent hand-wringing over junk science, flawed statistics, non-replicable studies, and fraud (discussed at times on this blog). And it’s clearly a hot topic. Let me just run this by you and invite your comments (before giving my impression). Following the Bill, below, is a list of five NSF projects about which the HQRA’s sponsor has requested further information, and then part of an article from today’s New Yorker on this “divisive new bill”: “Not Safe for Funding: The N.S.F. and the Economics of Science”.

[DISCUSSION DRAFT]

A BILL

April 18, 2013

TO [BE SUPPLIED]

Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,

SECTION 1. SHORT TITLE.

This act may be cited as the “High Quality Research Act”.

SECTION 2. HIGH QUALITY RESEARCH.

(a) CERTIFICATION.—prior to making an award of any contract or grant funding for a scientific research project, the Director of the NSF shall publish a statement on the public website of the Foundation that certifies that the research project—

(1) is in the interests of the U.S. to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science;

(2) is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

(3) is not duplicative of other research projects being funded by the Foundation or other Federal Science agencies.

(b) TRANSFER OF FUNDS.—Any unobligated funds for projects ot meeting the requirements of subjection (a) may be awarded to other scientific research projects that do meet such requirements.

(e) INITIAL IMPLEMENTATION REPORT.—Not later than 60 days after the date of enactment of this Act, the Director shall report to the Committee on Commerce, Science, and Transportation of the Senate and the Committee on Science, Space, and Technology of the House of Representatives on how the requirements set for in subsection (a) are being implemented.

(d) NATIONAL SCIENCE BOARD IMPLEMENTATION REPORT. __ Not later than 1 year after the date of enactment of this act, the national science board shall report to the committee on commerce, science, and transportation of the senate and the committee on science, space and technology of the house of representatives its findings and recommendations on how the requirements of subsection (a) are being implemented.

etc. etc.

Link to the Bill

Rep. Lamar Smith,author of the Bill, listed five NSF projects about which he has requested further information. 

1. Award Abstract #1247824: “Picturing Animals in National Geographic, 1888-2008,” March 15, 2013, ($227,437); 

2. Award Abstract #1230911: “Comparative Histories of Scientific Conservation: Nature, Science, and Society in Patagonian and Amazonian South America,” September 1, 2012 ($195,761);

3. Award Abstract #1230365: “The International Criminal Court and the Pursuit of Justice,” August 15, 2012 ($260,001);

4. Award Abstract #1226483, “Comparative Network Analysis: Mapping Global Social Interactions,” August 15, 2012, ($435,000); and

5. Award Abstract #1157551: “Regulating Accountability and Transparency in China’s Dairy Industry,” June 1, 2012 ($152,464).

________________________

MAY 9, 2013

NOT SAFE FOR FUNDING: THE N.S.F. AND THE ECONOMICS OF SCIENCE

POSTED BY DYLAN WALSH (The New Yorker)

Last month, Representative Lamar Smith, chairman of the House Committee on Science, Space, and Technology, introduced a divisive new bill, the High Quality Research Act, that would change the criteria by which the National Science Foundation evaluates research projects and awards funding. (The N.S.F., with a budget of seven billion dollars, funds roughly twenty per cent of federally supported basic research in American universities.) Currently, proposals are evaluated through a traditional peer-review process, in which scientists and experts with knowledge of the relevant fields evaluate the projects’ “intellectual merits” and “broader impacts.” Peer review is a central tenet of modern academic science, and, according to critics, the new bill threatens to supersede it with politics.

John Holdren, the director of the White House Office of Science and Technology Policy, said last week that “adding Congress as reviewers is a mistake.” Representative Eddie Bernice Johnson warned more forcefully that Representative Smith was “sending a chilling message to the entire scientific community that peer review may always be trumped by political review.” But in a statement, Representative Smith said the draft bill “improves on [the peer-review process] by adding a layer of accountability.” The bill’s new three-point criteria for funding require that a project be “in the interests of the United States to advance the national health, prosperity, or welfare, and to secure the national defense”; solve “problems that are of the utmost importance to society at large”; and not be “duplicative of other research projects being funded by the Foundation or other Federal science agencies.”

Implicit in the proposal’s language is a desire for oversight built into the process of determining which areas of study are significant. (Representative Smith cited five N.S.F.-funded social science projects, with concerns as to whether they “adhere to NSF’s ‘intellectual merit’ guideline”(above). To Smith’s point, despite sizable public investment in research and development (nearly $150 billion this year), relatively scant attention is devoted to investigating whether the process of science, in its current form, is well-designed for generating knowledge.

As it turns out, the N.S.F. recently awarded a grant to Kevin Zollman, an assistant professor of philosophy at Carnegie Mellon University, to investigate what drives researchers to pursue particular projects, and how they secure funding for them. In an ideal world, all science would proceed from the simple and lofty social goal of expanding human knowledge, but researchers are of course subject to the constraints of economic reality. To disentangle these knotted incentives, Zollman is applying a branch of game theory known as “mechanism design,” which uses simple models to understand how individual players within a system balance competing motivations to arrive at different ends.

Zollman’s first object of study is the “priority rule,” which states that all of the credit earned by scientific discovery—the money, the praise, the promotion and adulation—is directed to the first researcher across the finish line, whether by weeks, days, or hours. The priority rule was first elaborated in the nineteen-fifties by the sociologist Robert K. Merton, who observed that “In the institution of science, originality is at a premium.” He described its role over centuries of scientific progress, with fierce and stubborn quarrels punctuating the time line of discovery: Newton and Leibniz fought over the invention of calculus, Galileo and Simon Mayr over the discovery of Jupiter’s moons.

As researchers scramble to carve out intellectual real estate, however, overly aggressive competition and singular focus on originality can elicit a host of negative behaviors: bias toward reporting positive or rushed results, withholding or fabricating data, and counterproductive levels of output. Of fifty million scholarly articles published since 1665, more than half of them appeared in the last twenty-five years. While a number of factors contribute to this glut, one of them is the pressure to publish results even when they’re not immediately relevant. “Much of the recent scientific literature is repetitive, unimportant, poorly conceived or executed, and oversold; perhaps deservingly, much of it is ignored,” wrote the microbiologist Ferric Fang in a 2012 editorial. The problem is becoming particularly acute as a growing pool of scientists face a shrinking pot of money, both from drying stimulus funds scheduled to disappear this September and 2.9-percent budget cuts forced on the N.S.F. through sequestration. The currently overcharged pull of priority, by spurring unnecessary questions and hairs to be split, can thus dilute or stifle the advancement of science.

In light of these problems, Representative Smith proposed his bill. Yet the economic straits in science already encourage research that targets specific, funder-driven priorities over riskier, more open-ended questions. As the Nobel laureate Roger Kornberg lamented in 2007, “If the work that you propose to do isn’t virtually certain of success, then it won’t be funded.” Elevated political scrutiny would likely only reduce the willingness of agencies like the N.S.F. to fund projects without clearly defined, or even expected, outcomes. In tension with this reality is the fact that revolution in science is often indebted to prolonged exploration of basic research projects, which may at their outset fail to meet Representative Smith’s criteria. Forecasting how well research will “advance the national health, prosperity, or welfare,” or predetermining the degree to which a project is or is not “groundbreaking,” demands an unlikely prescience. Further embedding specific requirements in grant allocation could make improvements at the margins by defunding egregiously conspicuous research. But it also threatens to close off a large landscape of research questions with unforeseen potential.

The article is here.

Categories: junk science, science communication, Statistics | 14 Comments

What should philosophers of science do? (Higgs, statistics, Marilyn)

Marilyn Monroe not walking past a Higgs boson and not making it decay, whatever philosophers might say.

Marilyn Monroe not walking past a Higgs boson and not making it decay, whatever philosophers might say.

My colleague, Lydia Patton, sent me this interesting article, “The Philosophy of the Higgs,” (from The Guardian, March 24, 2013) when I began the posts on “statistical flukes” in relation to the Higgs experiments (here and here); I held off posting it partly because of the slightly sexist attention-getter pic  of Marilyn (in reference to an “irrelevant blonde”[1]), and I was going to replace it, but with what?  All the men I regard as good-looking have dark hair (or no hair). But I wanted to take up something in the article around now, so here it is, a bit dimmed. Anyway apparently MM was not the idea of the author, particle physicist Michael Krämer, but rather a group of philosophers at a meeting discussing philosophy of science and science. In the article, Krämer tells us:

For quite some time now, I have collaborated on an interdisciplinary project which explores various philosophical, historical and sociological aspects of particle physics at the Large Hadron Collider (LHC). For me it has always been evident that science profits from a critical assessment of its methods. “What is knowledge?”, and “How is it acquired?” are philosophical questions that matter for science. The relationship between experiment and theory (what impact does theoretical prejudice have on empirical findings?) or the role of models (how can we assess the uncertainty of a simplified representation of reality?) are scientific issues, but also issues from the foundation of philosophy of science. In that sense they are equally important for both fields, and philosophy may add a wider and critical perspective to the scientific discussion. And while not every particle physicist may be concerned with the ontological question of whether particles or fields are the more fundamental objects, our research practice is shaped by philosophical concepts. We do, for example, demand that a physical theory can be tested experimentally and thereby falsified, a criterion that has been emphasized by the philosopher Karl Popper already in 1934. The Higgs mechanism can be falsified, because it predicts how Higgs particles are produced and how they can be detected at the Large Hadron Collider.

On the other hand, some philosophers tell us that falsification is strictly speaking not possible: What if a Higgs property does not agree with the standard theory of particle physics? How do we know it is not influenced by some unknown and thus unaccounted factor, like a mysterious blonde walking past the LHC experiments and triggering the Higgs to decay? (This was an actual argument given in the meeting!) Many interesting aspects of falsification have been discussed in the philosophical literature. “Mysterious blonde”-type arguments, however, are philosophical quibbles and irrelevant for scientific practice, and they may contribute to the fact that scientists do not listen to philosophers.

I entirely agree that philosophers have wasted a good deal of energy maintaining that it is impossible to solve Duhemian problems of where to lay the blame for anomalies. They misrepresent the very problem by supposing there is a need to string together a tremendously long conjunction consisting of a hypothesis H and a bunch of auxiliaries Ai which are presumed to entail observation e. But neither scientists nor ordinary people would go about things in this manner. The mere ability to distinguish the effects of different sources suffices to pinpoint blame for an anomaly. For some posts on falsification, see here and here*.

The question of why scientists do not listen to philosophers was also a central theme of the recent inaugural conference of the German Society for Philosophy of Science. I attended the conference to present some of the results of our interdisciplinary research group on the philosophy of the Higgs. I found the meeting very exciting and enjoyable, but was also surprised by the amount of critical self-reflection.

In the opening talk Peter Godfrey-Smith from the City University of New York emphasized three roles for philosophy: an integrative role, whereby philosophy can assess and connect various fields with an emphasis on generic categories and perspectives; an incubator role, where philosophy develops new ideas in a broad and speculative form, which are then pursued in a more focussed and specific way within an individual science; and an educative role, where philosophy teaches various general skills, including critical and abstract thinking. The problem I see with the integrative and incubator roles of philosophy is the high degree of specialization in modern science. It is very hard for a philosopher to keep up with scientific progress, and how could one integrate various fields without having fully appreciated the essential features of the individual sciences? As Margaret Morrison from the University of Toronto pointed out in her talk, if philosophy steps back too far from the individual sciences, the account becomes too general and isolated from scientific practice. On the other hand, if philosophy is too close to an individual science, it may not be philosophy any longer.

I think philosophy of science should not consider itself primarily as a service to science, but rather identify and answer questions within its own domain. I certainly would not be concerned if my own research went unnoticed by biologists, chemists, or philosophers, as long as it advances particle physics. On the other hand, as Morrison pointed out, science does generate its own philosophical problems, and philosophy may provide some kind of broader perspective for understanding those problems.

So then, should we physicists listen to philosophers?

An emphatic “No!”, if philosophers want to impose their preconceptions of how science should be done. I do not subscribe to Feyerabend’s provocative claim that “anything goes” in science, but I believe that many things go, and certainly many things should be tried.

But then, “Yes!”, we should listen, as philosophy can provide a critical assessment of our methods, in particular if we consider physics to be more than predicting numbers and collecting data, but rather an attempt to understand and explain the world. And even if philosophy might be of no direct help to science, it may be of help to scientists through its educational role, and sharpen our awareness of conceptional problems in our research**.

What I want to talk about are the roles of philosophers of science. While I do not disagree with the roles Godfrey-Smith allots philosophers of science, to incubate, integrate, and educate (about things like logic and critical thinking), and his list would not preclude what I have in mind, I would press to go much further. To focus just on one of my own areas of interest, there is enormous unclarity in discussions by statistical practitioners regarding such philosophical notions as objectivity, truth, falsifiability, evidence, inductive inference, and the roles of probability in modeling and inference. It is as if a certain trepidation and groupthink take over when it comes to philosophically tinged notions, and philosophers are rarely consulted to lend insight.  When they are, I’m afraid, they do not escape the criticism Stephen Weinberg raises in the linked Godfrey-Smith article (i.e., being wedded to a position that grows out of “theory-laden” philosophy, where the theories are philosophical.) Fresh methodological problems arise in practice, but philosophers of science are not consulted. Nor is it surprising. Peter Achinstein[2] has often said that scientists do not and should not consult philosophical accounts about evidence,because while scientists evaluate evidence empirically, philosophical accounts are merely based on a priori computations. Sad, if still true.

By and large, philosophers of science have reneged on the promise of the 80s to be relevant to science. In some areas, in particular the one I know best, philosophers of science have gone backwards. Philosophers of statistics were ahead of their time in the 70s and early 80s, engaging in discussions side by side with statistical practitioners (Godambe and Sprott 1971, Harper and Hooker, 1977 come to mind.) Contributions to the field were as likely to be by a philosopher as by a statistician. I talk about this much more elsewhere (e.g., the introduction to Mayo and Spanos, Error and Inference (CUP 2010), so I’m being quick here. Soon after I got my Ph.D, things seemed to dissipate…

Nowadays, while the foundations of statistics are being considered anew by many statisticians, philosophers of statistics are almost nowhere to be found.  Arguments given for some very popular slogans (mostly by non-philosophers), are too readily taken on faith as canon by others, and are repeated as gospel. Examples are easily found: all models are false, no models are falsifiable, everything is subjective, or equally subjective and objective, and the only properly epistemological use of probability is to supply posterior probabilities for quantifying actual or rational degrees of belief. Then there is the cluster of “howlers” allegedly committed by frequentist error statistical methods repeated verbatim (discussed on this blog). Margaret Morrison is right that many ask: is truly relevant philosophy really philosophy? I and a few others[3] think the answer is Yes! I have organized conferences[4] and published papers that address these issues, and it is the focus of a current book, nearing completion.

Even in the Higgs example, recall the controversy about whether particle physicists were misinterpreting their p-values; the letter-writing campaign by subjective Bayesians, etc. [5]. There is a valid question as to whether it is the philosopher of X’s responsibility to solve philosophical problems in domain X; and the answer will surely depend on the field. But in statistical science—itself sometimes regarded as “applied philosophy of science,” –I say the answer is, emphatically, yes! Their failure to do so has left them out of one of the most interesting periods in the areas of statistical science as well as machine learning.

*For a unit on Popper that includes Duhem’s problem and falsification, see http://errorstatistics.com/2012/02/01/no-pain-philosophy-skepticism-rationality-popper-and-all-that-part-2-duhems-problem-methodological-falsification/

*Michael Krämer is a theoretical particle physicist at the RWTH Aachen University and likes philosophy. Follow him on Twitter at @mikraemer


[1] The article’s subtitle is: “Particle physicist Michael Krämer hangs out with philosophers and learns that one should be wary of irrelevant blondes” (whatever that means).

[2] See, for example, p. 2 of the intro to E&I.

[3] Clark Glymour, Jim Woodward come to mind. They, as well as Godfrey-Smith, will be at our O&M conference at Virginia Tech next weekend!

[4] E.g.,Statistical Science and Philosophy of Science: Where Do/Should They Meet? For selected contributions and related papers see http://www.rmm-journal.de/htdocs/st01.html. Several of these papers have been discussed in “U-Phils” on this blog. Search for the author or title.

[5] O’Hagan: digest of  responses, discussed on this blog here and here.

  • Achinstein, P. (2001), The Book of Evidence, Oxford: Oxford University Press.
  • Cox and Mayo 2010.
  • Godambe, V.  and Sprott, D., (eds), (1971). Foundations of Statistical Inference, Holt, Rinehart and Winston of Canada, Toronto, 1971.
  • Harper, W. L. and Hooker C. A. (eds.) (1976): Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science. Vol. 2, Dordrecht, The Netherlands: D. Reidel
  • Mayo, D. G. and Spanos, A. (2010). “Introduction and Background: Part I: Central Goals, Themes, and Questions; Part II The Error-Statistical Philosophy” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-14, 15-27.
Categories: Higgs, Statistics, StatSci meets PhilSci | 88 Comments

Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)

UnknownThree years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15. Trials have been taking place this month, as people try to meet the 3 year deadline to sue BP and others. But what happened to the 200 million gallons of oil?  (Is anyone up to date on this?)  Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night around the 3 year anniversary, let’s listen into a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes. 

In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:

Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!

Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

 Oil Exec:  Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average!  You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April  20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator:  But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec:  That’s the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail),  that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion:  the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages?  … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

Two Measuring Instruments with Different Precisions:

 A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in, ton o’bricks).

In applying our test T+ (see November 2011 blog post ) to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”.  Denote the two p-values as p’ and p”, respectively.  However, or so the criticism proceeds, the error statistician would report the average p-value:  .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed?  Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of.  Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

  •   If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!

The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion.

Let me now give a special (the first!) honorary mention to Christian Robert [2] on this point, as raised in Cox and Mayo (2010).  He writes p. 9 http://arxiv.org/abs/1111.5827:

A compelling section is the one about the weak conditionality principle (pp.294- 298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behaviour in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Robert (2007), Example 1.3.7, p.18) that the classical confidence interval averages over the experiments. The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this.

He would want me to mention that he does raise some caveats:

I could, however, [argue] about ‘conditioning is warranted to achieve objective frequentist goals’ (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”.

But there is nothing arbitrary about regarding as “good” the only experiment actually run and from which the actual data arose.  The severity criterion only makes explicit what is/should be already obvious. Objectivity, for us, is directed by the goal of making correct and warranted inferences, not freedom from thinking. After all, any time an experiment E is performed, the critic could insist that the decision to perform E is the result of some chance circumstances and with some probability we might have felt differently that day and have run some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with.  The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests.

Still, we exiled ones can’t be too fussy, and Robert still gets the mention for conceding that we have  a solid leg on which to pirouette.


[1] You can search the blog for connections between this event, the June 2010 conference at the LSE (especially the RMM volume), my introduction to deepwater drilling, and the blog’s “mascot” stock, Diamond offshore, DO, which, incidentally, just had earnings.

[2] There have been around 4-5 others since then, not sure.


Categories: Bayesian/frequentist, Comedy, Statistics | 2 Comments

Blog Contents 2013 (March)

metablog old fashion typewriterError Statistics Philosophy Blog: March 2013* (Frequentists in Exile-the blog)**:

(3/1) capitalizing on chance
(3/4) Big Data or Pig Data?
(3/7) Stephen Senn: Casting Stones
(3/10) Blog Contents 2013 (Jan & Feb)
(3/11) S. Stanley Young: Scientific Integrity and Transparency
(3/13) Risk-Based Security: Knives and Axes
(3/15) Normal Deviate: Double Misunderstandings About p-values
(3/17) Update on Higgs data analysis: statistical flukes (1)
(3/21) Telling the public why the Higgs particle matters
(3/23) Is NASA suspending public education and outreach?
(3/27) Higgs analysis and statistical flukes (part 2)
(3/31) possible progress on the comedy hour circuit?

*March was incredibly busy here; I’m saving up several partially-baked posts on draft. Also, while I love this old typewriter, I’ve had to have special keys made for common statistical symbols, and that has delayed me some. I hope people will scan the previous contents starting from the beginning (e.g., with “prionvac“): it’s philosophy, remember, and philosophy has to be reread many times over.  January and February 2013 contents are here.

**compiled by Jean Miller and Nicole Jinn.

Categories: Metablog, Statistics | Leave a comment

Stephen Senn: When relevance is irrelevant

Stephen Senn(guest post) When Relevance is Irrelevant, by Stephen Senn

Head of Competence Center for Methodology and Statistics (CCMS)

Applied statisticians tend to perform analyses on additive scales and additivity is an important aspect of an analysis to try to check. Consider survival analysis. The most important model used, the default in many cases, is the proportional hazards model introduced by David Cox in 1972[1] and sometimes referred to as Cox regression. In fact, from one point of view, analysis takes place on the log-hazard scale and so the model could equally be referred to by the rather clumsier title additive log-hazards model and there is quite a literature on how the proportionality (or equivalently, additivity) assumption can be checked.

Words have a definite power on the mind and you sometimes encounter the nonsensical claim that if the proportionality assumption does not apply you should consider a log-rank test instead. In fact, when testing the null hypothesis that two treatments are identical, neither the log-rank test nor the score test using the proportional hazards model require the assumption of proportionality: the assumption is trivially satisfied by the fact of two treatments being identical. Furthermore the log-rank test is just a special case of proportional hazards: the score test for a proportional hazards model without any covariates is the log-rank test. Finally, it is easy to produce examples where proportional hazards would apply in a model with covariates but not in the model without covariates but very difficult to produce the converse.

An objection often made regarding such models is that they are very difficult for physicians to understand. My reply is to ask what is preferable: a difficult truth or an easy lie? Ah yes, it is sometimes countered, but surely I agree on the importance of clinical relevance. It is surely far more useful to express the results of a proportional hazards analysis in clinically relevant terms that can be understood, such as difference in median length of survival or the difference in the event rate up to a particular census point (say one year after treatment).

A disturbing paper by Snapinn and Jiang[2] points to a problem, however, and to explain it I can do no better that cite the abstract:

The standard analysis of a time-to-event variable often involves the calculation of a hazard ratio based on a survival model such as Cox regression; however, many people consider such relative measures of effect to be poor expressions of clinical meaningfulness. Two absolute measures of effect are often used to assess clinical meaningfulness: (1) many disease areas frequently use the absolute difference in event rates (or its inverse, the number-needed-to-treat) and (2) oncology frequently uses the difference between the median survival times in the two groups. While both of these measures appear reasonable, they directly contradict each other. This paper describes the basic mathematics leading to the two measures and shows examples. The contradiction described here raises questions about the concept of clinical meaningfulness. (p2341)

To see the problem, consider the following. The more serious the disease, the less a given difference in the rate at which people die will impact on the time survived and hence on differences in median survival. However, generally, the higher the baseline mortality rate the greater the difference in survival at a given time point that will be conveyed by a given treatment benefit.

If you find this less than clear, you have my sympathy. The only solution I can offer is to suggest that you read the paper by Snappin and Jiang[2]. However, in that case also consider the following point. If the point is so subtle, how many physicians who cannot understand proportional hazards can understand numbers needed to treat or differences in median survival? My opinion is that they can be counted on the fingers of one foot.

Let me explain the point at issue by analogy. If one were to study road traffic accidents one would find that among the very many factors affecting seriousness of the consequences of an accident would be the relative velocity at impact. However, if one looks at Newton’s laws of motion one finds that the second law speaks of the relationship between force, mass, and acceleration but not velocity. Now it is clear that a) acceleration being a concept that is defined (or at least understood) in terms of velocity (it is, indeed, derivative of velocity) it is a more complicated concept than velocity b) all cars have speedometers that show velocity but not acceleration c) the traffic laws are couched in terms of velocity rather than acceleration and d) it seems that from the point of view of human health it is velocity that is important.

None of this remotely constitutes an argument for replacing Newton’s second law. On the contrary what it implies is that you might need to work a little to use Newton’s laws to translate the effect of relative velocity into accident survivability. However, any attempt to simplify will run the danger of being an oversimplification.

This point was very well understood by a somewhat neglected scientist, the centenary of whose death falls this year:  James Berry( 1852-1913) an English executioner or hangman but one who recognised the value of physics. Ronald Meek, the Marxist economist whose work I was expected to study when a student of Economics and Statistics in the 1970s, devotes a chapter of his entertaining book, Figuring out Society[3], to Mr Berry. Berry decided that the length of the drop in an execution required scientific study: too short and death was not instantaneous, too long and decapitation ensued. He soon realised that a simple law of hanging, linear in the height of the drop, was wrong and instead came upon the idea of a ‘striking force’. This enabled him to hang criminals along a curve. (I understand that in American universities professors also sometimes execute judgement along a curve.)   The striking force required was adjusted according to the weight and neck musculature of the condemned and the height was then determined from the curve.

Many of the current proponents of evidence based medicine could learn from Berry’s example. NNTs derived from clinical trials are misleading indicators as to what will happen in clinical practice. For that to be the case would require that the patients in the clinical trial we run were a representative sample of the population of patients. They are not, and if they were the fact that we set such store on concurrent control would be inexplicable. To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. ‘Additive at the point of analysis but relevant at the point of application’ should be the motto.  Sometimes short cuts lead to long delays.

References

1.              Cox DR. Regression models and life-tables (with discussion). Journal of the Royal Statistical Society Series B 1972; 34: 187-220.

2.              Snapinn S, Jiang Q. On the clinical meaningfulness of a treatment’s effect on a time-to-event variable. Statistics in Medicine 2011; 30: 2341-2348.

3.              Meek RL. Figuring out society. Fontana, 1971.

Categories: Statistics | 9 Comments

Does statistics have an ontology? Does it need one? (draft 2)

questionmark pinkChance, rational beliefs, decision, uncertainty, probability, error probabilities, truth, random sampling, resampling, opinion, expectations. These are some of the concepts we bandy about by giving various interpretations to mathematical statistics, to statistical theory, and to probabilistic models. But are they real? The question of “ontology” asks about such things, and given the “Ontology and Methodology” conference here at Virginia Tech (May 4, 5), I’d like to get your thoughts (for possible inclusion in a Mayo-Spanos presentation).*  Also, please consider attending**.

Interestingly, I noticed the posts that have garnered the most comments have touched on philosophical questions of the nature of entities and processes behind statistical idealizations (e.g.,http://errorstatistics.com/2012/10/18/query/).copy-cropped-ampersand-logo-blog1

1. When an interpretation is supplied for a formal statistical account, its theorems may well turn out to express approximately true claims, and the interpretation may be deemed useful, but this does not mean the concepts give correct descriptions of reality. The interpreted axioms, and inference principles, are chosen to reflect a given philosophy, or set of intended aims: roughly, to use probabilistic ideas (i) to control error probabilities of methods (Neyman-Pearson, Fisher), or (ii) to assign and update degrees of belief, actual or rational (Bayesian).  But this does not mean its adherents have to take seriously the realism of all the concepts generated. In fact ,we often (on this blog) see supporters of various stripes of frequentist and Bayesian accounts running far away from taking their accounts literally, even as those interpretations are, or at least were, the basis and motivation for the development of the formal edifice (“we never meant this literally”).  But are these caveats on the same order? Or do some threaten the entire edifice of the account?

Starting with the error statistical account, recall Egon Pearson in his “Statistical Concepts in Their Relation to Reality” making it clear to Fisher that the business of controlling erroneous actions in the long run, acceptance sampling in industry and 5-year plans, only arose with Wald, and were never really part of the original Neyman-Pearson tests (declaring that the behaviorist philosophy was Neyman’s, not his).  The paper itself may be found here. I was interested to hear (Mayo 2005)  Neyman’s arch opponent, Bruno de Finetti, remark (quite correctly) that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his, the Bayesian and the Fisherian formulations” became with Abraham Wald’s work, “something much more substantial” (de Finetti 1972, 176).

Granted, it has not been obvious to people just how to interpret N-P tests “evidentially “ or “inferentially”—the subject of my work over many years. But there always seemed to me to be enough hints and examples to see what was intended: A statistical hypothesis H assigns probabilities to possible outcomes, and the warrant for accepting H as adequate—for an error statistician– is in terms of how well corroborated H is: how well H has stood up to tests that would have detected flaws in H, at least with very high probability. So the grounds for holding or using H are error statistical. The control and assessment of error probabilities may be used inferentially to determine the capabilities of methods to detect the adequacy/inadequacy of models, and express the extent of the discrepancies that have been identified. We also employ these ideas to detect gambits that make it too easy to find evidence for claims, even if the claims have been subjected to weak tests and biased procedures. A recent post is here.

The account has never professed to supply a unified logic, or any kind of logic for inference. The idea that there was a single rational way to make inferences was ridiculed by Neyman (whose birthday is April 16).

2. Proposed (“we never meant this literally”) withdrawals  from the Bayesian interpretations do not seem so innocuous. Perhaps some will say this just shows my bias. Let me grant that the popular idea of interpreting prior probability distributions as non-subjective, in some sense or other, is not so radical (though I’d still want to know how to interpret posteriors and why). But what we usually see now is some blurring of the two: touting the advantage of Bayesian methods because they incorporate background beliefs, while also advertising “conventional” (default, reference, or “objective”) priors as having minimal influence on inference. [1] See “Grace and amen Bayesianism within this deconstruction. Also relevant: Irony and Bad Faith: Deconstructing Bayesians.

Perhaps the most popular view nowadays regards the prior as some kind of uninterpreted mathematical construct, merely serving to get a posterior. These same Bayesians, some of them, advocate “testing” the prior, but this is hard to grasp if we do not know what the priors intend to be, or stand for.  Then there are those Bayesians, perhaps they are a radical (but influential) subgroup, who deny the machine of updating by Bayes theorem altogether.  In Gelman (2011) (our special topic of RMM):

“Our key departure from the mainstream Bayesian view (as expressed, for example, [in Wikipedia]) is that we do not attempt to assign posterior probabilities to models or to select or average over them using posterior probabilities. Instead, we use predictive checks to compare models to data and use the information thus learned about anomalies to motivate model improvements.” (p. 71).

In Gelman and Robert (2013), we hear that a major source of Bayesian criticism comes from assuming “that Bayesians actually seem to believe their assumptions rather than merely treating them as counters in a mathematical game.” (p. 3) This comes as a surprise to those of us who thought the Bayesians really meant it. So what is the game being played?

[W]e make strong assumptions and use subjective knowledge in order to make inferences and predictions that can be tested by comparing to observed and new data (see Gelman and Shalizi, 2012, or Mayo, 1996 for a similar attitude coming from a non-Bayesian direction). (p. 3)

So maybe some kind of a “non-Bayesian checking of Bayesian models” would offer more a more promising foundation, at least for Gelman’s brand of “Bayesian falsificationism” (Gelman 2011). See my 2013 Comments on Gelman and Shalizi [2]. On the face of it, any inference, whether to the adequacy of a model (for a given purpose), or to a posterior probability, can be said to be warranted just to the extent that the inference has withstood severe testing: one with a high probability of having found flaws were they present.  The ontology matters less than the epistemology.

Thus, the severity idea, could conceivably illuminate what’s going on with Gelman’s model checking; I find the idea promising, but do not really know what he thinks.

But to pursue such an avenue still requires reckoning with a fundamental issue at the foundations of Bayesian method: the interpretation of and justification for the prior probability distribution. Error statisticians use idealizations, but they are tightly constrained by the need for error probabilities, in a statistical model, to approximate the actual ones, even if only hypothetical, or checked by simulation. We are modeling real processes, not knowledge of processes.

Gelman and Robert (2013) allow:

“that many Bayesians over the years have muddied the waters by describing parameters as random rather than fixed. Once again, for Bayesians as much as for any other statistician, parameters are (typically) fixed but unknown. It is the knowledge about these unknowns that Bayesians model as random” (p. 4).

Bayesians will …assign a probability distribution to a parameter that one could not possibly imagine to have been generated by a random process, parameters such as the coefficient of party identification in a regression on vote choice, or the overdispersion in a network model, or Hubble’s constant in cosmology. There is no inconsistency in this opposition once one realizes that priors are not reflections of a hidden “truth” but rather evaluations of the modeler’s uncertainty about the parameter. (p. 3)

The choice, of course, is not between modeling a “hidden ‘truth’” and modeling “the modeler’s uncertainty”. Actually, in the majority of the examples I have seen, it seems better to imagine the parameter being generated by a random process.  On the other hand, “the modeler’s uncertainty about the parameter” is one of the most unclear parts of Bayesian modeling. It is not that we can’t see measuring the degree of evidence, corroboration, severity of test, or the like, that is accorded a claim about a fixed parameter.  We can and do. It is just that those measures will not be well represented as posterior or prior probabilities, obeying the probability calculus.

Possibly an idea I once proposed–a variation on a view held by the frequentist Reichenbach– can work (in EGEK, ch. 4 1996). Reichenbach suggested that scientists might eventually be able to assess the relative frequency with which a given type of hypothesis or theory is true. This might provide it a frequentist probability assignment. I don’t see how one could get such a relative frequency (or rather I can see many different reference sets that could be used), nor why knowing such quantities would be useful in appraising the evidence for a given hypothesis H. My variation (Chapter 4 Duhem, Kuhn, and Bayes, pp 120-4) is to consider the relative frequency with which evidence of a certain strength, (e.g., passing k tests with increasingly impressive error probabilities)  is generated, despite H being false. This is attainable. But that of course take us to an error probabilistic assessment!

Maybe this style of Bayesianism doesn’t need a clear ontology so long as it’s got a clear epistemology. But does it?***

What do readers think?

*To see the full list of speakers: “Ontology and Methodology” conference. Actually our presentation will likely take a different tack, but I still want to hear your thoughts.

**Registration is free, but required, by April 20-25.

***I should say right off (for those who do not know) that my work is not in metaphysics, but on philosophical problems about inductive-statisical inference , experiment and evidence.My colleague (and co-conference organizer) Ben Jantzen is the “ontology” guy, and the third colleague involved, Lydia Patton, does O & M as well as HPS.

For further references, see those within posts and papers linked here, or search this blog.

De Finetti, B. (1972), Probability, Induction, and Statistics: The Art of Guessing. NY, Wiley.

Gelman, A. (2011). Induction and deduction in Bayesian data analysisRationality, Markets and Morals (RMM) 2, 67–78.

Gelman, A.and C. Shalizi. (Article first published online: 24 Feb 2012). “Philosophy and the Practice of Bayesian statistics (with discussion)”.British Journal of Mathematical and Statistical Psychology (BJMSP).

Gelman, A, and Robert, C. (2013). Not only defended but also applied: The perceived absurdity of Bayesian inference.

http://www.stat.columbia.edu/~gelman/research/published/feller8.pdf

Kass and Wasserman, L. (1996). The Selection of Prior Distributions by Formal Rules. Journal of the American Statistical Association 91, 1343-1370.

Mayo, D. G. (1996).[EGEK] Error and the growth of experimental knowledge. Chicago: University of Chicago Press.

_____ (2005). Evidence as passing severe tests: Highly probable vs. highly probed hypotheses. In P. Achinstein (Ed.), Scientific Evidence (pp. 95-127). Baltimore: Johns Hopkins University Press.

_____ (2011). Statistical science and philosophy of science: where do/should they meet in 2011 (and beyond)?Rationality, Markets and Morals (RMM) 2, Special Topic: Statistical Science and Philosophy of Science, 79–102.

_____ (2013). Comments on A. Gelman and C. Shalizi: Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, forthcoming.

Mayo, D. and Cox, D. (2010). Frequentist statistics as a theory of inductive inference. In D. Mayo and A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (pp. 247-275). Cambridge: Cambridge University Press. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, 247-275.

Mayo, D. and Spanos, A. (2011). Error statistics. In P. Bandyopadhyay and M. Forster (Volume Eds.); D. M.Gabbay, P. Thagard and J. Woods (General Eds.). Philosophy of statistics: Handbook of philosophy of science Vol 7 (pp. 1-46). The Netherlands: Elsevier.

Pearson, E. S. (1955). Statistical concepts in their relation to reality.  Journal of the Royal Statistical SocietyB 17, 204-207.

Senn, S. (2011). You may believe you are a Bayesian but you are probably wrong. Rationality, Markets and Morals (RMM) 2, Special Topic: Statistical Science and Philosophy of Science, 48-66.


[1] For a thorough account of problems with the latter, see Kass and Wasserman (1996).

[2] I take Gelman-Shalizi (2012) to be an attempt at a meeting of the minds between Bayesian Gelman and error statistical Shalizi. I may be wrong.

Categories: Bayesian/frequentist, Error Statistics, Statistics | 59 Comments

Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

UnknownIt was from my Virginia Tech colleague I.J. Good (in statistics), who died four years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules.

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]

This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was a year out of grad school, and he a University Distinguished Professor.) 

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

 To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

images-3By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

 Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a Read more »

Categories: Bayesian/frequentist, Comedy, Statistics | Tags: , , | 68 Comments

Guest Post. Kent Staley: On the Five Sigma Standard in Particle Physics

Kent Staley

Kent Staley
Associate Professor
Department of philosophy
Saint Louis University

Regular visitors to Error Statistics Philosophy may recall a discussion that broke out here and on other sites last summer when the CMS and ATLAS collaborations at the Large Hadron Collider announced that they had discovered a new particle in their search for the Higgs boson that had at least some of the properties expected of the Higgs. Both collaborations emphasized that they had results that were significant at the level of “five sigma,” and the press coverage presented this is a requirement in high energy particle physics for claiming a new discovery. Both the use of significance testing and the reliance on the five sigma standard became a matter of debate.

Mayo has already commented on the recent updates to the Higgs search results (here and here); these seem to have further solidified the evidence for a new boson and the identification of that boson with the Higgs of the Standard Model. I have been thinking recently about the five sigma standard of discovery and what we might learn from reflecting on its role in particle physics. (I gave a talk on this at a workshop sponsored by the “Epistemology of the Large Hadron Collider” project at Wuppertal [i], which included both philosophers of science and physicists associated with the ATLAS collaboration.)

Just to refresh our memories, back in July 2012, Tony O’Hagan posted at the ISBA forum (prompted by “a question from Dennis Lindley”) three questions regarding the five-sigma claim:

  1. “Why such an extreme evidence requirement?} We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?
  2. “Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?
  3. “We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LHC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?”

O’Hagan received a lot of responses to this post, and he very helpfully wrote up and posted a digest of those responses, discussed on this blog here and here. Read more »

Categories: Error Statistics, P-values, Statistics | 26 Comments

Flawed Science and Stapel: Priming for a Backlash?

my 1st fraud kitDeiderik Stapel is back in the news, given the availability of the English translation of the Tilberg (Levelt and Noort Committees) Report as well as his book, Ontsporing (Dutch for “Off the Rails”), where he tries to explain his fraud. An earlier post on him is here. While the disgraced social psychologist was shown to have fabricated the data for something like 50 papers, it seems that some people think he deserves a second chance. A childhood friend, Simon Kuper, in an article “The Sin of Bad Science,” describes a phone conversation with Stapel:

“I’ve lost everything,” the disgraced former psychology professor tells me over the phone from the Netherlands. He is almost bankrupt. … He has tarnished his own discipline of social psychology. And he has become a national pariah. …

Very few social psychologists make stuff up, but he was working in a discipline where cavalier use of data was common. This is perhaps the main finding of the three Dutch academic committees which investigated his fraud. The committees found many bad practices: researchers who keep rerunning an experiment until they get the right result, who omit inconvenient data, misunderstand statistics, don’t share their data, and so on….

Chapter 5 of the Report, pp 47-54, is extremely illuminating about the general practices they discovered in examining Stapel’s papers, I recommend it.

Social psychology might recover. However, Stapel might not. A country’s way of dealing with sinners is often shaped by its religious heritage. In Catholicism, sinners can get absolution in the secrecy of confession. … …In many American versions of Protestantism, the sinner can be “born again”. …Stapel’s misfortune is to be Dutch. The dominant Dutch tradition is Calvinist, and Calvinism believes in eternal sin. …But the downside to not forgiving sinners is that there are almost no second acts in Dutch lives.

http://www.ft.com/intl/cms/s/2/d1e53488-48cd-11e2-a6b3-00144feab49a.html#axzz2PAPIxuHx

But it isn’t just old acquaintances who think Stapel might be ready for a comeback. A few researchers are beginning to defend the field from the broader accusations the Report wages against the scientific integrity of social psychology. They do not deny the “cavalier” practices, but regard them as acceptable and even necessary! This might even pave the way for Stapel’s rehabilitation. An article by a delegate for the 3rd World Conference on Research Integrity (wcri2013.org) in Montreal, Canada, in May reports on members of a new group critical of the Report, including some who were interviewed by the Tilberg Committees: Read more »

Categories: junk science, Statistics | 21 Comments

Higgs analysis and statistical flukes (part 2)

imagesEveryone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post. This, too, is a rough outsider’s angle on one small aspect of the statistical inferences involved. (Doubtless there will be corrections.) But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Following an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as: Read more »

Categories: P-values, statistical tests, Statistics | 33 Comments

Update on Higgs data analysis: statistical flukes (part 1)

physics pic yellow particle burst blue coneI am always impressed at how researchers flout the popular philosophical conception of scientists as being happy as clams when their theories are ‘born out’ by data, while terribly dismayed to find any anomalies that might demand “revolutionary science” (as Kuhn famously called it). Scientists, says Kuhn, are really only trained to do “normal science”—science within a paradigm of hard core theories that are almost never, ever to be questioned.[i] It is rather the opposite, and the reports out last week updating the Higgs data analysis reflect this yen to uncover radical anomalies from which scientists can push the boundaries of knowledge. While it is welcome news that the new data do not invalidate the earlier inference of a Higgs-like particle, many scientists are somewhat dismayed to learn that it appears to be quite in keeping with the Standard Model. In a March 15 article in National Geographic News:

Although a full picture of the Higgs boson has yet to emerge, some physicists have expressed disappointment that the new particle is so far behaving exactly as theory predicts. Read more »

Categories: significance tests, Statistics | 30 Comments

Risk-Based Security: Knives and Axes

headlesstsaAfter a 6-week hiatus from flying, I’m back in the role of female opt-out[i] in a brand new Delta[ii] terminal with free internet and ipads[iii]. I heard last week that the TSA plans to allow small knives in carry-ons, for the first time since 9/11, as “part of an overall risk-based security approach”. But now it appears that flight attendants, pilot unions, a number of elected officials, and even federal air marshals are speaking out against the move, writing letters and petitions of opposition.

“The Flight Attendants Union Coalition, representing nearly 90,000 flight attendants, and the Coalition of Airline Pilots Associations, which represents 22,000 airline pilots, also oppose the rule change.”

Former flight attendant Tiffany Hawk is “stupefied” by the move, “especially since the process that turns checkpoints into maddening logjams — removing shoes, liquids and computers — remains unchanged,” she wrote in an opinion column for CNN. Link is here. Read more »

Categories: evidence-based policy, Rejected Posts, Statistics | 17 Comments

S. Stanley Young: Scientific Integrity and Transparency

Stanley Young recently shared his summary testimony with me, and has agreed to my posting it.

YoungPhoto2008 S. Stanley Young, PhD
Assistant Director for Bioinformatics
National Institute of Statistical Sciences
Research Triangle Park, NC

One-page Summary Young
Testimony of Committee on Science, Space and Technology, 5 March 2013
Scientific Integrity and Transparency
S. Stanley Young, PhD, FASA, FAAAS

Integrity and transparency are two sides of the same coin. Transparency leads to integrity. Transparency means that study protocol, statistical analysis code and data sets used in papers supporting regulation by the EPA should be publicly available as quickly as possible and not just going forward. Some might think that peer review is enough to ensure the validity of claims made in scientific papers. Peer review only says that the work meets the common standards of the discipline and on the face of it, the claims are plausible, Feinstein, Science, 1988. Peer review is not enough. Read more »

Categories: evidence-based policy, Statistics | 10 Comments

Stephen Senn: Casting Stones

senncropped1Casting Stones, by Stephen Senn*

At the end of last year I received a strange email from the editor of the British Medical Journal(BMJ) appealing for  ‘evidence’ to persuade the UK parliament of the necessity of making sure that data for clinical trials conducted by the pharmaceutical industry are made readily available to all and sundry.  I don’t disagree with this aim. In fact in an article(1) I published over a dozen years ago I wrote ‘No sponsor who refuses to provide end-users with trial data deserves to sell drugs.’(P26)

However, the way in which the BMJ is choosing to collect evidence does not set a good example. It is one I hope that all scientists would disown and one of which even journalists should be ashamed.

The letter reads

“Dear Prof Senn,

We need your help to show the House of Commons Science and Technology Select Committee the true scale of the problem of missing clinical data by collating a list of examples. Read more »

Categories: evidence-based policy, Statistics | 28 Comments

Big Data or Pig Data?

pig-bum-textI don’t know if my reading of this Orwellian* piece is in sync with what Rameez intended, but he thought it was fine for me to post it here. See what you think: 

“Big Data or Pig Data” (A fable on huge amounts of data and why we don’t need models) by Remeez Rahman, computer scientist: posted at Realm of the SCENSCI

 There was a pig who wanted to be a scientist. He was not interested in models. When asked how he planned on making sense of the world, the pig would say in a deep mysterious voice, “I don’t do models: the world is my model” and then with a twinkle in his eyes, look at his interlocutor smugly.

By his phrase, “I don’t do models, the world is my model”, he meant that the world’s data was enough for him, the pig scientist. The more the data, the more accurately the pig declared, he would be able to predict what might happen in the world. Read more »

Categories: Statistics | 21 Comments

capitalizing on chance

Mayo playing the slots

DGM playing the slots

Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Nowadays, writers make it much less clear that the fault lies with the fallacious use of significance tests and other error statistical methods. Instead, the tests are blamed for permitting or even encouraging such misuses. Criticisms to the effect that we should stop trying to teach these methods correctly have hardly helped. The situation is especially puzzling given the fact that these same statistical fallacies have trickled down to the public sphere, what with Ben Goldacre’s “Bad Pharma”, calls for “all trials” to be registered and reported, and the popular articles on the ills of ‘big data’: Read more »

Categories: Error Statistics, Statistics | 18 Comments

Statistically speaking…

images

calculus tattoo

Statistically speaking, we don’t use calculus By Dave Gammon

An article in a local op-ed piece today (Roanoke Times) claims:

“Quantitative skills are highly sought after by employers, and the best time to learn these skills is in high school and early college. And we all know the best math students should eventually learn calculus.

Or should they? Maybe it’s statistics, not calculus, that is a more worthy pursuit for the vast majority of students.”

This reminds me of the trouble I got into when, as a graduate student at the University of Pennsylvania, I supplemented my fellowship in philosophy by leading some recitation classes in statistics at the Wharton school. Although it was vaguely suggested that I not assign homework problems that required calculus, since many of the exercises in the sections of the text (on business statistics) that I was to cover required, and were illuminated by, calculus, (and given that the text was written by a Wharton statistics professor [de Cani]), I went ahead and assigned some of them, and promptly was reported by the students[i]. The author of this article appears to have no clue that statistical methods depend on calculus and the “area under a curve”. Read more »

Categories: Statistics, Uncategorized | 18 Comments

Stephen Senn: Also Smith and Jones

Stephen SennAlso Smith and Jones[1]
by Stephen Senn

Head of Competence Center for Methodology and Statistics (CCMS)

 

This story is based on a paradox proposed to me by Don Berry. I have my own opinion on this but I find that opinion boring and predictable. The opinion of others is much more interesting and so I am putting this up for others to interpret.

Two scientists working for a pharmaceutical company collaborate in designing and running a clinical trial known as CONFUSE (Clinical Outcomes in Neuropathic Fibromyalgia in US Elderly). One of them, Smith is going to start another programme of drug development in a little while. The other one, Jones, will just be working on the current project. The planned sample size is 6000 patients.

Smith says that he would like to look at the experiment after 3000 patients in order to make an important decision as regards his other project. As far as he is concerned that’s good enough.

Jones is horrified. She considers that for other reasons CONFUSE should continue to recruit all 6000 and that on no account should the trial be stopped early.

Smith say that he is simply going to look at the data to decide whether to initiate a trial in a similar product being studied in the other project he will be working on. The fact that he looks should not affect Jones’s analysis.

Jones is still very unhappy and points out that the integrity of her trial is being compromised.

Smith suggests that all that she needs to do is to state quite clearly in the protocol that the trial will proceed whatever the result of the interim administrative look and she should just write that this is so in the protocol. The fact that she states publicly that on no account will she claim significance based on the first 3000 alone will reassure everybody including the FDA. (In drug development circles, FDA stands for Finally Decisive Argument.)

However, Jones insists. She wants to know what Smith will do if the result after 3000 patients is not significant.

Smith replies that in that case he will not initiate the trial in the parallel project. It will suggest to him that it is not worth going ahead.

Jones wants to know suppose that the results for the first 3000 are not significant what will Smith do once the results of all 6000 are in.

Smith replies that, of course, in that case he will have a look. If (but it seems to him an unlikely situation) the results based on all 6000 will be significant, even though the results based on the first 3000 were not, he may well decide that the treatment works after all and initiate his alternative program, regretting, of course, the time that has been lost.

Jones points out that Smith will not be controlling his type I error rate by this procedure.

‘OK’, Says Smith, ‘to satisfy you I will use adjusted type I error rates. You, of course, don’t have to.’

The trial is run. Smith looks after 3000 patients and concludes the difference is not significant. The trial continues on its planned course. Jones looks after 6000 and concludes it is significant P=0.049. Smith looks after 6000 and concludes it is not significant, P=0.052. (A very similar thing happened in the famous TORCH study(1))

Shortly after the conclusion of the trial, Smith and Jones are head-hunted and leave the company.  The brief is taken over by new recruit Evans.

What does Evans have on her hands: a significant study or not?

Reference

1.  Calverley PM, Anderson JA, Celli B, Ferguson GT, Jenkins C, Jones PW, et al. Salmeterol and fluticasone propionate and survival in chronic obstructive pulmonary disease. The New England journal of medicine. 2007;356(8):775-89.


[1] Not to be confused with either Alias Smith and Jones nor even Alas Smith and Jones

Categories: Philosophy of Statistics, Statistics | Tags: , , , | 14 Comments

Fisher:’Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H0 to that on the hypothesis H1 is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H0 bears to the likelihood of H1, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number.  In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T.  For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient.  Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the tesitng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters.  Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

Categories: phil/history of stat, Statistics | Tags: , , , , , | Leave a comment

Blog at WordPress.com. Theme: Customized Adventure Journal by Contexture International.

Follow

Get every new post delivered to your Inbox.

Join 84 other followers