Slides from the Boston Colloquium for Philosophy of Science: “Severe Testing: The Key to Error Correction”

Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”


Categories: fallacy of rejection, Fisher, fraud, frequentist/Bayesian, Likelihood Principle, reforming the reformers | 16 Comments

Post navigation

16 thoughts on “Slides from the Boston Colloquium for Philosophy of Science: “Severe Testing: The Key to Error Correction”

  1. Do you have any thoughts on the relevance of this discussion to climate science?

    For example, statistical measures are the cornerstone of its relevance to the public policy debate — comparing the predictions of models with observations, to estimate the reliability and accuracy of the models.

    There is some discussion in the peer-reviewed literature about the best methodology of these tests, but not much. Cites available on request.

  2. The climate science literature about validation of models is quite sparse, considering its public policy importance. Most are hindtests, comparing observations to model “predictions” of past climate — ignoring the effect of model tuning. For example, see “Well-estimated global surface warming in climate projections selected for ENSO phase“, James S. Risbey et al, Nature Climate Change, September 2014. Hindcasting of the CMIP5 model.

    Three representative papers in this area:

    (1) “Reconciling warming trends” by Gavin A. Schmidt et al, Nature Geoscience, March 2014 — Ungated copy here.

    (2) “Recent observed and simulated warming“, John C. Fyfe and Nathan P. Gillett, Nature Climate Change, March 2014 — Gated. “Fyfe et al. showed that global warming over the past 20 years is significantly less than that calculated from 117 simulations of the climate by 37 models participating in Phase 5 of the Coupled Model Intercomparison Project (CMIP5). This might be due to some combination of errors… It is this light that we revisit the findings of Fyfe and colleagues.”

    (3) “Well-estimated global surface warming in climate projections selected for ENSO phase“, James S. Risbey et al, Nature Climate Change, September 2014. Hindcasting of the CMIP5 model.

    The following posts have citations, abstracts, links, and excerpts to papers about testing climate models (ungated, where available).

    (1) This discusses this issue from a public policy perspective. Section (f) at the end has an extensive list of references:

    (2) This discusses application of Karl Popper’s concepts of “risky predictions” and “severe testing” to climate science:

    It includes a discussion of “Should we assess climate model predictions in light of severe tests?”, Joel Katzav, Eros, 7 June 2011.

  3. Steven McKinney

    Editor of the Fabius Maximus website:

    The unfortunate issue in climate science is that we have only one earth right now. Global warming is thus not an experiment we can repeat over and over again, to see if we rarely fail to obtain a result showing statistical significance for effect sizes of scientific relevance.

    We get one shot with the earth’s climate at this time. None of us opining about whether CO2 does anything or not will be around when our grandchildren’s grandchildren witness the outcome.

    Mayo’s severity concepts apply to scenarios that are repeatable. People all over the world build small houses from glass or plastic, and rarely fail to grow plants therein that require warmer temperatures than available outside the small glass house. A most reproducible outcome indeed. CO2 in the atmosphere produces a similar effect, as any visitor to Venus can attest. The sensible among us conclude that piling more and more CO2 into our atmosphere at this time is a strategy with a likely disastrous outcome for human beings, based on reproducible smaller scale demonstrations. Bugs and moulds and bacteria that thrive in greenhouses will scarcely note our passing should the fossil fuel apologists prevail.

    Fabius Maximus would have left oil in the ground, rather than frantically digging it all up as quickly as possible and expending it. Why the big rush to burn it all right now?

    • Steve,

      This isn’t the place to debate public policy about climate change. I suggest you read the the Working Group I report of the IPCC’s AR5. The summary explains the situation clearly and in detail, describing the uncertainties that need clarification before re-arranging the world economy.

      Click to access WG1AR5_SPM_FINAL.pdf

      That’s vital, because there are other serious threats that require funding. For example, there is a large peer-reviewed literature showing that the oceans are being destroyed. We cannot fully fund efforts to fight every threat. Priorities must be set.

      The comparison with Venus is not relevant. It is hotter because it is closer to the sun (a total solar irradiance almost 2x that of Earth) and has a denser atmosphere. The pressure on Venus is 93x that of Earth, the equivalent of almost 1 kilometer under the ocean. Also, the composition of the atmosphere is radically different (e.g., clouds of sulphuric acid, a powerful greenhouse agent). See the NASA fact sheet:

    • Steven: I’m surprised you seem to suggest that severity is relevant for cases where the entire universe is repeated over and over. We consider hypothetical outcomes that could have occurred in each experiment in order to assess and control its capabilities to uncover errors as regards inferences concerning historical phenomena (e.g., anthropology) this one world or one hypothesis about it. There’s no research problem that we can’t be wrong about and so there’s always a type of mistaken interpretation of data. We cash this out in terms of counterfactual”what would have occurred), often w/ simulated repetitions. We’re generalizing about the methods and models, which are never unique, but always of a general type. That’s how we severely probe.
      There are people using severity to test climate models, but I only know a little about it. I realize this likely wasn’t your main point.

      • Professor Mayo,

        Can you recall anyone doing severe testing of climate models? The only papers I’ve found are by Joel K. Katzav, They made little impression on the field. I forgot to mention the second one in my comment. it specifically mentions your work.

        “Should we assess climate model predictions in light of severe tests?”in Eros, 7 June 2011.

        “Severe testing of climate change hypotheses”, Joel K. Katzav, Studies in History and Philosophy of Modern Physics, Volume 44, Issue 4, p. 433-441 (November 2013).

        Click to access 375419940988534.pdf

        • No, there were different authors, I’d have to look them up. They were quite interesting.

          • Professor Mayo,

            There is a new paper in my reading pile. Deferred, as it looks over my head, but probably is in your wheelhouse.

            “Reconciling the signal and noise of atmospheric warming on decadal timescales”, Roger N. Jones and James H. Ricketts, Earth System Dynamics, 8 (1).


            “Interactions between externally forced and internally generated climate variations on decadal timescales is a major determinant of changing climate risk. Severe testing is applied to observed global and regional surface and satellite temperatures and modelled surface temperatures to determine whether these interactions are independent, as in the traditional signal-to-noise model, or whether they interact, resulting in step-like warming. …”

      • Steven McKinney


        I am unclear on how I suggest that severity is relevant for cases where the entire universe is repeated over and over. Can you clarify what I said that entails an entire universe repeating over and over?

        What I mean to point out is that frequentist statistical inference is relevant for repeatable phenomena, and is not relevant for one-off events. We can not study the long run frequency properties of things that only happen once. It is going to be a long, long time until stores of carbon fuels such as oil build back up on Earth, after we dig them up and burn them so relatively quickly over a few hundred years. Thus the current climate change debate concerns a single event – the pushing of our climate to a hotter state, a state not seen for several million years, well before the evolution of the human species.

        Putting this event into the science replication crisis debate currently ongoing is inappropriate. Your slides discuss the replication crisis and the “Editor of the Fabius Maximus website” asks for “thoughts on the relevance of this discussion to climate science”.

        My claim is that this discussion needs to be reviewed carefully in this regard, because we have only one shot at getting the answer right. Certainly replication is relevant as regards the many studies that inform our debate on climate change, but we have huge volumes of repeated findings over decades concerning climate change phenomena. This contrasts sharply with the thin volumes of one-off findings in current psychology that too often end up in newspaper headlines and fuel the replication crisis discussion. We can certainly document the cherry picking of a few small studies that get published and hyped in current psychology and biology, but this is not the case for studies shoring up the debate on the effects of CO2 on global temperatures.

        The About page at the Fabius Maximus website lists the current editor as Larry Kummer.

        The Fabius Maximus website is a most odd collection of writings indeed, with titles such as “Manufacturing climate nightmares: misusing science to create horrific predictions” and repeated references to “Lefties” and the despair and propaganda they disseminate.

        Writing about the replication crisis currently debated in several research fields, Larry Kummer states

        “Many sciences are vulnerable, but climate science might become the most affected. It combines high visibility, a central role in one of our time’s major public policy questions, and a frequent disregard for the methodological safeguards that other sciences rely upon.”

        (archived at )

        Really? After decades of study, how are methodological safeguards being disregarded?

        So Larry has some curious agenda, picking away at certain climate science topics and those who believe the time is past to do something about our warming planet.

        Larry’s scientific chops do not impress me – he may have been a great money maker at UBS investments, but his understanding of Venus leaves me confused. The very link that Larry embeds above, to a NASA fact sheet, shows the CO2 level on Venus to be 96.5%, that is 965,000 parts per million versus Earth’s recently historic 250 parts per million, now at 400 parts per million and climbing. Larry points to Venus’s higher sulfur dioxide levels (150 parts per million), yet I find articles such as this

        “Robock, an expert on how volcanoes affect climate and a professor of environmental sciences at Rutgers University, cautions that while the Pinatubo eruption confirmed the cooling effect of sulfate aerosols, it injected a massive amount of sulfur dioxide into the stratosphere over a few days.”

        so on Earth sulfur dioxide has a cooling effect. Curious that a mere 150 parts per million on Venus is claimed to be responsible for so much of Venus’s trapped heat, never mind the 965,000 parts per million of CO2.

        Here’s reporting from Seattle’s Times newspaper

        “Here’s why: When CO2 mixes with water it takes on a corrosive power that erodes some animals’ shells or skeletons. It lowers the pH, making oceans more acidic and sour, and robs the water of ingredients animals use to grow shells in the first place.

        Acidification wasn’t supposed to start doing its damage until much later this century.

        Instead, changing sea chemistry already has killed billions of oysters along the Washington coast and at a hatchery that draws water from Hood Canal. It’s helping destroy mussels on some Northwest shores. It is a suspect in the softening of clam shells and in the death of baby scallops. It is dissolving a tiny plankton species eaten by many ocean creatures, from auklets and puffins to fish and whales — and that had not been expected for another 25 years.

        And this is just the beginning.”

        Larry states “This isn’t the place to debate public policy about climate change.” This is beyond political. When our food no longer grows, this is an existential threat. Where more appropriate to review such an issue than a philosophical arena, as presented by philosopher Mayo?

    • john byrd

      “The unfortunate issue in climate science is that we have only one earth right now. Global warming is thus not an experiment we can repeat over and over again, to see if we rarely fail to obtain a result showing statistical significance for effect sizes of scientific relevance.”

      Actually, we might not be around to see repeats, but there is no reason that the Earth cannot have repeats of the experiment. The Earth can recover without us. Maybe some other intelligence can check the results.

  4. john byrd

    I find the fixation on significance testing in this “replication crisis” somewhere between puzzling and disturbing. Sure, the statistical setups are important, but not more important than the fundamental issues of replicating data collection, experimental design, etc. It seems that many of the psych studies are not easily replicated, at least to an outsider. Where it gets scary is to see the critical focus on p-values while naively ignoring the folly of failing to properly consider sampling error in any statistical consideration. One should expect that heightened awareness of the problems with cherry-picking, etc. would naturally lead to a realization of the folly of the SLP. Maybe next year.

  5. In the paper Reconciling the signal and noise of atmospheric warming on decadal timescales, linked to above, we applied the substantive null of model adequacy approach where we linked two physically plausible rival hypotheses to two statistical hypotheses (step and trend) to determine whether externally forced atmospheric warming is a gradual or step-like process. Six tests were applied to obtain a result. We were unable to determine any likelihoods associated with the results, but one alternative passed the six tests and the other failed.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at