Sensitivity and Severity: Gardiner and Zaharatos (2022) (i)

.

I’ve been reading an illuminating paper by Georgi Gardiner and Brian Zaharatos (Gardiner and Zaharatos, 2022; hereafter, G & Z), “The safe, the sensitive and the severely tested,” that forges links between contemporary epistemology and my severe testing account. It’s part of a collection published in Synthese on “Recent issues in Philosophy of Statistics”.  Gardiner and Zaharatos were among the 15 faculty who attended the 2019 summer seminar in philstat that I ran (with Aris Spanos). The authors courageously jump over some high hurdles separating the two projects (whether a palisade or a ha ha–see G & Z) and manage to bring them into close connection. The traditional epistemologist is largely focused on an analytic task of defining what is meant by knowledge (generally restricted to low-level perceptual claims, or claims about single events) whereas the severe tester is keen to articulate when scientific hypotheses are well or poorly warranted by data. Still, while severity grows out of statistical testing, I intend for the account to hold for any case of error-prone inference. So it should stand up to the examples with which one meets in the jungles of epistemology. For all of the examples I’ve seen so far, it does. I will admit, the epistemologists have storehouses of thorny examples, many of which I’ll come back to. This will be part 1 of two, possible even three, posts on the topic; revisions to this part will be indicated with ii, iii, etc., and no I haven’t used the chatbot or anything in writing this.

I won’t dwell over many differences in goals and language, but focus on the points of contrast that best reveal what severity has to offer the epistemologist.  The epistemological notion that is closest in spirit to severity, G & Z propose, is what epistemologists call “sensitivity,”

Mayo has independently developed a sensitivity condition without drawing on the resources of contemporary epistemological theory. She has developed a sensitivity account, without perceiving herself as such. (G & Z, 19)

So am I like the Molière of sensitivity? [1] In fact, severe testing quite consciously gives an account of inference that is sensitive to erroneously inferring (warranting or believing) claims. It is only because statistical methods deliberately supply such tools that I, as a philosopher, look to them in the first place.

It is noteworthy that the authors link severity with “non-formal epistemology” rather than formal epistemology. I think this seems right. Formal epistemology is largely Bayesian or at least “probabilist”, in the sense of using probability to capture degrees of belief, plausibility, or support in hypotheses or claims. However, non-formal epistemologists regularly slip into probabilist talk in speaking of statistical evidence and inference, and this creates obstacles for their own project. Notably, they are all too happy with crude induction: from k% of A’s have been observed to be B’s, to inferring the probability that a specific A is a B equals k. Fallacies of probabilistic instantiation, reference class problems, and lack of randomness loom large, but do not get attention. Following a second big assumption—that a claim’s being probable (in some sense) warrants inferring it—non-formal epistemologists set about to find principles to block such an apparent warrant. But I’m getting ahead of myself. I return to this at the end of the current post.

Severity.

Informal. Here’s an informal take on my notion of severity, weak and a strong. We start with a minimal requirement: we deny there is evidence for claim C if little if anything has been done that could have found flaws in C (weak severity). Claim C is warranted (by data x) just to the extent it has been subjected to, and passes, a test that probably would have found flaws in C, if they are present. This probability is the severity with which C has passed the test. Claims that pass with high severity are said to be warranted (strong severity). The severity accorded a claim is automatically deemed low if it’s impossible to assess the relevant error probabilities, even approximately.[2]

The authors are right to suggest there are echoes of “sensitivity” in epistemology. Very roughly, sensitivity requires, for any claim C that is judged or believed to be true, that if C were false, it would (probably?) not be believed, or judged or the like. However, they also aver that severity faces, or appears to face, a problem thought to bedevil sensitivity in epistemology. If correct, it suggests that skeptical possibilities result in even well-tested claims failing the minimal requirement for evidence. I will show how the severe tester debunks the allegation that sensitivity (in epistemology) is said to face in Part 2.

A full discussion of severe testing may be found in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2022): STINT. All 16 “tours” of the book may be found, in proof form, in these excerpts on this blog.

Severity in error statistics.

The informal description of severity that I gave above is liable to be misunderstood without first understanding a little bit of how they arise in error statistical methods such as statistical significance tests—even though my severity idea results in reformulating these tools. If severity is misunderstood, it will be misapplied when call on for the epistemological project. So here are some elements from error statistics: In formal error statistics, probability arises to assess and control the probabilty a method leads to misinterpreting data. These are the method’s error probabilities. While, technically, error probabilities of a method allude to its behavior in (actual or hypothetical) repeated use, they may also serve to capture the capabilities of methods to avoid error in the case at hand. This is what allows moving from error probabilities (of a method) to specific warrants (to applications of that method). Or so I have been arguing for some time

Pre-data, a method such as a hypothesis test is specified so as to ensure that true claims are passed or inferred, and false claims are not passed or are rejected—although these claims are qualified probabilistically. For example, in a statistical significance test, the claims of interest might be: the data (say from a randomized control trial) are evidence a given treatment is beneficial for a given disease, or the data fail to provide evidence of benefit (at least in the experimental population). The standard type I error here is to infer evidence of benefit when it is absent (i.e., when the observed effect is merely due to background variability or “noise”); while the type II error is to fail to infer benefit when it exists. The probability a method M “would have found” flaws in claim C refers to the probability method M would have rejected C, computed under various statistical hypotheses. This “computed under…” phrase may be unfamiliar to you, but it’s important. It is not a conditional probability, and does not use a prior probability. I’ve written so much on this blog about statistical tests, that it’s best to just direct you to search this blog. Can it be used in informal epistemology? I say yes.

One need not consider the case a “test” in any official way (e.g., the claim C need not be prespecified). Claim C can be an estimation, prediction, perceptual judgment or other. You don’t have to call the inference from data x to claim C a test. But keeping to testing language underscores that we are interested in how well probed claims are—at least when we’re doing an analysis as we are. (In fact, I dub the severe tester’s use of probability “probativism,” in contrast to “probabilism” and “performance”. (These are briefly discussed in G & Z.) The main thing is that there is a method M that moves from data and background to a claim C (which may be a denial of some claim).

Severity is post-data. The severity notion I develop is a post-data measure. For example, the null hypothesis H0 might assert no benefit, whereas alternatives under H1 assert positive magnitudes of benefit. We rarely want to merely infer evidence of some benefit, but rather how much of a benefit. G & Z illustrate with a figure relating effect size to severity. Once a test result is at hand, and an inference reached with data x, severe testers evaluate how well (or poorly) warranted various claims are. Any inference reached is accompanied with at least one poorly warranted claim (in relation to the relevant errors).

Levels. The severity assessment is always at a “level”—analogous to a significance level or confidence level. The focus is on the level attained post-data. This depends on how well probed the claim actually is with the given data x and method M. In informal settings—the ones we are usually faced with—these error probabilities remain qualitative and so does the level of warrant: poorly probed, reasonably well probed, extremely well probed, etc.. Even in formal statistical contexts, a precise error probability is rarely required. Epistemologists like to talk about beliefs, especially a “subject S believes that P” (for a proposition P). You’re free to do so, although I won’t, except when quoting others. What level of severity to require for an indication, strong evidence, knowledge (or warrant) will vary with the context—but I will not have to specify thresholds here. A method that errs over 50% of the time (in relation to a type of claim) is unreliable.

The kinds of claims that arise in epistemology are dichotomous: rather than consider magnitudes of effects, the possibilities are generally exhausted by C and its denial. That simplifies things.

Counterfactuals in error statistics.

My severity requirement alludes to counterfactuals but, unlike their typical treatment by epistemologists, I don’t cash them out in terms of possible worlds, be they close or distant. In formal statistics, the needed counterfactuals stem from the sampling distribution of a statistic (“computed under” various assumptions about the world giving rise to the data). The statistic, T, called the test statistic, measures the accordance or fit between data x and claim H. The larger T is, the more improbable x is, when computed under H, (and the more probable under ~H). If the data are sufficiently distant from what would be expected under H, the method M outputs ~H. So the probability attaches to the method.

Auditing. A crucial part of the severity requirement is checking the assumptions underlying these claims, which I call auditing. An application of an adequate statistical method may be said to supply nominally good error probabilities, but they may actually be poor if they do not stand up to testing by the required audit. (Audits themselves occur at two levels: the primary inference to C, and the secondary scrutiny of assumptions (typically, in statistics, of a model for the data generation).

A passage from Gardiner and Zaharatos (G & Z):

Strong severity aims to characterise the epistemic value of good tests. A good test is good because were H false, the test would have detected it.  For observed data e to support a hypothesis H, on Mayo’s view, it does not suffice for e to fit H. In addition, e’s fitting  H must be a good test of H. A test is good if H were false, the data wouldn’t fit H. (ibid. 4)

I especially like the first sentence of this passage, because it is often supposed that good error probabilities are for shop-keeping and acceptance sampling in industry. Among the ways I’d wish to qualify the claims in this passage:

  • First, I would say, a test is good because were  H false, the test would (probably) have detected the falsity, but not otherwise. A method is not good if it always or often infers C is false or unwarranted, even if C is true or warranted. A method is poor if it often blocks or fails to affirm true claims.
  • Second, rather than speak of “support”, I speak of data warranting a hypothesis H, or more generally, a claim C. The notion of “support” is used in connection with probabilist approaches where the goal is degree of support, or, in Bayesian accounts, how much of a probability boost a hypothesis receives with data.>
  • Third, severity requires probabilistic qualifications—very probably, the error that is purported to be absent would have been found if it were present, or, C would have failed the test (at the given level).
  • Fourth, we immediately get into trouble speaking of detached statements, probabilities, likelihoods and the like. Consider the last sentence: “A test is good if H were  false, the data wouldn’t fit H.” A severe tester needs to take account of how the H (and the x) under analysis were generated.

Remember the method used by Scott Harkonen? (Mayo 2020; and these two blogposts (Oct 9, 2013 and Feb 1, 2020). Failing to find statistical significance on 10 different endpoints he tries and tries again until finding a subgroup of patients with a sufficiently high number showing benefit from the treatment (for a serious lung disease). Let HPD be the post-data hypothesis Harkonen formulates based on the unblinded data. HPD: the treatment benefits patients with such and such characteristics. Look at what happens in relation to the specially generated null hypothesis: ~HPD: the treatment does not benefit these patients.  The data x “do not fit” the post-data null hypothesis ~HPD. The only reason Harkonen selected this postdoc subgroup is that he could declare that the data do not fit the post-designated null hypothesis ~HPD. Using a likelihood ratio as a measure of fit, the data do not fit ~HPD since Pr(x;~HPD) < Pr(x;HPD). The biased selection would not show up in the likelihood ratio, if the associated error probability was not considered. What resources does the epistemologist have to pick up on the fact that the method had high error probabilities—in our sense?  I’m not sure, but that’s what I’d like to supply them.

Sensitivity in Epistemology.

In one place, Gardiner and Zaharatos define sensitivity as

S’s true judgement that p is sensitive iff in the nearest possible worlds in which p is not true, S does not judge that p. (G & Z, 13)

I’m not sure if they intended to write “true judgment” here. They drop the possible worlds in the following:

Sensitivity of belief. S’s belief that p is sensitive iff if  p were false,  S would not believe that p (ibid.)

A judgement that p is sensitive iff were p false, the agent would not have judged that p. This ‘judgement’ might be a legal verdict, scientific conclusion, formal finding, news report or similar. The agent might be a group or community. For some such judgements, an individual’s believing p is not a central or necessary condition that p (ibid.)

That’s good because we want to drop the “subject (or knower) S” from the notion. But we need to qualify with an error probability. An article on sensitivity in legal epistemology, by Enoch et al. (2012, 204), defines sensitivity using probability rather than possible worlds:

Sensitivity: S’s belief that p is sensitive =df. Had it not been the case that p, S would (most probably) not have believed that p.

Nudging their definition closer to severity, one might try:

Claim C, which passes test M, is warranted (with severity?) iff were C false, C would (most probably) not have passed or passed so well, with method M.

I insert “?” because it is at most something that might be tried. A main problem is that I don’t know how probability is being used here. Enoch et al, I take it, construe it as probabilifying the claim C itself, perhaps with a posterior probability. My position would side with Peirce

if universes were as plenty as blackberries, if we could put a quantity of them in a bag, shake them well up, draw out a sample and examine them to see what proportion of them had one arrangement and what proportion another. (2.684)

It still seems odd to my ears to call the claim (belief or judgment) sensitive, rather than the method that outputs the claim. I also worry that insufficient attention is paid to ensuring that true claims pass.

Some Classic Examples.

What allows viewing sensitivity and severity as in the same spirit is considering how they are used in debunking some classic examples. So let’s turn to that.

Lottery paradox. The lottery paradox, for which my friend Henry Kyburg is famous, goes like this:

The evidence, x, is that a person A has bought a ticket in a fair lottery with only a one in a million chance of having her ticket drawn as the winner. While the probability that “A will not win” is high, it does not pass with severity with this evidence. because even if A’s ticket is a winner, there is no chance of finding this out. The paradox that Kyburg was on about is that if it is inferred, for each ticket-holder, that they will not win, then we would infer no one will win, contradicting the supposition that there will be a winner. (Kyburg’s solution denies we should conjoin highly probable claims to infer A1 will not win, A2 will not win etc.) Here, the assumption is that it is a fair lottery, so that each ticket has the same probability of being selected.

The severe tester denies it is warranted to infer A will not win. The improbability of winning is given in the description of the lottery, so nothing has been done to distinguish a winning ticket from a losing ticket. I discuss this in Mayo 1996.

The authors, G & Z discuss a few other popular examples. Take the example of “prisoners”.

Prisoner (“guilt by association”).

Security footage reveals that ninety-nine prisoners together attack a guard. One prisoner refuses to participate. Prison officials decide that since for each prisoner it is 99% probable they are guilty, they have adequate evidence to successfully prosecute individual prisoners for assault. They charge Ryan, an arbitrarily selected prisoner in the yard, with assault. A guilty verdict is returned. Given the evidence, it is highly probable that Ryan rioted. But convicting Ryan on this evidence seems epistemically inappropriate. (G & Z, 11)

G & Z rightly appeal to severity (or sensitivity) to show the epistemic inappropriateness, but what about the inappropriateness of supposing that “for each prisoner it is 99% probable they are guilty” and “Given the evidence, it is highly probable that Ryan rioted”? You could say that the probability that a randomly selected prisoner rioted is .95, but this does not mean that Ryan, the one selected, has a .95 probability of having gatecrashed (whatever this might mean). Either Ryan is guilty or he isn’t. Moreover, The fact that we can randomly select prisoners does not mean they each had an equal probability of rioting.

Even if one wants to arrive at a method for assigning epistemic warrant to specific claims, based on proportions (it need not be a probability), the problem of the reference class must be dealt with, if the method is to be decently reliable. (Principles of indifference do not suffice.)

Putting aside this issue, G & Z are right in handling this example: nothing has been done to distinguish Ryan’s guilt from his innocence.

Let me end this post with their excellent sum-up unifying sensitivity and severity:

Unification. The parallel between sensitivity and severe testing is apparent. Sensitivity is not a matter of how probable the claim is given the evidence. A judgement can have very high evidential probability, and yet be insensitive. This is exemplified by the lottery, prisoner, and sex crime examples. Instead sensitivity asks ‘were the claim false, would this falsity be detectable?’ … Severe testing likewise focuses on this subjunctive question: If the claim were wrong, would the fit between the favoured hypothesis and the data be notably weaker? And has anything been done so that were the hypothesis false, the data collected would indicate this falsity? In cases like Prisoner and Lottery, the answer is resoundingly no to both questions. (G & Z, 14)

Gardiner and Zaharatos have managed to put a new spin on severity. Their paper—which I highly recommend— has encouraged me to revisit the jungles of classic epistemology, at least where severity has something to say. But I don’t expect to start speaking like a native.

Stay tuned for part 2. In the mean time, share your thoughts in the comments to this blog.

References

Enoch D., Spectre, L. & Fisher, T. (2012). Statistical evidence, sensitivity, and the legal value of knowledge. Philosophy & Public Affairs, 40(3), 197–224.

Gardiner, G., & Zaharatos, B. (2022). The safe, the sensitive, and the severely tested: a unified account. Synthese: An International Journal for Epistemology, Methodology and Philosophy of Science200(5). [See class on 4/26  from Mayo’s 2023 graduate seminar (syllabus here).]

Mayo, D. G. (2020). P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting. Harvard Data Science Review 2.1.

Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. [See this post of excepts for proofs.]

Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge, Chicago: Chicago University Press, 1996.

Peirce, C. S. (1931-35). Collected papers. Vols. 1-6. Edited by C. Hartshorne and P. Weiss. Cambridge: Harvard University Press.

ENDNOTES

[1] I refer to Molière (1890), for those of you who had to read Le Bourgois Gentilhomme in high school:

My faith! For more than forty years I have been speaking prose while knowing nothing of it, and I am the most obliged person in the world to you for telling me so.

[2] See Statistical Inference as Severe Testing (SIST 2018, 18). Gambits that result in low severity or the inability to assess severity—variants of cherry-picking, P-hacking, data-dredging, and optional stopping– are examples of biasing selection effects (92). Gellerization is another term I’ve used.

Categories: severity and sensitivity in epistemology | 2 Comments

Post navigation

2 thoughts on “Sensitivity and Severity: Gardiner and Zaharatos (2022) (i)

  1. Mayo

    test

  2. Mayo

    I’m testing

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.