Abandon Statistical Significance and Bayesian Epistemology: some troubles in philosophy v3

.

Has the “abandon significance” movement in statistics trickled down into philosophy of science? A little bit. Nowadays (since the late 1990’s [i]), probabilistic inference and confirmation enter in philosophy by way of fields dubbed formal epistemology and Bayesian epistemology. These fields, as I see them, are essentially ways to do analytic epistemology using probability. Given its goals, I do not criticize the best known current text in Bayesian Epistemology with that title, Titelbaum 2022, for not engaging in foundational problems of Bayesian practice, be it subjective, non-subjective (conventional), empirical or what some call “pragmatic” Bayesianism. The text focuses on probability as subjective degree of belief. I have employed chapters from it in my own seminars in spring 2023 to explain some Bayesian puzzles such as the tacking paradox. But I am troubled with some of the examples Titelbaum uses in criticizing statistical significance tests. I only came across them while flipping through some later chapters of the text while observing a session of my colleague Rohan Sud’s course on Bayesian Epistemology this spring. It was not a topic of his seminar.

1. A test of statistical significance. What is it?  First of all, it’s a test of a statistical hypothesis. There is a set of possible outcomes, a sample space, and a statistical test hypothesis H0 which assigns probabilities (or densities) to outcomes modeled in terms of the distribution of a random variable X. There is a test statistic d(X), a function of X and H0, such that the larger the observed d(x) the more improbable x is, computed according to H0. H0 provides this probability assignment by hypothesizing the value of an unknown parameter θ in a statistical model M.  θ is viewed as a fixed, but unknown quantity, except for special cases where θ may itself be regarded as a random variable that takes on values with given probabilities.

The test is a rule that maps observed values of d(x) into either “reject H0” or “do not reject H0in such a way that there is a low probability of erroneously rejecting H0 and a much higher probability of correctly rejecting H0. “Reject H0” and “fail to reject H0” are generally interpreted as x is evidence against H0, or x fails to provide evidence against H0, (which is not the same as evidence for H0.) Titelbaum focuses on simple Fisherian tests, so I will too.

Coin tossing (Bernouilli) trials. For instance, in a random sample X of n coin tosses (X = X1, X2,…Xn), H0 might assert that θ, the probability of heads on each, trial is .5. A test statistic d(X) would be the difference between the observed proportion of heads and .5 (in units of the standard error SE). If we observe 60% heads in 100 trials, the test statistic is .6 – .5 in SE units, which yields 2. (Under H0, the SE is .05.) The probability d(X) exceeds 2 is ~ .02. This is the p-value associated with the result for testing H0.  The test infers: there is evidence of inconsistency (or discordance) with H0 at statistical significance level .02. The rationale is that 98% of the time, we’d observe a smaller proportion of heads than we did, under the assumption that H0 is correct about the data generating mechanism. (Note: .98 is 1 – the p-value.)

Until we check the assumptions of the model, I would say we merely have an indication of evidence. And, even if they check out, as Fisher repeatedly emphasized, isolated significant results do not suffice: “we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result” (Fisher 1947, p. 14).

7/31 addition: Titelbaum gives the 2016 ASA statement’s definition of p-values in 13.6, p. 462: p-value = Pr(outcome at least as extreme as what was observed|null hypothesis, which is OK except for the fact that the p-value is not a conditional probability. (I put this to one side for brevity.)

2. Titelbaum’s troubles. The first thing that troubles me about Titelbaum’s “troubles with significance testing” (13.2.2, 461) is that his critical examples do not concern statistical hypotheses, but the occurrence of events, such as “lands heads”, “has a disease” or “is a soccer player”, with stipulated frequentist probabilities of occurrence. There is no problem about computing these frequentist probabilities, but the probability of event A, given event B is not B’s “p-value”.

Testing for soccer. Treating it as such can lead to absurd results, as in Titelbaum’s example: “Consider the hypothesis that John plays soccer. Conditional on this hypothesis, it’s unlikely that he plays goalie….But now suppose we observe John playing goalie” where it is stipulated  “that goalies are rare soccer players….Yet it would be a mistake to conclude that John doesn’t play soccer” (Titelbaum 463). True, but Titelbaum is mistaken to think significance tests license such an inference. [ii] Since he has told us that goalies are rare soccer players (I don’t have a clue about sports), this leads to a logical contradiction. [Premises: (Gj & Sj), (Gj → Sj); conclusion: ~Sj, where G and S are the predicates “is a goalie”, “is a soccer player”, respectively, and j is the name John.] There are no p-values here, and terrible error probabilities. The probability that an inference to “John is not a soccer player” is erroneous is 1. (Also, all of us non-goalies, are erroneously inferred to be soccer players.) [iii]

Titelbaum cites Dickson and Baird (2011, 219-220) as the source of the example. Since I don’t think in terms of sports, note that on their construal of statistical significance tests, observing that a person is a Nobel prize winner would license inferring she was not a person, since Nobel prize winners are rare people. This makes no sense. John being a soccer player is not a statistical hypothesis assigning probabilities to outcomes, and even if it were a statistical H, the probability of data x given H is not its p-value. I turn to this.

It’s still a mistake with statistical hypotheses: What if we have a genuine statistical hypothesis, such as the coin tossing case:s on each trial is .5? Every random sample of 100 toss, x, leads a very low probability under H0, (θ = .5) in the Bernouilli model M. But it’s incorrect to view the probability of x given H0 as the p-value. 7/31 Titelbaum’s “testing for soccer” example encourages this conception.

Anyone who reads this blog knows I don’t consider simple statistical significance tests an adequate account of statistical inference (without reformulation). Here I keep to a rudimentary Fisherian view, because that is what Titelbaum does, and that suffices to block his criticisms [iv]. Added 7/31: Neyman and Pearson show that with a single null hypothesis, the p-value will depend on a choice of test statistic. Titelbaum also raises this objection (p. 466). That’s why Neyman and Pearson introduce the alternative hypothesis and power. But Fisher prefers to rely on an appropriate choice of test statistic (and corresponding sensitivity to discrepancies of interest). See Senn’s “Fisher’s alternative to the alternative“.

When Titelbaum says “unlikely events occur, and occur all the time” (463), he thinks he’s objecting to Fisherian tests, but Fisher would agree (remember to construe Titelbaum’s “likely”: as “probable”). What’s infrequent is for test statistic d(x) to land in the rejection region computed under H0. (It can be made frequent only by violating assumptions of the model M.)

3. Diagnostic screening. Titelbaum’s next criticism (464) is better known than the soccer case. Here we are given frequentist “prior” probabilities for a disease or abnormality.

Suppose a randomly selected member of the population receives a positive test result for a particular disease. This result may receive a very low p-value relative to the null hypothesis that the individual lacks the disease. But if the frequency of the disease in the general population is even lower, it would be a mistake to reject the null.

Do you agree? Is it is a mistake to take a positive test result as evidence for a disease, where a positive result is very rare under no disease? If it is a mistake, then it’s even more of a mistake to take a negative result as evidence for disease. So if we required a high posterior probability of disease given a positive result, the test would have no chance of detecting disease even if present.

Doctor: Your positive result is not evidence of disease, given how rare the disease is.

Patient: Does your screening have any chance of finding evidence of disease even if present?

If a high posterior is required for evidence of disease, the doctor would have to say no.

To have some numbers, assume the probability of a positive result among those with no disease is very low, say .01, while the probability of a positive result among those with the disease is high (the typical criticism sets it at 1). If it is stipulated that the probability of the disease is sufficiently small, say .001, the probability of disease can still be very small, in this case .1. The posterior probability is correct, but the test has not done its job: to discern evidence of the rare disease.

Titelbaum’s diagnostic screening criticism is based on assuming that taking a positive result + as evidence for abnormality or disease requires Pr(disease|+) to be high. I don’t think most Bayesians would agree with this, given that Pr(+|disease) is 100 times Pr(+|no disease). Bayesians generally allow that evidence confirms a claim (to some degree) if its posterior probability has increased from its prior.

The example is crudely simplistic, but it’s the one we’re given. For frequentists to infer evidence of disease is not to infer the the disease is definitely present: there is a statistical claim, with given error probabilities. With diagnostic screening, the error statistical report is generally in terms of sensitivity and specificity. In this example, it is also stipulated that there are prior frequentist probabilities or prior prevalences. If so, my frequentist doctor reports that while the positive result is evidence of the presence of abnormality, there is a very low frequentist probability of disease among those testing positive, here ~.1.

In would be unusual for a diagnostic test of disease to infer its presence or absence, rather than to infer evidence of some sign of abnormality warranting a call-back (e.g., in breast-cancer screening). There would also typically be a report of degrees and type of abnormality. My doctor would also add that the proportion with disease among callbacks is low, especially among women with characteristics q, r, s, etc., for further investigation. The question of what action to take is distinct: with luggage, ringing the alarm just leads to pat-downs or rummaging through your bag for prohibited objects. Of course, some diagnostic tests are too sensitive/not sensitive enough for +,- results to be informative. That’s a different, substantive, issue.

The base rate fallacy. Titelbaum (464) says, “Admittedly the framing of [the medical screening] example brings prior probabilities into the discussion, which the frequentist is keen to avoid. But ignoring such considerations encourages us to commit the Base Rate Fallacy” (464). He assumes, in other words, that since frequentists do not assign probabilities to hypotheses where doing so is illegitimate, that they will “abstain” from assigning them even when proper frequentist probabilities are given!

It is not that the frequentist is keen to avoid prior probabilities of events with given frequentist probabilities. She is keen to avoid assigning probabilities to hypotheses unless they can be construed as random variables with frequentist distributions. Examples such as diagnostic screening with given probabilities are just ordinary conditional probability examples. Many would deny there’s anything Bayesian about it, unless any use of conditional probability is to be called Bayesian.

By and large, by the way, Bayesians share the frequentist perspective that parameters in a model are constants (except for random effects). The difference is that Bayesians are prepared to assign probabilities to constants by allowing probability to be degree of belief assignments.

Frequentists do not abstain from assigning probabilities to claims when legitimate frequentist priors are given, but she will critically evaluate how they are arrived at. Where do we get relevant base rates? Which reference class should we use?

Probabilistic instantiation fallacy. Thus far I have criticized Titelbaum’s diagnostic screening  example for assuming evidence of H requires the posterior probability of H to be high. I deny this, even where priors are given. For one thing, it led to never having evidence for a rare disease, even if present. Even hypotheses that are probable in some sense, need not have been well probed by the data x.  I now go further. What about the assumed prior probability? In the diagnostic screening example, it is assumed if an individual, say Jill, is randomly selected from a population where 1 out of 1000 have a property (e.g., disease) then the probability that Jill has the property is .001.  But this is a fallacy, akin to assuming that a randomly selected .95 confidence interval estimate has a probability of .95 of covering the true parameter value. [Granted, applying uniform priors can yield .95 as the probability the particular estimate is true.]

It’s important to realize that probability, in randomly selecting members of a population, refers to the selecting process. Even if we imagine Jill was randomly selected from a population where the abnormality is very rare, it does not follow that the probability Jill has it is low. Supposing it is, is to commit the fallacy of probabilistic instantiation.

I have discussed this elsewhere [v]. Peter Achinstein (2010, 187), who considers himself an objective Bayesian epistemologist, says it would be a fallacy if he were a frequentist:

My response to the probabilistic fallacy charge is to say that it would be true if the probabilities in question were construed as relative frequencies. However, … I am concerned with epistemic probability.

Achinstein is prepared to assign objective epistemic probabilities this if “the only thing known” is that Jill was randomly selected from a population where 1 in 1000 have the disease. The problems in supposing we get knowledge from ignorance, indifference, or uninformative priors are very old. I’m not sure what Titelbaum’s position is on them. His diagnostic screening example was intended to give frequentist probabilities.

I am not saying frequentists can never assign probabilities to Jill having a disease, or other specific events. With disease, it will involve a combination of genetic, environmental, and several other background variables to assess the relevant reference class in which to place Jill. Perhaps the new AI/ML techniques will provide them. When they do, the question of whether and how to employ them in evaluating evidence of hypotheses will still be a separate issue.

Please share corrections, reactions, and questions in the comments. Connections to the previous 5 guestposts are especially welcome.

I thank Conor Mayo-Wilson for his comments, reflected in 7/31 revisions.

Notes:

[i] In an earlier period, as in the 80s, philosophers of science engaged regularly with statistics.

[ii] Using “likely” both as probability of events and when referring to the technical term of a hypothesis being “likely” is a slippery business. For example, many rival hypotheses can all be maximally likely, in the technical sense.

[iii] Here, H0 is: John is a soccer player; the test rule is: observe goalie, infer not soccer; observe not goalie, infer soccer. If you have another construal, let me know.

[iv] Although I prefer an account closer to Neyman and Pearson (N-P), with an explicit alternative, there are important roles for Fisherian tests where the choice is left to specifying the test statistic d(X); notably, testing assumptions of the model. Fisher gives a demanding set of requirements for an adequate test statistic (see SIST). For a broad class of tests, as David Cox shows, Fisher’s tests lead to the same place as N-P.

[v] E.g., Mayo 1997, 2005, 2010, 2018. See also Spanos 2010. In the Achinstein discussion, a student is randomly selected from a population where college-readiness is rare. High test scores still yield a low posterior probability for being classified as ready.

References

  • Acinstein, P. (2010).   Mill’s Sins or Mayo’s Errors. In Error and Inference: D. Mayo and A. Spanos (eds.). Cambridge University Press.
  • Cox, D. and Hinkley, D. (1974). Theoretical Statistics. London: Chapman and Hall.
  • Mayo, D. (1997). Response to Howson and Laudan. Philosophy of Science 64(1): 222-244 and 323-333.
  • Mayo, D. (2005). Evidence as Passing Severe Tests: Highly Probable versus Highly Probed Hypotheses. In Scientific Evidence, P. Achinstein (ed.). Johns Hopkins University Press: 95-127.
  • Mayo, D. (2010). Sins of the Epistemic Probabilist: Exchanges with Peter Achinstein. In Error and Inference. D. Mayo and A. Spanos (eds.).  Cambridge University Press: 189-201.
  • Mayo, D. (2018): SIST. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press. (See Section 5.6  in Excursion 5,  Tour II: Positive Predictive Value: Fine for Luggage.)
  • Mayo, D. and Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice? Synthese 200.
  • Spanos, A. (2010). Is Frequentist Testing Vulnerable to the Base-Rate Fallacy? Philosophy of Science, 77(4), 565–583.
  • Titelbaum, M. (2022). Fundamentals of Bayesian Epistemology 2: Arguments, Challenges, Alternative. Oxford University Press. (Section 13.2.2)
Categories: Bayesian epistemology, Bayesian priors, Bayesian/frequentist, Diagnostic Screening | 15 Comments

Post navigation

15 thoughts on “Abandon Statistical Significance and Bayesian Epistemology: some troubles in philosophy v3

  1. Christian Hennig

    Of course there’s “trouble with significance testing” if tests are used that are not any good. No frequentist would deny that. This is very similar to criticising Bayesian inference based on silly results obtained from using very silly priors (and then misinterpreting the results on top of that).

    • But I think that Titelbaum is not setting out to make fun of statistical significance tests, and he has a responsibility to correctly define p-value. Recall my urging you to qualify your description of statistical significance test reasoning in your recent guest post. This was the reason.

    • Yes, but it seems difficult to explain what makes some examples like Titelbaum’s (inferring John is not a soccer player upon finding he is a goalie soccer player) silly–or rather, not even examples of significance tests. I thought my criticism was pretty straightforward. For starters, every inference to non-soccer would be wrong (as it’s given goalies are rare soccer players). I have most of the serious ones in my (2018) book.

  2. Zach

    Many of these problems are on display in XKCD’s frequentist vs Bayesian comic. I recently wrote up what I thought was wrong with the comic, and think the arguments are in line with what you say here. I would be interested in your thoughts: https://smthzch.github.io/posts/xkcd_freq.html

    • Zach:
      Thanks for your comment. I will look at your link. You can find ‘comedy hour’ posts on this blog, and in my book Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP 2018). Each begins with “did you hear the one about the frequentist…?” But I distinguish serious confusions based on misunderstanding or presupposing a philosophy of statistics at odds with frequentist error statistics from those howlers that are mere appeals to ridicule. Perhaps they’re hard to distinguish at times. I don’t know Titelbaum, but I presume he believes his examples are licensed by Fisherian tests (though perhaps not by N-P tests, with its type 2 errors). But Fisher had strict requirements for adequate test statistics that, if followed, results in tests that jibe with N-P. Blame goes to the 2016 document for neither mentioning what a test statistic requires nor considering alternative hypotheses. There was quite a lot of disagreement about that document. (I know because I was there.) Readers can search this blog for details.

  3. I should note that although Titelbaum’s discussion begins with the 2016 ASA statement on statistical significance, his faulty examples of p-values are not found there or, for that matter, in Wasserstein 2019. However, the “screening model” example that comes next can be found in “redefining significance” (Benjamin et al. 2017). People can write down the approximately correct definition of p-values, but not know how to apply it. That is what I think may be happening here. There’s also a tendency to employ someone else’s criticism, as with the soccer player howler, rather than go to a direct source of the definitions. But I admit both the 2016 and 2019 documents encourage a careless attitude, and that is their greatest damage. If the executive director is treating p-values with ridicule, then there’s a sense that “anything goes”. I’ve discussed this with Wasserstein directly, as have others, so there’s no secret here.

  4. Yudi Pawitan

    Two comments to add to the Discussion:

    1. Soccer example and statistical hypotheses: I see no problem in setting “A is a soccer player” or any other event as hypotheses. To be non-trivial, they just need to be logically distinct from the evidence and remain hidden. If the evidence is “A is a goalie”, then it’s trivial that A is a soccer player, so it does not make sense to reject the hypothesis no matter how rare a goalie is. Consider instead “A is a basketbal player”, and the evidence: A is 220cm, can run 40m in 5sec, jump 80cm vertically, has super cardio, and nothing else. Then the hypothesis is quite legitimate. The idea that p-value alone is not (or should not be) the only basis for rejection of a hypotheses seems rather trivial. Other logical elements are needed to make it sensible, eg logical separation between hypothesis and evidence, better power at alternatives, etc In practice, whenever the P-value is regularly used, typically these are satisfied.
    2. Diagnostic test example. Consider a very rare motor-neuron disease (MND), eg due to a genetic mutation, where many suggestive non-genetic tests have larger false-positive probabilities than the population prevalence of the MND. In this case, the posterior probability of disease given positive test is indeed small. From my experience at work , the specialists will intuitively concur with the posterior probability, ie not to trust the non-genetic tests fully, but to use them only as a basis for pursuing other tests. They will eventually perform genetic tests to search of the causal basis, in effect trying to get zero p-value. (The genetic cause here is only illustrative; it can be generalized to other causes, eg infections. Doctors in the western world are extremely reluctant to prescribe antibiotic, unless they see conclusive evidence from a bacterial culture test, again trying to get the p-value to zero.)
    • Youdi:
      I didn’t get into this in this post (maybe I should), but while a specific event, e.g., Jill has an abnormality, can be treated as a hypothesis by imagining the probabilistic assignments given from the background are the assignments given by the specific event. But then the “prevalence” claim no longer holds for Jill. In order to engage the examples given by Colin Howson, I always grant it may be converted to a statistical hypothesis this way (e.g., Mayo 1997), but having converted it, the probabilities from random sampling no longer apply to the specific case, or need not.This is especially problematic when the diagnostic screening model of statistical tests is extended to genuine statistical hypotheses, as when H, having been selected from an urn with 50% true claims, is thought to have frequentist probability .5. The reference class problem looms large, but even if you had the right one for H, it wouldn’t show how much evidence there is for H, in my view.

      • Yudi Pawitan

        I don’t think it’s the issue of ‘specific event’. Let’s try to be more concrete, ie whether “Jill has Duchenne muscular dystrophy (DMD)”. For me that’s a legitimate hypothesis (which will explain her observable symptoms). There are some motor-neuron related symptoms due to this disease, they have certain prevalences, which will apply when thinking about Jill. These symptoms are not specific, ie could be due other diseases, so could be false positive for DMD. So all the standard statistical thinking applies to Jill’s case.

        • Yudi:
          You can call any claim a hypothesis, but won’t be a statistical hypothesis wherein which p-values are defined. The point is simply whether there’s an assignment of probabilities to values of a random variable, but I agree that if you stretch it, and use the background assignments as if it’s part of a model, it can be seen as such. This doesn’t solve anything.

  5. Your summary of the soccer example is uncharitable, I believe, and your summary of Titelbaum’s definition of p-value is misleading.

    Titelbaum does define a p-value to be a conditional probability, which does obscure two distinctions that you make, namely, the distinctions between (i) statistical hypotheses and non-statistical ones and (ii) conditioning and supposition. Those distinctions are controversial. But even if he wanted to draw them, Titelbaum does *not* equate the p-value of the observed outcome x relative to the null H0 with p(x|H0). His definition — in equation 13.6 on page 504 — explicitly defines a p-value in terms of observations “at least as extreme” as x0. So your summary of his definition is misleading.

    The soccer player example strikes me as perfectly fine too. Because his book is intended to be accessible to the average undergraduate, Titelbaum doesn’t introduce needless formalism. But if one wants to flesh out the example to avoid the purported problems raised above, one can. Let N by the population of all *non*-soccer players at a high school and S be the set of all soccer players. Let H0 be the hypothesis that the subject of our conversation, John, was selected at random from S. Let H1 be the hypothesis he was selected at random from N. Suppose the high school year book lists all students’ positions on sports teams. Then H0 and H1 are statistical hypotheses assigning probabilities to events like “John is forward for the basketball team” being listed on John’s yearbook page. Let x0 be the observation that John is listed as “goalie for soccer team” in his school year book. Goalie is the rarest position on the soccer team. So if we “x is at least as extreme as y under H” by the relation P_H(x) <= P_H(y), then on supposition that H0, the p-value of x0 simply is P_H(x0), just as Titelbaum claims. In general, if one defines “more extreme than” in this conventional way, the “more extreme than” relation makes no difference to the calculation of a p-value of a point null hypothesis H0 if the observation x0 is the least likely one on H0. Titelbaum’s example seems to me to be a fine criticism of a standard interpretation of Fisherian testing whereby one rejects a null if the p-value is low. The example nicely illustrates that a low p-value is not good evidence against a null if the p-value under the alternative H1 is lower. That’s something you agree with.

    As Sam Fletcher and I note (footnote 9 here: https://core.ac.uk/download/pdf/295732731.pdf), the mathematical definitions of p-value differ from one (rigorous theoretical) textbook to the next. What is agreed upon is not a definition, but how to calculate p-values in certain canonical contexts. So saying that Titelbaum has not calculated a p-value correctly in a non-canonical context is, I believe, not particularly charitable.

    • Thank you so much for your comment Conor! I don’t know why it was held up in moderation.I received your comment in my email, assuming it was up, but noticed in posting this reply that it was not. WordPress is holding all comments in moderation and they don’t know why, possibly if there’s a link in them. They tell me it’s all done by AI. This is quick as I’m in a Dr. office.

      The soccer example is quoted from Titelbaum, it’s not a summary. I did try hard to make sense of it. You say “that Titelbaum has not calculated a p-value correctly in a non-canonical context is, I believe, not particularly charitable”. But why assume it is applicable to “a non-canonical context”? The whole point of my post—perhaps the only point– is to block doing so. The problem with the soccer example is not lack of “more extreme”, it’s the problem as I describe it in the post, namely, that Titelbaum is mistaken to think significance tests license an inference to John is not a soccer player. “Since he has told us that goalies are rare soccer players (I don’t have a clue about sports), this leads to a logical contradiction. [Premises: (Gj & Sj), (Gj → Sj); conclusion: ~Sj, where G and S are the predicates “is a goalie”, “is a soccer player”, respectively, and j is the name John.] There are no p-values here, and terrible error probabilities. The probability that an inference to “John is not a soccer player” is erroneous is 1. (Also, all of us non-goalies, are erroneously inferred to be soccer players.)”

      Your high school yearbook doesn’t help the case. Note too that you have to add probability assignments that are not in the events in order to convert them into statistical hypotheses. (I’ve done that myself in order to entertain so and so’s criticism. But what is the point of requiring the probability of the test statistic be computable from Ho alone (as part of the definition of test stat), if one can arrange background probabilities to be part of the event? I admit to not understanding how this example can be seen as an application of Fisherian tests, except as I have tried to. His claim that for Fisher “sufficiently unlikely outcomes are practically impossible”(463) further underwrites the conception I attributed to him. I revised that for the imagined case of a statistical hypothesis. You say: “The [soccer] example nicely illustrates that a low p-value is not good evidence against a null if the p-value under the alternative H1 is lower. That’s something you agree with.” I don’t get what I’m supposedly agreeing with. I don’t. I’ve added that Titelbaum’s definition of p-values in 13.6, p. 462. It’s just not applied in the examples I have troubles with. Larry Wasserman is clearest on this; when I get back, I’ll post it. https://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/

      I admit there are shortcomings to p-values; if I didn’t I would not have sought to reformulate error statistical tests. If one is setting out to criticize a strict Fisherian test with just a single hypothesis, it’s important to include his requirements for a test statistic d(X). I prefer to make the alternative explicit as with N-P tests. But, as David Cox, Stephen Senn and others have shown, you can get to the same destination with a proper notion of a test statistic. The 2016 ASA statement does not make it clear whether they are alluding to p-values in general or only Fisherian tests—I assume it’s the former. The majority of applications to which it is to apply use the notion of a test’s power, and thus go beyond Fisher’s theory (though he had “sensitivity”). Since it is this full theory that matters for evaluating statistical significance tests, that is what needs to be taken into account. There are many other issues in the chapter that I didn’t take up in the post to keep it focused on the soccer and diagnostic example.

  6. Hi all,

    Thanks so much to everyone for such thoughtful consideration of my chapter, and to Prof. Mayo for alerting me that this interaction was happening on the blog and inviting my comments. Many excellent clarifications have been made above, so at this point I think I’ll make two comments:

    1. The point about a p-value not being a conditional probability is an interesting one that I’ll have to think about. I understand that the null hypothesis always comes with a probability distribution from which the p-value is calculated, and that it makes no sense to think about a probability conditional on such a distribution. What’s tricky is that there’s often a proposition associated with the null hypothesis as well. For instance, the main example I use in my text to illustrate p-values is one on which the members of a second-grade class are found to have an uncommonly high mean IQ score. I imagine someone considering a number of possible explanations (the teacher’s somehow improving scores? students assigned into the class for specific reasons? etc.), and contrasting those with the null hypothesis that the students have been selected randomly from the general population, nothing the teacher does affects IQ test results, and the high mean is simply due to variance and the luck of the draw. One goes on to calculate the p-value for that null hypothesis by looking at the probability distribution of IQ scores in the general population, which is not an ingredient of a conditional probability. But I wonder if it’s necessarily a mistake to also treat that p-value as a probability conditional on the proposition associated with the null hypothesis? (Namely the proposition that the high IQ is due entirely to chance.) I’d be very interested in everyone’s thoughts on that question.
    2. As for the various examples Prof. Mayo cites and criticizes: I think one of the problems we’re having here is about terminology. Near the very beginning of her post, Prof. Mayo writes, ‘“Reject H0” and “fail to reject H0” are generally interpreted as x is evidence against H0, or x fails to provide evidence against H0, (which is not the same as evidence for H0.)’ If one interprets “reject H_0” as “x is evidence against H_0”, then much of what I say in the examples she quotes seems nonsensical. But if you look more broadly at the section of my text from which those quotes come, you’ll see that what I’m doing in that section is working through many possible readings of frequentists’ “rejection” terminology. On p. 463 specifically, I’ve just floated the interpretation that one could use Cournot’s Principle (treating sufficiently unlikely outcomes as practical impossibilities) to license a disjunctive syllogism and treat the null as false for practical purposes when its p-value is sufficiently low. The soccer example from Dickson and Baird, and positive disease result example, come immediately after that proposal, and are meant to show why it is a bad idea, not anything about the interpretation that takes “reject” to indicate “evidence against”.
      In the very next paragraph, I move on to questioning significance testing “in more subtle ways”. On p. 465, I eventually reach the proposed interpretation that a sufficiently low p-value indicates that the test “disconfirms” the null, which I’ve made quite clear earlier in the text is my Bayesian way of saying it provides evidence against the null. I then go on to worry about Fisherian significance testing as an indication of evidence against the null, for reasons of choice of test statistic, etc.
      Now one might complain that reading the frequentist’s “reject” as anything like “treat as false” is just a horrible misrepresentation of the view, and not something that should even be brought up. There I’d have to plead that, first of all, this is an introductory text and this is an interpretation many students have been given and find plausible, and second, that it’s an interpretation that has been taken seriously (for better or worse) in the literature. (That disjunctive syllogism idea certainly didn’t originate with me!) But I absolutely agree that this interpretation isn’t a good one, and that serious, high-level philosophical discussion of frequentism shouldn’t hold the view to it.
    • Michael:
      Thanks so much for your comment on my blogpost.

      “Now one might complain that reading the frequentist’s “reject” as anything like “treat as false” is just a horrible misrepresentation of the view, and not something that should even be brought up.”

      I don’t see where construing rejection of H as treat H as false removes any of the criticisms I raise. Don’t forget, though, that statistical falsifications always include the associated error probabilities. (They don’t deductively infer “not-H”.)

      “On p. 465, I eventually reach the proposed interpretation that a sufficiently low p-value indicates that the test “disconfirms” the null, which I’ve made quite clear earlier in the text is my Bayesian way of saying it provides evidence against the null. I then go on to worry about Fisherian significance testing as an indication of evidence against the null, for reasons of choice of test statistic, etc.”

      Yes, I didn’t go on to discuss your worries about Fisherian significance testing because of choices of test statistic. I will now. It’s incorrect to speak of “the null.” Different test statistics are testing different claims, and different ways claims can be false. Therefore, what is inferred when various different null hypotheses are falsified differs. Test statistics are very special statistics: their distributions must be known under the corresponding statistical null hypothesis, and large (or “extreme”) values of test statistic indicate relevant departures or discrepancies from the null hypothesis. A test statistic in the fair coin tossing example is not the sample mean M, but (for one sided departures): (M – .5) in units of the standard error (SE). (For two-sided departures, the corresponding absolute value would serve.)* A P-value is essentially the same as a report of the corresponding test statistic, varying according to the question. (The distribution of a continuous P-value is uniform under the null.) In some cases, the question concerns the value of a parameter in a model; in other cases, the question concerns the assumptions of a statistical model. The data are modelled in different ways and the p-values differ, just as they should. Of course, Neyman-Pearson statistical significance tests make the alternative hypotheses explicit, and then test statistics fall out from considering the test’s power.(Generally, the latter way of proceeding is preferable, in my view, but it’s also implicit in Fisher’s requirements of data reduction and sensitivity.)
      *Familiar test statistics are Z, t, F, chi-square.
      Aris Spanos says much more about this in his post

      Aris Spanos Guest Post: “On Frequentist Testing: revisiting widely held confusions and misinterpretations”

    • Michael:

      To continue the remarks from my last comment, which go beyond the issues in my post, I’m not saying that there is just one way of testing a given null hypothesis. E.g., some use ratios, others, differences. In an earlier guest post by Stephen Senn, he describes Fisher’s position: “All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often.” (Senn post)–at least when it is false. The test statistic, he says, is more “primitive” than the N-P appeal to an alternative. In Senn’s own work on clinical trials, sometimes the appeal is to experience, other times to analysis of the N-P sort. You might be interested in his comment on Spanos today:

      https://errorstatistics.com/2024/08/08/aris-spanos-guest-commentary-on-frequentist-testing-revisiting-widely-held-confusions-and-misinterpretations-2/comment-page-1/#comment-265024

      Now Bayesians use numerous different measures of “confirmation”, as Fitelson and others (e.g., Popper) point out. The quantitative confirmation changes. I know that some measures seem better at avoiding certain fallacies, but does the Bayesian have a general way to critically evaluate the differences in degrees of confirmation that result?

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.