We spent the first half of Thursday’s seminar discussing the Fisher, Neyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning.
We then turned to a severity evaluation of tests as a way to avoid classic fallacies and misinterpretations.
“Probability/Statistics Lecture Notes 5 for 3/20/14: Post-data severity evaluation” (Prof. Spanos)
[i] Fisher, Neyman, and E. Pearson.
[ii] In a recent Nature article by Regina Nuzzo, we hear that N-P statistics “was spearheaded in the late 1920s by Fisher’s bitter rivals”. Nonsense. It was Neyman and Pearson who came to Fisher’s defense against the old guard. See for example Aris Spanos’ post here. According to Nuzzo, “Neyman called some of Fisher’s work mathematically ‘worse than useless’”. It never happened. Nor does she reveal, if she is aware of, the purely technical notion being referred to. Nuzzo’s article doesn’t give the source of the quote; I’m guessing it’s from Gigerenzer quoting Hacking, or Goodman (whom she is clearly following and cites) quoting Gigerenzer quoting Hacking, but that’s a big jumble.
N-P did provide a theory of testing that could avoid the purely technical problem that can theoretically emerge in an account that does not consider alternatives or discrepancies from a null. As for Fisher’s charge against an extreme behavioristic, acceptance sampling approach, there’s something to this, but as Neyman’s response shows, Fisher, in practice, was more inclined toward a dichotomous “thumbs up or down” use of tests than Neyman. Recall Neyman’s “inferential” use of power in my last post. If Neyman really had altered the tests to such an extreme, it wouldn’t have required Barnard to point it out to Fisher many years later. Yet suddenly, according to Fisher, we’re in the grips of Russian 5-year plans or U.S. robotic widget assembly lines! I’m not defending either side in these fractious disputes, but alerting the reader to what’s behind a lot of writing on tests (see my anger management post). I can understand how Nuzzo’s remark could arise from a quote of a quote, doubly out of context. But I think science writers on statistical controversies have an obligation to try to avoid being misled by whomever they’re listening to at the moment. There are really only a small handful of howlers to take note of. It’s fine to sign on with one side, but not to state controversial points as beyond debate. I’ll have more to say about her article in a later post (and thanks to the many of you who have sent it to me).
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale: Lawrence Erlbaum Associates.
Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.
Nuzzo, R .(2014). “Scientific method: Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume”. Nature, 12 February 2014.
Journalists of statistics, particularly those who jump on the popular bandwagon to rehearse foibles of significance testing, purport to be genuinely interested in improving the scientific status of the discussion, if not also of the practice, of statistical science. If so, they should strenuously resist repeating the standard mash-up of sloganeering coming from one side of the aisle. It completely prevents their being taken entirely seriously, even if people generously say, “well, s/he didn’t know; she was earnestly trying to provoke honest conversation”. Well, after the 5th or 20th time,that excuse won’t wash. I’d like to see one, just ONE, popular article, where we can hear the voices of those in favor of frequentist error-statistics over other statistical philosophies. Let’s hear their responses to the hackneyed, knee-jerk criticisms, and permit a genuine evidence-based conversation to ensue.
It is curious when the p-value bashers maintain their support for “all trials”, Cochrane collaboration in medicine, and against verification biases. The use of RCTs and strictures against “searching for the pony” are at home in frequentist, not Bayesian, statistics. Likewise, randomization has no real home in Bayesianism, and multiple testing need gather no dirt for them either.
e.Berk: You’re behind the times. You can read Bayesian Data Analysis 3rd, chapter 8, for the role of randomization in Bayesian statistics. For multiple comparisons, see this answer and links therein to exactly this question.
corey: i don’t think so. Randomization’s role is to ground significance levels which Bayesians reject. They condition on the data,and observed likelihoods, so the error probabilities and the sample space are irrelevant.Colin Howson argues for judgment samples,and rejects long runs as irrelevant.(Worrall does too.) I think you may mean Wasserman saying on this blog, or his,that randomization could let the Bayesian posterior be less biased by the prior, but that’s not its use in significance tests.
e.Berk: Randomization’s role *in the approach you favor* is to ground significance levels. It does not then follow that randomization has no home in Bayesian statistics. For insight into what the Bayesian approach actually entails, you’d do far better to pay attention to Gelman, an expert in applied statistics and author of the aforementioned Bayesian Data Analysis (possibly the best graduate-level Bayesian statistics text), than to Bayesian philosophers such as Howson and Worrall.
Let me be specific (and flesh out what Lauren Muller wrote too): just as randomization secures the probabilistic assumptions upon which significance testing rests, it also helps to secure the modeling assumptions upon which Bayesian inferences are based.
Corey: This may be getting the post off topic, but since you’re on it, and I’m sort of moderating, I might point out to readers that Gelman himself bemoans the fact that Bayesians tend to regard their models as expressions of subjective beliefs and thus cannot be tested.For his article in our RMM volume see:
Click to access Article_Gelman.pdf
(Gelman-Bayes, may be searched on this blog.) If we are going to test model assumptions, which I’m all in favor of, I think the statistical assumptions should be checked separately from the prior–to pinpoint blame reliably. In any event, if the priors are to be tested (perhaps using significance testing), we need to know what they are asserting.
The logic of randomization in clinical trials cannot be said to be purely statistical in origin. A key reason it was accepted here by Bradford-Hill in Britain in the 1940s was that it was also an unbiased, objective means of assigning patients to treatments. Thus, Bayesians could still use it for clinical, as opposed to statistical reasons. The uptake of randomization in trials appears to be because it met multiple logics simultaneously.
Lauren: I wouldn’t say that any experimental design procedure is purely statistical in origin, but in order for it to have a rationale within a methodology of statistical inference, it must play a clear inferential role. Strictly speaking, randomization conflicts with the Likelihood Principle, which says once the data are in hand, the sampling rule is inferentially irrelevant. Of course, Bayesians have other techniques which they regard as enabling balance and matching.
It is quite misleading to say that “strictly speaking, randomisation conflicts with the Likelihood Principle”! That principle only says that the evidence in the data relevant to the parameter of interest is in the relevant likelihood function. It is, properly, silent on the issue of how to obtain an unbiassed sample that is useful for testing any particular hypothesis about the parameter of interest.
Even when randomisation is not needed to ground a statistical method (as it is, for example, to enable Student’s t-test to be an analogue of a permutations test) it is still a great way to maximise the chances of getting a sample that allows safe scientific conclusions. Lauren is correct.
Michael: Thank you. Without getting into any detail, lt me just say that these are all error-statistical or frequentist or sampling theory arguments with error statistical concepts and frequentist rationales, as with significance tests. The allusion by E. Berk was to subjective Bayesian arguments. My remark was about Neyman vs Fisher and getting the history right.
Mayo: How does randomisation conflict with the likelihood principle? As far as I can see the likelihood principle says nothing about randomisation.
The Likelihood Principle implies that the stopping rules do not affect the evidence in the likelihood function, but that does not mean that it is not necessary to ensure that the sample is representative of any population about which one would like to make an inference.
Michael: It is not misleading. Someone who holds the Likelihood Principle, subjective or objective, or likelihoodists like Royal, will deny the use the randomization mechanism used for assigning treatments in obtaining probabilistic assessments of evidence.(as with significance levels).They relinquish what many consider the central reason to regard Fisher’s introduction of randomization so brilliant: as a way to obtain significance probabilities to measure evidence of inconsistency with a model.
Anonymous, your conjecture about the attitudes of those who hold the Likelihood Principle to be a useful principle is not really very important to the issue as to whether the Likelihood Principle implies anything about the role of, or need for, randomisation.
Michael: Actually anon is correct. Once the data are in hand an LPer cannot, and purports not to want to, make use of the sampling distribution over outcomes other than the one observed. So the significance levels generated by randomized assignments are not deemed part of the post-data evidentia appraisal. I’m not sure if you’re defending the LP (strong likelihood principle)? Since it permits strongly misleading interpretations of data with high probability (and not only by rendering optional stopping irrelevant), it’s hard to see why one would hold it and at the same time be concerned with error probabilities. That is why Birnbaum denied the LP could even count as a viable concept of evidence: no error control.
Check Berger and Wolpert, the Likelihood Principle.
Too late for blogging….
Ronald Fisher played a key role in historically linking design to statistical inference, and randomization to that design and ANOVA.
It is useful here to trace randomization back to Fisher’s original logics – or at least our guesses and re-intrepretations thereof. Fisher was always at pains to argue for a mixed scientific/ experimental and statistical rationale, having tried to meld these two. For him, the development of randomization (or random assignment) was also embedded within the pre-existent historical ideas and experience of agricultural experiments and the use and logic of his ANOVA. (As we know the debates about what Fisher meant are ongoing too – and he changed his mind too it seems).
One of the means that Fisher used to discredit Neyman & Pearson’s approach was that they were not practical experimentalists and did not think like scientists. Thus he argued the validity of his proposals both statistically and experimentally (incidentally, never actually using the words “causal inference”, always stressing “inductive inference” in his texts).
It may be interesting to see what his biographer, Joan Fisher Box, said about his motivations for randomization(Joan Fisher Box, “RA Fisher – the Life of a Scientist” 1978. ):
“Almost certainly it was the evident lack of independence of field observations that led Fisher to seek a foundation for his analysis which did not involve this assumption [of independence]. He knew that the effect of even a moderate lack of independence of the observations, unlike the effect of moderate nonnormality, could be disastrous to the analysis” (p.148)
and later,
“No doubt these ideas were also supported on the theoretical side by Fisher’s geometrical insight: he could see that randomization would pattern in the n-dimensional space, and he could see that randomization would produce a symmetry in that pattern rather like that produced by a kaleidoscope. This might approximate the spherical symmetry that would have been induced by normalilty. His confidence in the validity of his results must to some extent have rested on this insight” p.149.
Obviously the current statistical theories have taken these ideas further, but historically, for Fisher at least, there doesn’t seem to have been a pure space of either experimental or statistical thinking, and, for the record, also his genetic and social experimentation. The clinical trialist in medicine just added clinical logic to these multi-layered interpretations. It is an interesting philosophical question exactly where “statistics” begins and ends in this complex terrain, but there is no doubt that the uses and applications thereof (and the associated interpretations) are an integral part of these logics.
Lauren: This will be quick as I am dashing to a meeting. Thanks for your interesting comment.I know all Fisher’s writings quite well, and also N and P’s and the various biographers (Reid more than Box). (I still find it valuable to go back and reread F, N & P often.) I don’t have any objections to what you wrote, with the exception of Fisher’s remarks about N and P not being practical scientists. Fisher’s attack on Neyman during those years were 99.9% professional antagonisms. You might read the “triad” from this post.
Fishers notorious polemics aside, he was actually involved in more direct scientific research than Neyman i.e. his genetics and perhaps agriculture. I am sitting here re-reading Fisher’s “Statistical Methods and Scientific inference” (1973) and just struck once again at how much he emphasizes it is “scientific” inference (rather than simply statistical), and also the extent of his move away from significance testing towards estimation. Fisher’s position definitely changed across time, often criticizing in NP what he had started e.g. fixed levels of significance. This said, I am not aware of any field of disciplinary theory outside of statistics that N or P developed, Fisher did, and thus did seem a fuller “scientist”. The messy practical, institutional and epistemic domain of science seems to also scaffold the “truth” of historical statistics, but then you know that anyway. Sorry, I am probably far off your original topic!
Lauren: I happen to be sitting on Stat Methods & Sci Inference because I’m writing to some “new fiducialists”*. I recommend you read some of N’s work and decide about how much of a scientist he was. *N could never get over Fisher’s flawed “probabilistic instantiation” in fiducialing (in the Fisher piece in the triad). N was right. But aspects of the idea are still being salvaged and developed.
Personally, my hero of the three has long been E.S. Pearson.
You are right, I have not read enough of Neyman’s actual texts, my doctorate is on Fisher. Egon Pearson’s article on Fisher, “Memories of the Impact of Fisher’s Work in the 1920’s” suggests an insightful and generous soul. I have battled through Ian Hacking on Fiducial arguments and not got far! Good luck!
Lauren: What dept. are you getting your doctorate in? Can you give me the E.S. Pearson reference? One of the most interesting upshots of working with David Cox is that he, and I attained a much greater appreciation of Neyman and Fisher, respectively. He saw that Neyman was much more “evidential” than his caricature (we discuss this in Mayo and Cox 2010), and I came to appreciate the taxonomy of simple significance tests. My colleague, Aris Spanos (Economics) was also very important in my understanding the role of Fisherian tests in testing the validity of models. Some points of possible relevance emerge in a published conversation between Cox and I:
Click to access Article_Cox_Mayo.pdf
You can find it discussed on this blog as well. Thanks for your comments.
The E.S. Pearson article (1974) is in International Statistical Review 42 (1). I am doing my doctorate in a psychology dept at a smallish university in the Western Cape, South Africa. I assumed this was an open blog, I am sorry if I have gatecrashed! I have really enjoyed your material and papers, thank you for the further suggestions. You I don’t follow all your statistical arguments, but I am fascinated by the history and philosophy of trials and statistics and Fisher’s role here. Statistics tends to be a rather ahistorical domain, so it is easy to adopts simplistic caricatures of its protagonists and their tools (as Gigerenzer points out). I look forward to reading more Neyman and Pearson, thanks for the triad link.
Lauren: I don’t think this will come out in the right place on the blog. No gatecrashing, we frequentists in exile are quite happy to have a historical scholar comment or read the blog. (nowadays, with twitter, we don’t get as much conversation.) I hope you will learn whatever’s needed to understand the statistical arguments. Psych needs someone who can explicate these things in a constructive way without distortion.
When you wrote “as Gigerenzer points out” I spoze you meant that literally, i.e., that Gerd claims it’s easy to adopt caricatures? In fact Gigerenzer has seemed to single-handedly create this psychological story: the compulsive hand-washing, the rituals, the id-ego-superego of testing, the fear of Fisher’s wrath, denial of the parents, and of course mentioning those “worse than useless tests” in such a way that Nuzzo could pick it up out of context. (I’m not saying he doesn’t indicate what he means, but he may not bee too obvious to someone picking up a quote from here or there.) See one of my comedy task forces, you’ll probably recognize him:
https://errorstatistics.com/2013/01/19/saturday-night-brainstorming-and-task-forces-2013-tfsi-on-nhst/
Gigerenzer is always doing interesting things, I met him once, –I’m not saying he doesn’t have a good gig, and I’m sure it’s a lot of fun—but I wouldn’t call this careful history of statistics. It’s a distortion. Frankly I’m currently irked at his one-man (successful!) agenda in getting people to accept the “inconsistent hybrid” myth. The reason it’s hurtful is it encourages people to throw up their hands and say, that entire school is an inconsistent hybrid and the texts distort the methods, so let’s not have any respect for them.
By the way, speaking of disrespect, are you aware of the slander Fisher has come into in the past few years in popstat books like Ziliac and McCloskey (who also appear in my “task force”) and Nate Silver? Silver makes him out to be a bad man who denied uncertainty and gave us methods that completely ignores bias. That’s not a quote but it’s not a stretch either. Don’t believe me? Pick up The Signal and the Noise.
Wrote last comments at almost midnight so excuse limitations….i enjoy Gigerenzer’s work as a commentary of the statistics that has been taught to students in psychology and the health sciences (my background is more psychiatry and public health). Believe me, it is a mishmash here, and now adding some Bayesian ideas to the pot, potentially an even bigger one! This said, firstly, Gigerenzer’s ideas are surely a comment on his experience in this specific applied field, and should not be taken as a general truth ( I like my psychoanalysis applied to patients, rather than my statistics). Secondly, the problems are surely not with the statistical tools per se but the lack of a critical philosophy of science in their interpretation and application.
I came to this field of statistical histories not by choice (I have had to really learn to appreciate statistics), but desperation due to the misuse of trials in my field in evidence-based medicine and global health policy-making. Here inference, in trials, is indeed automated and all the critique of mindless null-hypothesis testing is valid. Incidentally, the proposed “solution” of effect sizes, power and confidence does not solve the problem either. Results are not uncertain or provisional, or part of any real substantive hypothesis or disciplinary theory. They seem to simply operate as classification instruments in dichotomous decisions as whether a biotechnology is efficacious/effective or not. “Science” is invoked here, it seems, to authourize per-existent policy decisions, with very simplistic and deterministic models of causality or what statistics can (and cannot) do.
I would love to see frequentists tools used as (an important) part of scientific inquiry here, not simply as a component of some mindless, mechanized decision-making pipeline. Inference mis-use here is about certainty, not error or “learning” . From, my perspective, Bayesian approaches only exacerbate this problem as with their prior probabilities they try to fit even more of reality and inference into an already abused system – leaving a false confidence that statistical inference has covered it all!
(By the way, from my trenches, frequentists ideas are still central, although Bayesian methods are increasingly being adopted, not for epistemological reasons, but largely because they are seen as quicker and cheaper to do! They also seem to be undertaken with no real shift in philosophy, just another black-boxed statistical technique in a per-existent clinical trial framework.)
Fisher’s name, and not Neyman & Pearson’s, is invoked here to warrant the validity of randomized clinical trials, so it was to him that I went to find a historical culprit for the process. Of course neither he nor N/P can be blamed for this. However, I have found in close reading of Fisher’s early texts (Statistical Methods & Design of Experiments) that he often did sell his experimentation as conclusive and self-contained, downplaying the role of disciplinary knowledge in coming to these conclusions. Of course, Fisher was a very contradictory, ambitious man who worked hard to sell his statistics and related experimentation as a accurate, objective, mobile, pan-disciplinary, global tool – and the rest is history.
I feel no need to support or defend Fisher and his techniques per se, except from crude misinformation. I have developed a great respect for his work, but a lot of that has also come from reading and investigating his genetics and evolutionary theory too. His awful ideas and activism on eugenics cannot be left out of the picture, and it is fascinating how statistically inconsistent he could be when trying to prove a eugenic point for policy purposes. Exploring just how these ideas all come together in his work is part of my scholarly project.
I commend your general project of trying to fit frequentist statistics into a more critical, error-based philosophy of science, and hope to read more to see if you have looked at the actual use of statistics outside the academy. I see from your CV that you have connections with LSE, I have enjoyed reading Nancy Cawtrights papers on the application of her ideas on causality to trials. It is however a very specific model of causality that has much in common with problematic interpretations that exist in trial interpretation.
Anyway enough for now, unfortunately for you I’ve never got the point of Twitter! The kids are off safely to school, and I can now write – about Fisher! Thank you for “having the conversation”.
Lauren: Thanks for your comment.Yes, I’ve been a visiting professor for years at LSE (Cartwright’s not there any more). What in the world do you mean by: “Inference mis-use here is about certainty, not error or “learning” “. What are the up-down biotech classification decisions? As for twitter, I don’t get it either. Itried it just to see what it was when its stock IPOed. I do get reactions to posts and other comments at times that link to useful articles.
Okay, I am not writing clearly enough. (1) My comment on the regular misuse of statistics to create false certainty and lack of learning in my fields has been (somewhat) described here in an article by Harry Marks, http://ije.oxfordjournals.org/content/32/6/932.full.pdf+html. (Fisher, I believe, was more ambivalent about these issues than Marks paints him). (2) My comments on biotechnologies and the creation of statistical tools as “technologies” draws on the theory of Science and Technology Studies (STS) and Science Studies. Too much to go into here. (3) You seemed to suggest that this blog is being followed and responded to by most on Twitter. My comment was a self-criticism on the length of my post – if I too was on Twitter you would not be needing to read such a long posting! (4) I see that you have been involved in some exciting work on the application of statistics outside of the academy, such as 2008 “Workshop on Philosophy of Science & Evidence Relevant for Regulation & Policy”. I look forward to reading the articles. And finally, I liked the interview article with David Cox.
Lauren: This is a very interesting paper by Marks.As for the Russian connection it helps explain Fisher’s attack on Neyman (also in the triad just posted), and in this connection curious that Barnard didn’t believe me when I included that quote in a paper on Pearson. For a related “statistical theater of the absurd piece” (of mine), see:
https://errorstatistics.com/2013/09/22/statistical-theater-of-the-absurd-stat-on-a-hot-tin-roof/
1) It is indeed very strange that Barnard was surprised by your comment here – one would think that Fisher’s position would have been more broadly known – pity you didn’t ask him! 2) Have you seen the review of Fisher’s biography by William Kruskal, http://drsmorey.org/bibtex/upload/Kruskal:1980.pdf it has an interesting historical account on pg. 1022 (2nd column) about Fisher’s position on an alternative hypothesis. 3) How have you managed to reconcile Fisher’s position on “infinite populations” and N-P “repeated sampling” ? I am not sure if today’s Frequentist term, “long run frequencies” capture what Fisher meant, even if he meant hypothetical populations. By the way, the question of an infinite populations of real biological entities was a real point of conflict for Fisher with Sewall Wright in genetics.
I have lots of ontological, rather than analytical questions about statistical entities. Fisher was very concrete at times about his statistical entities as they were features of a historical biological project which he participated in, this is so well illustrated in his fascinating 1924 text, “The Biometrical study of Heredity” http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2942624/. Lots that could be said about Fisher’s role in synthesizing experimentally orientated, more causal, Medelism and the more associative, large-samlple Biomertical school.
Lauren: These are good questions, but I can scarcely do justice to them on a blog comment. Perhaps, for now, some chapters of my Error and the Growth of Experimental Knowledge (EGEK 1996) would be relevant. Anyhow, my publication page is here:
http://www.phil.vt.edu/dmayo/personal_website/bibliography%20complete.htm
e.Berk: That’s remindful of my recent post on the philosophy of probabilism being an obstacle to fraud busting. Naturally, Bayesians, as Larry Laudan puts it, can always pull a rabbit out of their hats: the prior saves them. But what happens with two experiments on the same hypotheses, only different in design? If the prior represents prior belief or information, the priors shouldn’t change in the two cases. In any event, I’m not saying these testing critics like subjective Bayesiansm either, but they do seem sold on the Bayesian idea that in order for an inference to be relevant to the hypothesis, it needs to take the form of a posterior probability. The error probability person denies this. That’s the biggest disagreement, but it is just about never mentioned or even noticed.
Do you see how on p. 290 of Neyman’s article (in the triad) he gives as the third role for the probability of a type 2 error that of determining whether failure to reject a null should count as any kind of “confirmation” of that (null) hypothesis? I was amazed that I missed this until after discovering those “hidden Neyman” papers and then going back to this one. I find it very striking.
See, in this connection, my last post: https://errorstatistics.com/2014/03/19/power-taboos-statue-of-liberty-senn-neyman-carnap-severity/
Note the page on this blog that links to our seminar: https://errorstatistics.com/phil6334-s14-mayo-and-spanos/
The importance of randomization has much more to do with causal inference and the blocking of confounding paths (in judea pearl’s causal graph formalism), which has nothing to do with either bayesian or frequentist statistics.
vl: huh? you don’t think that blocking confounders, etc. has to do with controlling the probability of erroneous causal inferences?
vl: isn’t “nothing to do with” a bit strong?
Operating as either a Bayesians or a frequentist, statisticians want to avoid problems of confounding, and often work very hard to avoid them.
Confounding is a problem of causality in which there exists a bias between the underlying causal relationships in nature and the inferential procedure being used. If such a bias exists, it doesn’t go away even if we had an infinite sample size, in which case, it doesn’t make much of a difference an estimated effect uses bayesian or frequentist approaches.
The role of randomization is to eliminate confounding paths by making treatment assignment exogenous to the system under study. The exogeneity of treatment assignment is thus important whether one is a bayesian or a frequentist.
Judea Pearl himself can probably argue this point better than I can, his argument, as I understand it, is that the language of probability is fundamentally inadequate to express causal relationships.
I agree with your last sentence. Many equate them. (don’t see the sample size Bayes-freq point in your first para)