**MONTHLY MEMORY LANE: 3 years ago: August 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog**.[1]** Posts that are part of a “unit” or a group of “U-Phils” count as one (there are 4 U-Phils on Wasserman this time). Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014. We’re about to turn four.

**August 2012**

- (8/2) Stephen Senn: Fooling the Patient: an Unethical Use of Placebo? (Phil/Stat/Med)
- (8/5) A “Bayesian Bear” rejoinder practically writes itself…
- (8/6) Bad news bears: Bayesian rejoinder
- (8/8) U-PHIL: Aris Spanos on Larry Wasserman[2]
- (8/10) U-PHIL: Hennig and Gelman on Wasserman (2011)
- (8/11) E.S. Pearson Birthday
- (8/11) U-PHIL: Wasserman Replies to Spanos and Hennig
- (8/13) U-Phil: (concluding the deconstruction) Wasserman/Mayo
- (8/14) Good Scientist Badge of Approval?
- (8/16) E.S. Pearson’s Statistical Philosophy
- (8/18) A. Spanos: Egon Pearson’s Neglected Contributions to Statistics
- (8/20) Higgs Boson: Bayesian “Digest and Discussion”
- (8/22) Scalar or Technicolor? S. Weinberg, “Why the Higgs?”
- (8/25) “Did Higgs Physicists Miss an Opportunity by Not Consulting More With Statisticians?
- (8/27) Knowledge/evidence not captured by mathematical prob.
- (8/30) Frequentist Pursuit
- (8/31) Failing to Apply vs Violating the Likelihood Principle

**[1] **excluding those reblogged fairly recently.

[2] Larry Wasserman’s paper was “Low Assumptions, High dimensions” in our special RIMM volume.

Filed under: 3-year memory lane, Statistics ]]>

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significantxis agoodindication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result *is just statistically significant. *(As soon as it exceeds the cut-off the rule has to be modified).

Rule (1) was stated in relation to a statistically significant result ** x** (at level α) from a one-sided test T+ of the mean of a Normal distribution with

(2) If POW(T+,µ’) is high, then an α statistically significantxis agoodindication that µ < µ’.

(The higher the POW(T+,µ’) is, the better the indication that µ < µ’.)That is, if the test’s power to detect alternative µ’ is

high, then the statistically significantis axgoodindication (or good evidence) that the discrepancy from null isnotas large as µ’ (i.e., there’s good evidence that µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

**EXAMPLE**. Let σ = 10, *n* = 100, so (σ/√*n*) = 1. Test T+ rejects H_{0 }at the .025 level if M_{ } > 1.96(1). For simplicity, let the cut-off, M*, be 2. Let the observed mean M_{0} just reach the cut-off 2.

**POWER**: POW(T+,µ’)** = **POW(Test T+ rejects *H*_{0};µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*. Using one of our power facts, POW(M* + 1(σ/√*n*)) = .84.

That is, adding one (σ/ √*n*) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ** _{ }**= 3) = .84. See this post.

By (2), the (just) significant result * x* is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84. The upper .84 confidence limit is 3. The significant result is even better evidence that µ< 4, the upper .975 confidence limit is 4 (approx.), etc.

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (Only (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical *in*significance, (2) is essentially ordinary *power analysis.* (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, that’s a context-dependent consideration, often stipulated in regulatory statutes.

Rule (2) also provides a way to distinguish values *within* a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often confused with “retrospective power” (what I call shpower). Again, confidence bounds could be, but they are not now, used to this end (but rather the opposite [iii]).

**Severity replaces M* in (2) with the actual result, be it significant or insignificant. **

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to *custom tailor* results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

*One more thing:*

**Applying (1) and (2) requires the error probabilities to be actual** (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. *If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones.* There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ_{0 }+ δ, or µ < µ’ =µ_{0 }+ δ or the like. They are *not* to point values! (Not even to the point µ =M_{0}.) Most simply, you may consider that the inference is in terms of the one-sided upper confidence bound (for various confidence levels)–the dual for test T+.

[iii] That is, upper confidence bounds are viewed as “plausible” bounds, and as values for which the data provide positive evidence. As soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
**12/29/14**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**01/03/15**No headache power (for Deirdre)

Filed under: fallacy of rejection, power, Statistics ]]>

I was reading this interview Of Erich Lehmann yesterday: “A Conversation with Erich L. Lehmann”

Lehmann: …I read over and over again that hypothesis testing is dead as a door nail, that nobody does hypothesis testing. I talk to Julie and she says that in the behaviorial sciences, hypothesis testing is what they do the most. All my statistical life, I have been interested in three different types of things: testing, point estimation, and confidence-interval estimation.There is not a year that somebody doesn’t tell me that two of them are total nonsense and only the third one makes sense. But which one they pick changes from year to year.[Laughs] (p.151)…..

DeGroot:…It has always amazed me about statistics that we argue among ourselves about which of our basic techniques are of practical value. It seems to me that in other areas one can argue about whether a methodology is going to prove to be useful, but people would agree whether a technique is useful in practice. But in statistics, as you say, some people believe that confidence intervals are the only procedures that make any sense on practical grounds, and others think they have no practical value whatsoever.I find it kind of spooky to be in such a field.

Lehmann:After a while you get used to it.If somebody attacks one of these, I just know that next year I’m going to get one who will be on the other side. (pp.151-2)

Emphasis is mine.

I’m reminded of this post.

Morris H. DeGroot,* Statistical Science, *1986, Vol. 1, No.2, 243-258

* *

Filed under: phil/history of stat, Statistics ]]>

I received a copy of a statistical text recently that included a discussion of severity, and this is my first chance to look through it. It’s *Introductory Statistical Inference with the Likelihood Function* by Charles Rohde from Johns Hopkins. Here’s the blurb:

This textbook covers the fundamentals of statistical inference and statistical theory including Bayesian and frequentist approaches and methodology possible without excessive emphasis on the underlying mathematics. This book is about some of the basic principles of statistics that are necessary to understand and evaluate methods for analyzing complex data sets. The likelihood function is used for pure likelihood inference throughout the book. There is also coverage of severity and finite population sampling. The material was developed from an introductory statistical theory course taught by the author at the Johns Hopkins University’s Department of Biostatistics. Students and instructors in public health programs will benefit from the likelihood modeling approach that is used throughout the text. This will also appeal to epidemiologists and psychometricians. After a brief introduction, there are chapters on estimation, hypothesis testing, and maximum likelihood modeling. The book concludes with sections on Bayesian computation and inference. An appendix contains unique coverage of the interpretation of probability, and coverage of probability and mathematical concepts.

It’s welcome to see severity in a statistics text; an example from Mayo and Spanos (2006) is given in detail. The author even says some nice things about it: “Severe testing does a nice job of clarifying the issues which occur when a hypothesis is accepted (not rejected) by finding those values of the parameter (here *mu*) which are plausible (have high severity[i]) given acceptance. Similarly severe testing addresses the issue of a hypothesis which is rejected…. . ” I don’t know Rohde, and the book isn’t error-statistical in spirit at all.[ii] In fact, inferences based on error probabilities are often called “illogical” because they take into account cherry-picking, multiple testing, optional stopping and other biasing selection effects that the likelihoodist considers irrelevant. I wish he had used severity to address some of the classic howlers he delineates regarding N-P statistics. To his credit, they are laid out with unusual clarity. For example a rejection of a point null µ= µ0 based on a result that just reaches the 1.96 cut-off for a one-sided test is claimed to license the inference to a point alternative µ= µ’ that is over 6 standard deviations greater than the null. (pp. 49-50). But it is not licensed. The probability of a larger difference than observed, were the data generated under such an alternative is ~1, so the severity associated with such an inference is ~ 0. SEV(µ <µ’) ~1.

[i]Not to quibble, but I wouldn’t say parameter values are assigned severity, but rather that various hypotheses about mu pass with severity. The hypotheses are generally in the form of discrepancies, e..g,µ >µ’

[ii] He’s a likelihoodist from Johns Hopkins. Royall has had a strong influence there (Goodman comes to mind), and elsewhere, especially among philosophers. Bayesians also come back to likelihood ratio arguments, often. For discussions on likelihoodism and the law of likelihood see:

How likelihoodists exaggerate evidence from statistical tests

Breaking the Law of Likelihood ©

Breaking the Law of Likelihood, to keep their fit measures in line A, B

Why the Law of Likelihood is Bankrupt as an Account of Evidence

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations.** *Chicago: University of Chicago Press.

http://www.phil.vt.edu/dmayo/personal_website/2006Mayo_Spanos_severe_testing.pdf

Filed under: Error Statistics, Severity ]]>

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (*performance*). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (*probativeness*). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson (11 Aug, 1895-12 June, 1980). I reblog a relevant post from 2012.

*Cases of Type A and Type B*

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B.Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171)“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

*We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing.* As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

*Three Steps in the Original Construction of Tests*

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2. However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

*Neyman Was the More Behavioristic of the Two*

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

__________________________

Aside: It is interesting, given these non-behavioristic leanings, that Pearson had earlier worked in acceptance sampling and quality control (from which he claimed to have obtained the term “power”). From the Cox-Mayo “conversation” (2011, 110):

COX:It is relevant that Egon Pearson had a very strong interest in industrial design and quality control.

MAYO:Yes, that’s surprising, given his evidential leanings and his apparent dis-taste for Neyman’s behavioristic stance. I only discovered that around 10 years ago; he wrote a small book.[iii]

COX:He also wrote a very big book, but all copies were burned in one of the first air raids on London.

Some might find it surprising to learn that it is from this early acceptance sampling work that Pearson obtained the notion of “power”, but I don’t have the quote handy where he said this……

** **

**References:**

Cox, D. and Mayo, D. G. (2011), “Statistical Scientist Meets a Philosopher of Science: A Conversation,” *Rationality, Markets and Morals*:* Studies at the Intersection of Philosophy and Economics*, 2: 103-114.

Pearson, E. S. (1935), The Application of Statistical Methods to Industrial Standardization and Quality Control, London: British Standards Institution.

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,” *Biometrika* 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to Reality” *Journal of the Royal Statistical Society, Series B, (Methodological)*, 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” *Biometrika* 20(A): 175-240.

[i] In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

[iii] I thank Aris Spanos for locating this work of Pearson’s from 1935

Filed under: 3-year memory lane, phil/history of stat Tagged: E S Pearson ]]>

*Today is Egon Pearson’s birthday. I reblog a post by my colleague **Aris Spanos from **(8/18/12): “**Egon Pearson’s Neglected Contributions to Statistics.” Happy Birthday Egon Pearson!*

**Egon Pearson** (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the * Neyman-Pearson (1933) theory of hypothesis testing*. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

**(i) specification**: the need to state explicitly the inductive premises of one’s inferences,

**(ii) robustness**: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

**(iii) Mis-Specification (M-S) testing**: probing for potential departures from the Normality assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the **Student’s t** fame] and then **Fisher** (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as *the simple Normal model*:

X_{k} ∽ NIID(μ,σ²), k=1,2,…,n,… (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(**X**) =[√n(Xbar- μ)/s] ∽ St(n-1), (2)

(b) *v*(**X**) =[(n-1)s²/σ²] ∽ χ²(n-1), (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom.

The question of ‘how these inferential results might be affected when the Normality assumption is false’ was originally raised by Gosset in a letter to Fisher in 1923:

“What I should like you to do is to find a solution for some other population than a normal one.” (Lehmann, 1999)

He went on to say that he tried the rectangular (uniform) distribution but made no progress, and he was seeking Fisher’s help in tackling this ‘robustness/sensitivity’ problem. In his reply that was unfortunately lost, Fisher must have derived the sampling distribution of τ(**X**), assuming some skewed distribution (possibly log-Normal). We know this from Gosset’s reply:

“I like the result for z [τ(**X**)] in the case of that horrible curve you are so fond of. I take it that in skew curves the distribution of z is skew in the opposite direction.” (Lehmann, 1999)

After this exchange Fisher was not particularly receptive to Gosset’s requests to address the problem of working out the implications of non-Normality for the Normal-based inference procedures; t, chi-square and F tests.

In contrast, **Egon Pearson** shared Gosset’s concerns about the robustness of Normal-based inference results (a)-(b) to non-Normality, and made an attempt to address the problem in a series of papers in the late 1920s and early 1930s. This line of research for Pearson began with a review of Fisher’s 2nd edition of the 1925 book, published in *Nature*, and dated June 8th, 1929. Pearson, after praising the book for its path breaking contributions, dared raise a mild criticism relating to (i)-(ii) above:

“There is one criticism, however, which must be made from the statistical point of view. A large number of tests are developed upon the assumption that the population sampled is of ‘normal’ form. That this is the case may be gathered from a very careful reading of the text, but the point is not sufficiently emphasised. It does not appear reasonable to lay stress on the ‘exactness’ of tests, when no means whatever are given of appreciating how rapidly they become inexact as the population samples diverge from normality.” (Pearson, 1929a)

Fisher reacted badly to this criticism and was preparing an acerbic reply to the ‘young pretender’ when Gosset jumped into the fray with his own letter in *Nature*, dated July 20th, in an obvious attempt to moderate the ensuing fight. Gosset succeeded in tempering Fisher’s reply, dated August 17th, forcing him to provide a less acerbic reply, but instead of addressing the ‘robustness/sensitivity’ issue, he focused primarily on Gosset’s call to address ‘the problem of what sort of modification of my tables for the analysis of variance would be required to adapt that process to non-normal distributions’. He described that as a hopeless task. This is an example of Fisher’s genious when cornered by an insightful argument. He sidestepped the issue of ‘robustness’ to departures from Normality, by broadening it – alluding to other possible departures from the ID assumption – and rendering it a hopeless task, by focusing on the call to ‘modify’ the statistical tables for all possible non-Normal distributions; there is an infinity of potential modifications!

**Egon Pearson** recognized the importance of stating explicitly the inductive premises upon which the inference results are based, and pressed ahead with exploring the robustness issue using several non-Normal distributions within the Pearson family. His probing was based primarily on *simulation*, relying on tables of pseudo-random numbers; see Pearson and Adyanthaya (1928, 1929), Pearson (1929b, 1931). His broad conclusions were that the t-test:

τ_{0}(**X**)=** |**[√n(X-bar- μ

for testing the hypotheses:

H_{0}: μ = μ_{0} vs. H_{1}: μ ≠ μ_{0}, (5)

is relatively robust to certain departures from Normality, especially when the underlying distribution is symmetric, but the ANOVA test is rather sensitive to such departures! He continued this line of research into his 80s; see Pearson and Please (1975).

Perhaps more importantly, Pearson (1930) proposed a *test for the Normality* assumption based on the skewness and kurtosis coefficients: a Mis-Specification (M-S) test. Ironically, Fisher (1929) provided the sampling distributions of the sample skewness and kurtosis statistics upon which Pearson’s test was based. Pearson continued sharpening his original M-S test for Normality, and his efforts culminated with the D’Agostino and Pearson (1973) test that is widely used today; see also Pearson et al. (1977). The crucial importance of testing Normality stems from the fact that it renders the ‘robustness/sensitivity’ problem manageable. The test results can be used to narrow down the possible departures one needs to worry about. They can also be used to suggest ways to respecify the original model.

After Pearson’s early publications on the ‘robustness/sensitivity’ problem Gosset realized that *simulation* alone was not effective enough to address the question of robustness, and called upon Fisher, who initially rejected Gosset’s call by saying ‘it was none of his business’, to derive analytically the implications of non-Normality using different distributions:

“How much does it [non-Normality] matter? And in fact that is your business: none of the rest of us have the slightest chance of solving the problem: we can play about with samples [i.e. perform simulation studies], I am not belittling E. S. Pearson’s work, but it is up to you to get us a proper solution.” (Lehmann, 1999).

In this passage one can discern the high esteem with which Gosset held Fisher for his technical ability. Fisher’s reply was rather blunt:

“I do not think what you are doing with nonnormal distributions is at all my business, and I doubt if it is the right approach. … Where I differ from you, I suppose, is in regarding normality as only a part of the difficulty of getting data; viewed in this collection of difficulties I think you will see that it is one of the least important.”

It’s clear from this that Fisher understood the problem of how to handle departures from Normality more broadly than his contemporaries. His answer alludes to two issues that were not well understood at the time:

(a) departures from the other two probabilistic assumptions (IID) have much more serious consequences for Normal-based inference than Normality, and

(b) deriving the consequences of particular forms of non-Normality on the reliability of Normal-based inference, and proclaiming a procedure enjoys a certain level of ‘generic’ robustness, does *not* provide a complete answer to the problem of dealing with departures from the inductive premises.

In relation to (a) it is important to note that the role of ‘randomness’, as it relates to the IID assumptions, was not well understood until the 1940s, when the notion of non-IID was framed in terms of explicit forms of heterogeneity and dependence pertaining to stochastic processes. Hence, the problem of assessing departures from IID was largely ignored at the time, focusing almost exclusively on departures from Normality. Indeed, the early literature on nonparametric inference retained the IID assumptions and focused on inference procedures that replace the Normality assumption with indirect distributional assumptions pertaining to the ‘true’ but unknown *f*(x), like the existence of certain moments, its symmetry, smoothness, continuity and/or differentiability, unimodality, etc. ; see Lehmann (1975). It is interesting to note that Egon Pearson did not consider the question of testing the IID assumptions until his 1963 paper.

In relation to (b), when one poses the question ‘how robust to non-Normality is the reliability of inference based on a t-test?’ one ignores the fact that the t-test might no longer be the ‘optimal’ test under a non-Normal distribution. This is because the sampling distribution of the test statistic and the associated type I and II error probabilities depend crucially on the validity of the statistical model assumptions. When any of these assumptions are invalid, the relevant error probabilities are no longer the ones derived under the original model assumptions, and the optimality of the original test is called into question. For instance, assuming that the ‘true’ distribution is uniform (Gosset’s rectangular):

X_{k }∽ U(a-μ,a+μ), k=1,2,…,n,… (6)

where *f*(x;a,μ)=(1/(2μ)), (a-μ) ≤ x ≤ (a+μ), μ > 0,

how does one assess the robustness of the t-test? One might invoke its generic robustness to symmetric non-Normal distributions and proceed as if the t-test is ‘fine’ for testing the hypotheses (5). A more well-grounded answer will be to assess the discrepancy between the nominal (assumed) error probabilities of the t-test based on (1) and the actual ones based on (6). If the latter approximate the former ‘closely enough’, one can justify the generic robustness. These answers, however, raise the broader question of what are the relevant error probabilities? After all, the optimal test for the hypotheses (5) in the context of (6), is no longer the t-test, but the test defined by:

w(**X**)=|{(n-1)([X_{[1] }+X_{[n]}]-μ_{0})}/{[X_{[1]}-X_{[n]}]}|∽F(2,2(n-1)), (7)

with a rejection region C_{1}:={**x**: w(**x**) > c_{α}}, where (X_{[1]}, X_{[n]}) denote the smallest and the largest element in the ordered sample (X_{[1]}, X_{[2]},…, X_{[n]}), and F(2,2(n-1)) the F distribution with 2 and 2(n-1) degrees of freedom; see Neyman and Pearson (1928). One can argue that the relevant comparison error probabilities are no longer the ones associated with the t-test ‘corrected’ to account for the assumed departure, but those associated with the test in (7). For instance, let the t-test have nominal and actual significance level, .05 and .045, and power at μ_{1}=μ_{0}+1, of .4 and .37, respectively. The conventional wisdom will call the t-test robust, but is it reliable (effective) when compared with the test in (7) whose significance level and power (at μ_{1}) are say, .03 and .9, respectively?

A strong case can be made that a more complete approach to the statistical misspecification problem is:

(i) to probe thoroughly for any departures from all the model assumptions using trenchant M-S tests, and if any departures are detected,

(ii) proceed to respecify the statistical model by choosing a more appropriate model with a view to account for the statistical information that the original model did not.

Admittedly, this is a more demanding way to deal with departures from the underlying assumptions, but it addresses the concerns of Gosset, Egon Pearson, Neyman and Fisher much more effectively than the invocation of vague robustness claims; see Spanos (2010).

Two additional posts on Pearson are here and here, please search this blog for yet others.

**References**

Bartlett, M. S. (1981) “Egon Sharpe Pearson, 11 August 1895-12 June 1980,” *Biographical Memoirs of Fellows of the Royal Society*, 27: 425-443.

D’Agostino, R. and E. S. Pearson (1973) “Tests for Departure from Normality. Empirical Results for the Distributions of b₂ and √(b₁),” *Biometrika*, 60: 613-622.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” *Biometrika*, 10: 507-521.

Fisher, R. A. (1921) “On the “probable error” of a coefficient of correlation deduced from a small sample,” *Metron*, 1: 3-32.

Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society* A, 222, 309-368.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae, and the distribution of regression coefficients,” *Journal of the Royal Statistical Society*, 85: 597-612.

Fisher, R. A. (1925) *Statistical Methods for Research Workers*, Oliver and Boyd, Edinburgh.

Fisher, R. A. (1929), “Moments and Product Moments of Sampling Distributions,” *Proceedings of the London Mathematical Society*, Series 2, 30: 199-238.

Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” *Biometrika*, 20A: 175-240.

Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses”, *Philosophical Transanctions of the Royal Society*, A, 231: 289-337.

Lehmann, E. L. (1975) *Nonparametrics: statistical methods based on ranks*, Holden-Day, San Francisco.

Lehmann, E. L. (1999) “‘Student’ and Small-Sample Theory,” *Statistical Science*, 14: 418-426.

Pearson, E. S. (1929a) “Review of ‘Statistical Methods for Research Workers,’ 1928, by Dr. R. A. Fisher”, *Nature*, June 8th, pp. 866-7.

Pearson, E. S. (1929b) “Some notes on sampling tests with two variables,” *Biometrika*, 21: 337-60.

Pearson, E. S. (1930) “A further development of tests for normality,” *Biometrika*, 22: 239-49.

Pearson, E. S. (1931) “The analysis of variance in cases of non-normal variation,” *Biometrika*, 23: 114-33.

Pearson, E. S. (1963) “Comparison of tests for randomness of points on a line,” *Biometrika*, 50: 315-25.

Pearson, E. S. and N. K. Adyanthaya (1928) “The distribution of frequency constants in small samples from symmetrical populations,” *Biometrika*, 20: 356-60.

Pearson, E. S. and N. K. Adyanthaya (1929) “The distribution of frequency constants in small samples from non-normal symmetrical and skew populations,” *Biometrika*, 21: 259-86.

Pearson, E. S. and N. W. Please (1975) “Relations between the shape of the population distribution and the robustness of four simple test statistics,” *Biometrika*, 62: 223-241.

Pearson, E. S., R. B. D’Agostino and K. O. Bowman (1977) “Tests for departure from normality: comparisons of powers,” *Biometrika*, 64: 231-246.

Spanos, A. (2010) “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” *Journal of Econometrics*, 158: 204-220.

Student (1908), “The Probable Error of the Mean,” *Biometrika*, 6: 1-25.

Filed under: phil/history of stat, Statistics, Testing Assumptions Tagged: Aris Spanos, Misspecification, R.A. Fisher, robustness ]]>

**Memory lane: **Did you ever consider how some of the colorful exchanges among better-known names in statistical foundations could be the basis for high literary drama in the form of one-act plays (even if appreciated by only 3-7 people in the world)? (Think of the expressionist exchange between Bohr and Heisenberg in Michael Frayn’s play *Copenhagen, except here there would be *no attempt at all to popularize—only published quotes and closely remembered conversations would be included, with no attempt to create a “story line”.) Somehow I didn’t think so. But rereading some of Savage’s high-flown praise of Birnbaum’s “breakthrough” argument (for the Likelihood Principle) today, I was swept into a “(statistical) theater of the absurd” mindset.(Update Aug, 2015 [ii])

The first one came to me in autumn 2008 while I was giving a series of seminars on philosophy of statistics at the LSE. Modeled on a disappointing (to me) performance of *The Woman in Black, *“A Funny Thing Happened at the [1959] Savage Forum” relates Savage’s horror at George Barnard’s announcement of having rejected the Likelihood Principle!

The current piece also features George Barnard. It recalls our first meeting in London in 1986. I’d sent him a draft of my paper on E.S. Pearson’s statistical philosophy, “Why Pearson Rejected the Neyman-Pearson Theory of Statistics” (later adapted as chapter 11 of *EGEK*) to see whether I’d gotten Pearson right. Since Tuesday (Aug 11) is Pearson’s birthday, I’m reblogging this. Barnard had traveled quite a ways, from Colchester, I think. It was June and hot, and we were up on some kind of a semi-enclosed rooftop. Barnard was sitting across from me looking rather bemused.

*The curtain opens with Barnard and Mayo on the roof, lit by a spot mid-stage. He’s drinking (hot) tea; she, a Diet Coke. The dialogue (is what I recall from the time[i]):*

* **Barnard:* I read your paper. I think it is quite good. Did you know that it was I who told Fisher that Neyman-Pearson statistics had turned his significance tests into little more than acceptance procedures?

*Mayo:* Thank you so much for reading my paper. I recall a reference to you in Pearson’s response to Fisher, but I didn’t know the full extent.

*Barnard:* I was the one who told Fisher that Neyman was largely to blame. He shouldn’t be too hard on Egon. His statistical philosophy, you are aware, was different from Neyman’s.

*Mayo: * That’s interesting. I did quote Pearson, at the end of his response to Fisher, as saying that inductive behavior was “Neyman’s field, not mine”. I didn’t know your role in his laying the blame on Neyman!

*Fade to black. The lights go up on Fisher, stage left, flashing back some 30 years earlier . . . ….*

*Fisher:* Now, acceptance procedures are of great importance in the modern world. When a large concern like the Royal Navy receives material from an engineering firm it is, I suppose, subjected to sufficiently careful inspection and testing to reduce the frequency of the acceptance of faulty or defective consignments. . . . I am casting no contempt on acceptance procedures, and I am thankful, whenever I travel by air, that the high level of precision and reliability required can really be achieved by such means. But the logical differences between such an operation and the work of scientific discovery by physical or biological experimentation seem to me so wide that the analogy between them is not helpful . . . . [Advocates of behavioristic statistics are like]

Russians [who] are made familiar with the ideal that research in pure science can and should be geared to technological performance, in the comprehensive organized effort of a five-year plan for the nation. . . .

In the U.S. also the great importance of organized technology has I think made it easy to confuse the process appropriate for drawing correct conclusions, with those aimed rather at, let us say, speeding production, or saving money. (Fisher 1955, 69-70)

*Fade to black. The lights go up on Egon Pearson stage right (who looks like he does in my sketch [frontispiece] from EGEK 1996, a bit like a young C. S. Peirce):*

*Pearson:* There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans. . . . Indeed, to dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot . . . . To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. (Pearson 1955, 204)

*Fade to black. The spotlight returns to Barnard and Mayo, but brighter. It looks as if it’s gotten hotter. Barnard wipes his brow with a white handkerchief. Mayo drinks her Diet Coke.*

*Barnard (ever so slightly angry):* You have made one blunder in your paper. Fisher would never have made that remark about Russia.

*There is a tense silence.*

*Mayo:* But—it was a quote.

*End of Act 1.*

*Given this was pre-internet, we couldn’t go to the source then and there, so we agreed to search for the paper in the library. Well, you get the idea. Maybe I could call the piece “Stat on a Hot Tin Roof.” *

*If you go see it, don’t say I didn’t warn you.*

I’ve gotten various new speculations over the years as to why he had this reaction to the mention of Russia (check discussions in earlier posts with this play). Neyman was born in a part of Poland that, from time to time, was part of Russia. Then there’s what Box wrote (in his recent autobiography) on Barnard’s sympathies. Feel free to share your speculations.

[i] We had also discussed this many years later, in 1999.

[ii] I’ve recently written a new one-act play on Neyman’s reminiscences on the origin of “that one paragraph”.

Filed under: Barnard, phil/history of stat, Statistics Tagged: Fisher, George Bernard, Likelihood Principle, Neyman, Pearson ]]>

“Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena” by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

Neyman died on August 5, 1981. Here’s an unusual paper of his, “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” I have been reading a fair amount by Neyman this summer in writing about the origins of his philosophy, and have found further corroboration of the position that the behavioristic view attributed to him, while not entirely without substance*, is largely a fable that has been steadily built up and accepted as gospel. This has justified ignoring Neyman-Pearson statistics (as resting solely on long-run performance and irrelevant to scientific inference) and turning to crude variations of significance tests, that Fisher wouldn’t have countenanced for a moment (so-called NHSTs), lacking alternatives, incapable of learning from negative results, and permitting all sorts of P-value abuses–notably going from a small p-value to claiming evidence for a substantive research hypothesis. The upshot is to reject all of frequentist statistics, even though P-values are a teeny tiny part. *This represents a change in my perception of Neyman’s philosophy since EGEK (Mayo 1996). I still say that that for our uses of method, it doesn’t matter what anybody thought, that “it’s the methods, stupid!” Anyway, I recommend, in this very short paper, the general comments and the example on home ownership. Here are two snippets:

1. INTRODUCTION

The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction. Particularly with reference to applied statistical work in a variety of domains of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors.…

(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion, these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. In my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)

To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.

Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my new book.[1] Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudolf Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H

_{0}is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H

_{0}, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H_{0}cannot be reasonably considered as anything like a confirmation of H_{0}. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman, like Peirce, Popper and many others, holds that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis *H* only to the extent that it passed a severe test–one with a high probability of having found flaws in *H*, if they existed. Neyman puts this in terms of having high power to reject *H*, if *H* is false and alternative *H’* true, and high probability of finding no evidence against *H* if true, but it’s the same idea. (Their weakness is in being predesignated error probabilities, but severity fixes this.) Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.

Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach[2] from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.

De Finetti gets it right when he says that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).

For related papers, see:

- Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,”
*Optimality: The Second Erich L. Lehmann Symposium*(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97. - Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,”
*British Journal of Philosophy of Science*, 57: 323-357.

[1] That really is a decision though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There’s plenty of evidence, by the way, that Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!”

[2] And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.

Fisher is the arch anti-Bayesian; whereas, Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors. Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):

- Mayo, D. 1996. “Why Pearson rejected the Neyman-Pearson (behavioristic) philosophy and a note on objectivity in statistics” .

**References**

de Finetti, B. 1972. *Probability, Induction and Statistics*: The Art of Guessing. Wiley.

Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” *Commun. Statist. Theor. Meth*. A5(8), 737-751.

Filed under: Error Statistics, Neyman, Statistics Tagged: behavioristic vs evidential ]]>

Suppose you are reading about a statistically significant result * x* (at level α) from a one-sided test T+ of the mean of a Normal distribution with

I have heard some people say [0]:

A. If the test’s power to detect alternative µ’ is very low, then the statistically significant

isxpoorevidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ).*See point on language in notes.They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the statistically significant

isxgoodevidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is

unwarranted.

**Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?**

Allow the test assumptions are adequately met. I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post:

the probability of correctly rejecting the null

–which is both ambiguous and fails to specify the all important *conjectured* alternative. [For handholding slides on power, please see this post.] That you compute power for several alternatives is not the slightest bit problematic; it’s precisely what you *want* to do in order to assess the test’s capability to detect discrepancies. If you knew the true parameter value, why would you be running an inquiry to make statistical inferences about it?

It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ_{0 }+ δ, or µ < µ’ =µ_{0 }+ δ or the like. They are *not* to point values! (Not even to the point µ =M_{0}.) Most simply, you may consider that the inference is in terms of the one-sided lower confidence bound (for various confidence levels)–the dual for test T+.

**DEFINITION**: POW(T+,µ’**) = **POW(Test T+ rejects *H*_{0};µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous it doesn’t matter if we write > or ≥). I’ll leave off the T+ and write POW(µ’**).**

In terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

**Let σ = 10, n = 100, so (σ/ √n) = 1. **(Nice and simple!) Test T+ rejects H

**Test T+ rejects H _{0 }at ~ .025 level if M > 2. **

**CASE 1: ** We need a µ’ such that POW(µ’) = low. The power against alternatives between the null and the cut-off M* will range from α to .5. Consider the power against the null:

1. POW(µ

= 0) = α = .025._{ }Since the the probability of M > 2, under the assumption that µ

= 0, is low, the significant result indicates_{ }µ > 0.That is, since power against µ= 0 is low, the statistically significant result is a good indication that_{ }µ > 0.Equivalently, 0 is the lower bound of a .975 confidence interval.

2. For a second example of low power that does not use the null: We get power of .04 if µ’ = M* – 1.75 (σ/ √

n) unit –which in this case is (2 – 1.75) .25. That is, POW(.25) =.04.[ii]Equivalently, µ >.25 is the lower confidence interval (CI) at level .96 (this is the CI that is dual to the test T+.)

**CASE 2: ** We need a µ’ such that POW(µ’) = high. Using one of our power facts, POW(M* + 1(σ/ √*n*)) = .84.

3. That is, adding one (σ/ √

n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So µ= 2 + 1 will work: POW(T+, µ_{ }= 3) = .84. See this post._{ }Should we say that the significant result is a good indication that

µ > 3? No, the confidence level would be .16.Pr(M > 2; µ = 3 ) = Pr(Z > -1) = .84. It would be

terribleevidence forµ > 3!

Blue curve is the null, red curve is one possible conjectured alternative: µ** _{ }**= 3. Green area is power, little turquoise area is α.

As Stephen Senn points out (in my favorite of his posts), the alternative against which we set high power is the discrepancy from the null that “we should not like to miss”, delta Δ. Δ is not the discrepancy we may infer from a significant result (in a test where POW(Δ, ) = .84).

**So the correct answer is B.**

**Does A hold true if we happen to know (based on previous severe tests) that µ <µ’? I’ll return to this.**

*Point on language: “to detect alternative µ'” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data. Perhaps the strict definition should be employed unless one is clear on this. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.

[0] A comment by Michael Lew’s brings out my presumption of a given α level, since that is the context in which this matter arises. (I actually have a further destination in mind, indicated in red,, and it won’t be clear until a follow -up). Back to Lew’s query: power is always defined for the “worst case” of just reaching the α level cut-off, which is why I prefer the data-dependent severity, or associated confidence interval. Lew’s reference to a likelihood analysis underscores the need to indicate how one is evaluating evidence. Although I indicated “practicing within the error statistical tribe”, perhaps that was too vague. Still, the interpretation in the case he gives is not very different (except that I wouldn’t favor x over µ’. I thank Lew for his graphic. PastedGraphic-5

[i] I surmise, without claiming a scientific data base, that this fallacy has been increasing over the past few years. It was discussed way back when in Morrison and Henkel (1970). (A relevant post relates to a Jackie Mason comedy routine.) Research was even conducted to figure out how psychologists could be so wrong. Wherever I’ve seen it, it’s due to (explicitly or implicitly) transposing the conditional in a Bayesian use of power. For example, (1 – β)/ α is treated as a kind of likelihood in a Bayesian computation. I say this is unwarranted, even for a Bayesian’s goal, see 2/10/15 post below.

[ii] Pr(M > 2; µ = .25 ) = Pr(Z > 1.75) = .04.

OTHER RELEVANT POSTS ON POWER

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
**12/29/14**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**01/03/15**No headache power (for Deirdre)**02/10/15**What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?

Filed under: confidence intervals and tests, power, Statistics ]]>

**Stephen Senn**

*Head of Competence Center for Methodology and Statistics (CCMS)*

*Luxembourg Institute of Health*

*This post first appeared here.* An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence-based medicine?

Philosophy of Science2002;69: S316-S330: see p. S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random.

The second point is that in the absence of a treatment effect, where randomization has taken place, the statistical theory predicts probabilistically how the variation in outcome between groups relates to the variation within.The third point, strongly related to the other two, is that statistical inference in clinical trials proceeds using ratios. The F statistic produced from Fisher’s famous analysis of variance is the *ratio* of the variance between to the variance within and calculated using observed outcomes. (The ratio form is due to Snedecor but Fisher’s approach using semi-differences of natural logarithms is equivalent.) The critics of randomization are talking about the effect of the unmeasured covariates on the *numerator* of this ratio. However, any factor that could be imbalanced *between* groups could vary strongly *within* and thus while the numerator would be affected, so would the denominator. Any Bayesian will soon come to the conclusion that, given randomization, coherence imposes strong constraints on the degree to which one expects an unknown something to inflate the numerator (which implies not only differing between groups but also, coincidentally, having predictive strength) but not the denominator.

The final point is that statistical inferences are probabilistic: either about statistics in the frequentist mode or about parameters in the Bayesian mode. Many strong predictors varying from patient to patient will tend to inflate the variance within groups; this will be reflected in due turn in wider confidence intervals for the estimated treatment effect. It is not enough to attack the estimate. Being a statistician means never having to say you are certain. It is not the estimate that has to be attacked to prove a statistician a liar, it is the certainty with which the estimate has been expressed. We don’t call a man a liar who claims that with probability one half you will get one head in two tosses of a coin just because you might get two tails.

Filed under: RCTs, S. Senn, Statistics Tagged: confounders, Evidence-based medicine ]]>

**MONTHLY MEMORY LANE: 3 years ago: July 2012. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog**.[1]** This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014. (Once again it was tough to pick just 3; please check out others which might interest you, e.g., Schachtman on StatLaw, the machine learning conference on simplicity, the story of Lindley and particle physics, Glymour and so on.)

**July 2012**

- (7/1) PhilStatLaw: “Let’s Require Health Claims to Be ‘Evidence Based’” (Schachtman)
- (7/2) More from the Foundations of Simplicity Workshop*
- (7/3) Elliott Sober Responds on Foundations of Simplicity
- (7/4) Comment on Falsification
- (7/6) Vladimir Cherkassky Responds on Foundations of Simplicity
- (7/8) Metablog: Up and Coming
**(7/9) Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics**- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
- (7/11) Is Particle Physics Bad Science?
- (7/12) Dennis Lindley’s “Philosophy of Statistics”
- (7/15) Deconstructing Larry Wasserman – it starts like this…
- (7/16) Peter Grünwald: Follow-up on Cherkassky’s Comments
- (7/19) New Kvetch Posted 7/18/12
- (7/21) “Always the last place you look!”
- (7/22) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 1)
- (7/23) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 2)
- (7/27) P-values as Frequentist Measures
**(7/28) U-PHIL: Deconstructing Larry Wasserman**(connects to 7/15)**(7/31) What’s in a Name? (Gelman’s blog)**

**[1] **excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Filed under: 3-year memory lane, Statistics ]]>

Someone linked this to me on Twitter. I thought it was a home blog at first. Surely the U.S. Dept of Health and Human Services can give a better definition than this.

U.S. Department of Health and Human Services

Effective Health Care Program

Glossary of TermsWe know that many of the concepts used on this site can be difficult to understand. For that reason, we have provided you with a glossary to help you make sense of the terms used in Comparative Effectiveness Research. Every word that is defined in this glossary should appear highlighted throughout the Web site…..

Statistical Significance

Definition:A mathematical technique to measure whether the results of a study are likely to be true.Statistical significanceis calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example:For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to bestatistically significantbecause p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

You can find it here. First of all, one should never use “likelihood” and “probability” in what is to be a clarification of formal terms, as these mean very different things in statistics.Some of the claims given actually aren’t so bad if “likely” takes its statistical meaning, but are all wet if construed as mathematical probability.

What really puzzles me is, how do they expect readers to understand the claims that appear within this definition? Are their meanings known to anyone? Watch:

**Statistical Significance **

- A mathematical technique to measure whether the results of a study are likely to be true.

**What does it mean to say “the results of a study are likely to be true”?**

*Statistical significance*is calculated as the probability that an effect observed in a research study is occurring because of chance.

**Meaning?**

- Statistical significance is usually expressed as a P-value.
- The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).

**How should we define “more likely that the results are true”?**

- Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

**oy, oy**

- The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

**Oy, oy, oy ****OK, I’ll turn this into a single “oy” and just suggest dropping “probably” (leaving the hypertext “probability”). But this was part of the illustration, not the definition.**

Surely it’s possible to keep to their brevity and do a better job than this, even though one would really want to explain about the types of null hypotheses, the test statistic, the assumptions of the test (we aren’t told if their example is an RCT.) I’ve listed how they might capture what I think they mean to say, off the top of my head. Submit your improvements, corrections and additions, and I’ll add them. Updates will be indicated with (ii), (iii), etc.

**Statistical Significance**

- A mathematical technique to measure whether the results of a study are likely to be true.

a) A statistical technique to measure whether the results of a study indicate the null hypothesis is false, that some*genuine*discrepancy from the null hypothesis exists.

*Statistical significance*is calculated as the probability that an effect observed in a research study is occurring because of chance.

a) The statistical significance of an observed difference is the probability of observing results as large as was observed, even if the null hypothesis is true.

b) The statistical significance of an observed difference is how frequently even larger differences than were observed would occur (through chance variability), even if the null hypothesis is true.

- Statistical significance is usually expressed as a P-value.

a) Statistical significance may be expressed as a P-value associated with an observed difference from a null hypothesis*H*_{0}within a given statistical test T.

- The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).

a) The smaller the P-value, the less consistent the results are with the null hypothesis, and the more consistent they are with a genuine discrepancy from the null.

- Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

a) Researchers generally regard the results as inconsistent with the null if statistical significance is less than 0.05 (p<.05).

- (Part of the illustrative example): The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

a) The probability that even larger differences would occur due to chance variability (even if the null is true) is high enough to regard the result as consistent with the null being true.

**7/17/15 remark:** Maybe there’s a convention in this glossary that if the word is not in hypertext, it is being used informally. In that case, this might not be so bad. I’d remove “probably” to get:

b) The probability that the results were due to chance was high enough to conclude that the two drugs did not differ in causing blood pressure problems.

**7/17/15:** In (ii) In reaction to a comment, I replaced d_{obs} with “observed difference”, and cut out Pr(d ≥ d_{obs} ;*H*_{0}). I also allowed that #6 wasn’t too bad, especially if (the non-hypertext) “probably” is removed. The only thing is, this was *not* part of the definition, but rather the illustration. So maybe this could be the basis for fixing the others in the definition itself.

** **

Filed under: P-values, Statistics ]]>

- The power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
- So, the probability of incorrectly rejecting the null hypothesis is β.
- But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.

Filed under: Error Statistics, power, Statistics ]]>

2015: The Large Hadron Collider (LHC) is back in collision mode in 2015[0]. There’s a 2015 update, a virtual display, and links from ATLAS, one of two detectors at (LHC)) here. The remainder is from one *year ago. (2014) I’m reblogging a few of the Higgs posts at the anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013**).[1]*

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

**“Higgs Analysis and Statistical Flukes: part 2”**

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of *excess events* of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess;

H_{0}: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone). The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

*Error probabilities*

In a Neyman-Pearson setting, a cut-off c_{α}_{ }is chosen pre-data so that the probability of a type I error is low. In general,

Pr(

d(X) > c_{α};H_{0}) ≤ α

and in particular, alluding to an overall test T:

(1) Pr(Test T yields

d(X) > 5 standard deviations;H_{0}) ≤ .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, *p*_{0}. In general,

Pr(

P<p_{0};H_{0}) <p_{0}

and in particular,

(2) Pr(Test T yields

P<.0000003;H_{0}) < .0000003.

For test T to yield a “worse fit” with *H*_{0 }(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0. With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic *d*(**X)**, or the p-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

*An implicit principle of inference or evidence*

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Datax_{0 }from a test T provide evidence for rejecting H_{0}(just) to the extent that H_{0}would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between *H*_{0}* *and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977). “*H** _{0 }*is true” is a shorthand for a very long statement that

*Severity and the detachment of inferences*

The sampling distributions serve to give counterfactuals. In this case, they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to *H*_{0}.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out. *Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference*. (This is why the bootstrap, and other types of, re-sampling works when one has a random sample from the process or population of interest.)

The *severity principle*, put more generally:

Data from a test T[ii]provide good evidence for inferring H (just) to the extent that H passes severely withx_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.)* *In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim *H* requires considering *H*‘s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually **detached** from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

*Qualifying claims by how well they have been probed*

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned. Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

*Telling what’s true about significance levels*

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation). It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to *H** _{0}*. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to *H** _{0}*, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

*Those prohibited phrases*

One may wish to return to some of the condemned phrases of particular physics reports. Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution: The statistical null asserts that *H*_{0}: background alone adequately describes the process.

*H*_{0} does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under *H*_{0}”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {** x**: p < p

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain *H*_{0} as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

*Triggering, indicating, inferring*

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

I hope it is (more or less) clear that blue is from 2015, burgandy is from 2014; black is old. If interested: *See statistical flukes (part 3)*

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1: http://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Part 2 http://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

2015/2014 Notes

[0]Physicists manage to learn quite a lot from negative results. They’d love to find something more exotic, but the negative results will not go away.

“Physicists aren’t just praying for hints of new physics, Strassler stresses. He says there is very good reason to believe that the LHC should find new particles. For one, the mass of the Higgs boson, about125.09 billion electron volts, seems precariously low if the census of particles is truly complete. Various calculations based on theory dictate that the Higgs mass should be comparable to a figure called the Planck mass, which is about 17 orders of magnitude higher than the boson’s measured heft.”The article is here.

[1]My presentation at a Symposium on the Higgs discovery at the Philosophy of Science Association (Nov. 2014) is here.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

___________

Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(*H*) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv] In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis http://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

REFERENCES (from March, 2013 post):

ATLAS Collaboration (November 14, 2012), Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” *Annals of Mathematical Statistics*, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” *Scandinavian Journal of Statistics*, 4: 49–70.

Mayo, D.G. (1996), *Error and the Growth of Experimental Knowledge*, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy *of *Science*, 57: 323–357.

Filed under: Higgs, highly probable vs highly probed, P-values, Severity ]]>

**Winner of June 2014 Palindrome Contest: (a dozen book choices)**

**Lori Wike: **Principal bassoonist of the Utah Symphony; Faculty member at University of Utah and Westminster College

**Palindrome: Sir, a pain, a madness! Elba gin in a pro’s tipsy end? I know angst, sir! I taste, I demonstrate lemon omelet arts. Nome diet satirists gnaw on kidneys, pits or panini. Gab less: end a mania, Paris!
**

**Book choice**: *Conjectures and Refutations *(K. Popper 1962, New York: Basic Books)

**The requirement:** A palindrome using “demonstrate” (and Elba, of course).

**Bio: **Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine.

**Statement**: “I’m very happy to be a fourth-time winner in this palindrome contest. This palindrome ended up being a particularly fun one to write. Here is a picture of me visiting Akaka Falls, a necessary stop on any palindromist tour itinerary! I’ve been fascinated by palindromes ever since first learning about them as a child in a Martin Gardner book. I started writing palindromes several years ago when my interest in the form was rekindled by reading about the constraint-based techniques of several Oulipo writers. While I love all sorts of wordplay and puzzles, and I occasionally write some word-unit palindromes as well, I find writing the traditional letter-unit palindromes to be the most satisfying challenge. My latest palindrome goal is to attempt to write a palindromic mystery story.”

**Runner-up:** John Falcone (Asturias, Spain)

**Die tarts! No medics. I’d abandon Elba. Nut unable? Nod. Nab a disc. I demonstrate id.**

**Mayo’s June attempts (selected):**

- Able Noah, Plato or God. Deified lo-diet arts, no media. I demonstrate idol: deified dogroot! Alpha on Elba.
- Disable code, but use tarts. No Medco data doc demonstrates U-tube doc. Elba’s id.
- Elba fat, a diet? Arts, no media. I demonstrate i-data fable
- Lo! Disable now Sam. God’s assay: “Monet arts, no med deified!” Demonstrate, no? My ass! As dogma’s won Elba’s idol.

**Lori is amazing, but you can win with very simple palindromes. (I call on Lori only if no one has won for months on end.) I’ll be adding more book selections for July and August.**

Filed under: Palindrome ]]>