How to avoid making mountains out of molehills (using power and severity)

.

In preparation for a new post that takes up some of the recent battles on reforming or replacing p-values, I reblog an older post on power, one of the most misunderstood and abused notions in statistics. (I add a few “notes on howlers”.)  The power of a test T in relation to a discrepancy from a test hypothesis H0 is the probability T will lead to rejecting H0 when that discrepancy is present. Power is sometimes misappropriated to mean something only distantly related to the probability a test leads to rejection; but I’m getting ahead of myself. This post is on a classic fallacy of rejection.

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significant x is a good indication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result is just statistically significant at a level α. (As soon as the observed difference exceeds the cut-off the rule has to be modified). 

Rule (1) was stated in relation to a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. Here’s the companion:

(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’.
(The higher the POW(T+,µ’) is, the better the indication  that µ < µ’.)

That is, if the test’s power to detect alternative µ’ is high, then the just statistically significant x is a good indication (or good evidence) that the discrepancy from null is not as large as µ’ (i.e., there’s good evidence that  µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

POWER: POW(T+,µ’) = POW(Test T+ rejects H0;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

EXAMPLE. Let σ = 10, n = 100, so (σ/√n) = 1.  Test T+ rejects Hat the .025 level if  M  > 1.96(1).

Find the power against µ = 2.3. To find Pr(M > 1.96; 2.3), get the standard Normal z = (1.96 – 2.3)/1 = -.84. Find the area to the right of -.84 on the standard Normal curve. It is .8. So POW(T+,2.8) = .8.

For simplicity in what follows, let the cut-off, M*, be 2. Let the observed mean M0 just reach the cut-off  2.

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*, for these yield negative z values.  Power fact, POW(M* + 1(σ/√n)) = .84.

That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ = 3) = .84. See this post.

 By (2), the (just) significant result x is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84.  The upper .84 confidence limit is 3. The significant result is much better evidence that µ< 4,  the upper .975 confidence limit is 4 (approx.), etc. 

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (However, in my view, (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical insignificance, (2) is essentially ordinary power analysis. (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, what counts as a risk of concern is a context-dependent consideration, often stipulated in regulatory statutes.

NOTES ON HOWLERS: When researchers set a high power to detect µ’, it is not an indication they regard µ’ as plausible, likely, expected, probable or the like. Yet we often hear people say “if statistical testers set .8 power to detect µ = 2.3 (in test T+), they must regard µ = 2.3 as probable in some sense”. No, in no sense. Another thing you might hear is, “when H0: µ ≤  0 is rejected (at the .025 level), it’s reasonable to infer µ > 2.3″, or “testers are comfortable inferring µ ≥ 2.3”.  No, they are not comfortable, nor should you be. Such an inference would be wrong with probability ~.8. Given M = 2 (or 1.96), you need to subtract to get a lower confidence bound, if the confidence level is not to exceed .5 . For example, µ > .5 is a lower confidence bound at confidence level .93.

Rule (2) also provides a way to distinguish values within a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often called “retrospective power”, which is a fine term, but it’s often defined as what I call shpower). Again, confidence bounds could be, but they are not now, used to this end [iii].

Severity replaces M* in (2) with the actual result, be it significant or insignificant. 

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to custom tailor results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

One more thing:  

Applying (1) and (2) requires the error probabilities to be actual (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones. There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that statistical testing inferences are going to be in the form of µ > µ’ =µ+ δ,  or µ ≤ µ’ =µ+ δ  or the like. They are not to point values! (Not even to the point µ =M0.) Take a look at the alternative H1: µ >  0. It is not a point value. Although we are going beyond inferring the existence of some discrepancy, we still retain inferences in the form of inequalities. 

[iii] That is, upper confidence bounds are too readily viewed as “plausible” bounds, and as values for which the data provide positive evidence. In fact, as soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

Categories: CIs and tests, Error Statistics, power | 1 Comment

The Conversion of Subjective Bayesian, Colin Howson, & the problem of old evidence (i)

.

“The subjective Bayesian theory as developed, for example, by Savage … cannot solve the deceptively simple but actually intractable old evidence problem, whence as a foundation for a logic of confirmation at any rate, it must be accounted a failure.” (Howson, (2017), p. 674)

What? Did the “old evidence” problem cause Colin Howson to recently abdicate his decades long position as a leading subjective Bayesian? It seems to have. I was so surprised to come across this in a recent perusal of Philosophy of Science that I wrote to him to check if it is really true. (It is.) I thought perhaps it was a different Colin Howson, or the son of the one who co-wrote 3 editions of Howson and Urbach: Scientific Reasoning: The Bayesian Approach espousing hard-line subjectivism since 1989.[1] I am not sure which of the several paradigms of non-subjective or default Bayesianism Howson endorses (he’d argued for years, convincingly, against any one of them), nor how he handles various criticisms (Kass and Wasserman 1996), I put that aside. Nor have I worked through his, rather complex, paper to the extent necessary, yet. What about the “old evidence” problem, made famous by Clark Glymour 1980?  What is it?

Consider Jay Kadane, a well-known subjective Bayesian statistician. According to Kadane, the probability statement: Pr(d(X) ≥ 1.96) = .025

“is a statement about d(X) before it is observed. After it is observed, the event {d(X) ≥ 1.96} either happened or did not happen and hence has probability either one or zero” (2011, p. 439).

Knowing d0= 1.96, (the specific value of the test statistic d(X)), Kadane is saying, there’s no more uncertainty about it.* But would he really give it probability 1? If the probability of the data x is 1, Glymour argues, then Pr(x|H) also is 1, but then Pr(H|x) = Pr(H)Pr(x|H)/Pr(x) = Pr(H), so there is no boost in probability for a hypothesis or model arrived at after x. So does that mean known data doesn’t supply evidence for H? (Known data are sometimes said to violate temporal novelty: data are temporally novel only if the hypothesis or claim of interest came first.) If it’s got probability 1, this seems to be blocked. That’s the old evidence problem. Subjective Bayesianism is faced with the old evidence problem if known evidence has probability 1, or so the argument goes.

What’s the accepted subjective Bayesian solution to this?  (I’m really asking.) One attempt is to subtract out, or try to, the fact that x  is known, and envision being in a context prior to knowing x. That’s not very satisfactory or realistic, in general. Subjective Bayesians in statistics, I assume, just use the likelihoods and don’t worry about this: known data are an instance of a general random variable X, and you just use the likelihood once it’s known that {Xx}. But can you do this and also hold, with Kadane, that it’s an event with probability 1? I’ve always presumed that the problem was mainly for philosophers who want to assign probabilities to statements in a language, rather than focusing on random variables and their distributions, or statistical models (a mistake in my opinion). I also didn’t think subjective Bayesians in statistics were prepared to say, with Kadane, that an event has probability 1 after it’s observed or known. Yet if probability measures your uncertainty in the event, Kadane seems right. So how does the problem of old evidence get solved by subjective Bayesian practitioners? I asked Kadane years ago, but did not get a reply.

Any case where the data are known prior to constructing or selecting a hypothesis to accord with them, strictly speaking, would count as cases where data are known, or so it seems.** The most well known cases in philosophy allude to a known phenomenon, such as Mercury’s perihelion, as evidence for Einstein’s GTR. (The perihelion was long known as anomalous for Newton, yet GTR’s predicting it, without adjustments, is widely regarded as evidence for GTR.)[2] You can read some attempted treatments by philosophers in Howson’s paper; I discuss Garber’s attempt in Chapter 10, Mayo 1996 [EGEK], 10.2.[3] I’d like to hear from readers, regardless of statistical persuasion, how it’s handled in practice (or why it’s deemed unproblematic).

But wait, are we sure it isn’t also a problem for non-subjective or default Bayesians? In this paradigm (and there are several varieties), the prior probabilities in hypotheses are not taken to express degrees of belief but are given by various formal assignments, so as to have minimal impact on the posterior probability. Although the holy grail of finding “uninformative” default priors has been given up, default priors are at least supposed to ensure the data dominate in some sense.[4] A true blue subjective Bayesian like Kadane is unhappy with non-subjective priors. Rather than quantify prior beliefs, non-subjective priors are viewed as primitives or conventions or references for obtaining posterior probabilities. How are they to be interpreted? It’s not clear, but let’s put this aside to focus on the “old evidence” problem.

OK, so how do subjective Bayesians get around  the old evidence problem?

*I thank Jay Kadane for noticing I used the inequality in my original post 11/27/17. I haven’t digested his general reaction yet, stay tuned.
**There’s a place where Glymour (or Glymour, Scheines, Spirtes, and Kelly 1987) slyly argues that, strictly speaking, the data are always known by the time you appraise some some model–or so I seem to recall. But I’d have to research that or ask him.

[1] I’ll have to add a footnote to my new book (Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, CUP, 2018), as I allude to him as a subjectivist Bayesian philosopher throughout.
[2] I argue that the reason it was good evidence for GTR is precisely because it was long known, and yet all attempts to explain it were ad hoc, so that they failed to pass severe tests. The deflection effect, by contrast, was new and no one had discerned it before, let alone tried to explain it. Note that this is at odds with the idea that novel results count more for a theory H when H (temporally) predicts them, than when H accounts for known results. (Here the known perihelion of Mercury is thought to be better evidence for GTR than the novel deflection effect.) But the issue isn’t the novelty of the results, it’s how well-tested H is, or so I argue. (EGEK Chapter 8, p. 288).
[3] I don’t see that the newer attempts avoid the key problem in Garber’s. I’m not sure if Howson is rescinding the remark I quote from him in EGEK, p. 333. Here he was trying to solve it by subtracting the data out from what’s known.
[4] Some may want to use “informative” priors as well, but their meaning/rationale is unclear. Howson mentions Wes Salmon’s style of Bayesianism in this paper, but Salmon was a frequentist.

REFERENCES

-Glymour, C. (1980), Theory and Evidence, Princeton University Press. I’ve added a link to the relevant chapter, “Why I am Not a Bayesian” (from Fitelson resources). The relevant pages are 85-93.
Howson, C (2017), “Putting on the Garber Style? Better Not”, Philosophy of Science, 84 (October 2017) pp. 659–676.
-Kass, R. and Wasserman, L. (1996), “The Selection of Prior Distributions By Formal Rules”,  JASA 91: 1343-70.

Further References to Solutions (to this or Related problems): 

-Garber, Daniel. 1983. “Old Evidence and Logical Omniscience in Bayesian Confirmation Theory.” In Minnesota Studies in the Philosophy of Science, ed. J. Earman, 99–131. Minneapolis: University of Minnesota Press.
-Hartmann, Stephan, and Branden Fitelson. 2015. “A New Garber-Style Solution to the Problem of Old Evidence.” Philosophy of Science 82 (4): 712–17. H
Seidenfeld, T., Schervish, M., and Kadane, T. 2012. “What kind of uncertainty is that ?” Journal of Philosophy, (2012), pp 516-533.

Categories: Bayesian priors, objective Bayesians, Statistics | Tags: | 25 Comments

Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*

.

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!”  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

Continue reading

Categories: Fisher, P-values, phil/history of stat | 3 Comments

3 YEARS AGO (NOVEMBER 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: November 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green 3- 4 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one (11/1/14 & 11/09/14 and 11/15/14 & 11/25/14 are grouped). The comments are worth checking out.

 

November 2014

  • 11/01 Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”
  • 11/09 “Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA
  • 11/11 The Amazing Randi’s Million Dollar Challenge
  • 11/12 A biased report of the probability of a statistical fluke: Is it cheating?
  • 11/15 Why the Law of Likelihood is bankrupt–as an account of evidence

     

  • 11/18 Lucien Le Cam: “The Bayesians Hold the Magic”
  • 11/20 Erich Lehmann: Statistician and Poet
  • 11/22 Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”
  • 11/25 How likelihoodists exaggerate evidence from statistical tests
  • 11/30 3 YEARS AGO: MONTHLY (Nov.) MEMORY LANE

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).

 

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Categories: 3-year memory lane | 1 Comment

Yoav Benjamini, “In the world beyond p < .05: When & How to use P < .0499…"

.

These were Yoav Benjamini’s slides,”In the world beyond p<.05: When & How to use P<.0499…” from our session at the ASA 2017 Symposium on Statistical Inference (SSI): A World Beyond p < 0.05. (Mine are in an earlier post.) He begins by asking:

However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism). Continue reading

Categories: Error Statistics, P-values, replication research, selection effects | 22 Comments

Going round and round again: a roundtable on reproducibility & lowering p-values

.

There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then you will find the situation is reversed, and the recommended BF exaggerates the evidence!  (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

Continue reading

Categories: Announcement, P-values, reforming the reformers, selection effects | 5 Comments

Deconstructing “A World Beyond P-values”

.A world beyond p-values?

I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:

The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. [1] Continue reading

Categories: P-values, Philosophy of Statistics, reforming the reformers | 8 Comments

Statistical skepticism: How to use significance tests effectively: 7 challenges & how to respond to them

Here are my slides from the ASA Symposium on Statistical Inference : “A World Beyond p < .05”  in the session, “What are the best uses for P-values?”. (Aside from me,our session included Yoav Benjamini and David Robinson, with chair: Nalini Ravishanker.)

7 QUESTIONS

  • Why use a tool that infers from a single (arbitrary) P-value that pertains to a statistical hypothesis H0 to a research claim H*?
  • Why use an incompatible hybrid (of Fisher and N-P)?
  • Why apply a method that uses error probabilities, the sampling distribution, researcher “intentions” and violates the likelihood principle (LP)? You should condition on the data.
  • Why use methods that overstate evidence against a null hypothesis?
  • Why do you use a method that presupposes the underlying statistical model?
  • Why use a measure that doesn’t report effect sizes?
  • Why do you use a method that doesn’t provide posterior probabilities (in hypotheses)?

 

Categories: P-values, spurious p values, statistical tests, Statistics | Leave a comment

New venues for the statistics wars

I was part of something called “a brains blog roundtable” on the business of p-values earlier this week–I’m glad to see philosophers getting involved.

Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”. Continue reading

Categories: Announcement, Bayesian/frequentist, P-values | 3 Comments

G.A. Barnard: The “catch-all” factor: probability vs likelihood

 

Barnard 23 Sept.1915 – 9 Aug.20

With continued acknowledgement of Barnard’s birthday on Friday, Sept.23, I reblog an exchange on catchall probabilities from the “The Savage Forum” (pp 79-84 Savage, 1962) with some new remarks.[i] 

 BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important. Continue reading

Categories: Barnard, highly probable vs highly probed, phil/history of stat | 6 Comments

George Barnard’s birthday: stopping rules, intentions

G.A. Barnard: 23 Sept.1915 – 9 Aug.2002

Today is George Barnard’s birthday. I met him in the 1980s and we corresponded off and on until 1999. Here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to issues often taken up on this blog: stopping rules and the likelihood principle. (It’s a slightly revised reblog of an earlier post.) I’ll post some other items related to Barnard this week, in honor of his birthday.

Happy Birthday George!

Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevant in certain circumstances.  Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant. Continue reading

Categories: Likelihood Principle, Philosophy of Statistics | Tags: | 2 Comments

Revisiting Popper’s Demarcation of Science 2017

28 July 1902- 17 Sept. 1994

Karl Popper died on September 17 1994. One thing that gets revived in my new book (Statistical Inference as Severe Testing, 2018, CUP) is a Popperian demarcation of science vs pseudoscience Here’s a snippet from what I call a “live exhibit” (where the reader experiments with a subject) toward the end of a chapter on Popper:

Live Exhibit. Revisiting Popper’s Demarcation of Science: Here’s an experiment: Try shifting what Popper says about theories to a related claim about inquiries to find something out. To see what I have in mind, join me in watching a skit over the lunch break:

Physicist: “If mere logical falsifiability suffices for a theory to be scientific, then, we can’t properly oust astrology from the scientific pantheon. Plenty of nutty theories have been falsified, so by definition they’re scientific. Moreover, scientists aren’t always looking to subject well corroborated theories to “grave risk” of falsification.”

Fellow traveler: “I’ve been thinking about this. On your first point, Popper confuses things by making it sound as if he’s asking: When is a theory unscientific? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H unscientific? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face-saving devices, then the inquiry–the handling and use of data–is unscientific. Despite being logically falsifiable, theories can be rendered immune from falsification by means of cavalier methods for their testing. Adhering to a falsified theory no matter what is poor science. On the other hand, some areas have so much noise that you can’t pinpoint what’s to blame for failed predictions. This is another way that inquiries become bad science.”

She continues: Continue reading

Categories: Error Statistics, Popper, pseudoscience, science vs pseudoscience | Tags: | 10 Comments

Peircean Induction and the Error-Correcting Thesis

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Sunday, September 10, was C.S. Peirce’s birthday. He’s one of my heroes. He’s a treasure chest on essentially any topic, and anticipated quite a lot in statistics and logic. (As Stephen Stigler (2016) notes, he’s to be credited with articulating and appling randomization [1].) I always find something that feels astoundingly new, even rereading him. He’s been a great resource as I complete my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) [2]. I’m reblogging the main sections of a (2005) paper of mine. It’s written for a very general philosophical audience; the statistical parts are very informal. I first posted it in 2013Happy (belated) Birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT): Continue reading

Categories: Bayesian/frequentist, C.S. Peirce | 2 Comments

Professor Roberta Millstein, Distinguished Marjorie Grene speaker September 15

 

CANCELED

Virginia Tech Philosophy Department

2017 Distinguished Marjorie Grene Speaker

 

Professor Roberta L. Millstein


University of California, Davis

“Types of Experiments and Causal Process Tracing: What Happened on the Kaibab Plateau in the 1920s?”

September 15, 2017

320 Lavery Hall: 5:10-6:45pm

 

.

Continue reading

Categories: Announcement | 4 Comments

All She Wrote (so far): Error Statistics Philosophy: 6 years on

metablog old fashion typewriter

D.G. Mayo with her  blogging typewriter

Error Statistics Philosophy: Blog Contents (6 years) [i]
By: D. G. Mayo

Dear Reader: It’s hard to believe I’ve been blogging for six years (since Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening. If you’re in the neighborhood, stop by for some Elba Grease.

Amazingly, this old typewriter not only still works; one of the whiz kids on Elba managed to bluetooth it to go directly from my typewriter onto the blog (I never got used to computer keyboards.) I still must travel to London to get replacement ribbons for this klunker.

Please peruse the offerings below, and take advantage of some of the super contributions and discussions by guest posters and readers! I don’t know how much longer I’ll continue blogging–I’ve had to cut back this past year (sorry)–but at least until the publication of my book “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (CUP, 2018). After that I plan to run conferences, workshops, and ashrams on PhilStat and PhilSci, and I will invite readers to take part! Keep reading and commenting. Sincerely, D. Mayo

.

 

September 2011

October 2011 Continue reading

Categories: blog contents, Metablog | Leave a comment

Egon Pearson’s Heresy

E.S. Pearson: 11 Aug 1895-12 June 1980.

Here’s one last entry in honor of Egon Pearson’s birthday: “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years (6!), but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, some people concentrate to an absurd extent on “science-wise error rates in dichotomous screening”.) Continue reading

Categories: phil/history of stat, Philosophy of Statistics, Statistics | Tags: , , | Leave a comment

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

11 August 1895 – 12 June 1980

Continuing with my Egon Pearson posts in honor of his birthday, I reblog a post by Aris Spanos:  Egon Pearson’s Neglected Contributions to Statistics“. 

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: E.S. Pearson, phil/history of stat, Spanos, Testing Assumptions | 2 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll blog some E. Pearson items this week, including, my latest reflection on a historical anecdote regarding Egon and the woman he wanted marry, and surely would have, were it not for his father Karl!

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.  Continue reading

Categories: highly probable vs highly probed, phil/history of stat, Statistics | Tags: | Leave a comment

Thieme on the theme of lowering p-value thresholds (for Slate)

.

Here’s an article by Nick Thieme on the same theme as my last blogpost. Thieme, who is Slate’s 2017 AAAS Mass Media Fellow, is the first person to interview me on p-values who (a) was prepared to think through the issue for himself (or herself), and (b) included more than a tiny fragment of my side of the exchange.[i]. Please share your comments.

Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.

 

 

Illustration by Slate

                 Illustration by Slate

Last week a team of 72 scientists released the preprint of an article attempting to address one aspect of the reproducibility crisis, the crisis of conscience in which scientists are increasingly skeptical about the rigor of our current methods of conducting scientific research.

Their suggestion? Change the threshold for what is considered statistically significant. The team, led by Daniel Benjamin, a behavioral economist from the University of Southern California, is advocating that the “probability value” (p-value) threshold for statistical significance be lowered from the current standard of 0.05 to a much stricter threshold of 0.005. Continue reading

Categories: P-values, reforming the reformers, spurious p values | 14 Comments

“A megateam of reproducibility-minded scientists” look to lowering the p-value

.

Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today: Continue reading

Categories: Error Statistics, highly probable vs highly probed, P-values, reforming the reformers | 55 Comments

Blog at WordPress.com.