Diary For Statistical War Correspondents on the Latest Ban on Speech

When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection growing out of that conference “A world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks.

First set.

1. We agree with the age-old fallacy of non-rejection of a null hypothesis: a non-statistically significant result at level P is not evidence for the null because a test may have low probability of rejecting a null even if it’s false (i.e., it might have low power to detect a particular alternative).

The solution in the severity interpretation of tests is to take a result that is not statistically significant at a small level, i.e., a large P-value, as ruling out given discrepancies from the null or other reference value:

The data indicate that discrepancies from the null are less than those parametric values the test had a high probability of detecting, if present. See p. 351 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics wars (2018, CUP). [i]

This is akin to the use of power analysis, except that it is sensitive to the actual outcome. It is very odd that this paper makes no mention of power analysis, since that is the standard way to interpret non-significant results.

Using non-significant results (“moderate” P-values) to set upper bounds is done throughout the sciences and is highly informative. This paper instead urges us to read into any observed difference found to be in the welcome direction, to potentially argue for an effect.

2. I agree that one shouldn’t mechanically use P< .05. Ironically, they endorse a .95 confidence interval CI. They should actually use several levels, as is done with a severity assessment.

I have objections to their interpretation of CIs, but I will mainly focus my objections to the ban of the words “significance” or “significant”. It’s not too hard to report that results are significant at level .001 or whatever. Assuming researchers invariably use an unthinking cut-off, rather than reporting the significance level attained by the data, they want to ban words. They (Greenland at least) claim this is a political fight, and so arguing by an appeal to numbers (who sign on to their paper) is appropriate for science. I think many will take this as yet one more round of significance test bashing–even though, amazingly, it is opposite to the most popular of today’s statistical wars. I explain in #3. (The actual logic of significance testing is lost in both types of criticisms.)

3. The most noteworthy feature of this criticism of statistical significance tests is that it is opposite to the most well-known and widely circulated current criticisms of significance tests.

In other words, the big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis. The most well known Bayesian reforms being bandied about do this by giving a point prior–a lump of prior probability–to a point null hypothesis. (There’s no mention of this in the paper.)

These Bayesians argue that small P-values are consistent with strong evidence for the null hypothesis. They conclude that P-values exaggerate the evidence against the null hypothesis. Never mind for now that they are insisting P-values be measured against a standard that is radically different from what the P-value means. All of the criticisms invoke reasoning at odds with statistical significance tests. I want to point out the inconsistency between those reforms and the current one. I will call them Group A and Group B:

Group A: “Make it harder to find evidence against the null”: a P-value of .05 (i.e. a statistically significant result) should not be taken as evidence against the null, it may often be evidence for the null.

Group B (“Retire Stat Sig”): “Make it easier to find evidence against the null”: a P-value > .05 (i.e., a non-statistically significant result) should not be taken as evidence for the null, it may often be evidence against the null.

A proper use and interpretation of statistical tests (as set out in my SIST) interprets P-values correctly in both cases and avoids fallacies of rejection (inferring a magnitude of discrepancy larger than warranted) and fallacies of non-rejection (inferring the absence of an effect smaller than warranted).

The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence! When data provide lousy evidence, when little if anything has been done to rule out known flaws in a claim, it’s not a little bit of evidence (on my account). The most serious concern with the “Retire” argument to ban thresholds for significance is that it is likely to encourage the practice whereby researchers spin their non-significant results by P-hacking or data dredging. It’s bad enough that they do this. Read Goldacre [ii]

Note their saying the researcher should discuss the observed difference. This opens the door to spinning it convincingly to the uninitiated reader.

4. What about selection effects? The really important question that is not mentioned in this paper is whether the researcher is allowed to search for endpoints post-data.

My own account replaces P-values with reports of how severely tested various claims are, whether formal or informal. If we are in a context reporting P-values, the phrase “statistically significant” at the observed P-value is important because the significance level is invalidated by multiple testing, optional stopping, data-dependent subgroups, and data dredging. Everyone knows that. (A P-value, by contrast, if detached from corresponding & testable claims about significance levels, is sometimes seen as a mere relationship between data and a hypothesis.) Getting rid of the term is just what is wanted by those who think the researcher should be free to scour the data in search of impressive-looking effects, or interpret data according to what they believe. Some aver that their very good judgment allows them to determine post-data what the pre-registered endpoints really are or were or should have been. (Goldacre calls this “trust the trialist”). The paper mentions pre-registration fleetingly, but these days we see nods to it that actually go hand in hand with flouting it.

The ASA P-value Guide very pointedly emphasizes that selection effects invalidate P-values. But it does not say that selection effects need to be taken into account by any of the “alternative measures of evidence”, including Bayesian and Likelihoodist. Are they free from Principle 4 on transparency, or not? Whether or when to take account of multiple testing and data dredging are known to be key points on which those accounts differ from significance tests (at least all those who hold to the Likelihood Principle, as with Bayes Factors and Likelihood Ratios).

5. A few asides:

They should really be doing one-sided tests and do away with the point null altogether (except for special cases. I agree with D.R. Cox who suggests doing two 1-sided tests.) . (With 1-sided tests, the test hypothesis and alternative hypothesis are symmetrical as with N-P tests.)

The authors seem to view a test as a report on parameter values that merely fit or are compatible with data. This misses testing reasoning! Granted the points within a CI aren’t far enough away to reject the null at level .05–but that doesn’t mean there’s evidence for them. In other words, they commit the same fallacy they are on about, but regarding members of the CI. In fact there is fairly good evidence the parameter value is less than those values close to the upper confidence limit. Yet this paper calls them compatible, even where there’s rather strong evidence against them, as with an upper .9 level bound, say.

[Using one-sided tests and letting the null assert: a positive effect exists, the recommended account is tantamount to taking the non-significant result as evidence for this null.]

 

Second Set (to briefly give the minimal non-technical points):

I do think we should avoid the fallacy of going from a large P-value to evidence for a point null hypothesis: inferring evidence of no effect.

CIs at the .95 level are more dichotomous than reporting attained P-values for various hypotheses.

The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence!

The most serious concern with the argument to ban thresholds for significance is that it encourages researchers to spin their non-significant results by P-hacking, data dredging, multiple testing, and outcome-switching.

I would like to see some attention paid to how easy it is to misinterpret results with Bayesian and Likelihoodist methods. Obeying the LP, there is no onus to take account of selection effects, and priors are very often data-dependent, giving even more flexibility.

 

Third Set (for different journals)

Banning the word “significance” may well free researchers from being held accountable when they downplay negative results and search the data for impressive-looking subgroups.

It’s time for some attention to be paid to how easy it is to misinterpret results on various (subjective,default) Bayesian methods–if there is even agreement on one to examine. The brouhaha is all about a method that plays a small role in an overarching methodology that is able to bound the probabilities of seriously misleading interpretations of data. These are called error probabilities. Their role is just a first indication of whether results could readily be produced by chance variability alone.

Rival schools of statistics (the ASA Guide’s “alternative accounts of evidence”) have never shown their worth in controlling error probabilities of methods. (Without this, we cannot assess their capability for having probed mistaken interpretations of data).

Until those alternative methods are subject to scrutiny for the same or worse abuses–biasing selection effects–we should be wary of ousting these methods and the proper speech that goes with them.

One needs to consider a statistical methodology as a whole–not one very small piece. That full methodology may be called error statistics. (Focusing on the simple significance test, with a point null & no alternative or power consideration, as in the ASA Guide, hardly does justice to the overall error statistical methodology. Error statistics is known to be a piecemeal account–it’s highly distorting to focus on an artificial piece of it.)

Those who use these methods with integrity never recommend using a single test to move from statistical significance to a substantive scientific claim. Once a significant effect is found, they move on to estimating its effect size & exploring properties of the phenomenon. I don’t favor existing testing methodologies but rather reinterpret tests as a way to infer discrepancies that are well or poorly indicated. I described this account over 25 years ago.

On the other hand, simple significance tests are important for testing assumptions of statistical models. Bayesians, if they test their assumptions, use them as well, so they could hardly ban them entirely. But what are P-values measuring? OOPS! you’re not allowed to utter the term s____ance level that was coined for this purpose. Big Brother has dictated! (Look at how strange it is to rewrite Goldacre’s claim below without it. [ii])

I’m very worried that the lead editorial in the new “world after P ≤ 0.05” collection warns us that even if scientists repeatedly show statistically significant increases (p< 0.01 or 0.001) in lead poisoning among children in City F, we mustn’t “conclude anything about scientific or practical importance” such as the water is causing lead poisoning.

“Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, editorial for the Special Issue).

Following this rule, and note the qualification that had been in the ASA Guide is missing, would mean never inferring risks of concern when there was uncertainty (among much else that would go by the wayside). Risks have to be so large and pervasive that no statistics is needed! Statistics is just window dressing, with no actual upshot about the world. Menopausal women would still routinely be taking and dying from hormone replacement therapy because “real world” observational results are compatible with HRT staving off age-related diseases.

Welcome to the brave new world after abandoning error control.

See also my post “Deconstructing ‘A World Beyond P-values’”on the 2017 conference.

 

[i] Mayo, D. (2018). Statistical Inference as Severe Testing: How To Get Beyond the Statistics Wars, Cambridge: Cambridge University Press.

 

[ii] Should we replace the offending terms with “moderate or non-small P-values”? The required level for “significance” is separately reported.

Misleading reporting by presenting a study in a more positive way than the actual results reflect constitutes ‘spin’. Authors of an analysis of 72 trials with non-significant results reported it was a common phenomenon, with 40% of the trials containing some form of spin. Strategies included reporting on statistically significant results for within-group comparisons, secondary outcomes, or subgroup analyses and not the primary outcome, or focussing the reader on another study objective away from the statistically non-significant result. (Goldacre)

 

[added March 25: To be clear, I have no objection to recommending people not use “statistical significance” routinely in that it may be confused with “important”. But the same warnings about equivocation would have to be given to the use of claims: H is more likely than H’. H is more probable than H’. H has probability p. What I object to is mandating a word ban, along with derogating statistical tests in general, while raising no qualms or questions about alternative methods. It doesn’t suffice to say “all methods have problems” either. Let’s look at them.

In the time people have spent repeating old criticisms of significance tests, different ways to deal with data-dependent selection effects could have been developed and experimented with. I know there is considerable work in this area, but I haven’t seen it in the pop discussions of significance tests and p-values.

 

Categories: ASA Guide to P-values, P-values | 17 Comments

1 Days to Apply for the Summer Seminar in Phil Stat

Go to the website for instructions: SummerSeminarPhilStat.com.

Categories: Summer Seminar in PhilStat | 1 Comment

S. Senn: To infinity and beyond: how big are your data, really? (guest post)

.

 

Stephen Senn
Consultant Statistician
Edinburgh

What is this you boast about?

Failure to understand components of variation is the source of much mischief. It can lead researchers to overlook that they can be rich in data-points but poor in information. The important thing is always to understand what varies in the data you have, and to what extent your design, and the purpose you have in mind, master it. The result of failing to understand this can be that you mistakenly calculate standard errors of your estimates that are too small because you divide the variance by an n that is too big. In fact, the problems can go further than this, since you may even pick up the wrong covariance and hence use inappropriate regression coefficients to adjust your estimates.

I shall illustrate this point using clinical trials in asthma.

Breathing lessons

Suppose that I design a clinical trial in asthma as follows. I have six centres, each centre has four patients, each patient will be studied in two episodes of seven days and during these seven days the patients will be measured daily, that is to say, seven times per episode. I assume that between the two episodes of treatment there is a period of some days in which no measurements are taken. In the context of a cross-over trial, which I may or may not decide to run, such a period is referred to as a washout period.

The block structure is like this:

Centres/Patients/Episodes/Measurements

The / sign is a nesting operator and it shows, for example, that I have Patients ‘nested’ within centres. For example, I could label the patients 1 to 4 in each centre, but I don’t regard patient 3 (say) in centre 1 as being somehow similar to patient 3 in centre 2 and patient 3 in centre 3 and so forth. Patient is a term that is given meaning by referring it to centre.

The block structure is shown in Figure 1, which does not, however, show the seven measurements per episode.

Figure 1. Schematic representation of the block structure for some possible clinical trials. The six centres are shown by black lines. For each centre there are four patients shown by blue lines and each patient is studied in two episodes, shown by red lines.

I now wish to compare two treatments, two so-called beta-agonists. The first of these, I shall call Zephyr and the second Mistral. I shall do this using a measure of lung function called forced expiratory volume in one second, (FEV1). If there are no dropouts and no missing measurements, I shall have 6 x 4 x 2 x 7 =336 FEVreadings. Is this my ‘n’?

I am going to use Genstat®, a package that fully incorporates John Nelder’s ideas of general balance[1, 2]and the analysis of designed experiments and uses, in fact, what I have called the Rothamsted approach to experiments.

I start by declaring the block structure thus

BLOCKSTRUCTURE Centre/Patient/Episode/Measurement

This is the ‘null’ situation: it describes the variation in the experimental material before any treatment is applied. If I ask Genstat®to do a ‘null’ skeleton analysis of variance for me, by typing the statement

ANOVA

and the output is as given in Table 1

Analysis of variance

Source of variation d.f.
Centre stratum    5
Centre.Patient stratum  18
Centre.Patient.Episode stratum  24
Centre.Patient.Episode.Measurement stratum 288
Total 335

Table 1. Degrees of freedom for a null analysis of variance for a nested block structure.

This only gives me possible sources of variation and degrees of freedom associated with them but not the actual variances: that would require data. There are six centres, so five degrees of freedom between centres. There are four patients per centre, so three degrees of freedom per centre between patients but there are six centres and therefore 6 x 3 = 18 in total. There are two episodes per patient and so one degree of freedom between episodes per patient but there are 24 patients and so 24 degrees of freedom in total. Finally, there are seven measurements per episode and hence six degrees of freedom but 48 episodes in total so  48 x 6 = 288 degrees of freedom for measurements.

Having some actual data would put flesh on the bones of this skeleton by giving me some mean square errors, but to understand the general structure this is not necessary. It tells me that at the highest level I will have variation between centres, next patients within centres, after that episodes within patients and finally measurements within episodes. Which of these are relevant to judging the effect of any treatments I wish to study depends how I allocate treatments.

Design matters

I now consider, three possible approaches to allocating treatments to patients. In each of the three designs, the same number of measurements will be available for each treatment. There will be 168 measurements under Zephyr and 168 measurements under Mistral and thus 336 in total. However, as I shall show, the designs will be very different, and this will lead to different analyses being appropriate and lead us to understand better what our is.

I shall also suppose that we are interested in causal analysis rather than prediction. That is to say, we are interested in estimating the effect that the treatments did have (actually, the difference in their effects) in the trial that was actually run. The matter of predicting what would happen in future to other patients is much more delicate and raises other issues and I shall not address it here, although I may do so in future. For further discussion see my paper Added Values[3].

In the first experiment, I carry out a so-called cluster-randomised trial. I choose three centres at random and all patents, in both episodes on all occasions in the three centres chosen receive Zephyr. For the other three centres, all patients on all occasions receive Mistral. I create a factor Treatment (cluster trial), (Cluster for short) which encodes this allocation so that the pattern of allocation to Zephyr or Mistral reflects this randomised scheme.

In the second experiment, I carry out a parallel group trial blocking by centre. In each centre, I choose two patients to receive Zephyr and two to receive Mistral. Thus, overall, there 6 x 2 = 12 patients on each treatment. I create a factor Treatment (parallel trial) (Parallel for short) to reflect this.

The third experiment consists of a cross-over trial. Each patient is randomised to one of two sequences, either receiving Zephyr in episode one and Mistral in episode two, or vice versa. Each patient receives both treatments so that there will be 6 x 4 = 24 patients given each treatment. I create a factor Treatment (cross-over trial) (Cross-over for short) to encode this.

Note that the total number of measurements obtained is the same for each of the three schemes. For the cluster randomised trial, a given treatment will be studied in three centres each of which has four patients, each of whom will be studied in two episodes on seven occasions. Thus, we have 3 x 4 x 2 x 7 = 168 measurement per treatment. For the parallel group trial, 12 patients are studied for a given treatment in two episodes, each providing 7 measurements. Thus, we have 12 x 2 x 7 = 168 measurement per treatment. For the cross-over trial we have 24 patients each of whom will receive a given treatment in one episode (either episode one or two) so we have 24 x 1 x 7 + 168 measurements per treatment.

Thus, from one point of view the in the data is the same for each of these three designs. However, each of the three designs provides very different amounts of information and this alone should be enough to warn anybody against assuming that all problems of precision can be solved by increasing the number of data.

Controlled Analysis

Before collecting any data, I can analyse this scheme and use Nelder’s approach to tell me where the information is in each scheme.

Using the three factors to encode the corresponding allocation, I now ask Genstat® to prepare a dummy analysis of variance (in advance of having collected any data) as follows. All I need to do is type a statement of the form

TREATMENTSTRUCTURE Design
ANOVA

Where Design is set equal to the Cluster, Parallel, Crossover, as the case may be. The result is shown in Table 2

 

Analysis of variance

Source of variation d.f.
Centre stratum
Treatment (cluster trial) 1
Residual 4
Centre.Patient stratum
Treatment (parallel trial) 1
Residual 17
Centre.Patient.Episode stratum
Treatment (cross-over trial) 1
Residual 23
Centre.Patient.Episode.Measurement stratum 288
Total 335

Table 2. Analysis of variance skeleton for three possible designs using the block structure given in Table 1

This shows us that the three possible designs will have quite different degrees of precision associated with them. Since, for the cluster trial, any given centre only receives one of the treatments, the variation between centres affects the estimate of the treatment effect and its standard error must reflect this. Since, however, the parallel trial balances treatments by centres it is unaffected by variation between centres. It is, however, affected by variation between patients. This variation is, in turn, eliminated by the cross-over trial which, in consequence is only affected by variation between episodes (although this variation will, itself, inherit variation from measurements). Each higher level of variation inherits variation from the lower levels but adds its own.

Note, however, that for all three designs the unbiased estimate of the treatment effect is the same. All that is necessary is to average the 168 measurements under Zephyr and the 168 under Mistral and calculate the difference. It is the estimate of the appropriate variation in the estimate that varies.

Suppose that, more generally, we have centres, with patients per centre and episodes per patient, with the number of measurements per episode fixed, then for the cross-over trial the variance of our estimate will be proportional to \sigma_E^2 /(mnp) where \sigma_E^2 is variance between episodes. For the parallel group trial, there will be a further term involving \sigma_P^2 /(mn) where \sigma_P^2 is the variance between patients. Finally, for the cluster randomised trial there will be a further term involving \sigma_C^2 /m, where \sigma_C^2  is the variance between centres.

The consequences of this are, you cannot decrease the variance of a cluster randomised trial indefinitely simply by increasing the number of patients; it is centres you need to increase. You cannot decrease the variance of a parallel group trial indefinitely by increasing the number of episodes; it is patients you need to increase.

Degrees of Uncertainty

Why should this matter? Why should it matter how certain we are about anything? There are several reasons. Bayesian statisticians need to know what relative weight to give their prior belief and the evidence from the data. If they do not, they do not know how to produce a posterior distribution. If they do not know what the variances of both data and prior are, they don’t know the posterior variance. Frequentists and Bayesians are often required to combine evidence from various sources as, say, in a so-called meta-analysis. They need to know what weight to give to each and again to assess the total information available at the end. Any rational approach to decision-making requires an appreciation of the value of information. If one had to make a decision with no further prospect of obtaining information based on a current estimate it might make little difference how precise it was but if the option of obtaining further information at some cost applies, this is no longer true. In short, estimation of uncertainty is important. Indeed, it is a central task of statistics.

Finally, there is one further point that is important. What applies to variances also applies to covariances. If you are adjusting for a covariate using a regression approach, then the standard estimate of the coefficient of adjustment will involve a covariance divided by a variance. Just as there can be variances at various levels there can be covariances at various levels. It is important to establish which is relevant[4] otherwise you will calculate the adjustment incorrectly.

Consequences

Just because you have many data does not mean that you will come to precise conclusions: the variance of the effect estimate may not, as one might naively suppose, be inversely proportional to the number of data, but to some other much rarer feature in the data-set. Failure to appreciate this has led to excessive enthusiasm for the use of synthetic patients and historical controls as alternatives to concurrent controls. However, the relevant dominating component of variation is that between studies not between patients. This does not shrink to zero as the number of subjects goes to infinity. it does not even shrink to zero as the number of studies goes to infinity, since if the current study is the only one that the new treatment is on, the relevant variance for that arm is at least \sigma_{St}^2 /1, where \sigma_{St}^2  is the variance between studies, even if, for the ‘control’ data-set it may be negligible , thanks to data collected from many subjects in many studies.

There is a lesson also for epidemiology here. All too often, the argument in the epidemiological, and more recently, the causal literature has been about which effects one should control for or condition on without appreciating that merely stating what should be controlled for does not solve how. I am not talking here about the largely sterile debate, to which I have contributed myself[5] as to how at a given level, adjustment should be made for possible confounders (for example, propensity score or linear model), but to the level at which such adjustment can be made. The usual implicit assumption is that an observational study is somehow a deficient parallel group trial, with maybe complex and perverse allocation mechanisms that must somehow be adjusted for, but that once such adjustments have been made, precision increases as the subjects increase. But suppose the true analogy is a cluster randomised trial. Then, whatever you adjust for, your standard errors will be too small.

Finally, it is my opinion, that much of the discussion about Lord’s paradox would have benefitted from an appreciation of the issue of components of variance. I am used to informing medical clients that saying we will analyse the data using analysis of variance is about as useful as saying we will treat the patients with a pill. The varieties of analysis of variance are legion and the same is true of analysis of covariance. So, you conditioned on the baseline values. Bravo! But how did you condition on them? If you used a slope obtained at the wrong level of the data then, except fortuitously, your adjustment will be wrong, as will the precision you claim for it.

Finally, if I may be permitted an auto-quote, the price one pays for not using concurrent control is complex and unconvincing mathematics. That complexity may be being underestimated by those touting ‘big data’.

.

References

  1. Nelder JA. The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A 1965; 283: 147-162.
  2. Nelder JA. The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A 1965; 283: 163-178.
  3. Senn SJ. Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 2004; 23: 3729-3753.
  4. Kenward MG, Roger JH. The use of baseline covariates in crossover studies. Biostatistics2010; 11: 1-17.
  5. Senn SJ, Graf E, Caputo A. Stratification for the propensity score compared with linear regression techniques to assess the effect of treatment or exposure. Statistics in Medicine 2007; 26: 5529-5544.

Some relevant blogposts

Lord’s Paradox:

    • (11/11/18) Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)
    • (11/22/18) Stephen Senn: On the level. Why block structure matters and its relevance to Lord’s paradox (Guest Post)

Personalized Medicine:

    • (01/30/18) S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)
    • (7/11/18) S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)
    • (07/26/14) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)

Randomisation:

    • (07/01/17) S. Senn: Fishing for fakes with Fisher (Guest Post)
Categories: Lord's paradox, S. Senn | 4 Comments

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)

Statistical Inference as Severe Testing:
How to Get Beyond the Statistics Wars (2018, CUP)

Deborah G. Mayo

Abstract for Book

By disinterring the underlying statistical philosophies this book sets the stage for understanding and finally getting beyond today’s most pressing controversies revolving around statistical methods and irreproducible findings. Statistical Inference as Severe Testing takes the reader on a journey that provides a non-technical “how to” guide for zeroing in on the most influential arguments surrounding commonly used–and abused– statistical methods. The book sets sail with a tool for telling what’s true about statistical controversies: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to assess and control how severely tested claims are. Viewing statistical inference as severe testing supplies novel solutions to problems of induction, falsification and demarcating science from pseudoscience, and serves as thelinchpin for understanding and getting beyond the statistics wars. The book links philosophical questions about the roles of probability in inference to the concerns of practitioners in psychology, medicine, biology, economics, physics and across the landscape of the natural and social sciences.

Keywords for book:

Severe testing, Bayesian and frequentist debates, Philosophy of statistics, Significance testing controversy, statistics wars, replication crisis, statistical inference, error statistics, Philosophy and history of Neyman, Pearson and Fisherian statistics, Popperian falsification

Excursion 1: How to Tell What’s True about Statistical Inference

Tour I: Beyond Probabilism and Performance

(1.1) If we’re to get beyond the statistics wars, we need to understand the arguments behind them. Disagreements about the roles of probability in statistical inference–holdovers from long-standing frequentist-Bayesian battles–still simmer below the surface of current debates on scientific integrity, irreproducibility, and questionable research practices. Striving to restore scientific credibility, researchers, professional societies, and journals are getting serious about methodological reforms. Some–disapproving of cherry picking and advancing preregistration–are welcome. Others might create obstacles to the critical standpoint we seek. Without understanding the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden. (1.2) Rival standards reflect a tension between using probability (i) to constrain a method’s ability to avoid erroneously interpreting data (performance), and (ii) to assign degrees of support, confirmation, or plausibility to hypotheses (probabilism). We set sail with a tool for telling what’s true about statistical inference: If little has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test. From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. (1.3) We survey the current state of play in statistical foundations.

Excursion 1 Tour I: Keywords

Error statistics, severity requirement: weak/strong, probabilism, performance, probativism, statistical inference, argument from coincidence, Life-off (vs drag down), sampling distribution, cherry-picking

 

Excursion 1 Tour II: Error Probing Tools vs. Logics of Evidence

Core battles revolve around the relevance of a method’s error probabilities. What’s distinctive about the severe testing account is that it uses error probabilities evidentially: to assess how severely a claim has passed a test. Error control is necessary but not sufficient for severity. Logics of induction focus on the relationships between given data and hypotheses–so outcomes other than the one observed drop out. This is captured in the Likelihood Principle (LP). Tour II takes us to the crux of central wars in relation to the Law of Likelihood (LL) and Bayesian probabilism. (1.4) Hypotheses deliberately designed to accord with the data can result in minimal severity. The likelihoodist tries to oust them via degrees of belief captured in prior probabilities. To the severe tester, such gambits directly alter the evidence by leading to inseverity. (1.5) If a tester tries and tries again until significance is reached–optional stopping–significance will be attained erroneously with high probability. According to the LP, the stopping rule doesn’t alter evidence. The irrelevance of optional stopping is an asset for holders of the LP, it’s the opposite for a severe tester.  The warring sides talk past each other.

Excursion 1 Tour II: Keywords

Statistical significance: nominal vs actual, Law of likelihood, Likelihood principle, Inductive inference, Frequentist/Bayesian, confidence concept, Bayes theorem, default/non-subjective Bayesian, stopping rules/optional stopping, argument from intentions

Excursion 2: Taboos of Induction and Falsification

Tour I: Induction and Confirmation

The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction, e.g., Carnap’s confirmation theory. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory is directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine problems of irrelevant conjunctions: if xconfirms H, it confirms (H& J) for any J.

Tour I: keywords

asymmetry of induction and falsification, argument, sound and valid, enumerative induction (straight rule), confirmation theory (and formal epistemology), statistical affirming the consequent, guide to life, problem of induction, irrelevant conjunction, likelihood ratio, old evidence problem

 

Excursion 2 Tour II: Falsification, Pseudoscience, Induction

Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions. Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set: isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.

 Excursion 2 Tour II: keywords

Corroboration, Demarcation of science and pseudoscience, Falsification, Duhem’s problem, Novelty, Biasing selection effects, Simple significance tests, Fallacies of rejection, NHST, Reproducibility and replication

Excursion 3: Statistical Tests and Scientific Inference

Tour I: Ingenious and Severe Tests

We move from Popper to the development of statistical tests (3.2) by way of a gallery on (3.1): Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR). The tour opens by honing in on where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps we find in E. Pearson’s opening description of tests in (3.2). The typical (behavioristic) formulation of N-P tests is as mechanical rules to accept or reject claims with good long run error probabilities. The severe tester breaks out of the behavioristic prison. The classical testing notions–Type I and II errors, power, consistent tests–are shown to grow out of requiring of probative tests. Viewing statistical inference as severe testing, we explore how members of the Fisherian tribe can do all N-P tests do (3.3).  We consider the frequentist principle of evidence FEV (Mayo and Cox) and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR.

Tour I: keywords

eclipse test, statistical test ingredients, Type I & II errors, power, P-value, uniformly most powerful (UMP); severity interpretation of tests, severity function, frequentist principle of evidence FEV; Cox’s taxonomy of nulls

 

Excursion 3 Tour II: It’s The Methods, Stupid

Tour II disentangles a jungle of conceptual issues at the heart of today’s statistical wars. (3.4) unearths the basis for counterintuitive inferences thought to be licensed by Fisherian or N-P tests. These howlers and chestnuts show: the need for an adequate test statistic, the difference between implicationary and actual assumptions, and the fact that tail areas serve to raise, and not lower, the bar for rejecting a null hypothesis. Stop (3.5) pulls back the curtain on an equivocal use of “error probability”. When critics allege that Fisherian P-values are not error probabilities, they mean Fisher wanted an evidential not a performance interpretation–this is a philosophical not a mathematical claim. In fact, N-P and Fisher used P-values in both ways. Critics argue that P-values are for evidence, unlike error probabilities, but in the next breath they aver P-values aren’t good measures of evidence either, since they disagree with probabilist measures: likelihood ratios, Bayes Factors or posteriors (3.6). But the probabilist measures are inconsistent with the error probability ones. By claiming the latter are what’s wanted, the probabilist begs key questions, and misinterpretations are entrenched.

Excursion 3 Tour II keywords

howlers and chestnuts of statistical tests, Jeffreys tail area criticism, two machines with different positions, weak conditionality principle, likelihood principle, long run performance vs probabilism, Neyman vs Fisher, hypothetical long-runs, error probability1and error probability 2, incompatibilism (Fisher & Neyman-Pearson must be separated)

 

Excursion 3 Tour III: Capability and Severity: Deeper Concepts

A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs). In fact there’s a clear duality between the two: the parameter values within the (1 – α) CI are those that are not rejectable by the corresponding test at level α. (3.7) illuminates both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. In (3.8) we reopen a highly controversial matter of interpretation in relation to statistics and the 2012 discovery of the Higgs particle based on a “5 sigma observed effect”. Because the 5-sigma standard refers to frequentist significance testing, the discovery was immediately imbued with controversies that, at bottom, concern statistical philosophy. Some Bayesians even hinted it was “bad science”. One of the knottiest criticisms concerns the very meaning of the phrase: “the probability our data are merely a statistical fluctuation”.  Failing to clarify it may impinge on the nature of future big science inquiry. The problem is a bit delicate, and my solution is likely to be provocative. Even rejecting my construal will allow readers to see what it’s like to switch from wearing probabilist, to severe testing, glasses.

Excursion 3 Tour III: keywords

confidence intervals, duality of confidence intervals and tests, rubbing off interpretation, confidence level, Higg’s particle, look elsewhere effect, random fluctuations, capability curves, 5 sigma, beyond standard model physics (BSM)

Excursion 4: Objectivity and Auditing

Tour I: The Myth of “The Myth of Objectivity”

Blanket slogans such as “all methods are equally objective and subjective” trivialize into oblivion the problem of objectivity. Such cavalier attitudes are at odds with the moves to take back science. The goal of this tour is to identify what there is in objectivity that we won’t give up, and shouldn’t. While knowledge gaps leave room for biases and wishful thinking, we regularly come up against data that thwart our expectations and disagree with predictions we try to foist upon the world. This pushback supplies objective constraints on which our critical capacity is built. Supposing an objective method is to supply formal, mechanical, rules to process data is a holdover of a discredited logical positivist philosophy.

Discretion in data generation and modeling does not warrant concluding: statistical inference is a matter of subjective belief. It is one thing to talk of our models as objects of belief and quite another to maintain that our task is to model beliefs. For a severe tester, a statistical method’s objectivity requires the ability to audit an inference: check assumptions, pinpoint blame for anomalies, falsify, and directly register how biasing selection effects–hunting, multiple testing and cherry-picking–alter its error probing capacities. 

Tour I: keywords

objective vs. subjective, objectivity requirements, auditing, dirty hands argument, logical positivism; default Bayesians, equipoise assignments, (Bayesian) wash-out theorems, degenerating program, epistemology: internal/external distinction

 

Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?

We begin with the Mountains out of Molehills Fallacy (large nproblem): The fallacy of taking a (P-level) rejection of H0with larger sample size as indicating greater discrepancy from H0than with a smaller sample size. (4.3). The Jeffreys-Lindley paradox shows with large enough n, a .05 significant result can correspond to assigning H0a high probability .95. There are family feuds as to whether this is a problem for Bayesians or frequentists! The severe tester takes account of sample size in interpreting the discrepancy indicated. A modification of confidence intervals (CIs) is required.
It is commonly charged that significance levels overstate the evidence against the null hypothesis (4.4, 4.5). What’s meant? One answer considered here, is that the P-value can be smaller than a posterior probability to the null hypothesis, based on a lump prior (often .5) to a point null hypothesis. There are battles between and within tribes of Bayesians and frequentists. Some argue for lowering the P-value to bring it into line with a particular posterior. Others argue the supposed exaggeration results from an unwarranted lump prior to a wrongly formulated null.We consider how to evaluate reforms based on bayes factor standards (4.5). Rather than dismiss criticisms of error statistical methods that assume a standard from a rival account, we give them a generous reading. Only once the minimal principle for severity is violated do we reject them. Souvenir R summarizes the severe tester’s interpretation of a rejection in a statistical significance test. At least 2 benchmarks are needed: reports of discrepancies (from a test hypothesis) that are, and those that are not, well indicated by the observed difference.

Keywords:

significance test controversy, mountains out of molehills fallacy, large n problem, confidence intervals, P-values exaggerate evidence, Jeffreys-Lindley paradox, Bayes/Fisher disagreement, uninformative (diffuse) priors, Bayes factors, spiked priors, spike and slab, equivocating terms, severity interpretation of rejection (SIR)

 

Excursion 4 Tour III: Auditing: Biasing Selection Effects & Randomization

Tour III takes up Peirce’s “two rules of inductive inference”: predesignation (4.6) and randomization (4.7). The Tour opens on a court case transpiring: the CEO of a drug company is being charged with giving shareholders an overly rosy report based on post-data dredging for nominally significant benefits. Auditing a result includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to moving from statistical to substantive claims. We hear it’s too easy to obtain small P-values, yet replication attempts find it difficult to get small P-values with preregistered results. I call this the paradox of replication. The problem isn’t P-values but failing to adjust them for cherry picking and other biasing selection effects. Adjustments by Bonferroni and false discovery rates are considered. There is a tension between popular calls for preregistering data analysis, and accounts that downplay error probabilities. Worse, in the interest of promoting a methodology that rejects error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. However, data dependent searching need not be pejorative. In some cases, it can improve severity. (4.6)

Big Data cannot ignore experimental design principles. Unless we take account of the sampling distribution, it becomes difficult to justify resampling and randomization. We consider RCTs in development economics (RCT4D) and genomics. Failing to randomize microarrays is thought to have resulted in a decade lost in genomics. Granted the rejection of error probabilities is often tied to presupposing their relevance is limited to long-run behavioristic goals, which we reject. They are essential for an epistemic goal: controlling and assessing how well or poorly tested claims are. (4.7)

Keywords

error probabilities and severity, predesignation, biasing selection effects, paradox of replication, capitalizing on chance, bayes factors, batch effects, preregistration, randomization: Bayes-frequentist rationale, bonferroni adjustment, false discovery rates, RCT4D, genome-wide association studies (GWAS)

 

Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking

While all models are false, it’s also the case that no useful models are true. Were a model so complex as to represent data realistically, it wouldn’t be useful for finding things out. A statistical model is useful by being adequate for a problem, meaningit enables controlling and assessing if purported solutions are well or poorly probed and to what degree. We give a way to define severity in terms of solving a problem.(4.8) When it comes to testing model assumptions, many Bayesians agree with George Box (1983) that “it requires frequentist theory of significance tests” (p. 57). Tests of model assumptions, also called misspecification (M-S) tests, are thus a promising area for Bayes-frequentist collaboration. (4.9) When the model is in doubt, the likelihood principle is inapplicable or violated. We illustrate a non-parametric bootstrap resampling. It works without relying on a theoretical  probability distribution, but it still has assumptions. (4.10). We turn to the M-S testing approach of econometrician Aris Spanos.(4.11) I present the high points for unearthing spurious correlations, and assumptions of linear regression, employing 7 figures. M-S tests differ importantly from model selection–the latter uses a criterion for choosing among models, but does not test their statistical assumptions. They test fit rather than whether a model has captured the systematic information in the data.

Keywords

adequacy for a problem, severity (in terms of problem solving), model testing/misspecification (M-S) tests, likelihood principle conflicts, bootstrap, resampling, Bayesian p-value, central limit theorem, nonsense regression, significance tests in model checking, probabilistic reduction, respecification

Excursion 5: Power and Severity

Tour I: Power: Pre-data and Post-data

The power of a test to detect a discrepancy from a null hypothesis H0is its probability of leading to a significant result if that discrepancy exists. Critics of significance tests often compare H0and a point alternative H1 against which the test has high power. But these don’t exhaust the space. Blurring the power against H1 with a Bayesian posterior in H1results in exaggerating the evidence. (5.1) A drill is given for practice (5.2). As we learn from Neyman and Popper: if data failed to reject a hypothesis H, it does not corroborate Hunless the test probably would have rejected it if false. A classic fallacy is to construe no evidence against H0as evidence of the correctness of H0. It was in the list of slogans opening Excursion 1. His corroborated severely only if, and only to the extent that, it passes a test it probably would have failed, if false. By reflecting this reasoning, power analysis avoids such fallacies, but it’s too coarse. Severity analysis follows the pattern but is sensitive to the actual outcome (it uses what I call attained power). (5.3) Using severity curves we read off assessments for interpreting non-significant results in a standard test. (5.4)

 Tour I: keywords

power of a test, attained power (and severity), fallacies of non-rejection, severity curves, severity interpretation of negative results (SIN), power analysis, Cohen and Neyman on power analysis, retrospective power

 

Excursion 5 Tour II: How not to Corrupt Power

We begin with objections to power analysis, and scrutinize accounts that appear to be at odds with power and severity analysis.(5.5) Understanding power analysis also promotes an improved construal of CIs: instead of a fixed confidence level, several levels are needed, as with confidence distributions. Severity offers an evidential assessment rather than mere coverage probability.  We examine an influential new front in the statistics wars based on what I call the diagnostic model of tests. (5.6) The model is a cross between a Bayesian and frequentist analysis. To get the priors, the hypothesis you’re about to test is viewed as a random sample from an urn of null hypotheses, a high proportion of which are true. The analysis purports to explain the replication crisis because the proportion of true nulls amongst hypotheses rejected may be higher than the probability of rejecting a null hypothesis given it’s true. We question the assumptions and the altered meaning of error probability (error probability2in 3.6).  The Tour links several arguments that use probabilist measures to critique error statistics.

Excursion 5 Tour II: keywords

confidence distributions, coverage probability, criticisms of power, diagnostic model of tests, shpower vs power, fallacy of probabilistic instantiation, crud factors

  

Excursion 5 Tour III: Deconstructing the N-P vs. Fisher Debates

We begin with a famous passage from Neyman and Pearson (1933), taken to show N-P philosophy is limited to long-run performance. The play, “Les Miserables Citations”leads to a deconstruction that illuminates the evidential over the performance construal.(5.7) To cope with the fact that any sample is improbable in some respect, statistical methods either: appeal to prior probabilities of hypotheses or to error probabilities of a method. Pursuing the latter N-P are led to (i) a prespecified test criterion and (ii) consider alternative hypotheses and power. Fisher at first endorsed their idea of a most powerful test. Fisher hoped fiducial probability would both control error rates of a method – performance – as well as supply an evidential assessment. When confronted with the fact that fiducial solutions disagreed with performance goals he himself had held, Fisher abandoned them. (5.8) He railed against Neyman who was led to a performance construal largely to avoid inconsistencies in Fisher’s fiducial probability. The problem we face today is precisely to find a measure that controls error while capturing evidence.This is what severity purports to supply. We end with a connection with recent work on Confidence Distributions.

Excursion 5 Tour III: keywords

Bertrand and Borel debate, Neyman-Pearson test development, behavioristic (performance model) of tests, deconstructing N-P (1933), Fisher’s fiducial probabilities, Neyman/Fisher feuds, Neyman and Fisher dovetail, confidence distributions

Excursion 6: (Probabilist) Foundations Lost, (Probative) Foundations Found

Excursion 6 Tour I: What Ever Happened to Bayesian Foundations

Statistical battles often grow out of assuming the goal is a posterior probabilism of some sort. Yet when we examine each of the ways this could be attained, the desirability for science evanesces. We survey classical subjective Bayes via an interactive museum display on Lindley and commentators. (6.1) We durvey a plethora of meanings given to Bayesian priors (6.2) and current family feuds between subjective and non-subjective Bayesians. (6.3) The most prevalent Bayesian accounts are default/non-subjective, but there is no agreement on suitable priors. Sophisticated methods give as many priors as there are parameters and different orderings. They are deemed mere formal devices for obtaining a posterior. How then should we interpret the posterior as an adequate summary of information? While touted as the best way to bring in background, they are simultaneously supposed to minimize the influence of background. The main assets of the Bayesian picture–a coherent way to represent and update beliefs–go by the board.(6.4) The very idea of conveying “the” information in the data is unsatisfactory. It turns on what one wants to know. An answer to: how much a prior would be updated, differs from how well and poorly tested claims are. The latter question, of interest to a severe tester, is not answered by accounts that require assigning probabilities to a catchall factor: science must be open ended.

Tour I: keywords

Classic subjective Bayes, subjective vs default Bayesians, Bayes conditioning, default priors (and their multiple meanings), default Bayesian and the Likelihood Principle, catchall factor

 

Excursion 6 Tour II: Pragmatic and Error Statistical Bayesians

Tour II asks: Is there an overarching philosophy that “matches contemporary attitudes”? Kass’s pragmatic Bayesianism seeks unification by a restriction to cases where the default posteriors match frequentist error probabilities.(6.5) Even with this severe limit, the necessity for a split personality remains: probability is to capture variability as well as degrees of belief. We next consider the falsificationist Bayesianism of Andrew Gelman, and his work with others.(6.6) This purports to be an error statistical view, and we consider how its foundations might be developed. The question of where it differs from our misspecification testing is technical and is left open. Even more important than shared contemporary attitudes is changing them: not to encourage a switch of tribes, but to understand and get beyond the tribal warfare. If your goal is really and truly probabilism, you are better off recognizing the differences than trying to unify or reconcile. Snapshots from the error statistical lens lets you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. If you’ve come that far in making the gestalt switch to the error statistical paradigm, a new candidate for an overarching philosophy is at hand. Our Fairwell Keepsake delineates the requirements for a normative epistemology and surveys nine key statistics wars anda cluster of familiar criticisms of error statistical methods. They can no longer be blithely put forward as having weight without wrestling with the underlying presuppositions and challenges collected on our journey. This provides the starting point for any future attempts to refight these battles. The reader will then be beyond the statistics wars. (6.7)

Excursion 6 Tour II: keywords

pragmatic Bayesians, falsificationist Bayesian, confidence distributions, epistemic meaning for coverage probability, optional stopping and Bayesian intervals, error statistical foundations

 

REFERENCE

Mayo, D. (2018). Statistical Inference as Severe Testing: How To Get Beyond the Statistics Wars, Cambridge: Cambridge University Press.

_______

*Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.

Jan 10, 2019 Excerpt from SIST is here.

Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.

Feb 23, 2019 Excerpt from SIST 5.8 is here.

Categories: Statistical Inference as Severe Testing | Leave a comment

Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST

imgres-4

Fisher/ Neyman

This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. These 2 posts reflect my working out of these ideas in writing Section 5.8 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, CUP 2018). Here’s all of Section 5.8 (“Neyman’s Performance and Fisher’s Fiducial Probability”) for your Saturday night reading.* 

Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74).

Fisher was right that Neyman’s calling the outputs of statistical inferences “actions” merely expressed Neyman’s preferred way of talking. Nothing earth-shaking turns on the choice to dub every inference “an act of making an inference”.[i] The “rationality” or “merit” goes into the rule. Neyman, much like Popper, had a good reason for drawing a bright red line between his use of probability (for corroboration or probativeness) and its use by ‘probabilists’ (who assign probability to hypotheses). Fisher’s Fiducial probability was in danger of blurring this very distinction. Popper said, and Neyman would have agreed, that he had no problem with our using the word induction so long it was kept clear it meant testing hypotheses severely.

In Fisher’s next few sentences, things get very interesting. In reinforcing his choice of language, Fisher continues, Neyman “seems to claim that the statement (a) “μ has a probability of 5 per cent. of exceeding M” is a different statement from (b) “M has a probability of 5 per cent. of falling short of μ”. There’s no problem about equating these two so long as M is a random variable. But watch what happens in the next sentence. [I’m using M rather than X ; Fisher’s paper uses lower case x in the following, though clearly he means X in [1].] According to Fisher,

Neyman violates ‘the principles of deductive logic [by accepting a] statement such as

[1]                      Pr{(M – ts) < μ < (M  + ts)} = α,

as rigorously demonstrated, and yet, when numerical values are available for the statistics M
and s, so that on substitution of these and use of the 5 per cent. value of t, the statement would read

[2]                   Pr{92.99 < μ < 93.01} = 95 per cent.,

to deny to this numerical statement any validity. This evidently is to deny the syllogistic process of making a substitution in the major premise of terms which the minor premise establishes as equivalent (Fisher 1955, p. 75).

But the move from (1) to (2) is fallacious! Could Fisher really be commiting this fallacious probabilistic instantiation? I.J. Good (1971) describes how many felt, and often still feel:

…if we do not examine the fiducial argument carefully, it seems almost inconceivable that Fisher should have made the error which he did in fact make. It is because (i) it seemed so unlikely that a man of his stature should persist in the error, and (ii) because he modestly says(…[1959], p. 54) his 1930 explanation left a good deal to be desired’, that so many people assumed for so long that the argument was correct. They lacked the daring to question it.

In responding to Fisher, Neyman (1956, p.292) declares himself at his wit’s end in trying to find a way to convince Fisher of the inconsistencies in moving from (1) to (2).

When these explanations did not suffice to convince Sir Ronald of his mistake, I was tempted to give up. However, in a private conversation David Blackwell suggested that Fisher’s misapprehension may be cleared up by the examination of several simple examples. They illustrate the general rule that valid probability statements regarding relations involving random variables may cease and usually do cease to be valid if random variable are replaced by their observed particular values.(p. 292)[ii]

“Thus if X is a normal random variable with mean zero and an arbitrary variance greater than zero, we may agree” [that Pr(X < 0)= .5 But observing, say X = 1.7 yields Pr(1.7< 0) = .5, which is clearly illicit]. “It is doubtful whether the chaos and confusion now reigning in the field of fiducial argument were ever equaled in any other doctrine. The source of this confusion is the lack of realization that equation (1) does not imply (2)” (Neyman 1956).

For decades scholars have tried to figure out what Fisher might have meant, and while the matter remains unsettled, this much is agreed: The instantiation that Fisher is yelling about 20 years after the creation of N-P tests and the break with Neyman, is fallacious. Fiducial probabilities can only properly attach to the method. Keeping to “performance” language, is a sure way to avoid the illicit slide from (1) to (2). Once the intimate tie-ins with Fisher’s fiducial argument is recognized, the rhetoric of the Neyman-Fisher dispute takes on a completely new meaning. When Fisher says “Neyman only cares for acceptance sampling contexts” as he does after around 1950, he’s really saying Neyman thinks fiducial inference is contradictory unless it’s viewed in terms of properties of the method in (actual or hypothetical) repetitions. The fact that Neyman (with the contributions of Wald, and later Robbins) went overboard in his behaviorism [iii], to the extent that even Egon wanted to divorce him—ending his 1955 reply to Fisher with the claim that inductive behavior was “Neyman’s field rather than mine”—is a different matter.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[i] Fisher also commonly spoke of the output of tests as actions. Neyman rightly says that he is only following Fisher. As the years went by, Fisher comes to renounce things he himself had said earlier in the midst of polemics against Neyman.

[ii] But surely this is the kind of simple example that would have been brought forward right off the bat, before the more elaborate, infamous cases (Fisher-Behrens). Did Fisher ever say “oh now I see my mistake” as a result of these simple examples? Not to my knowledge. So I find this statement of Neyman’s about the private conversation with Blackwell a little curious. Anyone know more about it?

[iii]At least in his theory, but not not in his practice. A relevant post is “distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen“.

Fisher, R.A. (1955). “Statistical Methods and Scientific Induction”.

Good, I.J. (1971b), In reply to comments on his “The probabilistic explication of information, evidence, srprise, causality, explanation and utility’. In Godambe and Sprott (1971).

Neyman, J. (1956). “Note on an Article by Sir Ronald Fisher”.

Pearson, E.S. (1955). “Statistical concepts in Their Relation to Reality“.

 

____________________________

 

*Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.

Jan 10, 2019 Excerpt from SIST is here.

Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.

Categories: fiducial probability, Fisher, Neyman, Statistics | 2 Comments

Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]

imgres

R.A. Fisher: February 17, 1890 – July 29, 1962

Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a few years ago on a topic that had previously not been discussed on this blog: Fisher’s fiducial probability

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehmann (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics. Continue reading

Categories: fiducial probability, Fisher, Phil6334/ Econ 6614, Statistics | Leave a comment

Guest Blog: R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

A SPANOS

.

In recognition of R.A. Fisher’s birthday on February 17…a week of Fisher posts!

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Phil6334/ Econ 6614, Spanos, Statistics | 2 Comments

R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts begun on his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between  Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. The other two are below. They are each very short and are worth your rereading.

17 February 1890 — 29 July 1962

“Statistical Methods and Scientific Induction”

by Sir Ronald Fisher (1955)

SUMMARY

The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of  acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

  1. “Repeated sampling from the same population”,
  2. Errors of the “second kind”,
  3. “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

To continue reading Fisher’s paper.

 

Note on an Article by Sir Ronald Fisher

by Jerzy Neyman (1956)

Neyman

Neyman

Summary

(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation.  (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible. Continue reading

Categories: E.S. Pearson, fiducial probability, Fisher, Neyman, phil/history of stat, Phil6334/ Econ 6614 | 1 Comment

Guest Post: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of posts on R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012, and 2017. See especially the comments from Feb 2017. 

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | Leave a comment

Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I will post some Fisherian items this week in recognition of it*. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  We may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

Two New Properties of Mathematical Likelihood

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Phil6334/ Econ 6614, Statistics | Tags: , , , | Leave a comment

American Phil Assoc Blog: The Stat Crisis of Science: Where are the Philosophers?

Ship StatInfasST

The Statistical Crisis of Science: Where are the Philosophers?

This was published today on the American Philosophical Association blog. 

“[C]onfusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth.” (George Barnard 1985, p. 2)

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists…” (Allan Birnbaum 1972, p. 861).

“In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered.” (p. 57, Committee Investigating fraudulent research practices of social psychologist Diederik Stapel)

I was the lone philosophical observer at a special meeting convened by the American Statistical Association (ASA) in 2015 to construct a non-technical document to guide users of statistical significance tests–one of the most common methods used to distinguish genuine effects from chance variability across a landscape of social, physical and biological sciences.

It was, by the ASA Director’s own description, “historical”, but it was also highly philosophical, and its ramifications are only now being discussed and debated. Today, introspection on statistical methods is rather common due to the “statistical crisis in science”. What is it? In a nutshell: high powered computer methods make it easy to arrive at impressive-looking ‘findings’ that too often disappear when others try to replicate them when hypotheses and data analysis protocols are required to be fixed in advance.

Continue reading

Categories: Error Statistics, Philosophy of Statistics, Summer Seminar in PhilStat | 1 Comment

Summer Seminar in PhilStat: July 28-Aug 11

Please See New Information for Summer Seminar in PhilStat

Categories: Announcement, Summer Seminar in PhilStat | 1 Comment

Little Bit of Logic (5 mini problems for the reader)

Little bit of logic (5 little problems for you)[i]

Deductively valid arguments can readily have false conclusions! Yes, deductively valid arguments allow drawing their conclusions with 100% reliability but only if all their premises are true. For an argument to be deductively valid means simply that if the premises of the argument are all true, then the conclusion is true. For a valid argument to entail  the truth of its conclusion, all of its premises must be true.  In that case the argument is said to be (deductively) sound.

Equivalently, using the definition of deductive validity that I prefer: A deductively valid argument is one where, the truth of all its premises together with the falsity of its conclusion, leads to a logical contradiction (A & ~A).

Show that an argument with the form of disjunctive syllogism can have a false conclusion. Such an argument take the form (where A, B are statements): Continue reading

Categories: Error Statistics | 22 Comments

Mayo Slides Meeting #1 (Phil 6334/Econ 6614, Mayo & Spanos)

Slides  Meeting #1 (Phil 6334/Econ 6614: Current Debates on Statistical Inference and Modeling (D. Mayo and A. Spanos)

 

Categories: Phil6334/ Econ 6614 | Leave a comment

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking

4.8 All Models Are False

. . . it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. . . . The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis. (Cox 1995, p. 456)

 A popular slogan in statistics and elsewhere is “all models are false!”  Is this true? What can it mean to attribute a truth value to a model? Clearly what is meant involves some assertion or hypothesis about the model – that it correctly or incorrectly represents some phenomenon in some respect or to some degree. Such assertions clearly can be true. As Cox observes, “the very word model implies simplification and idealization.”  To declare, “all models are false”  by dint of their being idealizations or approximations, is to stick us with one of those  “all flesh is grass”  trivializations (Section 4.1). So understood, it follows that all statistical models are false, but we have learned nothing about how statistical models may be used to infer true claims about problems of interest. Since the severe tester’s goal in using approximate statistical models is largely to learn where they break down, their strict falsity is a given. Yet it does make her wonder why anyone would want to place a probability assignment on their truth, unless it was 0? Today’s tour continues our journey into solving the problem of induction (Section 2.7). Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

Protected: Participants in 6334/6614 Meeting place Jan-Feb

This content is password protected. To view it please enter your password below:

Categories: SIST | Enter your password to view comments.

6334/6614: Captain’s Library: Biblio With Links

Mayo and A. Spanos
PHIL 6334/ ECON 6614: Spring 2019: Current Debates on Statistical Inference and Modeling

Bibliography (this includes a selection of articles with links; numbers 1-15 after the item refer to seminar meeting number.)

See Syllabus (first) for class meetings, and the page PhilStat19 menu up top for other course items.

Achinstein (2010). Mill’s Sins or Mayo’s Errors? (E&I: 170-188). (11)

Bacchus, Kyburg, & Thalos (1990). Against Conditionalization, Synthese (85): 475-506. (15)

Barnett (1999). Comparative Statistical Inference (Chapter 6: Bayesian Inference), John Wiley & Sons. (1), (15)

Begley & Ellis (2012) Raise standards for preclinical cancer research. Nature 483: 531-533. (10)

Continue reading

Categories: SIST | 1 Comment

(Full) Excerpt of Excursion 4 Tour I: The Myth of “The Myth of Objectivity”

A month ago, I excerpted just the very start of Excursion 4 Tour I* on The Myth of the “Myth of Objectivity”. It’s a short Tour, and this continues the earlier post.

4.1    Dirty Hands: Statistical Inference Is Sullied with Discretionary Choices

If all flesh is grass, kings and cardinals are surely grass, but so is everyone else and we have not learned much about kings as opposed to peasants. (Hacking 1965, p.211)

Trivial platitudes can appear as convincingly strong arguments that everything is subjective. Take this one: No human learning is pure so anyone who demands objective scrutiny is being unrealistic and demanding immaculate inference. This is an instance of Hacking’s “all flesh is grass.” In fact, Hacking is alluding to the subjective Bayesian de Finetti (who “denies the very existence of the physical property [of] chance” (ibid.)). My one-time colleague, I. J. Good, used to poke fun at the frequentist as “denying he uses any judgments!” Let’s admit right up front that every sentence can be prefaced with “agent x judges that,” and not sweep it under the carpet (SUTC) as Good (1976) alleges. Since that can be done for any statement, it cannot be relevant for making the distinctions in which we are interested, and we know can be made, between warranted or well-tested claims and those so poorly probed as to be BENT. You’d be surprised how far into the thicket you can cut your way by brandishing this blade alone. Continue reading

Categories: objectivity, SIST | Leave a comment

New Course Starts Tomorrow: Current Debates on Statistical Inference and Modelings: Joint Phil and Econ

I will post items on a new PhilStat Spring 19 page on this blogI

Categories: Announcement | Leave a comment

A letter in response to the ASA’s Statement on p-Values by Ionides, Giessing, Ritov and Page

I came across an interesting letter in response to the ASA’s Statement on p-values that I hadn’t seen before. It’s by Ionides, Giessing, Ritov and Page, and it’s very much worth reading. I make some comments below. Continue reading

Categories: ASA Guide to P-values, P-values | 7 Comments

Blog at WordPress.com.