Error Statistics

Is it impossible to commit Type I errors in statistical significance tests? (i)

.

While immersed in our fast-paced, remote, NISS debate (October 15) with J. Berger and D. Trafimow, I didn’t immediately catch all that was said by my co-debaters (I will shortly post a transcript). We had all opted for no practice. But  looking over the transcript, I was surprised that David Trafimow was indeed saying the answer to the question in my title is yes. Here are some excerpts from his remarks:

Trafimow 8:44
See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so what you’re in essence doing then, is you’re using the P-value to index evidence against a model that is already known to be wrong. …But the point is the model was wrong. And so there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. …

Trafimow 18:27
I’ll make a more general comment, which is that since since the model is wrong, in the sense of not being exactly correct, whenever you reject it, you haven’t learned anything. And in the case where you fail to reject it, you’ve made a mistake. So the worst, so the best possible cases you haven’t learned anything, the worst possible cases is you’re wrong…

Trafimow 37:54
Now, Deborah, again made the point that you need procedures for testing discrepancies from the null hypothesis, but I will repeat that …P-values don’t give you that. P-values are about discrepancies from the model…

But P-values are not about discrepancies from the model (in which a null or test hypothesis is embedded). If they were, you might say, as he does, that you should properly always find small P-values, so long as the model isn’t exactly correct.  If you don’t, he says, you’re making a mistake. But this is wrong, and is in need of clarification. In fact, if violations of the model assumptions prevent computing a legitimate P-value, then its value is not really “about” anything. [a] (New Endnote 11/29.)

Three main points:

[1] It’s very important to see that the statistical significance test is not testing whether the overall model is wrong, and it is not indexing evidence against the model. It is only testing the null hypothesis (or test hypothesis) H0. It is an essential part of the definition of a test statistic T that its distribution be known, at least approximately, under H0. Cox has discussed this for over 40 years; I’ll refer first to a recent, and then an early paper.

Cox (2020, p. 1):

Suppose that we study a system with haphazard variation and are interested in a hypothesis, H, about the system.We find a test quantity, a function t(y) of data y, such that if H holds, t(y) can be regarded as the observed value of a random variable t(Y) having a distribution under H that is known numerically to an adequate approximation, either by mathematical theory or by computer simulation. Often the distribution of t(Y) is known also under plausible alternatives to H, but this is not necessary. It is enough that the larger the value of t(y), the stronger the pointer against H.

Cox (1977, pp. 1-2):

The basis of a significance test is an ordering of the points in [a sample space] in order of increasing inconsistency with H0, in the respect under study. Equivalently there is a function t = t(y) of the observations, called a test statistic, and such that the larger is t(y), the stronger is the inconsistency of y with H0, in the respect under study. The corresponding random variable is denoted by T. To complete the formulation of a significance test, we need to be able to compute, at least approximately,

p(yobs) = pobs = pr(T > tobs ; H0),                                  (1)

called the observed level of significance.

…To formulate a test, we therefore need to define a suitable function t(.), or rather the associated ordering of the sample points. Essential requirements are that (a) the ordering is scientifically meaningful, (b) it is possible to evaluate, at least approximately, the probability (1).

To suppose, as Trafimow plainly does, that we can never commit a Type 1 error in statistical significance testing because the underlying model “is not exactly correct” is a serious misinterpretation. The statistical significance test only tests one null hypothesis at a time. It is piecemeal. If it’s testing, say, the mean of a Normal distribution, it’s not also testing the underlying assumptions of the Normal model (Normal, IID). Those assumptions are tested separately, and the error statistical methodology offers systematic ways for doing so, with yet more statistical significance tests [see point 3].

[2] Moreover, although the model assumptions must be met adequately in order for the P-value to serve as a test of H0, it isn’t required that we have an exactly correct model, merely that the reported error probabilities are close to the actual ones. As I say in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018) (several excerpts of which can be found on this blog):

Statistical models are at best approximations of aspects of the data-generating process. Reasserting this fact is not informative about the case at hand. These models work because they need only capture rather coarse properties of the phenomena: the error probabilities of the test method are approximately and conservatively related to actual ones. …Far from wanting true (or even “truer”) models, we need models whose deliberate falsity enables finding things out. (p. 300)

Nor do P-values “track” violated assumptions; such violations can lead to computing an incorrectly high, or an incorrectly low, P-value.

And what about cases where we know ahead of time that a hypothesis H0 is strictly false?—I’m talking about the hypothesis here, not the underlying model. (Examples would be with a point null, or one asserting “there’s no Higgs boson”.) Knowing a hypothesis H0 is false is not yet to falsify it. That is, we are not warranted in inferring we have evidence of a genuine effect or discrepancy from H0, and we still don’t know in which way it is flawed.

[3] What is of interest in testing H0 with a statistical significance test is whether there is a systematic discrepancy or inconsistency with H0—one that is not readily accounted for by background variability, chance, or “noise” (as modelled). We don’t need, or even want, a model that fully represented the phenomenon—whatever that would mean. In “design-based” tests, we look to experimental procedures, within our control, as with randomisation.

Fisher:

the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged. (Fisher 1935, 21)

We look to RCTs quite often these days to test the benefits (and harms) of vaccines for Covid-19. Researchers observe differences in the number of Covid-19 cases in two randomly assigned groups, vaccinated and unvaccinated. We know there is ordinary variability in contracting Covid-19; it might be that, just by chance, more people who would have remained Covid-free, even without the vaccine, happen to be assigned to the vaccination group. The random assignment allows determining the probability that an even larger difference in Covid-19 rates would be observed even if H0: the two groups have the same chance of avoiding Covid-19. (I’m describing things extremely roughly; a much more realistic account of randomisation is given by several guest posts by Senn (e.g.,blogpost).) Unless this probability is small, it would not be correct to reject H0 and infer that there is evidence the vaccine is effective. Yet Trafimow, if we take him seriously, is saying it would always be correct to reject H0, and that to fail to reject it is to make a mistake. I hope that no one’s seriously suggesting that we should always infer there’s evidence a vaccine or other treatment works. But I don’t know how else to understand the position that it’s always correct to reject H0, and that to fail to reject it is to make a mistake. This is a dangerous and wrong view, which fortunately researchers are not guilty of.

When we don’t have design-based assumptions, we may check the model-based assumptions by means of tests that are secondary in relation to the primary test. The trick is to get them to be independent of the unknowns in the primary test, and there are systematic ways to achieve this.

Cox 2006:

We now turn to a complementary use of these ideas, namely to test the adequacy of a given model, what is also sometimes called model criticism…..It is necessary if we are to parallel the previous argument to find a statistic whose distribution is exactly of very nearly independent of the unknown parameter μ. An important way of doing this is by appeal t the second property of sufficient statistics, namely that after conditioning on their observed value the remaining data have a fixed distribution. (2006, p. 33)

“In principle, the information in the data is split into two parts, one to assess the unknown parameters of interest and the other for model criticism” (Cox 2006, p. 198). If the model is appropriate then the conditional distribution of Y given the value of the sufficient statistic s is known, so it serves to assess if the model is violated. The key is often to look at residuals: the difference between each observed outcome and what is expected under the model. The full data are remodelled to ask a different question. [i]

In testing assumptions, the null hypothesis is generally that the assumption(s) hold approximately. Again, even when we know this secondary null is strictly false, we want to learn in what way, and use the test to pinpoint improved models to try. (These new models must be separately tested.) [ii]

The essence of the reasoning can be made out entirely informally. Think of how the 2019 Eddington eclipse tests probed departures from the Newtonian predicted light deflection. It tested the Newtonian “half deflectionH0: μ ≤ 0.87, vs H1 : μ > 0.87, which includes the Einstein value of 1.75. These primary tests relied upon sufficient accuracy in the telescopes to get a usable standard error for the star positions during the eclipse, and 6 months before (SIST, Excursion 3 Tour I). In one set of plates, that some thought supported Newton, this necessary assumption was falsified using a secondary test. Relying only on known star positions and the detailed data, it was clear that the sun’s heat had systematically distorted the telescope mirror. No assumption about general relativity was required.

If I update this, I will indicate with (i), (ii), etc.

I invite your comments and/or guest posts on this topic.

NOTE: Links to the full papers/book are given in this post, so you might want to check them out.

[i] See  Spanos 2010 (pp. 322-323) from Error & Inference. (This is his commentary on Cox and Mayo in the same volume.) Also relevant Mayo and Spanos 2011 (pp. 193-194).

[ii] It’s important to see that other methods, error statistical or Bayesian, rely on models. A central asset of the simple significance test, on which Bayesians will concur, is their apt role in testing assumptions.

ENDNOTES ADDED AFTER 11/28:

[a] A passage I happen to find in Cox 2006 yesterday is worth noting (the book is linked in this post). He emphasizes that even in testing the model, we’re only interested in serious inconsistencies, not mere inexactitude of some sort.

An important possibility is that the data under analysis are derived from a probability density g(.) that is not a member of the family f(y; θ ) originally chosen to specify the model. Note that since all models are idealizations the empirical content of this possibility is that the data may be seriously inconsistent with the assumed model and that although a different model is to be preferred, it is fruitful to examine the consequences for the fitting of the original family. (p. 141, Principles of Statistical Inference)

Categories: D. Trafimow, J. Berger, National Institute of Statistical Sciences (NISS), Testing Assumptions | 12 Comments

S. Senn: “A Vaccine Trial from A to Z” with a Postscript (guest post)

.

Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Alpha and Omega (or maybe just Beta)

Well actually, not from A to Z but from AZ. That is to say, the trial I shall consider is the placebo- controlled trial of the Oxford University vaccine for COVID-19 currently being run by AstraZeneca (AZ) under protocol AZD1222 – D8110C00001 and which I considered in a previous blog, Heard Immunity. A summary of the design  features is given in Table 1. The purpose of this blog is to look a little deeper at features of the trial and the way I am going to do so is with the help of geometric representations of the sample space, that is to say the possible results the trial could produce. However, the reader is warned that I am only an amateur in all this. The true professionals are the statisticians at AZ who, together with their life science colleagues in AZ and Oxford, designed the trial.

Whereas in an October 20 post (on PHASTAR) I considered the sequential nature of the trial, here I am going to ignore that feature and only look at the trial as if it had a single look. Note that the trial employs a two to one randomisation, twice as many subjects being given vaccine as placebo

However, first I shall draw attention to one interesting feature. Like the two other trials that I also previously considered (one by BioNTech and Pfizer and the other by Moderna) the null hypothesis that is being tested is not that the vaccine has no efficacy but that its efficacy does not exceed 30%. Vaccine Efficacy (VE) is defined as

Where Rplacebo & Rvaccine are the ‘true’  rates of infection under placebo and vaccine respectively

Obviously, if the vaccine were completely ineffective, the value of VE would be 0. Presumably the judgement is that a vaccine will be of no practical use unless it has an efficacy of 30%. Perhaps a lower value than this could not really help to control the epidemic. The trial is designed to show that this is the case. In what follows, you can take it as read that the probability of the trial failing because the efficacy is equal to some value that is less than 30% (such as 27%, say) is even greater than if the value is exactly 30%. Therefore, it becomes of interest to consider the way the trial will behave if the value is exactly 30%.

Figuring it out

Figure 1 gives a representation of what might happen in terms of cases of infected subjects in both arms of the trial based on its design. It’s a complicated diagram and I shall take some time to explain it. For the moment I invite the reader to ignore the concentric circles and the shading. I shall get to those in due course.

Figure 1 Possible and expected outcomes for the trial plotted in the two dimensional space of vaccine and placebo cases of infection. The contour plot applies when the null hypothesis is true.

The X axis gives the number of cases  in the vaccine group and the Y axis the number of cases under Placebo. It is important to bear in mind that twice as many subjects are being treated with vaccine as with placebo. The line of equality of infection rates is given by the dashed white diagonal line  towards the bottom right hand side of the pot and labelled ‘0% efficacy’. This joins (for example) the points (80,40) and (140, 70) corresponding to twice as many cases under vaccine as placebo and reflecting the 2:1 allocation ratio. Other diagonal lines correspond to 30%, 50% and 60% VE respectively.

The trial is deigned to stop once 150 cases of infection have occurred. This boundary is represented by the diagonal solid red line descending from the upper left (30 cases in the vaccine group and 120 cases in the placebo group) towards the bottom right (120 cases in the vaccine group and 30 cases in the placebo group). Thus, we know in advance, that the combination of results we shall see must lie on this line.

Note that the diagram is slightly misleading, since where the space concerned refers to number of cases, it is neither continuous in X nor continuous in Y. The only possible values are those given by the whole numbers, W, that is to say the integers plus zero. However, the same is not true for expected numbers and this is a common difference between parameters and random variables in statistics. For example, if we have a Poisson random variable with a given mean, the only possible values of the random variable are the whole numbers 0,1,2… but the mean can be any positive real number.

Ripples in the pond

Figure 2 is the same diagram as Figure 1 as regards every feature except that which I invited the reader to ignore. The concentric circles are contour plots that represent features of the trial that are suitable for planning. In order to decide how many subjects to recruit, the scientists at AZ and Oxford had to decide what infection rate was likely. They chose an infection rate of 0.8% per 6 months under placebo. This in turn implies that of 10,000 subjects treated with placebo, we might expect 80 to get COVID. On the other hand, a vaccine efficacy of 30% would imply an infection rate of 0.56% since 

For 20,000 subjects treated with vaccine we would expect (0.56/100)20,000 = 112 of them to be infected with COVID and if the vaccine efficacy were 60%, the value assumed for the power calculation, then the expected infection rate would be 0.32% and we would expect 64 of the subjects to be infected.

Since the infection rates are small, a Poisson distribution is a possible simple model for the probability of seeing certain combinations of infections. This is what the contour plots illustrate. For both cases, the expected number of cases under placebo is assumed to be 80 and this is illustrated by a dashed horizontal white line. However, the lower infection rate under H1 has the effect of shifting the contour plots to the left. Thus, in Figure 1 the dashed vertical line indicating the expected numbers in the vaccine arm is at 112 and in Figure 2 it is at 64. Nothing else changes between the figures.

Figure 2 Possible and expected outcomes for the trial plotted in the two dimensional space of vaccine and placebo cases of infection. The contour plot applies when the value under the alternative hypothesis assumed for power calculations  is true.

Test bed

How should we carry out a significance test? One way of doing so is to condition on the total number of infected cases. The issue of whether to condition or not is a notorious controversy ins statistics, Here the total of 150 is fixed but I think that there is a good argument for doing so whether or not it is fixed. Such conditioning in this case leads to a binomial distribution describing the number of cases of infection observed out of the 150 that are in the vaccine group. Ignoring any covariates, therefore, a simple analysis, is to compare the proportion of cases we see to the proportion we would expect to see under the null hypothesis. This proportion is given by 112/(112+80)=0.583. (Note a subtle but important point here. The total number of cases expected is 192 but we know the trial will stop at 150. That is irrelevant. It is the expected proportion that matters here.)

By trial and error or by some other means we can now discover that the probability of 75 or fewer cases given vaccine out of 150 in total when the probability is 0.583 is 0.024.The AZ protocol requires a two-sided P-value less than or equal to 4.9%, which is to say 0.0245 one sided, assuming the usual doubling rule, so this is just low enough.  On the other hand, the probability of 76 or fewer cases under vaccine is 0.035 and thus too high. This establishes the point X=75, Y=75 as a critical value of the test. This is shown by the small red circle labelled ‘critical value’ on both figures. It just so happens that this lies along the 50% efficacy line. Thus observed 50% efficacy will be (just) enough to reject the hypothesis that the true efficacy is 30% or lower.

Reading the tea-leaves 

There are many other interesting features of this trial I could discuss, in particular what alternative analyses might be tried (the protocol refers to a ‘modified Poisson regression approach’ due to Zou, 2004) but I shall just consider one other issue here. That is that in theory when the trial stops might give some indication as to vaccine efficacy, a point that might be of interest to avid third party trial-watchers. If you look at Figure 3, which combines Figure 1 and Figure 2, you will note that the expected number of cases under H­0, if the values used for planning are correct,  is at least (when vaccine efficacy is 30%) 80+112=192. For zero efficacy the figure is 80+160=240. However, the trial will stop once 150 cases of infection have been observed. Thus, under H0, the trial is expected to stop before all 30,000 subjects have had six months of follow-up.

On the other hand, for an efficacy of 60% given in Figure 3 the value is 80+64 =144 and so slightly less then the figure required. Thus, under H1, the trial might not be big enough. Taken together, these figures imply that other things being equal, the earlier the trial stops the more likely the result is to be negative and the longer it continues, the more likely it is to be positive.

Of course, this raises the issue as to whether one can judge what is early and what is late. To make some guesses as to background rates of infection is inevitable when planning a trial. One would be foolish to rely on them when interpreting it.

Figure 3 Combination of Figures 1 and 2 showing contour plots for the joint density for the number of cases when the vaccine efficacy is 30% (H0) and the value under H1 of 60% used for planning.

Reference

Zou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol. 2004;159(7):702-6.

POSTSCRIPT: Needlepoint

Pressing news

Extract of a press-release from Pfizer, 9 November 2020:

“I am happy to share with you that Pfizer and our collaborator, BioNTech, announced positive efficacy results from our Phase 3, late-stage study of our potential COVID-19 vaccine. The vaccine candidate was found to be more than 90% effective in preventing COVID-19 in participants without evidence of prior SARS-CoV-2 infection in the first interim efficacy analysis.” Albert Bourla (Chairman and CEO, Pfizer.)

Naturally, this had Twitter agog and calculations were soon produced to try and reconstruct the basis on which the claim was being made: how many cases of COVID-19 infection under vaccine had there been seen in order to be able to make this claim? In the end these amateur calculations don’t matter. It’s what Pfizer calculates and what the regulators decide about the calculation that matters. I note by the by that a fair proportion of Twitter seemed to think that journal publication and peer review is essential. I don’t share this point of view, which I tend to think of as “quaint”. It’s the regulator’s view I am interested in but we shall have to wait for that.

Nevertheless, calculation can be fun and if I don’t think so, I am in the wrong profession. So here goes. However, first I should acknowledge that Jen Rogers’s interesting blog on the subject has been very useful in preparing this note.

The back of the envelope

To do the calculation properly, this is what one would have to know

Need to know

Discussion

Disposition of Subjects

Randomisation was one to one but strictly speaking we want to know the exact achieved proportions. BusinessWire describe a total of “43,538 participants to date, 38,955 of whom have received a second dose of the vaccine candidate as of November 8, 2020”.

Number of cases of infection

According to BusinessWire 94 were seen.

Method of analysis

Pfizer claims in the protocol a Bayesian analysis will be used. I shall not attempt this but use a very simple frequentist one conditioning on totals infected.

Aim of claim

Is the point estimate the basis of the claim or is the lower bound of some confidence interval the basis?

Level of confidence to be used

Pfizer planned to look five times but it seems that the first look was abandoned. The reported look is the 2nd but at a number of cases that is slightly greater (94) than the number originally planned for the 3rd (92). I shall assume that the confidence level for look three of an O’Brien-Fleming boundary is appropriate.

Missingness

A simple analysis would assume no missing data or at least that any missing data are missing completely at random.

Other matters

Two doses are required. Were there any cases arising between the two doses and if so, what was done with them?

 

If I condition on the total number of infected cases, and assume equal numbers of subjects on each arm, then by varying the number of cases in the vaccine group and subtracting them from the total of 94 to get those on the control group arm, I can calculate the vaccine efficacy. This has been done in the figure below.

The solid blue circles are the estimate of the vaccine efficacy. The ‘whiskers’ below indicate a confidence limit of 99.16% which (I think) is the level appropriate for the third look in an O’Brien-Fleming scheme for an overall type I error rate of 5%. Horizontal lines have been drawn at 30% efficacy (the value used in the protocol for the null hypothesis) and 90% efficacy (the claimed effect in the press release). Three cases on the vaccine arm would give a vaccine efficacy at about 91.3% for the lower confidence interval whereas four gives a value of 89.2%. Eight cases would give a point estimate of 90.7%. So depending on what exactly the claim of “more than 90% effective” might mean (and a whole host of other assumptions) we could argue that between three and eight cases of infection were seen.

Safety second

Of course safety is often described as being first in terms of priorities but it usually takes longer to see the results that are necessary to judge it than to see those for efficacy. According to BusinessWire “Pfizer and BioNTech are continuing to accumulate safety data and currently estimate that a median of two months of safety data following the second (and final) dose of the vaccine candidate – the amount of safety data specified by the FDA in its guidance for potential Emergency Use Authorization – will be available by the third week of November.”

The world awaits the results with interest.

Reference

  1. C. O’Brien and T. R. Fleming (1979) A multiple testing procedure for clinical trials. Biometrics, 549-556.

 

Categories: covid-19, Error Statistics, RCTs, Stephen Senn | 9 Comments

Phil Stat Forum: November 19: Stephen Senn, “Randomisation and Control in the Age of Coronavirus?”

For information about the Phil Stat Wars forum and how to join, see this post and this pdf. 


Continue reading

Categories: Error Statistics, randomization | Leave a comment

S. Senn: Testing Times (Guest post)

.

 

Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Testing Times

Screening for attention

There has been much comment on Twitter and other social media about testing for coronavirus and the relationship between a test being positive and the person tested having been infected. Some primitive form of Bayesian reasoning is often used  to justify concern that an apparent positive may actually be falsely so, with specificity and sensitivity taking the roles of likelihoods and prevalence that of a prior distribution. This way of looking at testing dates back at least to a paper of 1959 by Ledley and Lusted[1]. However, as others[2, 3] have pointed out, there is a trap for the unwary in this, in that it is implicitly assumed that specificity and sensitivity are constant values unaffected by prevalence and it is far from obvious that this should be the case. Continue reading

Categories: S. Senn, significance tests, Testing Assumptions | 14 Comments

September 24: Bayes factors from all sides: who’s worried, who’s not, and why (R. Morey)

Information and directions for joining our forum are here.

Continue reading

Categories: Announcement, bayes factors, Error Statistics, Phil Stat Forum, Richard Morey | 1 Comment

5 September, 2018 (w/updates) RSS 2018 – Significance Tests: Rethinking the Controversy

.

Day 2, Wed 5th September, 2018:

The 2018 Meeting of the Royal Statistical Society (Cardiff)

11:20 – 13:20

Keynote 4 – Significance Tests: Rethinking the Controversy Assembly Room

Speakers:
Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech

Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.

5 Sept. 2018 (taken by A.Spanos)

Continue reading

Categories: Error Statistics | Tags: | Leave a comment

Statistical Crises and Their Casualties–what are they?

What do I mean by “The Statistics Wars and Their Casualties”? It is the title of the workshop I have been organizing with Roman Frigg at the London School of Economics (CPNSS) [1], which was to have happened in June. It is now the title of a forum I am zooming on Phil Stat that I hope you will want to follow. It’s time that I explain and explore some of the key facets I have in mind with this title. Continue reading

Categories: Error Statistics | 4 Comments

August 6: JSM 2020 Panel on P-values & “Statistical Significance”

SLIDES FROM MY PRESENTATION

July 30 PRACTICE VIDEO for JSM talk (All materials for Practice JSM session here)

JSM 2020 Panel Flyer (PDF)
JSM online program w/panel abstract & information):

Categories: ASA Guide to P-values, Error Statistics, evidence-based policy, JSM 2020, P-values, Philosophy of Statistics, science communication, significance tests | 3 Comments

Bad Statistics is Their Product: Fighting Fire With Fire (ii)

Mayo fights fire w/ fire

I. Doubt is Their Product is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle: How Industry’s Assault on Science Threatens Your Health. Imagine you have just picked up a book, published in 2020: Bad Statistics is Their Product. Is the author writing about how exaggerating bad statistics may serve in the interest of denying well-established risks? [Interpretation A]. Or perhaps she’s writing on how exaggerating bad statistics serves the interest of denying well-established statistical methods? [Interpretation B]. Both may result in distorting science and even in dismantling public health safeguards–especially if made the basis of evidence policies in agencies. A responsible philosopher of statistics should care. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values, replication research, slides | 33 Comments

A. Saltelli (Guest post): What can we learn from the debate on statistical significance?

Professor Andrea Saltelli
Centre for the Study of the Sciences and the Humanities (SVT), University of Bergen (UIB, Norway),
&
Open Evidence Research, Universitat Oberta de Catalunya (UOC), Barcelona

What can we learn from the debate on statistical significance?

The statistical community is in the midst of crisis whose latest convulsion is a petition to abolish the concept of significance. The problem is perhaps neither with significance, nor with statistics, but with the inconsiderate way we use numbers, and with our present approach to quantification.  Unless the crisis is resolved, there will be a loss of consensus in scientific arguments, with a corresponding decline of public trust in the findings of science. Continue reading

Categories: Error Statistics | 11 Comments

The First Eye-Opener: Error Probing Tools vs Logics of Evidence (Excursion 1 Tour II)

1.4, 1.5

In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP),  I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).  Continue reading

Categories: Error Statistics, law of likelihood, SIST | 14 Comments

National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values | 20 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll post some Pearson items this week to mark his birthday.

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

Continue reading

Categories: E.S. Pearson, Error Statistics | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

Neyman April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | Leave a comment

Neyman vs the ‘Inferential’ Probabilists

.

We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake.  My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).

drawn by his wife,Olga

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.
HAPPY BIRTHDAY WEEK FOR NEYMAN! Continue reading

Categories: Bayesian/frequentist, Error Statistics, Neyman | Leave a comment

Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Source: Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Categories: Error Statistics | Leave a comment

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt

.

For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).

1.4 The Law of Likelihood and Error Statistics

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one. Continue reading

Categories: Error Statistics, law of likelihood, SIST | 2 Comments

American Phil Assoc Blog: The Stat Crisis of Science: Where are the Philosophers?

Ship StatInfasST

The Statistical Crisis of Science: Where are the Philosophers?

This was published today on the American Philosophical Association blog. 

“[C]onfusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth.” (George Barnard 1985, p. 2)

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists…” (Allan Birnbaum 1972, p. 861).

“In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered.” (p. 57, Committee Investigating fraudulent research practices of social psychologist Diederik Stapel)

I was the lone philosophical observer at a special meeting convened by the American Statistical Association (ASA) in 2015 to construct a non-technical document to guide users of statistical significance tests–one of the most common methods used to distinguish genuine effects from chance variability across a landscape of social, physical and biological sciences.

It was, by the ASA Director’s own description, “historical”, but it was also highly philosophical, and its ramifications are only now being discussed and debated. Today, introspection on statistical methods is rather common due to the “statistical crisis in science”. What is it? In a nutshell: high powered computer methods make it easy to arrive at impressive-looking ‘findings’ that too often disappear when others try to replicate them when hypotheses and data analysis protocols are required to be fixed in advance.

Continue reading

Categories: Error Statistics, Philosophy of Statistics, Summer Seminar in PhilStat | 2 Comments

Little Bit of Logic (5 mini problems for the reader)

Little bit of logic (5 little problems for you)[i]

Deductively valid arguments can readily have false conclusions! Yes, deductively valid arguments allow drawing their conclusions with 100% reliability but only if all their premises are true. For an argument to be deductively valid means simply that if the premises of the argument are all true, then the conclusion is true. For a valid argument to entail  the truth of its conclusion, all of its premises must be true.  In that case the argument is said to be (deductively) sound.

Equivalently, using the definition of deductive validity that I prefer: A deductively valid argument is one where, the truth of all its premises together with the falsity of its conclusion, leads to a logical contradiction (A & ~A).

Show that an argument with the form of disjunctive syllogism can have a false conclusion. Such an argument take the form (where A, B are statements): Continue reading

Categories: Error Statistics | 22 Comments

Mayo-Spanos Summer Seminar PhilStat: July 28-Aug 11, 2019: Instructions for Applying Now Available

INSTRUCTIONS FOR APPLYING ARE NOW AVAILABLE

See the Blog at SummerSeminarPhilStat

Categories: Announcement, Error Statistics, Statistics | Leave a comment

Blog at WordPress.com.