S. Senn: Testing Times (Guest post)



Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Testing Times

Screening for attention

There has been much comment on Twitter and other social media about testing for coronavirus and the relationship between a test being positive and the person tested having been infected. Some primitive form of Bayesian reasoning is often used  to justify concern that an apparent positive may actually be falsely so, with specificity and sensitivity taking the roles of likelihoods and prevalence that of a prior distribution. This way of looking at testing dates back at least to a paper of 1959 by Ledley and Lusted[1]. However, as others[2, 3] have pointed out, there is a trap for the unwary in this, in that it is implicitly assumed that specificity and sensitivity are constant values unaffected by prevalence and it is far from obvious that this should be the case.

In the age of COVID-19 this is a highly suitable subject for a blog. However, I am a highly unsuitable person to blog about it, since what I know about screening programmes could be written on the back of a postage stamp and the matter is very delicate. So, I shall duck the challenge but instead write about something that bears more than a superficial similarity to it, namely, testing for model adequacy prior to carrying out a test of a hypothesis of primary interest. It is an issue that arises in particular in cross-over trials, where there may be concerns that carry-over has taken place. Here, I may or may not be an expert, that is for others to judge, but I can’t claim that I think I am unqualified to write about it, since I once wrote a book on the subject[4, 5]. So this blog will be about testing model assumptions taking the particular example of cross-over trials.

Presuming to assume

The simplest of all cross-over trials, the so-called AB/BA cross-over is one in which two treatments A and B are compare by randomising patients to one of two sequences: either A followed by B (labelled AB) or B followed by A (labelled BA). Each patient is thus studied in two periods, receiving one of the two treatments in each. There may be a so-called wash-out period between them but whether or not a wash-out is employed, the assumption will be made that by the time the effect of a treatment comes to be measured in period two, the effect of any treatment given in period one has disappeared. If such a residual effect, referred to as a carry-over, existed it would bias the treatment effect since, for example, the result in period two of the AB sequence would not only reflect the effect of giving B but the previous effect of having given A.

Everyone is agreed that if the effect of carry-over can be assumed negligible, an efficient estimate of the difference between the effect of B and A can be  made by allowing each patient to act as his or her own control. One way of doing this is to calculate a difference for each patient of the period two values minus the period one values. I shall refer to these as the period differences. If the effect of treatment B is the same as that of treatment A, then these period differences will not be expected to differ systematically from one sequence to another. However, if (say) the effect of B was greater that that of A (and higher values were better), then in the AB sequence  a positive difference would be added to the period differences and in the BA sequence that difference would be subtracted from the period differences. We should thus expect the means of the period differences for the two sequences to differ. So one way of testing the null hypothesis of no treatment effect is to carry out a two-sample t-test comparing the period differences between one sequence an another. Equivalently from the point of view of testing, but more convenient from the point of view of estimation, is to work with the semi period differences, that is to say the period difference divided by two.  I shall refer to the associated estimate as CROS and the t-statistic for this test as CROSt, since they are what the crossover trial was designed to produce.

Unfortunately, however, these period differences could also reflect carry-over and if this occurs it would bias the estimate of the treatment effect. The usual effect will be to bias it towards the null and so there may be a loss of power. Is there a remedy? One possibility is to discard the second period values. After all, the first period values cannot be contaminated by carry-over. On the other hand single period values is what we have in any parallel group trial. So all we need to do is regard the first period values as coming from a parallel group trial. Patients in the AB sequence yield values under A and patients in the BA sequence yield values under B so a comparison of the first period values, again using a two-sample t-test, is a test of the null hypothesis of no difference between treatments. I shall refer to this estimate as the PAR statistic and the corresponding t-statistic as PARt.

Note that the PAR statistic is expected to be a second best to the CROS statistic. Not only are half the values discarded but since we can no longer use each patient as his or her own control, the relevant variance is a sum of between and within-patient variances unlike for CROS, which only reflects within-patient variation. Nevertheless, PAR may be expected to be unbiased in circumstances where CROS will be and since all applied statistics is a bias-variance trade-off it seems conceivable that there are circumstances under which PAR would be preferable.

Carry-over carry on

However, there is a problem. How shall we know that carry-over has occurred? It turns out that there is a t-test for this too. First, what we construct are means over the two periods for each patient. In each sequence such means must reflect the effect of both treatments, since each patient receives each treatment and they must also reflect the effect of both periods, since each patient will be treated in each period. However, in the AB sequence the total (and hence the mean) will also reflect the effect of any carryover from A in the second period whereas in the BA sequence the total (and hence the mean) will reflect the carry-over of B. Thus, if the two carry-overs differ, which is what matters, these sequence means will differ and therefore the t-test of the totals comparing the two sequences is a valid test of zero differential carry-over. I shall refer to the estimate as SEQ and the corresponding t-statistic as SEQt.

The rule of three

These three tests were then formally incorporated in a testing strategy known as the two-stage procedure[6]as follows. First a test for carry-over was performed using SEQt. Since the test was a between-patient test and therefore of low power, a nominal type I error rate of 10% was generally used. If SEQt was not significant, the statistician proceeded to use CROSt to test the principle hypothesis of interest, namely that of the equality of the two treatments. If however, SEQt was significant, which might be taken as an indication of carry-over, the fallback test PARt was used instead to test equality of the treatments.

The procedure is illustrated in Figure 1.


Figure 1. The two-stage procedure.

Chimeric catastrophe

Of course, to the extent that the three tests are used as prescribed, they can be combined in a single algorithm. In fact in the pharmaceutical company that I joined in 1987, the programming group had written a SAS® macro to do exactly that. You just needed to point the macro at your data and it would calculate SEQ, come to a conclusion, choose either CROS or PAR as appropriate and give you your P-value.

I hated the procedure as soon as I saw it and never used it. I argued that it was an abuse of testing to assume that just because SEQ was not significant that therefore no carry-over has occurred. One had to rely on other arguments to justify ignoring carry-over. It was only on hearing a colleague lecture on an example where the test for carry-over had proved significant and, much to his surprise, given its low power, the first period test had also proved significant, that I suddenly realised that SEQ and PAR were highly correlated and therefore this was only to be expected. In consequence, the procedure would not maintain the Type I error rate. Only a few days later a manuscript from Statistics in Medicine arrived on my desk for review. The paper by Peter Freeman[7] overturned everything everyone believed on  testing for carry-over. In bolting these tests together, statisticians had created a chimeric monster. Far from helping to solve the problem of carry-over the two-stage procedure had made it worse.

‘How can screening for something be a problem?’, an applied statistician might ask but in asking that they would be completely forgetting the advice they would give a physician who wanted to know the same thing. The process as a whole of screening plus remedial action needed to be studied and statisticians had failed to do so. Peter Freeman completely changed that. He did what statisticians should have done and looked at how the procedure as whole behaved. In the years since I have simply asked statisticians who wish to give an opinion on cross-over trials what they think of Freeman’s paper[7]. It has become a litmus paper for me. Their answer tells me everything I need to know.

Correlation is not causation but it can cause trouble

So what is the problem? The problem is illustrated by Figure 2. This shows a simulation from a null case. There is no difference between the treatments and no carry-over. The correlation between periods one and two has been set to 0.7. One thousand trials in 24 patients (12 for each sequence) have been simulated. The figure plots CROSt(blue circles) and PARt (red diamonds) on the Y axis against SEQt on the X axis. The vertical lines show the critical boundaries for SEQt at the 10% level and the horizontal lines show the critical boundaries for CROSt and PARt at the 5% level. Filled circles or diamonds indicate significant results of CROSt and PARt and open circles or diamonds indicate non-significant values.

It is immediately noticeable that CROSt and SEQt are uncorrelated. This is hardly surprising, since given equal variances CROS and SEQ are orthogonal by construction. On the other hand PARt and SEQt are very strongly correlated. This ought not to be surprising. PAR uses the first period means. SEQ also uses the first period with the same sign. Even if the second period means were uncorrelated with the first the two statistics would be correlated, since the same information is used. However, in practice the second period means will be correlated and thus a strong correlation can result. In this example the empirical correlation is 0.92.

The consequence is that if SEQt is significant PARt is likely to be so. This can be seen from the scatter plot where there are far more filled diamonds in the regions to the left of the lower critical value or to the right of the higher critical values for SEQt than in the region in between. In this simulation of 1000 trials, 99 values of SEQt are significant at the 10% level 50 values of CROSand 53 values of PAR­t are significant at the 5% level. These figures are close to the expected values. However,  91 values are significant using the two-stage procedure. Of course, this is just a simulation. However, for the theory[8] see  http://www.senns.demon.co.uk/ROEL.pdf .

Figure 2. Scatterplot of t-statistics (CROSt) for the within-patient test (blue circles) or the t-statistics (PARt) for the between-patient test (red diamonds) against the value of the t-statistic (SEQt) for the carry-over test. Filled values are ‘significant’ at the 5% level.

In fact, this inflation really underestimates the problem with the two-stage procedure. Either the extra complication is irrelevant (we end up using CROSt) or the conditional type-I error rate is massively inflated. In this example, of the 99 cases where SEQt is significant, 48 of the values of PARt are significant. A nominal 5% significance rate has become nearly a 50% conditional one!

What are the lessons?

The first lesson is despite what your medical statistics textbook might tell you, you should never use the two-stage procedure. It is completely unacceptable.

Should you test for carry-over at all? That’s a bit more tricky. In principle more evidence is always better than less. The practical problem is that there is no advice that I can offer you as to what to do next on ‘finding’ carry-over except to drop the nominal target significance level. (See The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t[8], but note the warning in the title.)

Should you avoid using cross-over trials? No. they can be very useful on occasion. Their use needs to be grounded in biology and pharmacology. Statistical manipulation is not the cure for carry-over.

Are there more general lessons? Probably. The two-stage analysis is the worst case I know of but there may be others where testing assumptions is dangerous. Remember, a decision to behave as if something is true, is not the same as knowing it is true. Also, beware of recogiseable subsets. There are deep waters here.


  1. Ledley, R.S. and L.B. Lusted, Reasoning foundations of medical diagnosis; symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science, 1959. 130(3366): p. 9-21.
  2. Dawid, A.P., Properties of Diagnostic Data Distributions. Biometrics, 1976. 32: p. 647-658.
  3. Guggenmoos-Holzmann, I. and H.C. van Houwelingen, The (in)validity of sensitivity and specificity.Statistics in Medicine, 2000. 19(13): p. 1783-92.
  4. Senn, S.J., Cross-over Trials in Clinical Research. First ed. Statistics in Practice, ed. V. Barnett. 1993, Chichester: John Wiley. 257.
  5. Senn, S.J., Cross-over Trials in Clinical Research. Second ed. 2002, Chichester: Wiley.
  6. Hills, M. and P. Armitage, The two-period cross-over clinical trial. British Journal of Clinical Pharmacology, 1979. 8: p. 7-20.
  7. Freeman, P., The performance of the two-stage analysis of two-treatment, two-period cross-over trials.Statistics in Medicine, 1989. 8: p. 1421-1432.
  8. Senn, S.J., The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t., in Liber Amicorum Roel van Strik, B. Hansen and M. de Ridder, Editors. 1996, Erasmus University: Rotterdam. p. 93-100.


Categories: S. Senn, significance tests, Testing Assumptions | 1 Comment

Souvenir From the NISS Stat Debate for Users of Bayes Factors (& P-Values)


What would I say is the most important takeaway from last week’s NISS “statistics debate” if you’re using (or contemplating using) Bayes factors (BFs)–of the sort Jim Berger recommends–as replacements for P-values? It is that J. Berger only regards the BFs as appropriate when there’s grounds for a high concentration (or spike) of probability on a sharp null hypothesis,            e.g.,H0: θ = θ0.

Thus, it is crucial to distinguish between precise hypotheses that are just stated for convenience and have no special prior believability, and precise hypotheses which do correspond to a concentration of prior belief. (J. Berger and Delampady 1987, p. 330).

Now, to be clear, I do not think that P-values need to be misinterpreted (Bayesianly) to use them evidentially, and think it’s a mistake to try to convert  them into comparative measures of belief or support. However, it’s important to realize that even if you do think such a conversion is required, and are contemplating replacing them with the kind of BF Jim Berger advances, then it would be wrong to do so if there were no grounds for a high prior belief on a point null. Jim said in the debate that people want a Bayes factor, so we give it to them. But when you’re asking for it, especially if it’s described as a “default” method, you might assume it is capturing a reasonably common standpoint—not one that only arises in an idiosyncratic case. To allege that there’s really much less evidence against the sharp null than is suggested by a P-value, as does the BF advocate, is to hide the fact that most of this “evidence” is due to the spiked concentration of prior belief being given to the sharp null hypothesis. This is an a priori bias in favor of the sharp null, not evidence in the data. (There is also the matter of how the remainder of the prior is smeared over the parameter values in the alternative.) Jim Berger, somewhat to my surprise (at the debate) reaffirms that that is the context for the intended use of his recommended Bayes factor with the spiked prior. Yet these BFs are being touted as a tool to replace P-values for everyday use.

Jim’s Sharp Null BFs were developed For a Very Special Case. Harold Jeffreys developed the spiked priors for a Bayesian special problem: how to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon, as R.A. Fisher emphasized.)

Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test (J. Berger and Sellke 1987, p. 136).

Casella and Roger Berger (1987b) respond to Jim Berger and Sellke and to Jim Berger and Delampady –all in 1987. “We would be surprised if most researchers would place even a 10% prior probability of H0. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H0|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H0] that was used.” They make the astute point that the most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability.  They conclude: “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H0” (ibid., p. 345). Thus, they conclude,  “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).

As I said in response to debate question 3, “the move to redefine statistical significance, advanced by a megateam in 2017, including Jim, all rest upon the lump high prior probability on the null as well as the appropriateness of evaluating P-values using Bayes factors. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies [from the point null]”.

Conduct an Error Statistical Critique. Thus a question you should ask in contemplating the application of the default BF is this: What’s the probability the default BF would find no evidence against the null or even evidence for it for an alternative or discrepancy of interest to you? If the probability is fairly high then you’d not want to apply it.

Notice what we’re doing in asking this question: we’re applying the frequentist error statistical analysis to the Bayes factor. What’s sauce for the goose is sauce for the gander.[ii] This is what the error statistician needs to do whenever she’s told an alternative measure ought to be adopted as a substitute for an error statistical one: check its error statistical properties.

Is the Spiked Prior Appropriate to the Problem, Even With a Well-corroborated Value? Even in those highly special cases where a well-corroborated substantive theory gives a high warrant for a particular value of a parameter, it’s far from clear that a spiked prior reflects how scientists examine the question: is the observed anomaly (with the theory) merely background noise or some systematic effect?  Remember when Neutrinos appeared to travel faster than light—an anomaly for special relativity—in an OPERA experiment in 2011?

This would be a case where Berger would place a high concentration of prior probability on the point null, the speed of light c given by special relativity. The anomalous results, at most, would lower the posterior belief. But I don’t think scientists were interested in reporting that the posterior probability for the special relativity value had gone down a bit, due to their anomalous result, but was still quite high. Rather, they wanted to know whether the anomaly was mere noise or genuine, and finding it was genuine, they wanted to pinpoint blame for the anomaly. It turns out a fiber optic cable wasn’t fully screwed in and one of the clocks was ticking too fast. Merely discounting the anomaly (or worse, interpreting it as evidence strengthening their belief in the precise null) because of strong belief in special relativity would sidestep the most interesting work: gleaning important information about how well or poorly run the experiment was.[iii]

It is interesting to compare the position of the spiked prior with an equally common Bayesian position that all null hypotheses are false. The disagreement may stem from viewing H0 as asserting the correctness of a scientific theory (the spiked prior view) as opposed to asserting a parameter in a model, representing a portion of that theory, is correct (the all nulls are false view).

Search Under “Overstates” On This Blog for More (and “P-values” for much more). The reason to focus attention on the disagreement between the P-value and the Bayes factor with a sharp null is that it explains an important battle in the statistics wars, and thus points the way to (hopefully) getting beyond it. The very understanding of the use and interpretation of error probabilities differs in the rival approaches.

As I was writing my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), I was often distracted by high pitched discussions in 2015-17 about P-values “overstating” evidence on account of being smaller than a posterior probability on a sharp null. Thus, I wound up writing several posts, the ingredients of which made their way into the book, notably, Section 4.4. Here’s one. I eventually coined it as a fallacy, “P-values overstate the evidence fallacy”. For many excerpts from the book, including the rest of the “Tour” where this issue arises, see this blogpost.

Stephen Senn wrote excellent guest posts on P-values for this blog that are especially clarifying, such as this one. He observes that Jeffreys, having already placed the spiked prior on the point null, required only that the posterior on the alternative exceeded .5 in order to find evidence against the null, not that it be a large number such as .95.

A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives. (S. Senn)


[i] I mentioned two of the simplest inferential arguments using P-values during the debate: one for blocking an inference, a second for inferring incompatibility with (or discrepancy from) a null hypothesis, set as a reference: “If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.”

“…A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H0. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it.“ For a more detailed discussion see SIST, e.g., Souvenir C (SIST, p. 52) https://errorstatistics.files.wordpress.com/2019/04/sist_ex1-tourii.pdf.

[ii] From SIST* (p. 247): “The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P -value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you.”

[iii] Conversely, the sharp null in discovering the Higgs Boson  was disbelieved even before they built the expensive particle colliders (physicists knew there had to be a Higgs particle of some sort). You can find a number of posts on the Higgs on this blog (also in Mayo 2018, Excursion 3 Tour III).

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). The book follows an “itinerary” of a stat cruise with lots of museum stops and souvenirs.


Categories: bayes factors, Berger, P-values, S. Senn | 4 Comments

My Responses (at the P-value debate)


How did I respond to those 7 burning questions at last week’s (“P-Value”) Statistics Debate? Here’s a fairly close transcript of my (a) general answer, and (b) final remark, for each question–without the in-between responses to Jim and David. The exception is question 5 on Bayes factors, which naturally included Jim in my general answer. 

The questions with the most important consequences, I think, are questions 3 and 5. I’ll explain why I say this in the comments. Please share your thoughts. Continue reading

Categories: bayes factors, P-values, Statistics, statistics debate NISS | 1 Comment

The P-Values Debate



National Institute of Statistical Sciences (NISS): The Statistics Debate (Video)

Categories: J. Berger, P-values, statistics debate | 12 Comments

The Statistics Debate! (NISS DEBATE, October 15, Noon – 2 pm ET)

October 15, Noon – 2 pm ET (Website)

Where do YOU stand?

Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading

Categories: Announcement, J. Berger, P-values, Philosophy of Statistics, reproducibility, statistical significance tests, Statistics | Tags: | 9 Comments

CALL FOR PAPERS (Synthese) Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications


Call for Papers: Topical Collection in Synthese

Title: Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications

The deadline for submissions is 1 November, 2020 1 December 2020

Description: Continue reading

Categories: Announcement, CFP, Synthese | Leave a comment

G.A. Barnard’s 105th Birthday: The Bayesian “catch-all” factor: probability vs likelihood


G. A. Barnard: 23 Sept 1915-30 July, 2002

Yesterday was statistician George Barnard’s 105th birthday. To acknowledge it, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] A portion appears on p. 420 of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Six other posts on Barnard are linked below, including 2 guest posts, (Senn, Spanos); a play (pertaining to our first meeting), and a letter Barnard wrote to me in 1999.  Continue reading

Categories: Barnard, phil/history of stat, Statistics | 10 Comments

Live Exhibit: Bayes Factors & Those 6 ASA P-value Principles


Live Exhibit: So what happens if you replace “p-values” with “Bayes Factors” in the 6 principles from the 2016 American Statistical Association (ASA) Statement on P-values? (Remove “or statistical significance” in question 5.)

Does the one positive assertion hold? Are the 5 “don’ts” true? Continue reading

Categories: ASA Guide to P-values, bayes factors | 2 Comments

September 24: Bayes factors from all sides: who’s worried, who’s not, and why (R. Morey)

Information and directions for joining our forum are here.

Continue reading

Categories: Announcement, bayes factors, Error Statistics, Phil Stat Forum, Richard Morey | 1 Comment

All She Wrote (so far): Error Statistics Philosophy: 9 years on

Dear Reader: I began this blog 9 years ago (Sept. 3, 2011)! A double celebration is taking place at the Elbar Room tonight (a smaller one was held earlier in the week), both for the blog and the 2 year anniversary of the physical appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [SIST] (CUP, 2018). A special rush edition made an appearance on Sept 3, 2018 in time for the RSS meeting in Cardiff. If you’re in the neighborhood, stop by for some Elba Grease.


Many of the discussions in the book were importantly influenced (corrected and improved) by reader’s comments on the blog over the years. I posted several excerpts and mementos from SIST here. I thank readers for their input. Readers should look up the topics in SIST on this blog to check out the comments, and see how ideas were developed, corrected and turned into “excursions” in SIST. Continue reading

Categories: blog contents, Metablog | Leave a comment

5 September, 2018 (w/updates) RSS 2018 – Significance Tests: Rethinking the Controversy


Day 2, Wed 5th September, 2018:

The 2018 Meeting of the Royal Statistical Society (Cardiff)

11:20 – 13:20

Keynote 4 – Significance Tests: Rethinking the Controversy Assembly Room

Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech

Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.

5 Sept. 2018 (taken by A.Spanos)

Continue reading

Categories: Error Statistics | Tags: | Leave a comment

The Physical Reality of My New Book! Here at the RSS Meeting (2 years ago)


You can find several excerpts and mementos from the book, including whole “tours” (in proofs) updated June 2020 here.

Categories: SIST | Leave a comment

Statistical Crises and Their Casualties–what are they?

What do I mean by “The Statistics Wars and Their Casualties”? It is the title of the workshop I have been organizing with Roman Frigg at the London School of Economics (CPNSS) [1], which was to have happened in June. It is now the title of a forum I am zooming on Phil Stat that I hope you will want to follow. It’s time that I explain and explore some of the key facets I have in mind with this title. Continue reading

Categories: Error Statistics | 4 Comments

New Forum on The Statistics Wars & Their Casualties: August 20, Preregistration (D. Lakens)

I will now hold a monthly remote forum on Phil Stat: The Statistics Wars and Their Casualties–the title of the workshop I had scheduled to hold at the London School of Economics (Centre for Philosophy of Natural and Social Science: CPNSS) on 19-20 June 2020. (See the announcement at the bottom of this blog). I held the graduate seminar in Philosophy (PH500) that was to precede the workshop remotely (from May 21-June 25), and this new forum will be both an extension of that and a linkage to the planned workshop. The issues are too pressing to put off for a future in-person workshop, which I still hope to hold. It will begin with presentations by workshop participants, with lots of discussion. If you want to be part of this monthly forum and engage with us, please go to the information and directions page. The links are now fixed, sorry. (It also includes readings for Aug 20.)  If you are already on our list, you’ll automatically be notified of new meetings. (If you have questions, email me.) Continue reading

Categories: Announcement | Leave a comment

August 6: JSM 2020 Panel on P-values & “Statistical Significance”


July 30 PRACTICE VIDEO for JSM talk (All materials for Practice JSM session here)

JSM 2020 Panel Flyer (PDF)
JSM online program w/panel abstract & information):

Categories: ASA Guide to P-values, Error Statistics, evidence-based policy, JSM 2020, P-values, Philosophy of Statistics, science communication, significance tests | 3 Comments

JSM 2020 Panel on P-values & “Statistical Significance”

All: On July 30 (10am EST) I will give a virtual version of my JSM presentation, remotely like the one I will actually give on Aug 6 at the JSM. Co-panelist Stan Young may as well. One of our surprise guests tomorrow (not at the JSM) will be Yoav Benjamini!  If you’re interested in attending our July 30 practice session* please follow the directions here. Background items for this session are in the “readings” and “memos” of session 5.

*unless you’re already on our LSE Phil500 list

JSM 2020 Panel Flyer (PDF)
JSM online program w/panel abstract & information): Continue reading

Categories: Announcement, JSM 2020, significance tests, stat wars and their casualties | Leave a comment

Stephen Senn: Losing Control (guest post)


Stephen Senn
Consultant Statistician

Losing Control

Match points

The idea of local control is fundamental to the design and analysis of experiments and contributes greatly to a design’s efficiency. In clinical trials such control is often accompanied by randomisation and the way that the randomisation is carried out has a close relationship to how the analysis should proceed. For example, if a parallel group trial is carried out in different centres, but randomisation is ‘blocked’ by centre then, logically, centre should be in the model (Senn, S. J. & Lewis, R. J., 2019). On the other hand if all the patients in a given centre are allocated the same treatment at random, as in a so-called cluster randomised trial, then the fundamental unit of inference becomes the centre and patients are regarded as repeated measures on it. In other words, the way in which the allocation has been carried out effects the degree of matching that has been achieved and this, in turn, is related to the analysis that should be employed. A previous blog of mine, To Infinity and Beyond,  discusses the point. Continue reading

Categories: covid-19, randomization, RCTs, S. Senn | 14 Comments

JSM 2020: P-values & “Statistical Significance”, August 6

Link: https://ww2.amstat.org/meetings/jsm/2020/onlineprogram/ActivityDetails.cfm?SessionID=219596

To register for JSM: https://ww2.amstat.org/meetings/jsm/2020/registration.cfm

Categories: JSM 2020, P-values | Leave a comment

Colleges & Covid-19: Time to Start Pool Testing


I. “Colleges Face Rising Revolt by Professors,” proclaims an article in today’s New York Times, in relation to returning to in-person teaching:

Thousands of instructors at American colleges and universities have told administrators in recent days that they are unwilling to resume in-person classes because of the pandemic. More than three-quarters of colleges and universities have decided students can return to campus this fall. But they face a growing faculty revolt.
Continue reading

Categories: covid-19 | Tags: | 8 Comments

David Hand: Trustworthiness of Statistical Analysis (LSE PH 500 presentation)

This was David Hand’s guest presentation (25 June) at our zoomed graduate research seminar (LSE PH500) on Current Controversies in Phil Stat (~30 min.)  I’ll make some remarks in the comments, and invite yours.


Trustworthiness of Statistical Analysis

David Hand

Abstract: Trust in statistical conclusions derives from the trustworthiness of the data and analysis methods. Trustworthiness of the analysis methods can be compromised by misunderstanding and incorrect application. However, that should stimulate a call for education and regulation, to ensure that methods are used correctly. The alternative of banning potentially useful methods, on the grounds that they are often misunderstood and misused is short-sighted, unscientific, and Procrustean. It damages the capability of science to advance, and feeds into public mistrust of the discipline.

Below are Prof.Hand’s slides w/o audio, followed by a video w/audio. You can also view them on the Meeting #6 post on the PhilStatWars blog (https://phil-stat-wars.com/2020/06/21/meeting-6-june-25/). Continue reading

Categories: LSE PH 500 | Tags: , , , , , , | 7 Comments

Blog at WordPress.com.