# A. Saltelli (Guest post): What can we learn from the debate on statistical significance?

Professor Andrea Saltelli
Centre for the Study of the Sciences and the Humanities (SVT), University of Bergen (UIB, Norway),
&
Open Evidence Research, Universitat Oberta de Catalunya (UOC), Barcelona

What can we learn from the debate on statistical significance?

The statistical community is in the midst of crisis whose latest convulsion is a petition to abolish the concept of significance. The problem is perhaps neither with significance, nor with statistics, but with the inconsiderate way we use numbers, and with our present approach to quantification.  Unless the crisis is resolved, there will be a loss of consensus in scientific arguments, with a corresponding decline of public trust in the findings of science.

#The sins of quantification

Every quantification which is unclear as to its scope and the context in which it is produced obscures rather than elucidates.

Traditionally, the strength of numbers in the making of an argument has rested on their purported objectivity and neutrality. Expressions such as “Concrete numbers”, “The numbers speak for themselves”, “The data/the model don’t lie” are common currency. Today, doubts about algorithmic instances of quantification – e.g. in promoting, detaining, conceding freedom or credit, are becoming more urgent and visible. Yet the doubt should be general. It is becoming realised that in every activity of quantification, the technique or the methods are never neutral, because it is never possible to separate entirely the act of quantifying from the wishes and expectations of the quantifier.  Thus, books apparently telling separate stories, such as Rigor Mortis, Weapons of Math Destruction, the Tyranny of Metrics, or Useless Arithmetic, dealing with statistics, algorithms, indicators and models, share a common concern.

# Statisticians know

Statisticians are increasingly aware that each number presupposes an underlying narrative, a worldview, and a purpose of the exercise. The maturity of this debate in the house of statistics is not an accident. Statistics is a discipline, with recognized leaders and institutions, and although one might derive an impression of disorder by the use a petition to influence a scientific argument, one cannot deny that the problems in statistics are being tackled head on, in the public arena, in spite of the obvious difficulty for the lay public to follow the technicality of the arguments. With its ongoing  discussion of significance, the community of statistics is teaching us an important lesson about the tight coupling between technique and values. How so? We recap here some elements of the debate.

• For some, it would be better to throw away the concept of significance altogether, because the p-test, – with its magical p<0.05 threshold, is being misused as a measure of veracity and publishability.
• Others object that discussion should not take place with the instrument of a petition and that withdrawing tests of significance would make science even more uncertain.
• The former retort that since this discussion has been going on for decades on academic journal without the existing flaws being fixed, then perhaps times are ripe for action.

A good vantage point to look at this debate in its entirety is this section in Andrew Gelman’s blog.

# Different worlds

An important aspect of this discussion is that the contenders may inhabit different worlds. One world is full of important effects which are overlooked because the test of significance fails (p value greater that 0.05 in statistical parlance). The other world is instead replete with bogus results passed on to the academic literature thanks to a low value of the p-test (p<0.05).

A modicum of investigation reveals that the contention is normative, or indeed political. To take an example, some may fear the introduction on the market of ineffectual pharmaceutical products, others that important epidemiological effects of a pollutant on health may be overlooked. The first group would thus have a more restrictive value for the test, the second group a less restrictive one.

All this is not new. Philosopher Richard Rudner had already written in 1953 that it is impossible to use a test of significance without knowing to what it is being applied, i.e. without making a value judgment. Interestingly, Rudner used this example to make the point that scientists do need to make value judgments.

# How about mathematical models?

In all this discussion mathematical models have enjoyed a relative immunity, perhaps because mathematical modelling is not a discipline. But the absence of awareness of a quality problem is not proof of the absence of a problem.  And there are signals that the crisis there might be even worse than that which is recognised in statistics.

Implausible quantifications of the effect of climate change on the gross domestic product of a country at the year 2100, or of the safety of a disposal for nuclear waste a million years from now, or of the risk of the financial products at the heart of the latest financial crisis, are just examples that are easily seen in the literature. Political decision in the field of transports may be based on a model which needs as an input the average number of passengers sitting is a car several decades in the future. A scholar studying science and technology laments the generation of artefactual numbers through methods and concepts such as ‘expected utility’, ‘decision theory’, ‘life cycle assessment’, ‘ecosystem services’ ‘sound scientific decisions’ and ‘evidence-based policy’ to convey a spurious impression of certainty and control over important issues concerning health and the environment. A rhetorical use of quantification may thus be used in evidence-based policy to hide important knowledge and power asymmetries: the production of evidence empowers those who can pay for it, a trend noted in both the US and Europe.

# Resistance?

Since its inception the current of post normal science (PNS) has insisted on the need to fight against instrumental or fantastic quantifications. PNS scholars suggested the use of pedigree for numerical information (NUSAP), and recently for mathematical models. Combined with PNS’ concept of extended peer communities, these tools are meant to facilitate a discussion of the various attributes of a quantification. This information includes not just its uncertainty, but also its history, the profile of its producers, its position within a system of power and norms, and overall its ‘fitness for function’, while also identifying the possible exclusion of competing stakes and worldviews.

Stat-Activisme, a recent French intellectual ovement, proposes to ‘fight against’ as well as ‘fight with’ numbers. Stat-activisme targets invasive metrics and biased statistics, with a rich repertoire of strategies from ‘statistical judo’ to the construction of alternative measures.

As philosopher Jerome Ravetz reminds us, so long as our modern scientific culture has faith in numbers as if they were ‘nuggets of truth’, we will be victims of ‘funny numbers’ employed to rule our technical society.

Note: A different version of this piece has been published in Italian in the journal Epidemiologia and Prevenzione.

Categories: Error Statistics | 11 Comments

### 11 thoughts on “A. Saltelli (Guest post): What can we learn from the debate on statistical significance?”

1. I thank Saltelli for his guest blog, and for his patience in waiting for me to post it. I wanted to articulate my problems with one of the leading fronts in today’s statistics battles in a few posts first (there were to be 3, but I stopped at 2)–my last two blogposts.
There are two sides to the current “crisis of stat” movement and research. One is truly trying to improve accountability and best practices by improved experimental designs, “21 word” solutions, and other consciousness-raising moves. These grow from recognizing that today’s powerful methods for data dredging make it easy to come up with apparent evidence for preconceived or selected claims while failing to warrant intended error probability control. All of this is to the good. The second is the one that I discuss in my last two posts. Rather than try to ensure non-fallacious uses of error statistical methods, such as statistical significance tests, the second side assumes the tests have already lost the war, so arguments aren’t needed, replacements are on order. The known pitfalls of some of the leading replacements are ignored. Also ignored are existing, more sophisticated uses of tests and confidence intervals. (I have myself offered a reformulation of tests wherein inferences take the form of discrepancies, from a reference hypothesis, that are well or poorly warranted.)

My last 2 posts give details of my criticisms of the current P-value campaign led by spokespeople in the ASA: “The ASA’s P-value Project: Why it’s Doing More Harm than Good”
https://errorstatistics.com/2019/11/14/the-asas-p-value-project-why-its-doing-more-harm-than-good-cont-from-11-4-19/

and: “On Some Self-Defeating Aspects of the ASA’s 2019 Recommendations on Statistical Significance Tests”
https://errorstatistics.com/2019/11/04/on-some-self-defeating-aspects-of-the-asas-2019-recommendations-on-statistical-significance-tests/

My editorial on Hardwicke and Ioannidis (2019) “P-value thresholds: forfeit at your peril” is linked in this post

While Saltelli suggests the debate is “mature” and not as disorderly as it appears, I argue that they are less mature, more disorderly and much less self-aware than the statistical significance test (and related) controversies of the past 40 or 50 (or more) years. The philosophical presuppositions (e.g., about the roles of probability in science, the nature of falsificationist reasoning in statistics and how it differs from accounts of support, confirmation and belief) go unattended and remain hidden. That did not used to be the case. Nowadays, leaders of rival statistical tribes are keen to use today’s “crises” as a way to get their favorite method accepted, without any kind of debate as to whether they are up to the job performed by methods they seek to scapegoat and replace. Saltelli’s overarching frameworks are useful and intriguing, but specific cases under the various headings can take very different forms. I’m interested to know where in his framework Saltelli might pigeon-hole today’s stat wars–or my perspective on them– and their strange casualties.

2. Christian Hennig

Very good guest post! I agree that uncritical belief/trust in the objectivity of numbers and mathematical models has a big role in the current crisis/controversy. Which is a major reason why I don’t think that tests themselves are the major issue, and why I’m not optimistic that any suggested formal replacement of p-values will improve things substantially.
https://www.researchgate.net/publication/225691477_Mathematical_Models_and_Reality_A_Constructivist_Perspective

• Christian: I don’t think the issue is formal/informal–as if rendering the process of planning and interpreting experiments purely qualitative will somehow prevent well-known sources of bad-science and cheating. I know you don’t mean to say there’s a cure in getting rid of formalisms, but there’s a dangerous tendency to make it appear that qualitative hunches and beliefs can and should enter to tell us how to interpret p-values (or other measures).

I don’t think it’s all that mysterious how to do science with integrity, nor are we in the dark about bad science. It’s just that many people would prefer not to do science with integrity, and their views have great allure. As soon as it was realized that preregistration greatly decreased irreplicable results, we started to hear voices in opposition–even if it’s a minority. A much more popular view is to redefine irreplication. After all, even a series of nonsignificant results is still, strictly speaking, “compatibile” with a genuine effect. It’s harder to have constraints on science, and if you’re in a field where finding out what’s the case isn’t considered so important by society, then you’ll do better allowing shaky methods, flexibility, and no falsification.

Saltelli mentions that post-normal science (whatever that is*) calls for info about its “its history, the profile of its producers, its position within a system of power and norms, …while also identifying the possible exclusion of competing stakes and worldviews”. We should apply this also to meta-statistics and meta-science. There’s an unfair balance of power and exclusion of world views when ASA spokespeople use their position of power to declare which concepts have no right to survive.

*I realize he’s alluding to Kuhn, but the Kuhnian story of normal, revolutionary, post-normal science never held up to being anything like a correct view of real science. That doesn’t mean I deny the importance of looking at the power structure and biases in science–especially when they’re allowed free rein.

3. Interesting Hardwicke and Ioannidis reference that Saltelli links to. I read it as showing that the majority of signatories to ‘ban p-values’ recently used statistical significance language in their papers, as well as the ‘ban p-values’ type of discussions are being driven by biostatistics/epidemiology and psychology fields, both areas where they’ve had issues with experiments not replicating as expected or replicating the way they wanted it to (reproducibility).

I noticed that there are many papers, even in Nature and ASA publications, using p-values and stat sig language, after the ‘ban p-values’ articles were published.

Justin

4. Andrea Saltelli

I thank Deborah Mayo for her remarks. My viewpoint is that of a practitioner in mathematical modelling, metrics/indicators, and methods for their quality, such as sensitivity analysis, sensitivity auditing, and others. I am not a leader in the statistical community, and look at this discussion – or battle in the words of Deborah Mayo, from the side lines.

It would appear that I am perhaps starry-eyed when calling the discussion among statisticians ‘mature’. There are clearly different kind of immaturity, that of unruly children and that of a debate which hasn’t reached maturity yet. I am in all likelihood seduced by the statisticians’ capacity and willingness to discuss their problems when compared to the ‘anything goes’ of other instances of quantification. If a statistician said to a colleague that an estimate is possible [1] for the increased crime rate at the year 2100 as resulting from climate change in the US the level of a county, then his or her colleagues would probably suggest a visit to a clinician. Statisticians are foremost clear – from my vantage point – that the problem being confronted are both technical and normative, again an aspect I see less in different communities. As our relation to quantification is today in need of careful investigation, statistician can do their bit, as do practitioners from other disciplines, from philosophy [2] to law [3].

Now, in order to be as precise as possible if answering Mayo’s question, I would say that the statistical battles exhibit the advantage of being fought primarily – though not exclusively, within a discipline, with recognized leaders, hence with a chance to produce some kind of fungible result when the dust settles. If I were to take side here – which I think I would better not do – I would rather look at the implications of this battle for science and society [4] overall. With hindsight [5] one can say that the relevance of the debate opposing Keynes and Knight to Ramsey and Savage many years ago was not on the points of doctrine, but on the implications on the management of human affairs entailed by the competing theories.

Deborah Mayo likes the lesson I quote from post normal science, though she resent the term, and the reference to Kuhn. I am convinced that some exposure to PNS ideas [6] and practices is in general a positive epistemological therapy, introducing elements of reasonableness as a cure against misplaced rationality. Keynes and Knight would have probably appreciated that.

• Andrea: You say that you “would rather look at the implications of this battle for science and society [4] overall”, but that is the question I’m asking you to address. It requires looking at the particularity of what’s being assumed, and what the consequences are. In my last 2 posts I focus just on the ASA P-value campaign.* Just alluding to implications for science and society without at least pondering what these might be, how society might react, what consequences are likely, etc., then it’s not very helpful. Here’s one implication we all should be concerned with: without P-value thresholds, unwarranted results can be spun and dredged so as to claim evidence of some welcome result or other. Ioannidis puts it well in the Gelman post to which Saltelli links. It’s the areas where there’s the most money to make that the harms are greatest, and thus we should be most worried about.

(*Even if the ASA, as a whole, agrees neither with abandoning the concept of statistical significance nor rejecting P-value thresholds (ensconced in best practice manuals for clinical trials), it can’t just deny it is permitting the “campaign” led by R. Wasserstein, amplified by their new communications person, Nuzzo, to proffer the recommendations in ASA II to journals, associations, researchers, etc.)

5. Huw Llewelyn

I thank Andrea Saltelli for taking a helpfully broad view of the debate about statistical significance. I would like to widen this further from the viewpoint of a doctor who tests hypotheses (diagnostic models) intensively on a daily basis. I have described this process mathematically and the latter may have much in common with statistical hypothesis testing.

The first analogy is with numerical test precision. For example, we might wish to estimate the probability that a sample has been taken at random from one of a row of individual test tubes with different measured values, each of which therefore has the same prior probability. This means that there is probability of 0.95 that any repeat sample value will fall within +/-1.96 SD and 0.025 that it will fall beyond one 1.96 SD limit.

This first step assesses stochastic matters only. In addition we have to consider hypotheses such as no inconsistency of test methodology, no inaccuracy of result reporting (due to unintentional error or ignorance or dishonesty) and so on.

If we can show that these postulated causes of inaccurate precision or bias are improbable or eliminated, then the probability of replication within two limits will remain just below or exactly 0.95 or 0.025 below one limit. This only applies to the probability of replication of an observation (such as a test result or the observed difference between two results perhaps with and without a treatment).

A doctor would also consider a list of possible diagnostic explanations for such an observation with a high probability of replication. It might help doctors and their students if the principles of statistics and scientific hypothesis testing could be explained by comparing them to these principles of diagnostic reasoning.

6. rkenett

Huw – interesting comments. I was particularly interested by “A doctor would also consider a list of possible diagnostic explanations for such an observation with a high probability of replication. ”

A decade ago I started investigating verbal expressions of research findings and the use of alternative representations, This is the list of “possible diagnostics” you refer to. In these alternatives, some have meaning equivalence and some have surface similarity, i.e. look similar but have a different meaning, These are separated by a boundary of meaning (BOM). For a manuscript under revision explaining this see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070, for how this relates to research in psychology, see https://dx.doi.org/10.31234/osf.io/jqw35, for examples of specifying a boundary od meaning in clinical research see https://www.ncbi.nlm.nih.gov/pubmed/30270168.

The point is that the current discussions on p-values or Bayes factors are rather narrow in the context of representations of findings and their generalisation. In that sense, I do not agree with Saltelli that it reached “maturity”.

• Huw Llewelyn

Ron

Thank you for your comments; I’m sorry for the delay in replying. I tried to read your papers but found the vocabulary difficult. I was reminded of the way our intellectual backgrounds separate us. Perhaps this is an important cause of the problems facing statistics.

It seems to me that different concepts and vocabularies prevent effective exchange of ideas between statisticians, physicians, biological scientists, philosophers, psychologists.

In my research MD thesis I tried to express medical reasoning in terms of probability and set theory. I tried to lay bare in a common language the way that doctors thought verbally (as opposed to subconscious ‘pattern recognition’.) These concepts are now described in Chapters 1, 2 and 13 of the Oxford Handbook of Clinical Diagnosis. The following have been made ‘open access’ by Oxford University Press.

Chapter 1: A description of the diagnostic process
http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-1
Chapter 2: a detailed example description of the diagnostic process
http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-2
Chapter 13: The probability and set theory of diagnosis
http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-13

In the same way that I find the vocabulary and concepts of other disciplines such as psychology and philosophy difficult to follow, others (including medical epidemiologists) find my medical chapters difficult. Practicing doctors have no issues but find the maths difficult.

With best wishes

Huw

• rkenett

Thank you for the links to the Oxford Handbook of Clinical Diagnosis.

I downloaded the files but did not look at them carefully enough, yet.

First of all, I fully agree that vocabulary and communication is an important challenge. Unfortunately, statisticians have traditionally not been good at it.

The path I have taken is that, indeed, clinicians need a way to communicate findings that is not based on standard statistical reports that look like tables, with p-values and stars.

The way I framed this is in the context of generalization of findings. How can this be done, on the basis of statistical analysis but using a more verbal form.

This lead to alternative representations that produce alternatives with meaning equivalence and alternatives with surface similarity.

The trick is to have explicit statements representing what the research found and what is not claimed by it.

I have used it with clinicians and interns who relate to this approach, instantly.

Thanks again for the pointers

Ron

This site uses Akismet to reduce spam. Learn how your comment data is processed.