I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data [0]. The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in * black*.

The journal

Basic and Applied Social Psychologybannedp-values a year ago. I read some of their articles published in the last year. I didn’t like many of them. Here’s why.First of all, it seems BASP didn’t just ban

p-values. They also banned confidence intervals, because Godforbid you use that lower bound to check whether or not it includes 0. They also banned reporting sample sizes for between subject conditions, because God forbid you divide that SD by the square root of N and multiply it by 1.96 and subtract it from the mean and guesstimate whether that value is smaller than 0.

It reminds me of alcoholics who go into detox and have to hand in their perfume, before they are tempted to drink it. Thou shall not know whether a result is significant – it’s for your own good! Apparently,

thou shall also not know whether effect sizes were estimated with any decent level of accuracy. Nor shall thou include the effect in future meta-analyses to commit the sin of cumulative science.(my emphasis)There are some nice papers where the

p-value ban has no negative consequences. For example, Swab & Greitemeyer (2015) examine whether indirect (virtual) intergroup contact (seeing you have 1 friend in common with an outgroup member, vs not) would influence intergroup attitudes. It did not, in 8 studies.P-values can’t be used to accept the null-hypothesis, and these authors explicitly note they aimed to control Type 2 errors based on an a-priori power analysis. So, after observing many null-results, they drew the correct conclusion that if there was an effect, it was very unlikely to be larger than what the theory on evaluative conditioning predicted. After this conclusion, they logically switch to parameter estimation, perform a meta-analysis and based on a Cohen’s d of 0.05, suggest that this effect is basically 0. It’s a nice article, and thep-value ban did not make it better or worse.

**If the journal is banning reports of inferential notions, then how do power and Type 2 errors slip by the editors’ bloodhounds?**

But in many other papers, especially those where sample sizes were small, and experimental designs were used to examine hypothesized differences between conditions, things don’t look good.

In many of the articles published in BASP, researchers make statements about differences between groups. Whether or not these provide support for their hypotheses becomes a moving target, without the need to report p-values. For example, some authors interpret a d of 0.36 as support for an effect, while in the same study, a Cohen’s d < 0.29 (we are not told the exact value) is not interpreted as an effect. You can see how banning p-values solved the problem of dichotomous interpretations (I’m being ironic). Also, with 82 people divided over three conditions, the p-value associated with the d = 0.36 interpreted as an effect is around p= 0.2.If BASP had required authors to report p-values, they might have interpreted this effect a bit more cautiously. And in case you are wondering: No, this is not the only non-significant finding interpreted as an effect. Surprisingly enough, it seems to happen a lot more often than in journals where authors report p-values! Who would have predicted this?!(my emphasis)Nice work Trafimow and Marks! Just what psychology needs.

Saying one thing is bigger than something else, and reporting an effect size, works pretty well in simple effects. But how would say there is a statistically significant interaction, if you can’t report inferential statistics and p-values? Here are some of my favorite statements.

“The ANOVA also revealed an interaction between [X] and [Y], η² = 0.03 (small to medium effect).”

How much trust do you have in that interaction from an exploratory ANOVA with a small to medium effect size of .03, partial eta squared? That’s what I thought.

“The main effects were qualified by an [X] by [Y] interaction. See Figure 2 for means and standard errors”

The main effects were qualified, but the interaction was not quantified. What does this author expect I do with the means and standard errors? Look at it while humming ‘ohm’ and wait to become enlightened? Everybody knows these authors calculated p-values, and based their statements on these values.

**My predictions on the consequences of this journal’s puzzling policy appear to be true, all too true: They allow error statistical methods for purposes of a paper’s acceptance, but then require their extirpation in the published paper. I call it the “Don’t ask, don’t tell” policy (see this post). See also my commentary on the ASA P-value report.**

In normal scientific journals, authors sometimes report a Bonferroni correction. But there’s no way you are going to Bonferroni those means and standard deviations, now is there? With their ban on p-values and confidence intervals, BASP has banned error control. For example, read the following statement:

Willpower theories were also related to participants’ BMI. The more people endorsed a limited theory, the higher their BMI. This finding corroborates the idea that a limited theory is related to lower self-control in terms of dieting and might therefore also correlate with patients BMI.This is based on a two-sided p-value of 0.026, and it was one of 10 calculated correlation coefficient. Would a Bonferroni adjusted

p-value have led to a slightly more cautious conclusion?Oh, and if you hoped banning p-values would lead

anyoneto use Bayesian statistics: No. It leads to a surprisingly large number of citations to Trafimow’s articles where he tries to usep-values as measures of evidence, and is disappointed they don’t do what he expects. Which is like going to The Hangover part 4 and complaining it’s really not that funny. Excepteveryone who publishes in BASP mysteriously agrees that Trafimow’s articles show NHST has been discredited and is illogical. (my emphasis)

**This last sentence gets to the most unfortunate consequence of all. In a field increasingly recognized to be driven by “perverse incentives,” and desperately in need of publishing reform, even the appearance of “pay to play” is disturbing, when editors hold so idiosyncratic a view about standard statistical methods.**

In their latest editorial[1], Trafimow and Marks hit down some arguments you could, after a decent bottle of liquor, interpret as straw men against their ban of

p-values. They don’t, and have never, discussed the only thingp-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise.

**I’m guessing that Daniel means they might (after liquor at least) be interpreted as converting the many telling criticisms of their ban into such weak versions as to render them “straw men”. I make some comments on this editorial [1].**

The absence of p-values has not prevented dichotomous conclusions, nor claims that data support theories (which is only possible using Bayesian statistics), nor anything else p-values were blamed for in science. After reading a year’s worth of BASP articles, you’d almost start to suspect p-values are not the real problem. Instead, it looks like researchers find making statistical inferences pretty difficult, and forcing them to ignore p-values didn’t magically make things better.[2]

As far as I can see, all that banning p-values has done, is increase the Type 1 error rate in BASP articles. Restoring a correct use of p-values would substantially improve how well conclusions authors draw actually follow from the data they have collected. The only expense, I predict, is a much lower number of citations to articles written by Trafimow about how useless p-values are.(my emphasis)

**Lakens, by dint of this post, certainly deserves an Honorable Mention, and can choose a book prize from the palindrome prize list. He has agreed to answer questions posted in the comments. So share your thoughts.**

[0] As I say (slide 26) in my recent Popper talk at the LSE: “To use an eclectic toolbox in statistics, it’s important not to expect an agreement on numbers from methods evaluating different things. A p-value isn’t ‘invalid’ because it does not supply ‘the probability of the null hypothesis, given the finding’ (the posterior probability of *H*_{0}) (Trafimow and Marks, 2015).”

[1] I checked this editorial. Among at least a half dozen fallacies*****, the editors say that the definition of a p-value is “true by definition and hence trivial”. But the definition of the posterior probability of *H* given **x** is also “true by definition and hence trivial”. Yet they’re quite sure that P(H|**x**) is informative. Why? It’s just true by definition.

Another puzzling claim is that “One cannot compute the probability of the finding due to chance unless one knows the population effect size. And if one knows the population effect size, there is no need to do the research.”

Given how they understand “probability of the finding due to chance” what this says is you can’t compute P(*H*|** x**) unless you know the population effect size. So this nihilistic claim of the editors is that to make a statistical inference about

*H*requires knowing

*H*, but then there’s no need to do the research. So there’s never any reason to do any research!

*****I won’t call them statistical howlers—a term I’ve used on this blog–because they really have little to do with statistics and involve rudimentary logical gaffes.

[2] My one question is what Lakens means in saying that to claim that data support theories “is only possible using Bayesian statistics”. I don’t see how Bayesian statistics infers that data support theories, unless he means they may be used to provide a *comparative* measure of support, such as a Bayes’ Factor or likelihood ratio. On the other hand, if “supporting a theory” means something like “accepting or inferring a theory is well tested” then it’s outside of Bayesian probabilism (understood as a report of a posterior probability, however defined). An “acceptance” or “rejection” rule could be *added* to Bayesian updating (e.g., infer *H* if its posterior is high enough), but I’m not sure Bayesians find this welcome. It’s also possible that Lakens finds authors of this journal claiming their theories are “probable,” and he’s pointing out their error.

Send me your thoughts.

Laken’s most telling remark is:

“They don’t, and have never, discussed the only thing p-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise”.

The role of p-values as error probabilities is often something people are silent about these days, and yet Lakens is correct to identify this as the P-values central function. Whether one views control of error probabilities in terms of good long-run performance or, as I prefer, enabling claims about the probativeness of the test in question, they are part of a methodology directed to block unwarranted claims about having evidence for genuine, reliable effects. Notice that the replicationist research proceeds by checking for significant p-values (or by means of dual uses of confidence intervals). When only 36 or however many studies were replicable in the recent OCS report,I don’t think anyone maintained that they don’t count as non-replications because they were based on significance testing methodology. More than that, the researchers proceeded to employ significance tests to rule out various explanations for lack of replication. (The complaints I’ve heard concern the assumptions of the tests, such as “fidelity”.)

Unless I missed it, the (2016) ASA P-value statement[1] doesn’t include mention of the role of p-values in controlling or evaluating error probabilities either. One could easily construe the definition given as merely providing a “nominal” p-value as a distance measure. Yet, their key principles (e.g., about cherry picking leading to spurious p-values) depend entirely on assuming (as is correct) that p-values ought to be measuring error probabilities and not mere “fit”. That is why I emphasized those points in my invited commentary on the p-value statement [2].

[1]https://errorstatistics.files.wordpress.com/2016/06/the-asa-s-statement-on-p-values-context-process-and-purpose.pdf

[2]https://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/

What about this: http://andrewgelman.com/2016/03/07/29212/ ?

The problem with Gelmon’s blog post is that merely pointing out some of the problems with a particular statistic does not necessarily render that statistic as useless.

Any statistic, applied in an inappropriate scenario, will present problems. To therefore conclude that the whole enterprise of statistics is pointless hardly contributes to understanding how we know anything about the real world. That we do know much, and have learned it through disciplined use of data, appears around us every day now, whenever anyone pulls out a cell phone and performs calculations somewhat unimaginable just decades ago.

The Gelmon blog post you link to states

“Second, it seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that begins with data and concludes with success as measured by statistical significance. Again, I do not exempt my own books from this criticism: we present neatly-packaged analyses with clear conclusions. This is what is expected—demanded—of subject-matter journals. Just try publishing a result with p = 0.20. ”

Personally I do not find the writings of Mayo, Spanos, Cox and others to resemble those of alchemists. They are the top shelf statistical philosophers of our time, discussing central issues concerning how we know anything, in particular in situations where numerical data are available, the central enterprise of science.

I have published a result with p = 0.20, it happens often when scientists report all results, not just a cherry-picked subset. Journal editors should be far more tolerant of submitted papers that list all findings, even those with large p-values. When accompanied by a-priori power analyses, or severity analyses, much can be learned from p-values, even 0.20 ones.

The blog post you link to has a link within it to another Gelmon blog, stating

“to gather more data and look for statistical significance is at best a waste of time, at worst a way to confuse yourself with noise.”

The very premise of statistics and information theory is dispelled in one sentence – that to gather more data is merely to confuse yourself with noise.

So these are the types of problems with Gelmon’s blog posts. Do review many of Mayo’s related posts, to get a better understanding of how data is effectively probed to understand natural phenomena.

Steven: Thanks so much for your comment. Indeed, “negative” results can be very informative. I happen to read a rather good short paper on it today:

By teaching a fallacious animal, NHST, that purports to move from a statistically significant difference in a single trial to a substantive research claim, some fields commit multiple, classic fallacies against which Fisher and Neyman and Pearson warned. Further, ignoring N-P tests, these same practitioners are at a loss to interpret insignificant results.

Enrique: Of course I’m aware of Gelman’s comment on the ASA P-value Statement. It gets at the point about the Statement on pp. 39 -40 of my Popper talk.

https://errorstatistics.com/2016/05/10/my-slides-the-statistical-replication-crisis-paradoxes-and-scapegoats/.

That some accounts are “free” to ignore error probabilities, because their account conditions on the data, does not make the ill effects of hunting and snooping, multiple testing, cherry picking and a host of selection effects go away. Nor does the charge that it’s infeasible to ever know all of the experiments that could have been done show it’s hopeless to vouchsafe error control for this experiment. Otherwise preregistration, efforts not to commit QRPs, transparency about data-dependent alterations (Simonsohn 21 word requirement) would be empty requirements. They’re not empty, as we see in the difficulty to replicate results arrived at through QRPs. As I note on slide 33, consulting our prior beliefs about theories doesn’t cure the problem of inferences that are invalidated by biasing selection effects. Plausibility of a theory H must be distinguished from how well run a test of H is–at least on the error statistical account. I’ve discussed all this in a clearer fashion in published work over the years.

Mayo, I ask this question on the basis that likelihood analyses are those that are free to “ignore error probabilities”.

If a likelihoodlum were to warn against scientific conclusions based on results of a single experiment and to suggest that replication of interesting results was a prerequisite for acceptance of the scientific hypothesis, and were such a likelihoodlum to remind all that the inferences are the responsibility of the inferring mind rather than the statistical procedures, wouldn’t that likelihoodlum be acting in a manner that is acceptable to a severitist? If not, then why not?

Michael: Just on the first part, the answer is no because the likelihood principle is still at odds with the error probability control–control that is vitiated with such gambits as cherry-picking, significance seeking, post data subgroups, outcome switching, p-hacking, look elsewhere effects, searching for the pony, monster barring, hunting, snooping, and lots of other colorful terms for biasing selection effects. And if you ask: what if a likelihoodlum* added prohibitions to these gambits, then does he or she satisfy severity requirements? The answer is still no, because it’s altogether necessary to have a rationale for such a prohibition. Moreover, not all cases of data dependent selections are injurious to severity, so we need a rationale to distinguish them, and also to try and adjust for them.

As for the view that “inferences are the responsibility of the inferring mind rather than the statistical procedures”, I don’t see that as something for which a severe testing theorist would sign up for.

*The word “likelihoodlum”, which perhaps you invented, always makes Likelihoodists appear as fun-loving, easy-going folks, when in fact, we see, they are fiercely loyal to their principle (LP).

Likelihoodlums might be those who feel that the likelihood principle is sensible, but the likelihood principle only says that the evidence in the data about the parameters within the statistical model is contained in the likelihood function. Such a principle does not say anything about how to react to that evidence and how to form inferences. Suggestions that it does so come from misconceptions about the scope of the principle, as I have argued endlessly but ineffectively.

All of the bad procedures that you list lose their power to mislead when they are followed by a confirmatory study.

If likelihood analysis “does not say anything about …how to form inferences,” then you concur with my claim that it’s “bankrupt as a theory of inference”. https://errorstatistics.com/2014/11/15/why-the-law-of-likelihood-is-bankrupt-as-an-account-of-evidence-first/

The LP itself become a definition of a kind of evidence that is distinct from inference, and we have no way to evaluate if it’s a good account of evidence. For any criticism you say, well that comes in when we do inference. But, we’re interested in an account of inference.

Rubbish! You know very well my response to that blog post. If not, then read my comments from the time.

How can you claim that _any_ account of evidence is good or bad when you refuse to define what you mean by the word ‘evidence’?

Inferences should be made on the basis of the information available. Among that information is the evidence in the data, the nature of the statistical model which allows quantitation of the evidence, the nature of the experiments that yielded the evidence and the evidence and opinions that are available outside the experiment in question. Every possible “theory of inference” falls down by one criterion or another unless that theory allows mindful consideration of the totality of that information. To say that likelihood is “bankrupt as a theory of inference” is as meaningful as saying that a roll of thread is bankrupt as a theory of clothing.

Michael: But I do define evidence, in terms of severe testing. Most importantly, it supplies a view of no evidence, or bad evidence no test (BENT). If an account regards x as evidence for H, even though there is high or even maximal probability of finding such evidence, even if H is false, then I say that account is bankrupt as an account of evidence. Cox would say it violates the weak repeated sampling principle. And even aside from extreme cases, the mere fact that it doesn’t distinguish the evidence x affords H on grounds of data-dependent selections that wreck error probabilities, suffices to discount it. In one case, for example, the test was preregistered in another H was constructed post data to make the data probable. If an account lumps these together so long as the likelihood ratios are proportional (i.e., so long as the stipulations of the LP are met), then I say it doesn’t do the job I require of an evidential account. (I’m not stating the full requirement of the LP, but people can search this blog.) I realize there are lots of people who would agree with your conception of comparative evidence (an example of a likelihoodist philosopher is Elliott Sober), I just find it comes up severely short, and further, there’s a better one around.

“I say it doesn’t do the job I require of an evidential account”. I want to say that I also require that “job”, among others, to be accomplished within the processes of inference. However, I don’t agree that it has to be part of what should be described as “evidence”.

To rail against the likelihood principle on the basis that it is not the repeated sampling principle is unhelpful and unnecessary. Both can be considered when making inferences. Of course, to consider both one has to make inferences on the basis of _more_ than just the evidence in the likelihood function regarding support of model parameter values that provide the x-axis of that curve. To call likelihood bankrupt as a theory of evidence on the basis that an algorithmic ‘inference’ based on only a likelihood ratio is flawed is, as I mention above, about as sensible as claiming that a roll of thread is a bankrupt theory of clothing.

Michael: I would only be repeating myself to reply as to why I strongly disagree.

“As for the view that “inferences are the responsibility of the inferring mind rather than the statistical procedures”, I don’t see that as something for which a severe testing theorist would sign up for.”

Well, that is a shame, because as I see it the only argument that favours the algorithmic approach to inference that is necessary when the mind is excluded is the argument that it is convenient for accounting. I do not care so much for accounting for errors when in the real world there is no way to assess the rate of errors, and in the real world the conclusions in the minds of scientists should properly be graded and subject to change in light of new information. Such conclusions are not readily dealt with using the accounting of error rates.

Your severity concept is so close to likelihood that it is a real shame that you see likelihoodlums as the enemy. The enemy is mindless reliance on significance in place of thoughtful scientific conclusions based on evaluation of totality of evidence and the probativeness of testing.

Michael: In my account, as in science, error probabilities are employed in order to scrutinize inferences and ascertain gaps and anomalies, showing something new is needed. Likelihood and (traditional*) Bayesian accounts are locked in the space of models they start out with.

What they’re learning this month about a new particle (in high energy physics) is something no one had a clue about (they’re saying “who ordered this?”). You think that updating priors they eventually would have given a low enough probability to all the “beyond standard model”(BSM) physics they had in mind? Maybe, but they’re doing it in a few months. And even if they reached a point of giving a low prior to all the theories now entertained (assuming vast ,vast funds for high energy physics), they still wouldn’t have the clues as to how to extend the standard model to the new theory they’re learning about.

Using significance test reasoning they find the clues in short order. Inductive inference does involve “speeding things up in this way”, and pointing you to brand new ideas as a result of identifying anomalous effects that couldn’t be explained away. That is why (ecumenicist Bayesian) George Box says we need frequentist significance tests if we’re going to find anything new. (The citation is in my comment on the p-value statement)

OK, so you have ‘defined’ evidence. However, as far as I’m concerned you’ve done nothing more than define severity and then equated severity with evidence. All other procedures will rise or fall in your evidential account on their closeness to severity. I guess that is a privilege that you can take, but it is not one that will take us very far.

In my opinion, you have two separate accounts of evidence. One based on severity where the hypotheses are entities that may be thought of as existing independently of each other, perhaps not contained within a single statistical model, and another based on severity curves that takes ‘hypotheses’ that are parameter values within a statistical model. The latter is probably entirely compatible with likelihood analysis, and indeed it seems interchangeable with it for simple cases.

You point out the aspect of severity testing that you call BENT as an important positive feature. That feature is absent from an analysis based on only a likelihood function, but in my accounts I have always tried to make it clear that the _inference_ has to be made on more than just the hypothetical parameter values best supported by the likelihood analysis of the evidence provided by the data. Separating evidence from the process of inference not only allows the word ‘inference’ to retain its dictionary meaning, it also allows the likelihood principle to be accepted while BENT is used as a principle for inference.

On “likelihoodlum”: David Draper used this term in 1999 (JRSSD Vol. 48, No. 1 pg 27) as did Roger Berger (Stat Sci Vol 14 No. 4 pg 373) and it’s also in a 1995 book by Poirier. The obvious connection to “hoodlum” strongly suggests it is pejorative. It’s use is not therefore helpful to discussion, unless it’s very clear one is writing in a light-hearted way – which is difficult.

George: Thanks for the history. It’s true that whenever I use it on this blog (and that’s the only place I have) I mention that a true blue likelihoodist (Lew) uses it. Was R. Berger critical? I’m guessing he was. It’s because it seems pejorative that I wondered at Lew using it, so I decided it was meant to be cute.

My use of likelihoodlum is intended to be light hearted (my heart is feather-weight), but there may be a little more to it. I adopted it as a badge of honour in response to the unfair treatment of likelihood and a persistent refusal by many to hear what the data say. I like the term and do not intend to offend.

I am amused to be repeatedly described as a likelihoodist, true blue or other. I am only a likelihoodist in so far as one has to be a likelihoodist in order to see what the data say and to use that information when making reasoned inferences. I do not deny that error rates are a factor to be considered when making inferences, and I am well-disposed towards severity. However, my opinion is that any inferential approach that is deaf to what the observed data say about the possible values of the parameter(s) of interest is lacking. I also feel that different mixtures of approaches are required for different types of analytical problems. Any single-minded approach to inference will be inappropriate in some circumstances.

Michael: Unfortunately, your argument is completely circular. We deny we’ve heard what the data have to say if we can’t distinguish between predesignated and data dependent hypotheses—just for one of many examples where likelihoodists (or lums) differ from error statisticians. That’s why Birnbaum, once a staunch lum rejected it except for cases where one is comparing 2 predesignated hypotheses. If x accords with H, but your procedure had a high prob of resulting in such an accordance even if H is false, then it’s a distortion of the evidence to hide this! Just to remind people, by the way, my use of error probs is not merely for controlling long run error rates–I reject the behavioristic use of error probs. The unreliability that results from cherry picking, p-hacking, outcome switching etc. is problematic, not because if we kept doing this we’d often be wrong (even though that’s true). It’s because the gambit yields a terrible test (and lousy evidence) in the case at hand.

Mayo wrote: “it’s a distortion of the evidence to hide this!” Lew responds: who’s hiding it? Read my previous comment again, I wrote “I do not deny that error rates are a factor to be considered when making inferences.”

You raise, again, Birnbaum’s conversion. I’ve written about it at length here: http://arxiv.org/abs/1507.08394

I note that Birnbaum changed his mind about likelihood on the basis of a badly flawed example that can only be interpreted as being either overfitting of the model, or the use of a silly model containing an ‘evil demon’. What’s more, the example is only relevant when a decision is being made on the basis of a single observation: a second observed datum makes Brinbaum’s problem go away entirely. I would like to think that there was more to his decision to repudiate likelihood, but he used that example as a justification in two separate communications. Hacking also raised that example as a problem for the likelihood principle, but did not justify the weight it had been given.

I am tiring of your repeated implication that I would (or other likelihood-ists-lums) would ignore the experimental design issues when making inference. The likelihood principle simply says that most of those design features are irrelevant to the data-directed order of preference among the possible values of the parameter(s) of interest. It absolutely definitively does not say that one has to ignore other factors when making inference. It tells how to evaluate evidence about the parameter value(s), not how to make inferences.

The reason that I have been trying to get you, Mayo, to be clear on what the word ‘evidence’ means is to prompt you to see that under the meaning of the word implied by the likelihood principle, there are other factors that can and should be used when making inference.

Daniel: You wrote: “No, this is not the only non-significant finding interpreted as an effect. ” I’m wondering if they presented the data so that you could work this out.

Hi, no, they did not provide me the data – but given the sample size and the d effect size, you can get an approximate t-value and p-value.

One thing that bothers me in all this endless discussion about p-values and how they are used in psychological research is that most arguments blame p-values for questionable research practices and thus focus on the analyses.

But there’s a bigger problem in data collection: many inferential studies in Psychology are based on convenience samples – psych undergrads, open internet forms, etc. How can we use any kind of inferential procedure with such ambiguous samples? A sampling distribution can only be defined under a precise sampling scheme; sampling information should also be incorporated in bayesian models.

I see no point in arguing against or in favor of any inferential procedure in Psychology while data collection and measurement problems are just swept under the rug. What kind of excuses do researchers in Psychology use to justify the inferential value of a convenience sample? “It’s a valid random sample from a hypothetical (and thus imaginary and nonexistent) hyperpopulation”? I don’t agree with the ban in any way, but I’d say that BASP’s ban on inferential procedures is right for the wrong reasons: the fault is not on p-values, confidence intervals, posterior probabilities based on uniform priors; but in how many psychological studies collect their data. Under convenience sampling, the best we can do with our data is describe it – any inference based on such sample will depend on assumptions that are not met in any approximate way.

Erikson84: I entirely agree that the biggest problem with the psych inferences concerns the validity of the measurements (e.g., unscrambling soap words to “treat” with situated cognition of cleanliness, measures of self-esteem, etc.etc.), but tackling this goes beyond anything I imagine critics are prepared to consider. On the other hand, your point about a “convenience” sample may or may not impinge upon the inferences based on randomized assignments because, strictly speaking, such experimental populations are almost never representative of the general population of interest. I do think it’s something that should be discussed more than it is in psych. But subjects who agree to take part in clinical trials in medicine are not considered representative of the general population of interest: the goal seems to be simply to show it’s possible for the “treatment” to bring about the effect somewhere. (Proof of principle?) Experts in RCTs may weigh in here. Granted, it’s problematic for psych in that they tend to make general inferences directly, but it’s wrong to say that no statistical inferences are legitimate. Thus, I wouldn’t agree that “BASP’s ban on inferential procedures is right for the wrong reasons”. They don’t say their concern is with concept validity or external validity, they say, and truly appear to believe, that statistical inference must be in terms of a posterior probability in H– never mind that they haven’t explained why, in any of the ways this posterior could be arrived at, the result indicates it’s warranted to infer H. Misgivings on this point leads many Bayesians to settle for comparative measures, e.g., Bayes factors or likelihood ratios. We can justify statistical inferences from well conducted significance tests and confidence intervals, but it’s hard to see how this journal isn’t encouraging their misunderstanding, if only inadvertently. Again, I concur that a critique should focus on the validity of the measurements and the grounds for generalizations.But that’s distinct from “test bans” and such.

Mayo,

I have indeed forgotten to consider inferences coming from randomized experiments. Of course, sampling distributions of discrepancies between conditions are usually valid under randomization even for a non-probabilistic sample, if we are not worried about generalization. In this case we can make valid inferences assuming hypothetical repeated attributions to treatment conditions.

My comment makes no sense in this case, even though the measurement and generalization problems still remains.

Maybe I could rephrase like this: under ambiguous data collection (no clear sampling or randomization scheme), statistical inference would be misleading. Under this circunstance, which is not uncommon in Psychology, avoiding inferential statements should be preferred. So the ban could be justified if the editors were worried about correspondence between statistical assumptions and research practices. As you have pointed out, it’s not the case, and simply banning inference won’t produce better research practices.

Parenthetical, but I had to stick this in here.

http://www.smbc-comics.com/index.php?id=4127

This is my first comment on this blog. I’m not a statistician nor a philosopher, nor am I a native English speaker. So please bare with me through grammatical errors and possible fallacious arguments.

1. I agree with erikson84 with the statement “BASP’s ban on inferential procedures is right for the wrong reasons”. P-values and confidence intervals (CI) are taken too lightly by the research and clinical community that I know. Results are interpreted as significant or non-significant, and that’s about it. That is the main aim of a clinical discussion, a journal club, or a statistics class nowadays. Is it statistically significant? Yes, then it is the truth. No, then it is discarded. Bias, model specification are topics that are not discussed often. It is refreshing to see what happens to researchers when this simplistic dichotomy is banned. I guess this was bound to happen, but it is good, because it shows us where we are in terms of critical and statistical thinking, and forces researchers to take into account the rationale, methods and results of their research. Of course, there has to be a learning curve, which seems steep and lengthy.

2. P-values and CIs are not to blame. Although, their correct interpretation and usage has proven very difficult for people working in areas related with statistics, and even for statisticians. There’s an ongoing debate on different views (Fisher, N-P, “hybrids”), which doesn’t seem to be coming to an end in the near future. I have found p-values are difficult to teach as they are counter intuitive for students. It’s very easy to misinterpret them and its very hard to strip out all wrong preconceptions. Then again, p-values and CIs are not to blame, they are what they are, but I have found them impractical.

3. I recall that most frequentist theorems about statistical tests assume random sampling. I might be (very) wrong. Assuming I’m right, convenience samples would have limitations regarding these tests. Validity of statistical inference becomes questionable. However, I’ve found researchers don’t regard this as important because it is “very difficult to obtain a true random sample”. What’s more important for them is if the statistical model is correct. I don’t know yet if the latter is what’s right and the random sample assumption is unimportant.

Then there’s the issue of census. There’s an ongoing controversy about the use of p-values and CIs if we have the entire population. The rationale for the use of these inferential procedures is that census are made to make decisions. And the population studied might have variability with time and biological variability, which can lead to results that lead to wrong decisions. Inferential procedures can grasp this uncertainty and generalize to a theoretical superpopulation, so as to have an account of the variability. For example, I take a census of 2016. I report CIs for some variable and these are imprecise. This will inform me that there’s a lot of underlying variability so I can make the correct decision. But the way I see it is that, actually, the 2016 census is a convenience sample of a greater population (say 2000-2016). If this is true, then what is warranted is a different study that samples subjects for those other years to see the trend and actual variability between them. My take is that for census studies it is sensible to describe the population and not infer about a theoretical parameter of an imaginary superpopulation. They’re a snapshot and nothing else. If you want to know the actual trend, then the question merits a different study design.

4. The view of RCTs as “working for some people somewhere” is very narrow. Most RCTs are designed so as to make a biological causal claim about an intervention and an outcome. To see if an intervention has a “true” clinically important effect given certain (ideal) circumstances. But RCT tells us more than that, it informs us about diagnostic tests and pathophysiology, for example. There’s a good discussion of this in Phillips, Philosophy of Science, Dec 2015.

Then there’s the clinical problem: in which types of patients and circumstances does a given intervention work? That’s a totally different issue. First there has to be a sound intervention-outcome association established. The next studies should focus on specific populations and clinical circumstances to try to understand why the intervention works or doesn’t work for and in them. If we start with a “representative” sample then, as Rothman says, this pursuit for representativeness can hinder the validity of causal relations. And he deals mainly with observational studies which are more prone to bias.

In my experience, interventions in Medicine are directed to populations (Public Health interventions) and to individuals (Clinical Epidemiology – EBM). For example, vaccines, anti-tobacco laws and safe drinking water services work in the majority of the world population and are put into action through government policies. However, it would be cumbersome to have a public policy for every specific disease, cancer, hypotensive medication, etc. because this depends on specific patient clinical features, preferences and values. This is the realm of the clinician and EBM.

The most interesting thing of all is that interventions based on clinical trials work. Although it may seem as a mere anecdote, you see it on a day to day basis in clinical practice. People get well with adequate treatment. Additionally, there has been recent evidence that following evidence-based practice guidelines actually prevents outcomes (Chung et al BMJ 2015; 351:h3913).

Finally, I guess external validity has to do more with reason than with sampling schemes.

I hope I wasn’t too off-topic.

Thanks for the opportunity to express some of my ideas.

Martin, it appears to me that the problems you flag are broad, and point at the problems people have with relating statistical methods to scientific Inference. I do not believe pvalues or confidence intervals are worse off in this regard. It also appears that you are focused on the problem of properly framing problems with regards to the relevant reference class. This issue is every bit as difficult for alternative statistical approaches, and I do not believe that researchers or lay public have a better track record with alternatives such as likelihoods or Bayesian models. The difference is that we all understand and recognize the misuses of pvalues and confidence intervals more readily, and this easier recognition is (ironically) a strength that superficially looks like a weakness to some.

John, I agree.

My colleagues and I were only taught p-value and CIs during our studies. Exploratory Data Analysis, Likelihoods and Bayesian techniques were shown to us just tangentially.

Afterwards, when I went deeper into the subject, I found there has been a long discussion on frequentist statistical inference, the dos and don’ts, strengths and weaknesses, different perspectives, and so on. It is, most definitely, a more studied area of statistical inference.

I strongly believe p-values and CIs are very useful when we are dealing with RCTs. However, I don’t find them that useful when there’s not a clear sampling scheme.

I guess the main problem of p-values in practice is that they are counter intuitive, at least in clinical areas of knowledge. I’m just learning about Bayesian theory but I’ve found out we clinicians tend to think more in a Bayesian way.

I mean: we have a prior of a health problem (or problems) based on prevalence. We actually deal with several hypothesis at the time to which we (tacitly) assign a probability: high, moderate, low, very low. Then we gather data regarding those possible hypothesis through a thorough clinical history and examination. We make up our mind given the data and our priors and establish working diagnoses. If we have doubts we order complementary tests to confirm or discard the possible differential diagnoses. We combine that with our priors and have a final diagnosis (posterior probability). The clinical course will tell us if what we believed was true or not. Of course decisions are made in the context of severity, complications, costs, etc.

I don’t see how the N-P framework could become a part of day to day practice. The long-run error rates seem a little too abstract.

Whereas an error in diagnosis or therapy could just be something I’ll take into account into my prior the next time I encounter a similar situation. This is more down to Earth, in my opinion. Of course, this is an opinion from someone who’s just beginning to get the whole picture.

Returning to the BASP ban on p-values. This is only one journal out of many journals out there. There’s been a lot of bashing and I don’t think it’s completely deserved even though the reasons most don’t agree with the ban are sensible. I don’t think this ban will spread out to many other journals in the near future. I do find it as an opportunity to see what happens when researchers, who were only taught one way of analyzing data and making inferences from them, have to go and think more about what they have in their hands. We could learn very much from what’s happening with the p-value ban in terms of methods and statistical thinking. I hope that this BASP p-value ban experiment turns out to be successful, not because of the p-value ban itself, but because it would help researchers to find different ways to think about their studies. I think some good can come out of it.

Martin, I would encourage you to think carefully about why use of priors and likelihoods in no way sidesteps problems with sampling (assumptions such as randomness) and the reference class problem (why should I believe your posterior has any applicability to this case?). From what I see out in the real world, it can easily muddy the water. As to how to use confidence intervals in p-values, they were born out of a hypothesis testing philosophy where one seeks to remove false hypotheses and, by eliminating the false, then fixing onto the true (viable) hypothesis. I have read some depictions of evidence based medicine that appear to be moving that way, as science has done for a long time.

Testing

I may once more be late to the party, but here are a few words on random sampling and to what extent this is an assumption of frequentist analysis. Quite generally, model assumptions are abstract and formal and as such are never perfectly fulfilled. This holds for all mathematical modelling and isn’t anything particularly problematic about frequentism. This particularly means that it doesn’t make much sense to say that “the model/method assumes this-and-that and it’s therefore only warranted to apply it if this-and-that is fulfilled”. If it was so, no mathematical model could ever be legitimately applied.

The really relevant question is always to what extent issues with the model assumptions invalidate inferences that are made using a formal model that is violated in a certain way. In case that there’s a convenience sample and frequentist inference, the question therefore is whether the way the convenience sample was compiled has the potential to bias the inference. This is a tough question, but it should be discussed, because proper random samples are in very short supply. Note also that, as mentioned before, randomisation is not a magic bullet that makes the issue go away, because random sampling is concerned with generalisation whereas randomisation is concerned with causality.

Of course there are lots of examples where convenience sampling will strongly bias inferences for reasons that are obvious when thinking thoroughly about how the sampling procedure may interact with the target of inference. Then there are situations in which one could imagine that it’s not so much of an issue; also one could distinguish two different aspects of the problem, namely a) whether sample selection is properly random and b) whether the population represented by the sample is the population of interest. Aspect b) can be handled by just adapting the interpretation of results to an appropriately restricted population. Aspect a) may often be seen as less problematic (although it can be); the question to ask here is to what extent the availability of sampling units interacts with the target of interest. I think the best that can be done is to think thoroughly in every single case about this, document it, and then to decide whether the problem seems to be negligible (which may of course be challenged).

This may seem somewhat unsatisfactory because at the end of the day one can never be really sure whether an analysis is valid, but that’s how it is. There is no alternative. Any non-frequentist approach can also only deal with this to exactly the extent to which this has been thoroughly discussed and potentially problematic deviations from random sampling have been thoroughly modelled. This is possible in Bayesian and non-Bayesian ways and happens very rarely in Bayesian and non-Bayesian analyses alike (or be it EDA or non-Bayesian likelihood or whatever).

Christian,

Thank you for your input. I was so eager in my criticism that I even left out the validity of randomization distributions under random assignment, an issue Mayo quickly pointed out.

I agree with you in all your points. Models are approximations to reality and so are their assumptions. I am quite fond of Mario Bunge’s distinction between model objects and theoretical models: in science, we begin by building an abstract, idealized object that we can describe in a rigorous way. In many research domains, the IID assumptions is really OK, even if the sampling procedure was not randomized in any meaningful way, because it allows us to model the data generating process with less mathematical hassle. But in others, like most fields of Psychology, the IID assumption that is implied by most traditional models is a gross approximation at best.

So, I do think that most research in Psych goes over the board in how well the data could be represented by a simple IID model. In most papers I’m used to reading, there’s no clear discussion of how much the sampling procedure deviates from random, independent sampling, but it doesn’t stop researchers from justifying their inferences using “exact” p-value from traditional tests. Of course, in case of experiments with random assignment, the inference is sound for causal reasoning, but not for generalization.

On the other hand, if we do assume that our models are approximate, and their assumption are idealistic, how much can we trust in how we control the error rate? If the sampling scheme deviates from IID in ways that we don’t know exactly — and when do we ever know it? — the actual error rate will be a lot differente from the theoretical error rate. Even if we use error probability in their evidential sense, like Mayo advocates, a very low p-value might arise from assumption violation and not strictly from discrepancies from H_0. How can we build confidence that our inferences are not artifacts from inadequate assumptions?

I don’t mean this as a criticism of frequentism in particular – in the case of Psych research, I think it is a matter of better measures, sampling, modeling and model criticism, regardless of one’s preferred interpretation of probability or inferential procedure.

erikson

“On the other hand, if we do assume that our models are approximate, and their assumption are idealistic, how much can we trust in how we control the error rate? If the sampling scheme deviates from IID in ways that we don’t know exactly — and when do we ever know it? — the actual error rate will be a lot differente from the theoretical error rate.”

I don’t think there is any such thing as a well defined “actual” error rate. Error rates are only defined under assumptions that are formal, not real. So the issue is not so much what an “actual error rate” might be but rather to what extent the theoretical one is relevant to the study in question.

Actually one could compute/simulate error rates under non-iid models that could be motivated in one way or another to be more realistic for the study in question. This can be instructive, but these wouldn’t be “actual/true” error rates either.

This is not meant to disagree with your main issue; you point at something very important. It’s rather about “reframing”; how to think about the problem.

“Even if we use error probability in their evidential sense, like Mayo advocates, a very low p-value might arise from assumption violation and not strictly from discrepancies from H_0. How can we build confidence that our inferences are not artifacts from inadequate assumptions?”

Mayo also advocates that we check the assumptions. My take on this is that all information we have should be used to think about potentially critical issues with model assumptions, those that can easily invalidate inferences. This could be the data themselves but also information about where the “sample” came from etc. Also, how does our result look like in the light of all the other related things people have done in related problems? If we think as critically as we can about this (and do some data exploration) and then we don’t find any issue, this is as good as it gets. It’s not perfect. We may still be wrong (and should learn from error if it later turns out we are).

Christian: Knowing the literal, actual error probabilities is rarely if ever what’s needed for warranted interpretations of the data. Nor did N-P think they were. I agree with looking at background information as well as model assumptions, but these shouldn’t be run together. It’s very important to distinguish them. One would distinguish, as well, a “big picture” inference as to what’s known about the problem or hypothesis in general, as opposed to what these data teach us about the question posed, as framed in the model.

Christian, Mayo,

I remember one study on a large health insurance database in which researchers computed an ’empirical distribution’, given that they could compute a test statistic for cases in which the background information says the effect is truly (or close enough to make little difference) 0. So, they could compare the obtained test statistic from a comparison of interest against this empirical distribution to make inferences (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4285234/).

I understant that such calibration is not always possible – in fact, we will have to trust in our theoretical guarantees most of the time. But, as Christian puts it, shouldn’t we use all available information to criticize the model and give some confidence to our inferences?

Mayo, could you please elaborate on your last comment? Why ‘actual error probabilities’ wouldn’t be of interest in interpreting the data? Why model assumptions and background information checking shouldn’t run together? Could you please point to some of your articles where you discuss problems of model definition and assumption violation and how should we tackle them in inferential reasoning? I’m truly interested in how to deal with these questions, since I believe that most problems in Psych studies could be avoided if we (talking as a Psych researcher) were more careful about data gathering, measurement and modeling.

Mayo: Yes, I agree; if it seemed otherwise I probably wrote something too sloppily.

OK, so I decided to review some readings to better understand this problem. But I’m more confused now because there seems to be a contradiction:

In Mayo and Spanos (2011, p.190), it’s clearly stated that “statistical misspecifications often create sizeable deviations between the calculated or nominal error probabilities and the actual error probabilities associated with an inference, thereby vitiating error statistical inferences.” That was exactly my point in my first post: most sampling schemes in Psych do not allow for valid inferences, because they are unclear and grossly violate most test assumptions (random assignment aside, as already pointed out, but only for causal reasoning).

How can I reconcile this with Mayo’s “Knowing the literal, actual error probabilities is rarely if ever what’s needed for warranted interpretations of the data”? The excerpt above says otherwise: unless we can be confident that our assumptions are at least approximately satisfied, we will have nominal error probabilities and invalid inferences. Of course, the ‘actual’ here is not the ‘absolute truth’, but as good as we can approximate under our model and how much we can trust its assumptions hold.

Some assumptions can be tested in a formal way, building confidence in the model. But how can we test IID assumptions on a haphazard sample? Formal testing won’t help much, but background information can be used to justify the IID approximation. Why they ‘shouldn’t run together’? Or we can use the ‘it’s a random sample from an unknown superpopulation’ and make inferences to this never-to-be-seen-again superpopulation, in which case why should care about inferences at all.

As someone with little formal statistical training, I’m genuinely interested in those questions and how to better tackle them in research – I’m not asking them for the sake of criticism. So if anyone can point me to relevant works that discuss those issues, I’ll appreciate it.

Erikson: The point is that the reported error probs should be approximately correct. It may be that substantive considerations are very relevant for ensuring or questioning assumptions, but wouldn’t it be odd if , given what you know, the stat assumptions ought to hold and yet they fail in the fce of testing assumptions? The misspecification methods in Spanos were developed with observational studies in mind. I’m not sure what you mean by saying they can’t be applied to a “haphazard” sample.

Can you explain:

“most sampling schemes in Psych do not allow for valid inferences, because they are unclear and grossly violate most test assumptions (random assignment aside,”

Are you referring to problems of measurement, and of external validity?

Mayo,

Thank you for your clarification.

What I have in mind with ‘haphazard sample’ is the usual non-probabilistic sample that is very common in Psych studies. Undergrads, mechanical turks, open internet questionnaires… An autoregressive or non-parametric test might tell us that there is no serious issue of serial autocorrelation in the sample, and that might be enough to claim our independence assumption is met.

My main concern is: what does a p-value mean for such sample? The inferential value of a test statistic comes from its sampling distribution, assuming the null hypothesis and all the necessary regularity conditions. But without a clear sampling scheme or at least some good background justification, the sampling distribution is not well-defined, unless we use the “imaginary superpopulation” excuse. For such sample, systematic and sampling error will be unknown, so we won’t be able to derive a sampling distribution for any test statistic, and any p-value will be nominal.

I understant that this issue is not problematic in some fields. E.g., biologists might not have any reason to believe that the irises collected in field A will be systematically different from irises from the same species collected in field B, at least in some aspects of interest. Or, at least that any systematic within species variation will be much smaller than relevant between species variations, so I don’t have to worry much about ‘random sampling from the entires population of irises of a given species’. But this is hardly the case in most Psych studies, in which I have very good reasons to believe that the outcome of interest is counfounded with other variables that influence participating the study.

Under such biased sampling, in which we can’t even have a clear notion of how systematic and random error affect the results, how can we interpret a p-value, or, for that matter, any inferential statistic? As Berk and Freedman argue, “researchers may find themselves assuming that their sample is a random sample from an imaginary population. Such a population has no empirical existence, but is defined in an essentially circular way—as that population from which the sample may be assumed to be randomly drawn. At the risk of the obvious, inferences to imaginary populations are also imaginary.” (http://www.stat.berkeley.edu/~census/berk2.pdf)

Erikson:

I think this came up earlier. It’s the random assignment of treatment that is assumed to permit the computation of the p-value. The question of generalization is distinct.

Mayo,

Indeed, in case of experiments the p-value is valid for causal reasoning.

But my questions above are about observational studies in particular – I guess I haven’t stated it clearly.

Erikson: The p-value refers to the formal null model. There is nothing “unreal” or “unactual” about it in the sense that it specifies a probability for an extreme test result under the null model that can be used for making a statement about how well the null model fits observed reality (regarding the aspect that is tested).

Personally I don’t like discussing this in terms of a difference between “nominal” and “actual” error rates because I don’t know what the latter are supposed to be.

The issue with non-probabilistic samples is rather with the interpretation of the formal models. A significant p-value indicates that the null model is inappropriate, but the reason for this could be bias from the sample rather than any substantial effect, as which many people would like to take it.

This assumes that the p-value was not obtained by any kind of data dredging, hunting for significance etc., in which case one indeed could define “actual error rates” that differ from the nominal ones by setting up combined null models and analysing what exactly was done.

A paper by Fraser (march 2016) on p-values that is sophisticated and sane.