A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)



A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called Basic and Applied Social Psychology (BASP)[i]. Enough. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. Oh please, no ESP required.None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about. Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (really?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors.  I don’t know, maybe some good will come of all this.

Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them!

Question 1. Will manuscripts with p-values be desk rejected automatically?

Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).”

Now if these measures are alleged to be irrelevant and invalid instruments for statistical inference, then why should they be included in the peer review process at all? Will reviewers be told to ignore them? That would only seem fair: papers should not be judged by criteria alleged to be invalid, but how will reviewers blind themselves to them? It would seem the measures should be excluded from the get-go. If they are included in the review, why shouldn’t the readers see what the reviewers saw when they recommended acceptance?

But here’s where the puzzle really sets in. If the authors must free their final papers from such encumbrances as sampling distributions and error probabilities, what will be the basis presented for their conclusions in the published paper? Presumably, from the notice, they are allowed only mere descriptive statistics or non-objective Bayesian reports (added: actually can’t tell which kind of Bayesianism they allow, given the Fisher reference which doesn’t fit*). Won’t this be tantamount to requiring authors support their research in a way that is either (actually) invalid, or has little to do with the error statistical properties that were actually reported and on which the acceptance was based?[ii]

Maybe this won’t happen because prospective authors already know there’s a bias in this particular journal against reporting significance levels, confidence levels, power etc., but the announcement says they are permitted.

Translate P-values into euphemisms

Or might authors be able to describe p-values only using a variety of euphemisms, for instance: “We have consistently observed differences such that, were there no genuine effect, then there is a very high probability we would have observed differences smaller than those we found; yet we kept finding results that could almost never have been produced if we hadn’t got hold of a genuine discrepancy from the null model.” Or some such thing, just as long as the dreaded “P-word” is not mentioned? In one way, that would be good; the genuine basis for why and when small p-values warrant indications of discrepancies should be made explicit. I’m all for that. But, in all likelihood such euphemisms would be laughed at; everyone would know the code for “small p-value” when banned from saying the “p-word”, so what would have been served?

Or, much more likely, rewording p-values wouldn’t be allowed, so authors might opt to:

Find a way to translate error-statistical results to Bayesian posteriors?

They might say something like: “These results make me assign a very small probability to the ‘no effect’ hypothesis”, even though their study actually used p-values and not priors? But a problem immediately arises. If the paper is accepted based on p-values, then if they want to use priors to satisfy the editors in the final publication, they might have to resort to the uninformative priors that the editors have also banned [added: again, on further analysis, it’s unclear which type of Bayesian priors they are permitting as “interesting” enough to be considered on a case by case basis, as the Fisher genetics example supports frequentist priors]. So it would follow that unless authors did a non-objective Bayesian analysis first, the only reasonable thing would be for the authors to give, in their published paper, merely a descriptive report.[iii]

Give descriptive reports and make no inferences

If the only way to translate an error statistical report into a posterior entails illicit uninformative priors, then authors can opt for a purely descriptive report. What kind of descriptive report would convey the basis of the inference if it was actually based on statistical inference methods? Unclear, but there’s something else. Won’t descriptive reports in published papers be a clear tip off for readers that p-values, size, power or confidence intervals were actually used in the original paper? The only way they wouldn’t be is if the papers were merely descriptive from the start. Will readers be able to find out? Will they be able to obtain those error statistics used or will the authors not be allowed to furnish them? If they are allowed to furnish them, then all the test ban would have achieved is the need for a secret middle level source that publishes the outlawed error probabilities. How does this fit with the recent moves toward transparency, shared data, even telling whether variables were selected post hoc, etc. See “upshot” below. This is sounding more like “don’t ask, don’t tell!”

To sum-up this much.

If papers based on error statistics are accepted, then the final published papers must find a different way to justify their results. We have considered three ways, either:

  1. using euphemisms for error probabilities,
  2. merely giving a descriptive report without any hint of inference.
  3. translating what was done so as to give a (non-default? informative? non-nonsubjective?) posterior probability

But there’s a serious problem with each.

Consider # 3 again. If they’re led to invent priors that permit translating the low p-value into a low prior for the null, say, then won’t that just create the invalidity that was actually not there at all when p-values were allowed to be as p-values?  If they’re also led to obey the ban on non-informative priors, mightn’t they be compelled to employ (or assume) information in the form of a prior, say, even though that did not enter their initial argument?  You can see how confusing this can get. Will the readers at least be told by the authors that they had to change the justification from the one used in the appraisal of the manuscript? “Don’t ask, don’t tell” doesn’t help if people are trying to replicate the result thinking the posterior probability was the justification when in fact it was based on a p-value? Each generally has different implications for replication. Of course, if it’s just descriptive statistics, it’s not clear what “replication” would even amount to.

What happens to randomization and experimental design?

If we’re ousting error probabilities, be they p-values, type 1 and 2 errors, power, or confidence levels, then shouldn’t authors be free to oust the methods of experimental design and data collection whose justification is in substantiating the “physical basis for the validity of the test” of significance? (Fisher, DOE 17). Why should they go through the trouble of experimental designs whose justification is precisely to support an inference procedure the editors deem illegitimate?


It would have made more sense if the authors were required to make the case without the (alleged) invalid measures from the start.  Maybe they should correct this. I’m serious, at least if one is to buy into the test ban. Authors could be encouraged to attend to points almost universally ignored (in social psychology) when the attention is on things like p-values, to wit: what’s the connection between what you’re measuring and your inference or data interpretation? (Remember unscrambling soap words and moral judgments?) [iv] On the other hand, the steps toward progress are at risk of being nullified.

See out damned pseudoscience, and Some ironies in the replication crisis in social psychology

The major problems with the uses of NHST in social psych involve the presumption that one is allowed to go from a statistical to a substantive (often causal!) inference—never mind that everyone has known this fallacy for 100 years—, invalid statistical assumptions (including questionable proxy variables), and questionable research practices (QRPs): cherry-picking, post-data subgroups, barn-hunting, p-hacking, and so on. That these problems invalidate the method’s error probabilities was the basis for deeming them bad practices!

Everyone can see at a glance (without any statistics) that reporting a lone .05 p-value for green jelly beans and acne (in that cartoon), while failing to report the 19 other colors that showed no association, means that the reported .05 p value is invalidated! We can valuably grasp immediately that finding 1 of 20 with a nominal p-value of .05 is common and not rare by chance alone. Therefore, it shows directly that the actual p-value is not low as purported! That’s what an invalid p-value really means. The main reason for the existence of the p-value is that it renders certain practices demonstrably inadmissible (like this one). They provably alter the actual p-value. Without such invalidating moves, the reported p-value is very close to the actual! Pr(p-value < .05;null) ~ .05. But these illicit moves reveal themselves in invalid p-values![v] What grounds will there be for transparency about such cherry-picking now, in that journal?

Remember that bold move by Simmons, Nelson and Simonsohn? (See “statistical dirty laundry” post here). They had called on researchers to “just say it”: “If you determined sample size in advance, say it. If you did not drop any variables, say it. If you did not drop any conditions, say it.”

The new call, for this journal at least, will be: “If you used p-values, confidence intervals, size, power, sampling distributions, just don’t say it”.[vi]


*See my comment on this blog concerning their Fisher 1973 reference.

[i]An NSF Director asked for my response but I didn’t have one for dissemination. They sent me the ASA response.

[ii]Allowing statistical significance to go directly to substantive significance, as we often see in NHST is invalid; but there’s nothing invalid in the correct report of a p-value, as used, for instance in recent discovery of the Higgs particle (search blog for posts), that hormone replacement therapy increases risks of breast cancer (unlike what observational studies were telling us for years), that Anil Potti’s prediction model, on which personalized cancer treatments were based, was invalid. Everyone who reads this blog knows I oppose cookbook statistics, and knows I’d insist on indicating discrepancies passed with good or bad severity, insist on taking account a slew of selection effects, and violation of statistical model assumptions—especially links from observed proxy variables in social psych and claims inferred. Alternatives to the null are made explicit, but what’s warranted may not be the alternative posed for purposes of getting a good distance measure, etc etc. (You can search this post for all these issues and more.)

[iii]Can anyone think of an example wherein a warranted low Bayesian probability of the null hypothesis—what the editors seek—would not have corresponded to finding strong evidence of discrepancy from the null hypothesis by means of a low p-value, ideally with corresponding discrepancy size? I can think of cases where a high posterior in a “real effect” claim is shot down by a non-low p-value (once selection effects, and stopping rules are taken account of) but that’s not at issue, apparently.

[iv]I think one of the editors may have had a representative at the Task force meeting I recently posted.

An aside: These groups seem to love evocative terms and acronyms. We’ve got the Test Ban (reminds me of when I was a kid in NYC public schools and we had to get under our desks) of NHSTP at BASP.

[v] Anyone who reads this blog knows that I favor reporting the discrepancies well-warranted and poorly warranted and not merely a p-value. There are some special circumstances where the p-value alone is of value. (See Mayo and Cox 2010).

[vi] Think of how all this would have helped Diederik Stapel.

Categories: P-values, reforming the reformers, Statistics

Post navigation

73 thoughts on “A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)

  1. Steven McKinney

    Thanks for tackling this subject Mayo.

    Though it is tedious for you, it is at times such as this that we really need your philosophical thinking and exposition, to help the rest of us think more clearly about this silly proclamation and others of a similar nature.

    “The state of the art remains uncertain” regarding inferential statistical procedures, say the editors.

    Editors who say such a thing, and are not statisticians, should take a step back and convene some kind of panel, including statisticians, to try and figure out what they are uncertain about and how to work towards certainty.

    As you well describe here, the problem is not the statistical procedures themselves, as developed and described by various genius thinkers over the last 300 years. The problem is the recent explosion of higher education graduates, with little statistical training, cranking out reams of papers, and handling the interpretation of statistical findings very poorly indeed.

    Thus a sensible group of editors would work towards improving knowledge of proper handling of statistical evidence, and towards encouraging research groups in their field to engage with statisticians more often; rather than stepping outside their bailiwick and making ridiculous decisions.

    Clear descriptions of problematic practices such as p-hacking, using examples and terminology reflective of the editors’ field, and recommended improvements are what is needed here, not a goofy ban on methodologies developed over centuries to deal with precisely the problems facing so many scientific fields these days.

    • Steven: I thank you for your remark, and it’s not that it’s tedious, it’s just that I saw no reason to add to their self promotion based on confusions. Yet officials from the ASA and RSS have responded, so I guess they achieved this goal. Only thing, now that people will look more closely, they might be aghast at what they find, based on what Trafimow sent me.

  2. Well, I’d like to ban p-values too, but obviously they have no idea of what they are talking about.

    In terms of confidence intervals, their comment is one commonly heard, but which is in my view silly. I’d parody their remark by saying, “The probability of heads in a coin toss is not 1/2. That’s not a probability; instead, it’s the proportion of heads we’d get in infinitely many tosses of the coin.” 🙂

  3. David Judkins

    Satisfying read. Thank you.

  4. Mark

    I was thinking the same thing about randomized trials; in fact, I think this should preclude publishing the results from any randomized trial in this journal.

    A friend of mine said “I assume they won’t allow the reporting of SEs, either…” And if they’re consistent, they shouldn’t! Sampling distributions be damned!

    A plus side for many authors I work with… this journal would also now have to ban all power calculations. Folks I work with just hate it when I tell them they need to enroll more participants than they can afford, now they simply don’t have to worry about it! Happy day!

    • Thanks for this Mark. They seem not to grasp that these concepts are of a piece. but then again, if they actually thought p-values were supposed to give posterior probabilities of some type (we’re not told) then I guess the “state of the art” really is a problem for them. They should say what their desirable posterior probability means: I strongly believe in the hypothesis, very often with data like this the hypothesis is true? Or what?
      But my main point here is they should fix this logical conundrum.

      • newtostats

        I’ve been reading your blog and tweets for a while now and have seen a number of comments like these:

        “I strongly believe in the hypothesis, very often with data like this the hypothesis is true? Or what?”

        “1st says it gives degs of belief, 2nd won’t tell youjust a formal comp to get a posterior.”

        “So much easier to just declare “I believe in this effect with all my heart” and tenaciously cling to it. No data can falsify you!”

        A uniform prior on a parameter is saying “based on everything known before hand, the true parameter could reasonably be anything, therefore we’ll give full consideration to all of possibilities without favorites one way or another, and let the data narrow the posterior down on the true value”.

        This seems intuitive, unexceptional, un-mysterious, fully compatible with objective science and not related to beliefs at all. I’m at a loss to explain why you or anyone else is having difficulty with it.

        • NTS: Firstly, these are taken out of different contexts and had different purposes. The first was trying to give the best spin on an informative but non-subjective prior. The second was reflecting the fact that nonsubjective Bayesians often say the prior is uninterpreted, it’s merely a computation to get a posterior. The third, I’m fairly sure, was a tweet responding to a tweet that said “die P-value, die”. But this tweeter and I always joke around, we’re not even far apart on most issues. I don’t really like twitter and should likely stop using it.

          On uninformative priors, two points: (1) There’s more than one way to try to express an uninformative prior, and what appears uninformative in one parametrization becomes informative in another–in short there is no uninformative prior, as nonsubjective Bayesians clearly admit (2) this journal was also banning uninformative priors as I understand it–is this not so?

          Moreover, using certain uninformative priors can enable a match with error probabilities, but that’s not desirable from my and many other people’s perspective. i.e., We don’t want to assign a non-null .95 degree of (nonsubjective) belief based on a .05 stat sig result. A much weaker claim is intended by a correct construal of a stat sig result (assuming of course that it’s actual and not merely nominal, and model assumptions OK.

          It seems to me that those in earnest about wanting a nonsubjective posterior assessment that is not frequentist are after something like: this claim is well warranted, well-tested, flaws well ruled out or some such thing. If it’s something else they should tell us.

          • newtostats

            My point was priors have been explained extensively, I’ve read dozens and dozen of authors explaining them in detail. It’s not a state secret what a non-subjective prior is. For the simplicity of a blog comment I used a uniform prior, but could have been more general if writing a book or giving a lecture.

            “There’s more than one way to try to express an uninformative prior, and what appears uninformative in one parametrization becomes informative in another”

            I never mentioned “uninformative”, but this is as it should be. If you transform an entire parameter space to the value 7, then while you may be uninformed about the parameter, you are simultaneously highly informed about the transformed value (it’s exactly 7). Being clueless about one space doesn’t contradict being informed about a transformed space.

            Either way this is not an objection to the explication given for what the prior means (I never mentioned “uninformed”). Nor do I see any denials that the stated explanation makes sense, is compatible with objective science, and has nothing to do with beliefs. Saying Bayesian and classical statistics agree sometimes numerically isn’t not an argument against Bayes.

            • Well, but I noticed you haven’t told us the desired meaning for the non-subjective , non-frequentist prior/posterior, nor what’s wrong with my attempts. It would be good to know what’s being talked about, what’s to be tested, if it is, etc. Then maybe it can be clearer why an account that applies w/o priors can be “valid” and not “invalid”.

              • newtostats

                I did. Once again “based on everything known before hand, the true parameter could reasonably be anything, therefore we’ll give full consideration to all possibilities without favorites one way or another, and let the data narrow the posterior down on the true value”.

                Clearly this is not the most general case, but forgetting that for a moment, what is your objection to it?

  5. Perhaps most curious of all is that the editors reference Fisher as the justification for reserving the right to consider Bayesian solutions on a case by case basis. That’s because “there have been Bayesian proposals that at least somewhat circumvent the Laplacian assumption, and there might even be cases where there are strong grounds for assuming that the numbers really are there (see Fisher, 1973, for an example).

    Fisher 73 is Statistical Methods and Scientific Inference. Fisher, of course, is the arch anti-Bayesian (unlike, say Neyman). Is it the fiducial stuff they’re referring to? But there are no priors there. If it’s the Michell example, it appears the editors are drawing the reverse inference from Fisher.*

    Fisher is claiming that even if we imagine that “a datum were added to Michell’s problem to the effect that it was a million to one a priori that the stars should be scattered at random. We need not consider what such a statement of probability a priori could possibly mean in the astronomical problem…” If this datum were introduced then “a probability statement could be inferred a posteriori, to the effect that the odds were 30 to 1 that the stars really had been scattered at random. The inherent improbability of what has been observed being observable on this view still remains in our minds, and no explanation has been given of it. It has been overweighted, not neutralized, by the even greater supposed improbability of the universe chosen for examination being of the supposedly exceptional kind in which the stars are not distributed at random. The observer is thus not left at all in the same state of mind as if the stars had actually displayed no evidence against a random arrangement…The example shows that the resistance felt by the normal mind to accepting a story intrinsically too improbable is not capable of finding expression in any calculation of probability a posteriori. “(Fisher 1973, 42-3)

    Moreover “if this datum were considered as a hypothesis, it would be rejected at once by the observations at a level of significance almost as great as the hypothesis ‘The stars are really distributed at random’, was rejected in the first instance.”(Fisher 1973, 44)

    In other words, even if we granted an extremely low a priori probability to the null (random distribution of stars), it would only show the grounds we actually use to reject the null—i.e., the significance level—fail to be captured by the posterior probability. Moreover, Fisher is saying, the significance level would actually be grounds for rejecting the a priori probability—were that to be treated as a hypothesis to be tested by the data!

    No doubt “there might even be cases where there are strong grounds for assuming that the numbers really are there” as the editors say, but I don’t think you’ll find them in Fisher
    *Actually, it’s the mice, see comment below

    • newtostats

      Suppose there’s no other past data/evidence for any hypothesis other than the current data.

      Then given two hypothesis H_notaccord and H_accord, if the first doesn’t accord with the data in the sense that P(observed data |H_notaccord) is low, while P(observed data|H_accord) is high, then Bayes Theorem will favor H_accord.

      Given two hypothesis H_strongtest and H_weaktest which both accord with the data in the above sense, but the former makes sharp predictions which would rarely turn out to be right accidentally, while the latter makes diffuse predictions which are consistent with almost any outcome, then Bayes Theorem will favor H_strongtest.

      • NTS: It would be good to use better known terms such as likelihood ratios and such. But if I’m understanding your gist, the hypothesis that would very rarely agree so well with the data “accidentally” would be the better tested hypotheses on Neyman-Pearson criterion. It would be too easy for the diffuse hypothesis to attain agreement, even if false. Are you saying you introduce a prior in order to achieve this familiar testing result? e.g., some people will give hypotheses with adjustable parameters low priors, but this does not turn out to be in sync with what scientists do. Most importantly, it isn’t merely the diffuseness of a hypothesis that can make it too easily in accord with data–there are a host of other procedures that can have the same effect, e.g., stopping when the data look good, cherry picking, modifying the variables or test statistic, etc. All these are picked up by error probabilities and result in the hypothesis that is “too easy to fit data, even if false” passing a test with very low stringency or severity.

        • newtostats

          Let me be more specific. First P(H_strongtest)=P(H_weaktest)=.5 (that’s the no past data/evidence remark).

          Second, the observed data accords with both. Specifically we could take this to mean the observed data lay in the shorted (least volume) 95% bayesian credibility region of both P(data |H_strongtest) and P(data |H_weaktest).

          Third, if P(data |H_strongtest) makes sharp predictions for the data so there’s little chance of data set accidentally according with H_strongtest, while P(data | H_weaktest) is so diffuse there is a high chance most any data set will accord with H_weaktest. then P(H_strongtest | observed data) will be much higher than P(H_weaktest |observed data).

          • Anonymous

            Congratulations, you are becoming an error statistician.

          • John Byrd

            Congratulations, you are becoming an error statistician.

            • newtostats

              I didn’t do anything. It’s a straightforward consequence of the sum and product rules of probability theory. It’s true no matter what I think. It would be true even if I had never existed.

              To be an error statistician, you have to dump the sum/product rules, which are used implicitly in every calculation every made in statistics, whenever they contradict an error statistician’s intuition. I’m unwilling to do that since the sum/product rule have never once let me down (they’ve been put to severe tests!) and I see no reason to think Mayo’s or John Byrd’s intuitions are infallible.

              • John Byrd

                NTS: Suppose the posterior under the strong test is 0.70 and under the weak test is 0.55. What does it mean to you? How do you use this information?

                • newtostats

                  The short answer is the evidence considered is more consistent with the former than the later. The longer answer is that if you examine all the remaining possibilities for the truth consistent with that (partial) evidence more of them agree with the former hypothesis than the later.

                  • john byrd

                    Are the remaining possibilities alternate hypotheses?

                    • newtostats

                      No. If a hypothesis/evidence determined the thing you cared about exactly (deductively) then you wouldn’t need statistics at all. It’s only because the hypothesis leaves some indeterminacy (i.e, non-uniqueness, a range of possibilities) in the the thing you care about that you use probabilities.

                      Looking over your comments, and Mayo’s, as well as some others, it’s clear you only conceive of two viewpoints. Probabilities are either “beliefs” or “frequencies”. Unfortunately, that’s not where all the action is. The real deal is a third viewpoint. Probabilities model indeterminacy, i.e. the multitude of possibilities consistent with the evidence, rather than the frequency which you’ll see those possibilities. It’s more general, more useful, more realistic, simpler in several senses, and contains none of the problems or difficulties of other approaches.

                      Now you and Mayo may deny this, but it’s impossible to discuss without talking past each other unless you all take the time to understand this view.

                    • I’m not sure where “newtostats” (who is really old to this blog) gets the ideas he is suggesting I/we hold. I deny “probabilities are [in the sense of must be] either ‘beliefs’ or ‘frequencies'”. The evidence is indeed consistent with a multitude of hypotheses (claims, models), but I’ve no clue what it means to say that probabilities model this multitude on his view of probability? Nor do I see why he thinks a frequentist assigns frequencies to the multitude of hypotheses (models or claims)–unless one is involved in a very special kind of modeling.

                      Statistical hypotheses assign probabilities to outcomes. In statistical inference one infers from data x to statistical hypotheses that model aspects of the source of the data. The inferences are qualified and assessed by how good a job the inference process had done e.g., by the capabilities the inference method has to avoid erroneous interpretations about the source of the data. In formal statistics these are given by error probabilities. They are influenced and altered by biases, selection effects, cherry-picking, stopping rules, and violated assumptions of the statistical model employed.

                    • john byrd

                      “If a hypothesis/evidence determined the thing you cared about exactly (deductively) then you wouldn’t need statistics at all. It’s only because the hypothesis leaves some indeterminacy (i.e, non-uniqueness, a range of possibilities) in the the thing you care about that you use probabilities.”

                      Suppose the hypothesis I care about is that these are fair dice? I think I need statistics to address it. You don’t?

                      (I never said probabilities are only frequencies or beliefs, by the way. Mayo has said that is certainly not a legitimate dichotomy.)

      • Mark

        But what would Jaynes say, I Wonder?

    • Mark

      Actually, Fisher wasn’t quite an arch-anti-Bayesian. In both Design of Experiments and Scientific Inference, he endorsed objective Bayesian methods (or, Bayesian methods with objective prior). I can’t point to page numbers at the moment, but I very recently reread both books and was a bit shocked as his consistency in this regard.

      He was also very clear that fiducial methods required that there be *no* prior knowledge.

      • Mark: When you find the pages let me know.

        • Mark

          Will do… Won’t be until Monday, though, as my copy is in my office.

  6. Clark Glymour

    Why don’t they go all the way and just ban data?

    • Yes, the Stapel method. You’d think they’d at least insist on likelihood functions.

    • haha 😀 First step to ban science altogether! 🙂

  7. Incidentally, I was not intending for this post to be at all humorous. It’s a serious puzzle. I resisted the easy ways to make it jokey. I’m completely and entirely serious here.

  8. I find interesting that a field so poorly related to Mathematics as is Psychology happens to gather so many strong opinions against p-values.

    Seems to me they want to go old business style decisions taking; that is, give me a few pie charts and don’t trouble my superior knowledge of the business with your mathematics.

    Well, if you cannot measure something you cannot call it science. I don’t remember if it was Wittgenstein or Russell (or someone else with similar caliber) that complained about Freud’s interpretation of dreams not being science because dreams cannot be objectively subject to science.

    I have the feeling these magazines want to turn Psychology into a Philosophy of Mind where everyone’s opinions go as long as they have the right pie chart.

    • newtostats

      It’s likely more of a reaction to wide spread suspicion that nearly all psychological research is wrong. Bans are a sign of intellectual weakness (you don’t see physicists banning anything) and are always a bad idea and a bad sign.

    • Fran: That’s one of the reasons Meehl was so hard on the significance testers: they appeared to satisfy Popper’s criterion for science, whereas his cherished Freudianism did not. I think there was something right about his feeling that he was onto deeper insights with Freud than many of the statistical experiments in psych he was calling out. Remember though that Popper also considered that Freudian theory would become testable some day. Meehl thought so too.

      • Indeed, though other giants have stronger opinions about what constitute science here is a really interesting opinion from Feynman 🙂

        • He’s terrific in “Cargo Cult Science”. It’s just what we need now. I wonder if I can watch this on the plane?

  9. Things get scarier when you see what these editors really think. David Trafimow sent me a paper of his from Psych Review (linked below). In the first 2 pages we get the following.
    “1. Propose a hypothesis to be (hopefully) supported.”
    Let’s call this H’.
    In the next line we hear that H’ and the null Ho are supposed to be mutually exclusive and EXHAUSTIVE.
    So if only “[p(F\Ho) = 0], then the logic of NHSTP would be unassailable and provable by the following reasoning:”
    We then get a disjunctive syllogism from the fact that Ho is ruled out, and (H’ V Ho) is a tautology to H’!

    So it follows that in the many examples where the null is known to be false, that it is unassailable logic to infer the research hypothesis H’!
    If I had any doubts that this was how many psych people view NHST, this clinches it.Here it is:

    Click to access bayes-2003.pdf

    No time to read the rest, but this is pretty surprising given that he cites many of the critics who repeatedly admonish testers for pretending that their research hypothesis H’, generally causal, is the logical denial of the null hypothesis.

    • newtostats

      I’m not understanding your fear. In practice most of the time you have to hypothesis like “H0: parameter less than C” and “H1: parameter greater than or equal to C”. They form a mutually exclusive and exhaustive pairs, since they don’t intersect and the parameter has to be on the Real line somewhere. There’s nothing objectionable or unusual about what Trafimow is saying as far as that goes.

      An example where you don’t have mutual exclusive might be:

      H0: parameter less than 1
      H1: parameter great than 0

      An example where you don’t have exhaustive might be:

      H0: parameter less than -1
      H1: parameter greater than +1

      • In N-P theory you do, but that doesn’t get to the research hypothesis e.g., about causes.

        • newtostats

          I reread your comment carefully again, but I not getting it. The most common situation is where H0 and H1 are exhaustive. If you show somehow that “H0: parameter less than C” is false, it has to be that “H1: parameter is greater than C” is true.

          Yet in your comment you write “So it follows that in the many examples where the null is known to be false, that it is unassailable logic to infer the research hypothesis H’!” as though it’s scandalous anyone would say that. Please explain.

          • NTS: If the null and alternative are exhaustive, as with embedded N-P tests then
            unlike what the editor writes or implies:
            (1) you don’t need a p-value of 0 to warrant inferring evidence of a discrepancy from the null (we’re assuming it’s an actual not merely a nominal p-value);
            (2) You do not, even in that case, have grounds for assigning a posterior probability to the alternative (about the parameter)
            (3) You’d need a distinct step, nearly always, to infer a substantive research hypothesis H’ of the sort that is typical in psychology and other sciences. That is, H’ goes beyond the mere assertion of a discrepancy from null; but if it doesn’t go beyond that, then a well-founded significance test does warrant the (rather weak) claim of evidence of the existence of a genuine effect or discrepancy from the null.

            • newtostats

              I don’t know. If there’s strong evidence the percentage of students with autism isn’t greater than 10%, I’m inclined to think that does mean it’s less than 10%. What do you think it means?

              • NTS: Sure, the issue that’s crucial is: what counts as strong evidence? That is to say, what is the meaning of strong evidence. I have an account of evidence based on the idea, to begin with, that finding data x in agreement with a claim fails to count as good evidence for H if the method was guaranteed to find agreement, whether or not H was true. That’s a fairly obvious idea of H being immune to data, tunnel vision, verification bias or the like. There are many ways to repress the recognition of unwanted/unfavorable data.. That H has failed to pass a stringent test, however, doesn’t mean H is improbable, implausible, or the like. The justification or warrant for a claim, in my philosophy of science, is not a matter of probabilifying claims or making them firmer in some sense. Take a look at my previous blogpost, just before this one, on “probabilism as an obstacle to fraudbusting.”

  10. Actually, it turns out the example from Fisher to which the editor(s) are alluding is a case where Fisher mentions an example where an ordinary (frequentist) computation of Bayes’s rule is possible. Suppose it is known that a black mouse has “prior to any test-mating in which it may be used, a known probability of 1/3 of being homozygous, and of 2/3 of being heterozygous. If, therefore, on testing with a brown mate it yields seven offspring, all being black, we have a situation perfectly analogous to that set out by Bayes.” (Fisher 1973, p. 19). But Fisher goes on to say this is almost never our situation with regard to scientific inference. No one doubts you can apply Bayes’s rule to such events.

    • Mark

      Yes, exactly. Fisher was also adamant that he (and Bayes himself, actually) was an objectivist when it came to probability (as was David Freedman, and as am I). Although in at least one of the examples that I’m thinking of from his book, the “objective” prior that he had in mind seemed a little harrier to me.

      • Mark: There’s an equivocation nowadays in speaking of being an “objectivist” especially in the same sentence with Bayes, because that could mean frequentist or it could mean default-style Bayesian priors. I’m pretty sure that Freedman was not the latter.

        • Mark

          Agree, but no equivocation on my part (nor, it would seem to me, on Fisher’s or Freedman’s, although I’d hesitate to put words in their mouths… I’ll let them speak for themselves). In DOE (9th ed, 1971, p.6), Fisher said “advocates of inverse probability seem forced to regard mathematical probability, not as an objective quantity *measured* by observable frequencies, but as measuring merely psychological tendencies.”

          Then, in talking about Bayes, in SMSI (3rd. ed., 1973, pp. 13-16) he said “some have insisted that [probability] should properly be used only for the expression of a state of rational judgement based on sufficient objective evidence, while others have thought that equality of probability may be asserted merely from the indifference of … the objective evidence, if there were none. … Bayes evidently held the first of these opinions … his definition is therefore equivalent to the limiting value of the relative frequency of success.” And, “on the contrary, Laplace … manifestly inclined to the second view.” And, “I have, for myself, no doubt that Bayes’ definition is the more satisfactory.”

          As for Freedman (https://mangellabs.soe.ucsc.edu/sites/default/files/16/freedman_antibayes.pdf), he said “objectivists hold that probabilities are inherent properties of the system being studied.” Yes, this is exactly what I hold (and seems consistent with Popper’s propensity theory, which I dig). Also, “for an objectivist such as myself…”

          • Mark: Yes, all that is so for Fisher and Freeman; the ambiguity is in the “default” or “reference” Bayesians who have followed J. Berger and others into meaning something different by O-Bayes, usually/often (?) on the order of a prior that maximizes the input of the data in some sense (but there are several varieties). I am trying keep it clear for the reader.
            so when the editor told me he had Fisher’s genetic example in mind, I realized he might well be favoring frequentist or empirical Bayes, but empirical Bayesians also want to use/control error probabilities which are apparently banned.

        • newtostats

          Most objective Bayesians fall into neither of these categories. Default Bayesians are rare in practice. They want “default” priors the same way everyone else uses default sampling distributions. Their gaol is to get priors by searching for commonly accepted ones that make little final difference. They judge priors based on how few will complain about them.

          Objective Bayesians in practice are the opposite of this. They want priors which reflect the (objective) residual uncertainty inherent in a given set of information/evidence. If the information is precisely defined, then the prior is uniquely given for that information, however, different information/evidence leads to different priors.

          They can use either informative or uninformative information/evidence on which to base their priors, but prefer “informative” for obvious reasons when available. They judge priors based on (1) whether the information/evidence itself is true and (2) whether the prior accurately reflects the residual uncertainty in that state of information.

  11. For what it’s worth, a psychologist friend of mine says
    she has never even heard of this journal


    • ND: Nice to hear from you. I’ve heard others say that as well. Of course I’m unaware of all but 1 or 2 journals in this arena. In that connection many people have expressed puzzlement at the oversize reaction, and the explanatory theories are of interest in their own right, but too speculative to share.

  12. vl54321

    “a claim fails to count as good evidence for H if the method was guaranteed to find agreement, whether or not H was true. ”

    Every time I see this, I have a really hard time reconciling this with the aversion to enumerating the hypothesis space. It doesn’t seem possible to honestly assess this without at least an _implicit_ model of the space of alternative possibilities and the statistic under them.

    Sure it’s comforting not to have to explicitly state the space of models because your representation of the space of models could be wrong, but not doing strikes me as sweeping the uncertainty under the rug.

    • VL: First of all this is a definition. Second of all, we can readily apply it, and in published papers I list the ways. Of course they’re quite common: e.g., any data that disagrees with H is removed or reinterpreted. Or like when Geller says he can predict the next ESP card and fails he says, I meant the second, or third, or whatever will do the trick. We are not barred from speaking of “H is false”–it’s falsity is quite specific in my account, whether it’s a substantive claim about causes or a statistical claim about parameter values. Of course, keep in mind that giving defns. differs from applying them, we may not be able to tell in any particular case if a lousy job has been done in trying to find flaws. The above is minimalist, I would demand much much for evidence, like the agreement results despite the method being highly capable of unearthing flaws of a given type. Traveling, can’t link to papers just now.

    • john byrd

      “Every time I see this, I have a really hard time reconciling this with the aversion to enumerating the hypothesis space.”

      Consider an example: Mrs. Mayo teaches statistics at a local high school (Fewready HS?). To illustrate a point she assigns the class a quick hands on project. She says her hypothesis is that the mean height of students enrolled in the school is 180cm. (They do not know that Ms. Mayo got the stature data from the school nurse and so she has a tight grip on a correct answer). She wants each student to conduct a test of her hypothesis. Bob ventures across the hall into the gym and sees several of his friends at basketball practice. He knows they will help him out, so he measures the heights of 10 of the team mates and calculates a mean height of 190. He creates a CI at 95% and rejects Ms. Mayo’s null hypothesis. Mary also ventures across the hall and sees the cheerleaders having practice. She knows them all well, and convinces them to submit to measurement. She gets a mean stature at 162, and also calculates a 95% CI. She rejects Ms. Mayo’s null hypothesis.

      Both instances involve methods (taking biased grab samples) that are virtually guaranteed to reject the null if the null is unbiased. In this case, the method used renders any subsequent results useless. What is at issue is not the hypothesis space, but other more fundamental aspects of the test. Likewise, cherry-picking, optional stopping, poorly specified models etc. lead to an inability to draw conclusions.

      Of course, Ms. Mayo succeeded in teaching a valuable lesson: all conclusions are only as good as the methods used to derive them.

      • vl

        john byrd – thank you for perfectly illustrating my point.

        In order to realize that the hypothesis wasn’t severely probed, you had to articulate the alternative model that would produce the same explanation “he measures the heights of 10 of the team mates and calculates a mean height of 190.” “Mary also ventures across the hall and sees the cheerleaders having practice. She knows them all well, and convinces them to submit to measurement.”

        Without articulating that, you have not expressed why there’s a severity problem.

        The difference in the bayesian context is that you could incorporate these alternative explanations into your model space and you’d see the underdetermination in the posterior.

        Yes, you could well be wrong about the model of the space of models, but if you relegate severity to just something that can only be talked about in story form, in my opinion that’s ‘not even wrong’ territory.

        • VL: Just to note I’m not in sync with Byrd’s ex., but no time to reply just now.

        • john byrd

          “…the alternative model that would produce the same explanation “he measures the heights of 10 of the team mates and calculates a mean height of 190.” …

          I do not see the sampling method used as a model that is proposed as an alternate. I see this as a problem of method deriving the data to be used in a statistical model. The statistical model was defined when the null was provided. Are you saying the students unknowingly formed alternate models?

          • john byrd

            To be clear, there is a severity problem when you take a biased grab sample instead of taking an appropriate random sample. This is true with reference only to the sampling distribution under the null. It is not a severe test when done this way because it very easily leads to a rejection of a true null.This is a very simple example, but I think serves as a case where there is no need to calculate a severity value, nor any need to reference alternative hypotheses.

          • vl

            @john byrd

            No, I’m saying that you as a statistical interpreter articulated the possibility that “he measures the heights of 10 of the team mates and calculates a mean height of 190” as a model of the process whereby the measurement was generated.

            It doesn’t matter what the students “think”. In a Bayesian analyses, the focus is on what are the possible mechanisms that could generate the data including alternative choices of the population being sampled and their relation to the target parameter.

            • john byrd

              “In a Bayesian analyses, the focus is on what are the possible mechanisms that could generate the data including alternative choices of the population being sampled and their relation to the target parameter.”.

              OK, that may be the way you wish to view the circumstances. But, I do not view it that way and find it a bit awkward. I see the grab sample approach as simply poor choice of method for testing a null that was provided. The null did not have any explicit alternatives and we did not need any to address the question. The grab sample method has low severity because it is very likely to give a rejection of the null even when the null is true. Of course, it also violates the assumptions of the significance test. To me,this is a simpler, cleaner way to view the problem.

  13. Jumping off the island of Elba, don’t expect responses for a day.

  14. My comments on this issue are in: http://andrewgelman.com/2015/02/26/psych-journal-bans-significance-tests-stat-blogger-inundated-with-emails/#comment-212877

    It is quite obvious that H0 and H1 are not exhaustive hypotheses, that is H1 cannot be the negation of H0. It is easy to see that, since the universe where H0 and H1 are exhaustive requires many many many statistical assumptions. It is difficult to discuss with those who do not realized this simple fact.

  15. Pingback: Friday links: weak inference in ecology, the Ambiguous Pazuma, you vs. lunch, and more | Dynamic Ecology

  16. Jeremy Fox wrote on his blog:
    “As with Andrew Gelman, the entire internet apparently asked philosopher Deborah Mayo for her opinion on that social psychology journal that banned all inferential statistics. She obliges here, comparing the policy to “don’t ask, don’t tell”. My (brief) comments are here. It occurs to me that maybe the journal’s policy isn’t their real policy, but rather is a devious plan to goad top people like Andrew Gelman and Deborah Mayo into telling them what their policy should be. Sort of the statistical policy equivalent of Calvin and Hobbes shoveling snow. 🙂 Oh, and I love Clark Glymour’s comment over on Deborah Mayo’s blog: “Why don’t they go all the way and just ban data?” :-)”


    That’s one of the reasons I was reluctant to react at all.

    • john byrd

      I followed the link and saw his discussion of Platt 1964, and the recent Fudge 2014. From the latter is the quote, “We can evaluate a prediction’s utility by asking ourselves whether the hypothesis can survive if the prediction is found to be false. If it can, then it is not a strong prediction, and probably not worth testing.” And also, “The test of a good experiment or test is to ask whether the results, whichever way they turn out, will allow you to evaluate how good a given prediction is.”

      • I’m missing the reference.

        • john byrd

          The references are available at the link.

          • John: Yes, but what issue are you talking about?

            • john byrd

              I see the reasoning to be similar to, “a claim fails to count as good evidence for H if the method was guaranteed to find agreement, whether or not H was true. ”

  17. scatter

    “The major problems with the uses of NHST in social psych involve the presumption that one is allowed to go from a statistical to a substantive (often causal!) inference—never mind that everyone has known this fallacy for 100 years—”

    What use is there for rejecting the null of zero difference (unless this happens to be deduced from the substantive hypothesis) other than to subsequently make the logical error you mention? Oddly, one of the cases where the zero-difference null hypothesis is valid would be testing the existence of ESP, yet no one is convinced by those significant results.

    Even if god told me two groups were different it does not tell me why they are different. The difference may be a “good” thing, a “bad” thing, or due to some irrelevant experimental artifact. To figure out why I will need to run many many controls to rule the plethora of possible explanations, this is prohibitively expensive and will not be done.

    Or I could do parameter estimation, compare that information to other parameter estimates arrived at in other studies, abduce an explanation for the available data, then deduce a prediction. In that case I have no need for the null hypothesis since I have a prediction to compare to. Even better, multiple explanations can be put forward with associated predictions. If the data can distinguish between any two of these it must also be able to rule out chance.

    • “What use is there for rejecting the null of zero difference” if not to commit the fallacy of inferring anything you like that entails the stat sig result?–you ask. It’s to infer the existence of a discrepancy of various magnitudes (and direction). If the null and alternative exhaust the space, the rejection indicates the alternative statistical hyp, but substantive (and even many statistical) claims need have been well probed in the least, based on having merely rejected a null.

  18. Pingback: “So you banned p-values, how’s that working out for you?” D. Lakens exposes the consequences of a puzzling “ban” on statistical inference | A bunch of data

Blog at WordPress.com.