*Bad Science*), in a

*Nature*article today (“Make Journals Report Clinical Trials Properly“), expresses puzzlement as to why bad statistical practices– “selective publication, inadequate descriptions of study methods that block efforts at replication, and data dredging through undisclosed use of multiple analytical strategies“–are continuing to occur even in the face of the new “technical activism” (a great term he introduces). Worse, these questionable practices are actually being defended by some medical journals. “[J]ournal editors now need to engage in a serious public discussion on why this is still happening“. Goldacre doesn’t consider that at least some of the pushback he’s seeing has a basis in statistical philosophy! I explain at the end. Here’s Goldacre (Feb 2, 2016; emphasis mine):

Science is in flux. The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data: selective publication, inadequate descriptions of study methods that block efforts at replication, and data dredging through undisclosed use of multiple analytical strategies. Problems such as these undermine the integrity of published data and increase the risk of exaggerated or even false-positive findings, leading collectively to the ‘replication crisis’.

Alongside academic papers that document the prevalence of these problems, we have seen a growth in ‘

technical activism’: groups creating data structures and services to help find solutions. These include the Reproducibility Project, which shares out the work of replicating hundreds of published papers in psychology, and Registered Reports, in which researchers can specify their methods and analytical strategy before they begin a study.These initiatives can generate conflict, because they set out to hold individuals to account. Most researchers maintain a public pose that science is about healthy, reciprocal, critical appraisal. But when you replicate someone’s methods and find discrepant results, there is inevitably a risk of friction.

Our team in the Centre for Evidence-Based Medicine at the University of Oxford, UK, is now facing the same challenge. We are targeting the problem of selective outcome reporting in clinical trials.

At the outset, those conducting clinical trials are supposed to publicly declare what measurements they will take to assess the relative benefits of the treatments being compared. This is long-standing best practice, because an outcome such as ‘cardiovascular health’ could be measured in many ways. So researchers are expected to list the specific blood tests and symptom-rating scales that they will use, for example, alongside the dates on which measurements will be taken, and any cut-off values they will apply to turn continuous data into categorical variables.

This is all done to prevent researchers from ‘data-dredging’ their results. If researchers switch from these pre-specified outcomes, without explaining that they have done so, then

they break the assumptions of their statistical tests. That carries a significant risk of exaggerating findings, or simply getting them wrong, and this in turn helps to explain why so many trial results eventually turn out to be incorrect.You might think that this problem is so obvious that it would already be competently managed by researchers and journals. But that is not the case. Repeatedly, academic papers have been published showing that outcome-switching is highly prevalent, and that such switches often lead to more favourable statistically significant results being reported instead. ….

Our group has taken a new approach to trying to fix this problem. Since last October, we have been checking the outcomes reported in every trial published in five top medical journals against the pre-specified outcomes from the registry entries or protocols. Most had discrepancies, many of them major. Then, crucially, we have submitted a correction letter, on every trial that misreported its outcomes, to the journal in question. (All of our raw data, methods and correspondence with journals are available on our website at COMPare-trials.org.)

We expected that journals would take these discrepancies seriously, because trial results are used by physicians, researchers and patients to make informed decisions about treatments. Instead, we have seen a wide range of reactions. Some have demonstrated best practice: the

BMJ, for instance, quickly published a correction on one misreported trial we found, within days of our letter being posted.Other journals have not followed the

BMJ’s lead. The editors atAnnals of Internal Medicine, for example, have responded to our correction letters with an unsigned rebuttal that, in our view, raises serious questions about their commitment to managing outcome-switching. For example,they repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” from documents produced after a trial began;they make concerning comments that undermine the crucial resource of trial registers; and they say that their expertise allows them to permit — and even solicit — undeclared outcome-switching.

Read the full article. Here’s their post discussing the anonymous response they received from the *Annals of Internal Medicine*.

The practice of identifying “prespecified outcomes” from post hoc information can indeed “break the assumptions of their statistical tests”.[1] For example, the *reported* significance level may have no relation to the *actual* significance level. You might report that an observed effect would not easily (frequently) be bought about by mere chance variability (small *reported* P-value), when in fact it would frequently be brought about by chance alone, thanks to dredging (large *actual* P-value). You’re breaking the test’s assumptions, but what if your account of evidence denies that’s any skin off it’s nose?

Take for example the epidemiologist Stephen Goodman, a co-director of a leading home for “technical activism” (Meta-Research Innovation Center at Stanford):

“Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value”(Goodman 1999, p. 1010).

“Nothing to do with the data”? On the frequentist (error statistical) philosophy, it has a lot to do with the data. To Goodman’s credit, he’s up front about his standpoint being based on accepting the “likelihood principle”. [See Supplement below.] However, what people come away with is the upshot of that evidential standpoint, not the philosophical nuances that lurk in the undergrowth. At their inaugural conference, the questionable relevance of significance levels was the punch line of the joke in a funny group video you may have seen (posted on 538)[2]. The take away message is scarcely: Do all you can to report any post data specifications that would violate the legitimacy of your significance level.

Goldacre is surely right to suspect that some of the resistance to calls against “outcome switching” is defensiveness; but he shouldn’t close his eyes to the role played by foundational principles of evidence.

**What do you think?**

*******************

(ii) Supplement: You may wonder, how the contrasting standpoint (between, say, a Goodman and a Goldacre) involves *philosophical* principles of evidence. In a nutshell, the former holds an “evidential-relation” (EGEK) or a logicist (Hacking) view of statistical evidence where, given statements of hypotheses and data, an evidential appraisal–generally comparative–falls out. Then considerations such as when the hypotheses were constructed drop out–at least for questions of evidence. Philosophers sought “logics of confirmation” for a long time, and some/many still do? The idea is to have a context-free logic for inductive inference akin to logics of deductive inference. This contrasts with the position that an evidential appraisal depends on features (of the selection and generation of data) that alter the error probabilities of the procedure, such as “selection effects”. Interested readers can search this blog for statistically oriented discussions (under likelihood principle, law of likelihood) or philosophically oriented ones (novel evidence, double counting, Popper, Carnap). Carnap’s logicism took the form of confirmation theories. Popper rejected this and required “novel” evidence for a severe test––even though he tended to change his definition of novelty, and never settled on an adequate notion of severity. On the statistical side, relating to the likelihood principle, is the “Law of Likelihood” (LL). Some relevant posts are here and here. The (LL) regards data ** x** as evidence supporting

*H*over

_{1}*H*

_{0}*iff*

_{ }Pr(**x;*** H _{1}*) > Pr(

**x;***H*).

_{0}*H _{0}*

*and*

_{ }

*H*

_{1}*are statistical hypothesis that assign probabilities to the random variable*

_{ }**taking value**

*X***.**

*x*

_{ }*On many accounts, the likelihood ratio also measures the strength of that comparative evidence. Here’s Richard Royall (whom Goodman follows):*

_{ }“

If hypothesis A implies that the probability that a random variableXtakes the valuexis p_{A}(x), while hypothesis B implies that the probability is p_{B}(x), then the observationX=xis evidence supporting A over B if and only if p_{A}(x) > p_{B}(x), andthe likelihood ratio, p_{A}(x)/ p_{B}(x), measures the strength of that evidence.”(Royall, 2004, p. 122)

Moreover, “the likelihood ratio, is the exact factor by which the probability ratio [ratio of priors in A and B] is changed. (ibid. 123)

RELATED POSTS: Statistical “reforms” without philosophy are blind

Why the law of likelihood is bankrupt as an account of evidence.

Breaking the Royall Law of Likelihood.

[1] Violations of statistical test assumptions won’t always occur as a result of post data specifications. Post hoc determinations can, in certain cases, be treated “as if” they were prespecified, but the onus is on the researcher to show the error probabilities aren’t vitiated. Interestingly, the philosopher C.S. Peirce anticipates this contemporary point.

[2]I first saw the video in a tweet by the American Statistical Association.

REFERENCES:

Goldacre, B. 2008. *Bad Science*. HarperCollins Publishers.

Goldacre, B. 2016. “Make Journals Report Clinical Trials Properly,” *Nature* 530,7 (04 February 2016)

Goodman, S. (1999). ‘Toward Evidence-Based Medical Statistics. 2: The Bayes Factor’, *Annals of Internal Medicine*, 130(12): 1005-13.

Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite.” In D. H. Mellor (ed.), *Science, belief and behavior: Essays in honor of R.B. Braithwaite. * 141-160. Cambridge: CUP.

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations.** *Chicago: University of Chicago Press.

Video link: “Not even Scientists Can Easily Explain P-values”

Likelihood principle ain’t magic. Conclusions based on the likelihood principle are conditional on the likelihood, which is just part of the model, just an assumption. Sure you can work with that, consider departures from the assumptions etc. But I don’t like when the likelihood is taken a known, God-given entity.

True, but even with a legitimate likelihood, Goodman (I originally, accidentally, wrote Greenland) is saying error probabilities don’t matter. I just wanted to draw Goldacre’s attention to this. He’s to be credited for explaining biasing selection effects at a clear and non-technical level in his popular books, but it’s as if he assumes everybody subscribes to the statistical philosophy that has a problem with “outcome switching”. I don’t know how to get his attention.

Mayo: Can you please give a cite where Greenland says “error probabilities don’t matter”?

I can give you some cites where Greenland is all about error probabilities, as in empirical Bayes and partial/semiparametric Bayes for multiparameter inference (Greenland, 1992, 1993, 1994,1997, 1999, 2000 and so on into the 21st century). The rationale for those methods is estimation to minimize MSE (instead of the more usual MVU criterion which – rather unrealistically, to say the least – acts as if there is infinite loss from any degree of bias by limiting MSE minimization to the domain of unbiased estimators). You could call those methods Efron-Morris error statistics – their rationale is sampling-error minimization, not coherence. They are not NP binary-decision error statistics, but I think Neyman at least welcomed these sorts of methods.

Sander:

I was mystified by your comment, so did a search through the post, no mention of Greenland, but then in a comment written hastily, I did accidentally write Greenland instead of Goodman! Sorry. This is the problem with writing comments while traveling. Readers:Goodman and Greenland sometimes come together in joint papers, so that’s why I erred. I’ve now corrected that. But it’s obvious from the post I’m talking Goodman! Thanks for the correction.

There is little point in trying to discuss the issues relating to likelihood and P-values and evidence and multiple testing and sequential sampling unless you are able to daw a clear distinction between the evidential content _of the data itself_ and the appropriate _response_ to the evidence.

The evidential content of the data itself is best seen in the likelihood function, but is also expressed in the observed P-value. It is what the likelihood principle refers to, and it is all that the likelihood principle should be applied to.

The appropriate response to the evidence depends on things like what you are trying to achieve and how the evidence came about. It depends therefore on whether the study is intended to be definitive or preliminary, exploratory or confirmatory, standalone or part of a series, and so on. Those various intentions each require different emphases on the answers to the three core questions: what do the data say?; what should I believe now that I have these data?; and what should I do or decide now that I have these data?.

To discuss this blog post without being clear on those points is a waste of time.

Michael:

First, thanks for your question; I decided I’d better supplement this post (ii), since I’d said nothing about the “philosophical” principles of evidence associated with the issue, and not all readers will know of the previous discussions on this blog.

Second, this notion of “an appropriate response to the evidence” is new to me: is it to go under “what should I believe?” or “what should I do?”

Naturally, since it contains the word “evidence” you must intend it to be distinct from the first, evidential question.

If all this were merely semantics, and in fact my evidential appraisal is no different from Royall’s or your “reacting to the data,” I’d not mention it. What baffles me is that the major discussants of today’s meta-research, the new technical activists (Goldacre calls them) talk right past each other on some core issues.

I mean you were at the P-value pow-wow, I think you know what I mean.

Finally, you say the “evidential content of the data itself is best seen in the likelihood function”.

Why is it “best”?

You then say “but is also expressed in the observed P-value” which suggests you’re viewing the P-value as a logicist measure, unaffected by post hoc outcome changing and the like. Correct? Then your observed P-value is what’s often called the nominal” P-value. In that case, I don’t see why you think it gives the evidential content in the data! You can define it that way, but it’s not the intended definition by Fisher, and anyway, would only require us to have a distinct notion of the “actual” P-value (distinct from your observed P-value) which is sure to confuse people further.

Mayo, the likelihood function shows what the data say about support for parameter values within the statistical model. You respond to that information by changing your beliefs or by deciding to do something or accept some hypothesis. If the data have come from a biassed approach then you might choose to be reluctant to take the evidential picture provided by the likelihood function at face value. A discount of some sort might be applied when making inferences. I find it best to think of such a discount as being external to the actual evidence of the data.

The evidence answers the first of the questions, What do the data say?. These are different questions: Should you believe the data? Should you apply a discount to the weight you apply to the evidence in inference, or a discount that attempts to undo some bias in the evidence?

An analogy might be helpful. Imagine that I testify in court that “I saw Joe Blow at the scene of the crime two minutes before the robbery went down.” That sentence is my testimony. It is the evidence that I provided. Now, you might be aware of some background information that suggests that I am a serial liar or that the Lew family has been carrying our a feud with the Blow family for generations, or that I am an alternative suspect in the case, or that I might have been drunk at the time, or that I was not wearing the glasses that I need for clear vision. Those things might make you doubt the validity of my evidence, and they might make you place little weight on my evidence or they might make you doubt the identification of Joe Blow, or maybe the “two minutes” part of my testimony. They are relevant to how you respond to my evidence, but it is convenient to consider them to be external to the evidence of my testimony.

The likelihood function is analogous to my testimony. It contains, depicts or presents the evidence. The effects of outcome switching, P-value hacking, cherry picking etc. are all analogous to the factors in the previous paragraph that would make you doubtful of the validity of the evidence. I contend that it is better for clarity to keep the scope of `the evidence’ separate from those factors. Otherwise we are not able to discuss it clearly. And otherwise it is impossible to be comfortable with the likelihood principle. I think that the mistrust of the likelihood principle comes mostly from assuming that the word `evidence’ entails things like the family feud of my example. The likelihood principle has nothing at all to say about how to deal with feuds.

Now, you will ask how specifically to apply a discount to evidence when making inferences. I don’t have a recipe that fits all, or even most circumstances. Apply a discount knowingly and carefully. Apply it openly and discuss it in reports of inferences.

People talk past each other when discussing evidence because the terminology is imprecise and most people are not practiced in thinking in evidential terms. Bayesians tend to focus on posteriors which are not depictions of the evidence, and error statisticians focus on method-related error rates which are also not the evidence. When likelihoodlums talk about evidence their words are interpreted as if they are talking about the evidence in the data along with the background factors that should affect the responses to the evidence. If you re-read many of our previous conversations on this blog you will see what I mean.

Why are likelihood functions “best”? Because they make explicit the evidential support for the possible values of the parameter(s) of interest. I do not see the attraction of doing an experiment that allows estimation of the values of a parameter of interest and then just saying it’s `significant’. It’s best also because it depicts the evidence without the discounts that might be applied on account of various `bad behaviours’ on the part of experimenters and analysts. Those discounts might sensibly be applied across the board to the support for all parameter values, or they might be applied in a way that effectively shifts the support towards smaller parameter value and away from large parameter values, for example.

P-values are related to likelihood functions. Each dataset that can be analysed to yield a likelihood function can also be analysed to produce a P-value that relates to the same parameter. Each likelihood function can be `pointed to’ or indexed by the corresponding P-value. I suggest that that P-value should be the observed, or nominal P-value, but you can use a `corrected’ P-value if you insist. If you do so then there will be two different P-values that point to the same likelihood function, but that two to one relationship is a consequence of you choosing to manipulate the observed P-value and is not a reason to say that there is some conflict between P-values and likelihood functions.

Your interpretation of Fisher’s intentions are just as reliable as mine. Unreliable. He was not always consistent and his meanings are frequently difficult to discern. My impression is that he was keen on conditioning wherever reasonable, and so many of his P-values would be more akin to my `observed’ P-values than they are to your `actual’ P-values. (I note that there are many circumstances where the `observed’ P-value and the `actual’ P-value are identical.) I think that your use of the word `actual’ without specification of the conditioning choices is leading to confusion.

This is the definition of P-value that come out of the ASA P-value pow-wow that you refer to:

“A p-value is the probability under a specified statistical model that a statistical summary of the data (including, for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.” You prefer to use a model that incorporates accounting for the number of comparisons etc. where I would prefer that accounting to be external to the P-value in order that the P-value remain an uncomplicated index to the likelihood function that depicts the evidential support of the data for values of the model parameters. That difference of opinion is of little consequence, but to say that your preferred accounting yields an `actual’ P-value is a strange use of `actual’. I think that you can choose a more appropriate adjective.

Michael: Let me just say that I’m very grateful for your comment. It summarizes your view clearly and should help readers to understand the issues. There has been almost no discussion on the blog as of late, and I appreciate the drops of water it provides for a very dry and parched desert. (I blame twitter, and I’m guilty of using it for quick comments as well.

Now on the substance: the difference in the views of evidence and inference is scarcely of “little consequence” (even if I were to agree that a chosen definition for P-value is mere semantics). It’s absolutely at the heart of what Goldacre is on about. That’s my point.

As for what the founders really thought, the fact is that Fisher, Neyman, Pearson, and Neyman and Pearson jointly were explicit about adjustments for post data selections, as are those who gave us modern treatments of their accounts. Previous posts attest to it w/ extensive quotes. That is why, for any measure of “distance” or “fit” you choose, F-N-P demanded to know the probability of getting so impressive a fit (with a given H) even if H is false (or discrepancies exist).

You see statistical inference is inference, not mer data summarization (as informative as that can be). The inference is inductive in that it goes beyond the data. Because it goes beyond the data, it may be in error. But we can control and assess these errors, and that’s what error probabilities are about. Now I go a bit beyond F-N-P to provide an account wherein those error probabilities, where relevant, serve to characterize the probative capacities of methods in such a way that the justification is a matter of warranted evidence (not merely long run error control).

Likelihood ratios are not inferences. See my “why the law of likelihood is bankrupt as an account of inference” post (not the exact title).

Mayo, you keep saying things like “likelihood ratios are not inferences”, and I keep thinking that you are missing the point. You are correct, they are not inferences, but that doesn’t matter. The inferences that we should care about are scientific inferences.

My approach is to use statistical methods to inform the scientific inferences. Likelihood functions (and the various ratios that they contain) inform scientific inferences by answering the question of what the data say.

I am less inclined towards the Neyman approach which seems to pass the responsibility for the inference on to the statistical method by answering the question of what should I do now that I have these data without addressing the third question, what do I believe now that I have these data.

Likelihood functions inform a scientific mind more fully than a P-value, and much more fully than an asterisk, and so they are a superior aid to scientific inference. A Bayesian posterior can be even more informative, as long as the prior is appropriate.

Perhaps it would be appropriate to flesh out what is meant by various uses of the word “inference” as well as “evidence”.

Michael: “Mayo, you keep saying things like “likelihood ratios are not inferences”, and I keep thinking that you are missing the point. You are correct, they are not inferences, but that doesn’t matter. The inferences that we should care about are scientific inferences.”

This is a rather large admission. We want to carry out statistical inference before getting to scientific inference, and if LRs don’t do that, then they’re inadequate for the job (of statistical inference). Now it’s absurd to suppose that if one IS interested in statistical inference that one is somehow robbed from looking at the likelihood function. The severity account would essentially direct you to that, when relevant, in the midst of determining poorly warranted inferences, and it would require even more to determine how well model assumptions are satisfied, but again, it’s kooky to spoze you are limited to one way of reducing/summarizing data. I’m talking about using data to evaluate evidence for statistical inference. Nor is the choice between LRs and a P-value—again, a false choice thus presenting a false dilemma.

I think you should know I don’t obey the so-called “Neyman approach” (I say “so-called” because in practice he didn’t play any such automatic game, and certainly E Pearson didn’t.)

There are plenty of posts and papers on my recommended account of using error probabilities in inference, whether one wants to call the statistical qualification degrees of “corroboration” or “severity” or something else.

Likelihood functions are integral to error statistical inference, even though they don’t tell us what the data say in order to interpret them in relation to the question of interest. For that one needs to consider the capabilities of the method to control erroneous interpretations. It’s not a separate step (even though subsequent scientific inferences are, as are “decisions”).

I also argue that it’s entirely inadequate for an account of evidence to deny we can compare the warrant x and y afford a given hypothesis H. Yet Royall says:

“[T]he likelihood view is that observations [like x and y]…have no valid interpretation as evidence in relation to the single hypothesis H.” (Royall 2004, p. 149).

This is a very odd notion of evidence. We most certainly can say that x is quite lousy evidence for H, if nothing (or very little) has been done to find flaws in H, or if I constructed an H to agree swimmingly with x, but by means that make it extremely easy to achieve, even if H is false.

By declaring such considerations irrelevant to the evidential purport of data, the likelihoodlum (your term) is free to deny the importance of Goldacre’s concern with “outcome switching”. Yet it’s altogether essential in using data as evidence for inference.

See the following post:

https://errorstatistics.com/2014/11/15/why-the-law-of-likelihood-is-bankrupt-as-an-account-of-evidence-first/

Mayo, your response is strange to me. I suspect that you are still not really reading what I write in a way that captures my intended meanings.

You write “We want to carry out statistical inference before getting to scientific inference”. Well, that depends on what you mean by “inference”. Is it the same in both places?

This is some of what I write about inference for my statistics students:

“Do the answers to questions 3, 4 and 5 constitute inferences? [Those questions are about the sample mean, the sample distribution and a difference between means in sample subgroups.] It depends on exactly what we mean by `inference’, but I prefer to think that those answers are statements of facts, and preserve the word inference for the act of coming to a conclusion, opinion, or decision after consideration of facts. Thus I will stipulate that the answers to questions 3, 4 and 5 constitute factual statements in which trivial statistical methods have been used for the purposes of accuracy, objectivity and clarity. If we were to use those answers to address the earlier questions, 1, and 2 [questions regarding the biology of real world population(s) from which the sample may have been obtained] then we would be making inferences that would be based on those observed facts.”

That is an explanation of what I mean by “inference”. What do you mean?

Yet again, I will say that I know that you are not, and have not, advocated the statistical approach that most readers would understand to be what I am referring to with the shortcut of “Neyman approach”. Your personal approach is not really at issue in the blog post that started this conversation. The problems that you write of on the original post are at least partly caused or facilitated by the habit of taking a `statistically significant’ effect as being somehow sanctified with a blessing of reality or reliability. What I refer to as the Neyman approach is far from a severity analysis.

I have never intended to make P-values and likelihood functions mutually exclusive alternatives, any more than the acceptance of cars by my society precludes my use of a bicycle. The two can work well together, and offer different advantages. For most purposes the likelihood function is more informative, but the P-value is certainly more succinct and better known. You should be well aware of my writing about the close association of P-values and likelihood functions, as I’ve sent the paper to you and linked it in comments on this blog on several occasions. I have never intended to communicate the “kooky” idea that people should be limited to one way of summarising data. That seems to be something that you have imposed on my words.

The likelihood function depicts the evidential support of the observed data for the range of values of the parameters of the statistical model. I think that your continued focus on individual points on the continuum of parameter values leads you to misinterpret Royall. Nonetheless, I agree with him (and Fisher and Edwards and Pawitan and Hacking (early Hacking) and Goodman and so on) that the evidence is comparative. It points in favour of any particular value of the parameter _relative_ to another. The idea of evidential support for one value of the parameter without comparators is just silly.

Your responses of likelihoodlum claims continue to ignore the distinction between assessment of statistical evidence and scientific inference, and you ignore the idea that the veracity and reliability of the evidence, as well as its likely biasses, can be (should be) dealt with when going from the evidence to the inference. I tried quite hard in my previous comment to make that clear.

I know that you love to refer me to previous blog posts, but please note that I made 17 individual comments on the one that you linked this time. I’ve seen it, read it, dissected it, and I’m done with it. I don’t think it is a good account of the role that likelihood functions can play in scientific inference.

Michael:

That’s a change of view! You are now saying that these considerations are relevant to inference, but not to evidence! You seem to have come over to the error statistical side:

What would Royall say?

“Your responses of likelihoodlum claims continue to ignore the distinction between assessment of statistical evidence and scientific inference, and you ignore the idea that the veracity and reliability of the evidence, as well as its likely biasses, can be (should be) dealt with when going from the evidence to the inference.”

I try not to make semantics the linchpin of key issues. My interest is in inference, and it’s fine to put this as evidence for an inference. What’s unacceptable is to claim that such and such considerations are irrelevant to evidence, but relevant for inference. Evidence is in relation to a claim. If evidence is highly unreliable evidence for C, we don’t say it’s evidence…for something or other, just not for claim C.

Nor have likelihoodlums tried to make an evidence/inference distinction. Royall speaks of evidence, belief, and action. You can insist inference is an act, but if it’s the act of inferring an evidential warrant, it can’t be irrelevant to evidence but relevant to evidential warrant.

Mayo, your say “That’s a change of view!”, but if you look back through my writings and comments on your blog and Andrew Gelman’s blog and my answers to questions on stats.stackexchange.com you will certainly find that my views have not been entirely static over the last ten years. They change with my improving understanding. However, they have not changed in any stepwise manner, and I think that the changes are subtle. My recent comments here are not informed by any important recent change in my view.

As far as I can tell, I have been relatively unsuccessful in communicating my views in comments here.

1. The data contain evidence.

2. A likelihood function depicts that evidence in relation to the parameter values of the statistical model chosen for analysis.

3. The data come from some sort of study.

4. That study may have a good design, or not.

5. The study design might be honestly and completely communicated, or not.

6. Aspects of study design may lead to the data, and thus the evidence, being unreliable in some sense or biassed.

7. Scientific inferences should be based on all of the above, as well as background material external to the study and data in question.

My guess about what degrades the clarity of my communication with you, and perhaps others, is that it comes from two main differences of outlook.

First, I am increasingly of the opinion that the important things that statistics can do to help scientific inference can, and should, be divided into the three questions that Royall was first to put: What do the data say? What should I believe now that I have these data? What should I do now that I have these data? Discussions of the merits and deficiencies of the various schools of statistical thought should take place with those questions in mind because many of the deficiencies of frequentist approaches occur when they are applied to the first two questions, and many of the criticisms of the use of likelihood functions assume that those functions answer the second or third question, and so on. The features of an approach that are important depend on the purpose for which it is to be used.

Second, the classical focus of testing on the `null hypothesis’ has led to two problems for our communication. First, the `hypothesis’ is usually not a hypothesis in the way that a lay person would expect. It is just a speculation regarding the correctness of a particular value of the statistical model parameter of interest. You talk and write about hypotheses, and so I assume that you think of them as hypotheses. Denoting them as parameter values has the dual advantage of making it clear that the null lies on a continuum of possible values, and it also makes it explicit that the statistical analysis requires a statistical model. When the continuum of interesting values that the parameter of interest can take is noted, it is easier to deal with questions regarding what the data say and what you should believe.

I would be grateful if you would read my first comment in this thread again and comment on my scenario concerning testimony about Joe Blow. Does that not provide a way to distinguish usefully between the evidence in the data and the alterations to how we should respond to the data that are contingent on the study design?

Michael: I’m glad you admit a change in view, but you’re still not clear as which of the 3 questions you place inference It can’t be under the evidence question because you regards considerations as irrelevant to evidence that perhaps you now want to make relevant for inference.

I find that an untenable and unclear position. A key goal for statistical inference is using data to reach/evaluate claims about aspects of the underlying data generation–as modelled.

If considerations are relevant for inferences they must be relevant for evidence for inferences.

Mayo, you are making a bit of a fuss about a change of position on my part, but it feels to me that what has happened id that you are now seeing my meaning more nearly than you have in the past. I wrote this in a comment on this blog in November 2014:

“The likelihood principle says that all of the evidence in the data relevant to parameter values in the model is contained in the likelihood function. That DOES NOT say that one has to make inferences only on the basis of the likelihood function. Priors can be important, and they are dealt with by Bayes. However, the probability of erroneous inference is obviously also important. I think of that as being the ‘reliability’ of the evidence.”

[…]

“However, if there has been optional stopping or mulitplicity of testing then they add some degree of unreliability to the evidence. That does not change the likelihood function, and does not change the strength of the evidence or change the parameter values favoured by the evidence, but it may change how one should make inferences about those parameter values.”

Those words seem to me to contain the same ideas as I wrote above in this thread to me, even if my words are now more clear.

Now `inference’. The word `inference’ is applicable when there is a consideration of the facts to come to some sort of conclusion. Often answers to question 1 (what do the data say?) do not require much in the way of inference, as what the data say is often clear enough to be accounted for as a `fact’, albeit a relatively labile fact. Answers to the other questions should be inferences. And ideally, answers to question 2 (what should I believe?) are informed by what the data say, and answers to question 3 must be informed by what the data say and, where appropriate, they can be informed by answers to question 2 as well. Inferences can be restricted to conclusions about a statistical model, and then they are well described as statistical inferences, or they may be about real-world features or populations, in which case they can be called scientific inferences. The latter sort of inferences should incorporate, as far as possible, all relevant material that is known to the inferrer.

Mayo – I’ll put on my Bayesian hat for a second. A slightly different (and more Bayesian) set of principles are (see Evans’ work)

1) Replace P(A) as the measure of belief that event A is true by P (A | C) after being told that event C has occurred

2) We have evidence for the truth of A when P (A | C) > P (A), evidence against the truth of A when P(A | C) < P(A) and no evidence one way or the other when P(A|C)=P(A).

Do you have the same objections to these as you do to Likelihoodist principles?

Om: My objections are much more severe against this idea of evidence as “belief boost”. How strongly you believe the Standard Model in physics might not change even within more evidence; or your beliefs might change up or down—what’s that got to do with warranted belief change? You can have a belief boost in C even when little if anything has been done to rule out ways C can be false. If C entails A, the posterior goes up, so we get evidence for (C and J) with J an irrelevant conjunct. (I can link to that post on irrelevant conjuncts.) Then of course there’s temporal incoherence–what you thought you’d believe before knowing E can differ once you know it. Search Dutch books.

Maybe you’re just teasing me.

My comment was just re your supplement on principles of evidence. Michael Evans agrees that Birnbaum’s argument for the likelihood principle doesn’t work but also defends the above view.

The irrelevant conjunct argument doesn’t work – sticking to the above principles and following only probability theory manipulations (not the suspect epistemic closure under entailment principle) shows that there is no boost for irrelevant conjuncts

(I did write a post about the tacking paradox here: http://omaclaren.com/2015/10/06/the-tacking-paradox-model-closure-and-irrelevant-hypotheses/ which isn’t that interesting to me anymore but I think the argument still stands)

RE: temporal incoherence etc. Personally I’m not too worried about any of that. I just see Bayes as one (often useful) way of updating some initial information into a new state of information subject to certain e.g. regularity assumptions. Many of the philosophical arguments for Bayes seem suspect in light of the required regularity assumptions, though the standard Bayesian line is that when these fail we learn something so all is fine. My views as posted on the brittleness post are probably closer to my current (updating!) position(s).

Michael:

You write of “the evidential content _of the data itself_” which is fine, but remember that this evidential content is defined only in the context of some model. There’s no evidential content of the data in a pure, model-free sense.

Even with the stat model, the notion that there IS an evidential content attached to data is a holdover from a logical positivistic philosophy wherein it’s supposed “the empirical data” are the unquestioned (or least questioned) foundation of knowledge. This view won’t hold up and was a main reason for the demise of positivism. Somehow, in statistics, we still have holdovers from this discredited view.

Andrew, yes I absolutely agree that the likelihoods are model-bound. As far as I can tell, all statistical approaches to support of inference are model-bound, even if the models vary in their complexity and extent, and so it is really important to bear the models in mind when making scientific inferences.

I have previously argued against the comparison of likelihoods that come from different models, but with little success. 😦

“There’s no evidential content of the data in a pure, model-free sense.”. I am sure everyone agrees with Gelman’s point (I hope), but we can go a little further into the perhaps obvious assertion that there is no evidential content of the data if the observations were not made in the manner prescribed by the requirements of the study/experiment. This simple requirement is often a great challenge to meet, and I believe accounts for some portion of the failures to replicate. Goldacre listed as one of the problems, “inadequate descriptions of study methods that block efforts at replication”. This is getting at this issue.

Part of the package of assumptions made when making statistical Inference is that the data have integrity with regards to the study protocol (eg. random selections were random, measurements were taken in a consistent manner, instruments were performing properly, etc.). The information that supports acceptance of this assumption is critically part of “the evidence” as is the model. We should not refer to ***THE*** evidence in a statistical analysis without including these critical factors as part of data as evidence. Data in a model, considered in isolation, is just math not science.

john byrd, NO! I do not agree that there is no evidential content of the data without a model. There is evidence there, but we cannot deal with it statistically without at least a minimal statistical model. The existence of the evidence is not contingent on a statistical analysis and so it cannot be contingent on the presence of a model.

If you succeed in your insistence that the word `evidence’ be expanded to contain the model and the study design then we lose the ability to clearly talk about how to deal with differences between possible models and designs.

Michael, at issue are problems in replication, which involves comparing designs and actual results. A failure to replicate could be due to a flawed original study, or random error in one or both studies, or a flawed replication study. In the latter case, a failure to properly make the observations/collect the data will often be the root cause. This factor is part of the evidence to consider, not just what the data tell us. (Look back at the Higgs boson experiments and consider how the numbers generated in following experiments could not be separated from examination of how the numbers were obtained.). Claiming that data are evidence in isolation from all else seems to be a new and narrow view, and somewhat perverse. I am not trying to change the prevailing view as I understand it.

John, we can agree to define the word `evidence’ however we choose, but I would like a definition that fits well with the description and understanding of the problems that we all agree about. Do you not find that my restriction of `evidence’ to when combined with explicit discussion of the factors that affect how we should respond to the evidence helpful? Do you not find that it makes it easy to describe the influence of the various factors on my courtroom testimony example?

I think that we currently lack any agreed usage of the word evidence, and that is hindering our ability to talk about the appropriate ways of making inferences, and preventing many students from being able to put the various schools of statistical thought into a unified framework.

Michael: As I’ve said, evidence is always in relation to a claim, there’s not evidence full stop. I have defined very simply “bad evidence” or no test at all.

Michael, I agree that you do not ignore the relevant factors I say are quintessential to what we call evidence. We simply debate what can properly be referred to as “the” evidence in a study that involves statistical Inference.

I think your trial example makes my point better, actually. A person claiming to be a witness will be questioned and their testimony rigorously checked. If it is found to be unsound– as when the person was not at the scene to witness the event– then the testimony is not treated as evidence. No one just takes witness testimony at face value. Likewise, experts who are asked to testify must go through a vetting process and their analysis/test results subject to a review to determine if proper methods were used properly (called a Daubert hearing in US) prior to being admitted into evidence. Again, if the work was found to be flawed, then no testimony and no evidence.

That is the legal system, not science, but it makes sense.

The data are (1,3) with evidential content zero. The semantics are Manchester City 1 Leicester City 3. The evidential content of this is that Leicester City is a world class football (soccer) team and no model is required. Is evidential content = information content? Information content requires semantics. John Searle is sound on this but this requirement is a source of confusion in many books and articles.

http://www.nybooks.com/articles/2014/10/09/what-your-computer-cant-know/

Given this I agree with Michael’s first paragraph of his latest posting.

Laurie: I take it you’re concurring that “evidence” doesn’t need a model, but if you don’t know what it means, you’re saying, it has evidential content 0? . The meaning is generally regarded as given by a model, but nothing turns on this (and I’m not a Searle fan). But the issue of interest is not this one. It’s whether it makes sense to claim that even though considerations that don’t alter evidential import because they don’t alter LRs (which of course depend on the adequacy of the statistical model) do, at the same time, alter evidential import for inference. They are not evidentially relevant but they are inferentially relevant, Lew is now saying. Of course you can define things any old way, but an epistemological rationale is needed in contexts of evidence and inference. I say this move represents a new twist of contorted semantics in order to try and wriggle out of the admission that LRs alone fail to pick up considerations that we ought to regard as evidentially (and inferentially) relevant in learning from data. In any event, the two considerations go hand in hand. On the other hand, the belief/action questions (and I’m not sure where Lew puts inference) call for further inputs.

Laurie – what do you mean by ‘model’ in the football example?

I sent this comment as an email, trying to follow along as an interested amateur w/ v. minor training. I think Mayo’s comment above answers this, but is the objection to the LR as quantifying evidence for a particular parameter value the fact that the severity principle can’t be applied to likelihoods? I understand much of this comment thread is on contrasts of evidence & inference, so apologies if this is derailing.

Kyle: Severity can be applied to any such purported measure of evidence or rule for inference or what have you. Inferring evidence for H’ against Ho on grounds of likelihood licenses interpretations of evidence lacking severity. You can readily find a max likely H’ so the result is evidence against Ho in favor of max likely H’–according to comparative likehood. But since you can readily find such an H’ even though Ho is true, there’s a lack of error control. That’s why Birnbaum and others who had been sympathetic to likelihoodism abandoned it. You can look up Birnbaum and likelihood principle on this blog.

Thanks for your comment.

If you don’t know what the data means then either it is pointless to talk about evidence (evidence about what?) or the evidence is zero whichever formulation you prefer. I agree with that unless someone comes up with good counter-examples. If we know what we are talking about and if we can formulate what we wish to know about the real world then we may or may not need a model. I gave an explicit example where a model is not required. If we wish to investigate and quantify the effect of the weight of the model on that of the child then we can do this with a model (some form of regression) or we can do it without a model and treat the problem as one of functional choice. The meaning of the parameters or the functional is not given by the model. It requires a speculative identification with the semantics of the data. This is one reason I always dislike arguments about P-values which start from some null hypothesis H_0 mu=mu_0. I am talking about the Searle of the Chinese room not the Searle of social reality although it is the same person. If we restrict attention to parametric models rather than functionals we can talk about likelihood. LR (likelihood ratio?) depends not only on the definition of adequacy but also on the regularization which by the way you never mention: likelihoods are derivatives and the differential operator is pathologically discontinuous. Thus specifying a likelihood is an ill-posed problem. Once you have regularized and specified those parameter values which are consistent with the data likelihood has nothing to add: if there is a disagreement between adequacy and likelihood, adequacy wins. I don’t think my semantics are contorted but I confess to not knowing what the difference is between evidential and inferential relevance. I had an exchange of views with Michael on likelihood which ended somewhat abruptly. He thinks likelihood is important but his arguments in favour of likelihood only served to confirm my well held belief that it is irrelevant (apart from maximum likelihood and the planting of signposts).

Laurie: I agree with essentially every thing you wrote in your last comment, thanks.

Laurie, I’m not sure that a model is not required for your example, although it depends on what you mean by `required’ and `model’. I would say that in order to evaluate the relative meanings of 1 and 3 we need a model of what constitutes `better’. If the numbers are football scores then 3 is better than 1, but if they are golf scores or yachting results then 1 is better than 3. The rules of scoring represent a model of some sort.

Replace weight of model by weight of mother.

Oliver, Michael: OED model 2.e. A simplified or idealized description or conception of a particular system, situation, or process (often in mathematical terms: so mathematical model) that is put forward as a basis for calculation, predictions, or further investigation.

The rules of football define football just as the rules of chess define chess The rules of a game are not a model of the game, they are the game so to speak. This now sounds a bit like Searle and the construction of social reality. Nor is a real football game is a model of the rules.

Any non-statistician reading this must think we are all a bit weird.

We _are_ all weird. Only a weirdo could be interested enough in this stuff to argue endlessly and, maybe, pointlessly.

Michael: Not at all weird, on the contrary. The issue of evidence is crucially at the heart of holding accountable just about everyone around us who purports to have the evidence. What’s more than a little weird, but downright dangerous, is that so many people are prepared to jump on the bandwagon and repeat what “leaders” and “experts” claim about evidence and inference, without insisting on thinking it through themselves.

I’m reminded of a recent post on Jim Frost, a statistical advisor from Minitab. Despite posting several very sensible things on significance tests, he falls into confusion by echoing, with inadequate scrutiny, what he heard from J. Berger (mixed in with Colquihoun and others), and apparently has 0 interest in straightening things out:

https://errorstatistics.com/2016/01/19/high-error-rates-in-discussions-of-error-rates-i/

https://errorstatistics.com/2016/01/24/hocus-pocus-adopt-a-magicians-stance-if-you-want-to-reveal-statistical-sleights-of-hand/

“The rules of the game…are the game”. Sure, I’m pretty sympathetic to what I think some people call ‘structuralist’ philosophy (esp. in mathematics).

I’m just not sure how I would express the rules other than in mathematics (and to me logic is a subset of mathematics). And then we end up with a mathematical structure representing the rules which is the game. An applied mathematician would probably call that a mathematical model of the game.

To get forward predictions we plug in info like team quality (with uncertainty) and propagate through to get output like a predicted score (with uncertainty).

The inverse problem is to do something like predict team quality (with uncertainty) given observed scores and holding the mathematical structure fixed (same game – still football).

Contra the forward problem this is generally ill-posed so we regularise via various means. For Bayesians that means a prior, but for others something else. For a Bayesian the difference between prior and posterior measures quantifies the ‘evidence’ the data provides, given the model through which we connected inputs and outputs.

I’ve now lost track of what we were talking about. Hope only weirdos are reading.

Om: You say:

“For a Bayesian the difference between prior and posterior measures quantifies the ‘evidence’ the data provides, given the model”.

That is why Royall and other say that Bayesians and likelihoodlums agree on evidence.

Royall isn’t quite right. See Michael Evans’ comment on your Birnbaum paper*, and my first comment on this post.

*: http://www.utstat.utoronto.ca/mikevans/papers/mayoevansdiscuss.pdf

Om: The belief boost notion in your comment (wearing the Bayesian hat) would be even less appealing than Royall’s likelihoodism (he’s a frequentist about evidence). And statistical hypotheses are not events such that we’re inclined to say people have been “told that event C has occurred”. How much more strongly would you believe the Higgs results x, if you were told the event “the Standard Model” had “occurred”, compared to how strongly you believe x? The answer is how much evidence the Higgs results provide the Standard Model, for You now.

Quick attempt.

Take a parameter space theta, indexed by delta in [0,1], so

theta(delta) = {standard model+delta}

prior over theta values: p(theta)

results space x; higgs results x = x0.

Connect by mathematical model of relevant physics, p(x|theta)

Then

p(x) = integral p(x|theta)p(theta) dtheta

p(x|x = x0) = integral p(x|theta,x=x0)p(theta|x=x0) dtheta

p(x|x = x0) = integral p(x|theta)p(theta|x=x0) dtheta [same mathematical model]

p(theta|x=x0)/p(theta) = p(x=x0|theta)/p(x=x0)

standard model: theta = 0

evidence for standard model given Higgs results:

p(theta=0|x=x0)/p(theta=0) = p(x=x0|theta=0)/p(x=x0)

omaclaren: Looks right.

Note though as p(x=x0|theta)/p(x=x0) = c/p(x=x0) * L(theta) it can be mistaken as the likelihood. It has the same shape and so supports the same relative comparisons but in Evans’ theory p(x=x0|theta)/p(x=x0) is required.

This may be causing some confusion.

Keith O’Rourke

Hi Keith, yup good point. Thanks for making it more explicit.

I’m afraid that the Evans discussion is loaded with fallacious affirming the consequents on the order of: having stringently tested H influences your subjective beliefs in H, so your subjective beliefs in H measures the stringency of tests H has passed.

I disagree. Also note that it is *changes* in beliefs/state of mind that Evans uses to measure evidence. And also note that the tacking paradox doesn’t apply to this (just compute the marginals while avoiding the dubious epistemic closure ‘principle’)

Evans can be hard to understand if one does not (repeatedly) work through all the maths – which omaclaren likely did 😉

But this recent expository open access paper of his (with accessible worked examples) might help –

Measuring statistical evidence using relative belief. Michael Evans 2016 http://www.sciencedirect.com/science/article/pii/S2001037015000549

For instance:

“Since evidence is what causes beliefs to change, it is proposed to measure evidence by the amount beliefs change from a priori to a posteriori”

“One of the key concerns with Bayesian inference methods is that the choice of the prior can bias the analysis in various ways.” and ” the solution is to measure a priori whether or not the chosen prior induces bias either in favor of or against”

“Considering the bias in the evidence is connected with the idea of a severe test as discussed in Popper and Mayo and Spanos.”

Yes, here* we have the kind of fallacious affirming the consequent I referred to. Evidence is not the only cause of change of belief, and the question at issue here would be measuring warranted evidence by subjective belief change Imagine for some parallels, the way to measure the intrinsic morality of an action is whether the act is commonly seen in humans. Intrinsic morality is not the only reason for prevalence of an action, genuine evidence is not the only reason for a change of belief, etc. etc.

Affirming the consequent fallacy here goes from

If A is the case, there’s (commonly) an influence on B. So B measures A.

Even if A is sufficient, so long as it’s not necessary for B, the premises can be true and the conclusion false. Subjective belief may be a terrible indicator of warranted evidence, prevalence of acts (e.g., lying) isn’t a good indication of an act’s morality.

As for whether “the solution is to measure a priori whether or not the chosen prior induces bias either in favor of or against,” the needed solution just brings us to the task that was to have been carried out by an adequate statistical account.

I’m not sure how he thinks we rule out biases. If he means the prior has no effect, and the appraisal of beliefs is solely a matter of evidence, then we’re back to the task of finding a reliable account of evidence. I don’t think a ratio of beliefs or even likelihoods provides such an account.

I will buy his book.

*”Since evidence is what causes beliefs to change, it is proposed to measure evidence by the amount beliefs change from a priori to a posteriori”

Hi Mayo. As far as I’m aware Evans doesn’t claim that his proposal is deductively valid.

It seems to me to be analogous to the standard way of defining scientific quantities operationally. Consider (wiki):

“One volt is defined as the difference in electric potential between two points of a conducting wire when an electric current of one ampere dissipates one watt of power between those points.”

Also note that his approach is model (or method you might say) bound (eg a measurement of evidence relative to model M, where M is expressed in the probability models). One has to use the same model structure before and after seeing the data in the standard case, but you can allow for ‘paradigm shifts’ in the sense of changing the model structure itself. ‘Beliefs’ are simply distributions over parameter values to be plugged into the model – see Higgs example say.

It’s not the statistical method that needs to be deductive, it’s the arguments one uses. Evens, so far as I can tell, doesn’t deny his account is about subjective beliefs, so the argument becomes not, degrees of beliefs measure evidence, but degrees of beliefs measure degrees of beliefs.

But the wording encourages the equivocation to measuring something actually of interest–warranted evidence. That’s the fallacious argument.

*Change in* beliefs (state of info) measures evidence.

Change in momentum (state) equals applied force but a force is not a momentum.

So

belief/info state momentum

evidence force

in this analogy.

belief – momentum

evidence – force

It needn’t be equal, if not a necessary condition, at least not one that we know is found despite the lack of said cause.

No but it is a useful definition that relates distinct ideas. I probably wouldn’t use ’cause’ in this context, more likely balance of information flow within a model or somesuch. Like balance of momentum transfers. Personally I think directionality (cause) tends to stem from boundary conditions while the laws themselves are invariant and adirectional.

(also – another point of difference I probably have with Evans/Keith is that I often prefer to think of measurements as finite approximations to a continuous reality which appears to be the opposite of their view. Luckily, Evans’ approach to continuous problems appears consistent with the modern approaches to infinite dimensional bayesian inverse problems starting from Radon-Nikodym derivatives.)

I actually thought including “cause” helped his argument ever so slightly.

Interesting. Like I said, I tend to prefer to keep the mechanisms invariant and adirectional and the ‘causal’ direction determined by initial/boundary conditions.

For example a voltage difference doesn’t cause a current and a current doesn’t cause a voltage difference but there is an invariant relationship between them in terms of the power dissipated. One can either impose (as a boundary condition) a voltage clamp and measure a current or impose a current clamp and measure a voltage difference. You might say an *imposed* voltage causes a current but you could also just keep ‘laws’ and ‘boundary/initial conditions’ separate.

Also applies to problems in mechanics (eg D’Alembert’s principle) and, in my view, statistical inference.

One of the problems under discussion is the reproducibility of P-values. There are many reasons for non-reproducibility which have to do with the care with which the data were obtained. In the following I will ignore such reasons and suppose that the model is adequate0.116. The simulation show that in about . As an example I take measurements of the copper content of a sample of drinking water:

2.16 2.21 2.15 2.05 2.06 2.04 1.90 2.03 2.06 2.02 2.06 1.92 2.08 2.05 1.88 1.99 2.01 1.86 1.70 1.88 1.99 1.93 2.20 2.02 1.92 2.13 2.13

The model is N(mu,sigma^2) and the null hypothesis is H_0: mu0=2.09. The standard P-values based on the t-statistic is pt(sqrt(27)*(mean(copper)-2.09)/sd(copper),26=0.0017 if my calculations are correct. I have never understood this definition and consequently I don’t like it. But leaving that aside we can investigate the reproducibility of the P-value within the context of the model. To do this we simulate data sets of size n=27 under a N(mu,sigma^2) model and calculate the P-values for H_0. The question which (mu,sigma) to choose. Suppose we put mu=mean(copper)=2.016 and sigma=sd(copper)=0.116. The simulation show that in 21% of the cases the P-value exceeds 0.01 and in 6.3% of the cases it exceeds 0.05. We could choose other values of (mu,sigma) for the simulations. The restriction will be that they are consistent with the copper data. We can define this as belonging to the 0.9-confidence region for (mu,sigma). Suppose we calculate this by using the t-statistic for mu and the chi-squared statistic for sigma. If my calculations are correct the region is (1.978,2.054)x(0.0915,0.159) again assuming my calculations are correct. Now (2.04,0.14) lies in this region and if we perform the simulations with these values then in 46% of the cases the P-values exceeds 0.05 and in 15.5% it exceeds 0.2. On this basis an initial P-value of 0.0017 followed by a repetition with a P-value of 0.2 is not all that surprising.

Ignore ‘0.116. The simulation show that in about ‘ after adequate. I copied by mistake.