So, the paper by Valentin Amrhein, Sander Greenland, and Blake McShane that we discussed a few weeks ago has just appeared online as a comment piece in Nature, along with a letter with hundreds (or is it thousands?) of supporting signatures.
Following the first circulation of that article, the authors of that article and some others of us had some email discussion that I thought might be of general interest.
I won’t copy out all the emails, but I’ll share enough to try to convey the sense of the conversation, and any readers are welcome to continue the discussion in the comments.
1. Is it appropriate to get hundreds of people to sign a letter of support for a scientific editorial?
John Ioannidis wrote:
Brilliant Comment! I am extremely happy that you are publishing it and that it will certainly attract a lot of attention.
He had some specific disagreements (see below for more on this). Also, he was bothered by the group-signed letter and wrote:
I am afraid that what you are doing at this point is not science, but campaigning. Leaving the scientific merits and drawbacks of your Comment aside, I am afraid that a campaign to collect signatures for what is a scientific method and statistical inference question sets a bad precedent. It is one thing to ask for people to work on co-drafting a scientific article or comment. This takes effort, real debate, multiple painful iterations among co-authors, responsibility, undiluted attention to detailed arguments, and full commitment. Lists of signatories have a very different role. They do make sense for issues of politics, ethics, and injustice. However, I think that they have no place on choosing and endorsing scientific methods. Otherwise scientific methodology would be validated, endorsed and prioritized based on who has the most popular Tweeter, Facebook or Instagram account. I dread to imagine who will prevail.
To this, Sander Greenland replied:
YES we are campaigning and it’s long overdue . . . because YES this is an issue of politics, ethics, and injustice! . . .
My own view is that this significance issue has been a massive problem in the sociology of science, hidden and often hijacked by those pundits under the guise of methodology or “statistical science” (a nearly oxymoronic term). Our commentary is an early step toward revealing that sad reality. Not one point in our commentary is new, and our central complaints (like ending the nonsense we document) have been in the literature for generations, to little or no avail – e.g., see Rothman 1986 and Altman & Bland 1995, attached, and then the travesty of recent JAMA articles like the attached Brown et al. 2017 paper (our original example, which Nature nixed over sociopolitical fears). Single commentaries even with 80 authors have had zero impact on curbing such harmful and destructive nonsense. This is why we have felt compelled to turn to a social movement: Soft-peddled academic debate has simply not worked. If we fail, we will have done no worse than our predecessors (including you) in cutting off the harmful practices that plague about half of scientific publications, and affect the health and safety of entire populations.
And I replied:
I signed the form because I feel that this would do more good than harm, but as I wrote here, I fully respect the position of not signing any petitions. Just to be clear, I don’t think that my signing of the form is an act of campaigning or politics. I just think it’s a shorthand way of saying that I agree with the general points of the published article and that I agree with most of its recommendations.
Zad Chow replied more agnostically:
Whether political or not, it seems like signing a piece as a form of endorsement seems far more appropriate than having papers with mass authorships of 50+ authors where it is unlikely that every single one of those authors contributed enough to actually be an author, and their placement as an author is also a political message.
I also wonder if such pieces, whether they be mass authorships or endorsements by signing, actually lead to notable change. My guess is that they really don’t, but whether or not such endorsements are “popularity contests” via social media, I think I’d prefer that people who participate in science have some voice in the manner, rather than having the views of a few influential individuals, whether they be methodologists or journal editors, constantly repeated and executed in different outlets.
2. Is “retiring statistical significance” really a good idea?
Now on to problems with the Amrhein et al. article. I mostly liked it, although I did have a couple places where I suggested changes of emphasis, as noted in my post linked above. The authors made some of my suggested changes; in other places I respect their decisions even if I might have written things slightly differently.
Ioannidis had more concerns, as he wrote in an email listing a bunch of specific disagreements with points in the article:
1. Statement: Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exist
Why it is misleading: Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important. It will also facilitate claiming that that there are no conflicts between studies when conflicts do exist.
2. Statement: Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P-value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.
Why it is misleading: In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim. In many cases using sufficiently stringent p-value thresholds, e.g. p=0.005 for many disciplines (or properly multiplicity-adjusted p=0.05, e.g. 10-9 for genetics or FDR or Bayes factor threhsolds or any thresholds) make perfect sense. We need to make some careful choices and move on. Saying that any and all associations cannot be 100% dismissed is correct strictly speaking, but practically it is nonsense. We will get paralyzed because we cannot exclude that everything may be causing everything.
3. Statement: statistically non-significant results were interpreted as indicating ‘no difference’ in XX% of articles
Why it is misleading: this may have been entirely appropriate in many/most/all cases, one has to examine carefully each one of them. It is probably at least or even more inappropriate that some/many of the remaining 100-XX% were not indicated as “no difference”.
4. Statement: The editors introduce the collection (2) with the caution “don’t say ‘statistically significant’.” Another article (3) with dozens of signatories calls upon authors and journal editors to disavow the words. We agree and call for the entire concept of statistical significance to be abandoned. We don’t mean to drop P-values, but rather to stop using them dichotomously to decide whether a result refutes or supports a hypothesis.
Why it is misleading: please see my e-mail about what I think regarding the inappropriateness of having “signatories” when we are discussing about scientific methods. We do need to reach conclusions dichotomously most of the time: is this genetic variant causing depression, yes or no? Should I spend 1 billion dollars to develop a treatment based on this pathway, yes or no? Is this treatment effective enough to warrant taking it, yes or no? Is this pollutant causing cancer, yes or no?
5. Statement: whole paragraph beginning with “Tragically…”
Why it is misleading: we have no evidence that if people did not have to defend their data as statistically significant, publication bias would go away and people would not be reporting whatever results look nicer, stronger, more desirable and more fit to their biases. Statistical significance or any other preset threshold (e.g. Bayesian or FDR) sets an obstacle to making unfounded claims. People may play tricks to pass the obstacle, but setting no obstacle is worse.
6. Statement: For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss (8).
Why it is misleading: this example is factually wrong; it is true only if we are certain that the effect being addressed is truly non-null.
7. Statement: One way to do this is to rename confidence intervals ‘compatibility intervals,’ …
Why it is misleading: Probably the least thing we need in the current confusing situation is to add yet a new, idiosyncratic term. “Compatibility” is even a poor choice, probably worse than “confidence”. Results may be entirely off due to bias and the X% CI (whatever C stands for) may not even include the truth much of the time if bias is present.
8. Statement: We recommend that authors describe the practical implications of all values inside the interval, especially the observed effect or point estimate (that is, the value most compatible with the data) and the limits.
Why it is misleading: I think it is far more important to consider what biases may exist and which may lead to the entire interval, no matter how we call it, to be off and thus incompatible with the truth.
9. Statement: We’re frankly sick of seeing nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews, and instructional materials.
Why it is misleading: I (and many others) are frankly sick with seeing nonsensical “proofs of the non-null”, people making strong statements about associations and even causality with (or even without) formal statistical significance (or other statistical inference tool) plus tons of spin and bias. Removing entirely the statistical significance obstacle, will just give a free lunch, all-is-allowed bonus to make any desirable claim. All science will become like nutritional epidemiology.
10. Statement: That means you can and should say “our results indicate a 20% increase in risk” even if you found a large P-value or a wide interval, as long as you also report and discuss the limits of that interval.
Why it is misleading: yes, indeed. But then, welcome to the world where everything is important, noteworthy, must be licensed, must be sold, must be bought, must lead to public health policy, must change our world.
11. Statement: Paragraph starting with “Third, the default 95% used”
Why it is misleading: indeed, but this means that more appropriate P-value thresholds and, respectively X% CI intervals are preferable and these need to be decided carefully in advance. Otherwise, everything is done post hoc and any pre-conceived bias of the investigator can be “supported”.
12. Statement: Factors such as background evidence, study design, data quality, and mechanistic understanding are often more important than statistical measures like P-values or intervals (10).
Why it is misleading: while it sounds reasonable that all these other factors are important, most of them are often substantially subjective. Conversely, statistical analysis at least has some objectivity and if the rules are carefully set before the data are collected and the analysis is run, then statistical guidance based on some thresholds (p-values, Bayes factors, FDR, or other) can be useful. Otherwise statistical inference is becoming also entirely post hoc and subjective.
13. Statement: The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy, and business environments, decisions based on the costs, benefits, and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to further pursue a research idea, there is no simple connection between a P-value and the probable results of subsequent studies.
Why it is misleading: This argument is equivalent to hand waving. Indeed, most of the time yes/no decisions need to be made and this is why removing statistical significance and making it all too fluid does not help. It leads to an “anything goes” situation. Study designs for questions that require decisions need to take all these other parameters into account ideally in advance (whenever possible) and set some pre-specified rules on what will be considered “success”/actionable result and what not. This could be based on p-values, Bayes factors, FDR, or other thresholds or other functions, e.g. effect distribution. But some rule is needed for the game to be fair. Otherwise we will get into more chaos than we have now, where subjective interpretations already abound. E.g. any company will be able to claim that any results of any trial on its product do support its application for licensing.
14. Statement: People will spend less time with statistical software and more time thinking.
Why it is misleading: I think it is unlikely that people will spend less time with statistical software but it is likely that they will spend more time mumbling, trying to sell their pre-conceived biases with nice-looking narratives. There will be no statistical obstacle on their way.
15. Statement: the approach we advocate will help halt overconfident claims, unwarranted declarations of ‘no difference,’ and absurd statements about ‘replication failure’ when results from original and the replication studies are highly compatible.
Why it is misleading: the proposed approach will probably paralyze efforts to refute the millions of nonsense statements that have been propagated by biased research, mostly observational, but also many subpar randomized trials.
Overall assessment: the Comment is written with an undercurrent belief that there are zillions of true, important effects out there that we erroneously dismiss. The main problem is quite the opposite: there are zillions of nonsense claims of associations and effects that once they are published, they are very difficult to get rid of. The proposed approach will make people who have tried to cheat with massaging statistics very happy, since now they would not have to worry at all about statistics. Any results can be spun to fit their narrative. Getting entirely rid of statistical significance and preset, carefully considered thresholds has the potential of making nonsense irrefutable and invincible.
That said, despite these various specific points of disagreement, Ioannidis emphasized that Amrhein et al. raise important points that “need to be given an opportunity to be heard loud and clear and in their totality.”
In reply to Ioannidis’s points above, I replied:
1. You write, “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important.” I completely disagree. Or, maybe I should say, anyone is already allowed to make any overstated claim about any result being important. That’s what PNAS is, much of the time. To put it another way: I believe that embracing uncertainty and avoiding overstated claims are important. I don’t think statistical significance has much to do with that.
2. You write, “In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim.” Again, this is already the case that people can conclude what they want. One concern is what is done by scientists who are honestly trying to do their best. I think those scientists are often misled by statistical significance, all the time, ALL THE TIME, taking patterns that are “statistically significant” and calling them real, and taking patterns that are “not statistically significant” and treating them as zero. Entire scientific papers are, through this mechanism, data in, random numbers out. And this doesn’t even address the incentives problem, by which statistical significance can create an actual disincentive to gather high-quality data.
I disagree with many other items on your list, but two is enough for now. I think the overview is that you’re pointing out that scientists and consumers of science want to make reliable decisions, and statistical significance, for all its flaws, delivers some version of reliable decisions. And my reaction is that whatever plus it is that statistical significance sometimes provides reliable decisions, is outweighed by (a) all the times that statistical significance adds noise and provides unreliable decisions, and (b) the false sense of security that statistical significance gives so many researchers.
One reason this is all relevant, and interesting, is that we all agree on so much—yet we disagree so strongly here. I’d love to push this discussion toward the real tradeoffs that arise when considering alternative statistical recommendations, and I think what Ioannidis wrote, along with the Amrhein/Greenland/McShane article, would be a great starting point.
Ioannidis then responded to me:
On whether removal of statistical significance will increase or decrease the chances that overstated claims will be made and authors will be more or less likely to conclude according to their whim, the truth is that we have no randomized trial to tell whether you are right or I am right. I fully agree that people are often confused about what statistical significance means, but does this mean we should ban it? Should we also ban FDR thresholds? Should we also ban Bayes factor thresholds? Also probably we have different scientific fields in mind. I am afraid that if we ban thresholds and other (ideally pre-specified) rules, we are just telling people to just describe their data as best as they can and unavoidably make strength-of-evidence statements as they wish, kind of impromptu and post-hoc. I don’t think this will work. The notion that someone can just describe the data without making any inferences seems unrealistic and it also defies the purpose of why we do science: we do want to make inferences eventually and many inferences are unavoidably binary/dichotomous. Also actions based on inferences are binary/dichotomous in their vast majority.
I agree that the effects of any interventions are unknown. We’re offering, or trying to offer, suggestions for good statistical practice in the hope that this will lead to better outcome. This uncertainty is a key reason why this discussion is worth having, I think.
3. Mob rule, or rule of the elites, or gatekeepers, consensus, or what?
One issue that came up is, what’s the point of that letter with all those signatories? Is it mob rule, the idea that scientific positions should be determined by those people who are loudest and most willing to express strong opinions (“the mob” != “the silent majority”)? Or does it represent an attempt by well-connected elites (such as Greenland and myself!) to tell people what to think? Is the letter attempting to serve a gatekeeping function by restricting how researchers can analyze their data? Or can this all be seen as a crude attempt to establish a consensus of the scientific community?
None of these seem so great! Science should be determined my truth, accuracy, reproducibility, strength of theory, real-world applicability, moral values, etc. All sorts of things, but these should not be the property of the mob, or the elites, or gatekeepers, or a consensus.
That said, the mob, the elites, gatekeepers, and the consensus aren’t going anywhere. Like it or not, people do pay attention to online mobs. I hate it, but it’s there. And elites will always be with us, sometimes for good reasons. I don’t think it’s such a bad idea that people listen to what I say, in part on the strength of my carefully-written books—and I say that even though, at the beginning of my career, I had to spend a huge amount of time and effort struggling against the efforts of elites (my colleagues in the statistics department at the University of California, and their friends elsewhere) who did their best to use their elite status to try to put me down. And gatekeepers . . . hmmm, I don’t know if we’d be better off without anyone in charge of scientific publishing and the news media—but, again, the gatekeepers are out there: NPR, PNAS, etc. are real, and the gatekeepers feed off of each other: the news media bow down before papers published in top journals, and the top journals jockey for media exposure. Finally, the scientific consensus is what it is. Of course people mostly do what’s in textbooks, and published articles, and what they see other people do.
So, for my part, I see that letter of support as Amrhein, Greenland, and McShane being in the arena, recognizing that mob, elites, gatekeepers, and consensus are real, and trying their best to influence these influencers and to counter negative influences from all those sources. I agree with the technical message being sent by Amrhein et al., as well as with their open way of expressing it, so I’m fine with them making use of all these channels, including getting lots of signatories, enlisting the support of authority figures, working with the gatekeepers (their comment is being published in Nature, after all; that’s one of the tabloids), and openly attempting to shift the consensus.
Amrhein et al. don’t have to do it that way. It would be also fine with me if they were to just publish a quiet paper in a technical journal and wait for people to get the point. But I’m fine with the big push.
4. And now to all of you . . .
As noted above, I accept the continued existence and influence of mob, elites, gatekeepers, and consensus. But I’m also bothered by these, and I like to go around them when I can.
Hence, I’m posting this on the blog, where we have the habit of reasoned discussion rather than mob-like rhetorical violence, where the comments have no gatekeeping (in 15 years of blogging, I’ve had to delete less than 5 out of 100,000 comments—that’s 0.005%!—because they were too obnoxious), and where any consensus is formed from discussion that might just lead to the pluralistic conclusion that sometimes no consensus is possible. And by opening up our email discussion to all of you, I’m trying to demystify (to some extent) the elite discourse and make this a more general conversation.
P.S. There’s some discussion in comments about what to do in situations like the FDA testing a new drug. I have a response to this point, and it’s what Blake McShane, David Gal, Christian Robert, Jennifer Tackett wrote in section 4.4 of our article, Abandon Statistical Significance:
While our focus has been on statistical significance thresholds in scientific publication, similar issues arise in other areas of statistical decision making, including, for example, neuroimaging where researchers use voxelwise NHSTs to decide which results to report or take seriously; medicine where regulatory agencies such as the Food and Drug Administration use NHSTs to decide whether or not to approve new drugs; policy analysis where non-governmental and other organizations use NHSTs to determine whether interventions are beneficial or not; and business where managers use NHSTs to make binary decisions via A/B tests. In addition, thresholds arise not just around scientific publication but also within research projects, for example, when researchers use NHSTs to decide which avenues to pursue further based on preliminary findings.
While considerations around taking a more holistic view of the evidence and consequences of decisions are rather different across each of these settings and different from those in scientific publication, we nonetheless believe our proposal to demote the p-value from its threshold screening role and emphasize the currently subordinate factors applies in these settings. For example, in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Plotting images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.
For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.
That said, we acknowledge that thresholds—of a non-statistical variety—may sometimes be useful in these settings. For example, consider a firm contemplating sending a costly offer to customers. Suppose the firm has a customer-level model of the revenue expected in response to the offer. In this setting, it could make sense for the firm to send the offer only to customers that yield an expected profit greater than some threshold, say, zero.
Even in pure research scenarios where there is no obvious cost-benefit calculation—for example a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.
While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.
We would also like to see—when possible in these and other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman, 2015, 2017; McShane and Bockenholt, 2017, 2018).