A debate about effect-size variation in psychology: Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos

A couple weeks ago, Uri Simonsohn and Joe Simmons sent me and others a note that they were writing a blog post citing some of our work and asking for us to point out anything that we find “inaccurate, unfair, snarky, misleading, or in want of a change for any reason.”

I took a quick look and decided that my part in this was small enough that I didn’t really have anything to say. But some of my colleagues did have reactions which they shared with the blog authors. Unfortunately, Simonsohn and Simmons did not want to post these replies on their blog or to link to them, so my colleagues asked me to post something here. So that’s what I’m doing.

1. Post by Joe Simmons and Uri Simonsohn

This is the post that started it all, and it begins:

A number of authors have recently proposed that (i) psychological research is highly unpredictable, with identical studies obtaining surprisingly different results, (ii) the presence of heterogeneity decreases the replicability of psychological findings. In this post we provide evidence that contradicts both propositions.

Consider these quotes:

“heterogeneity persists, and to a reasonable degree, even in […] Many Labs projects […] where rigid, vetted protocols with identical study materials are followed […] heterogeneity […] cannot be avoided in psychological research—even if every effort is taken to eliminate it.”
McShane, Tackett, Böckenholt, and Gelman (American Statistician 2019 .pdf)

“Heterogeneity […] makes it unlikely that the typical psychological study can be closely replicated”
Stanley, Carter, and Doucouliagos (Psychological Bulletin 2018 .pdf)

“Repeated investigations of the same phenomenon [get] effect sizes that vary more than one would expect […] even in exact replication studies. […] In the presence of heterogeneity, […] even large N studies may find a result in the opposite direction from the original study. This makes us question the wisdom of placing a great deal of faith in a single replication study”
Judd and Kenny (Psychological Methods 2019 .pdf)

This post is not an evaluation of the totality of these three papers, but rather a specific evaluation of the claims in the quoted text. . .

2. Response by Blakeley McShane, Ulf Böckenholt, and Karsten Hansen

I wish Simmons and Simonsohn had just linked to this, but since they didn’t, here it is. And here’s the summary that McShane, Böckenholt, and Hansen wrote for me to post here:

We thank Joe and Uri for featuring our papers in their blogpost and Andrew for hosting a discussion of it. We keep our remarks brief here but note that (i) the longer comments that we sent Joe and Uri before their post went live are available here (they denied our request to link to this from their blogpost) and (ii) our “Large-Scale Replication” paper that discusses many of these issues in greater depth (especially on page 101) is available here.

A long tradition has argued that heterogeneity is unavoidable in psychological research. Joe and Uri seem to accept this reality when study stimuli are varied. However, they seem to categorically deny it when study stimuli are held constant but study contexts (e.g., labs in Many Labs, waves in their Maluma example) are varied. Their view seems both dogmatic and obviously false (e.g., should studies with stimuli featuring Michigan students yield the same results when conducted on Michigan versus Ohio State students? Should studies with English-language stimuli yield the same results when conducted on English speakers versus non-English speakers?). And, even in their own tightly-controlled Maluma example, the average difference across waves is ≈15% of the overall average effect size.

Further, the analyses Joe and Uri put forth in favor of their dogma are woefully unconvincing to all but true believers. Specifically, their analyses amount to (i) assuming or forcing homogeneity across contexts, (ii) employing techniques with weak ability to detect heterogeneity, and (iii) concluding in favor of homogeneity when the handicapped techniques fail to detect heterogeneity. This is not particularly persuasive, especially given that these detection issues are greatly exacerbated by the paucity of waves/labs in the Maluma, Many Labs, M-Turk, and RRR data and the sparsity in the Maluma data which result in low power to detect and imprecise estimates of heterogeneity across contexts.

Joe and Uri also seem to misattribute to us the view that psychological research is in general “highly unpredictable” and that this makes replication hopeless or unlikely. To be clear, we along with many others believe exact replication is not possible in psychological research and therefore (by definition) some degree of heterogeneity is inevitable. Yet, we are entirely open to the idea that certain paradigms may evince low heterogeneity across stimuli, contexts, or both—perhaps even so low that one may ignore it without introducing much error (at least for some purposes if not all). However, it seems clearly fanatical to impose the view that heterogeneity is zero or negligible a priori. It cannot be blithely assumed away, and thus we have argued it is one of the many things that must be accounted for in study design and statistical analysis whether for replication or more broadly.

But, we would go further: heterogeneity is not a nuisance but something to embrace! We can learn much more about the world by using methods that assess and allow/account for heterogeneity. And, heterogeneity provides an opportunity to enrich theory because it can suggest the existence of unknown or unaccounted-for moderators.

It is obvious and uncontroversial that heterogeneity impacts replicability. The question is not whether but to what degree, and this will depend on how heterogeneity is measured, its extent, and how replicability is operationalized in terms of study design, statistical findings, etc. A serious and scholarly attempt to investigate this is both welcome and necessary!

3. Response by Charles Judd and David Kenny

Judd and Kenny’s response goes as follows:

Joe Simmons and Uri Simonsohn attribute to us (Kenny & Judd, 2019) a version of effect size heterogeneity that we are not sure we recognize. This is largely because the empirical results that they show seem to us perfectly consistent with the model of heterogeneity that we thought we had proposed. In the following we try to clearly say what our heterogeneity model really is and how Joe and Uri’s data seem to us consistent with that model.

Our model posits that an effect size from any given study, 𝑑_i, estimates some true effect size, 𝛿_i, and that these true effect sizes have some variation, 𝜎_𝛿, around their mean, 𝜇_𝛿. What might be responsible for this variation (i.e., the heterogeneity of true effect sizes)? There are many potential factors, but certainly among such factors are procedural variations of the sort that Joe and Uri include in the studies they report.

In the series of studies Joe and Uri conducted, participants are shown two shapes, one more rounded and one more jagged. Participants are then given two names, one male and one female, and asked which name is more likely to go with which shape. Across studies, different pairs of male and female names are used, but always with the same two shapes.

What Joe and Uri report is that across all studies there is an average effect (with the female name of the pair being seen as more likely for the rounded shape), but that the effect sizes in the individual studies vary considerably depending on which name pair is used in any particular study. For instance, when the name pair consists of Sophia and Jack, the effect is substantially larger than when the name pair consists of Liz and Luca.

Joe and Uri then replicate these studies a second time and show that the variation in the effect sizes across the different name-pairs is quite replicable, yielding a very substantial correlation of the effect sizes between the two replications, computed across the different name-pairs.

We believe that our model of heterogeneity can fully account for these results. The individual name-pairs each have a true effect size associated with them, 𝛿_i, and these vary around their grand mean 𝜇_𝛿. Different name-pairs produce heterogeneity of effect sizes. Name-pairs constitute a random factor that moderates the effect sizes obtained. It most properly ought to be incorporated into a single analysis of all the obtained data, across all the studies they report, treating it and participants as factors that induce random variation in the effect of interest (Judd, Kenny, & Westfall, 2012; 2017). . . .

The point is that there are a potentially a very large number of random factors that may moderate effect sizes and that may vary from replication attempt to replication attempt. In Joe and Uri’s work, these other random factors didn’t vary, but that’s usually not the case when one decides to replicate someone else’s effect. Sample selection methods vary, stimuli vary in subtle ways, lighting varies, external conditions and participant motivation vary, experimenters vary, etc. The full list of potential moderators is long and perhaps ultimately unknowable. And heterogeneity is likely to ensue. . . .

4. Response by T. D. Stanley and Chris Doucouliagos

And here’s what Stanley and Doucouliagos write:

Last Fall, MAER-Net (Meta-Analysis of Economics Research-Network) had a productive discussion about the replication ‘crisis,’ and how it could be turned into a credibility revolution. We examined the high heterogeneity revealed by our survey of over 12,000 psychological studies and how it implies that close replication is unlikely (Stanley et al., 2018). Marcel van Assen pointed out that the then recently-released, large-scale, multi-lab replication project, Many Labs 2 (Klein et al., 2018), “hardly show heterogeneity,” and Marcel claimed “it is a myth (and mystery) why researchers believe heterogeneity is omnipresent in psychology.”

Supporting Marcel’s view is the recent post by Joe Simmons and Uri Simonsohn about a series of experiments that are directly replicated a second time using the same research protocols. They find high heterogeneity across versions of the experiment (I^2 = 79%), but little heterogeneity across replications of the exact same experiment.

We accept that carefully-conducted, exact replications of psychological experiments can produce reliable findings with little heterogeneity (MAER-Net). However, contrary to Joe and Uri’s blog, such modest heterogeneity from exactly replicated experiments is fully consistent with the high heterogeneity that our survey of 200 psychology meta-analyses finds and its implication that “it (remains) unlikely that the typical psychological study can be closely replicated” . . .

Because Joe and Uri’s blog was not pre-registered and concerns only one idiosyncratic experiment at one lab, we focus instead on ML2’s pre-registered, large-scale replication of 28 experiments across 125 sites, addressing the same issue and producing the same general result. . . . ML2 focuses on measuring the “variation in effect magnitudes across samples and settings” (Klein et al., 2018, p. 446). Each ML2 experiment is repeated at many labs using the same methods and protocols established in consultation with the original authors. After such careful and exact replication, ML2 finds only a small amount of heterogeneity remains across labs and settings. It seems that psychological phenomenon and the methods used to study them are sufficiently reliable to produce stable and reproducible findings. Great news for psychology! But this fact does not conflict with our survey of 200 meta-analyses nor its implications about replications (Stanley et al., 2018).

In fact, ML2’s findings collaborate both the high heterogeneity our survey finds and its implication that typical studies are unlikely to be closely replicated by others. Both high and little heterogeneity at the same time? What explains this heterogeneity in heterogeneity?

First, our survey finds that typical heterogeneity in an area of research is 3 times larger than sampling error (I^2 = 74%; std dev = .35 SMD). Stanley et al. (2018) shows that this high heterogeneity makes it unlikely that the typical study will be closely replicated (p. 1339), and ML2 confirms our prediction!

Yes, ML2 discovers little heterogeneity among different labs all running the exact same replication, but ML2 also finds huge differences between the original and replicated effect sizes . . . If we take the experiments that ML2 selected to replicate as ‘typical,’ then it is unlikely that this ‘typical’ experiment can be closely replicated. . . .

Heterogeneity may not be omnipresent, but it is frequently: seen among published research results, identified in meta-analyses, and confirmed by large-scale replications. As Blakeley, Ulf and Karsten reminds us, heterogeneity has important theoretical implications, and it can also be identified and explained by meta-regression analysis.

5. Making sense of it all

I’d like to thank all parties involved—Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos—for their contributions to the discussion.

On the substance of the matter, I agree with Judd and Kenny and Stanley and Doucouliagos that effects do vary—indeed, just about every psychology experiment you’ll ever see is a study of a two-way or three-way interaction, hence varying effects are baked into the discussion—and it would be a mistake to consider variation as zero, just because it’s hard to detect variation in a particular dateset (echoes of the hot-hand fallacy fallacy!).

I’ve sounded the horn earlier on the statistical difficulties of estimating treatment effect variation, so I also see where Simmons and Simonsohn are coming from, pointing out that apparent variation can be much larger than underlying variation. Indeed, this relates to the justly celebrated work of Simmons, Nelson, and Simonsohn on researcher degrees of freedom and “false positive psychology”: The various overestimated effects in the Psychological Science / PNAS canon can be viewed as a big social experiment in which noisy data and noisy statistics were, until recently, taken as evidence that we live in a capricious social world, and that we’re buffeted by all sorts of large effects.

Simmons and Simonsohn’s efforts to downplay overestimation of effect-size variability is, therefore, consistent with their earlier work on downplaying overestimates of effect sizes. Remember: just about every effect being studied in psychology is an interaction. So if an effect size (e.g., ovulation and voting) was advertised as 20% but is really, say, 0.2%, then that’s really an effect-size heterogeneity that’s being scaled down.

I also like McShane, Böckenholt, and Hansen’s remark that “heterogeneity is not a nuisance but something to embrace.”

6. Summary

On the technical matter, I agree with the discussants that it’s a mistake to think of effect-size variation as negligible. Indeed, effect-size variation—also called “interactions”—is central to psychology research. At the same time, I respect Simmons and Simonsohn’s position that effect-size variation is not as large as it’s been claimed to be—that’s related to the uncontroversial statement that Psychological Science and PNAS have published a lot of bad papers—and that overrating of the importance of effect-size variation has led to lots of problems.

It’s too bad that Uri and Joe’s blog doesn’t have a comment section; fortunately we can have the discussion here. All are welcome to participate.

Continuing discussion of status threat and presidential elections, with discussion of challenge of causal inference from survey data

Last year we reported on an article by sociologist Steve Morgan, criticizing a published paper by political scientist Diana Mutz.

A couple months later we updated with Mutz’s response to Morgan’s critique.

Finally, Morgan has published a reply to Mutz’s response to Morgan’s comments on Mutz’s paper. Here’s a passage that is of methodological interest:

I [Morgan] first offer the most important example of the repeated mistake Mutz commits, and I then explain the underlying methodology of fixed-effects models that justifies the correct interpretation. Here is the most important example of her mistake: When interpreting her models to develop a conclusion about the role of social dominance orientation (SDO) in the 2012 and 2016 elections, Mutz (2018b) states,

When a person’s desire for group dominance increased from 2012 to 2016, so did the probability of defecting to Trump. However, as shown by the insignificant interaction between SDO and wave in both analyses, there is no evidence that those high in preexisting SDO were especially likely to defect to Trump, thus countering the idea that SDO was made more salient in 2016. Instead, it is the increase in SDO, which is indicative of status threat, that corresponded to increasing positivity toward Trump. (p. 5)

Even if one accepts that Mutz’s model is perfectly specified to reveal the causal effects of interest to her, the correct interpretation of the model that she estimated would instead be this:

The causal effect of SDO on Republican thermometer advantage was 0.206 in 2012 and 0.184 in 2016. Thus, these estimates provide no evidence that SDO was more important for the 2016 election than it was for the 2012 election. In addition, the model does not reveal whether changes in individuals’ SDO levels between 2012 and 2016 caused any prospective voters to rate Trump more warmly than Romney, in comparison with Clinton and Obama, respectively.

Had Mutz recognized that such a paragraph would be the correct interpretation of her primary status-threat coefficients, she could not have written the article that she did.

Ya want the data and code? We got the data and code!

Morgan adds:

For those of you who may find the discussion of fixed-effects models useful, perhaps as an example for classroom use, here are links to some other resources:

My July 2018 Github repository, with Mutz’s code, my analysis code chunks inserted into her code, her data, and results for my original critique:

Lecture slides on fixed-effect models, using Mutz’s data that I presented at the Rostock retreat on causality:

An updated Github repository, with additional analysis code chunks, data, and results, developed for the Rostock lecture and also for my reply to Mutz’s comment:

“How many years do we lose to the air we breathe?” Or not.

From this Washington Post article:

But . . . wait a second. The University of Chicago’s Energy Policy Institute . . . what exactly is that?

Let’s do a google, then we get to the relevant page.

I’m concerned because this is the group that did this report, which featured this memorable graph:

Screen Shot 2013-08-03 at 4.23.29 PM

See this article (“Evidence on the deleterious impact of sustained use of polynomial regression on causal inference”) for further discussion of statistical problem with this sort of analysis.

Anyway, the short story is that I’m skeptical about these numbers.

Does this matter? I dunno. I’m pretty sure that air pollution is bad for you, and I expect it does reduce life expectancy, so maybe you could say: So what if these particular numbers are off?

On the other hand, if all we want to say is that air pollution is bad for you, we can say that. The contribution of that Washington Post article and associated publicity is, first, the particular numbers being reported and, second, the claim about the magnitude of the effects.

So if I don’t trust the stats, why should I trust a claim like this?

I just don’t know what to think. Maybe the press should be careful about reporting these numbers uncritically. Or maybe it doesn’t matter because it’s a crisis so it’s better to have falsely-precise numbers than none at all. I’m concerned that the estimates being reported are overestimates because of the statistical significance filter.

Automatic voter registration impact on state voter registration

Sean McElwee points us to this study by Kevin Morris and Peter Dunphy, who write:

Automatic voter registration or AVR . . . features two seemingly small but transformative changes to how people register to vote:

1. Citizens who interact with government agencies like the Department of Motor Vehicles are registered to vote, unless they decline. In other words, a person is registered unless they opt out, instead of being required to opt in.

2. The information citizens provide as part of their application for government services is electronically transmitted to elections officials, who verify their eligibility to vote. This process is seamless and secure.

In the past five years, 15 states and the District of Columbia have adopted AVR. (Three states — Connecticut, Utah, and New Mexico — have adopted something very close to automatic registration.)

Morris and Dunphy perform some statistical analysis on state-level registration data and report:

– AVR markedly increases the number of voters being registered — increases in the number of registrants ranging from 9 to 94 percent.

– These registration increases are found in big and small states, as well as states with different partisan makeups.

I’m not sure what to think. On one hand, I’d expect this sort of treatment to work; on the other hand, some of the data look weird, such as Oregon in 2015, which suggests that other things are going on besides the treatment:

Lots more registrations in the control group than the non-control group in that year, even though it says that 2015 is before the treatment period. So it seems that there are systematic differences between treatment and control groups. Given there are such differences, and these differences show up differently in different years (compare 2013 to 2015), why should we be so sure that the differences in Georgia 2017, for example, are due to the treatment? To put it another way, they seem to be leaning very heavily on the comparisons to the matched tracts.

That said, there’s a lot about these data I don’t understand. Like, why do registrations increase steadily within each year? I’m sure there’s a reason, and maybe it’s in the report and I didn’t see it.

The authors write:

We were able to isolate the effect of AVR using a common political science method known as “matching.” We ran an algorithm to match areas that imple- mented AVR with demographically similar jurisdictions that did not. Matching similar jurisdictions allowed us to build a baseline figure of what a state’s registration rate would have looked like had it not implemented AVR.

That’s all fine but it doesn’t really address these data issues.

McElwee wrote his own critique of the Morris and Dunphy study, focusing on the difference between “back end” and “front end” registration systems. You can follow the link for details on this.

What’s the upshot?

Yair points us to this page, The Upshot, Five Years In, by the New York Times data journalism team, listing their “favorite, most-read or most distinct work since 2014.”

And some of these are based on our research:

There Are More White Voters Than People Think. That’s Good News for Trump. (Story by Nate Cohn. Research with Yair Ghitza)

We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results. (Story by Nate Cohn. Research with Sam Corbett-Davies and David Rothschild)

How Birth Year Influences Political Views (Story and visualization by Amanda Cox. Research with Yair Ghitza)

Big-time data journalism is a pretty new thing. Back when we were doing Red State Blue State, around 2004 and onward, we’d present some results on blogs, and we published a research article and a book, but it was hard to get these ideas out there in the news media, back in the day when Michael Barone was considered a quantitative political reporter and political scatterplots, line plots, and data maps were specialty fare.

P.S. Also included is How Arizona State Reinvented Free-Throw Distraction, featuring an analysis that I think is a noise-chasing mess. This one’s on their list of “The somewhat weird list: our most unconventional efforts to explain the world.” Unconventional, indeed. Unconventional and wrong, in this case. But, hey, we all make mistakes. I know I do.

Several post-doc positions in probabilistic programming etc. in Finland

There are several open post-doc positions in Aalto and University of Helsinki in 1. probabilistic programming, 2. simulator-based inference, 3. data-efficient deep learning, 4. privacy preserving and secure methods, 5. interactive AI. All these research programs are connected and collaborating. I (Aki) am the coordinator for the project 1 and contributor in the others. Overall we are developing methods and tools for a big part of Bayesian workflow. The methods are developed generally, but Stan is one of the platforms used to make the first implementations and thus some post-docs will work also with Stan development team. See more details here.

“Appendix: Why we are publishing this here instead of as a letter to the editor in the journal”

David Allison points us to this letter he wrote with Cynthia Kroeger and Andrew Brown:

Unsubstantiated conclusions in randomized controlled trial of binge eating program due to Differences in Nominal Significance (DINS) Error

Cachelin et al. tested the effects of a culturally adapted, Cognitive Behavioral Therapy-based, guided self-help (CBTgsh) intervention on binge eating reduction . . . The authors report finding a causal effect in their conclusion by stating,

Treatment with the CBTgsh program resulted in significant reductions in frequency of binge eating, depression, and psychological distress and 47.6% of the intention-to-treat CBTgsh group were abstinent from binge eating at follow-up. In contrast, no significant changes were found from pre- to 12-week follow-up assessments for the waitlisted group. Results indicate that CBTgsh can be effective in addressing the needs of Latinas who binge eat and can lead to improvements in symptoms.

This study is well-designed to test for causal effects between these groups; however, the authors did not conduct the statistical test needed to draw causal inference. Specifically, the authors base their conclusions from a parallel groups RCT on within-group analyses. Such analyses have been well-documented as invalid as tests for between-group treatment effects; instead, between-group tests should be utilized to inform conclusions (Bland & Altman, 2011; Gelman & Stern, 2006; Huck & McLean, 1975).

The Differences in Nominal Significance (DINS) error is a term used to describe this error of basing between-group conclusions on comparisons of the statistical significance of two (or more) separate tests . . . DINS errors are common within peer-reviewed obesity literature . . .

The difference between “significant” and “not significant” is not itself statistically significant.

Allison adds:

You might be interested in the Appendix titled “Why we are publishing this here instead of as a letter to the editor in the journal.”

Here’s the story:

Appendix: Why we are publishing this here instead of as a letter to the editor in the journal

We first contacted a Peer Review Manager of Psychological Services on May 12, 2018 to inquire as to how one should submit a Letter to the Editor to their journal, regarding an article published in their journal, because this article type was not an option in the author center of their online submission system.

On June 2, 2018, we received a reply from the Peer Review Manager stating the journal does not usually receive submissions like this, but that the editor confirmed we could submit it as a regular article and just explain that it is a Letter to the Editor in the cover letter.

On July 3, 2018, we followed these instructions and submitted the letter above.

On August 16, 2018, we wrote to inquire as to the status of our submitted letter. A reply from the Peer Review Manager was received on August 23, 2018 stating that the handling editor confirmed she is working on it and consulting with a reviewer.

On October 4, 2018, we received a decision from an Editor, stating that the editorial team was contacted and that they do not publish letters to the editor in this journal. Permission was requested to send a blinded version of our letter to the authors.

On October 13, 2018, the Peer Review Manager followed up to request permission to share a masked version of our letter with the authors of the original manuscript, so that they can potentially address the concerns.

On October 16, 2018, I replied to the Peer Review Manager indicating that I was unsure how to respond, because I was confused about the decision. I attached our previous email correspondence that mentioned how the Editor confirmed that we could submit the article as a regular article and explain that it was a Letter to the Editor in the cover letter – even though the journal normally does not have submissions like this. I mentioned that other journals often have original authors address concerns in a formal response and asked whether this person knew the process by which authors would address our concerns otherwise.

On October 17, 2018, I received an email from the Editor, indicating that our letter was read through carefully and our concerns were taken seriously. Multiple statisticians were consulted to see how to best address the concerns. The editorial team also was consulted with. The editorial team agreed that given the nature of our letter was statistical procedures related to a study, that publishing it in the journal was not the best step. It was explained that they reject papers that are not in the area of typical foci of their journal or readership, and that this is why the letter could not be accepted for publication. It was explained that they agreed the best step was to reach out to the author that needed corrections and wanted permission to send our blinded letter to them, so they can make the corrections.

On October 23, 2018, we responded to the Editor and provided consent to share our letter with the original authors. We also shared with them our plan to publish our submitted letter as a comment in PubPeer and update our comment as progress continues.

The Editor responded on October 23, 2018, asking whether we still wanted our letter to be blinded when they share it with original authors.

We responded on October 24, 2018, giving permission to send our letter unblinded. We said to please let us know if we can help further in any way and to please feel free to share with the original authors that we are happy to help if they think that might be useful. The Editor responded affirmatively and thanked us.

In sum, we decided to post our letter here. Posting our concerns here is in line with the COPE ideal of quickly making the scientific community aware of an issue. We also find the editor’s decision to not allow for the publication of LTEs discussing errors in their papers to be counter to the ideals of rigor, reproducibility, and transparency.

This is very similar to what happened to me when I tried to publish a letter pointing out a problem in a paper published in the American Sociological Review. I eventually gave up and just published the story in Chance. That was a few years ago. Had it happened more recently, I would’ve submitted it to Sociological Science.

Just today, someone sent me another story of this sort: A paper with serious statistical errors appeared in a medical journal, my correspondent found some problems with it but the journal editor refused to do anything about it. (In this case, the data were not made available, making it difficult to figure out what exactly was going on.) My correspondent was stuck, didn’t know what to do. I suggested publishing the criticism as a short article in a different journal in the same medical subfield. We’ll see what happens.

P.S. Since we’re on the topic of publication and conflict of interest, here’s an unrelated story.

R-squared for multilevel models

Brandon Sherman writes:

I just was just having a discussion with someone about multilevel models, and the following topic came up. Imagine we’re building a multilevel model to predict SAT scores using many students. First we fit a model on students only, then students in classrooms, then students in classrooms within district, the previous case within cities, then counties, countries, etc. Maybe we even add in census tract info. The idea here is we keep arbitrarily adding levels to the hierarchy.

In standard linear regression, adding variables, regardless of informativeness, always leads to an increase in R^2. In the case of multilevel modeling, does adding levels to the hierarchy always lead to a stronger relationship with the response, even if it’s a tiny one that’s only applicable to the data the model is built on?

My reply: Not always. See here.

P.S. Since we’re on the topic, I should point you to this recent paper with Ben, Jonah, and Aki on Bayesian R-squared.

Wanted: Statistical success stories

Bill Harris writes:

Sometime when you get a free moment, it might be great to publish a post that links to good, current exemplars of analyses. There’s a current discussion about RCTs on a program evaluation mailing list I monitor. I posted links to your power=0.06 post and your Type S and Type M post, but some still seem to think RCTs are the foundation. I can say “Read one of your books” or “Read this or that book,” or I could say “Peruse your blog for the last, oh, eight-ten years,” but either one requires a bunch of dedication. I could say “Read some Stan examples,” but those seem mostly focused on teaching Stan. Some published examples use priors you no longer recommend, as I recall. I think I’ve noticed a few models with writeups on your blog that really did begin to show how one can put together a useful analysis without getting into NHST and RCTs, but I don’t recall where they are.

Relatedly, Ramiro Barrantes-Reynolds writes:

I would be very interested in seeing more in your blog about research that does a good job in the areas that are most troublesome for you: measurement, noise, forking paths, etc; or that addresses those aspects so as to make better inferences. I think after reading your blog I know what to look for to see when some investigator (or myself) is chasing noise (i.e. I have a sense of what NOT to do), but I am missing good examples to follow in order to do better research – I would consider myself a beginning statistician so examples of research that is well done and addresses the issues of forking paths, measurement, etc help. I think blog posts and the discussion that arises would be beneficial to the community.

So, two related questions. The first one’s about causal inference beyond simple analyses of randomized trials; the second is about examples of good measurement and inference in the context of forking paths.

My quick answer is that, yes, we do have examples in our books, and it doesn’t involve that much dedication to order them and take a look at the examples. I also have a bunch of examples here and here.

More specifically:

Causal inference without a randomized trial: Millennium villages, incumbency advantage (and again)

Measurement: penumbras, assays

Forking paths: Millennium villages, polarization

I guess the other suggestion is that we post on high-quality new work so we can all discuss, not just what makes bad work bad, but also what makes good work good. That makes sense. To start with, you should start pointing me to some good stuff to post on.

No, its not correct to say that you can be 95% sure that the true value will be in the confidence interval

Hans van Maanen writes:

Mag ik je weer een statistische vraag voorleggen?

If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me. My visualisation is that she filled a bowl with 100 intervals, 95 of which do contain the true value and 5 do not, and she picked one at random.
Now, if she gives me two independent 95%-CI’s (e.g., two primary endpoints in a clinical trial), I can only be 90% sure (0.95^2 = 0,9025) that they both contain the true value. If I have a table with four measurements and 95%-CI’s, there’s only a 81% chance they all contain the true value.

Also, if we have two results and we want to be 95% sure both intervals contain the true values, we should construct two 97.5%-CI’s (0.95^(1/2) = 0.9747), and if we want to have 95% confidence in four results, we need 0,99%-CI’s.

I’ve read quite a few texts trying to get my head around confidence intervals, but I don’t remember seeing this discussed anywhere. So am I completely off, is this a well-known issue, or have I just invented the Van Maanen Correction for Multiple Confidence Intervals? ;-))

Ik hoop dat je tijd hebt voor een antwoord. It puzzles me!

My reply:

Ja hoor kan ik je hulpen, maar en engels:

1. “If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.” Not quite true. Yes, true on average, but not necessarily true in any individual case. Some intervals are clearly wrong. Here’s the point: even if you picked an interval at random from the bowl, once you see the interval you have additional information. Sometimes the entire interval is implausible, suggesting that it’s likely that you happened to have picked one of the bad intervals in the bowl. Other times, the interval contains the entire range of plausible values, suggesting that you’re almost completely sure that you have picked one of the good intervals in the bowl. This can especially happen if your study is noisy and the sample size is small. For example, suppose you’re trying to estimate the difference in proportion of girl births, comparing two different groups of parents (for example, beautiful parents and ugly parents). You decide to conduct a study of N=400 births, with 200 in each group. Your estimate will be p2 – p1, with standard error sqrt(0.5^2/200 + 0.5^2/200) = 0.05, so your 95% conf interval will be p2 – p1 +/- 0.10. We happen to be pretty sure that any true population difference will be less than 0.01 (see here), hence if p2 – p1 is between -0.09 and +0.09, we can be pretty sure that our 95% interval does contain the true value. Conversely, if p2 – p1 is less than -0.11 or more than +0.11, then we can be pretty sure that our interval does not contain the true value. Thus, once we see the interval, it’s no longer generally a correct statement to say that you can be 95% sure the interval contains the true value.

2. Regarding your question: I don’t really think it makes sense to want 95% confidence in four results. It makes more sense to accept that our inferences are uncertain, we should not demand or act as if that they all be correct.