How to simulate an instrumental variables problem?

Edward Hearn writes:

In an effort to buttress my own understanding of multi-level methods, especially pertaining to those involving instrumental variables, I have been working the examples and the exercises in Jennifer Hill’s and your book.

I can find general answers at the Github repo for ARM examples, but for Chapter 10, Exercise 3 (simulating an IV regression to test assumptions using a binary treatment and instrument) and for the book examples, no code is given and I simply cannot figure out the solution.

My reply:

I have no homework solutions to send. But maybe some blog commenters would like to help out?

Here’s the exercise:

Harvard dude calls us “online trolls”

Story here.

Background here (“How post-hoc power calculation is like a shit sandwich”) and here (“Post-Hoc Power PubPeer Dumpster Fire”).

OK, to be fair, “shit sandwich” could be considered kind of a trollish thing for me to have said. But the potty language in this context was not gratuitous; it furthered the larger point I was making. There’s a tradeoff: use clean language and that will help with certain readers; on the other hand, vivid language and good analogies make the writing more readable and can make the underlying argument more accessible. So no easy answers on that one.

In any case, the linked Pubpeer thread had no trolling at all, by me or anybody else.

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions.

Jeff Lax writes:

I’m probably not the only one telling you about this Science story, but just in case.

The link points to a new research article reporting a failed replication of a study from 2008. The journal that published that now-questionable result refuses to consider publishing the replication attempt.

My reply:

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions. Sure, they couldn’t publish this particular paper, but maybe there was something more exciting submitted to Science that week, maybe a new manuscript by Michael Lacour?

The publication asymmetry: What happens if the New England Journal of Medicine publishes something that you think is wrong?

After reading my news article on the replication crisis, retired cardiac surgeon Gerald Weinstein wrote:

I have long been disappointed by the quality of research articles written by people and published by editors who should know better. Previously, I had published two articles on experimental design written with your colleague Bruce Levin [of the Columbia University biostatistics department]:

Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548.

Weinstein GS and Levin B: The effect of crossover on the statistical power of randomized studies. Ann. Thorac. Surg. 1989;48:490-495.

I [Weinstein] would like to point out some additional problems with such studies in the hope that you could address them in some future essays. I am focusing on one recent article in the New England Journal of Medicine because it is typical of so many other clinical studies:

Alirocumab and Cardiovascular Outcomes after Acute Coronary Syndrome

November 7, 2018 DOI: 10.1056/NEJMoa1801174


Patients who have had an acute coronary syndrome are at high risk for recurrent ischemic cardiovascular events. We sought to determine whether alirocumab, a human monoclonal antibody to proprotein convertase subtilisin–kexin type 9 (PCSK9), would improve cardiovascular outcomes after an acute coronary syndrome in patients receiving high-intensity statin therapy.


We conducted a multicenter, randomized, double-blind, placebo-controlled trial involving 18,924 patients who had an acute coronary syndrome 1 to 12 months earlier, had a low-density lipoprotein (LDL) cholesterol level of at least 70 mg per deciliter (1.8 mmol per liter), a non−high-density lipoprotein cholesterol level of at least 100 mg per deciliter (2.6 mmol per liter), or an apolipoprotein B level of at least 80 mg per deciliter, and were receiving statin therapy at a high-intensity dose or at the maximum tolerated dose. Patients were randomly assigned to receive alirocumab subcutaneously at a dose of 75 mg (9462 patients) or matching placebo (9462 patients) every 2 weeks. The dose of alirocumab was adjusted under blinded conditions to target an LDL cholesterol level of 25 to 50 mg per deciliter (0.6 to 1.3 mmol per liter). “The primary end point was a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.”


The median duration of follow-up was 2.8 years. A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group (hazard ratio, 0.85; 95% confidence interval [CI], 0.78 to 0.93; P<0.001). A total of 334 patients (3.5%) in the alirocumab group and 392 patients (4.1%) in the placebo group died (hazard ratio, 0.85; 95% CI, 0.73 to 0.98). The absolute benefit of alirocumab with respect to the composite primary end point was greater among patients who had a baseline LDL cholesterol level of 100 mg or more per deciliter than among patients who had a lower baseline level. The incidence of adverse events was similar in the two groups, with the exception of local injection-site reactions (3.8% in the alirocumab group vs. 2.1% in the placebo group).

Here are some major problems I [Weinstein] have found in this study:

1. Misleading terminology: the “primary composite endpoint.” Many drug studies, such as those concerning PCSK9 inhibitors (which are supposed to lower LDL or “bad” cholesterol) use the term “primary endpoint” which is actually “a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.” [Emphasis added]

Obviously, a “composite primary endpoint” is an oxymoron (which of the primary colors are composites?) but, worse, the term is so broad that it casts doubt on any conclusions drawn. For example, stroke is generally an embolic phenomenon and may be caused by atherosclerosis, but also may be due to atrial fibrillation in at least 15% of cases. Including stroke in the “primary composite endpoint” is misleading, at best.

By casting such a broad net, the investigators seem to be seeking evidence from any of the four elements in the so-called primary endpoint. Instead of being specific as to which types of events are prevented, the composite primary endpoint obscures the clinical benefit.

2. The use of relative risks, odds ratios or hazard ratios to obscure clinically insignificant differences in absolute differences. “A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group.” This is an absolute difference of only 1.6%. Such small differences are unlikely to be clinically important, or even replicated on subsequent studies, yet the authors obscure this fact by citing hazard ratios. Only in a supplemental appendix (available online), does this become apparent. Note the enlarged and prominently displayed hazard ratio, drawing attention away from the almost nonexistent difference in event rates (and lack of error bars). Of course, when the absolute differences are small, the ratio of two small numbers can be misleadingly large.

I am concerned because this type of thing is appearing more and more frequently. Minimally effective drugs are being promoted at great expense, and investigators are unthinkingly adopting questionable methods in search of new treatments. No wonder they can’t be repeated.

I suggested to Weinstein that he write a letter to the journal, and he replied:

Unfortunately, the New England Journal of Medicine has a strict limit on the number of words in a letter to the editor of 175 words.

In addition, they have not been very receptive to my previous submissions. Today they rejected my short letter on an article that reached a conclusion that was the opposite of the data due to a similar category error, even though I kept it within that word limit.

“I am sorry that we will not be able to publish your recent letter to the editor regarding the Perner article of 06-Dec-2018. The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received.” Of course, they have the space to publish articles that are false on their face.

Here is the letter they rejected:

Re: Pantoprazole in Patients at Risk for Gastrointestinal Bleeding in the ICU

(December 6, 2018 N Engl J Med 2018; 379:2199-2208)

This article appears to reach an erroneous conclusion based on its own data. The study implies that pantoprazole is ineffective in preventing GI bleeding in ICU patients when, in fact, the results show that it is effective.

The purpose of the study was to evaluate the effectiveness of pantoprazole in preventing GI bleeding. Instead, the abstract shifts gears and uses death within 90 days as the primary endpoint and the Results section focuses on “at least one clinically important event (a composite of clinically important gastrointestinal bleeding, pneumonia, Clostridium difficile infection, or myocardial ischemia).” For mortality and for the composite “clinically important event,” relative risks, confidence intervals and p-values are given, indicating no significant difference between pantoprazole and control, but a p-value was not provided for GI bleeding, which is the real primary endpoint, even though “In the pantoprazole group, 2.5% of patients had clinically important gastrointestinal bleeding, as compared with 4.2% in the placebo group.” According to my calculations, the chi-square value is 7.23, with a p-value of 0.0072, indicating that pantoprazole is effective at the p<0.05 level in decreasing gastrointestinal bleeding in ICU patients. [emphasis added]

My concern is that clinicians may be misled into believing that pantoprazole is not effective in preventing GI bleeding in ICU patients when the study indicates that it is, in fact, effective.

This sort of mislabeling of end-points is now commonplace in many medical journals. I am hoping you can shed some light on this. Perhaps you might be able to get the NY Times or the NEJM to publish an essay by you on this subject, as I believe the quality of medical publications is suffering from this practice.

I have no idea. I’m a bit intimidated by medical research with all its specialized measurements and models. So I don’t think I’m the right person to write this essay; indeed I haven’t even put in the work to evaluate Weinstein’s claims above.

But I do think they’re worth sharing, just because there is this “publication asymmetry” in which, once something appears in print, especially in a prestigious journal, it becomes very difficult to criticize (except in certain cases when there’s a lot of money, politics, or publicity involved).

We’re done with our Applied Regression final exam (and solution to question 15)

We’re done with our exam.

And the solution to question 15:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

Both (a) and (b) are true.

(a) is true because everything’s approximately normally distributed so you’d expect a 95% chance for an estimate +/- 2 se’s to contain the true value. In real life we’re concerned with model violations, but here it’s all simulated data so no worries about bias. And n=100 is large enough that we don’t have to worry about the t rather than normal distribution. (Actually, even if n were pretty small, we’d be doing ok with estimates +/- 2 sd’s because we’re using the mad sd which gets wider when the t degrees of freedom are low.)

And (b) is true too because of the central limit theorem. Switching from a normal to a bimodal distribution will affect predictions for individual cases but it will have essentially no effect on the distribution of the estimate, which is an average from 100 data points.

Common mistakes

Most of the students got (a) correct but not (b). I guess I have to bang even harder on the relative unimportance of the error distribution (except when the goal is predicting individual cases).

Question 14 of our Applied Regression final exam (and solution to question 13)

Here’s question 14 of our exam:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),
for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.

(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

And the solution to question 13:

13. You fit a model of the form: y ∼ x + u full + (1 | group). The estimated coefficients are 2.5, 0.7, and 0.5 respectively for the intercept, x, and u full, with group and individual residual standard deviations estimated as 2.0 and 3.0 respectively. Write the above model as
y_i = a_j[i] + bx + ε_i
a_j = A + Bu_j + η_j.

(a) Give the estimates of b, A, and B together with the estimated distributions of the error terms.

(b) Ignoring uncertainty in the parameter estimates, give the predictive standard deviation for a new observation in an existing group and for a new observation in a new group.

(a) The estimates of b, A, and B are 0.7, 2.5, and 0.5, respectively, and the estimated distributions are ε ~ normal(0, 3.0) and η ~ normal(0, 2.0).

(b) 3.0 and sqrt(3.0^2 + 2.0^2) = 3.6.

Common mistakes

Almost everyone got part (a) correct, and most people got (b) also, but there was some confusion about the uncertainty for a new observation in a new group.

Naomi Wolf and David Brooks

Palko makes a good point:

Parul Sehgal has a devastating review of the latest from Naomi Wolf, but while Sehgal is being justly praised for her sharp and relentless treatment of her subject, she stops short before she gets to the most disturbing and important implication of the story.

There’s an excellent case made here that Wolf’s career should have collapsed long ago under the weight of her contradictions and factual errors, but the question of responsibility, of how enablers have sustained that career, and how many other journalistic all-stars owe their successes to the turning of blind eyes.

For example, Sehgal’s review ran in the New York Times. One of, if not the most prominent voice of that paper is David Brooks. . . .

Really these columnists should just stick to football writing, where it’s enough just to be entertaining, and accuracy and consistency don’t matter so much.

Question 12 of our Applied Regression final exam (and solution to question 11)

Here’s question 12 of our exam:

12. In the regression above, suppose you replaced height in inches by height in centimeters. What would then be the intercept and slope of the regression? (One inch is 2.54 centimeters.)

And the solution to question 11:

11. We defined a new variable based on weight (in pounds):

heavy <- weight>200

and then ran a logistic regression, predicting “heavy” from height (in inches):

glm(formula = heavy ~ height, family = binomial(link = "logit")) coef.est
(Intercept) -21.51 1.60
height 0.28 0.02
--- n = 1984, k = 2

(a) Graph the logistic regression curve (the probability that someone is heavy) over the approximate range of the data. Be clear where the line goes through the 50% probability point.

(b) Fill in the blank: near the 50% point, comparing two people who differ by one inch in height, you’ll expect a difference of ____ in the probability of being heavy.

(a) The x-axis should range from approximately 60 to 80 (most people have heights between 60 and 80 inches), and the y-axis should range from 0 to 1. The easiest way to draw the logistic regression curve is to first figure out where it goes through 0.5. That’s when the linear predictor equals 0, thus -21.51 + 0.28*x = 0, so x = 21.58/0.28 = 79. Then at that point the line has slope 0.07 (remember the divide-by-4 rule), and that will be enough to get something pretty close to the fitted curve.

(b) As just noted, the divide-by-4 rule gives us an answer of 0.07.

Common mistakes

In making the graphs, most of the students didn’t think about the range of x, for example having x go from 0 to 100, which doesn’t make sense as there aren’t any people close to 0 or 100 inches tall. To demand that the range of the curve fit the range of the data is not just being picky: it changes the entire interpretation of the fitted model because changing the range of x also changes the range of probabilities on the y-axis.

How statistics is used to crush (scientific) dissent.

Lakeland writes:

When we interpret powerful as political power, I think it’s clear that Classical Statistics has the most political power, that is, the power to get people to believe things and change policy or alter funding decisions etc… Today Bayes is questioned at every turn, and ridiculed for being “subjective” with a focus on the prior, or modeling “belief”. People in current power to make decisions about resources etc are predominantly users of Classical type methods (hypothesis testing, straw man NHST specifically, and to a lesser extent maximum likelihood fitting and in econ Difference In Difference analysis and synthetic controls and robust standard errors and etc all based on sampling theory typically without mechanistic models…).

The alternative is hard: model mechanisms directly, use Bayes to constrain the model to the reasonable range of applicability, and do a lot of computing to get fitted results that are difficult for anyone without a lot of Bayesian background to understand, and that specifically make a lot of assumptions and choices that are easy to question. It’s hard to argue against “model free inference procedures” that “guarantee unbiased estimates of causal effects” and etc. But it’s easy to argue that some specific structural assumption might be wrong and therefore the result of a Bayesian analysis might not hold…

So from a political perspective, I see Classical Stats as it’s applied in many areas as a way to try to wield power to crush dissent.

My reply:

Yup. But the funny thing is that I think that a lot of the people doing bad science also feel that they’re being pounded by classical statistics.

It goes like this:
– Researcher X has an idea for an experiment.
– X does the experiment and gathers data, would love to publish.
– Because of the annoying hegemony of classical statistics, X needs to do a zillion analyses to find statistical significance.
– Publication! NPR! Gladwell! Freakonomics, etc.
– Methodologist Y points to problems with the statistical analysis, the nominal p-values aren’t correct, etc.
– X is angry: first the statistical establishment required statistical significance, now the statistical establishment is saying that statistical significance isn’t good enough.
– From Researcher X’s point of view, statistics is being used to crush new ideas and it’s being used to force creative science into narrow conventional pathways.

This is a narrative that’s held by some people who detest me (and, no, I’m not Methodologist Y; this might be Greg Francis or Uri Simonsohn or all sorts of people.) There’s some truth to the narrative, which is one thing that makes things complicated.