Sadish Dhakal writes:

I am struggling with the problem of conditioning on post-treatment variables. I was hoping you could provide some guidance. Note that I have repeated cross sections, not panel data. Here is the problem simplified:

There are two programs. A policy introduced some changes in one of the programs, which I call the treatment group (T). People can select into T. In fact there’s strong evidence that T programs become more popular in the period after policy change (P). But this is entirely consistent with my hypothesis. My hypothesis is that high-quality people select into the program. I expect that people selecting into T will have better outcomes (Y) because they are of higher quality. Consider the specification (avoiding indices):

Y = b0 + b1 T + b2 P + b3 T X P + e (i)

I expect that b3 will be positive (which it is). Again, my hypothesis is that b3 is positive only because higher quality people select into T after the policy change. Let me reframe the problem slightly (And please correct me if I’m reframing it wrong). If I could observe and control for quality Q, I could write the error term e = Q + u, and b3 in the below specification would be zero.

Y = b0 + b1 T + b2 P + b3 T X P + Q + u (ii)

My thesis is not that the policy “caused” better outcomes, but that it induced selection. How worried should I be about conditioning on T? How should I go about avoiding bogus conclusions?

My reply:

There are two ways I can see to attack this problem, and I guess you’d want to do both. First is to control for lots of pre-treatment predictors, including whatever individual characteristics you can measure which you think would predict the decision to select into T. Second is to include in your model a latent variable representing this information, if you don’t think you can measure it directly. You can then do a Bayesian analysis averaging over your prior distribution on this latent variable, or a sensitivity analysis assessing the bias in your regression coefficient as a function of characteristics of the latent variable and its correlations with your outcome of interest.

I’ve not done this sort of analysis myself; perhaps you could look at a textbook on causal inference such as Tyler VanderWeele’s Explanation in Causal Inference: Methods for Mediation and Interaction, or Miguel Hernan and Jamie Robins’s Causal Inference.