So. I was reading the London Review of Books the other day and came across this passage by the philosopher Kieran Setiya:
Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .
I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.
So I thought I’d look into this particular example.
I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”
OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:
Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.
Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”
From that article:
The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .
We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors. The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.
Here are the two footnotes to the above passage:
 Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.
 Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.
I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live’”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.
Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:
In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?
Now for the data. Valdesolo and DeSteno find the following results:
– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.
– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.
So from this set of experiments alone, I would not say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)
At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:
From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).
I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.
I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does not successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)
Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.
And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study doesn’t replicate the original claim, then a defender can say it’s not a real replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.
The real question is, what did they find and how do these findings relate to the larger claim?
And the answer is, it’s complicated.
First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)
Second, the two new studies looked at interactions rather than main effects.
The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.
Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:
I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?
And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.
Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.
One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.
What have we learned from this journey?
Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.
This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.
There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.
P.S. Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.