Why edit a journal? More generally, how to contribute to scientific discussion?

The other day I wrote:

Journal editing is a volunteer job, and people sign up for it because they want to publish exciting new work, or maybe because they enjoy the power trip, or maybe out of a sense of duty—but, in any case, they typically aren’t in it for the controversy.

Jon Baron, editor of the journal Judgment and Decision Making, saw this and wrote:

In my case, the reasons are “all three”! But it isn’t a matter of “exciting new work” so much as “solid work with warranted conclusions, even if boring”. This is a very old-fashioned experimental psychologist’s approach. Boring is good. And the “power” is not a trivial consideration; many things that academics do have the purpose of influencing their fields, and editing, for me, beats teaching, blogging, writing trade books, giving talks, or even . . . (although it does not beat writing a textbook).

I’ve been asked many times to edit journals but I’ve always said no because I’ve felt that, personally, I can make better contributions to the field as a loner. Editing a journal would require too much social skill for me. We each should contribute where we can.

Also recall this story:

I remember, close to 20 years ago, an economist friend of mine was despairing of the inefficiencies of the traditional system of review, and he decided to do something about it: He created his own system of journals. They were all online (a relatively new thing at the time), with an innovative transactional system of reviewing (as I recall, every time you submitted an article you were implicitly agreeing to review three articles by others) and a multi-tier acceptance system, so that very few papers got rejected; instead they were just binned into four quality levels. And all the papers were open-access or something like that.

The system was pretty cool, but for some reason it didn’t catch on—I guess that, like many such systems, it relied a lot on continuing volunteer efforts of its founder, and perhaps he just got tired of running an online publishing empire, and the whole thing kinda fell apart. The journals lost all their innovative aspects and became just one more set of social science publishing outlets. My friend ended up selling his group of journals to a traditional for-profit company, they were no longer free, etc. It was like the whole thing never happened.

A noble experiment, but not self-sustaining. Which was too bad, given that he’d put so much effort into building a self-sustaining structure.

Perhaps one lesson from my friend’s unfortunate experience is that it’s not enough to build a structure; you also need to build a community.

Another lesson is that maybe it can help to lean on some existing institution. This guy built up his whole online publishing company from scratch, which was kinda cool, but then when he no longer felt like running it, it dissolved, and then he ended up with a pile of money, which he probably didn’t need and he might never get around to spending, while losing the scientific influence, which is more interesting and important. Maybe it would’ve been better for him to have teamed up with an economics society, or with some university, governmental body, or public-interest organization.

Good intentions are not enough, and even good intentions + a lot of effort aren’t enough. You have to work with existing institutions, or create your own. This blog works in part because it piggybacks off the existing institution of blogging. Nowadays there isn’t much blogging anymore, but the circa 2005-era blogosphere was helpful in giving us a sense of how to set up our community. We built upon the strengths of the blogosphere and avoided some of the pitfalls.

Similarly this is the challenge of reforming scientific communication: to do something better while making use of existing institutions and channels whereby researchers donate their labor.

My (remote) talk this Friday 3pm at the Department of Cognitive Science at UCSD

It was too much to do one more flight so I’ll do this one in (nearly) carbon-free style using hangout or skype.

It’s 3pm Pacific time in CSB (Cognitive Science Building) 003 at the University of California, San Diego.

This is what they asked for in the invite:

Our Friday afternoon COGS200 series has been a major foundation of the Cognitive Science community and curriculum in our department for decades and is attended by faculty and students from diverse fields (e.g. anthropology, human-computer-interface/design, AI/machine learning, neuroscience, philosophy of mind, psychology, genetics, etc).

One of the goals of our Spring quarter series is to expose attendees to research on the cultural practices surrounding data acquisition, analysis, and interpretation. In particular, we were hoping to have a section exploring current methods in statistical inference, with an emphasis on designing analyses appropriate to the question being asked. If you are interested and willing, we would love for you to share your expertise on multilevel / hierarchical modeling—as well as your more general perspective on how scientists can better deploy statistical models for conducting good, replicable science. (Relevant papers that come to mind include your 2016 paper on “multiverse analyses”, as well as your 2017 “Abandon statistical significance” paper.)

I’m still not sure what’s the best thing to talk about. I guess I’ll start with what’s in that above paragraph and then go from there.

Peter Ellis on Forecasting Antipodal Elections with Stan

I liked this intro to Peter Ellis from Rob J. Hyndman’s talk announcement:

He [Peter Ellis] started forecasting elections in New Zealand as a way to learn how to use Stan, and the hobby has stuck with him since he moved back to Australia in late 2018.

You may remember Peter from my previous post on his analysis of NZ traffic crashes.

The talk

Speaker: Peter Ellis

Title: Poll position: statistics and the Australian federal election

Abstract: The result of the Australian federal election in May 2019 stunned many in the politics-observing class because it diverged from a long chain of published survey results of voting intention. How surprising was the outcome? Not actually a complete outlier; about a one in six chance, according to Peter Ellis’s forecasting model for Free Range Statistics. This seminar will walk through that model from its data management (the R package ozfedelect, built specifically to support it), the state-space model written in Stan and R that produces the forecasts; and its eventual visualisation and communication to the public. There are several interesting statistical issues relating to how we translate crude survey data into actual predicted seats, and some even more interesting communication issues about how all this is understood by the public. This talk is aimed at those with an interest in one or more of R, Stan, Bayesian modelling and forecasts, and Australian voting behaviour.

Location: 11am, 31 May 2019. Room G03, Learning and Teaching Building, 19 Ancora Imparo Way, Clayton Campus, Monash University [Melbourne, Australia]

The details

Ellis’s blog, Free Range Statistics, has the details of the Australian election model and much much more.

You can also check out his supporting R package, ozfedelect, on GitHub.

From hobbyist to pundit

Ellis’s hobby led to his being quoted by The Economist in a recent article, Did pollsters misread Australia’s election or did pundits?. Quite the hobby.

But wait, there’s more…

There are a lot more goodies on Peter Ellis’s blog, both with and without Stan.

A plea

I’d like to make a plea for a Stan version of the Bradley-Terry model (the statistical basis of ELO) for predicting Australian Football League matches. It’s an exercise somewhere in the regression sections of Gelman et al.’s Bayesian Data Analysis to formulate the model (including how to extend to ties). I have a half-baked Bradley-Terry case study I’ve been meaning to finish, but would be even happier to get an outside contribution! I’d be happy to share what I have so far.

They’re working for the clampdown

This is just disgraceful: powerful academics using their influence to suppress (“clamp down on”) dissent. They call us terrorists, they lie about us in their journals, and they plot to clamp down on us. I can’t say at this point that I’m surprised to see this latest, but it saddens and angers me nonetheless to see people who could be research leaders but who instead are saying:

We’re working for the clampdown
We will teach our twisted speech
To the unbelievers
We will train our blue-eyed men
To the unbelievers

What are we gonna do now?

Donald J. Trump and Robert E. Lee

The other day the president made some news by praising Civil War general Robert E. Lee, and it struck me that Trump and Lee had a certain amount in common. Not in their personalities, but in their situations.

Lee could’ve fought on the Union side in the Civil War. Or he could’ve saved a couple hundred thousand lives by surrendering his army at some point, once he realized they were going to lose. But, conditional on fighting on, he had to innovate. He had to gamble, over and over again, because that was his only chance.

Similarly for Trump. There was no reason he had to run for president. And, once he had the Republican nomination, he still could’ve stepped down. But, conditional on him running for president with all these liabilities (he’s unpopular, he leads an unpopular party, and he has a lot of legal issues), he’s had to use unconventional tactics, and to continue to use unconventional tactics. Like Lee, there’s no point at which he could rest on his successes.

Against Arianism 3: Consider the cognitive models of the field

“You took my sadness out of context at the Mariners Apartment Complex” – Lana Del Rey

It’s sunny, I’m in England, and I’m having a very tasty beer, and Lauren, Andrew, and I just finished a paper called The experiment is just as important as the likelihood in understanding the prior: A cautionary note on robust cognitive modelling.

So I guess it’s time to resurrect a blog series. On the off chance that any of you have forgotten, the Against Arianism series focusses on the idea that, in the same way that Arianism1 was heretical, so too is the idea that priors and likelihoods can be considered separately. Rather, they are consubstantial–built of the same probability substance.

There is no new thing under the sun, so obviously this has been written about a lot. But because it’s my damn blog post, I’m going to focus on a paper Andrew, Michael, and I wrote in 2017 called The Prior Can Often Only Be Understood in the Context of the Likelihood. This paper was dashed off in a hurry and under deadline pressure, but I quite like it. But it’s also maybe not the best place to stop the story.

An opportunity to comment

A few months back, the fabulous Lauren Kennedy was visiting me in Toronto on a different project. Lauren is a postdoc at Columbia working partly on complex survey data, but her background is quantitative methods in psychology. Among other things, we saw a fairly regrettable (but excellent) Claire Denis movie about vampires2.

But that’s not relevant to the story. What is relevant was that Lauren had seen an open invitation to write a comment on a paper in Computational Brain & Behaviour about Robust3 Modelling in Cognitive Science written by a team cognitive scientists and researchers in scientific theory, philosophy, and practice  (Michael Lee, Amy Criss, Berna Devezer, Christopher Donkin, Alexander Etz, Fábio Leite, Dora Matzke, Jeffrey Rouder, Jennifer Trueblood, Corey White, and Joachim Vandekerckhove).

Their bold aim to sketch out the boundaries of good practice for cognitive modelling (and particularly for the times where modelling meets data) is laudable, not least because such an endeavor will always be doomed to fail in some way. But the act of stating some ideas for what constitutes best practice gives the community a concrete pole to hang this important discussion on. And Computational Brain & Behaviour recognized this and decided to hang an issue off the paper and its discussions.

The paper itself is really thoughtful and well done. And obviously I do not agree with everything in it, but that doesn’t stop me from the feeling that wide-spread adoption of  their suggestions would definitely make quantitative research better.

But Lauren noticed one tool that we have found extremely useful that wasn’t mentioned in the paper: prior predictive checks. She asked if I’d be interested in joining her on a paper, and I quickly said yes!

It turns out there is another BART

The best thing about working with Lauren on this was that she is a legit psychology researcher so she isn’t just playing in someone’s back yard, she owns a patch of sand. It was immediately clear that it would be super-quick to write a comment that just said “you should use prior predictive checks”. But that would miss a real opportunity. Because cognitive modelling isn’t quite the same as standard statistical modelling (although in the case where multilevel models are appropriate Daniel  Schad, Michael Betancourt, and  Shravan Vasishth just wrote an excellent paper on importing general ideas of good statistical workflows into Cognitive applications).

Rather than using our standard data analysis models, a lot of the time cognitive models are generative models for the cognitive process coupled (sometimes awkwardly) with models for the data that is generated from a certain experiment. So we wanted an example model that is more in line with this practice than our standard multilevel regression examples.

Lauren found the Balloon Analogue Risk Task (BART) in Lee and Wagenmakers’ book Bayesian Cognitive Modeling: A Practical Course, which conveniently has Stan code online4. We decided to focus on this example because it’s fairly easy to understand and has all the features we needed. But hopefully we will eventually write a longer paper that covers more common types of models.

BART is an experiment that makes participants simulate pumping balloons with some fixed probability of popping after every pump. Every pump gets them more money, but they get nothing if the balloon pops. The model contains a parameter (\gamma^+) for risk taking behaviour and the experiment is designed to see if the risk taking behaviour changes as a person gets more drunk.  The model is described in the following DAG:

Exploring the prior predictive distribution 

Those of you who have been paying attention will notice the Uniform(0,10) priors on the logit scale and think that these priors are a little bit terrible. And they are! Direct simulation from model leads to absolutely silly predictive distributions for the number of pumps in a single trial. Worse still, the pumps are extremely uniform across trials. Which means that the model thinks, a priori, that it is quite likely for a tipsy undergraduate to pump a balloon 90 times in each of the 20 trials. The mean number of pumps is a much more reasonable 10.

Choosing tighter upper bounds on the uniform priors leads to more sensible prior predictive distributions, but then Lauren went to test out what changes this made to inference (in particular looking at how it affects the Bayes factor against the null that the \gamma^+ parameters were the same across different levels of drunkenness). It made very little difference.  This seemed odd so she started looking closer.

Where is the p? Or, the Likelihood Principle gets in the way

So what is going on here? Well the model describe in Lee and Wagenmaker’s book is not a generative model for the experimental data. Why not? Because the balloon sometimes pops! But because in this modelling setup the probability of explosion is independent of the number of pumps, this explosive possibility only appears as a constant in the likelihood.

The much lauded Likelihood Principle tells us that we do not need to worry about these constants when we are doing inference. But when we are trying to generate data from the prior predictive distribution, we really need to care about these aspects of the model.

Once the context on the experiment is taken into account, the prior predictive distributions change a lot.

Context is important when taking statistical methods into new domains

Prior predictive checks are really powerful tools. They give us a way to set priors, they give us a way to understand what our model does, they give us a way to generate data that we can use to assess the behaviour of different model comparison tools under the experimental design at hand. (Neyman-Pearson acolytes would talk about power here, but the general question lives on beyond that framework).

Modifications of prior predictive checks should also be used to assess how predictions, inference, and model comparison methods behave under different but realistic deviations from the assumed generative model. (One of the points where I disagree with Lee et al.‘s paper is that it’s enough to just pre-register model comparision methods. We also need some sort of simulation study to know how they work for the problem at hand!)

But prior predictive checks require understanding of the substantive field as well as understanding of how the experiment was performed. And it is not always as simple as just predict y!

Balloons pop. Substantive knowledge may only be about contrasts or combinations of predictions. We need to always be aware that it’s a lot of work to translate a tool to a new scientific context. Even when that tool  appears to be as straightforward to use and as easy to explain as prior predictive checks.

And maybe we should’ve called that paper The Prior Can Often Only Be Understood in the Context of the Experiment.


1 The fourth century Christian heresy that posited that Jesus was created by God and hence was not of the same substance. The council of Nicaea ended up writing a creed to stamp that one out.

2 Really never let me choose the movie. Never.

3 I hate the word “robust” here. Robust against what?! The answer appears to be “robust against un-earned certainty”, but I’m not sure. Maybe they want to Winsorize cognitive science?

4 Lauren had to twiddle it a bit, particularly using a non-centered parameterization to eliminate divergences.

Pushing the guy in front of the trolley

So. I was reading the London Review of Books the other day and came across this passage by the philosopher Kieran Setiya:

Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .

I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.

So I thought I’d look into this particular example.

I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”

OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:

Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”

From that article:

The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .

We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors.[1] The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.[2]

Here are the two footnotes to the above passage:

[1] Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.

[2] Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.

I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live’”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.

Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:

In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?

Now for the data. Valdesolo and DeSteno find the following results:

– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.

– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.

So from this set of experiments alone, I would not say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)

At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:

From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).

I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.

I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does not successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)

Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.

And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study doesn’t replicate the original claim, then a defender can say it’s not a real replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.

The real question is, what did they find and how do these findings relate to the larger claim?

And the answer is, it’s complicated.

First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)

Second, the two new studies looked at interactions rather than main effects.

The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.

Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:

I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?

And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.

Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.

One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.

What have we learned from this journey?

Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.

This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.

There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.

P.S. Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.

Going beyond the rainbow color scheme for statistical graphics

Yesterday in our discussion of easy ways to improve your graphs, a commenter wrote:

I recently read and enjoyed several articles about alternatives to the rainbow color palette. I particularly like the sections where they show how each color scheme looks under different forms of color-blindness and/or in black and white.

Here’s a couple of them (these are R-centric but relevant beyond that):

The viridis color palettes, by Bob Rudis, Noam Ross and Simon Garnier

Somewhere over the Rainbow, by Ross Ihaka, Paul Murrell, Kurt Hornik, Jason Fisher, Reto Stauffer, Claus Wilke, Claire McWhite, and Achim Zeileis.

I particularly like that second article, which includes lots of examples.

What are some common but easily avoidable graphical mistakes?

John Kastellec writes:

I was thinking about writing a short paper aimed at getting political scientists to not make some common but easily avoidable graphical mistakes. I’ve come up with the following list of such mistakes. I was just wondering if any others immediately came to mind?

– Label lines directly

– Make labels big enough to read

– Small multiples instead of spaghetti plots

– Avoid stacked barplots

– Make graphs completely readable in black-and-white

– Leverage time as clearly as possible by placing it on the x-axis.

That reminds me . . . I was just at a pharmacology conference. And everybody there—I mean everybody—used the rainbow color scheme for their graphs. Didn’t anyone send them the memo, that we don’t do rainbow anymore? I prefer either a unidirectional shading of colors, or a bidirectional shading as in figure 4 here, depending on context.