Peter Ellis on Forecasting Antipodal Elections with Stan

I liked this intro to Peter Ellis from Rob J. Hyndman’s talk announcement:

He [Peter Ellis] started forecasting elections in New Zealand as a way to learn how to use Stan, and the hobby has stuck with him since he moved back to Australia in late 2018.

You may remember Peter from my previous post on his analysis of NZ traffic crashes.

The talk

Speaker: Peter Ellis

Title: Poll position: statistics and the Australian federal election

Abstract: The result of the Australian federal election in May 2019 stunned many in the politics-observing class because it diverged from a long chain of published survey results of voting intention. How surprising was the outcome? Not actually a complete outlier; about a one in six chance, according to Peter Ellis’s forecasting model for Free Range Statistics. This seminar will walk through that model from its data management (the R package ozfedelect, built specifically to support it), the state-space model written in Stan and R that produces the forecasts; and its eventual visualisation and communication to the public. There are several interesting statistical issues relating to how we translate crude survey data into actual predicted seats, and some even more interesting communication issues about how all this is understood by the public. This talk is aimed at those with an interest in one or more of R, Stan, Bayesian modelling and forecasts, and Australian voting behaviour.

Location: 11am, 31 May 2019. Room G03, Learning and Teaching Building, 19 Ancora Imparo Way, Clayton Campus, Monash University [Melbourne, Australia]

The details

Ellis’s blog, Free Range Statistics, has the details of the Australian election model and much much more.

You can also check out his supporting R package, ozfedelect, on GitHub.

From hobbyist to pundit

Ellis’s hobby led to his being quoted by The Economist in a recent article, Did pollsters misread Australia’s election or did pundits?. Quite the hobby.

But wait, there’s more…

There are a lot more goodies on Peter Ellis’s blog, both with and without Stan.

A plea

I’d like to make a plea for a Stan version of the Bradley-Terry model (the statistical basis of ELO) for predicting Australian Football League matches. It’s an exercise somewhere in the regression sections of Gelman et al.’s Bayesian Data Analysis to formulate the model (including how to extend to ties). I have a half-baked Bradley-Terry case study I’ve been meaning to finish, but would be even happier to get an outside contribution! I’d be happy to share what I have so far.

They’re working for the clampdown

This is just disgraceful: powerful academics using their influence to suppress (“clamp down on”) dissent. They call us terrorists, they lie about us in their journals, and they plot to clamp down on us. I can’t say at this point that I’m surprised to see this latest, but it saddens and angers me nonetheless to see people who could be research leaders but who instead are saying:

We’re working for the clampdown
We will teach our twisted speech
To the unbelievers
We will train our blue-eyed men
To the unbelievers

What are we gonna do now?

Donald J. Trump and Robert E. Lee

The other day the president made some news by praising Civil War general Robert E. Lee, and it struck me that Trump and Lee had a certain amount in common. Not in their personalities, but in their situations.

Lee could’ve fought on the Union side in the Civil War. Or he could’ve saved a couple hundred thousand lives by surrendering his army at some point, once he realized they were going to lose. But, conditional on fighting on, he had to innovate. He had to gamble, over and over again, because that was his only chance.

Similarly for Trump. There was no reason he had to run for president. And, once he had the Republican nomination, he still could’ve stepped down. But, conditional on him running for president with all these liabilities (he’s unpopular, he leads an unpopular party, and he has a lot of legal issues), he’s had to use unconventional tactics, and to continue to use unconventional tactics. Like Lee, there’s no point at which he could rest on his successes.

Against Arianism 3: Consider the cognitive models of the field

“You took my sadness out of context at the Mariners Apartment Complex” – Lana Del Rey

It’s sunny, I’m in England, and I’m having a very tasty beer, and Lauren, Andrew, and I just finished a paper called The experiment is just as important as the likelihood in understanding the prior: A cautionary note on robust cognitive modelling.

So I guess it’s time to resurrect a blog series. On the off chance that any of you have forgotten, the Against Arianism series focusses on the idea that, in the same way that Arianism1 was heretical, so too is the idea that priors and likelihoods can be considered separately. Rather, they are consubstantial–built of the same probability substance.

There is no new thing under the sun, so obviously this has been written about a lot. But because it’s my damn blog post, I’m going to focus on a paper Andrew, Michael, and I wrote in 2017 called The Prior Can Often Only Be Understood in the Context of the Likelihood. This paper was dashed off in a hurry and under deadline pressure, but I quite like it. But it’s also maybe not the best place to stop the story.

An opportunity to comment

A few months back, the fabulous Lauren Kennedy was visiting me in Toronto on a different project. Lauren is a postdoc at Columbia working partly on complex survey data, but her background is quantitative methods in psychology. Among other things, we saw a fairly regrettable (but excellent) Claire Denis movie about vampires2.

But that’s not relevant to the story. What is relevant was that Lauren had seen an open invitation to write a comment on a paper in Computational Brain & Behaviour about Robust3 Modelling in Cognitive Science written by a team cognitive scientists and researchers in scientific theory, philosophy, and practice  (Michael Lee, Amy Criss, Berna Devezer, Christopher Donkin, Alexander Etz, Fábio Leite, Dora Matzke, Jeffrey Rouder, Jennifer Trueblood, Corey White, and Joachim Vandekerckhove).

Their bold aim to sketch out the boundaries of good practice for cognitive modelling (and particularly for the times where modelling meets data) is laudable, not least because such an endeavor will always be doomed to fail in some way. But the act of stating some ideas for what constitutes best practice gives the community a concrete pole to hang this important discussion on. And Computational Brain & Behaviour recognized this and decided to hang an issue off the paper and its discussions.

The paper itself is really thoughtful and well done. And obviously I do not agree with everything in it, but that doesn’t stop me from the feeling that wide-spread adoption of  their suggestions would definitely make quantitative research better.

But Lauren noticed one tool that we have found extremely useful that wasn’t mentioned in the paper: prior predictive checks. She asked if I’d be interested in joining her on a paper, and I quickly said yes!

It turns out there is another BART

The best thing about working with Lauren on this was that she is a legit psychology researcher so she isn’t just playing in someone’s back yard, she owns a patch of sand. It was immediately clear that it would be super-quick to write a comment that just said “you should use prior predictive checks”. But that would miss a real opportunity. Because cognitive modelling isn’t quite the same as standard statistical modelling (although in the case where multilevel models are appropriate Daniel  Schad, Michael Betancourt, and  Shravan Vasishth just wrote an excellent paper on importing general ideas of good statistical workflows into Cognitive applications).

Rather than using our standard data analysis models, a lot of the time cognitive models are generative models for the cognitive process coupled (sometimes awkwardly) with models for the data that is generated from a certain experiment. So we wanted an example model that is more in line with this practice than our standard multilevel regression examples.

Lauren found the Balloon Analogue Risk Task (BART) in Lee and Wagenmakers’ book Bayesian Cognitive Modeling: A Practical Course, which conveniently has Stan code online4. We decided to focus on this example because it’s fairly easy to understand and has all the features we needed. But hopefully we will eventually write a longer paper that covers more common types of models.

BART is an experiment that makes participants simulate pumping balloons with some fixed probability of popping after every pump. Every pump gets them more money, but they get nothing if the balloon pops. The model contains a parameter (\gamma^+) for risk taking behaviour and the experiment is designed to see if the risk taking behaviour changes as a person gets more drunk.  The model is described in the following DAG:

Exploring the prior predictive distribution 

Those of you who have been paying attention will notice the Uniform(0,10) priors on the logit scale and think that these priors are a little bit terrible. And they are! Direct simulation from model leads to absolutely silly predictive distributions for the number of pumps in a single trial. Worse still, the pumps are extremely uniform across trials. Which means that the model thinks, a priori, that it is quite likely for a tipsy undergraduate to pump a balloon 90 times in each of the 20 trials. The mean number of pumps is a much more reasonable 10.

Choosing tighter upper bounds on the uniform priors leads to more sensible prior predictive distributions, but then Lauren went to test out what changes this made to inference (in particular looking at how it affects the Bayes factor against the null that the \gamma^+ parameters were the same across different levels of drunkenness). It made very little difference.  This seemed odd so she started looking closer.

Where is the p? Or, the Likelihood Principle gets in the way

So what is going on here? Well the model describe in Lee and Wagenmaker’s book is not a generative model for the experimental data. Why not? Because the balloon sometimes pops! But because in this modelling setup the probability of explosion is independent of the number of pumps, this explosive possibility only appears as a constant in the likelihood.

The much lauded Likelihood Principle tells us that we do not need to worry about these constants when we are doing inference. But when we are trying to generate data from the prior predictive distribution, we really need to care about these aspects of the model.

Once the context on the experiment is taken into account, the prior predictive distributions change a lot.

Context is important when taking statistical methods into new domains

Prior predictive checks are really powerful tools. They give us a way to set priors, they give us a way to understand what our model does, they give us a way to generate data that we can use to assess the behaviour of different model comparison tools under the experimental design at hand. (Neyman-Pearson acolytes would talk about power here, but the general question lives on beyond that framework).

Modifications of prior predictive checks should also be used to assess how predictions, inference, and model comparison methods behave under different but realistic deviations from the assumed generative model. (One of the points where I disagree with Lee et al.‘s paper is that it’s enough to just pre-register model comparision methods. We also need some sort of simulation study to know how they work for the problem at hand!)

But prior predictive checks require understanding of the substantive field as well as understanding of how the experiment was performed. And it is not always as simple as just predict y!

Balloons pop. Substantive knowledge may only be about contrasts or combinations of predictions. We need to always be aware that it’s a lot of work to translate a tool to a new scientific context. Even when that tool  appears to be as straightforward to use and as easy to explain as prior predictive checks.

And maybe we should’ve called that paper The Prior Can Often Only Be Understood in the Context of the Experiment.


1 The fourth century Christian heresy that posited that Jesus was created by God and hence was not of the same substance. The council of Nicaea ended up writing a creed to stamp that one out.

2 Really never let me choose the movie. Never.

3 I hate the word “robust” here. Robust against what?! The answer appears to be “robust against un-earned certainty”, but I’m not sure. Maybe they want to Winsorize cognitive science?

4 Lauren had to twiddle it a bit, particularly using a non-centered parameterization to eliminate divergences.

Pushing the guy in front of the trolley

So. I was reading the London Review of Books the other day and came across this passage by the philosopher Kieran Setiya:

Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .

I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.

So I thought I’d look into this particular example.

I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”

OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:

Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”

From that article:

The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .

We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors.[1] The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.[2]

Here are the two footnotes to the above passage:

[1] Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.

[2] Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.

I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live’”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.

Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:

In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?

Now for the data. Valdesolo and DeSteno find the following results:

– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.

– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.

So from this set of experiments alone, I would not say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)

At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:

From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).

I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.

I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does not successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)

Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.

And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study doesn’t replicate the original claim, then a defender can say it’s not a real replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.

The real question is, what did they find and how do these findings relate to the larger claim?

And the answer is, it’s complicated.

First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)

Second, the two new studies looked at interactions rather than main effects.

The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.

Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:

I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?

And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.

Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.

One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.

What have we learned from this journey?

Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.

This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.

There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.

P.S. Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.

Going beyond the rainbow color scheme for statistical graphics

Yesterday in our discussion of easy ways to improve your graphs, a commenter wrote:

I recently read and enjoyed several articles about alternatives to the rainbow color palette. I particularly like the sections where they show how each color scheme looks under different forms of color-blindness and/or in black and white.

Here’s a couple of them (these are R-centric but relevant beyond that):

The viridis color palettes, by Bob Rudis, Noam Ross and Simon Garnier

Somewhere over the Rainbow, by Ross Ihaka, Paul Murrell, Kurt Hornik, Jason Fisher, Reto Stauffer, Claus Wilke, Claire McWhite, and Achim Zeileis.

I particularly like that second article, which includes lots of examples.

What are some common but easily avoidable graphical mistakes?

John Kastellec writes:

I was thinking about writing a short paper aimed at getting political scientists to not make some common but easily avoidable graphical mistakes. I’ve come up with the following list of such mistakes. I was just wondering if any others immediately came to mind?

– Label lines directly

– Make labels big enough to read

– Small multiples instead of spaghetti plots

– Avoid stacked barplots

– Make graphs completely readable in black-and-white

– Leverage time as clearly as possible by placing it on the x-axis.

That reminds me . . . I was just at a pharmacology conference. And everybody there—I mean everybody—used the rainbow color scheme for their graphs. Didn’t anyone send them the memo, that we don’t do rainbow anymore? I prefer either a unidirectional shading of colors, or a bidirectional shading as in figure 4 here, depending on context.

Abortion attitudes: The polarization is among richer, more educated whites

Abortion has been in the news lately. A journalist asked me something about abortion attitudes and I pointed to a post from a few years ago about partisan polarization on abortion. Also this with John Sides on why abortion consensus is unlikely. That was back in 2009, and consensus doesn’t seem any more likely today.

It’s perhaps not well known (although it’s consistent with what we found in Red State Blue State) that just about all the polarization on abortion comes from whites, and most of that is from upper-income, well-educated whites. Here’s an incomplete article that Yair and I wrote on this from 2010; we haven’t followed up on it recently.

Vigorous data-handling tied to publication in top journals among public heath researchers

Gur Huberman points us to this news article by Nicholas Bakalar, “Vigorous Exercise Tied to Macular Degeneration in Men,” which begins:

A new study suggests that vigorous physical activity may increase the risk for vision loss, a finding that has surprised and puzzled researchers.

Using questionnaires, Korean researchers evaluated physical activity among 211,960 men and women ages 45 to 79 in 2002 and 2003. Then they tracked diagnoses of age-related macular degeneration, from 2009 to 2013. . . .

They found that exercising vigorously five or more days a week was associated with a 54 percent increased risk of macular degeneration in men. They did not find the association in women.

The study, in JAMA Ophthalmology, controlled for more than 40 variables, including age, medical history, body mass index, prescription drug use and others. . . . an accompanying editorial suggests that the evidence from such a large cohort cannot be ignored.

The editorial, by Myra McGuinness, Julie Simpson, and Robert Finger, is unfortunately written entirely from the perspective of statistical significance and hypothesis testing, but they raise some interesting points nonetheless (for example, that the subgroup analysis can be biased if the matching of treatment to control group is done for the entire sample but not for each subgroup).

The news article is not so great, in my opinion. Setting aside various potential problems with the study (including those issues raised by McGuinness et al. in their editorial), the news article makes the mistake of going through all the reported estimates and picking the largest one. That’s selection bias right there. “A 54 percent increased risk,” indeed. If you want to report the study straight up, no criticism, fine. But then you should report the estimated main effect, which was 23% (as reported in the journal article, “(HR, 1.23; 95% CI, 1.02-1.49)”). That 54% number is just ridiculous. I mean, sure, maybe the effect really is 54%, who knows? But such an estimate is not supported by the data: it’s the largest of a set of reported numbers, any of which could’ve been considered newsworthy. If you take a set of numbers and report only the maximum, you’re introducing a bias.

Part of the problem, I suppose, is incentives. If you’re a health/science reporter, you have a few goals. One is to report exciting breakthroughs. Another is to get attention and clicks. Both goals are served, at least in the short term, by exaggeration. Even if it’s not on purpose.

OK, on to the journal article. As noted above, it’s based on a study of 200,000 people: “individuals between ages 45 and 79 years who were included in the South Korean National Health Insurance Service database from 2002 through 2013,” of whom half engaged in vigorous physical activity and half did not. It appears that the entire database contained about 500,000 people, of which 200,000 were selected for analysis in this comparison. The outcome is neovascular age-related macular degeneration, which seems to be measured by a prescription for ranibizumab, which I guess was the drug of choice for this condition in Korea at that time? Based on the description in the paper, I’m assuming they didn’t have direct data on the medical conditions, only on what drugs were prescribed, and when, hence “ranibizumab use from August 1, 2009, indicated a diagnosis of recently developed active (wet) neovascular AMD by an ophthalmologist.” I don’t know if there were people with neovascular AMD who which was not captured in this dataset because they never received this diagnosis.

In their matched sample of 200,000 people, 448 were recorded as having neovascular AMD: 250 in the vigorous exercise group and 198 in the control group. The data were put into a regression analysis, yielding an estimated hazard ratio of 1.23 with 95% confidence interval of [1.02, 1.49]. Also lots of subgroup analyses: unsurprisingly, the point estimate is higher for some subgroups than others; also unsurprisingly, some of the subgroup analyses reach statistically significance and some are not.

It is misleading to report that vigorous physical activity was associated with a greater hazard rate for neovascular AMD in men but not in women. Both the journal article and the news article made this mistake. The difference between “significant” and “non-significant” is not itself statistically significant.

So what do I think about all this? First, the estimates are biased due to selection on statistical significance (see, for example, section 2.1 here). Second, given how surprised everyone is, this suggests a prior distribution on any effect that should be concentrated near zero, which would pull all estimates toward 0 (or pull all hazard ratios toward 1), and I expect that the 95% intervals would then all include the null effect. Third, beyond all the selection mentioned above, there’s the selection entailed in studying this particular risk factor and this particular outcome. In this big study, you could study the effect of just about any risk factor X on just about any outcome Y. I’d like to see a big grid of all these things, all fit with a multilevel model. Until then, we’ll need good priors on the effect size for each study, or else some corrections for type M and type S errors.

Just reporting the raw estimate from one particular study like that: No way. That’s a recipe for future non-replicable results. Sorry, NYT, and sorry, JAMA: you’re gettin played.

P.S. Gur wrote:

The topic may merit two posts — one for the male subpopulation, another for the female.

To which I replied:

20 posts, of which 1 will be statistically significant.

P.P.S. On the plus side, Jonathan Falk pointed me the other day to this post by Scott Alexander, who writes the following about a test of a new psychiatric drug:

The pattern of positive results shows pretty much the random pattern you would expect from spurious findings. They’re divided evenly among a bunch of scales, with occasional positive results on one scale followed by negative results on a very similar scale measuring the same thing. Most of them are only the tiniest iota below p = 0.05. Many of them only work at 40 mg, and disappear in the 80 mg condition; there are occasional complicated reasons why drugs can work better at lower doses, but Occam’s razor says that’s not what’s happening here. One of the results only appeared in Stage 2 of the trial, and disappeared in Stage 1 and the pooled analysis. This doesn’t look exactly like they just multiplied six instruments by two doses by three ways of grouping the stages, got 36 different cells, and rolled a die in each. But it’s not too much better than that. Who knows, maybe the drug does something? But it sure doesn’t seem to be a particularly effective antidepressant, even by our very low standards for such. Right now I am very unimpressed.

It’s good to see this mode of thinking becoming so widespread. It makes me feel that things are changing in a good way.

So, some good news for once!